Topics and Goals

In the current age of Big Data, Information Access technologies, comprising Information Retrieval (IR) and Recommender Systems (RS) just to name a few, play a crucial role in quickly and effectively retrieving relevant resources to meet information needs of users. Such systems face big challenges in terms of both efficiency and effectiveness, since they need to work with very huge amounts of complex and heterogeneous information, also relying on computationally intensive methods. Research in Quantum Computing (QC) has resulted in the development of very powerful devices that are now able to tackle even realistic problems, promising a great improvement in terms of performance for some computationally intensive tasks. In fact, by leveraging quantum-mechanical phenomena such as superposition and entanglement, QC technologies can explore problem spaces that are exponentially larger compared to the ones that classical machines, having an equivalent number of traditional bits, can handle. Nevertheless, neither the use of QC technologies nor how to express IR/RS algorithms in a form suitable for QC has been investigated yet in the field. However, Information Access has an extreme need of computational resources and QC could provide many potential advantages such as the capability of providing faster search/recommendation/access times, improved accuracy, the ability to handle larger and more complex datasets, and, last but not least, greener computation. Therefore, the objective of QuantumCLEF is to design and develop an evaluation infrastructure for QC algorithms and, in particular, for Quantum Annealing (QA) algorithms closely related to the Information Access field which will allow to:
• identify new problem formulations to efficiently and effectively employ quantum annealers for their resolution;
• compare the performance of QA approaches with respect to their non-quantum counterparts using traditional hardware;
• involve researchers from different fields (e.g., Information Retrieval, Recommender Systems, Operations Research...) to work together and learn more about QA technologies.

Our Tasks

QuantumCLEF addresses two different tasks involving computationally-intensive problems that are closely related to the Information Access field: Feature Selection and Clustering. There is one problem for each task and each problem is solvable with the QA paradigm. For each one of the tasks, participants will be asked to both submit their solutions using Quantum Annealing and Simulated Annealing to compare the two methods both in terms of efficiency and effectiveness.

Task 1: Feature Selection

Apply quantum annealers to find the most relevant subset of features to train a learning model, e.g., for ranking. This problem is very impactful, since many IR and RS systems involve the optimization of learning models, and reducing the dimensionality of the input data can improve their performance.

Task 1A: The IR Task

Select the most relevant features in the considered datasets to train a LambdaMART model and thus achieve the highest score. A baseline using RFE with the Logistic Regression classifier will be used as an overall alternative.

Datasets

  • MQ2007 (one of the LETOR datasets)
  • ISTELLA (this one will be an additional challenge since the number of features cannot fit directly in the QPU)

Metrics

The obtained features will be then used to train a LambdaMART model and measure its performance on the Test Dataset. The performance will be measured in terms of nDCG@10.

Task 1B: The RS Task

The task is to select the subset of features that will produce the best recommendation quality when used for an Item-Based KNN recommendation model. The KNN model computes the item-item similarity with cosine on the feature vectors and applies to the denominator a shrinkage of 5, the number of kneighbors to is 100. The baselines for this task are the same Item-Based KNN recommendation model trained using all the features, and then trained using the features selected by a bayesian search optimizing the model recommendation effectiveness.

Datasets

The dataset is private and refers to a task of music recommendation. The dataset contains both collaborative data and two different sets of item features:

  • 150_ICM: Contains 150 features for each item.
  • 500_ICM: Contains 500 features for each item.

The User Rating Matrix (URM) contains tuples in the form (UserID, ItemID), listing which user interacted with which item. The Item Content Matrix (ICM) contains tuples in the form (ItemID, FeatureID, Value), note that the ICM is sparse and any missing (ItemID, FeatureID) couples should be treated as missing data. A common assumption is to use a value of 0. The features refer to different types of descriptors and tags associated to the songs. Some of the features have been normalized. The Training Dataset can be downloaded HERE. Note that a private holdout of the data will be used for testing.

Metrics

The selected features will be used to train an Item-Based KNN recommendation model and measure its performance on the Test Dataset with nDCG@10.

Submissions

Participants should submit the final set of features selected through their own solution using only the provided Training Datasets. Each participating team can provide at most 5 different subsets of features, so that it is possible to try different alternatives to achieve the best selection.
The submissions should be done by using Quantum Annealing and Simulated Annealing to compare the performance of quantum annealers and a possible traditional hardware alternative.
The submissions should be txt files according to the following format:

[Task]_[Dataset]_[Method]_[Groupname]_[SubmissionID].txt1
4
5
8
...
44
45
['id1', 'id2', ..., 'idn']

where each line reports one of the features that is kept. All the features that are removed should not be reported in this file. In this example, features 1, 4, 5, 8, ..., 44, 45 are kept while features 2, 3, 6, 7, ... are removed.
The last line of the file represents the ids of the solved problems that relate to this given submission. For example, if you solved 3 different problems with QA or SA to achieve the final submission (e.g., you split the problem into subproblems and solve them separately), you should provide their ids in a list at the end of the submission file. Their ids can be retrieved directly from the code or through the dashboard.
The submission files should be left in your workspace in the directory called /config/workspace/submissions and the name of the files should comply to the following rules:

  • [Task]: it should be either 1A or 1B based on the task the submission refers to
  • [Dataset]: it should be either MQ2007, ISTELLA, 150_ICM or 500_ICM based on the dataset used
  • [Method]: it should be either QA or SA based on the method used
  • [Groupname]: the name of your group
  • [SubmissionID]: a submission ID that must be the same for the submissions using the same algorithm but performed with different methods (e.g., QA or SA)

Task 2: Clustering

Use QA to cluster different documents in the form of embeddings to ease the browsing process of large collections. Clustering can be helpful for organizing large collections, helping users to explore a collection and providing similar search results to a given query. Furthermore, it can be helpful to divide users according to their interests or build user models with the cluster centroids speeding up the runtime of the system or its effectiveness for users with limited data. Clustering is however a very complex task in the case of QA since it is possible to perform clustering only considering a limited number of items and clusters due to the architecture of quantum annealers. A baseline using K-medoids clustering with cosine distance will be used as an overall alternative.

Task 2A: The IR Task

Obtain a list of representative centroids of the given dataset of embeddings (each embedding is a sentence picked from Yahoo). The cluster quality will be then measured with standard evaluation measures for clustering and also with opportune test queries that will be used to retrieve the most relevant movie plots for each query. Instead of comparing the query embedding with every document embedding in the corpus, the search will be restricted to the clusters that are most likely to contain relevant documents, thereby reducing the search space and improving retrieval speed.

Datasets

A split of the ANTIQUE dataset in which each sentence taken from Yahoo is turned into an embedding using a transformer. The split size will be of roughly 6500 sentences and also another smaller dataset of roughly 2200 sentences is provided to test the clustering algorithm.

Metrics

  • the Davies-Bouldin Index will be used to measure the overall cluster quality;
  • the nDCG@10 will be used to measure the overall retrieval quality based on a set of 50 queries. Each query will be transformed into its corresponding embedding, then the Cosine Similarity is used to get the closest centroid and its corresponding cluster of documents, finally all the documents belonging to that cluster will be retrieved and ranked using the Cosine Similarity between the documents and the query.

Submissions

Participants should submit a list of 10, 25 and 50 vectors that represent the final centroids achieved through their clustering algorithm. Each centroid should also be followed by the list of documents that belong to the given cluster. Each team can provide at most 5 different submissions. The submissions should be done by using Quantum Annealing and Simulated Annealing to compare the performance of quantum annealers and a possible traditional hardware alternative.
The submissions should be txt files according to the following format:

[Centroids]_[Method]_[Groupname]_[SubmissionID].txt[ 
	{'centroid' : [coord1, coord2, ..., coord767, coord768], 'docs': ['id1', 'id2', ..., 'idn']},
	{'centroid' : [coord1, coord2, ..., coord767, coord768], 'docs': ['id1', 'id2', ..., 'idn']},
	...
	{'centroid' : [coord1, coord2, ..., coord767, coord768], 'docs': ['id1', 'id2', ..., 'idn']},
]
['id1', 'id2', ..., 'idn']

where each line reports one of the centroids with its coordinates and its corresponding associated documents using only their ids.
The last line of the file represents the ids of the solved problems that relate to this given submission. For example, if you solved 3 different problems with QA or SA to achieve the final submission (e.g., you split the problem into subproblems and solve them separately), you should provide their ids in a list at the end of the submission file. Their ids can be retrieved directly from the code or through the dashboard.
The submission files should be left in your workspace in the directory called config/workspace/submissions and the name of the files should comply to the following rules:

  • [Centroids]: it should be either 10, 25 or 50 based on the number of centroids
  • [Method]: it should be either QA or SA based on the method used
  • [Groupname]: the name of your group
  • [SubmissionID]: a submission ID that must be the same for the submissions using the same algorithm but performed with different methods (e.g., QA or SA)

For Everyone

For each task we will propose a baseline that can be a starting point for the participants. In addition we will provide some tutorials covering both the usage of the given libraries to interact with D-Wave quantum annealers and the basic understanding of the Annealing process. Participants can also find here a small tutorial covering how to use our infrastructure and 2 notebooks about Feature Selection and Clustering. In this way, participants will learn more about the fundamental concepts underneath the QA paradigm and will be able to explore and experiment new solutions. Participants will only need a laptop with internet access and knowledge of the Python programming language!
We have been already granted access to D-WAVE Quantum Annelears through a project granted by CINECA, the Italian largest supercomputing center and one of the most important worldwide. We have already made an agreement that grants us sufficient QPU access time that can be distributed among several participants. In addition, we will use Amazon cloud resources to deploy our evaluation infrastructure.



Reproducibility represents a core aspect that we will follow during this Lab. Each team participating in this task will have access to its own machine which has given characteristics (e.g., fixed number of CPU cores, fixed amount of RAM...). This ensures that all the teams will work with the same hardware capabilities and the final results achieved by different groups will be easily comparable. In addition, we will set up public repositories that will be used by the teams to publish their own solutions and results. In such a manner, future research groups in this field could use our datasets, results and solutions to benchmark their own approaches against the ones obtained in our Lab. Participants will face further challenges in each task such as developing new problem formulations that account for the limited connectivity of the quantum annealer or splitting the original problems into subproblems. Participants will have access to the D-Wave Problem Inspector in order to both analyze and visualize how the problem has been physically embedded in the Quantum Processing Unit.



Deadlines

Here you can find all the important strict deadlines:

  • Registration closes: April 22, 2024
  • Runs submission deadline: May 6, 2024
  • Evaluation results out: May 20, 2024
  • Participant's papers submission deadline: May 31, 2024 June 3, 2022
  • Notification of acceptance for participant's papers: June 24, 2024
  • Camera-ready participant's papers submission: July 8, 2024
  • QuantumCLEF Workshop: September 9-12, 2024 during the CLEF Conference

Contacts

Here it is possible to find the email addresses of all the organizers of the Lab. Do not hesitate to contact us for any information!