Our Tasks
QuantumCLEF 2025 addresses three different tasks involving computationally-intensive problems that are closely related to the Information Access field: Feature Selection, Instance Selection, and Clustering. There is one problem for each task and each problem is solvable with the QA paradigm. For each one of the tasks, participants will be asked to both submit their solutions using Quantum Annealing and Simulated Annealing to compare the two methods both in terms of efficiency and effectiveness.
Task 1: Feature Selection
Apply quantum annealers to find the most relevant subset of features to train a learning model, e.g., for ranking. This problem is very impactful, since many IR and RS systems involve the optimization of learning models, and reducing the dimensionality of the input data can improve their performance.
Task 1A: The IR Task
Select the most relevant features in the considered datasets to train a LambdaMART model and thus achieve the highest score. A baseline using RFE with the Logistic Regression classifier will be used as an overall alternative.
Datasets
- MQ2007 (one of the LETOR datasets)
- ISTELLA (this one will be an additional challenge since the number of features cannot fit directly in the QPU)
Metrics
The obtained features will be then used to train a LambdaMART model and measure its performance on the Test Dataset. The performance will be measured in terms of nDCG@10.
Task 1B: The RS Task
The task is to select the subset of features that will produce the best recommendation quality when used for an Item-Based KNN recommendation model. The KNN model computes the item-item similarity with cosine on the feature vectors and applies to the denominator a shrinkage of 5, the number of kneighbors to is 100. The baselines for this task are the same Item-Based KNN recommendation model trained using all the features, and then trained using the features selected by a bayesian search optimizing the model recommendation effectiveness.
Datasets
The dataset is private and refers to a task of music recommendation. The dataset contains both collaborative data and two different sets of item features:
- 150_ICM: Contains 150 features for each item.
- 500_ICM: Contains 500 features for each item.
The User Rating Matrix (URM) contains tuples in the form (UserID, ItemID), listing which user interacted with which item. The Item Content Matrix (ICM) contains tuples in the form (ItemID, FeatureID, Value), note that the ICM is sparse and any missing (ItemID, FeatureID) couples should be treated as missing data. A common assumption is to use a value of 0. The features refer to different types of descriptors and tags associated to the songs. Some of the features have been normalized. The Training Dataset can be downloaded HERE. Note that a private holdout of the data will be used for testing.
Metrics
The selected features will be used to train an Item-Based KNN recommendation model and measure its performance on the Test Dataset with nDCG@10.
Submissions
Participants should submit the final set of features selected through their own solution using only
the provided Training Datasets. Each participating team can provide at most 5 different subsets of
features,
so that it is possible to try different alternatives to achieve the best selection.
The submissions
should be done by using Quantum Annealing and Simulated Annealing to compare the performance of quantum annealers
and a possible traditional hardware alternative.
The submissions should be txt files according to the following format:
[Task]_[Dataset]_[Method]_[Groupname]_[SubmissionID].txt1
4
5
8
...
44
45
['id1', 'id2', ..., 'idn']
where each line reports one of the features that is kept. All the features that are removed should not be reported in this file.
In this example, features 1, 4, 5, 8, ..., 44, 45 are kept while features 2, 3, 6, 7, ... are removed.
The last line of the file represents the ids of the solved problems that relate to this given submission. For example,
if you solved 3 different problems with QA or SA to achieve the final submission (e.g., you split the problem into subproblems and solve them separately), you should provide their ids in a list
at the end of the submission file. Their ids can be retrieved directly from the code or through the dashboard.
The submission files should be left in your workspace in the directory called /config/workspace/submissions and the name of the files should comply to the following rules:
- [Task]: it should be either 1A or 1B based on the task the submission refers to
- [Dataset]: it should be either MQ2007, ISTELLA, 150_ICM or 500_ICM based on the dataset used
- [Method]: it should be either QA or SA based on the method used
- [Groupname]: the name of your group
- [SubmissionID]: a submission ID that must be the same for the submissions using the same algorithm but performed with different methods (e.g., QA or SA)
Task 2: Instance Selection
This section is under construction...
Task 3: Clustering
Use QA to cluster different documents in the form of embeddings to ease the browsing process of large collections. Clustering can be helpful for organizing large collections, helping users to explore a collection and providing similar search results to a given query. Furthermore, it can be helpful to divide users according to their interests or build user models with the cluster centroids speeding up the runtime of the system or its effectiveness for users with limited data. Clustering is however a very complex task in the case of QA since it is possible to perform clustering only considering a limited number of items and clusters due to the architecture of quantum annealers. A baseline using K-medoids clustering with cosine distance will be used as an overall alternative.
Task 2A: The IR Task
Obtain a list of representative centroids of the given dataset of embeddings (each embedding is a sentence picked from Yahoo). The cluster quality will be then measured with standard evaluation measures for clustering and also with opportune test queries that will be used to retrieve the most relevant movie plots for each query. Instead of comparing the query embedding with every document embedding in the corpus, the search will be restricted to the clusters that are most likely to contain relevant documents, thereby reducing the search space and improving retrieval speed.
Datasets
A split of the ANTIQUE dataset in which each sentence taken from Yahoo is turned into an embedding using a transformer. The split size will be of roughly 6500 sentences and also another smaller dataset of roughly 2200 sentences is provided to test the clustering algorithm.
Metrics
- the Davies-Bouldin Index will be used to measure the overall cluster quality;
- the nDCG@10 will be used to measure the overall retrieval quality based on a set of 50 queries. Each query will be transformed into its corresponding embedding, then the Cosine Similarity is used to get the closest centroid and its corresponding cluster of documents, finally all the documents belonging to that cluster will be retrieved and ranked using the Cosine Similarity between the documents and the query.
Submissions
Participants should submit a list of 10, 25 and 50 vectors that represent the final centroids achieved
through their clustering algorithm. Each centroid should also be followed by the list of documents that belong to the given
cluster. Each team can provide at most 5 different submissions. The submissions
should be done by using Quantum Annealing and Simulated Annealing to compare the performance of quantum annealers
and a possible traditional hardware alternative.
The submissions should be txt files according to the following format:
[Centroids]_[Method]_[Groupname]_[SubmissionID].txt[
{'centroid' : [coord1, coord2, ..., coord767, coord768], 'docs': ['id1', 'id2', ..., 'idn']},
{'centroid' : [coord1, coord2, ..., coord767, coord768], 'docs': ['id1', 'id2', ..., 'idn']},
...
{'centroid' : [coord1, coord2, ..., coord767, coord768], 'docs': ['id1', 'id2', ..., 'idn']},
]
['id1', 'id2', ..., 'idn']
where each line reports one of the centroids with its coordinates and its corresponding associated documents using only their ids.
The last line of the file represents the ids of the solved problems that relate to this given submission. For example,
if you solved 3 different problems with QA or SA to achieve the final submission (e.g., you split the problem into subproblems and solve them separately), you should provide their ids in a list
at the end of the submission file. Their ids can be retrieved directly from the code or through the dashboard.
The submission files should be left in your workspace in the directory called config/workspace/submissions and the name of the files should comply to the following rules:
- [Centroids]: it should be either 10, 25 or 50 based on the number of centroids
- [Method]: it should be either QA or SA based on the method used
- [Groupname]: the name of your group
- [SubmissionID]: a submission ID that must be the same for the submissions using the same algorithm but performed with different methods (e.g., QA or SA)
Deadlines
Here you can find all the important strict deadlines:
- Registration closes: April 22, 2024
- Runs submission deadline: May 6, 2024
- Evaluation results out: May 20, 2024
- Participant's papers submission deadline: May 31, 2024. Follow the instructions. Click here for the LaTeX template.
- Notification of acceptance for participant's papers: June 24, 2024
- Camera-ready participant's papers submission: July 8, 2024
- QuantumCLEF Workshop: September 9-12, 2024 during the CLEF Conference