In the current age of Big Data, Information Access technologies, comprising
Information
Retrieval (IR) and Recommender Systems (RS) just to name a few, play a
crucial role in
quickly and effectively retrieving relevant resources to meet information needs of
users.
Such systems face big challenges in terms of both efficiency and effectiveness,
since they
need to work with very huge amounts of complex and heterogeneous information, also
relying
on computationally intensive methods.
Research in Quantum Computing (QC) has resulted in the development of very
powerful devices
that are now able to tackle even realistic problems, promising a great
improvement in terms
of performance for some computationally intensive tasks. In fact, by leveraging
quantum-mechanical phenomena such as superposition and entanglement, QC technologies can
explore problem spaces that are exponentially larger compared to the ones that classical
machines, having an equivalent number of traditional bits, can handle.
Nevertheless, neither the use of QC technologies nor how to express IR/RS algorithms in
a
form suitable for QC has been investigated yet in the field. However, Information Access
has
an extreme need of computational resources and QC could provide many potential
advantages
such as the capability of providing faster search/recommendation/access times, improved
accuracy, the ability to handle larger and more complex datasets, and, last but not
least,
greener computation.
Therefore, the objective of QuantumCLEF is to design and develop an evaluation
infrastructure for QC algorithms and, in particular, for Quantum Annealing (QA)
algorithms
closely related to the Information Access field which will allow to:
• identify new problem formulations to efficiently and effectively employ quantum
annealers
for their resolution;
• compare the performance of QA approaches with respect to their non-quantum
counterparts
using traditional hardware;
• involve researchers from different fields (e.g., Information Retrieval, Recommender
Systems, Operations Research...) to work together and learn more about QA
technologies.
Our Tasks
QuantumCLEF addresses two different tasks involving computationally-intensive problems
that are
closely related to the Information Access field: Feature Selection and Clustering. There
is one
problem for each task and each problem is solvable with the QA paradigm. For each one of the tasks,
participants will be asked to both submit their solutions using Quantum Annealing and Simulated Annealing
to compare the two methods both in terms of efficiency and effectiveness.
Task 1: Feature Selection
Apply quantum annealers to find the most relevant subset of features to train a learning
model,
e.g., for ranking. This problem is very impactful, since many IR and RS systems involve the
optimization of learning models, and reducing the dimensionality of the input data can improve
their
performance.
Task 1A: The IR Task
Select the most relevant features in the considered datasets to train a LambdaMART model and thus
achieve the highest score. A baseline using RFE with the Logistic Regression classifier will be used as an
overall alternative.
Datasets
MQ2007 (one of the LETOR datasets)
ISTELLA (this one will be an additional challenge since the number of features cannot fit
directly in the QPU)
Metrics
The obtained features will be then used to train a LambdaMART model and measure its performance on
the Test Dataset. The performance will be measured in terms of nDCG@10.
Task 1B: The RS Task
The task is to select the subset of features that will produce the best recommendation quality when used for an Item-Based KNN recommendation model. The KNN model computes the item-item similarity with cosine on the feature vectors and applies to the denominator a shrinkage of 5, the number of kneighbors to is 100.
The baselines for this task are the same Item-Based KNN recommendation model trained using all the features, and then trained using the features selected by a bayesian search optimizing the model recommendation effectiveness.
Datasets
The dataset is private and refers to a task of music recommendation. The dataset contains both collaborative data and two different sets of item features:
150_ICM: Contains 150 features for each item.
500_ICM: Contains 500 features for each item.
The User Rating Matrix (URM) contains tuples in the form (UserID, ItemID), listing which user interacted with which item.
The Item Content Matrix (ICM) contains tuples in the form (ItemID, FeatureID, Value), note that the ICM is sparse and any missing (ItemID, FeatureID) couples should be treated as missing data. A common assumption is to use a value of 0. The features refer to different types of descriptors and tags associated to the songs. Some of the features have been normalized.
The Training Dataset can be downloaded HERE. Note that a private holdout of the data will be used for testing.
Metrics
The selected features will be used to train an Item-Based KNN recommendation model and measure its performance on the Test Dataset with nDCG@10.
Submissions
Participants should submit the final set of features selected through their own solution using only
the provided Training Datasets. Each participating team can provide at most 5 different subsets of
features,
so that it is possible to try different alternatives to achieve the best selection.
The submissions
should be done by using Quantum Annealing and Simulated Annealing to compare the performance of quantum annealers
and a possible traditional hardware alternative.
The submissions should be txt files according to the following format:
where each line reports one of the features that is kept. All the features that are removed should not be reported in this file.
In this example, features 1, 4, 5, 8, ..., 44, 45 are kept while features 2, 3, 6, 7, ... are removed.
The last line of the file represents the ids of the solved problems that relate to this given submission. For example,
if you solved 3 different problems with QA or SA to achieve the final submission (e.g., you split the problem into subproblems and solve them separately), you should provide their ids in a list
at the end of the submission file. Their ids can be retrieved directly from the code or through the dashboard.
The submission files should be left in your workspace in the directory called /config/workspace/submissions and the name of the files should comply to the following rules:
[Task]: it should be either 1A or 1B based on the task the submission refers to
[Dataset]: it should be either MQ2007, ISTELLA, 150_ICM or 500_ICM based on the dataset used
[Method]: it should be either QA or SA based on the method used
[Groupname]: the name of your group
[SubmissionID]: a submission ID that must be the same for the submissions using the same
algorithm but performed with different methods (e.g., QA or SA)
Task 2: Clustering
Use QA to cluster different documents in the form of embeddings to ease the browsing
process of
large collections. Clustering can be helpful for organizing large collections, helping users to
explore a collection and providing similar search results to a given query. Furthermore, it can
be helpful to divide users according to their interests or build user models with the cluster
centroids speeding up the runtime of the system or its effectiveness for users with limited data.
Clustering is however a very complex task in the case of QA since it is possible to perform
clustering only considering a limited number of items and clusters due to the architecture of
quantum annealers. A baseline using K-medoids clustering with cosine distance will be used as an
overall alternative.
Task 2A: The IR Task
Obtain a list of representative centroids of the given dataset of embeddings (each embedding is a sentence
picked from Yahoo). The cluster quality will be then measured with standard evaluation measures for clustering
and also with
opportune test queries that will be used to retrieve the most relevant movie plots for each query.
Instead of comparing the query embedding with every document embedding in the corpus, the search
will be restricted to the clusters that are most likely to contain relevant documents, thereby
reducing the search space and improving retrieval speed.
Datasets
A split of the ANTIQUE dataset
in which each sentence taken from Yahoo is turned into an embedding using a transformer. The split size will be of
roughly 6500 sentences and also another smaller dataset of roughly 2200 sentences is provided to test the clustering algorithm.
Metrics
the Davies-Bouldin Index will be used to measure the overall cluster quality;
the
nDCG@10 will be used to measure the overall retrieval quality based on a set of 50 queries. Each
query
will be transformed into its corresponding embedding, then the Cosine Similarity is used to
get the closest centroid and its corresponding cluster of documents, finally all the documents
belonging to that cluster
will be retrieved and ranked using the Cosine Similarity between the documents and the query.
Submissions
Participants should submit a list of 10, 25 and 50 vectors that represent the final centroids achieved
through their clustering algorithm. Each centroid should also be followed by the list of documents that belong to the given
cluster. Each team can provide at most 5 different submissions. The submissions
should be done by using Quantum Annealing and Simulated Annealing to compare the performance of quantum annealers
and a possible traditional hardware alternative.
The submissions should be txt files according to the following format:
where each line reports one of the centroids with its coordinates and its corresponding associated documents using only their ids.
The last line of the file represents the ids of the solved problems that relate to this given submission. For example,
if you solved 3 different problems with QA or SA to achieve the final submission (e.g., you split the problem into subproblems and solve them separately), you should provide their ids in a list
at the end of the submission file. Their ids can be retrieved directly from the code or through the dashboard.
The submission files should be left in your workspace in the directory called config/workspace/submissions and the name of the files should comply to the following rules:
[Centroids]: it should be either 10, 25 or 50 based on the number of centroids
[Method]: it should be either QA or SA based on the method used
[Groupname]: the name of your group
[SubmissionID]: a submission ID that must be the same for the submissions using the same
algorithm but performed with different methods (e.g., QA or SA)
For Everyone
For each task we will propose a baseline that can be a starting point for the
participants. In
addition we will provide some tutorials covering both the usage of the given libraries to
interact with D-Wave quantum annealers and the basic understanding of the Annealing process.
Participants can also find here a small tutorial covering how to use our infrastructure and 2
notebooks about Feature Selection and Clustering.
In this way, participants will learn more about the fundamental concepts underneath the QA paradigm
and will be able to explore and experiment new solutions.
Participants will only need a laptop with internet access and knowledge
of the Python programming language!
We have been already granted access to D-WAVE Quantum Annelears through a project granted by
CINECA, the Italian largest supercomputing center and one of the most important worldwide. We
have already made an agreement that grants us sufficient QPU access time that can be distributed
among several participants. In addition, we will use Amazon cloud resources to deploy our
evaluation infrastructure.
Reproducibility represents a core aspect that we will follow during this Lab. Each team
participating in this task will have access to its own machine which has given characteristics
(e.g., fixed number of CPU cores, fixed amount of RAM...). This ensures that all the teams will
work with the same hardware capabilities and the final results achieved by different
groups will
be easily comparable. Each workspace is given 1200 CPU millicores and 10 GB of RAM. In addition, we will set up public repositories that will be used
by the
teams to publish their own solutions and results. In such a manner, future research groups in
this field could use our datasets, results and solutions to benchmark their own approaches
against the ones obtained in our Lab.
Participants will face further challenges in each task such as developing new problem
formulations that account for the limited connectivity of the quantum annealer or splitting the
original problems into subproblems. Participants will have access to the D-Wave Problem
Inspector
in order to both analyze and visualize how the problem has been physically embedded in the
Quantum
Processing Unit.
Deadlines
Here you can find all the important strict deadlines:
Registration closes: April 22, 2024
Runs submission deadline: May 6, 2024
Evaluation results out: May 20, 2024
Participant's papers submission deadline: May 31, 2024. Follow the instructions. Click here for the LaTeX template.
Notification of acceptance for participant's papers: June 24, 2024
Camera-ready participant's papers submission: July 8, 2024
QuantumCLEF Workshop: September 9-12, 2024 during the CLEF Conference
Contacts
Here it is possible to find the email addresses of all the organizers of the Lab. Do not hesitate to
contact us for any information!