Generative Artificial Intelligence (AI) and Large Language Models (LLMs) are revolutionizing technology and society thanks to their versatility and applicability to a wide array of tasks and use cases, in multiple media and modalities. As a new and relatively untested technology, LLMs raise several issues for research and application alike, including questions about their quality, reliability, predictability, veracity, as well as on how to develop proper evaluation methodologies to assess their various capacities.
Much effort is put into investigating the various capacities of LLMs with respect to their quality, reliability, reasoning capabilities and more. Many dataset ensembles are being adapted and used to evaluate the overall performance of LLMs, but overall there are still a number of issues to address. In particular (i) too often, the evaluation is compromised because test data is publicly available and models have seen the ground truth data in the pre-training phase; this problem is known as contamination, and is severe; (ii) with the goal of testing antropomorphic properties of models – such as common sense reasoning – and linguistic competence, datasets are drifting away from current practical application challenges
The MonsterCLEF lab will focus on a specific aspect of LLMs, namely their versatility.
Our goal is to systematically explore how well a given LLM performs across a number of different real-world application challenges with respect to algorithms specifically trained for each task, avoiding contamination issues.
The MonsterCLEF lab is organized as a meta-challenge across a selection of tasks chosen from the other labs running in CLEF 2024 and participants are asked to develop a generative AI/LLM-based system that will be run against all the selected tasks with no or minimal adaptation. For each targeted task we rely on the same dataset, experimental setting, and evaluation measures adopted for that specific task. In this way, the LLM-based systems participating in the MonsterCLEF lab are directly comparable with the specialized systems participating in each targeted task.
In this way, we will systematically evaluate the performance of the same LLM-based system across a wide range of very different tasks and to provide feedback to each targeted task about the performance of a general-purpose LLM system compared to systems specifically developed for that task. Moreover, since the datasets for CLEF 2024 will not be public yet, we are able to experiment with previously unseen data, thus avoiding the risk of contamination.
The following labs are targeted by MonsterCLEF:
University of Padua, Italy
UNED, Spain
Silo AI, Sweden
HES-SO, Switzerland