Monster CLEF

One Lab to Rule Them All

About

About


Generative Artificial Intelligence (AI) and Large Language Models (LLMs) are revolutionizing technology and society thanks to their versatility and applicability to a wide array of tasks and use cases, in multiple media and modalities. As a new and relatively untested technology, LLMs raise several issues for research and application alike, including questions about their quality, reliability, predictability, veracity, as well as on how to develop proper evaluation methodologies to assess their various capacities.

Much effort is put into investigating the various capacities of LLMs with respect to their quality, reliability, reasoning capabilities and more. Many dataset ensembles are being adapted and used to evaluate the overall performance of LLMs, but overall there are still a number of issues to address. In particular (i) too often, the evaluation is compromised because test data is publicly available and models have seen the ground truth data in the pre-training phase; this problem is known as contamination, and is severe; (ii) with the goal of testing antropomorphic properties of models – such as common sense reasoning – and linguistic competence, datasets are drifting away from current practical application challenges

The MonsterCLEF lab will focus on a specific aspect of LLMs, namely their versatility.

Objectives


Our goal is to systematically explore how well a given LLM performs across a number of different real-world application challenges with respect to algorithms specifically trained for each task, avoiding contamination issues.

The MonsterCLEF lab is organized as a meta-challenge across a selection of tasks chosen from the other labs running in CLEF 2024 and participants are asked to develop a generative AI/LLM-based system that will be run against all the selected tasks with no or minimal adaptation. For each targeted task we rely on the same dataset, experimental setting, and evaluation measures adopted for that specific task. In this way, the LLM-based systems participating in the MonsterCLEF lab are directly comparable with the specialized systems participating in each targeted task.

In this way, we will systematically evaluate the performance of the same LLM-based system across a wide range of very different tasks and to provide feedback to each targeted task about the performance of a general-purpose LLM system compared to systems specifically developed for that task. Moreover, since the datasets for CLEF 2024 will not be public yet, we are able to experiment with previously unseen data, thus avoiding the risk of contamination.

Participating Labs and Offered Tasks


The following labs are targeted by MonsterCLEF:

  • CheckThat! provides a diverse collection of challenges to the research community interested in developing technology to support and understand the journalistic verification process.
    • Task 1: Checkworthiness
    • Task 3: Detection of Persuasion Techniques in News Articles
  • ELOQUENT is a lab devoted to the evaluation of certain quality aspects of content generated by LLMs. It intends to use LLMs to test the capacities of themselves, and is thus a good fit for this meta-lab evaluation effort.
    • Task 3: Robustness
    • Task 4: Voight-Kampff
  • EXIST s a lab devoted to the detection and characterization of sexism in online content.
    • Task 1: Detection of sexist content
    • Task 2: Characterization of sexist content: source intention
    • Task 3: Characterization of sexist content: sexism categorization
  • ImageCLEF aims to provide an evaluation forum for the cross–language annotation and retrieval of images. For the MonsterCLEF lab, ImageCLEF will provide two image caption tasks in the biomedical domain (radiological images).
    • Task 1: ImageCLEFmedical -> ImageCLEF-caption -> Caption Prediction
  • LongEval is a shared task evaluating the temporal persistence of Information Retrieval systems and text classifiers.
    • Task 1: LongEval Retrieval
    • Task 2: LongEval Classification
  • PAN is a series of scientific events and shared tasks on authorship identification and verification, author profiling, and plagiarism detection.
    • Task 2: Multilingual Text Detoxification
    • Task 4: Generative AI Authorship Verification 2024
  • Touché is a series of scientific events and shared tasks on computational argumentation and causality.
    • Task 1: Human Value Detection
    • Task 2: Ideology and Power Identification in Parliamentary Debates
    • Task 3: Image Retrieval for Arguments

Organizers


Nicola Ferro

Nicola Ferro

University of Padua, Italy

Julio Gonzalo

Julio Gonzalo

UNED, Spain

Jussi Karlgren

Jussi Karlgren

Silo AI, Sweden

Henning Müller

Henning Müller

HES-SO, Switzerland