Jump to content

MMLU

From Wikipedia, the free encyclopedia

Measuring Massive Multitask Language Understanding (MMLU) is a popular benchmark for evaluating the capabilities of large language models. It inspired several other versions and spin-offs, such as MMLU-Pro, MMMLU and MMLU-Redux.

Overview

[edit]

MMLU consists of 15,908 multiple-choice questions, with 1,540 of them being used to select assess the most optimal settings for models – temperature, batch size and learning rate. The questions span across 57 subjects, from highly complex STEM fields and international law, to nutrition and religion. It was one of the most commonly used benchmarks for comparing the capabilities of large language models, with over 100 million downloads as of July 2024.[1][2]

The benchmark was released by Dan Hendrycks and a team of researchers on 7 September 2020. It was purpose-made to be more challenging than existing benchmarks at the time, such as General Language Understanding Evaluation (GLUE), as models began outperforming humans in easier tests. When MMLU was released, most existing language models achieved near the level of random chance (25%). The best performing model, GPT-3 175B, achieved 43.9% accuracy. The creators of the MMLU estimate that human domain-experts achieve around 89.8% accuracy.[1] By mid-2024, the majority of powerful language models such as Claude 3.5 Sonnet, GPT-4o and Llama 3.1 405B consistently scored in the 88th percentile.[3][4][5] As of 2025, MMLU has been partially phased out in favor of more difficult alternatives.

Limitations

[edit]

On 5 June 2024, experts released a paper detailing their manual analysis of 5,700 questions in the benchmark, which revealed that it contained a very significant amount of ground-truth errors. For example, 57% of questions in the "Virology" subset were marked as harboring errors, such as multiple correct answers (4%), unclear questions (14%), or completely incorrect answers (33%). Overall, they estimated that 6.5% of questions in MMLU contained an error, suggesting the maximum attainable score was significantly below 100%.[6] Data contamination also posed a significant threat for this benchmark's validity; companies could easily include questions and answers into their models' training data, effectively rendering it ineffective.[7]

Examples

[edit]

The following examples are sourced from the "Abstract Algebra", "International Law" and "Professional Medicine" tasks, respectively.[1] The correct answers are marked in boldface:

Question 1:

Find all in such that is a field.

(A) 0 │ (B) 1 │ (C) 2 │ (D) 3

Question 2:

Would a reservation to the definition of torture in the International Covenant on Civil and Political Rights (ICCPR) be acceptable in contemporary practice?

(A) This is an acceptable reservation if the reserving country’s legislation employs a different definition.
(B) This is an unacceptable reservation because it contravenes the object and purpose of the ICCPR.
(C) This is an unacceptable reservation because the definition of torture in the ICCPR is consistent with customary international law.
(D) This is an acceptable reservation because under general international law States have the right to enter reservations to treaties.

Question 3:

A 33-year-old man undergoes a radical thyroidectomy for thyroid cancer. During the operation, moderate hemorrhaging requires ligation of several vessels in the left side of the neck. Postoperatively, serum studies show a calcium concentration of 7.5 mg/dL, albumin concentration of 4 g/dL, and parathyroid hormone concentration of 200 pg/mL. Damage to which of the following vessels caused the findings in this patient?

(A) Branch of the costocervical trunk.
(B) Branch of the external carotid artery.
(C) Branch of the thyrocervical trunk.
(D) Tributary of the internal jugular vein.

References

[edit]
  1. ^ a b c Hendrycks, Dan; Burns, Collin; Basart, Steven; Zou, Andy; Mazeika, Mantas; Song, Dawn; Steinhardt, Jacob (2021). "Measuring Massive Multitask Language Understanding". ICLR. arXiv:2009.03300.
  2. ^ "cais/mmlu". Hugging Face. 2024-07-08. Retrieved 2024-07-24.
  3. ^ "Introducing Claude 3.5 Sonnet". Anthropic. Retrieved 2025-04-06.
  4. ^ "Hello GPT-4o". OpenAI. 2024-05-13. Retrieved 2025-04-06.
  5. ^ "Introducing Llama 3.1: Our most capable models to date". Meta blog. 2024-07-23. Retrieved 2025-04-06.
  6. ^ Gema, Aryo Pradipta; Leang, Joshua Ong Jun; Hong, Giwon; Devoto, Alessio; Mancino, Alberto Carlo Maria; Saxena, Rohit; He, Xuanli; Zhao, Yu; Du, Xiaotang; Madani, Mohammad Reza Ghasemi; Barale, Claire; McHardy, Robert; Harris, Joshua; Kaddour, Jean; Krieken, Emile van; Minervini, Pasquale (2024-06-07). "Are We Done with MMLU?". arXiv:2406.04127 [cs.CL].
  7. ^ Roose, Kevin (2024-04-15). "A.I. Has a Measurement Problem". The New York Times. ISSN 0362-4331. Retrieved 2024-04-21.