A benchmark of expert-level academic questions to assess AI capabilities

Long Phan (including F.J. Rodrigo-Ginés)

doi:10.1038/s41586-025-09962-4

A benchmark of expert-level academic questions to assess AI capabilities

Long Phan et al. (including F.J. Rodrigo-Ginés)

Nature, 2026. doi:10.1038/s41586-025-09962-4

Abstract

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding, limiting informed measurement of state-of-the-art LLM capabilities.

DOI Publisher version

Related work

Media Bias Within Information Disorder: Bridging Two Research Communities Through a Systematic ReviewInDor Workshop @ LREC 2026, 2026
From Co-Pilots to Co-Workers: A Formal Typology of Human–Agent Collaboration in OrganizationsIEEE Conference on Artificial Intelligence (CAI) 2026, 2026
The Epistemic Limits of NLP Models in Media Bias Detection: Toward a Framework for Context-Aware and Reflexive AI SystemsIEEE Conference on Artificial Intelligence (CAI) 2026, 2026
Modeling the impact of research data unavailability on scienceJournal of Informetrics, 2026