Fran RodrigoAll publications →

A benchmark of expert-level academic questions to assess AI capabilities

Long Phan et al. (including F.J. Rodrigo-Ginés)

Nature, 2026. doi:10.1038/s41586-025-09962-4

Abstract

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding, limiting informed measurement of state-of-the-art LLM capabilities.

Related work