A benchmark of expert-level academic questions to assess AI capabilities
Nature, 2026. doi:10.1038/s41586-025-09962-4
Abstract
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding, limiting informed measurement of state-of-the-art LLM capabilities.
Related work
- Modeling the impact of research data unavailability on scienceJournal of Informetrics, 2026
- Agentic-JEPA: A Self-Supervised World Model for Planning in Text-Based Agent EnvironmentsPreprint, 2026
- Media Bias Bias-Mitigated Dataset (MBBMD): A Hierarchical, Perspectivist, and Counterfactually-Augmented Corpus for Bias Detection in Spanish NewsProcesamiento del Lenguaje Natural (SEPLN) 2026, 2026
- zenodo-mcp: A Model Context Protocol Server for the Zenodo Open-Research RepositoryTechnical report (Zenodo), 2026