Four Leading AI Models Failed This Elite Maths Challenge

AI has demonstrated impressive abilities in coding, scientific research and problem-solving. Yet a new study suggests that the world’s most advanced AI systems still struggle to match elite human mathematicians when faced with original research questions.

The finding comes from First Proof, a project designed to test whether AI can solve genuine mathematical problems at the frontier of research.

The initiative challenged four leading AI systems with ten unpublished research-level mathematics problems. Professional mathematicians had recently solved these questions, but researchers had not yet published the solutions.

The project released its results on June 10. Expert mathematicians anonymously reviewed and graded every submission.

The outcome was clear.

“Not one model matched the standard of a top human mathematician.”

Researchers say the trial is the first benchmark to combine three critical elements. It used genuine research-level questions, ensured the problems were absent from training data and relied on formal evaluation by experts.

According to reports published by First Proof and discussed by Nature, earlier AI benchmarks often faced criticism because models may have encountered test questions during training. As a result, strong performance could reflect memorisation rather than genuine reasoning.

First Proof attempted to eliminate that concern by sourcing problems directly from unpublished mathematical research.

OpenAI joins test as hallucinations remain a concern

OpenAI was the only major technology company to enter a commercially available model. The company submitted ChatGPT 5.5 Pro.

The remaining systems came from academic teams at UCLA, Princeton University and ETH Zurich.

Some notable AI projects did not participate. These included Google’s Aletheia and the unreleased version of Anthropic’s Claude Mythos. Organisers said they could not independently verify whether those systems operated without human assistance.

The study also highlighted a familiar problem for large language models.

Even when researchers instructed the AI systems to verify references and reasoning, the models still generated inaccurate information.

“The models also displayed a familiar weakness, hallucination.”

Experts say hallucinations remain one of the biggest barriers to deploying AI in advanced scientific and mathematical work, where a single error can invalidate an entire proof.

Human experts remain ahead, but AI continues to improve

The findings arrive only weeks after an OpenAI chatbot reportedly solved an 80-year-old mathematical problem posed by the late Hungarian mathematician Paul Erdős.

That achievement fuelled speculation that AI could soon transform mathematical research.

However, First Proof researchers argue that solving an existing puzzle differs significantly from tackling a completely new problem that nobody has published before.

“Cracking an old puzzle is a very different thing from solving a brand-new research problem.”

The team believes future versions of the benchmark could help researchers understand where AI adds value. Potential applications include checking proofs, identifying mistakes and suggesting new research directions.

For now, however, the results send a clear message.

“The humans are still winning.”