Saturday, 17 January 2026

Improvement in reasoning performance of LLMs over time

If you tried using ChatGPT when it first came out and concluded it wasn't much use for a scientific reasoning task, it might be time to try it again. For the task I semi-described in the previous post, it worked exceptionally well. Here I've tried to gather some information on what's changed between the original release and today, and how you can keep track of improvements in performance.

Background

The LLM developed by OpenAI is called GPT ("Generative Pre-trained Transformer"). It was released as a chatbot publicly in Nov 2022 and it changed the world (for better or worse). This release was already version 3.5. Since then, LLMs have been developed by several companies/groups, often with new versions multiple times a year.

For scientific problems, a particular landmark was the development of LLMs with 'reasoning' capability; the ability to perform multi-step logical problem-solving. OpenAI's o1 model (Sept 2024) was the poster child for this. Currently OpenAI's GPT-5.2 represents the state-of-the-art, though the upcoming Gemini 3 Pro (currently in preview) may have the edge on it. But how do we know this?

The benchmark

LLM benchmarks exist for coding tasks, solving maths problems, medical exams and so forth, but for reasoning, a single benchmark - the GPQA benchmark - has established itself as the gold standard, in particular the "Diamond" subset thereof. GPQA stands for "Google-Proof Q&A". In Nov 2023, Rein et al (New York Uni) published a preprint describing how they employed incentivised PhD-level contractors to put together a set of scientific multiple-choice questions (4 choices) covering biology, chemistry and physics that could not be answered by Google searches and required domain knowledge. Random answering scores 25%; PhD-level non-scientists (note: my interpretation) with access to the internet scored 34%; PhD-level scientists scored 69.7% (note: I don't see this figure in the preprint, but it is widely quoted).

Just a note about the 69.7%. The exact figure doesn't matter, but there's some idea that beating this figure means superhuman results. Um, no. An expert in biology is not expected to answer the physics questions very well. Are coding LLMs superhuman if they are better than a Python expert at answering Java questions? If they beat an expert at their own area of expertise, now that would be impressive. Given that it is expected that three experts sitting together would get 100%, it follows that the benchmark is not capable of measuring superhuman performance (this is actually mentioned at the end of the preprint). A more interesting number perhaps is 55% which roughly gives a measure of how well a scientist does when answering questions outside of their domain (I estimated this value by subtracting the 33.3% of correct answers inside the expert's domain from 69.7% and scaling it up, i.e. (69.7 - 100/3) / 2 * 3).

The results

The key point is that this is a set of difficult scientific questions that require reasoning, and that we can track the performance of LLMs over time against this benchmark. Here are the results for OpenAI models taken from an analysis by Epoch AI. Each individual model was tested 16 times, with the typical stderr around 2%:

  • GPT-3.5 (Nov 2022): 28% (taken from here
  • GPT-4 (Mar 2023): 36%
  • GPT-4 Turbo (Nov 2023): 42%
  • GPT-4o (Aug 2024): 49%
  • o1-mini (Sep 2024): 62% (at "high" reasoning setting)
  • o1 (Dec 2024): 76% (medium)
  • o3-mini (Jan 2025): 77% (high)
  • GPT-4.1 (Apr 2025): 67% (note: the 4.1s are a non-reasoning model)
    • GPT-4.1 mini: 66% (this is a smaller/faster/cheaper version of GPT-4.1)
    • GPT-4.1 nano: 49%
  • GPT-5 (Aug 2025): 85% (medium)
    • GPT-5-mini: 72%
    • GPT-5-nano: 67%
  • GPT-5.1 (Nov 2025): 85% (medium) 88% (high)
  • GPT-5.2 (Dec 2025): 88% (medium) 88% (high) 91% (xhigh)

Conclusion

In summary, there have been major improvements over the last three years in terms of reasoning. Nor has it reached a peak. Gemini 3 Pro Preview (Nov 2025) is reported as 93%. In August, a new kid on the block, Autopoiesis, reported 92.4% for their Aristotle X1 Verify model (but this is not publicly available nor independently verified). Either way, we are getting to the point where the field will need a more difficult benchmark.

Finally, it should be noted that pure performance isn't the only relevant criteria. Price, energy usage, and ability to run locally will also be important considerations. Let's see what 2026 brings.

Image credit: Epoch AI, CC BY, https://epoch.ai/benchmarks/gpqa-diamond.

No comments: