Evaluating & Ranking GPT-5 Reasoning Ability

Given the proliferation of reasoning models, we wanted to go beyond knowledge-based benchmarks to test reasoning abilities such as pattern recognition, lateral thinking, abstraction, contextual reasoning (accounting for British cultural references), and multi-step inference.

In addition to reasoning, we aimed to assess how effectively models make decisions when presented with judgment calls—such as choosing between making an educated guess based on available clues or calling a function to retrieve additional information. This capability is crucial for building multi-agent orchestration systems.

Another objective was to measure improvements in the latest GPT-5 models, particularly using the reasoning effort and verbosity parameters, comparing them to previous iterations and evaluating their token and reasoning time efficiency.

What is Only Connect?

Only Connect tests contestants' ability to identify connections between seemingly unrelated clues. It prioritizes lateral thinking, pattern recognition, and creative problem-solving over quick recall. The game consists of four rounds:

Connections: Players identify the common thread linking 1-4 clues
Sequences: Players predict the fourth element in a sequence after seeing 1-3 clues
Wall: Players group 16 elements into four categories (similar to the NYT Connections game)
Missing Vowels: Players reconstruct phrases with removed vowels and spaces using cryptic clues

Given its emphasis on clever reasoning rather than knowledge recall, Only Connect provides an ideal challenge for benchmarking LLMs' reasoning capabilities. We also wanted to track performance improvements across successive model generations.

Methodology

We selected models for analysis including GPT-3, GPT-4-Mini, GPT-4.1, Claude Sonnet 4, Opus 4, and Opus 4.1, along with GPT-5 using eight different parameter configurations (low/high verbosity and minimal/low/medium/high reasoning).

Questions were sourced from the Only Connect game show, following official rules. For rounds requiring straight guesses, we provided LLMs with all available clues and used structured output parameters to receive JSON responses. When contestants could request additional clues, we implemented this as a function call triggering another API call with the supplementary information.

For evaluation, deterministic answers (such as missing vowels) were assessed using standard string methods. For questions with multiple correct answers or varying phrasings, we employed the deepeval library for nuanced evaluation. Points were assigned using the game's official scoring system. We simulated eight randomly selected episodes from series 3-10 and aggregated the results.

Results Overview

The best-performing models were GPT-5 and similar reasoning-optimized models. Verbosity had minimal impact on accuracy.

Loading chart...

We found a strong correlation between response time and accuracy. GPT-5 models with higher reasoning parameters (high, medium) consistently outperformed those with lower settings (low, minimal).

Loading chart...

Similarly, when analyzing token usage, reasoning models consumed comparatively high token counts but demonstrated strong effectiveness. The verbosity parameter significantly affected token usage with only minor impacts on accuracy.

Loading chart...

While GPT-5 and other reasoning models perform well, they come at a cost in both time and token usage.

Regarding individual rounds, models performed best on Missing Vowels—unsurprising given this round prioritizes speed over lateral logic. LLMs routinely handle poor grammar and spelling, making this relatively straightforward. The Wall round proved most challenging, with significant performance gaps between top and bottom performers. This likely stems from the prompt complexity containing 16 different elements, which more powerful reasoning models processed more effectively. We'll compare NYT Connections games against the Only Connect dataset in a future post.

Loading table...

Next Steps

We'll publish the complete dataset this week alongside a granular analysis identifying which questions posed the greatest challenges for models. We'll also implement a more realistic competitive format, pairing models against each other and allowing points for correctly answering questions opponents miss.

Annex: Models used

ID	MODEL	PARAMS
gpt-5 high high	gpt-5-2025-08-07	High verbosity, high reasoning
gpt-5 low high	gpt-5-2025-08-07	Low verbosity, high reasoning
gpt-5 high low	gpt-5-2025-08-07	High verbosity, low reasoning
o4-mini	o4-mini-2025-04-16
gpt-5 low low	gpt-5-2025-08-07	Low verbosity, low reasoning
gpt-5 high medium	gpt-5-2025-08-07	High verbosity, medium reasoning
gpt-5 high minimal	gpt-5-2025-08-07	High verbosity, minimal reasoning
gpt-4.1	gpt-4.1-2025-04-14
gpt-5 low minimal	gpt-5-2025-08-07	Low verbosity, minimal reasoning
claude-opus-4	claude-opus-4-20250514
claude-sonnet-4	claude-sonnet-4-20250514
gpt-5 low medium	gpt-5-2025-08-07
claude-opus-4-1	claude-opus-4-1-20250805
o3	o3-2025-04-16

Interested in pushing the boundaries of AI research and knowledge? Reach out to careers@ingram.tech.

Stay at the Forefront of AI Research

Subscribe to be notified when the next post in this series is published. We'll share the complete dataset, granular analysis of model performance, and competitive head-to-head model comparisons.