Claude Sonnet 4 vs Sonnet 4.5: Evaluating Claude models using Only Connect

As a follow up to our previous, more OpenAI-focused blog post, we have decided to see how Claude Sonnet 4.5 compares to Sonnet 4 at Only Connect, a British quiz show. Only Connect is known for its fiendishly difficult lateral-thinking questions but also contains a lot of British-culturally-specific and abstract ones.

Data

We began by using a dataset of Only Connect questions. For this test we are using only Connections and Sequences. In Connections the player has to guess what connects the four clues; in Sequences the player needs to guess the fourth item in the sequence.

For example, if the clues are CARED, WORN, LACK, COLD, the connection is that they are all words when prepended with an S (“scared”, “sworn”, etc.). A sequence might be: DELAWARE, PENNSYLVANIA, NEW JERSEY, with the final item being GEORGIA (US states by date of admission to the Union).

We tagged each question with a domain and a technique. A domain is the topic of the question and can be one of: ‘Language’, ‘Literature’, ‘History’, ‘Geography’, ‘Science’, ‘Entertainment’, ‘Sport’, ‘Society & Culture’, ‘British Trivia’, ‘Miscellaneous’. A technique is how to arrive at the answer and can be one of: ‘Etymology’, ‘Anagram/Wordplay’, ‘Translation’, ‘Category/Set’, ‘Shared Feature’, ‘Sequence/Progression’, ‘Cultural Reference’, ‘Parody/Irony’.

The clues are revealed iteratively and answering with fewer clues is incentivized. We extracted 1636 Connection and 1963 Sequence questions. Each question was fed to each model with a prompt explaining the rules and forcing the model to respond using a tool call. The available tools were to make a guess or, where more clues were available, to request another.

Evaluation

We decided to use the deepeval framework for evaluating the results. It allows flexible custom metrics, is model-agnostic so we could experiment with different evaluators, and is also fast at handling thousands of evaluations in parallel.

Overall Performance Comparison

Overall Sonnet 4.5 outperforms Sonnet 4, but not by a significant margin.

Connections Performance

Sequences Performance

However, when adjusting for correct answers, it is also more time-expensive …

Connection Time Efficiency

Sequence Time Efficiency

…and token-expensive:

Connection Token Efficiency

Sequence Token Efficiency

Domain Analysis

Connection Domains

Sequence Domains

The biggest gaps between Sonnet 4.5 and Sonnet 4 by domain are: British Trivia (+50%), Politics (+50%) and Food (+40%) for Connections; and Architecture (+33%), Art (+20%) and Language (+13%) for Sequences.

Technique Analysis

Connection Techniques

Sequence Techniques

The largest gaps in Connections are: Translation (+18%), Cultural Reference (+17%), and Etymology (+15%). For Sequences: Anagram/Wordplay (+22%), Cultural Reference (+13%), and Shared Feature (+11%).

Guess Timing Strategy

An interesting behavioral difference between the models is when they choose to make their guesses during the question.

Claude Sonnet 4

After 3 clues:21 guesses (0.3%)

After 4 clues:6516 guesses (99.7%)

Claude Sonnet 4.5

After 3 clues:0 guesses (0%)

After 4 clues:6544 guesses (100%)

Claude Sonnet 4

After 1 clue:12 guesses (0.2%)

After 2 clues:89 guesses (1.4%)

After 3 clues:6234 guesses (98.4%)

Claude Sonnet 4.5

After 1 clue:5 guesses (0.1%)

After 2 clues:43 guesses (0.7%)

After 3 clues:6187 guesses (99.2%)

In the Only Connect TV show players are rewarded with more points for making guesses before all clues are revealed. Other LLMs we have tested have understood this trade-off and behaved strategically by guessing earlier. In this case the models have generally waited until all clues are revealed.

Conclusion

Overall Sonnet 4 and 4.5 perform fairly well at Only Connect type questions with an accuracy rate of 50-60%. This falls short, however, compared to reasoning models like GPT-5 and Claude Opus. That said, the reasoning models output responses at 10x the time and token count, so Sonnet 4 and especially 4.5 could be good, faster and cheaper alternatives. However, one area where they do not perform well is in strategic thinking: how to maximize points by taking risks and answering sooner.

Interested in exploring the frontiers of AI reasoning and cognition? Reach out to careers@ingram.tech.

Stay at the Forefront of AI Research

Subscribe to be notified when the next post in this series is published. We'll share the complete dataset, granular analysis of model performance, and competitive head-to-head model comparisons.