Claude Sonnet 4 vs Sonnet 4.5: Evaluating Claude models using Only Connect

As a follow up to our previous, more OpenAI-focused blog post, we have decided to see how Claude Sonnet 4.5 compares to Sonnet 4 at Only Connect, a British quiz show. Only Connect is known for its fiendishly difficult lateral-thinking questions but also contains a lot of British-culturally-specific and abstract ones.
Data
We began by using a dataset of Only Connect questions. For this test we are using only Connections and Sequences. In Connections the player has to guess what connects the four clues; in Sequences the player needs to guess the fourth item in the sequence.
For example, if the clues are CARED, WORN, LACK, COLD,
the connection is that they are all words when prepended with an S (“scared”, “sworn”, etc.).
A sequence might be: DELAWARE, PENNSYLVANIA, NEW JERSEY, with the final item being GEORGIA (US states by date of admission to the Union).
We tagged each question with a domain and a technique. A domain is the topic of the question and can be one of: ‘Language’, ‘Literature’, ‘History’, ‘Geography’, ‘Science’, ‘Entertainment’, ‘Sport’, ‘Society & Culture’, ‘British Trivia’, ‘Miscellaneous’. A technique is how to arrive at the answer and can be one of: ‘Etymology’, ‘Anagram/Wordplay’, ‘Translation’, ‘Category/Set’, ‘Shared Feature’, ‘Sequence/Progression’, ‘Cultural Reference’, ‘Parody/Irony’.
The clues are revealed iteratively and answering with fewer clues is incentivized. We extracted 1636 Connection and 1963 Sequence questions. Each question was fed to each model with a prompt explaining the rules and forcing the model to respond using a tool call. The available tools were to make a guess or, where more clues were available, to request another.
Evaluation
We decided to use the deepeval framework for evaluating the results. It allows flexible custom metrics, is model-agnostic so we could experiment with different evaluators, and is also fast at handling thousands of evaluations in parallel.
Overall Performance Comparison
Overall Sonnet 4.5 outperforms Sonnet 4, but not by a significant margin.
Connections Performance
Sequences Performance
However, when adjusting for correct answers, it is also more time-expensive …
Connection Time Efficiency
Sequence Time Efficiency
…and token-expensive:
Connection Token Efficiency
Sequence Token Efficiency
Domain Analysis
Connection Domains
Sequence Domains
The biggest gaps between Sonnet 4.5 and Sonnet 4 by domain are: British Trivia (+50%), Politics (+50%) and Food (+40%) for Connections; and Architecture (+33%), Art (+20%) and Language (+13%) for Sequences.
Technique Analysis
Connection Techniques
Sequence Techniques
The largest gaps in Connections are: Translation (+18%), Cultural Reference (+17%), and Etymology (+15%). For Sequences: Anagram/Wordplay (+22%), Cultural Reference (+13%), and Shared Feature (+11%).
Guess Timing Strategy
An interesting behavioral difference between the models is when they choose to make their guesses during the question.
Claude Sonnet 4
Claude Sonnet 4.5
Claude Sonnet 4
Claude Sonnet 4.5
In the Only Connect TV show players are rewarded with more points for making guesses before all clues are revealed. Other LLMs we have tested have understood this trade-off and behaved strategically by guessing earlier. In this case the models have generally waited until all clues are revealed.
Conclusion
Overall Sonnet 4 and 4.5 perform fairly well at Only Connect type questions with an accuracy rate of 50-60%. This falls short, however, compared to reasoning models like GPT-5 and Claude Opus. That said, the reasoning models output responses at 10x the time and token count, so Sonnet 4 and especially 4.5 could be good, faster and cheaper alternatives. However, one area where they do not perform well is in strategic thinking: how to maximize points by taking risks and answering sooner.
Interested in exploring the frontiers of AI reasoning and cognition? Reach out to careers@ingram.tech.
Stay at the Forefront of AI Research
Subscribe to be notified when the next post in this series is published. We'll share the complete dataset, granular analysis of model performance, and competitive head-to-head model comparisons.
