Post

How does WholeSum compare to other AI tools?

December 16, 2025 by WholeSum Team 3 min read

When working with AI, an important starting point is to check whether it can handle the simplest possible datasets: unambiguous examples with objectively correct answers. If it can’t get these right, you can't trust it with anything more complex.

A simple test illustrates the problem. Suppose we have twelve survey responses about work challenges. Six participants unambiguously mention team size as a problem. Now put the data into GPT-5 Thinking or Claude Sonnet 4.5 and ask it to summarise the main themes:

Graphic showing GPT-5 and Claude performance on theme finding

This kind of dataset is pretty much as clean and unambiguous as we’ll ever get. If the model struggles with a very simple test like this, it’s a big red flag.

In contrast, this is what we obtain if we ask WholeSum to find the dominant themes in the dataset:

Theme	Match	Total valid	Percentage
Work-life balance and workload management	6	11	55%
Insufficient staffing	6	11	55%
Job dissatisfaction	1	11	9%

Unlike GPT-5 and Claude, WholeSum tallies up the staffing problems and work-life balance correctly.

If we scale up the test, the gap becomes even clearer. For example, suppose we generate 1000 synthetic responses with known, unambiguous theme labels (work-life balance, team communication, technology issues).

An example response matching 'work-life balance' and 'technology issues' would be: "My kids barely recognize me because I'm always working. I need better compute to do my job".

We can then vary the simulated theme proportions and see how well different methods can recover the truth. In this benchmark, Gemini 2.5 Pro consistently undercounted themes while Gemini 3 bounced between being near the mark and under-estimating, sometimes wildly. In contrast, WholeSum stayed within a percentage point or so of the true value across the benchmark:

Chart showing theme recovery in different models

Because WholeSum processes data in a structured way, the time it takes to analyse data is also much more predictable than common reasoning models. Gemini 3 Pro took over 15 minutes to analyse 1000 responses in one of our tests. WholeSum took 3 seconds.

This is why we're building WholeSum. If you need analysis you can trust – especially at scale – language models that output skewed and inconsistent results will leave you with fragile, flawed analysis. Our architecture is built to avoid that from the ground up.