We speak to a lot of teams using tools like ChatGPT and Claude to analyse qualitative data. They usually describe the same experience: the initial "wow" moment. Then, the frustration.
What surprises many people is that issues don't just appear at scale. They show up even when analysing 10 or 12 responses. So what's going on?
Large language models (LLMs) are extraordinary tools. They help you get started quickly. They surface themes. They can save hours of manual work.
But when you're working with precious human contributions like feedback, interviews, survey responses, etc, it's worth understanding their natural limitations.
Below are a few of the most common frustrations we see, and why they happen.
1. Why earlier quotes get picked more often
One pattern people notice is that selected quotes often come from the first few responses in a dataset. This isn't random.
LLMs tend to pay more attention to:
- The start of documents
- Clearly summarised takeaways
- Structured or "headline-like" information
They've been trained heavily on sources like news articles and Wikipedia, where key points are typically top-loaded.
As one market researcher we talked to put it:
"The quotes always seem to be selected from the first few responses."
When analysing qualitative data, important insights can appear anywhere, of course - so this is unhelpful at best and dangerously biased at worst. If the model's attention is uneven, your output may be too.
2. Why sentiment is often misunderstood
Human feedback is rarely purely positive or negative. Someone might say:
"I love the concept, but it's confusing to use."
"It's helpful overall, although the onboarding was frustrating."
That's not neutral. It's nuanced.
Many AI tools compress sentiment into three broad categories:
- Positive
- Neutral
- Negative
Mixed signals often get labelled "neutral", even when the emotional weight leans clearly one way.
As a UX writer we talked to said:
"A lot of the 'positive' responses in my data were neutral at best, and I could see several 'neutral' ones that were clearly negative."
The issue isn't that models can't detect sentiment at all. It's that human experience often carries layered meaning, and reducing that to a single label can flatten what matters. And at a population level, significantly skew the takeout.
3. Why hallucinations happen (even with 12 responses)
This is the one people expect at scale, but are particularly surprised to see in small datasets.
LLMs don't "check" information in the way we might assume. They generate the most statistically likely continuation of text based on patterns they've learned. So occasionally, they will produce something that sounds plausible...but wasn't actually said.
A deputy head teacher, who uses WholeSum for school surveys, told us:
"ChatGPT literally just made stuff up that was not even in the surveys."
Even with just 12 responses.
This happens because generation and verification are different tasks. LLMs are optimised for fluent language generation, not rigorous source validation.
4. Why scale becomes fiddly
"A million tokens" sounds enormous. In practice, qualitative data adds up quickly.
To analyse larger datasets, teams often:
- Split responses into chunks
- Summarise each chunk
- Then summarise the summaries
This can work, but each additional layer introduces abstraction. Over time, you risk drift in themes and inconsistent categorisation. It can also be time consuming, and gets expensive with top reasoning models.
As one data scientist put it:
"At the moment, we have to do a lot of chunking to get all the data summarised by AI. It's a pain."
The bigger picture
Large language models are extraordinary at many things. But they are not designed specifically for rigorous data insight extraction.
Analysing human experience properly requires:
- Balanced handling of uncertainty and nuance
- Consistency in outputs, regardless of dataset size
- Traceability — trusted links back to source data
AI is reshaping what we can do with unstructured audience data. The opportunity is huge.
We just need to understand what these tools are optimised for, and where additional structure, validation, or specialised systems are required. When you're working with real human voices, accuracy and defensibility matter. And that's where the next generation of qualitative analytics needs to focus.