These are always amazing when juxtaposed with apparently impressive LLM reasoning, knowledge, and creativity. You can trivially get them to make the most basic mistakes about words and numbers, and double down on those mistakes, repeatedly explaining that they're totally correct.
Have any systems tried prompting LLMs with a warning like "You don't intuitively or automatically know many facts about words, spelling, or the structure or context of text, when considered as text; for example, you don't intuitively or automatically know how words or other texts are spelled, how many letters they contain, or what the result of applying some code, mechanical transformation, or substitution to a word or text is. Your natural guesses about these subjects are likely to be wrong as a result of how your training doesn't necessarily let you infer correct answers about them. If the content or structure of a word or text, or the result of using a transformation, code, or the like on a text, is a subject of conversation, or you are going to make a claim about it, always use a tool to confirm your intuitions."?
This is a great idea. Like, if someone asked me to count the number of B's in your paragraph, I'd yeet it through `grep -o 'B' file.txt | wc -l` or similar, why would I sit there counting it by hand?
As a human, if you give me a number on screen like 100000000, I can't be totally sure if that's 100 Million or 1 Billion without getting close and counting carefully. Should ought have my glasses. Mouse pointer helps some as an ersatz thousands-separator, but still.
Since we're giving them tools, especially for math, it makes way more sense to start giving them access to some of the finest tools ever. Make an MCP into Mathematica or Matlab and let the LLM write some math and have classical solvers actually deal with the results. Let the LLM write little bits of bash or python as its primary approach for dealing with these kinds of analytical questions.
You don't need specialised MCPs for this. In the past you could add "use python" to there chatgpt prompt and it would do the right thing. This is exactly the intelligent "use the right tool for the right thing" idea. Chatgpt just want trained to apply it in the right circumstances automatically.
I often tell LLMs to ask questions if required, and that it is a skilled developer who is working along side me. That seems to help them be more collaborative rather than prescriptive.
A couple of weeks ago, I asked google, ordinary google search, how many times the letter r is found in preferred, and it told me 2. This century has taken quite a bitter turn against those of us who think that the 'enough' in 'good enough' ought to exclude products indistinguishable from the most grievously disgraceful products of sloth. But I have also lately realized that human beings, brains, society, culture, education, technology, computers, etc, are all extremely complicated emergent properties of a universe that is far beyond our understanding. And we ought not to complain too seriously, because this, too, shall pass.
With data starvation driving ai companies towards synthetic data I’m surprised that an easily synthesized problem like this hasn’t been trained out of relevance. Yet here we are with proof that it hasn’t
Are we a hundred percent sure it isn't a watermark that is by design?
A quick test anyone can run and say, yup, that is a model XYZ derivative running under the hood.
Because, as you quite rightly point out, it is trivial to train the model not to have this behaviour. For me, that is when Occam kicks in.
I remember initially believing the explanation for the Strawberry problem, but one day I sat down and thought about it, and realized it made absolutely zero sense.
The explanation that Karpathy was popularizing was that it has to do with tokenization.
However, models are not conscious of tokens, and they certainly don't have any ability to count them without tool help.
Additionally, if it were a tokenization issue, we would expect to spot the issue everywhere.
So yeah, I'm thinking it's a model tag or insignia of some kind, similar to the fun logos you find when examining many silicon integrated circuits under a microscope.
It refuses to show the thinking process for this question though, so its unclear if it even used the reasoning model or fell back on a non reasoning one.
> While GPT‑5 in ChatGPT is a system of reasoning, non-reasoning, and router models, GPT‑5 in the API platform is the reasoning model that powers maximum performance in ChatGPT. Notably, GPT‑5 with minimal reasoning is a different model than the non-reasoning model in ChatGPT, and is better tuned for developers. The non-reasoning model used in ChatGPT is available as gpt-5-chat-latest.
> How many times does the letter b appear in blueberry?
AI> Thinking... .oO{ Well, let's see. The letter b is the second letter in the alphabet, so the answer must be two, right? That sounds right to me anyway. }
Q: How many times does the letter b appear in blueberry?
The letter “b” appears 3 times in the word blueberry:
• 1st b – at the beginning: Blueberry
• 2nd b – in the middle: blueberry
• 3rd b – again in the middle: blueberry (wait… no — actually…)
Let’s break it down carefully:
blueberry → b, l, u, e, b, e, r, r, y
Letters:
1. b
2. l
3. u
4. e
5. b
6. e
7. r
8. r
9. y
If you choose the thinking model it doesn’t make this mistake. It means the auto router should be tuned to call the thinking model on edge cases like these.
If you are going to release a new set of models and eliminate the old ones, your new smallest model should have equal capabilities equal to or greater than your old models of similar size. This is not a routing problem. This is a driving the price down of running and scaling the business problem.
“It’s like talking to a PhD level expert” -Sam Altman
https://www.youtube.com/live/0Uu_VJeVVfo?si=PJGU-MomCQP1tyPk
These are always amazing when juxtaposed with apparently impressive LLM reasoning, knowledge, and creativity. You can trivially get them to make the most basic mistakes about words and numbers, and double down on those mistakes, repeatedly explaining that they're totally correct.
Have any systems tried prompting LLMs with a warning like "You don't intuitively or automatically know many facts about words, spelling, or the structure or context of text, when considered as text; for example, you don't intuitively or automatically know how words or other texts are spelled, how many letters they contain, or what the result of applying some code, mechanical transformation, or substitution to a word or text is. Your natural guesses about these subjects are likely to be wrong as a result of how your training doesn't necessarily let you infer correct answers about them. If the content or structure of a word or text, or the result of using a transformation, code, or the like on a text, is a subject of conversation, or you are going to make a claim about it, always use a tool to confirm your intuitions."?
This is a great idea. Like, if someone asked me to count the number of B's in your paragraph, I'd yeet it through `grep -o 'B' file.txt | wc -l` or similar, why would I sit there counting it by hand?
As a human, if you give me a number on screen like 100000000, I can't be totally sure if that's 100 Million or 1 Billion without getting close and counting carefully. Should ought have my glasses. Mouse pointer helps some as an ersatz thousands-separator, but still.
Since we're giving them tools, especially for math, it makes way more sense to start giving them access to some of the finest tools ever. Make an MCP into Mathematica or Matlab and let the LLM write some math and have classical solvers actually deal with the results. Let the LLM write little bits of bash or python as its primary approach for dealing with these kinds of analytical questions.
It's like giving a kid a calculator...
If you have to build an MCP for every system you aren’t building intelligence in the first place.
You don't need specialised MCPs for this. In the past you could add "use python" to there chatgpt prompt and it would do the right thing. This is exactly the intelligent "use the right tool for the right thing" idea. Chatgpt just want trained to apply it in the right circumstances automatically.
Why? just cause? analogize it to the human brain.
You can’t just prompt your way out of a systemic flaw
What's the systematic flaw?
I often tell LLMs to ask questions if required, and that it is a skilled developer who is working along side me. That seems to help them be more collaborative rather than prescriptive.
A couple of weeks ago, I asked google, ordinary google search, how many times the letter r is found in preferred, and it told me 2. This century has taken quite a bitter turn against those of us who think that the 'enough' in 'good enough' ought to exclude products indistinguishable from the most grievously disgraceful products of sloth. But I have also lately realized that human beings, brains, society, culture, education, technology, computers, etc, are all extremely complicated emergent properties of a universe that is far beyond our understanding. And we ought not to complain too seriously, because this, too, shall pass.
With data starvation driving ai companies towards synthetic data I’m surprised that an easily synthesized problem like this hasn’t been trained out of relevance. Yet here we are with proof that it hasn’t
Are we a hundred percent sure it isn't a watermark that is by design?
A quick test anyone can run and say, yup, that is a model XYZ derivative running under the hood.
Because, as you quite rightly point out, it is trivial to train the model not to have this behaviour. For me, that is when Occam kicks in.
I remember initially believing the explanation for the Strawberry problem, but one day I sat down and thought about it, and realized it made absolutely zero sense.
The explanation that Karpathy was popularizing was that it has to do with tokenization.
However, models are not conscious of tokens, and they certainly don't have any ability to count them without tool help.
Additionally, if it were a tokenization issue, we would expect to spot the issue everywhere.
So yeah, I'm thinking it's a model tag or insignia of some kind, similar to the fun logos you find when examining many silicon integrated circuits under a microscope.
Isn't that just an artifact caused by the tokenization of the training and input data?
See
https://platform.openai.com/tokenizer
https://github.com/openai/tiktoken
My phone still has gpt-4o which gets it right: https://files.catbox.moe/0yg6cu.jpg
But my browser has gpt-5 which says 3: https://files.catbox.moe/63qkce.jpg
Claude spells it out letter by letter: https://files.catbox.moe/f1irfx.jpg
So I thought GPT-5 Thinking might get it right, and it does: https://files.catbox.moe/xlchnr.jpg
It refuses to show the thinking process for this question though, so its unclear if it even used the reasoning model or fell back on a non reasoning one.
> While GPT‑5 in ChatGPT is a system of reasoning, non-reasoning, and router models, GPT‑5 in the API platform is the reasoning model that powers maximum performance in ChatGPT. Notably, GPT‑5 with minimal reasoning is a different model than the non-reasoning model in ChatGPT, and is better tuned for developers. The non-reasoning model used in ChatGPT is available as gpt-5-chat-latest.
https://openai.com/index/introducing-gpt-5-for-developers/
> How many times does the letter b appear in blueberry?
AI> Thinking... .oO{ Well, let's see. The letter b is the second letter in the alphabet, so the answer must be two, right? That sounds right to me anyway. }
AI> Two. The answer is two.
Checks out!
I love this test. Demonstrates the "understanding" process of the language model.
Q: How many times does the letter b appear in blueberry?
The letter “b” appears 3 times in the word blueberry: • 1st b – at the beginning: Blueberry • 2nd b – in the middle: blueberry • 3rd b – again in the middle: blueberry (wait… no — actually…)
Let’s break it down carefully:
blueberry → b, l, u, e, b, e, r, r, y
Letters: 1. b 2. l 3. u 4. e 5. b 6. e 7. r 8. r 9. y
So the letter b appears twice.
Final answer: 2 times.
If you choose the thinking model it doesn’t make this mistake. It means the auto router should be tuned to call the thinking model on edge cases like these.
If you are going to release a new set of models and eliminate the old ones, your new smallest model should have equal capabilities equal to or greater than your old models of similar size. This is not a routing problem. This is a driving the price down of running and scaling the business problem.
do you not think 4o has the same capabilities as GPT-5 without thinking?
Maybe it's joking
[dead]