> Researchers had observed similar patterns in BERT, where "a surprisingly large amount of attention focuses on the delimiter token [SEP] and periods," which they argued was used by the model as a sort of no-op. The same summer at Meta, researchers studying vision transformers found similar behavior, observing that models would repurpose uninformative background patches as computational scratchpads.
This seems to go beyond just transformers. For example, I recall reading a paper a while ago that showed a similar effect in an image to image model with a GAN/U-Net architecture [1].
The singular defects (or high-norm tokens) [1] may be related to attention sinks. It is interesting that the direction of all high-norm tokens share the same direction. Maybe the theory behind is not very complex and the issue can be fixed cleverly during training.
> Barbero et al. have shown that attention sinks serve as "pressure valves" preventing what researchers call "over-mixing"—a pathological state where deep models processing long sequences blur important distinctions between tokens. The presence of a sink draws attention away from other tokens, limiting the spread of information (and noise) and resulting in more stable embeddings.
This sounds like it is working for the wrong reasons. Surely the right behavior is for the right neurons to receive attention rather than the first handful. Jamming everything there is the complementary sin of blurring. I would investigate attention equalization paired with a sparsity prior or something similar to prevent blurring.
The point is that there's not always a right token to attend to. If the information you're looking for is not there, no clever attention scheme will find it. The best you can hope for when that happens is that the value returned in the "not found" case is distinguishable from the "found" case. Having an attention sink serve as a fixed "not found" value is one way to do this.
In a sense? As the article notes, models trained using standard attention develop attention sinks naturally and removing them makes the model deteriorate completely, so the hallucinations you're thinking of were most likely output by a model that had already mitigated them in this way.
Some people start their prompts with "Hello" or "Please" or something similar, out of some habitual sense of politeness, I think. It would be hilarious if those prompts really work better because the model can use those words as attention sinks.
One point that Karpathy has made in some of his videos is that using additional tokens in the prompt can facilitate computation. If you ask a transformer to do some basic math, it will be more likely to get the right answer (or at least a better approximation) with a more verbose prompt. To me, this backs up the use of more conversational language ("Please," etc.) when prompting.
However, that seems to be contradicted by what was shown recently with the successful International Math Olympiad effort. Their prompts, such as https://github.com/aw31/openai-imo-2025-proofs/blob/main/pro... , were very terse. It's hard to tell where the prompt stops and the CoT response starts, in fact.
So there is probably some interplay between the need for attention sinks and the use of step-by-step reasoning. It might not be too surprising if the latter works because it's an indirect way to optimize the former.
I wonder if the model could also just make its own sink tokens if the prompt doesn't have any. E.g. if the model first emits some "fluff" like "The answer to this question is:" before starting with the actual answer, it could use those tokens as attention sinks. Same with "thinking tokens" that don't directly contribute to the answer or invisible formatting tokens, etc.
True, along with "You're absolutely right! What an insightful observation. You're going places, bro," yadda yadda yadda.
It would be amusing if all that gratuitous sycophancy actually helped with inference accuracy. It would also be worth treating that as a bug to be fixed, of course.
> Researchers had observed similar patterns in BERT, where "a surprisingly large amount of attention focuses on the delimiter token [SEP] and periods," which they argued was used by the model as a sort of no-op. The same summer at Meta, researchers studying vision transformers found similar behavior, observing that models would repurpose uninformative background patches as computational scratchpads.
This seems to go beyond just transformers. For example, I recall reading a paper a while ago that showed a similar effect in an image to image model with a GAN/U-Net architecture [1].
[1] https://arxiv.org/abs/1712.02950
The singular defects (or high-norm tokens) [1] may be related to attention sinks. It is interesting that the direction of all high-norm tokens share the same direction. Maybe the theory behind is not very complex and the issue can be fixed cleverly during training.
[1] https://openreview.net/pdf?id=4yBnUokU2v
Seems like this was a better solution to the same problem https://www.evanmiller.org/attention-is-off-by-one.html
Did this end up working? It sounds plausible but it needs some empirical validation.
Yeah, attention sinks were applied to gpt-oss
I found a fairly large improvement in my toy transformer model where I added a "global" token akin to the CLS token in ViT.
Another approach I've seen is the "Diff transformer" from MS Research (https://github.com/microsoft/unilm/tree/master/Diff-Transfor...).
> Barbero et al. have shown that attention sinks serve as "pressure valves" preventing what researchers call "over-mixing"—a pathological state where deep models processing long sequences blur important distinctions between tokens. The presence of a sink draws attention away from other tokens, limiting the spread of information (and noise) and resulting in more stable embeddings.
This sounds like it is working for the wrong reasons. Surely the right behavior is for the right neurons to receive attention rather than the first handful. Jamming everything there is the complementary sin of blurring. I would investigate attention equalization paired with a sparsity prior or something similar to prevent blurring.
The point is that there's not always a right token to attend to. If the information you're looking for is not there, no clever attention scheme will find it. The best you can hope for when that happens is that the value returned in the "not found" case is distinguishable from the "found" case. Having an attention sink serve as a fixed "not found" value is one way to do this.
Good point. Does that make them mitigate hallucinations?
In a sense? As the article notes, models trained using standard attention develop attention sinks naturally and removing them makes the model deteriorate completely, so the hallucinations you're thinking of were most likely output by a model that had already mitigated them in this way.
> The first few tokens often carried minimal semantic information—sometimes just a start-of-sequence marker or common words like "the" or "a."
I wonder if it makes sense to use the first word as a title of sorts rather than going straight in grammatically correct sentence when prompting
Some people start their prompts with "Hello" or "Please" or something similar, out of some habitual sense of politeness, I think. It would be hilarious if those prompts really work better because the model can use those words as attention sinks.
One point that Karpathy has made in some of his videos is that using additional tokens in the prompt can facilitate computation. If you ask a transformer to do some basic math, it will be more likely to get the right answer (or at least a better approximation) with a more verbose prompt. To me, this backs up the use of more conversational language ("Please," etc.) when prompting.
However, that seems to be contradicted by what was shown recently with the successful International Math Olympiad effort. Their prompts, such as https://github.com/aw31/openai-imo-2025-proofs/blob/main/pro... , were very terse. It's hard to tell where the prompt stops and the CoT response starts, in fact.
So there is probably some interplay between the need for attention sinks and the use of step-by-step reasoning. It might not be too surprising if the latter works because it's an indirect way to optimize the former.
> It's hard to tell where the prompt stops and the CoT response starts, in fact.
That's because you're looking at the final output that includes neither the prompt nor the intermediate chain of thought.
Good point -- I can see that, but it all ends up in the same context, anyway. Point being, the model seems to prefer to conserve tokens.
That said, now I'm wondering if all those dashes it spews out are more than just window dressing.
I wonder if the model could also just make its own sink tokens if the prompt doesn't have any. E.g. if the model first emits some "fluff" like "The answer to this question is:" before starting with the actual answer, it could use those tokens as attention sinks. Same with "thinking tokens" that don't directly contribute to the answer or invisible formatting tokens, etc.
True, along with "You're absolutely right! What an insightful observation. You're going places, bro," yadda yadda yadda.
It would be amusing if all that gratuitous sycophancy actually helped with inference accuracy. It would also be worth treating that as a bug to be fixed, of course.
"Magnets. How do they work?"
The heuristic doesn't work quite so well when applied to the actual original version of that line.
Is there a way to hint in the prompting what information should be retained in the attention sinks?
This is nice and useful because the new GPT-OSS model uses this technique. Kudos to the original authors!
And, as always, the FOSS ecosystem moves quickly, llama.cpp already fully support them! https://github.com/ggml-org/llama.cpp/pull/15157