> The Hierarchical navigable small world (HNSW) algorithm is a graph-based approximate nearest neighbor search technique used in many vector databases.[1] Nearest neighbor search without an index involves computing the distance from the query to each point in the database, which for large datasets is computationally prohibitive. For high-dimensional data, tree-based exact vector search techniques such as the k-d tree and R-tree do not perform well enough because of the curse of dimensionality.
At very high scale, there's less usage of graphs. Or there's a set of clustering on top of graphs.
Graphs can be complex to build and rebalance. Graph-like data structures with a thing, then a pointer out to another thing, aren't that cache friendly.
Add to that, people almost always want to *filter* vector search results. And this is a huge blindspot for consumers and providers. It's where the ugly performance surprises come from. Filtered HNSW isn't straightforward, and requires you to just keep traversing the graph looking for results that satisfy your filter.
HNSW came out of a benchmark regime where we just indexed some vectors and tried to only maximize recall for query latency. It doesn't take into account the filtering / indexing almost everyone wants.
Turbopuffer, for example, doesn't use graphs at all, it uses SPFresh. And they recently got 200ms latency on 100B vectors.
There is an entire section of the post about that. I believe that's more the illusion of a problem because of product design issues than a real challenge since far results that match the filter are totally useless.
I'm facing the problem you describe daily. It's especially bad because it's very difficult for me to predict if the set of filters will reduce the dataset by ~1% (in which case following the original vector index is fine) or by 99.99% (in which case you just want to brute force the remaining vectors).
Tried a million different things, but haven't heard of Turbopuffer yet. Any references on how they perform with such additional filters?
Lucene and ES implement a shortcut for filters that are restrictive enough. Since it's already optimized for figuring out if something falls into your filter set, you first determine the size of that. You traverse the HNSW normally, then if you have traversed more nodes than your filter set's cardinality, you just switch to brute forcing your filter set distance comparisons. So worst case scenario is you do 2x your filter set size vector distance operations. Quite neat.
Oh that's nice! Any references on this shortcut? How do you activate that behavior? I was playing around with ES, but the only suggestion I found was to use `count` on filters before deciding (manually) which path to take.
Just curious what the state of the art around filtered vector search results is? I took a quick look at the SPFresh paper and didn't see it specifically address filtering.
two great points here:
(1) quantization is how you speed up vector indexes, and
(2) how your build your graph matters much much less*
These are the insights behind DiskANN, which has replaced HNSW in most production systems.
past that, well, you should really go read the DiskANN paper instead of this article, product quantization is way way way way way more effective than simple int8 or binary quant.
and if you want to skip forward several years to the cutting edge, check out https://arxiv.org/abs/2509.18471 and the references list for further reading
* but it still matters more than a lot of people thought circa 2020
Hi! I worked with product quantization in the past in the context of a library I released to read LLMs stored in llama.cpp format (GUFF). However, in the context of in-memory HNSWs, I found them to make a small difference. The recall is already almost perfect with int8. Of course it is very different in the case you are quantizing an actual neural network with, for instance 4 bit quants. There it will make a huge difference. But in my use case I picked what would be the fastest, given that both performed equally well. What could be potentially done with PQ in the case of Redis Vector Sets is to make 4 bit quants work decently (but not as well as int8 anyway), however given how fat the data structure nodes are per-se, I don't think this is a great tradeoff.}
All this to say: the blog post tells mostly the conclusions, but to reach that design, many things were tried, including things that looked cooler but in the practice were not the best fit. It's not by chance that Redis HNSWs are easily able to go 50k full queries/sec in decent hardware.
if you're getting near-perfect recall with int8 and no reranking then you're either testing an unusual dataset or a tiny one, but if it works for you then great!
Near perfect recall VS fp32, not in absolute terms: TLDR, it's not int8 to ruin it, at least if the int8 quants are computed per-vector and not with global centroids. And also, recall is a very illusionary metric, but this is an argument for another blog post (In short, what really matters is that the best candidates are collected: the long tail is full of elements that are anyway far enough or practically equivalent, since this happens under the illusion that the embedding model already captures the similarity our application demands. This is, indeed, already an illusion, so if the 60th result is 72th, it normally does not matter. The reranking that really matters (if there is the ability to do that) is the LLM picking / reranking: that, yes, makes all the difference.
This is well worth reading in full. The section about threading is particularly interesting: most of Redis is single-threaded, but antirez decided to use threads for the HNSW implementation and explains why.
> Similarly, there are, right now, efforts in order to really check if the “H” in the HNSWs is really needed, and if instead a flat data structure with just one layer would perform more or less the same (I hope I’ll cover more about this in the future: my feeling is that the truth is in the middle, and that it makes sense to modify the level selection function to just have levels greater than a given threshold).
Small world networks have been my obsession for almost a decade, from a biological and sociological evolution angle. Love this stuff.
This is the paper he's referring to, which I was just reading again yesterday :)
As someone also interested in not just repair of the social fabric, but the evolution of gendered social dynamics in social species (biologically and culturally), it's interesting to me that there might be some property of networks that:
1. Makes any random point as close as possible to any other random point (all solutions are equally likely to be close to the unknown solution somewhere in the network)
2. Minimally edges must be maintained (valuable in social networks, where edges are relationships that have maintenance cost)
3. Involves some negotiation between hierarchy (aka entrepreneurial networks, with opportunity for value extraction thru rent-seeking traffic) and hubness (which dissolve capture points in networks via "weaving" them out of existence). All the short paths in a small-world pass through hubs, which necessarily maintain many more weaker connections in a much more "gossip" comms protocols
Am working on this stuff related to collective intelligence and democracy work, if anyone wants to be in touch https://linkedin.com/in/patcon-
> many programmers are smart, and if instead of creating a magic system they have no access to, you show them the data structure, the tradeoffs, they can build more things, and model their use cases in specific ways. And your system will be simpler, too.
Basically my entire full-time job is spent prosecuting this argument. It is indeed true that many programmers are smart, but it is equally true that many programmers _are not_ smart, and those programmers have to contribute too. More hands is usually better than simpler systems for reasons that have nothing to do with technical proficiency.
>> More hands is usually better than simpler systems for reasons that have nothing to do with technical proficiency.
If you are working on open source databases, or something close to the metal I agree with antirez, if you are working at some established tech business (e.g: a very old ecommerce site), I agree with you
To be clear, I'm not disagreeing with antirez at all. I feel his argument in my bones. I am a smart programmer. I want simple, powerful systems that leave the kid gloves in the drawer.
The unfortunate reality is that a large cadre of people cannot handle such tools, and those people still have extremely valuable contributions to make.
I say this as a full-time research engineer at a top-10 university. We are not short on talent, new problems, or funding. There is ample opportunity to make our systems as simple/"pure" as possible, and I make that case vigorously. The fact remains that intentionally limiting scope for the sake of the many is often better than cultivating an elite few.
I write blog posts into TextEdit, just the white page to fill with text. Normally this is fine as I end writing and publish the blog post. This time after writing it I left it there for weeks: I wanted to refine it a bit, the post felt not "ready". Then I had to reboot the computer for an upgrade, and I magically quit the application hitting cancel when there was to save the document :-|
However rewriting it was fast, and the second version was better. So, it's fine. But starting from now I'll use the less pleasant (to me) Google Docs.
From Wikipedia:
> The Hierarchical navigable small world (HNSW) algorithm is a graph-based approximate nearest neighbor search technique used in many vector databases.[1] Nearest neighbor search without an index involves computing the distance from the query to each point in the database, which for large datasets is computationally prohibitive. For high-dimensional data, tree-based exact vector search techniques such as the k-d tree and R-tree do not perform well enough because of the curse of dimensionality.
At very high scale, there's less usage of graphs. Or there's a set of clustering on top of graphs.
Graphs can be complex to build and rebalance. Graph-like data structures with a thing, then a pointer out to another thing, aren't that cache friendly.
Add to that, people almost always want to *filter* vector search results. And this is a huge blindspot for consumers and providers. It's where the ugly performance surprises come from. Filtered HNSW isn't straightforward, and requires you to just keep traversing the graph looking for results that satisfy your filter.
HNSW came out of a benchmark regime where we just indexed some vectors and tried to only maximize recall for query latency. It doesn't take into account the filtering / indexing almost everyone wants.
Turbopuffer, for example, doesn't use graphs at all, it uses SPFresh. And they recently got 200ms latency on 100B vectors.
https://turbopuffer.com/docs/vector
There is an entire section of the post about that. I believe that's more the illusion of a problem because of product design issues than a real challenge since far results that match the filter are totally useless.
I'm facing the problem you describe daily. It's especially bad because it's very difficult for me to predict if the set of filters will reduce the dataset by ~1% (in which case following the original vector index is fine) or by 99.99% (in which case you just want to brute force the remaining vectors).
Tried a million different things, but haven't heard of Turbopuffer yet. Any references on how they perform with such additional filters?
Lucene and ES implement a shortcut for filters that are restrictive enough. Since it's already optimized for figuring out if something falls into your filter set, you first determine the size of that. You traverse the HNSW normally, then if you have traversed more nodes than your filter set's cardinality, you just switch to brute forcing your filter set distance comparisons. So worst case scenario is you do 2x your filter set size vector distance operations. Quite neat.
Oh that's nice! Any references on this shortcut? How do you activate that behavior? I was playing around with ES, but the only suggestion I found was to use `count` on filters before deciding (manually) which path to take.
Here you go https://github.com/apache/lucene/pull/656 - no need to do anything from the user side to trigger it as far as I know.
Hybrid search with vector similarity and filtering I think has mostly been solved by Vespa and not even recently.
https://blog.vespa.ai/vespa-hybrid-billion-scale-vector-sear...
For sure. But its "solved" differently by every vector database. You have to pay attention to how its solved.
Just curious what the state of the art around filtered vector search results is? I took a quick look at the SPFresh paper and didn't see it specifically address filtering.
Full text search has this same issue.
two great points here: (1) quantization is how you speed up vector indexes, and (2) how your build your graph matters much much less*
These are the insights behind DiskANN, which has replaced HNSW in most production systems.
past that, well, you should really go read the DiskANN paper instead of this article, product quantization is way way way way way more effective than simple int8 or binary quant.
here's my writeup from a year and a half ago: https://dev.to/datastax/why-vector-compression-matters-64l
and if you want to skip forward several years to the cutting edge, check out https://arxiv.org/abs/2509.18471 and the references list for further reading
* but it still matters more than a lot of people thought circa 2020
Hi! I worked with product quantization in the past in the context of a library I released to read LLMs stored in llama.cpp format (GUFF). However, in the context of in-memory HNSWs, I found them to make a small difference. The recall is already almost perfect with int8. Of course it is very different in the case you are quantizing an actual neural network with, for instance 4 bit quants. There it will make a huge difference. But in my use case I picked what would be the fastest, given that both performed equally well. What could be potentially done with PQ in the case of Redis Vector Sets is to make 4 bit quants work decently (but not as well as int8 anyway), however given how fat the data structure nodes are per-se, I don't think this is a great tradeoff.}
All this to say: the blog post tells mostly the conclusions, but to reach that design, many things were tried, including things that looked cooler but in the practice were not the best fit. It's not by chance that Redis HNSWs are easily able to go 50k full queries/sec in decent hardware.
if you're getting near-perfect recall with int8 and no reranking then you're either testing an unusual dataset or a tiny one, but if it works for you then great!
Near perfect recall VS fp32, not in absolute terms: TLDR, it's not int8 to ruin it, at least if the int8 quants are computed per-vector and not with global centroids. And also, recall is a very illusionary metric, but this is an argument for another blog post (In short, what really matters is that the best candidates are collected: the long tail is full of elements that are anyway far enough or practically equivalent, since this happens under the illusion that the embedding model already captures the similarity our application demands. This is, indeed, already an illusion, so if the 60th result is 72th, it normally does not matter. The reranking that really matters (if there is the ability to do that) is the LLM picking / reranking: that, yes, makes all the difference.
This is well worth reading in full. The section about threading is particularly interesting: most of Redis is single-threaded, but antirez decided to use threads for the HNSW implementation and explains why.
Thanks! Appreciate your words.
> Similarly, there are, right now, efforts in order to really check if the “H” in the HNSWs is really needed, and if instead a flat data structure with just one layer would perform more or less the same (I hope I’ll cover more about this in the future: my feeling is that the truth is in the middle, and that it makes sense to modify the level selection function to just have levels greater than a given threshold).
Small world networks have been my obsession for almost a decade, from a biological and sociological evolution angle. Love this stuff.
This is the paper he's referring to, which I was just reading again yesterday :)
Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs" (2025) https://arxiv.org/abs/2412.01940
Also:
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data (2010) https://www.jmlr.org/papers/v11/radovanovic10a.html
As someone also interested in not just repair of the social fabric, but the evolution of gendered social dynamics in social species (biologically and culturally), it's interesting to me that there might be some property of networks that:
1. Makes any random point as close as possible to any other random point (all solutions are equally likely to be close to the unknown solution somewhere in the network)
2. Minimally edges must be maintained (valuable in social networks, where edges are relationships that have maintenance cost)
3. Involves some negotiation between hierarchy (aka entrepreneurial networks, with opportunity for value extraction thru rent-seeking traffic) and hubness (which dissolve capture points in networks via "weaving" them out of existence). All the short paths in a small-world pass through hubs, which necessarily maintain many more weaker connections in a much more "gossip" comms protocols
Am working on this stuff related to collective intelligence and democracy work, if anyone wants to be in touch https://linkedin.com/in/patcon-
Fascinating things about other small worlds applications, very far from what I do, that I did ignore. Thanks.
> many programmers are smart, and if instead of creating a magic system they have no access to, you show them the data structure, the tradeoffs, they can build more things, and model their use cases in specific ways. And your system will be simpler, too.
Basically my entire full-time job is spent prosecuting this argument. It is indeed true that many programmers are smart, but it is equally true that many programmers _are not_ smart, and those programmers have to contribute too. More hands is usually better than simpler systems for reasons that have nothing to do with technical proficiency.
>> More hands is usually better than simpler systems for reasons that have nothing to do with technical proficiency.
If you are working on open source databases, or something close to the metal I agree with antirez, if you are working at some established tech business (e.g: a very old ecommerce site), I agree with you
To be clear, I'm not disagreeing with antirez at all. I feel his argument in my bones. I am a smart programmer. I want simple, powerful systems that leave the kid gloves in the drawer.
The unfortunate reality is that a large cadre of people cannot handle such tools, and those people still have extremely valuable contributions to make.
I say this as a full-time research engineer at a top-10 university. We are not short on talent, new problems, or funding. There is ample opportunity to make our systems as simple/"pure" as possible, and I make that case vigorously. The fact remains that intentionally limiting scope for the sake of the many is often better than cultivating an elite few.
> long, sad story about MacOS and bad habits – I hadn’t lost something like that since the 90s, during blackouts
would love to hear this story as well now!
Well TLDR I'm an idiot :D
I write blog posts into TextEdit, just the white page to fill with text. Normally this is fine as I end writing and publish the blog post. This time after writing it I left it there for weeks: I wanted to refine it a bit, the post felt not "ready". Then I had to reboot the computer for an upgrade, and I magically quit the application hitting cancel when there was to save the document :-|
However rewriting it was fast, and the second version was better. So, it's fine. But starting from now I'll use the less pleasant (to me) Google Docs.
why not vim?
I code with vim, but to write prose, I want a white big window without anything else.