Congrats on the launch! I'm one of the authors of that paper you cited, glad it was useful and inspiring to building this :) Let me know if we can support in any way!
Graph DB OOMing 101. Can it do Erdős/Bacon numbers?
Graph DBs have been plagued with exploding complexity of queries as doing things like allowing recursion or counting paths isn't as trivial as it may sound. Do you have benchmarks and comparisons against other engines and query languages?
This is very interesting, are there any examples of interacting with LLMs? If the queries are compiled and loaded into the database ahead of time the pattern of asking an LLM to generate a query from a natural language request seems difficult because current LLMs aren't going to know your query language yet and compiling each query for each prompt would add unnecessary overhead.
This is definitely a problem we want to work on fixing quickly. We're currently planning an MCP tool that can traverse the graph and decide for itself at each step where to go to next. As opposed to having to generate actual text written queries.
I mentioned in another comment that you can provide a grammar with constrained decoding to force the LLM to generate tokens that comply with the grammar. This ensures that only valid syntactic constructs are produced.
Looks very interesting, but I've seen these kind of multi-paradigm databases like Gel, Helix and Surreal and I'm not sure that any of them quite hit the graph spot.
Does Helix support much of the graph algorithm world? For things like GrapgRAG.
Either way, I'd be all over it if there was a python SDK witch worked with the generated types!
Congrats! Any chance Helixdb can be run in the browser too, maybe via WASM? I'm looking for a vector db that can be pre-populated on the server and then be searched on the client so user queries (chat) stay on-device for privacy / compliance reasons.
Interesting, we've had a few people ask about this. So essentially you'd call the server to retrieve the HNSW and then store it in the browser and use WASM to query it?
Currently the road block for that is the LMDB storage engine. We have on our own storage engine on our roadmap, which we want to include WASM support with. If you wanna talk about it reach out to my twitter: https://x.com/georgecurtiss
Currently you can't run us embedded and I'm not sure how you could sidestep the DSL :/
We're working on putting our grammar in llama's cpp code so that it only outputs grammatically correct HQL. But, even without that it shouldn't be hard or expensive to do.
I wrote a Claude wrapper that had our docs in its context window, it did a good job of writing queries most of the time.
Those were actual benchmarks that we run, we didn't get a chance to write them out properly before posting. I'll get on it now and notify by replying to this comment when they're on the readme :)
That being said, when I saw `helix-db` I was thrown too. "What's a text editor doing writing a vector-graph database, I thought they were working on plugins?"
we just started off as a side project and thought the name fitted well. With the strands, graph type structure, connections...
We didn't think of getting people to use it until we found it was solving a real pain point for people, so weren't worried about trademarks or names. There was no other helix db so that was good enough for us at the time.
There was no active one. We saw this and thought it would be a nice nod to history. We've actually spoken to some developers at apple who thought this was really neat :)
Pretty much the same way you would with any graph DB, with the added benefit of being able to treat a vector as a node by creating those explicit relationships between them.
There is currently no cap. We will probably impose a similar cap to Qdrant or Pinecone some time soon ~64k. There's obviously a performance trade off as you go up, but we hope to massively offset this by doing binary quantisation within the next couple of months.
Because neo4j sucks! Partly because they're working with a monolith that I imagine is difficult to iterate on and it's written in Java. We've had the benefit of working on this in Rust which lets us get really nitty and gritty with different optimisations.
My friend who I worked on this with is putting together a technical blog on those graph optimisations so I'll link it here when he's done
Looks really interesting, I'll have a proper read. What would be your reasoning to incorporate this if we already have vector functionality and semantic search?
You can literally use us for free haha.
There's not a language that properly encapsulates graph and vector functionality, so we needed to make our own. Also, we thought it was dumb that query languages weren't type-safe... So we changed that
General consensus is it's really slow, I like the concept of surreal though. Our first, and extremely bare bones, version of the graph db was 1-2 orders of magnitude faster than surreal (we haven't run benchmarks against surreal recently, but I'll put them here when we're done)
Congrats on the launch! I'm one of the authors of that paper you cited, glad it was useful and inspiring to building this :) Let me know if we can support in any way!
Wow! I enjoyed reading it a lot and it was definitely inspiring for this project!
Would love to talk to you about it and make sure we capture all of the pain points if you're open to it? :)
Absolutely, will DM you on X!
Graph DB OOMing 101. Can it do Erdős/Bacon numbers?
Graph DBs have been plagued with exploding complexity of queries as doing things like allowing recursion or counting paths isn't as trivial as it may sound. Do you have benchmarks and comparisons against other engines and query languages?
This is very interesting, are there any examples of interacting with LLMs? If the queries are compiled and loaded into the database ahead of time the pattern of asking an LLM to generate a query from a natural language request seems difficult because current LLMs aren't going to know your query language yet and compiling each query for each prompt would add unnecessary overhead.
This is definitely a problem we want to work on fixing quickly. We're currently planning an MCP tool that can traverse the graph and decide for itself at each step where to go to next. As opposed to having to generate actual text written queries.
I mentioned in another comment that you can provide a grammar with constrained decoding to force the LLM to generate tokens that comply with the grammar. This ensures that only valid syntactic constructs are produced.
Looks very interesting, but I've seen these kind of multi-paradigm databases like Gel, Helix and Surreal and I'm not sure that any of them quite hit the graph spot.
Does Helix support much of the graph algorithm world? For things like GrapgRAG.
Either way, I'd be all over it if there was a python SDK witch worked with the generated types!
Congrats! Any chance Helixdb can be run in the browser too, maybe via WASM? I'm looking for a vector db that can be pre-populated on the server and then be searched on the client so user queries (chat) stay on-device for privacy / compliance reasons.
Interesting, we've had a few people ask about this. So essentially you'd call the server to retrieve the HNSW and then store it in the browser and use WASM to query it?
Currently the road block for that is the LMDB storage engine. We have on our own storage engine on our roadmap, which we want to include WASM support with. If you wanna talk about it reach out to my twitter: https://x.com/georgecurtiss
Can I run this as an embedded DB like sqlite?
Can I sidestep the DSL? I want my LLMs to generate queries and using a new language is going to make that hard or expensive.
Currently you can't run us embedded and I'm not sure how you could sidestep the DSL :/
We're working on putting our grammar in llama's cpp code so that it only outputs grammatically correct HQL. But, even without that it shouldn't be hard or expensive to do. I wrote a Claude wrapper that had our docs in its context window, it did a good job of writing queries most of the time.
It sounds very intriguing indeed. However, the README makes some claims. Are there any benchmarks to support them?
> Built for performance we're currently 1000x faster than Neo4j, 100x faster than TigerGraph
Those were actual benchmarks that we run, we didn't get a chance to write them out properly before posting. I'll get on it now and notify by replying to this comment when they're on the readme :)
How does it compare with https://kuzudb.com/ ?
Kuzu don't support incremental indexing on the vectors. The vector index is completely separate and decoupled from the graph.
I.e: You have to re-index all of the vectors when you make an update to them.
Nice "I'll have this name" when there's already the helix editor :)
First I'm hearing from it. The Beatles must've been super pissed when Apple took their name :(
https://crates.io/search?q=Helix
I'm surprised none in the team searched crates.io once before picking the name. Good luck!
I don't think `helix-editor` is even on crates.io, just placeholders.
https://github.com/helix-editor/helix/discussions/7038
That being said, when I saw `helix-db` I was thrown too. "What's a text editor doing writing a vector-graph database, I thought they were working on plugins?"
we just started off as a side project and thought the name fitted well. With the strands, graph type structure, connections...
We didn't think of getting people to use it until we found it was solving a real pain point for people, so weren't worried about trademarks or names. There was no other helix db so that was good enough for us at the time.
> There was no other helix db
https://en.wikipedia.org/wiki/Helix_(database)
There was no active one. We saw this and thought it would be a nice nod to history. We've actually spoken to some developers at apple who thought this was really neat :)
It's not the end of the world, just me being a bit grumpy. I mean it when I say good luck! :)
Thank you :)
I can't tell if this is droll sarcasm, but just in case not...
https://en.wikipedia.org/wiki/Apple_Corps_v_Apple_Computer
perhaps it’s a homage to the famous Helix database (see Wikipedia)
well noted
How do you think about building the graph relationships? Any special approaches you use?
Pretty much the same way you would with any graph DB, with the added benefit of being able to treat a vector as a node by creating those explicit relationships between them.
Does that answer your question properly?
How scalable is your DB in your tests? Could it be performent on graphs with 1B/10B/100B connections?
Super cool!!! I'll try it this week and go back to give a feedback.
I look forward to it :)
What is the max number of dimensions supported for a vector?
There is currently no cap. We will probably impose a similar cap to Qdrant or Pinecone some time soon ~64k. There's obviously a performance trade off as you go up, but we hope to massively offset this by doing binary quantisation within the next couple of months.
How can I migrate neo4j to this?
We can build an ingestion engine for you :)
We've built SQL and PGVector ones already, just waiting for someone who could make use of other ones before we build them.
Let us know! Twitter in my bio
how did you get it 3 OOMs faster than neo4j?
On comparable benchmarks with comparable guarantees? Comparable persistence levels? I’m very skeptical.
Because neo4j sucks! Partly because they're working with a monolith that I imagine is difficult to iterate on and it's written in Java. We've had the benefit of working on this in Rust which lets us get really nitty and gritty with different optimisations.
My friend who I worked on this with is putting together a technical blog on those graph optimisations so I'll link it here when he's done
What method/model are you using for sparse search?
We're going to use BM25. Currently it is just dense search. Coming very soon
have you thought about SPALDE models? ex: https://arxiv.org/abs/2109.10086
Looks really interesting, I'll have a proper read. What would be your reasoning to incorporate this if we already have vector functionality and semantic search?
my project deals w/ non-english text, bm25 performance is middeling. Language specific sparse model helps.
Looks nice! Are you looking to compete with https://www.falkordb.com or do something a bit different?
Pretty much, our biggest focus is on Graph and Hybrid RAG. They seem to have really honed in on Graph RAG since the last time I checked their website.
One of the problems I know people experience with them is that they're super slow at bulk reading.
Oh also, they aren't built in Rust haha
> so much easier that it’s worth a bit of a learning curve
I think you misspelled "vendor lock in"
You can literally use us for free haha. There's not a language that properly encapsulates graph and vector functionality, so we needed to make our own. Also, we thought it was dumb that query languages weren't type-safe... So we changed that
why not surrealdb?
General consensus is it's really slow, I like the concept of surreal though. Our first, and extremely bare bones, version of the graph db was 1-2 orders of magnitude faster than surreal (we haven't run benchmarks against surreal recently, but I'll put them here when we're done)