Depends on your needs. You surely don&#x27;t want 32k long chunks for doing the standard RAG pipeline, that&#x27;s for sure.My use case is basically a recommendation engine, where retrieve a list of similar forum topics based on the current read one. As with dynamic user generated content, it can vary from 10 to 100k tokens. Ideally I would generate embeddings from an LLM generated summary, but that would increase inference costs considerably at the scale I&#x27;m applying it.Having a larger possible context out of the box just made a simple swap of embeddeding models increase quality of recommendations greatly.

What does it mean to generate 1000 float16 array size on a 32k context? Surely the embedding you get is no longer representative of the text.

Just migrated all embeddings to this same model a few weeks ago in my company, and it&#x27;s a game changer. Having 32k context is a 64x increase when compared with our previous used model. Plus being natively multilingual and producing very standard 1024 long arrays made it a seamless transition even with millions of embeddings across thousands of databases.I do recommend using <a href="https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;text-embeddings-inference" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;huggingface&#x2F;text-embeddings-inference</a> for fast inference.

Embedding Text Documents with Qwen3