Vector search has been around for a long time. For example, Google has been using it since the late 1990s. This powerful, almost magical technology serves as a core component of most web and app services today, including modern AI-powered search using retrieval augmented generation (RAG). It provides a low-power way to leverage AI's natural language processing to find data with semantic context.
A vector database stores data as numerical embeddings that capture the semantic meaning of text, images, or other content, and it retrieves results by finding the closest vectors using similarity search rather than exact matches. Sounds like an AI large language model (LLM) right? This makes it great for web and app content because users can search by meaning and intent, not just keywords. So synonyms and loosely related concepts still match. It also scales efficiently which is one reason large organizations like Google have used it.
A happy side effect of how these platforms work is that they're also good at handling misspellings, to a point. To really get robust handling of spelling variations, however, two strategies tend to be common:
Spell correct the actual search text before using it
Include character n-grams in your vector database entries
Character n-grams are vector embeddings (like commonly used semantic embeddings) that break text into overlapping sequences of characters, allowing vector search systems to better match terms despite typos, inflections, or spelling variations. Without these n-grams, a misspelled query like "saracha sauce" would likely return a higher score for "hot sauce" entries. But including character n-grams, a combined (fused) search would more consistently return a higher score for items with the correct spelling "sriracha sauce".
Using these n-grams can better handle searches with:
typos
missing letters
swapped letters
phonetic-ish variants
common misspellings
How does this work? At a high level, it adds a character match capability to the standard semantic search used by most vector database implementations. Here's a quick example of what happens under the hood. Take the first word in our previous example:
sriracha
3-grams: sri, rir, ira, rac, ach, cha
4-grams: srir, rira, irac, rach, acha
saracha
3-grams: sar, ara, rac, ach, cha
4-grams: sara, arac, rach, acha
Shared grams:
shared 3-grams: rac, ach, cha
shared 4-grams: rach, acha
So even though the beginning is wrong (sri vs sa), the ending chunks that carry a lot of the distinctive shape of "sriracha" survive (racha, acha, cha). And since the second word is the same, they have even more matching grams.
When these matches are fused with semantic matches, it adds weight to the correctly spelled "sriracha sauce" entry, yielding a better match set.
When it comes to including character n-grams, there are only a couple changes you need to make to a standard semantic vector database implementation:
When you generate embeddings, you also need to generate character n-gram embeddings; this is true both when you store data in the database, and when you search.
When searching, you need to execute a search both on the semantic vectors and the n-gram vectors, then fuse the results using Reciprocal Rank Fusion (RRF), which is a great way to merge disparate result sets and combine the scores.
The following samples will fill those gaps. They are written with C# for .NET, which is part of a common stack we use to build cross-platform, secure, high-performance web and mobile apps and services for our clients. We also tend to prefer the vector database Qdrant for its performance, maintainability, and open source model. So that is also referenced in the samples.
References to AiService.GenerateEmbeddingsAsync() are not covered here. Essentially it's a method to generate standard semantic embeddings. Replace that with your own (likely existing) method. And references to QdrantService.Client are merely references to a standard Qdrant client provided by the Qdrant Nuget package.
Note: Some of the code was generated by AI, but was reviewed and refactored by an actual human developer (me!).
First, you need a way to create n-grams. The CharNGramEmbedding class below will fill that gap. It allows you to generate character n-grams for a given string, and it also provides a method for fusing the semantic and n-gram search results into a single, weighted result set.
Now that you have the character n-gram generation and fusion handled, following is an example of performing a Qdrant upsert of a sample food object, including both sets of vectors.
Lastly, the following example shows how you can search the Qdrant data using both sets of vectors. Embeddings (semantic and character n-grams) for the prompt are generated and used in the search.
For the best fused results each search (semantic, n-grams) needs to return 3-5 times the number of the final result set. This is because you're trying to recover a good final top-K from two imperfect retrievers. If each retriever only returns exactly K (or close to it), you often don't have enough overlap + near misses to let fusion do its job, especially when the two methods return different items, and rank positions aren't directly comparable.
There's usually more to the story so if you have questions or comments about this post let us know!
Do you need a new software development partner for an upcoming project? We would love to work with you! From websites and mobile apps to cloud services and custom software, we can help!