Better search results with character n-grams

Michael Argentini/Wednesday, February 11, 2026

Vector search has been around for a long time. For example, Google has been using it since the late 1990s. This powerful, almost magical technology serves as a core component of most web and app services today, including modern AI-powered search using retrieval augmented generation (RAG). It provides a low-power way to leverage AI's natural language processing to find data with semantic context.

What is a vector database?

A vector database stores data as numerical embeddings that capture the semantic meaning of text, images, or other content, and it retrieves results by finding the closest vectors using similarity search rather than exact matches. Sounds like an AI large language model (LLM) right? This makes it great for web and app content because users can search by meaning and intent, not just keywords. So synonyms and loosely related concepts still match. It also scales efficiently which is one reason large organizations like Google have used it.

A happy side effect of how these platforms work is that they're also good at handling misspellings, to a point. To really get robust handling of spelling variations, however, two strategies tend to be common:

Spell correct the actual search text before using it
Include character n-grams in your vector database entries

What are character n-grams?

Character n-grams are vector embeddings (like commonly used semantic embeddings) that break text into overlapping sequences of characters, allowing vector search systems to better match terms despite typos, inflections, or spelling variations. Without these n-grams, a misspelled query like "saracha sauce" would likely return a higher score for "hot sauce" entries. But including character n-grams, a combined (fused) search would more consistently return a higher score for items with the correct spelling "sriracha sauce".

Using these n-grams can better handle searches with:

typos
missing letters
swapped letters
phonetic-ish variants
common misspellings

How does this work? At a high level, it adds a character match capability to the standard semantic search used by most vector database implementations. Here's a quick example of what happens under the hood. Take the first word in our previous example:

sriracha

3-grams: sri, rir, ira, rac, ach, cha
4-grams: srir, rira, irac, rach, acha

saracha

3-grams: sar, ara, rac, ach, cha
4-grams: sara, arac, rach, acha

Shared grams:

shared 3-grams: rac, ach, cha
shared 4-grams: rach, acha

So even though the beginning is wrong (sri vs sa), the ending chunks that carry a lot of the distinctive shape of "sriracha" survive (racha, acha, cha). And since the second word is the same, they have even more matching grams.

When these matches are fused with semantic matches, it adds weight to the correctly spelled "sriracha sauce" entry, yielding a better match set.

How to use character n-grams

When it comes to including character n-grams, there are only a couple changes you need to make to a standard semantic vector database implementation:

When you generate embeddings, you also need to generate character n-gram embeddings; this is true both when you store data in the database, and when you search.
When searching, you need to execute a search both on the semantic vectors and the n-gram vectors, then fuse the results using Reciprocal Rank Fusion (RRF), which is a great way to merge disparate result sets and combine the scores.

The following samples will fill those gaps. They are written with C# for .NET, which is part of a common stack we use to build cross-platform, secure, high-performance web and mobile apps and services for our clients. We also tend to prefer the vector database Qdrant for its performance, maintainability, and open source model. So that is also referenced in the samples.

References to AiService.GenerateEmbeddingsAsync() are not covered here. Essentially it's a method to generate standard semantic embeddings. Replace that with your own (likely existing) method. And references to QdrantService.Client are merely references to a standard Qdrant client provided by the Qdrant Nuget package.

Note: Some of the code was generated by AI, but was reviewed and refactored by an actual human developer (me!).

Character n-gram helper

First, you need a way to create n-grams. The CharNGramEmbedding class below will fill that gap. It allows you to generate character n-grams for a given string, and it also provides a method for fusing the semantic and n-gram search results into a single, weighted result set.

using System.Globalization;

namespace MyApp.Extensions;

/// <summary>
/// Generates a typo-robust, fixed-length dense vector representation of text
/// using hashed character n-grams.
/// </summary>
public static class CharNGramEmbedding
{
    /// <summary>
    /// Generates a normalized dense embedding vector for the specified text
    /// using hashed character n-grams.
    /// </summary>
    /// <param name="text">
    /// The input text to embed.
    /// </param>
    /// <param name="dims">
    /// The dimensionality of the output vector. Higher values reduce hash
    /// collisions at the cost of additional memory and storage.
    /// A value of 256 is a good default for typo-robust search.
    /// </param>
    /// <param name="minGram">
    /// The minimum character n-gram size to generate.
    /// Smaller values increase recall but may introduce noise.
    /// </param>
    /// <param name="maxGram">
    /// The maximum character n-gram size to generate.
    /// Larger values emphasize longer, more specific substrings.
    /// </param>
    public static float[] Embed(string text, int dims = 256, int minGram = 3, int maxGram = 4)
    {
        ArgumentOutOfRangeException.ThrowIfNegativeOrZero(dims);

        var v = new float[dims];
        var normalized = Normalize(text);

        if (normalized.Length == 0)
            return v;

        // Add boundary markers so "sriracha" and "sriracha sauce"
        // still share useful grams
        var s = $"^{normalized}$";

        for (var n = minGram; n <= maxGram; n++)
        {
            if (s.Length < n)
                continue;

            for (var i = 0; i <= s.Length - n; i++)
            {
                var gram = s.AsSpan(i, n);

                // Hash n-gram → index
                var h = Fnv1A32(gram);
                var idx = (int)(h % (uint)dims);

                // Optional sign-hash reduces collisions bias
                var sign = ((h & 1u) == 0u) ? 1f : -1f;

                v[idx] += sign;
            }
        }

        // L2 normalize for cosine similarity
        // (or dot product on normalized vectors)
        L2NormalizeInPlace(v);

        return v;

        static string Normalize(string input)
        {
            if (string.IsNullOrWhiteSpace(input))
                return string.Empty;

            // lowercase + strip accents + keep letters/digits/spaces
            var lower = input.ToLowerInvariant().Normalize(NormalizationForm.FormD);
            var sb = new StringBuilder(lower.Length);

            foreach (var ch in lower)
            {
                var uc = CharUnicodeInfo.GetUnicodeCategory(ch);
            
                if (uc == UnicodeCategory.NonSpacingMark)
                    continue;

                // ignore punctuation
                if (char.IsLetterOrDigit(ch))
                    sb.Append(ch);
                else if (char.IsWhiteSpace(ch) || ch == '-' || ch == '_')
                    sb.Append(' ');
            }

            // collapse spaces
            return string.Join(' ', sb.ToString().Split(' ', StringSplitOptions.RemoveEmptyEntries | StringSplitOptions.TrimEntries));
        }

        static uint Fnv1A32(ReadOnlySpan<char> s)
        {
            const uint offset = 2166136261;
            const uint prime = 16777619;

            var hash = offset;
            
            for (var i = 0; i < s.Length; i++)
            {
                // hash UTF-16 chars (fine for this purpose)
                hash ^= s[i];
                hash *= prime;
            }

            return hash;
        }

        static void L2NormalizeInPlace(float[] v)
        {
            double sumSq = 0;

            for (var i = 0; i < v.Length; i++)
                sumSq += (double)v[i] * v[i];
            
            if (sumSq <= 0)
                return;

            var inv = (float)(1.0 / Math.Sqrt(sumSq));
            
            for (var i = 0; i < v.Length; i++)
                v[i] *= inv;
        }
    }
    
    /// <summary>
    /// Fuses multiple ranked result lists using <b>Reciprocal Rank Fusion (RRF)</b>.
    /// RRF is robust when combining heterogeneous retrieval signals (e.g. semantic
    /// embeddings and character n-gram embeddings) whose raw scores are not directly
    /// comparable.
    /// </summary>
    /// <param name="a">
    /// The first ranked result list (e.g. results from a semantic embedding search),
    /// ordered from best to worst. The list should already be truncated to a reasonable
    /// top-K size.
    /// </param>
    /// <param name="b">
    /// The second ranked result list (e.g. results from a character n-gram or typo-robust
    /// search), ordered from best to worst. The list should already be truncated to a
    /// reasonable top-K size.
    /// </param>
    /// <param name="getId">
    /// A function that extracts a stable, unique identifier from a result item.
    /// This identifier is used to merge and score items that appear in multiple lists.
    /// </param>
    /// <param name="take">
    /// The maximum number of fused results to return after applying Reciprocal Rank Fusion.
    /// </param>
    /// <param name="k">
    /// The RRF rank constant. Higher values reduce the impact of rank position differences
    /// between lists. Typical values range from 50 to 100; a default of 60 is commonly used
    /// in practice.
    /// </param>
    /// <returns>
    /// A list of fused results ordered by descending RRF score, containing at most
    /// <paramref name="take"/> items.
    /// </returns>
    public static IReadOnlyList<TPoint> FuseScoredPoints<TPoint>(
        IReadOnlyList<TPoint> a,
        IReadOnlyList<TPoint> b,
        Func<TPoint, string> getId,
        int take,
        int k = 60)
    {
        var scores = new Dictionary<string, double>(StringComparer.Ordinal);
        var best = new Dictionary<string, TPoint>(StringComparer.Ordinal);

        Add(a);
        Add(b);

        return scores
            .OrderByDescending(kvp => kvp.Value)
            .Take(take)
            .Select(kvp => best[kvp.Key])
            .ToList();

        void Add(IReadOnlyList<TPoint> list)
        {
            for (var i = 0; i < list.Count; i++)
            {
                var p = list[i];
                var id = getId(p);
                
                if (scores.TryGetValue(id, out var s) == false)
                    s = 0;

                // rank is i+1 (1-based)
                s += 1.0 / (k + (i + 1));
                scores[id] = s;

                // keep a representative point object
                best.TryAdd(id, p);
            }
        }
    }
}

Example upsert to Qdrant

Now that you have the character n-gram generation and fusion handled, following is an example of performing a Qdrant upsert of a sample food object, including both sets of vectors.

/// <summary>
/// Generates embeddings (semantic and character n-grams), and upserts data to Qdrant.
/// </summary>
/// <param name="food"></param>
/// <param name="json"></param>
/// <returns></returns>
public async Task<bool> UpsertFoodItemAsync(SampleFoodItem? food)
{
    if (food?.Description is null)
        return false;
    
    var semantic = await AiService.GenerateEmbeddingsAsync(food.Description) ?? [];
    var chargram = CharNGramEmbedding.Embed(food.Description);
    
    if (semantic.Length != AiService.SemanticEmbeddingSize || chargram.Length != AiService.CharGramEmbeddingSize)
        return false;

    var point = new PointStruct
    {
        Id = food.Id,
        Vectors = new Dictionary<string, float[]>
        {
            ["semantic"] = semantic,
            ["chargram"] = chargram,
        },
        Payload = 
        {
            ["description"] = food.Description
        }                
    };

    var result = await QdrantService.Client.UpsertAsync("food-collection", [point]);

    return result.Status == UpdateStatus.Completed;
}

Example Qdrant search

Lastly, the following example shows how you can search the Qdrant data using both sets of vectors. Embeddings (semantic and character n-grams) for the prompt are generated and used in the search.

For the best fused results each search (semantic, n-grams) needs to return 3-5 times the number of the final result set. This is because you're trying to recover a good final top-K from two imperfect retrievers. If each retriever only returns exactly K (or close to it), you often don't have enough overlap + near misses to let fusion do its job, especially when the two methods return different items, and rank positions aren't directly comparable.

/// <summary>
/// Search food data items.
/// </summary>
/// <param name="prompt">
/// Search text prompt can be a question or just search text (e.g. keywords)
/// </param>
/// <param name="cancellationToken"></param>
/// <returns></returns>
public async Task<List<ScoredPoint>> SearchFoodItemsAsync(string prompt, CancellationToken cancellationToken = default)
{
    const int MaxSearchResults = 5;

    var semantic = await AiService.GenerateEmbeddingsAsync(prompt);
    var chargram = CharNGramEmbedding.Embed(prompt);

    var semanticHits = await QdrantService.Client.SearchAsync(
        "food-collection",
        semantic,
        limit: MaxSearchResults * 5, // extra results padding for fusing
        vectorName: "semantic",
        cancellationToken: cancellationToken
    );

    var chargramHits = await QdrantService.Client.SearchAsync(
        "food-collection",
        chargram,
        limit: MaxSearchResults * 5, // extra results padding for fusing
        vectorName: "chargram",
        cancellationToken: cancellationToken
    );
    
    return CharNGramEmbedding.FuseScoredPoints(
        semanticHits,
        chargramHits,
        getId: p => p.Id.ToString(),
        take: MaxSearchResults
    ).OrderByDescending(o => o.Score).ToList();
}

Want to know more?

There's usually more to the story so if you have questions or comments about this post let us know!

Do you need a new software development partner for an upcoming project? We would love to work with you! From websites and mobile apps to cloud services and custom software, we can help!

Let's Talk!

Tech tips Subtopics : Software development Subtopics : AI Subtopics : Database & storage

Better search results with character n-grams

What is a vector database?

What are character n-grams?

How to use character n-grams

Character n-gram helper

Example upsert to Qdrant

Example Qdrant search

Want to know more?

Fynydd

Community

Tools

Service Areas