13 May 2026

One Model, Four Speeds

Accuracy and speed pull in opposite directions. More detailed descriptions of text are more precise but slower to compare; less detailed ones are faster but lose nuance. ThinkableSpace resolves this with one model used at two levels of detail. A first pass uses compact 128-number descriptions to quickly identify candidate matches. A second pass re-ranks them using full 768-number descriptions for finer precision. Both come from the same mode, the compact description is simply the beginning of the full one. Nothing is wasted.

In the previous post we explained how ThinkableSpace searches a large knowledge base in milliseconds by navigating a graph rather than scanning through everything. But there's a further challenge hidden inside that fast search: accuracy and speed often pull in opposite directions.

A more detailed description of a document chunk is more precise, better at distinguishing subtly different ideas. But more detailed descriptions take more time to compare. A less detailed description is faster to work with, but loses some nuance.

ThinkableSpace resolves this tension with an elegant idea: use the same AI model at different levels of detail, depending on what each stage of the search needs.

The Russian doll model

Imagine an AI model that, when it reads a passage of text, produces not just one description but a set of nested descriptions, like Russian dolls, where each one fits inside the next.

The smallest doll captures the most essential meaning: the broad topic, the general category of ideas. Open it up and you find a slightly larger one, adding more detail. Open that, and the next one adds more still. The largest doll holds the complete, nuanced description of everything in the text.

This is not a metaphor, it's a real technique called Matryoshka Representation Learning, named after those same Russian dolls.

The model learns to pack the most important information into the first numbers of its output, with each additional number adding progressively finer detail. This means you can use just the first portion of the output for a fast, rough comparison, or the full output for a precise one, and both come from exactly the same model.

Why this matters for search

A typical search has two very different requirements at different stages.

At the start, you need broad coverage. You're scanning a large collection and trying to identify which chunks are in the right neighbourhood of meaning, eliminating the vast majority that are clearly irrelevant. At this stage, speed matters most. Precision matters less, because you're making rough cuts, not final judgements.

At the end, you need fine precision. You've narrowed down to a shortlist of candidates that all seem relevant. Now you need to rank them carefully, to decide which passages are truly the closest match to what the user is looking for. At this stage, precision matters most. Speed matters less, because you're comparing a small number.

These two requirements are perfectly served by two levels of detail from the same model.

Two stages, one model

When you search in ThinkableSpace, the search runs in two stages.

In the first stage, the system uses a compact, 128-number description of each chunk. These short descriptions can be compared extremely quickly, and the graph navigation described in the previous post becomes even faster at this resolution. The result is a list of several hundred candidate chunks, everything that's plausibly relevant to your query.

In the second stage, the system takes that shortlist and re-ranks it using full, 768-number descriptions. These longer descriptions capture subtler distinctions between ideas. A chunk that seemed relevant in the broad first pass might be revealed as a poorer match than another when examined in more detail. The ranking improves.

The total time for both stages together is still measured in milliseconds. But the accuracy is substantially better than if only the fast, compact descriptions had been used throughout.

The best of both worlds

The elegance of this approach is that nothing is wasted. There's no separate "fast model" and "accurate model" to maintain. No trade-off where choosing speed means accepting a worse experience.

The same AI model produces both descriptions simultaneously. The compact description is simply the beginning of the full one. Using more of it adds precision. Using less adds speed.

This lets ThinkableSpace adapt: broader recall when you need to sweep a large collection, sharper precision when you're comparing the finalists.

What you experience

From a user's perspective, none of this is visible. You type a query and results appear, fast, relevant, and ranked sensibly. The two-stage process happens invisibly in the background, completed before most people finish reading the first result.

But the architecture behind it is what makes that experience possible. Search that feels effortless usually isn't. It's the result of deliberate choices about where to spend precision and where to spend speed, made once, invisibly, so you never have to think about it.