Serve Matryoshka Sentence Embeddings with nomic-ai/nomic-embed-text-v1.5
In this cookbook, we build an Encoderfile that serves Matryoshka sentence embeddings using the nomic-ai/nomic-embed-text-v1.5 model. You’ll package the model into a single, self-contained binary that runs fully offline and can be deployed as a REST API, gRPC service, or CLI.
Along the way, we show how to apply the model’s recommended Matryoshka post-processing and select a fixed embedding dimensionality at build time, making it easier to balance retrieval quality, latency, and memory footprint in production.
Check out the full code in GitHub.
What are Matryoshka Embeddings?
Matryoshka embeddings are embeddings that remain semantically meaningful even when truncated. A single model can produce embeddings at multiple dimensionalities by taking prefixes of the output vector, making it easy to balance retrieval quality against storage and performance constraints in downstream systems.
This Encoderfile is useful when you want to standardize on a fixed embedding size while still benefiting from a Matryoshka-trained model’s training regime. By selecting the embedding dimensionality at build time, you can tailor the binary to your storage, indexing, and memory constraints—then deploy it as a stable, reproducible artifact.
This is a good fit for production search and retrieval systems, offline indexing pipelines, and environments with strict operational or compliance requirements, where embedding shape must be fixed and predictable, and runtime configuration is intentionally limited.
How Matryoshka Embeddings Are Applied in This Encoderfile
The nomic-ai/nomic-embed-text-v1.5 model produces token-level hidden states at its full native dimensionality. On their own, these outputs are not directly usable as sentence embeddings. This Encoderfile applies the post-processing steps recommended by the model authors and compiles them directly into the binary.
All post-processing is implemented as a Lua transform and runs inside the Encoderfile at inference time. There is no runtime configuration: the embedding shape and normalization behavior are fixed at build time.
---Generated by Encoderfile ❤️
---Remember: Lua is 1-indexed!
MatryoshkaDim = 512
Eps = 1e-5
---Postprocessing script follows instructions from the official model repository:
---https://huggingface.co/nomic-ai/nomic-embed-text-v1.5
---Postprocess embeddings
---Must return 2D tensor of shape [batch_size, *]
---@input Tensor 3D tensor of shape [batch_size, seq_len, hidden_dim]
---@input mask Attention mask of shape [batch_size, seq_len]
---@return Tensor
function Postprocess(arr, mask)
---Step 1: mean pool
local embeddings = arr:mean_pool(mask)
---Step 2: layer_norm along 2nd axis (1st axis in PyTorch land)
embeddings = embeddings:layer_norm(2, Eps)
---Step 3: truncate along 2nd axis
embeddings = embeddings:truncate_axis(2, MatryoshkaDim)
---Step 4: l2 normalize along 2nd axis (1st axis in PyTorch land)
embeddings = embeddings:lp_normalize(2.0, 2)
return embeddings
end
The transform performs the following steps:
-
Mean pooling Token-level embeddings are averaged across the sequence using the attention mask, producing a single vector per input text.
-
Layer normalization The pooled embeddings are normalized to stabilize scale and match the model's reference implementation.
-
Matryoshka truncation The embedding vector is truncated to a fixed dimensionality (
MatryoshkaDim). Because the model was trained with a Matryoshka objective, the prefix of the vector remains semantically meaningful even at lower dimensions. -
L2 normalization The final embeddings are L2-normalized, making them suitable for cosine similarity and nearest-neighbor search.
By compiling these steps into the Encoderfile, every inference produces embeddings with a fixed, predictable shape and identical semantics across environments. This avoids runtime configuration drift and makes the resulting binary easier to deploy in production systems where embedding dimensionality, memory usage, and indexing behavior must be tightly controlled.
The result is a single, reproducible artifact that serves Matryoshka embeddings at a chosen dimensionality—without requiring downstream systems to understand or reimplement the post-processing logic.
Building the Encoderfile
This is the easiest and most reproducible path. All dependencies are pinned and handled for you.
Step 1: Build the Encoderfile
Run:
This step:
- downloads the model artifacts
- applies the Matryoshka post-processing configuration
- builds the final Encoderfile binary
Step 2: Run the Encoderfile
Run:
The container runs the Encoderfile directly and starts an embedding server. This exposes both an HTTP (port 8080) and a gRPC endpoint (port 50051). To see more options, run:
Use this path if you want full control over the build environment or to inspect each step.
Step 1: Install Prerequisites
Ensure the encoderfile CLI is installed and available in your PATH. For instructions on how to install the encoderfile CLI, check out our Getting Started guide.
To install Huggingface CLI (for downloading model artifacts):
Step 2: Download Model
Run the following:
This script downloads the nomic-ai/nomic-embed-text-v1.5 model files(config.json, tokenizer.json, tokenizer_config.json, and onnx/model.onnx) expected by the Encoderfile build configuration.
Step 3: Build the Encoderfile
Run the following:
This produces a single executable binary, named nomic-embed-text-v1_5.encoderfile. All configuration—model weights, embedding dimensionality, and post-processing logic—is compiled into this file.
Step 4: Run the Encoderfile
To serve the model as a server:
Running Inference
You can verify that the server is running by running in a separate terminal:
You should get back the following:
The following Python snippet shows how to extract sentence embeddings: