Skip to content

Home

Project logo

🚀 Overview

Encoderfile packages transformer encoders—optionally with classification heads—into a single, self-contained executable. No Python runtime, no dependencies, no network calls. Just a fast, portable binary that runs anywhere.

While Llamafile focuses on generative models, Encoderfile is purpose-built for encoder architectures with optional classification heads. It supports embedding, sequence classification, and token classification models—covering most encoder-based NLP tasks, from text similarity to classification and tagging—all within one compact binary.

Under the hood, Encoderfile uses ONNX Runtime for inference, ensuring compatibility with a wide range of transformer architectures.

Why?

  • Smaller footprint: a single binary measured in tens-to-hundreds of megabytes, not gigabytes of runtime and packages
  • Compliance-friendly: deterministic, offline, security-boundary-safe
  • Integration-ready: drop into existing systems as a CLI, microservice, or API without refactoring your stack

Encoderfiles can run as:

  • REST API
  • gRPC microservice
  • CLI
  • (Future) MCP server
  • (Future) FFI support for near-universal cross-language embedding

Supported Architectures

Encoderfile supports the following Hugging Face model classes (and their ONNX-exported equivalents):

Task Supported classes Examples models
Embeddings / Feature Extraction AutoModel, AutoModelForMaskedLM bert-base-uncased, distilbert-base-uncased
Sequence Classification AutoModelForSequenceClassification distilbert-base-uncased-finetuned-sst-2-english, roberta-large-mnli
Token Classification AutoModelForTokenClassification dslim/bert-base-NER, bert-base-cased-finetuned-conll03-english
  • âś… All architectures must be encoder-only transformers — no decoders, no encoder–decoder hybrids (so no T5, no BART).
  • ⚙️ Models must have ONNX-exported weights (path/to/your/model/model.onnx).
  • đź§  The ONNX graph input must include input_ids and optionally attention_mask.
  • đźš« Models relying on generation heads (AutoModelForSeq2SeqLM, AutoModelForCausalLM, etc.) are not supported.

Gotchas

  • XLNet, Transfomer XL, and derivative architectures are not yet supported.

đź§° Setup

Prerequisites: - Rust - Python - uv - protoc

To set up your dev environment, run the following:

make setup

This will install Rust dependencies, create a virtual environment, and download model weights for integration tests (these will show up in models/).

🏗️ Building an Encoderfile

Prepare your Model

To create an Encoderfile, you must have a HuggingFace model downloaded in an accessible directory. The model directory must have exported ONNX weights.

Export a Model

optimum-cli export onnx \
  --model <model_id>  \
  --task <task_type> \
  <path_to_model_directory>

Task types: See HuggingFace task guide for available tasks (feature-extraction, text-classification, token-classification, etc.)

Use a pre-exported model

Some models on HuggingFace already have ONNX weights in their repos.

Your model directory should look like this:

my_model/
├── config.json
├── model.onnx
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── vocab.txt

Build the binary

uv run -m encoderbuild build \
    -n my-model-name \
    -t [embedding|sequence_classification|token_classification] \
    -m path/to/model/dir

Run REST Server

Your final binary is target/release/encoderfile. To run it as a server: Default port: 8080 (override with --http-port)

chmod +x target/release/encoderfile
./target/release/encoderfile serve

REST API Usage

Embeddings

curl -X POST http://localhost:8080/predict \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["this is a sentence"]}'

Extracts token-level embeddings

Sequence Classification / Token Classification

curl -X POST http://localhost:8080/predict \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["this is a sentence"]}'

Returns predictions and logits.

đź”§ Walkthrough Example - Sequence Classification

Let's use encoderfile to perform sentiment analysis on a few input strings

We'll work with distilbert-base-uncased-finetuned-sst-2-english, which is a fine-tuned version of the DistilBERT model.

Export Model to ONNX

optimum-cli export onnx \
  --model distilbert-base-uncased-finetuned-sst-2-english \
  --task text-classification \
  <path_to_model_directory>

Build Encoderfile

uv run -m encoderbuild build \
  -n sentiment-analyzer \
  -t sequence_classification \
  -m <path_to_model_directory>

Start Server

Use --http-port parameter to start the REST server on a specific port

./target/release/encoderfile serve 

Analyze Sentiment

curl -X POST http://localhost:8080/predict \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["This is the cutest cat ever!", "Boring video, waste of time", "These cats are so funny!"]}'

Expected Output

JSON Output
{
    "results": [
        {
            "logits": [
                -4.045369,
                4.3970084
            ],
            "scores": [
                0.00021549074,
                0.9997845
            ],
            "predicted_index": 1,
            "predicted_label": "POSITIVE"
        },
        {
            "logits": [
                4.7616825,
                -3.8323877
            ],
            "scores": [
                0.9998148,
                0.0001851664
            ],
            "predicted_index": 0,
            "predicted_label": "NEGATIVE"
        },
        {
            "logits": [
                -4.2407384,
                4.565653
            ],
            "scores": [
                0.00014975043,
                0.9998503
            ],
            "predicted_index": 1,
            "predicted_label": "POSITIVE"
        }
    ],
    "model_id": "sentiment-analyzer"
}