Creating a llamafile
A llamafile bundles the llamafile executable, model weights, and a set of
default arguments into a single self-contained file using the
APE (Actually Portable Executable) format,
which supports ZIP as a container for extra data. If you have already
downloaded a llamafile, you can inspect its contents with
unzip -vl <filename.llamafile> (or on Windows, rename it to .zip and
open it in your ZIP GUI).
Prerequisites
llamafile uses zipalign to bundle files
into the executable. It is included as a git submodule and built alongside
llamafile, so if you have already compiled llamafile you have the zipalign
executable in the o//third_party/zipalign folder. To build it on its own:
[!NOTE] The zipalign tool referenced here is not the Android zipalign. See the GitHub repo above for an in-depth description and up-to-date code.
What you need
-
The llamafile executable — download a prebuilt binary from the releases page, or build from source following these instructions.
-
Model weights in GGUF format — download from Hugging Face (search here), or use weights already on disk from another application.
-
A
.argsfile — specifies default arguments (at minimum, the model path so it loads automatically).
Examples
TUI, text-only
Let's see how this works in practice with a simple, text-only language model, e.g. Qwen3-0.6B:
- Search for the model weights in GGUF format (for the sake of this example we'll download these with Q8 quantization)
- Create a file named
.argswith the following content:
-m
/zip/Qwen3-0.6B-Q8_0.gguf
-fa
on
--temp
0.6
--top-k
20
--top-p
0.95
--min-p
0
--presence-penalty
1.5
-c
40960
-n
32768
--no-context-shift
--no-mmap
...
[!NOTE] There is one argument per line. Most arguments are optional — the model name is the only required one (the above replicates the parameters suggested here). The
/zip/path prefix is required whenever referencing a file packaged inside the llamafile. The...token is replaced with any additional CLI arguments the user passes at runtime.
- Copy the llamafile executable and run zipalign to embed the weights and args:
cp o//llamafile/llamafile Qwen3-0.6B-Q8.llamafile
o//third_party/zipalign/zipalign -j0 \
Qwen3-0.6B-Q8.llamafile \
Qwen3-0.6B-Q8_0.gguf \
.args
./Qwen3-0.6B-Q8.llamafile
Congratulations, you've just made your own LLM executable that's easy to share with your friends!
Your new llamafile will start loading the Qwen model in the TUI. You can also run it as a web server with:
Server, multimodal
Now, let us build another llamafile running a multimodal model served via HTTP. If you want to be able to just say:
...and have it run the web server without having to specify arguments,
embed both the weights and the following .args file
(weights used in this example are downloaded from here):
-m
/zip/llava-v1.6-mistral-7b.Q8_0.gguf
--mmproj
/zip/mmproj-model-f16.gguf
--server
--host
0.0.0.0
-ngl
9999
--no-mmap
...
Next, add both the weights and the argument file to the executable:
cp o//llamafile/llamafile llava.llamafile
o//third_party/zipalign/zipalign -j0 \
llava.llamafile \
llava-v1.6-mistral-7b.Q8_0.gguf \
mmproj-model-f16.gguf \
.args
./llava.llamafile
Distribution
One good way to share a llamafile with your friends is by posting it on
Hugging Face. If you do that, then it's recommended that you mention in
your Hugging Face commit message what git revision or released version
of llamafile you used when building your llamafile. That way everyone
online will be able verify the provenance of its executable content. If
you've made changes to the llama.cpp or cosmopolitan source code, then
the Apache 2.0 license requires you to explain what changed. One way you
can do that is by embedding a notice in your llamafile using zipalign
that describes the changes, and mention it in your Hugging Face commit.