Skip to content

Creating a llamafile

A llamafile bundles the llamafile executable, model weights, and a set of default arguments into a single self-contained file using the APE (Actually Portable Executable) format, which supports ZIP as a container for extra data. If you have already downloaded a llamafile, you can inspect its contents with unzip -vl <filename.llamafile> (or on Windows, rename it to .zip and open it in your ZIP GUI).

Prerequisites

llamafile uses zipalign to bundle files into the executable. It is included as a git submodule and built alongside llamafile, so if you have already compiled llamafile you have the zipalign executable in the o//third_party/zipalign folder. To build it on its own:

make o//third_party/zipalign

[!NOTE] The zipalign tool referenced here is not the Android zipalign. See the GitHub repo above for an in-depth description and up-to-date code.

What you need

  • The llamafile executable — download a prebuilt binary from the releases page, or build from source following these instructions.

  • Model weights in GGUF format — download from Hugging Face (search here), or use weights already on disk from another application.

  • A .args file — specifies default arguments (at minimum, the model path so it loads automatically).

Examples

TUI, text-only

Let's see how this works in practice with a simple, text-only language model, e.g. Qwen3-0.6B:

  • Search for the model weights in GGUF format (for the sake of this example we'll download these with Q8 quantization)
  • Create a file named .args with the following content:
-m
/zip/Qwen3-0.6B-Q8_0.gguf
-fa
on
--temp
0.6
--top-k
20
--top-p
0.95
--min-p
0
--presence-penalty
1.5
-c
40960
-n
32768
--no-context-shift
--no-mmap
...

[!NOTE] There is one argument per line. Most arguments are optional — the model name is the only required one (the above replicates the parameters suggested here). The /zip/ path prefix is required whenever referencing a file packaged inside the llamafile. The ... token is replaced with any additional CLI arguments the user passes at runtime.

  • Copy the llamafile executable and run zipalign to embed the weights and args:
cp o//llamafile/llamafile Qwen3-0.6B-Q8.llamafile

o//third_party/zipalign/zipalign -j0 \
  Qwen3-0.6B-Q8.llamafile \
  Qwen3-0.6B-Q8_0.gguf \
  .args

./Qwen3-0.6B-Q8.llamafile

Congratulations, you've just made your own LLM executable that's easy to share with your friends!

Your new llamafile will start loading the Qwen model in the TUI. You can also run it as a web server with:

./Qwen3-0.6B-Q8.llamafile --server

Server, multimodal

Now, let us build another llamafile running a multimodal model served via HTTP. If you want to be able to just say:

./llava.llamafile

...and have it run the web server without having to specify arguments, embed both the weights and the following .args file (weights used in this example are downloaded from here):

-m
/zip/llava-v1.6-mistral-7b.Q8_0.gguf
--mmproj
/zip/mmproj-model-f16.gguf
--server
--host
0.0.0.0
-ngl
9999
--no-mmap
...

Next, add both the weights and the argument file to the executable:

cp o//llamafile/llamafile llava.llamafile

o//third_party/zipalign/zipalign -j0 \
  llava.llamafile \
  llava-v1.6-mistral-7b.Q8_0.gguf \
  mmproj-model-f16.gguf \
  .args

./llava.llamafile

Distribution

One good way to share a llamafile with your friends is by posting it on Hugging Face. If you do that, then it's recommended that you mention in your Hugging Face commit message what git revision or released version of llamafile you used when building your llamafile. That way everyone online will be able verify the provenance of its executable content. If you've made changes to the llama.cpp or cosmopolitan source code, then the Apache 2.0 license requires you to explain what changed. One way you can do that is by embedding a notice in your llamafile using zipalign that describes the changes, and mention it in your Hugging Face commit.