Getting Started with llamafile
The easiest way to try it for yourself is to download our example llamafile for the Qwen3.5 model (license: Apache 2.0). Qwen3.5 is a recent LLM that can do more than just chat; you can also upload images and ask it questions about them. With llamafile, this all happens locally: no data ever leaves your computer.
NOTE: we chose this model because that's the smallest one we have built a llamafile for, so most likely to work out-of-the-box for you. Please let us know if you are still having issues with that! If, on the other hand, you have powerful hardware and/or GPUs, feel free to choose larger and more expressive models which should provide more accurate responses.
-
Download Qwen3.5-0.8B-Q8_0.llamafile (1.77 GB).
-
Open your computer's terminal.
- If you're using macOS, Linux, or BSD, you'll need to grant permission for your computer to execute this new file. (You only need to do this once.)
- If you're on Windows, rename the file by adding ".exe" on the end.
-
Run the llamafile. e.g.:
-
A chat interface will open in the terminal window. That's it: you can immediately start writing. You can also upload an image by using the
/uploadcommand and specifying the path to the image, or write/helpto see the available commands). -
Note that when llamafile is running, you can also chat with it using llama.cpp's Web UI: just open a browser window and connect to http://localhost:8080/.
-
When you're done chatting,
Control-Cto shut down llamafile.
Having trouble? See the Troubleshooting page.
JSON API Quickstart
As llamafile relies on llama.cpp for serving models, it comes with all its features. When it is started, in addition to hosting a web UI chat server at http://127.0.0.1:8080/, it also exposes an endpoint compatible with OpenAI API and Anthropic's Messages API. For further details on what fields and endpoints are available, refer to the APIs documentation and llama.cpp server's README.
Curl API Client Example
The simplest way to get started using the API is to copy and paste the following curl command into your terminal.curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"model": "LLaMA_CPP",
"messages": [
{
"role": "system",
"content": "You are LLAMAfile, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
"role": "user",
"content": "Write a limerick about python exceptions"
}
]
}' | python3 -c '
import json
import sys
json.dump(json.load(sys.stdin), sys.stdout, indent=2)
print()
'
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "In the world of Python, where magic breaks and errors occur,\nA script fails when it should not have failed.\nWith a `KeyError`, I can't access the key,\nSo I tell you to use the `except` clause!"
}
}
],
"created": 1773659260,
"model": "Qwen3.5-0.8B-Q8_0.gguf",
"system_fingerprint": "b1773565177-7f5ee5496",
"object": "chat.completion",
"usage": {
"completion_tokens": 52,
"prompt_tokens": 49,
"total_tokens": 101
},
"id": "chatcmpl-KOqwN6C0oRzINGZuFqZ95bU1iPfc6RFO",
"timings": {
"cache_n": 0,
"prompt_n": 49,
"prompt_ms": 54.944,
"prompt_per_token_ms": 1.1213061224489795,
"prompt_per_second": 891.8171228887594,
"predicted_n": 52,
"predicted_ms": 405.856,
"predicted_per_token_ms": 7.804923076923076,
"predicted_per_second": 128.1242608215722
}
}
Python API Client example
If you've already developed your software using the [`openai` Python package](https://pypi.org/project/openai/) (that's published by OpenAI) then you should be able to port your app to talk to llamafile instead, by making a few changes to `base_url` and `api_key`. This example assumes you've run `pip3 install openai` to install OpenAI's client software, which is required by this example. Their package is just a simple Python wrapper around the OpenAI API interface, which can be implemented by any server.#!/usr/bin/env python3
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
api_key = "sk-no-key-required"
)
completion = client.chat.completions.create(
model="LLaMA_CPP",
messages=[
{"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
{"role": "user", "content": "Write a limerick about python exceptions"}
]
)
print(completion.choices[0].message)
ChatCompletionMessage(content="A script that crashes like a ghost,\nWhen it tries to solve the problem deep and fast.\nThe error message pops up in a bright light,\nAnd tells us what's wrong when we try to fix it.", refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None)
Using llamafile with external weights
Even though our example llamafiles have the weights built-in, you don't have to use llamafile that way. Instead, you can download just the llamafile software (without any weights included) from our releases page. You can then use it alongside any external weights you may have on hand. External weights are particularly useful for Windows users because they enable you to work around Windows' 4GB executable file size limit.
For Windows users, here's an example for the gpt-oss LLM (whose size is >12GB):
curl -L -o llamafile.exe https://huggingface.co/mozilla-ai/llamafile_0.10.0/resolve/main/llamafile_0.10.0
curl -L -o gpt-oss.gguf https://huggingface.co/unsloth/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-Q5_K_S.gguf
./llamafile.exe -m gpt-oss.gguf
Windows users may need to change ./llamafile.exe to .\llamafile.exe when running the above command.
Running llamafile with models downloaded by third-party applications
This section answers the question "I already have a model downloaded locally by application X, can I use it with llamafile?". The general answer is "yes, as long as those models are locally stored in GGUF format" but its implementation can be more or less hacky depending on the application. A few examples (tested on a Mac) follow.
LM Studio
LM Studio stores downloaded models in ~/.cache/lm-studio/models/lmstudio-community, in subdirectories with the same name of the models, minus their quantization level. So if you have downloaded e.g. the gpt-oss-20b-MXFP4.gguf file, it will be stored in ~/.cache/lm-studio/models/lmstudio-community/gpt-oss-20b-GGUF/ and you can run llamafile as follows:
Ollama
When you download a new model with ollama, all its metadata will be stored in a manifest file under ~/.ollama/models/manifests/registry.ollama.ai/library/. The directory and manifest file name are the model name as returned by ollama list. For instance, for llama3:latest the manifest file will be named .ollama/models/manifests/registry.ollama.ai/library/llama3/latest.
The manifest maps each file related to the model (e.g. GGUF weights, license, prompt template, etc) to a sha256 digest. The digest corresponding to the element whose mediaType is application/vnd.ollama.image.model is the one referring to the model's GGUF file.
Each sha256 digest is also used as a filename in the ~/.ollama/models/blobs directory (if you look into that directory you'll see only those sha256-* filenames). This means you can directly run llamafile by passing the sha256 digest as the model filename. So if e.g. the llama3:latest GGUF file digest is sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29, you can run llamafile as follows:
cd ~/.ollama/models/blobs
llamafile -m sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29