Quickstart

You can deploy Lumigator either locally or into a distributed environment using Kubernetes. In this quickstart guide, we’ll start Lumigator locally and evaluate a model on a dataset that we upload. Looking to develop Lumigator? If so, you’ll be interested in the local development guide for details of running in development mode.

If you hit any issues running the quickstart, we want to hear about it! You can submit an issue here.

Prerequisites

Lumigator is currently supported on Linux and Mac. Windows is not yet tested, but we welcome contributions, see Contributing Guide.

Before you start, make sure you have the following:

A working installation of Docker
- On Mac, Docker Desktop >= 4.37, and docker-compose >= 2.31.
- On Linux, please also complete the post-installation steps.
The directory $HOME/.cache/huggingface/ must exist and be readable and writeable. Lumigator uses this directory for accessing cached huggingface hub models.
If you want to evaluate against hosted API-based LLM services such as the platforms provided by OpenAI, Mistral, or Deepseek, you will need to set the appropriate API key setting as a secret in Lumigator. Refer to API settings configuration for more details.
If your system has an NVIDIA GPU, you need to have installed the NVIDIA Container Toolkit following their instructions. After that, open a terminal and run:
```
export RAY_WORKER_GPUS=1
export RAY_WORKER_GPUS_FRACTION=1.0
export GPU_COUNT=1
```
Important: Run the next deployment steps in this same terminal, or thes env vars must be set in your shell configuration

Local Deployment

Lumigator is run locally using docker-compose. In order to deploy the latest release of Lumigator:

Clone the Lumigator repository:

user@host:~$ git clone git@github.com:mozilla-ai/lumigator.git

Change to the Lumigator directory:
```
user@host:~$ cd lumigator
```

Run the start-lumigator make target:

user@host:~/lumigator$ make start-lumigator

This will run all of the components needed for Lumigator.

This creates multiple container services networked together to make up all the components of the Lumigator application:

minio: Local storage for datasets that mimics S3-API compatible functionality.
backend: Lumigator’s FastAPI REST API. Access the Swagger HTTP Docs at http://localhost:8000/docs
ray: A Ray cluster for submitting several types of jobs. Access the Ray dashboard at http://localhost:8265
mlflow: Used to track experiments and metrics, accessible at http://localhost:8001
frontend: Lumigator’s Web UI, accessible at http://localhost:80

Verify

To verify that Lumigator is running, open a browser and navigate to http://localhost:8000. You should receive the following JSON response:

{"Hello": "Lumigator!🐊"}

Using Lumigator

Now that Lumigator is deployed, we can use it to compare a few models. In this guide, we’ll evaluate GPT-4o for a few samples of the dialogsum dataset that we store here, on the task of summarization. Lumigator also supports evaluation on translation tasks; refer to the translation evaluation guide for details.

We will show how to do this using either cURL or the Lumigator SDK. See the UI guide for information about how to use the UI.

The steps are as follows:

Upload the dialogsum dataset to Lumigator
Create an experiment, which is a container for running the workflow for each model
Run the summarization workflow for the model
Retrieve the results of the workflow

Upload a Dataset

To upload a dataset, you need to send a POST request to the /datasets endpoint. The request should include the dataset file.

To run this example, first cd to the lumigator directory. Then run

cURL

user@host:~/lumigator$ export DATASET_PATH=lumigator/sample_data/summarization/dialogsum_exc.csv
user@host:~/lumigator$ curl -s http://localhost:8000/api/v1/datasets/ \
  -H 'Accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'dataset=@'$DATASET_PATH';type=text/csv' \
  -F 'format=job' | jq
{
  "id": "9ac42307-e0e5-4635-a9ce-48acdb451742",
  "filename": "dialogsum_exc.csv",
  "format": "job",
  "size": 3603,
  "ground_truth": true,
  "run_id": null,
  "generated": false,
  "generated_by": null,
  "created_at": "2025-02-19T20:00:01"
}

Python SDK

from lumigator_sdk.lumigator import LumigatorClient
from lumigator_schemas.datasets import DatasetFormat

dataset_path = 'lumigator/sample_data/summarization/dialogsum_exc.csv'
client = LumigatorClient('localhost:8000')

response = client.datasets.create_dataset(
    open(dataset_path, 'rb'),
    DatasetFormat.JOB
)

Note

The dataset file should be in CSV format and contain a header row with the following columns: examples, ground_truth. The ground_truth column is optional since you can generate it using Lumigator. See here for an example.

You can verify that the dataset was uploaded successfully by asking the API to list all datasets and checking that the uploaded dataset is in the list:

cURL

user@host:~/lumigator$ curl -s http://localhost:8000/api/v1/datasets/ | jq -r '.items | .[] | .filename'
dialogsum_exc.csv

Python SDK

datasets = client.datasets.get_datasets()
print(datasets.items[-1].filename)

Create an Experiment

Now that you have uploaded a dataset, you can create an experiment. An experiment is a container for running evaluations of models with the dataset. To this end, you need to send a POST request to the /experiments endpoint. The request should include the following required fields:

A name for the experiment job.
A short description.
The ID of the dataset you want to use for evaluations.
The task definition, which is summarization for this example.

Here is an example of how to create an experiment:

Note

The steps assume you only have uploaded a single dataset. If you have multiple datasets uploaded, this command will use the latest one. If you want a different dataset, replace the "$(curl -s http://localhost:8000/api/v1/datasets/ | jq -r '.items | .[0].id')" code with the ID of the dataset you’d like to use.

cURL

Set the following variables:

user@host:~/lumigator$ export EXP_NAME="DialogSum Summarization" \
       EXP_DESC="See which model best summarizes Dialogues." \
       EXP_DATASET="$(curl -s http://localhost:8000/api/v1/datasets/ | jq -r '.items | .[0].id')" \
       EXP_TASK="summarization"

Define the JSON string:

user@host:~/lumigator$ export JSON_STRING=$(jq -n \
        --arg name "$EXP_NAME" \
        --arg desc "$EXP_DESC" \
        --arg dataset_id "$EXP_DATASET" \
        --arg task "$EXP_TASK" \
        '{name: $name, description: $desc, dataset: $dataset_id, task_definition: {task: $task}}')

Create the experiment:

user@host:~/lumigator$ curl -s http://localhost:8000/api/v1/experiments/ \
  -H 'Accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "$JSON_STRING" | jq
{
  "id": "1",
  "name": "DialogSum Summarization",
  "description": "See which model best summarizes Dialogues.",
  "created_at": "2025-02-19T20:11:55.492000",
  "task_definition": {
    "task": "summarization"
  },
  "dataset": "9ac42307-e0e5-4635-a9ce-48acdb451742",
  "updated_at": null,
  "workflows": null
}

Python SDK

from lumigator_schemas.experiments import ExperimentCreate
from lumigator_schemas.tasks import SummarizationTaskDefinition

dataset_id = datasets.items[-1].id
request = ExperimentCreate(
    name="DialogSum Summarization",
    description="See which model best summarizes Dialogues.",
    dataset=dataset_id,
    task_definition=SummarizationTaskDefinition()
)
experiment_response = client.experiments.create_experiment(request)
experiment_id = experiment_response.id
print(f"Experiment created and has ID: {experiment_id}")

Create a new secret for your OpenAI API key

Before creating your first evaluation workflow with an OpenAI model, you need to create a secret in Lumigator to store your OpenAI API key. This secret will be used by Lumigator to authenticate with the OpenAI API.

cURL

Set the following variables:

user@host:~/lumigator$ export SECRET_NAME="openai_api_key" \
       VALUE="sk-..." \
       DESCRIPTION="My OpenAI API Key"

Define the JSON string:

usert@host:~/lumigator$ export JSON_STRING=$(jq -n \
        --arg value "$VALUE" \
        --arg desc "$DESCRIPTION" \
        '{value: $value, description: $desc}')

Submit the secret:

user@host:~/lumigator$ curl -X PUT http://localhost:8000/api/v1/settings/secrets/$SECRET_NAME \
  -H 'Accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "$JSON_STRING" | jq

Python SDK

api_key = client.settings.secrets.APIKey.OPENAI

response = client.settings.secrets.upload_api_key(
    api_key,
    secret_value="sk-...",
)

Trigger the workflows

Now it’s time to evaluate a model! Let’s trigger workflows to evaluate GPT-4o. This process can be repeated for as many models as you would like to evaluate in the experiment.

Note

the steps assume you only have created a single experiment. If you have multiple experiments, replace the "$(curl -s http://localhost:8000/api/v1/experiments/ | jq -r '.items | .[0].id')" code with the ID of the experiment you want.

cURL

Set the following variables:

user@host:~/lumigator$ export WORKFLOW_NAME="OpenAI 4o" \
       WORKFLOW_DESC="Summarize with 4o." \
       EXPERIMENT_ID="$(curl -s http://localhost:8000/api/v1/experiments/ | jq -r '.items | .[0].id')" \
       SECRET_KEY_NAME="openai_api_key"

Define the JSON string:

user@host:~/lumigator$ export JSON_STRING=$(jq -n \
        --arg name "$WORKFLOW_NAME" \
        --arg model "gpt-4o" \
        --arg provider "openai" \
        --arg desc "$WORKFLOW_DESC" \
        --arg secret_key_name "$SECRET_KEY_NAME" \
        --arg exp_id "$EXPERIMENT_ID" \
        '{name: $name, description: $desc, model: $model, provider: $provider, secret_key_name: $secret_key_name, experiment_id: $exp_id}')

Trigger the workflow:

user@host:~/lumigator$ curl -s http://localhost:8000/api/v1/workflows/ \
  -H 'Accept: application/json' \
  -H 'Content-Type: application/json' \
  -d "$JSON_STRING" | jq
{
  "id": "2ad30ca4aa7c4aa59eb6dc7235eebd57",
  "experiment_id": "1",
  "model": "gpt-4o",
  "name": "OpenAI 4o",
  "description": "Summarize with 4o.",
  "system_prompt": "You are a helpful assistant, expert in text summarization. For every prompt you receive, provide a summary of its contents in at most two sentences.",
  "status": "created",
  "created_at": "2025-03-24T09:15:30.376000",
  "updated_at": null
}

Python SDK

from lumigator_schemas.workflows import WorkflowCreateRequest

request = WorkflowCreateRequest(
    name="OpenAI 4o",
    description="Summarize with 4o.",
    model="gpt-4o",
    provider="openai",
    secret_key_name="openai_api_key",
    experiment_id=experiment_id
)
client.workflows.create_workflow(request).model_dump()

Get the results

Now that the workflow has been triggered we can get the experiment, which will give us all the details about the containing workflows. When the workflows are completed, this call will give you back all of the information about the evaluation, to let you compare results and review performance.

cURL

Set the following variables:

user@host:~/lumigator$ export EXPERIMENT_ID="$(curl -s http://localhost:8000/api/v1/experiments/ | jq -r '.items | .[0].id')"

Get the experiment!

user@host:~/lumigator$ curl -s http://localhost:8000/api/v1/experiments/$EXPERIMENT_ID | jq
{
  "id": "1",
  "name": "DialogSum Summarization",
  "description": "See which model best summarizes Dialogues.",
  "created_at": "2025-02-19T20:11:55.492000",
  "task_definition": {
    "task": "summarization"
  },
  "dataset": "9ac42307-e0e5-4635-a9ce-48acdb451742",
  "updated_at": "2025-02-19T20:11:55.492000",
  "workflows": [
    {
      "id": "ffa38f72fe7e4b06a60de5bf797c31d6",
      "experiment_id": "1",
      "model": "gpt-4o",
      "name": "OpenAI 4o",
      "description": "Summarize with 4o.",
      "status": "succeeded",
      "created_at": "2025-02-19T20:30:33.713000",
      "updated_at": null,
      "jobs": [...]
      "metrics": {
        "rouge1_mean": 0.224,
        "rouge2_mean": 0.106,
        "rougeL_mean": 0.195,
        "rougeLsum_mean": 0.195,
        "bertscore_f1_mean": 0.872,
        "bertscore_precision_mean": 0.866,
        "bertscore_recall_mean": 0.878,
        "meteor_mean": 0.276
      },
    }
  ]
}

Python SDK

experiment_details = client.experiments.get_experiment(experiment_id)
print(experiment_details.model_dump_json())

The metrics we use to evaluate are ROUGE, METEOR, and BERT score. They all measure similarity between predicted summaries and those provided with the ground truth, but each of them focuses on different aspects:

ROUGE - (Recall-Oriented Understudy for Gisting Evaluation), which compares an automatically-generated summary to one generated by a machine learning model on a score of 0 to 1 in a range of metrics comparing statistical similarity of two texts.
METEOR - Looks at the harmonic mean of precision and recall.
BERTScore - Generates embeddings of ground truth input and model output and compares their cosine similarity

Terminate Session

In order to shut down Lumigator, you can stop the containers that were started using Docker Compose. This can be done by simply running the following command:

user@host:~/lumigator$ make stop-lumigator

Next Steps

Congratulations! You have successfully uploaded a dataset, created an experiment, run a model evaluation in the experiment, and retrieved the results.

For info about developing lumigator, see the local development guide.

For information about deploying lumigator into a Kubernetes cluster, see kubernetes installation.