Running LLM-as-judge evaluationļ
Lumigator relies on Deepevalās G-Eval implementation to run LLM-as-judge evaluations of models used for summarization and translation tasks. From the userās point of view, these look like just other evaluation metrics that can be specified via API call.
Note
At the present time LLM-as-judge metrics are only available via API, but we are planning to make them available in the UI soon. As soon as they are, youāll be able to add them to your evaluations just as any other metric.
By default, DeepEval uses gpt-4o to power all of its evaluation metrics. It is however possible to use self-hosted models as an alternative, and weāll show you how in this guide.
Available metricsļ
The LLM-as-judge evaluation metrics implemented in Lumigator are inspired by the paper āG-Eval: NLG Evaluation using GPT-4 with Better Human Alignmentā (reference code here). For each task (e.g. summarization, translation) Lumigator provides different to evaluate across different dimensions. In particular:
g_eval_summarization
: this metric uses an original sample and a reference summary to evaluate a newly generated summary across the following dimensions:coherence
,consistency
,fluency
,relevance
(see reference g-eval prompts here for comparison).g_eval_translation
: this metric uses an original sample and a reference translation to evaluate a newly generated translation across the following dimensions:consistency
,fluency
.g_eval_translation_noref
: this metric runs an evaluation across the same dimensions of the previous one, but explicitly ignoring the reference translation (helpful in case you donāt have ground truth available).
When you choose any of the above metrics, Lumigator will prompt an external LLM (either OpenAIās or self-hosted) to evaluate your modelās predictions across all the dimensions that are predefined for the specified task. These dimensions, as well as the prompts that are used for them, are specified in this JSON file.
Running LLM-as-judge with gpt-4oļ
Assuming you have already run inference on a given dataset and you have recovered its id (you can do that by clicking on the dataset row in the Web UI and then choosing Copy ID
in the right pane), running LLM-as-judge evaluation boils down to specifying the corresponding g_eval metric for your task (or adding it to any other metric youād like to calculate). For instance, for translation:
user@host:~/lumigator$ curl -X 'POST' \
'http://localhost:8000/api/v1/jobs/evaluator/' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"name": "llm-as-judge translation eval",
"description": "",
"dataset": "<paste your dataset id here>",
"max_samples": -1,
"job_config": {
"secret_key_name": "openai_api_key",
"job_type": "evaluator",
"metrics": [
"blue",
"g_eval_translation"
]
}
}'
Note
If you use OpenAI models you will have to first configure your API keys in the Lumigator UI (under Settings) and specify openai_api_key
as your jobās secret_key_name
parameter.
Running LLM-as-judge with Ollamaļ
You can run LLM-as-judge with Ollama by simply specifying the metric you want to calculate, the Ollama server URL, and the name of the model you want to hit. For instance, for summarization:
user@host:~/lumigator$ curl -X 'POST' \
'http://localhost:8000/api/v1/jobs/evaluator/' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"name": "llm-as-judge summarization eval",
"description": "",
"dataset": "<paste your dataset id here>",
"max_samples": -1,
"job_config": {
"job_type": "evaluator",
"metrics": [
"bertscore",
"g_eval_summarization"
],
"llm_as_judge": {
"model_name": "gemma3:27b",
"model_base_url": "http://localhost:11434"
}
}
}'
Note
If you want to use a local model for LLM-as-judge, first make sure it has the capabilities to perform the task appropriately (in particular, the model needs to have multilingual capabilities to properly evaluate translations). For instance, the quantized version of gemma3:27b running on Ollama can be a good starting point for both summarization and translation.