Evaluation

lm-evaluation-harness

EleutherAI’s lm-evaluation-harness package is used internally to access a variety of benchmark datasets. The model to evaluate can be loaded directly from the HuggingFace Hub, from a local model checkpoint saved on the filesystem, or from a Weights and Biases artifact object based on the path parameter specified in the evaluation config.

In the evaluation directory, there are sample files for running evaluation on a model in HuggingFace (lm_harness_hf_config.yaml), or using a local inference server hosted on vLLM, (lm_harness_inference_server_config.yaml).

Prometheus

Evaluation relies on Prometheus as LLM judge. We internally serve it via vLLM but any other OpenAI API compatible service should work (e.g. llamafile via their api_like_OAI.py script).

Input datasets must be saved as HuggingFace datasets.Dataset. The code below shows how to convert Prometheus benchmark datasets and optionally save them as wandb artifacts:

import wandb
from datasets import load_dataset
from lm_buddy.tracking.artifact_utils import (
    ArtifactType,
    build_directory_artifact,
)
from lm_buddy.jobs.common import JobType

artifact_name = "tutorial_vicuna_eval"
dataset_fname = "/path/to/prometheus/evaluation/benchmark/data/vicuna_eval.json"
output_path = "/tmp/tutorial_vicuna_eval"

# load the json dataset and save it in HF format
ds = load_dataset("json", data_files = dataset_fname, split='train')
ds.save_to_disk(output_path)

with wandb.init(job_type=JobType.PREPROCESSING,
                project="wandb-project-name",
                entity="wandb-entity-name",
                name=artifact_name
               ):
    artifact = build_directory_artifact(
        dir_path=output_path,
        artifact_name=artifact_name,
        artifact_type=ArtifactType.DATASET,
        reference=False,
    )
    wandb.log_artifact(artifact)

In the evaluation directory, you will find a sample prometheus_config.yaml file for running Prometheus evaluation. Before using it, you will need to specify the path of the input dataset, the base_url where the Prometheus model is served, and the tracking options to save the evaluation output on wandb.

You can then run the evaluation as:

lm_buddy evaluate prometheus --config /path/to/prometheus_config.yaml