Datasets

A dataset is a collection of data points (or samples) used for training, testing, or evaluating machine learning models.

An evaluation dataset specifically assesses a model’s performance on a particular task.

Why bother creating your own?

Unlike generic, open source benchmarks, a curated dataset can reflect the nuances, vocabulary, and challenges unique to the use case. Intuitively, summarizing patient information will require a different vocabulary than summarizing a film. Even within the same domain, some models may be enough to handle short technical messages, whereas they might struggle with long technical papers.

Would that model also work for you better than a generic LLM? Would you really lose performance if you used a smaller model? Those are questions Lumigator can help you answer for your use case. The emphasis here is on “for your use case”; for an LLM, data is what best defines a use case.

Content

At its simplest, an evaluation dataset for a specific task should contain the following key components:

Input text or examples: These are the samples that the model will process; e.g. each individual text from your domain that a model should summarize.
Ground truth: Answers that you would deem correct for each input. Ideally these are generated by humans who are an expert in the topic.

Ideally, you could also add metadata in additional columns that detail the source, date of collection or level of difficulty of each sample. However, Lumigator does not use this additional metadata.

What is a good dataset?

A good way to think of this is through the analogy of designing a good exam for a class. Will it cover the syllabus completely? Is it long enough to avoid the odd (un)lucky answer having too much weight? For each answer you include in your exam, do you know what the ideal answer would be, so that you can compare students’ answers to it?

What to consider while you are curating?

This is very broad topic, source of many academic and industry papers, articles, posts and tools. As a set of rough guidelines while you build your first evaluation dataset for Lumigator, consider the following:

Is your data representative of your domain in production?
- Are typical lengths of text covered? Consider seeing the distribution of the texts you collected in step #1:
- Is there good coverage of the vocabulary in your data? (E.g. does a percentage of your users employ very distinct abbreaviations the rest do not use?)
- Is there any seasonality in your data? For example, in the patient information example above: could that group of people who use a specific vocabulary only come in during internships in the Summer? Were you collecting examples from those dates?
- Are all your users represented in your data? Beware of only choosing samples from one or two of your users. You could find that the model you select fails miserably when, in production, it needs to work for everyone.
Are there enough samples? (There is no hard number that will be enough, and it will also depend on the quality of your samples, but it is true that frequently, you will need more than you think.)
Are you selecting examples of all difficulties? Be careful not to only choose examples that are either easy or hard to summarize, you want examples of all types.
Do you have quality ground truth? Did human experts create them? Lumigator will help you create synthetic ground truth, but bear in mind this is not ideal.
Could you be cheating? If your data was involved in the training or fine-tuning of any of the models you are evaluating, do make sure only new, unseen data is present in the evaluation dataset. Following the exam analogy, allowing samples to be shared between training and evaluation is similar to slipping a student a list of right answers before the exam (learn more about data leakage here).

I’m convinced, how do I prepare one?

Follow the guide on Preparing your own evaluation dataset.