API Reference
speech_to_text_finetune.config
Config
Bases: BaseModel
Store configuration used for finetuning
Attributes:
Name | Type | Description |
---|---|---|
model_id |
str
|
HF model id of a Whisper model used for finetuning |
dataset_id |
str
|
HF dataset id of a Common Voice dataset version, ideally from the mozilla-foundation repo |
language |
str
|
registered language string that is supported by the Common Voice dataset |
repo_name |
str
|
used both for local dir and HF, "default" will create a name based on the model and language id |
n_train_samples |
int
|
explicitly set how many samples to train+validate on. If -1, use all train+val data available |
n_test_samples |
int
|
explicitly set how many samples to evaluate on. If -1, use all eval data available |
training_hp |
TrainingConfig
|
store selective hyperparameter values from Seq2SeqTrainingArguments |
Source code in src/speech_to_text_finetune/config.py
TrainingConfig
Bases: BaseModel
More info at https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments
Source code in src/speech_to_text_finetune/config.py
speech_to_text_finetune.data_process
DataCollatorSpeechSeq2SeqWithPadding
dataclass
Data Collator class in the format expected by Seq2SeqTrainer used for processing input data and labels in batches while finetuning. More info here:
Source code in src/speech_to_text_finetune/data_process.py
load_and_proc_hf_fleurs(language_id, n_test_samples, processor, eval_batch_size)
Load only the test split of fleurs on a specific language and process it for Whisper. Args: language_id (str): a registered language identifier from Fleurs (see https://huggingface.co/datasets/google/fleurs/blob/main/fleurs.py) n_test_samples (int): number of samples to use from the test split processor (WhisperProcessor): Processor from Whisper to process the dataset eval_batch_size (int): batch size to use for processing the dataset
Returns:
Name | Type | Description |
---|---|---|
DatasetDict |
Dataset
|
HF Dataset |
Source code in src/speech_to_text_finetune/data_process.py
load_dataset_from_dataset_id(dataset_id, language_id=None)
This function loads a dataset, based on the dataset_id and the content of its directory (if it is a local path). Possible cases: 1. The dataset_id is a path to a local, Common Voice dataset directory.
-
The dataset_id is a path to a local, custom dataset directory.
-
The dataset_id is a HuggingFace dataset ID.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_id
|
str
|
Path to a processed dataset directory or local dataset directory or HuggingFace dataset ID. |
required |
language_id
|
Only used for the HF dataset case
|
Language identifier for the dataset (e.g., 'en' for English) |
None
|
Returns:
Name | Type | Description |
---|---|---|
DatasetDict |
DatasetDict
|
A processed dataset ready for training with train/test splits |
str |
str
|
Path to save the processed directory |
Raises:
Type | Description |
---|---|
ValueError
|
If the dataset cannot be found locally or on HuggingFace |
Source code in src/speech_to_text_finetune/data_process.py
process_dataset(dataset, processor, batch_size, proc_dataset_path)
Process dataset to the expected format by a Whisper model and then save it locally for future use.
Source code in src/speech_to_text_finetune/data_process.py
try_find_processed_version(dataset_id, language_id=None)
Try to load a processed version of the dataset if it exists locally. Check if: 1. The dataset_id is a local path to an already processed dataset directory. or 2. The dataset_id is a path to a local dataset, but a processed version already exists locally. or 3. The dataset_id is a HuggingFace dataset ID, but a processed version already exists locally.
Source code in src/speech_to_text_finetune/data_process.py
speech_to_text_finetune.finetune_whisper
run_finetuning(config_path='config.yaml')
Complete pipeline for preprocessing the Common Voice dataset and then finetuning a Whisper model on it.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_path
|
str
|
yaml filepath that follows the format defined in config.py |
'config.yaml'
|
Returns:
Type | Description |
---|---|
Tuple[Dict, Dict]
|
Tuple[Dict, Dict]: evaluation metrics from the baseline and the finetuned models |
Source code in src/speech_to_text_finetune/finetune_whisper.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
|
speech_to_text_finetune.utils
compute_wer_cer_metrics(pred, processor, wer, cer, normalizer)
Word Error Rate (wer) is a metric that measures the ratio of errors the ASR model makes given a transcript to the total words spoken. Lower is better. Character Error Rate (cer) is similar to wer, but operates on character instead of word. This metric is better suited for languages with no concept of "word" like Chinese or Japanese. Lower is better.
More info: https://huggingface.co/learn/audio-course/en/chapter5/fine-tuning#evaluation-metrics
Note 1: WER/CER can be larger than 1.0, if the number of insertions I is larger than the number of correct words C. Note 2: WER/CER doesn't tell the whole story and is not fully representative of the quality of the ASR model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pred
|
EvalPrediction
|
Transformers object that holds predicted tokens and ground truth labels |
required |
processor
|
WhisperProcessor
|
Whisper processor used to decode tokens to strings |
required |
wer
|
EvaluationModule
|
module that calls the computing function for WER |
required |
cer
|
EvaluationModule
|
module that calls the computing function for CER |
required |
normalizer
|
BasicTextNormalizer
|
Normalizer from Whisper |
required |
Returns: wer (Dict): computed WER metric
Source code in src/speech_to_text_finetune/utils.py
upload_custom_hf_model_card(hf_repo_name, model_id, dataset_id, language_id, language, n_train_samples, n_eval_samples, baseline_eval_results, ft_eval_results)
Create and upload a custom Model Card (https://huggingface.co/docs/hub/model-cards) to the Hugging Face repo of the finetuned model that highlights the evaluation results before and after finetuning.
Source code in src/speech_to_text_finetune/utils.py
speech_to_text_finetune.make_custom_dataset_app
save_text_audio_to_file(audio_input, sentence, dataset_dir, is_train_sample)
Save the audio recording in a .wav file using the index of the text sentence in the filename. And save the associated text sentence in a .csv file using the same index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
audio_input
|
Audio
|
Gradio audio object to be converted to audio data and then saved to a .wav file |
required |
sentence
|
str
|
The text sentence that will be associated with the audio |
required |
dataset_dir
|
str
|
The dataset directory path to store the indexed sentences and the associated audio files |
required |
is_train_sample
|
bool
|
Whether to save the text-recording pair to the train or test dataset |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Status text for Gradio app |
None |
None
|
Returning None here will reset the audio module to record again from scratch |