API Reference
speech_to_text_finetune.config
Config
Bases: BaseModel
Store configuration used for finetuning
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_id
|
str
|
HF model id of a Whisper model used for finetuning |
required |
dataset_id
|
str
|
HF dataset id of a Common Voice dataset version, ideally from the mozilla-foundation repo |
required |
dataset_source
|
str
|
can be "HF" or "local", to determine from where to fetch the dataset |
required |
language
|
str
|
registered language string that is supported by the Common Voice dataset |
required |
repo_name
|
str
|
used both for local dir and HF, "default" will create a name based on the model and language id |
required |
training_hp
|
TrainingConfig
|
store selective hyperparameter values from Seq2SeqTrainingArguments |
required |
Source code in src/speech_to_text_finetune/config.py
TrainingConfig
Bases: BaseModel
More info at https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments
Source code in src/speech_to_text_finetune/config.py
speech_to_text_finetune.data_process
DataCollatorSpeechSeq2SeqWithPadding
dataclass
Data Collator class in the format expected by Seq2SeqTrainer used for processing input data and labels in batches while finetuning. More info here:
Source code in src/speech_to_text_finetune/data_process.py
_process_inputs_and_labels_for_whisper(batch, feature_extractor, tokenizer)
Use Whisper's feature extractor to transform the input audio arrays into log-Mel spectrograms and the tokenizer to transform the text-label into tokens. This function is expected to be called using the .map method in order to process the data batch by batch.
Source code in src/speech_to_text_finetune/data_process.py
load_common_voice(dataset_id, language_id)
Load the default train+validation split used for finetuning and a test split used for evaluation. Args: dataset_id: official Common Voice dataset id from the mozilla-foundation organisation from Hugging Face language_id: a registered language identifier from Common Voice (most often in ISO-639 format)
Returns:
Name | Type | Description |
---|---|---|
DatasetDict |
DatasetDict
|
HF Dataset dictionary that consists of two distinct Datasets |
Source code in src/speech_to_text_finetune/data_process.py
load_local_dataset(dataset_dir, train_split=0.8)
Load sentences and accompanied recorded audio files into a pandas DataFrame, then split into train/test and finally load it into two distinct train Dataset and test Dataset.
Sentences and audio files should be indexed like this:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_dir
|
str
|
path to the local dataset, expecting a text.csv and .wav files under the directory |
required |
train_split
|
float
|
percentage split of the dataset to train+validation and test set |
0.8
|
Returns:
Name | Type | Description |
---|---|---|
DatasetDict |
DatasetDict
|
HF Dataset dictionary in the same exact format as the Common Voice dataset from load_common_voice |
Source code in src/speech_to_text_finetune/data_process.py
process_dataset(dataset, feature_extractor, tokenizer)
Process dataset to the expected format by a Whisper model.
Source code in src/speech_to_text_finetune/data_process.py
speech_to_text_finetune.finetune_whisper
compute_word_error_rate(pred, tokenizer, metric)
Word Error Rate (wer) is a metric that measures the ratio of errors the ASR model makes given a transcript to the total words spoken. Lower is better. To identify an "error" we measure the difference between the ASR generated transcript and the ground truth transcript using the following formula: - S is the number of substitutions (number of words ASR swapped for different words from the ground truth) - D is the number of deletions (number of words ASR skipped / didn't generate compared to the ground truth) - I is the number of insertions (number of additional words ASR generated, not found in the ground truth) - C is the number of correct words (number of words that are identical between ASR and ground truth scripts)
then: WER = (S+D+I) / (S+D+C)
Note 1: WER can be larger than 1.0, if the number of insertions I is larger than the number of correct words C. Note 2: WER doesn't tell the whole story and is not fully representative of the quality of the ASR model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pred
|
EvalPrediction
|
Transformers object that holds predicted tokens and ground truth labels |
required |
tokenizer
|
WhisperTokenizer
|
Whisper tokenizer used to decode tokens to strings |
required |
metric
|
EvaluationModule
|
module that calls the computing function for WER |
required |
Returns: wer (Dict): computed WER metric
Source code in src/speech_to_text_finetune/finetune_whisper.py
run_finetuning(config_path='config.yaml')
Complete pipeline for preprocessing the Common Voice dataset and then finetuning a Whisper model on it.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_path
|
str
|
yaml filepath that follows the format defined in config.py |
'config.yaml'
|
Returns:
Type | Description |
---|---|
Tuple[Dict, Dict]
|
Tuple[Dict, Dict]: evaluation metrics from the baseline and the finetuned models |
Source code in src/speech_to_text_finetune/finetune_whisper.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
|
speech_to_text_finetune.hf_utils
get_available_languages_in_cv(dataset_id)
Checks if dictionary with the languages already exists as .json and load it. If not:
Downloads a languages.py file from a Common Voice dataset repo which stores all languages available.
Then, dynamically imports the file as a module and returns the dictionary defined inside.
Since the dictionary is in the format {
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_id
|
str
|
It needs to be a specific Common Voice dataset id, e.g. mozilla-foundation/common_voice_17_0 |
required |
Returns:
Name | Type | Description |
---|---|---|
Dict |
Dict
|
A language mapping dictionary in the format { |
Source code in src/speech_to_text_finetune/hf_utils.py
upload_custom_hf_model_card(hf_repo_name, model_id, dataset_id, language_id, language, n_train_samples, n_eval_samples, baseline_eval_results, ft_eval_results)
Create and upload a custom Model Card (https://huggingface.co/docs/hub/model-cards) to the Hugging Face repo of the finetuned model that highlights the evaluation results before and after finetuning.
Source code in src/speech_to_text_finetune/hf_utils.py
speech_to_text_finetune.make_local_dataset_app
save_text_audio_to_file(audio_input, sentence, dataset_dir)
Save the audio recording in a .wav file using the index of the text sentence in the filename. And save the associated text sentence in a .csv file using the same index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
audio_input
|
Audio
|
Gradio audio object to be converted to audio data and then saved to a .wav file |
required |
sentence
|
str
|
The text sentence that will be associated with the audio |
required |
dataset_dir
|
str
|
The dataset directory path to store the indexed sentences and the associated audio files |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Status text for Gradio app |
None |
None
|
Returning None here will reset the audio module to record again from scratch |