Skip to content

API Reference

speech_to_text_finetune.config

Config

Bases: BaseModel

Store configuration used for finetuning

Parameters:

Name Type Description Default
model_id str

HF model id of a Whisper model used for finetuning

required
dataset_id str

HF dataset id of a Common Voice dataset version, ideally from the mozilla-foundation repo

required
dataset_source str

can be "HF" or "local", to determine from where to fetch the dataset

required
language str

registered language string that is supported by the Common Voice dataset

required
repo_name str

used both for local dir and HF, "default" will create a name based on the model and language id

required
training_hp TrainingConfig

store selective hyperparameter values from Seq2SeqTrainingArguments

required
Source code in src/speech_to_text_finetune/config.py
class Config(BaseModel):
    """
    Store configuration used for finetuning

    Args:
        model_id (str): HF model id of a Whisper model used for finetuning
        dataset_id (str): HF dataset id of a Common Voice dataset version, ideally from the mozilla-foundation repo
        dataset_source (str): can be "HF" or "local", to determine from where to fetch the dataset
        language (str): registered language string that is supported by the Common Voice dataset
        repo_name (str): used both for local dir and HF, "default" will create a name based on the model and language id
        training_hp (TrainingConfig): store selective hyperparameter values from Seq2SeqTrainingArguments
    """

    model_id: str
    dataset_id: str
    dataset_source: str
    language: str
    repo_name: str
    training_hp: TrainingConfig

TrainingConfig

Bases: BaseModel

More info at https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments

Source code in src/speech_to_text_finetune/config.py
class TrainingConfig(BaseModel):
    """
    More info at https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments
    """

    push_to_hub: bool
    hub_private_repo: bool
    max_steps: int
    per_device_train_batch_size: int
    gradient_accumulation_steps: int
    learning_rate: float
    warmup_steps: int
    gradient_checkpointing: bool
    fp16: bool
    eval_strategy: str
    per_device_eval_batch_size: int
    predict_with_generate: bool
    generation_max_length: int
    save_steps: int
    logging_steps: int
    load_best_model_at_end: bool
    save_total_limit: int
    metric_for_best_model: str
    greater_is_better: bool

speech_to_text_finetune.data_process

DataCollatorSpeechSeq2SeqWithPadding dataclass

Data Collator class in the format expected by Seq2SeqTrainer used for processing input data and labels in batches while finetuning. More info here:

Source code in src/speech_to_text_finetune/data_process.py
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    """
    Data Collator class in the format expected by Seq2SeqTrainer used for processing
    input data and labels in batches while finetuning. More info here:
    """

    processor: WhisperProcessor
    decoder_start_token_id: int

    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        input_features = [
            {"input_features": feature["input_features"]} for feature in features
        ]
        batch = self.processor.feature_extractor.pad(
            input_features, return_tensors="pt"
        )

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100
        )

        # if labels already have a bos token, remove it since its appended later
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

_process_inputs_and_labels_for_whisper(batch, feature_extractor, tokenizer)

Use Whisper's feature extractor to transform the input audio arrays into log-Mel spectrograms and the tokenizer to transform the text-label into tokens. This function is expected to be called using the .map method in order to process the data batch by batch.

Source code in src/speech_to_text_finetune/data_process.py
def _process_inputs_and_labels_for_whisper(
    batch: Dict, feature_extractor: WhisperFeatureExtractor, tokenizer: WhisperTokenizer
) -> Dict:
    """
    Use Whisper's feature extractor to transform the input audio arrays into log-Mel spectrograms
     and the tokenizer to transform the text-label into tokens. This function is expected to be called using
     the .map method in order to process the data batch by batch.
    """
    audio = batch["audio"]

    batch["input_features"] = feature_extractor(
        audio["array"], sampling_rate=audio["sampling_rate"]
    ).input_features[0]

    batch["labels"] = tokenizer(batch["sentence"]).input_ids
    return batch

load_common_voice(dataset_id, language_id)

Load the default train+validation split used for finetuning and a test split used for evaluation. Args: dataset_id: official Common Voice dataset id from the mozilla-foundation organisation from Hugging Face language_id: a registered language identifier from Common Voice (most often in ISO-639 format)

Returns:

Name Type Description
DatasetDict DatasetDict

HF Dataset dictionary that consists of two distinct Datasets

Source code in src/speech_to_text_finetune/data_process.py
def load_common_voice(dataset_id: str, language_id: str) -> DatasetDict:
    """
    Load the default train+validation split used for finetuning and a test split used for evaluation.
    Args:
        dataset_id: official Common Voice dataset id from the mozilla-foundation organisation from Hugging Face
        language_id: a registered language identifier from Common Voice (most often in ISO-639 format)

    Returns:
        DatasetDict: HF Dataset dictionary that consists of two distinct Datasets
    """
    common_voice = DatasetDict()

    common_voice["train"] = load_dataset(
        dataset_id, language_id, split="train+validation", trust_remote_code=True
    )
    common_voice["test"] = load_dataset(
        dataset_id, language_id, split="test", trust_remote_code=True
    )
    common_voice = common_voice.remove_columns(
        [
            "accent",
            "age",
            "client_id",
            "down_votes",
            "gender",
            "locale",
            "path",
            "segment",
            "up_votes",
        ]
    )

    return common_voice

load_local_dataset(dataset_dir, train_split=0.8)

Load sentences and accompanied recorded audio files into a pandas DataFrame, then split into train/test and finally load it into two distinct train Dataset and test Dataset.

Sentences and audio files should be indexed like this: : should be accompanied by rec_.wav

Parameters:

Name Type Description Default
dataset_dir str

path to the local dataset, expecting a text.csv and .wav files under the directory

required
train_split float

percentage split of the dataset to train+validation and test set

0.8

Returns:

Name Type Description
DatasetDict DatasetDict

HF Dataset dictionary in the same exact format as the Common Voice dataset from load_common_voice

Source code in src/speech_to_text_finetune/data_process.py
def load_local_dataset(dataset_dir: str, train_split: float = 0.8) -> DatasetDict:
    """
    Load sentences and accompanied recorded audio files into a pandas DataFrame, then split into train/test and finally
    load it into two distinct train Dataset and test Dataset.

    Sentences and audio files should be indexed like this: <index>: <sentence> should be accompanied by rec_<index>.wav

    Args:
        dataset_dir (str): path to the local dataset, expecting a text.csv and .wav files under the directory
        train_split (float): percentage split of the dataset to train+validation and test set

    Returns:
        DatasetDict: HF Dataset dictionary in the same exact format as the Common Voice dataset from load_common_voice
    """
    text_file = dataset_dir + "/text.csv"

    dataframe = pd.read_csv(text_file)
    audio_files = sorted(
        [f"{dataset_dir}/{f}" for f in os.listdir(dataset_dir) if f.endswith(".wav")]
    )

    dataframe["audio"] = audio_files
    train_index = round(len(dataframe) * train_split)

    my_data = DatasetDict()
    my_data["train"] = Dataset.from_pandas(dataframe[:train_index])
    my_data["test"] = Dataset.from_pandas(dataframe[train_index:])

    return my_data

process_dataset(dataset, feature_extractor, tokenizer)

Process dataset to the expected format by a Whisper model.

Source code in src/speech_to_text_finetune/data_process.py
def process_dataset(
    dataset: DatasetDict,
    feature_extractor: WhisperFeatureExtractor,
    tokenizer: WhisperTokenizer,
) -> DatasetDict:
    """
    Process dataset to the expected format by a Whisper model.
    """
    # Create a new column that consists of the resampled audio samples in the right sample rate for whisper
    dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

    dataset = dataset.map(
        _process_inputs_and_labels_for_whisper,
        fn_kwargs={"feature_extractor": feature_extractor, "tokenizer": tokenizer},
        remove_columns=dataset.column_names["train"],
        num_proc=1,
    )
    return dataset

speech_to_text_finetune.finetune_whisper

compute_word_error_rate(pred, tokenizer, metric)

Word Error Rate (wer) is a metric that measures the ratio of errors the ASR model makes given a transcript to the total words spoken. Lower is better. To identify an "error" we measure the difference between the ASR generated transcript and the ground truth transcript using the following formula: - S is the number of substitutions (number of words ASR swapped for different words from the ground truth) - D is the number of deletions (number of words ASR skipped / didn't generate compared to the ground truth) - I is the number of insertions (number of additional words ASR generated, not found in the ground truth) - C is the number of correct words (number of words that are identical between ASR and ground truth scripts)

then: WER = (S+D+I) / (S+D+C)

Note 1: WER can be larger than 1.0, if the number of insertions I is larger than the number of correct words C. Note 2: WER doesn't tell the whole story and is not fully representative of the quality of the ASR model.

Parameters:

Name Type Description Default
pred EvalPrediction

Transformers object that holds predicted tokens and ground truth labels

required
tokenizer WhisperTokenizer

Whisper tokenizer used to decode tokens to strings

required
metric EvaluationModule

module that calls the computing function for WER

required

Returns: wer (Dict): computed WER metric

Source code in src/speech_to_text_finetune/finetune_whisper.py
def compute_word_error_rate(
    pred: EvalPrediction, tokenizer: WhisperTokenizer, metric: EvaluationModule
) -> Dict:
    """
    Word Error Rate (wer) is a metric that measures the ratio of errors the ASR model makes given a transcript to the
    total words spoken. Lower is better.
    To identify an "error" we measure the difference between the ASR generated transcript and the
    ground truth transcript using the following formula:
    - S is the number of substitutions (number of words ASR swapped for different words from the ground truth)
    - D is the number of deletions (number of words ASR skipped / didn't generate compared to the ground truth)
    - I is the number of insertions (number of additional words ASR generated, not found in the ground truth)
    - C is the number of correct words (number of words that are identical between ASR and ground truth scripts)

    then: WER = (S+D+I) / (S+D+C)

    Note 1: WER can be larger than 1.0, if the number of insertions I is larger than the number of correct words C.
    Note 2: WER doesn't tell the whole story and is not fully representative of the quality of the ASR model.

    Args:
        pred (EvalPrediction): Transformers object that holds predicted tokens and ground truth labels
        tokenizer (WhisperTokenizer): Whisper tokenizer used to decode tokens to strings
        metric (EvaluationModule): module that calls the computing function for WER
    Returns:
        wer (Dict): computed WER metric
    """
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    label_ids[label_ids == -100] = tokenizer.pad_token_id

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

run_finetuning(config_path='config.yaml')

Complete pipeline for preprocessing the Common Voice dataset and then finetuning a Whisper model on it.

Parameters:

Name Type Description Default
config_path str

yaml filepath that follows the format defined in config.py

'config.yaml'

Returns:

Type Description
Tuple[Dict, Dict]

Tuple[Dict, Dict]: evaluation metrics from the baseline and the finetuned models

Source code in src/speech_to_text_finetune/finetune_whisper.py
def run_finetuning(
    config_path: str = "config.yaml",
) -> Tuple[Dict, Dict]:
    """
    Complete pipeline for preprocessing the Common Voice dataset and then finetuning a Whisper model on it.

    Args:
        config_path (str): yaml filepath that follows the format defined in config.py

    Returns:
        Tuple[Dict, Dict]: evaluation metrics from the baseline and the finetuned models
    """
    cfg = load_config(config_path)

    language_id = LANGUAGES_NAME_TO_ID[cfg.language]

    if cfg.repo_name == "default":
        cfg.repo_name = f"{cfg.model_id.split('/')[1]}-{language_id}"
    local_output_dir = f"./artifacts/{cfg.repo_name}"

    logger.info(f"Finetuning starts soon, results saved locally at {local_output_dir}")
    hf_repo_name = ""
    if cfg.training_hp.push_to_hub:
        hf_username = get_hf_username()
        hf_repo_name = f"{hf_username}/{cfg.repo_name}"
        logger.info(
            f"Results will also be uploaded in HF at {hf_repo_name}. "
            f"Private repo is set to {cfg.training_hp.hub_private_repo}."
        )

    logger.info(f"Loading the {cfg.language} subset from the {cfg.dataset_id} dataset.")
    if cfg.dataset_source == "HF":
        dataset = load_common_voice(cfg.dataset_id, language_id)
    elif cfg.dataset_source == "local":
        dataset = load_local_dataset(cfg.dataset_id, train_split=0.8)
    else:
        raise ValueError(f"Unknown dataset source {cfg.dataset_source}")

    device = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"

    logger.info(
        f"Loading {cfg.model_id} on {device} and configuring it for {cfg.language}."
    )
    feature_extractor = WhisperFeatureExtractor.from_pretrained(cfg.model_id)
    tokenizer = WhisperTokenizer.from_pretrained(
        cfg.model_id, language=cfg.language, task="transcribe"
    )
    processor = WhisperProcessor.from_pretrained(
        cfg.model_id, language=cfg.language, task="transcribe"
    )
    model = WhisperForConditionalGeneration.from_pretrained(cfg.model_id)

    model.generation_config.language = cfg.language.lower()
    model.generation_config.task = "transcribe"
    model.generation_config.forced_decoder_ids = None

    logger.info("Preparing dataset...")
    dataset = process_dataset(dataset, feature_extractor, tokenizer)

    data_collator = DataCollatorSpeechSeq2SeqWithPadding(
        processor=processor,
        decoder_start_token_id=model.config.decoder_start_token_id,
    )

    training_args = Seq2SeqTrainingArguments(
        output_dir=local_output_dir,
        hub_model_id=hf_repo_name,
        report_to=["tensorboard"],
        **cfg.training_hp.model_dump(),
    )

    metric = evaluate.load("wer")

    trainer = Seq2SeqTrainer(
        args=training_args,
        model=model,
        train_dataset=dataset["train"],
        eval_dataset=dataset["test"],
        data_collator=data_collator,
        compute_metrics=partial(
            compute_word_error_rate, tokenizer=tokenizer, metric=metric
        ),
        processing_class=processor.feature_extractor,
    )

    feature_extractor.save_pretrained(training_args.output_dir)
    tokenizer.save_pretrained(training_args.output_dir)
    processor.save_pretrained(training_args.output_dir)

    logger.info(
        f"Before finetuning, run evaluation on the baseline model {cfg.model_id} to easily compare performance"
        f" before and after finetuning"
    )
    baseline_eval_results = trainer.evaluate()
    logger.info(f"Baseline evaluation complete. Results:\n\t {baseline_eval_results}")

    logger.info(
        f"Start finetuning job on {dataset['train'].num_rows} audio samples. Monitor training metrics in real time in "
        f"a local tensorboard server by running in a new terminal: tensorboard --logdir {training_args.output_dir}/runs"
    )
    trainer.train()
    logger.info("Finetuning job complete.")

    logger.info(f"Start evaluation on {dataset['test'].num_rows} audio samples.")
    eval_results = trainer.evaluate()
    logger.info(f"Evaluation complete. Results:\n\t {eval_results}")

    if cfg.training_hp.push_to_hub:
        logger.info(f"Uploading model and eval results to HuggingFace: {hf_repo_name}")
        trainer.push_to_hub()
        upload_custom_hf_model_card(
            hf_repo_name=hf_repo_name,
            model_id=cfg.model_id,
            dataset_id=cfg.dataset_id,
            language_id=language_id,
            language=cfg.language,
            n_train_samples=dataset["train"].num_rows,
            n_eval_samples=dataset["test"].num_rows,
            baseline_eval_results=baseline_eval_results,
            ft_eval_results=eval_results,
        )

    return baseline_eval_results, eval_results

speech_to_text_finetune.hf_utils

get_available_languages_in_cv(dataset_id)

Checks if dictionary with the languages already exists as .json and load it. If not: Downloads a languages.py file from a Common Voice dataset repo which stores all languages available. Then, dynamically imports the file as a module and returns the dictionary defined inside. Since the dictionary is in the format {: } , e.g. {'ab': 'Abkhaz'} We swap to use the full language name as key and the ISO id as value instead. Then save the dictionary as json for easier loading next time and remove languages.py as its no longer necessary

Parameters:

Name Type Description Default
dataset_id str

It needs to be a specific Common Voice dataset id, e.g. mozilla-foundation/common_voice_17_0

required

Returns:

Name Type Description
Dict Dict

A language mapping dictionary in the format {: } , e.g. {'Abkhaz': 'ab'}

Source code in src/speech_to_text_finetune/hf_utils.py
def get_available_languages_in_cv(dataset_id: str) -> Dict:
    """
    Checks if dictionary with the languages already exists as .json and load it. If not:
    Downloads a languages.py file from a Common Voice dataset repo which stores all languages available.
    Then, dynamically imports the file as a module and returns the dictionary defined inside.
    Since the dictionary is in the format {<ISO-639-id>: <Full language name>} , e.g. {'ab': 'Abkhaz'}
    We swap to use the full language name as key and the ISO id as value instead.
    Then save the dictionary as json for easier loading next time and remove languages.py as its no longer necessary

    Args:
        dataset_id: It needs to be a specific Common Voice dataset id, e.g. mozilla-foundation/common_voice_17_0

    Returns:
        Dict: A language mapping dictionary in the format {<Full language name>: <ISO-639-id>} , e.g. {'Abkhaz': 'ab'}
    """
    lang_map_file_name = f"./artifacts/languages_{dataset_id.split('/')[1]}.json"

    if Path(lang_map_file_name).is_file():
        logger.info(f"Found {lang_map_file_name} locally, loading the dictionary.")
        with open(lang_map_file_name) as json_file:
            lang_name_to_id = json.load(json_file)
        return lang_name_to_id

    logger.info(
        f"{lang_map_file_name} not found locally. Downloading it from {dataset_id}..."
    )
    filepath = hf_hub_download(
        repo_id=dataset_id, filename="languages.py", repo_type="dataset", local_dir="."
    )
    # Dynamically load LANGUAGES dictionary from languages.py as module
    spec = importlib.util.spec_from_file_location("languages_map_module", filepath)
    module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(module)
    lang_id_to_name = module.LANGUAGES

    # Swap keys <> values
    lang_name_to_id = dict((v, k) for k, v in lang_id_to_name.items())

    logger.info(f"Saving {lang_map_file_name} locally to use it next time.")
    Path("./artifacts").mkdir(exist_ok=True)
    with open(lang_map_file_name, "w") as lang_file:
        json.dump(lang_name_to_id, lang_file, indent=4)

    # Cleanup
    os.remove(filepath)

    return lang_name_to_id

upload_custom_hf_model_card(hf_repo_name, model_id, dataset_id, language_id, language, n_train_samples, n_eval_samples, baseline_eval_results, ft_eval_results)

Create and upload a custom Model Card (https://huggingface.co/docs/hub/model-cards) to the Hugging Face repo of the finetuned model that highlights the evaluation results before and after finetuning.

Source code in src/speech_to_text_finetune/hf_utils.py
def upload_custom_hf_model_card(
    hf_repo_name: str,
    model_id: str,
    dataset_id: str,
    language_id: str,
    language: str,
    n_train_samples: int,
    n_eval_samples: int,
    baseline_eval_results: Dict,
    ft_eval_results: Dict,
) -> None:
    """
    Create and upload a custom Model Card (https://huggingface.co/docs/hub/model-cards) to the Hugging Face repo
    of the finetuned model that highlights the evaluation results before and after finetuning.
    """
    card_metadata = ModelCardData(
        model_name=f"Finetuned {model_id} on {language}",
        base_model=model_id,
        datasets=[dataset_id],
        language=language_id,
        license="apache-2.0",
        library_name="transformers",
        eval_results=[
            EvalResult(
                task_type="automatic-speech-recognition",
                task_name="Speech-to-Text",
                dataset_type="common_voice",
                dataset_name=f"Common Voice ({language})",
                metric_type="wer",
                metric_value=round(ft_eval_results["eval_wer"], 3),
            )
        ],
    )
    content = f"""
---
{card_metadata.to_yaml()}
---

# Finetuned {model_id} on {n_train_samples} {language} training audio samples from {dataset_id}.

This model was created from the Mozilla.ai Blueprint:
[speech-to-text-finetune](https://github.com/mozilla-ai/speech-to-text-finetune).

## Evaluation results on {n_eval_samples} audio samples of {language}:

### Baseline model (before finetuning) on {language}
- Word Error Rate: {round(baseline_eval_results["eval_wer"], 3)}
- Loss: {round(baseline_eval_results["eval_loss"], 3)}

### Finetuned model (after finetuning) on {language}
- Word Error Rate: {round(ft_eval_results["eval_wer"], 3)}
- Loss: {round(ft_eval_results["eval_loss"], 3)}
"""

    card = ModelCard(content)
    card.push_to_hub(hf_repo_name)

speech_to_text_finetune.make_local_dataset_app

save_text_audio_to_file(audio_input, sentence, dataset_dir)

Save the audio recording in a .wav file using the index of the text sentence in the filename. And save the associated text sentence in a .csv file using the same index.

Parameters:

Name Type Description Default
audio_input Audio

Gradio audio object to be converted to audio data and then saved to a .wav file

required
sentence str

The text sentence that will be associated with the audio

required
dataset_dir str

The dataset directory path to store the indexed sentences and the associated audio files

required

Returns:

Name Type Description
str str

Status text for Gradio app

None None

Returning None here will reset the audio module to record again from scratch

Source code in src/speech_to_text_finetune/make_local_dataset_app.py
def save_text_audio_to_file(
    audio_input: gr.Audio,
    sentence: str,
    dataset_dir: str,
) -> Tuple[str, None]:
    """
    Save the audio recording in a .wav file using the index of the text sentence in the filename.
    And save the associated text sentence in a .csv file using the same index.

    Args:
        audio_input (gr.Audio): Gradio audio object to be converted to audio data and then saved to a .wav file
        sentence (str): The text sentence that will be associated with the audio
        dataset_dir (str): The dataset directory path to store the indexed sentences and the associated audio files

    Returns:
        str: Status text for Gradio app
        None: Returning None here will reset the audio module to record again from scratch
    """
    Path(dataset_dir).mkdir(parents=True, exist_ok=True)
    text_data_path = Path(f"{dataset_dir}/text.csv")

    if text_data_path.is_file():
        text_df = pd.read_csv(text_data_path)
    else:
        text_df = pd.DataFrame(columns=["index", "sentence"])

    index = len(text_df)
    text_df = pd.concat(
        [text_df, pd.DataFrame([{"index": index, "sentence": sentence}])],
        ignore_index=True,
    )
    text_df = text_df.sort_values(by="index")
    text_df.to_csv(f"{dataset_dir}/text.csv", index=False)

    audio_filepath = f"{dataset_dir}/rec_{index}.wav"

    sr, data = audio_input
    sf.write(file=audio_filepath, data=data, samplerate=sr)

    return (
        f"""✅ Updated {dataset_dir}/text.csv \n✅ Saved recording to {audio_filepath}""",
        None,
    )