Step-by-Step Guide: How the Speech-to-Text-Finetune Blueprint Works
This Blueprint enables you to fine-tune a Speech-to-Text (STT) model, using either your own data or the Common Voice dataset. This Step-by-Step guide you through the end-to-end process of finetuning an STT model based on your needs.
Overview
This blueprint consists of three independent, yet complementary, components:
-
Transcription app 🎙️📝: A simple UI that lets you record your voice, pick any HF STT/ASR model, and get an instant transcription.
-
Dataset maker app 📂🎤: Another UI app that enables you to easily and quickly create your own Speech-to-Text dataset.
-
Finetuning script 🛠️🤖: A script to finetune your own STT model, either using Common Voice data or your own custom data created by the Dataset maker app.
Step-by-Step Guide
Visit the Getting Started page for the initial project setup.
The following guide is a suggested user-flow for getting the most out of this Blueprint
Step 1 - Initial transcription testing
Start by initially testing the quality of the Speech-to-Text models available in HuggingFace:
-
Simply execute:
-
Select or add the HF model id of your choice
- Record a sample of your voice and get the transcribed text back. You may find that there are sometimes inaccuracies for your voice/accent/chosen language, indicating the model could benefit form finetuning on additional data.
Step 2 - Make your Custom Dataset for STT finetuning
-
Create your own, custom dataset by running this command and following the instructions:
-
Follow the instruction in the app to create at least 10 audio samples, which will be saved locally.
Step 3 - Creating a finetuned STT model using your custom data
-
Configure
config.yaml
with the model, custom data directory and hyperparameters of your choice. Note that if you selectpush_to_hub: True
you need to have an HF account and log in locally. For example: -
Finetune a model by running:
[!TIP] You can prematurely and gracefully stop the finetuning job by pressing CTRL+C. The rest of the code (evaluation, uploading the model) will execute as normal.
Step 4 - (Optional) Creating a finetuned STT model using CommonVoice data
1.
2. Go to https://commonvoice.mozilla.org/en/datasets, pick your language and dataset version and download the dataset
2. Move the zipped file under a directory of your choice and extract it
3. Configure config.yaml
with the model, Common Voice dataset path and hyperparameters of your choice. For example:
model_id: openai/whisper-tiny
dataset_id: path/to/common_voice_data/language_id
dataset_source: custom
language: English
repo_name: default
training_hp:
push_to_hub: False
hub_private_repo: True
...
- Finetune a model by running:
[!NOTE] Every time you load a new dataset, the script will have to process it before feeding it to the STT model. The script will then also save this processed dataset version locally so that next time you want to finetune a model on the same dataset, the processing step will be skipped, saving time & computation.
Step 5 - Evaluate transcription accuracy with your finetuned STT model
- Start the Transcription app:
```bash
python demo/transcribe_app.py
2. Provided that `push_to_hub: True` when you Finetuned, you can select your HuggingFace model-id. If not, you can specify the local path to your model 3. Record a sample of your voice and get the transcribed text back. 4. You can easily switch between models with the same recorded sample to evaluate if the finetuned model has improved transcription accuracy. ### Step 6 - Compare transcription performance between two models 1. Start the Model Comparison app: ```bash python demo/model_comparison_app.py
- Select a baseline model, for example the model you used as a base for finetuning.
- Select a comparison model, for example your finetuned model.
- Record a sample of your voice and get two transcriptions back side-by-side for an easier manual evaluation.
Step 7 - Evaluate a model on the Fleurs dataset on a specific language
- Configure the arguments through the command line according to your needs and execute the command below
bash python evaluate_whisper.py --model_id openai/whisper-tiny --lang_code sw_ke --language Swahili --eval_batch_size 8 --n_test_samples -1 --fp16 True
🎨 Customizing the Blueprint
To better understand how you can tailor this Blueprint to suit your specific needs, please visit the Customization Guide.
🤝 Contributing to the Blueprint
Want to help improve or extend this Blueprint? Check out the Future Features & Contributions Guide to see how you can contribute your ideas, code, or feedback to make this Blueprint even better!
📖 Resources & References
If you are interested in learning more about this topic, you might find the following resources helpful: - Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers (Blog post by HuggingFace which inspired the implementation of the Blueprint!)
-
Automatic Speech Recognition Course from HuggingFace (Series of Blog posts)
-
Fine-Tuning ASR Models: Key Definitions, Mechanics, and Use Cases (Blog post by Gladia)
-
Active Learning Approach for Fine-Tuning Pre-Trained ASR Model for a low-resourced Language (Paper)
-
Exploration of Whisper fine-tuning strategies for low-resource ASR (Paper)