Transcribers
This section presents the speech recognition systems (ASR) implemented by the ASRBench CLI. Each transcriber is described with a brief introduction, followed by its configuration in YAML format. More complex parameters are explained in detail in the section.
Wav2Vec
Wav2Vec 2.0, by Meta AI, is a self-supervised learning model that generates speech representations from raw audio, using a Transformer architecture. It is efficient with little labeled data, achieving low error rates in benchmarks.
wav2vec:
asr: "wav2vec"
model: "facebook/wav2vec2-large-xlsr-53-portuguese"
compute_type: "float32" # Set the precision of the calculations
device: "cpu"
Available models
The facebook/wav2vec2-large-xlsr-53-portuguese template is optimized for Portuguese. Other pre-trained templates can be used, according to the Wav2Vec documentation.
Whisper
Whisper, from OpenAI, is a speech recognition system trained on a large scale with 680,000 hours of audio in several languages. Its generalization capability allows it to perform competitively in a variety of scenarios without specific tuning.
whisper:
asr: "whisper"
model: "medium"
device: "cpu"
language: "en"
fp16: false # Enable 16-bit floating point
Faster Whisper
Faster Whisper is an optimized version of Whisper, implemented with the CTranslate2 library. It offers greater efficiency and lower memory consumption, while maintaining the accuracy of the original model.
faster_whisper:
asr: "faster_whisper"
model: "medium"
compute_type: "int8" # Defines the precision of the calculations
device: "cpu"
beam_size: 5 # Controls sequence search
language: "en"
Vosk
Vosk is an offline speech recognition toolkit, compatible with 20 languages. Its architecture combines deep neural networks, hidden Markov models and finite state transducers, making it ideal for embedded systems.
vosk:
asr: "vosk"
model: "medium"
language: "en"
Vosk templates
Vosk models must be downloaded separately and configured correctly. See the documentation for details.
Common parameters
Some parameters are shared between transcribers and influence performance and accuracy. Below, we explain the most relevant ones:
Compute Type
Defines the numerical precision used in the model's calculations. The options are:
- int8: 8-bit integer, optimized for speed, but with a possible loss of precision.
- float16: 16-bit floating point, balances performance and accuracy.
- float32: 32-bit floating point, offers greater precision but is slower.
The choice depends on the hardware and the balance between speed and quality.
FP16
Enables (true) or disables (false) the use of 16-bit floating point. When enabled, it reduces memory consumption and speeds up processing, but can have an impact on precision.
Beam Size
Controls the width of the search in the Beam Search algorithm, used to generate text sequences. Larger values (e.g. 5 or 10) increase accuracy, but consume more time and memory. Smaller values (e.g. 1 or 3) are faster but less accurate.