Canary-v2-French-TV-Media

This model is a fine-tuned version of NVIDIA Canary-1B optimized for French media content (TV, Radio, Podcasts).

Canary-v2 is a multi-task encoder-decoder model. This version has been specifically adapted to handle the linguistic nuances of French broadcasting, including rapid speech, overlapping dialogue, and 2026 media terminology.

Key Improvements

Domain Adaptation: Fine-tuned on 5 thematic streams (News, Society, Entertainment, Documentaries, Sports).
Robust Punctuation: Improved handling of sentence boundaries in spontaneous media speech.
Accuracy: Significant WER reduction in complex acoustic environments compared to the base Canary-1B.

Usage with NVIDIA NeMo

Canary-v2 requires specific input prompts to define the task (ASR/Translation) and the source/target languages.

Installation

pip install nemo_toolkit['all']==2.6.0

import nemo.collections.asr as nemo_asr
from huggingface_hub import hf_hub_download

# Load the model
REPO_ID="Archime/canary-v2-fr-tv-media"
MODEL_NAME="canary_v2_fr_tv_media.nemo"

model_path = hf_hub_download(
    repo_id=REPO_ID,
    filename=MODEL_NAME
)
model = nemo_asr.models.EncDecMultiTaskModel.restore_from(model_path)

# Transcription with specific prompts
# pnc="yes" enables punctuation and capitalization
# task="asr", source_lang="fr", target_lang="fr"
transcription = model.transcribe(
    ["path/to/media_audio.wav"],
    task="asr",
    source_lang="fr",
    target_lang="fr",
    pnc="yes"
)

print(f"Transcription: {transcription[0]}")

Performances (WER/CER)

The following comparison the Base Canary-1B-v2 model with this Fine-tuned French TV Media version.

Scores are reported in decimal format (e.g., 0.0334 = 3.34% WER). Results focus on the Processed mode (normalized text).

WER (Word Error Rate) Comparison

Dataset (Processed)	Base Canary-v2	Fine-tuned Media	Improvement
News (Info)	0.0465	0.0377	-18.9%
Society (Talks)	0.0475	0.0390	-17.9%
Sports	0.0999	0.0825	-17.4%
Documentaries	0.0394	0.0340	-13.7%
Entertainment	0.0652	0.0614	-5.8%
Fleurs (FR)	0.0824	0.0806	-2.2% (Stable)

CER (Character Error Rate) - Detailed

Dataset (Processed)	Base Canary-v2 (CER)	Fine-tuned (CER)	Improvement
Documentaries	0.0149	0.0139	-6.7%
News	0.0178	0.0153	-14.0%
Society	0.0207	0.0178	-14.0%
Sports	0.0461	0.0409	-11.3%

Robustness Check (Non-Target Language)

We monitored the English performance to ensure no "catastrophic forgetting" occurred during French specialization.

Dataset	Base Canary-v2	Fine-tuned Media	Impact
Fleurs (EN-US)	0.0739	0.0768	+3.9% (Negligible)

⚠️ Limitations & Bias

While this model provides state-of-the-art performance for French media transcription, users should consider the following limitations:

Multilingual Degradation: To achieve high precision in French TV & Media, the model's performance in other languages (English, German, Spanish) originally supported by Canary-v2 has decreased.
Acoustic Complexity: Despite improvements in the Sports and Entertainment sectors, accuracy may still drop during segments with extreme overlapping speech (e.g., heated political debates) or very high levels of non-speech ambient noise (e.g., stadium crowds, loud music beds).
Bias in Media Language: The model is trained on broadcast-quality data. It may exhibit bias towards "Standard French" as spoken in media and might perform slightly less accurately on strong regional accents or non-professional recordings (e.g., low-quality phone calls).
Inference Resources: With nearly 1 Billion parameters, this model requires significant VRAM (at least 8GB+ for comfortable inference) compared to smaller architectures like Parakeet.

Fine-tuning Methodology: Parameter-Efficient Adaptation

To adapt Canary-v2 to the specificities of French media (TV, Radio, Podcasts), we employed a targeted fine-tuning strategy focusing on the linguistic generation component while preserving the acoustic foundation.

Freezing Strategy

Encoder (Frozen): The massive 810M parameter Conformer Encoder was kept in eval mode. This ensures that the model retains its high-quality, universal acoustic representation and remains robust to various recording conditions.
Decoder (Trainable): The Transformer Decoder and Token Classifier were fully trained. This allows the model to learn the specific syntax, media-related vocabulary, and punctuation patterns of 2026 French broadcasting.

Training Architecture Summary

Component	Type	Parameters	Mode
Encoder	ConformerEncoder	810 M	EVAL (Frozen)
Transf_decoder	TransformerDecoderNM	151 M	TRAIN
Log_softmax	TokenClassifier	16.8 M	TRAIN

Training Statistics

Trainable Parameters: 117 M (12.2%)
Non-trainable Parameters: 844 M (87.8%)
Total Model Size: 962 M parameters (~3.8 GB)

Training Data

This model was specifically fine-tuned on the Archime/french_tv_media_dataset_2026 dataset.

The dataset consists of 5 major thematic streams, which were carefully balanced during the fine-tuning process to ensure broad coverage of the media landscape:

News (Info): Television news bulletins and hourly news flashes.
Society: Debates, talk shows, and panel discussions.
Entertainment: Game shows, variety shows, and live entertainment.
Documentaries: Voice-over narrations and on-location field interviews.
Sports: Play-by-play commentary and post-match interviews.

This diversity enables the model to be robust across various linguistic registers—ranging from the formal, scripted language of documentaries to the spontaneous (and often noisy) speech found in entertainment and live sports.

Reproducing Results

To reproduce the metrics displayed in the performance table, ensure you have NVIDIA NeMo installed and use the official evaluation scripts.

Evaluation Command (Example: Media Domains)

You can adjust the text processing flags to toggle between Raw and Normalized scores.

Install dependencies:

pip install -r script/requirements.txt

Prepare the test datasets:

python script/prepare_datasets_test_NeMo.py

Run the evaluation (FR):

python script/speech_to_text_eval_manifests.py \
+models="{canary-1b-v2:'nvidia/canary-1b-v2',canary_v2_fr_tv_media:'Archime/canary-v2-fr-tv-media'}" \
+prompt.source_lang=fr +prompt.target_lang=fr +prompt.task=asr +prompt.pnc=yes \
+dataset_manifests.fleurs_fr_fr="path/to/nemo_datasets/fleurs/fleurs_test_manifest.json" \
+dataset_manifests.info="path/to/nemo_datasets/french_tv_media_dataset_2026/archime_test_info_manifest.json" \
+dataset_manifests.societe="path/to/nemo_datasets/french_tv_media_dataset_2026/archime_test_societe_manifest.json" \
+dataset_manifests.divertissements="path/to/nemo_datasets/french_tv_media_dataset_2026/archime_test_divertissements_manifest.json" \
+dataset_manifests.documentaires="path/to/nemo_datasets/french_tv_media_dataset_2026/archime_test_documentaires_manifest.json" \
+dataset_manifests.sports="path/to/nemo_datasets/french_tv_media_dataset_2026/archime_test_sports_manifest.json" \
use_cer=True \
batch_size=32

Run the evaluation (EN):

python script/speech_to_text_eval_manifests.py \
+models="{canary-1b-v2:'nvidia/canary-1b-v2',canary_v2_fr_tv_media:'Archime/canary-v2-fr-tv-media'}" \
+prompt.source_lang=en +prompt.target_lang=en +prompt.task=asr +prompt.pnc=yes \
+dataset_manifests.fleurs_en_us="path/to/nemo_datasets/fleurs/fleurs_en_us/fleurs_test_manifest.json" \
use_cer=True \
batch_size=32

Citation

If you use this model in your research or product, please cite the original Parakeet-TDT work and this fine-tuned version:

Base model:

@misc{canary-1b-v2,
  author = {NVIDIA},
  title = {canary-1b-v2},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/nvidia/canary-1b-v2}
}

@misc{canary-v2-fr-tv-media,
  author = {Archime},
  title = {canary-v2-fr-tv-media},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{[https://huggingface.co/Archime/canary_v2_fr_tv_media](https://huggingface.co/Archime/canary-v2-fr-tv-media)}}
}

Contact

For questions or issues, please open an issue on the repository.

Downloads last month: 137

Model tree for Archime/canary-v2-fr-tv-media

Base model

nvidia/canary-1b-v2

Finetuned

(6)

this model

Dataset used to train Archime/canary-v2-fr-tv-media

Space using Archime/canary-v2-fr-tv-media 1

Evaluation results

WER (Processed) on French TV Media (News)
test set self-reported

0.038
CER (Processed) on French TV Media (News)
test set self-reported

0.015
WER (Raw) on French TV Media (News)
test set self-reported

0.047
WER (Processed) on French TV Media (Society)
test set self-reported

0.039
CER (Processed) on French TV Media (Society)
test set self-reported

0.018
WER (Processed) on French TV Media (Documentaries)
test set self-reported

0.034
CER (Processed) on French TV Media (Documentaries)
test set self-reported

0.014
WER (Processed) on French TV Media (Sports)
test set self-reported

0.083
CER (Processed) on French TV Media (Sports)
test set self-reported

0.041
WER (Processed) on French TV Media (Entertainment)
test set self-reported

0.061
CER (Processed) on French TV Media (Entertainment)
test set self-reported

0.031
WER (Processed) on Fleurs FR-FR
test set self-reported

0.081
CER (Processed) on Fleurs FR-FR
test set self-reported

0.026
WER (Processed) on Fleurs EN-US
test set self-reported

0.077
CER (Processed) on Fleurs EN-US
test set self-reported

0.030