whisper-25TPS-VQ-32k-large-v3-turbo
Add a pooling layer with stride 2 to introduce 25 TPS with 32768 VQ embedding size.
This model to introduce VQ on top malaysia-ai/whisper-25TPS-large-v3-turbo
WanDB at https://wandb.ai/huseinzol05/whisperconv?nw=nwuserhuseinzol05
Training dataset
- malaysia-ai/common_voice_17_0
- mesolitica/Malaysian-STT-Whisper-Stage2/malaysian_multiturn_chat_assistants_segments
- mesolitica/Malaysian-STT-Whisper-Stage2/malaysian_multiturn_chat_assistants_manglish_segments
how to audio token
from transformers import AutoFeatureExtractor, AutoModel, AutoTokenizer
import librosa
model_id = "mesolitica/whisper-25TPS-VQ-32k-large-v3-turbo"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, trust_remote_code = True, torch_dtype = 'auto').cuda()
encoder = model.model.get_encoder()
y, sr = librosa.load('common_voice_ba_26517811.mp3', sr = feature_extractor.sampling_rate)
features = feature_extractor([y], return_tensors = 'pt', return_attention_mask = True)
for k in features.keys():
features[k] = features[k].cuda()
encoded = encoder(**features)
print(encoded[1][0, encoded[2][0] == 1])
tensor([14135, 7585, 12890, 32383, 15559, 4515, 252, 32713, 252, 16296,
3050, 18175, 15733, 5619, 5619, 1770, 7520, 32041, 26287, 8139,
8453, 28652, 4327, 26837, 20927, 26620, 12310, 12310, 12938, 29755,
29755, 18102, 5597, 8076, 8076, 8076, 9772, 31738, 31738, 1856,
24397, 27124, 5538, 1970, 29984, 8891, 20453, 20453, 1815, 1465,
1465, 26893, 5597, 9531, 11871, 11871, 6484, 21016, 14653, 18417,
9598, 9598, 30138, 27531, 18071, 18071, 30147, 24892, 434, 16557,
30589, 25516, 30876, 30876, 32039, 29394, 27996, 10042, 1939, 16692,
8163, 16665, 16665, 4507, 28100, 31251, 3051, 3051, 12157, 19865,
27147, 27357, 21524, 19750, 20016, 9031, 20016, 13475, 30149, 30149,
21785, 4176, 24032, 19334, 17387, 31375, 2659, 16509, 31672, 7785,
10352, 30063, 8518, 30730, 29357, 28538, 7072], device='cuda:0')
how to decode
from transformers import AutoFeatureExtractor, AutoModel, AutoTokenizer
import librosa
model_id = "mesolitica/whisper-25TPS-VQ-32k-large-v3-turbo"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, trust_remote_code = True, torch_dtype = 'auto').cuda()
y, sr = librosa.load('common_voice_ba_26517811.mp3', sr = feature_extractor.sampling_rate)
input_ids = tokenizer(
'<|startoftranscript|><|ru|><|transcribe|><|notimestamps|>',
add_special_tokens = False, return_tensors = 'pt')['input_ids']
features = feature_extractor([y], return_tensors = 'pt', return_attention_mask = True)
features['decoder_input_ids'] = input_ids
for k in features.keys():
features[k] = features[k].cuda()
generate_kwargs = dict(
**features,
max_new_tokens=1024,
)
generation_output = model.generate(**generate_kwargs)
tokenizer.decode(generation_output[0])
Output,
<|startoftranscript|><|ru|><|transcribe|><|notimestamps|> Кубах сирта был холква кешене битарафлыг сирпаса.<|endoftext|>
Evaluation
Evaluate on malaysia-ai/common_voice_17_0/test up to 115 languages with some conditions,
- Lower case.
- Remove punctuation.
- Provide language tagging for decoder input ids,
<|startoftranscript|><|{lang}|><|transcribe|><|notimestamps|>.
We also compared with mesolitica/whisper-conv-large-v3-turbo,
lang: gl, samples: 9949, CER: 0.179959138002185, Difference CER: 0.13721869488084443
lang: en, samples: 16379, CER: 0.18379318688291674, Difference CER: 0.12280680287314846
lang: ar, samples: 10458, CER: 0.442098309275015, Difference CER: 0.2194370734765707
lang: kab, samples: 14972, CER: 0.5076671557224989, Difference CER: 0.18320063206386478
lang: ml, samples: 703, CER: 0.7163533887103217, Difference CER: 0.2929944834997549
lang: kk, samples: 514, CER: 0.4628507980881995, Difference CER: 0.29241639009023807
lang: ltg, samples: 2904, CER: 0.4205037910930837, Difference CER: 0.18932788573261197
lang: fr, samples: 16145, CER: 0.1654582834310926, Difference CER: 0.11697265184252421
lang: de, samples: 16170, CER: 0.1724037776794763, Difference CER: 0.14608880590128248
lang: fi, samples: 1554, CER: 0.39460069431609057, Difference CER: 0.3440490009933553
lang: pt, samples: 9432, CER: 0.17848626855907665, Difference CER: 0.13761340489197915
lang: ia, samples: 1816, CER: 0.15100353821127493, Difference CER: 0.09107791393755202
lang: eu, samples: 13621, CER: 0.296558713578422, Difference CER: 0.24527039634593922
lang: ro, samples: 3896, CER: 0.23927959309114133, Difference CER: 0.1885134193753486
lang: sw, samples: 12086, CER: 0.37654191236655504, Difference CER: 0.22579246201638664
lang: sv-SE, samples: 5247, CER: 0.30514152709793296, Difference CER: 0.2436479140179749
lang: ta, samples: 8263, CER: 0.48104527337629344, Difference CER: 0.341981281259172
lang: et, samples: 2653, CER: 0.5043074385490827, Difference CER: 0.4102667579878675
lang: lg, samples: 11902, CER: 0.40869471032709415, Difference CER: 0.23476138336318905
lang: it, samples: 15154, CER: 0.1520149858391226, Difference CER: 0.12816344195114243
lang: mhr, samples: 15107, CER: 0.33643170300979874, Difference CER: 0.2197327291197385
lang: sr, samples: 1539, CER: 0.27426151226637097, Difference CER: 0.09744828405684117
lang: mr, samples: 1437, CER: 0.6049676171879896, Difference CER: 0.4127790267325503
lang: ka, samples: 12608, CER: 0.4641408894888927, Difference CER: 0.3391600692577844
lang: es, samples: 15848, CER: 0.11158011174118186, Difference CER: 0.08995314166952252
lang: be, samples: 15878, CER: 0.22578637993729334, Difference CER: 0.19194390408900153
lang: lt, samples: 4753, CER: 0.3434212862030025, Difference CER: 0.253312491014122
lang: ca, samples: 16389, CER: 0.1315824165221354, Difference CER: 0.09176576155922209
lang: eo, samples: 14773, CER: 0.17372091016054594, Difference CER: 0.12776604038944495
lang: tr, samples: 11235, CER: 0.32994471022721983, Difference CER: 0.27318948867844095
lang: hu, samples: 11435, CER: 0.3561362381156991, Difference CER: 0.3114171485676524
lang: ja, samples: 6033, CER: 0.9982833411296135, Difference CER: 0.628491637183775
lang: br, samples: 2202, CER: 0.5381746564223735, Difference CER: 0.16834193910688278
lang: ne-NP, samples: 217, CER: 0.604053957230674, Difference CER: -0.02706696647602469
lang: uz, samples: 12006, CER: 0.4207017429304853, Difference CER: 0.2754451831277186
lang: ru, samples: 10184, CER: 0.2849535442232065, Difference CER: 0.25471650017165537
lang: dv, samples: 2213, CER: 0.6553536784214389, Difference CER: 0.21568280449693655
lang: tt, samples: 4953, CER: 0.4218519170807958, Difference CER: 0.269218179819226
lang: rw, samples: 14797, CER: 0.4474459282723276, Difference CER: 0.2412264486491754
lang: bn, samples: 9327, CER: 0.6288092560840732, Difference CER: 0.3984746323986289
lang: ug, samples: 6108, CER: 0.4840831192258846, Difference CER: 0.33341606506417826
lang: rm-sursilv, samples: 1361, CER: 0.38524075200621666, Difference CER: 0.1648936703303748
lang: bg, samples: 3201, CER: 0.331453799832877, Difference CER: 0.2699784190733914
lang: ab, samples: 9108, CER: 0.4736496871597669, Difference CER: 0.23927317199856718
lang: uk, samples: 9915, CER: 0.2814522916429826, Difference CER: 0.22192610550326083
lang: mt, samples: 1662, CER: 0.46426521755300143, Difference CER: 0.21599540144825657
lang: fa, samples: 10292, CER: 0.37443357243073633, Difference CER: 0.1881830242987229
lang: pl, samples: 9186, CER: 0.31906700327448057, Difference CER: 0.27611312786639164
lang: bas, samples: 541, CER: 0.48938294627539863, Difference CER: 0.058741911919594825
lang: nl, samples: 11255, CER: 0.21872580062032035, Difference CER: 0.19298044816143148
lang: zh-CN, samples: 10335, CER: 0.8210631692844381, Difference CER: 0.5807757008875891
lang: tok, samples: 2175, CER: 0.17319962128358807, Difference CER: 0.11941545248985369
lang: ur, samples: 4052, CER: 0.4027290974374358, Difference CER: 0.2692039179316977
lang: sk, samples: 2593, CER: 0.3354639680359322, Difference CER: 0.2060686428463557
lang: oc, samples: 254, CER: 0.4043809122006512, Difference CER: 0.15007764689220227
lang: yue, samples: 2585, CER: 0.7505072448354659, Difference CER: 0.5056939610440645
lang: mrj, samples: 7102, CER: 0.3983174300047843, Difference CER: 0.2207365424592496
lang: fy-NL, samples: 3167, CER: 0.4013634703164395, Difference CER: 0.21497581337341332
lang: cs, samples: 9055, CER: 0.30871198296817226, Difference CER: 0.2694371466902227
lang: th, samples: 10982, CER: 0.6311950035438857, Difference CER: 0.41644986961973657
lang: ckb, samples: 5262, CER: 0.41982876859913076, Difference CER: 0.2067183633022011
lang: mn, samples: 1896, CER: 0.6504788197607608, Difference CER: 0.24050040955516266
lang: ky, samples: 1604, CER: 0.49966345148800123, Difference CER: 0.29977656835105015
lang: skr, samples: 1006, CER: 0.5273596174776184, Difference CER: 0.09399999204068638
lang: hy-AM, samples: 4281, CER: 0.5024096428595386, Difference CER: 0.3484867650554775
lang: sl, samples: 1242, CER: 0.3072356434149594, Difference CER: 0.21210338918345795
lang: vi, samples: 1077, CER: 0.5123390332518818, Difference CER: 0.4141346583971688
lang: hi, samples: 3151, CER: 0.4433347767492109, Difference CER: 0.306372811867595
lang: nan-tw, samples: 2317, CER: 0.7641467371011472, Difference CER: 0.18097757844493056
lang: id, samples: 3633, CER: 0.15998835407786902, Difference CER: 0.12512013060283947
lang: cy, samples: 5371, CER: 0.46837017267238346, Difference CER: 0.2825786961901451
lang: yo, samples: 999, CER: 0.7100514992546237, Difference CER: 0.15481400464063144
lang: sah, samples: 1455, CER: 0.5696483293386966, Difference CER: 0.34213265807681337
lang: mk, samples: 1097, CER: 0.3633377564531888, Difference CER: 0.2638160639529168
lang: cv, samples: 1288, CER: 0.5136879769244236, Difference CER: 0.2560624251025429
lang: myv, samples: 479, CER: 0.4441728174014146, Difference CER: 0.25828875786921446
lang: da, samples: 2405, CER: 0.3323597890538349, Difference CER: 0.2629458684051959
lang: lv, samples: 6738, CER: 0.34039794018912223, Difference CER: 0.24378256153241187
lang: kmr, samples: 3900, CER: 0.4001037734079973, Difference CER: 0.169977863006498
lang: tk, samples: 545, CER: 0.5953345982069979, Difference CER: 0.23303181767779912
lang: nn-NO, samples: 370, CER: 0.38739263901776333, Difference CER: 0.23898330177900332
lang: ha, samples: 661, CER: 0.3961181418968935, Difference CER: 0.10301374821255638
lang: he, samples: 260, CER: 0.7651224921633984, Difference CER: 0.3737074083544803
lang: dyu, samples: 59, CER: 0.5455594488660915, Difference CER: -0.06883710434112844
lang: gn, samples: 855, CER: 0.5436678887771803, Difference CER: 0.17064681853064745
lang: lij, samples: 694, CER: 0.4594077375161419, Difference CER: 0.16396474304831565
lang: hsb, samples: 444, CER: 0.5364972753606619, Difference CER: 0.31292411666604203
lang: pa-IN, samples: 487, CER: 0.5484373916344527, Difference CER: 0.11910272512323433
lang: el, samples: 1696, CER: 0.38105151153019773, Difference CER: 0.2859149702640159
lang: zgh, samples: 159, CER: 1.0, Difference CER: 0.0
lang: as, samples: 551, CER: 0.6162622996828861, Difference CER: 0.2606034932750207
lang: sq, samples: 472, CER: 0.44938032259231714, Difference CER: 0.20108797838476486
lang: ko, samples: 338, CER: 1.0, Difference CER: 0.7286135754490389
lang: ga-IE, samples: 517, CER: 0.5355736705711854, Difference CER: 0.09817055611874492
lang: cnh, samples: 763, CER: 0.47912514136497514, Difference CER: 0.005714919342429958
lang: sat, samples: 147, CER: 0.40596308869026315, Difference CER: -0.5940369113097368
lang: rm-vallader, samples: 462, CER: 0.39101991164453215, Difference CER: 0.1913176080009029
lang: or, samples: 670, CER: 0.8464969512872611, Difference CER: -0.1535030487127389
lang: mdf, samples: 104, CER: 0.466300445510711, Difference CER: 0.1636534248636496
lang: af, samples: 62, CER: 0.45330709681185294, Difference CER: 0.28916139814182135
lang: ig, samples: 4, CER: 0.6701285963382738, Difference CER: 0.10868297733217092
lang: sc, samples: 232, CER: 0.44202589637943673, Difference CER: 0.09915654171955035
lang: tig, samples: 169, CER: 1.0, Difference CER: 0.039992025079691684
lang: te, samples: 49, CER: 0.8905341862696007, Difference CER: 0.4086612634876574
lang: ps, samples: 199, CER: 0.488391538686537, Difference CER: 0.13565255246387803
lang: am, samples: 205, CER: 1.0, Difference CER: 0.1377434745095969
lang: ast, samples: 162, CER: 0.28144477608953195, Difference CER: 0.14675671093636652
lang: os, samples: 50, CER: 0.6279852022226734, Difference CER: 0.1452089402221578
lang: lo, samples: 33, CER: 1.0, Difference CER: 0.0
lang: az, samples: 33, CER: 0.5481673876531203, Difference CER: 0.424782749677839
lang: ti, samples: 4, CER: 1.0, Difference CER: 0.0
lang: vot, samples: 6, CER: 0.43847483085183314, Difference CER: 0.01979380268853953
lang: nhi, samples: 5, CER: 0.6312414467253177, Difference CER: 0.22568775310710798
lang: yi, samples: 6, CER: 1.0, Difference CER: 0.09213035535913638
lang: tw, samples: 9, CER: 0.6558037450160406, Difference CER: 0.16412324733187883
average CER: 0.47777548634450767
Source code
Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/whisper-conv
- Downloads last month
- 8