FireRedVAD-CoreML

Core ML conversion of FireRedVAD Stream-VAD for real-time voice activity detection on Apple platforms (iOS 16+ / macOS 13+). Converted from the original PyTorch model by FireRedTeam/FireRedVAD.

Model Description

  • Original model: FireRedVAD by Xiaohongshu (小红书) FireRedTeam
  • Architecture: DFSMN (Deep Feedforward Sequential Memory Network) — 8 DFSMN blocks + 1 DNN layer
  • Variant: Stream-VAD (causal, lookahead=0), suitable for real-time streaming
  • Parameters: ~568K (extremely lightweight)
  • Model size: 2.2 MB (FP32)
  • Input: 80-dim log-Mel filterbank features (16kHz, 25ms frame, 10ms shift)
  • Output: Speech probability [0, 1] per frame
  • Language support: 100+ languages, 20+ Chinese dialects

Performance

Results from the FLEURS-VAD-102 benchmark (102 languages, 9,443 audio clips):

Metric FireRedVAD Silero-VAD TEN-VAD FunASR-VAD WebRTC-VAD
AUC-ROC 99.60 97.99 97.81 - -
F1 Score 97.57 95.95 95.19 90.91 52.30
False Alarm 2.69% 9.41% 15.47% 44.03% 2.83%
Miss Rate 3.62% 3.95% 2.95% 0.42% 64.15%

Core ML Model Specification

Inputs

Name Shape Type Description
feat [1, 1..512, 80] Float32 Log-Mel filterbank features (dynamic time axis)
cache_0 ~ cache_7 [1, 128, 19] Float32 FSMN lookback cache for each of the 8 layers

Outputs

Name Type Description
probs Float32 Speech probability, shape [1, T, 1]
new_cache_0 ~ new_cache_7 Float32 Updated lookback cache
  • Minimum deployment target: iOS 16 / macOS 13
  • Compute units: CPU + Neural Engine

Conversion

Converted from PyTorch using coremltools via the export script in FireRedASR2S. The Stream-VAD variant was selected for its causal (no lookahead) property, making it suitable for real-time streaming applications.

Usage

import CoreML

// Load model
let model = try FireRedVAD(configuration: .init())

// Initialize caches (8 layers x [1, 128, 19])
var caches = (0..<8).map { _ in
    try! MLMultiArray(shape: [1, 128, 19], dataType: .float32)
}

// Process audio frame by frame
let input = FireRedVADInput(
    feat: fbankFeatures,       // [1, T, 80]
    cache_0: caches[0], cache_1: caches[1],
    cache_2: caches[2], cache_3: caches[3],
    cache_4: caches[4], cache_5: caches[5],
    cache_6: caches[6], cache_7: caches[7]
)
let output = try model.prediction(input: input)
let speechProb = output.probs  // [1, T, 1]

// Update caches for next frame
caches = [
    output.new_cache_0, output.new_cache_1,
    output.new_cache_2, output.new_cache_3,
    output.new_cache_4, output.new_cache_5,
    output.new_cache_6, output.new_cache_7
]

For a complete implementation with feature extraction, CMVN normalization, and speech state machine, see FireRedASRKit.

References

License

Apache 2.0, following the original FireRedVAD license.

Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for illitan/FireRedVAD-CoreML

Quantized
(3)
this model

Papers for illitan/FireRedVAD-CoreML