Kokoro CoreML (HAR-Optimized)
High-performance Kokoro TTS CoreML conversion with Apple Neural Engine (ANE) optimized HAR decoder buckets.
This repository contains precompiled .mlpackage models for fast on-device speech synthesis on Apple platforms.
based on this this open source project https://github.com/mattmireles/kokoro-coreml
π¦ Included Models
π§ Duration Model (Stage 1)
kokoro_duration.mlpackage
Handles variable-length text and predicts phoneme durations + intermediate features.
π HAR Decoder Buckets (Stage 2 β ANE Optimized)
Fixed-size audio synthesis models:
KokoroDecoder_HAR_1s.mlpackageKokoroDecoder_HAR_2s.mlpackageKokoroDecoder_HAR_3s.mlpackageKokoroDecoder_HAR_5s.mlpackageKokoroDecoder_HAR_8s.mlpackageKokoroDecoder_HAR_10s.mlpackageKokoroDecoder_HAR_15s.mlpackageKokoroDecoder_HAR_20s.mlpackageKokoroDecoder_HAR.mlpackage
π Decoder-Only Variants
kokoro_decoder_only_3s.mlpackagekokoro_decoder_only_5s.mlpackagekokoro_decoder_only_10s.mlpackage
π F0 / Feature Variants
kokoro_f0n_3s.mlpackagekokoro_f0n_5s.mlpackagekokoro_f0n_10s.mlpackage
π Vocoder Variants
KokoroVocoder.mlpackageKokoroVocoder_asr64_f0128.mlpackageKokoroVocoder_asr80_f0160.mlpackageKokoroVocoder_asr96_f0192.mlpackageKokoroVocoder_asr128_f0256.mlpackageKokoroVocoder_asr160_f0320.mlpackageKokoroVocoder_asr200_f0400.mlpackage
π§ͺ Experimental / Alternative
kokoro_synthesizer_3s.mlpackagekokoro_synthesizer_3s_nolstm.mlpackageStyleTTS2_iSTFTNet_Decoder.mlpackage
π Architecture
This CoreML conversion uses a two-stage pipeline to support Kokoroβs dynamic operations while maximizing ANE performance.
Stage 1 β Duration Model (CPU/GPU)
Input: Variable-length text (ct.RangeDim)
Process: Transformer + LSTM duration prediction
Output: Phoneme durations + intermediate features
Compute: CPU / GPU
Why CPU?
- LSTM layers are not ANE-compatible
- Dynamic shape text processing
Stage 2 β HAR Decoder (ANE Optimized)
Input:
- Features from duration model
- Alignment matrix (built client-side)
Process:
Vocoder synthesis using iSTFTNet architecture
Output:
24kHz waveform audio
Compute:
Apple Neural Engine
π Key Innovations
- HAR Processing β Harmonic/phase separation for ANE efficiency
- Fixed-size Buckets β Avoid CoreML dynamic shape issues
- Client-side Alignment β Swift/Python builds alignment matrix
- On-demand Model Loading β Memory optimized
- MIL Graph Patching β CoreML compatibility fixes
β‘ Performance
Runs on ANE (HAR Models)
- Conv1D
- ConvTranspose1D
- LeakyReLU
- Element-wise ops
Result: ~17Γ faster than real-time synthesis
Runs on CPU/GPU (Duration Model)
- LSTM layers
- Transformer attention
- AdaLayerNorm
- Dynamic shape processing
π§ Production Optimizations
- Bucket auto-selection
- ~200MB per loaded model
- Warm-up optimization
- Graceful bucket fallback
- Memory cleanup during idle
π₯ Downloading
Because .mlpackage is a folder, download using Hugging Face CLI:
huggingface-cli download <username>/<repo> --local-dir .
---
license: mit
---
- Downloads last month
- 138