YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Kokoro CoreML (HAR-Optimized)

High-performance Kokoro TTS CoreML conversion with Apple Neural Engine (ANE) optimized HAR decoder buckets.

This repository contains precompiled .mlpackage models for fast on-device speech synthesis on Apple platforms.

based on this this open source project https://github.com/mattmireles/kokoro-coreml

📦 Included Models

🧠 Duration Model (Stage 1)

kokoro_duration.mlpackage

Handles variable-length text and predicts phoneme durations + intermediate features.

🔊 HAR Decoder Buckets (Stage 2 – ANE Optimized)

Fixed-size audio synthesis models:

KokoroDecoder_HAR_1s.mlpackage
KokoroDecoder_HAR_2s.mlpackage
KokoroDecoder_HAR_3s.mlpackage
KokoroDecoder_HAR_5s.mlpackage
KokoroDecoder_HAR_8s.mlpackage
KokoroDecoder_HAR_10s.mlpackage
KokoroDecoder_HAR_15s.mlpackage
KokoroDecoder_HAR_20s.mlpackage
KokoroDecoder_HAR.mlpackage

🔁 Decoder-Only Variants

kokoro_decoder_only_3s.mlpackage
kokoro_decoder_only_5s.mlpackage
kokoro_decoder_only_10s.mlpackage

🎛 F0 / Feature Variants

kokoro_f0n_3s.mlpackage
kokoro_f0n_5s.mlpackage
kokoro_f0n_10s.mlpackage

🔊 Vocoder Variants

KokoroVocoder.mlpackage
KokoroVocoder_asr64_f0128.mlpackage
KokoroVocoder_asr80_f0160.mlpackage
KokoroVocoder_asr96_f0192.mlpackage
KokoroVocoder_asr128_f0256.mlpackage
KokoroVocoder_asr160_f0320.mlpackage
KokoroVocoder_asr200_f0400.mlpackage

🧪 Experimental / Alternative

kokoro_synthesizer_3s.mlpackage
kokoro_synthesizer_3s_nolstm.mlpackage
StyleTTS2_iSTFTNet_Decoder.mlpackage

📐 Architecture

This CoreML conversion uses a two-stage pipeline to support Kokoro’s dynamic operations while maximizing ANE performance.

Stage 1 — Duration Model (CPU/GPU)

Input: Variable-length text (ct.RangeDim)
Process: Transformer + LSTM duration prediction
Output: Phoneme durations + intermediate features
Compute: CPU / GPU

Why CPU?

LSTM layers are not ANE-compatible
Dynamic shape text processing

Stage 2 — HAR Decoder (ANE Optimized)

Input:

Features from duration model
Alignment matrix (built client-side)

Process:
Vocoder synthesis using iSTFTNet architecture

Output:
24kHz waveform audio

Compute:
Apple Neural Engine

🚀 Key Innovations

HAR Processing – Harmonic/phase separation for ANE efficiency
Fixed-size Buckets – Avoid CoreML dynamic shape issues
Client-side Alignment – Swift/Python builds alignment matrix
On-demand Model Loading – Memory optimized
MIL Graph Patching – CoreML compatibility fixes

⚡ Performance

Runs on ANE (HAR Models)

Conv1D
ConvTranspose1D
LeakyReLU
Element-wise ops

Result: ~17× faster than real-time synthesis

Runs on CPU/GPU (Duration Model)

LSTM layers
Transformer attention
AdaLayerNorm
Dynamic shape processing

🧠 Production Optimizations

Bucket auto-selection
~200MB per loaded model
Warm-up optimization
Graceful bucket fallback
Memory cleanup during idle

📥 Downloading

Because .mlpackage is a folder, download using Hugging Face CLI:

huggingface-cli download <username>/<repo> --local-dir .

---
license: mit
---

Downloads last month: 138

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support