YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Kokoro CoreML (HAR-Optimized)

High-performance Kokoro TTS CoreML conversion with Apple Neural Engine (ANE) optimized HAR decoder buckets.

This repository contains precompiled .mlpackage models for fast on-device speech synthesis on Apple platforms.


based on this this open source project https://github.com/mattmireles/kokoro-coreml

πŸ“¦ Included Models

🧠 Duration Model (Stage 1)

  • kokoro_duration.mlpackage

Handles variable-length text and predicts phoneme durations + intermediate features.


πŸ”Š HAR Decoder Buckets (Stage 2 – ANE Optimized)

Fixed-size audio synthesis models:

  • KokoroDecoder_HAR_1s.mlpackage
  • KokoroDecoder_HAR_2s.mlpackage
  • KokoroDecoder_HAR_3s.mlpackage
  • KokoroDecoder_HAR_5s.mlpackage
  • KokoroDecoder_HAR_8s.mlpackage
  • KokoroDecoder_HAR_10s.mlpackage
  • KokoroDecoder_HAR_15s.mlpackage
  • KokoroDecoder_HAR_20s.mlpackage
  • KokoroDecoder_HAR.mlpackage

πŸ” Decoder-Only Variants

  • kokoro_decoder_only_3s.mlpackage
  • kokoro_decoder_only_5s.mlpackage
  • kokoro_decoder_only_10s.mlpackage

πŸŽ› F0 / Feature Variants

  • kokoro_f0n_3s.mlpackage
  • kokoro_f0n_5s.mlpackage
  • kokoro_f0n_10s.mlpackage

πŸ”Š Vocoder Variants

  • KokoroVocoder.mlpackage
  • KokoroVocoder_asr64_f0128.mlpackage
  • KokoroVocoder_asr80_f0160.mlpackage
  • KokoroVocoder_asr96_f0192.mlpackage
  • KokoroVocoder_asr128_f0256.mlpackage
  • KokoroVocoder_asr160_f0320.mlpackage
  • KokoroVocoder_asr200_f0400.mlpackage

πŸ§ͺ Experimental / Alternative

  • kokoro_synthesizer_3s.mlpackage
  • kokoro_synthesizer_3s_nolstm.mlpackage
  • StyleTTS2_iSTFTNet_Decoder.mlpackage

πŸ“ Architecture

This CoreML conversion uses a two-stage pipeline to support Kokoro’s dynamic operations while maximizing ANE performance.


Stage 1 β€” Duration Model (CPU/GPU)

Input: Variable-length text (ct.RangeDim)
Process: Transformer + LSTM duration prediction
Output: Phoneme durations + intermediate features
Compute: CPU / GPU

Why CPU?

  • LSTM layers are not ANE-compatible
  • Dynamic shape text processing

Stage 2 β€” HAR Decoder (ANE Optimized)

Input:

  • Features from duration model
  • Alignment matrix (built client-side)

Process:
Vocoder synthesis using iSTFTNet architecture

Output:
24kHz waveform audio

Compute:
Apple Neural Engine


πŸš€ Key Innovations

  • HAR Processing – Harmonic/phase separation for ANE efficiency
  • Fixed-size Buckets – Avoid CoreML dynamic shape issues
  • Client-side Alignment – Swift/Python builds alignment matrix
  • On-demand Model Loading – Memory optimized
  • MIL Graph Patching – CoreML compatibility fixes

⚑ Performance

Runs on ANE (HAR Models)

  • Conv1D
  • ConvTranspose1D
  • LeakyReLU
  • Element-wise ops

Result: ~17Γ— faster than real-time synthesis


Runs on CPU/GPU (Duration Model)

  • LSTM layers
  • Transformer attention
  • AdaLayerNorm
  • Dynamic shape processing

🧠 Production Optimizations

  • Bucket auto-selection
  • ~200MB per loaded model
  • Warm-up optimization
  • Graceful bucket fallback
  • Memory cleanup during idle

πŸ“₯ Downloading

Because .mlpackage is a folder, download using Hugging Face CLI:

huggingface-cli download <username>/<repo> --local-dir .

---
license: mit
---
Downloads last month
138
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support