Bark Small β€” GGUF

Suno Bark (MIT license) converted to GGUF for native C++ inference with CrispASR.

Model details

  • Architecture: 3-stage hierarchical transformer (semantic β†’ coarse β†’ fine) + EnCodec decoder
  • Parameters: ~300M total across 3 GPT-2 sub-models
  • Output: 24 kHz mono PCM
  • Languages: 13 languages with pre-trained speaker prompts
  • German speakers: v2/de_speaker_0 through v2/de_speaker_9
  • License: MIT

Quantization table

File Quant Size Quality
bark-small-f16.gguf F16 809 MB Reference
bark-small-q8_0.gguf Q8_0 435 MB Near-lossless
bark-small-q4_k.gguf Q4_K 235 MB Good for real-time

All variants pack the 3 sub-models (text/semantic, coarse acoustic, fine acoustic) + EnCodec decoder into a single GGUF file. No companion model needed.

Usage with CrispASR

# Auto-download and synthesize
crispasr --backend bark -m auto --tts "Hello, how are you today?" --tts-output hello.wav

# With a specific quantization
crispasr --backend bark -m bark-small-q4_k.gguf --tts "The quick brown fox" --tts-output fox.wav

# With a German speaker prompt (when supported)
crispasr --backend bark -m bark-small-q8_0.gguf --tts "Hallo Welt" --voice v2/de_speaker_3 --tts-output hallo.wav

Conversion

Produced with:

python models/convert-bark-to-gguf.py --output bark-small-f16.gguf
crispasr-quantize bark-small-f16.gguf bark-small-q8_0.gguf q8_0
crispasr-quantize bark-small-f16.gguf bark-small-q4_k.gguf q4_k

Architecture details

Stage 1 β€” Semantic model

  • GPT-2 (12 layers, 768-d) generating semantic tokens from text
  • BERT WordPiece tokenizer (119547 vocab)
  • Output: up to 768 semantic tokens

Stage 2 β€” Coarse acoustic model

  • GPT-2 (12 layers, 1024-d) converting semantic β†’ coarse EnCodec codes
  • Alternates codebook 0/1 prediction
  • Output: 2 Γ— ~384 coarse tokens

Stage 3 β€” Fine acoustic model

  • Non-causal GPT-2 (12 layers, 1024-d)
  • Fills codebooks 2-7 from codebooks 0-1
  • Output: 8 codebooks Γ— 384 timesteps

EnCodec decoder

  • 8-codebook RVQ (1024 entries each)
  • SEANet CNN decoder with ELU activation
  • Upsample ratios [8, 5, 4, 2] β†’ 24 kHz

Credits

Downloads last month
91
GGUF
Model size
0.4B params
Architecture
bark
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support