Octen-Embedding-8B - GGUF Quantizations

GGUF quantizations of Octen/Octen-Embedding-8B, converted using llama.cpp b8110.

Octen-Embedding-8B is a fine-tune of Qwen/Qwen3-Embedding-8B, ranked #1 on the RTEB Leaderboard.

Quantized by tex8 โ€” a platform building AI-native web solutions and cloud services.

Available Quantizations

File Quant Size Description
Octen-Embedding-8B-Q4_K_M.gguf Q4_K_M 4.0 GB Good balance of size and quality
Octen-Embedding-8B-Q6_K.gguf Q6_K 6.5 GB High quality, moderate size
Octen-Embedding-8B-Q8_0.gguf Q8_0 8.0 GB Near-lossless, recommended

All quantizations were created with --leave-output-tensor and --token-embedding-type F16 to preserve embedding quality.

Usage with llama.cpp

llama-embedding \
  -m Octen-Embedding-8B-Q8_0.gguf \
  --pooling last \
  -p "Your text here"

Usage with llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="Octen-Embedding-8B-Q8_0.gguf",
    embedding=True,
    n_gpu_layers=-1,
    n_ctx=2048,
)

result = llm.create_embedding("Your text here")
embedding = result['data'][0]['embedding']  # 4096-dim vector

Conversion Command

# Step 1: Convert to F16
python convert_hf_to_gguf.py Octen/Octen-Embedding-8B \
  --outfile Octen-Embedding-8B-f16.gguf \
  --outtype f16

# Step 2: Quantize
llama-quantize \
  --leave-output-tensor \
  --token-embedding-type F16 \
  Octen-Embedding-8B-f16.gguf \
  Octen-Embedding-8B-Q8_0.gguf Q8_0
Downloads last month
152
GGUF
Model size
8B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

4-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for tex8/Octen-Embedding-8B-GGUF

Base model

Qwen/Qwen3-8B-Base
Quantized
(3)
this model