Octen-Embedding-8B - GGUF Quantizations

GGUF quantizations of Octen/Octen-Embedding-8B, converted using llama.cpp b8110.

Octen-Embedding-8B is a fine-tune of Qwen/Qwen3-Embedding-8B, ranked #1 on the RTEB Leaderboard.

Quantized by tex8 — a platform building AI-native web solutions and cloud services.

Available Quantizations

File	Quant	Size	Description
`Octen-Embedding-8B-Q4_K_M.gguf`	Q4_K_M	4.0 GB	Good balance of size and quality
`Octen-Embedding-8B-Q6_K.gguf`	Q6_K	6.5 GB	High quality, moderate size
`Octen-Embedding-8B-Q8_0.gguf`	Q8_0	8.0 GB	Near-lossless, recommended

All quantizations were created with --leave-output-tensor and --token-embedding-type F16 to preserve embedding quality.

Usage with llama.cpp

llama-embedding \
  -m Octen-Embedding-8B-Q8_0.gguf \
  --pooling last \
  -p "Your text here"

Usage with llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="Octen-Embedding-8B-Q8_0.gguf",
    embedding=True,
    n_gpu_layers=-1,
    n_ctx=2048,
)

result = llm.create_embedding("Your text here")
embedding = result['data'][0]['embedding']  # 4096-dim vector

Conversion Command

# Step 1: Convert to F16
python convert_hf_to_gguf.py Octen/Octen-Embedding-8B \
  --outfile Octen-Embedding-8B-f16.gguf \
  --outtype f16

# Step 2: Quantize
llama-quantize \
  --leave-output-tensor \
  --token-embedding-type F16 \
  Octen-Embedding-8B-f16.gguf \
  Octen-Embedding-8B-Q8_0.gguf Q8_0

Downloads last month: 152

GGUF

Model size

8B params

Architecture

qwen3

Hardware compatibility

4-bit

6-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tex8/Octen-Embedding-8B-GGUF

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-Embedding-8B

Finetuned

Octen/Octen-Embedding-8B

Quantized

(3)

this model