Octen-Embedding-8B - GGUF Quantizations
GGUF quantizations of Octen/Octen-Embedding-8B, converted using llama.cpp b8110.
Octen-Embedding-8B is a fine-tune of Qwen/Qwen3-Embedding-8B, ranked #1 on the RTEB Leaderboard.
Quantized by tex8 โ a platform building AI-native web solutions and cloud services.
Available Quantizations
| File | Quant | Size | Description |
|---|---|---|---|
Octen-Embedding-8B-Q4_K_M.gguf |
Q4_K_M | 4.0 GB | Good balance of size and quality |
Octen-Embedding-8B-Q6_K.gguf |
Q6_K | 6.5 GB | High quality, moderate size |
Octen-Embedding-8B-Q8_0.gguf |
Q8_0 | 8.0 GB | Near-lossless, recommended |
All quantizations were created with --leave-output-tensor and --token-embedding-type F16 to preserve embedding quality.
Usage with llama.cpp
llama-embedding \
-m Octen-Embedding-8B-Q8_0.gguf \
--pooling last \
-p "Your text here"
Usage with llama-cpp-python
from llama_cpp import Llama
llm = Llama(
model_path="Octen-Embedding-8B-Q8_0.gguf",
embedding=True,
n_gpu_layers=-1,
n_ctx=2048,
)
result = llm.create_embedding("Your text here")
embedding = result['data'][0]['embedding'] # 4096-dim vector
Conversion Command
# Step 1: Convert to F16
python convert_hf_to_gguf.py Octen/Octen-Embedding-8B \
--outfile Octen-Embedding-8B-f16.gguf \
--outtype f16
# Step 2: Quantize
llama-quantize \
--leave-output-tensor \
--token-embedding-type F16 \
Octen-Embedding-8B-f16.gguf \
Octen-Embedding-8B-Q8_0.gguf Q8_0
- Downloads last month
- 152
Hardware compatibility
Log In
to add your hardware
4-bit
6-bit
8-bit
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support