File size: 6,389 Bytes
a4baf3b a69896f a4baf3b a69896f a4baf3b 6b0b0ec a4baf3b 6b0b0ec |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
---
license: mit
language:
- en
base_model:
- inclusionAI/Ling-mini-base-2.0-20T
pipeline_tag: text-generation
library_name: transformers
tags:
- moe
---
# Ring-mini-sparse-2.0-exp
<p align="center">
<img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*4QxcQrBlTiAAAAAAQXAAAAgAemJ7AQ/original" width="100"/>
<p>
<p align="center">🤗 <a href="https://huggingface.co/inclusionAI/Ring-mini-sparse-2.0-exp">Hugging Face</a>   |   🤖 <a href="https://modelscope.cn/organization/inclusionAI/Ring-mini-sparse-2.0-exp">ModelScope</a></p>
## Introduction
We are excited to annouce the official release of Ring-mini-sparse-2.0-exp. This model employs a Mixture of Block Attention (MoBA) architecture, delivering highly efficient inference without compromising performance. This model inherts from [Ling-mini-base-2.0](https://huggingface.co/inclusionAI/Ling-mini-base-2.0-20T), continually trained on an additional 100B tokens. The performance of the MoBA-based model is on par with the standard attention models of the same size (e.g., Ring-mini-v2). Furthermore, by applying YaRN-based 4× window extrapolation, we extend the context length to 128K tokens, delivering superior inference speed on tasks that involve long inputs and outputs.
<div style="display: flex; justify-content: center;">
<div style="text-align: center;">
<img src="https://mdn.alipayobjects.com/huamei_9mcypc/afts/img/PIoSTKEzmsEAAAAAU5AAAAgADlCHAQFr/original" width="800">
<p style="margin-top: 8px; font-size: 14px;"><strong>Figure 1:</strong> The Model Architecture of Ring-mini-sparse-2.0-exp</p>
</div>
</div>
## Evaluation
To comprehensively assess the reasoning capability of our model, we conducted evaluations on five challenging benchmarks spanning mathematics, coding, and science, comparing it with Ring-mini-2.0, Qwen3-8B-Thinking, and GPT-OSS-20B-Medium. The MoBA architecture demonstrates comparable performance to full softmax attention models.
<div style="display: flex; justify-content: center;">
<div style="text-align: center;">
<img src="https://mdn.alipayobjects.com/huamei_9mcypc/afts/img/Yr7eRreHNNUAAAAAWfAAAAgADlCHAQFr/original" width="100%">
<p style="margin-top: 8px; font-size: 14px;"><strong>Figure 2:</strong> Model Performance Comparison </p>
</div>
</div>
## Highly Sparse, High-Speed Generation
Ring-mini-sparse-2.0-exp achieves high inference efficiency through highly sparse attention and a Mixture-of-Experts (MoE) architecture. Unlike MoBA used in Kimi, our approach shares the same KV block selection across all heads within a GQA group, reducing the total number of KV tokens each query head retrieves from the KV cache during decoding. During 64K-context decoding, only 8,192 key-value (KV) tokens are activated per query—reducing KV cache retrieval overhead by 87.5% compared to full attention and delivering up to 3× inference speedup over Ring-mini-2.0. This design significantly lowers computational costs for high-concurrency scenarios involving reasoning-intensive models while maintaining competitive performance. Additionally, with YaRN extrapolation, the model extends context capacity to 128K tokens, achieving up to 2× relative speedup in long-input scenarios compared to Ring-mini-2.0 (full softmax attention).
<div style="text-align: center;">
<p align="center">
<img src="https://mdn.alipayobjects.com/huamei_9mcypc/afts/img/iL_eTZP-FVEAAAAATOAAAAgADlCHAQFr/original" width="500">
</p>
<p style="margin-top: 8px; font-size: 14px;"><strong>Figure 4:</strong> Inference speedup ratios of Ring-mini-sparse-2.0-exp compared to Ring-mini-2.0.</p>
</div>
</div>
## Quickstart
### 🤗 Hugging Face Transformers
Installation requirements:
```shell
pip install flash-attn==2.6.3
pip install transformers==4.56.1
```
Here is a code snippet to show you how to use the chat model with `transformers`:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "inclusionAI/Ring-mini-sparse-2.0-exp"
model = AutoModelForCausalLM.from_pretrained(
model_name,
dtype="auto",
device_map="auto",
trust_remote_code=True,
attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompts = [
"Give me a short introduction to large language models."
]
input_texts = []
for prompt in prompts:
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
input_texts.append(text)
print(input_texts)
model_inputs = tokenizer(input_texts, return_tensors="pt", return_token_type_ids=False, padding=True, padding_side='left').to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=8192,
do_sample=False,
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
responses = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print("*" * 30)
print(responses)
print("*" * 30)
```
### 🚀 SGLang
#### Environment Preparation
We have submitted our PR to SGLang official release and it will be merged later, for now we can prepare the environment following steps, firstly install the community version SGLang and required packages:
```shell
pip install sglang==0.5.3 sgl-kernel==0.3.15 torch==2.8.0 torchvision==0.23.0 torchao
```
Then you should install our sglang wheel package:
```shell
git clone https://github.com/inclusionAI/Ring-V2.git
pip install Ring-V2/moba/whls/sglang-0.5.3.post1-py3-none-any.whl --no-deps --force-reinstall
```
#### Run Inference
Our model is supported by SGLang now. You can launch the sever with the command in the following:
- Start server:
```shell
python -m sglang.launch_server \
--model-path <model_path> \
--trust-remote-code \
--tp-size 4 \
--disable-radix-cache \
--chunked-prefill-size 0 \
--attention-backend moba
```
- Client:
```shell
curl -s http://localhost:${PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "auto", "temperature": 0.6, "messages": [{"role": "user", "content": "Give me a short introduction to large language models."}]}'
```
More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
|