---
base_model:
- openai/whisper-large-v3
base_model_relation: quantized
pipeline_tag: text-generation
language:
- en
- fr
- de
- es
- it
- pt
- nl
- ru
- zh
- ja
- ko
---

# Elastic model: whisper-large-v3

## Overview

----

ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models:

- **XL**: Mathematically equivalent neural network, optimized with our DNN compiler.
- **L**: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks.
- **M**: Faster model, with accuracy degradation less than 1.5%.
- **S**: The fastest model, with accuracy degradation less than 2%.

Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section).

## Installation

---

### System Requirements

| **Property**| **Value** |
 | ---  | ---  |
| **GPU** | H100, L40s, B200, RTX 5090, RTX 4090 |
| **Python Version** | 3.10-3.12 |
| **CPU** | Intel/AMD x86_64 |
| **CUDA Version** | 12.9+ |


### TheStage AI Access token setup

Install TheStage AI CLI and setup API token:

```bash
pip install thestage
thestage config set --api-token <YOUR_ACCESS_TOKEN>
```

### ElasticModels installation

Install TheStage Elastic Models package:

```bash
pip install 'thestage-elastic-models[nvidia,cudnn]' \
    --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0
```

If you want to run on Nvidia Blackwell architecture, you need to install package as follows:

```bash
pip install 'thestage-elastic-models[blackwell,cudnn]' \
    --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple
pip install -U --pre torch \
    --index-url https://download.pytorch.org/whl/nightly/cu128
pip install -U --pre torchvision \
   --index-url https://download.pytorch.org/whl/nightly/cu128
pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0
```

## Usage example

----

Elastic Models provides the same interface as HuggingFace Diffusers. Here is an example of how to use the whisper-large-v3 model:

```python
import torch
from transformers import AutoTokenizer
from elastic_models.transformers import AutoModelForCausalLM

# Currently we require to have your HF token
# as we use original weights for part of layers and
# model configuration as well
model_name = "openai/whisper-large-v3"
hf_token = ''
device = torch.device("cuda")

# Create mode
tokenizer = AutoTokenizer.from_pretrained(
    model_name, token=hf_token
)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name,
    token=hf_token,
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    mode='S'
).to(device)
model.generation_config.pad_token_id = tokenizer.eos_token_id

# Inference simple as transformers library
prompt = "Describe basics of DNNs quantization."
messages = [
  {
    "role": "system",
    "content": "You are a search bot, answer on user text queries."
  },
  {
    "role": "user",
    "content": prompt
  }
]

chat_prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)

inputs = tokenizer(chat_prompt, return_tensors="pt")
inputs.to(device)

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_length=500)

input_len = inputs['input_ids'].shape[1]
generate_ids = generate_ids[:, input_len:]
output = tokenizer.batch_decode(
    generate_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]

# Validate answer
print(f"# Q:\n{prompt}\n")
print(f"# A:\n{output}\n")
```


## Quality Benchmarks

------------

We have used the `lm_eval` library to validate the models. For each model size (S, M, L, XL), we have run the following tasks: MMLU, PIQA, Arc Challenge, Windogrande.

![Quality Benchmarking]()

### Quality Benchmark Results

| **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original** |
 | ---  | ---  | ---  | ---  | ---  | ---  |


## Datasets

-------

- **MMLU**: Measures model performance on a diverse set of multiple-choice questions covering various academic subjects, testing general knowledge and reasoning.
- **PIQA**: Evaluates physical commonsense reasoning by asking the model to choose the most plausible solution to everyday physical problems.
- **Arc Challenge**: Assesses scientific and factual reasoning using challenging multiple-choice questions from the AI2 Reasoning Challenge dataset.
- **Winogrande**: Tests commonsense understanding and pronoun resolution through sentences requiring the model to identify the correct referent.

## Metrics

----------

- **Accuracy**: Accuracy measures the proportion of model predictions that exactly match the correct answers across evaluation tasks.


## Latency Benchmarks

-----

We measured TPS (tokens per second) for each model size using 100 input tokens and 300 output tokens.

![Latency Benchmarking]()

### Latency Benchmark Results

Tokens per second for different model sizes on various GPUs.

| **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original** |
 | ---  | ---  | ---  | ---  | ---  | ---  |
| **H100** | 224 | N/A | N/A | 236 | N/A |
| **L40s** | 202 | N/A | N/A | 187 | 56 |
| **B200** | 199 | N/A | N/A | N/A | N/A |
| **GeForce RTX 4090** | 249 | N/A | N/A | N/A | 53 |
| **GeForce RTX 3090** | 201 | N/A | N/A | N/A | N/A |


## Benchmarking Methodology

----

The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated.

> **Algorithm summary:**
> 1. Load the whisper-large-v3 model with the specified size (S, M, L, XL, original).
> 2. Move the model to the GPU.
> 3. Prepare a sample prompt for image generation.
> 4. Run the model for a number of iterations (e.g., 10) and measure the time taken for each iteration. On each iteration:
>    - Synchronize the GPU to flush any previous operations.
>    - Record the start time.
>    - Generate the text using the model.
>    - Synchronize the GPU again.
>    - Record the end time and calculate the TTFT and TPS for that iteration.
> 5. Calculate the average TTFT and TPS over all iterations.


## Serving with Docker Image

------------

For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints.
Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers.
You can also use this container to run inference through TheStage AI platform.

### Prebuilt image from ECR

| **GPU** | **Docker image name** |
| --- | --- |
| H100, L40s | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-nvidia-24.09b` |
| B200, RTX 5090 | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-blackwell-24.09b` |

Pull docker image for your Nvidia GPU and start inference container:

```bash
docker pull <IMAGE_NAME>
```
```bash
docker run --rm -ti \
  --name serving_thestage_model \
  -p 8000:80 \
  -e AUTH_TOKEN=<AUTH_TOKEN> \
  -e MODEL_REPO=openai/whisper-large-v3 \
  -e MODEL_SIZE=<MODEL_SIZE> \
  -e MODEL_BATCH=<MAX_BATCH_SIZE> \
  -e HUGGINGFACE_ACCESS_TOKEN=<HUGGINGFACE_ACCESS_TOKEN> \
  -e THESTAGE_AUTH_TOKEN=<THESTAGE_ACCESS_TOKEN> \
  -v /mnt/hf_cache:/root/.cache/huggingface \
  <IMAGE_NAME_DEPNDING_ON_YOUR_GPU>
```

| **Parameter**              | **Description**                                                                                      |
|----------------------------|------------------------------------------------------------------------------------------------------|
| `<MODEL_SIZE>`             | Available: S, M, L, XL.                                                                              |
| `<MAX_BATCH_SIZE>`         | Maximum batch size to process in parallel.                                                           |
| `<HUGGINGFACE_ACCESS_TOKEN>` | Hugging Face access token.                                                                         |
| `<THESTAGE_ACCESS_TOKEN>`  | TheStage token generated on the platform (Profile -> Access tokens).                                 |
| `<AUTH_TOKEN>`             | Token for endpoint authentication. You can set it to any random string; it must match the value used by the client. |
| `<IMAGE_NAME>`             | Image name which you have pulled.                                                                    |

## Invocation

------

You can invoke the endpoint using CURL as follows:

```bash
curl -X POST 'http://127.0.0.1:8000/v1/chat/completions' \
    -H 'Authorization: Bearer 123' \
    -H 'Content-Type: application/json' \
    -H "X-Model-Name: whisper-large-v3-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged" \
    -d '{
        "messages":[{"role":"user","content":"Define AI"}]
    }'
```

Or using OpenAI python client:

```python
import os, base64, pathlib, json
from openai import OpenAI

BASE_URL = "http://<your_ip>/v1"
API_KEY  = "123"
MODEL    = "whisper-large-v3-<MODEL_SIZE>-bs<MAX_BATCH_SIZE>-paged"

client = OpenAI(
    api_key=API_KEY,
    base_url=BASE_URL,
    default_headers={"X-Model-Name": MODEL}
)

response = client.client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": "Define AI"}
    ]
)

print(response.choices[0].message.content)
```

## Endpoint Parameters

-------------

### Method

> **POST** `/v1/chat/completions`

### Header Parameters

> `Authorization`: `string`
>
> Bearer token for authentication. Should match the `AUTH_TOKEN` set during container startup.

> `Content-Type`: `string`
>
> Must be set to `application/json`.

> `X-Model-Name`: `string`
>
> Specifies the model to use for generation. Format: `whisper-large-v3-<size>-bs<batch_size>`, where `<size>` is one of `S`, `M`, `L`, `XL`, `original` and `<batch_size>` is the maximum batch size configured during container startup.

### Input Body

> `messages` : `string`
>
> The input text prompt.


## Deploy on Modal

-----------------------

For more details please use the tutorial [Modal deployment](https://docs.thestage.ai/tutorials/source/modal_thestage.html)

### Clone modal serving code

```shell
git clone https://github.com/TheStageAI/ElasticModels.git
cd ElasticModels/examples/modal
```

### Configuration of environment variables

Set your environment variables in `modal_serving.py`:

```python
# modal_serving.py

ENVS = {
    "MODEL_REPO": "openai/whisper-large-v3",
    "MODEL_BATCH": "4",
    "THESTAGE_AUTH_TOKEN": "",
    "HUGGINGFACE_ACCESS_TOKEN": "",
    "PORT": "80",
    "PORT_HEALTH": "80",
    "HF_HOME": "/cache/huggingface",
}
```

### Configuration of GPUs

Set your desired GPU type and autoscaling setup. variables in `modal_serving.py`:

```python
# modal_serving.py

@app.function(
    image=image,
    gpu="B200",
    min_containers=8,
    max_containers=8,
    timeout=10000,
    ephemeral_disk=600 * 1024,
    volumes={"/opt/project/.cache": HF_CACHE},
    startup_timeout=60*20
)
@modal.web_server(
    80,
    label="openai/whisper-large-v3-test",
    startup_timeout=60*20
)
def serve():
    pass
```

### Run serving

```shell
modal serve modal_serving.py
```


## Links

* __Platform__: [app.thestage.ai](https://app.thestage.ai)
* __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI)
* __Contact email__: contact@thestage.ai