--- base_model: - openai/whisper-large-v3 base_model_relation: quantized pipeline_tag: text-generation language: - en - fr - de - es - it - pt - nl - ru - zh - ja - ko --- # Elastic model: whisper-large-v3 ## Overview ---- ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models: - **XL**: Mathematically equivalent neural network, optimized with our DNN compiler. - **L**: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks. - **M**: Faster model, with accuracy degradation less than 1.5%. - **S**: The fastest model, with accuracy degradation less than 2%. Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section). ## Installation --- ### System Requirements | **Property**| **Value** | | --- | --- | | **GPU** | H100, L40s, B200, RTX 5090, RTX 4090 | | **Python Version** | 3.10-3.12 | | **CPU** | Intel/AMD x86_64 | | **CUDA Version** | 12.9+ | ### TheStage AI Access token setup Install TheStage AI CLI and setup API token: ```bash pip install thestage thestage config set --api-token ``` ### ElasticModels installation Install TheStage Elastic Models package: ```bash pip install 'thestage-elastic-models[nvidia,cudnn]' \ --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0 ``` If you want to run on Nvidia Blackwell architecture, you need to install package as follows: ```bash pip install 'thestage-elastic-models[blackwell,cudnn]' \ --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple pip install -U --pre torch \ --index-url https://download.pytorch.org/whl/nightly/cu128 pip install -U --pre torchvision \ --index-url https://download.pytorch.org/whl/nightly/cu128 pip install --force-reinstall --no-deps nvidia-cudnn-frontend==1.18.0 ``` ## Usage example ---- Elastic Models provides the same interface as HuggingFace Diffusers. Here is an example of how to use the whisper-large-v3 model: ```python import torch from transformers import AutoTokenizer from elastic_models.transformers import AutoModelForCausalLM # Currently we require to have your HF token # as we use original weights for part of layers and # model configuration as well model_name = "openai/whisper-large-v3" hf_token = '' device = torch.device("cuda") # Create mode tokenizer = AutoTokenizer.from_pretrained( model_name, token=hf_token ) model = AutoModelForSpeechSeq2Seq.from_pretrained( model_name, token=hf_token, torch_dtype=torch.bfloat16, attn_implementation="sdpa", mode='S' ).to(device) model.generation_config.pad_token_id = tokenizer.eos_token_id # Inference simple as transformers library prompt = "Describe basics of DNNs quantization." messages = [ { "role": "system", "content": "You are a search bot, answer on user text queries." }, { "role": "user", "content": prompt } ] chat_prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=False ) inputs = tokenizer(chat_prompt, return_tensors="pt") inputs.to(device) with torch.inference_mode(): generate_ids = model.generate(**inputs, max_length=500) input_len = inputs['input_ids'].shape[1] generate_ids = generate_ids[:, input_len:] output = tokenizer.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] # Validate answer print(f"# Q:\n{prompt}\n") print(f"# A:\n{output}\n") ``` ## Quality Benchmarks ------------ We have used the `lm_eval` library to validate the models. For each model size (S, M, L, XL), we have run the following tasks: MMLU, PIQA, Arc Challenge, Windogrande. ![Quality Benchmarking]() ### Quality Benchmark Results | **Metric/Model Size**| **S**| **M**| **L**| **XL**| **Original** | | --- | --- | --- | --- | --- | --- | ## Datasets ------- - **MMLU**: Measures model performance on a diverse set of multiple-choice questions covering various academic subjects, testing general knowledge and reasoning. - **PIQA**: Evaluates physical commonsense reasoning by asking the model to choose the most plausible solution to everyday physical problems. - **Arc Challenge**: Assesses scientific and factual reasoning using challenging multiple-choice questions from the AI2 Reasoning Challenge dataset. - **Winogrande**: Tests commonsense understanding and pronoun resolution through sentences requiring the model to identify the correct referent. ## Metrics ---------- - **Accuracy**: Accuracy measures the proportion of model predictions that exactly match the correct answers across evaluation tasks. ## Latency Benchmarks ----- We measured TPS (tokens per second) for each model size using 100 input tokens and 300 output tokens. ![Latency Benchmarking]() ### Latency Benchmark Results Tokens per second for different model sizes on various GPUs. | **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original** | | --- | --- | --- | --- | --- | --- | | **H100** | 224 | N/A | N/A | 236 | N/A | | **L40s** | 202 | N/A | N/A | 187 | 56 | | **B200** | 199 | N/A | N/A | N/A | N/A | | **GeForce RTX 4090** | 249 | N/A | N/A | N/A | 53 | | **GeForce RTX 3090** | 201 | N/A | N/A | N/A | N/A | ## Benchmarking Methodology ---- The benchmarking was performed on a single GPU with a batch size of 1. Each model was run for 10 iterations, and the average latency was calculated. > **Algorithm summary:** > 1. Load the whisper-large-v3 model with the specified size (S, M, L, XL, original). > 2. Move the model to the GPU. > 3. Prepare a sample prompt for image generation. > 4. Run the model for a number of iterations (e.g., 10) and measure the time taken for each iteration. On each iteration: > - Synchronize the GPU to flush any previous operations. > - Record the start time. > - Generate the text using the model. > - Synchronize the GPU again. > - Record the end time and calculate the TTFT and TPS for that iteration. > 5. Calculate the average TTFT and TPS over all iterations. ## Serving with Docker Image ------------ For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints. Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers. You can also use this container to run inference through TheStage AI platform. ### Prebuilt image from ECR | **GPU** | **Docker image name** | | --- | --- | | H100, L40s | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-nvidia-24.09b` | | B200, RTX 5090 | `public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.1.7.post0-llm-blackwell-24.09b` | Pull docker image for your Nvidia GPU and start inference container: ```bash docker pull ``` ```bash docker run --rm -ti \ --name serving_thestage_model \ -p 8000:80 \ -e AUTH_TOKEN= \ -e MODEL_REPO=openai/whisper-large-v3 \ -e MODEL_SIZE= \ -e MODEL_BATCH= \ -e HUGGINGFACE_ACCESS_TOKEN= \ -e THESTAGE_AUTH_TOKEN= \ -v /mnt/hf_cache:/root/.cache/huggingface \ ``` | **Parameter** | **Description** | |----------------------------|------------------------------------------------------------------------------------------------------| | `` | Available: S, M, L, XL. | | `` | Maximum batch size to process in parallel. | | `` | Hugging Face access token. | | `` | TheStage token generated on the platform (Profile -> Access tokens). | | `` | Token for endpoint authentication. You can set it to any random string; it must match the value used by the client. | | `` | Image name which you have pulled. | ## Invocation ------ You can invoke the endpoint using CURL as follows: ```bash curl -X POST 'http://127.0.0.1:8000/v1/chat/completions' \ -H 'Authorization: Bearer 123' \ -H 'Content-Type: application/json' \ -H "X-Model-Name: whisper-large-v3--bs-paged" \ -d '{ "messages":[{"role":"user","content":"Define AI"}] }' ``` Or using OpenAI python client: ```python import os, base64, pathlib, json from openai import OpenAI BASE_URL = "http:///v1" API_KEY = "123" MODEL = "whisper-large-v3--bs-paged" client = OpenAI( api_key=API_KEY, base_url=BASE_URL, default_headers={"X-Model-Name": MODEL} ) response = client.client.chat.completions.create( model=MODEL, messages=[ {"role": "user", "content": "Define AI"} ] ) print(response.choices[0].message.content) ``` ## Endpoint Parameters ------------- ### Method > **POST** `/v1/chat/completions` ### Header Parameters > `Authorization`: `string` > > Bearer token for authentication. Should match the `AUTH_TOKEN` set during container startup. > `Content-Type`: `string` > > Must be set to `application/json`. > `X-Model-Name`: `string` > > Specifies the model to use for generation. Format: `whisper-large-v3--bs`, where `` is one of `S`, `M`, `L`, `XL`, `original` and `` is the maximum batch size configured during container startup. ### Input Body > `messages` : `string` > > The input text prompt. ## Deploy on Modal ----------------------- For more details please use the tutorial [Modal deployment](https://docs.thestage.ai/tutorials/source/modal_thestage.html) ### Clone modal serving code ```shell git clone https://github.com/TheStageAI/ElasticModels.git cd ElasticModels/examples/modal ``` ### Configuration of environment variables Set your environment variables in `modal_serving.py`: ```python # modal_serving.py ENVS = { "MODEL_REPO": "openai/whisper-large-v3", "MODEL_BATCH": "4", "THESTAGE_AUTH_TOKEN": "", "HUGGINGFACE_ACCESS_TOKEN": "", "PORT": "80", "PORT_HEALTH": "80", "HF_HOME": "/cache/huggingface", } ``` ### Configuration of GPUs Set your desired GPU type and autoscaling setup. variables in `modal_serving.py`: ```python # modal_serving.py @app.function( image=image, gpu="B200", min_containers=8, max_containers=8, timeout=10000, ephemeral_disk=600 * 1024, volumes={"/opt/project/.cache": HF_CACHE}, startup_timeout=60*20 ) @modal.web_server( 80, label="openai/whisper-large-v3-test", startup_timeout=60*20 ) def serve(): pass ``` ### Run serving ```shell modal serve modal_serving.py ``` ## Links * __Platform__: [app.thestage.ai](https://app.thestage.ai) * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI) * __Contact email__: contact@thestage.ai