SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Paper
• 2301.00774 • Published
• 4
This repo contains model files for llama2.c 15M tinystories optimized for NM-vLLM, a high-throughput serving engine for compressed LLMs.
This model was pruned with SparseGPT, using llm-compressor.
Install llm-compressor:
pip install llmcompressor
from llmcompressor.transformers import oneshot
from llmcompressor.transformers import SparseAutoModelForCausalLM
hf_model_stub = "Xenova/llama2.c-stories15M"
calibration_dataset = "open_platypus"
output_directory = f"{hf_model_stub.split('/')[-1]}-pruned_50.2of4-uncompressed"
model = SparseAutoModelForCausalLM.from_pretrained(hf_model_stub, torch_dtype="auto", device_map="auto")
recipe = """
test_stage:
obcq_modifiers:
SparseGPTModifier:
sparsity: 0.5
sequential_update: true
mask_structure: "2:4"
targets: ['re:model.layers.\d*$']
"""
oneshot(
model=model,
dataset=calibration_dataset,
recipe=recipe,
output_dir=output_directory,
)
model.save_pretrained(output_directory, save_compressed=False)
For further support, and discussions on these models and AI in general, join Neural Magic's Slack Community