Model Card for Model ID

Introduction

This model converts emails, chats, and voice transcripts into structured JSON tickets for IT Helpdesk communications. The model takes in various inputs and compiles the information into a validated ticket JSON schema with a title, issue_type, summary, repro_steps (optional), impact, assignee, and labels as applicable. This model addresses a workplace issue where key information is lost in lengthy email threads, and tickets don't get created due to time constraints. This solution saves time, ensures consistent ticket quality and can be implemented with various ticketing platforms. This task is suitable for an LLM because it requires combing through multiple media/channels of various lengths and extracting the necessary information for ease of access and reproducibility. While general LLM's can summarize conversations, they often generate inconsistent priority classifications and struggle with generating a structured JSON schema that are required by ticketing systems.

Key Features

Processes emails, chat messages, and voice transcripts
Generates schema-validated JSON tickets
Vendor-agnostic output
Extracts reproduction steps, impact assessment, and appropriate priority categorization

Main Results:

91.3% schema validity on gold test set (target: >85%)
100% priority classification accuracy (target: >75%)
2.83/3 information coverage (target: >2.5)

Training Data

For this project, I built two datasets. The first was a combined evaluation set created by merging two Kaggle datasets: the KameronB Synthetic IT Call-Center Tickets and the Console-AI IT Support Tickets. I cleaned and normalized both so they followed the same Jira-style schema, and I kept this entire combined set strictly for evaluation and cross-domain testing. Separately, I built a Gold dataset by generating structured tickets and then manually reviewing and correcting every field. This Gold set was much smaller but represented the "ideal" version of a ticket, so it became my training dataset. The data was split 80/10/10 for training/validation/testing using a fixed seed (42) for reproducibility. The combined Kaggle dataset was used to test how well the model generalized beyond the curated Gold examples.

Data Sources

1. Custom Eval-IT-Gold Dataset (114 samples)

Human-validated chat/email/ASR transcripts
Categories: account_access, device_issue, network_vpn, admin_request, onboarding_setup, software_install
Difficulty-balanced scenarios for comprehensive evaluation
Created using few-shot generation with schema validation

2. Combined Dataset (1500 samples)

KameronB/synthetic-it-callcenter-tickets (1,000 samples)
Phone call transcripts with resolution notes.
Includes fields such as content, category, subcategory, agent assignments, and close_notes.
Uses stratified sampling by category, priority, and difficulty.
Console-AI/IT-helpdesk-synthetic-tickets (500 samples)
Email- and chat-based tickets with priority levels.
Provides subject lines, descriptions, categories, and priorities.
Contains shorter, more concise inputs compared to the KameronB dataset.

Methodology

Model Selection

Evaluated three models: Llama-3.2-1B-Instruct, Qwen2.5-7B-Instruct, and SmolLM3-3B

Qwen2.5-7B-Instruct produced the highest quality outputs with superior structured output understanding and content extraction.

Training Approach: LoRA

For this task, I chose to use LoRA as my finetuning method. Since the goal was to improve the model’s ability to generate consistent, schema-aligned IT tickets and not to teach it new domain knowledge, LoRA was a good fit. It allows the model to learn formatting patterns and structural behaviors without updating the full set of parameters, which keeps training lightweight and stable. In earlier coursework, LoRA showed strong performance gains with minimal catastrophic forgetting, and it preserved general language capabilities that are still important for summarizing unstructured ticket descriptions. The main challenge is maintaining a strong prompt foundation, but once the schema prompt was carefully designed, LoRA provided an efficient and reliable way to refine the model’s structured-output behavior.

LoRA Configuration

Hyperparameter	Value	Rationale
Rank (r)	16	Baseline capacity for pattern learning
Alpha	32	Scaling factor = 2r for stable training
Dropout	0.05	Minimal regularization to prevent underfitting
Learning Rate	0.001	Fast convergence on formatting task
Epochs	1	Single pass sufficient for schema learning
Batch Size	1	With gradient accumulation = 8
Trainable Parameters	5.04M	0.066% of base model (7.6B)

Hyperparameter Optimization

Tested three configurations:

Config	r	alpha	dropout	LR	Epochs	Final Loss	Training Time
Baseline	16	32	0.05	0.001	1	1.2585	8.37s
Higher Capacity	32	64	0.05	0.0005	1	1.3106	8.10s
Conservative	16	32	0.10	0.0001	2	1.5971	15.45s

Result: Baseline configuration achieved lowest training loss with fastest convergence.

Evaluation

For this project, I evaluated the model on three different benchmark tasks as well as my test set, designed to capture schema fidelity, real-world generalization, and overall model stability after fine-tuning. The first benchmark was my Gold Test Set, which contains hand-validated tickets and is the best indicator of how well the model learned the target Jira-style schema. The second benchmark was the Combined Dataset, which merges two real Kaggle IT-ticket datasets. This serves as a cross-domain generalization test to see whether a model trained only on curated data can still perform well on noisier, mixed-format inputs. The first two benchmarks were tested for priority accuracy and schema validity metrics. To check for catastrophic forgetting, I also evaluated on RACE, a reading-comprehension benchmark commonly used, and GSM8K, which measures mathematical reasoning. Together, these benchmarks cover structured-output performance, robustness to domain shifts, and retention of general capabilities.

For comparison models, I chose two open-source LLMs: SmolLM3-3B and Llama-3.1-8B-Instruct, since both are reasonably strong instruction-tuned models in the same parameter range and are commonly used as baselines for structured-output tasks. I also included the base Qwen2.5-7B-Instruct model to isolate how much improvement comes from LoRA fine-tuning versus the underlying architecture.

Overall, my fine-tuned model outperformed all comparison models on the Gold Test Set, especially on schema validity and priority accuracy, while maintaining comparable scores on RACE and GSM8K. This shows that the LoRA adapters successfully improved structured-output behavior without causing major degradation in general reasoning performance.

Benchmark Tasks

1. Gold Test Set

Metric	Pre-Training	Post-Training	Target	Status
Schema Validity	82.61%	91.30%	>85%	Success
Priority Accuracy	89.47%	100%	>75%	Success
Info Coverage (0-3)	2.65	2.83	>2.5	Success

2. Combined Dataset (Generalization Test)

Metric	Pre-Training	Post-Training
Schema Validity	42.33%	47.33%
Info Coverage	1.86	1.95

3. RACE (General Capabilities)

Measures catastrophic forgetting after fine-tuning
Pre-training: 48.0% accuracy
Post-training: 45.3% accuracy
Change: -2.7% (minimal degradation)

4. GSM8K (General Capabilities)

Measures mathematical reasoning and checks for catastrophic forgetting
Pre-training (strict EM): 77.2%
Post-training (strict EM): 82.2%
Change: +5%

Performance Analysis

Gold Test Set (Training Domain):

91.3% schema validity shows strong learning of the JSON ticket structure
100% priority accuracy indicates perfect classification on this dataset
2.83/3 info coverage shows the model consistently fills required fields

Combined Dataset (Cross-Domain Generalization):

Lower validity (47.3%) reflects expected domain shift
Modest improvement (42.3% → 47.3%) shows learning without overfitting to the Gold dataset

Catastrophic Forgetting:

RACE dropped only -2.7% which is a minimal change
Model retained general language capabilities while learning structured output

Evaluation Metrics Explained

Schema Validity (Binary): Validates all required fields present with correct types
Information Coverage (0-3 Likert):
- 0: Missing multiple required fields
- 1: Only a few required fields present
- 2: Missing only optional fields
- 3: All required and relevant optional fields present
Priority Accuracy Exact match classification on {Low, Medium, High}

Model Comparison Across Benchmarks

Model	Gold Test Set (All Metrics)	Combined Dataset (All Metrics)	RACE Accuracy	GSM8K Accuracy
Qwen2.5-7B + LoRA	Schema: 91.3% Coverage: 2.83/3.0 Priority: 100%	Schema: 47.3% Coverage: 1.95/3.0	45.3%	82.2%
Qwen2.5-7B (Base)	Schema: 42.3% Coverage: 1.86/3.0Priority: 89.5%	Schema: 39.1% Coverage: 1.78/3.0	48.0%	77.2%
SmolLM3-3B	Schema: 39.1% Coverage: 1.78/3.0 Priority: 77.8%	Schema: 19.3% Coverage: 1.39/3.0	37.0%	73.9%
Mistrial-7B-Instruct Schema: 40.6% Coverage: 1.75/3.0 Priority: 85.3%	Schema: 27% Coverage: 1.55/3.0	47.5%	70.1%

The LoRA-fine-tuned model shows the strongest performance across all ticket-specific metrics, with large gains in schema validity, information coverage, and perfect priority accuracy compared to both the base Qwen2.5-7B model and SmolLM3-3B. On general-capability benchmarks, the LoRA model maintains competitive RACE performance and even improves over the base model on GSM8K, indicating that structured-output fine-tuning did not degrade broader reasoning ability. Overall, these results show that targeted LoRA training substantially improves the model’s ability to generate clean, schema-aligned IT tickets while preserving general task performance relative to the comparison models.

Usage and Intended Uses

This model is designed to convert unstructured IT helpdesk inputs into clean, structured JSON tickets that follow a Jira-style schema. Its intended use cases include IT operations, automated ticket triage, and any workflow where user messages need to be normalized into consistent, machine-readable fields. Rather than providing new factual knowledge, the model focuses on formatting, field extraction, and schema compliance. The recommended loading method is attaching the LoRA adapters to the base Qwen2.5-7B model, which keeps the adapters flexible for continued finetuning or integration into a RAG-style pipeline.

Loading the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    device_map="auto",
    torch_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
base_model.config.pad_token_id = tokenizer.pad_token_id

# Load LoRA adapters
model = PeftModel.from_pretrained(
    base_model,
    "zoha28/it-ticket-generator-Qwen2.5-7B-v1"
)

print("Model loaded with LoRA adapters")
print(f"Device: {model.device}")

Prompt Format

The prompt follows a simple question-answer structure where the model is instructed to convert an unstructured user message into a minified JSON object matching a predefined schema. Each prompt contains: (1) the schema reference, (2) the raw user input, and (3) an "A:" prefix where the model outputs the JSON.

# Schema used in all prompts
schema = "{'title': str, 'summary': str, 'repro_steps': list (optional), 'category': str, 'priority': str, 'impact': str, 'contact': str, 'labels': list}"

# Example formatted prompt
prompt = f"""
Q: Given the user's input, produce ONLY a minified JSON object matching this {schema}.
Input: {user_communication}

A:
"""

Expected Output Format

The model always returns a minified JSON object that follows a fixed helpdesk-ticket schema, ensuring outputs are consistent, fully parseable, and aligned with downstream ticket automation workflows. Each response includes fields such as title, summary, category, priority, and impact.

JSON Schema

All outputs conform to this validated structure:

{
  "title": "string (≤90 characters)",
  "summary": "2-4 sentence description of the issue",
  "repro_steps": ["step 1", "step 2", "step 3"] or null,
  "category": "category_value",
  "priority": "Low|Medium|High",
  "impact": "description of user/team/organization impact",
  "contact": {
    "name": "string",
    "email": "string",
    "phone": "string (optional)"
  },
  "labels": ["tag1", "tag2", "tag3"]
}

Field Specifications

Required Fields:

title (string, max 90 chars): Concise issue description
summary (string, 2-4 sentences): Core problem description
category (string): One of predefined categories
priority (string): Low | Medium | High
impact (string): User/team/organization impact description
contact (object): User contact information
labels (array): Relevant categorization tags

Optional Fields:

repro_steps (array or null): Numbered reproduction steps
- Required for technical issues (device, network, software)
- Null for administrative requests (password reset, access request)

Example 1 (Minified):

{"title":"Network Connection Issue","summary":"Network connectivity issues have been affecting all departments for over an hour. Users are unable to access internal systems and require immediate resolution.","category":"network_vpn","priority":"High","impact":"org_wide","repro_steps":["All departments report widespread network connectivity issues.","Website and internal services are inaccessible."],"contact":{"name":"IT Support","email":"[email protected]"},"labels":["network_vpn","connectivity_issues","network_outage"]}

Example 1: Network Issue (Expanded)

{
  "title": "Network Connection Issue",
  "summary": "Network connectivity issues affecting all departments for over an hour. Seeking immediate resolution and mitigation steps",
  "category": "network_vpn",
  "priority": "High",
  "impact": "org_wide",
  "repro_steps": [
    "All departments report network connectivity issues.",
    "Website and internal services are inaccessible."
    ],
  "contact": {"name": "IT Support", "email": "[email protected]"},
  "labels": ["network_vpn", "connectivity_issues", "network_outage"]
}

Example 2: Account Access Issue

{
  "title": "Admin Account Login Issue",
  "summary": "Unable to log into admin account despite receiving new credentials via email.",
  "category": "Account Access",
  "priority": "high",
  "impact": "Prevents access to critical tools for ongoing tasks and meetings.",
  "repro_steps": [
    "Received new login credentials via email",
    "Attempted to log in using provided details",
    "Received error message: 'Invalid username or password'",
    "Tried resetting password through portal",
    "Failed to resolve issue on multiple devices"
  ],
  "contact": {"name": "ITUser", "email": "[email protected]", "phone": "+1-555-1234"},
  "source": "chat",
  "labels": ["admin-login", "finance-tools"]
}

Limitations

While the model performs strongly on the curated Gold dataset, it is noticeably sensitive to domain shift, with schema validity dropping to 47% on the combined Kaggle dataset. This suggests the model generalizes best to ticket formats and field names that closely match its training distribution, and may require additional fine-tuning in organizations with different ticket templates or terminology. Because the model relies heavily on a clear prompt and an explicit schema, vague or inconsistently formatted inputs may lead to incomplete or misaligned JSON outputs. The model also struggles with very long inputs where extended email threads or chat transcripts may exceed the effective context window and cause loss of important details. Although GSM8K performance improved after fine-tuning, the slight drop on RACE shows that some general reasoning abilities can shift during LoRA training, even if the overall impact is small.

Bias, Risks, and Ethical Considerations

Data Privacy

May process sensitive or personal information
No built-in PII redaction or anonymization
Requires external privacy controls

Bias Potential

Training data may contain biased priority/impact assessments
Organizations should monitor for unintended bias

Human Oversight Required

Model should not be deployed without human review
Should not be used unsupervised in production workloads
Critical incidents require immediate human attention

Acknowledgments

Base Model:

Qwen/Qwen2.5-7B-Instruct

External Datasets Used:

Comparison Models:

Benchmarks:

RACE:
- Dataset: https://huggingface.co/datasets/race
- Paper: https://arxiv.org/abs/1704.04683
GSM8K:
- Dataset: https://huggingface.co/datasets/openai/gsm8k
- Paper: https://arxiv.org/abs/2110.14168
Gold Test Set (Manually Curated):
- Custom-generated schema-aligned evaluation dataset
Combined Dataset (Cross-Domain Test):
- Merged from the two Kaggle-style IT ticket datasets listed above

Tools / Libraries:

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zoha28/it-ticket-generator-Qwen2.5-7B-v1

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

(2285)

this model