Model Card for Model ID

Introduction

This model converts emails, chats, and voice transcripts into structured JSON tickets for IT Helpdesk communications. The model takes in various inputs and compiles the information into a validated ticket JSON schema with a title, issue_type, summary, repro_steps (optional), impact, assignee, and labels as applicable. This model addresses a workplace issue where key information is lost in lengthy email threads, and tickets don't get created due to time constraints. This solution saves time, ensures consistent ticket quality and can be implemented with various ticketing platforms. This task is suitable for an LLM because it requires combing through multiple media/channels of various lengths and extracting the necessary information for ease of access and reproducibility. While general LLM's can summarize conversations, they often generate inconsistent priority classifications and struggle with generating a structured JSON schema that are required by ticketing systems.

Key Features

  • Processes emails, chat messages, and voice transcripts
  • Generates schema-validated JSON tickets
  • Vendor-agnostic output
  • Extracts reproduction steps, impact assessment, and appropriate priority categorization

Main Results:

  • 91.3% schema validity on gold test set (target: >85%)
  • 100% priority classification accuracy (target: >75%)
  • 2.83/3 information coverage (target: >2.5)

Training Data

For this project, I built two datasets. The first was a combined evaluation set created by merging two Kaggle datasets: the KameronB Synthetic IT Call-Center Tickets and the Console-AI IT Support Tickets. I cleaned and normalized both so they followed the same Jira-style schema, and I kept this entire combined set strictly for evaluation and cross-domain testing. Separately, I built a Gold dataset by generating structured tickets and then manually reviewing and correcting every field. This Gold set was much smaller but represented the "ideal" version of a ticket, so it became my training dataset. The data was split 80/10/10 for training/validation/testing using a fixed seed (42) for reproducibility. The combined Kaggle dataset was used to test how well the model generalized beyond the curated Gold examples.

Data Sources

1. Custom Eval-IT-Gold Dataset (114 samples)

  • Human-validated chat/email/ASR transcripts
  • Categories: account_access, device_issue, network_vpn, admin_request, onboarding_setup, software_install
  • Difficulty-balanced scenarios for comprehensive evaluation
  • Created using few-shot generation with schema validation

2. Combined Dataset (1500 samples)

  • KameronB/synthetic-it-callcenter-tickets (1,000 samples)
    Phone call transcripts with resolution notes.
    Includes fields such as content, category, subcategory, agent assignments, and close_notes.
    Uses stratified sampling by category, priority, and difficulty.

  • Console-AI/IT-helpdesk-synthetic-tickets (500 samples)
    Email- and chat-based tickets with priority levels.
    Provides subject lines, descriptions, categories, and priorities.
    Contains shorter, more concise inputs compared to the KameronB dataset.

Methodology

Model Selection

Evaluated three models: Llama-3.2-1B-Instruct, Qwen2.5-7B-Instruct, and SmolLM3-3B

Qwen2.5-7B-Instruct produced the highest quality outputs with superior structured output understanding and content extraction.

Training Approach: LoRA

For this task, I chose to use LoRA as my finetuning method. Since the goal was to improve the modelโ€™s ability to generate consistent, schema-aligned IT tickets and not to teach it new domain knowledge, LoRA was a good fit. It allows the model to learn formatting patterns and structural behaviors without updating the full set of parameters, which keeps training lightweight and stable. In earlier coursework, LoRA showed strong performance gains with minimal catastrophic forgetting, and it preserved general language capabilities that are still important for summarizing unstructured ticket descriptions. The main challenge is maintaining a strong prompt foundation, but once the schema prompt was carefully designed, LoRA provided an efficient and reliable way to refine the modelโ€™s structured-output behavior.

LoRA Configuration

Hyperparameter Value Rationale
Rank (r) 16 Baseline capacity for pattern learning
Alpha 32 Scaling factor = 2r for stable training
Dropout 0.05 Minimal regularization to prevent underfitting
Learning Rate 0.001 Fast convergence on formatting task
Epochs 1 Single pass sufficient for schema learning
Batch Size 1 With gradient accumulation = 8
Trainable Parameters 5.04M 0.066% of base model (7.6B)

Hyperparameter Optimization

Tested three configurations:

Config r alpha dropout LR Epochs Final Loss Training Time
Baseline 16 32 0.05 0.001 1 1.2585 8.37s
Higher Capacity 32 64 0.05 0.0005 1 1.3106 8.10s
Conservative 16 32 0.10 0.0001 2 1.5971 15.45s

Result: Baseline configuration achieved lowest training loss with fastest convergence.

Evaluation

For this project, I evaluated the model on three different benchmark tasks as well as my test set, designed to capture schema fidelity, real-world generalization, and overall model stability after fine-tuning. The first benchmark was my Gold Test Set, which contains hand-validated tickets and is the best indicator of how well the model learned the target Jira-style schema. The second benchmark was the Combined Dataset, which merges two real Kaggle IT-ticket datasets. This serves as a cross-domain generalization test to see whether a model trained only on curated data can still perform well on noisier, mixed-format inputs. The first two benchmarks were tested for priority accuracy and schema validity metrics. To check for catastrophic forgetting, I also evaluated on RACE, a reading-comprehension benchmark commonly used, and GSM8K, which measures mathematical reasoning. Together, these benchmarks cover structured-output performance, robustness to domain shifts, and retention of general capabilities.

For comparison models, I chose two open-source LLMs: SmolLM3-3B and Llama-3.1-8B-Instruct, since both are reasonably strong instruction-tuned models in the same parameter range and are commonly used as baselines for structured-output tasks. I also included the base Qwen2.5-7B-Instruct model to isolate how much improvement comes from LoRA fine-tuning versus the underlying architecture.

Overall, my fine-tuned model outperformed all comparison models on the Gold Test Set, especially on schema validity and priority accuracy, while maintaining comparable scores on RACE and GSM8K. This shows that the LoRA adapters successfully improved structured-output behavior without causing major degradation in general reasoning performance.

Benchmark Tasks

1. Gold Test Set

Metric Pre-Training Post-Training Target Status
Schema Validity 82.61% 91.30% >85% Success
Priority Accuracy 89.47% 100% >75% Success
Info Coverage (0-3) 2.65 2.83 >2.5 Success

2. Combined Dataset (Generalization Test)

Metric Pre-Training Post-Training
Schema Validity 42.33% 47.33%
Info Coverage 1.86 1.95

3. RACE (General Capabilities)

  • Measures catastrophic forgetting after fine-tuning
  • Pre-training: 48.0% accuracy
  • Post-training: 45.3% accuracy
  • Change: -2.7% (minimal degradation)

4. GSM8K (General Capabilities)

  • Measures mathematical reasoning and checks for catastrophic forgetting
  • Pre-training (strict EM): 77.2%
  • Post-training (strict EM): 82.2%
  • Change: +5%

Performance Analysis

Gold Test Set (Training Domain):

  • 91.3% schema validity shows strong learning of the JSON ticket structure
  • 100% priority accuracy indicates perfect classification on this dataset
  • 2.83/3 info coverage shows the model consistently fills required fields

Combined Dataset (Cross-Domain Generalization):

  • Lower validity (47.3%) reflects expected domain shift
  • Modest improvement (42.3% โ†’ 47.3%) shows learning without overfitting to the Gold dataset

Catastrophic Forgetting:

  • RACE dropped only -2.7% which is a minimal change
  • Model retained general language capabilities while learning structured output

Evaluation Metrics Explained

  • Schema Validity (Binary): Validates all required fields present with correct types
  • Information Coverage (0-3 Likert):
    • 0: Missing multiple required fields
    • 1: Only a few required fields present
    • 2: Missing only optional fields
    • 3: All required and relevant optional fields present
  • Priority Accuracy Exact match classification on {Low, Medium, High}

Model Comparison Across Benchmarks

Model Gold Test Set (All Metrics) Combined Dataset (All Metrics) RACE Accuracy GSM8K Accuracy
Qwen2.5-7B + LoRA Schema: 91.3%
Coverage: 2.83/3.0
Priority: 100%
Schema: 47.3%
Coverage: 1.95/3.0
45.3% 82.2%
Qwen2.5-7B (Base) Schema: 42.3%
Coverage: 1.86/3.0Priority: 89.5%
Schema: 39.1%
Coverage: 1.78/3.0
48.0% 77.2%
SmolLM3-3B Schema: 39.1%
Coverage: 1.78/3.0
Priority: 77.8%
Schema: 19.3%
Coverage: 1.39/3.0
37.0% 73.9%
Mistrial-7B-Instruct Schema: 40.6%
Coverage: 1.75/3.0
Priority: 85.3%
Schema: 27%
Coverage: 1.55/3.0
47.5% 70.1%

The LoRA-fine-tuned model shows the strongest performance across all ticket-specific metrics, with large gains in schema validity, information coverage, and perfect priority accuracy compared to both the base Qwen2.5-7B model and SmolLM3-3B. On general-capability benchmarks, the LoRA model maintains competitive RACE performance and even improves over the base model on GSM8K, indicating that structured-output fine-tuning did not degrade broader reasoning ability. Overall, these results show that targeted LoRA training substantially improves the modelโ€™s ability to generate clean, schema-aligned IT tickets while preserving general task performance relative to the comparison models.

Usage and Intended Uses

This model is designed to convert unstructured IT helpdesk inputs into clean, structured JSON tickets that follow a Jira-style schema. Its intended use cases include IT operations, automated ticket triage, and any workflow where user messages need to be normalized into consistent, machine-readable fields. Rather than providing new factual knowledge, the model focuses on formatting, field extraction, and schema compliance. The recommended loading method is attaching the LoRA adapters to the base Qwen2.5-7B model, which keeps the adapters flexible for continued finetuning or integration into a RAG-style pipeline.

Loading the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    device_map="auto",
    torch_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
base_model.config.pad_token_id = tokenizer.pad_token_id

# Load LoRA adapters
model = PeftModel.from_pretrained(
    base_model,
    "zoha28/it-ticket-generator-Qwen2.5-7B-v1"
)

print("Model loaded with LoRA adapters")
print(f"Device: {model.device}")

Prompt Format

The prompt follows a simple question-answer structure where the model is instructed to convert an unstructured user message into a minified JSON object matching a predefined schema. Each prompt contains: (1) the schema reference, (2) the raw user input, and (3) an "A:" prefix where the model outputs the JSON.

# Schema used in all prompts
schema = "{'title': str, 'summary': str, 'repro_steps': list (optional), 'category': str, 'priority': str, 'impact': str, 'contact': str, 'labels': list}"

# Example formatted prompt
prompt = f"""
Q: Given the user's input, produce ONLY a minified JSON object matching this {schema}.
Input: {user_communication}

A:
"""

Expected Output Format

The model always returns a minified JSON object that follows a fixed helpdesk-ticket schema, ensuring outputs are consistent, fully parseable, and aligned with downstream ticket automation workflows. Each response includes fields such as title, summary, category, priority, and impact.

JSON Schema

All outputs conform to this validated structure:

{
  "title": "string (โ‰ค90 characters)",
  "summary": "2-4 sentence description of the issue",
  "repro_steps": ["step 1", "step 2", "step 3"] or null,
  "category": "category_value",
  "priority": "Low|Medium|High",
  "impact": "description of user/team/organization impact",
  "contact": {
    "name": "string",
    "email": "string",
    "phone": "string (optional)"
  },
  "labels": ["tag1", "tag2", "tag3"]
}

Field Specifications

Required Fields:

  • title (string, max 90 chars): Concise issue description
  • summary (string, 2-4 sentences): Core problem description
  • category (string): One of predefined categories
  • priority (string): Low | Medium | High
  • impact (string): User/team/organization impact description
  • contact (object): User contact information
  • labels (array): Relevant categorization tags

Optional Fields:

  • repro_steps (array or null): Numbered reproduction steps
    • Required for technical issues (device, network, software)
    • Null for administrative requests (password reset, access request)

Example 1 (Minified):

{"title":"Network Connection Issue","summary":"Network connectivity issues have been affecting all departments for over an hour. Users are unable to access internal systems and require immediate resolution.","category":"network_vpn","priority":"High","impact":"org_wide","repro_steps":["All departments report widespread network connectivity issues.","Website and internal services are inaccessible."],"contact":{"name":"IT Support","email":"[email protected]"},"labels":["network_vpn","connectivity_issues","network_outage"]}

Example 1: Network Issue (Expanded)

{
  "title": "Network Connection Issue",
  "summary": "Network connectivity issues affecting all departments for over an hour. Seeking immediate resolution and mitigation steps",
  "category": "network_vpn",
  "priority": "High",
  "impact": "org_wide",
  "repro_steps": [
    "All departments report network connectivity issues.",
    "Website and internal services are inaccessible."
    ],
  "contact": {"name": "IT Support", "email": "[email protected]"},
  "labels": ["network_vpn", "connectivity_issues", "network_outage"]
}

Example 2: Account Access Issue

{
  "title": "Admin Account Login Issue",
  "summary": "Unable to log into admin account despite receiving new credentials via email.",
  "category": "Account Access",
  "priority": "high",
  "impact": "Prevents access to critical tools for ongoing tasks and meetings.",
  "repro_steps": [
    "Received new login credentials via email",
    "Attempted to log in using provided details",
    "Received error message: 'Invalid username or password'",
    "Tried resetting password through portal",
    "Failed to resolve issue on multiple devices"
  ],
  "contact": {"name": "ITUser", "email": "[email protected]", "phone": "+1-555-1234"},
  "source": "chat",
  "labels": ["admin-login", "finance-tools"]
}

Limitations

While the model performs strongly on the curated Gold dataset, it is noticeably sensitive to domain shift, with schema validity dropping to 47% on the combined Kaggle dataset. This suggests the model generalizes best to ticket formats and field names that closely match its training distribution, and may require additional fine-tuning in organizations with different ticket templates or terminology. Because the model relies heavily on a clear prompt and an explicit schema, vague or inconsistently formatted inputs may lead to incomplete or misaligned JSON outputs. The model also struggles with very long inputs where extended email threads or chat transcripts may exceed the effective context window and cause loss of important details. Although GSM8K performance improved after fine-tuning, the slight drop on RACE shows that some general reasoning abilities can shift during LoRA training, even if the overall impact is small.

Bias, Risks, and Ethical Considerations

Data Privacy

  • May process sensitive or personal information
  • No built-in PII redaction or anonymization
  • Requires external privacy controls

Bias Potential

  • Training data may contain biased priority/impact assessments
  • Organizations should monitor for unintended bias

Human Oversight Required

  • Model should not be deployed without human review
  • Should not be used unsupervised in production workloads
  • Critical incidents require immediate human attention

Acknowledgments

Base Model:

External Datasets Used:

Comparison Models:

Benchmarks:

Tools / Libraries:

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for zoha28/it-ticket-generator-Qwen2.5-7B-v1

Base model

Qwen/Qwen2.5-7B
Finetuned
(2285)
this model