GameChartEvaluator (GCE4)

A neural network model for evaluating the quality of rhythm game charts relative to their corresponding music. The model predicts a quality score (0-1) indicating how well a chart synchronizes with the music.

Model Architecture

The model uses an early fusion approach with dilated convolutions for temporal analysis:

Early Fusion: Concatenates music and chart mel spectrograms along the channel dimension (80 + 80 = 160 channels)
Dilated Residual Encoder: 4 residual blocks with increasing dilation rates (1, 2, 4, 8) to capture multi-scale temporal context while preserving 11ms frame resolution. This gives the model a receptive field of ~0.73s (63 frames), meaning each time-step's score depends on the local ~0.36s context before and after.
Error-Sensitive Scoring Head: Combines average local scores with the worst 10% of scores using a learnable mixing parameter

Input: (B, 80, T) music_mels + (B, 80, T) chart_mels
  ↓ Concatenate
(B, 160, T)
  ↓ Conv1D Projection
(B, 128, T)
  ↓ Dilated ResBlocks × 4
(B, 128, T)
  ↓ Linear → Sigmoid (per-frame scores)
(B, T, 1)
  ↓ Error-Sensitive Pooling
(B,) final score

Usage

import torch
from gce4 import GameChartEvaluator

model = GameChartEvaluator.from_pretrained("JacobLinCool/gce4")
model.eval()

# Input: 80-band mel spectrograms
music_mels = torch.randn(1, 80, 1000)  # (batch, freq, time)
chart_mels = torch.randn(1, 80, 1000)

# Get overall quality score (0-1)
with torch.no_grad():
    score = model(music_mels, chart_mels)
    print(f"Quality Score: {score.item():.3f}")

# Get per-frame quality trace for explainability
with torch.no_grad():
    trace = model.predict_trace(music_mels, chart_mels)
    # trace shape: (batch, time)

Input Specifications

music_mels: (Batch, 80, Time) - Mel spectrogram of the music
chart_mels: (Batch, 80, Time) - Mel spectrogram of synthesized chart audio (click sounds at note positions)

Both inputs should be normalized and have the same temporal dimensions.

Output

forward(): (Batch,) - Single quality score per sample in range [0, 1]
predict_trace(): (Batch, Time) - Per-frame quality scores for interpretability

Model Configuration

Parameter	Default	Description
`input_dim`	80	Mel spectrogram frequency bins
`d_model`	128	Hidden dimension
`n_layers`	4	Number of residual blocks

Training

The model was trained to detect misaligned or poorly-synchronized rhythm game charts by comparing music-chart pairs with various synthetic corruptions (time shifts, random note placement, etc).

Evaluation Results

Evaluation was performed on 2,204 test samples with various segment durations. The model uses a severity parameter of 0.56.

Overall Accuracy by Segment Duration

Duration	Overall	Positive	Shift	Random	Mismatch
5s	81.85%	95.69%	79.04%	97.41%	97.41%
10s	83.35%	96.55%	80.60%	97.41%	100.00%
20s	84.66%	96.55%	82.06%	99.14%	100.00%
30s	85.30%	95.69%	82.81%	100.00%	100.00%
60s	85.98%	95.69%	83.62%	100.00%	100.00%
120s	86.25%	94.83%	84.00%	100.00%	100.00%
180s	85.57%	94.83%	83.19%	100.00%	100.00%

Shift Detection by Offset (120s segment)

Offset	Accuracy	Offset	Accuracy
-0.50s	91.38%	+0.50s	92.24%
-0.30s	89.66%	+0.30s	89.66%
-0.20s	91.38%	+0.20s	94.83%
-0.10s	100.00%	+0.10s	100.00%
-0.05s	100.00%	+0.05s	100.00%
-0.03s	91.38%	+0.03s	95.69%
-0.02s	84.48%	+0.02s	88.79%
-0.01s	20.69%	+0.01s	13.79%

Analysis

The performance characteristics can be directly explained by the model's physical constraints:

Resolution Limit (±0.01s): Performance drops significantly here because the 10ms shift is smaller than the model's temporal resolution (~11.6ms per frame). Sub-frame timing differences are mathematically difficult for the Convolutional Encoder to resolve.
Optimal Zone (±0.05s to ±0.20s): The model achieves 100% accuracy here. These shifts are large enough to be resolved but small enough to fit within the ~0.36s half-receptive field. The model can simultaneously "see" the music beat and the misaligned note, enabling a direct and precise comparison.
Field Boundary (±0.30s to ±0.50s): Accuracy dips slightly (to ~90%). A 0.50s shift often pushes the note outside the receptive field of its corresponding music beat. The model can no longer compare them directly; instead, it must rely on detecting "a note without a corresponding beat" or vice-versa, which is a harder inference task (and prone to errors if the shift lands on a different valid beat).

Downloads last month: 22

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support