Overview

This is an experimental model trained to reconstruct a plausible chain of reasoning connecting a user-provided INSTRUCTION to a fixed SOLUTION while preserving the original answer. It focuses entirely on the route, not rewriting the destination, producing stepwise “thinking” traces that align with the target output. The goal is to enable reasoning backfill for legacy or non-reasoning datasets where collecting thought processes directly is impractical, such as older chat logs or instruction corpora. These traces can help bootstrap process-supervision signals, support teacher-style models, and deepen auditability of output behavior.

I would love to try this with larger or more stylized models to see how feasible it is to produce stylized or limited-domain reasoning traces. If you’re a compute partner interested in scaling this line of work to larger backfill models, let’s talk.

Train setup

Base: Qwen/Qwen3-4B
Hardware: 1× H100
Epochs: 4 · Cosine schedule · warmup 40
Optimizer: adamw_bnb_8bit · lr 2.5e-5

Intended Uses

Dataset augmentation: generate process-supervision style traces for older instruction pairs.
Teacher bootstrapping: seed trace-rich examples to train or distill teachers.
Analysis tooling: produce rationales for audit of solution adherence.

Limitations

Traces are plausible reconstructions, not ground truth cognition.
Model can over-rationalize if solution is underspecified.

Prompting · ChatML

Example Prompt

<|im_start|>system
Your role as an assistant involves thoroughly reconstructing a plausible reasoning process that leads from a user-provided INSTRUCTION to a user-provided SOLUTION. You must not alter the SOLUTION. Each step should include concrete decisions and validations, such as: interpreting the INSTRUCTION, extracting constraints, selecting an approach, justifying key choices, verifying that intermediate results remain consistent with the provided SOLUTION, refining any errors, and a final consistency check noting any residual ambiguities. Use domain-appropriate specifics and avoid filler. Do not introduce new facts that would change the SOLUTION. Now, given an INSTRUCTION and a SOLUTION, reconstruct the Thought and present the Solution per the above guidelines. Your reasoning should begin and end with '<|thinking_start|>' and '<|thinking_end>' respectively.
<|im_end|>
<|im_start|>user
<|instruction_start|>
What are some tips for reducing stress at work? Your response should contain at least 4 bullet points.
Use markdown bullets like "* point". Include the keyword "mindfulness" twice.
<|instruction_end|>

<|solution_start|>
‐ Practice mindfulness during breaks ...
‐ Prioritize tasks and set boundaries ...
‐ Incorporate mindfulness into routine activities ...
‐ Stay physically active ...
<|solution_end|>
<|im_end|>
<|im_start|>assistant
<|thinking_start|>

Recommended Sampling

temperature: 0.7–1.0
top_p: 0.9
min_p: 0.05
max_tokens: as needed for trace length
stop: ["<|im_end|>"]

For tighter adherence, drop temperature toward 0.5–0.7.

Quantizations

GGUF: coming soon

Axolotl config

```yaml base_model: Qwen/Qwen3-4B hub_model_id: joeyzero/Qwen3-4B-Reasoning-Backfill-V0.1 hf_use_auth_token: true load_in_8bit: false load_in_4bit: false strict: false gradient_accumulation_steps: 2 micro_batch_size: 2 num_epochs: 4 optimizer: adamw_bnb_8bit lr_scheduler: cosine learning_rate: 2.5e-5 max_grad_norm: 1.0 bf16: auto tf32: false datasets: - path: joeyzero/OpenThought-144k-Backfill-0.2 type: chat_template field_messages: messages - path: joeyzero/dolphin-r1-backfill-0.0.2 type: chat_template field_messages: messages chat_template: chatml dataset_prepared_path: prepared_data2 output_dir: ./thinking-backfill-0.1.17 sequence_len: 1024 sample_packing: true pad_to_sequence_len: true xformers_attention: flash_attention: true warmup_steps: 40 save_steps: 0.5 weight_decay: 0.02 wandb_project: reasoning-backfill wandb_name: reasoning-backfill-attempt-04 ```

Made by joeyzero· contributions and issues welcome.