ForgeWM: A Reproducible Training Recipe for Action-Controllable World Models

Train a real-time, playable Minecraft world model on 8 GPUs — keyboard & mouse control, fully open and reproducible.

Abstract

We present ForgeWM, an integration project that provides the first fully open and reproducible training recipe for the Matrix-Game 2 (MG2) lineage. We do not introduce new methods, models, or data — instead, we connect MG2's game-native I2V backbone, GameFactory's open Minecraft gameplay data, and the Causal Forcing distillation paradigm into an end-to-end 4-stage pipeline that is trainable on 8 GPUs. MG2 open-sourced its weights but not its training data or training code, leaving the community unable to reproduce the MG2 recipe; ForgeWM fills this gap. Our final model achieves comparable generation quality to MG2's distilled checkpoint at 4-step inference, while fixing key failure modes (underwater drift artifacts) and releasing complete training code, data, and checkpoints.

Key Contributions

Fully Reproducible

Complete training code, pre-encoded data (89 GB LMDB), and checkpoints (Stage 0 + Stage 3) publicly available on HuggingFace.

4-Stage Distillation

Bidirectional SFT → Teacher-Forcing AR → Consistency Distillation → DMD, building on Causal Forcing paradigm.

Game-Native Control

Discrete keyboard (WASD) via cross-attention + continuous mouse via channel concatenation — MG2's hybrid action module.

Open Data Pipeline

Trained entirely on GameFactory's GF-Minecraft (~70h gameplay) with balanced action distribution. No proprietary data needed.

Training Pipeline

4-stage progressive distillation: from bidirectional diffusion to real-time causal generator.

Stage 0

Bid. SFT

4000 steps
Domain adaptation

Stage 1

TF Causal AR

10000 steps
Teacher Forcing

Stage 2

Causal CD

6000 steps
Consistency Distill.

Stage 3

DMD

2000 steps
4-step real-time

# Full pipeline on 8 GPUs (~22K steps total)
torchrun --nproc_per_node=8 train.py --config configs/stage0_bid_sft.yaml
torchrun --nproc_per_node=8 train.py --config configs/stage1_teacher_forcing.yaml
torchrun --nproc_per_node=8 train.py --config configs/stage2_consistency_distillation.yaml
torchrun --nproc_per_node=8 train.py --config configs/stage3_dmd.yaml

Architecture

I2V Conditioning (First-Frame Fidelity)

ForgeWM uses Matrix-Game 2's three-pathway image conditioning:

Channel-concat: First frame VAE-encoded + concatenated with noise (4-ch mask | 16-ch latent = 20 channels).
CLIP visual context: 257-token CLIP sequence injected via cross-attention at every transformer block.
Causal history: Previously generated clean frames cached in KV store during autoregressive rollout.

Action Injection (Hybrid Design)

Keyboard (discrete): Cross-attention — actions embedded and attend to latent tokens at each block.
Mouse (continuous): Sliding-window grouping (VAE temporal compression=4) + channel concatenation.

Temporal: Block-wise Causal Attention

Frames grouped into chunks of 3: within-chunk bidirectional, cross-chunk strictly causal.
Sliding window (size=6): each chunk attends only to 6 most recent frames → unbounded generation without memory growth.

Results

ForgeWM (4-step DMD, Causal Forcing) vs Matrix-Game 2 (Self-Forcing distilled). Same reference frame, same action, 352×640.

Forest — Turn Right

MG2

ForgeWM

Plains — Forward

MG2

ForgeWM

Cave — Forward

MG2

ForgeWM

Desert — Back

MG2

ForgeWM

Rainy Sunset — Random

MG2

ForgeWM

Rainy Night — Forward

MG2

ForgeWM

Observations

Overall quality: ForgeWM largely reproduces MG2's generation quality. Temporal smoothness is slightly better; texture detail slightly weaker (smaller data: ~70h vs MG2 proprietary).
"Underwater" artifact fixed: MG2 drifts into ocean textures on rain/dark scenes (over-represented in proprietary data). ForgeWM does not exhibit this.
Action controllability: Both models respond correctly to keyboard/mouse. Causal Forcing preserves action fidelity through all 4 stages.
HUD stability: MG2's HUD gradually shrinks over rollout (resolution mismatch). ForgeWM does not show this artifact.

Project	Base Model	Control	Paradigm	I2V	Data Open	Train Code
ForgeWM	Wan2.1-1.3B	KB + Mouse	Causal Forcing	✓	✓	✓
Matrix-Game 2	Wan2.1-1.3B	KB + Mouse	Self Forcing	✓	✗	✗
minWM	HY1.5 / Wan2.1	Camera pose	Causal Forcing	HY only	✓	✓

Abstract

Key Contributions

Fully Reproducible

4-Stage Distillation

Game-Native Control

Open Data Pipeline

Training Pipeline

Architecture

I2V Conditioning (First-Frame Fidelity)

Action Injection (Hybrid Design)

Temporal: Block-wise Causal Attention

Results

Observations

Comparison

Citation

Acknowledgements

Contact