ForgeWM

A Reproducible Training Recipe for Action-Controllable World Models

ForgeWM Team

License Python GPUs

Train a real-time, playable Minecraft world model on 8 GPUs — keyboard & mouse control, fully open and reproducible.

Abstract

We present ForgeWM, an integration project that provides the first fully open and reproducible training recipe for the Matrix-Game 2 (MG2) lineage. We do not introduce new methods, models, or data — instead, we connect MG2's game-native I2V backbone, GameFactory's open Minecraft gameplay data, and the Causal Forcing distillation paradigm into an end-to-end 4-stage pipeline that is trainable on 8 GPUs. MG2 open-sourced its weights but not its training data or training code, leaving the community unable to reproduce the MG2 recipe; ForgeWM fills this gap. Our final model achieves comparable generation quality to MG2's distilled checkpoint at 4-step inference, while fixing key failure modes (underwater drift artifacts) and releasing complete training code, data, and checkpoints.

Key Contributions

Fully Reproducible

Complete training code, pre-encoded data (89 GB LMDB), and checkpoints (Stage 0 + Stage 3) publicly available on HuggingFace.

4-Stage Distillation

Bidirectional SFT → Teacher-Forcing AR → Consistency Distillation → DMD, building on Causal Forcing paradigm.

Game-Native Control

Discrete keyboard (WASD) via cross-attention + continuous mouse via channel concatenation — MG2's hybrid action module.

Open Data Pipeline

Trained entirely on GameFactory's GF-Minecraft (~70h gameplay) with balanced action distribution. No proprietary data needed.

Training Pipeline

4-stage progressive distillation: from bidirectional diffusion to real-time causal generator.

Stage 0
Bid. SFT
4000 steps
Domain adaptation
Stage 1
TF Causal AR
10000 steps
Teacher Forcing
Stage 2
Causal CD
6000 steps
Consistency Distill.
Stage 3
DMD
2000 steps
4-step real-time
# Full pipeline on 8 GPUs (~22K steps total)
torchrun --nproc_per_node=8 train.py --config configs/stage0_bid_sft.yaml
torchrun --nproc_per_node=8 train.py --config configs/stage1_teacher_forcing.yaml
torchrun --nproc_per_node=8 train.py --config configs/stage2_consistency_distillation.yaml
torchrun --nproc_per_node=8 train.py --config configs/stage3_dmd.yaml

Architecture

I2V Conditioning (First-Frame Fidelity)

ForgeWM uses Matrix-Game 2's three-pathway image conditioning:

  1. Channel-concat: First frame VAE-encoded + concatenated with noise (4-ch mask | 16-ch latent = 20 channels).
  2. CLIP visual context: 257-token CLIP sequence injected via cross-attention at every transformer block.
  3. Causal history: Previously generated clean frames cached in KV store during autoregressive rollout.

Action Injection (Hybrid Design)

Temporal: Block-wise Causal Attention

Results

ForgeWM (4-step DMD, Causal Forcing) vs Matrix-Game 2 (Self-Forcing distilled). Same reference frame, same action, 352×640.

Forest — Turn Right
MG2
MG2
ForgeWM
ForgeWM
Plains — Forward
MG2
MG2
ForgeWM
ForgeWM
Cave — Forward
MG2
MG2
ForgeWM
ForgeWM
Desert — Back
MG2
MG2
ForgeWM
ForgeWM
Rainy Sunset — Random
MG2
MG2
ForgeWM
ForgeWM
Rainy Night — Forward
MG2
MG2
ForgeWM
ForgeWM

Observations

Comparison

ProjectBase ModelControlParadigmI2VData OpenTrain Code
ForgeWMWan2.1-1.3BKB + MouseCausal Forcing
Matrix-Game 2Wan2.1-1.3BKB + MouseSelf Forcing
minWMHY1.5 / Wan2.1Camera poseCausal ForcingHY only

Citation

@misc{forgewm2026,
  title={ForgeWM: A Reproducible Training Recipe for Action-Controllable World Models},
  author={ForgeWM Team},
  year={2026},
  url={https://github.com/asdfo123/ForgeWM}
}

Acknowledgements

ForgeWM integrates work from multiple research groups:

We also thank the authors of:

Contact

Xinye Lileeasdfo123@gmail.com

WeChat Group:

WeChat Group