Skip to main content
Featured image for post: JEPA GTA5 World Generation

JEPA GTA5 World Generation

1 min

What I worked on

Planned to use JEPA for generating new world frames from a GTA5 driving dataset. Explored whether JEPA learns through masking rather than autoregression and how to pick target frames.

What I noticed

  • JEPA learns context-to-target prediction like BERT, not sequence prediction
  • Latent space can be manipulated to generate new variations

”Aha” Moment

That JEPA focuses on learning latent representations through reconstruction, not by predicting pixel sequences.

What still feels messy

How to map latent manipulations to specific visual or motion changes.

Next step

Train a small JEPA variant on a subset of GTA5 frames to test reconstruction quality.