JEPA GTA5 World Generation
• 1 min read 1 min
What I worked on
Planned to use JEPA for generating new world frames from a GTA5 driving dataset. Explored whether JEPA learns through masking rather than autoregression and how to pick target frames.
What I noticed
- JEPA learns context-to-target prediction like BERT, not sequence prediction
- Latent space can be manipulated to generate new variations
”Aha” Moment
That JEPA focuses on learning latent representations through reconstruction, not by predicting pixel sequences.
What still feels messy
How to map latent manipulations to specific visual or motion changes.
Next step
Train a small JEPA variant on a subset of GTA5 frames to test reconstruction quality.