Google’s Genie model creates interactive 2D worlds from a single image

textimagesvideoaudiorecently unveiled Genie model

DeepMind’s Genie announcement page shows plenty of sample GIFs of simple platform-style games generated from static starting images (children’s sketches, real-world photographs, etc.) or even text prompts passed through ImageGen2. And while those slick-looking GIFs gloss over some major current limitations that are discussed in the full research paper, AI researchers are still excited about how Genie’s generalizable “foundational world modeling”could help supercharge machine learning going forward.

Under the hood

While Genie’s output looks similar at a glance to what might come from a basic 2D game engine, the model doesn’t actually draw sprites and code a playable platformer in the same way a human game developer might. Instead, the system treats its starting image (or images) as frames of a video and generates a best guess at what the entire next frame (or frames) should look like when given a specific input.

To establish that model, Genie started with 200,000 hours of public Internet gaming videos, which were filtered down to 30,000 hours of standardized video from “hundreds of 2D games.”The individual frames from those videos were then tokenized into a 200-million-parameter model that a machine learning algorithm could easily work with.

An image like this, generated via text prompt to an image generator, can serve as the starting point for Genie's world-building.
A sample of interactive movement enabled by Genie from the above starting image (Click

With the latent action model established, Genie then generates a “dynamics model”that can take any number of arbitrary frames and latent actions and generate an educated guess about what the next frame should look like given any potential input. This final model ends up with 10.7 billion parameters trained on 942 billion tokens, though Genie’s results suggest that even larger models would generate better results.

Previous work on generating similar interactive models using generative AI has relied on using “ground truth action labels”or text descriptions of training data to help guide their machine learning algorithms. Genie differentiates itself from that work in its ability to “train without action or text annotations,”inferring the latent actions behind a video using nothing but those hours of tokenized video frames.

“The ability to generalize to such significantly [out-of-distribution] inputs underscores the robustness of our approach and the value of training on large-scale data, which would not have been feasible with real actions as input,”the Genie team wrote in its research paper.

CDN CTB