Google’s research scientists have published a paper on its new GameNGen technology, an AI game engine that generates each new frame in real-time based on player input. It kind of sounds like Frame Generation gone mad in that everything is generated by AI, including visual effects, enemy movement, and more.
AI generating an entire game in real-time is impressive, even more so when GameNGen uses its tech to recreate a playable version of id Software’s iconic Doom. This makes sense when you realize that getting Doom to run on lo-fi devices, high-tech gadgets, and even organic material is a right of passage.
Seeing it in action, you can see some of the issues when it comes to AI generating everything (random artifacts, weird animation), but it’s important to remember that everything you see is being generated and built around you in real-time as you move, strafe, and fire shotgun blasts at demons.
As expected, the underlying AI model was trained on Doom and played repeatedly by AI agents trained to play the game, simulating various skills and playstyles. The result is impressive, to be sure. However, the game runs at 20 FPS, so there are still latency and performance improvements before GameNGen could be considered a viable option for playing a game.
What makes this an essential breakthrough for generative AI is how the image stays consistent between frames, something AI has struggled with when animating physical objects and characters. Each frame is separate without any underlying physics calculations or physical rendering. GameNGen presents a notable improvement thanks to Google Research extending the training on new frames with preceding frames and user input information. Here’s the official description of what it does and how it works.
GameNGen is the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality. GameNGen can interactively simulate the classic game DOOM at over 20 frames per second on a single TPU. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations enable stable auto-regressive generation over long trajectories.