World Emulation via Neural Network (

)

I converted a forest near my apartment into an interactive neural world. Click here to explore the world in your browser:

I mean “neural world”by which I mean the entire thing is a network that generates new images based upon previous images and controls. There is no level geometry. No code for lighting and shadows. No scripted animation. Just a neural network in a loop.

By “in your web browser” I mean this world runs locally, in your web browser. Once the world has loaded, you can continue exploring even in Airplane Mode.

So, why bother creating a world this way? There are some interesting conceptual reasons (I’ll get to them later), but my main goal was just to outdo a prior post.

See, Three years agoI trained a neural net to mimic gameplay videos on YouTube.

Mimicking a 2D video game world was cute, but ultimately kind of pointless;
existing video games already exist and we can already emulate them just fine.

The wonderful, unique, exciting property of neural worlds is that they can be constructed from any video file, not just screen recordings of old video games.
My previous post didn’t really get this across.

So for this post, to demonstrate what makes neural networks truly special,
I wanted to train a neural network on gameplay videos of the actual world.

Recording data

I began this project by walking through a forest, recording videos on my phone using a customized app that also recorded the motion of my phone.

I collected ~15 minutes of video and motion recordings. I’ve visualized motion as a “walking” control stick on the left and a “looking” control stick on the right.

Back at home, I transferred the recordings to my laptop, and shuffled them into a list of (previous frame, controlnext frame) pairs just like my previous game-emulation dataset.

Now, all I needed to do was train a neural network to mimic the behavior of these input→output pairs. I already had working code from my previous game-emulation project,
so I tried rerunning that code to establish a baseline.

Training baselines

Applying my previous game-emulation-via-neural-network recipe to this new dataset produced, regrettably, a sort of interactive forest-flavored soup.

My neural network couldn’t predict the actual next frame accurately, and it couldn’t make up new details fast enough to compensate, so the resulting world collapsed even if I gave it a running start by initializing from real video frames:

Undaunted, I started work on a new version of the neural world training code.

Upgrade the training recipe

I upgraded the recipe to help my network understand the real-world video.

  1. I added more control information. “control” I upgraded”control”the network input from simple controls to more-informative (3D) ( 6DoF) controls.
  2. Addition of memory. “memory” I upgraded from a single frame input to 32 frames using lower resolution (for the older frames).
  3. Addition of multiple scales. Instead of using a fixed resolution, I restructured my network to process inputs at multiple resolutions.

These upgrades let me stave off soupification enough to get a half-baked demo:

This was significant progress. Unfortunately, the world was still pretty melty,
so I started work on a second batch of improvements (more daunted this time).

Upgrading the recipe for training more

I left the inputs/outputs the same this time and focused on incremental improvements to the procedure. Here’s a mercifully-abbreviated montage:

The biggest jumps in quality came from:

  1. Making the network bigger: I added even more layers of neural network processing, while striving to maintain a somewhat-playable FPS.
  2. Picking a better training objective: I adjusted training to put less emphasis on Detail prediction with more emphasis detail generation.
  3. I trained the network on a subset of video frames for longerto try and get the best results.

Here is a summary of our final forest world recipe.

  • : 22,814 frames (30FPS video, timestamped poses) captured by iPhone 13 Pro at Marymoor Park Audobon Bird Loop.
  • : 3×4-element relative pose, 2-element gravity relative roll/pitch and relative time delta. Valid/augmented bit.
    Four past-frame TCHW memories buffers (32x3x3x4, 8x3x48x64. 4x3x192x256).
    Four U(0,1) single-channel noise Tensors for each spatial scale. Model: Asymmetric (decoder heavy) 4-scale UNet, with reduced-size full resolution decoder block.
    5M parameters trainable, 1 GFLOP for each generated 192×256 frames.
  • Training: AdamW constant LR + SWA, L1 + adversarial loss, stability fixes from the game-emulation recipe, around ~100 GPU-hours (~$100 USD).
  • Inference: Control-conditioned sequential autoregression with 60FPS cap, preprocessing in JS, network in ONNX Runtime Web’s WebGL backend.

Whew. So, let’s return to the original question:
why bother? Why go through so much work to get a low-resolution neural world of a single forest trail? Why not make a stabler, higher-resolution demo using traditional video game techniques?

There are two ways to create a world

The traditional game worlds look like paintings. You can create beautiful worlds by layering keystrokes on an empty canvas. Every detail in a traditional video game is there only because an artistpainted it.

The way neural worlds are created is different.
In order to create a neural forest, I
walked through a real forest and “record” the device in my hands.
The final world has every detail that is so lifelike because my phone recorded them.

If traditional game worlds were paintings, then neural worlds would be photographs.
Information is transferred from sensor to screen, without the need for human intervention.

(

) Admittedly, at the time of this post, it is true that neural worlds are similar to very old photos.
The photos taken by early cameras were not realistic.

The exciting part was that cameras reduced realistic-image-creation from an artistic problem to a technological one.
As cameras improved, so did photography. Photographs became more accurate to reality, while paintings remained the same.

In the future, neural worlds may have trees that bend with the wind, lilypads which bob in the rain and birds who sing to eachother.
Automatically because they exist in the real world and a tool is able to record them. Not because an artist painted them in.

The tools for creating neural-worlds could be as convenient as cameras today. We could create worlds in the same way a digital camera can create images or videos with the push of a button.

If the neural worlds are as lifelike, affordable, and comprehensible as photos today,
then narrative arrangements of neural universes could be a creative medium in their own right,
just as different from video games today as photographs were. I think it would be exciting!


The neural networks that model the world have been called “world models” by many smart people. A classic example is Comma’s. If you are a programmer who is interested in creating your own world models, then I recommend DIAMONDand Diffusion Forcing (also known as Diffusion Forcing) .

In comparison to serious “Foundation World Models” that have billions of parameters
is a toy.
It would be fun to make more worlds and improve the recipe.
Let us know if you have any suggestions for a location near Seattle.

www.aiobserver.co

More from this stream

Recomended