RealmDreamer: Text-Driven 3D Scene Generation

with Inpainting and Depth Diffusion

RealmDreamer: Text-Driven 3D Scene Generation
with Inpainting and Depth Diffusion

We generate large, explorable 3D scenes from a text-description

with just pretrained 2D diffusion models

Abstract

We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require training on any scene-specific dataset and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.


Abstract

We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require training on any scene-specific dataset and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.


Results


RGB Depth

A bear sitting in a classroom with a hat on, realistic, 4k image, high detail

bear bedroom bust boat lavender living_room piano resolute astronaut car bathroom bathroom bathroom bathroom bathroom bathroom bathroom bathroom

Inpainting priors are great for occlusion reasoning

Using text conditioned 2D diffusion models for 3D scene generation is tricky given the lack of 3D consistency across different samples. We mitigate this by leveraging 2D inpainting priors as novel view estimators instead. By rendering an incomplete 3D model and inpainting unknown regions, we learn to generate consistent 3D scenes.

Image to 3D

We show that our technique can generate 3D scenes from a single image. This is a challenging task as it requires the model to hallucinate the missing geometry and texture in the scene. We do not require training on any scene-specific dataset.

Input Image

"The Brandenburg Gate in Berlin, large stone gateway with series of columns and a sculpture of a chariot and horses on stop, clear sky, 4k image, photorealistic"

Input Image

"A minimal conference room, with a long table, a screen on the wall and a whiteboard, 4k image, photorealistic, sharp"

How?

Step 1: Generate a Prototype

We start by generating a cheap 2D prototype of the 3D scene from a text description using a pretrained text-to-image generator. Given the desired image, we lift its content into 3D using monocular depth estimator, before computing the occlusion volume. This serves as the initialization for a 3D Gaussian Splatting (3DGS) representation.

Step 2: Inpaint Missing Regions

The generated 3D scene is incomplete and contains missing regions. To fill them in, we leverage a 2D inpainting diffusion model and optimize the splats to match its output over multiple views. An additional depth distillation loss on sampled images ensure the inpainted regions are geometrically plausible.

Step 3: Refine the Scene

Finally, we refine the 3D model to improve the cohesion between inpainted regions and the prototype by using a vanilla text-to-image diffusion model. An additional sharpness filter ensures the generated samples are more detailed.

Related Work

There are many related works that have influenced our technique:

There are also some concurrent work that tackle scene generation or use inpainting models for similar applications: