3D Photography using a 2D Image and AI!
-An article by Lakshya, Computer Vision Researcher, Sally Robotics.
Some time ago, I was surfing the Reddit and Machine Learning Subreddit when I found the mention of thebeautiful paper by Meng-Li Shih, Shih-Yang Su, Johannes Kopf and Jia-Bin Huang Titled “3D Photography using Context-aware Layered Depth Inpainting”. Here is the paper link.
Using the above model I was able to make the above photograph into an mp4 file that makes the image into a 3D photo and perform a dolly zoom.
The above model converts a single RGB image into a 3D photograph, by first estimating a depth (the authors used external tools for that, like MiDas) and making images from a certain Point of View (PoV).
One of the best things about the model is that it stores the result as 3D mesh file that can be loaded into a standard graphics engine on edge devices, which allows for quick rendering. Let’s learn how did they create these novel views.
Novel View Representation
This section gives an idea in brief about various ways we can generate and store the novel views.
They render photo realistic novel views using multiple image inputs. Here is the idea for Light Fields. (Taken from this page)
There are some good research papers written on this topic, some of them were used for writing the papers. Here are few to mention:
Steven J Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F Cohen. The lumigraph. In SIGGRAPH, volume 96, pages 43–54, 1996.
Marc Levoy and Pat Hanrahan. Light field rendering. InProceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 31–42, 1996.
Chris Buehler, Michael Bosse, Leonard McMillan, StevenGortler, and Michael Cohen. Unstructured lumigraph rendering. InProceedings of the 28th annual conference on Computer graphics and interactive techniques, 2001.
Multi Plane Images (MPI)
This representation stores multiple layers of RGB-α images (RGBA color model) at fixed depth. This representation is especially advantageous for semi-reflective or semi-transparent surfaces.
The main disadvantage of this method is that sloped surfaces are not reproduced well unless an excessive number of planes are used. This is because of depth discretization.
Layered Depth Images (LDI)
LDI is similar to regular 4-connected image, except at every position in the pixel lattice it can hold any number of pixels, from zero to many.
Each LDI pixel stores color and a depth value. Unlike the original LDI work, this represents the local connectivity of pixels (a novelty). Each pixel stores pointers to either zero or at most one direct neighbor in each of the four cardinal direction (left/top/right/bottom).
LDI pixels are 4-connected like normal image pixels within the smooth regions but do not have neighbors across depth discontinuities. Here is an LDI with depth discontinuity.
LDI representation are great for 3D photography because:
They can handle arbitray number of layers.
They are sparse i.e. memory and storage efficient and can be converted into a light weight textured mesh that renders fast.
Now that we know how to represent the novel views we are going to see why this method works best and why was LDI the choice for the representation.
Facebook 3D Photos and various other methods
The above mentioned “naïve” methods exist due stereo magnification and it’s recent variations which uses Fronto-parallel (parallel to front view) multi-plane representations (MPI). The downsides of using the former methods are that it produces artifacts on the sloped surfaces and it’s costly to render.
Facebook 3D photos changed that by using LDI representations for their work. As mentioned it’s compact due to it’s sparsity and has the ease of being converted into light-weight mesh for easy rendering.
Facebook 3D photos synthesize the color and depth in occluded (obstructed) view by using heuristics that are optimized for fast runtime. The downside for this algorithm is that it produces overly smooth result and has the inability to extrapolate texture and structures, as shown in the previous image.
Here is an example from Facebook 3D image. If you look closely you can see the overly smooth results near the edges in the 3D generated view.
The authors wanted to create a novel view that were photo-realistic, they achieved this by introducing changes in LDI representations.
In most of the recent learning-based methods the LDI used are in a sense “rigid”. Every pixel in the image has the same (fixed and predetermined) number of layers. They store the nearest surface in the first layer, the second-nearest in the next layer,etc. This is problematic, because across depth discontinuities (in occluded view) the content within a layer changes abruptly, which destroys locality in receptive fields of convolution kernels.
In neural networks, each neuron receives input from some number of locations in the previous layer. In a fully connected layer, each neuron receives input from every element of the previous layer. In a convolutional layer, neurons receive input from only a restricted subarea of the previous layer. This input area of a neuron is called its receptive field.
The improvement in LDI allows it to be used in situations where we can handle arbitrary depth complexities. This is done by explicitly storing connectivity across pixels in our representation.
Novelty introduces problem
The above custom LDI makes it impossible to be put into a Convolutional Neural Network. Though, the exact method is explained later, but the crux is that we divide the problem into many sub-problems which are just inpainting task. Each local sub-problem has input image-like, so we can apply the CNN. All these sub-problem are solved iteratively.
These local sub-problems break the LDI at depth discontinuity and perform inpainting and the result is fused back into LDI, leading to a recursive algorithm that treats LDI until all the sub-problems are solved.
Image Inpainting and Depth Inpainting
The task is fill the missing regions in images with some possible content. The best method is to complete the missing regions by transferring contents from the known regions of the images.
Due to the progress of CNNs, they are actively being used for inpainting. Several architectures are now being produced to better handle holes with irregular shape and two stage methods with structure-content disentaglement. For ex. predicting structure (say contour/edge in missing regions) and followed by content completion conditioned on predicted structures.
The inpainting model made by the authors follows a similar two-stage method but with two key differences:
Unlike the existing inpainting algorithms where hole and available context are static here inpainting is applied locally around each depth discontinuity with adaptive hole and context regions.
In addition to inpaint the color image, the method also inpaints the depth values as well as depth discontinuity in the missing regions.
Depth inpainting has applications in filling missing depth values where commodity-grade depth cameras fail (e.g., transparent/reflective/distant surfaces) or performing image editing tasks such as object removal on stereo images. The goal of these algorithms, however, is to inpaint the depth of the visible surfaces. In contrast, the authors’ focus is on recovering the depth of the hidden surface.
Now that we understand the basics let’s understand the process.
The method described in the paper takes an input of RGB-D image. To obtain this we need CNN-based single Depth estimation networks.
For a single color image available we obtain depth estimate through a pre-trained depth estimation model like MegaDepth, MiDas or Kinect.
For the project MiDas was used. From the output of the RGB-D image we will now preprocess it in an automatic fashion.
Overview of the process
We have input as a RGB-D image. Initialize a trivial LDI, which uses a single layer everywhere and is fully 4-connected. In pre-process, we detect major depth discontinuities and group them into simple connected depth edges. This forms the basic unit of our algorithm.
In core parts of our algorithm, we iteratively select a depth edge for inpainting. We disconnect LDI pixel across the edge and only consider the background pixels of the edge for inpainting. We extract a local context region from the “known” side of the edge and generated a synthesis region on the unknown side.
The synthesis region is a contiguous 2D region of new pixels, whose color and depth values use generated from the given context using a learning-based method. Once inpainted we merge the synthesis region back into the LDI.
This is done iteratively done until all depth edges have been treated.
We start by normalizing the depth channel of the input by mapping the min and max disparity values (i.e. 1/depth) to 0 and 1 respectively. All parameters related to spatial dimension below are tuned for images with 1024 pixels along the longer dimension and should be adjusted proportionaly for image of different sizes.
Then we lift the image onto an LDI, i.e., creating a single layer everywhere and connecting every LDI pixel to its four cardinal neighbors. Since our goal is to inpaint the occluded parts of the scene, we need to find depth discontinuities since these are the places where we need to extend the existing content. In most depth maps produced by stereo methods (dual camera cell phones) or depth estimation networks, discontinuities are blurred across multiplepixels (see (c) part in the figure below) , making it difficult to precisely localize them. We, therefore, sharpen the depth maps using a bi-lateral median filter (see the (d) part in the figure below), using a 7×7 window size and σ_(spatial)=4.0, σ_(intensity)=0.5.
Bilateral median filters can be found in the following research paper: Ziyang Ma, Kaiming He, Yichen Wei, Jian Sun, En-hua Wu. Constant time weighted median filtering for stereo matching and beyond. 49–56 InProceedings of the 2013 IEEE International Conference on Computer Vision. (2013).
After sharpening the depth map, we find discontinuities by thresholding the disparity difference between neighboring pixels. This results in many spurious responses, such as such as isolated speckles and short segments dangling off longer edges (see (e) part above figure).
This is cleaned up by -
Create a binary map by labeling depth discontinuities as 1 (and others as 0).
Use connected component analysis to merge adjacent discontinuities into a collection of “linked depth edges”. To avoid merging edges at junctions, we separate them based on the local connectivity of the LDI.
We remove short segments (<10 pixels), including both isolated and dangling ones.
The final edges (see (f) part in the above image) form the basic unit of our iterative inpainting procedure.
Context and Synthesis Regions
The inpainting algorithm will now operate on one of the depth edges computed from the previous computed depth edges at a time. The goal is to synthesize new color and depth content in the adjacent occluded region.
The first step is to disconnect the LDI at discontinuity. The result is formation of two silhouette like pixel called as foreground (green) and background (red) silhouette. Only the background silhouette requires the inpainting now as we want to extend them.
The next step is to generate a synthesis region, a contiguous regions of new pixels (marked as red in (c) part). These are essentially just 2D pixel coordinates at this point. These pixels are initialized with the color and depth values using a simple iterative flood-fill like algorithm. It starts by stepping from all silhouette pixels one step in the direction where they are disconnected. These pixels form the initial synthesis region.
Then it’s iteratively expanding (for 40 iterations) all pixels of the region by stepping left/right/up/down and adding any pixels that have not been visited before. For each iteration, we expand the context and synthesis regions alternately and thus a pixel only belong to either one of the two regions. Additionally, we do not step back across the silhouette, so the synthesis region remains strictly in the occluded part of the image, as shown in the above figure.
The learning-based method for inpainting the synthesis region is described in the next section. Similar techniques were previously used for filling holes in images. One important difference to this work is that these image holes were always fully surrounded by known content, which constrained the synthesis. In our case, however, the inpainting is performed on a connected layer of an LDI pixels, and it should only be constrained by surrounding pixels that are directly connected to it. Any other region in the LDI, for example, on other foreground or background layer, is entirely irrelevant for this synthesis unit, and should not constrain or influence it in any way.
This behaviour is achieved by explicitly defining a context region (the blue part in the figure two above) for the synthesis. The inpainting networks only considers the content in the context region and does not see any other parts of the LDI. The context region is generated using a similar flood-fill like algorithm. One difference, however, is that this algorithm selects actual LDI pixels and follows their connection links, so the context region expansion halts at silhouettes. Synthesis is better for larger context regions.
In practice, the silhouette pixels may not align well with the actual occluding boundaries due to imperfect depth estimation. To tackle this issue, the authors dilated the synthesis region near the depth edge by 5 pixels which also caused erosion of the context region.
Context Aware color and depth inpainting
Given the context and synthesis regions, the next goal is to synthesize color and depth values. Even though the synthesis is performed on an LDI, the extracted context and synthesis regions are locally like images (because we disconnected the LDI), so we can use standard network architectures designed for images.
One straightforward approach is to inpaint the color image and depth map independently. The inpainted depth map, however, may not be well-aligned with respect to the inpainted color.
This can be addressed by dividing the problem of inpainting to three tasks as sub-networks -
Edge inpainting network
Color inpainting network
Depth inpainting network
First, given the context edges as input, we use the edge inpainting network to predict the depth edges in the synthesis regions, producing the inpainted edges. Performing this step first helps infer the structure (in terms of depth edges) that can be used for constraining the content prediction (the color and depth values). We take the concatenated inpainted edges and context color as input and use the color inpainting network to produce inpainted color. We perform the depth inpainting similarly.
For the edge inpainting network — the model used was EdgeConnect along with all the hyper-parameters etc. For the depth and color inpainting networks, we use a standard U-Net architecture with partial convolution. More about the partial convolution method is present in this paper.
A lot of detail of the implementation like losses, training etc. is provided in this supplementary book by the authors themselves.
Applying the inpainting model once is not sufficient as there are still holes. Applying it more than twice does the trick.
Now we have the resulting LDI that can be converted into 3D mesh for other uses. Let’s see the comparison of results and see some of the result I created for some images I had.
Comparison of Result with various MPI-based methods
Comparison of results with Facebook 3D
The authors were kind enough to make their codes public and also supply a demo on Google Colab. Please visit their webpage for more information at https://shihmengli.github.io/3D-Photo-Inpainting/
Do take a peek in their files!
Results from the Demo
The above three were on Artworks, the below are some non-Artwork.