4.2  Depth Image Based Rendering

4.2.1  Rendering using 3D image warping

In this section, we describe the 3D image warping technique, which enables the rendering of a synthetic image, using a reference texture image and a corresponding depth image [56]. Let us consider a 3D point at homogeneous coordinates Pw = (Xw,Y w,Zw,1)T , captured by two cameras and projected onto the reference and synthetic image plane at pixel positions p1 = (x1,y1,1)T and p2 = (x2,y2,1)T , respectively (see Figure 4.2).

Figure 4.2 Two projection points p1 and p2 of a point Pw.

The orientation and location of the camera i (with i ∈{1,2}) is described by the rotation matrix Ri and translation matrix ti = -RiCi, where Ci describes the coordinates of the camera center. This allows us to define the pixel positions p1 and p2 in both image planes by

                                       (      )
             [             ]              Xw
λ1p1 = [K1 |03]  R1T   - R1C1   Pw = K1R1 (  Yw  ) - K1R1C1,    (4.1)
               03     1                   Zw
             [             ]           (  X   )
λ p  = [K  |0 ]  R2   - R2C2   P  = K R  (  Yw  ) - K R  C ,   (4.2)
 2 2     2  3  0T3     1       w     2 2   Zw       2  2 2
where K1, K2 represent the 3 × 3 intrinsic parameter matrix of the corresponding cameras and λ1, λ2 are the homogeneous scaling factors. Assuming that the reference “Camera 1” is located at the coordinate-system origin (C1 = 03) and looks along the Z-direction (R1 = I3×3), the scaling factor λ1 can be specified in this particular case by λ1 = Zw.

From Equation (4.1), the 3D position of the original point Pw in the Euclidean domain can be written as

(Xw, Yw,Zw )T = (K1R1 )-1 ⋅(λ1p1 + K1R1C1  ).

Finally, we obtain the synthetic pixel position p2 by substituting Equation (4.3) into Equation (4.2), so that

λ2p2 = K2R2  (K1R1 )-1 ⋅(λ1p1 + K1R1C1  )- K2R2C2.

Assuming that “Camera 1” is located at the world coordinate system and looking in the Z direction, we rewrite the warping equation into

               - 1
λ2p2 =  K2R2K  1  Zwp1 - K2R2C2.

Equation (4.5) constitutes the 3D image warping equation [56] that enables the synthesis of the virtual view from a reference texture view and a corresponding depth image.

One issue of the previously described method is that input pixels p1 of the reference view are usually not mapped onto a point p2 at integer pixel position. To obtain an integer pixel position, a simple heuristic technique is to map the sub-pixel coordinate p2 to the nearest integer pixel position pˆ2 with pˆ2 = (ˆy2,ˆx2,1) = (⌊x2 + 0.5⌋,⌊y2 + 0.5⌋,1). Additionally, to avoid undefined pixels in the rendered image, so-called pixel splat surfaces that cover multiple destination pixels, can be employed. In practice, a pixel splat simply enlarges the footprint of the destination pixel p2, such that not only the color of ˆp2 is defined but also some pixels surrounding pˆ2. A second complication is that multiple original pixels can be projected onto the same pixel position in the virtual view. For example, a foreground pixel can occlude a background pixel in the rendered view, which results in overlapping pixels. Additionally, some regions in the virtual view are not visible from the original viewpoint, which results in holes in the virtual image (see Figure 4.3). The aforementioned issues are addressed with a rendering technique based on triangular meshes, which is presented in the next section.


Figure 4.3 A virtual view consists of visible, overlapped and undefined pixels.

4.2.2  Rendering using triangular meshes

To avoid rendering artifacts, a natural approach is to employ a mesh of triangles. The idea is to triangulate the reference depth image so that each triangle locally approximates the object surface. In our implementation, depth image triangulation is performed such that two micro-triangles per pixel are employed. For each triangle vertex in the reference image, the corresponding position of the warped vertex is calculated using Equation (4.4). Finally, a rasterization procedure is performed that converts the triangle-based geometric description of the warped image into a bitmap or raster image (see Figure 4.4). For an efficient implementation, it can be noticed that each adjacent triangle shares two common vertices. Therefore, only one warped-vertex position per pixel needs to be computed to obtain the third warped-vertex position.

Figure 4.4 Stages of a micro-triangular mesh rendering technique: first, each triangle vertex in the reference image is warped and, second, each triangle is rasterized to produce the output image.

While such a technique leads to high-quality image rendering, a disadvantage is the very large number of micro-triangles, which involves a high computational complexity.

4.2.3  Rendering using relief texture mapping

As previously highlighted, the 3D image warping and the mesh-based rendering techniques suffer either from a low rendering quality, or a high computational complexity. To circumvent these issues, we now present an alternative rendering technique derived from the relief texture mapping algorithm [77]. Initially, relief texture mapping was developed for rendering uneven surfaces of synthetic computer graphics objects. The guiding principle of the relief texture mapping is to factorize the 3D image warping equation into a combination of simpler 2D texture mapping operations. In this section, we employ a similar approach, albeit adapted to the multi-view geometry framework.

Let us now factorize the warping function so that the equation is decomposed into a sequence of simple 2D texture mapping operations. From Equation (4.5), it can be written

λ2              -1       K1C2
Z--p2 = K2R2K   1 ⋅(p1 - --Z---).
  w                         w

Analyzing this factorized equation, it can be observed that the first factor K2R2K1-1 is equivalent to a 3 × 3 matrix. This 3 × 3 matrix corresponds to a homography transform between two images. In practice, a homography transform between two images is implemented using a planar texture mapping algorithm. The advantage of using such a transformation is that a hardware implementation of the function is available in a standard GPU.

Let us now analyze the second factor of the factorized equation, i.e., (p1 -K1C2-
 Zw). This term projects the input pixel p1 onto an intermediate point pi = (xi,yi,1)T , which can be defined as

λipi = p1 - K1C2-,

where λi defines a homogeneous scaling factor. It can be seen that this last operation performs the translation of the reference pixel p1 to the intermediate pixel pi. The translation vector can be expressed in homogeneous coordinates by

  (     )    (         )
     xi        x1 - t1                 T    K1C2
λi(  yi ) =  (  y1 - t2 ) , with (t1,t2,t3) =-Z---.
      1         1-  t3                         w

Written in Euclidean coordinates, the intermediate pixel position is defined by

     x1 --t1-         y1---t2-
xi = 1 - t3 ,    yi = 1- t3 .

It can be seen that the mapping from Equations (4.8) and (4.9) basically involves a 2D texture mapping operation, which can be further decomposed into a sequence of two 1D transformations. In practice, these two 1D transformations are performed first, along rows, and second, along columns. This class of warping methods is known as scanline algorithms [105]. An advantage of this supplementary decomposition is that a simpler 1D texture mapping algorithm can be employed (as opposed to 2D texture mapping algorithms). Hence, 1D pixel re-sampling can now be performed easily.

Let us now provide details about the pixel re-sampling procedure. Consider two consecutive pixels in the input reference depth image (see Figure 4.5). Following Equation (4.9), both pixels are first shifted horizontally in the intermediate image. Because both pixels are mapped onto sub-pixel positions, it is necessary to calculate and interpolate pixel values at integer pixel positions. A similar process is repeated column-wise to obtain the intermediate image.

Figure 4.5 Two 1D pixel resampling operations. (a) In a first step, two consecutive pixels are horizontally shifted and re-sampled. (b) Next, re-sampled pixels are then projected onto the final intermediate image by performing vertical pixel shift followed by a pixel re-sampling procedure.

The pseudocode of the pixel re-sampling algorithm is summarized in Algorithm 3. In this pseudocode, the procedure LinearInterpolationColor(xiprev,xi,x) performs a linear interpolation between the color[xiprev] and color[xi] at integer position x. Similarly, the procedure LinearInterpolationDepth(xiprev,xi,x) performs a linear interpolation between depth[xiprev] and depth[xi] at integer position x.

Algorithm 3 1D horizontal pixel mapping and re-sampling
Require:  Camera parameters K1 and C2, input row depth[.] and color[.]. Require:  Output re-sampled row depth_out[.] and color_out[.].   procedure AskXIntermediate(x1,Zw
  return  xi = (x1 -t1)∕(1 -t3) with (t1,t2,t3)T = (K1C2)∕Zw
  end procedure
  colorprev = color[0];Zwprev = depth[0]
  xiprev = AskXIntermediate(0,Zwprev)
  for (x1 = 1;x1 ≤width;x1 + +) do
  Zw = depth[x1]
  xi = AskXIntermediate(x1,Zw)
  for (u = ⌈xiprev⌉; u ≤xi;u + +) do

  end for
  xiprev = xi;Zwprev = Zw

  end for
  end main

The advantages of the relief texture algorithm are twofold. First, the described decomposition of the 3D image warping equation factorizes the equation into a sequence of two simpler 2D texture mapping operations. As a beneficial result, the implementation of the relief texture algorithm can be realized on a standard GPU, so that the rendering can be efficiently executed. Second, the relief texture algorithm allows an accurate re-sampling of pixels, leading to an accurate rendering of fine textures. However, one disadvantage of the relief texture rendering technique is that the algorithm uses an intermediate image of arbitrary size, giving the algorithm an arbitrary computational complexity. More specifically, we have seen that the relief texture algorithm synthesizes an intermediate image (described by pi = p1 -K1C2-
 Zw in Equation (4.6)) prior to performing the planar texture mapping operation (described by K2R2K1-1 in Equation (4.6)). Because the intermediate pixel position pi depends on the arbitrary position of the virtual camera centered at C2, the pixel pi may be projected in the intermediate image also at an arbitrary position. For example, intermediate pixels may be projected at negative pixel coordinates. A solution for handling intermediate pixels with negative coordinates consists is to use a large intermediate image of which the coordinate system is translated. Therefore, the relief texture algorithm requires an intermediate image of larger size. Practically, the size of the intermediate image is defined by calculating the position of the the four image corners for the minimum and maximum depth values (Equation (4.6).

4.2.4  Rendering using inverse mapping  A. Previous work

Let us now introduce an alternative technique to forward mapping, called inverse mapping. Prior to presenting the algorithm, we discuss the aspects of the earlier mentioned approaches and related work.

In the previous sections, three different image rendering algorithms were presented: 3D image warping, triangular mesh-based rendering and relief texture mapping. Each of these algorithms features some advantages and disadvantages. First, the 3D image warping provides a low complexity method for rendering images. However, the 3D image warping projects the source pixels onto the destination image grid at sub-pixel positions. As a result, the pixels in the synthetic image are not correctly re-sampled according to the integer pixel grid, resulting in re-sampling artifacts. Second, the triangular mesh-based rendering technique renders high-quality images at the expense of a high computational complexity. Finally, the proposed variant of the relief texture algorithm simultaneously features a high-quality rendering and a rendering-equation decomposition well suited for GPU architectures. However, the computational complexity varies with respect to the virtual camera position. Note that each of these rendering algorithms is defined as a mapping from the source image coordinates p1 to the destination image coordinates p2. Therefore, they are referred to as forward mapping. An alternative formulation is to define the rendering algorithm as a function of the destination image to the source image coordinates. Such a technique is usually referred to as inverse mapping. In particular, inverse mapping (back) projects destination pixels onto the source image grid. The color of the synthetic destination pixel is then interpolated from the pixels surrounding the calculated source pixel position. Therefore, the advantage of an inverse mapping rendering technique is that all pixels of the destination image are correctly defined and no holes can be distinguished in the rendered image. Figure 4.6 illustrates the image rendering process using inverse mapping.


Figure 4.6 An inverse mapping function (back) projects a destination pixel onto the source image grid. The color of the destination pixel is then interpolated from the four neighboring pixels in the source image.

One of the earliest attempts for rendering images using inverse mapping, is based on a collection of uncalibrated and calibrated views [49]. The algorithm searches the color of the destination pixel that yields the most consistent color across the views. Using the geometry of multiple views, this is performed by searching the depth value of the destination pixel such that all pixel colors are consistent. This most consistent color is finally used as a color for the destination pixel. Because this method requires a search of depth values, the technique is computationally expensive. To avoid such an expensive search, the Sprite with Depth algorithm has been introduced [89]. The Sprite with Depth algorithm decomposes the source images into multiple planar sprite regions with known depth values and warps each sprite to the position of the destination image, thereby providing the per-pixel depth (in the destination image). As a result, an expensive search of depth is avoided. However, as mentioned in [89], because the source image is decomposed into multiple planar sprites, the Sprite with Depth algorithm can only handle 3D scenes with smoothly varying surfaces and short baseline distances between the cameras.  B. Rendering algorithm using inverse mapping

Bringing together ideas from both the Sprite with Depth algorithm and rendering from a collection of views, we propose an inverse mapping rendering algorithm that (a) allows a simple and accurate re-sampling of synthetic pixels, and (b) easily enables to combine multiple source images such that occluded regions can be correctly handled. It should be noted that property (a) is not fulfilled with the first presented 3D forward mapping algorithm, whereas the other two alternatives require complex re-sampling procedures. Property (b) is realized by scanning the interpolated view, so that the technique is forced to handle occluded regions by interpolating them from available views. The new algorithm can be summarized as follows.
Step 1: The depth map of the source image is warped at the position of the destination image (forward mapping of the depth image). This can be achieved using the 3D image warping equation (see Section 4.2.1), defined as
λ2d2 =  K2R2K  -1 1Z1wd1 - K2R2  ⋅C2,

where d1 and d2 are depth pixel coordinates in the source and destination depth images, respectively, and λ2 a positive scaling factor. Thus, the depth value Z1w is defined by the pixel value at coordinate d1 in the source depth image. As defined in Section 4.2.1, (K2,R2,C2) and (K1) correspond to the camera parameters of the destination and source images, respectively. Note that the possible occluded holes are not padded in this stage.

Step 2: To avoid undefined pixels resulting from the forward mapping algorithm, a dilation operation is carried out on the rendered depth image. Next, two erosion operations which slightly reduce the delineation of foreground objects, are subsequently performed. This last step ensures that blended pixels at object borders are not classified as foreground pixels. Such a rendering artifact is described in detail in Section 4.4.1.
Step 3: Next, for each defined pixel d2 of the destination depth image, a corresponding 3D world point (X2w,Y 2w,Z2w)T is calculated using the back-projection operation defined by Equation (2.20)
( X    )
(   2w )            -1  -1
   Y2w   =  C2 + λR 2 K 2 d2,

where λ is defined as

                    (  r1)                         ( C2x  )
λ = Z2w----C2z with (  r )  = R -1K -1d  and C  =  ( C    ) .
        r3              2       2   2   2      2       2y
                       r3                            C2z

Here, the depth value Z2w is defined by the pixel value at coordinate d2 in the destination depth image. Thus, the depth value Z2w is provided by the first step of the algorithm that warps the source depth image into the destination depth image.

Step 4: Finally, the calculated 3D point (X2w,Y 2w,Z2w)T is projected onto the source texture image by employing
                             (  X   )
             [             ] |   2w |
λp1 = [K1 |03]  R1T  - R1C1    |  Y2w |  ,
      ◟--------03◝◜----1-----◞(  Z2w )
         projection matrix M1       1

such that the color of the destination pixel p2 can be interpolated from the pixels surrounding p1 in the source image. The possible holes in the image are finally padded using a technique described in the following section.

The proposed inverse mapping rendering technique exhibits three advantages. First, because an inverse mapping procedure is employed, destination pixels can be accurately interpolated, thereby rendering high-quality virtual images. Second, we have seen that “Step 1” of the algorithm involves a forward mapping of the depth image. Intuitively, this processing step suffers from the usual problems associated with rendering using forward mapping. However, it should be noted that the depth image represents the surface of objects, so that the corresponding depth signal mainly features low-frequency components. As a result, forward-mapping depth images render images with limited rendering artifacts, when compared to rendered texture images. Third, the color of occluded pixels can be inferred at “Step 3” by projecting the 3D-world point p2 = (X2w,Y 2w,Z2w)T onto multiple source image planes, covering all regions of the video scene. We describe this occlusion-handling scheme in detail in Section 4.4.3.

All rendering algorithms presented up to this point suffer from occlusion artifacts, such as creating holes that cannot reconstructed in the computed images, or overlapping pixels that cannot be interpreted. This means that additional occlusion processing is required, which is presented in the following two sections.