4.2 Depth Image Based Rendering
4.2.1 Rendering using 3D image warping
In this section, we describe the 3D image warping technique, which enables the rendering of a synthetic image, using a reference texture image and a corresponding depth image [56]. Let us consider a 3D point at homogeneous coordinates P_{w} = (X_{w},Y _{w},Z_{w},1)^{T }, captured by two cameras and projected onto the reference and synthetic image plane at pixel positions p_{1} = (x_{1},y_{1},1)^{T } and p_{2} = (x_{2},y_{2},1)^{T }, respectively (see Figure 4.2).
The orientation and location of the camera i (with i {1,2}) is described by the rotation matrix R_{i} and translation matrix t_{i} = R_{i}C_{i}, where C_{i} describes the coordinates of the camera center. This allows us to define the pixel positions p_{1} and p_{2} in both image planes by
where K_{1}, K_{2} represent the 3 × 3 intrinsic parameter matrix of the corresponding cameras and λ_{1}, λ_{2} are the homogeneous scaling factors. Assuming that the reference “Camera 1” is located at the coordinatesystem origin (C_{1} = 0_{3}) and looks along the Zdirection (R_{1} = I_{3×3}), the scaling factor λ_{1} can be specified in this particular case by λ_{1} = Z_{w}.From Equation (4.1), the 3D position of the original point P_{w} in the Euclidean domain can be written as
 (4.3) 
Finally, we obtain the synthetic pixel position p_{2} by substituting Equation (4.3) into Equation (4.2), so that
 (4.4) 
Assuming that “Camera 1” is located at the world coordinate system and looking in the Z direction, we rewrite the warping equation into
 (4.5) 
Equation (4.5) constitutes the 3D image warping equation [56] that enables the synthesis of the virtual view from a reference texture view and a corresponding depth image.
One issue of the previously described method is that input pixels p_{1} of the reference view are usually not mapped onto a point p_{2} at integer pixel position. To obtain an integer pixel position, a simple heuristic technique is to map the subpixel coordinate p_{2} to the nearest integer pixel position with = (,,1) = (⌊x_{2} + 0.5⌋,⌊y_{2} + 0.5⌋,1). Additionally, to avoid undefined pixels in the rendered image, socalled pixel splat surfaces that cover multiple destination pixels, can be employed. In practice, a pixel splat simply enlarges the footprint of the destination pixel p_{2}, such that not only the color of is defined but also some pixels surrounding . A second complication is that multiple original pixels can be projected onto the same pixel position in the virtual view. For example, a foreground pixel can occlude a background pixel in the rendered view, which results in overlapping pixels. Additionally, some regions in the virtual view are not visible from the original viewpoint, which results in holes in the virtual image (see Figure 4.3). The aforementioned issues are addressed with a rendering technique based on triangular meshes, which is presented in the next section.
4.2.2 Rendering using triangular meshes
To avoid rendering artifacts, a natural approach is to employ a mesh of triangles. The idea is to triangulate the reference depth image so that each triangle locally approximates the object surface. In our implementation, depth image triangulation is performed such that two microtriangles per pixel are employed. For each triangle vertex in the reference image, the corresponding position of the warped vertex is calculated using Equation (4.4). Finally, a rasterization procedure is performed that converts the trianglebased geometric description of the warped image into a bitmap or raster image (see Figure 4.4). For an efficient implementation, it can be noticed that each adjacent triangle shares two common vertices. Therefore, only one warpedvertex position per pixel needs to be computed to obtain the third warpedvertex position.

While such a technique leads to highquality image rendering, a disadvantage is the very large number of microtriangles, which involves a high computational complexity.
4.2.3 Rendering using relief texture mapping
As previously highlighted, the 3D image warping and the meshbased rendering techniques suffer either from a low rendering quality, or a high computational complexity. To circumvent these issues, we now present an alternative rendering technique derived from the relief texture mapping algorithm [77]. Initially, relief texture mapping was developed for rendering uneven surfaces of synthetic computer graphics objects. The guiding principle of the relief texture mapping is to factorize the 3D image warping equation into a combination of simpler 2D texture mapping operations. In this section, we employ a similar approach, albeit adapted to the multiview geometry framework.
Let us now factorize the warping function so that the equation is decomposed into a sequence of simple 2D texture mapping operations. From Equation (4.5), it can be written
 (4.6) 
Analyzing this factorized equation, it can be observed that the first factor K_{2}R_{2}K_{1}^{1} is equivalent to a 3 × 3 matrix. This 3 × 3 matrix corresponds to a homography transform between two images. In practice, a homography transform between two images is implemented using a planar texture mapping algorithm. The advantage of using such a transformation is that a hardware implementation of the function is available in a standard GPU.
Let us now analyze the second factor of the factorized equation, i.e., (p_{1} ). This term projects the input pixel p_{1} onto an intermediate point p_{i} = (x_{i},y_{i},1)^{T }, which can be defined as
 (4.7) 
where λ_{i} defines a homogeneous scaling factor. It can be seen that this last operation performs the translation of the reference pixel p_{1} to the intermediate pixel p_{i}. The translation vector can be expressed in homogeneous coordinates by
 (4.8) 
Written in Euclidean coordinates, the intermediate pixel position is defined by
 (4.9) 
It can be seen that the mapping from Equations (4.8) and (4.9) basically involves a 2D texture mapping operation, which can be further decomposed into a sequence of two 1D transformations. In practice, these two 1D transformations are performed first, along rows, and second, along columns. This class of warping methods is known as scanline algorithms [105]. An advantage of this supplementary decomposition is that a simpler 1D texture mapping algorithm can be employed (as opposed to 2D texture mapping algorithms). Hence, 1D pixel resampling can now be performed easily.
Let us now provide details about the pixel resampling procedure. Consider two consecutive pixels in the input reference depth image (see Figure 4.5). Following Equation (4.9), both pixels are first shifted horizontally in the intermediate image. Because both pixels are mapped onto subpixel positions, it is necessary to calculate and interpolate pixel values at integer pixel positions. A similar process is repeated columnwise to obtain the intermediate image.
The pseudocode of the pixel resampling algorithm is summarized in Algorithm 3. In this pseudocode, the procedure LinearInterpolationColor(x_{i}^{prev},x_{i},x) performs a linear interpolation between the color[x_{i}^{prev}] and color[x_{i}] at integer position x. Similarly, the procedure LinearInterpolationDepth(x_{i}^{prev},x_{i},x) performs a linear interpolation between depth[x_{i}^{prev}] and depth[x_{i}] at integer position x.
The advantages of the relief texture algorithm are twofold. First, the described decomposition of the 3D image warping equation factorizes the equation into a sequence of two simpler 2D texture mapping operations. As a beneficial result, the implementation of the relief texture algorithm can be realized on a standard GPU, so that the rendering can be efficiently executed. Second, the relief texture algorithm allows an accurate resampling of pixels, leading to an accurate rendering of fine textures. However, one disadvantage of the relief texture rendering technique is that the algorithm uses an intermediate image of arbitrary size, giving the algorithm an arbitrary computational complexity. More specifically, we have seen that the relief texture algorithm synthesizes an intermediate image (described by p_{i} = p_{1}  in Equation (4.6)) prior to performing the planar texture mapping operation (described by K_{2}R_{2}K_{1}^{1} in Equation (4.6)). Because the intermediate pixel position p_{i} depends on the arbitrary position of the virtual camera centered at C_{2}, the pixel p_{i} may be projected in the intermediate image also at an arbitrary position. For example, intermediate pixels may be projected at negative pixel coordinates. A solution for handling intermediate pixels with negative coordinates consists is to use a large intermediate image of which the coordinate system is translated. Therefore, the relief texture algorithm requires an intermediate image of larger size. Practically, the size of the intermediate image is defined by calculating the position of the the four image corners for the minimum and maximum depth values (Equation (4.6).
4.2.4 Rendering using inverse mapping
4.2.4.0 A. Previous work
Let us now introduce an alternative technique to forward mapping, called inverse mapping. Prior to presenting the algorithm, we discuss the aspects of the earlier mentioned approaches and related work.
In the previous sections, three different image rendering algorithms were presented: 3D image warping, triangular meshbased rendering and relief texture mapping. Each of these algorithms features some advantages and disadvantages. First, the 3D image warping provides a low complexity method for rendering images. However, the 3D image warping projects the source pixels onto the destination image grid at subpixel positions. As a result, the pixels in the synthetic image are not correctly resampled according to the integer pixel grid, resulting in resampling artifacts. Second, the triangular meshbased rendering technique renders highquality images at the expense of a high computational complexity. Finally, the proposed variant of the relief texture algorithm simultaneously features a highquality rendering and a renderingequation decomposition well suited for GPU architectures. However, the computational complexity varies with respect to the virtual camera position. Note that each of these rendering algorithms is defined as a mapping from the source image coordinates p_{1} to the destination image coordinates p_{2}. Therefore, they are referred to as forward mapping. An alternative formulation is to define the rendering algorithm as a function of the destination image to the source image coordinates. Such a technique is usually referred to as inverse mapping. In particular, inverse mapping (back) projects destination pixels onto the source image grid. The color of the synthetic destination pixel is then interpolated from the pixels surrounding the calculated source pixel position. Therefore, the advantage of an inverse mapping rendering technique is that all pixels of the destination image are correctly defined and no holes can be distinguished in the rendered image. Figure 4.6 illustrates the image rendering process using inverse mapping.
One of the earliest attempts for rendering images using inverse mapping, is based on a collection of uncalibrated and calibrated views [49]. The algorithm searches the color of the destination pixel that yields the most consistent color across the views. Using the geometry of multiple views, this is performed by searching the depth value of the destination pixel such that all pixel colors are consistent. This most consistent color is finally used as a color for the destination pixel. Because this method requires a search of depth values, the technique is computationally expensive. To avoid such an expensive search, the Sprite with Depth algorithm has been introduced [89]. The Sprite with Depth algorithm decomposes the source images into multiple planar sprite regions with known depth values and warps each sprite to the position of the destination image, thereby providing the perpixel depth (in the destination image). As a result, an expensive search of depth is avoided. However, as mentioned in [89], because the source image is decomposed into multiple planar sprites, the Sprite with Depth algorithm can only handle 3D scenes with smoothly varying surfaces and short baseline distances between the cameras.
4.2.4.0 B. Rendering algorithm using inverse mapping
Bringing together ideas from both the Sprite with Depth algorithm and rendering from a collection of views, we propose an inverse mapping rendering algorithm that (a) allows a simple and accurate resampling of synthetic pixels, and (b) easily enables to combine multiple source images such that occluded regions can be correctly handled. It should be noted that property (a) is not fulfilled with the first presented 3D forward mapping algorithm, whereas the other two alternatives require complex resampling procedures. Property (b) is realized by scanning the interpolated view, so that the technique is forced to handle occluded regions by interpolating them from available views. The new algorithm can be summarized as follows.
 Step 1: The depth map of the source image is warped at the position of the
destination image (forward mapping of the depth image). This can be
achieved using the 3D image warping equation (see Section 4.2.1), defined
as
(4.10) where d_{1} and d_{2} are depth pixel coordinates in the source and destination depth images, respectively, and λ_{2} a positive scaling factor. Thus, the depth value Z_{1w} is defined by the pixel value at coordinate d_{1} in the source depth image. As defined in Section 4.2.1, (K_{2},R_{2},C_{2}) and (K_{1}) correspond to the camera parameters of the destination and source images, respectively. Note that the possible occluded holes are not padded in this stage.
 Step 2: To avoid undefined pixels resulting from the forward mapping algorithm, a dilation operation is carried out on the rendered depth image. Next, two erosion operations which slightly reduce the delineation of foreground objects, are subsequently performed. This last step ensures that blended pixels at object borders are not classified as foreground pixels. Such a rendering artifact is described in detail in Section 4.4.1.
 Step 3: Next, for each defined pixel d_{2} of the destination depth image, a
corresponding 3D world point (X_{2w},Y _{2w},Z_{2w})^{T } is calculated using the
backprojection operation defined by Equation (2.20)
(4.11) where λ is defined as
(4.12) Here, the depth value Z_{2w} is defined by the pixel value at coordinate d_{2} in the destination depth image. Thus, the depth value Z_{2w} is provided by the first step of the algorithm that warps the source depth image into the destination depth image.
 Step 4: Finally, the calculated 3D point (X_{2w},Y _{2w},Z_{2w})^{T } is projected onto the
source texture image by employing
(4.13) such that the color of the destination pixel p_{2} can be interpolated from the pixels surrounding p_{1} in the source image. The possible holes in the image are finally padded using a technique described in the following section.
The proposed inverse mapping rendering technique exhibits three advantages. First, because an inverse mapping procedure is employed, destination pixels can be accurately interpolated, thereby rendering highquality virtual images. Second, we have seen that “Step 1” of the algorithm involves a forward mapping of the depth image. Intuitively, this processing step suffers from the usual problems associated with rendering using forward mapping. However, it should be noted that the depth image represents the surface of objects, so that the corresponding depth signal mainly features lowfrequency components. As a result, forwardmapping depth images render images with limited rendering artifacts, when compared to rendered texture images. Third, the color of occluded pixels can be inferred at “Step 3” by projecting the 3Dworld point p_{2} = (X_{2w},Y _{2w},Z_{2w})^{T } onto multiple source image planes, covering all regions of the video scene. We describe this occlusionhandling scheme in detail in Section 4.4.3.
All rendering algorithms presented up to this point suffer from occlusion artifacts, such as creating holes that cannot reconstructed in the computed images, or overlapping pixels that cannot be interpreted. This means that additional occlusion processing is required, which is presented in the following two sections.