3.4 Multipleview depth estimation
3.4.1 Depth estimation for multiview video coding
The previous section has presented a method that estimates disparity images using two rectified views. Similar approaches [43, 107] were recently investigated to create depth maps for 3DTV systems. However, in those applications, a pair of images was used to compute the disparity image. In multiview video, multiple images are available so that an efficient depth estimation algorithm should employ all views available. More specifically, the disadvantages of a pairwise depth estimation algorithm are threefold. First, while multiple views of the scene are available, only two rectified views are employed simultaneously, which degrades performance. Second, because depth images are estimated pairwise, consistency of depth estimates across the views is not enforced. Third, multiple preprocessing (image rectification) and postprocessing (disparitytodepth conversion and image derectification) procedures are necessary.
In order to circumvent these three disadvantages, we propose an approach that employs multiple views and estimates the depth using a correlation table similar to the Disparity Space Image (DSI). As in the previous section, the depth is calculated along each scanline using the dynamic programming algorithm from the previous section. Two novel constraints are introduced which are enforcing smooth variations of depth across scanlines and across the views, thereby positively addressing the aforementioned disadvantages. Additionally, no unnecessary conversion from disparity pixels to depth values is required.
Let us now explain the multiview depthestimation algorithm that was introduced previously. The presented algorithm calculates a depth image corresponding to a selected reference view k_{r} in two steps. In a first step, the 3D world coordinate (X,Y,Z_{c})^{T } of a pixel p_{kr} = (x_{kr},y_{kr},1)^{T } (from the reference view) is calculated using a candidate depth value, e.g., Z_{c}. The 3D world coordinate can be calculated using the backprojection relation (see Equation (2.20)) and be written as
 (3.15) 
where C_{kr}, R_{kr}, K_{kr} correspond to the parameters of the selected reference camera k_{r}. It should be emphasized that the 3D point (X,Y,Z_{c})^{T } is described in the world coordinate system that is shared by all cameras. In a second step, the 3D point is projected onto each neighboring view k, providing a corresponding pixel (x_{k},y_{k},1)^{T } for each view for which:
 (3.16) 
where P_{k} represents the camera parameters of the neighboring camera k. These two steps are repeated for several depth candidates Z_{c}, so that a corresponding similarity can be measured and stored in a correlation table T(x,y,Z_{c}). The algorithm is illustrated by Figure 3.13. It can be noted that this correlation table is similar to the DSI data structure. In practice, an entry in the correlation table T(x,y,Z_{c}) is calculated using the sum of the correlation measures over all image views I, specified by
 (3.17) 
Using a WTA optimization strategy, the depth value that yields the most consistent (or similar) color across the views is selected as the final estimated depth value. The pseudocode of the algorithm using a WTA optimization is summarized in Algorithm 2.
In the previous section, we have seen that the WTA optimization yields inaccurate depth estimates and noisy depth images. Therefore, we have experimented earlier with a dynamic programming optimization strategy to enforce smooth variations of the depth, which has led to streaking artifacts. In the next section, we impose two additional constraints to obtain more accurate depth estimates.
3.4.2 Twopass optimization strategy
To enforce smooth depth variations within the image and between the images, we present two novel constraints: (1) smooth variations of depth values across image scanlines and (2) consistent depth values across the views. To combine these constraints with the dynamic programming algorithm, we employ two penalty costs, namely an interline penalty cost and an interview penalty cost that both extend the cost function E(D) previously defined by Equation (3.11). The proposed algorithm estimates a depth image by minimizing two different objective cost functions in two passes. In a first pass, a depth image is estimated for each view with the constraint that consecutive image scanlines should show smooth depth variations across the lines. The purpose of this pass is to generate one initialization depth image for each view without streaking artifacts. We denote the objective function for one view at the first pass as
 (3.18) 
where E_{smooth 1}(D) is defined in the following paragraph. This computation is repeated for each view. In a second pass, a depth image is reestimated for each view with the constraint that an object should have consistent world depth values in all depth images estimated during the first pass. The aim of this second pass is to refine the initialization depth images from the first pass such that the final depth images are consistent across the views. We denote the objective function at the second pass as
 (3.19) 
where E_{smooth 2}(D) is also defined in the sequel.
Pass 1. Consistent depth values across scanlines.
To enforce smooth variations of depth values across scanlines, we introduce an interline penalty cost. Practically, the objective cost function integrating the interline penalty cost can be written as
 (3.20) 
where D is the vector of depth values Z_{kr}(x,y) along a scanline, Z_{kr}(x,y) the estimated depth at position (x,y) in the reference image with index k_{r} and κ_{l} the interline penalty parameter (typically a scalar). The index of the reference view k_{r} varies between 1 ≤k_{r} ≤N, for an Nview data set.
Pass 2. Consistent depth values across the views.
In a second pass, the depth is refined for each view by incorporating a cost that penalizes inconsistent depth values across the views into the objective cost function E_{smooth 2}(D) (see Figure 3.14). Similar to the first constraint, the objective function that integrates the interline and interview penalty costs is written as
Admissible path in the DSI.
To optimize the previous objective functions E_{pass 1}(D) and E_{pass 2}(D), the dynamic programming algorithm is employed. In Section 3.3.3, we have seen that the optimal path in the DSI is derived from the ordering constraint. However, the ordering constraint cannot be employed in the extended case of multipleview depth estimation because no left or right camera position is assumed. We therefore employ an admissible edge that enables an arbitrary increase or decrease of depth values. The admissible edges in the DSI are portrayed by Figure 3.15.
3.4.3 Depth sampling of the 3D space
For 3D sampling, it should be noted that there exists a dependency between the depth of an object and its corresponding resolution in the image. For example, far and near objects are seen as small and large in the image, respectively. A near object can therefore be observed with higher resolution than a far object so that the depth resolution of a near object should be higher than that of a far object. In this context, we employ the plenoptic sampling framework [12]. The idea of the plenoptic sampling framework is to employ a nonlinear quantization of the depth between a minimum and maximum depth Z_{min} and Z_{max} (see Figure 3.16).
More specifically, the depth Z_{i} for a quantization index i can be written as
 (3.23) 
with
 (3.24) 
where N_{d} corresponds to the maximum number of depth values/layers. Alternatively, the quantization index i of a depth value Z_{i} can be written as
 (3.25) 
The depth of a pixel can be therefore efficiently stored and represented using the corresponding quantization index i. In the case that 256 possible depth values are employed, the depth quantization index can be represented using an 8bit pixel format and stored using a standard 8 bits per pixel image.
3.4.4 Depth image postprocessing
Having calculated a depth image for each view, an optional task is to apply a noisereduction algorithm as a last postprocessing step. Typically, this noisereduction stage noticeably improves the quality of depth images at a limited computational cost. For example, based on the assumption that depth pixels vary smoothly at the object surfaces, a simple noisereduction algorithm consists of filtering noisy pixels that have low spatial correlation with neighboring pixels. However, the assumption of smoothly varying depth pixels is invalid at object borders (edges). Therefore, the challenge is to filter noisy pixels (outliers) while preserving the accuracy of depth pixels along object borders. To implement this concept, we propose in this section a depthimage denoising algorithm that actively supports smooth regions in the depth image while preserving sharp depth discontinuities along the object borders.
The proposed noisereduction algorithm proceeds in two stages. At the first stage, a color segmentation is applied to the texture image that corresponds to the considered depth image. The intuitive idea underlying this first stage is that the resulting segments, which delineate the object contours, correspond to a piece/portion of the object surfaces. Because object surfaces show smooth regions in the depth image, the corresponding depth image segments can be approximated by a piecewiselinear function [97]. Figure 3.17 shows an example of a colorsegmented image obtained using the watershed segmentation algorithm [84].
At the second stage, a piecewiselinear function which models the object surfaces is estimated for each image segment. To calculate the linear function parameters of each segment, we apply a robust estimation method, i.e., the RANdom SAmple Consensus (RANSAC) algorithm [30]. The advantage of this algorithm is that it can estimate parameters even if the data is disturbed by a large number of outliers, i.e., noisy depth pixels. Because we assume that the data fits to a piecewiselinear function, three pixels are sufficient to calculate a plane and thus the corresponding parameters. The RANSAC algorithm first randomly selects three pixels, which are used to compute candidate parameters of the linear (planar) function. The algorithm then counts the number of pixels that fit to this model (inliers). This procedure is repeated several times and finally, the parameters with the largest support are selected as the parameters for the considered image segment (see a onedimensional example in Figure 3.18). The robust RANSAC estimation of piecewiselinear function parameters is then performed for all image segments. The advantages of the depth image postprocessing step are twofold. First, because the depth is accurately estimated especially at object borders, a highquality image can be rendered (this property of image rendering will be discussed in Chapter 4). Second, the resulting depth image presents smoothly varying properties and thus can be encoded efficiently. Note that this property will be exploited in Chapter 6 for the coding of depth images.
Figure 3.18 Two samples are randomly selected to compute a candidate model indicated by the solid line. Input data close to the line candidate is considered as inliers (black dots between dotted lines). Fig. 3.18(a) and Fig. 3.18(b) show two candidate models with a small and a high number of inliers, respectively. The model with the largest number of inliers is selected.

3.4.5 Experimental results
In this section, we present experimental results evaluating the accuracy of the depth estimation algorithm from the previous section. We especially focus on the challenging problem of estimating depth images using textureless and colorless multiview images. For this reason, we employ the multiview images “Breakdancers”, “Ballet” and “Livingroom”, featuring these properties. Note that because the calibration parameters are necessary for multiview depth estimation, the depth images “Teddy” and “Cones” are not estimated (no parameters available). For each presented depth image, the parameters are empirically determined and not specifically adapted to each data set. We first provide an objective evaluation and, second, proceed by subjectively assessing the depth image quality. Finally, we compare the quality of the calculated depth images with two alternative methods, based on a twodimensional optimization strategy.
3.4.5.0 A. Objective evaluation
For the objective evaluation of the depth image quality, we employ an image rendering method that synthesizes virtual views using the calculated depth image. This technique is based on the notion that the quality of the rendered image depends on the accuracy of the estimated depth image. Specifically, the more accurate a depth image is, the higher the rendering quality becomes. In our experiments, the synthesis of virtual images is carried out using the relief texture rendering algorithm, which will be presented in Chapter 4. In the presented experiments, we use a postprocessing step as defined in Section 3.4.4. To evaluate the quality of rendered images, a synthetic image is rendered at the same position and orientation of a selected reference camera ^{4}. By comparing the synthetic image I_{s} with the reference image I_{r}, the Peak SignalNoise Ratio (PSNR_{rs}) can be calculated by
 (3.26) 
where the Mean Squared Error (MSE) is computed by the following equation:
 (3.27) 
with w and h corresponding to the width and the height of the images, respectively. Note that the PSNR_{rs} metric has shown some limitations for evaluating the image quality because it does not match the human subjective perception. However, this is the only simple alternative image quality metric that has been widely accepted and used by the research community (also adopted in key literature [50] on rendering). Besides computing this metric, we will also comment on the subjective depth image quality. The differences between the objective (PSNR_{rs}) and subjective rendering quality will be illustrated by Figure 6.11 in Chapter 6.
The discussed results are compared in a relative sense to the conventional algorithm addressed earlier in this chapter. Up till now, results on multiview depth estimation in literature are still being presented in a subjective discussion. To avoid the ambiguity of such a discussion, we have presented here a welldefined measurement for estimating the quality of depth images. This indirect method, based on rendering quality evaluation, gives a less precise evaluation of the depth quality than a direct method based on a comparison with, e.g., a groundtruth depth image. However, calibrated groundtruth multiview depth images are presently not available to the research community. Additionally, it may be argued that the image rendering quality is more important than the quality of the depth images because depth images are not visualized by the user (as opposed to the rendered images).
Table 3.1 Image rendering quality resulting of two depthestimation algorithms: (1) the twoview depth estimation and (2) the multipleview depth estimation. Both methods are based on dynamic programming optimization. 
Let us now discuss the renderingquality measurements summarized in Table 3.1. First, it can be seen that the proposed algorithm that uses multiple views to estimate depth images consistently and significantly outperforms the twoview depthestimation algorithm. For example, an imagequality improvement of 2.9 dB, 3.2 dB and 0.3 dB can be observed for the “Breakdancers”, “Ballet” and “Livingroom” sequences, respectively. The lower improvement for the “Livingroom” sequence is mainly due to an inaccurate texture rendering of the scene, and especially, an inaccurate rendering of the background tree. Considering the applications of 3DTV and freeviewpoint video, the contributed twopass algorithm clearly enables highquality rendering, thereby facilitating the reconstruction of a highquality depth signal.
3.4.5.0 B. Subjective evaluation
Figures 3.19, 3.20 and 3.21 enable a discussion on the subjective quality of the estimated depth images using the proposed twopass algorithm. We assess the quality of the depth images prior to the postprocessing step. First, it can be observed that, prior to the postprocessing step, the twopass algorithm leads to more consistent depth values across the lines, so the annoying streaking artifacts are reduced leading to a significant perceptual improvement. For example, observing Figure 3.19(b), it can be noted that the twopass algorithm leaves over little streaking artifacts in the image. Similar conclusions can be drawn by observing Figure 3.20(b) and Figure 3.21(b), albeit there is still fuzzyness around the objects. Second, it can be noted that the depth of colorless and textureless regions can be accurately estimated. For example, the depth values of the textureless walls in Figure 3.19(b) and Figure 3.21(b) are correctly estimated and do not suffer from streaking artifacts. Additionally, the depth of very thin objects such as the flower stalk, can be correctly estimated (see Figure 3.19(b)). Such an accurate result is obtained by using multiple views simultaneously and by enforcing consistent depth values across the views. However, several estimation errors occur at object boundaries which correspond to depth discontinuities (e.g., borders of the dancer in Figure 3.20(b)). Let us now assess the quality of the depth images after the postprocessing step. As expected, it can be seen in Figures 3.19(c), 3.20(c) and 3.21(c) that the postprocessed depth images depict piecewiselinear regions delineated by sharp boundaries. For example, Figure 3.19(c) shows that the depth along the border of the table is accurately delineated. Similar conclusions can be drawn by comparing Figure 3.20(b) with Figure 3.20(c) and Figure 3.21(b) with Figure 3.21(c). Thus, the introduced postprocessing step clearly removes noisy pixels while preserving the edges along objects in the depth image.
3.4.5.0 C. Subjective comparisons with alternative methods
Next, we have compared the quality of the calculated depth images with two alternative methods based on a twodimensional optimization strategy (see Section 3.3.3). Note that only a few alternative contributions exist, employing the complex “Breakdancers” and “Ballet” multiview sequences. We have found only two proposals that employ either both sequences [109], or one sequence only (“Breakdancers”) [16].
The first alternative proposal is based on a modified plane sweeping algorithm [16]. By subjectively comparing Figure 3.20(c) and Figure 3.22(a) it can be seen that the method proposed in this thesis results in a slightly more accurate depth image ^{5}. For example, the competing method generates some depth artifacts in the occluded region along the arm of the breakdancer. Additionally, the proposed method enables an accurate depth estimation along the delineation between the background wall and the rightmost person (see Figure 3.20(c)). At the opposite, the competing method does not allow the visualization of the discussed depth discontinuity (along the person delineation).
The second alternative proposal [109] performs an oversegmentation of the texture image, and calculates the depth of each segment using a belief propagation algorithm. For the “Breakdancers” sequence, it can be seen that both methods lead to a similar depth image quality. For example, the depth discontinuity between the breakdancer and the background wall is accurately estimated in both cases (see Figure 3.20(c) and Figure 3.22(b)). However, for the “Ballet” sequence, it can be noticed that the proposed method yields a less accurate depth image. For example, the depth discontinuity along the foreground person is better calculated by the alternative proposal (see Figure 3.22(c)).
Figure 3.19 (a) View 3 of the “Livingroom” sequence. Estimated depth images (b) prior to postprocessing and (c) after the postprocessing step.

Figure 3.20 (a) View 3 of the “Breakdancers” sequence. Estimated depth images (b) prior to postprocessing and (c) after the postprocessing step.

Figure 3.21 (a) View 3 of the “Ballet” sequence. Estimated depth images (b) prior to postprocessing and (c) after the postprocessing step.

Figure 3.22 Depth images calculated using the algorithm based on (a) planesweeping [16] and (b) (c) oversegmentation [110].

^{4}This selected reference view is left out from the data set for rendering.
^{5}The discussed alternative method [16] presents the depth image of a different viewpoint and time. However, the properties of the sequence do not vary over time and across the views, so that a subjective comparison is still possible.