3.3 Previous work on depth estimation
The problem of depth estimation has been intensively investigated in the computervision research community [85]. The author distinguishes three essential components, constituting most of the depthestimation algorithms: (1) the matching criterion or cost function, (2) the support of the matching function and (3) the disparity calculation/optimization strategy.
3.3.1 Matching criterion
The first elementary component corresponds to the matching criterion that measures correlation or similarity between pixels. These include, among others, the Sum of Absolute Differences (SAD), the Sum of Squared Differences (SSD), the Cross Correlation (CC), and the Normalized CrossCorrelation (NCC). However, as highlighted in the literature [20], the selection of the matching cost function does not have a significant impact on the quality of the correspondence matching. We employ in our experiments the Sum of Absolute Differences (SAD) as a matching cost function. This SAD metric is calculated for the three color channels and the resulting three SAD values are simply added. However, it is known [20] that the three color channels are correlated, so that employing three color channels does not triple the matching accuracy. Thus, using the three color channels (instead of one luminance channel only) does not significantly improve the quality of the estimated depth images.
3.3.2 Support of the function
The second elementary component of a depth estimation algorithm is the support of the matching cost function. Typically, matching functions can be classified using their respective support area and imagecontent adaptiveness. These include singlepixel windows [5], square windows, adaptive windows [41] and shiftable windows [32, 42, 98]. Typically, to obtain a reliable matching metric, a large region support should be used. However, whereas using a large window provides a reliable matching support at an object surface with smoothly varying depth, an unreliable support is obtained at object boundaries. To circumvent this problem, an appealing approach [7, 97] decomposes the image into a sufficiently large number of object segments (using for example a colorsegmentation technique) and assumes that a single depth value is computed for each segment. Therefore, as the segmentation would follow the shape of objects, an advantage of such a segmentbased approach is that the depth is more accurately estimated at the boundaries of objects.
3.3.3 Optimization strategy
The third and most important component of depth estimation is the calculation or optimization strategy. We distinguish three classes of disparity optimization strategies: local, onedimensional and twodimensional optimization.
Local optimization calculates independently the disparity of each pixel using the single matching cost of the selected pixel. Local optimizations typically yield accurate disparity estimates in textured regions. However, large textureless regions tend to produce fuzzy disparity estimates (this case corresponds to the difficulty D1 in Section 3.2.3).
Onedimensional optimization. This second strategy estimates the disparity by minimizing a cost function calculated over one horizontal image scanline [6]. The idea followed is to exploit the coherence of disparity values along the horizontal scanline, so that smooth spatial variations of disparity estimates are enforced.
Twodimensional strategy. This third strategy simultaneously optimizes the disparity estimates in the horizontal and vertical direction. To this end, 2D energy minimization algorithms, such as graphcut [10, 47] or belief propagation [96], can be employed. For example, one proposal [109] segments the texture image into a large number of small segments (oversegmentation) and calculates a single depth for each segment. To this end, the texture segments are employed to construct a Markov Random Field, with the depth values being the states and the texture segments representing the nodes. To calculate and update the depth value of each node, a loopy belief propagation method is employed [96]. A similar method [16] first segments the input images, and then calculates an initial depth image using a modified plane sweeping algorithm [16]. Using this initial depth image, the algorithm iteratively calculates the depth of each segment with a greedysearch algorithm [97]. Whereas twodimensional optimization strategies usually calculate smooth depth images, one major disadvantage is their computational complexity that hampers implementation in practical systems.
We now detail two different optimization techniques: a local “WinnerTakeAll” optimization and a onedimensional optimization. This onedimensional optimization technique will be further developed in the second part of this chapter such that multiple views can be employed simultaneously.
3.3.3.0 A. Local optimization
A simple local optimization strategy is the “WinnerTakeAll” (WTA). The WTA optimization consists of selecting for each pixel (x,y) the disparity d that minimizes an objective cost function c(x,xd), thus
 (3.6) 
where c(x,x d) corresponds to one of the functions SAD, SSD, CC or NCC. This optimization strategy is called “WinnerTakeAll” because the result of the cost function is not shared with neighboring pixels ^{2}. Thus, the minimum cost (the “Winner”) yields (“TakeAll”) the final disparity value. As discussed in Section 3.2.3, the optimization of Equation (3.6) yields fuzzy disparity estimates in textureless regions. To detect unreliable matches, a crosschecking procedure can be employed. The crosschecking procedure can be decomposed into two steps. First, the disparity d^{r} with respect to the right image I_{2} (see Figure 3.2) is estimated. Second, the disparity d^{l} with respect to the left image I_{1} is calculated. This can be written as
 (3.7) 
The match is then classified as reliable if the following condition holds:
 (3.8) 
Unreliable pixels not satisfying the above condition, can afterwards be padded by the disparity value of a neighboring defined match.
3.3.3.0 B. Onedimensional optimization
Several contributions have considered the use of a onedimensional optimization to estimate depth images [8, 17]. In this section, we outline a slightly modified version of the method described in [6]. The idea of the onedimensional optimization is to consider sequences of pixel disparities and to identify a sequence of disparities along an image scanline that minimizes a global energy function for each scanline. The onedimensional optimization approach has two important advantages. First, it increases the reliability of the matching procedure by enforcing smooth variations of disparity estimates. As a result, fuzzy disparity estimates in textureless regions are reduced. Second, the algorithm allows unmatched pixels, thereby facilitating an easy handling of occluded pixels.
The optimization strategy consists of calculating the vector of disparities D along a scanline that minimizes the objective energy function E(D) defined as
 (3.9) 
where each term is defined as follows. First, the vector D consists of the sequence of w disparity estimates along the scanline (assuming a scanline of w pixels), which is denoted as D = (d_{1},d_{2},...,d_{w}). It should be noted that the disparity of an occluded pixel cannot be calculated and is thus labeled as “undefined” ^{3}, i.e., d_{i} = ϕ. Second, the term E_{data}(D) represents the matching costs of all pixels along a scanline with defined disparities, defined as
 (3.10) 
where c(x,x d_{x}) represents the cost of the matching function, e.g., the SAD matching function, and S = {xd_{x}≠ϕ,1 ≤x ≤w}. Third, the term E_{smooth}(D) ensures that (a) disparity estimates vary smoothly in textureless regions and that (b) fast disparity transitions are allowed at object borders, i.e., at pixel intensity variations. The term E_{smooth}(D) can be written as
 (3.11) 
with κ_{r} > 0 being a constant parameter corresponding to a matching reward bonus leading to a negative cost. The newly function E_{smooth_1}(D) is defined as
 (3.12) 
where κ_{d1}, κ_{d2} are control parameters used to penalize disparity variations along the scanline. The computation runs over S_{smooth} with an extra constraint S_{smooth} = {xd_{x}≠ϕ,d_{xnext}≠ϕ,1 ≤x ≤w}, and x_{next} is the pixel position with defined disparity immediately following the current pixel position x (x S). The terms B_{1}(x), B_{2}(x) are discontinuity bonuses that decrease the cost at an intensity discontinuity within the scanlines of the left and right images.
Example.
We show how to calculate the global objective function defined by Equation (3.9), assuming that a vector D was provided. In the illustration from Figure 3.5, we consider a (short) scanline of 8 pixels with their associated matching costs. The costs E_{data}(D) and E_{smooth}(D) are computed by
 (3.13) 
assuming for simplicity that B_{1}(x) = B_{2}(x) = 0 for 1 ≤x ≤w.

Disparity space image.
To estimate the vector of disparity D and minimize the energy function defined by Equation (3.9), dynamic programming can be employed. The implementation is based on constructing a correlation table, called the Disparity Space Image (DSI) [8]. The DSI structure is a 2D table, of d_{max} rows and w columns that contains the matching cost of all pixels (along a scanline) at all possible disparity values. Additionally, the smoothness term is integrated into the DSI as a transition cost between each node of the DSI. Considering the DSI structure as a graph, matching costs and transition costs correspond to node and edges, respectively.

Using dynamic programming, the calculation of the optimal disparity sequence that minimizes Equation (3.9) can be then performed as follows. First, all entries of the DSI structure are initialized using the matching function, i.e., DSI(x,d) = c(x,x d_{x}). Second, a 2D table of cumulated costs T(x,d) is generated, using the following relation
 (3.14) 
where S_{xa} and S_{da} define the “admissible” or “legal” disparity variations between two consecutive pixels, with S_{xa} and S_{da} are both a set of integer numbers. In a graph, this corresponds to admissible transitions between nodes. For example, Figure 3.6 illustrates a DSI with the admissible transitions S_{xa} = {1} and S_{da} = {1,0,1}. Using the array of cumulated costs, the minimum cost sequence can be finally traced back. The algorithm that finds the optimal sequence D is summarized in Algorithm 1.
Constraint path.
We now define an admissible path in the DSI that can handle (a) fast disparity variations and (b) occluded pixels. To constrain the admissible edge in the DSI, a solution is to employ an ordering constraint. Such an ordering constraint states that if an object appears at the left of an object in a left image, also appears at the left of in the right image (as defined by Bobick et al. [8]). From this observation, it can be inferred that leftoccluded pixels correspond to a disparity increase in the graph while rightoccluded pixels correspond to a disparity decrease (see Figure 3.7(a)). Additionally, it should be noted that the DSI edges should allow unmatched pixels to handle occluded pixels. For example, let us consider Figure 3.7(b) that depicts the DSI table at the top, and the corresponding left and right image scanlines and disparity sequence D at the bottom.
In Figure 3.7(b), the sequence D consists of the disparity d_{1} and d_{2}, corresponding to the disparity of the background and foreground objects, respectively. To handle leftoccluded pixels, it can be seen that the disparity transition between d_{1} and d_{2}, should allow unmatched pixels. Therefore, leftoccluded pixels, giving a decrease of disparity, can be handled by skipping pixels in the left image scanline. This is in accordance with the diagonal gap in the DSI. At the opposite, rightoccluded pixels, i.e., decrease of disparity, can be handled by skipping rightoccluded pixels, corresponding to a vertical gap in the DSI. Using such admissible edges in the DSI, the algorithm can handle large occluded regions by allowing occluded pixels to remain unmatched. Therefore, the appropriate admissible edges correspond to the path illustrated by Figure 3.8. Transition costs are indicated as defined by Equation (3.11).

3.3.4 Experimental results
This section presents experimental results evaluating the accuracy of the reviewed disparityestimation algorithms. It should be noted that the disparityoptimization strategy has a significant impact on the quality of the disparity images. Therefore, we have experimented with several different optimization strategies. More specifically, the presented experiments are performed using three different optimization strategies, with the following characteristics:
 an optimization strategy based on a WTA algorithm and,
 a WTA algorithm extended by a crosschecking procedure (see Equation (3.7)), and
 an optimization strategy based on dynamic programming as described in Section 3.3.3.
All optimization strategies use a fixed matching function (Sum of Absolute Differences) and a fixed support of the matching function (5 × 5 centered squared window). Each presented disparity image is estimated using fixed parameters. The functions B_{1}(x), B_{2}(x) define a discontinuity bonus B_{1}(x) = B_{2}(x) = 75 if two consecutive pixels show the intensity variations (I(x,y) I(x + 1,y))> 10. We employed the following set of multiview images “Teddy”, “Cones”, “Breakdancers”, “Ballet” and “Livingroom” (see Appendix 9.2 for a description of the multiview images data set). For the multiview images “Breakdancers”, “Ballet” and “Livingroom”, the two employed views were first rectified using the method described in Section 2.4.2. Let us now discuss the resulting estimated disparity images for the three different optimization strategies.
WTA.
First, it can be observed that the ”WinnerTakeAll” strategy provides an accurate disparity estimation in textured regions. For example, the disparity is correctly estimated in the center region of Figure 3.9(b)) and Figure 3.9(f). However, the disadvantages of the WTA strategy are as follows. Looking at the topright part of Figure 3.9(f), it can observed that the disparity in textureless regions cannot be accurately estimated. For the sequences “Ballet”,“Breakdancers” and “Livingroom” showing large textureless regions, the estimated disparity images can be considered as highly inaccurate. These inaccurate depth images are therefore not included in this thesis, as this algorithmic proposal does not give any satisfactory results. A second disadvantage is that the disparity cannot be estimated accurately along object borders where occluded pixels are typically located. For example, the disparity along the left edges of the cones in Figure 3.9(b) cannot be estimated.
WTA extended by crosschecking.
To avoid unreliable disparity estimates, WTA was extented with crosschecking. Considering Figure 3.9(c) and Figure 3.9(g), it can be seen that the crosschecking procedure detects unreliable disparity estimates (shown as blue/black). However, one disadvantage of the procedure is that a too large number of estimates are classified as unreliable. Similar to the WTA experiment without crosschecking, the disparity of the multiview images “Breakdancers”, “Ballet” and “Livingroom” cannot be estimated and are therefore also not included in this thesis. In the next section, we propose an algorithm that estimates more accurate depth images of which the corresponding calculated depth images will be presented.
Onedimensional optimization.
The onedimensional dynamic programming optimization provides more reliable disparity estimates with respect to WTA. For example, when observing Figure 3.9(d) and Figure 3.9(h), it can be seen that disparity images correspond to the expected surface of objects in the scene. Furthermore, occluded pixels along the object borders are also detected and correctly handled (see for example the left side of the teddy bear). Additionally, the disparity for the multiview images “Ballet”,“Breakdancers” and “Livingroom” can be more accurately estimated. For example, observing the disparity of Figure 3.10, it can be seen that the disparities of the checkerboard, the lamp and the tree were correctly estimated. Similar observations can be made by analyzing Figure 3.11 and Figure 3.12. However, all disparity images suffer from scanline streaking artifacts. Such artifacts result from the independent optimization of disparities for each scanline.
Figure 3.10 (a) View 3 of the “Livingroom” multiview images data set. (b) Estimated disparity image using the dynamic programming algorithm.

Figure 3.11 (a) View 3 of the “Ballet” multiview images data set. (b) Estimated disparity image using the dynamic programming algorithm.

Figure 3.12 (a) View 3 of the “Breakdancers” multiview images data set. (b) Estimated disparity image using the dynamic programming algorithm.

3.3.5 Intermediate conclusions
The previous experiments lead to a few intermediate conclusions. First, it can noted that the quality of disparity images significantly depends on the characteristics of the input multiview images. For example, the “Teddy” and “Cones” multiview images show large colorful and textured regions. As a result, accurate disparity images can be estimated. In contrast with this, the colorless and textureless characteristics of the “Breakdancers” and “Ballet” multiview images do not allow an accurate estimation of disparity. To obtain more accurate disparity images for this type of sequences, a onedimensional optimization strategy was presented. However, while this strategy estimates more accurate disparity values in some regions, significant streaking artifacts are introduced at some other locations. In order to increase the accuracy of the estimation, we explore an increasingly popular approach, employing multiple views of the scene. This approach will addressed in the next section.