3.3  Previous work on depth estimation


The problem of depth estimation has been intensively investigated in the computer-vision research community [85]. The author distinguishes three essential components, constituting most of the depth-estimation algorithms: (1) the matching criterion or cost function, (2) the support of the matching function and (3) the disparity calculation/optimization strategy.

3.3.1  Matching criterion


The first elementary component corresponds to the matching criterion that measures correlation or similarity between pixels. These include, among others, the Sum of Absolute Differences (SAD), the Sum of Squared Differences (SSD), the Cross Correlation (CC), and the Normalized Cross-Correlation (NCC). However, as highlighted in the literature [20], the selection of the matching cost function does not have a significant impact on the quality of the correspondence matching. We employ in our experiments the Sum of Absolute Differences (SAD) as a matching cost function. This SAD metric is calculated for the three color channels and the resulting three SAD values are simply added. However, it is known [20] that the three color channels are correlated, so that employing three color channels does not triple the matching accuracy. Thus, using the three color channels (instead of one luminance channel only) does not significantly improve the quality of the estimated depth images.

3.3.2  Support of the function


The second elementary component of a depth estimation algorithm is the support of the matching cost function. Typically, matching functions can be classified using their respective support area and image-content adaptiveness. These include single-pixel windows [5], square windows, adaptive windows [41] and shiftable windows [324298]. Typically, to obtain a reliable matching metric, a large region support should be used. However, whereas using a large window provides a reliable matching support at an object surface with smoothly varying depth, an unreliable support is obtained at object boundaries. To circumvent this problem, an appealing approach [797] decomposes the image into a sufficiently large number of object segments (using for example a color-segmentation technique) and assumes that a single depth value is computed for each segment. Therefore, as the segmentation would follow the shape of objects, an advantage of such a segment-based approach is that the depth is more accurately estimated at the boundaries of objects.

3.3.3  Optimization strategy


The third and most important component of depth estimation is the calculation or optimization strategy. We distinguish three classes of disparity optimization strategies: local, one-dimensional and two-dimensional optimization.

Local optimization calculates independently the disparity of each pixel using the single matching cost of the selected pixel. Local optimizations typically yield accurate disparity estimates in textured regions. However, large texture-less regions tend to produce fuzzy disparity estimates (this case corresponds to the difficulty D1 in Section 3.2.3).

One-dimensional optimization. This second strategy estimates the disparity by minimizing a cost function calculated over one horizontal image scanline [6]. The idea followed is to exploit the coherence of disparity values along the horizontal scanline, so that smooth spatial variations of disparity estimates are enforced.

Two-dimensional strategy. This third strategy simultaneously optimizes the disparity estimates in the horizontal and vertical direction. To this end, 2D energy minimization algorithms, such as graph-cut [1047] or belief propagation [96], can be employed. For example, one proposal [109] segments the texture image into a large number of small segments (over-segmentation) and calculates a single depth for each segment. To this end, the texture segments are employed to construct a Markov Random Field, with the depth values being the states and the texture segments representing the nodes. To calculate and update the depth value of each node, a loopy belief propagation method is employed [96]. A similar method [16] first segments the input images, and then calculates an initial depth image using a modified plane sweeping algorithm [16]. Using this initial depth image, the algorithm iteratively calculates the depth of each segment with a greedy-search algorithm [97]. Whereas two-dimensional optimization strategies usually calculate smooth depth images, one major disadvantage is their computational complexity that hampers implementation in practical systems.

We now detail two different optimization techniques: a local “Winner-Take-All” optimization and a one-dimensional optimization. This one-dimensional optimization technique will be further developed in the second part of this chapter such that multiple views can be employed simultaneously.

3.3.3.0  A. Local optimization


A simple local optimization strategy is the “Winner-Take-All” (WTA). The WTA optimization consists of selecting for each pixel (x,y) the disparity d that minimizes an objective cost function c(x,x-d), thus
d (x ) = d = arg min c(x, x- d˜),
        x      d˜
(3.6)

where c(x,x -d) corresponds to one of the functions SAD, SSD, CC or NCC. This optimization strategy is called “Winner-Take-All” because the result of the cost function is not shared with neighboring pixels 2. Thus, the minimum cost (the “Winner”) yields (“Take-All”) the final disparity value. As discussed in Section 3.2.3, the optimization of Equation (3.6) yields fuzzy disparity estimates in texture-less regions. To detect unreliable matches, a cross-checking procedure can be employed. The cross-checking procedure can be decomposed into two steps. First, the disparity dr with respect to the right image I2 (see Figure 3.2) is estimated. Second, the disparity dl with respect to the left image I1 is calculated. This can be written as

dr(x) = arg min c(x,x - ˜d) and
 l             ˜d      ˜
d (x ) = arg min ˜dc(x+ d,x).
(3.7)

The match is then classified as reliable if the following condition holds:

dl(x- dr(x)) = dr(x).
(3.8)

Unreliable pixels not satisfying the above condition, can afterwards be pad-ded by the disparity value of a neighboring defined match.

3.3.3.0  B. One-dimensional optimization


Several contributions have considered the use of a one-dimensional optimization to estimate depth images [817]. In this section, we outline a slightly modified version of the method described in [6]. The idea of the one-dimensional optimization is to consider sequences of pixel disparities and to identify a sequence of disparities along an image scanline that minimizes a global energy function for each scanline. The one-dimensional optimization approach has two important advantages. First, it increases the reliability of the matching procedure by enforcing smooth variations of disparity estimates. As a result, fuzzy disparity estimates in texture-less regions are reduced. Second, the algorithm allows unmatched pixels, thereby facilitating an easy handling of occluded pixels.

The optimization strategy consists of calculating the vector of disparities D along a scanline that minimizes the objective energy function E(D) defined as

E (D ) = E   (D )+ E      (D),
          data        smooth
(3.9)

where each term is defined as follows. First, the vector D consists of the sequence of w disparity estimates along the scanline (assuming a scanline of w pixels), which is denoted as D = (d1,d2,...,dw). It should be noted that the disparity of an occluded pixel cannot be calculated and is thus labeled as “undefined” 3, i.e., di = ϕ. Second, the term Edata(D) represents the matching costs of all pixels along a scanline with defined disparities, defined as

           ∑
Edata(D ) =    c(x, x- dx),
           x∈S
(3.10)

where c(x,x -dx) represents the cost of the matching function, e.g., the SAD matching function, and S = {x|dx≠ϕ,1 ≤x ≤w}. Third, the term Esmooth(D) ensures that (a) disparity estimates vary smoothly in texture-less regions and that (b) fast disparity transitions are allowed at object borders, i.e., at pixel intensity variations. The term Esmooth(D) can be written as

             {
Esmooth(D ) =   E∑smoothx1(D )    if |dx - dxnext| ≥ 2,
                  x∈S - κr      else,
(3.11)

with κr > 0 being a constant parameter corresponding to a matching reward bonus leading to a negative cost. The newly function Esmooth_1(D) is defined as

                 ∑
Esmoothx1(D ) =        κd1|dx - dxnext|+ κd2 - B1 (x)- B2 (x - dx),
              x∈Ssmooth
(3.12)

where κd1, κd2 are control parameters used to penalize disparity variations along the scanline. The computation runs over Ssmooth with an extra constraint Ssmooth = {x|dx≠ϕ,dxnext≠ϕ,1 ≤x ≤w}, and xnext is the pixel position with defined disparity immediately following the current pixel position x (x ∈S). The terms B1(x), B2(x) are discontinuity bonuses that decrease the cost at an intensity discontinuity within the scanlines of the left and right images.

Example.

We show how to calculate the global objective function defined by Equation (3.9), assuming that a vector D was provided. In the illustration from Figure 3.5, we consider a (short) scanline of 8 pixels with their associated matching costs. The costs Edata(D) and Esmooth(D) are computed by
Edata(D ) = 25+ 27 + 24+ 20 + 23,
E     (D ) = - κ - κ +  (κ  |11- 15 |+ κ  )- κ ,
 smooth          r   r     d1            d2    r
(3.13)

assuming for simplicity that B1(x) = B2(x) = 0 for 1 ≤x ≤w.


PIC
Figure 3.5 Example of the cost calculation defined by Equation (3.9).


Disparity space image.

To estimate the vector of disparity D and minimize the energy function defined by Equation (3.9), dynamic programming can be employed. The implementation is based on constructing a correlation table, called the Disparity Space Image (DSI) [8]. The DSI structure is a 2D table, of dmax rows and w columns that contains the matching cost of all pixels (along a scanline) at all possible disparity values. Additionally, the smoothness term is integrated into the DSI as a transition cost between each node of the DSI. Considering the DSI structure as a graph, matching costs and transition costs correspond to node and edges, respectively.

PIC
Figure 3.6 DSI structure with all matching costs for all disparity values. Here, a unity disparity change between two consecutive pixels is admissible.


Using dynamic programming, the calculation of the optimal disparity sequence that minimizes Equation (3.9) can be then performed as follows. First, all entries of the DSI structure are initialized using the matching function, i.e., DSI(x,d) = c(x,x -dx). Second, a 2D table of cumulated costs T(x,d) is generated, using the following relation

T(x,d) = DSI (x,d) +     min    T (x+ xp,d + dp),
                     xp∈Sxa,dp∈Sda
(3.14)

where Sxa and Sda define the “admissible” or “legal” disparity variations between two consecutive pixels, with Sxa and Sda are both a set of integer numbers. In a graph, this corresponds to admissible transitions between nodes. For example, Figure 3.6 illustrates a DSI with the admissible transitions Sxa = {-1} and Sda = {-1,0,1}. Using the array of cumulated costs, the minimum cost sequence can be finally traced back. The algorithm that finds the optimal sequence D is summarized in Algorithm 1.



Algorithm 1 Dynamic programming optimization for a scanline y
Require:  2D arrays PrecTable(x,d), T(x,d) and DSI(x,d)   Initialize DSI(x,d)
  for d ∈ [0;dmaxdo
  T(0,d) = DSI(0,d);PrecTable(0,d) = (-1,d)

  end for
  for (x = 1;x ≤w;x + +) do
  for (d = 1;d ≤dmax;d + +) do
  (xp_o,dp_o) = arg minxp∈Sxa,dp∈SdaT(x + xp,d + dp)
  PrecTable(x,d) = (xp_o,dp_o)
  T(x,d) = DSI(x,d) + T(xp_o,dp_o)

  end for

  end for
  d = arg min˜d ∈{0,…,d max}T(w - 1,d˜ )
  while (x ≥ 0) do
  Store computed disparity: Disparity(x,y) = d
  (xp,dp) = PrecTable(x,d)
  x = xp; d = dp

  end while

Constraint path.

We now define an admissible path in the DSI that can handle (a) fast disparity variations and (b) occluded pixels. To constrain the admissible edge in the DSI, a solution is to employ an ordering constraint. Such an ordering constraint states that if an object A appears at the left of an object B in a left image, A also appears at the left of B in the right image (as defined by Bobick et al. [8]). From this observation, it can be inferred that left-occluded pixels correspond to a disparity increase in the graph while right-occluded pixels correspond to a disparity decrease (see Figure 3.7(a)). Additionally, it should be noted that the DSI edges should allow unmatched pixels to handle occluded pixels. For example, let us consider Figure 3.7(b) that depicts the DSI table at the top, and the corresponding left and right image scanlines and disparity sequence D at the bottom.
PIC
(a)
PIC
(b)

Figure 3.7 (a) Scanning the image from left to right, left- and right-occluded pixels correspond to an increase and decrease of disparity, respectively. (b) Simple DSI (top) with the corresponding scanline profile (bottom) of a left and right image. The admissible disparity transitions correspond to a diagonal gap for left-occluded pixels and a vertical gap for right-occluded pixels.


In Figure 3.7(b), the sequence D consists of the disparity d1 and d2, corresponding to the disparity of the background and foreground objects, respectively. To handle left-occluded pixels, it can be seen that the disparity transition between d1 and d2, should allow unmatched pixels. Therefore, left-occluded pixels, giving a decrease of disparity, can be handled by skipping pixels in the left image scanline. This is in accordance with the diagonal gap in the DSI. At the opposite, right-occluded pixels, i.e., decrease of disparity, can be handled by skipping right-occluded pixels, corresponding to a vertical gap in the DSI. Using such admissible edges in the DSI, the algorithm can handle large occluded regions by allowing occluded pixels to remain unmatched. Therefore, the appropriate admissible edges correspond to the path illustrated by Figure 3.8. Transition costs are indicated as defined by Equation (3.11).


PIC
Figure 3.8 Admissible edges in the DSI that model left- and right-occluded pixels. Transitions L → C correspond to a diagonal gap to handle left-occluded pixels. Transitions R → C correspond to a vertical gap to handle right-occluded pixels. Transitions M → C correspond to small disparity variations for matching visible pixels.


3.3.4  Experimental results


This section presents experimental results evaluating the accuracy of the reviewed disparity-estimation algorithms. It should be noted that the dis-parity-optimization strategy has a significant impact on the quality of the disparity images. Therefore, we have experimented with several different optimization strategies. More specifically, the presented experiments are performed using three different optimization strategies, with the following characteristics:
  1. an optimization strategy based on a WTA algorithm and,
  2. a WTA algorithm extended by a cross-checking procedure (see Equation (3.7)), and
  3. an optimization strategy based on dynamic programming as described in Section 3.3.3.

All optimization strategies use a fixed matching function (Sum of Absolute Differences) and a fixed support of the matching function (5 × 5 centered squared window). Each presented disparity image is estimated using fixed parameters. The functions B1(x), B2(x) define a discontinuity bonus B1(x) = B2(x) = -75 if two consecutive pixels show the intensity variations |(I(x,y) -I(x + 1,y))|> 10. We employed the following set of multi-view images “Teddy”, “Cones”, “Breakdancers”, “Ballet” and “Livingroom” (see Appendix 9.2 for a description of the multi-view images data set). For the multi-view images “Breakdancers”, “Ballet” and “Livingroom”, the two employed views were first rectified using the method described in Section 2.4.2. Let us now discuss the resulting estimated disparity images for the three different optimization strategies.

WTA.

First, it can be observed that the ”Winner-Take-All” strategy provides an accurate disparity estimation in textured regions. For example, the disparity is correctly estimated in the center region of Figure 3.9(b)) and Figure 3.9(f). However, the disadvantages of the WTA strategy are as follows. Looking at the top-right part of Figure 3.9(f), it can observed that the disparity in texture-less regions cannot be accurately estimated. For the sequences “Ballet”,“Breakdancers” and “Livingroom” showing large texture-less regions, the estimated disparity images can be considered as highly inaccurate. These inaccurate depth images are therefore not included in this thesis, as this algorithmic proposal does not give any satisfactory results. A second disadvantage is that the disparity cannot be estimated accurately along object borders where occluded pixels are typically located. For example, the disparity along the left edges of the cones in Figure 3.9(b) cannot be estimated.

WTA extended by cross-checking.

To avoid unreliable disparity estimates, WTA was extented with cross-checking. Considering Figure 3.9(c) and Figure 3.9(g), it can be seen that the cross-checking procedure detects unreliable disparity estimates (shown as blue/black). However, one disadvantage of the procedure is that a too large number of estimates are classified as unreliable. Similar to the WTA experiment without cross-checking, the disparity of the multi-view images “Breakdancers”, “Ballet” and “Livingroom” cannot be estimated and are therefore also not included in this thesis. In the next section, we propose an algorithm that estimates more accurate depth images of which the corresponding calculated depth images will be presented.

One-dimensional optimization.

The one-dimensional dynamic programming optimization provides more reliable disparity estimates with respect to WTA. For example, when observing Figure 3.9(d) and Figure 3.9(h), it can be seen that disparity images correspond to the expected surface of objects in the scene. Furthermore, occluded pixels along the object borders are also detected and correctly handled (see for example the left side of the teddy bear). Additionally, the disparity for the multi-view images “Ballet”,“Breakdancers” and “Livingroom” can be more accurately estimated. For example, observing the disparity of Figure 3.10, it can be seen that the disparities of the checkerboard, the lamp and the tree were correctly estimated. Similar observations can be made by analyzing Figure 3.11 and Figure 3.12. However, all disparity images suffer from scanline streaking artifacts. Such artifacts result from the independent optimization of disparities for each scanline.
PIC
(a) Original “Cones” image
PIC
(b) “WTA”
PIC
(c) “WTA” with cross-checking
PIC
(d) Dynamic programming
PIC
(e) Original “Teddy” image
PIC
(f) “WTA”
PIC
(g) “WTA” with cross-checking
PIC
(h) Dynamic programming

Figure 3.9 Estimated disparity images using the “Teddy” and “Cones” multi-view images.



PIC
(a)
PIC
(b)

Figure 3.10 (a) View 3 of the “Livingroom” multi-view images data set. (b) Estimated disparity image using the dynamic programming algorithm.



PIC
(a)
PIC
(b)

Figure 3.11 (a) View 3 of the “Ballet” multi-view images data set. (b) Estimated disparity image using the dynamic programming algorithm.



PIC
(a)
PIC
(b)

Figure 3.12 (a) View 3 of the “Breakdancers” multi-view images data set. (b) Estimated disparity image using the dynamic programming algorithm.


3.3.5  Intermediate conclusions


The previous experiments lead to a few intermediate conclusions. First, it can noted that the quality of disparity images significantly depends on the characteristics of the input multi-view images. For example, the “Teddy” and “Cones” multi-view images show large colorful and textured regions. As a result, accurate disparity images can be estimated. In contrast with this, the colorless and texture-less characteristics of the “Breakdancers” and “Ballet” multi-view images do not allow an accurate estimation of disparity. To obtain more accurate disparity images for this type of sequences, a one-dimensional optimization strategy was presented. However, while this strategy estimates more accurate disparity values in some regions, significant streaking artifacts are introduced at some other locations. In order to increase the accuracy of the estimation, we explore an increasingly popular approach, employing multiple views of the scene. This approach will addressed in the next section.

2Sharing the result of the cost function can be done to enforce spatially consistent disparity values

3In this case, the symbol ϕ does not represent an empty set but an undefined element.