3.2  Two-view depth estimation


3.2.1  Two-view geometry


We now describe the geometry that defines the relationship between two corresponding pixels and the depth of a 3D point. Let us consider a 3D Euclidean point (X,Y,Z)T captured by two cameras and the two corresponding projected pixels p1 and p2. Using the camera parameters (see Equation (2.18)), the pixel positions p1 = (x1,y1,1)T and p2 = (x2,y2,1)T can be written as
            (     )
               X
λipi = KiRi (  Y  ) - KiRiCi  with i ∈ {1,2}.
               Z
(3.1)


PIC
Figure 3.2 Two aligned cameras capturing rectified images can be employed to perform the estimation of the depth using triangulation.


The previous relation can be simplified by considering the restricted case of two rectified views so that the following assumptions can be made. First, without loss of generality, the world coordinate system is selected such that it coincides with the coordinate system of camera 1 (see Figure 3.2). In such a case, it can be deduced that C1 = 03 and R1 = I3×3. Second, because images are rectified, both rotation matrices are equal: R1 = R2 = I3×3. Third, camera 2 is located on the X axis: C2 = (tx2,0,0). Finally, and fourth, both cameras are identical, so that the internal camera parameter matrices are equal, leading to

                (           )
                   f  0  ox
K = K1  = K2  = (  0  f  oy ) .
                   0  0  1
(3.2)

Equation (3.1) can now be rewritten as

  (  x1 )      (  X  )        (  x2 )      (  X  )      ( tx  )
λ (  y  ) =  K (  Y  )  and λ (  y  ) =  K (  Y  ) - K  (  02 ) .
 1    1                      2    2
     1            Z              1            Z            0
(3.3)

By combining both previous relations, it can be derived that

(    )    (      f⋅tx  )
  x2   =    x1 - -Z-2   .
  y2            y1
(3.4)

Equation (3.4) provides the relationship between two corresponding pixels and the depth Z of the 3D point (for the simplified case of two rectified views). The quantity f ⋅tx2∕Z is typically called the disparity. Practically, the disparity quantity corresponds to the parallax motion of objects 1. The parallax motion is the motion of objects that are observed from a moving viewpoint. This can be illustrated by taking the example of a viewer who is sitting in a moving train. In this example, the motion parallax of the foreground grass along the train tracks is higher than a tree far away in the background.

It can be noted that the disparity is inversely proportional to the depth, so that a small disparity value corresponds to a large depth distance. To emphasize the difference between both quantities, we indicate that the following two terms will be used distinctively in this thesis.

The reason that we emphasize the difference so explicitly, is that we are going to exploit this difference in the sequel of this thesis. Typically, a depth image is estimated by first calculating a disparity image using two rectified images and afterwards, by converting this disparity into depth values. In the second part of this chapter, we show that such a two-stage computation can be circumvented by directly estimating the depth using an alternative technique based on an appropriate geometric formulation of the framework.

3.2.2  Simple depth estimation algorithm


Based on the previously described geometric model, a simple depth estimation algorithm can be detailed as follows. Let us consider a left- and right-rectified image denoted by I1 and I2, respectively. To perform depth estimation, it is necessary to establish the point-correspondence (p1,p2) for each pixel. Selecting the pixel p1 as a reference, a simple strategy consists of searching for the pixel p2 that corresponds to pixel p1 along the epipolar line. Because the images are rectified, we search for the point-correspondence along horizontal raster scanlines. To limit the search area, a maximum disparity value dmax is defined. The similarity between pixels p1 and p2 is measured using a matching block (window) W surrounding the pixels (see Figure 3.3).

PIC
Figure 3.3 The disparity is estimated by searching the most similar block in the second image I2 along a one-dimensional horizontal epipolar line.


Employing the Sum of Absolute Differences (SAD) as a similarity measure for block comparison, the disparity d of a pixel at position (x,y) in view I1 can be written as

d(x,y) = arg min  ∑    |I (x + i,y + j) - I (x + i- d˜, y + j)|.
         0≤˜d≤dmax        1               2
                 (i,j)∈W
(3.5)

The previous operation is repeated for each pixel so that a dense disparity map is obtained. To convert the obtained disparity map into a depth image, Equation (3.4) should be employed. Because the technique relies on a block-matching procedure, the attractiveness of the approach is its computational simplicity. However, this simple technique results in inaccurately estimated disparity values. For example, a change of illumination across the views introduces ambiguities. Besides this, other difficulties arise that we now discuss in more detail.

3.2.3  Difficulties of the described model


The calculation of depth using point-correspondences is an ill-posed problem in many situations, which are summarized below.
D1: Texture-less regions. To estimate the disparity, a similarity or correlation measure is employed. However, in the case the disparity is estimated over a region of constant color, the correlation function yields a constant correlation/matching score for all disparity values. Therefore, all disparity values yield an equal correlation measure. Practically, this results in unreliable disparity estimates.
D2: Occluded regions. As illustrated by Figure 3.4, regions in the scene may not be visible from the two selected viewpoints, called occluded regions. In this case, because it is not possible to detect point-correspon-dences from occluded regions, no depth information can be calculated.
D3: Contrast changes across the views. When capturing two images with two different cameras, the contrast settings and illumination may differ. This results in different intensity levels across the views yielding unreliable matches.


PIC
Figure 3.4 Some regions visible in the image I1 are occluded in the image I2, which blocks the detection of point-correspondences.


To address the previously discussed issues, we now review previous work on depth estimation from literature.

1assuming a translational motion