3.2 Twoview depth estimation
3.2.1 Twoview geometry
We now describe the geometry that defines the relationship between two corresponding pixels and the depth of a 3D point. Let us consider a 3D Euclidean point (X,Y,Z)^{T } captured by two cameras and the two corresponding projected pixels p_{1} and p_{2}. Using the camera parameters (see Equation (2.18)), the pixel positions p_{1} = (x_{1},y_{1},1)^{T } and p_{2} = (x_{2},y_{2},1)^{T } can be written as
 (3.1) 

The previous relation can be simplified by considering the restricted case of two rectified views so that the following assumptions can be made. First, without loss of generality, the world coordinate system is selected such that it coincides with the coordinate system of camera 1 (see Figure 3.2). In such a case, it can be deduced that C_{1} = 0_{3} and R_{1} = I_{3×3}. Second, because images are rectified, both rotation matrices are equal: R_{1} = R_{2} = I_{3×3}. Third, camera 2 is located on the X axis: C_{2} = (t_{x2},0,0). Finally, and fourth, both cameras are identical, so that the internal camera parameter matrices are equal, leading to
 (3.2) 
Equation (3.1) can now be rewritten as
 (3.3) 
By combining both previous relations, it can be derived that
 (3.4) 
Equation (3.4) provides the relationship between two corresponding pixels and the depth Z of the 3D point (for the simplified case of two rectified views). The quantity f ⋅t_{x2}∕Z is typically called the disparity. Practically, the disparity quantity corresponds to the parallax motion of objects ^{1}. The parallax motion is the motion of objects that are observed from a moving viewpoint. This can be illustrated by taking the example of a viewer who is sitting in a moving train. In this example, the motion parallax of the foreground grass along the train tracks is higher than a tree far away in the background.
It can be noted that the disparity is inversely proportional to the depth, so that a small disparity value corresponds to a large depth distance. To emphasize the difference between both quantities, we indicate that the following two terms will be used distinctively in this thesis.
 Disparity image/map: an image that stores the disparity values of all pixels.
 Depth image/map: an image that represents the depth of all pixels.
The reason that we emphasize the difference so explicitly, is that we are going to exploit this difference in the sequel of this thesis. Typically, a depth image is estimated by first calculating a disparity image using two rectified images and afterwards, by converting this disparity into depth values. In the second part of this chapter, we show that such a twostage computation can be circumvented by directly estimating the depth using an alternative technique based on an appropriate geometric formulation of the framework.
3.2.2 Simple depth estimation algorithm
Based on the previously described geometric model, a simple depth estimation algorithm can be detailed as follows. Let us consider a left and rightrectified image denoted by I_{1} and I_{2}, respectively. To perform depth estimation, it is necessary to establish the pointcorrespondence (p_{1},p_{2}) for each pixel. Selecting the pixel p_{1} as a reference, a simple strategy consists of searching for the pixel p_{2} that corresponds to pixel p_{1} along the epipolar line. Because the images are rectified, we search for the pointcorrespondence along horizontal raster scanlines. To limit the search area, a maximum disparity value d_{max} is defined. The similarity between pixels p_{1} and p_{2} is measured using a matching block (window) W surrounding the pixels (see Figure 3.3).

Employing the Sum of Absolute Differences (SAD) as a similarity measure for block comparison, the disparity d of a pixel at position (x,y) in view I_{1} can be written as
 (3.5) 
The previous operation is repeated for each pixel so that a dense disparity map is obtained. To convert the obtained disparity map into a depth image, Equation (3.4) should be employed. Because the technique relies on a blockmatching procedure, the attractiveness of the approach is its computational simplicity. However, this simple technique results in inaccurately estimated disparity values. For example, a change of illumination across the views introduces ambiguities. Besides this, other difficulties arise that we now discuss in more detail.
3.2.3 Difficulties of the described model
The calculation of depth using pointcorrespondences is an illposed problem in many situations, which are summarized below.
 D1: Textureless regions. To estimate the disparity, a similarity or correlation measure is employed. However, in the case the disparity is estimated over a region of constant color, the correlation function yields a constant correlation/matching score for all disparity values. Therefore, all disparity values yield an equal correlation measure. Practically, this results in unreliable disparity estimates.
 D2: Occluded regions. As illustrated by Figure 3.4, regions in the scene may not be visible from the two selected viewpoints, called occluded regions. In this case, because it is not possible to detect pointcorrespondences from occluded regions, no depth information can be calculated.
 D3: Contrast changes across the views. When capturing two images with two different cameras, the contrast settings and illumination may differ. This results in different intensity levels across the views yielding unreliable matches.

To address the previously discussed issues, we now review previous work on depth estimation from literature.
^{1}assuming a translational motion