2.2 Pinhole camera model
In this section, we describe the image acquisition process known as the pinhole camera model, which is regularly employed as a basis in this thesis. More specifically, we first discuss the model that integrates the internal or intrinsic camera parameters, such as the focal length and the lens distortion. Secondly, we extend the presented simple camera model to integrate external or extrinsic camera parameters corresponding to the position and orientation of the camera.
2.2.1 Intrinsic camera parameters
The pinhole camera model defines the geometric relationship between a 3D point and its 2D corresponding projection onto the image plane. When using a pinhole camera model, this geometric mapping from 3D to 2D is called a perspective projection. We denote the center of the perspective projection (the point in which all the rays intersect) as the optical center or camera center and the line perpendicular to the image plane passing through the optical center as the optical axis (see Figure 2.2). Additionally, the intersection point of the image plane with the optical axis is called the principal point. The pinhole camera that models a perspective projection of 3D points onto the image plane can be described as follows.
2.2.1.0 A. Perspective projection using homogeneous coordinates
Let us consider a camera with the optical axis being collinear to the Z_{cam}axis and the optical center being located at the origin of a 3D coordinate system (see Figure 2.2).

The projection of a 3D world point (X,Y,Z)^{T } onto the image plane at pixel position (u,v)^{T } can be written as
 (2.3) 
where f denotes the focal length. To avoid such a nonlinear division operation, the previous relation can be reformulated using the projective geometry framework, as
 (2.5) 
where λ = Z is the homogeneous scaling factor.
2.2.1.0 B. Principalpoint offset
Most of the current imaging systems define the origin of the pixel coordinate system at the topleft pixel of the image. However, it was previously assumed that the origin of the pixel coordinate system corresponds to the principal point (o_{x},o_{y})^{T }, located at the center of the image (see Figure 2.3(a)). A conversion of coordinate systems is thus necessary. Using homogeneous coordinates, the principalpoint position can be readily integrated into the projection matrix. The perspective projection equation becomes now
 (2.6) 
2.2.1.0 C. Imagesensor characteristics
To derive the relation described by Equation (2.6), it was implicitly assumed that the pixels of the image sensor are square, i.e., aspect ratio is 1 : 1 and pixels are not skewed. However, both assumptions may not always be valid. First, for example, an NTSC TV system defines nonsquare pixels with an aspect ratio of 10 : 11. In practice, the pixel aspect ratio is often provided by the imagesensor manufacturer. Second, pixels can potentially be skewed, especially in the case that the image is acquired by a frame grabber. In this particular case, the pixel grid may be skewed due to an inaccurate synchronization of the pixelsampling process. Both previously mentioned imperfections of the imaging system can be taken into account in the camera model, using the parameters η and τ, which model the pixel aspect ratio and skew of the pixels, respectively (see Figure 2.3(b)). The projection mapping can be now updated as
 (2.7) 
with P = (X,Y,Z,1)^{T } being a 3D point defined with homogeneous coordinates. In practice, when employing recent digital cameras, it can be safely assumed that pixels are square (η = 1) and nonskewed (τ = 0). The projection matrix that incorporates the intrinsic parameters is denoted as K throughout this thesis. The all zero element vector is denoted by 0_{3}.
Figure 2.3 (a) The image (x,y) and camera (u,v) coordinate system. (b) Nonideal image sensor with nonsquare, skewed pixels.

2.2.1.0 D. Radial lens distortion
Real camera lenses typically suffer from nonlinear lens distortion. In practice, radial lens distortion causes straight lines to be mapped as curved lines. As seen in Figure 2.4, the radial lens distortion appears more visible at the image edges, where the radial distance is high. A standard technique to model the radial lens can be described as follows.
Figure 2.4 Real camera lenses suffer from radial lens distortion that causes straight lines to be bended. Pixel grid of an (a) undistorted and (b) distorted image.

Let (x_{u},y_{u})^{T } and (x_{d},y_{d})^{T } be the corrected and the measured distorted pixel positions, respectively. The relation between an undistorted and distorted pixel can be modeled with a polynomial function and can be written as
 (2.8) 
where
 (2.9) 
In the case k_{1} = 0, it can be noted that x_{u} = x_{d} and y_{u} = y_{d}, which corresponds to the absence of radial lens distortion.
It should be noted that Equation (2.8) provides the correct pixel position using a function of the distorted pixel position. However, to generate an undistorted image, it would be more convenient to base the function L(r) on the undistorted pixel position. This technique is usually known as the inverse mapping method. The inverse mapping technique consists of scanning each pixel in the output image and resampling and interpolating the correct pixel from the input image. To perform an inverse mapping, the inversion of the radial lens distortion model is necessary and can be described as follows. First, similar to the second part of Equation (2.9), we define
 (2.10) 
Then, taking the norm of Equation (2.8), it can be derived that
 (2.11) 
which is equivalent to
 (2.12) 
When taking into account Equation (2.9), this equation can be rewritten as a cubic polynomial:
 (2.13) 
The inverted lens distortion function can be derived by substituting Equation (2.12) into Equation (2.8) and developing it from the righthand side:
 (2.14) 
where r_{d} can be calculated by solving the cubic polynomial function of Equation (2.13). This polynomial can be solved using Cardano’s method, by first calculating the discriminant Δ defined as Δ = q^{2} + 4∕27p^{3} where p = 1∕k_{1} and q = r_{u}∕k_{1}. Depending on the sign of the discriminant, three sets of solutions are possible.
If Δ > 0, then the equation has one real root r_{d1} defined as
 (2.15) 
If Δ < 0, then the equation has three real roots r_{dk} defined by
 (2.16) 
for k = {0,1,2}, where an appropriate solution r_{dk} should be selected such that r_{dk} > 0 and r_{dk} < r_{uk}. However, only one single radius corresponds to the practical solution. Therefore, the second case Δ < 0 should not be encountered. The third case with Δ = 0 is also impractical. In practice, we have noticed that, indeed, these second and third cases never occur.
As an example, Figure 2.5 depicts a distorted image and the corresponding corrected image using the inverted mapping method, with Δ > 0.
Figure 2.5 (a) Distorted image. (b) Corresponding corrected image using the inverted mapping method.

Estimation of the distortion parameters The discussed lensdistortion correction method requires knowledge of the lens parameters, i.e., k_{1} and (o_{x},o_{y})^{T }. The estimation of the distortion parameters can be performed by minimizing a cost function that measures the curvature of lines in the distorted image. To measure this curvature, a practical solution is to detect feature points belonging to the same line on a calibration rig, e.g., a checkerboard calibration pattern (see Figure 2.5). Each point belonging to the same line in the distorted image forms a bended line instead of a straight line [99]. By comparing the deviation of the bended line from the theoretical straight line model, the distortion parameters can be calculated.
2.2.2 Extrinsic parameters
As opposed to the intrinsic parameters that describe internal parameters of the camera (focal distance, radial lens parameters), the extrinsic parameters indicate the external position and orientation of the camera in the 3D world. Mathematically, the position and orientation of the camera is defined by a 3 × 1 vector C and by a 3 × 3 rotation matrix R (see Figure 2.6).

To obtain the pixel position p = (x,y,1)^{T } of a 3Dworld homogeneous point P, the camera should be first translated to the world coordinate origin and second, rotated. This can be mathematically written as
 (2.17) 
Alternatively, when combining matrices, Equation (2.17) can be reformulated as
 (2.18) 
2.2.2.0 Backprojection of a 2D point to 3D
Previously, the process of projecting a 3D point onto the 2D image plane was described. We now present how a 2D point can be backprojected to the 3D space and derive the corresponding coordinates. Considering a 2D point p in an image, there exists a collection of 3D points that are mapped and projected onto the same point p. This collection of 3D points constitutes a ray connecting the camera center C = (C_{x},C_{y},C_{z})^{T } and p = (x,y,1)^{T }. From Equation (2.18), the ray P(λ) associated to a pixel p = (x,y,1)^{T } can be defined as
 (2.19) 
where λ is the positive scaling factor defining the position of the 3D point on the ray. In the case Z is known, it is possible to obtain the coordinates X and Y by calculating λ using the relation
 (2.20) 
The backprojection operation is important for depth estimation and image rendering, which will be extensively addressed later in this thesis. For depth estimation, this would mean that an assumption is made for the value of Z and the corresponding 3D point is calculated. With an iterative procedure, an appropriate depth value is selected from a set of assumed depth candidates.
2.2.3 Coordinate system conversion
Sometimes, coordinate systems need to be converted to obtain a more efficient computation procedure. Let us now propose two methods that transform the projection matrix, so that new coordinate systems can be employed. We will also provide applications of those methods.
2.2.3.0 A. Changing the image coordinate system
The definition of the coordinate system in 3D image processing is not uniformly chosen. For example, the calibration parameters of the MPEG test sequences “Breakdancers” and “Ballet” [1] assume a lefthanded coordinate system. However, a righhanded coordinate system is usually employed in literature. Therefore, we outline a method for converting the image coordinate system.
Typically, pixel coordinates are defined such that the origin of the 2D image coordinate system is located at the top left of the image. In this case, the x and y axis point horizontally to the right and vertically downward, respectively (convention 1). However, an alternative convention is to locate the origin of the image coordinate system at the bottom left, with the y image axis pointing vertically upwards. To transform the image coordinate system, it is necessary to flip the y image axis and translate the origin along the y image axis. This can be performed using the matrix denoted B_{1} (see Equation (2.21)). Additionally, one can distinguish two possible conventions for defining the orientation of the 3D world axis: either a lefthanded or a righthanded coordinate system can be adopted. The conversion of a lefthanded to a righthanded coordinate system can be performed by flipping the Y world axis ^{1} (matrix B_{2}). By concatenating the two conversion matrices B_{1} and B_{2} with the original projection matrix, one can obtain the converted projection matrix
 (2.21) 
where h corresponds to the height of the image. The obtained converted projection matrix is then defined in an image coordinate system using the convention 1 notation and in a righthanded world coordinate system. Finally, it should be noted that the conversion of the image coordinate system is achieved by modifying the intrinsic parameters while the conversion of the world coordinate system is made by transforming the extrinsic parameters. This is the reason why conversion matrices B_{1} and B_{2} are placed as left and right terms in Equation (2.21).
2.2.3.0 B. Changing the world coordinate system
A conversion is used for respecifying depth images into a new world coordinate system. This conversion involves the calculation of the position of a 3D point specified in another camera coordinate system and the projection of this 3D point onto the other image plane. The modification of the location and orientation of the world coordinate system is performed in a way similar to the abovedescribed method. Figure 2.7 illustrates the definition of two world coordinate systems.
The modification of the world coordinate system involves the simultaneous conversion of the projection matrix and the coordinates of the 3D point. Considering a 3D world point P and a camera defined with a projection matrix with intrinsic and extrinsic parameters K, R and C, the coordinatesystem conversion can be carried out in two steps. First, specify the projection matrix in a new world coordinate system, where only the position and orientation of the camera, i.e., extrinsic parameters, should be modified. The extrinsic parameters are converted using the position C_{n} and orientation R_{n} of the new coordinate system defined, with respect to the original coordinate system. Second, specify the position of a 3D point P in the new coordinate system. The coordinatesystem conversion can be written as
where p represents the projected pixel position and the allzero element vector is denoted by 0_{3}.^{1}For clarity, the image coordinate axes are labeled in lower case and the world coordinate axes are labeled in upper case.