2.2  Pinhole camera model

In this section, we describe the image acquisition process known as the pinhole camera model, which is regularly employed as a basis in this thesis. More specifically, we first discuss the model that integrates the internal or intrinsic camera parameters, such as the focal length and the lens distortion. Secondly, we extend the presented simple camera model to integrate external or extrinsic camera parameters corresponding to the position and orientation of the camera.

2.2.1  Intrinsic camera parameters

The pinhole camera model defines the geometric relationship between a 3D point and its 2D corresponding projection onto the image plane. When using a pinhole camera model, this geometric mapping from 3D to 2D is called a perspective projection. We denote the center of the perspective projection (the point in which all the rays intersect) as the optical center or camera center and the line perpendicular to the image plane passing through the optical center as the optical axis (see Figure 2.2). Additionally, the intersection point of the image plane with the optical axis is called the principal point. The pinhole camera that models a perspective projection of 3D points onto the image plane can be described as follows.  A. Perspective projection using homogeneous coordinates

Let us consider a camera with the optical axis being collinear to the Zcam-axis and the optical center being located at the origin of a 3D coordinate system (see Figure 2.2).

Figure 2.2 The ideal pinhole camera model describes the relationship between a 3D point (X,Y,Z)T and its corresponding 2D projection (u,v) onto the image plane.

The projection of a 3D world point (X,Y,Z)T onto the image plane at pixel position (u,v)T can be written as

    Xf          Y f
u = -Z--and v = -Z- ,

where f denotes the focal length. To avoid such a non-linear division operation, the previous relation can be reformulated using the projective geometry framework, as

          T             T
(λu,λv, λ) =  (Xf, Y f,Z ) .                  (2.4)
This relation can be the expressed in matrix notation by
  (    )   (             )  ( X  )
     u        f  0  0  0    |    |
λ (  v ) = (  0  f  0  0 )  | Y  | ,
     1        0  0  1  0    (  Z )

where λ = Z is the homogeneous scaling factor.  B. Principal-point offset

Most of the current imaging systems define the origin of the pixel coordinate system at the top-left pixel of the image. However, it was previously assumed that the origin of the pixel coordinate system corresponds to the principal point (ox,oy)T , located at the center of the image (see Figure 2.3(a)). A conversion of coordinate systems is thus necessary. Using homogeneous coordinates, the principal-point position can be readily integrated into the projection matrix. The perspective projection equation becomes now
                            (     )
 (  x )    (  f  0  o   0 )    X
 (    )    (         x    ) ||  Y  ||
λ   y    =    0  f  oy  0   (  Z  ) .
    1         0  0  1   0      1
(2.6)  C. Image-sensor characteristics

To derive the relation described by Equation (2.6), it was implicitly assumed that the pixels of the image sensor are square, i.e., aspect ratio is 1 : 1 and pixels are not skewed. However, both assumptions may not always be valid. First, for example, an NTSC TV system defines non-square pixels with an aspect ratio of 10 : 11. In practice, the pixel aspect ratio is often provided by the image-sensor manufacturer. Second, pixels can potentially be skewed, especially in the case that the image is acquired by a frame grabber. In this particular case, the pixel grid may be skewed due to an inaccurate synchronization of the pixel-sampling process. Both previously mentioned imperfections of the imaging system can be taken into account in the camera model, using the parameters η and τ, which model the pixel aspect ratio and skew of the pixels, respectively (see Figure 2.3(b)). The projection mapping can be now updated as
  (    )   (               ) (     )
    x         f   τ  ox  0      X
λ ( y  ) = (  0  ηf  oy  0 ) ||  Y  || =  [ K   03 ]P ,
    1         0   0   1  0   (  Z  )

with P = (X,Y,Z,1)T being a 3D point defined with homogeneous coordinates. In practice, when employing recent digital cameras, it can be safely assumed that pixels are square (η = 1) and non-skewed (τ = 0). The projection matrix that incorporates the intrinsic parameters is denoted as K throughout this thesis. The all zero element vector is denoted by 03.


Figure 2.3 (a) The image (x,y) and camera (u,v) coordinate system. (b) Non-ideal image sensor with non-square, skewed pixels.  D. Radial lens distortion

Real camera lenses typically suffer from non-linear lens distortion. In practice, radial lens distortion causes straight lines to be mapped as curved lines. As seen in Figure 2.4, the radial lens distortion appears more visible at the image edges, where the radial distance is high. A standard technique to model the radial lens can be described as follows.

Figure 2.4 Real camera lenses suffer from radial lens distortion that causes straight lines to be bended. Pixel grid of an (a) undistorted and (b) distorted image.

Let (xu,yu)T and (xd,yd)T be the corrected and the measured distorted pixel positions, respectively. The relation between an undistorted and distorted pixel can be modeled with a polynomial function and can be written as

(         )         (         )
  xu - ox             xd - ox
   yu - oy  = L (rd)  yd - oy   ,


               2       2           2          2
L (rd) = 1 + k1rd and rd = (xd - ox) + (yd - oy) .

In the case k1 = 0, it can be noted that xu = xd and yu = yd, which corresponds to the absence of radial lens distortion.

It should be noted that Equation (2.8) provides the correct pixel position using a function of the distorted pixel position. However, to generate an undistorted image, it would be more convenient to base the function L(r) on the undistorted pixel position. This technique is usually known as the inverse mapping method. The inverse mapping technique consists of scanning each pixel in the output image and re-sampling and interpolating the correct pixel from the input image. To perform an inverse mapping, the inversion of the radial lens distortion model is necessary and can be described as follows. First, similar to the second part of Equation (2.9), we define

r2u = (xu - ox)2 + (yu - oy)2.

Then, taking the norm of Equation (2.8), it can be derived that

(xu - ox)2 + (yu - oy)2 = L (rd) ⋅((xd - ox)2 + (yd - oy)2),

which is equivalent to

ru = L(rd)⋅rd.

When taking into account Equation (2.9), this equation can be rewritten as a cubic polynomial:

r3+  1-rd - ru=  0.
 d   k1     k1

The inverted lens distortion function can be derived by substituting Equation (2.12) into Equation (2.8) and developing it from the right-hand side:

(         )    r (          )
   xd - ox  =  -d   xu - ox  ,
   yd - oy     ru   yu - oy

where rd can be calculated by solving the cubic polynomial function of Equation (2.13). This polynomial can be solved using Cardano’s method, by first calculating the discriminant Δ defined as Δ = q2 + 4∕27p3 where p = 1∕k1 and q = -ru∕k1. Depending on the sign of the discriminant, three sets of solutions are possible.

If Δ > 0, then the equation has one real root rd1 defined as

     ∘ ------√---  ∘ -----√----
      3--q +--Δ-   3 --q ---Δ-
rd1 =       2     +       2    .

If Δ < 0, then the equation has three real roots rdk defined by

               (          ∘ ----      )
      ∘ ----     arccos( -q- -273)+  2kπ
rdk = 2  --p cos( -------2----p--------),
         3                 3

for k = {0,1,2}, where an appropriate solution rdk should be selected such that rdk > 0 and rdk < ruk. However, only one single radius corresponds to the practical solution. Therefore, the second case Δ < 0 should not be encountered. The third case with Δ = 0 is also impractical. In practice, we have noticed that, indeed, these second and third cases never occur.

As an example, Figure 2.5 depicts a distorted image and the corresponding corrected image using the inverted mapping method, with Δ > 0.


Figure 2.5 (a) Distorted image. (b) Corresponding corrected image using the inverted mapping method.

Estimation of the distortion parameters The discussed lens-distortion correction method requires knowledge of the lens parameters, i.e., k1 and (ox,oy)T . The estimation of the distortion parameters can be performed by minimizing a cost function that measures the curvature of lines in the distorted image. To measure this curvature, a practical solution is to detect feature points belonging to the same line on a calibration rig, e.g., a checkerboard calibration pattern (see Figure 2.5). Each point belonging to the same line in the distorted image forms a bended line instead of a straight line [99]. By comparing the deviation of the bended line from the theoretical straight line model, the distortion parameters can be calculated.

2.2.2  Extrinsic parameters

As opposed to the intrinsic parameters that describe internal parameters of the camera (focal distance, radial lens parameters), the extrinsic parameters indicate the external position and orientation of the camera in the 3D world. Mathematically, the position and orientation of the camera is defined by a 3 × 1 vector C and by a 3 × 3 rotation matrix R (see Figure 2.6).

Figure 2.6 The relationship between the camera and world coordinate system is defined by the camera center C and the rotation R of the camera.

To obtain the pixel position p = (x,y,1)T of a 3D-world homogeneous point P, the camera should be first translated to the world coordinate origin and second, rotated. This can be mathematically written as

            [       ][  T       ]
λp = [K |03]  R   0    03   - C   P .
              03  1     0    1

Alternatively, when combining matrices, Equation (2.17) can be reformulated as

                                  (     )
           [            ]            X
λp = [K  |03]   R   - RC   P  = KR  (  Y  ) - KRC.
             0T3     1                Z
(2.18)  Back-projection of a 2D point to 3D

Previously, the process of projecting a 3D point onto the 2D image plane was described. We now present how a 2D point can be back-projected to the 3D space and derive the corresponding coordinates. Considering a 2D point p in an image, there exists a collection of 3D points that are mapped and projected onto the same point p. This collection of 3D points constitutes a ray connecting the camera center C = (Cx,Cy,Cz)T and p = (x,y,1)T . From Equation (2.18), the ray P(λ) associated to a pixel p = (x,y,1)T can be defined as
(     )
(  Y  ) = C  + λR -1K -1p ,
   Z      ◟------◝◜------◞
               ray P(λ)

where λ is the positive scaling factor defining the position of the 3D point on the ray. In the case Z is known, it is possible to obtain the coordinates X and Y by calculating λ using the relation

λ = Z---Cz-where (z1,z2,z3)T = R - 1K  -1p.

The back-projection operation is important for depth estimation and image rendering, which will be extensively addressed later in this thesis. For depth estimation, this would mean that an assumption is made for the value of Z and the corresponding 3D point is calculated. With an iterative procedure, an appropriate depth value is selected from a set of assumed depth candidates.

2.2.3  Coordinate system conversion

Sometimes, coordinate systems need to be converted to obtain a more efficient computation procedure. Let us now propose two methods that transform the projection matrix, so that new coordinate systems can be employed. We will also provide applications of those methods.  A. Changing the image coordinate system

The definition of the coordinate system in 3D image processing is not uniformly chosen. For example, the calibration parameters of the MPEG test sequences “Breakdancers” and “Ballet” [1] assume a left-handed coordinate system. However, a righ-handed coordinate system is usually employed in literature. Therefore, we outline a method for converting the image coordinate system.

Typically, pixel coordinates are defined such that the origin of the 2D image coordinate system is located at the top left of the image. In this case, the x and y axis point horizontally to the right and vertically downward, respectively (convention 1). However, an alternative convention is to locate the origin of the image coordinate system at the bottom left, with the y image axis pointing vertically upwards. To transform the image coordinate system, it is necessary to flip the y image axis and translate the origin along the y image axis. This can be performed using the matrix denoted B1 (see Equation (2.21)). Additionally, one can distinguish two possible conventions for defining the orientation of the 3D world axis: either a left-handed or a right-handed coordinate system can be adopted. The conversion of a left-handed to a right-handed coordinate system can be performed by flipping the Y world axis 1 (matrix B2). By concatenating the two conversion matrices B1 and B2 with the original projection matrix, one can obtain the converted projection matrix

      (              )                   (  1  0   0  0 )
        1   0    0          [ R   - RC  ]||  0 - 1  0  0 ||
λp =  ( 0  - 1 h - 1 ) [K |03]   0    1    (  0  0   1  0 ) P,
      ◟-0---0◝◜--1---◞                      0  0   0  1
             B1                          ◟------◝◜------◞
                     converted projection matrix

where h corresponds to the height of the image. The obtained converted projection matrix is then defined in an image coordinate system using the convention 1 notation and in a right-handed world coordinate system. Finally, it should be noted that the conversion of the image coordinate system is achieved by modifying the intrinsic parameters while the conversion of the world coordinate system is made by transforming the extrinsic parameters. This is the reason why conversion matrices B1 and B2 are placed as left and right terms in Equation (2.21).  B. Changing the world coordinate system

A conversion is used for re-specifying depth images into a new world coordinate system. This conversion involves the calculation of the position of a 3D point specified in another camera coordinate system and the projection of this 3D point onto the other image plane. The modification of the location and orientation of the world coordinate system is performed in a way similar to the above-described method. Figure 2.7 illustrates the definition of two world coordinate systems.

Figure 2.7 A 3D point P can be defined in two different world coordinate systems. Defining P in a new coordinate system involves (1) the conversion of its 3D coordinates and (2) the conversion of the extrinsic parameters.

The modification of the world coordinate system involves the simultaneous conversion of the projection matrix and the coordinates of the 3D point. Considering a 3D world point P and a camera defined with a projection matrix with intrinsic and extrinsic parameters K, R and C, the coordinate-system conversion can be carried out in two steps. First, specify the projection matrix in a new world coordinate system, where only the position and orientation of the camera, i.e., extrinsic parameters, should be modified. The extrinsic parameters are converted using the position Cn and orientation Rn of the new coordinate system defined, with respect to the original coordinate system. Second, specify the position of a 3D point P in the new coordinate system. The coordinate-system conversion can be written as

            [ R   - RC   ][ RT   C   ] [ R   - R  C   ]
λp =  [K |03]   T              nT    n      nT      n  n  P ,  (2.22)
            ◟-03-----1--◝◜---03---1--◞ ◟-03----◝◜1-------◞
              converted extrinsic parameters   converted 3D position
where p represents the projected pixel position and the all-zero element vector is denoted by 03.

1For clarity, the image coordinate axes are labeled in lower case and the world coordinate axes are labeled in upper case.