4.1  Introduction


This section starts with reviewing image rendering techniques and their advantages and disadvantages with respect to 3D-TV and free-viewpoint video applications. For each technique, we consider the constituting stages in the multi-view video processing chain: acquisition, transmission and rendering of multi-view video. Afterwards, we describe two multi-view video systems, relying either on a strong calibration (Section 2.3) or a weak calibration (Section 2.4) of the multi-view setup. It is shown that both forms of calibration enable an efficient acquisition and rendering. However, in the last part of the introduction, it will become clear that the compression of multi-view video can be performed more efficiently when employing a strongly calibrated system. In the previous chapter, calibrated depth images were estimated for each view. Hence, the depth images and the calibration parameters can be beneficially employed for rendering synthetic images. Later in this thesis, the same rendering techniques are integrated into the coding loop for exploiting inter-view redundancy.

4.1.1  Review of image rendering techniques


Synthetic images can be rendered using either texture-only images, or using a combination of a 3D geometric model and texture images. Rendering techniques can be therefore classified according to their level of detail of the geometric description of the scene [92].

Rendering without geometry.

A first class of techniques considers the use of multiple views (N-texture) of the scene but does not require any geometric description of the scene. For example, Light Field [50] and Lumigraph [36] record a 3D scene by capturing all rays traversing a 3D volume using a large array of cameras. Each pixel of the virtual view is then interpolated from the captured rays using the positions of the input and virtual cameras. However, such a rendering technique implies the capturing of the video scene using a large camera array, i.e., oversampling of the scene. Considering the 3D-TV and free-viewpoint video applications, such an oversampling is inefficient because a large number of video streams needs to be transmitted.

Rendering with geometry.

A second class of techniques involves the use of a geometric description of the scene for rendering. One approach applies polygons to represent the geometry of the scene, i.e., the surface of objects [19]. Virtual images are then synthesized using a planar-texture mapping algorithm (see Section 2.4.2). However, the acquisition of 3D polygons is a difficult task. An alternative method is to associate a depth image to the texture image. Using a depth image, new views can be subsequently rendered using a Depth Image Based Rendering (DIBR) algorithm. DIBR algorithms include, among others, Layered Depth Image [89], view morphing [87], point-clouds [106] and image warping [56]. We refer to the technique of associating one depth with one texture image [3882110] as the N-depth/N-texture representation format. The two advantages of the N-depth/N-texture approach are that (1) the data format features a compact representation, and (2) high-quality views can be rendered. For these reasons, this data format provides a suitable representation format for 3D-TV and free-viewpoint video applications. Consequently, the N-depth/N-texture representation format was adopted in this thesis.

4.1.2  Acquisition and rendering using N-depth/N-texture


Two possible approaches can be distinguished to render images, using an N-depth/N-texture representation format.

Weakly calibrated cameras.

A first approach [25] allows to synthesize intermediate views along a chain of cameras. The algorithm estimates the epipolar geometry between each pair of successive cameras and rectifies the images pairwise (see Figure 4.1(a)). Disparity images are estimated for each pair of cameras (see Section 3.2). Next, synthetic views are interpolated using an algorithm similar to View Morphing [87]. This approach of working along a chain of cameras has two advantages. First, the acquisition is flexible because the epipolar geometry can be estimated without any calibration device [37]. Second, a real-time rendering implementation [25] based on a standard Graphics Processor Unit (GPU) is feasible and was even demonstrated. However, this rendering technique does not enable the user to navigate within the 3D scene. Instead, the viewer can only select a viewpoint that is located on the poly-line connecting all camera centers along the chain of cameras.

Strongly calibrated cameras.

An alternative method [110] employs a similar video capturing system composed of a set of multiple cameras. As opposed to the above approach, the cameras are fully calibrated prior to the capture session (see Figure 4.1(b)). Therefore, the depth can be subsequently estimated for each view. To perform view synthesis, given the depth information, 3D image warping techniques can be employed. As opposed to a weakly calibrated multi-view setup, a first advantage of this approach is that it enables the user to freely navigate within the 3D scene. Additionally, a second advantage is that the compression of a multi-view video can be performed more efficiently by employing all camera parameters, an aspect that we discuss in the next section.
PIC
(a)
PIC
(b)

Figure 4.1 (a) For a weakly calibrated setup, the captured images are rectified pairwise. Disparity estimation and view interpolation between a pair of cameras is carried out on the rectified images. (b) For a strongly calibrated setup, the respective position and orientation of each camera is known. Depth images can therefore be estimated for each view and a 3D image warping algorithm can be used to synthesize virtual views.


4.1.3  Compression of N-depth/N-texture


To perform the compression of multi-view images, the redundancy between neighboring views should be exploited. To do so, one approach is to employ image rendering in a predictive fashion. More specifically, the rendering procedure can be employed to predict or approximate a view captured by a predicted camera. We describe in this section several rendering algorithms enabling the prediction of views. The integration of the rendering engine into the corresponding coding algorithm will be discussed in Chapter 5.

Weakly calibrated cameras.

To perform the prediction of an image using weakly calibrated cameras, the disparity image can be employed. However, we have shown in Section 3.2 that disparity images are estimated pairwise, i.e., between a left and right texture image. Consequently, the disparity image and the corresponding left texture image enable only the prediction of the right image. In other words, it is not possible to perform the prediction of other views than the right image.

Strongly calibrated cameras.

Alternatively, using a strongly calibrated camera setup, a depth image can be calculated for each view. As opposed to the disparity image that provides the pairwise disparity between a left and right image, a depth image indicates the 3D-world depth coordinates shared by all views. Therefore, it is possible to synthesize a virtual view at the position of any selected camera, thereby enabling the prediction of multiple views using one reference view only. Considering the problem of compression, a strongly calibrated multi-view acquisition system would be highly favorable.

Whereas image rendering has been an active field of research for multimedia applications [86], limited work has been focused on image rendering in a multi-view coding framework. Two recent approaches have employed either a direct projection of pixels onto the virtual image plane [110], or a method known as point-clouds rendering [61]. Both techniques are similar to the 3D image warping algorithm, so that they also suffer from known stair-case artifacts and holes (undefined pixels). An accurate prediction of views requires an accurate image rendering algorithm. To perform predictive coding of views, the desired rendering algorithm should:

In this chapter, we describe two Depth Image Based Rendering algorithms, the 3D image warping algorithm [56] and a mesh-based rendering technique. Next, we propose a variant of the relief texture [77] mapping algorithm for rendering. More specifically, we express the relief texture mapping with an alternative formulation that fits better to the camera calibration framework. We show that such a formulation enables the execution of the algorithm on a standard GPU, thereby significantly reducing the processing time. Because these three rendering methods are defined as a mapping from the source to the synthetic destination image, they are referred to as forward mapping. To circumvent the usual problems associated with forward mapping rendering techniques (hole artifacts), we propose a new alternative, called an inverse mapping rendering technique. This technique has the advantage of being simple and at the same it accurately re-samples synthetic pixels. Additionally, such an inverse mapping rendering technique can easily combine multiple source images such that occluded regions can be correctly handled. Therefore, we continue in Section 4.4 by presenting multiple techniques for properly handling occlusions. The chapter closes by evaluating the quality of synthetic images obtained using the described methods.