1.3  Three-dimensional video systems layout


Because the previously discussed applications rely on multiple views of the scene, the 3D video technologies enabling these various applications do not exclude each other and can be integrated into a single 3D video system. Specifically, to enable 3D-TV or free-viewpoint video applications, several 3D video systems have been introduced. They can be classified into three classes with respect to the amount of employed 3D geometry.

A first class of 3D video systems is based on multiple texture views of the video scene, called N-texture representation format [55]. The N-texture approach forms the basis for the emerging Multi-view Video Coding (MVC) standard currently developed by the Joint Video Team (JVT) [102]. Due to the significant amount of data to be stored, the main challenge of the MVC standard is to define efficient coding and decoding tools 1. To this end, a number of H.264/MPEG-4 AVC coding tools have been proposed and evaluated within the MVC framework. A first coding tool exploits the similarity between the views by multiplexing the captured views and encoding the resulting video stream by a modified H.264/MPEG-4 AVC encoder [2659]. A second coding tool equalizes the inter-view illumination to compensate for mismatches across the views captured by different cameras [45]. The latest description of the standard can be found in the Joint Draft 8.0 on Multi-view Video Coding [102]. One advantage of the above-mentioned N-texture representation format is that no 3D geometric description of the scene is required. Therefore, because 3D geometry is not used, this 3D video format allows a simple video processing chain at the encoder. However, such a 3D video representation format involves a high-complexity decoder for the following reason. A multi-view display supports a varying number of views at the input, which makes it impractical to prepare these views prior to transmission. Instead, intermediate views should be interpolated from the transmitted reference views at the decoder, where the display characteristics are known. To obtain high-quality interpolated views, a 3D geometric description of the scene is necessary, thereby involving computationally expensive calculations at the receiver side.

A second class of 3D video systems relies on a partial 3D geometric description of the scene [27]. The scene geometry is typically described by a depth map, or depth image, that specifies the distance between a point in the 3D world and the camera. Typically, a depth image is estimated from two images by identifying corresponding pixels in the multiple views, i.e., point-correspondences, that represent the same 3D scene point. Using depth images, new views can be subsequently rendered or synthesized using a Depth Image Based Rendering (DIBR) algorithm. Here, the term DIBR corresponds to a class of rendering algorithms that use depth and texture images simultaneously to synthesize virtual images. Considering a 3D-TV application, it is assumed that the scene is observed from a narrow field of view (short baseline distance between cameras). As a result, a combination of only one texture and one depth video sequence is sufficient to provide appropriate rendering quality (1-depth/1-texture). The 1-depth/1-texture approach was recently standardized by Part 3 of the MPEG-C video specification [9]. This system is illustrated in Figure 1.3. However, considering a video scene with rich 3D geometry, rendered virtual views typically show occluded regions that were not covered by the reference camera.


PIC
Figure 1.3 Overview of a 3D-TV video system that encodes and transmits one texture video along with one depth video (1-depth/1-texture). Note that the depth signal can be estimated using multiple texture camera views, of which only one view is transmitted.


A third class of 3D video systems addresses the occlusion problem by combining the two aforementioned classes (N-texture vs. 1-depth/1-texture) by using one depth image for each texture image, i.e., N-depth/N-texture [110] (see Figure 1.4). This approach has multiple advantages. First, as previously highlighted, the problem of occluded regions can be addressed by combining multiple reference images that cover all regions seen by the virtual camera. Second, the N-depth/N-texture representation format is compatible with different types of multi-view displays supporting a variable number of views. More specifically, because 3D geometry data is transmitted to the decoder, an arbitrary number of synthetic views corresponding to the display characteristics, can be interpolated. A final advantage is that the N-depth/N-texture representation format provides a natural extension to the 1-depth/1-texture representation format. Therefore, this approach allows a gradual transition from an already standardized technique (MPEG-C Part 3) to the next generation of 3D video systems. Because of these advantages, we have adopted the N-depth/N-texture representation format as the basis for conducting all experiments and studies through the complete thesis.


PIC
PIC
PIC
PIC
PIC
PIC
PIC
PIC

Figure 1.4 The N-depth/N-texture representation format combines N texture with N depth views.


1Video coding standards only define the decoding procedure and the corresponding bit stream but do not specify the encoding algorithms.