5.1  Introduction


Current 3D video systems can be coarsely divided into two classes. The first class focuses on the 3D-TV application and relies on the 1-depth/1-texture 3D video representation format. Specifically, by combining one texture video and one depth video, synthetic views can be rendered using an image rendering algorithm and, they can be visualized using a stereoscopic display. The compression and transmission of the 1-depth/1-texture approach was recently standardized by Part 3 of the MPEG-C video specifications [2]. As already mentioned in the introductory chapter, the 1-depth/1-texture relies on a single texture video combined with a corresponding depth video signal. However, considering a video scene with rich 3D video geometry, rendered virtual views typically show occluded regions that were not captured by the reference camera. We have seen in the previous chapter that occluded regions can be accurately rendered by combining multiple source images. Therefore, a second class of 3D video systems is based on multiple texture views of the video scene, called N-texture representation format [55]. The N-texture approach forms the basis for the Multi-view Video Coding (MVC) standard currently developed by the Joint Video Team (JVT) [102] within the ITU and ISO MPEG. However, the MVC standard does not involve the transmission of depth sequences, thereby reducing the quality of the rendered images. For this reason, a third class of 3D video systems is based on the N-depth/N-texture 3D video representation format. In October 2007, the Join Video Team (JVT) started an ad-hoc group, specifically focusing on the compression of N-depth/N-texture 3D video [101]. Our work in this chapter originates from the period 2006-2007.

Let us now discuss the problem of N-depth/N-texture multi-view video coding in more detail. A major problem when dealing with N-texture and N-depth video signals is the large amount of data to be encoded, decoded and rendered. For example, an independent transmission of 8 views of the “Breakdancers” sequence, using an H.264/MPEG-4 AVC encoder, requires about 10 Mbit∕s and 1.7 Mbit∕s with a PSNR of 40 dB for the texture and depth data, respectively 1. Given this resolution and the limited frame rate for the experimental sequences, it can be directly concluded that the bit rate is relatively high. Therefore, coding algorithms enabling an efficient compression of both multi-view depth and texture video are necessary. In a typical multi-view acquisition system, the acquired views are highly correlated. As a result, a coding gain can be obtained by exploiting the inter-view dependency between neighboring cameras.

In this chapter, we propose a multi-view coding algorithm that is based on two different approaches for the predictive coding of views. We employ two approaches for predictive coding for handling varying multi-view acquisition conditions (small or large baseline distance) and varying 3D depth image quality. The first predictive coding tool is based on a block-based disparity-compensated prediction scheme (the difference between motion and disparity compensation is further elaborated in the next section). The advantages of this predictive coding tool are that it does not require a geometric description of the scene and it yields high coding efficiency for multi-view sequences with small baseline distance. The second predictive coding tool is based on a View Synthesis Prediction algorithm (VSP), that synthesizes an image as captured by the predicted camera [5154]. The advantage of the VSP algorithm is that it enables the prediction of views for large baseline distance sequences. The proposed encoder employs both approaches and adaptively selects the most appropriate prediction scheme, using a rate-distortion criterion for an optimal prediction-mode selection. To evaluate the efficiency of the VSP predictive coding tool, we have integrated VSP into an H.264/MPEG-4 AVC encoder. We particularly emphasize that previous contributions on multi-view video coding have mainly focused on multi-view texture video. However, the problem of multi-view depth video coding has been hardly investigated and forms therefore an interesting topic.

This chapter is organized as follows. We commence with a short survey of previous multi-view texture coding approaches. In the same section, the performance of the surveyed multi-view coding algorithms is evaluated for both texture and depth compression. Afterwards, Section 5.3 and Section 5.3.3 describe the integration of two predictive coding algorithms, i.e., block-based disparity-compensated prediction and VSP, into an H.264/MPEG-4 AVC encoder. Experimental results are provided in Section 5.4 and the chapter concludes with Section 5.5.

1The resolution of the “Breakdancers” sequence is 1024 × 768 pixels and the frame rate is 15 frames per second. The compression is performed using an H.264/MPEG-4 AVC encoder with main profile.