5.2  Multi-view video coding tools


In this section, we provide an overview of the multi-view compression coding approaches, some of which have been investigated in the scope of the MVC standard [14] and we perform a comparison of their performance.

5.2.1  Disparity-compensated prediction


Generally, it is accepted that multi-view images are highly correlated so that a coding gain can be obtained [44]. To exploit the inter-view correlation, a typical approach is to perform a block-based disparity-compensated prediction. Similar to a block-based motion-compensated prediction, a block-based disparity-compensated prediction estimates (for each block) a displacement/disparity vector between a predicted view and a main reference view. The approach of using a disparity-compensated prediction was earlier adopted in the MPEG-2 multi-view profile [76]. The disparity-compensated prediction scheme assumes that the motion of objects across the views corresponds to a translational motion model. For a short baseline camera distance, this assumption on translational motion is usually valid. However, considering a wide baseline multi-view sequence, the motion of objects cannot always be accurately approximated by a simple translation. In such a case, the temporal correlation between two consecutive frames may be higher than the spatial correlation between two neighboring views. To evaluate the temporal and spatial inter-view correlation, a statistical analysis of the block-matching step has been carried out [44]. In the discussed analysis, the block-matching step refers to finding the block that is most similar to the selected reference block in the neighboring views or the time-preceding frames. In the case the prediction block is located in a time-preceding frame, the block is temporally predicted. Alternatively, if the predicted block is located in a neighboring view/frame, the block is spatially predicted. Experimental results of the statistical analysis in [44] show that the image blocks are temporally predicted in 63.7% and 87.2% of the cases, for the “Breakdancers” and “Ballet” sequences, respectively. As a result, depending on the multi-view sequences properties, in particular, the baseline-camera distance, it is more appropriate to exploit the temporal correlation rather than the spatial inter-view correlation. Therefore, we propose a prediction structure that adequately exploits both the temporal and spatial inter-view correlation should be employed.

5.2.2  Prediction structures


Let us now discuss several prediction structures that exploit both the temporal and spatial inter-view correlations. To adequately exploit both correlations, several multi-view coding structures have been explored within the MVC framework [60].

5.2.2.0  A. Simulcast coding


To perform multi-view coding, a first straightforward solution consists of independently encoding the multiple views, called simulcast coding. The prediction structure of a simulcast coding algorithm is illustrated by Figure 5.1(a), where each view is independently coded and compression only exploits temporal redundancy. One advantage of simulcast coding is that standard video encoders can be used for multi-view coding. However, simulcast coding does not exploit the correlation between the views. Because of its simplicity, simulcast coding is typically employed as a reference for comparisons of coding performance. For example, an evaluation procedure based on a simulcast compression of the views was initially proposed within the MVC framework as a coding performance anchor [1].

5.2.2.0  B. Hybrid motion- and disparity-compensated prediction structure


As a second approach, it has been proposed [265963] to simultaneously exploit the temporal and inter-view correlations. To this end, a selected view is predicted from a time-preceding frame or from a neighboring view, using motion- or disparity-compensated prediction, respectively. Therefore, by employing multiple reference frames, temporal and inter-view correlations can be simultaneously exploited. The prediction structure of this hybrid motion- and disparity-compensated prediction structure is illustrated by Figure 5.1(b). In the following, we denote this Motion- and Disparity-Compensated prediction structure as an “MaDC prediction structure”. In practice, at least two reference frames should be employed and stored in the Decoded Picture Buffer (DPB) of an H.264/MPEG-4 AVC encoder. Therefore, it can be noted that the H.264/MPEG-4 AVC architecture features the advantage of enabling view-prediction from multiple reference frames. Additionally, by employing the H.264/MPEG-4 AVC rate-distortion optimization, the optimal (in a rate-distortion sense) reference frame can be selected for each image block. Similar multiple prediction structures have been recently reported to the MVC framework [60]. One of these prediction structures exploits also the spatial and temporal redundancy by performing a hierarchical bi-directional prediction of the views (hierarchical B-picture) [59]. This prediction structure was then integrated into the MVC reference software [13] and constitutes an essential coding tool of the MVC software.


PIC
(a)
PIC
(b)

Figure 5.1 (a) Simulcast prediction structure and (b) MaDC prediction structure for multi-view coding.


5.2.2.0  C. Experimental comparison of prediction structures


We now present experimental comparison of two texture and depth multi-view encoded sequences, using the previously introduced prediction structures. We start by discussing the rate-distortion curves of the texture multi-view sequences, as shown by Figure 5.2. For coding experiments, we have employed the open-source H.264/MPEG-4 AVC encoder x264 [4]. The arithmetic coding algorithm CABAC was enabled for all experiments and the motion search was 32 × 32 pixels. The Group Of Pictures (GOP) size is set to 25 frames and the GOP structure is defined as IBBP. Finally, the number of reference frames is set to 2 and the first 25 frames of the sequences are used for compression. First, for the “Breakdancers” sequence, it can be observed that the MaDC prediction structure slightly outperforms the simulcast prediction structure. For example, an MaDC prediction structure yields a quality improvement of 0.2 dB at a bit rate of 700 kbit per view with respect to simulcast coding. However, considering the “Ballet” sequence, no coding gain is observed, instead a small loss occurs. Such a result simply emphasizes that the temporal correlation between consecutive frames is more significant than the inter-view correlation between neighboring views (as highlighted by the statistical analysis discussed in Section 5.2.1). Therefore, the overhead involved by the usage of multiple reference frames (e.g., reference frame index for each block) decreases the coding performance. For comparison, the MVC reference software yields a coding improvement of 0.25 dB and 0.05 dB for the “Breakdancers” and “Ballet” sequences, respectively [62]. Therefore, the presented MaDC prediction structure yields a coding efficiency similar to the MVC reference software. Additionally, although significant coding improvements can be obtained for multi-view sequences with short baseline camera distances [60], experimental results show that neither the MVC encoder nor the MaDC prediction structure yield significant coding improvements for wide baseline multi-view sequences. As confirmed by the coding experiments reported to the Joint Video Team [62], this conclusion is especially true for the wide baseline multi-view sequence “Ballet”.

Let us now consider the compression of multi-view depth sequences. Considering the rate-distortion curves of the depth multi-view sequences “Breakdancers” and “Ballet”, it can be observed that an MaCD prediction structure does not yield a coding improvement over simulcast coding (see Figure 5.3). In the case of such wide baseline sequences, the depth image sequences show high temporal correlation. As a result, a large percentage of macroblocks can be encoded using a motion SKIP coding mode, where the inter-view correlation is not exploited at all. In this mode, the depth signals are so stable over time that the transmission of macroblocks can be skipped and the contents of depth macroblocks are copied from the previous frame. This is particularly true for depth video, since depth signals have a smooth nature. For comparison with the MVC reference software, a coding improvement of 0.3 dB was obtained by the MVC software for the “Breakdancers” depth sequence [62]. However, similar to the presented coding structure, the MVC encoder yields no coding improvement (and even a slight degradation) for the “Ballet” depth sequence.


PIC
(a)
PIC
(b)

Figure 5.2 RD-curves of the texture sequences “Breakdancers” 5.2(a) and “Ballet” 5.2(b) encoded using the simulcast and MaDC prediction structures.



PIC
(a)
PIC
(b)

Figure 5.3 RD-curves of the depth sequences “Breakdancers” 5.3(a) and “Ballet” 5.3(b) encoded using the simulcast and MaDC prediction structures.


5.2.3  Performance bounds


In the previous section, the rate-distortion efficiency of motion- and disparity-compensated coding of multi-view video has been investigated. However, the theoretical bound for the coding of multi-view video has not been addressed, so that the theoretical gain in coding performance is not known. Considering the case of single-view video coding, a mathematical framework that establishes the coding performance bound of motion-compensated predictive coding was proposed [34]. This mathematical framework was later extended for evaluating the theoretical rate-distortion bounds for motion- and disparity-compensated coding of multi-view video [31]. In the discussed study, a matrix of pictures of M views and K temporally successive pictures is defined. Here, the parameter M indicates the Group of Views (GOV) size, for which the inter-view similarity between M neighboring views is exploited. For example, the case M=1 corresponds to a simulcast compression of the views. To evaluate the rate-distortion bounds, the impact of various temporal GOP and spatial GOV sizes is explored. Results of the analysis shows that the rate-distortion gain significantly depends on the properties of the multi-view sequence. Specifically, in the case the inter-view redundancy is exploited for 8 views (M=8), a theoretical coding gain ranging from 0.01 to 0.08 bit/pixel/camera can be obtained. Therefore, the coding gain can vary significantly, but for practical cases, such as a GOP=8 or even larger, the coding gain is rather limited. For the multi-view sequences investigated in this thesis, a limited theoretical coding gain of 0.01 and 0.02 bit/pixel/camera is obtained at 40 dB for the “Breakdancers” and “Ballet” sequences, respectively 2. Additionally, the analysis highlights that exploiting the interview redundancy across a varying number of views (M=4 or M=8) does not significantly alter the coding-performance bound. Instead, it was shown that the temporal redundancy constitutes a very important factor in the theoretical coding-performance bound for wide baseline camera distances.

5.2.4  Coding efficiency versus random access and decoding complexity


In the previous section, the compression efficiency of existing multi-view coding structures was investigated in a straightforward fashion. However, coding efficiency is not the only important aspect for consideration. A second important feature is the ability for users to randomly access views in the encoded bit stream. In practice, considering a free-viewpoint video system, this feature enables a user to switch viewpoints at interactive rates, i.e., at the frame rate of the video. A third aspect is the complexity of the multi-view video decoder, either implemented on a regular computer or as embedded system within a (low-cost) consumer electronics product. Therefore, the design of the free-viewpoint video system is driven by three aspects: (1) compression efficiency, (2) random-access, i.e., low-delay access to a desired view and (3) low complexity of the multi-view decoder. These three system aspects should be balanced at design time. Given the importance of these aspects, we discuss them in more detail in this section.

A. Decoding complexity versus low-delay access.

Because switching viewpoints should be performed at the frame rate of the video, the decoding of the desired views should be completed within a frame time (e.g., within 1/25 seconds). Therefore, to keep the delay of decompression reasonably low, the number of dependencies should be limited. Although the current monocular video coding standards do provide some random-access capabilities (insertion of intra-coded frames), these were designed for robustness reasons and fast channel switching and as such, are insufficient for the free-viewpoint video application. Specifically, depending on the processing capabilities of the decoder and the periodicity of inserted intra-coded frames 3, a simulcast compression of the views may not enable the user to switch viewpoints at the frame rate of the video. One approach to allow a user to switch viewpoints at the frame rate of the video is to prepare and decode all views prior to the user-selection of the view. However, this approach involves the parallel decompression of all views and all of its reference views. For a simulcast prediction structure, this approach implies the instantiation of one H.264/MPEG-4 AVC decoder for each view and thus a high computational complexity. Hence, the multi-view video player should be able to handle simultaneously multiple H.264/MPEG-4 AVC decoders, or be able to decode the user-selected view and all of its reference frames within the delay tolerated by the user. Alternatively, a prior decompression of the views encoded using an MaDC prediction structure, requires the decompression of the user-selected view/frame and all of its reference frames. Therefore, the complexity of such a free-viewpoint video decompression system grows as O(M) where M corresponds to the number of views and their reference views/frames. In the case the system cannot handle the parallel decompression of all views, the desired view is rendered with delays. Examples of coding scenarios will be provided in Section C in the sequel.

B. Compression efficiency versus random access.

Typically, random access is obtained by reducing the coding dependencies between encoded frames and it can be implemented by periodically inserting intra-coded frames. Therefore, there exists a trade-off between coding efficiency and random-access capabilities and an appropriate coding structure should be selected such that it balances both the coding efficiency and the random-access capabilities. To provide random access to an arbitrary view while still exploiting the temporal and inter-view correlations, we propose to use predefined main views as reference frames from which neighboring secondary views are predicted (see Figure 5.4). A closer inspection of the proposed coding structure reveals that temporal correlation is exploited only by two main views. Alternatively, the secondary views exploit only the spatial inter-view correlation. Consequently, by exploiting an appropriate mixture of temporal and inter-view predictions, views along the chain of cameras can be randomly accessed. However, it should be noted that this approach is obtained at the expense of a loss in coding efficiency. Thus, the number of reference main views should be appropriately selected, such that the compression efficiency is not dramatically reduced. The number of reference main views can range from 1 to N - 1, where the exact number depends on the tolerated rendering latency and depending on the expected compression efficiency (here, N corresponds to the number of views). Additionally, the illustrative example presented in this chapter shows that the number of reference main views also significantly depends on the properties of the sequences.

C. Scenarios for selecting prediction structures.

From the previous discussion, four scenarios for adequately selecting a prediction structure can be distinguished. The first scenario consists of encoding a short baseline multi-view sequence and tolerating a high delay for accessing the views. In this scenario, an MVC encoder or an MaDC prediction structure should be employed because they feature a high coding performance. The second scenario consists of encoding a short baseline multi-view sequence but requiring a low delay for accessing a desired view. This scenario implies that a limited rendering delay is tolerated by the viewer so that a complex prediction structure should not be employed. To obtain a low-delay access, the proposed coding structure from Figure 5.4 can be employed. However, to correctly balance the low-delay access with the coding efficiency, a significant number of main views should be employed, e.g., N∕3, where N corresponds to the number of views. In the case the compression performance should be preserved, the number of main views can be as high as N. The third scenario consists of encoding a wide baseline multi-view sequence and tolerating a high delay for rendering. In this case, a high delay tolerated by the viewer does not necessarily imply that using a complex prediction structure is beneficial. For example, the MVC encoder and the MaDC prediction structure could not improve the coding efficiency for the wide baseline distance sequence “Ballet”, when compared to a simulcast coding of the views (see Figure 5.2 and the JVT input document [62]). Thus, for simplicity, simulcast compression may be preferred in this scenario. The fourth scenario consists of encoding a wide-baseline multi-view sequence and not tolerating a high latency for accessing and rendering images. In this last case, the proposed prediction structure from Figure 5.4 should be employed because a limited number of coding dependencies is used. Furthermore, depending on the desired coding efficiency, the system designer can select the number of reference main views, such that the coding efficiency and low-delay access is appropriately balanced.

In this chapter, we provide an illustrative example, where two reference main views are used. However, even this simple case will clearly illustrate the trade-off between sequence properties and compression efficiency, as our experimental results will show later in this chapter.


PIC
Figure 5.4 Coding structure that allows motion- and disparity-compensated prediction with random access to an arbitrary view. Main views exploit the temporal correlation while the secondary views exploit the inter-view correlation.


2In [31], the original MPEG sequences were down-sampled such that a comparison with the results presented within MVC is not directly possible.

3The period of inserted intra-coded frames corresponds to the Group Of Pictures (GOP) size.