5.3  View Synthesis Prediction (VSP) for N-depth/N-texture coding

In this section, we propose a novel multi-view encoder, based on H.264/MPEG-4 AVC, that employs two different view-prediction algorithms in parallel. In this section, the text and algorithm development concentrates on texture coding for simplicity reasons and because the author developed the algorithm in this way. However, after the design and experiments, it appeared that virtually all descriptions equally apply to the compression of multi-view depth sequences. The reader should therefore interpret the text as being applicable to both depth and texture. At the end of this section, a contribution which is specific to depth signals is appended.

5.3.1  Predictive coding of views

The first inter-view prediction technique is the disparity-compensated prediction scheme. A major advantage of the disparity-compensated prediction is that a coding gain is obtained when the baseline distance between cameras is small. Additionally, the disparity-compensated predictor does not rely on the geometry of multiple views, so that camera calibration parameters are not required. However, we have shown in Section 5.2.2 that the disparity-compensated prediction scheme does not always yield a coding gain, especially for wide baseline distance camera setting. A reason for this is that the translational motion model, employed by the block-based motion-compensated scheme, is not sufficiently accurate to predict the motion of objects with different depths.

The second alternative, i.e., a View Synthesis Prediction (VSP) scheme, is based on a view-synthesis algorithm that renders an image as seen by the predicted camera [5154]. The advantage of the view-synthesis prediction is that the views can be better predicted, even when the baseline distance between the reference and predicted cameras is large, thereby yielding a high compression ratio. However, as opposed to the previous approach, the multi-camera acquisition system needs to be fully calibrated prior to the capturing session and relies on an reasonably accurate depth image. Additionally, because depth estimation is a complicated task, the depth images may be inaccurately estimated, thereby reducing the view-prediction quality. Since VSP is new in the discussion, we further concentrate on its integration into a multi-view video encoder based on H.264/MPEG-4 AVC.

5.3.2  Incorporating VSP into H.264/MPEG-4 AVC

Important requirements of the view-prediction algorithm are that (a) it should be robust against depth images which are inaccurately estimated, and (b) a high compression ratio should be obtained for various baseline distances between cameras. As discussed above, both presented view-prediction algorithms have their limitations and cannot be used under variable capturing conditions. Therefore, our novel strategy is to use both algorithms selectively on an image-block basis, depending on their current coding performance.

An attractive approach to support coding efficiency is to integrate both prediction techniques, i.e., the View Synthesis Prediction (VSP) and Disparity Compensated Prediction (DCP), and then select the best prediction for each block. It should be noted that both prediction techniques employ two different reference frames: DCP uses the main view, whereas VSP applies the synthesized image. Thus, each block of a frame can be predicted by one of the reference frames stored into the reference frames buffer (also known as Decoded Picture Buffer (DPB)). A disadvantage of such a multi-view encoder is that VSP generates a prediction image that does not necessarily minimize the prediction residual error, so that the complete scheme would not yield the minimum residual error accordingly. Consequently, the adopted system concept is modified as follows.

At the refinement stage, the search for matching blocks is performed in a region of limited size, e.g., 32×32 pixels. In contrast to this, the disparity between two views in the “Ballet” sequence can be as high as 50 pixels. Figure 5.5 portrays an overview of the resulting proposed coding architecture.

Figure 5.5 Architecture of an H.264/MPEG-4 AVC encoder that adaptively employs a block-based disparity-compensated prediction or view-synthesis prediction followed by a prediction refinement. The main view and the synthesized view are denoted Ref and W(Ref), respectively.

There are multiple advantages for using an H.264/MPEG-4 AVC encoder. First, because the H.264/MPEG-4 AVC standard enables that each macroblock can be encoded using different coding modes, occluded regions in the predicted view can be efficiently compressed. More specifically, occluded pixels cannot always be predicted with sufficient accuracy. In this case, the algorithm encodes an occluded macroblock in intra-mode. Alternatively, when the prediction accuracy of occluded pixels is sufficient, the macroblock is encoded in inter-mode. Summarizing, the flexibility of H.264/MPEG-4 AVC provides a suitable framework for handling the compression of occlusions. Second, in the case that the depth image is not estimated accurately, the VSP coding mode is inefficient and will be simply not selected. Hence, the criterion automatically leads to the correct coding mode selection. Third, the prediction mode of each image block is implicitly specified by using the reference frame index, so that no additional information has to be transmitted. Specifically, as previously highlighted, the VSP and the DCP coding algorithms employ two different reference frames: the synthesized image and the main reference view, respectively. Because these reference frames are stored into the DPB, an image block can be predicted using, either the first (DCP), or the second (VSP) frame buffer. The selection of the prediction scheme is therefore indicated to the decoder by the index of the reference frame, which is an integral part of the H.264/MPEG-4 AVC standard. Fourth and additionally, the selection of the prediction tool is performed for each image block using a rate-distortion criterion. In practice, the rate-distortion criterion can be based on an R-D optimized coding-mode selection, as implemented in a standard H.264/MPEG-4 AVC encoder. Thus, the H.264/MPEG-4 AVC standard offers sufficient flexibility in providing the appropriate coding modes for optimal matching with the varying prediction accuracies of VSP and DCP.

5.3.3  Multi-view depth coding aspects when using VSP

A multi-view depth image sequence has correlation properties similar to a multi-view texture sequence. More specifically, because each depth image represents the depth of the same video scene, all depth images are correlated across the views. This property is similar to the discussion on multi-view texture coding in the previous section. This motivates why the same coding concept for exploiting inter-view correlation can be equally applied to both texture and depth coding.

As a consequence of the above, a coding gain for depth signals can be obtained by exploiting depth inter-view correlation. Similar to multi-view texture coding, the multi-view depth coding algorithm employs the two different view-prediction algorithms, i.e., the DCP and VSP schemes. The most efficient prediction method is then selected for each image block using a rate-distortion criterion. As opposed to recent contributions in literature, the presented solution is the first to employ a view-synthesis prediction scheme for encoding multi-view depth sequences.

A further benefit of this approach is that depth signals are relatively smooth and thus can be accurately predicted using a view-synthesis algorithm. In this aspect, the predictive coding of depth signals is typically more efficient than for multi-view texture signal, since view-synthesis prediction of texture signals may not always be able to accurately predict fine texture. A last advantage is that the predictive coding of depth views does not require the transmission of side information, but instead performs the prediction of neighboring depth views using the depth main view only. Figure 5.6 shows an overview of the described depth-coding architecture. This figure is nearly the same as the coding block diagram for texture. There is only one major difference: to perform view-synthesis prediction, the depth compression requires only camera parameters as input, whereas the texture compression needs both camera parameters and a reference depth image. In the depth coding loop, the reference main view enables the simultaneous prediction of multiple secondary views without any side informations such as motion vectors. Let us now evaluate the results of DCP- and VSP-based coding algorithms for depth and texture multi-view signals.

Figure 5.6 Architecture of the extended H.264/MPEG-4 AVC encoder that adaptively employs the previously encoded depth image Dt-1 or the corresponding warped image W(Dt-1) as reference frames.