1.4  Multi-view acquisition, compression and rendering problems addressed in this thesis


1.4.1  Acquisition problem


In a multi-view system, multiple cameras capture the same scene. In order to capture a 3D representation of the scene (using the N-depth/N-texture representation format), a signal is required that describes the geometric features in three dimensions As previously discussed, the depth of a pixel can be calculated by triangulating corresponding pixels across the views. By assigning a depth value to each pixel of the captured texture image and combining those depth values into an image, a depth image is created.

Corresponding pixels across the views are known as point-correspondences. Hence, in this thesis, the three-dimensional multi-view video acquisition step corresponds to the task of calculating multiple depth images by estimating point-correspondences across the multiple camera views. The calculation of depth by estimating point-correspondences is an ill-posed problem in many situations. For example, a change of illumination across the views increases the signal ambiguity while identifying the potentially corresponding points. Additionally, with multiple cameras, the internal settings like the contrast setting can vary, so that corresponding pixels show dissimilar intensity values. This results in an unreliable identification of the point-correspondences and thus in inaccurate depth values. Furthermore, specular reflection refers to the phenomenon that occurs when light is reflected in different directions with varying intensity. As a result, object surfaces appear differently depending on the viewpoint. Such a surface is known as a non-Lambertian surface. In this thesis, we assume that the 3D objects of the video scene do not change their appearance depending on the camera viewpoint, hence we assume Lambertian surfaces. Another problem that may occur in the scene is the appearance of texture-less regions. Specifically, it is difficult to identify corresponding pixels over a region of constant color, thereby resulting in inaccurate depth values. Moreover, in some cases, particular background regions may be visible from a given camera viewpoint but may not be visible from a different camera viewpoint. This problem is known as occlusion. In this case, it is not possible to identify point-correspondences across the views, so that the depth values (of the point-correspondences) cannot be calculated.

In this thesis, we have concentrated on a few specific problems dealing with multi-view system design, rather than addressing all of the above. Our aim is to design a depth estimation sub-system that can be efficiently combined with the multi-view depth compression sub-system and the image rendering sub-system. Specifically, considering the multi-view video coding framework, the depth estimation sub-system should:

The last requirement is further detailed in the next section, since it is less evident than the first three requirements.

1.4.2  Compression problem


The transmission of an N-depth/N-texture multi-view video requires efficient compression algorithms. The intrinsic problem when dealing with a multitude of depth and texture video signals is the large amount of data to be transmitted. For example, an independent transmission of 8 views of a typical sequence such as the multi-view “Breakdancers” sequence, requires about 10 Mbit/s for the texture and 1.7 Mbit/s for the depth data (at a PSNR of 40 dB) 2. This example forms a typical situation of the specification of a multi-view sequence and its compression system. In this thesis, we aim at designing a compression system with the following characteristics.

The above list does not express that multi-view video cannot be applied to other resolutions, frames and bit rates. It is merely intended to provide a quick outline of the video systems that will be elaborated in this thesis.

As many parts in the multi-view signal are correlated, this aspect could be exploited in the compression. In a typical multi-view acquisition system, multiple cameras are employed to capture a single scene from different viewpoints. By capturing a single scene from varying viewpoints, the multiple cameras capture objects with similar colors and textures across the views, thereby generating a highly correlated texture video. The correlation between the camera views is usually referred to as inter-view correlation. This correlation exists both for texture and depth signals. For each view, whether it is texture or depth, the succeeding frames have correlation over time, called temporal correlation. As temporal correlation within a single view is already exploited by the existing coding standards such as MPEG-2 and H.264/MPEG-4 AVC, by employing motion compensated transform coding. Let us now further discuss the inter-view correlation. The inter-view correlation can be exploited for compression, for example, with a predictive coding technique, as the neighboring views show most of scene from a different viewpoint. However, an accurate prediction of views is a difficult task. For example, illumination and contrast changes from view to view and additionally, the system has to deal with occlusions which is new information, and thus not correlated with a neighboring view. This is why it is thoroughly addressed in a specific chapter focusing on predictive inter-view coding.

In the compression system for multi-view video, we have concentrated on the following problems.

1.4.3  Rendering problem


In general, rendering involves the read-out and presentation process of images. In a multi-view coding system, image rendering refers to the process of generating synthetic images.

In our case, synthetic images are rendered by combining the multiple texture images with their corresponding depth images. Over the past decade, image rendering has been an active field of research for multimedia applications [86]. However, limited work has addressed the problem of image rendering in a multi-view video coding framework. Specifically, it will be shown in this thesis that image rendering can be integrated into the multi-view coding algorithm, by employing the rendering procedure in a predictive fashion. When doing so, the quality of the rendered images should match the quality of other non-synthetic images. This implies that the rendering process should be of high quality, dealing with occlusions and the change of scene-sampling position when going from one view to the other. The latter aspect requires accurate texture pixel re-sampling. Summarizing, and more broadly speaking, there are two key aspects for rendering: visualization and compression. Both aspects can be simultaneously addressed when aiming at high-quality rendering. In this thesis, the rendering problem is discussed using the following requirements.

2The resolution and frame rate of the “Breakdancers” sequence is 1024 × 768 pixels and 15 frames per second, respectively.