In a multi-view system, multiple cameras capture the same scene. In order to capture a 3D representation of the scene (using the N-depth/N-texture representation format), a signal is required that describes the geometric features in three dimensions As previously discussed, the depth of a pixel can be calculated by triangulating corresponding pixels across the views. By assigning a depth value to each pixel of the captured texture image and combining those depth values into an image, a depth image is created.
Corresponding pixels across the views are known as point-correspondences. Hence, in this thesis, the three-dimensional multi-view video acquisition step corresponds to the task of calculating multiple depth images by estimating point-correspondences across the multiple camera views. The calculation of depth by estimating point-correspondences is an ill-posed problem in many situations. For example, a change of illumination across the views increases the signal ambiguity while identifying the potentially corresponding points. Additionally, with multiple cameras, the internal settings like the contrast setting can vary, so that corresponding pixels show dissimilar intensity values. This results in an unreliable identification of the point-correspondences and thus in inaccurate depth values. Furthermore, specular reflection refers to the phenomenon that occurs when light is reflected in different directions with varying intensity. As a result, object surfaces appear differently depending on the viewpoint. Such a surface is known as a non-Lambertian surface. In this thesis, we assume that the 3D objects of the video scene do not change their appearance depending on the camera viewpoint, hence we assume Lambertian surfaces. Another problem that may occur in the scene is the appearance of texture-less regions. Specifically, it is difficult to identify corresponding pixels over a region of constant color, thereby resulting in inaccurate depth values. Moreover, in some cases, particular background regions may be visible from a given camera viewpoint but may not be visible from a different camera viewpoint. This problem is known as occlusion. In this case, it is not possible to identify point-correspondences across the views, so that the depth values (of the point-correspondences) cannot be calculated.
In this thesis, we have concentrated on a few specific problems dealing with multi-view system design, rather than addressing all of the above. Our aim is to design a depth estimation sub-system that can be efficiently combined with the multi-view depth compression sub-system and the image rendering sub-system. Specifically, considering the multi-view video coding framework, the depth estimation sub-system should:
- calculate an accurate depth of individual pixels,
- estimate depth images with smooth properties on the object surfaces so that a high compression ratio (of depth images) can be obtained,
- present sharp discontinuities along object borders so that a high rendering quality can be obtained,
- yield consistent depth images across the views so that the multi-view compression sub-system can exploit the inter-view redundancy (between depth views).
The last requirement is further detailed in the next section, since it is less evident than the first three requirements.
The transmission of an N-depth/N-texture multi-view video requires efficient compression algorithms. The intrinsic problem when dealing with a multitude of depth and texture video signals is the large amount of data to be transmitted. For example, an independent transmission of 8 views of a typical sequence such as the multi-view “Breakdancers” sequence, requires about 10 Mbit/s for the texture and 1.7 Mbit/s for the depth data (at a PSNR of 40 dB) 2. This example forms a typical situation of the specification of a multi-view sequence and its compression system. In this thesis, we aim at designing a compression system with the following characteristics.
- Spatial resolution: HD-ready to HD, hence 1000–1920 pixels per line and 768–1080 lines.
- Frame rate: 15 to 30 frames per second.
- Number of views: 2 to 10, depending on the application.
- Bit rate for depth: 10 to 50% of the total bit rate, depending on the desired rendering quality.
The above list does not express that multi-view video cannot be applied to other resolutions, frames and bit rates. It is merely intended to provide a quick outline of the video systems that will be elaborated in this thesis.
As many parts in the multi-view signal are correlated, this aspect could be exploited in the compression. In a typical multi-view acquisition system, multiple cameras are employed to capture a single scene from different viewpoints. By capturing a single scene from varying viewpoints, the multiple cameras capture objects with similar colors and textures across the views, thereby generating a highly correlated texture video. The correlation between the camera views is usually referred to as inter-view correlation. This correlation exists both for texture and depth signals. For each view, whether it is texture or depth, the succeeding frames have correlation over time, called temporal correlation. As temporal correlation within a single view is already exploited by the existing coding standards such as MPEG-2 and H.264/MPEG-4 AVC, by employing motion compensated transform coding. Let us now further discuss the inter-view correlation. The inter-view correlation can be exploited for compression, for example, with a predictive coding technique, as the neighboring views show most of scene from a different viewpoint. However, an accurate prediction of views is a difficult task. For example, illumination and contrast changes from view to view and additionally, the system has to deal with occlusions which is new information, and thus not correlated with a neighboring view. This is why it is thoroughly addressed in a specific chapter focusing on predictive inter-view coding.
In the compression system for multi-view video, we have concentrated on the following problems.
- Efficient compression for variable baseline-distances between cameras. The baseline-distances refer to the distances between cameras of the multi-view acquisition setup. However, considering inter-view correlation, multi-view video captured by an acquisition setup with a large baseline distance between the cameras has limited inter-view correlation. Therefore, an encoder is desired that adapts to the various baseline distances of the acquisition setup. One way that is exploited in this thesis, is to selectively predict regions from the multi-view images related to that baseline-distance.
- Low-delay random access to a selected view. An interesting feature of a free-viewpoint video system is the ability for users to quickly access an arbitrary selected view, i.e., random access to a selected view. To enable a user to render an arbitrary selected view of the video scene with an acceptable response time (computer graphics experts call this interactive frame-rate), low-delay random access is necessary. However, temporal and spatial inter-view coding creates dependencies between views during the encoding process, which deteriorates the random-access capabilities. Therefore, we aim at a coding system that facilitates random access with a reasonable response time as a design criterion.
- Efficient compression of depth images. To render synthetic views on a remote display, an efficient transmission and thus compression of the depth images is necessary. Previous work on depth image coding has used a transform-coding algorithm derived from JPEG-2000 and MPEG encoders. However, transform coders have shown a significant shortcoming for representing edges without deterioration at low bit rates. Perceptually, such a coder generates ringing artifacts along edges that lead to errors in the rendered images. Therefore, we aim at a depth image encoder that preserves the edges, so that a high-quality rendering can be obtained. A solution explored in this thesis is to exploit the special characteristics of depth images: smooth regions delineated by sharp edges.
- Efficient joint compression of texture and depth data such that the rendering quality is maximized. To perform an efficient compression of the N-depth/N-texture multi-view video streams, an efficient joint compression of both the texture and the depth data is necessary. Previously, the compression of such a data set has addressed the problem of texture and depth compression by coding each of the signals individually. However, the influence of texture and depth compression on 3D rendering was not incorporated in the coding experiments, so that the rendering quality trade-off was not well considered. We therefore aim at a method that optimally distributes the bit rate over the texture and the depth image, such that the 3D rendering quality is maximized.
In general, rendering involves the read-out and presentation process of images. In a multi-view coding system, image rendering refers to the process of generating synthetic images.
In our case, synthetic images are rendered by combining the multiple texture images with their corresponding depth images. Over the past decade, image rendering has been an active field of research for multimedia applications . However, limited work has addressed the problem of image rendering in a multi-view video coding framework. Specifically, it will be shown in this thesis that image rendering can be integrated into the multi-view coding algorithm, by employing the rendering procedure in a predictive fashion. When doing so, the quality of the rendered images should match the quality of other non-synthetic images. This implies that the rendering process should be of high quality, dealing with occlusions and the change of scene-sampling position when going from one view to the other. The latter aspect requires accurate texture pixel re-sampling. Summarizing, and more broadly speaking, there are two key aspects for rendering: visualization and compression. Both aspects can be simultaneously addressed when aiming at high-quality rendering. In this thesis, the rendering problem is discussed using the following requirements.
- In the prediction-based rendering algorithm that is related to the multi-view coding system, the rendering algorithm should synthesize high-quality images by accurate pixel re-sampling, and
- the rendering algorithm should correctly handle occluded pixels, so that even occluded pixels can be efficiently compressed at low bit rate.
2The resolution and frame rate of the “Breakdancers” sequence is 1024 × 768 pixels and 15 frames per second, respectively.