Using View Interpolation for Low Bit-Rate Video

Using View Interpolation for Low Bit-Rate Video
Richard J. Radke, Peter J. Ramadge, Sanjeev R. Kulkarni, and Tomio Echigo.
IEEE International Conference on Image Processing, Thessaloniki, Greece, October 2001.

Download!

Summary

The transmission and reconstruction of video in wireless multimedia poses a much more difficult problem than it does in a wired setting. There are three main issues that complicate matters: relatively limited bandwidth (so video data needs to be reduced in both frame size and frame rate), limited power supply at the client (so reconstruction algorithms must be simple), and high bit error rates (so robust error correction is required). At high bit rates, block-based motion compensation approaches to the lossy video coding problem (e.g. MPEG) have proven difficult to improve upon. However, at very low bit rates, perceptual quality of the reconstructed frames can degrade due to the scarcity of bits available to encode the residuals between each "real" block of pixels and the block used as a predictor. In this paper, we demonstrate that in some situations, perceptual quality can be better maintained using an approach based on synthesizing "virtual" images of a scene that match frames from a source video clip. We use this algorithm for interpolation of video frames in the time domain, using a small amount of information to construct an approximation of the original video. Our algorithm is well-suited for the limitations in bandwidth and complexity characteristic of wireless multimedia channels. Since the approach is based on estimating functions of the underlying camera motion parameters, it can capture relationships between image correspondences that extend across many (perhaps hundreds) of video frames. Each interpolated image can be rendered using only a few tens of bytes of side information, and the rendering process itself has low computational requirements. We present experimental results to demonstrate that for certain types of video, our algorithm can give significant perceptual improvement over MPEG-4 coded video at the same low bit rate (e.g. 47 kpbs). Our approach is particularly amenable to representing computer-generated video, for which the correspondence and camera motion information required for view synthesis is readily available at render time.

The advantage of our method is the use of view morphing instead of block-based motion compensation to synthesize intermediate views between reference frames. As we demonstrate, adjacent reference frames need not be temporally close or visually similar. Consequently, in theory, the reference frames can be taken hundreds of frames apart (much further apart, for example, than typical I frames in an MPEG video). We note that while there has been some work on using "virtual views" for video coding, these approaches are generally aimed at the small-baseline case for compressing video teleconferencing data. We emphasize that this scheme is meant to augment, not to supplant, standard video coding algorithms. When a segment of video that is suitable for view interpolation is encountered, the server could transmit the low-overhead side information within the framework of a standard video data stream. A ``smart" receiver equipped with our algorithm could take advantage of this information to render the video segment at higher quality, while a "normal" receiver would ignore the side information and produce standard-quality video.

Experimental Results

We applied our interpolation method to a 180-frame, single-shot, 320 x 240 test sequence captured with a digital video camera. The first and last frames (Figure 1) were designated as anchor frames, and correspondence between them was initialized using 66 user-selected matching points and 28 control line segments. The remaining 178 frames were designated as "virtual" frames to be interpolated. The camera motion is roughly linear, though the speed is not uniform. From Figure 1 we can see that the perspective difference between the anchor frames is substantial, and that a block from an intermediate frame would probably have a poor (i.e. high MSE) match in either of the anchor frames.

Figure 1. Frames 0 and 179 from the test video sequence.

Figure 2 illustrates the original, interpolated, and luminance difference frames for the 90th frame of the test sequence (the difference image has been enhanced to darken the lighter pixels for visibility). From the difference image we can see that the interpolated image aligns quite well with the original frame of video. The errors around edges are largely due to the blurriness of the virtual images introduced by several steps of image resampling. The other major artifacts are the black regions around the borders of the interpolated image that correspond to areas of the virtual frame visible in neither of the anchor frames. In this example, these areas are not too large and could be filled in by the type of error-concealment algorithms devised for other video compression schemes. The total file size of the information required to interpolate is roughly 35.4 KB (18 KB total for the two JPEG-coded anchor frames, 11 KB for the compressed correspondence information, 36 B for the parameters of each virtual frame). Clearly the number of interpolated frames has a negligible effect on the size of the transmitted data. For video in which the camera moves slowly along an approximately piecewise linear path, we therefore expect reasonable performance for a small amount of side information. The total bit rate in this example is 47.22 kbps. The mean PSNR over the entire sequence is 30.2 dB. The PSNR could be increased by reducing the blurriness of the virtual images through postprocessing or by reducing the number of explicit image resampling steps.

Figure 2. Left: original frame 90. Middle: Interpolated frame 90. Right: Luminance difference.

For comparison, we constructed an MPEG-1 video constrained to have the same file size as the data required for view interpolation. . At this bit rate, the MPEG blocking and compression artifacts are severe, especially in high-detail, perceptually significant areas of the image. In contrast, the virtual image is well-defined and relatively sharper in these areas. This can be seen in the close-ups shown in Figure 3. We are currently investigating comparisons with video codecs targeted at low bit rates, e.g. MPEG-4.

Figure 3. Left: close-up, original frame 90. Middle: close-up, interpolated frame 90.
Right: close-up, MPEG-4 video, frame 90.

For web viewing, we present a side-by-side comparison of the original video with the interpolated video. The black segments around the borders of the interpolated video are regions of the intermediate images seen in neither of the anchor frames.

For comparison, here is the original video side-by-side with its MPEG-1 counterpart. Note the severity of the blocking artifacts. To be fair, an MPEG-4 video would likely have better visual fidelity for the same bit rate, and we are currently investigating such comparisons. We also expect that our reconstruction algorithm has lower computational requirements than MPEG variants, since only simple weighted averages of pixel intensities are required, as opposed to frequency-domain operations.