Proceedings Article | 30 January 2003
KEYWORDS: Video, Motion models, Video compression, Image compression, Motion detection, Motion estimation, Image segmentation, Video surveillance, 3D modeling, Chromium
The movement of objects in video sequences comprises a type of spatiotemporal redundancy that can be decreased mathematically to facilitate video compression. This observation holds particularly in the case of periodic motion, for example, bipedal or quadrupedal locomotion or repetitive gestures. Previously-published motion detection techniques were based on optical flow, interframe differences represented in terms of transform coefficient perturbations, or changes in eigenvalues between frames in a video sequence. However, such methods have deficits that include sensitivity to noise, burdensome computational requirements (e.g., floating point operations), and prohibitive instability in the presence of spatial or temporal interframe discontinuities.
In this paper, we discuss several techniques of motion detection in two-dimensional images of three-dimensional scenes — pointwise tracking of constant-intensity pixels, region-based vector field characterization of apparent motion, and correlationbased detection. In the latter category is a technique called Interframe Similarity Matrices (ISMs). ISMs were developed and successfully applied by Yacoob, Black, and Davis to address the challenging problem of detecting human and animal motion in surveillance video sequences. In particular, given an N-frame video sequence, an NxN-element interframe correlation matrix can be constructed and Fourier-transformed to obtain an N/2-element power spectrum of interframe periodicities. Different actions (e.g., walking vs. running) and various actors (e.g., quadruped versus human) tend to be characterized by distinct spatiotemporal spectra, and can often be distinguished from one another.
Since each spectrum can be computed from a sequence of small image regions, it is possible to represent interframe motion by a pixel tagging technique, thus implementing detection, segmentation, and representation. If there are K objects with M pixels per frame having B bits per pixel (bpp) in N frames of a compressed video sequence, and each object is segmented into a region represented by a P bit tag, then increased compression results if NKMB < NKP, i.e., MB < P. Implementational discussion concerns efficient algorithms for tagging of motion-containing regions, to decrease the representational overhead to several bits per region. We also discuss motion encoding in a compressed format for purposes of efficient extraction of motion parameters from a compressed image, which can support efficient object recognition in highresolution compressed image sequences.