|
1.IntroductionThe ability to quantify the visual quality of an image or video is a crucial step for any system that processes digital media. Algorithms for image quality assessment (IQA) and video quality assessment (VQA) aim to estimate the quality of a distorted image/video in a manner that agrees with the quality judgments reported by human observers. Over the last few decades, numerous IQA algorithms have been developed and shown to perform reasonably well on various image-quality databases. Therefore, a natural technique to VQA is to apply existing IQA algorithms to each frame of the video and to pool the per-frame results across time. A key advantage of this approach is that it is very intuitive, easily implemented, and computationally efficient. However, such a frame-by-frame IQA approach often fails to correlate with the subjective ratings of quality.1,2 1.1.General Approaches to VQAOne reason frame-by-frame IQA performs less well for VQA is because it ignores temporal information, which is important for video quality due to temporal effects, such as temporal masking and motion perception.3,4 Many researchers have incorporated temporal information into their VQA algorithms by supplementing frame-by-frame IQA with a model of temporal masking and/or temporal weighting.5–8 For example, in Refs. 6 and 7, motion-weighting and temporal derivatives have been used to extend structural similarity (SSIM)9 and visual information fidelity (VIF)10 for VQA. Modern VQA algorithms often estimate video quality by extracting and comparing visual/quality features from localized space-time regions or groups of video frames. For example, in Refs. 11 and 12, video quality is estimated based on spatial gradients, color information, and the interaction of contrast and motion from spatiotemporal blocks; motion-based temporal pooling is employed to yield the quality estimate. In Ref. 4, video quality is estimated via measures of spatial quality, temporal quality, and spatiotemporal quality for groups of video frames via a three-dimensional (3-D) Gabor filter-bank; the spatial and temporal components are combined into an overall estimate of quality. In Ref. 13, spatial edge features and motion characteristics in localized space-time regions are used to estimate quality. Furthermore, it is known that the subjective assessment of video quality is time-varying,14 and this temporal variation can strongly influence the overall quality ratings.15,16 Models of VQA that consider these effects have been proposed in Refs. 1617.18. to 19. For example, in Ref. 19, Ninassi et al. measured temporal variations of spatial visual distortions in a short-term pooling for groups of frames through a mechanism of visual attention; the global video quality score is estimated via a long-term pooling. In Ref. 16, Seshadrinathan et al. proposed a hysteresis temporal pooling model of spatial quality values by studying the relation between time-varying quality scores and the final quality score assigned by human subjects. 1.2.Different Approach for VQA: Analysis of Spatiotemporal SlicesTraditional analyses of temporal variation in VQA tend to formulate methods to compute spatial distortion of a standalone frame,5,7 of local space-time regions,12,13 or of groups of adjacent frames4,19 and then measure the changes of spatial distortion over time. An alternative approach, which is the technique we adopt in this paper, is to use spatiotemporal slices (as illustrated in Fig. 1), which allows one to analyze longer temporal variations.20,21 In the context of general motion analysis, Ngo et al.21 stated that analyzing the visual patterns of spatiotemporal slices could characterize the changes of motion over time and describe the motion trajectories of different moving objects. Inspired by this result, in this paper, we present an algorithm that estimates quality based on the differences between the spatiotemporal slices of the reference and distorted videos. As shown in Fig. 1(a), a video can be envisaged as a rectangular cuboid in which two of the sides represent the spatial dimensions ( and ), and the third side represents the time dimension (). If one takes slices of the cuboid from front-to-back, then the extracted slices correspond to normal video frames. However, it is also possible to take the slices of the cuboid from other directions (e.g., from left-to-right or top-to-bottom) to extract images that contain spatiotemporal information, hereafter called the STS images. As shown in Fig. 1(b), if the cuboid is sliced vertically (left-to-right or right-to-left), then the extracted slices represent time along one dimension and vertical space along the other dimension, hereafter called the vertical STS images. If the cuboid is sliced horizontally (top-to-bottom or bottom-to-top), then the extracted slices represent time along one dimension and horizontal space along the other dimension, hereafter called the horizontal STS images. Figure 2 shows examples of STS images from some typical videos. At one extreme, if the video contains no changes across time (e.g., no motion, as in a static video), then the STS images will contain only horizontal lines [see Fig. 2(a)] or only vertical lines [see Fig. 2(b)]. In both Figs. 2(a) and 2(b), the perfect temporal relationship in the video content manifests as perfect spatial relationship along the dimension that corresponds to time in the STS images. At the other extreme, if the video is rapidly changing (e.g., each frame contains vastly different content), the STS images will appear as random patterns. In both Figs. 2(c) and 2(d), the randomness of temporal content in the video manifests as spatially random pixels along the dimension that corresponds to time in the STS images. The STS images for normal videos [Figs. 2(e) and 2(f)] are generally well structured due to the joint spatiotemporal relationship of neighboring pixels and the smooth frame-to-frame transition. The STS images have been effectively used in a model of human visual-motion sensing,22 in energy models of motion perception,23 and in video motion analysis.20,21 Here, we argue that the temporal variation of spatial distortion is exhibited as spatiotemporal dissimilarity in the STS images, and thus, these STS images can also be used to estimate video quality. To illustrate this, Fig. 3 shows sample STS images from a reference video (reference STS image) and from a distorted video (distorted STS image), where some dissimilar regions are clearly visible in the close-ups. As we will demonstrate, by quantifying the spatiotemporal dissimilarity between the reference and distorted STS images, it is possible to estimate video quality. Figure 4 shows sample STS images from two distorted videos of the LIVE video database24 and the normalized absolute difference images between the reference and distorted STS images. The associated estimates and are computed by applying peak SNR (PSNR)25 and the most apparent distortion (MAD) algorithm26 to each pair of the reference and distorted STS images and by averaging the results across all STS images. The higher the value, the better the video quality; and the lower the value, the better the video quality. As seen from Fig. 4, the and values show promise for VQA by comparing the STS images, whereas the frame-by-frame MAD fails to predict the qualities of these videos. However, it is important to note that, although PSNR and MAD show promise when applied to the STS images, neither PSNR nor MAD were designed for use with STS images. In particular, PSNR and MAD do not account for the responses of the human visual system (HVS) to temporal changes of spatial distortion. Consequently, and can yield predictions that correlate poorly with mean opinion score (MOS)/ difference mean opinion score (DMOS). Thus, we propose an alternative method of quantifying degradation of the STS images via a measure of correlation and a model of motion perception. 1.3.Proposal and ContributionsIn this paper, we propose a VQA algorithm that estimates video quality by measuring spatial distortion and spatiotemporal dissimilarity separately. To estimate perceived video quality degradation due to spatial distortion, both the detection-based strategy and the appearance-based strategy of our MAD algorithm are adapted and applied to groups of normal video frames. A simple model of temporal weighting using optical-flow motion estimation is employed to give greater weights to distortions in the slow-moving regions.5,18 To estimate spatiotemporal dissimilarity, we extend the models of Watson–Ahumada27 and Adelson–Bergen,23 which have been used to measure energy of motion in videos, to the STS images and measure the local variance of spatiotemporal neural responses. The spatiotemporal response is measured by filtering the STS image via one one-dimensional (1-D) spatial filter and one 1-D temporal filter.23,27 The overall estimate of perceived video quality degradation is given by a geometric mean of the spatial distortion and spatiotemporal dissimilarity values. We have named our algorithm according to its two main stages: the first stage estimates video quality degradation based on spatial distortion (), and the second stage estimates video quality degradation based on the dissimilarity between spatiotemporal slice images (). The final estimate of perceived video quality degradation is a combination of and . The algorithm is an improved and extended version of our previous VQA algorithms presented in Refs. 28 and 29. We demonstrate the performance of this algorithm on various video-quality databases and compare to some recent VQA algorithms. We also analyze the performance of on different types of distortion by measuring its performance on each subset of videos. The major contributions of this paper are as follows. First, we provide a simple yet effective extension of our MAD algorithm for use in VQA. Specifically, we show how to apply MAD’s detection- and appearance-based strategies to groups of video frames and how to modify the combination to take into account temporal masking. This contribution is presented in the first stage of the algorithm. Second, we demonstrate that the spatiotemporal dissimilarity exhibited in the STS images can be used to effectively estimate video quality degradation. We specifically provide in the second stage of the algorithm a technique to quantify the spatiotemporal dissimilarity by measuring spatiotemporal correlation and by applying an HVS-based model to the STS images. Finally, we demonstrate that a combination of the measurements obtained from these two stages is able to estimate video quality quite accurately. This paper is organized as follows. In Sec. 2, we provide a brief review of current VQA algorithms. In Sec. 3, we describe details of the algorithm. In Sec. 4, we present and compare the results of applying to different video databases. General conclusions are presented in Sec. 5. 2.Brief Review of Existing VQA AlgorithmsIn this section, we provide a brief review of current VQA algorithms. Following the classification specified in Ref. 30, current VQA methods can roughly be divided into four classes: (1) those that employ IQA on a frame-by-frame basis, (2) those that estimate quality based on differences between visual features of the reference and distorted videos, (3) those that estimate quality based on statsitical differences between the reference and distorted videos, and (4) those that attempt to model one or more aspects of the HVS. 2.1.Frame-by-Frame IQAAs stated in Sec. 1, the most straightforward technique to estimate video quality is to apply existing IQA algorithms on a frame-by-frame basis. These per-frame quality estimates can then be collapsed across time to predict an overall quality estimate of the video. It is common to find these frame-by-frame IQA algorithms used as a baseline for comparison,24,31 and some authors implement this technique as a part of their VQA algorithms.32,33 However, due to the lack of temporal information, this technique often fails to correlate with the perceived quality measurements obtained from human observers. 2.2.Algorithms Based on Visual FeaturesAn approach commonly used in VQA is to extract spatial and temporal visual features of the videos and then estimate quality based on the changes of these features between the reference and distorted videos.11,12,34–40 One of the earliest approaches to feature-based VQA was proposed by Pessoa et al.34 Their VQA algorithm employs segmentation along with segment-type-specific error measures. Frames of the reference and distorted videos are first segmented into smooth, edge, and texture segments. Various pixel-based and edge-detection-based error measures are then computed between corresponding regions of the reference and distorted videos for both the luminance and chrominance components. The overall estimate of quality is computed via a weighted linear combination of logistic-normalized versions of these error measures, using segment-category-specific weights, collapsed across all segments and all frames. One of the most popular feature-based VQA algorithms, called the video quality metric (VQM), was developed by Pinson and Wolf.11,12 The VQM algorithm employs quality features that capture spatial, temporal, and color-based differences between the reference and distorted videos. The VQM algorithm consists of four sequential steps. The first step calibrates videos in terms of brightness, contrast, and spatial and temporal shifts. The second step breaks the videos into subregions of space and time, and then extracts a set of quality features for each subregion. The third step compares features extracted from the reference and distorted videos to yield a set of quality indicators. The last step combines these indicators into a video quality index. Okamoto et al.35 proposed a VQA algorithm that operates based on the distortion of edges in both space and time. Okamoto et al. employ three general features: (1) blurring in edge regions, which is quantified by using the average edge energy difference described in ANSI T1.801.03; (2) blocking artifacts, which are quantified based on the ratio of horizontal and vertical edge distortions to other edge distortions; and (3) the average local motion distortion, which is quantified based on the average difference between block-based motion measures of the reference and distorted frames. The overall video quality is estimated via a weighted average of these three features. In Ref. 36, Lee and Sim propose a VQA algorithm that operates under the assumption that visual sensitivity is greatest near edges and block boundaries. Accordingly, their algorithm applies both an edge-detection stage and a block-boundary detection stage to frames from the reference video to locate these regions. Separate measures of distortion for the edge regions and block regions are then computed between the reference and distorted frames. These two features are supplemented with a gradient-based distortion measure, and the overall estimate of quality is then obtained via a weighted linear sum of these three features averaged across all frames. In the context of packet-loss scenarios, Barkowsky et al.37 designed the TetraVQM algorithm by adding a model of temporal distortion awareness to the VQM algorithm. The key idea in TetraVQM is to estimate the temporal visibility of image areas and, therefore, weight the degradations in these areas based on their durations. TetraVQM employs block-based motion estimation to track image objects over time. The resulting motion vectors and motion-prediction errors are then used to estimate the temporal visibility, and this information is used to supplement VQM for estimating the overall quality. In Ref. 39, Engelke et al. demonstrated that significant improvements to VQM and TetraVQM can be realized by augmenting these techniques with information regarding visual saliency. Various features have also been combined via machine-learning for improved VQA. In Ref. 8, Narwaria et al. proposed the temporal quality variation (TQV) algorithm, a low-complexity VQA algorithm that employs a machine-learning mechanism to determine the impact of the spatial and temporal factors as well as their interactions on the overall video quality. Spatial quality factors are estimated by a singular value decomposition (SVD)-based algorithm,41 and the temporal variation of spatial quality factors is used as a feature to estimate video quality. 2.3.Algorithms Based on Statistical MeasurementsAnother class of VQA algorithms has been proposed that estimate quality based on differences in statistical features of the reference and distorted videos.5–7 In Ref. 5, Wang et al. proposed the video structural similarity (VSSIM) index. VSSIM computes various SSIM9 indices at three different levels: the local region level, the frame level, and the video sequence level. In the local region level, the SSIM index of each region is computed for the luminance and chrominance components, with greater weight given to luminance component. These SSIM indices are weighted by local luminance intensity to yield the frame-level SSIM index. Finally, at the sequence level, the frame SSIM index is weighted by global motion to yield an estimate of video quality. Another extension of SSIM to VQA, called speed SSIM, was also proposed by Wang and Li.6 There, they augmented SSIM9 with an additional stage that employs Stocker and Simoncelli’s statistical model42 of visual speed perception. The speed perception model is used to derive a spatiotemporal importance weight function, which specifies a relative weighting at each spatial location and time instant. The overall estimate of video quality is obtained by using this weight function to compute a weighted average of SSIM over all space and time. In Ref. 7, Sheikh and Bovik augmented the VIF IQA algorithm10 for use in VQA. VIF estimates quality based on the amount of information that the distorted image provides about the reference image. VIF models images as realizations of a mixture of marginal Gaussian densities of wavelet subbands, and quality is then determined based on the mutual information between the subband coefficients of the reference and distorted images. To account for motion, V-VIF quantifies loss in motion information by measuring deviations in the spatiotemporal derivatives of the videos, the latter of which are estimated by using separable bandpass filters in space and time. Tao and Eskicioglu33 proposed a VQA algorithm that estimates quality based on SVD. Each frame of the reference and distorted videos are divided into blocks, and then the SVD is applied to each block. Differences in the SVDs of corresponding blocks of the reference and distorted frames, weighted by the edge-strength in each block, are used to generate a frame-level distortion estimate. Both luminance and chrominance SVD-based distortions are combined via a weighted sum. These combined frame-level estimates are then averaged across all frames to derive an overall estimate of video quality. Peng et al. proposed a motion-tuned and attention-guided VQA algorithm based on a space-time statistical texture representation of motion. To construct the spacetime texture representation, the reference and distorted videos are filtered via a bank of 3-D Gaussian derivative filters at multiple scales and orientations. Differences in the energies within local regions of the filtered outputs between the reference and distorted videos are then computed along 13 different planes in space-time to define their temporal distortion measure. This temporal distortion measure is then combined with a model of visual saliency and multiscale SSIM43 (averaged across frames) to estimate quality. 2.4.Algorithms Based on Models of Human VisionAnother widely adopted approach to VQA is to estimate video quality via the use of various models of the HVS.4,44–,55 One of the earliest VQA algorithms based on a vision model was developed by Lukas and Budrikis.44 Their technique employs a spatiotemporal visual filter that models visual threshold characteristics on uniform backgrounds. To account for nonuniform backgrounds, the model is supplemented with a masking function based on the spatial and temporal activities of the video. The digital video quality algorithm, developed by Watson et al.,49 also models visual thresholds to estimate video quality. The authors employ the concept of just noticeable differences (JNDs), which are computed via a discrete cosine transform (DCT)-based model of early vision. After sampling, cropping, and color conversion, each block of the videos is transformed to DCT coefficients, converted to local contrast, and filtered by a model of the temporal contrast sensitivity function. JNDs are then measured by dividing each DCT coefficient by its respective visual threshold. Contrast masking is estimated based on the differences between successive frames, and the masking-adjusted differences are pooled and mapped to a visual quality estimate. Other HVS-based approaches to VQA have employed various subband decompositions to model the spatiotemporal response properties of populations of visual neurons, which are assumed to underlie the multichannel nature of the HVS.4,45–47,53,55 These algorithms generally compute simulated neural responses to the reference and distorted videos and then estimate quality based on the extent to which these responses differ. The moving picture quality metric algorithm, proposed by Basso et al.,45 employs a spatiotemporal multichannel HVS model by using 17 spatial Gabor filters and two temporal filters on the luminance component. After contrast sensitivity and masking adjustments, distortion is measured within each subband and pooled to yield the quality estimate. The color MPQM algorithm, proposed by Lambrecht,46 extends and applies the MPQM algorithm to both luminance and chrominance components with a reduced number of filters for the chrominance components (nine spatial filters and one temporal filter). The normalization video fidelity metric algorithm, proposed by Lindh and Lambrecht,47 implements a visibility prediction model based on the Teo–Heeger gain-control model.56 Instead of using Gabor filters, the multichannel decomposition is performed by using the steerable pyramid with four scales and four orientations. An excitatory-inhibitory stage and a pooling stage are performed to yield a map of normalized responses. The distortion is measured based on the squared error between normalized response maps generated for the reference and the distorted videos. Masry et al.53 developed a VQA algorith that employs a multichannel decomposition and a masking model implemented via a separable wavelet transform. A training step was performed on a set of videos and associated subjective quality scores to obtain the masking parameters. Later in Ref. 55, Li et al. utilized this algorithm as part of a VQA algorithm that measures and combines detail losses and additive impairments within each frame; optimal parameters were determined by training the algorithm on a subset of the LIVE video database.24 Seshadrinathan and Bovik4 proposed the motion-based video integrity evaluation (MOVIE) algorithm that estimates spatial quality, temporal quality, and spatiotemporal quality via a 3-D subband decomposition. MOVIE decomposes both the reference and distorted videos by using a 3-D Gabor filter-bank with 105 spatiotemporal subbands. The spatial component of MOVIE uses the outputs of the spatiotemporal Gabor filters and a model of contrast masking to capture spatial distortion. The temporal component of MOVIE employs optical-flow motion estimation to determine motion information, which is combined with the outputs of the spatiotemporal Gabor filters to capture temporal distortion. These spatial and temporal components are combined into an overall estimate of video quality. 2.5.SummaryIn summary, although previous VQA algorithms have analyzed the effects of spatial and temporal interactions on video quality, none have estimated video quality based on spatiotemporal slices (STS images), which contain important spatiotemporal information on a longer time scale. Earlier related work was performed by Péchard et al.,57 where spatiotemporal tubes rather than slices were used for VQA. Their algorithm employs a segmentation to create spatiotemporal tubes, which are coherent in terms of motion and spatial activity. Similar to our STS images, the spatiotemporal tubes permit analysis of spatiotemporal information on a long time scale, and Pechard et al. demonstrated the superiority of their approach compared to other VQA algorithms on videos containing H.264 artifacts. In the following section, we describe our HVS-based VQA algorithm, , which employs measures of both motion-weighted spatial distortion and spatiotemporal dissimilarity of the STS images to estimate perceived video quality degradation. 3.AlgorithmThe algorithm estimates video quality degradation by using the luminance components of the reference and distorted videos in YUV color space. We denote as the cuboid representation of the component of the reference video, and we denote as the cuboid representation of the component of the distorted video. The algorithm employs a combination of both spatial and spatiotemporal analyses to estimate the perceived video quality degradation of the distorted video in comparison to the reference video . Figure 5 shows a block diagram of the algorithm, which measures spatial distortion and spatiotemporal dissimilarity separately via two main stages:
Finally, the spatial distortion value and the spatiotemporal dissimilarity value are combined via a geometric mean to yield a single scalar that represents the overall perceived quality degradation of the video. The following subsections provide details of each stage of the algorithm. 3.1.Spatial DistortionIn the spatial distortion stage, we employ and extend our MAD algorithm,26 which was designed for still images, to measure spatial distortion in each GOF of the video. The MAD algorithm is composed of two separate strategies: (1) a detection-based strategy, which computes the perceived distortion due to visual detection (denoted by ) and (2) an appearance-based strategy, which computes the perceived distortion due to visual appearance changes (denoted by ). The perceived distortion due to visual detection is measured by using a masking-weighted block-based mean-squared error in the lightness domain. The perceived distortion due to visual appearance changes is measured by computing the average differences between the block-based log-Gabor statistics of the reference and distorted images. The MAD index of the distorted image is computed via a geometric weighted mean. where the weight serves to adaptively combine the two strategies ( and ) based on the overall level of distortion. As described in Ref. 26, for high-quality images, MAD should obtain its value mostly from , whereas for low-quality images, MAD should obtain its value mostly from . Thus, an initial estimate of the quality level is required in order to determine the proper weighting () of the two strategies. In Ref. 26, the value of served as this initial estimate, and thus, is a function of . The two free parameters and were obtained after training on the A57 image database;58 see Ref. 26 for a complete description of the MAD algorithm.To extend MAD for use with video, we take the components of the videos and perform the following steps (shown in Fig. 6) on each group of consecutive frames:
The video frames are extracted from the components of the reference and distorted videos. Let denote the ’th frame of the reference video , and let denote the ’th frame of the distorted video , where denotes the frame (time) index, and denotes the number of frames in video . These video frames are then divided into groups of consecutive frames for both the reference and the distorted video. The following subsections describe the details of each step. 3.1.1.Compute visible distortion mapWe apply the detection-based strategy from Ref. 26 to all pairs of respective frames from the reference video and the distorted video. A block diagram of this detection-based strategy is provided in Fig. 7. Detection-based strategyAs illustrated in Fig. 7, a preprocessing step is first performed by using the nonlinear luminance conversion and spatial contrast sensitivity function filtering. Then, models of luminance and contrast masking are used to compute a local distortion visibility map. Next, this map is weighted by local mean squared error (MSE) to yield a visible distortion map. The specific steps are given below (see Ref. 26 for additional details). First, to account for the nonlinear relationship between digital pixel values and physical luminance of typical display media, the video is converted to a perceived luminance video via where the parameters , , and are constants specific to the device on which the video is displayed. For 8-bit pixel values and an sRGB display, these parameters are given by , , and . The division by 3 attempts to take into account the nonlinear HVS response to luminance by converting luminance into perceived luminance (relative lightness).Next, the contrast sensitivity function (CSF) is applied by filtering both the reference frame and the error frame . The filtering is performed in the frequency domain via where and denote the discrete fourier transform (DFT) and inverse DFT, respectively; is the DFT-based version of the CSF function defined by Eq. (3) in Ref. 26.To account for the fact that the presence of an image can reduce the detectability of distortions, MAD employs a simple spatial-domain measure of contrast masking. First, a local contrast map is computed for the reference frame in the lightness domain by dividing into blocks (with 75% overlap between neighboring blocks) and then measuring the RMS contrast of each block. The RMS contrast of block of is computed via where denotes the mean of block of , and denotes the minimum of the standard deviations of the four subblocks of . The block size of was chosen because it is large enough to accommodate division into reasonably sized subblocks (to avoid overestimating the contrast around edges), but small enough to yield decent spatial localization (see Appendix A in Ref. 26).is a measure of the local RMS contrast in the reference frame and is thus independent of the distortions. Accordingly, we next compute a local contrast map for the error frame to account for the spatial distribution of the distortions in the distorted frame. The error frame is divided into blocks (with 75% overlap between blocks), and then the RMS contrast for each block is computed via where denotes the standard deviation of block of . A lightness threshold of 0.5 is employed to account for the fact that the HVS is relatively insensitive to changes in extremely dark regions.The local contrast maps are computed for both the reference frame and the error frame for every block of size with 75% overlap between neighboring blocks. The two local contrast maps and are used to compute a local distortion visibility map denoted by via The local distortion visibility map is then point-by-point multiplied by the local MSE to determine a visible distortion map denoted by , where the superscript is used to imply that the map is computed from the detection-based strategy. The visible distortion at the location of block is given by Note that in Ref. 26, the visible distortion map is collapsed into a single scalar that represents the perceived distortion due to visual detection , which is computed via , where the summation is over all blocks. In the current paper, we do not collapse . Apply to groups of video framesLet denote the visible distortion map computed from the ’th frame of the reference video and the ’th frame of the distorted video. The visible distortion maps computed from all frames in the ’th GOF will be , where is the GOF index and is the number of GOFs in the video. These maps are combined via a point-by-point average to yield a GOF-based visible distortion map of the ’th GOF, which is denoted by . 3.1.2.Compute statistical difference mapAs argued in Ref. 26, when the distortions in the image are highly suprathreshold, perceived distortion is better modeled by quantifying the extent to which the distortions degrade the appearance of the image’s subject matter. The appearance-based strategy measures local statistics of multiscale log-Gabor filter responses to capture changes in visual appearance. Figure 8 shows a block diagram of the appearance-based strategy used to compute a statistical difference map between the reference and the distorted frame. Appearance-based strategyThe appearance-based strategy employs a computational neural model using a log-Gabor filter-bank (with five scales and four orientations ), which implements both even-symmetric (cosine-phase) and odd-symmetric (sine-phase) filters. The even and odd filter outputs are then combined to yield magnitude-only subband values. Let and denote the sets of log-Gabor subbands computed for a reference and a distorted frame, respectively, where each subband is the same size as the frames. The standard deviation, skewness, and kurtosis are then computed for each block of size (with 75% overlap between blocks) for each log-Gabor subband of the reference frame and the distorted frame. Let , , and denote the standard deviation, skewness, and kurtosis computed from block of subband . Let , , and denote the standard deviation, skewness, and kurtosis computed from block of subband . The statistical difference map is computed as the weighted combination of the differences in standard deviation, skewness, and kurtosis for all subbands. We denote as the statistical difference map, where the superscript is used to imply that the map is computed from the appearance-based strategy. Specifically, the statistical difference at the location of block is given by where the scale-specific weights (for the finest to coarsest scales, respectively) are chosen the same as in Ref. 26 to account for the HVS’s preference for coarse scales over fine scales (see Ref. 26 for more details).Note that in Ref. 26, the statistical difference map is collapsed into a single scalar that represents the perceived distortion due to visual appearance changes , which is computed via , where the summation is over all blocks. In the current paper, we do not collapse . Apply to groups of video framesLet denote the statistical difference map computed from the ’th frame of the reference video and the ’th frame of the distorted video. The statistical difference maps computed from all frames in the ’th GOF will be , where is the GOF index and is the number of GOFs in the video. These maps are combined via a point-by-point average to yield a GOF-based statistical difference map of the ’th GOF, which is denoted by . 3.1.3.Optical-flow motion estimationBoth the detection-based strategy and the appearance-based strategy were designed for still images. They do not account for the effects of motion on the visibility of distortion. One attribute of motion that affects the visibility of distortion in video is the speed of motion (or the magnitude of motion vectors). According to Wang et al.5 and Barkowsky et al.,18 the visibility of distortion is significantly reduced when the speed of motion is large. Alternatively, the distortion in slow-moving regions is more visible than the distortion in fast-moving regions. To model this effect of motion, we measure the speed of motion in different regions of the video by using an optical flow algorithm. We specifically apply the optical flow method designed by Lucas and Kanade59 to the reference video to estimate motion vectors. The Lucas–Kanade method assumes that the displacement of the frame contents between two nearby frames is small and approximately constant within a neighborhood (window) of a point under consideration. Thus, the optical-flow motion vector can be assumed the same within a window centered at that point, and it is computed from solving the optical-flow equations using the least squares criterion. By using a window of size , for each pair of consecutive frames, we obtain two matrices of motion vectors, and , with respect to the vertical and horizontal directions. The motion magnitude matrix is then computed as . Each element in this matrix represents the motion magnitude of a region defined by an block in the frame. Let denote the motion magnitude matrix computed from the ’th video frame and its successive frame, where denotes the frame index and is the number of frames in the video. For the ’th GOF of the reference video, the motion magnitude matrices computed from all of its frames are averaged to yield an average motion magnitude matrix via Note that the sizes of and are both 64 times smaller than a regular frame because each value in these matrices represents motion magnitude of an window in the regular frame. We therefore resize the matrix to the size of the video frame by using nearest-neighbor interpolation to obtain the GOF-based motion magnitude map of the ’th GOF denoted by , where the superscript is used to imply that the map is computed from the motion magnitudes. 3.1.4.Combine maps and compute spatial distortion valueFor each GOF, we have computed the GOF-based visible distortion map , the GOF-based statistical difference map , and the GOF-based motion magnitude map . Now, we extend and apply Eq. (2) to respective regions of the visible distortion map and the statistical difference map to obtain the GOF-based most apparent distortion map. This map is then point-by-point weighted by the motion magnitude map to yield the spatial distortion map of the ’th GOF. We denote of size , the video frame size, as the spatial distortion map of the ’th GOF. Specifically, the value at of the spatial distortion map is computed via The division by accounts for the fact that the distortion in slow-moving regions is generally more visible than the distortion in fast-moving regions. When the value in the motion magnitude map is relatively large or the corresponding spatial region is fast-moving, the visible distortion value in is relatively small; when the value in the motion magnitude map is relatively small or the corresponding spatial region is slow-moving, the visible distortion value in is relatively large. When there is no motion in the region, the visible distortion is determined solely by and . Figure 9 shows examples of the first frame (a) and the last frame (b) of a specific GOF of video mc2_50fps.yuv from the LIVE video database.24 The visible distortion map (c), the statistical difference map (d), the motion magnitude map (e), and the spatial distortion map (f) computed for this GOF are also shown. As seen from the visible distortion map (c) and the statistical difference map (d), at the regions of high visible distortion level (i.e., the train, the numbers in the calendar), the spatial distortion map is weighted more by the statistical difference map. At the regions of low visible distortion level (i.e., the wall background), the spatial distortion map is weighted more by the visible distortion map. As also seen from Figs. 9(c) and 9(d), the region corresponding to the train at the bottom of the frames is more heavily distorted than the other regions. However, due to the fast movement of the train, which is reflected in the bottom of the motion magnitude map (e), the visibility of distortion is reduced, making this region less bright in the spatial distortion map (f). To estimate spatial distortion value of each GOF, we compute the RMS value of the spatial distortion map. The RMS value of the map of size is given by where the superscript is used to remind readers that the value is computed from the normal frames with two dimensions and . The overall perceived spatial distortion value, denoted by , is computed as the arithmetic mean of all spatial distortion values viaHere, is a single scalar that represents the overall perceived quality degradation of the video due to spatial distortion. The lower the value, the better the video quality. A value indicates that the distorted video is equal in quality to the reference video. 3.2.Spatiotemporal DissimilarityIn the distorted video, the distortion impacts not only the spatial relationship between neighboring pixels within the current frame, but also the transition between frames, which can be captured via the use of STS images. The difference between the STS images from the reference and distorted videos is referred to as the spatiotemporal dissimilarity in this paper. If the spatiotemporal dissimilarity between the STS images is small, the distorted video has high quality relative to the reference video; if the spatiotemporal dissimilarity between the STS images is large, the distorted video has low quality relative to the reference video. Figure 10 depicts a block diagram of the spatiotemporal dissimilarity stage, which estimates the spatiotemporal dissimilarity between the reference and the distorted video via the following steps:
The following subsections describe the details of each step. 3.2.1.Extract the STS imagesThe reference video and the distorted video are converted to perceived luminance videos and , respectively, using Eq. (3). Let denote the vertical STS image of the video cuboid , where denotes the vertical slice (column) index and denotes the spatial width of the video (measured in pixels). As shown previously in Fig. 1, these vertical STS images contain temporal information in the horizontal direction and spatial information in the vertical direction. Thus, for a video containing frames, will be of size , where denotes the spatial height of the video (measured in pixels). There are such STS images . Similarly, let denote the horizontal STS image of the video cuboid , where denotes the horizontal slice (row) index and denotes the spatial height of the video. These horizontal STS images contain spatial information in the vertical direction and temporal information in the horizontal direction. Thus, for a video containing frames, will be of size , and there are such STS images . The STS images extracted from the reference video [, ] and the STS images extracted from the distorted video [, ] are then used to compute the spatiotemporal dissimilarity values. This procedure consists of two main steps: (1) compute the spatiotemporal correlation maps and (2) compute the spatiotemporal response difference maps. 3.2.2.Compute spatiotemporal correlation mapOne simple way to measure the spatiotemporal dissimilarity is by using the local linear correlation coefficients of the STS images extracted from the reference and the distorted videos. If the distorted video has perfect quality relative to the reference video, these two videos should have high correlation in the STS images; if the distorted video has low quality relative to the reference video, the spatiotemporal correlation will be low. Let denote the linear correlation coefficient computed from block of the two STS images and . We define the local spatiotemporal correlation coefficient of these two blocks as As shown in Eq. (17), if the two blocks are highly positively correlated, we set . The threshold value of 0.9 was chosen empirically so that a relatively high positive correlation () is still considered perfect by the algorithm. As we demonstrate in the online supplement to this paper,60 the performance of the algorithm is relatively robust to small changes in this threshold value. On the other hand, if the two blocks are negatively correlated, we set to reflect the dissimilarity between the two blocks. This process is performed on every block of size with 75% overlap between neighboring blocks, yielding a spatiotemporal correlation map denoted by between and . Similarly, we compute a spatiotemporal correlation map denoted by between and . Examples of the correlation maps are shown in Fig. 11(c). The brighter the maps, the higher the spatiotemporal correlation between corresponding regions of the two STS images. 3.2.3.Compute spatiotemporal response difference mapThe spatiotemporal correlation coefficient computed in Sec. 3.2.2 does not account for the HVS’s response to joint spatiotemporal characteristics of the video. Therefore, in addition to measuring the spatiotemporal correlation, we employ a computational HVS model that takes into account joint spatiotemporal perception based on the work of Watson and Ahumada in Ref. 27. This model applies separate 1-D filters to each dimension of the STS images to measure spatiotemporal responses. In Ref. 23, Adelson and Bergen used these spatiotemporal responses to measure energy of motion in a video. Here, we apply the model to the STS images and measure the differences of spatiotemporal responses to estimate video quality. Decompose STS images into spatiotemporally filtered imagesAs stated by Adelson and Bergen in Ref. 23, the spatiotemporal information presented in the STS images can be captured via a set of spatiotemporally oriented filters. As suggested by Watson and Ahumada,27 these filters can be constructed by two sets of separate 1-D filters (spatial and temporal) with appropriate spatiotemporal characteristics. Following this suggestion, we employ a set of log-Gabor 1-D filters , , as the spatial filters, where the frequency response of each filter is given by where , , and denote the frequency response, center frequency, and bandwidth of the filter , respectively, is the 1-D spatial frequency. The bandwidth is held constant for all scales to obtain constant filter shape. We specifically choose five scales and a filter bandwidth of approximately two octaves (). These filters are almost the same as the log-Gabor filters used in Ref. 26 without the orientation information.The two temporal filters , , were selected following the Adelson–Bergen model.23 The impulse response at time instance of each filter is given by where and were chosen to approximate the temporal contrast sensitivity functions reported by Robson,61 which correspond to the fast and slow motions, respectively.The STS images are filtered along the spatial dimension by each spatial filter and then along the temporal dimension by each temporal filter to yield a spatiotemporally filtered image, which represents modeled spatiotemporal neural responses. With five spatial filters and two temporal filters, each STS image yields 10 spatiotemporally filtered images. We denote and , and , as the spatiotemporally filtered images obtained by filtering the STS images and from the reference video via spatial filter and temporal filter . These filtered images are computed via where , , denotes the convolution along dimension .Similarly, we denote and as the spatiotemporally filtered images obtained by filtering the STS images and from the distorted video via spatial filter and temporal filter . Then, the spatiotemporal response differences and are defined as the absolute difference of the spatiotemporally filtered images via Although the proper technique of estimating video quality based on the response differences remains an open research question, as discussed next, we employ a simple yet effective measure based on the local standard deviation of the spatiotemporal response differences. Compute log of response difference mapWe compute the local mean and standard deviation of the spatiotemporal response differences in a block-based fashion. Let and denote the local mean and standard deviation computed from block of the response difference . Let and denote the local mean and standard deviation computed from block of the response difference . The adjusted standard deviation of block of the error-filtered image at spatial frequency index and temporal frequency index is given by where is a threshold value. When the mean value of block is small, there is no dissimilarity between the regions at the location of block in the STS images; when the mean value of block is large enough, the dissimilarity is approximately measured by the standard deviation of block in the response differences.This process is performed on every block of size with 75% overlap between neighboring blocks, yielding maps of adjusted standard deviation and . The log of response difference maps and are computed as a natural logarithm of a weighted sum of all the maps and , respectively, as follows: where the weights were chosen following Ref. 26 to account for the HVS’s preference for coarse scales over fine scales. The addition of 1 is to prevent the logarithm of zero, and is a scaling factor to enlarge the adjusted variance. Examples of the log of response difference maps are shown in Fig. 11(d). The brighter the maps, the greater the difference in spatiotemporal responses between corresponding regions of the two STS images.3.2.4.Compute spatiotemporal dissimilarity valueThe spatiotemporal correlation map and the log of response difference map are combined into a spatiotemporal dissimilarity map via a point-by-point multiplication. Let denote the RMS value of the spatiotemporal dissimilarity map of size , where is the column (vertical slice) index of the vertical STS images. Let denote the RMS value of the spatiotemporal dissimilarity map of size , where is the row (horizontal slice) index of the horizontal STS images. Specifically, these RMS values are computed as follows: where and are the spatial width and height of the video frame, respectively, and is number of frames in the videos. The superscripts and are used to remind readers about the two dimensions of the STS images that are used to compute the values. The spatiotemporal dissimilarity value, denoted by , between the reference and the distorted video is given byHere, is a single scalar that represents the overall perceived video quality degradation due to spatiotemporal dissimilarity. The lower the value, the better the video quality. A value of indicates that the distorted video has perfect quality relative to the reference video. Figure 11 shows the correlation maps , the log of response difference maps , and the spatiotemporal dissimilarity maps computed from two pairs of specific horizontal STS images. The brighter values in the spatiotemporal dissimilarity maps in Fig. 11(e) denote the corresponding spatiotemporal regions of greater dissimilarity. As observed from the video mc2_50fps.yuv (LIVE), the spatial distortion occurs more frequently in the middle frames. These middle frames are also heavily distorted in nearly every spatial region. This fact is well-captured by the spatiotemporal dissimilarity map in Fig. 11(e) (left). As observed in Fig. 11(e) (left), the dissimilarity map is brighter in the middle and along the entire spatial dimension. In video PartyScene_dst_09.yuv (CSIQ), the spatial distortion that occurs in the center of the video is smaller than the distortion in the surrounding area. This fact is also reflected in the spatiotemporal dissimilarity map in Fig. 11(e) (right), where the spatiotemporal dissimilarity map shows brighter surrounding regions compared to the center regions across the temporal dimension. 3.3.Combine Spatial Distortion and Spatiotemporal Dissimilarity ValuesFinally, the overall estimate of perceived video quality degradation, denoted by , is computed from the spatial distortion value and the spatiotemporal dissimilarity value . Specifically, is computed as a geometric mean of and , which is given by Here, is a single scalar that represents the overall perceived quality degradation of the video. The smaller the value, the better the video quality. A value of indicates that the distorted video is equal in quality to the reference video. Note that the values of and occupy different ranges. Thus, the use of a geometric mean in Eq. (33) allows us to combine these values without the need for custom weights (which would be required when using an arithmetic mean). Other combinations are also possible, e.g., using a weighted geometric mean with possibly adaptive weights. However, our preliminary attempts to select such weights have not yielded significant improvements (see also Sec. 4.3.4). 4.ResultsIn this section, we analyze the performance of the algorithm in predicting subjective ratings of quality on three publicly available video-quality databases. We also compare the performance of with other quality assessment algorithms. 4.1.Video Quality DatabasesTo evaluate the performance of and other quality assessment algorithms, we used the following three publicly available video-quality databases that have multiple types of distortion: 4.1.1.LIVE video databaseThe LIVE video database24 developed at the University of Texas at Austin contains 10 reference videos and 150 distorted videos (15 distorted versions per each reference video). All videos are in raw YUV420 format with a resolution of , in duration, and at frame rates of 25 or 50 fps. There are four distortion types in this database: MPEG-2 compression (MPEG-2), H.264 compression (H.264), simulated transmission of H.264-compressed bit-streams through error-prone IP networks (IPPL), and simulated transmission of H.264-compressed bit-streams through error-prone wireless networks (WLPL). Three or four levels of distortion are present for each distortion type. 4.1.2.IVPL video databaseThe IVPL HD video database62 developed at the Chinese University of Hong Kong consists of 10 reference videos and 128 distorted videos. All videos in this database are in raw YUV420 format with a resolution of , in duration, and at a frame rate of 25 fps. There are four types of distortion in this database: Dirac wavelet compression (DIRAC, three levels), H.264 compression (H.264, four levels), simulated transmission of H.264-compressed bit-streams through error-prone IP networks (IPPL, four levels), and MPEG-2 compression (MPEG-2, three levels). To reduce the computation time, we rescaled the videos to using FFMPEG software64 with its default configuration. 4.1.3.CSIQ video databaseThe CSIQ video database63 developed by the authors at Oklahoma State University consists of 12 reference videos and 216 distorted videos. All videos in this database are in raw YUV420 format with a resolution of , a duration of 10 s, and span a range of various frame rates: 24, 25, 30, 50, and 60 fps. Each reference video has 18 distorted versions with six types of distortion; each distortion type has three different levels. The distortion types consist of four video compression distortion types [Motion JPEG (MJPEG), H.264, HEVC, and wavelet compression using SNOW codec64] and two transmission-based distortion types [packet-loss in a simulated wireless network (WLPL) and additive white Gaussian noise (AWGN)]. The experiment was conducted following the SAMVIQ testing protocol65 with 35 subjects. 4.2.Algorithms and Performance MeasuresWe compared with PSNR25 and recent full-reference video quality assessment algorithms for which code is publicly available, VQM,12 MOVIE,4 and TQV,8 on the three video databases. PSNR was applied on a frame-by-frame basis, VQM and MOVIE were applied using their default implementations and settings, and TQV was applied using its original training parameters. For , we used a GOF size of . Before evaluating the performance of each algorithm on each video database, we applied a four-parameter logistic transform to the raw predicted scores, as recommended by video quality experts group (VQEG) in Ref. 31. The four-parameter logistic transform is given by where denotes the raw predicted score and , , , and are free parameters that are selected to provide the best fit of the predicted scores to the subjective rating scores.Following VQEG recommendations in Ref. 31, we employed the Spearman rank-order correlation coefficient (SROCC) to measure prediction monotonicity, and employed the Pearson linear correlation coefficient (CC) and the root mean square error (RMSE) to measure prediction accuracy. The prediction consistency of each algorithm was measured by two additional criteria: the outlier ratio (OR5) and the outlier distance (OD26). OR is the ratio of number of false scores predicted by the algorithm to the total number of scores. A false score is defined as the transformed score lying outside the 95% confidence interval of the associated subjective score.5 In addition, OD indicates how far the outliers fall outside of the confidence interval. The OD is measured by the total distance from all outliers to their closest edge points of the corresponding confidence interval.26 4.3.Overall PerformanceThe performance of each algorithm on each video database is shown in Table 1 in terms of the five criteria (SROCC, CC, RMSE, OR, and OD). The best-performing algorithm is bolded, and the second best-performing algorithm is italicized and bolded. These data indicate that is the best-performing algorithm on all three video databases in terms of all five evaluation criteria. The performances of and are also noteworthy. Table 1Performances of ViS3 and other algorithms on the three video databases. The best-performing algorithm is bolded and the second best-performing algorithm is italicized. Note that ViS3 is the best-performing algorithm on all three databases.
In terms of prediction monotonicity (SROCC), is the best-performing algorithm on all three databases. On the LIVE and CSIQ databases, and TQV are the two best-performing algorithms. On the IVPL database, and MOVIE are the two best-performing algorithms. A similar trend in performance is observed in terms of prediction accuracy (CC and RMSE). In terms of prediction consistency measured by OR, on the LIVE database, three algorithms (MOVIE, TQV, and ) have an OR of zero, which indicates that they do not yield any outliers. On the IVPL database, both and VQM have only one outlier. On the CSIQ database, and MOVIE are the two algorithms with the least number of outliers. In terms of OD, on the LIVE database, three algorithms (MOVIE, TQV, and ) have an OD of zero because they do not have any outliers. On the IVPL database, MOVIE and VQM have the smallest OD. Although yields only one outlier on the IVPL database as well as VQM, has larger OD because this outlier lies further away from its confidence interval. This indicates that has a weakness on the IPPL distortion, to which the outlier belongs. Furthermore, on the CSIQ database, and TQV yield the smallest OD values. Observe that and yield different relative performances depending on the database. shows better predictions than on the LIVE and IVPL databases. However, shows better predictions than on the CSIQ database. Generally, shows higher SROCC and CC and lower RMSE, OR, and OD than either or alone. Nonetheless, it may be possible to combine and in an adaptive fashion for even better prediction performance, and such an adaptive combination remains an area for future research. The scatter-plots of logistic-transformed values versus DMOS on the three databases are shown in Fig. 12. The plots show a highly correlated trend between the logistic-transformed values versus DMOS values. For all the three databases, the predictions are homoscedastic; i.e., there are generally no subpopulations of videos/distortion types for which yields lesser or greater residual variance in the predictions. These residuals are used for an analysis of statistical significance in Sec. 4.3.3. 4.3.1.Performance on individual types of distortionWe measured the performance of and other algorithms on individual types of distortion for videos from the three databases. For this analysis, we applied the logistic transform function to all predicted scores of each database, then divided the transformed scores into separate subsets according to the distortion types, and then measured the performance criteria in terms of SROCC and CC for each subset. Table 2 shows the resulting SROCC and CC values. Table 2Performances of ViS3 and other quality assessment algorithms measured on different types of distortion on the three video databases. The best-performing algorithm is bolded and the second best-performing algorithm is italicized.
In general, VQM, MOVIE, and all perform well on the WLPL distortion; these three algorithms show competitive and consistent performance on the WLPL distortion for both the LIVE and CSIQ databases. For the H.264 compression distortion, and MOVIE perform well and consistently across all subsets of H.264 videos on all three databases. and MOVIE are also competitive on the MPEG-2 compression distortion and the IPPL distortion on both the LIVE and IVPL databases. In particular, on the LIVE database, has the best performance on the WLPL distortion; VQM and have the best performance on the IPPL distortion; , MOVIE, and TQV are the three best-performing algorithms on the H.264 compression distortion; and TQV and MOVIE are the two best-performing algorithms on the MPEG-2 compression distortion. The low performance of the algorithm on H.264 and MPEG-2 compression types in the LIVE video database is due to the outliers corresponding to specific videos as shown in Fig. 13; the outliers are marked by the red square markers. For H.264, the outliers correspond to the video riverbed where the water’s movement significantly masks the blurring imposed by the compression. However, underestimates this masking and, thus, overestimates the DMOS. For MPEG-2, the sunflower seeds in the video sunflower generally impose signficant masking of the MPEG-2 blocking artifacts. However, there are select frames in this video in which the blocking artifacts become highly visible (owing perhaps to failed motion compensation), yet does not accurately capture the visibility of these artifacts and, thus, underestimates the DMOS. These types of interactions between the videos and distortions are issues that certainly warrant future research. On the IVPL database, yields the best performance on three types of distortion (DIRAC, H.264, and MPEG-2); yields the second best performance on the IPPL distortion, on which MOVIE is the best-performing algorithm. VQM and MOVIE are the second best-performing algorithms on the MPEG-2 distortion. PSNR, VQM, and MOVIE are also competitive on both the DIRAC and H.264 distortion. On the CSIQ database, TQV and are the two best-performing algorithms on the H.264 compression distortion; and MOVIE are the two best-performing algorithms on three types of distortion (WLPL, SNOW, and HEVC); MOVIE and TQV are the two best-performing algorithms on the MJPEG. On the AGWN distortion, and TQV are competitive with PSNR, which is well known to perform well for white noise. Generally, excels on the H.264 compression distortion and the wavelet-based compression distortion (DIRAC, SNOW), and , VQM, and MOVIE excel on the WLPL distortion. also performs well on the MPEG-2, HEVC, and AWGN distortion. However, does not perform well on the MJPEG compression distortion compared to MOVIE and TQV. 4.3.2.Performance with different GOF sizesAs we mentioned in Sec. 3.1, for , the size of the GOF used in Eqs. (9), (11), and (12) is a user-selectable parameter (). The results presented in the previous subsection were obtained with a GOF size of . To investigate how the prediction performance varies with different GOF sizes, we computed SROCC and CC values for and using values of ranging from 4 to 16. The results of this analysis are listed in Table 3. Table 3Performances of ViS3 on the three video databases with different group of frames (GOF) size. Note that ViS3 is robust with the change of the GOF size on all three databases.
As shown in the upper portion of Table 3, the performance of tends to increase with larger values of . This trend may partially be attributable to the fact that a larger GOF size can give rise to a more accurate estimate of the motion and, thus, perhaps a more accurate account of the temporal masking. Nonetheless, as demonstrated in the lower portion of Table 3, is relatively robust to small changes in . The choice of generally provides good performance on all three databases. However, the optimal choice of remains an open research question. 4.3.3.Statistical significanceTo assess the statistical significance of differences in performances of and other algorithms, we used an F-test to compare the variances of the residuals (errors) of the algorithms’ predictions.66 If the distribution of residuals is sufficiently Gaussian, an F-test can be used to determine the probability that the residuals are drawn from different distributions and are thus statistically different. To determine whether the residuals of an algorithm have Gaussian distributions, we performed the Jarque–Bera (JB) test (see Ref. 58) on the residuals to measure the JBSTAT value. If the JBSTAT value is smaller than a critical value, then the distribution of residuals is significantly Gaussian. If the JBSTAT value is greater than the critical value, then the distribution of residuals is not Gaussian. The JB test results show that for the LIVE database, all the algorithms pass the JB test and their residuals have Gaussian distributions. On the IVPL database, only PSNR does not pass the JB test. On the CSIQ database, only VQM and pass the JB test. We performed an F-test with 95% confidence to compare the residual variances of the algorithms whose distributions of residuals are significantly Gaussian. If the variances are significantly different, we conclude that the two algorithms are significantly different. The smaller the variance of residuals, the better the prediction performance of the algorithm. Table 4 shows the F-test results between each pair of the algorithms whose distributions of residuals are significantly Gaussian. A “0” value implies that residual variances of two algorithms are not significantly different. A “+“ sign implies that the algorithm indicated by the column has significantly smaller residual variance than the algorithm indicated by the row, and therefore, it has better performance. A “−“ sign implies that the algorithm indicated by the column has significantly larger residual variance than the algorithm indicated by the row, and therefore, it has worse performance. Table 4Statistical significance relationship between each pair of algorithms on the three video databases. A “0” value implies that variances of residuals between the algorithm indicated by the column and the algorithm indicated by row are not significantly different. A “+” sign implies that the algorithm indicated by the column has significantly smaller residual variance than the algorithm indicated by the row. A “−“ sign implies that the algorithm indicated by the column has significantly larger residual variance than the algorithm indicated by the row.
As seen from Table 4, on the LIVE database, the variance of residuals yielded by PSNR is significantly larger than the variances of residuals yielded by the other algorithms, and therefore, PSNR is significantly worse than the other algorithms. The difference in residuals of and either of VQM, MOVIE, or TQV is not statistically significant. On the IVPL database, the variance of residuals yielded by TQV is significantly larger than the variances of residuals yielded by VQM, MOVIE, and , and therefore, VQM, MOVIE, and are significantly better than TQV on this database. On both IVPL and CSIQ databases, the variance of residuals yielded by VQM is significantly larger than the variance of residuals yielded by , and therefore, is significantly better than VQM on these databases. Although is not significantly better than MOVIE on any of the three databases, it should be noted that MOVIE is not significantly better than VQM on any of the three database, while is significantly better than VQM on the IVPL and CSIQ databases. Moreover, MOVIE requires more computation time than . Specifically, using a modern computer (Intel Quad Core at 2.66 GHz, 12 GB RAM DDR2 at 6400 MHz, Windows 7 Pro 64-bit, MATLAB® R2011b) to estimate the quality of a 10-s video of size (300 frames total), MOVIE requires , whereas basic MATLAB® implementations of VQM and require and 7 min, respectively. 4.3.4.Summary, limitations, and future workThrough testing on various video-quality databases, we have demonstrated that performs well in predicting video quality. It not only excels at VQA for whole databases with varying types of distortion and varying distortion levels, but also performs well on videos with a specific type of distortion. Our performance evaluation demonstrates that is either better than or statistically tied with current state-of-the-art VQA algorithms. A statistical analysis also shows that is significantly better than PSNR, VQM, and TQV in predicting the qualities of videos from specific databases. Yet, is not without its limitations. One important limitation is in regards to the potentially large memory requirements for long videos. The STS images of a long video can require a prohibitively large width or height for the dimension corresponding to time. In this case, one solution would be to divide the video into small chunks across time, where each chunk has a length of to 600 frames. The final result can be estimated via the mean of the values computed for each chunk. Another limitation of is that it currently takes into account only the luminance component of the video. Further improvements may be realized by also considering degradations in chrominance. Another possible improvement might be realized by employing a more accurate pooling model of the spatiotemporal responses used in the spatiotemporal dissimilarity stage. Equation (33) gives the same weight to the spatial distortion and spatiotemporal dissimilarity values. However, it would seem possible to adaptively combine the two values in a way that more accurately reflects the visual contribution of each degradation to the overall quality degradation. Our preliminary attempts to select the weights based on the video motion magnitudes, the difference in motion, or the variance of spatial distortion have not yielded significant improvements. We are currently conducting a psychophysical study to better understand if and how the spatial distortion and spatiotemporal dissimilarity values should be adaptively combined. The incorporation of visual-attention modeling is another avenue for potential improvements. Some studies have shown that visual attention can be useful for quality assessment (e.g., Refs. 39, 67, and 68; see also Ref. 69). One possible technique for incorporating such data into would be to weight the maps generated during the computation of both and based on estimates of visual gaze data or regions of interest in both space and time. Another interesting avenue of future research would be to compare the and maps with gaze data to identify any existing relationships and, perhaps, determine techniques for predicting gaze data based on the STS images. 5.ConclusionsIn this paper, we have presented a VQA algorithm, , that analyzes various two-dimensional space-time slices of the video to estimate perceived video quality degradation via two different stages. The first stage of the algorithm adaptively applies two strategies in the MAD algorithm to groups of video frames to estimate perceived video quality degradation due to spatial distortion. An optical-flow-based weighting scheme is used to model the effect of motion on the visibility of distortion. The second stage of the algorithm measures spatiotemporal correlation and applies an HVS-based model to the STS images to estimate perceived video quality degradation due to spatiotemporal dissimilarity. The overall estimate of perceived video quality degradation is given as the geometric mean of the two measurements obtained from the two stages. The algorithm has been shown to perform well in predicting quality of videos from the LIVE database,24 the IVPL database,62 and the CSIQ database.63 Statistically significant improvements in predicting subjective ratings are achieved in comparison to a variety of existing VQA algorithms. The online supplement to this paper is available in Ref. 60. AcknowledgmentsThis material is based upon work supported by the National Science Foundation Awards 0917014 and 1054612, and by the U.S. Army Research Laboratory and the U.S. Army Research Office under contract/grant number W911NF-10-1-0015. ReferencesB. Girod, Digital Images and Human Vision, MIT Press, Cambridge, Massachusetts
(1993). Google Scholar
A. M. EskiciogluP. S. Fisher,
“Image quality measures and their performance,”
IEEE Trans. Commun., 43
(12), 2959
–2965
(1995). http://dx.doi.org/10.1109/26.477498 IECMBT 0090-6778 Google Scholar
B. A. Wandell, Foundations of Vision, Sinauer Associates, Sunderland, Massachusetts
(1995). Google Scholar
K. SeshadrinathanA. Bovik,
“Motion tuned spatio-temporal quality assessment of natural videos,”
IEEE Trans. Image Process., 19
(2), 335
–350
(2010). http://dx.doi.org/10.1109/TIP.2009.2034992 IIPRE4 1057-7149 Google Scholar
Z. WangL. LuA. C. Bovik,
“Video quality assessment based on structural distortion measurement,”
Signal Process.: Image Commun., 19
(2), 121
–132
(2004). http://dx.doi.org/10.1016/S0923-5965(03)00076-6 SPICEF 0923-5965 Google Scholar
Z. WangQ. Li,
“Video quality assessment using a statistical model of human visual speed perception,”
J. Opt. Soc. Am. A, 24
(12), B61
–B69
(2007). http://dx.doi.org/10.1364/JOSAA.24.000B61 JOAOD6 0740-3232 Google Scholar
H. R. SheikhA. C. Bovik,
“A visual information fidelity approach to video quality assessment,”
in First Int. Workshop on Video Processing and Quality Metrics for Consumer Electronics,
23
–25
(2005). Google Scholar
M. NarwariaW. LinA. Liu,
“Low-complexity video quality assessment using temporal quality variations,”
IEEE Trans. Multimedia, 14
(3), 525
–535
(2012). http://dx.doi.org/10.1109/TMM.2012.2190589 ITMUF8 1520-9210 Google Scholar
Z. Wanget al.,
“Image quality assessment: from error visibility to structural similarity,”
IEEE Trans. Image Process., 13
(4), 600
–612
(2004). http://dx.doi.org/10.1109/TIP.2003.819861 IIPRE4 1057-7149 Google Scholar
H. R. SheikhA. C. Bovik,
“Image information and visual quality,”
IEEE Trans. Image Process., 15
(2), 430
–444
(2006). http://dx.doi.org/10.1109/TIP.2005.859378 IIPRE4 1057-7149 Google Scholar
S. WolfM. Pinson,
“In-service performance metrics for MPEG-2 video systems,”
in Measurement Techniques of the Digital Age Technical Seminar,
12
–13
(1998). Google Scholar
M. H. PinsonS. Wolf,
“A new standardized method for objectively measuring video quality,”
IEEE Trans. Broadcast., 50
(3), 312
–322
(2004). http://dx.doi.org/10.1109/TBC.2004.834028 IETBAC 0018-9316 Google Scholar
Y. Wanget al.,
“Novel spatio-temporal structural information based video quality metric,”
IEEE Trans. Circuits Syst. Video Technol., 22
(7), 989
–998
(2012). http://dx.doi.org/10.1109/TCSVT.2012.2186745 ITCTEM 1051-8215 Google Scholar
D. E. Pearson,
“Variability of performance in video coding,”
in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing,
5
–8
(1997). Google Scholar
D. E. Pearson,
“Viewer response to time-varying video quality,”
Proc. SPIE, 3299 16
–25
(1998). http://dx.doi.org/10.1117/12.320109 PSISDG 0277-786X Google Scholar
K. SeshadrinathanA. C. Bovik,
“Temporal hysteresis model of time varying subjective video quality,”
in IEEE Int. Conf. on Acoustics, Speech and Signal Processing,
1153
–1156
(2011). Google Scholar
M. A. MasryS. S. Hemami,
“A metric for continuous quality evaluation of compressed video with severe distortions, signal processing,”
Signal Process.: Image Commun., 19
(2), 133
–146
(2004). http://dx.doi.org/10.1016/j.image.2003.08.001 SPICEF 0923-5965 Google Scholar
M. Barkowskyet al.,
“Perceptually motivated spatial and temporal integration of pixel based video quality measures,”
in Welcome to Mobile Content Quality of Experience,
4:1
–4:7
(2007). Google Scholar
A. Ninassiet al.,
“Considering temporal variations of spatial visual distortions in video quality assessment,”
IEEE J. Sel. Topics Signal Process., 3
(2), 253
–265
(2009). http://dx.doi.org/10.1109/JSTSP.2009.2014806 1932-4553 Google Scholar
C. NgoT. PongH. Zhang,
“On clustering and retrieval of video shots through temporal slices analysis,”
IEEE Trans. Multimedia, 4
(4), 446
–458
(2002). http://dx.doi.org/10.1109/TMM.2002.802022 ITMUF8 1520-9210 Google Scholar
C. NgoT. PongH. Zhang,
“Motion analysis and segmentation through spatio-temporal slices processing,”
IEEE Trans. Image Process., 12
(3), 341
–355
(2003). http://dx.doi.org/10.1109/TIP.2003.809020 IIPRE4 1057-7149 Google Scholar
A. B. WatsonA. J. Ahumada,
“Model of human visual-motion sensing,”
J. Opt. Soc. Am. A, 2
(2), 322
–341
(1985). http://dx.doi.org/10.1364/JOSAA.2.000322 JOAOD6 0740-3232 Google Scholar
E. H. AdelsonJ. R. Bergen,
“Spatiotemporal energy models for the perception of motion,”
J. Opt. Soc. Am. A, 2
(2), 284
–299
(1985). http://dx.doi.org/10.1364/JOSAA.2.000284 JOAOD6 0740-3232 Google Scholar
K. Seshadrinathanet al.,
“Study of subjective and objective quality assessment of video,”
IEEE Trans. Image Process., 19
(6), 1427
–1441
(2010). http://dx.doi.org/10.1109/TIP.2010.2042111 IIPRE4 1057-7149 Google Scholar
“Objective video quality measurement using a peak-signal-to-noise-ratio (PSNR) full reference technique,”
(2001). Google Scholar
E. LarsonD. Chandler,
“Most apparent distortion: full-reference image quality assessment and the role of strategy,”
J. Electron. Imaging, 19
(1), 011006
(2010). http://dx.doi.org/10.1117/1.3267105 JEIME5 1017-9909 Google Scholar
A. B. WatsonA. J. Ahumada, A look at motion in the frequency domain, 1983). Google Scholar
P. V. VuC. T. VuD. M. Chandler,
“A spatiotemporal most-apparent-distortion model for video quality assessment,”
in IEEE Int. Conf. on Image Processing,
2505
–2508
(2011). Google Scholar
P. V. VuD. M. Chandler,
“Video quality assessment based on motion dissimilarity,”
in Seventh Int. Workshop on Video Processing and Quality Metrics for Consumer Electronics,
(2013). Google Scholar
S. Chikkeruret al.,
“Objective video quality assessment methods: a classification, review, and performance comparison,”
IEEE Trans. Broadcasting, 57
(2), 165
–182
(2011). http://dx.doi.org/10.1109/TBC.2011.2104671 IETBAC 0018-9316 Google Scholar
“Final report from the video quality experts group on the validation of objective models of video quality assessment, Phase II,”
(2003). Google Scholar
L. Luet al.,
“Full-reference video quality assessment considering structural distortion and no-reference quality evaluation of MPEG video,”
in IEEE Int. Conf. on Multimedia and Expo,
61
–64
(2002). Google Scholar
P. TaoA. M. Eskicioglu,
“Video quality assessment using M-SVD,”
Proc. SPIE, 6494 649408
(2007). http://dx.doi.org/10.1117/12.696142 PSISDG 0277-786X Google Scholar
A. Pessoaet al.,
“Video quality assessment using objective parameters based on image segmentation,”
SMPTE J., 108
(12), 865
–872
(1999). http://dx.doi.org/10.5594/J04308 SMPJDF 0036-1682 Google Scholar
J. Okamotoet al.,
“Proposal for an objective video quality assessment method that takes temporal and spatial information into consideration,”
Electron. Commun. Jpn., 89
(12), 97
–108
(2006). http://dx.doi.org/10.1002/(ISSN)1520-6424 ECOJAL 0424-8368 Google Scholar
S. O. LeeD. G. Sim,
“New full-reference visual quality assessment based on human visual perception,”
in Int. Conf. on Consumer Electronics,
1
–2
(2008). Google Scholar
M. Barkowskyet al.,
“Temporal trajectory aware video quality measure,”
IEEE J. Sel. Topics Signal Process., 3
(2), 266
–279
(2009). http://dx.doi.org/10.1109/JSTSP.2009.2015375 1932-4553 Google Scholar
A. BhatI. RichardsonS. Kannangara,
“A new perceptual quality metric for compressed video,”
in IEEE Int. Conf. on Acoustics, Speech and Signal Processing,
933
–936
(2009). Google Scholar
U. Engelkeet al.,
“Modelling saliency awareness for objective video quality assessment,”
in Second Int. Workshop on Quality of Multimedia Experience,
212
–217
(2010). Google Scholar
X. Guet al.,
“Region of interest weighted pooling strategy for video quality metric,”
Telecommun. Syst., 49
(1), 63
–73
(2012). http://dx.doi.org/10.1007/s11235-010-9353-8 TESYEV 1018-4864 Google Scholar
M. NarwariaW. Lin,
“Scalable image quality assessment based on structural vectors,”
in IEEE Int. Workshop on Multimedia Signal Processing,
1
–6
(2009). Google Scholar
A. A. StockerE. P. Simoncelli,
“Noise characteristics and prior expectations in human visual speed perception,”
Nat. Neurosci., 9
(4), 578
–585
(2006). http://dx.doi.org/10.1038/nn1669 NANEFN 1097-6256 Google Scholar
Z. WangE. SimoncelliA. Bovik,
“Multiscale structural similarity for image quality assessment,”
in Conf. Record of the Thirty-Seventh Asilomar Conf. on Signals, Systems and Computers,
1398
–1402
(2003). Google Scholar
F. LukasZ. Budrikis,
“Picture quality prediction based on a visual model,”
IEEE Trans. Commun., 30
(7), 1679
–1692
(1982). http://dx.doi.org/10.1109/TCOM.1982.1095616 IECMBT 0090-6778 Google Scholar
A. Bassoet al.,
“Study of MPEG-2 coding performance based on a perceptual quality metric,”
in Proc. of Picture Coding Symp. 1996,
263
–268
(1996). Google Scholar
C. J. van den Branden Lambrecht,
“Color moving pictures quality metric,”
in IEEE Int. Conf. on Image Process.,
885
–888
(1996). Google Scholar
P. LindhC. J. van den Branden Lambrecht,
“Efficient spatio-temporal decomposition for perceptual processing of video sequences,”
in IEEE Int. Conf. on Image Processing,
331
–334
(1996). Google Scholar
A. Hekstraet al.,
“PVQM—a perceptual video quality measure,”
Signal Process.: Image Commun., 17
(10), 781
–798
(2002). http://dx.doi.org/10.1016/S0923-5965(02)00056-5 SPICEF 0923-5965 Google Scholar
A. B. WatsonJ. HuJ. F. McGowan,
“Digital video quality metric based on human vision,”
J. Electron. Imaging, 10
(1), 20
–29
(2001). http://dx.doi.org/10.1117/1.1329896 JEIME5 1017-9909 Google Scholar
C. LeeO. Kwon,
“Objective measurements of video quality using the wavelet transform,”
Opt. Eng., 42
(1), 265
–272
(2003). http://dx.doi.org/10.1117/1.1523420 OPEGAR 0091-3286 Google Scholar
E. Onget al.,
“Video quality metric for low bitrate compressed videos,”
in IEEE Int. Conf. on Image Processing,
3531
–3534
(2004). Google Scholar
E. Onget al.,
“Colour perceptual video quality metric,”
in IEEE Int. Conf. on Image Processing,
III-1172-5
(2005). Google Scholar
M. MasryS. HemamiY. Sermadevi,
“A scalable wavelet-based video distortion metric and applications,”
IEEE Trans. Circuits Syst. Video Technol., 16
(2), 260
–273
(2006). http://dx.doi.org/10.1109/TCSVT.2005.861946 ITCTEM 1051-8215 Google Scholar
P. Ndjiki-NyaM. BarradoT. Wiegand,
“Efficient full-reference assessment of image and video quality,”
in IEEE Int. Conf. on Image Processing,
II-125
–II-128
(2007). Google Scholar
S. LiL. MaK. N. Ngan,
“Full-reference video quality assessment by decoupling detail losses and additive impairments,”
IEEE Trans. Circuits Syst. Video Technol., 22
(7), 1100
–1112
(2012). http://dx.doi.org/10.1109/TCSVT.2012.2190473 ITCTEM 1051-8215 Google Scholar
P. C. TeoD. J. Heeger,
“Perceptual image distortion,”
in IEEE Int. Conf. on Image Processing,
982
–986
(1994). Google Scholar
S. Péchardet al.,
“A new methodology to estimate the impact of H.264 artefacts on subjective video quality,”
in Third Int. Workshop on Video Processing and Quality Metrics,
(2007). Google Scholar
D. ChandlerS. Hemami,
“VSNR: a wavelet-based visual signal-to-noise ratio for natural images,”
IEEE Trans. Image Process., 16
(9), 2284
–2298
(2007). http://dx.doi.org/10.1109/TIP.2007.901820 IIPRE4 1057-7149 Google Scholar
B. D. LucasT. Kanade,
“An iterative image registration technique with an application to stereo vision,”
in 7th Int. Joint Conf. on Artificial Intelligence,
674
–679
(1981). Google Scholar
P. VuD. Chandler,
“Online supplement: : an algorithm for video quality assessment via analysis of spatial and spatiotemporal slices,”
(2013) http://vision.okstate.edu/vis3/ December 2013). Google Scholar
J. G. Robson,
“Spatial and temporal contrast-sensitivity functions of the visual system,”
J. Opt. Soc. Am., 56
(8), 1141
–1142
(1966). http://dx.doi.org/10.1364/JOSA.56.001141 JOSAAH 0030-3941 Google Scholar
Image & Video Processing Laboratory, The Chinese University of Hong Kong, “IVP subjective quality video database,”
(2012) http://ivp.ee.cuhk.edu.hk/research/database/subjective/index.shtml April ). 2012). Google Scholar
Laboratory of Computational Perception & Image Quality, Oklahoma State University, “CSIQ video database,”
(2013) http://vision.okstate.edu/csiq/ November 2012). Google Scholar
“Subjective quality of internet video codec phase II evaluations using SAMVIQ,”
(2005). Google Scholar
H. R. SheikhM. F. SabirA. C. Bovik,
“A statistical evaluation of recent full reference image quality assessment algorithms,”
IEEE Trans. Image Process., 15
(11), 3440
–3451
(2006). http://dx.doi.org/10.1109/TIP.2006.881959 IIPRE4 1057-7149 Google Scholar
U. EngelkeV. X. NguyenH. Zepernick,
“Regional attention to structural degradations for perceptual image quality metric design,”
in IEEE Int. Conf. on Acoustics, Speech and Signal Processing,
869
–872
(2008). Google Scholar
J. Youet al.,
“Perceptual quality assessment based on visual attention analysis,”
in Proc. of the 17th ACM Int. Conf. on Multimedia,
561
–564
(2009). Google Scholar
O. L. Meuret al.,
“Overt visual attention for free-viewing and quality assessment tasks: impact of the regions of interest on a video quality metric,”
Signal Process.: Image Commun., 25
(7), 547
–558
(2010). http://dx.doi.org/10.1016/j.image.2010.05.006 SPICEF 0923-5965 Google Scholar
BiographyPhong V. Vu received his BE in telecommunications engineering from the Posts and Telecommunications Institute of Technologies, Hanoi, Vietnam, in 2004. He is currently working toward his PhD degree in the School of Electrical and Computer Engineering, Oklahoma State University, Stillwater, Oklahoma. His research interests include image and video processing, image and video quality assessment, and computational modeling of visual perception. Damon M. Chandler received his BS degree in biomedical engineering from Johns Hopkins University, Baltimore, Maryland, in 1998, and his MEng, MS, and PhD degrees in electrical engineering from Cornell University, Ithaca, New York, in 2000, 2004, and 2005, respectively. He is currently an associate professor in the School of Electrical and Computer Engineering at Oklahoma State University, Stillwater, Oklahoma, where he heads the Laboratory of Computational Perception and Image Quality. |