After a few seconds of an action, the human eye only needs a few photos to judge, but the action recognition network needs hundreds of frames of input pictures for each action. This results in a large number of floating point operations (ranging from 16 to 100 G FLOPs) to process a single sample, which hampers the implementation of graph convolutional networks (GCN)-based action recognition methods when the computation capabilities are restricted. A common strategy is to retain only the portions of the frames, but this results in the loss of important information in the discarded frames. Furthermore, the selection progress of key frames is too independent and lacks connections with other frames. To solve these two problems, we propose a fusion sampling network to generate fused frames to extract key frames. Temporal aggregation is used to fuse adjacent similar frames, thereby reducing information loss and redundancy. The concept of self-attention is introduced to strengthen the long-term association of key frames. The experimental results on three benchmark datasets show that the proposed method achieves performance levels that are competitive with state-of-the-art methods while using only 16.7% of the number of frames (∼50 and 300 frames in total). On the NTU 60 dataset, the number of FLOPs and Params with a single-channel input are 3.776 G and 3.53 M, respectively. This would greatly reduce the excessive computational power cost in practical applications due to the large amount of data processed by action recognition. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one
CITATIONS
Cited by 2 scholarly publications.
Convolution
Data modeling
Video
Bone
RGB color model
Neural networks
Cameras