Human action detection in videos can be applied in such areas as video surveillance, human-computer interaction, and device control. The task requires an image sequence with a three-dimensional shape as an input to detect such actions as running or catching a ball.
Image credit: pxhere.com, CC0 Public Domain
Usually, convolutional neural networks (CNN) are used for this task. However, they only consider the spatiotemporal features, while employing frequency features would facilitate the learning. A recent paper on arXiv.org proposes an end-to-end single-stage network in the time-frequency domain.
3D-CNN and 2D-CNN were used to extract time and frequency features accordingly. Then, they were fused with an attention mechanism to obtain detecting patterns. The experiments demonstrate the superiority of the suggested approach against other state-of-the-art models. The feasibility of action detection using frequency features was proved.
Currently, spatiotemporal features are embraced by most deep learning approaches for human action detection in videos, however, they neglect the important features in frequency domain. In this work, we propose an end-to-end network that considers the time and frequency features simultaneously, named TFNet. TFNet holds two branches, one is time branch formed of three-dimensional convolutional neural network(3D-CNN), which takes the image sequence as input to extract time features; and the other is frequency branch, extracting frequency features through two-dimensional convolutional neural network(2D-CNN) from DCT coefficients. Finally, to obtain the action patterns, these two features are deeply fused under the attention mechanism. Experimental results on the JHMDB51-21 and UCF101-24 datasets demonstrate that our approach achieves remarkable performance for frame-mAP.
Research paper: Li, C., Chen, H., Lu, J., Huang, Y., and Liu, Y., “Time and Frequency Network for Human Action Detection in Videos”, 2021. Link: https://arxiv.org/abs/2103.04680