Overall framework.Firstly, feature vectors are constructed by concatenating the appearance features and motion features. Then we estimate the uncertainty based on the input and use the quantified uncertainty to generate the spatial-temporal attention. Finally, the prediction is made by dynamically combining both the deterministic model and probabilistic model, whose inputs are original features and attention-weighted features respectively.
Abstract
Online action detection aims at detecting the ongoing action in a streaming video. In this paper, we proposed an uncertainty-based spatial-temporal attention for online action detection. By explicitly modeling the distribution of model parameters, we extend the baseline models in a probabilistic manner. Then we quantify the predictive uncertainty and use it to generate spatial-temporal attention that focus on large mutual information regions and frames. For inference, we introduce a twostream framework that combines the baseline model and the probabilistic model based on the input uncertainty. We validate the effectiveness of our method on three benchmark datasets: THUMOS-14, TVSeries, and HDD. Furthermore, we demonstrate that our method generalizes better under different views and occlusions, and is more robust when training with small-scale data.
Results
Method
Citation
If our work helps your research, please consider citing the paper as follows: