MoDem

Overall framework.Firstly, feature vectors are constructed by concatenating the appearance features and motion features. Then we estimate the uncertainty based on the input and use the quantified uncertainty to generate the spatial-temporal attention. Finally, the prediction is made by dynamically combining both the deterministic model and probabilistic model, whose inputs are original features and attention-weighted features respectively.

Abstract

Online action detection aims at detecting the ongoing action in a streaming video. In this paper, we proposed an uncertainty-based spatial-temporal attention for online action detection. By explicitly modeling the distribution of model parameters, we extend the baseline models in a probabilistic manner. Then we quantify the predictive uncertainty and use it to generate spatial-temporal attention that focus on large mutual information regions and frames. For inference, we introduce a twostream framework that combines the baseline model and the probabilistic model based on the input uncertainty. We validate the effectiveness of our method on three benchmark datasets: THUMOS-14, TVSeries, and HDD. Furthermore, we demonstrate that our method generalizes better under different views and occlusions, and is more robust when training with small-scale data.

Results

We evaluted our proposed UGPT on three benchmark datasets: THUMOS-14, TVSeries, and HDD. We obtained consistant improvement over baseline methods.

Method

To jointly model the spatial and temporal attention, we combine them to formulate a unified spatial-temporal attention using Eq. (2). The samples generated from the same input are used to estimate the predictive uncertainty for both spatial and temporal attention simultaneously. In this way, the spatial and temporal attention can be generated with the same estimated uncertainty, which is more computationally efficient. The training procedure is summarized as Algorithm 1.

By explicitly modeling the distributions of model parameters, our probabilistic architecture can well capture the stochasticity of the data and model. On the other hand, deterministic methods directly generate the attention from the input feature. The network for attention generation needs to be trained well with enough data. Thus, the uncertainty-based model should have better generalization ability and more robustness than the deterministic methods. To demonstrate our propositions, we perform the generalization experiments and insufficient data experiments.

Citation

If our work helps your research, please consider citing the paper as follows:

@inproceedings{guo2022uncertainty, title={Uncertainty-Based Spatial-Temporal Attention for Online Action Detection}, author={Guo, Hongji and Ren, Zhou and Wu, Yi and Hua, Gang and Ji, Qiang}, booktitle={European Conference on Computer Vision}, pages={69--86}, year={2022}, organization={Springer} }

Correspondence to Nicklas Hansen. Website based on TD-MPC and Nerfies.

Uncertainty-Based Spatial-Temporal Attention for Online Action Detection

Hongji Guo, Zhou Ren, Yi Wu, Gang Hua, Qiang Ji Rensselaer Polytechnic Institute, Wormpex AI Research

Abstract

Results

Method

Citation

Hongji Guo, Zhou Ren, Yi Wu, Gang Hua, Qiang Ji

Rensselaer Polytechnic Institute, Wormpex AI Research