Framework of Uncertainty-Guided Probabilistic Transformer (UGPT).The input of our model is a video (sequence). Firstly, an atomic action localization module gives a coarse temporal segmentation of atomic actions. Then a CNN-based backbone is used to extract features for each segment. After adding the positional encoding, the extracted features are fed into the UGPT. Different from deterministic setting, the attentions of our probabilistic Transformer are sampled from Gaussian distributions with a NLL loss. The output embeddigns of the Transformer are used to perform the classification and estimate the epistemic uncertainty, which is further utizlied to guide both the training and the inference.
Abstract
A complex action consists of a sequence of atomic actions that interact with each other over a relatively long period of time. This paper introduces a probabilistic model named Uncertainty-Guided Probabilistic Transformer (UGPT) for complex action recognition. The self-attention mechanism of a Transformer is used to capture the complex and long-term dynamics of the complex actions. By explicitly modeling the distribution of the attention scores, we extend the deterministic Transformer to a probabilistic Transformer in order to quantify the uncertainty of the prediction. The model prediction uncertainty is used to improve both training and inference. Specifically, we propose a novel training strategy by introducing a majority model and a minority model based on the epistemic uncertainty. During the inference, the prediction is jointly made by both models through a dynamic fusion strategy. Our method is validated on the benchmark datasets, including Breakfast Actions, MultiTHUMOS, and Charades. The experiment results show that our model achieves the state-of-the-art performance under both sufficient and insufficient data.
Results
Method
Citation
If our work helps your research, please consider citing the paper as follows: