Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition

Hongji Guo,  Hanjing Wang,  Qiang Ji

Rensselaer Polytechnic Institute

Framework of Uncertainty-Guided Probabilistic Transformer (UGPT).The input of our model is a video (sequence). Firstly, an atomic action localization module gives a coarse temporal segmentation of atomic actions. Then a CNN-based backbone is used to extract features for each segment. After adding the positional encoding, the extracted features are fed into the UGPT. Different from deterministic setting, the attentions of our probabilistic Transformer are sampled from Gaussian distributions with a NLL loss. The output embeddigns of the Transformer are used to perform the classification and estimate the epistemic uncertainty, which is further utizlied to guide both the training and the inference.

Abstract

A complex action consists of a sequence of atomic actions that interact with each other over a relatively long period of time. This paper introduces a probabilistic model named Uncertainty-Guided Probabilistic Transformer (UGPT) for complex action recognition. The self-attention mechanism of a Transformer is used to capture the complex and long-term dynamics of the complex actions. By explicitly modeling the distribution of the attention scores, we extend the deterministic Transformer to a probabilistic Transformer in order to quantify the uncertainty of the prediction. The model prediction uncertainty is used to improve both training and inference. Specifically, we propose a novel training strategy by introducing a majority model and a minority model based on the epistemic uncertainty. During the inference, the prediction is jointly made by both models through a dynamic fusion strategy. Our method is validated on the benchmark datasets, including Breakfast Actions, MultiTHUMOS, and Charades. The experiment results show that our model achieves the state-of-the-art performance under both sufficient and insufficient data.

Results

We evaluted our proposed UGPT on three benchmark datasets: Breakfast Action, MultiTHUMOS, and Charades. Our method achieves state-of-the-art performance for complex action recognition.

Method

The input of our model is a video (sequence). Firstly, an atomic action localization module is applied to provide a coarse temporal localization of the Then, we extract features for each segment by a CNN-based backboneatomic actions. These features are used as a token for an atomic action and is fed to the Transformer. To keep the sequential order information, the positional encoding computed by Eq. (1) is added to input tokens. After the positional encoding, the new tokens are fed to our probabilistic Transformer and output high-level embeddings. Then, the output embeddings go through a linear classifier to output probability vectors for classification. The probability vectors are fed into an uncertainty quantification module to generate the epistemic uncertainty (defined in Sec. 3.2.3) for each input. We train two models separately using two different uncertainty weighted loss functions. One model assigns larger weight to low-uncertainty data to emphasize majority of the data, which refers as “majority model”. The other assigns larger weights to high-uncertainty data to emphasize minority of the data, which refers as “minority model”. In the end, we combine the two models dynamically to make the final prediction.

Citation

If our work helps your research, please consider citing the paper as follows:

@inproceedings{guo2022uncertainty, title={Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition}, author={Guo, Hongji and Wang, Hanjing and Ji, Qiang}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={20052--20061}, year={2022} }
Correspondence to Nicklas Hansen. Website based on TD-MPC and Nerfies.