Human Gesture, Action, and Activity Recognition

1. Introduction

The Intelligent Systems Laboratory at RPI has long performed research related to human gesture, action, and activity recognition. Specifically, we have performed research in human body detection and tracking, 2D/3D body pose estimation, body landmark/part detection and tracking, body gesture recognition, human event and complex activity recognition. These efforts have been supported by different governmental agencies including DARPA, ARO, ONR, AFOSR, DOT, and NSF.

2. Our work

2.1 Human action recognition, Localization, and Synthsis

Uncertainty-Based Spatial-Temporal Attention for Online Action Detection
Hongji Guo, Zhou Ren, Yi Wu, Gang Hua, and Qiang Ji
ECCV 2022
[Project Page]
Online action detection aims at detecting the ongoing action in a streaming video. In this paper, we proposed an uncertainty-based spatial-temporal attention for online action detection. By explicitly modeling the distribution of model parameters, we extend the baseline models in a probabilistic manner. Then we quantify the predictive uncertainty and use it to generate spatial-temporal attention that focus on large mutual information regions and frames. For inference, we introduce a twostream framework that combines the baseline model and the probabilistic model based on the input uncertainty

Bayesian Adversarial Human Motion Synthesis
Rui Zhao, Hui Su, and Qiang Ji
CVPR 2020
We proposed a generative probabilistic model for human motion synthesis. It has a hierarchy of three layers. At the bottom layer, we utilize Hidden semi-Markov Model(HSMM), which explicitly models the spatial pose, temporal transition and speed variations in motion sequences. At the middle layer, HSMM parameters are treated as random variables which are allowed to vary across data instances in order to capture large intra- and inter-class variations. At the top layer, hyperparameters define the prior distributions of parameters, preventing the model from overfitting.

Bayesian Graph Convolution LSTM for Skeleton Based Action Recognition
Rui Zhao, Kang Wang, Hui Su, Qiang Ji
ICCV 2019
We utilize graph convolution to extract structure-aware feature representation from pose data by exploiting the skeleton anatomy. Long short-term memory (LSTM) network is then used to capture the temporal dynamics of the data. Finally, the whole model is extended under the Bayesian framework to a probabilistic model in order to better capture the stochasticity and variation in the data.

Bayesian Hierarchical Dynamic Model for Human Action Recognition
Rui Zhao, Hui Su, and Qiang Ji
CVPR 2019
We proposed a probabilistic model called Hierarchical Dynamic Model (HDM). Leveraging on Bayesian framework, the model parameters are allowed to vary across different sequences of data, which increase the capacity of the model to adapt to intra-class variations on both spatial and temporal extent of actions. Meanwhile, the generative learning process allows the model to preserve the distinctive dynamic pattern for each action class.

Spatio-temporal Deep Q-networks for Human Activity Localization
Wanru Xu, Jian Yu, Zhengjiang Miao, Lili Wan and Qiang ji
IEEE Transaction on Circuits and Systmes for Video Technology, 2019
We propose a unified spatio-temporal deep Q-network (ST-DQN), consisting of a temporal Q-network and a spatial Q-network, to learn an optimized search strategy. Specifically, the spatial Q-network is a novel two-branch sequence-to-sequence deep Q-network, called TBSS-DQN.

Action Recognition and Localization with Spatial and Temporal Contexts
Wanru Xu, Zhengjiang Miao, Jian Yu, Qiang ji
Neurocomputing 2019
We propose a principled dynamic model, called spatio-temporal context model (STCM), to simultaneously locate and recognize actions. The STCM integrates various kinds of contexts, including the temporal context that consists of the sequences before and after action as well as the spatial context in the surrounding of target. Meanwhile, a novel dynamic programming approach is introduced to accumulate evidences collected at a small set of candidates in order to detect the spatio-temporal location of action effectively and efficiently.

Context-based human event recognition
Xiaoyang Wang and Qiang ji
We propose to exploit and model contexts from differnet levels to perform robust human event recognition.

Real-Time Action Recognition using HMM

In this work, we proposed a Hidden Markov Model for human action recognition in real-time. We obtained the skeleton positions of human from Kinect depth camera and the builtin software. Based on the skeleton information, Hidden Markov Model is used to model the transition between the hidden states that define the action.

2.2 Complex Human Activity Modeling and Recognition

Uncertainty-Guided Probabilistic Transformer for Complex Action Recognition
Hongji Guo, Hanjing Wang, Qiang Ji
CVPR 2022
[Project Page]
In this work, we introduce uncertainty-guided probabilistic Transformer (UGPT) for complex action recognition. The selfattention mechanism of a Transformer is used to capture the complex and long-term dynamics of the complex actions. By explicitly modeling the distribution of the attention scores, we extend the deterministic Transformer to a probabilisticTransformer in order to quantify the uncertainty of the prediction. The model prediction uncertainty is used to improve both training and inference. Specifically, we propose a novel training strategy by introducing a majority model and a minority model based on the epistemic uncertainty. During the inference, the prediction is jointly made by both models through a dynamic fusion strategy.

Complex Activity Recognition Using Constrained DBN (GCDBN) in Sports and Surveillance Video
Eran Swears, Anthony Hoogs, Qiang Ji and Kim Boyer
CVPR 2014
We propose a novel structure learning solution that fuses the Granger Causality statistic, a direct measure of temporal dependence, with the Adaboost feature selection algorithm to automatically constrain the temporal links of a DBN in a discriminative manner. This approach enables us to completely define the DBN structure prior to parameter learning, which reduces computational complexity in addition to providing a more descriptive structure.

Modeling Temporal Interactions with Interval Temporal Bayesian Networks for Complex Activity Recognition
Yongmian Zhang, Yifan Zhang, Eran Swears, Natalia Larios, Ziheng Wang and Qiang Ji
TPAMI 2014
We introduce the interval temporal Bayesian network (ITBN), a novel graphical model that combines the Bayesian Network with the interval algebra to explicitly model the temporal dependencies among basic human actions over time intervals. Advanced machine learning methods are introduced to learn the ITBN model structure and parameters.

3. Demos

Upper body gesture recognition

Gesture recognition for teaching mathematics

4. Activity/action recognition datasets

5. Related Publications