3D Human Pose Estimation Based on Multi-Input Multi-Output Convolutional Neural Network and Event Cameras: A Proof of Concept on the DHP19 Dataset

Manilii, A.; Lucarelli, L.; Rosati, R.; Romeo, L.; Mancini, A.; Frontoni, E.

doi:10.1007/978-3-030-68763-2_2

Nowadays Human Pose Estimation (HPE) represents one of the main research themes in the field of computer vision. Despite innovative methods and solutions introduced for frame processing algorithms, the use of standard frame-based cameras still has several drawbacks such as data redundancy and fixed frame-rate. The use of event-based cameras guarantees higher temporal resolution with lower memory and computational cost while preserving the significant information to be processed and thus it represents a new solution for real-time applications. In this paper, the DHP19 dataset was employed, the first and, to date, the only one with HPE data recorded from Dynamic Vision Sensor (DVS) event-based cameras. Starting from the baseline single-input single-output (SISO) Convolutional Neural Network (CNN) model proposed in the literature, a novel multi-input multi-output (MIMO) CNN-based architecture was proposed in order to model simultaneously two different single camera views. Experimental results show that the proposed MIMO approach outperforms the standard SISO model in terms of accuracy and training time.