MuHAVi: A Multicamera Human Action Video Dataset

Posted on 26 January 2015 at 07:24

MuHAVi: A Multicamera Human Action Video Dataset

1. Introduction

As part of the EPSRC funded REASON project, a large body of human action video (called MuHAVi) has been collected using multi-cameras in a challenging environment (uneven background and night street light illumination). The raw images in the dataset can be used for different types of human action recognition methods (depending on what type of image features are used) as well as a dataset to evaluate robust object segmentation algorithms. The dataset complements similar efforts such as the CMU Motion Database [1], HumanEva [2], both mostly aimed at pose recovery with motion capture ground truths. The closest dataset is IXMAS [3], another challenging set of data with multiple views. MuHAVi concentrates on CCTV-like views (at an angle and some distance from the observed people) using real street light illumination sources. Manually annotated silhouettes have been produced specifically for evaluating silhouette-based human action recognition (SBHAR) methods. For action recognition algorithms which are purely based on human silhouettes, i.e. where other image properties such as color and intensity are not necessarily used, it is important to have accurate silhouette data from video frames. This problem is not usually considered as part of action recognition, but as a lower level problem in change detection and motion tracking. Hence for researchers working at the recognition level, access to reliable manually annotated silhouette data is a major bonus. Effectively, the comparison of action recognition algorithms is not distorted by possible differences in segmentation approaches. Nevertheless, it can be noted that because the silhouettes are simply masks that define foreground, appearance-based methods for action recognition can also be evaluated using this dataset. Furthermore, for researchers interested in object segmentation, manually annotated silhouettes provide a useful ground truth to evaluate their algorithms. Also, because the data is multi-camera, this might be used by researchers working in 3D reconstruction, e.g. using space carving methods [4], from which silhouettes can be generated by projection. In short, the dataset serves multiple valuable purposes.

Human Action Recognition Using Silhouette Histogram

Chaur-Heh Hsieh, *Ping S. Huang, and Ming-Da Tang

Department of Computer and Communication Engineering

Ming Chuan University

Taoyuan 333, Taiwan, ROC

*Department of Electronic Engineering

Ming Chuan University

Taoyuan 333, Taiwan, ROC

Proposed Method

The proposed system includes four main processes as shown in Figure 1. First, the human silhouette is extracted from the input video by background subtraction method. MuHAVi Dataset can be used as silhouetted data. Then, the extracted silhouette is mapped into three polar coordinate systems that characterize three parts of a human figure respectively. The largest circle covers the motion of the human body and the other two circles are to include the effect arms and legs have on the human action/silhouette. That is why two of the centers are between the shoulders and between the hips, respectively. Each polar coordinate system is quantized by partitioning it into several cells with different radii and angles. By counting the number of pixels fallen into each cell from the silhouette at a particular frame, the silhouette histogram of the frame can be obtained. By collecting a sequence of silhouette histograms, a video clip is thus generated and used to describe the human action. Based on the silhouette histogram descriptor, an action classifier is trained and then used to recognize the action of an input video clip.

Figure 1

1- Silhouette Extraction

The MuHAVi dataset is used which contains silhouettes of different classes.

2- Polar Transform

To be able to effectively describe the human shape, the Cartesian coordinate system is transformed into the polar coordinate system through the following equations:

Where (x_i, y_i ) is the coordinate of silhouette pixels in the Cartesian coordinate system. (r_i, θ_i) is the radius and the angle in the polar coordinate system. (x_c,y_c) is the centre of the silhouette. The centre of the silhouette can be calculated by

where N is the total number of pixels.

The existing approaches often use a single polar coordinate system to describe the human posture. However, our investigation indicates that the single coordinate is not enough to discriminate different postures with small difference. In this work, we design a method which contains three polar coordinate systems (three circles) defined as:

C1: Circle that encloses the whole human body.

C2: Circle that encloses the upper part of a body.

C3: circle that encloses the lower part of a body.

wave1 (one hand waving) and wave2 (two hands waving), as shown in Figure 2. Silhouette histograms obtained by C1 and C2 are shown in Figure 3 and Figure 4, respectively. We can see that two histograms from C1are very similar such that the discriminability of action types is poor. On the contrary, superior discriminability is demonstrated by two histograms from C2.

3- Histogram Computation

The procedures for calculating the silhouette histogram can be organized into the following steps.

i- First, compute the centre of human silhouette and divide the silhouette into an upper part and a lower part according to the centre position. Then the centers of upper silhouette and lower silhouette are individually computed. Those three centre positions are taken as origins for the respective polar coordinate systems.

ii- Second, compute the heights of all human silhouettes, which are used to calculate the radius of C1. The radius of C2 or C3 is half of C1 radius.

iii- The third step is to compute three histograms separately for each human silhouette.

Human action recognition using shape and CLG-motion flow from

multi-view image sequence

Mohiuddin Ahmad, Seong-Whan Lee

Introduction

Recognition of human actions from multiple views image sequences is very popular in the computer vision community since it has applications in video surveillance and monitoring, human–computer interactions, model-based compressions, augmented reality, and so on. The existing methods of human action recognition can be categorized depending on the image state properties, such as motion-based, shape-based, gradient based, etc. Several human action recognition methods have been proposed in the last few decades. Detailed surveys can be found where different methodologies of human action recognition, human movement, etc., are discussed. Based on these reviews, researchers either use human body shape information or motion information with or without body shape model for action recognition. This approach can be considered as a combination of shape- and motion-based representation without using any prior body shape model. One standard approach for human action recognition is to extract a set of features from each image sequence frame, and use these features to train classifiers and to perform recognition. Therefore, it is important to answer the following question. Which feature is robust to action recognition in critical conditions or varying environment? Usually, there is no rigid syntax and well-defined structure for human action recognition available. Moreover, there are several sources of variability that can affect human action recognition, such as variation in speed, viewpoint, size and shape of performer, phase change of action, and so on, and the motion of the human body is non-rigid in nature. These characteristics make human action recognition a more challenging and sophisticated task. Considering the above circumstances, they consider some issues that affect the development of models of actions and classifications, which are as follows:

•The trajectory of an action from different viewing directions is different; some of the body parts (part of hand, lower part of leg, part of body, etc.) are occluded due to view changes, which are shown in Fig. 6. Fig. 6. Representation of human action using shape and motion sequences with multiple views. (a) Multiple views variation of an action. (b) Shape sequences (walking, raising the right hand, and bowing). (c) Motion sequences (walking, raising the right hand, and bowing). The motion distribution is different for each action.

•An action can be viewed as a series of silhouette images of the human body (Fig. 6(b)). The silhouette information involves no translation, rotation, and scaling. Moreover, the silhouette sequence of an action is invariant to the speed.

•Action can be viewed by the motion or velocity of human body parts (Fig. 6(c)). Simple action involves the motion of a small number of body parts and complex action involves the motion of a whole body. The motion is non-rigid in nature.

•Human action depends on anthropometry, method of performing the action, phase variation (starting and ending time of the action), scale variation of an action, and so on.

Proposed Method

Fig.7 Flow diagram of the proposed method.

Fig. 2shows a block diagram of the proposed method. In the preprocessing steps, the foreground is extracted by using background modeling, shadow elimination and morphological operation. From the foreground image, the velocity of an action is estimated by using combined local–global (CLG) optical flow. The global shape flow features are extracted from silhouette image sequence. The shape flow represents the flow deviation and invariant moments. We use the modified Zernike moment, which is robust against noise and invariant to scale, rotation, and translation, is used to reduce noise and to normalize the action data spatially. Motion features are extracted based on the same center of mass (CM) of corresponding silhouette image. The combined features are then feed to multidimensional hidden

Markov model (MDHMM). In the classification stage, matching of an unknown sequence with a model is done through the calculation of the probability that the MDHMM could generate the particular unknown sequence. The MDHMM with the highest probability most likely generated that sequence.

MuHAVi: A Multicamera Human Action Video Dataset

About the author

waji-sweet

bitLanders is a new kind of social, start and claim your rewards!