LEARNING SALIENCY FOR HUMAN ACTION RECOGNITION
MetadataShow full item record
When we are looking at a visual stimuli, there are certain areas that stand out from the neighbouring areas and immediately grab our attention. A map that identi- es such areas is called a visual saliency map. As humans can easily recognize actions when watching videos, having their saliency maps available might be bene cial for a fully automated action recognition system. In this thesis we look into ways of learning to predict the visual saliency and how to use the learned saliency for action recognition. In the rst phase, as opposed to the approaches that use manually designed fea- tures for saliency prediction, we propose few multilayer architectures for learning saliency features. First, we learn rst layer features in a two layer architecture using an unsupervised learning algorithm. Second, we learn second layer features in a two layer architecture using a supervision from recorded human gaze xations. Third, we use a deep architecture that learns features at all layers using only supervision from recorded human gaze xations. We show that the saliency prediction results we obtain are better than those obtained by approaches that use manually designed features. We also show that using a supervision on higher levels yields better saliency prediction results, i.e. the second approach outperforms the rst, and the third outperforms the second. In the second phase we focus on how saliency can be used to localize areas that will be used for action classi cation. In contrast to the manually designed action features, such as HOG/HOF, we learn the features using a fully supervised deep learning architecture. We show that our features in combination with the predicted saliency (from the rst phase) outperform manually designed features. We further develop an SVM framework that uses the predicted saliency and learned action features to both localize (in terms of bounding boxes) and classify the actions. We use saliency prediction as an additional cost in the SVM training and testing procedure when inferring the bounding box locations. We show that the approach in which saliency cost is added yields better action recognition results than the approach in which the cost is not added. The improvement is larger when the cost is added both in training and testing, rather than just in testing.
- Theses