Attribute Learning for Image/Video Understanding
MetadataShow full item record
For the past decade computer vision research has achieved increasing success in visual recognition including object detection and video classification. Nevertheless, these achievements still cannot meet the urgent needs of image and video understanding. The recently rapid development of social media sharing has created a huge demand for automatic media classification and annotation techniques. In particular, these types of media data usually contain very complex social activities of a group of people (e.g. YouTube video of a wedding reception) and are captured by consumer devices with poor visual quality. Thus it is extremely challenging to automatically understand such a high number of complex image and video categories, especially when these categories have never been seen before. One way to understand categories with no or few examples is by transfer learning which transfers knowledge across related domains, tasks, or distributions. In particular, recently lifelong learning has become popular which aims at transferring information to tasks without any observed data. In computer vision, transfer learning often takes the form of attribute learning. The key underpinning idea of attribute learning is to exploit transfer learning via an intermediatelevel semantic representations – attributes. The semantic attributes are most commonly used as a semantically meaningful bridge between low feature data and higher level class concepts, since they can be used both descriptively (e.g., ’has legs’) and discriminatively (e.g., ’cats have it but dogs do not’). Previous works propose many different attribute learning models for image and video understanding. However, there are several intrinsic limitations and problems that exist in previous attribute learning work. Such limitations discussed in this thesis include limitations of user-defined attributes, projection domain-shift problems, prototype sparsity problems, inability to combine multiple semantic representations and noisy annotations of relative attributes. To tackle these limitations, this thesis explores attribute learning on image and video understanding from the following three aspects. Firstly to break the limitations of user-defined attributes, a framework for learning latent attributes is present for automatic classification and annotation of unstructured group social activity in videos, which enables the tasks of attribute learning for understanding complex multimedia data with sparse and incomplete labels. We investigate the learning of latent attributes for content-based understanding, which aims to model and predict classes and tags relevant to objects, sounds and events – anything likely to be used by humans to describe or search for media. Secondly, we propose the framework of transductive multi-view embedding hypergraph label propagation and solve three inherent limitations of most previous attribute learning work, i.e., the projection domain shift problems, the prototype sparsity problems and the inability to combine multiple semantic representations. We explore the manifold structure of the data distributions of different views projected onto the same embedding space via label propagation on a graph. Thirdly a novel framework for robust learning is presented to effectively learn relative attributes from the extremely noisy and sparse annotations. Relative attributes are increasingly learned from pairwise comparisons collected via crowdsourcing tools which are more economic and scalable than the conventional laboratory based data annotation. However, a major challenge for taking a crowdsourcing strategy is the detection and pruning of outliers. We thus propose a principled way to identify annotation outliers by formulating the relative attribute prediction task as a unified robust learning to rank problem, tackling both the outlier detection and relative attribute prediction tasks jointly. In summary, this thesis studies and solves the key challenges and limitations of attribute learning in image/video understanding. We show the benefits of solving these challenges and limitations in our approach which thus achieves better performance than previous methods.
- Theses