FROM VISUAL SALIENCY TO VIDEO BEHAVIOUR UNDERSTANDING
Abstract
In a world of ever increasing amounts of video data, we are forced to abandon traditional
methods of scene interpretation by fully manual means. Under such circumstances, some form
of automation is highly desirable but this can be a very open ended issue with high complexity.
Dealing with such large amounts of data is a non-trivial task that requires efficient selective
extraction of parts of a scene which have the potential to develop a higher semantic meaning,
alone, or in combination with others. In particular, the types of video data that are in
need of automated analysis tend to be outdoor scenes with high levels of activity generated
from either foreground or background. Such dynamic scenes add considerable complexity
to the problem since we cannot rely on motion energy alone to detect regions of interest.
Furthermore, the behaviour of these regions of motion can differ greatly, while still being
highly dependent, both spatially and temporally on the movement of other objects within
the scene. Modelling these dependencies, whilst eliminating as much redundancy from the
feature extraction process as possible are the challenges addressed by this thesis.
In the first half, finding the right mechanism to extract and represent meaningful features
from dynamic scenes with no prior knowledge is investigated. Meaningful or salient information
is treated as the parts of a scene that stand out or seem unusual or interesting to
us. The novelty of the work is that it is able to select salient scales in both space and time
in which a particular spatio-temporal volume is considered interesting relative to the rest of
the scene. By quantifying the temporal saliency values of regions of motion, it is possible to
consider their importance in terms of both the long and short-term. Variations in entropy
over spatio-temporal scales are used to select a context dependent measure of the local scene
dynamics. A method of quantifying temporal saliency is devised based on the variation of
the entropy of the intensity distribution in a spatio-temporal volume over incraeasing scales.
Entropy is used over traditional filter methods since the stability or predictability of the intensity
distribution over scales of a local spatio-temporal region can be defined more robustly
relative to the context of its neighbourhood, even for regions exhibiting high intensity variation
due to being extremely textured. Results show that it is possible to extract both locally
salient features as well as globally salient temporal features from contrasting scenerios.
In the second part of the thesis, focus will shift towards binding these spatio-temporally
salient features together so that some semantic meaning can be inferred from their interaction.
Interaction in this sense, refers to any form of temporally correlated behaviour between
any salient regions of motion in a scene. Feature binding as a mechanism for interactive
behaviour understanding is particularly important if we consider that regions of interest may
not be treated as particularly significant individually, but represent much more semantically
when considered in combination. Temporally correlated behaviour is identified and classified
using accumulated co-occurrences of salient features at two levels. Firstly, co-occurrences are
accumulated for spatio-temporally proximate salient features to form a local representation.
Then, at the next level, the co-occurrence of these locally spatio-temporally bound features
are accumulated again in order to discover unusual behaviour in the scene. The novelty of
this work is that there are no assumptions made about whether interacting regions should be
spatially proximate. Furthermore, no prior knowledge of the scene topology is used. Results
show that it is possible to detect unusual interactions between regions of motion, which can
visually infer higher levels of semantics.
In the final part of the thesis, a more specific investigation of human behaviour is addressed
through classification and detection of interactions between 2 human subjects. Here, further
modifications are made to the feature extraction process in order to quantify the spatiotemporal
saliency of a region of motion. These features are then grouped to find the people
in the scene. Then, a loose pose distribution model is extracted for each person for finding
salient correlations between poses of two interacting people using canonical correlation
analysis. These canonical factors can be formed into trajectories and used for classification.
Levenshtein distance is then used to categorise the features. The novelty of the work is that
the interactions do not have to be spatially connected or proximate for them to be recognised.
Furthermore, the data used is outdoors and cluttered with non-stationary background. Results
show that co-occurrence techniques have the potential to provide a more generalised,
compact, and meaningful representation of dynamic interactive scene behaviour.
Authors
Hung, Hayley Shi WenCollections
- Theses [4223]