What's in a Name? Intelligent Classification and Identification of Online Media Content
Abstract
The sheer amount of content on the Internet poses a number of challenges for content providers and users alike. The providers want to classify and identify user downloads for market research, advertising and legal purposes. From the user’s perspective it is increasingly difficult to find interesting content online, hence content personalisation and media recommendation is expected by the public. An especially important (and also technically challenging) case is when a downloadable item has no supporting description or meta-data, as in the case of (normally illegal) torrent downloads, which comprise 10 to 30 percent of the global traffic depending on the region. In this case, apart from its size, we have to rely entirely on the filename – which is often deliberately obfuscated – to identify or classify what the file really is.
The Hollywood movie industry is sufficiently motivated by this problem that it has invested significant research – through its company MovieLabs – to help understand more precisely what material is being illegally downloaded in order both to combat piracy and exploit the extraordinary opportunities for future sales and marketing. This thesis was inspired, and partly supported, by MovieLabs who recognised the limitations of their current purely data-driven algorithmic approach.
The research hypothesis is that, by extending state-of-the-art information retrieval (IR) algorithms and by developing an underlying causal Bayesian Network (BN) incorporating expert judgment and data, it is possible to improve on the accuracy of MovieLabs’s benchmark algorithm for identifying and classifying torrent names. In addition to identification and standard classification (such as whether the file is Movie, Soundtrack, Book, etc.) we consider the crucial orthogonal classifications of pornography and malware. The work in the thesis provides a number of novel extensions to the generic problem of classifying and personalising internet content based on minimal data and on validating the results in the absence of a genuine ‘oracle’.
The system developed in the thesis (called Toran) is extensively validated using a sample of torrents classified by a panel of 3 human experts and the MovieLabs system, divided into knowledge and validation sets of 2,500 and 479 records respectively. In the absence of an automated classification oracle, we established manually the true classification for the test set of 121 records in order to be able to compare Toran, the human panel (HP) and the MovieLabs system (MVL). The results show that Toran performs better than MVL for the key medium categories that contain most items, such as music, software, movies, TVs and other videos. Toran also has the ability to assess the risk of fakes and malware prior to download, and is on par or even surpasses human experts in this capability.
Authors
Dementiev, EugeneCollections
- Theses [3704]