Probabilistic retrieval models - relationships, context-specific application, selection and implementation
Publisher
Metadata
Show full item recordAbstract
Retrieval models are the core components of information retrieval systems, which guide the document
and query representations, as well as the document ranking schemes. TF-IDF, binary
independence retrieval (BIR) model and language modelling (LM) are three of the most influential
contemporary models due to their stability and performance. The BIR model and LM
have probabilistic theory as their basis, whereas TF-IDF is viewed as a heuristic model, whose
theoretical justification always fascinates researchers.
This thesis firstly investigates the parallel derivation of BIR model, LM and Poisson model,
wrt event spaces, relevance assumptions and ranking rationales. It establishes a bridge between
the BIR model and LM, and derives TF-IDF from the probabilistic framework.
Then, the thesis presents the probabilistic logical modelling of the retrieval models. Various
ways of how to estimate and aggregate probability, and alternative implementation to nonprobabilistic
operator are demonstrated. Typical models have been implemented.
The next contribution concerns the usage of of context-specific frequencies, i.e., the frequencies
counted based on assorted element types or within different text scopes. The hypothesis
is that they can help to rank the elements in structured document retrieval. The thesis applies
context-specific frequencies on term weighting schemes in these models, and the outcome is a
generalised retrieval model with regard to both element and document ranking.
The retrieval models behave differently on the same query set: for some queries, one model
performs better, for other queries, another model is superior. Therefore, one idea to improve the
overall performance of a retrieval system is to choose for each query the model that is likely
to perform the best. This thesis proposes and empirically explores the model selection method
according to the correlation of query feature and query performance, which contributes to the
methodology of dynamically choosing a model.
In summary, this thesis contributes a study of probabilistic models and their relationships,
the probabilistic logical modelling of retrieval models, the usage and effect of context-specific
frequencies in models, and the selection of retrieval models.
Authors
Wang, JunCollections
- Theses [3704]