Vsm using apache lucene

1/6/2024

Query-boost for the query (actually for each query term).We now describe how Lucene implements this conceptual scoring formula, andĭerive from it Lucene's Practical Scoring Function.įor efficient score computation some scoring components The conceptual formula is a simplification in the sense that (1) terms and documentsĪre fielded and (2) boosts are usually per query term rather than per query. We get Lucene's Conceptual scoring formula: Under the simplifying assumption of a single field in the index, More terms are matched: coord-factor(q,d). Through a coordination factor, which is usually larger when The terms of that query (this is correct for some of the queries),Īnd users can further reward documents matching more query terms A document may match a multi term query without containing all.At search time users can specify boosts to each query, sub-query, andĮach query term, hence the contribution of a query term to the score ofĪ document is multiplied by the boost of that query term query-boost(q).The separate additions (or parts) of that field within the document. The same field can be added to a document during indexing several times,Īnd so the boost of that field is the multiplication of the boosts of.Lucene is field based, hence each query term applies to a singleįield, document length normalization is by the length of the certain field,Īnd in addition to document boost there are also document fields boosts.Important than others, by assigning a document boost.įor this, the score of each document is also multiplied by its boost value At indexing, users can specify that certain documents are more.To avoid this problem, a different document length normalizationįactor is used, which normalizes to a vector equal to or larger a document made by duplicating a certain paragraph 10 times,Įspecially if that paragraph is made of distinct terms.īut for a document which contains no duplicated paragraphs, It removes all document length information.įor some documents removing this info is probably ok,Į.g. Normalizing V(d) to the unit vector is known to be problematic in that.Lucene refines VSM score for both search quality and usability: V(q) by its euclidean norm is normalizing it to a unit vector. The normalized weighted vectors, in the sense that dividing Note: the above equation can be viewed as the dot product of Of the weighted query vectors V(q) and V(d): VSM score of document d for query q is the Number of index documents containing term t. Idf(t) similarly varies with the inverse of the (when one increases so does the other) and Tf(t,x) varies with the number of occurrences of term t in x Tf and Idf are described in more detail below,īut for now, for completion, let's just say thatįor given term t and document (or query) x, VSM does not require weights to be Tf-idf values,īut Tf-idf values are believed to produce search results of high quality,

Where each distinct index term is a dimension, Weighted vectors in a multi-dimensional space, In VSM, documents and queries are represented as Vector Space Model (VSM) of Information Retrieval -ĭocuments "approved" by BM are scored by VSM. (the latter is connected directly with Lucene classes and methods).īoolean model (BM) of Information Retrieval Then derive from it Lucene's Conceptual Scoring Formula,įrom which, finally, evolves Lucene's Practical Scoring Function Underlying information retrieval models to (efficient) implementation. The following describes how Lucene scoring evolves from Introduction To Information Retrieval, Chapter 6. Overriding computation of these components is a convenient Similarity defines the components of Lucene scoring. Public abstract class Similarity extends Object implements Serializable SUMMARY: NESTED | FIELD | CONSTR | METHODĬlass Similarity .Similarity All Implemented Interfaces: Serializable Direct Known Subclasses: DefaultSimilarity, SimilarityDelegator

0 Comments

Vsm using apache lucene

Leave a Reply.

Author

Archives

Categories