Friday, August 3, 2007

Term Frequency in Term-Document matrix

Term document matrix is used in information retrieval system where each row represents each term and column represents document. Each cell in the tXd matrix represents importance of the term t in the document d. Sometimes absolute term frequency is used for each cell in the txd matrix, however using absolute frequency is some times misleading, this is primarily because documents and document collections vary in size. It is resonable to expect that a given term is likely to occur more frequently in long document in a collection that the short document. At the sametime it is not resonable to expect the frequency of a given term to be roughly the same in all documents of a given length, since it may be a common term for a document in one collection and rare term in a document in another collection. Thus relative term frequency i.e. term frequency count that have been adjusted by taking an account of document and collection size and characteristics is used in retrieval computation.
The objective of calculating term frequncy is to define the value of term in the information retrieval system, hence it is important to view frequencies in the context of a specific document collection. One fundamental way of doing this is to use the inverse document frequency weight.

Inverse Document Frequency (IDF) of a term is defined as

log2(N/dk)+1 = log2N - log2dk + 1

where
N: the number of documents in the collection
dk : the number of documents containing the term k

The Inverse Document Frequency weight, tf.idf, of term k in the document i is defined by multiplying the term frequency by the inverse document frequency
wik = fik(log2N - log2dk + 1)

wik : the weight of term k in the document i
fik : the absolute frequncy of term k in the document i

Thus the weight of a term in a document is its frequency multiplied by a factor that depends logarithmically on the proportion of the document in the collection that contains the term.

No comments: