I wrote a command line utility that takes a term as an argument and outputs the df (docFreq)of the term and the total number of occurrences of the term in the index. The df is related to the size of the postings list since the postings list has to encode the document ID for each document containing the term. The total number of occurrences is related to the size of the positions index entry for that term since the positions index needs to list the position of each occurrence. I assumed 1 byte per docID or position as an estimate. This probably underestimates the size since the data are stored in a VInt which can be 1-4 bytes as explained in endnote iv.
The code:
reads the term from the args[] array
opens an IndexReader
gets the df by calling IndexReader.docFreq(term)
gets a TermDocs enumeration by calling IndexReader.termDocs(term)
iterates through the TermDocs and totals up the tf counts for each document
while (TermDocs.next()){
total_tf+=TermDocs.freq();
}
The code needs some clean up but I could post it to JIRA if you think others might want to use it.
Posting/position list size computation
Hi Otis,
I wrote a command line utility that takes a term as an argument and outputs the df (docFreq)of the term and the total number of occurrences of the term in the index. The df is related to the size of the postings list since the postings list has to encode the document ID for each document containing the term. The total number of occurrences is related to the size of the positions index entry for that term since the positions index needs to list the position of each occurrence. I assumed 1 byte per docID or position as an estimate. This probably underestimates the size since the data are stored in a VInt which can be 1-4 bytes as explained in endnote iv.
The code:
while (TermDocs.next()){ total_tf+=TermDocs.freq(); }The code needs some clean up but I could post it to JIRA if you think others might want to use it.