Reply to comment

Posting/position list size computation

Hi Otis,

I wrote a command line utility that takes a term as an argument and outputs the df (docFreq)of the term and the total number of occurrences of the term in the index. The df is related to the size of the postings list since the postings list has to encode the document ID for each document containing the term. The total number of occurrences is related to the size of the positions index entry for that term since the positions index needs to list the position of each occurrence. I assumed 1 byte per docID or position as an estimate. This probably underestimates the size since the data are stored in a VInt which can be 1-4 bytes as explained in endnote iv.

The code:

  • reads the term from the args[] array
  • opens an IndexReader
  • gets the df by calling IndexReader.docFreq(term)
  • gets a TermDocs enumeration by calling IndexReader.termDocs(term)
  • iterates through the TermDocs and totals up the tf counts for each document
while (TermDocs.next()){
	total_tf+=TermDocs.freq();
}

The code needs some clean up but I could post it to JIRA if you think others might want to use it.

Reply

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.