The stats on the size of the postings are for only one field. We have one field that contains all the OCR for a document (about 800K) and a number of very small metadata fields based on the MARC bibliographic metadata. Those other fields average 1KB or less. We are presently only searching against the OCR field, so all the calculations are for that one field. The tool we use to get the total number of occurrences of a term is now part of lucene/contrib/ if you use it with the -t flag it will count the total number of occurrences for the top N most frequent terms, which is a proxy for the size of the positions index. You can also specify the field as well as N.
Re: Posting/position list size computation
Hi Ahsan,
The stats on the size of the postings are for only one field. We have one field that contains all the OCR for a document (about 800K) and a number of very small metadata fields based on the MARC bibliographic metadata. Those other fields average 1KB or less. We are presently only searching against the OCR field, so all the calculations are for that one field. The tool we use to get the total number of occurrences of a term is now part of lucene/contrib/ if you use it with the -t flag it will count the total number of occurrences for the top N most frequent terms, which is a proxy for the size of the positions index. You can also specify the field as well as N.
java org.apache.lucene.misc.HighFreqTerms [-t][number_terms] [field]
-t: include totalTermFreq
See:
http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contri...
Tom