WordDelimiterFilter Details
Following is a list of the reasons why single letters and numbers are extremely frequent in the index and examples of how WordDelimiterFilter breaks tokens into parts that get searched as phrase queries:
1 parsing of apostrophes
- can't => "can" "t"
- French: "l'art" => "l" "art" [i]
- 1900's => "1900" "s"
- O'Sullivan => "o" "sullivan"
2 Initials
- J.P. Morgan => "j" "p" "morgan"
3 Abreviations
- I.B.M => "i" "b" "m"
4 alphanumerics
- 7th => "7" "th"
- p2p => "p" "2" "p"
- p1314 => "p" "1314"
5 punctuation in numbers
- 1,200 => "1" "200"
- 5.3 => "5" "3"
- 7:30 => "7" "30"
6 hypens
- 1950-56 => "1950" "56"
- Two-Party => "Two" "Party"
Endnotes
[i] The are some specific issues with tokenizing French. See:
- http://en.wikipedia.org/wiki/Elision_in_the_French_language
- http://mail-archives.apache.org/mod_mbox/lucene-java-user/200211.mbox/%3c5.1.0.14.0.20021121145130.00a97b70@mailbox.uottawa.ca%3e
- http://issues.apache.org/jira/browse/LUCENE-906
- http://blogs.msdn.com/correcteurorthographiqueoffice/archive/2005/12/07/500807.aspx
- http://en.wikipedia.org/wiki/Apostrophe_(mark)#Non-English_use
