WordDelimiterFilter Details

Following is a list of the reasons why single letters and numbers are extremely frequent in the index and examples of how WordDelimiterFilter breaks tokens into parts that get searched as phrase queries:

 1  parsing of apostrophes

  1. can't  => "can" "t"
  2. French:  "l'art" =>  "l" "art" [i]
  3. 1900's => "1900"  "s"
  4. O'Sullivan => "o" "sullivan"

2   Initials

  1. J.P. Morgan => "j" "p" "morgan"

3   Abreviations

  1. I.B.M => "i" "b" "m"

4   alphanumerics

  1. 7th  =>  "7" "th"
  2. p2p => "p" "2" "p"
  3. p1314 => "p" "1314"

5   punctuation in numbers

  1. 1,200 => "1" "200"
  2. 5.3    =>  "5" "3"
  3. 7:30 => "7" "30"

6   hypens

  1. 1950-56 => "1950" "56"
  2. Two-Party => "Two" "Party"

 

 


 

Endnotes

[i] The are some specific issues with tokenizing French.  See: