Reply to comment

Yes, this might seem to be

Yes, this might seem to be surprising, but there are a lot of words out there. Still, billions of words from only 555,000 documents seem to be surprising. In comparison, I recently indexed a ClueWeb09 SubsetB collection of 50 million mostly English pages. I collected only around 200 million unique words that contained only Latin letters and digits (i.e., I ignored words with non-ascii chars). I would suggest that n-gram sequences is the major contributor to the number of unique words in your case.

Reply

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.