Ran into a data parsing error in Mallet, and the only recommendation I found was to blow away all non-ASCII characters, which.... would've been a problem for this French dataset. Easy enough to filter by Latin-extended Unicode range chars, but how would a student know to do this? #MultilingualDH