Skip to content

vansky/extended_penn_tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

extended_penn_tokenizer

Fork of the Penn Treebank tokenizer

Original tokenizer written by Robert MacIntyre, University of Pennsylvania, late 1995
Original available at: http://www.cis.upenn.edu/~treebank/tokenizer.sed

Updated to:

  • fix 'comma in number' handling
  • fix open/close quote handling
  • generalize tokenization to documents with directional quotes
  • handle additional contractions
  • add an untokenizer to untokenize arbitrary documents to their original form

About

Fork of the Penn Treebank tokenizer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors