English
Lexica
All lexica can be found here:
en-ambiguity-classes-simplified.xz
: ambiguity classes for part-of-speech tagging with simplified word forms.
en-ambiguity-classes-simplified-lowercase.xz
: ambiguity classes for part-of-speech tagging with simplified lowercase word forms.
en-brown-clusters-simplified-lowercase.xz
: brown clusters with simplified lowercase word forms.
en-named-entity-gazetteers-simplified.xz
: gazetteers for named entity recognition with simplified word forms.
en-named-entity-gazetteers-simplified-lowercase.xz
: gazetteers for named entity recognition with simplified lowercase word forms.
en-stop-words-simplified-lowercase.xz
: stop words with simplified lowercase word forms.
en-word-embeddings-undigitalized.xz
: word embeddings with undigitalized word forms.
Models
All models can be found here:
en-pos.xz
: part-of-speech tagging.
en-ner.xz
: named entity recognition.
en-dep.xz
: dependency parsing.
Models are trained on the following corpora.
OntoNotes 5.0 |
Sentences |
Tokens |
Names |
Broadcasting conversations |
10,822 |
171,101 |
9,771 |
Broadcasting news |
10,344 |
206,029 |
19,670 |
News magazines |
6,672 |
163,627 |
10,736 |
Newswires |
34,438 |
875,800 |
77,496 |
Religious texts |
21,418 |
296,432 |
0 |
Telephone conversations |
8,963 |
85,444 |
2,021 |
Web texts |
12,448 |
284,951 |
8,170 |
English Web Treebank |
Sentences |
Tokens |
Answers |
2,699 |
43,916 |
Email |
2,983 |
44,168 |
Newsgroup |
1,996 |
37,816 |
Reviews |
2,915 |
44,337 |
Weblog |
1,753 |
38,770 |
MiPACQ |
Sentences |
Tokens |
Clinical questions |
1,600 |
30,138 |
Medpedia articles |
2,796 |
49,922 |
Clinical notes |
8,383 |
113,164 |
Pathological notes |
1,205 |
21,353 |
SHARP |
Sentences |
Tokens |
Seattle group health notes |
7,204 |
94,450 |
Clinical notes |
6,807 |
93,914 |
Stratified |
4,320 |
43,536 |
Stratified SGH |
13,662 |
139,403 |
THYME |
Sentences |
Tokens |
Clinical / pathological notes |
26,661 |
387,943 |
Brain cancer |
18,722 |
225,899 |