Decode
Command-Line
The following command runs the NLP pipeline for tokenization, part-of-speech tagging, morphological analysis, named entity recognition, and dependency parsing:
java edu.emory.mathcs.nlp.bin.NLPDecode -c <filename> -i <filepath> [-ie <string> -oe <string> -format <string> -threads <integer>]
-c <filename> : configuration filename (required)
-i <filepath> : input path (required)
-ie <string> : input file extension (default: *)
-oe <string> : output file extension (default: nlp)
-format <string> : format of the input data (raw|line|tsv; default: raw)
-threads <integer> : number of threads (default: 2)
- For command-line tools, replace
java edu.emory.mathcs.nlp.bin.NLPDecode
withbin/nlpdecode
. -c
specifies the configuration file (see configuration).-i
specifies the input path pointing to either a file or a directory. When the path points to a file, only the specific file is processed. When the path points to a directory, all files with the file extension-ie
under the specific directory are processed.-ie
specifies the input file extension. The default value*
implies files with any extension. This option is used only when the input path-i
points to a directory.-oe
specifies the output file extension appended to each input filename. The corresponding output file, consisting of the NLP output, will be generated.-format
specifies the format of the input file:raw
,line
, ortsv
(see data format).-threads
specifies the number of threads to be used. When multi-threads are used, each file is assigned to an individual thread.
Example
The following command takes nlp4j.txt and generates nlp4j.txt.nlp using config-decode-en.xml.
$ java -Xmx4g -XX:+UseConcMarkSweepGC edu.emory.mathcs.nlp.bin.NLPDecode -c config-decode-general.xml -i emorynlp.txt
Loading ambiguity classes
Loading word clusters
Loading word embeddings
Loading named entity gazetteers
Loading tokenizer
Loading part-of-speech tagger
Loading morphological analyzer
Loading named entity recognizer
Loading dependency parser
nlp4j.txt
- Use the
-XX:+UseConcMarkSweepGC
option for JVM, which reduces the memory usage into a half. - Use
log4j.properties
for the log4j configuration. - The output file is generated in the
tsv
format (see data format).
Configuration
Sample configuration files for decoding can be found here: config-decode-*.
<configuration>
<tsv>
<column index="0" field="form"/>
</tsv>
<lexica>
<ambiguity_classes field="word_form_simplified_lowercase">en-ambiguity-classes-simplified-lowercase.xz</ambiguity_classes>
<word_clusters field="word_form_simplified_lowercase">en-brown-clusters-simplified-lowercase.xz</word_clusters>
<named_entity_gazetteers field="word_form_simplified">en-named-entity-gazetteers-simplified.xz</named_entity_gazetteers>
<word_embeddings field="word_form_undigitalized">en-word-embeddings-undigitalized.xz</word_embeddings>
</lexica>
<models>
<pos>en-pos.xz</pos>
<ner>en-ner.xz</ner>
<dep>en-dep.xz</dep>
</models>
</configuration>
<tsv>
: see configuration#tsv. This does not need to be specified whenraw
orsen
is used.<lexica>
: see configuration#lexica.<models>
specifies the statistical model for each component (e.g., english models, NLPMode).