Decode

Command-Line

The following command runs the NLP pipeline for tokenization, part-of-speech tagging, morphological analysis, named entity recognition, and dependency parsing:

java edu.emory.mathcs.nlp.bin.NLPDecode -c <filename> -i <filepath> [-ie <string> -oe <string> -format <string> -threads <integer>]

-c       <filename> : configuration filename (required)
-i       <filepath> : input path (required)
-ie      <string>   : input file extension (default: *)
-oe      <string>   : output file extension (default: nlp)
-format  <string>   : format of the input data (raw|line|tsv; default: raw)
-threads <integer>  : number of threads (default: 2)

Example

The following command takes nlp4j.txt and generates nlp4j.txt.nlp using config-decode-en.xml.

$ java -Xmx4g -XX:+UseConcMarkSweepGC edu.emory.mathcs.nlp.bin.NLPDecode -c config-decode-general.xml -i emorynlp.txt

Loading ambiguity classes
Loading word clusters
Loading word embeddings
Loading named entity gazetteers
Loading tokenizer
Loading part-of-speech tagger
Loading morphological analyzer
Loading named entity recognizer
Loading dependency parser

nlp4j.txt

Configuration

Sample configuration files for decoding can be found here: config-decode-*.

<configuration>
          <tsv>
              <column index="0" field="form"/>
          </tsv>

          <lexica>
              <ambiguity_classes field="word_form_simplified_lowercase">en-ambiguity-classes-simplified-lowercase.xz</ambiguity_classes>
              <word_clusters field="word_form_simplified_lowercase">en-brown-clusters-simplified-lowercase.xz</word_clusters>
              <named_entity_gazetteers field="word_form_simplified">en-named-entity-gazetteers-simplified.xz</named_entity_gazetteers>
              <word_embeddings field="word_form_undigitalized">en-word-embeddings-undigitalized.xz</word_embeddings>
          </lexica>

          <models>
              <pos>en-pos.xz</pos>
              <ner>en-ner.xz</ner>
              <dep>en-dep.xz</dep>
          </models>
      </configuration>