NLP4J by emorynlp

Decode

Command-Line

The following command runs the NLP pipeline for tokenization, part-of-speech tagging, morphological analysis, named entity recognition, and dependency parsing:

java edu.emory.mathcs.nlp.bin.NLPDecode -c <filename> -i <filepath> [-ie <string> -oe <string> -format <string> -threads <integer>]

-c       <filename> : configuration filename (required)
-i       <filepath> : input path (required)
-ie      <string>   : input file extension (default: *)
-oe      <string>   : output file extension (default: nlp)
-format  <string>   : format of the input data (raw|line|tsv; default: raw)
-threads <integer>  : number of threads (default: 2)

For command-line tools, replace java edu.emory.mathcs.nlp.bin.NLPDecode with bin/nlpdecode.
-c specifies the configuration file (see configuration).
-i specifies the input path pointing to either a file or a directory. When the path points to a file, only the specific file is processed. When the path points to a directory, all files with the file extension -ie under the specific directory are processed.
-ie specifies the input file extension. The default value * implies files with any extension. This option is used only when the input path -i points to a directory.
-oe specifies the output file extension appended to each input filename. The corresponding output file, consisting of the NLP output, will be generated.
-format specifies the format of the input file: raw, line, or tsv (see data format).
-threads specifies the number of threads to be used. When multi-threads are used, each file is assigned to an individual thread.

Example

The following command takes nlp4j.txt and generates nlp4j.txt.nlp using config-decode-en.xml.

$ java -Xmx4g -XX:+UseConcMarkSweepGC edu.emory.mathcs.nlp.bin.NLPDecode -c config-decode-general.xml -i emorynlp.txt

Loading ambiguity classes
Loading word clusters
Loading word embeddings
Loading named entity gazetteers
Loading tokenizer
Loading part-of-speech tagger
Loading morphological analyzer
Loading named entity recognizer
Loading dependency parser

nlp4j.txt

Use the -XX:+UseConcMarkSweepGC option for JVM, which reduces the memory usage into a half.
Use log4j.properties for the log4j configuration.
The output file is generated in the tsv format (see data format).

Configuration

Sample configuration files for decoding can be found here: config-decode-*.

<configuration>
          <tsv>
              <column index="0" field="form"/>
          </tsv>

          <lexica>
              <ambiguity_classes field="word_form_simplified_lowercase">en-ambiguity-classes-simplified-lowercase.xz</ambiguity_classes>
              <word_clusters field="word_form_simplified_lowercase">en-brown-clusters-simplified-lowercase.xz</word_clusters>
              <named_entity_gazetteers field="word_form_simplified">en-named-entity-gazetteers-simplified.xz</named_entity_gazetteers>
              <word_embeddings field="word_form_undigitalized">en-word-embeddings-undigitalized.xz</word_embeddings>
          </lexica>

          <models>
              <pos>en-pos.xz</pos>
              <ner>en-ner.xz</ner>
              <dep>en-dep.xz</dep>
          </models>
      </configuration>

<tsv>: see configuration#tsv. This does not need to be specified when raw or sen is used.
<lexica>: see configuration#lexica.
<models> specifies the statistical model for each component (e.g., english models, NLPMode).