Train

Command

The following command trains an NLP component:

java edu.emory.mathcs.nlp.bin.NLPTrain -mode <string> -c <filename> -t <filepath> -d <filepath> [-f <integer> -m <filename> -p <filename> -te <string> -de <string>]

-c  <filename> : configuration file (required)
-m  <filename> : output model file (optional)
-p  <filename> : previously trained model file (optional)
-t  <filepath> : training path (required)
-d  <filepath> : development path (optional)
-te   <string> : training file extension (default: *)
-de   <string> : development file extension (default: *)
-cv      <int> : # of cross-validation folds (default: 0)
-mode <string> : component mode (required: pos|ner|dep)

Example

The following command takes sample-trn.tsv and sample-dev.tsv, trains a dependency parsing model using config-train-sample.xml, and saves the best model to sample-dep.xz.

$ java -Xmx1g -XX:+UseConcMarkSweepGC java edu.emory.mathcs.nlp.bin.NLPTrain -mode dep -c config-train-sample.xml -t sample-trn.tsv -d sample-dev.tsv -m sample-dep.xz

AdaGrad Mini-batch
- Max epoch: 5
- Mini-batch: 1
- Learning rate: 0.02
- LOLS: fixed = 0, decaying rate = 0.95
- RDA: 1.0E-5
Training: 0
  0:    1: LAS = 22.22, UAS = 26.98, L =  34, SF =    1300, NZW =     1867, N/S =  15750
  0:    2: LAS = 34.92, UAS = 39.68, L =  34, SF =    1410, NZW =     4578, N/S =  18000
  0:    3: LAS = 38.89, UAS = 44.44, L =  34, SF =    1454, NZW =     6191, N/S =  21000
  0:    4: LAS = 37.30, UAS = 41.27, L =  34, SF =    1550, NZW =     7751, N/S =  42000
  0:    5: LAS = 37.30, UAS = 41.27, L =  34, SF =    1583, NZW =     8997, N/S =  63000
  0: Best: 38.89, epoch = 3
Saving the model

Configuration

Sample configuration files for training can be found here: config-train-*.

<configuration>
            <tsv>
                <column index="1" field="form"/>
                <column index="2" field="lemma"/>
                <column index="3" field="pos"/>
                <column index="4" field="feats"/>
                <column index="5" field="dhead"/>
                <column index="6" field="deprel"/>
                <column index="7" field="sheads"/>
                <column index="8" field="nament"/>
            </tsv>

            <lexica>
                <ambiguity_classes field="word_form_simplified_lowercase">en-ambiguity-classes-simplified-lowercase.xz</ambiguity_classes>
                <word_clusters field="word_form_simplified_lowercase">en-brown-clusters-simplified-lowercase.xz</word_clusters>
                <word_embeddings field="word_form_undigitalized">en-word-embeddings-undigitalized.xz</word_embeddings>
                <named_entity_gazetteers field="word_form_simplified">en-named-entity-gazetteers-simplified.xz</named_entity_gazetteers>
            </lexica>

            <optimizer>
                <algorithm>adagrad-mini-batch</algorithm>
                <l1_regularization>0.00001</l1_regularization>
                <learning_rate>0.02</learning_rate>
                <feature_cutoff>2</feature_cutoff>
                <lols fixed="0" decaying="0.95"/>
                <max_epochs>40</max_epochs>
                <batch_size>5</batch_size>
                <bias>0</bias>
            </optimizer>

            <feature_template>
                <feature f0="i:word_form"/>
                <feature f0="i+1:lemma"/>
                <feature f0="i-1:part_of_speech_tag"/>
                <feature f0="i_lmd:part_of_speech_tag"/>
                <feature f0="i-1:lemma" f1="i:lemma" f2="i+1:lemma"/>
            </feature_template>
        </configuration>