edit

# Retraining

By default, OpenNMT saves a checkpoint every 5000 iterations and at the end of each epoch. For more frequent or infrequent saves, you can use the -save_every and -save_every_epochs options which define the number of iterations and epochs after which the training saves a checkpoint.

There are several reasons one may want to train from a saved model with the -train_from option:

• continuing a stopped training
• continuing the training with a smaller batch size
• training a model on new data (incremental adaptation)
• starting a training from pre-trained parameters
• etc.

## Considerations¶

When training from an existing model, some settings can not be changed:

• the model topology (layers, hidden size, etc.)
• the vocabularies

Exceptions

-dropout, -fix_word_vecs_enc and -fix_word_vecs_dec are model options that can be changed for a retraining.

## Resuming a stopped training¶

It is common that a training stops: crash, server reboot, user action, etc. In this case, you may want to continue the training for more epochs by using using the -continue flag. For example:

# start the initial training
th train.lua -gpuid 1 -data data/demo-train.t7 -save_model demo -save_every 50

# train for several epochs...

# need to reboot the server!

# continue the training from the last checkpoint
th train.lua -gpuid 1 -data data/demo-train.t7 -save_model demo -save_every 50 -train_from demo_checkpoint.t7 -continue


The -continue flag ensures that the training continues with the same configuration and optimization states. In particular, the following options are set to their last known value:

• -curriculum
• -decay
• -learning_rate_decay
• -learning_rate
• -max_grad_norm
• -min_learning_rate
• -optim
• -start_decay_at
• -start_decay_ppl_delta
• -start_epoch
• -start_iteration

Note

The -end_epoch value is not automatically set as the user may want to continue its training for more epochs past the end.

Additionally, the -continue flag retrieves from the previous training:

• the non-SGD optimizers states
• the random generator states
• the batch order (when continuing from an intermediate checkpoint)

## Training from pre-trained parameters¶

Another use case it to use a base model and train it further with new training options (in particular the optimization method and the learning rate). Using -train_from without -continue will start a new training with parameters initialized from a pre-trained model.