tools/learn_bpe.lua

learn_bpe.lua options:

  • -h [] (default: false)
    This help.
  • -md [] (default: false)
    Dump help in Markdown format.
  • -config (default: '')
    Load options from this file.
  • -save_config (default: '')
    Save options to this file.

BPE options

  • -size (default: 30000)
    The number of merge operations to learn.
  • -bpe_mode (accepted: suffix, prefix, both, none; default: suffix)
    Define the BPE mode. prefix: append to the begining of each word to learn prefix-oriented pair statistics; suffix: append to the end of each word to learn suffix-oriented pair statistics, as in the original Python script; both: suffix and prefix; none: no suffix nor prefix.
  • -bpe_EOT_marker (default: )
    Marker used to mark the End of Token while applying BPE in mode 'prefix' or 'both'.
  • -bpe_BOT_marker (default: )
    Marker used to mark the Beginning of Token while applying BPE in mode 'suffix' or 'both'.
  • -save_bpe (required)
    Path to save the output model.

Tokenizer options

  • -tok_mode (accepted: conservative, aggressive, space; default: space)
    Define how aggressive should the tokenization be. space is space-tokenization.
  • -tok_joiner_annotate [] (default: false)
    Include joiner annotation using -joiner character.
  • -tok_joiner (default: )
    Character used to annotate joiners.
  • -tok_joiner_new [] (default: false)
    In -joiner_annotate mode, -joiner is an independent token.
  • -tok_case_feature [] (default: false)
    Generate case feature.
  • -tok_segment_case [] (default: false)
    Segment case feature, splits AbC to Ab C to be able to restore case
  • -tok_segment_alphabet (accepted: Tagalog, Hanunoo, Limbu, Yi, Hebrew, Latin, Devanagari, Thaana, Lao, Sinhala, Georgian, Kannada, Cherokee, Kanbun, Buhid, Malayalam, Han, Thai, Katakana, Telugu, Greek, Myanmar, Armenian, Hangul, Cyrillic, Ethiopic, Tagbanwa, Gurmukhi, Ogham, Khmer, Arabic, Oriya, Hiragana, Mongolian, Kangxi, Syriac, Gujarati, Braille, Bengali, Tamil, Bopomofo, Tibetan)
    Segment all letters from indicated alphabet.
  • -tok_segment_numbers [] (default: false)
    Segment numbers into single digits.
  • -tok_segment_alphabet_change [] (default: false)
    Segment if alphabet change between 2 letters.
  • -tok_normalize_cmd (default: '')
    Command for on-the-fly corpus normalization. It should work in 'pipeline' mode.
  • Logger options

    • -log_file (default: '')
      Output logs to a file under this path instead of stdout - if file name ending with json, output structure json.
    • -disable_logs [] (default: false)
      If set, output nothing.
    • -log_level (accepted: DEBUG, INFO, WARNING, ERROR, NOERROR; default: INFO)
      Output logs at this level and above.