Interface SentencepieceModel.TrainerSpecOrBuilder

All Superinterfaces:
com.google.protobuf.GeneratedMessageV3.ExtendableMessageOrBuilder<SentencepieceModel.TrainerSpec>, com.google.protobuf.MessageLiteOrBuilder, com.google.protobuf.MessageOrBuilder
All Known Implementing Classes:
SentencepieceModel.TrainerSpec, SentencepieceModel.TrainerSpec.Builder
Enclosing class:
SentencepieceModel

public static interface SentencepieceModel.TrainerSpecOrBuilder extends com.google.protobuf.GeneratedMessageV3.ExtendableMessageOrBuilder<SentencepieceModel.TrainerSpec>
  • Method Summary

    Modifier and Type
    Method
    Description
    getAcceptLanguage(int index)
    List of the languages this model can accept.
    com.google.protobuf.ByteString
    List of the languages this model can accept.
    int
    List of the languages this model can accept.
    List of the languages this model can accept.
    boolean
    Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.
    int
    <s>
    optional string bos_piece = 46 [default = "<s>"];
    com.google.protobuf.ByteString
    optional string bos_piece = 46 [default = "<s>"];
    boolean
    Decomposes unknown pieces into UTF-8 bytes.
    float
    ///////////////////////////////////////////////////////////////// Training parameters.
    getControlSymbols(int index)
    ///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.
    com.google.protobuf.ByteString
    ///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.
    int
    ///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.
    ///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.
    long
    Clipping threshold to apply after adding noise.
    float
    Set these parameters if you need DP version of sentencepiece.
    boolean
    Whether to use DP version of sentencepiece.
    int
    </s>
    optional string eos_piece = 47 [default = "</s>"];
    com.google.protobuf.ByteString
    optional string eos_piece = 47 [default = "</s>"];
    boolean
    `vocab_size` is treated as hard limit.
    getInput(int index)
    ///////////////////////////////////////////////////////////////// General parameters Input corpus files.
    com.google.protobuf.ByteString
    getInputBytes(int index)
    ///////////////////////////////////////////////////////////////// General parameters Input corpus files.
    int
    ///////////////////////////////////////////////////////////////// General parameters Input corpus files.
    Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
    com.google.protobuf.ByteString
    Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
    ///////////////////////////////////////////////////////////////// General parameters Input corpus files.
    long
    Maximum size of sentences the trainer loads from `input` parameter.
    int
    The maximum sentence length in byte.
    int
    ///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece.
    int
    Deprecated.
    com.google.genai.proto.TrainerSpec.mining_sentence_size is deprecated.
    Output model file prefix.
    com.google.protobuf.ByteString
    Output model file prefix.
    optional .com.google.genai.proto.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];
    int
    Number of EM sub iterations.
    int
    Number of threads in the training.
    int
    <pad> (padding)
    optional string pad_piece = 48 [default = "<pad>"];
    com.google.protobuf.ByteString
    optional string pad_piece = 48 [default = "<pad>"];
    Defines the pre-tokenization delimiter.
    com.google.protobuf.ByteString
    Defines the pre-tokenization delimiter.
    Defines required characters.
    com.google.protobuf.ByteString
    Defines required characters.
    Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.
    com.google.protobuf.ByteString
    Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.
    int
    The size of seed sentencepieces.
    int
    Size of self-test samples, which are encoded in the model file.
    float
    In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece.
    boolean
    optional bool shuffle_input_sentence = 19 [default = true];
    boolean
    When `split_by_number` is true, put a boundary between number and non-number transition.
    boolean
    Uses Unicode script to split sentence pieces.
    boolean
    Use a white space to split sentence pieces.
    boolean
    Split all digits (0-9) into separate pieces.
    boolean
    Increase bit depth to allow unigram model training on large (>10M sentences) corpora.
    int
    Deprecated.
    com.google.genai.proto.TrainerSpec.training_sentence_size is deprecated.
    boolean
    Adds whitespace symbol (_) as a suffix instead of prefix.
    int
    ///////////////////////////////////////////////////////////////// Reserved special meta tokens.
    optional string unk_piece = 45 [default = "<unk>"];
    com.google.protobuf.ByteString
    optional string unk_piece = 45 [default = "<unk>"];
    Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.
    com.google.protobuf.ByteString
    Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.
    boolean
    use all symbols for vocab extraction.
    Defines user defined symbols.
    com.google.protobuf.ByteString
    Defines user defined symbols.
    int
    Defines user defined symbols.
    Defines user defined symbols.
    int
    Vocabulary size.
    boolean
    When creating the vocabulary file, defines whether or not to additionally output the score for each piece.
    boolean
    Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.
    boolean
    <s>
    boolean
    optional string bos_piece = 46 [default = "<s>"];
    boolean
    Decomposes unknown pieces into UTF-8 bytes.
    boolean
    ///////////////////////////////////////////////////////////////// Training parameters.
    boolean
    Clipping threshold to apply after adding noise.
    boolean
    Set these parameters if you need DP version of sentencepiece.
    boolean
    Whether to use DP version of sentencepiece.
    boolean
    </s>
    boolean
    optional string eos_piece = 47 [default = "</s>"];
    boolean
    `vocab_size` is treated as hard limit.
    boolean
    Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
    boolean
    Maximum size of sentences the trainer loads from `input` parameter.
    boolean
    The maximum sentence length in byte.
    boolean
    ///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece.
    boolean
    Deprecated.
    com.google.genai.proto.TrainerSpec.mining_sentence_size is deprecated.
    boolean
    Output model file prefix.
    boolean
    optional .com.google.genai.proto.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];
    boolean
    Number of EM sub iterations.
    boolean
    Number of threads in the training.
    boolean
    <pad> (padding)
    boolean
    optional string pad_piece = 48 [default = "<pad>"];
    boolean
    Defines the pre-tokenization delimiter.
    boolean
    Defines required characters.
    boolean
    Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.
    boolean
    The size of seed sentencepieces.
    boolean
    Size of self-test samples, which are encoded in the model file.
    boolean
    In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece.
    boolean
    optional bool shuffle_input_sentence = 19 [default = true];
    boolean
    When `split_by_number` is true, put a boundary between number and non-number transition.
    boolean
    Uses Unicode script to split sentence pieces.
    boolean
    Use a white space to split sentence pieces.
    boolean
    Split all digits (0-9) into separate pieces.
    boolean
    Increase bit depth to allow unigram model training on large (>10M sentences) corpora.
    boolean
    Deprecated.
    com.google.genai.proto.TrainerSpec.training_sentence_size is deprecated.
    boolean
    Adds whitespace symbol (_) as a suffix instead of prefix.
    boolean
    ///////////////////////////////////////////////////////////////// Reserved special meta tokens.
    boolean
    optional string unk_piece = 45 [default = "<unk>"];
    boolean
    Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.
    boolean
    use all symbols for vocab extraction.
    boolean
    Vocabulary size.
    boolean
    When creating the vocabulary file, defines whether or not to additionally output the score for each piece.

    Methods inherited from interface com.google.protobuf.GeneratedMessageV3.ExtendableMessageOrBuilder

    getDefaultInstanceForType, getExtension, getExtension, getExtension, getExtension, getExtension, getExtension, getExtensionCount, getExtensionCount, getExtensionCount, hasExtension, hasExtension, hasExtension

    Methods inherited from interface com.google.protobuf.MessageLiteOrBuilder

    isInitialized

    Methods inherited from interface com.google.protobuf.MessageOrBuilder

    findInitializationErrors, getAllFields, getDescriptorForType, getField, getInitializationErrorString, getOneofFieldDescriptor, getRepeatedField, getRepeatedFieldCount, getUnknownFields, hasField, hasOneof
  • Method Details

    • getInputList

      List<String> getInputList()
      /////////////////////////////////////////////////////////////////
       General parameters
      
       Input corpus files.
        Trainer accepts the following two formats:
        A) Monolingual: plain text, one sentence per line.
        B) Bilingual:   TSV, source sentence <tab> target sentence
        When bilingual data is passed, shared vocabulary model is built.
        Note that the input file must be raw corpus, not a preprocessed corpus.
        Trainer only loads the first `input_sentence_size` sentences specified
        with this parameter.
       
      repeated string input = 1;
      Returns:
      A list containing the input.
    • getInputCount

      int getInputCount()
      /////////////////////////////////////////////////////////////////
       General parameters
      
       Input corpus files.
        Trainer accepts the following two formats:
        A) Monolingual: plain text, one sentence per line.
        B) Bilingual:   TSV, source sentence <tab> target sentence
        When bilingual data is passed, shared vocabulary model is built.
        Note that the input file must be raw corpus, not a preprocessed corpus.
        Trainer only loads the first `input_sentence_size` sentences specified
        with this parameter.
       
      repeated string input = 1;
      Returns:
      The count of input.
    • getInput

      String getInput(int index)
      /////////////////////////////////////////////////////////////////
       General parameters
      
       Input corpus files.
        Trainer accepts the following two formats:
        A) Monolingual: plain text, one sentence per line.
        B) Bilingual:   TSV, source sentence <tab> target sentence
        When bilingual data is passed, shared vocabulary model is built.
        Note that the input file must be raw corpus, not a preprocessed corpus.
        Trainer only loads the first `input_sentence_size` sentences specified
        with this parameter.
       
      repeated string input = 1;
      Parameters:
      index - The index of the element to return.
      Returns:
      The input at the given index.
    • getInputBytes

      com.google.protobuf.ByteString getInputBytes(int index)
      /////////////////////////////////////////////////////////////////
       General parameters
      
       Input corpus files.
        Trainer accepts the following two formats:
        A) Monolingual: plain text, one sentence per line.
        B) Bilingual:   TSV, source sentence <tab> target sentence
        When bilingual data is passed, shared vocabulary model is built.
        Note that the input file must be raw corpus, not a preprocessed corpus.
        Trainer only loads the first `input_sentence_size` sentences specified
        with this parameter.
       
      repeated string input = 1;
      Parameters:
      index - The index of the value to return.
      Returns:
      The bytes of the input at the given index.
    • hasInputFormat

      boolean hasInputFormat()
       Input corpus format:
       "text": one-sentence-per-line text format (default)
       "tsv":  sentence <tab> freq
       
      optional string input_format = 7;
      Returns:
      Whether the inputFormat field is set.
    • getInputFormat

      String getInputFormat()
       Input corpus format:
       "text": one-sentence-per-line text format (default)
       "tsv":  sentence <tab> freq
       
      optional string input_format = 7;
      Returns:
      The inputFormat.
    • getInputFormatBytes

      com.google.protobuf.ByteString getInputFormatBytes()
       Input corpus format:
       "text": one-sentence-per-line text format (default)
       "tsv":  sentence <tab> freq
       
      optional string input_format = 7;
      Returns:
      The bytes for inputFormat.
    • hasModelPrefix

      boolean hasModelPrefix()
       Output model file prefix.
       <model_prefix>.model and <model_prefix>.vocab are generated.
       
      optional string model_prefix = 2;
      Returns:
      Whether the modelPrefix field is set.
    • getModelPrefix

      String getModelPrefix()
       Output model file prefix.
       <model_prefix>.model and <model_prefix>.vocab are generated.
       
      optional string model_prefix = 2;
      Returns:
      The modelPrefix.
    • getModelPrefixBytes

      com.google.protobuf.ByteString getModelPrefixBytes()
       Output model file prefix.
       <model_prefix>.model and <model_prefix>.vocab are generated.
       
      optional string model_prefix = 2;
      Returns:
      The bytes for modelPrefix.
    • hasModelType

      boolean hasModelType()
      optional .com.google.genai.proto.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];
      Returns:
      Whether the modelType field is set.
    • getModelType

      optional .com.google.genai.proto.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];
      Returns:
      The modelType.
    • hasVocabSize

      boolean hasVocabSize()
       Vocabulary size. 8k is the default size.
       
      optional int32 vocab_size = 4 [default = 8000];
      Returns:
      Whether the vocabSize field is set.
    • getVocabSize

      int getVocabSize()
       Vocabulary size. 8k is the default size.
       
      optional int32 vocab_size = 4 [default = 8000];
      Returns:
      The vocabSize.
    • getAcceptLanguageList

      List<String> getAcceptLanguageList()
       List of the languages this model can accept.
       Since the model is language-agnostic, this field is used as a reference.
       
      repeated string accept_language = 5;
      Returns:
      A list containing the acceptLanguage.
    • getAcceptLanguageCount

      int getAcceptLanguageCount()
       List of the languages this model can accept.
       Since the model is language-agnostic, this field is used as a reference.
       
      repeated string accept_language = 5;
      Returns:
      The count of acceptLanguage.
    • getAcceptLanguage

      String getAcceptLanguage(int index)
       List of the languages this model can accept.
       Since the model is language-agnostic, this field is used as a reference.
       
      repeated string accept_language = 5;
      Parameters:
      index - The index of the element to return.
      Returns:
      The acceptLanguage at the given index.
    • getAcceptLanguageBytes

      com.google.protobuf.ByteString getAcceptLanguageBytes(int index)
       List of the languages this model can accept.
       Since the model is language-agnostic, this field is used as a reference.
       
      repeated string accept_language = 5;
      Parameters:
      index - The index of the value to return.
      Returns:
      The bytes of the acceptLanguage at the given index.
    • hasSelfTestSampleSize

      boolean hasSelfTestSampleSize()
       Size of self-test samples, which are encoded in the model file.
       
      optional int32 self_test_sample_size = 6 [default = 0];
      Returns:
      Whether the selfTestSampleSize field is set.
    • getSelfTestSampleSize

      int getSelfTestSampleSize()
       Size of self-test samples, which are encoded in the model file.
       
      optional int32 self_test_sample_size = 6 [default = 0];
      Returns:
      The selfTestSampleSize.
    • hasEnableDifferentialPrivacy

      boolean hasEnableDifferentialPrivacy()
       Whether to use DP version of sentencepiece. Use it with TSV input format
       (requires precomputed word tab counts to work).
       
      optional bool enable_differential_privacy = 50 [default = false];
      Returns:
      Whether the enableDifferentialPrivacy field is set.
    • getEnableDifferentialPrivacy

      boolean getEnableDifferentialPrivacy()
       Whether to use DP version of sentencepiece. Use it with TSV input format
       (requires precomputed word tab counts to work).
       
      optional bool enable_differential_privacy = 50 [default = false];
      Returns:
      The enableDifferentialPrivacy.
    • hasDifferentialPrivacyNoiseLevel

      boolean hasDifferentialPrivacyNoiseLevel()
       Set these parameters if you need DP version of sentencepiece.
       std of noise to add.
       
      optional float differential_privacy_noise_level = 51 [default = 0];
      Returns:
      Whether the differentialPrivacyNoiseLevel field is set.
    • getDifferentialPrivacyNoiseLevel

      float getDifferentialPrivacyNoiseLevel()
       Set these parameters if you need DP version of sentencepiece.
       std of noise to add.
       
      optional float differential_privacy_noise_level = 51 [default = 0];
      Returns:
      The differentialPrivacyNoiseLevel.
    • hasDifferentialPrivacyClippingThreshold

      boolean hasDifferentialPrivacyClippingThreshold()
       Clipping threshold to apply after adding noise. All the words with
       frequency less than this value are dropped.
       
      optional uint64 differential_privacy_clipping_threshold = 52 [default = 0];
      Returns:
      Whether the differentialPrivacyClippingThreshold field is set.
    • getDifferentialPrivacyClippingThreshold

      long getDifferentialPrivacyClippingThreshold()
       Clipping threshold to apply after adding noise. All the words with
       frequency less than this value are dropped.
       
      optional uint64 differential_privacy_clipping_threshold = 52 [default = 0];
      Returns:
      The differentialPrivacyClippingThreshold.
    • hasCharacterCoverage

      boolean hasCharacterCoverage()
      /////////////////////////////////////////////////////////////////
       Training parameters.
      
       Uses characters which cover the corpus with the ratio of `chars_coverage`.
       This parameter determines the set of basic Alphabet of sentence piece.
       1.0 - `chars_coverage` characters are treated as UNK.
       See also required_chars field.
       
      optional float character_coverage = 10 [default = 0.9995];
      Returns:
      Whether the characterCoverage field is set.
    • getCharacterCoverage

      float getCharacterCoverage()
      /////////////////////////////////////////////////////////////////
       Training parameters.
      
       Uses characters which cover the corpus with the ratio of `chars_coverage`.
       This parameter determines the set of basic Alphabet of sentence piece.
       1.0 - `chars_coverage` characters are treated as UNK.
       See also required_chars field.
       
      optional float character_coverage = 10 [default = 0.9995];
      Returns:
      The characterCoverage.
    • hasInputSentenceSize

      boolean hasInputSentenceSize()
       Maximum size of sentences the trainer loads from `input` parameter.
       Trainer simply loads the `input` files in sequence.
       It is better to shuffle the input corpus randomly.
       
      optional uint64 input_sentence_size = 11 [default = 0];
      Returns:
      Whether the inputSentenceSize field is set.
    • getInputSentenceSize

      long getInputSentenceSize()
       Maximum size of sentences the trainer loads from `input` parameter.
       Trainer simply loads the `input` files in sequence.
       It is better to shuffle the input corpus randomly.
       
      optional uint64 input_sentence_size = 11 [default = 0];
      Returns:
      The inputSentenceSize.
    • hasShuffleInputSentence

      boolean hasShuffleInputSentence()
      optional bool shuffle_input_sentence = 19 [default = true];
      Returns:
      Whether the shuffleInputSentence field is set.
    • getShuffleInputSentence

      boolean getShuffleInputSentence()
      optional bool shuffle_input_sentence = 19 [default = true];
      Returns:
      The shuffleInputSentence.
    • hasMiningSentenceSize

      @Deprecated boolean hasMiningSentenceSize()
      Deprecated.
      com.google.genai.proto.TrainerSpec.mining_sentence_size is deprecated. See sentencepiece_model.proto;l=96
       Maximum size of sentences to make seed sentence pieces.
       Extended suffix array is constructed to extract frequent
       sub-strings from the corpus. This uses 20N working space,
       where N is the size of corpus.
       
      optional int32 mining_sentence_size = 12 [deprecated = true];
      Returns:
      Whether the miningSentenceSize field is set.
    • getMiningSentenceSize

      @Deprecated int getMiningSentenceSize()
      Deprecated.
      com.google.genai.proto.TrainerSpec.mining_sentence_size is deprecated. See sentencepiece_model.proto;l=96
       Maximum size of sentences to make seed sentence pieces.
       Extended suffix array is constructed to extract frequent
       sub-strings from the corpus. This uses 20N working space,
       where N is the size of corpus.
       
      optional int32 mining_sentence_size = 12 [deprecated = true];
      Returns:
      The miningSentenceSize.
    • hasTrainingSentenceSize

      @Deprecated boolean hasTrainingSentenceSize()
      Deprecated.
      com.google.genai.proto.TrainerSpec.training_sentence_size is deprecated. See sentencepiece_model.proto;l=99
       Maximum size of sentences to train sentence pieces.
       
      optional int32 training_sentence_size = 13 [deprecated = true];
      Returns:
      Whether the trainingSentenceSize field is set.
    • getTrainingSentenceSize

      @Deprecated int getTrainingSentenceSize()
      Deprecated.
      com.google.genai.proto.TrainerSpec.training_sentence_size is deprecated. See sentencepiece_model.proto;l=99
       Maximum size of sentences to train sentence pieces.
       
      optional int32 training_sentence_size = 13 [deprecated = true];
      Returns:
      The trainingSentenceSize.
    • hasSeedSentencepieceSize

      boolean hasSeedSentencepieceSize()
       The size of seed sentencepieces.
       `seed_sentencepiece_size` must be larger than `vocab_size`.
       
      optional int32 seed_sentencepiece_size = 14 [default = 1000000];
      Returns:
      Whether the seedSentencepieceSize field is set.
    • getSeedSentencepieceSize

      int getSeedSentencepieceSize()
       The size of seed sentencepieces.
       `seed_sentencepiece_size` must be larger than `vocab_size`.
       
      optional int32 seed_sentencepiece_size = 14 [default = 1000000];
      Returns:
      The seedSentencepieceSize.
    • hasShrinkingFactor

      boolean hasShrinkingFactor()
       In every EM sub-iterations, keeps top
       `shrinking_factor` * `current sentencepieces size` with respect to
       the loss of the sentence piece. This value should be smaller than 1.0.
       
      optional float shrinking_factor = 15 [default = 0.75];
      Returns:
      Whether the shrinkingFactor field is set.
    • getShrinkingFactor

      float getShrinkingFactor()
       In every EM sub-iterations, keeps top
       `shrinking_factor` * `current sentencepieces size` with respect to
       the loss of the sentence piece. This value should be smaller than 1.0.
       
      optional float shrinking_factor = 15 [default = 0.75];
      Returns:
      The shrinkingFactor.
    • hasMaxSentenceLength

      boolean hasMaxSentenceLength()
       The maximum sentence length in byte. The sentences with the length
       larger than `max_sentence_length` is simply ignored.
       Longer input tends to bring the following risks:
        * Overflow during EM training (unigram language model only)
        * Performance drop because of O(n log n) cost in BPE.
       
      optional int32 max_sentence_length = 18 [default = 4192];
      Returns:
      Whether the maxSentenceLength field is set.
    • getMaxSentenceLength

      int getMaxSentenceLength()
       The maximum sentence length in byte. The sentences with the length
       larger than `max_sentence_length` is simply ignored.
       Longer input tends to bring the following risks:
        * Overflow during EM training (unigram language model only)
        * Performance drop because of O(n log n) cost in BPE.
       
      optional int32 max_sentence_length = 18 [default = 4192];
      Returns:
      The maxSentenceLength.
    • hasNumThreads

      boolean hasNumThreads()
       Number of threads in the training.
       
      optional int32 num_threads = 16 [default = 16];
      Returns:
      Whether the numThreads field is set.
    • getNumThreads

      int getNumThreads()
       Number of threads in the training.
       
      optional int32 num_threads = 16 [default = 16];
      Returns:
      The numThreads.
    • hasNumSubIterations

      boolean hasNumSubIterations()
       Number of EM sub iterations.
       
      optional int32 num_sub_iterations = 17 [default = 2];
      Returns:
      Whether the numSubIterations field is set.
    • getNumSubIterations

      int getNumSubIterations()
       Number of EM sub iterations.
       
      optional int32 num_sub_iterations = 17 [default = 2];
      Returns:
      The numSubIterations.
    • hasMaxSentencepieceLength

      boolean hasMaxSentencepieceLength()
      /////////////////////////////////////////////////////////////////
       SentencePiece parameters which control the shapes of sentence piece.
      
       Maximum length of sentencepiece.
       
      optional int32 max_sentencepiece_length = 20 [default = 16];
      Returns:
      Whether the maxSentencepieceLength field is set.
    • getMaxSentencepieceLength

      int getMaxSentencepieceLength()
      /////////////////////////////////////////////////////////////////
       SentencePiece parameters which control the shapes of sentence piece.
      
       Maximum length of sentencepiece.
       
      optional int32 max_sentencepiece_length = 20 [default = 16];
      Returns:
      The maxSentencepieceLength.
    • hasSplitByUnicodeScript

      boolean hasSplitByUnicodeScript()
       Uses Unicode script to split sentence pieces.
       When `split_by_unicode_script` is true, we do not allow sentence piece to
       include multiple Unicode scripts, e.g. "F1" is not a valid piece.
       Exception: CJ characters (Hiragana/Katakana/Han) are all handled
       as one script type, since Japanese word can consist of multiple scripts.
       This exception is always applied regardless of the accept-language
       parameter.
       
      optional bool split_by_unicode_script = 21 [default = true];
      Returns:
      Whether the splitByUnicodeScript field is set.
    • getSplitByUnicodeScript

      boolean getSplitByUnicodeScript()
       Uses Unicode script to split sentence pieces.
       When `split_by_unicode_script` is true, we do not allow sentence piece to
       include multiple Unicode scripts, e.g. "F1" is not a valid piece.
       Exception: CJ characters (Hiragana/Katakana/Han) are all handled
       as one script type, since Japanese word can consist of multiple scripts.
       This exception is always applied regardless of the accept-language
       parameter.
       
      optional bool split_by_unicode_script = 21 [default = true];
      Returns:
      The splitByUnicodeScript.
    • hasSplitByNumber

      boolean hasSplitByNumber()
       When `split_by_number` is true, put a boundary between number and
       non-number transition. If we want to treat "F1" is one token, set this flag
       to be false.
       
      optional bool split_by_number = 23 [default = true];
      Returns:
      Whether the splitByNumber field is set.
    • getSplitByNumber

      boolean getSplitByNumber()
       When `split_by_number` is true, put a boundary between number and
       non-number transition. If we want to treat "F1" is one token, set this flag
       to be false.
       
      optional bool split_by_number = 23 [default = true];
      Returns:
      The splitByNumber.
    • hasSplitByWhitespace

      boolean hasSplitByWhitespace()
       Use a white space to split sentence pieces.
       When `split_by_whitespace` is false, we may have the piece containing
       a white space in the middle. e.g., "in_the".
       
      optional bool split_by_whitespace = 22 [default = true];
      Returns:
      Whether the splitByWhitespace field is set.
    • getSplitByWhitespace

      boolean getSplitByWhitespace()
       Use a white space to split sentence pieces.
       When `split_by_whitespace` is false, we may have the piece containing
       a white space in the middle. e.g., "in_the".
       
      optional bool split_by_whitespace = 22 [default = true];
      Returns:
      The splitByWhitespace.
    • hasTreatWhitespaceAsSuffix

      boolean hasTreatWhitespaceAsSuffix()
       Adds whitespace symbol (_) as a suffix instead of prefix. e.g., _hello =>
       hello_. When `treat_whitespace_as_suffix` is true,
       NormalizerSpec::add_dummy_prefix will add the dummy whitespace to the end
       of sentence.
       
      optional bool treat_whitespace_as_suffix = 24 [default = false];
      Returns:
      Whether the treatWhitespaceAsSuffix field is set.
    • getTreatWhitespaceAsSuffix

      boolean getTreatWhitespaceAsSuffix()
       Adds whitespace symbol (_) as a suffix instead of prefix. e.g., _hello =>
       hello_. When `treat_whitespace_as_suffix` is true,
       NormalizerSpec::add_dummy_prefix will add the dummy whitespace to the end
       of sentence.
       
      optional bool treat_whitespace_as_suffix = 24 [default = false];
      Returns:
      The treatWhitespaceAsSuffix.
    • hasAllowWhitespaceOnlyPieces

      boolean hasAllowWhitespaceOnlyPieces()
       Allows pieces that only contain whitespaces instead of appearing only as
       prefix or suffix of other pieces.
       
      optional bool allow_whitespace_only_pieces = 26 [default = false];
      Returns:
      Whether the allowWhitespaceOnlyPieces field is set.
    • getAllowWhitespaceOnlyPieces

      boolean getAllowWhitespaceOnlyPieces()
       Allows pieces that only contain whitespaces instead of appearing only as
       prefix or suffix of other pieces.
       
      optional bool allow_whitespace_only_pieces = 26 [default = false];
      Returns:
      The allowWhitespaceOnlyPieces.
    • hasSplitDigits

      boolean hasSplitDigits()
       Split all digits (0-9) into separate pieces.
       
      optional bool split_digits = 25 [default = false];
      Returns:
      Whether the splitDigits field is set.
    • getSplitDigits

      boolean getSplitDigits()
       Split all digits (0-9) into separate pieces.
       
      optional bool split_digits = 25 [default = false];
      Returns:
      The splitDigits.
    • hasPretokenizationDelimiter

      boolean hasPretokenizationDelimiter()
       Defines the pre-tokenization delimiter.
       When specified, no pieces crossing this delimiter is not included
       in the vocab. Then the delimiter string is virtually ignored
       during the training. This field can allows constraints on the vocabulary
       selection. Note that this field is available on unigram mode.
       
      optional string pretokenization_delimiter = 53 [default = ""];
      Returns:
      Whether the pretokenizationDelimiter field is set.
    • getPretokenizationDelimiter

      String getPretokenizationDelimiter()
       Defines the pre-tokenization delimiter.
       When specified, no pieces crossing this delimiter is not included
       in the vocab. Then the delimiter string is virtually ignored
       during the training. This field can allows constraints on the vocabulary
       selection. Note that this field is available on unigram mode.
       
      optional string pretokenization_delimiter = 53 [default = ""];
      Returns:
      The pretokenizationDelimiter.
    • getPretokenizationDelimiterBytes

      com.google.protobuf.ByteString getPretokenizationDelimiterBytes()
       Defines the pre-tokenization delimiter.
       When specified, no pieces crossing this delimiter is not included
       in the vocab. Then the delimiter string is virtually ignored
       during the training. This field can allows constraints on the vocabulary
       selection. Note that this field is available on unigram mode.
       
      optional string pretokenization_delimiter = 53 [default = ""];
      Returns:
      The bytes for pretokenizationDelimiter.
    • getControlSymbolsList

      List<String> getControlSymbolsList()
      /////////////////////////////////////////////////////////////////
       Vocabulary management
      
       Defines control symbols used as an indicator to
       change the behavior of the decoder. <s> and </s> are pre-defined.
       We can use this field to encode various meta information,
       including language indicator in multilingual model.
       These symbols are not visible to users, but visible to
       the decoder. Note that when the input sentence contains control symbols,
       they are not treated as one token, but segmented into normal pieces.
       Control symbols must be inserted independently from the segmentation.
       
      repeated string control_symbols = 30;
      Returns:
      A list containing the controlSymbols.
    • getControlSymbolsCount

      int getControlSymbolsCount()
      /////////////////////////////////////////////////////////////////
       Vocabulary management
      
       Defines control symbols used as an indicator to
       change the behavior of the decoder. <s> and </s> are pre-defined.
       We can use this field to encode various meta information,
       including language indicator in multilingual model.
       These symbols are not visible to users, but visible to
       the decoder. Note that when the input sentence contains control symbols,
       they are not treated as one token, but segmented into normal pieces.
       Control symbols must be inserted independently from the segmentation.
       
      repeated string control_symbols = 30;
      Returns:
      The count of controlSymbols.
    • getControlSymbols

      String getControlSymbols(int index)
      /////////////////////////////////////////////////////////////////
       Vocabulary management
      
       Defines control symbols used as an indicator to
       change the behavior of the decoder. <s> and </s> are pre-defined.
       We can use this field to encode various meta information,
       including language indicator in multilingual model.
       These symbols are not visible to users, but visible to
       the decoder. Note that when the input sentence contains control symbols,
       they are not treated as one token, but segmented into normal pieces.
       Control symbols must be inserted independently from the segmentation.
       
      repeated string control_symbols = 30;
      Parameters:
      index - The index of the element to return.
      Returns:
      The controlSymbols at the given index.
    • getControlSymbolsBytes

      com.google.protobuf.ByteString getControlSymbolsBytes(int index)
      /////////////////////////////////////////////////////////////////
       Vocabulary management
      
       Defines control symbols used as an indicator to
       change the behavior of the decoder. <s> and </s> are pre-defined.
       We can use this field to encode various meta information,
       including language indicator in multilingual model.
       These symbols are not visible to users, but visible to
       the decoder. Note that when the input sentence contains control symbols,
       they are not treated as one token, but segmented into normal pieces.
       Control symbols must be inserted independently from the segmentation.
       
      repeated string control_symbols = 30;
      Parameters:
      index - The index of the value to return.
      Returns:
      The bytes of the controlSymbols at the given index.
    • getUserDefinedSymbolsList

      List<String> getUserDefinedSymbolsList()
       Defines user defined symbols.
       These symbols are added with extremely high score
       so they are always treated as one unique symbol in any context.
       Typical usage of user_defined_symbols is placeholder for named entities.
       
      repeated string user_defined_symbols = 31;
      Returns:
      A list containing the userDefinedSymbols.
    • getUserDefinedSymbolsCount

      int getUserDefinedSymbolsCount()
       Defines user defined symbols.
       These symbols are added with extremely high score
       so they are always treated as one unique symbol in any context.
       Typical usage of user_defined_symbols is placeholder for named entities.
       
      repeated string user_defined_symbols = 31;
      Returns:
      The count of userDefinedSymbols.
    • getUserDefinedSymbols

      String getUserDefinedSymbols(int index)
       Defines user defined symbols.
       These symbols are added with extremely high score
       so they are always treated as one unique symbol in any context.
       Typical usage of user_defined_symbols is placeholder for named entities.
       
      repeated string user_defined_symbols = 31;
      Parameters:
      index - The index of the element to return.
      Returns:
      The userDefinedSymbols at the given index.
    • getUserDefinedSymbolsBytes

      com.google.protobuf.ByteString getUserDefinedSymbolsBytes(int index)
       Defines user defined symbols.
       These symbols are added with extremely high score
       so they are always treated as one unique symbol in any context.
       Typical usage of user_defined_symbols is placeholder for named entities.
       
      repeated string user_defined_symbols = 31;
      Parameters:
      index - The index of the value to return.
      Returns:
      The bytes of the userDefinedSymbols at the given index.
    • hasRequiredChars

      boolean hasRequiredChars()
       Defines required characters. Each UTF8 character in this string is included
       in the character set regardless of character_coverage value. Unlike
       user_defined_symbols, these characters have scores based on the frequency
       on input sentences, and the model can form subwords using characters
       in this field.
       
      optional string required_chars = 36;
      Returns:
      Whether the requiredChars field is set.
    • getRequiredChars

      String getRequiredChars()
       Defines required characters. Each UTF8 character in this string is included
       in the character set regardless of character_coverage value. Unlike
       user_defined_symbols, these characters have scores based on the frequency
       on input sentences, and the model can form subwords using characters
       in this field.
       
      optional string required_chars = 36;
      Returns:
      The requiredChars.
    • getRequiredCharsBytes

      com.google.protobuf.ByteString getRequiredCharsBytes()
       Defines required characters. Each UTF8 character in this string is included
       in the character set regardless of character_coverage value. Unlike
       user_defined_symbols, these characters have scores based on the frequency
       on input sentences, and the model can form subwords using characters
       in this field.
       
      optional string required_chars = 36;
      Returns:
      The bytes for requiredChars.
    • hasByteFallback

      boolean hasByteFallback()
       Decomposes unknown pieces into UTF-8 bytes.
       
      optional bool byte_fallback = 35 [default = false];
      Returns:
      Whether the byteFallback field is set.
    • getByteFallback

      boolean getByteFallback()
       Decomposes unknown pieces into UTF-8 bytes.
       
      optional bool byte_fallback = 35 [default = false];
      Returns:
      The byteFallback.
    • hasVocabularyOutputPieceScore

      boolean hasVocabularyOutputPieceScore()
       When creating the vocabulary file, defines whether or not to additionally
       output the score for each piece.
       
      optional bool vocabulary_output_piece_score = 32 [default = true];
      Returns:
      Whether the vocabularyOutputPieceScore field is set.
    • getVocabularyOutputPieceScore

      boolean getVocabularyOutputPieceScore()
       When creating the vocabulary file, defines whether or not to additionally
       output the score for each piece.
       
      optional bool vocabulary_output_piece_score = 32 [default = true];
      Returns:
      The vocabularyOutputPieceScore.
    • hasHardVocabLimit

      boolean hasHardVocabLimit()
       `vocab_size` is treated as hard limit. Crash if
       the model can not produce the vocab of size `vocab_size`,
       When `hard_vocab_limit` is false, vocab_size is treated
       as soft limit. Note that when model_type=char,
       always assumes hard_vocab_limit = false.
       
      optional bool hard_vocab_limit = 33 [default = true];
      Returns:
      Whether the hardVocabLimit field is set.
    • getHardVocabLimit

      boolean getHardVocabLimit()
       `vocab_size` is treated as hard limit. Crash if
       the model can not produce the vocab of size `vocab_size`,
       When `hard_vocab_limit` is false, vocab_size is treated
       as soft limit. Note that when model_type=char,
       always assumes hard_vocab_limit = false.
       
      optional bool hard_vocab_limit = 33 [default = true];
      Returns:
      The hardVocabLimit.
    • hasUseAllVocab

      boolean hasUseAllVocab()
       use all symbols for vocab extraction. This flag is valid
       if model type is either CHAR or WORD
       
      optional bool use_all_vocab = 34 [default = false];
      Returns:
      Whether the useAllVocab field is set.
    • getUseAllVocab

      boolean getUseAllVocab()
       use all symbols for vocab extraction. This flag is valid
       if model type is either CHAR or WORD
       
      optional bool use_all_vocab = 34 [default = false];
      Returns:
      The useAllVocab.
    • hasUnkId

      boolean hasUnkId()
      /////////////////////////////////////////////////////////////////
       Reserved special meta tokens.
       * -1 is not used.
       * unk_id must not be -1.
       Id must starts with 0 and be contiguous.
       
      optional int32 unk_id = 40 [default = 0];
      Returns:
      Whether the unkId field is set.
    • getUnkId

      int getUnkId()
      /////////////////////////////////////////////////////////////////
       Reserved special meta tokens.
       * -1 is not used.
       * unk_id must not be -1.
       Id must starts with 0 and be contiguous.
       
      optional int32 unk_id = 40 [default = 0];
      Returns:
      The unkId.
    • hasBosId

      boolean hasBosId()
       <s>
       
      optional int32 bos_id = 41 [default = 1];
      Returns:
      Whether the bosId field is set.
    • getBosId

      int getBosId()
       <s>
       
      optional int32 bos_id = 41 [default = 1];
      Returns:
      The bosId.
    • hasEosId

      boolean hasEosId()
       </s>
       
      optional int32 eos_id = 42 [default = 2];
      Returns:
      Whether the eosId field is set.
    • getEosId

      int getEosId()
       </s>
       
      optional int32 eos_id = 42 [default = 2];
      Returns:
      The eosId.
    • hasPadId

      boolean hasPadId()
       <pad> (padding)
       
      optional int32 pad_id = 43 [default = -1];
      Returns:
      Whether the padId field is set.
    • getPadId

      int getPadId()
       <pad> (padding)
       
      optional int32 pad_id = 43 [default = -1];
      Returns:
      The padId.
    • hasUnkPiece

      boolean hasUnkPiece()
      optional string unk_piece = 45 [default = "<unk>"];
      Returns:
      Whether the unkPiece field is set.
    • getUnkPiece

      String getUnkPiece()
      optional string unk_piece = 45 [default = "<unk>"];
      Returns:
      The unkPiece.
    • getUnkPieceBytes

      com.google.protobuf.ByteString getUnkPieceBytes()
      optional string unk_piece = 45 [default = "<unk>"];
      Returns:
      The bytes for unkPiece.
    • hasBosPiece

      boolean hasBosPiece()
      optional string bos_piece = 46 [default = "<s>"];
      Returns:
      Whether the bosPiece field is set.
    • getBosPiece

      String getBosPiece()
      optional string bos_piece = 46 [default = "<s>"];
      Returns:
      The bosPiece.
    • getBosPieceBytes

      com.google.protobuf.ByteString getBosPieceBytes()
      optional string bos_piece = 46 [default = "<s>"];
      Returns:
      The bytes for bosPiece.
    • hasEosPiece

      boolean hasEosPiece()
      optional string eos_piece = 47 [default = "</s>"];
      Returns:
      Whether the eosPiece field is set.
    • getEosPiece

      String getEosPiece()
      optional string eos_piece = 47 [default = "</s>"];
      Returns:
      The eosPiece.
    • getEosPieceBytes

      com.google.protobuf.ByteString getEosPieceBytes()
      optional string eos_piece = 47 [default = "</s>"];
      Returns:
      The bytes for eosPiece.
    • hasPadPiece

      boolean hasPadPiece()
      optional string pad_piece = 48 [default = "<pad>"];
      Returns:
      Whether the padPiece field is set.
    • getPadPiece

      String getPadPiece()
      optional string pad_piece = 48 [default = "<pad>"];
      Returns:
      The padPiece.
    • getPadPieceBytes

      com.google.protobuf.ByteString getPadPieceBytes()
      optional string pad_piece = 48 [default = "<pad>"];
      Returns:
      The bytes for padPiece.
    • hasUnkSurface

      boolean hasUnkSurface()
       Encodes <unk> into U+2047 (DOUBLE QUESTION MARK),
       since this character can be useful both for user and
       developer. We can easily figure out that <unk> is emitted.
       
      optional string unk_surface = 44 [default = " \342\201\207 "];
      Returns:
      Whether the unkSurface field is set.
    • getUnkSurface

      String getUnkSurface()
       Encodes <unk> into U+2047 (DOUBLE QUESTION MARK),
       since this character can be useful both for user and
       developer. We can easily figure out that <unk> is emitted.
       
      optional string unk_surface = 44 [default = " \342\201\207 "];
      Returns:
      The unkSurface.
    • getUnkSurfaceBytes

      com.google.protobuf.ByteString getUnkSurfaceBytes()
       Encodes <unk> into U+2047 (DOUBLE QUESTION MARK),
       since this character can be useful both for user and
       developer. We can easily figure out that <unk> is emitted.
       
      optional string unk_surface = 44 [default = " \342\201\207 "];
      Returns:
      The bytes for unkSurface.
    • hasTrainExtremelyLargeCorpus

      boolean hasTrainExtremelyLargeCorpus()
       Increase bit depth to allow unigram model training on large
       (>10M sentences) corpora. A Side-effect of enabling this flag
       is increased memory usage.
       
      optional bool train_extremely_large_corpus = 49 [default = false];
      Returns:
      Whether the trainExtremelyLargeCorpus field is set.
    • getTrainExtremelyLargeCorpus

      boolean getTrainExtremelyLargeCorpus()
       Increase bit depth to allow unigram model training on large
       (>10M sentences) corpora. A Side-effect of enabling this flag
       is increased memory usage.
       
      optional bool train_extremely_large_corpus = 49 [default = false];
      Returns:
      The trainExtremelyLargeCorpus.
    • hasSeedSentencepiecesFile

      boolean hasSeedSentencepiecesFile()
       Path to a seed sentencepieces file, with one tab-separated
       seed sentencepiece <tab> frequency per line.
       
      optional string seed_sentencepieces_file = 54 [default = ""];
      Returns:
      Whether the seedSentencepiecesFile field is set.
    • getSeedSentencepiecesFile

      String getSeedSentencepiecesFile()
       Path to a seed sentencepieces file, with one tab-separated
       seed sentencepiece <tab> frequency per line.
       
      optional string seed_sentencepieces_file = 54 [default = ""];
      Returns:
      The seedSentencepiecesFile.
    • getSeedSentencepiecesFileBytes

      com.google.protobuf.ByteString getSeedSentencepiecesFileBytes()
       Path to a seed sentencepieces file, with one tab-separated
       seed sentencepiece <tab> frequency per line.
       
      optional string seed_sentencepieces_file = 54 [default = ""];
      Returns:
      The bytes for seedSentencepiecesFile.