Package com.google.genai.proto
Class SentencepieceModel.TrainerSpec
java.lang.Object
com.google.protobuf.AbstractMessageLite
com.google.protobuf.AbstractMessage
com.google.protobuf.GeneratedMessageV3
com.google.protobuf.GeneratedMessageV3.ExtendableMessage<SentencepieceModel.TrainerSpec>
com.google.genai.proto.SentencepieceModel.TrainerSpec
- All Implemented Interfaces:
SentencepieceModel.TrainerSpecOrBuilder,com.google.protobuf.GeneratedMessageV3.ExtendableMessageOrBuilder<SentencepieceModel.TrainerSpec>,com.google.protobuf.Message,com.google.protobuf.MessageLite,com.google.protobuf.MessageLiteOrBuilder,com.google.protobuf.MessageOrBuilder,Serializable
- Enclosing class:
- SentencepieceModel
public static final class SentencepieceModel.TrainerSpec
extends com.google.protobuf.GeneratedMessageV3.ExtendableMessage<SentencepieceModel.TrainerSpec>
implements SentencepieceModel.TrainerSpecOrBuilder
TrainerSpec encodes a various parameters for SentencePiece training. Next id: 55Protobuf type
com.google.genai.proto.TrainerSpec- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final classTrainerSpec encodes a various parameters for SentencePiece training.static enumModel type.Nested classes/interfaces inherited from class com.google.protobuf.GeneratedMessageV3
com.google.protobuf.GeneratedMessageV3.ExtendableBuilder<MessageT extends com.google.protobuf.GeneratedMessageV3.ExtendableMessage<MessageT>,BuilderT extends com.google.protobuf.GeneratedMessageV3.ExtendableBuilder<MessageT, BuilderT>>, com.google.protobuf.GeneratedMessageV3.ExtendableMessage<MessageT extends com.google.protobuf.GeneratedMessageV3.ExtendableMessage<MessageT>>, com.google.protobuf.GeneratedMessageV3.ExtendableMessageOrBuilder<MessageT extends com.google.protobuf.GeneratedMessageV3.ExtendableMessage<MessageT>>, com.google.protobuf.GeneratedMessageV3.FieldAccessorTable -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final com.google.protobuf.Parser<SentencepieceModel.TrainerSpec>Deprecated.static final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final intstatic final int -
Method Summary
Modifier and TypeMethodDescriptionbooleangetAcceptLanguage(int index) List of the languages this model can accept.com.google.protobuf.ByteStringgetAcceptLanguageBytes(int index) List of the languages this model can accept.intList of the languages this model can accept.com.google.protobuf.ProtocolStringListList of the languages this model can accept.booleanAllows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.intgetBosId()<s>optional string bos_piece = 46 [default = "<s>"];com.google.protobuf.ByteStringoptional string bos_piece = 46 [default = "<s>"];booleanDecomposes unknown pieces into UTF-8 bytes.float///////////////////////////////////////////////////////////////// Training parameters.getControlSymbols(int index) ///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.com.google.protobuf.ByteStringgetControlSymbolsBytes(int index) ///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.int///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.com.google.protobuf.ProtocolStringList///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.static final com.google.protobuf.Descriptors.DescriptorlongClipping threshold to apply after adding noise.floatSet these parameters if you need DP version of sentencepiece.booleanWhether to use DP version of sentencepiece.intgetEosId()</s>optional string eos_piece = 47 [default = "</s>"];com.google.protobuf.ByteStringoptional string eos_piece = 47 [default = "</s>"];boolean`vocab_size` is treated as hard limit.getInput(int index) ///////////////////////////////////////////////////////////////// General parameters Input corpus files.com.google.protobuf.ByteStringgetInputBytes(int index) ///////////////////////////////////////////////////////////////// General parameters Input corpus files.int///////////////////////////////////////////////////////////////// General parameters Input corpus files.Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freqcom.google.protobuf.ByteStringInput corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freqcom.google.protobuf.ProtocolStringList///////////////////////////////////////////////////////////////// General parameters Input corpus files.longMaximum size of sentences the trainer loads from `input` parameter.intThe maximum sentence length in byte.int///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece.intDeprecated.com.google.genai.proto.TrainerSpec.mining_sentence_size is deprecated.Output model file prefix.com.google.protobuf.ByteStringOutput model file prefix.optional .com.google.genai.proto.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];intNumber of EM sub iterations.intNumber of threads in the training.intgetPadId()<pad> (padding)optional string pad_piece = 48 [default = "<pad>"];com.google.protobuf.ByteStringoptional string pad_piece = 48 [default = "<pad>"];com.google.protobuf.Parser<SentencepieceModel.TrainerSpec>Defines the pre-tokenization delimiter.com.google.protobuf.ByteStringDefines the pre-tokenization delimiter.Defines required characters.com.google.protobuf.ByteStringDefines required characters.Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.com.google.protobuf.ByteStringPath to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.intThe size of seed sentencepieces.intSize of self-test samples, which are encoded in the model file.intfloatIn every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece.booleanoptional bool shuffle_input_sentence = 19 [default = true];booleanWhen `split_by_number` is true, put a boundary between number and non-number transition.booleanUses Unicode script to split sentence pieces.booleanUse a white space to split sentence pieces.booleanSplit all digits (0-9) into separate pieces.booleanIncrease bit depth to allow unigram model training on large (>10M sentences) corpora.intDeprecated.com.google.genai.proto.TrainerSpec.training_sentence_size is deprecated.booleanAdds whitespace symbol (_) as a suffix instead of prefix.intgetUnkId()///////////////////////////////////////////////////////////////// Reserved special meta tokens.optional string unk_piece = 45 [default = "<unk>"];com.google.protobuf.ByteStringoptional string unk_piece = 45 [default = "<unk>"];Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.com.google.protobuf.ByteStringEncodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.booleanuse all symbols for vocab extraction.getUserDefinedSymbols(int index) Defines user defined symbols.com.google.protobuf.ByteStringgetUserDefinedSymbolsBytes(int index) Defines user defined symbols.intDefines user defined symbols.com.google.protobuf.ProtocolStringListDefines user defined symbols.intVocabulary size.booleanWhen creating the vocabulary file, defines whether or not to additionally output the score for each piece.booleanAllows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.booleanhasBosId()<s>booleanoptional string bos_piece = 46 [default = "<s>"];booleanDecomposes unknown pieces into UTF-8 bytes.boolean///////////////////////////////////////////////////////////////// Training parameters.booleanClipping threshold to apply after adding noise.booleanSet these parameters if you need DP version of sentencepiece.booleanWhether to use DP version of sentencepiece.booleanhasEosId()</s>booleanoptional string eos_piece = 47 [default = "</s>"];boolean`vocab_size` is treated as hard limit.inthashCode()booleanInput corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freqbooleanMaximum size of sentences the trainer loads from `input` parameter.booleanThe maximum sentence length in byte.boolean///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece.booleanDeprecated.com.google.genai.proto.TrainerSpec.mining_sentence_size is deprecated.booleanOutput model file prefix.booleanoptional .com.google.genai.proto.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];booleanNumber of EM sub iterations.booleanNumber of threads in the training.booleanhasPadId()<pad> (padding)booleanoptional string pad_piece = 48 [default = "<pad>"];booleanDefines the pre-tokenization delimiter.booleanDefines required characters.booleanPath to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.booleanThe size of seed sentencepieces.booleanSize of self-test samples, which are encoded in the model file.booleanIn every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece.booleanoptional bool shuffle_input_sentence = 19 [default = true];booleanWhen `split_by_number` is true, put a boundary between number and non-number transition.booleanUses Unicode script to split sentence pieces.booleanUse a white space to split sentence pieces.booleanSplit all digits (0-9) into separate pieces.booleanIncrease bit depth to allow unigram model training on large (>10M sentences) corpora.booleanDeprecated.com.google.genai.proto.TrainerSpec.training_sentence_size is deprecated.booleanAdds whitespace symbol (_) as a suffix instead of prefix.booleanhasUnkId()///////////////////////////////////////////////////////////////// Reserved special meta tokens.booleanoptional string unk_piece = 45 [default = "<unk>"];booleanEncodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.booleanuse all symbols for vocab extraction.booleanVocabulary size.booleanWhen creating the vocabulary file, defines whether or not to additionally output the score for each piece.final booleannewBuilder(SentencepieceModel.TrainerSpec prototype) parseDelimitedFrom(InputStream input) parseDelimitedFrom(InputStream input, com.google.protobuf.ExtensionRegistryLite extensionRegistry) parseFrom(byte[] data) parseFrom(byte[] data, com.google.protobuf.ExtensionRegistryLite extensionRegistry) parseFrom(com.google.protobuf.ByteString data) parseFrom(com.google.protobuf.ByteString data, com.google.protobuf.ExtensionRegistryLite extensionRegistry) parseFrom(com.google.protobuf.CodedInputStream input) parseFrom(com.google.protobuf.CodedInputStream input, com.google.protobuf.ExtensionRegistryLite extensionRegistry) parseFrom(InputStream input) parseFrom(InputStream input, com.google.protobuf.ExtensionRegistryLite extensionRegistry) parseFrom(ByteBuffer data) parseFrom(ByteBuffer data, com.google.protobuf.ExtensionRegistryLite extensionRegistry) static com.google.protobuf.Parser<SentencepieceModel.TrainerSpec>parser()voidwriteTo(com.google.protobuf.CodedOutputStream output) Methods inherited from class com.google.protobuf.GeneratedMessageV3.ExtendableMessage
getAllFields, getAllFieldsRaw, getExtension, getExtension, getExtension, getExtension, getExtension, getExtension, getExtensionCount, getExtensionCount, getExtensionCount, getField, getRepeatedField, getRepeatedFieldCount, hasExtension, hasExtension, hasExtension, hasFieldMethods inherited from class com.google.protobuf.GeneratedMessageV3
getDescriptorForType, getOneofFieldDescriptor, getUnknownFields, hasOneofMethods inherited from class com.google.protobuf.AbstractMessage
findInitializationErrors, getInitializationErrorString, toStringMethods inherited from class com.google.protobuf.AbstractMessageLite
toByteArray, toByteString, writeDelimitedTo, writeToMethods inherited from interface com.google.protobuf.GeneratedMessageV3.ExtendableMessageOrBuilder
getExtension, getExtension, getExtension, getExtension, getExtension, getExtension, getExtensionCount, getExtensionCount, getExtensionCount, hasExtension, hasExtension, hasExtensionMethods inherited from interface com.google.protobuf.MessageLite
toByteArray, toByteString, writeDelimitedTo, writeToMethods inherited from interface com.google.protobuf.MessageOrBuilder
findInitializationErrors, getAllFields, getDescriptorForType, getField, getInitializationErrorString, getOneofFieldDescriptor, getRepeatedField, getRepeatedFieldCount, getUnknownFields, hasField, hasOneof
-
Field Details
-
INPUT_FIELD_NUMBER
public static final int INPUT_FIELD_NUMBER- See Also:
-
INPUT_FORMAT_FIELD_NUMBER
public static final int INPUT_FORMAT_FIELD_NUMBER- See Also:
-
MODEL_PREFIX_FIELD_NUMBER
public static final int MODEL_PREFIX_FIELD_NUMBER- See Also:
-
MODEL_TYPE_FIELD_NUMBER
public static final int MODEL_TYPE_FIELD_NUMBER- See Also:
-
VOCAB_SIZE_FIELD_NUMBER
public static final int VOCAB_SIZE_FIELD_NUMBER- See Also:
-
ACCEPT_LANGUAGE_FIELD_NUMBER
public static final int ACCEPT_LANGUAGE_FIELD_NUMBER- See Also:
-
SELF_TEST_SAMPLE_SIZE_FIELD_NUMBER
public static final int SELF_TEST_SAMPLE_SIZE_FIELD_NUMBER- See Also:
-
ENABLE_DIFFERENTIAL_PRIVACY_FIELD_NUMBER
public static final int ENABLE_DIFFERENTIAL_PRIVACY_FIELD_NUMBER- See Also:
-
DIFFERENTIAL_PRIVACY_NOISE_LEVEL_FIELD_NUMBER
public static final int DIFFERENTIAL_PRIVACY_NOISE_LEVEL_FIELD_NUMBER- See Also:
-
DIFFERENTIAL_PRIVACY_CLIPPING_THRESHOLD_FIELD_NUMBER
public static final int DIFFERENTIAL_PRIVACY_CLIPPING_THRESHOLD_FIELD_NUMBER- See Also:
-
CHARACTER_COVERAGE_FIELD_NUMBER
public static final int CHARACTER_COVERAGE_FIELD_NUMBER- See Also:
-
INPUT_SENTENCE_SIZE_FIELD_NUMBER
public static final int INPUT_SENTENCE_SIZE_FIELD_NUMBER- See Also:
-
SHUFFLE_INPUT_SENTENCE_FIELD_NUMBER
public static final int SHUFFLE_INPUT_SENTENCE_FIELD_NUMBER- See Also:
-
MINING_SENTENCE_SIZE_FIELD_NUMBER
public static final int MINING_SENTENCE_SIZE_FIELD_NUMBER- See Also:
-
TRAINING_SENTENCE_SIZE_FIELD_NUMBER
public static final int TRAINING_SENTENCE_SIZE_FIELD_NUMBER- See Also:
-
SEED_SENTENCEPIECE_SIZE_FIELD_NUMBER
public static final int SEED_SENTENCEPIECE_SIZE_FIELD_NUMBER- See Also:
-
SHRINKING_FACTOR_FIELD_NUMBER
public static final int SHRINKING_FACTOR_FIELD_NUMBER- See Also:
-
MAX_SENTENCE_LENGTH_FIELD_NUMBER
public static final int MAX_SENTENCE_LENGTH_FIELD_NUMBER- See Also:
-
NUM_THREADS_FIELD_NUMBER
public static final int NUM_THREADS_FIELD_NUMBER- See Also:
-
NUM_SUB_ITERATIONS_FIELD_NUMBER
public static final int NUM_SUB_ITERATIONS_FIELD_NUMBER- See Also:
-
MAX_SENTENCEPIECE_LENGTH_FIELD_NUMBER
public static final int MAX_SENTENCEPIECE_LENGTH_FIELD_NUMBER- See Also:
-
SPLIT_BY_UNICODE_SCRIPT_FIELD_NUMBER
public static final int SPLIT_BY_UNICODE_SCRIPT_FIELD_NUMBER- See Also:
-
SPLIT_BY_NUMBER_FIELD_NUMBER
public static final int SPLIT_BY_NUMBER_FIELD_NUMBER- See Also:
-
SPLIT_BY_WHITESPACE_FIELD_NUMBER
public static final int SPLIT_BY_WHITESPACE_FIELD_NUMBER- See Also:
-
TREAT_WHITESPACE_AS_SUFFIX_FIELD_NUMBER
public static final int TREAT_WHITESPACE_AS_SUFFIX_FIELD_NUMBER- See Also:
-
ALLOW_WHITESPACE_ONLY_PIECES_FIELD_NUMBER
public static final int ALLOW_WHITESPACE_ONLY_PIECES_FIELD_NUMBER- See Also:
-
SPLIT_DIGITS_FIELD_NUMBER
public static final int SPLIT_DIGITS_FIELD_NUMBER- See Also:
-
PRETOKENIZATION_DELIMITER_FIELD_NUMBER
public static final int PRETOKENIZATION_DELIMITER_FIELD_NUMBER- See Also:
-
CONTROL_SYMBOLS_FIELD_NUMBER
public static final int CONTROL_SYMBOLS_FIELD_NUMBER- See Also:
-
USER_DEFINED_SYMBOLS_FIELD_NUMBER
public static final int USER_DEFINED_SYMBOLS_FIELD_NUMBER- See Also:
-
REQUIRED_CHARS_FIELD_NUMBER
public static final int REQUIRED_CHARS_FIELD_NUMBER- See Also:
-
BYTE_FALLBACK_FIELD_NUMBER
public static final int BYTE_FALLBACK_FIELD_NUMBER- See Also:
-
VOCABULARY_OUTPUT_PIECE_SCORE_FIELD_NUMBER
public static final int VOCABULARY_OUTPUT_PIECE_SCORE_FIELD_NUMBER- See Also:
-
HARD_VOCAB_LIMIT_FIELD_NUMBER
public static final int HARD_VOCAB_LIMIT_FIELD_NUMBER- See Also:
-
USE_ALL_VOCAB_FIELD_NUMBER
public static final int USE_ALL_VOCAB_FIELD_NUMBER- See Also:
-
UNK_ID_FIELD_NUMBER
public static final int UNK_ID_FIELD_NUMBER- See Also:
-
BOS_ID_FIELD_NUMBER
public static final int BOS_ID_FIELD_NUMBER- See Also:
-
EOS_ID_FIELD_NUMBER
public static final int EOS_ID_FIELD_NUMBER- See Also:
-
PAD_ID_FIELD_NUMBER
public static final int PAD_ID_FIELD_NUMBER- See Also:
-
UNK_PIECE_FIELD_NUMBER
public static final int UNK_PIECE_FIELD_NUMBER- See Also:
-
BOS_PIECE_FIELD_NUMBER
public static final int BOS_PIECE_FIELD_NUMBER- See Also:
-
EOS_PIECE_FIELD_NUMBER
public static final int EOS_PIECE_FIELD_NUMBER- See Also:
-
PAD_PIECE_FIELD_NUMBER
public static final int PAD_PIECE_FIELD_NUMBER- See Also:
-
UNK_SURFACE_FIELD_NUMBER
public static final int UNK_SURFACE_FIELD_NUMBER- See Also:
-
TRAIN_EXTREMELY_LARGE_CORPUS_FIELD_NUMBER
public static final int TRAIN_EXTREMELY_LARGE_CORPUS_FIELD_NUMBER- See Also:
-
SEED_SENTENCEPIECES_FILE_FIELD_NUMBER
public static final int SEED_SENTENCEPIECES_FILE_FIELD_NUMBER- See Also:
-
PARSER
Deprecated.
-
-
Method Details
-
getDescriptor
public static final com.google.protobuf.Descriptors.Descriptor getDescriptor() -
getInputList
public com.google.protobuf.ProtocolStringList getInputList()///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
repeated string input = 1;- Specified by:
getInputListin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- A list containing the input.
-
getInputCount
public int getInputCount()///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
repeated string input = 1;- Specified by:
getInputCountin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The count of input.
-
getInput
///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
repeated string input = 1;- Specified by:
getInputin interfaceSentencepieceModel.TrainerSpecOrBuilder- Parameters:
index- The index of the element to return.- Returns:
- The input at the given index.
-
getInputBytes
public com.google.protobuf.ByteString getInputBytes(int index) ///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
repeated string input = 1;- Specified by:
getInputBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Parameters:
index- The index of the value to return.- Returns:
- The bytes of the input at the given index.
-
hasInputFormat
public boolean hasInputFormat()Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
optional string input_format = 7;- Specified by:
hasInputFormatin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the inputFormat field is set.
-
getInputFormat
Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
optional string input_format = 7;- Specified by:
getInputFormatin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The inputFormat.
-
getInputFormatBytes
public com.google.protobuf.ByteString getInputFormatBytes()Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
optional string input_format = 7;- Specified by:
getInputFormatBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for inputFormat.
-
hasModelPrefix
public boolean hasModelPrefix()Output model file prefix. <model_prefix>.model and <model_prefix>.vocab are generated.
optional string model_prefix = 2;- Specified by:
hasModelPrefixin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the modelPrefix field is set.
-
getModelPrefix
Output model file prefix. <model_prefix>.model and <model_prefix>.vocab are generated.
optional string model_prefix = 2;- Specified by:
getModelPrefixin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The modelPrefix.
-
getModelPrefixBytes
public com.google.protobuf.ByteString getModelPrefixBytes()Output model file prefix. <model_prefix>.model and <model_prefix>.vocab are generated.
optional string model_prefix = 2;- Specified by:
getModelPrefixBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for modelPrefix.
-
hasModelType
public boolean hasModelType()optional .com.google.genai.proto.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];- Specified by:
hasModelTypein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the modelType field is set.
-
getModelType
optional .com.google.genai.proto.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];- Specified by:
getModelTypein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The modelType.
-
hasVocabSize
public boolean hasVocabSize()Vocabulary size. 8k is the default size.
optional int32 vocab_size = 4 [default = 8000];- Specified by:
hasVocabSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the vocabSize field is set.
-
getVocabSize
public int getVocabSize()Vocabulary size. 8k is the default size.
optional int32 vocab_size = 4 [default = 8000];- Specified by:
getVocabSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The vocabSize.
-
getAcceptLanguageList
public com.google.protobuf.ProtocolStringList getAcceptLanguageList()List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
repeated string accept_language = 5;- Specified by:
getAcceptLanguageListin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- A list containing the acceptLanguage.
-
getAcceptLanguageCount
public int getAcceptLanguageCount()List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
repeated string accept_language = 5;- Specified by:
getAcceptLanguageCountin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The count of acceptLanguage.
-
getAcceptLanguage
List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
repeated string accept_language = 5;- Specified by:
getAcceptLanguagein interfaceSentencepieceModel.TrainerSpecOrBuilder- Parameters:
index- The index of the element to return.- Returns:
- The acceptLanguage at the given index.
-
getAcceptLanguageBytes
public com.google.protobuf.ByteString getAcceptLanguageBytes(int index) List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
repeated string accept_language = 5;- Specified by:
getAcceptLanguageBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Parameters:
index- The index of the value to return.- Returns:
- The bytes of the acceptLanguage at the given index.
-
hasSelfTestSampleSize
public boolean hasSelfTestSampleSize()Size of self-test samples, which are encoded in the model file.
optional int32 self_test_sample_size = 6 [default = 0];- Specified by:
hasSelfTestSampleSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the selfTestSampleSize field is set.
-
getSelfTestSampleSize
public int getSelfTestSampleSize()Size of self-test samples, which are encoded in the model file.
optional int32 self_test_sample_size = 6 [default = 0];- Specified by:
getSelfTestSampleSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The selfTestSampleSize.
-
hasEnableDifferentialPrivacy
public boolean hasEnableDifferentialPrivacy()Whether to use DP version of sentencepiece. Use it with TSV input format (requires precomputed word tab counts to work).
optional bool enable_differential_privacy = 50 [default = false];- Specified by:
hasEnableDifferentialPrivacyin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the enableDifferentialPrivacy field is set.
-
getEnableDifferentialPrivacy
public boolean getEnableDifferentialPrivacy()Whether to use DP version of sentencepiece. Use it with TSV input format (requires precomputed word tab counts to work).
optional bool enable_differential_privacy = 50 [default = false];- Specified by:
getEnableDifferentialPrivacyin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The enableDifferentialPrivacy.
-
hasDifferentialPrivacyNoiseLevel
public boolean hasDifferentialPrivacyNoiseLevel()Set these parameters if you need DP version of sentencepiece. std of noise to add.
optional float differential_privacy_noise_level = 51 [default = 0];- Specified by:
hasDifferentialPrivacyNoiseLevelin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the differentialPrivacyNoiseLevel field is set.
-
getDifferentialPrivacyNoiseLevel
public float getDifferentialPrivacyNoiseLevel()Set these parameters if you need DP version of sentencepiece. std of noise to add.
optional float differential_privacy_noise_level = 51 [default = 0];- Specified by:
getDifferentialPrivacyNoiseLevelin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The differentialPrivacyNoiseLevel.
-
hasDifferentialPrivacyClippingThreshold
public boolean hasDifferentialPrivacyClippingThreshold()Clipping threshold to apply after adding noise. All the words with frequency less than this value are dropped.
optional uint64 differential_privacy_clipping_threshold = 52 [default = 0];- Specified by:
hasDifferentialPrivacyClippingThresholdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the differentialPrivacyClippingThreshold field is set.
-
getDifferentialPrivacyClippingThreshold
public long getDifferentialPrivacyClippingThreshold()Clipping threshold to apply after adding noise. All the words with frequency less than this value are dropped.
optional uint64 differential_privacy_clipping_threshold = 52 [default = 0];- Specified by:
getDifferentialPrivacyClippingThresholdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The differentialPrivacyClippingThreshold.
-
hasCharacterCoverage
public boolean hasCharacterCoverage()///////////////////////////////////////////////////////////////// Training parameters. Uses characters which cover the corpus with the ratio of `chars_coverage`. This parameter determines the set of basic Alphabet of sentence piece. 1.0 - `chars_coverage` characters are treated as UNK. See also required_chars field.
optional float character_coverage = 10 [default = 0.9995];- Specified by:
hasCharacterCoveragein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the characterCoverage field is set.
-
getCharacterCoverage
public float getCharacterCoverage()///////////////////////////////////////////////////////////////// Training parameters. Uses characters which cover the corpus with the ratio of `chars_coverage`. This parameter determines the set of basic Alphabet of sentence piece. 1.0 - `chars_coverage` characters are treated as UNK. See also required_chars field.
optional float character_coverage = 10 [default = 0.9995];- Specified by:
getCharacterCoveragein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The characterCoverage.
-
hasInputSentenceSize
public boolean hasInputSentenceSize()Maximum size of sentences the trainer loads from `input` parameter. Trainer simply loads the `input` files in sequence. It is better to shuffle the input corpus randomly.
optional uint64 input_sentence_size = 11 [default = 0];- Specified by:
hasInputSentenceSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the inputSentenceSize field is set.
-
getInputSentenceSize
public long getInputSentenceSize()Maximum size of sentences the trainer loads from `input` parameter. Trainer simply loads the `input` files in sequence. It is better to shuffle the input corpus randomly.
optional uint64 input_sentence_size = 11 [default = 0];- Specified by:
getInputSentenceSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The inputSentenceSize.
-
hasShuffleInputSentence
public boolean hasShuffleInputSentence()optional bool shuffle_input_sentence = 19 [default = true];- Specified by:
hasShuffleInputSentencein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the shuffleInputSentence field is set.
-
getShuffleInputSentence
public boolean getShuffleInputSentence()optional bool shuffle_input_sentence = 19 [default = true];- Specified by:
getShuffleInputSentencein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The shuffleInputSentence.
-
hasMiningSentenceSize
Deprecated.com.google.genai.proto.TrainerSpec.mining_sentence_size is deprecated. See sentencepiece_model.proto;l=96Maximum size of sentences to make seed sentence pieces. Extended suffix array is constructed to extract frequent sub-strings from the corpus. This uses 20N working space, where N is the size of corpus.
optional int32 mining_sentence_size = 12 [deprecated = true];- Specified by:
hasMiningSentenceSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the miningSentenceSize field is set.
-
getMiningSentenceSize
Deprecated.com.google.genai.proto.TrainerSpec.mining_sentence_size is deprecated. See sentencepiece_model.proto;l=96Maximum size of sentences to make seed sentence pieces. Extended suffix array is constructed to extract frequent sub-strings from the corpus. This uses 20N working space, where N is the size of corpus.
optional int32 mining_sentence_size = 12 [deprecated = true];- Specified by:
getMiningSentenceSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The miningSentenceSize.
-
hasTrainingSentenceSize
Deprecated.com.google.genai.proto.TrainerSpec.training_sentence_size is deprecated. See sentencepiece_model.proto;l=99Maximum size of sentences to train sentence pieces.
optional int32 training_sentence_size = 13 [deprecated = true];- Specified by:
hasTrainingSentenceSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the trainingSentenceSize field is set.
-
getTrainingSentenceSize
Deprecated.com.google.genai.proto.TrainerSpec.training_sentence_size is deprecated. See sentencepiece_model.proto;l=99Maximum size of sentences to train sentence pieces.
optional int32 training_sentence_size = 13 [deprecated = true];- Specified by:
getTrainingSentenceSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The trainingSentenceSize.
-
hasSeedSentencepieceSize
public boolean hasSeedSentencepieceSize()The size of seed sentencepieces. `seed_sentencepiece_size` must be larger than `vocab_size`.
optional int32 seed_sentencepiece_size = 14 [default = 1000000];- Specified by:
hasSeedSentencepieceSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the seedSentencepieceSize field is set.
-
getSeedSentencepieceSize
public int getSeedSentencepieceSize()The size of seed sentencepieces. `seed_sentencepiece_size` must be larger than `vocab_size`.
optional int32 seed_sentencepiece_size = 14 [default = 1000000];- Specified by:
getSeedSentencepieceSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The seedSentencepieceSize.
-
hasShrinkingFactor
public boolean hasShrinkingFactor()In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece. This value should be smaller than 1.0.
optional float shrinking_factor = 15 [default = 0.75];- Specified by:
hasShrinkingFactorin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the shrinkingFactor field is set.
-
getShrinkingFactor
public float getShrinkingFactor()In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece. This value should be smaller than 1.0.
optional float shrinking_factor = 15 [default = 0.75];- Specified by:
getShrinkingFactorin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The shrinkingFactor.
-
hasMaxSentenceLength
public boolean hasMaxSentenceLength()The maximum sentence length in byte. The sentences with the length larger than `max_sentence_length` is simply ignored. Longer input tends to bring the following risks: * Overflow during EM training (unigram language model only) * Performance drop because of O(n log n) cost in BPE.
optional int32 max_sentence_length = 18 [default = 4192];- Specified by:
hasMaxSentenceLengthin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the maxSentenceLength field is set.
-
getMaxSentenceLength
public int getMaxSentenceLength()The maximum sentence length in byte. The sentences with the length larger than `max_sentence_length` is simply ignored. Longer input tends to bring the following risks: * Overflow during EM training (unigram language model only) * Performance drop because of O(n log n) cost in BPE.
optional int32 max_sentence_length = 18 [default = 4192];- Specified by:
getMaxSentenceLengthin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The maxSentenceLength.
-
hasNumThreads
public boolean hasNumThreads()Number of threads in the training.
optional int32 num_threads = 16 [default = 16];- Specified by:
hasNumThreadsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the numThreads field is set.
-
getNumThreads
public int getNumThreads()Number of threads in the training.
optional int32 num_threads = 16 [default = 16];- Specified by:
getNumThreadsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The numThreads.
-
hasNumSubIterations
public boolean hasNumSubIterations()Number of EM sub iterations.
optional int32 num_sub_iterations = 17 [default = 2];- Specified by:
hasNumSubIterationsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the numSubIterations field is set.
-
getNumSubIterations
public int getNumSubIterations()Number of EM sub iterations.
optional int32 num_sub_iterations = 17 [default = 2];- Specified by:
getNumSubIterationsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The numSubIterations.
-
hasMaxSentencepieceLength
public boolean hasMaxSentencepieceLength()///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece. Maximum length of sentencepiece.
optional int32 max_sentencepiece_length = 20 [default = 16];- Specified by:
hasMaxSentencepieceLengthin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the maxSentencepieceLength field is set.
-
getMaxSentencepieceLength
public int getMaxSentencepieceLength()///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece. Maximum length of sentencepiece.
optional int32 max_sentencepiece_length = 20 [default = 16];- Specified by:
getMaxSentencepieceLengthin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The maxSentencepieceLength.
-
hasSplitByUnicodeScript
public boolean hasSplitByUnicodeScript()Uses Unicode script to split sentence pieces. When `split_by_unicode_script` is true, we do not allow sentence piece to include multiple Unicode scripts, e.g. "F1" is not a valid piece. Exception: CJ characters (Hiragana/Katakana/Han) are all handled as one script type, since Japanese word can consist of multiple scripts. This exception is always applied regardless of the accept-language parameter.
optional bool split_by_unicode_script = 21 [default = true];- Specified by:
hasSplitByUnicodeScriptin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the splitByUnicodeScript field is set.
-
getSplitByUnicodeScript
public boolean getSplitByUnicodeScript()Uses Unicode script to split sentence pieces. When `split_by_unicode_script` is true, we do not allow sentence piece to include multiple Unicode scripts, e.g. "F1" is not a valid piece. Exception: CJ characters (Hiragana/Katakana/Han) are all handled as one script type, since Japanese word can consist of multiple scripts. This exception is always applied regardless of the accept-language parameter.
optional bool split_by_unicode_script = 21 [default = true];- Specified by:
getSplitByUnicodeScriptin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The splitByUnicodeScript.
-
hasSplitByNumber
public boolean hasSplitByNumber()When `split_by_number` is true, put a boundary between number and non-number transition. If we want to treat "F1" is one token, set this flag to be false.
optional bool split_by_number = 23 [default = true];- Specified by:
hasSplitByNumberin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the splitByNumber field is set.
-
getSplitByNumber
public boolean getSplitByNumber()When `split_by_number` is true, put a boundary between number and non-number transition. If we want to treat "F1" is one token, set this flag to be false.
optional bool split_by_number = 23 [default = true];- Specified by:
getSplitByNumberin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The splitByNumber.
-
hasSplitByWhitespace
public boolean hasSplitByWhitespace()Use a white space to split sentence pieces. When `split_by_whitespace` is false, we may have the piece containing a white space in the middle. e.g., "in_the".
optional bool split_by_whitespace = 22 [default = true];- Specified by:
hasSplitByWhitespacein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the splitByWhitespace field is set.
-
getSplitByWhitespace
public boolean getSplitByWhitespace()Use a white space to split sentence pieces. When `split_by_whitespace` is false, we may have the piece containing a white space in the middle. e.g., "in_the".
optional bool split_by_whitespace = 22 [default = true];- Specified by:
getSplitByWhitespacein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The splitByWhitespace.
-
hasTreatWhitespaceAsSuffix
public boolean hasTreatWhitespaceAsSuffix()Adds whitespace symbol (_) as a suffix instead of prefix. e.g., _hello => hello_. When `treat_whitespace_as_suffix` is true, NormalizerSpec::add_dummy_prefix will add the dummy whitespace to the end of sentence.
optional bool treat_whitespace_as_suffix = 24 [default = false];- Specified by:
hasTreatWhitespaceAsSuffixin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the treatWhitespaceAsSuffix field is set.
-
getTreatWhitespaceAsSuffix
public boolean getTreatWhitespaceAsSuffix()Adds whitespace symbol (_) as a suffix instead of prefix. e.g., _hello => hello_. When `treat_whitespace_as_suffix` is true, NormalizerSpec::add_dummy_prefix will add the dummy whitespace to the end of sentence.
optional bool treat_whitespace_as_suffix = 24 [default = false];- Specified by:
getTreatWhitespaceAsSuffixin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The treatWhitespaceAsSuffix.
-
hasAllowWhitespaceOnlyPieces
public boolean hasAllowWhitespaceOnlyPieces()Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.
optional bool allow_whitespace_only_pieces = 26 [default = false];- Specified by:
hasAllowWhitespaceOnlyPiecesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the allowWhitespaceOnlyPieces field is set.
-
getAllowWhitespaceOnlyPieces
public boolean getAllowWhitespaceOnlyPieces()Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.
optional bool allow_whitespace_only_pieces = 26 [default = false];- Specified by:
getAllowWhitespaceOnlyPiecesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The allowWhitespaceOnlyPieces.
-
hasSplitDigits
public boolean hasSplitDigits()Split all digits (0-9) into separate pieces.
optional bool split_digits = 25 [default = false];- Specified by:
hasSplitDigitsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the splitDigits field is set.
-
getSplitDigits
public boolean getSplitDigits()Split all digits (0-9) into separate pieces.
optional bool split_digits = 25 [default = false];- Specified by:
getSplitDigitsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The splitDigits.
-
hasPretokenizationDelimiter
public boolean hasPretokenizationDelimiter()Defines the pre-tokenization delimiter. When specified, no pieces crossing this delimiter is not included in the vocab. Then the delimiter string is virtually ignored during the training. This field can allows constraints on the vocabulary selection. Note that this field is available on unigram mode.
optional string pretokenization_delimiter = 53 [default = ""];- Specified by:
hasPretokenizationDelimiterin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the pretokenizationDelimiter field is set.
-
getPretokenizationDelimiter
Defines the pre-tokenization delimiter. When specified, no pieces crossing this delimiter is not included in the vocab. Then the delimiter string is virtually ignored during the training. This field can allows constraints on the vocabulary selection. Note that this field is available on unigram mode.
optional string pretokenization_delimiter = 53 [default = ""];- Specified by:
getPretokenizationDelimiterin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The pretokenizationDelimiter.
-
getPretokenizationDelimiterBytes
public com.google.protobuf.ByteString getPretokenizationDelimiterBytes()Defines the pre-tokenization delimiter. When specified, no pieces crossing this delimiter is not included in the vocab. Then the delimiter string is virtually ignored during the training. This field can allows constraints on the vocabulary selection. Note that this field is available on unigram mode.
optional string pretokenization_delimiter = 53 [default = ""];- Specified by:
getPretokenizationDelimiterBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for pretokenizationDelimiter.
-
getControlSymbolsList
public com.google.protobuf.ProtocolStringList getControlSymbolsList()///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string control_symbols = 30;- Specified by:
getControlSymbolsListin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- A list containing the controlSymbols.
-
getControlSymbolsCount
public int getControlSymbolsCount()///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string control_symbols = 30;- Specified by:
getControlSymbolsCountin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The count of controlSymbols.
-
getControlSymbols
///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string control_symbols = 30;- Specified by:
getControlSymbolsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Parameters:
index- The index of the element to return.- Returns:
- The controlSymbols at the given index.
-
getControlSymbolsBytes
public com.google.protobuf.ByteString getControlSymbolsBytes(int index) ///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string control_symbols = 30;- Specified by:
getControlSymbolsBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Parameters:
index- The index of the value to return.- Returns:
- The bytes of the controlSymbols at the given index.
-
getUserDefinedSymbolsList
public com.google.protobuf.ProtocolStringList getUserDefinedSymbolsList()Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
repeated string user_defined_symbols = 31;- Specified by:
getUserDefinedSymbolsListin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- A list containing the userDefinedSymbols.
-
getUserDefinedSymbolsCount
public int getUserDefinedSymbolsCount()Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
repeated string user_defined_symbols = 31;- Specified by:
getUserDefinedSymbolsCountin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The count of userDefinedSymbols.
-
getUserDefinedSymbols
Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
repeated string user_defined_symbols = 31;- Specified by:
getUserDefinedSymbolsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Parameters:
index- The index of the element to return.- Returns:
- The userDefinedSymbols at the given index.
-
getUserDefinedSymbolsBytes
public com.google.protobuf.ByteString getUserDefinedSymbolsBytes(int index) Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
repeated string user_defined_symbols = 31;- Specified by:
getUserDefinedSymbolsBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Parameters:
index- The index of the value to return.- Returns:
- The bytes of the userDefinedSymbols at the given index.
-
hasRequiredChars
public boolean hasRequiredChars()Defines required characters. Each UTF8 character in this string is included in the character set regardless of character_coverage value. Unlike user_defined_symbols, these characters have scores based on the frequency on input sentences, and the model can form subwords using characters in this field.
optional string required_chars = 36;- Specified by:
hasRequiredCharsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the requiredChars field is set.
-
getRequiredChars
Defines required characters. Each UTF8 character in this string is included in the character set regardless of character_coverage value. Unlike user_defined_symbols, these characters have scores based on the frequency on input sentences, and the model can form subwords using characters in this field.
optional string required_chars = 36;- Specified by:
getRequiredCharsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The requiredChars.
-
getRequiredCharsBytes
public com.google.protobuf.ByteString getRequiredCharsBytes()Defines required characters. Each UTF8 character in this string is included in the character set regardless of character_coverage value. Unlike user_defined_symbols, these characters have scores based on the frequency on input sentences, and the model can form subwords using characters in this field.
optional string required_chars = 36;- Specified by:
getRequiredCharsBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for requiredChars.
-
hasByteFallback
public boolean hasByteFallback()Decomposes unknown pieces into UTF-8 bytes.
optional bool byte_fallback = 35 [default = false];- Specified by:
hasByteFallbackin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the byteFallback field is set.
-
getByteFallback
public boolean getByteFallback()Decomposes unknown pieces into UTF-8 bytes.
optional bool byte_fallback = 35 [default = false];- Specified by:
getByteFallbackin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The byteFallback.
-
hasVocabularyOutputPieceScore
public boolean hasVocabularyOutputPieceScore()When creating the vocabulary file, defines whether or not to additionally output the score for each piece.
optional bool vocabulary_output_piece_score = 32 [default = true];- Specified by:
hasVocabularyOutputPieceScorein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the vocabularyOutputPieceScore field is set.
-
getVocabularyOutputPieceScore
public boolean getVocabularyOutputPieceScore()When creating the vocabulary file, defines whether or not to additionally output the score for each piece.
optional bool vocabulary_output_piece_score = 32 [default = true];- Specified by:
getVocabularyOutputPieceScorein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The vocabularyOutputPieceScore.
-
hasHardVocabLimit
public boolean hasHardVocabLimit()`vocab_size` is treated as hard limit. Crash if the model can not produce the vocab of size `vocab_size`, When `hard_vocab_limit` is false, vocab_size is treated as soft limit. Note that when model_type=char, always assumes hard_vocab_limit = false.
optional bool hard_vocab_limit = 33 [default = true];- Specified by:
hasHardVocabLimitin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the hardVocabLimit field is set.
-
getHardVocabLimit
public boolean getHardVocabLimit()`vocab_size` is treated as hard limit. Crash if the model can not produce the vocab of size `vocab_size`, When `hard_vocab_limit` is false, vocab_size is treated as soft limit. Note that when model_type=char, always assumes hard_vocab_limit = false.
optional bool hard_vocab_limit = 33 [default = true];- Specified by:
getHardVocabLimitin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The hardVocabLimit.
-
hasUseAllVocab
public boolean hasUseAllVocab()use all symbols for vocab extraction. This flag is valid if model type is either CHAR or WORD
optional bool use_all_vocab = 34 [default = false];- Specified by:
hasUseAllVocabin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the useAllVocab field is set.
-
getUseAllVocab
public boolean getUseAllVocab()use all symbols for vocab extraction. This flag is valid if model type is either CHAR or WORD
optional bool use_all_vocab = 34 [default = false];- Specified by:
getUseAllVocabin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The useAllVocab.
-
hasUnkId
public boolean hasUnkId()///////////////////////////////////////////////////////////////// Reserved special meta tokens. * -1 is not used. * unk_id must not be -1. Id must starts with 0 and be contiguous.
optional int32 unk_id = 40 [default = 0];- Specified by:
hasUnkIdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the unkId field is set.
-
getUnkId
public int getUnkId()///////////////////////////////////////////////////////////////// Reserved special meta tokens. * -1 is not used. * unk_id must not be -1. Id must starts with 0 and be contiguous.
optional int32 unk_id = 40 [default = 0];- Specified by:
getUnkIdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The unkId.
-
hasBosId
public boolean hasBosId()<s>
optional int32 bos_id = 41 [default = 1];- Specified by:
hasBosIdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the bosId field is set.
-
getBosId
public int getBosId()<s>
optional int32 bos_id = 41 [default = 1];- Specified by:
getBosIdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bosId.
-
hasEosId
public boolean hasEosId()</s>
optional int32 eos_id = 42 [default = 2];- Specified by:
hasEosIdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the eosId field is set.
-
getEosId
public int getEosId()</s>
optional int32 eos_id = 42 [default = 2];- Specified by:
getEosIdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The eosId.
-
hasPadId
public boolean hasPadId()<pad> (padding)
optional int32 pad_id = 43 [default = -1];- Specified by:
hasPadIdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the padId field is set.
-
getPadId
public int getPadId()<pad> (padding)
optional int32 pad_id = 43 [default = -1];- Specified by:
getPadIdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The padId.
-
hasUnkPiece
public boolean hasUnkPiece()optional string unk_piece = 45 [default = "<unk>"];- Specified by:
hasUnkPiecein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the unkPiece field is set.
-
getUnkPiece
optional string unk_piece = 45 [default = "<unk>"];- Specified by:
getUnkPiecein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The unkPiece.
-
getUnkPieceBytes
public com.google.protobuf.ByteString getUnkPieceBytes()optional string unk_piece = 45 [default = "<unk>"];- Specified by:
getUnkPieceBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for unkPiece.
-
hasBosPiece
public boolean hasBosPiece()optional string bos_piece = 46 [default = "<s>"];- Specified by:
hasBosPiecein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the bosPiece field is set.
-
getBosPiece
optional string bos_piece = 46 [default = "<s>"];- Specified by:
getBosPiecein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bosPiece.
-
getBosPieceBytes
public com.google.protobuf.ByteString getBosPieceBytes()optional string bos_piece = 46 [default = "<s>"];- Specified by:
getBosPieceBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for bosPiece.
-
hasEosPiece
public boolean hasEosPiece()optional string eos_piece = 47 [default = "</s>"];- Specified by:
hasEosPiecein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the eosPiece field is set.
-
getEosPiece
optional string eos_piece = 47 [default = "</s>"];- Specified by:
getEosPiecein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The eosPiece.
-
getEosPieceBytes
public com.google.protobuf.ByteString getEosPieceBytes()optional string eos_piece = 47 [default = "</s>"];- Specified by:
getEosPieceBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for eosPiece.
-
hasPadPiece
public boolean hasPadPiece()optional string pad_piece = 48 [default = "<pad>"];- Specified by:
hasPadPiecein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the padPiece field is set.
-
getPadPiece
optional string pad_piece = 48 [default = "<pad>"];- Specified by:
getPadPiecein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The padPiece.
-
getPadPieceBytes
public com.google.protobuf.ByteString getPadPieceBytes()optional string pad_piece = 48 [default = "<pad>"];- Specified by:
getPadPieceBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for padPiece.
-
hasUnkSurface
public boolean hasUnkSurface()Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer. We can easily figure out that <unk> is emitted.
optional string unk_surface = 44 [default = " \342\201\207 "];- Specified by:
hasUnkSurfacein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the unkSurface field is set.
-
getUnkSurface
Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer. We can easily figure out that <unk> is emitted.
optional string unk_surface = 44 [default = " \342\201\207 "];- Specified by:
getUnkSurfacein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The unkSurface.
-
getUnkSurfaceBytes
public com.google.protobuf.ByteString getUnkSurfaceBytes()Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer. We can easily figure out that <unk> is emitted.
optional string unk_surface = 44 [default = " \342\201\207 "];- Specified by:
getUnkSurfaceBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for unkSurface.
-
hasTrainExtremelyLargeCorpus
public boolean hasTrainExtremelyLargeCorpus()Increase bit depth to allow unigram model training on large (>10M sentences) corpora. A Side-effect of enabling this flag is increased memory usage.
optional bool train_extremely_large_corpus = 49 [default = false];- Specified by:
hasTrainExtremelyLargeCorpusin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the trainExtremelyLargeCorpus field is set.
-
getTrainExtremelyLargeCorpus
public boolean getTrainExtremelyLargeCorpus()Increase bit depth to allow unigram model training on large (>10M sentences) corpora. A Side-effect of enabling this flag is increased memory usage.
optional bool train_extremely_large_corpus = 49 [default = false];- Specified by:
getTrainExtremelyLargeCorpusin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The trainExtremelyLargeCorpus.
-
hasSeedSentencepiecesFile
public boolean hasSeedSentencepiecesFile()Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.
optional string seed_sentencepieces_file = 54 [default = ""];- Specified by:
hasSeedSentencepiecesFilein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the seedSentencepiecesFile field is set.
-
getSeedSentencepiecesFile
Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.
optional string seed_sentencepieces_file = 54 [default = ""];- Specified by:
getSeedSentencepiecesFilein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The seedSentencepiecesFile.
-
getSeedSentencepiecesFileBytes
public com.google.protobuf.ByteString getSeedSentencepiecesFileBytes()Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.
optional string seed_sentencepieces_file = 54 [default = ""];- Specified by:
getSeedSentencepiecesFileBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for seedSentencepiecesFile.
-
isInitialized
public final boolean isInitialized()- Specified by:
isInitializedin interfacecom.google.protobuf.MessageLiteOrBuilder- Overrides:
isInitializedin classcom.google.protobuf.GeneratedMessageV3.ExtendableMessage<SentencepieceModel.TrainerSpec>
-
writeTo
- Specified by:
writeToin interfacecom.google.protobuf.MessageLite- Overrides:
writeToin classcom.google.protobuf.GeneratedMessageV3- Throws:
IOException
-
getSerializedSize
public int getSerializedSize()- Specified by:
getSerializedSizein interfacecom.google.protobuf.MessageLite- Overrides:
getSerializedSizein classcom.google.protobuf.GeneratedMessageV3
-
equals
- Specified by:
equalsin interfacecom.google.protobuf.Message- Overrides:
equalsin classcom.google.protobuf.AbstractMessage
-
hashCode
public int hashCode()- Specified by:
hashCodein interfacecom.google.protobuf.Message- Overrides:
hashCodein classcom.google.protobuf.AbstractMessage
-
parseFrom
public static SentencepieceModel.TrainerSpec parseFrom(ByteBuffer data) throws com.google.protobuf.InvalidProtocolBufferException - Throws:
com.google.protobuf.InvalidProtocolBufferException
-
parseFrom
public static SentencepieceModel.TrainerSpec parseFrom(ByteBuffer data, com.google.protobuf.ExtensionRegistryLite extensionRegistry) throws com.google.protobuf.InvalidProtocolBufferException - Throws:
com.google.protobuf.InvalidProtocolBufferException
-
parseFrom
public static SentencepieceModel.TrainerSpec parseFrom(com.google.protobuf.ByteString data) throws com.google.protobuf.InvalidProtocolBufferException - Throws:
com.google.protobuf.InvalidProtocolBufferException
-
parseFrom
public static SentencepieceModel.TrainerSpec parseFrom(com.google.protobuf.ByteString data, com.google.protobuf.ExtensionRegistryLite extensionRegistry) throws com.google.protobuf.InvalidProtocolBufferException - Throws:
com.google.protobuf.InvalidProtocolBufferException
-
parseFrom
public static SentencepieceModel.TrainerSpec parseFrom(byte[] data) throws com.google.protobuf.InvalidProtocolBufferException - Throws:
com.google.protobuf.InvalidProtocolBufferException
-
parseFrom
public static SentencepieceModel.TrainerSpec parseFrom(byte[] data, com.google.protobuf.ExtensionRegistryLite extensionRegistry) throws com.google.protobuf.InvalidProtocolBufferException - Throws:
com.google.protobuf.InvalidProtocolBufferException
-
parseFrom
- Throws:
IOException
-
parseFrom
public static SentencepieceModel.TrainerSpec parseFrom(InputStream input, com.google.protobuf.ExtensionRegistryLite extensionRegistry) throws IOException - Throws:
IOException
-
parseDelimitedFrom
public static SentencepieceModel.TrainerSpec parseDelimitedFrom(InputStream input) throws IOException - Throws:
IOException
-
parseDelimitedFrom
public static SentencepieceModel.TrainerSpec parseDelimitedFrom(InputStream input, com.google.protobuf.ExtensionRegistryLite extensionRegistry) throws IOException - Throws:
IOException
-
parseFrom
public static SentencepieceModel.TrainerSpec parseFrom(com.google.protobuf.CodedInputStream input) throws IOException - Throws:
IOException
-
parseFrom
public static SentencepieceModel.TrainerSpec parseFrom(com.google.protobuf.CodedInputStream input, com.google.protobuf.ExtensionRegistryLite extensionRegistry) throws IOException - Throws:
IOException
-
newBuilderForType
- Specified by:
newBuilderForTypein interfacecom.google.protobuf.Message- Specified by:
newBuilderForTypein interfacecom.google.protobuf.MessageLite
-
newBuilder
-
newBuilder
public static SentencepieceModel.TrainerSpec.Builder newBuilder(SentencepieceModel.TrainerSpec prototype) -
toBuilder
- Specified by:
toBuilderin interfacecom.google.protobuf.Message- Specified by:
toBuilderin interfacecom.google.protobuf.MessageLite
-
getDefaultInstance
-
parser
-
getParserForType
- Specified by:
getParserForTypein interfacecom.google.protobuf.Message- Specified by:
getParserForTypein interfacecom.google.protobuf.MessageLite- Overrides:
getParserForTypein classcom.google.protobuf.GeneratedMessageV3
-
getDefaultInstanceForType
- Specified by:
getDefaultInstanceForTypein interfacecom.google.protobuf.GeneratedMessageV3.ExtendableMessageOrBuilder<SentencepieceModel.TrainerSpec>- Specified by:
getDefaultInstanceForTypein interfacecom.google.protobuf.MessageLiteOrBuilder- Specified by:
getDefaultInstanceForTypein interfacecom.google.protobuf.MessageOrBuilder
-