Package com.google.genai.proto
Class SentencepieceModel.TrainerSpec.Builder
java.lang.Object
com.google.protobuf.AbstractMessageLite.Builder
com.google.protobuf.AbstractMessage.Builder<BuilderT>
com.google.protobuf.GeneratedMessageV3.Builder<BuilderT>
com.google.protobuf.GeneratedMessageV3.ExtendableBuilder<SentencepieceModel.TrainerSpec,SentencepieceModel.TrainerSpec.Builder>
com.google.genai.proto.SentencepieceModel.TrainerSpec.Builder
- All Implemented Interfaces:
SentencepieceModel.TrainerSpecOrBuilder,com.google.protobuf.GeneratedMessageV3.ExtendableMessageOrBuilder<SentencepieceModel.TrainerSpec>,com.google.protobuf.Message.Builder,com.google.protobuf.MessageLite.Builder,com.google.protobuf.MessageLiteOrBuilder,com.google.protobuf.MessageOrBuilder,Cloneable
- Enclosing class:
- SentencepieceModel.TrainerSpec
public static final class SentencepieceModel.TrainerSpec.Builder
extends com.google.protobuf.GeneratedMessageV3.ExtendableBuilder<SentencepieceModel.TrainerSpec,SentencepieceModel.TrainerSpec.Builder>
implements SentencepieceModel.TrainerSpecOrBuilder
TrainerSpec encodes a various parameters for SentencePiece training. Next id: 55Protobuf type
com.google.genai.proto.TrainerSpec-
Method Summary
Modifier and TypeMethodDescriptionaddAcceptLanguage(String value) List of the languages this model can accept.addAcceptLanguageBytes(com.google.protobuf.ByteString value) List of the languages this model can accept.addAllAcceptLanguage(Iterable<String> values) List of the languages this model can accept.addAllControlSymbols(Iterable<String> values) ///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.addAllInput(Iterable<String> values) ///////////////////////////////////////////////////////////////// General parameters Input corpus files.addAllUserDefinedSymbols(Iterable<String> values) Defines user defined symbols.addControlSymbols(String value) ///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.addControlSymbolsBytes(com.google.protobuf.ByteString value) ///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.addExtension(com.google.protobuf.GeneratedMessage.GeneratedExtension<SentencepieceModel.TrainerSpec, List<Type>> extension, Type value) ///////////////////////////////////////////////////////////////// General parameters Input corpus files.addInputBytes(com.google.protobuf.ByteString value) ///////////////////////////////////////////////////////////////// General parameters Input corpus files.addRepeatedField(com.google.protobuf.Descriptors.FieldDescriptor field, Object value) addUserDefinedSymbols(String value) Defines user defined symbols.addUserDefinedSymbolsBytes(com.google.protobuf.ByteString value) Defines user defined symbols.build()clear()List of the languages this model can accept.Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.<s>optional string bos_piece = 46 [default = "<s>"];Decomposes unknown pieces into UTF-8 bytes.///////////////////////////////////////////////////////////////// Training parameters.///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.Clipping threshold to apply after adding noise.Set these parameters if you need DP version of sentencepiece.Whether to use DP version of sentencepiece.</s>optional string eos_piece = 47 [default = "</s>"];clearExtension(com.google.protobuf.GeneratedMessage.GeneratedExtension<SentencepieceModel.TrainerSpec, T> extension) clearField(com.google.protobuf.Descriptors.FieldDescriptor field) `vocab_size` is treated as hard limit.///////////////////////////////////////////////////////////////// General parameters Input corpus files.Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freqMaximum size of sentences the trainer loads from `input` parameter.The maximum sentence length in byte.///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece.Deprecated.Output model file prefix.optional .com.google.genai.proto.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];Number of EM sub iterations.Number of threads in the training.clearOneof(com.google.protobuf.Descriptors.OneofDescriptor oneof) <pad> (padding)optional string pad_piece = 48 [default = "<pad>"];Defines the pre-tokenization delimiter.Defines required characters.Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.The size of seed sentencepieces.Size of self-test samples, which are encoded in the model file.In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece.optional bool shuffle_input_sentence = 19 [default = true];When `split_by_number` is true, put a boundary between number and non-number transition.Uses Unicode script to split sentence pieces.Use a white space to split sentence pieces.Split all digits (0-9) into separate pieces.Increase bit depth to allow unigram model training on large (>10M sentences) corpora.Deprecated.Adds whitespace symbol (_) as a suffix instead of prefix.///////////////////////////////////////////////////////////////// Reserved special meta tokens.optional string unk_piece = 45 [default = "<unk>"];Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.use all symbols for vocab extraction.Defines user defined symbols.Vocabulary size.When creating the vocabulary file, defines whether or not to additionally output the score for each piece.clone()getAcceptLanguage(int index) List of the languages this model can accept.com.google.protobuf.ByteStringgetAcceptLanguageBytes(int index) List of the languages this model can accept.intList of the languages this model can accept.com.google.protobuf.ProtocolStringListList of the languages this model can accept.booleanAllows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.intgetBosId()<s>optional string bos_piece = 46 [default = "<s>"];com.google.protobuf.ByteStringoptional string bos_piece = 46 [default = "<s>"];booleanDecomposes unknown pieces into UTF-8 bytes.float///////////////////////////////////////////////////////////////// Training parameters.getControlSymbols(int index) ///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.com.google.protobuf.ByteStringgetControlSymbolsBytes(int index) ///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.int///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.com.google.protobuf.ProtocolStringList///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.static final com.google.protobuf.Descriptors.Descriptorcom.google.protobuf.Descriptors.DescriptorlongClipping threshold to apply after adding noise.floatSet these parameters if you need DP version of sentencepiece.booleanWhether to use DP version of sentencepiece.intgetEosId()</s>optional string eos_piece = 47 [default = "</s>"];com.google.protobuf.ByteStringoptional string eos_piece = 47 [default = "</s>"];boolean`vocab_size` is treated as hard limit.getInput(int index) ///////////////////////////////////////////////////////////////// General parameters Input corpus files.com.google.protobuf.ByteStringgetInputBytes(int index) ///////////////////////////////////////////////////////////////// General parameters Input corpus files.int///////////////////////////////////////////////////////////////// General parameters Input corpus files.Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freqcom.google.protobuf.ByteStringInput corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freqcom.google.protobuf.ProtocolStringList///////////////////////////////////////////////////////////////// General parameters Input corpus files.longMaximum size of sentences the trainer loads from `input` parameter.intThe maximum sentence length in byte.int///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece.intDeprecated.com.google.genai.proto.TrainerSpec.mining_sentence_size is deprecated.Output model file prefix.com.google.protobuf.ByteStringOutput model file prefix.optional .com.google.genai.proto.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];intNumber of EM sub iterations.intNumber of threads in the training.intgetPadId()<pad> (padding)optional string pad_piece = 48 [default = "<pad>"];com.google.protobuf.ByteStringoptional string pad_piece = 48 [default = "<pad>"];Defines the pre-tokenization delimiter.com.google.protobuf.ByteStringDefines the pre-tokenization delimiter.Defines required characters.com.google.protobuf.ByteStringDefines required characters.Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.com.google.protobuf.ByteStringPath to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.intThe size of seed sentencepieces.intSize of self-test samples, which are encoded in the model file.floatIn every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece.booleanoptional bool shuffle_input_sentence = 19 [default = true];booleanWhen `split_by_number` is true, put a boundary between number and non-number transition.booleanUses Unicode script to split sentence pieces.booleanUse a white space to split sentence pieces.booleanSplit all digits (0-9) into separate pieces.booleanIncrease bit depth to allow unigram model training on large (>10M sentences) corpora.intDeprecated.com.google.genai.proto.TrainerSpec.training_sentence_size is deprecated.booleanAdds whitespace symbol (_) as a suffix instead of prefix.intgetUnkId()///////////////////////////////////////////////////////////////// Reserved special meta tokens.optional string unk_piece = 45 [default = "<unk>"];com.google.protobuf.ByteStringoptional string unk_piece = 45 [default = "<unk>"];Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.com.google.protobuf.ByteStringEncodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.booleanuse all symbols for vocab extraction.getUserDefinedSymbols(int index) Defines user defined symbols.com.google.protobuf.ByteStringgetUserDefinedSymbolsBytes(int index) Defines user defined symbols.intDefines user defined symbols.com.google.protobuf.ProtocolStringListDefines user defined symbols.intVocabulary size.booleanWhen creating the vocabulary file, defines whether or not to additionally output the score for each piece.booleanAllows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.booleanhasBosId()<s>booleanoptional string bos_piece = 46 [default = "<s>"];booleanDecomposes unknown pieces into UTF-8 bytes.boolean///////////////////////////////////////////////////////////////// Training parameters.booleanClipping threshold to apply after adding noise.booleanSet these parameters if you need DP version of sentencepiece.booleanWhether to use DP version of sentencepiece.booleanhasEosId()</s>booleanoptional string eos_piece = 47 [default = "</s>"];boolean`vocab_size` is treated as hard limit.booleanInput corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freqbooleanMaximum size of sentences the trainer loads from `input` parameter.booleanThe maximum sentence length in byte.boolean///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece.booleanDeprecated.com.google.genai.proto.TrainerSpec.mining_sentence_size is deprecated.booleanOutput model file prefix.booleanoptional .com.google.genai.proto.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];booleanNumber of EM sub iterations.booleanNumber of threads in the training.booleanhasPadId()<pad> (padding)booleanoptional string pad_piece = 48 [default = "<pad>"];booleanDefines the pre-tokenization delimiter.booleanDefines required characters.booleanPath to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.booleanThe size of seed sentencepieces.booleanSize of self-test samples, which are encoded in the model file.booleanIn every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece.booleanoptional bool shuffle_input_sentence = 19 [default = true];booleanWhen `split_by_number` is true, put a boundary between number and non-number transition.booleanUses Unicode script to split sentence pieces.booleanUse a white space to split sentence pieces.booleanSplit all digits (0-9) into separate pieces.booleanIncrease bit depth to allow unigram model training on large (>10M sentences) corpora.booleanDeprecated.com.google.genai.proto.TrainerSpec.training_sentence_size is deprecated.booleanAdds whitespace symbol (_) as a suffix instead of prefix.booleanhasUnkId()///////////////////////////////////////////////////////////////// Reserved special meta tokens.booleanoptional string unk_piece = 45 [default = "<unk>"];booleanEncodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.booleanuse all symbols for vocab extraction.booleanVocabulary size.booleanWhen creating the vocabulary file, defines whether or not to additionally output the score for each piece.final booleanmergeFrom(com.google.protobuf.CodedInputStream input, com.google.protobuf.ExtensionRegistryLite extensionRegistry) mergeFrom(com.google.protobuf.Message other) mergeUnknownFields(com.google.protobuf.UnknownFieldSet unknownFields) setAcceptLanguage(int index, String value) List of the languages this model can accept.setAllowWhitespaceOnlyPieces(boolean value) Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.setBosId(int value) <s>setBosPiece(String value) optional string bos_piece = 46 [default = "<s>"];setBosPieceBytes(com.google.protobuf.ByteString value) optional string bos_piece = 46 [default = "<s>"];setByteFallback(boolean value) Decomposes unknown pieces into UTF-8 bytes.setCharacterCoverage(float value) ///////////////////////////////////////////////////////////////// Training parameters.setControlSymbols(int index, String value) ///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder.setDifferentialPrivacyClippingThreshold(long value) Clipping threshold to apply after adding noise.setDifferentialPrivacyNoiseLevel(float value) Set these parameters if you need DP version of sentencepiece.setEnableDifferentialPrivacy(boolean value) Whether to use DP version of sentencepiece.setEosId(int value) </s>setEosPiece(String value) optional string eos_piece = 47 [default = "</s>"];setEosPieceBytes(com.google.protobuf.ByteString value) optional string eos_piece = 47 [default = "</s>"];setExtension(com.google.protobuf.GeneratedMessage.GeneratedExtension<SentencepieceModel.TrainerSpec, List<Type>> extension, int index, Type value) setExtension(com.google.protobuf.GeneratedMessage.GeneratedExtension<SentencepieceModel.TrainerSpec, Type> extension, Type value) setHardVocabLimit(boolean value) `vocab_size` is treated as hard limit.///////////////////////////////////////////////////////////////// General parameters Input corpus files.setInputFormat(String value) Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freqsetInputFormatBytes(com.google.protobuf.ByteString value) Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freqsetInputSentenceSize(long value) Maximum size of sentences the trainer loads from `input` parameter.setMaxSentenceLength(int value) The maximum sentence length in byte.setMaxSentencepieceLength(int value) ///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece.setMiningSentenceSize(int value) Deprecated.setModelPrefix(String value) Output model file prefix.setModelPrefixBytes(com.google.protobuf.ByteString value) Output model file prefix.optional .com.google.genai.proto.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];setNumSubIterations(int value) Number of EM sub iterations.setNumThreads(int value) Number of threads in the training.setPadId(int value) <pad> (padding)setPadPiece(String value) optional string pad_piece = 48 [default = "<pad>"];setPadPieceBytes(com.google.protobuf.ByteString value) optional string pad_piece = 48 [default = "<pad>"];Defines the pre-tokenization delimiter.setPretokenizationDelimiterBytes(com.google.protobuf.ByteString value) Defines the pre-tokenization delimiter.setRepeatedField(com.google.protobuf.Descriptors.FieldDescriptor field, int index, Object value) setRequiredChars(String value) Defines required characters.setRequiredCharsBytes(com.google.protobuf.ByteString value) Defines required characters.setSeedSentencepiecesFile(String value) Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.setSeedSentencepiecesFileBytes(com.google.protobuf.ByteString value) Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.setSeedSentencepieceSize(int value) The size of seed sentencepieces.setSelfTestSampleSize(int value) Size of self-test samples, which are encoded in the model file.setShrinkingFactor(float value) In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece.setShuffleInputSentence(boolean value) optional bool shuffle_input_sentence = 19 [default = true];setSplitByNumber(boolean value) When `split_by_number` is true, put a boundary between number and non-number transition.setSplitByUnicodeScript(boolean value) Uses Unicode script to split sentence pieces.setSplitByWhitespace(boolean value) Use a white space to split sentence pieces.setSplitDigits(boolean value) Split all digits (0-9) into separate pieces.setTrainExtremelyLargeCorpus(boolean value) Increase bit depth to allow unigram model training on large (>10M sentences) corpora.setTrainingSentenceSize(int value) Deprecated.setTreatWhitespaceAsSuffix(boolean value) Adds whitespace symbol (_) as a suffix instead of prefix.setUnkId(int value) ///////////////////////////////////////////////////////////////// Reserved special meta tokens.setUnknownFields(com.google.protobuf.UnknownFieldSet unknownFields) setUnkPiece(String value) optional string unk_piece = 45 [default = "<unk>"];setUnkPieceBytes(com.google.protobuf.ByteString value) optional string unk_piece = 45 [default = "<unk>"];setUnkSurface(String value) Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.setUnkSurfaceBytes(com.google.protobuf.ByteString value) Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer.setUseAllVocab(boolean value) use all symbols for vocab extraction.setUserDefinedSymbols(int index, String value) Defines user defined symbols.setVocabSize(int value) Vocabulary size.setVocabularyOutputPieceScore(boolean value) When creating the vocabulary file, defines whether or not to additionally output the score for each piece.Methods inherited from class com.google.protobuf.GeneratedMessageV3.ExtendableBuilder
addExtension, addExtension, clearExtension, clearExtension, getAllFields, getExtension, getExtension, getExtension, getExtension, getExtension, getExtension, getExtensionCount, getExtensionCount, getExtensionCount, getField, getFieldBuilder, getRepeatedField, getRepeatedFieldBuilder, getRepeatedFieldCount, hasExtension, hasExtension, hasExtension, hasField, newBuilderForField, setExtension, setExtension, setExtension, setExtensionMethods inherited from class com.google.protobuf.GeneratedMessageV3.Builder
getOneofFieldDescriptor, getUnknownFields, hasOneofMethods inherited from class com.google.protobuf.AbstractMessage.Builder
findInitializationErrors, getInitializationErrorString, mergeFrom, mergeFrom, mergeFrom, mergeFrom, mergeFrom, mergeFrom, mergeFrom, mergeFrom, mergeFrom, toStringMethods inherited from class com.google.protobuf.AbstractMessageLite.Builder
mergeDelimitedFrom, mergeDelimitedFrom, mergeFromMethods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, waitMethods inherited from interface com.google.protobuf.GeneratedMessageV3.ExtendableMessageOrBuilder
getExtension, getExtension, getExtension, getExtension, getExtension, getExtension, getExtensionCount, getExtensionCount, getExtensionCount, hasExtension, hasExtension, hasExtensionMethods inherited from interface com.google.protobuf.Message.Builder
mergeDelimitedFrom, mergeDelimitedFromMethods inherited from interface com.google.protobuf.MessageLite.Builder
mergeFromMethods inherited from interface com.google.protobuf.MessageOrBuilder
findInitializationErrors, getAllFields, getField, getInitializationErrorString, getOneofFieldDescriptor, getRepeatedField, getRepeatedFieldCount, getUnknownFields, hasField, hasOneof
-
Method Details
-
getDescriptor
public static final com.google.protobuf.Descriptors.Descriptor getDescriptor() -
clear
- Specified by:
clearin interfacecom.google.protobuf.Message.Builder- Specified by:
clearin interfacecom.google.protobuf.MessageLite.Builder- Overrides:
clearin classcom.google.protobuf.GeneratedMessageV3.ExtendableBuilder<SentencepieceModel.TrainerSpec,SentencepieceModel.TrainerSpec.Builder>
-
getDescriptorForType
public com.google.protobuf.Descriptors.Descriptor getDescriptorForType()- Specified by:
getDescriptorForTypein interfacecom.google.protobuf.Message.Builder- Specified by:
getDescriptorForTypein interfacecom.google.protobuf.MessageOrBuilder- Overrides:
getDescriptorForTypein classcom.google.protobuf.GeneratedMessageV3.Builder<SentencepieceModel.TrainerSpec.Builder>
-
getDefaultInstanceForType
- Specified by:
getDefaultInstanceForTypein interfacecom.google.protobuf.GeneratedMessageV3.ExtendableMessageOrBuilder<SentencepieceModel.TrainerSpec>- Specified by:
getDefaultInstanceForTypein interfacecom.google.protobuf.MessageLiteOrBuilder- Specified by:
getDefaultInstanceForTypein interfacecom.google.protobuf.MessageOrBuilder
-
build
- Specified by:
buildin interfacecom.google.protobuf.Message.Builder- Specified by:
buildin interfacecom.google.protobuf.MessageLite.Builder
-
buildPartial
- Specified by:
buildPartialin interfacecom.google.protobuf.Message.Builder- Specified by:
buildPartialin interfacecom.google.protobuf.MessageLite.Builder
-
clone
- Specified by:
clonein interfacecom.google.protobuf.Message.Builder- Specified by:
clonein interfacecom.google.protobuf.MessageLite.Builder- Overrides:
clonein classcom.google.protobuf.GeneratedMessageV3.Builder<SentencepieceModel.TrainerSpec.Builder>
-
setField
public SentencepieceModel.TrainerSpec.Builder setField(com.google.protobuf.Descriptors.FieldDescriptor field, Object value) - Specified by:
setFieldin interfacecom.google.protobuf.Message.Builder- Overrides:
setFieldin classcom.google.protobuf.GeneratedMessageV3.ExtendableBuilder<SentencepieceModel.TrainerSpec,SentencepieceModel.TrainerSpec.Builder>
-
clearField
public SentencepieceModel.TrainerSpec.Builder clearField(com.google.protobuf.Descriptors.FieldDescriptor field) - Specified by:
clearFieldin interfacecom.google.protobuf.Message.Builder- Overrides:
clearFieldin classcom.google.protobuf.GeneratedMessageV3.ExtendableBuilder<SentencepieceModel.TrainerSpec,SentencepieceModel.TrainerSpec.Builder>
-
clearOneof
public SentencepieceModel.TrainerSpec.Builder clearOneof(com.google.protobuf.Descriptors.OneofDescriptor oneof) - Specified by:
clearOneofin interfacecom.google.protobuf.Message.Builder- Overrides:
clearOneofin classcom.google.protobuf.GeneratedMessageV3.Builder<SentencepieceModel.TrainerSpec.Builder>
-
setRepeatedField
public SentencepieceModel.TrainerSpec.Builder setRepeatedField(com.google.protobuf.Descriptors.FieldDescriptor field, int index, Object value) - Specified by:
setRepeatedFieldin interfacecom.google.protobuf.Message.Builder- Overrides:
setRepeatedFieldin classcom.google.protobuf.GeneratedMessageV3.ExtendableBuilder<SentencepieceModel.TrainerSpec,SentencepieceModel.TrainerSpec.Builder>
-
addRepeatedField
public SentencepieceModel.TrainerSpec.Builder addRepeatedField(com.google.protobuf.Descriptors.FieldDescriptor field, Object value) - Specified by:
addRepeatedFieldin interfacecom.google.protobuf.Message.Builder- Overrides:
addRepeatedFieldin classcom.google.protobuf.GeneratedMessageV3.ExtendableBuilder<SentencepieceModel.TrainerSpec,SentencepieceModel.TrainerSpec.Builder>
-
setExtension
public <Type> SentencepieceModel.TrainerSpec.Builder setExtension(com.google.protobuf.GeneratedMessage.GeneratedExtension<SentencepieceModel.TrainerSpec, Type> extension, Type value) - Overrides:
setExtensionin classcom.google.protobuf.GeneratedMessageV3.ExtendableBuilder<SentencepieceModel.TrainerSpec,SentencepieceModel.TrainerSpec.Builder>
-
setExtension
public <Type> SentencepieceModel.TrainerSpec.Builder setExtension(com.google.protobuf.GeneratedMessage.GeneratedExtension<SentencepieceModel.TrainerSpec, List<Type>> extension, int index, Type value) - Overrides:
setExtensionin classcom.google.protobuf.GeneratedMessageV3.ExtendableBuilder<SentencepieceModel.TrainerSpec,SentencepieceModel.TrainerSpec.Builder>
-
addExtension
public <Type> SentencepieceModel.TrainerSpec.Builder addExtension(com.google.protobuf.GeneratedMessage.GeneratedExtension<SentencepieceModel.TrainerSpec, List<Type>> extension, Type value) - Overrides:
addExtensionin classcom.google.protobuf.GeneratedMessageV3.ExtendableBuilder<SentencepieceModel.TrainerSpec,SentencepieceModel.TrainerSpec.Builder>
-
clearExtension
public <T> SentencepieceModel.TrainerSpec.Builder clearExtension(com.google.protobuf.GeneratedMessage.GeneratedExtension<SentencepieceModel.TrainerSpec, T> extension) - Overrides:
clearExtensionin classcom.google.protobuf.GeneratedMessageV3.ExtendableBuilder<SentencepieceModel.TrainerSpec,SentencepieceModel.TrainerSpec.Builder>
-
mergeFrom
- Specified by:
mergeFromin interfacecom.google.protobuf.Message.Builder- Overrides:
mergeFromin classcom.google.protobuf.AbstractMessage.Builder<SentencepieceModel.TrainerSpec.Builder>
-
mergeFrom
-
isInitialized
public final boolean isInitialized()- Specified by:
isInitializedin interfacecom.google.protobuf.MessageLiteOrBuilder- Overrides:
isInitializedin classcom.google.protobuf.GeneratedMessageV3.ExtendableBuilder<SentencepieceModel.TrainerSpec,SentencepieceModel.TrainerSpec.Builder>
-
mergeFrom
public SentencepieceModel.TrainerSpec.Builder mergeFrom(com.google.protobuf.CodedInputStream input, com.google.protobuf.ExtensionRegistryLite extensionRegistry) throws IOException - Specified by:
mergeFromin interfacecom.google.protobuf.Message.Builder- Specified by:
mergeFromin interfacecom.google.protobuf.MessageLite.Builder- Overrides:
mergeFromin classcom.google.protobuf.AbstractMessage.Builder<SentencepieceModel.TrainerSpec.Builder>- Throws:
IOException
-
getInputList
public com.google.protobuf.ProtocolStringList getInputList()///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
repeated string input = 1;- Specified by:
getInputListin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- A list containing the input.
-
getInputCount
public int getInputCount()///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
repeated string input = 1;- Specified by:
getInputCountin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The count of input.
-
getInput
///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
repeated string input = 1;- Specified by:
getInputin interfaceSentencepieceModel.TrainerSpecOrBuilder- Parameters:
index- The index of the element to return.- Returns:
- The input at the given index.
-
getInputBytes
public com.google.protobuf.ByteString getInputBytes(int index) ///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
repeated string input = 1;- Specified by:
getInputBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Parameters:
index- The index of the value to return.- Returns:
- The bytes of the input at the given index.
-
setInput
///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
repeated string input = 1;- Parameters:
index- The index to set the value at.value- The input to set.- Returns:
- This builder for chaining.
-
addInput
///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
repeated string input = 1;- Parameters:
value- The input to add.- Returns:
- This builder for chaining.
-
addAllInput
///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
repeated string input = 1;- Parameters:
values- The input to add.- Returns:
- This builder for chaining.
-
clearInput
///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
repeated string input = 1;- Returns:
- This builder for chaining.
-
addInputBytes
///////////////////////////////////////////////////////////////// General parameters Input corpus files. Trainer accepts the following two formats: A) Monolingual: plain text, one sentence per line. B) Bilingual: TSV, source sentence <tab> target sentence When bilingual data is passed, shared vocabulary model is built. Note that the input file must be raw corpus, not a preprocessed corpus. Trainer only loads the first `input_sentence_size` sentences specified with this parameter.
repeated string input = 1;- Parameters:
value- The bytes of the input to add.- Returns:
- This builder for chaining.
-
hasInputFormat
public boolean hasInputFormat()Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
optional string input_format = 7;- Specified by:
hasInputFormatin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the inputFormat field is set.
-
getInputFormat
Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
optional string input_format = 7;- Specified by:
getInputFormatin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The inputFormat.
-
getInputFormatBytes
public com.google.protobuf.ByteString getInputFormatBytes()Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
optional string input_format = 7;- Specified by:
getInputFormatBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for inputFormat.
-
setInputFormat
Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
optional string input_format = 7;- Parameters:
value- The inputFormat to set.- Returns:
- This builder for chaining.
-
clearInputFormat
Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
optional string input_format = 7;- Returns:
- This builder for chaining.
-
setInputFormatBytes
public SentencepieceModel.TrainerSpec.Builder setInputFormatBytes(com.google.protobuf.ByteString value) Input corpus format: "text": one-sentence-per-line text format (default) "tsv": sentence <tab> freq
optional string input_format = 7;- Parameters:
value- The bytes for inputFormat to set.- Returns:
- This builder for chaining.
-
hasModelPrefix
public boolean hasModelPrefix()Output model file prefix. <model_prefix>.model and <model_prefix>.vocab are generated.
optional string model_prefix = 2;- Specified by:
hasModelPrefixin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the modelPrefix field is set.
-
getModelPrefix
Output model file prefix. <model_prefix>.model and <model_prefix>.vocab are generated.
optional string model_prefix = 2;- Specified by:
getModelPrefixin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The modelPrefix.
-
getModelPrefixBytes
public com.google.protobuf.ByteString getModelPrefixBytes()Output model file prefix. <model_prefix>.model and <model_prefix>.vocab are generated.
optional string model_prefix = 2;- Specified by:
getModelPrefixBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for modelPrefix.
-
setModelPrefix
Output model file prefix. <model_prefix>.model and <model_prefix>.vocab are generated.
optional string model_prefix = 2;- Parameters:
value- The modelPrefix to set.- Returns:
- This builder for chaining.
-
clearModelPrefix
Output model file prefix. <model_prefix>.model and <model_prefix>.vocab are generated.
optional string model_prefix = 2;- Returns:
- This builder for chaining.
-
setModelPrefixBytes
public SentencepieceModel.TrainerSpec.Builder setModelPrefixBytes(com.google.protobuf.ByteString value) Output model file prefix. <model_prefix>.model and <model_prefix>.vocab are generated.
optional string model_prefix = 2;- Parameters:
value- The bytes for modelPrefix to set.- Returns:
- This builder for chaining.
-
hasModelType
public boolean hasModelType()optional .com.google.genai.proto.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];- Specified by:
hasModelTypein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the modelType field is set.
-
getModelType
optional .com.google.genai.proto.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];- Specified by:
getModelTypein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The modelType.
-
setModelType
public SentencepieceModel.TrainerSpec.Builder setModelType(SentencepieceModel.TrainerSpec.ModelType value) optional .com.google.genai.proto.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];- Parameters:
value- The modelType to set.- Returns:
- This builder for chaining.
-
clearModelType
optional .com.google.genai.proto.TrainerSpec.ModelType model_type = 3 [default = UNIGRAM];- Returns:
- This builder for chaining.
-
hasVocabSize
public boolean hasVocabSize()Vocabulary size. 8k is the default size.
optional int32 vocab_size = 4 [default = 8000];- Specified by:
hasVocabSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the vocabSize field is set.
-
getVocabSize
public int getVocabSize()Vocabulary size. 8k is the default size.
optional int32 vocab_size = 4 [default = 8000];- Specified by:
getVocabSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The vocabSize.
-
setVocabSize
Vocabulary size. 8k is the default size.
optional int32 vocab_size = 4 [default = 8000];- Parameters:
value- The vocabSize to set.- Returns:
- This builder for chaining.
-
clearVocabSize
Vocabulary size. 8k is the default size.
optional int32 vocab_size = 4 [default = 8000];- Returns:
- This builder for chaining.
-
getAcceptLanguageList
public com.google.protobuf.ProtocolStringList getAcceptLanguageList()List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
repeated string accept_language = 5;- Specified by:
getAcceptLanguageListin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- A list containing the acceptLanguage.
-
getAcceptLanguageCount
public int getAcceptLanguageCount()List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
repeated string accept_language = 5;- Specified by:
getAcceptLanguageCountin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The count of acceptLanguage.
-
getAcceptLanguage
List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
repeated string accept_language = 5;- Specified by:
getAcceptLanguagein interfaceSentencepieceModel.TrainerSpecOrBuilder- Parameters:
index- The index of the element to return.- Returns:
- The acceptLanguage at the given index.
-
getAcceptLanguageBytes
public com.google.protobuf.ByteString getAcceptLanguageBytes(int index) List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
repeated string accept_language = 5;- Specified by:
getAcceptLanguageBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Parameters:
index- The index of the value to return.- Returns:
- The bytes of the acceptLanguage at the given index.
-
setAcceptLanguage
List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
repeated string accept_language = 5;- Parameters:
index- The index to set the value at.value- The acceptLanguage to set.- Returns:
- This builder for chaining.
-
addAcceptLanguage
List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
repeated string accept_language = 5;- Parameters:
value- The acceptLanguage to add.- Returns:
- This builder for chaining.
-
addAllAcceptLanguage
List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
repeated string accept_language = 5;- Parameters:
values- The acceptLanguage to add.- Returns:
- This builder for chaining.
-
clearAcceptLanguage
List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
repeated string accept_language = 5;- Returns:
- This builder for chaining.
-
addAcceptLanguageBytes
public SentencepieceModel.TrainerSpec.Builder addAcceptLanguageBytes(com.google.protobuf.ByteString value) List of the languages this model can accept. Since the model is language-agnostic, this field is used as a reference.
repeated string accept_language = 5;- Parameters:
value- The bytes of the acceptLanguage to add.- Returns:
- This builder for chaining.
-
hasSelfTestSampleSize
public boolean hasSelfTestSampleSize()Size of self-test samples, which are encoded in the model file.
optional int32 self_test_sample_size = 6 [default = 0];- Specified by:
hasSelfTestSampleSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the selfTestSampleSize field is set.
-
getSelfTestSampleSize
public int getSelfTestSampleSize()Size of self-test samples, which are encoded in the model file.
optional int32 self_test_sample_size = 6 [default = 0];- Specified by:
getSelfTestSampleSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The selfTestSampleSize.
-
setSelfTestSampleSize
Size of self-test samples, which are encoded in the model file.
optional int32 self_test_sample_size = 6 [default = 0];- Parameters:
value- The selfTestSampleSize to set.- Returns:
- This builder for chaining.
-
clearSelfTestSampleSize
Size of self-test samples, which are encoded in the model file.
optional int32 self_test_sample_size = 6 [default = 0];- Returns:
- This builder for chaining.
-
hasEnableDifferentialPrivacy
public boolean hasEnableDifferentialPrivacy()Whether to use DP version of sentencepiece. Use it with TSV input format (requires precomputed word tab counts to work).
optional bool enable_differential_privacy = 50 [default = false];- Specified by:
hasEnableDifferentialPrivacyin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the enableDifferentialPrivacy field is set.
-
getEnableDifferentialPrivacy
public boolean getEnableDifferentialPrivacy()Whether to use DP version of sentencepiece. Use it with TSV input format (requires precomputed word tab counts to work).
optional bool enable_differential_privacy = 50 [default = false];- Specified by:
getEnableDifferentialPrivacyin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The enableDifferentialPrivacy.
-
setEnableDifferentialPrivacy
Whether to use DP version of sentencepiece. Use it with TSV input format (requires precomputed word tab counts to work).
optional bool enable_differential_privacy = 50 [default = false];- Parameters:
value- The enableDifferentialPrivacy to set.- Returns:
- This builder for chaining.
-
clearEnableDifferentialPrivacy
Whether to use DP version of sentencepiece. Use it with TSV input format (requires precomputed word tab counts to work).
optional bool enable_differential_privacy = 50 [default = false];- Returns:
- This builder for chaining.
-
hasDifferentialPrivacyNoiseLevel
public boolean hasDifferentialPrivacyNoiseLevel()Set these parameters if you need DP version of sentencepiece. std of noise to add.
optional float differential_privacy_noise_level = 51 [default = 0];- Specified by:
hasDifferentialPrivacyNoiseLevelin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the differentialPrivacyNoiseLevel field is set.
-
getDifferentialPrivacyNoiseLevel
public float getDifferentialPrivacyNoiseLevel()Set these parameters if you need DP version of sentencepiece. std of noise to add.
optional float differential_privacy_noise_level = 51 [default = 0];- Specified by:
getDifferentialPrivacyNoiseLevelin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The differentialPrivacyNoiseLevel.
-
setDifferentialPrivacyNoiseLevel
Set these parameters if you need DP version of sentencepiece. std of noise to add.
optional float differential_privacy_noise_level = 51 [default = 0];- Parameters:
value- The differentialPrivacyNoiseLevel to set.- Returns:
- This builder for chaining.
-
clearDifferentialPrivacyNoiseLevel
Set these parameters if you need DP version of sentencepiece. std of noise to add.
optional float differential_privacy_noise_level = 51 [default = 0];- Returns:
- This builder for chaining.
-
hasDifferentialPrivacyClippingThreshold
public boolean hasDifferentialPrivacyClippingThreshold()Clipping threshold to apply after adding noise. All the words with frequency less than this value are dropped.
optional uint64 differential_privacy_clipping_threshold = 52 [default = 0];- Specified by:
hasDifferentialPrivacyClippingThresholdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the differentialPrivacyClippingThreshold field is set.
-
getDifferentialPrivacyClippingThreshold
public long getDifferentialPrivacyClippingThreshold()Clipping threshold to apply after adding noise. All the words with frequency less than this value are dropped.
optional uint64 differential_privacy_clipping_threshold = 52 [default = 0];- Specified by:
getDifferentialPrivacyClippingThresholdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The differentialPrivacyClippingThreshold.
-
setDifferentialPrivacyClippingThreshold
Clipping threshold to apply after adding noise. All the words with frequency less than this value are dropped.
optional uint64 differential_privacy_clipping_threshold = 52 [default = 0];- Parameters:
value- The differentialPrivacyClippingThreshold to set.- Returns:
- This builder for chaining.
-
clearDifferentialPrivacyClippingThreshold
Clipping threshold to apply after adding noise. All the words with frequency less than this value are dropped.
optional uint64 differential_privacy_clipping_threshold = 52 [default = 0];- Returns:
- This builder for chaining.
-
hasCharacterCoverage
public boolean hasCharacterCoverage()///////////////////////////////////////////////////////////////// Training parameters. Uses characters which cover the corpus with the ratio of `chars_coverage`. This parameter determines the set of basic Alphabet of sentence piece. 1.0 - `chars_coverage` characters are treated as UNK. See also required_chars field.
optional float character_coverage = 10 [default = 0.9995];- Specified by:
hasCharacterCoveragein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the characterCoverage field is set.
-
getCharacterCoverage
public float getCharacterCoverage()///////////////////////////////////////////////////////////////// Training parameters. Uses characters which cover the corpus with the ratio of `chars_coverage`. This parameter determines the set of basic Alphabet of sentence piece. 1.0 - `chars_coverage` characters are treated as UNK. See also required_chars field.
optional float character_coverage = 10 [default = 0.9995];- Specified by:
getCharacterCoveragein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The characterCoverage.
-
setCharacterCoverage
///////////////////////////////////////////////////////////////// Training parameters. Uses characters which cover the corpus with the ratio of `chars_coverage`. This parameter determines the set of basic Alphabet of sentence piece. 1.0 - `chars_coverage` characters are treated as UNK. See also required_chars field.
optional float character_coverage = 10 [default = 0.9995];- Parameters:
value- The characterCoverage to set.- Returns:
- This builder for chaining.
-
clearCharacterCoverage
///////////////////////////////////////////////////////////////// Training parameters. Uses characters which cover the corpus with the ratio of `chars_coverage`. This parameter determines the set of basic Alphabet of sentence piece. 1.0 - `chars_coverage` characters are treated as UNK. See also required_chars field.
optional float character_coverage = 10 [default = 0.9995];- Returns:
- This builder for chaining.
-
hasInputSentenceSize
public boolean hasInputSentenceSize()Maximum size of sentences the trainer loads from `input` parameter. Trainer simply loads the `input` files in sequence. It is better to shuffle the input corpus randomly.
optional uint64 input_sentence_size = 11 [default = 0];- Specified by:
hasInputSentenceSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the inputSentenceSize field is set.
-
getInputSentenceSize
public long getInputSentenceSize()Maximum size of sentences the trainer loads from `input` parameter. Trainer simply loads the `input` files in sequence. It is better to shuffle the input corpus randomly.
optional uint64 input_sentence_size = 11 [default = 0];- Specified by:
getInputSentenceSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The inputSentenceSize.
-
setInputSentenceSize
Maximum size of sentences the trainer loads from `input` parameter. Trainer simply loads the `input` files in sequence. It is better to shuffle the input corpus randomly.
optional uint64 input_sentence_size = 11 [default = 0];- Parameters:
value- The inputSentenceSize to set.- Returns:
- This builder for chaining.
-
clearInputSentenceSize
Maximum size of sentences the trainer loads from `input` parameter. Trainer simply loads the `input` files in sequence. It is better to shuffle the input corpus randomly.
optional uint64 input_sentence_size = 11 [default = 0];- Returns:
- This builder for chaining.
-
hasShuffleInputSentence
public boolean hasShuffleInputSentence()optional bool shuffle_input_sentence = 19 [default = true];- Specified by:
hasShuffleInputSentencein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the shuffleInputSentence field is set.
-
getShuffleInputSentence
public boolean getShuffleInputSentence()optional bool shuffle_input_sentence = 19 [default = true];- Specified by:
getShuffleInputSentencein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The shuffleInputSentence.
-
setShuffleInputSentence
optional bool shuffle_input_sentence = 19 [default = true];- Parameters:
value- The shuffleInputSentence to set.- Returns:
- This builder for chaining.
-
clearShuffleInputSentence
optional bool shuffle_input_sentence = 19 [default = true];- Returns:
- This builder for chaining.
-
hasMiningSentenceSize
Deprecated.com.google.genai.proto.TrainerSpec.mining_sentence_size is deprecated. See sentencepiece_model.proto;l=96Maximum size of sentences to make seed sentence pieces. Extended suffix array is constructed to extract frequent sub-strings from the corpus. This uses 20N working space, where N is the size of corpus.
optional int32 mining_sentence_size = 12 [deprecated = true];- Specified by:
hasMiningSentenceSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the miningSentenceSize field is set.
-
getMiningSentenceSize
Deprecated.com.google.genai.proto.TrainerSpec.mining_sentence_size is deprecated. See sentencepiece_model.proto;l=96Maximum size of sentences to make seed sentence pieces. Extended suffix array is constructed to extract frequent sub-strings from the corpus. This uses 20N working space, where N is the size of corpus.
optional int32 mining_sentence_size = 12 [deprecated = true];- Specified by:
getMiningSentenceSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The miningSentenceSize.
-
setMiningSentenceSize
Deprecated.Maximum size of sentences to make seed sentence pieces. Extended suffix array is constructed to extract frequent sub-strings from the corpus. This uses 20N working space, where N is the size of corpus.
optional int32 mining_sentence_size = 12 [deprecated = true];- Parameters:
value- The miningSentenceSize to set.- Returns:
- This builder for chaining.
-
clearMiningSentenceSize
Deprecated.Maximum size of sentences to make seed sentence pieces. Extended suffix array is constructed to extract frequent sub-strings from the corpus. This uses 20N working space, where N is the size of corpus.
optional int32 mining_sentence_size = 12 [deprecated = true];- Returns:
- This builder for chaining.
-
hasTrainingSentenceSize
Deprecated.com.google.genai.proto.TrainerSpec.training_sentence_size is deprecated. See sentencepiece_model.proto;l=99Maximum size of sentences to train sentence pieces.
optional int32 training_sentence_size = 13 [deprecated = true];- Specified by:
hasTrainingSentenceSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the trainingSentenceSize field is set.
-
getTrainingSentenceSize
Deprecated.com.google.genai.proto.TrainerSpec.training_sentence_size is deprecated. See sentencepiece_model.proto;l=99Maximum size of sentences to train sentence pieces.
optional int32 training_sentence_size = 13 [deprecated = true];- Specified by:
getTrainingSentenceSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The trainingSentenceSize.
-
setTrainingSentenceSize
Deprecated.Maximum size of sentences to train sentence pieces.
optional int32 training_sentence_size = 13 [deprecated = true];- Parameters:
value- The trainingSentenceSize to set.- Returns:
- This builder for chaining.
-
clearTrainingSentenceSize
Deprecated.Maximum size of sentences to train sentence pieces.
optional int32 training_sentence_size = 13 [deprecated = true];- Returns:
- This builder for chaining.
-
hasSeedSentencepieceSize
public boolean hasSeedSentencepieceSize()The size of seed sentencepieces. `seed_sentencepiece_size` must be larger than `vocab_size`.
optional int32 seed_sentencepiece_size = 14 [default = 1000000];- Specified by:
hasSeedSentencepieceSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the seedSentencepieceSize field is set.
-
getSeedSentencepieceSize
public int getSeedSentencepieceSize()The size of seed sentencepieces. `seed_sentencepiece_size` must be larger than `vocab_size`.
optional int32 seed_sentencepiece_size = 14 [default = 1000000];- Specified by:
getSeedSentencepieceSizein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The seedSentencepieceSize.
-
setSeedSentencepieceSize
The size of seed sentencepieces. `seed_sentencepiece_size` must be larger than `vocab_size`.
optional int32 seed_sentencepiece_size = 14 [default = 1000000];- Parameters:
value- The seedSentencepieceSize to set.- Returns:
- This builder for chaining.
-
clearSeedSentencepieceSize
The size of seed sentencepieces. `seed_sentencepiece_size` must be larger than `vocab_size`.
optional int32 seed_sentencepiece_size = 14 [default = 1000000];- Returns:
- This builder for chaining.
-
hasShrinkingFactor
public boolean hasShrinkingFactor()In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece. This value should be smaller than 1.0.
optional float shrinking_factor = 15 [default = 0.75];- Specified by:
hasShrinkingFactorin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the shrinkingFactor field is set.
-
getShrinkingFactor
public float getShrinkingFactor()In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece. This value should be smaller than 1.0.
optional float shrinking_factor = 15 [default = 0.75];- Specified by:
getShrinkingFactorin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The shrinkingFactor.
-
setShrinkingFactor
In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece. This value should be smaller than 1.0.
optional float shrinking_factor = 15 [default = 0.75];- Parameters:
value- The shrinkingFactor to set.- Returns:
- This builder for chaining.
-
clearShrinkingFactor
In every EM sub-iterations, keeps top `shrinking_factor` * `current sentencepieces size` with respect to the loss of the sentence piece. This value should be smaller than 1.0.
optional float shrinking_factor = 15 [default = 0.75];- Returns:
- This builder for chaining.
-
hasMaxSentenceLength
public boolean hasMaxSentenceLength()The maximum sentence length in byte. The sentences with the length larger than `max_sentence_length` is simply ignored. Longer input tends to bring the following risks: * Overflow during EM training (unigram language model only) * Performance drop because of O(n log n) cost in BPE.
optional int32 max_sentence_length = 18 [default = 4192];- Specified by:
hasMaxSentenceLengthin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the maxSentenceLength field is set.
-
getMaxSentenceLength
public int getMaxSentenceLength()The maximum sentence length in byte. The sentences with the length larger than `max_sentence_length` is simply ignored. Longer input tends to bring the following risks: * Overflow during EM training (unigram language model only) * Performance drop because of O(n log n) cost in BPE.
optional int32 max_sentence_length = 18 [default = 4192];- Specified by:
getMaxSentenceLengthin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The maxSentenceLength.
-
setMaxSentenceLength
The maximum sentence length in byte. The sentences with the length larger than `max_sentence_length` is simply ignored. Longer input tends to bring the following risks: * Overflow during EM training (unigram language model only) * Performance drop because of O(n log n) cost in BPE.
optional int32 max_sentence_length = 18 [default = 4192];- Parameters:
value- The maxSentenceLength to set.- Returns:
- This builder for chaining.
-
clearMaxSentenceLength
The maximum sentence length in byte. The sentences with the length larger than `max_sentence_length` is simply ignored. Longer input tends to bring the following risks: * Overflow during EM training (unigram language model only) * Performance drop because of O(n log n) cost in BPE.
optional int32 max_sentence_length = 18 [default = 4192];- Returns:
- This builder for chaining.
-
hasNumThreads
public boolean hasNumThreads()Number of threads in the training.
optional int32 num_threads = 16 [default = 16];- Specified by:
hasNumThreadsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the numThreads field is set.
-
getNumThreads
public int getNumThreads()Number of threads in the training.
optional int32 num_threads = 16 [default = 16];- Specified by:
getNumThreadsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The numThreads.
-
setNumThreads
Number of threads in the training.
optional int32 num_threads = 16 [default = 16];- Parameters:
value- The numThreads to set.- Returns:
- This builder for chaining.
-
clearNumThreads
Number of threads in the training.
optional int32 num_threads = 16 [default = 16];- Returns:
- This builder for chaining.
-
hasNumSubIterations
public boolean hasNumSubIterations()Number of EM sub iterations.
optional int32 num_sub_iterations = 17 [default = 2];- Specified by:
hasNumSubIterationsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the numSubIterations field is set.
-
getNumSubIterations
public int getNumSubIterations()Number of EM sub iterations.
optional int32 num_sub_iterations = 17 [default = 2];- Specified by:
getNumSubIterationsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The numSubIterations.
-
setNumSubIterations
Number of EM sub iterations.
optional int32 num_sub_iterations = 17 [default = 2];- Parameters:
value- The numSubIterations to set.- Returns:
- This builder for chaining.
-
clearNumSubIterations
Number of EM sub iterations.
optional int32 num_sub_iterations = 17 [default = 2];- Returns:
- This builder for chaining.
-
hasMaxSentencepieceLength
public boolean hasMaxSentencepieceLength()///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece. Maximum length of sentencepiece.
optional int32 max_sentencepiece_length = 20 [default = 16];- Specified by:
hasMaxSentencepieceLengthin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the maxSentencepieceLength field is set.
-
getMaxSentencepieceLength
public int getMaxSentencepieceLength()///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece. Maximum length of sentencepiece.
optional int32 max_sentencepiece_length = 20 [default = 16];- Specified by:
getMaxSentencepieceLengthin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The maxSentencepieceLength.
-
setMaxSentencepieceLength
///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece. Maximum length of sentencepiece.
optional int32 max_sentencepiece_length = 20 [default = 16];- Parameters:
value- The maxSentencepieceLength to set.- Returns:
- This builder for chaining.
-
clearMaxSentencepieceLength
///////////////////////////////////////////////////////////////// SentencePiece parameters which control the shapes of sentence piece. Maximum length of sentencepiece.
optional int32 max_sentencepiece_length = 20 [default = 16];- Returns:
- This builder for chaining.
-
hasSplitByUnicodeScript
public boolean hasSplitByUnicodeScript()Uses Unicode script to split sentence pieces. When `split_by_unicode_script` is true, we do not allow sentence piece to include multiple Unicode scripts, e.g. "F1" is not a valid piece. Exception: CJ characters (Hiragana/Katakana/Han) are all handled as one script type, since Japanese word can consist of multiple scripts. This exception is always applied regardless of the accept-language parameter.
optional bool split_by_unicode_script = 21 [default = true];- Specified by:
hasSplitByUnicodeScriptin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the splitByUnicodeScript field is set.
-
getSplitByUnicodeScript
public boolean getSplitByUnicodeScript()Uses Unicode script to split sentence pieces. When `split_by_unicode_script` is true, we do not allow sentence piece to include multiple Unicode scripts, e.g. "F1" is not a valid piece. Exception: CJ characters (Hiragana/Katakana/Han) are all handled as one script type, since Japanese word can consist of multiple scripts. This exception is always applied regardless of the accept-language parameter.
optional bool split_by_unicode_script = 21 [default = true];- Specified by:
getSplitByUnicodeScriptin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The splitByUnicodeScript.
-
setSplitByUnicodeScript
Uses Unicode script to split sentence pieces. When `split_by_unicode_script` is true, we do not allow sentence piece to include multiple Unicode scripts, e.g. "F1" is not a valid piece. Exception: CJ characters (Hiragana/Katakana/Han) are all handled as one script type, since Japanese word can consist of multiple scripts. This exception is always applied regardless of the accept-language parameter.
optional bool split_by_unicode_script = 21 [default = true];- Parameters:
value- The splitByUnicodeScript to set.- Returns:
- This builder for chaining.
-
clearSplitByUnicodeScript
Uses Unicode script to split sentence pieces. When `split_by_unicode_script` is true, we do not allow sentence piece to include multiple Unicode scripts, e.g. "F1" is not a valid piece. Exception: CJ characters (Hiragana/Katakana/Han) are all handled as one script type, since Japanese word can consist of multiple scripts. This exception is always applied regardless of the accept-language parameter.
optional bool split_by_unicode_script = 21 [default = true];- Returns:
- This builder for chaining.
-
hasSplitByNumber
public boolean hasSplitByNumber()When `split_by_number` is true, put a boundary between number and non-number transition. If we want to treat "F1" is one token, set this flag to be false.
optional bool split_by_number = 23 [default = true];- Specified by:
hasSplitByNumberin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the splitByNumber field is set.
-
getSplitByNumber
public boolean getSplitByNumber()When `split_by_number` is true, put a boundary between number and non-number transition. If we want to treat "F1" is one token, set this flag to be false.
optional bool split_by_number = 23 [default = true];- Specified by:
getSplitByNumberin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The splitByNumber.
-
setSplitByNumber
When `split_by_number` is true, put a boundary between number and non-number transition. If we want to treat "F1" is one token, set this flag to be false.
optional bool split_by_number = 23 [default = true];- Parameters:
value- The splitByNumber to set.- Returns:
- This builder for chaining.
-
clearSplitByNumber
When `split_by_number` is true, put a boundary between number and non-number transition. If we want to treat "F1" is one token, set this flag to be false.
optional bool split_by_number = 23 [default = true];- Returns:
- This builder for chaining.
-
hasSplitByWhitespace
public boolean hasSplitByWhitespace()Use a white space to split sentence pieces. When `split_by_whitespace` is false, we may have the piece containing a white space in the middle. e.g., "in_the".
optional bool split_by_whitespace = 22 [default = true];- Specified by:
hasSplitByWhitespacein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the splitByWhitespace field is set.
-
getSplitByWhitespace
public boolean getSplitByWhitespace()Use a white space to split sentence pieces. When `split_by_whitespace` is false, we may have the piece containing a white space in the middle. e.g., "in_the".
optional bool split_by_whitespace = 22 [default = true];- Specified by:
getSplitByWhitespacein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The splitByWhitespace.
-
setSplitByWhitespace
Use a white space to split sentence pieces. When `split_by_whitespace` is false, we may have the piece containing a white space in the middle. e.g., "in_the".
optional bool split_by_whitespace = 22 [default = true];- Parameters:
value- The splitByWhitespace to set.- Returns:
- This builder for chaining.
-
clearSplitByWhitespace
Use a white space to split sentence pieces. When `split_by_whitespace` is false, we may have the piece containing a white space in the middle. e.g., "in_the".
optional bool split_by_whitespace = 22 [default = true];- Returns:
- This builder for chaining.
-
hasTreatWhitespaceAsSuffix
public boolean hasTreatWhitespaceAsSuffix()Adds whitespace symbol (_) as a suffix instead of prefix. e.g., _hello => hello_. When `treat_whitespace_as_suffix` is true, NormalizerSpec::add_dummy_prefix will add the dummy whitespace to the end of sentence.
optional bool treat_whitespace_as_suffix = 24 [default = false];- Specified by:
hasTreatWhitespaceAsSuffixin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the treatWhitespaceAsSuffix field is set.
-
getTreatWhitespaceAsSuffix
public boolean getTreatWhitespaceAsSuffix()Adds whitespace symbol (_) as a suffix instead of prefix. e.g., _hello => hello_. When `treat_whitespace_as_suffix` is true, NormalizerSpec::add_dummy_prefix will add the dummy whitespace to the end of sentence.
optional bool treat_whitespace_as_suffix = 24 [default = false];- Specified by:
getTreatWhitespaceAsSuffixin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The treatWhitespaceAsSuffix.
-
setTreatWhitespaceAsSuffix
Adds whitespace symbol (_) as a suffix instead of prefix. e.g., _hello => hello_. When `treat_whitespace_as_suffix` is true, NormalizerSpec::add_dummy_prefix will add the dummy whitespace to the end of sentence.
optional bool treat_whitespace_as_suffix = 24 [default = false];- Parameters:
value- The treatWhitespaceAsSuffix to set.- Returns:
- This builder for chaining.
-
clearTreatWhitespaceAsSuffix
Adds whitespace symbol (_) as a suffix instead of prefix. e.g., _hello => hello_. When `treat_whitespace_as_suffix` is true, NormalizerSpec::add_dummy_prefix will add the dummy whitespace to the end of sentence.
optional bool treat_whitespace_as_suffix = 24 [default = false];- Returns:
- This builder for chaining.
-
hasAllowWhitespaceOnlyPieces
public boolean hasAllowWhitespaceOnlyPieces()Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.
optional bool allow_whitespace_only_pieces = 26 [default = false];- Specified by:
hasAllowWhitespaceOnlyPiecesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the allowWhitespaceOnlyPieces field is set.
-
getAllowWhitespaceOnlyPieces
public boolean getAllowWhitespaceOnlyPieces()Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.
optional bool allow_whitespace_only_pieces = 26 [default = false];- Specified by:
getAllowWhitespaceOnlyPiecesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The allowWhitespaceOnlyPieces.
-
setAllowWhitespaceOnlyPieces
Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.
optional bool allow_whitespace_only_pieces = 26 [default = false];- Parameters:
value- The allowWhitespaceOnlyPieces to set.- Returns:
- This builder for chaining.
-
clearAllowWhitespaceOnlyPieces
Allows pieces that only contain whitespaces instead of appearing only as prefix or suffix of other pieces.
optional bool allow_whitespace_only_pieces = 26 [default = false];- Returns:
- This builder for chaining.
-
hasSplitDigits
public boolean hasSplitDigits()Split all digits (0-9) into separate pieces.
optional bool split_digits = 25 [default = false];- Specified by:
hasSplitDigitsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the splitDigits field is set.
-
getSplitDigits
public boolean getSplitDigits()Split all digits (0-9) into separate pieces.
optional bool split_digits = 25 [default = false];- Specified by:
getSplitDigitsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The splitDigits.
-
setSplitDigits
Split all digits (0-9) into separate pieces.
optional bool split_digits = 25 [default = false];- Parameters:
value- The splitDigits to set.- Returns:
- This builder for chaining.
-
clearSplitDigits
Split all digits (0-9) into separate pieces.
optional bool split_digits = 25 [default = false];- Returns:
- This builder for chaining.
-
hasPretokenizationDelimiter
public boolean hasPretokenizationDelimiter()Defines the pre-tokenization delimiter. When specified, no pieces crossing this delimiter is not included in the vocab. Then the delimiter string is virtually ignored during the training. This field can allows constraints on the vocabulary selection. Note that this field is available on unigram mode.
optional string pretokenization_delimiter = 53 [default = ""];- Specified by:
hasPretokenizationDelimiterin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the pretokenizationDelimiter field is set.
-
getPretokenizationDelimiter
Defines the pre-tokenization delimiter. When specified, no pieces crossing this delimiter is not included in the vocab. Then the delimiter string is virtually ignored during the training. This field can allows constraints on the vocabulary selection. Note that this field is available on unigram mode.
optional string pretokenization_delimiter = 53 [default = ""];- Specified by:
getPretokenizationDelimiterin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The pretokenizationDelimiter.
-
getPretokenizationDelimiterBytes
public com.google.protobuf.ByteString getPretokenizationDelimiterBytes()Defines the pre-tokenization delimiter. When specified, no pieces crossing this delimiter is not included in the vocab. Then the delimiter string is virtually ignored during the training. This field can allows constraints on the vocabulary selection. Note that this field is available on unigram mode.
optional string pretokenization_delimiter = 53 [default = ""];- Specified by:
getPretokenizationDelimiterBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for pretokenizationDelimiter.
-
setPretokenizationDelimiter
Defines the pre-tokenization delimiter. When specified, no pieces crossing this delimiter is not included in the vocab. Then the delimiter string is virtually ignored during the training. This field can allows constraints on the vocabulary selection. Note that this field is available on unigram mode.
optional string pretokenization_delimiter = 53 [default = ""];- Parameters:
value- The pretokenizationDelimiter to set.- Returns:
- This builder for chaining.
-
clearPretokenizationDelimiter
Defines the pre-tokenization delimiter. When specified, no pieces crossing this delimiter is not included in the vocab. Then the delimiter string is virtually ignored during the training. This field can allows constraints on the vocabulary selection. Note that this field is available on unigram mode.
optional string pretokenization_delimiter = 53 [default = ""];- Returns:
- This builder for chaining.
-
setPretokenizationDelimiterBytes
public SentencepieceModel.TrainerSpec.Builder setPretokenizationDelimiterBytes(com.google.protobuf.ByteString value) Defines the pre-tokenization delimiter. When specified, no pieces crossing this delimiter is not included in the vocab. Then the delimiter string is virtually ignored during the training. This field can allows constraints on the vocabulary selection. Note that this field is available on unigram mode.
optional string pretokenization_delimiter = 53 [default = ""];- Parameters:
value- The bytes for pretokenizationDelimiter to set.- Returns:
- This builder for chaining.
-
getControlSymbolsList
public com.google.protobuf.ProtocolStringList getControlSymbolsList()///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string control_symbols = 30;- Specified by:
getControlSymbolsListin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- A list containing the controlSymbols.
-
getControlSymbolsCount
public int getControlSymbolsCount()///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string control_symbols = 30;- Specified by:
getControlSymbolsCountin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The count of controlSymbols.
-
getControlSymbols
///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string control_symbols = 30;- Specified by:
getControlSymbolsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Parameters:
index- The index of the element to return.- Returns:
- The controlSymbols at the given index.
-
getControlSymbolsBytes
public com.google.protobuf.ByteString getControlSymbolsBytes(int index) ///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string control_symbols = 30;- Specified by:
getControlSymbolsBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Parameters:
index- The index of the value to return.- Returns:
- The bytes of the controlSymbols at the given index.
-
setControlSymbols
///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string control_symbols = 30;- Parameters:
index- The index to set the value at.value- The controlSymbols to set.- Returns:
- This builder for chaining.
-
addControlSymbols
///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string control_symbols = 30;- Parameters:
value- The controlSymbols to add.- Returns:
- This builder for chaining.
-
addAllControlSymbols
///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string control_symbols = 30;- Parameters:
values- The controlSymbols to add.- Returns:
- This builder for chaining.
-
clearControlSymbols
///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string control_symbols = 30;- Returns:
- This builder for chaining.
-
addControlSymbolsBytes
public SentencepieceModel.TrainerSpec.Builder addControlSymbolsBytes(com.google.protobuf.ByteString value) ///////////////////////////////////////////////////////////////// Vocabulary management Defines control symbols used as an indicator to change the behavior of the decoder. <s> and </s> are pre-defined. We can use this field to encode various meta information, including language indicator in multilingual model. These symbols are not visible to users, but visible to the decoder. Note that when the input sentence contains control symbols, they are not treated as one token, but segmented into normal pieces. Control symbols must be inserted independently from the segmentation.
repeated string control_symbols = 30;- Parameters:
value- The bytes of the controlSymbols to add.- Returns:
- This builder for chaining.
-
getUserDefinedSymbolsList
public com.google.protobuf.ProtocolStringList getUserDefinedSymbolsList()Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
repeated string user_defined_symbols = 31;- Specified by:
getUserDefinedSymbolsListin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- A list containing the userDefinedSymbols.
-
getUserDefinedSymbolsCount
public int getUserDefinedSymbolsCount()Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
repeated string user_defined_symbols = 31;- Specified by:
getUserDefinedSymbolsCountin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The count of userDefinedSymbols.
-
getUserDefinedSymbols
Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
repeated string user_defined_symbols = 31;- Specified by:
getUserDefinedSymbolsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Parameters:
index- The index of the element to return.- Returns:
- The userDefinedSymbols at the given index.
-
getUserDefinedSymbolsBytes
public com.google.protobuf.ByteString getUserDefinedSymbolsBytes(int index) Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
repeated string user_defined_symbols = 31;- Specified by:
getUserDefinedSymbolsBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Parameters:
index- The index of the value to return.- Returns:
- The bytes of the userDefinedSymbols at the given index.
-
setUserDefinedSymbols
Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
repeated string user_defined_symbols = 31;- Parameters:
index- The index to set the value at.value- The userDefinedSymbols to set.- Returns:
- This builder for chaining.
-
addUserDefinedSymbols
Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
repeated string user_defined_symbols = 31;- Parameters:
value- The userDefinedSymbols to add.- Returns:
- This builder for chaining.
-
addAllUserDefinedSymbols
Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
repeated string user_defined_symbols = 31;- Parameters:
values- The userDefinedSymbols to add.- Returns:
- This builder for chaining.
-
clearUserDefinedSymbols
Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
repeated string user_defined_symbols = 31;- Returns:
- This builder for chaining.
-
addUserDefinedSymbolsBytes
public SentencepieceModel.TrainerSpec.Builder addUserDefinedSymbolsBytes(com.google.protobuf.ByteString value) Defines user defined symbols. These symbols are added with extremely high score so they are always treated as one unique symbol in any context. Typical usage of user_defined_symbols is placeholder for named entities.
repeated string user_defined_symbols = 31;- Parameters:
value- The bytes of the userDefinedSymbols to add.- Returns:
- This builder for chaining.
-
hasRequiredChars
public boolean hasRequiredChars()Defines required characters. Each UTF8 character in this string is included in the character set regardless of character_coverage value. Unlike user_defined_symbols, these characters have scores based on the frequency on input sentences, and the model can form subwords using characters in this field.
optional string required_chars = 36;- Specified by:
hasRequiredCharsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the requiredChars field is set.
-
getRequiredChars
Defines required characters. Each UTF8 character in this string is included in the character set regardless of character_coverage value. Unlike user_defined_symbols, these characters have scores based on the frequency on input sentences, and the model can form subwords using characters in this field.
optional string required_chars = 36;- Specified by:
getRequiredCharsin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The requiredChars.
-
getRequiredCharsBytes
public com.google.protobuf.ByteString getRequiredCharsBytes()Defines required characters. Each UTF8 character in this string is included in the character set regardless of character_coverage value. Unlike user_defined_symbols, these characters have scores based on the frequency on input sentences, and the model can form subwords using characters in this field.
optional string required_chars = 36;- Specified by:
getRequiredCharsBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for requiredChars.
-
setRequiredChars
Defines required characters. Each UTF8 character in this string is included in the character set regardless of character_coverage value. Unlike user_defined_symbols, these characters have scores based on the frequency on input sentences, and the model can form subwords using characters in this field.
optional string required_chars = 36;- Parameters:
value- The requiredChars to set.- Returns:
- This builder for chaining.
-
clearRequiredChars
Defines required characters. Each UTF8 character in this string is included in the character set regardless of character_coverage value. Unlike user_defined_symbols, these characters have scores based on the frequency on input sentences, and the model can form subwords using characters in this field.
optional string required_chars = 36;- Returns:
- This builder for chaining.
-
setRequiredCharsBytes
public SentencepieceModel.TrainerSpec.Builder setRequiredCharsBytes(com.google.protobuf.ByteString value) Defines required characters. Each UTF8 character in this string is included in the character set regardless of character_coverage value. Unlike user_defined_symbols, these characters have scores based on the frequency on input sentences, and the model can form subwords using characters in this field.
optional string required_chars = 36;- Parameters:
value- The bytes for requiredChars to set.- Returns:
- This builder for chaining.
-
hasByteFallback
public boolean hasByteFallback()Decomposes unknown pieces into UTF-8 bytes.
optional bool byte_fallback = 35 [default = false];- Specified by:
hasByteFallbackin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the byteFallback field is set.
-
getByteFallback
public boolean getByteFallback()Decomposes unknown pieces into UTF-8 bytes.
optional bool byte_fallback = 35 [default = false];- Specified by:
getByteFallbackin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The byteFallback.
-
setByteFallback
Decomposes unknown pieces into UTF-8 bytes.
optional bool byte_fallback = 35 [default = false];- Parameters:
value- The byteFallback to set.- Returns:
- This builder for chaining.
-
clearByteFallback
Decomposes unknown pieces into UTF-8 bytes.
optional bool byte_fallback = 35 [default = false];- Returns:
- This builder for chaining.
-
hasVocabularyOutputPieceScore
public boolean hasVocabularyOutputPieceScore()When creating the vocabulary file, defines whether or not to additionally output the score for each piece.
optional bool vocabulary_output_piece_score = 32 [default = true];- Specified by:
hasVocabularyOutputPieceScorein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the vocabularyOutputPieceScore field is set.
-
getVocabularyOutputPieceScore
public boolean getVocabularyOutputPieceScore()When creating the vocabulary file, defines whether or not to additionally output the score for each piece.
optional bool vocabulary_output_piece_score = 32 [default = true];- Specified by:
getVocabularyOutputPieceScorein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The vocabularyOutputPieceScore.
-
setVocabularyOutputPieceScore
When creating the vocabulary file, defines whether or not to additionally output the score for each piece.
optional bool vocabulary_output_piece_score = 32 [default = true];- Parameters:
value- The vocabularyOutputPieceScore to set.- Returns:
- This builder for chaining.
-
clearVocabularyOutputPieceScore
When creating the vocabulary file, defines whether or not to additionally output the score for each piece.
optional bool vocabulary_output_piece_score = 32 [default = true];- Returns:
- This builder for chaining.
-
hasHardVocabLimit
public boolean hasHardVocabLimit()`vocab_size` is treated as hard limit. Crash if the model can not produce the vocab of size `vocab_size`, When `hard_vocab_limit` is false, vocab_size is treated as soft limit. Note that when model_type=char, always assumes hard_vocab_limit = false.
optional bool hard_vocab_limit = 33 [default = true];- Specified by:
hasHardVocabLimitin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the hardVocabLimit field is set.
-
getHardVocabLimit
public boolean getHardVocabLimit()`vocab_size` is treated as hard limit. Crash if the model can not produce the vocab of size `vocab_size`, When `hard_vocab_limit` is false, vocab_size is treated as soft limit. Note that when model_type=char, always assumes hard_vocab_limit = false.
optional bool hard_vocab_limit = 33 [default = true];- Specified by:
getHardVocabLimitin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The hardVocabLimit.
-
setHardVocabLimit
`vocab_size` is treated as hard limit. Crash if the model can not produce the vocab of size `vocab_size`, When `hard_vocab_limit` is false, vocab_size is treated as soft limit. Note that when model_type=char, always assumes hard_vocab_limit = false.
optional bool hard_vocab_limit = 33 [default = true];- Parameters:
value- The hardVocabLimit to set.- Returns:
- This builder for chaining.
-
clearHardVocabLimit
`vocab_size` is treated as hard limit. Crash if the model can not produce the vocab of size `vocab_size`, When `hard_vocab_limit` is false, vocab_size is treated as soft limit. Note that when model_type=char, always assumes hard_vocab_limit = false.
optional bool hard_vocab_limit = 33 [default = true];- Returns:
- This builder for chaining.
-
hasUseAllVocab
public boolean hasUseAllVocab()use all symbols for vocab extraction. This flag is valid if model type is either CHAR or WORD
optional bool use_all_vocab = 34 [default = false];- Specified by:
hasUseAllVocabin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the useAllVocab field is set.
-
getUseAllVocab
public boolean getUseAllVocab()use all symbols for vocab extraction. This flag is valid if model type is either CHAR or WORD
optional bool use_all_vocab = 34 [default = false];- Specified by:
getUseAllVocabin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The useAllVocab.
-
setUseAllVocab
use all symbols for vocab extraction. This flag is valid if model type is either CHAR or WORD
optional bool use_all_vocab = 34 [default = false];- Parameters:
value- The useAllVocab to set.- Returns:
- This builder for chaining.
-
clearUseAllVocab
use all symbols for vocab extraction. This flag is valid if model type is either CHAR or WORD
optional bool use_all_vocab = 34 [default = false];- Returns:
- This builder for chaining.
-
hasUnkId
public boolean hasUnkId()///////////////////////////////////////////////////////////////// Reserved special meta tokens. * -1 is not used. * unk_id must not be -1. Id must starts with 0 and be contiguous.
optional int32 unk_id = 40 [default = 0];- Specified by:
hasUnkIdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the unkId field is set.
-
getUnkId
public int getUnkId()///////////////////////////////////////////////////////////////// Reserved special meta tokens. * -1 is not used. * unk_id must not be -1. Id must starts with 0 and be contiguous.
optional int32 unk_id = 40 [default = 0];- Specified by:
getUnkIdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The unkId.
-
setUnkId
///////////////////////////////////////////////////////////////// Reserved special meta tokens. * -1 is not used. * unk_id must not be -1. Id must starts with 0 and be contiguous.
optional int32 unk_id = 40 [default = 0];- Parameters:
value- The unkId to set.- Returns:
- This builder for chaining.
-
clearUnkId
///////////////////////////////////////////////////////////////// Reserved special meta tokens. * -1 is not used. * unk_id must not be -1. Id must starts with 0 and be contiguous.
optional int32 unk_id = 40 [default = 0];- Returns:
- This builder for chaining.
-
hasBosId
public boolean hasBosId()<s>
optional int32 bos_id = 41 [default = 1];- Specified by:
hasBosIdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the bosId field is set.
-
getBosId
public int getBosId()<s>
optional int32 bos_id = 41 [default = 1];- Specified by:
getBosIdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bosId.
-
setBosId
<s>
optional int32 bos_id = 41 [default = 1];- Parameters:
value- The bosId to set.- Returns:
- This builder for chaining.
-
clearBosId
<s>
optional int32 bos_id = 41 [default = 1];- Returns:
- This builder for chaining.
-
hasEosId
public boolean hasEosId()</s>
optional int32 eos_id = 42 [default = 2];- Specified by:
hasEosIdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the eosId field is set.
-
getEosId
public int getEosId()</s>
optional int32 eos_id = 42 [default = 2];- Specified by:
getEosIdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The eosId.
-
setEosId
</s>
optional int32 eos_id = 42 [default = 2];- Parameters:
value- The eosId to set.- Returns:
- This builder for chaining.
-
clearEosId
</s>
optional int32 eos_id = 42 [default = 2];- Returns:
- This builder for chaining.
-
hasPadId
public boolean hasPadId()<pad> (padding)
optional int32 pad_id = 43 [default = -1];- Specified by:
hasPadIdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the padId field is set.
-
getPadId
public int getPadId()<pad> (padding)
optional int32 pad_id = 43 [default = -1];- Specified by:
getPadIdin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The padId.
-
setPadId
<pad> (padding)
optional int32 pad_id = 43 [default = -1];- Parameters:
value- The padId to set.- Returns:
- This builder for chaining.
-
clearPadId
<pad> (padding)
optional int32 pad_id = 43 [default = -1];- Returns:
- This builder for chaining.
-
hasUnkPiece
public boolean hasUnkPiece()optional string unk_piece = 45 [default = "<unk>"];- Specified by:
hasUnkPiecein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the unkPiece field is set.
-
getUnkPiece
optional string unk_piece = 45 [default = "<unk>"];- Specified by:
getUnkPiecein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The unkPiece.
-
getUnkPieceBytes
public com.google.protobuf.ByteString getUnkPieceBytes()optional string unk_piece = 45 [default = "<unk>"];- Specified by:
getUnkPieceBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for unkPiece.
-
setUnkPiece
optional string unk_piece = 45 [default = "<unk>"];- Parameters:
value- The unkPiece to set.- Returns:
- This builder for chaining.
-
clearUnkPiece
optional string unk_piece = 45 [default = "<unk>"];- Returns:
- This builder for chaining.
-
setUnkPieceBytes
public SentencepieceModel.TrainerSpec.Builder setUnkPieceBytes(com.google.protobuf.ByteString value) optional string unk_piece = 45 [default = "<unk>"];- Parameters:
value- The bytes for unkPiece to set.- Returns:
- This builder for chaining.
-
hasBosPiece
public boolean hasBosPiece()optional string bos_piece = 46 [default = "<s>"];- Specified by:
hasBosPiecein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the bosPiece field is set.
-
getBosPiece
optional string bos_piece = 46 [default = "<s>"];- Specified by:
getBosPiecein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bosPiece.
-
getBosPieceBytes
public com.google.protobuf.ByteString getBosPieceBytes()optional string bos_piece = 46 [default = "<s>"];- Specified by:
getBosPieceBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for bosPiece.
-
setBosPiece
optional string bos_piece = 46 [default = "<s>"];- Parameters:
value- The bosPiece to set.- Returns:
- This builder for chaining.
-
clearBosPiece
optional string bos_piece = 46 [default = "<s>"];- Returns:
- This builder for chaining.
-
setBosPieceBytes
public SentencepieceModel.TrainerSpec.Builder setBosPieceBytes(com.google.protobuf.ByteString value) optional string bos_piece = 46 [default = "<s>"];- Parameters:
value- The bytes for bosPiece to set.- Returns:
- This builder for chaining.
-
hasEosPiece
public boolean hasEosPiece()optional string eos_piece = 47 [default = "</s>"];- Specified by:
hasEosPiecein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the eosPiece field is set.
-
getEosPiece
optional string eos_piece = 47 [default = "</s>"];- Specified by:
getEosPiecein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The eosPiece.
-
getEosPieceBytes
public com.google.protobuf.ByteString getEosPieceBytes()optional string eos_piece = 47 [default = "</s>"];- Specified by:
getEosPieceBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for eosPiece.
-
setEosPiece
optional string eos_piece = 47 [default = "</s>"];- Parameters:
value- The eosPiece to set.- Returns:
- This builder for chaining.
-
clearEosPiece
optional string eos_piece = 47 [default = "</s>"];- Returns:
- This builder for chaining.
-
setEosPieceBytes
public SentencepieceModel.TrainerSpec.Builder setEosPieceBytes(com.google.protobuf.ByteString value) optional string eos_piece = 47 [default = "</s>"];- Parameters:
value- The bytes for eosPiece to set.- Returns:
- This builder for chaining.
-
hasPadPiece
public boolean hasPadPiece()optional string pad_piece = 48 [default = "<pad>"];- Specified by:
hasPadPiecein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the padPiece field is set.
-
getPadPiece
optional string pad_piece = 48 [default = "<pad>"];- Specified by:
getPadPiecein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The padPiece.
-
getPadPieceBytes
public com.google.protobuf.ByteString getPadPieceBytes()optional string pad_piece = 48 [default = "<pad>"];- Specified by:
getPadPieceBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for padPiece.
-
setPadPiece
optional string pad_piece = 48 [default = "<pad>"];- Parameters:
value- The padPiece to set.- Returns:
- This builder for chaining.
-
clearPadPiece
optional string pad_piece = 48 [default = "<pad>"];- Returns:
- This builder for chaining.
-
setPadPieceBytes
public SentencepieceModel.TrainerSpec.Builder setPadPieceBytes(com.google.protobuf.ByteString value) optional string pad_piece = 48 [default = "<pad>"];- Parameters:
value- The bytes for padPiece to set.- Returns:
- This builder for chaining.
-
hasUnkSurface
public boolean hasUnkSurface()Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer. We can easily figure out that <unk> is emitted.
optional string unk_surface = 44 [default = " \342\201\207 "];- Specified by:
hasUnkSurfacein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the unkSurface field is set.
-
getUnkSurface
Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer. We can easily figure out that <unk> is emitted.
optional string unk_surface = 44 [default = " \342\201\207 "];- Specified by:
getUnkSurfacein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The unkSurface.
-
getUnkSurfaceBytes
public com.google.protobuf.ByteString getUnkSurfaceBytes()Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer. We can easily figure out that <unk> is emitted.
optional string unk_surface = 44 [default = " \342\201\207 "];- Specified by:
getUnkSurfaceBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for unkSurface.
-
setUnkSurface
Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer. We can easily figure out that <unk> is emitted.
optional string unk_surface = 44 [default = " \342\201\207 "];- Parameters:
value- The unkSurface to set.- Returns:
- This builder for chaining.
-
clearUnkSurface
Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer. We can easily figure out that <unk> is emitted.
optional string unk_surface = 44 [default = " \342\201\207 "];- Returns:
- This builder for chaining.
-
setUnkSurfaceBytes
public SentencepieceModel.TrainerSpec.Builder setUnkSurfaceBytes(com.google.protobuf.ByteString value) Encodes <unk> into U+2047 (DOUBLE QUESTION MARK), since this character can be useful both for user and developer. We can easily figure out that <unk> is emitted.
optional string unk_surface = 44 [default = " \342\201\207 "];- Parameters:
value- The bytes for unkSurface to set.- Returns:
- This builder for chaining.
-
hasTrainExtremelyLargeCorpus
public boolean hasTrainExtremelyLargeCorpus()Increase bit depth to allow unigram model training on large (>10M sentences) corpora. A Side-effect of enabling this flag is increased memory usage.
optional bool train_extremely_large_corpus = 49 [default = false];- Specified by:
hasTrainExtremelyLargeCorpusin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the trainExtremelyLargeCorpus field is set.
-
getTrainExtremelyLargeCorpus
public boolean getTrainExtremelyLargeCorpus()Increase bit depth to allow unigram model training on large (>10M sentences) corpora. A Side-effect of enabling this flag is increased memory usage.
optional bool train_extremely_large_corpus = 49 [default = false];- Specified by:
getTrainExtremelyLargeCorpusin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The trainExtremelyLargeCorpus.
-
setTrainExtremelyLargeCorpus
Increase bit depth to allow unigram model training on large (>10M sentences) corpora. A Side-effect of enabling this flag is increased memory usage.
optional bool train_extremely_large_corpus = 49 [default = false];- Parameters:
value- The trainExtremelyLargeCorpus to set.- Returns:
- This builder for chaining.
-
clearTrainExtremelyLargeCorpus
Increase bit depth to allow unigram model training on large (>10M sentences) corpora. A Side-effect of enabling this flag is increased memory usage.
optional bool train_extremely_large_corpus = 49 [default = false];- Returns:
- This builder for chaining.
-
hasSeedSentencepiecesFile
public boolean hasSeedSentencepiecesFile()Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.
optional string seed_sentencepieces_file = 54 [default = ""];- Specified by:
hasSeedSentencepiecesFilein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- Whether the seedSentencepiecesFile field is set.
-
getSeedSentencepiecesFile
Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.
optional string seed_sentencepieces_file = 54 [default = ""];- Specified by:
getSeedSentencepiecesFilein interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The seedSentencepiecesFile.
-
getSeedSentencepiecesFileBytes
public com.google.protobuf.ByteString getSeedSentencepiecesFileBytes()Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.
optional string seed_sentencepieces_file = 54 [default = ""];- Specified by:
getSeedSentencepiecesFileBytesin interfaceSentencepieceModel.TrainerSpecOrBuilder- Returns:
- The bytes for seedSentencepiecesFile.
-
setSeedSentencepiecesFile
Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.
optional string seed_sentencepieces_file = 54 [default = ""];- Parameters:
value- The seedSentencepiecesFile to set.- Returns:
- This builder for chaining.
-
clearSeedSentencepiecesFile
Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.
optional string seed_sentencepieces_file = 54 [default = ""];- Returns:
- This builder for chaining.
-
setSeedSentencepiecesFileBytes
public SentencepieceModel.TrainerSpec.Builder setSeedSentencepiecesFileBytes(com.google.protobuf.ByteString value) Path to a seed sentencepieces file, with one tab-separated seed sentencepiece <tab> frequency per line.
optional string seed_sentencepieces_file = 54 [default = ""];- Parameters:
value- The bytes for seedSentencepiecesFile to set.- Returns:
- This builder for chaining.
-
setUnknownFields
public final SentencepieceModel.TrainerSpec.Builder setUnknownFields(com.google.protobuf.UnknownFieldSet unknownFields) - Specified by:
setUnknownFieldsin interfacecom.google.protobuf.Message.Builder- Overrides:
setUnknownFieldsin classcom.google.protobuf.GeneratedMessageV3.Builder<SentencepieceModel.TrainerSpec.Builder>
-
mergeUnknownFields
public final SentencepieceModel.TrainerSpec.Builder mergeUnknownFields(com.google.protobuf.UnknownFieldSet unknownFields) - Specified by:
mergeUnknownFieldsin interfacecom.google.protobuf.Message.Builder- Overrides:
mergeUnknownFieldsin classcom.google.protobuf.GeneratedMessageV3.Builder<SentencepieceModel.TrainerSpec.Builder>
-