我想要一天分享一點「LLM從底層堆疊的技術」,並且每篇文章長度控制在三分鐘以內,讓大家不會壓力太大,但是又能夠每天成長一點。
Sentence Piece Tokenizer 在 Unigram 語言模型 Tokenizer (見 AI說書 - 從0開始 - 300 | Unigram Language Model Tokenization 展示) 中加入了 Byte Pair Encoding (BPE) 方法,它不需要 Pre-Tokenizer,並且可以處理原始數據。
以下示範,首先載入必要依賴包:
import sentencepiece as spm
import random
接著輸入範例文本並儲存:
basic_corpus = [ "Subword tokenizers break text sequences into subwords.",
"This sentence is another part of the corpus.",
"Tokenization is the process of breaking text down into smaller units.",
"These smaller units can be words, subwords, or even individual characters.",
"Transformer models often use subword tokenization." ]
# Generate a larger corpus by repeating sentences from the basic corpus
corpus = [random.choice(basic_corpus) for _ in range(10000)]
with open('large_corpus.txt', 'w') as f:
for sentence in corpus:
f.write(sentence + '\n')
再進行 Tokenizer 設定並進行訓練:
spm.SentencePieceTrainer.train(input = 'large_corpus.txt', model_prefix = 'm', vocab_size = 88)
接著檢視結果:
sp = spm.SentencePieceProcessor()
sp.load('m.model')
tokens = sp.encode_as_pieces("Subword tokenizers break text sequences into subwords.")
print(tokens)
結果為: