我想要一天分享一點「LLM從底層堆疊的技術」,並且每篇文章長度控制在三分鐘以內,讓大家不會壓力太大,但是又能夠每天成長一點。
總結一下目前有的素材:
現在我們將這些素材匯集在一起:
# Load English dataset
filename = 'English.pkl'
lines = load_clean_sentences(filename)
# Calculate vocabulary
vocab = to_vocab(lines)
print('English Vocabulary: %d' % len(vocab))
# Reduce vocabulary
vocab = trim_vocab(vocab, 5)
print('New English Vocabulary: %d' % len(vocab))
# Mark out of vocabulary words
lines = update_dataset(lines, vocab)
# Save updated dataset
filename = 'english_vocab.pkl'
save_clean_sentences(lines, filename)
# Spot check
for i in range(20):
print("line", i, ":", lines[i])
運行結果為:
以及:
上述是針對英文資料集的作法,以下針對法文資料集重做一遍:
# Load French dataset
filename = 'French.pkl'
lines = load_clean_sentences(filename)
# Calculate vocabulary
vocab = to_vocab(lines)
print('French Vocabulary: %d' % len(vocab))
# Reduce vocabulary
vocab = trim_vocab(vocab, 5)
print('New French Vocabulary: %d' % len(vocab))
# Mark out of vocabulary words
lines = update_dataset(lines, vocab)
# Save updated dataset
filename = 'french_vocab.pkl'
save_clean_sentences(lines, filename)
# Spot check
for i in range(20):
print("line", i, ":", lines[i])
運行結果為:
以及: