我想要一天分享一點「LLM從底層堆疊的技術」,並且每篇文章長度控制在三分鐘以內,讓大家不會壓力太大,但是又能夠每天成長一點。
目前我們已經有資料集在 AI說書 - 從0開始 - 103 ,必要的清理函數在 AI說書 - 從0開始 - 104 ,現在把它們湊在一起,如下:
# load English data
filename = 'europarl-v7.fr-en.en'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('English data: sentences = %d, min = %d, max = %d' % (len(sentences), minlen, maxlen))
cleanf = clean_lines(sentences)
filename = 'English.pkl'
outfile = open(filename, 'wb')
pickle.dump(cleanf, outfile)
outfile.close()
print(filename, " saved")
輸出結果為:
上述是針對英文資料集的作法,以下針對法文資料集重做一遍:
# load French data
filename = 'europarl-v7.fr-en.fr'
doc = load_doc(filename)
sentences = to_sentences(doc)
minlen, maxlen = sentence_lengths(sentences)
print('French data: sentences = %d, min = %d, max = %d' % (len(sentences), minlen, maxlen))
cleanf = clean_lines(sentences)
filename = 'French.pkl'
outfile = open(filename, 'wb')
pickle.dump(cleanf, outfile)
outfile.close()
print(filename," saved")
輸出結果為: