我想要一天分享一點「LLM從底層堆疊的技術」,並且每篇文章長度控制在三分鐘以內,讓大家不會壓力太大,但是又能夠每天成長一點。
先列出目前擁有的材料:
接著開始準備資料集:
!curl -L https://raw.githubusercontent.com/Denis2054/Transformers-for-NLP-and-Computer-Vision-3rd-Edition/master/Chapter08/gutenberg.org_cache_epub_4280_pg4280.html --output "gutenberg.org_cache_epub_4280_pg4280.html"
with open("gutenberg.org_cache_epub_4280_pg4280.html", 'r', encoding = 'utf-8') as file:
file_contents = file.read()
soup = BeautifulSoup(file_contents, 'html.parser')
接著進行清洗處理:
text = soup.get_text()
text = re.sub('\s+', ' ', text).strip()
sentences = sent_tokenize(text)
當中有幾點要注意:
然而 OpenAI 對於訓練資料的 Prompt 與 Completion 有嚴謹定義,因此處理如下:
prompt_separator = " ->"
completion_ending = "\n"
data = []
for i in range(len(sentences) - 1):
data.append({"prompt": sentences[i] + prompt_separator,
"completion": " " + sentences[i + 1] + completion_ending})
with open('kant_prompts_and_completions.json', 'w') as f:
for line in data:
f.write(json.dumps(line) + '\n')