我想要一天分享一點「LLM從底層堆疊的技術」,並且每篇文章長度控制在三分鐘以內,讓大家不會壓力太大,但是又能夠每天成長一點。
回顧目前手上有的素材:
今天來探討 Tokenization 後的基本資訊窺探:
unique_tokens = set(tokens)
print(len(unique_tokens))
print(unique_tokens)
結果為:
接著進行 Embedding 轉換:
from gensim.models import Word2Vec
model = Word2Vec([tokens], compute_loss = True, vector_size = 300, min_count = 1)
model.save("descartes_word2vec.model")
關鍵原文為:
- Vocabulary is a list of all the unique words the model has learned from. Each word is related to a specific index in the model’s embedding matrix
- Word vectors (embeddings) are the actual word vectors the model learns during training, stored in a matrix in which each row represents a word in the vocabulary
- The saved model doesn’t include the original training data (the text you used to train it). It only saves what it learned in the data (word vectors), not the data itself