我想要一天分享一點「LLM從底層堆疊的技術」,並且每篇文章長度控制在三分鐘以內,讓大家不會壓力太大,但是又能夠每天成長一點。
延續 AI說書 - 從0開始 - 287 | Tokenizer 重要性範例之資料準備,接著來執行 Tokenization:
sample = open("text.txt", "r")
s = sample.read()
f = s.replace("\n", " ")
data = []
for i in sent_tokenize(f):
temp = []
for j in word_tokenize(i):
temp.append(j.lower())
data.append(temp)
# Creating Skip Gram model
model2 = gensim.models.Word2Vec(data, min_count = 1, vector_size = 512, window = 5, sg = 1)
print(model2)
window = 5 限制輸入句子中當前單字和預測單字之間的距離,結果為:
為了要檢視效果好壞,我們撰寫一隻計算 Cosine 相似度的程式:
def similarity(word1, word2):
cosine = False
try:
a = model2[word1]
cosine = True
except KeyError:
print(word1, ":[unk] key not found in dictionary")
try:
b = model2[word2]
except KeyError:
cosine = False
print(word2, ":[unk] key not found in dictionary")
if(cosine == True):
dot = np.dot(a, b)
norma = np.linalg.norm(a)
normb = np.linalg.norm(b)
cos = dot / (norma * normb)
aa = a.reshape(1,512)
ba = b.reshape(1,512)
cos_lib = cosine_similarity(aa, ba)
if(cosine == False):
cos_lib = 0
return cos_lib