我想要一天分享一點「LLM從底層堆疊的技術」,並且每篇文章長度控制在三分鐘以內,讓大家不會壓力太大,但是又能夠每天成長一點。
回顧目前手上有的素材:
Embedding 後,兩詞彙間的 Cosine 相似度計算方法為:
import numpy as np
from gensim import matutils
import pandas as pd
words = ["method", "reason", "truth", "rightly", "science", "seeking"]
data = []
for i in range(len(words)):
for j in range(len(words)):
word1 = words[i]
word2 = words[j]
if word1 not in model.wv or word2 not in model.wv:
print(f"One or both words ('{word1}', '{word2}') are not in the model's vocabulary.")
continue
vec1 = model.wv[word1]
vec2 = model.wv[word2]
similarity = np.dot(matutils.unitvec(vec1), matutils.unitvec(vec2))
distance = 1 - similarity
data.append({'word1': word1, 'word2': word2, 'distance': distance})
df = pd.DataFrame(data)
display(df)
結果為:
當中的 Cosine 相似度計算方法為:
其數值將介於 -1 至 1 之間。