我想要一天分享一點「LLM從底層堆疊的技術」,並且每篇文章長度控制在三分鐘以內,讓大家不會壓力太大,但是又能夠每天成長一點。
- AI說書 - 從0開始 - 338 | Embedding Based Search 資料集描述
- AI說書 - 從0開始 - 339 | Embedding Based Search 資料集整備
- AI說書 - 從0開始 - 340 | Embedding Based Search 資料集編碼
- AI說書 - 從0開始 - 341 | Embedding Based Search 執行 Embedding 並儲存
- AI說書 - 從0開始 - 342 | Embedding Based Search 資料清洗
- AI說書 - 從0開始 - 343 | Embedding Based Search 之 K-Means 群集
為了進行資料視覺化,我們執行 t-SNE 降維作業,目標由 1536 至 2:
from sklearn.manifold import TSNE
import matplotlib
import matplotlib.pyplot as plt
tsne = TSNE(n_components = 2, perplexity = 15, random_state = 42, init = "random", learning_rate = 200)
vis_dims2 = tsne.fit_transform(matrix)
x = [x for x, y in vis_dims2]
y = [y for x, y in vis_dims2]
for category, color in enumerate(["purple", "green", "red", "blue"]):
xs = np.array(x)[df.Cluster == category]
ys = np.array(y)[df.Cluster == category]
plt.scatter(xs, ys, color = color, alpha = 0.3)
avg_x = xs.mean()
avg_y = ys.mean()
plt.scatter(avg_x, avg_y, marker = "x", color = color, s = 100)
plt.title("Clusters identified visualized in language 2d using t-SNE")
結果為:
