開啟一切可能性的鑰匙，8K Text Embedding Model

2024/10/30 更新2023/10/31 發佈閱讀 20 分鐘

前言:

要檢查兩段長篇文字，內容是否雷同，長久以來困擾著大家，沒有好的嵌入模型，會耗費大量人力，需要對長篇文檔進行人工閱讀然後人工分類。本文除了展示如何能完美的解決這個痛點以外，也提供一些延伸應用思路，學習CP值很高。期待你能從本文取得許多新的應用思路與應用潛力！

技術說明:

嵌入模型能藉由計算抽取出一段文字的特徵值向量 [a₁,a₂,...a_n]，每一個a₁~a_n都相當於是投影在抽取出來的特徵向量上的投影長度，藉由比較兩段文字各自的嵌入特徵值向量，能夠快速得知其內容是否雷同。

本篇介紹的Text Embedding Model基於 Bert 架構，支援 Attention with Linear Biases (ALiBi) 的 symmetric bidirectional variant，允許用較短的序列長度訓練，用更長的序列長度來測試。

使用 512 序列長度進行訓練，可以推斷到 8196序列長度（甚至更長)，當需要處理長文檔時，這個模型特別好用。

用途包括長文檔檢索、語義文本相似性、文本重新排名、推薦、RAG 和 LLM 的生成搜索，等等，可以拿來做各種應用！

該模型的Backbone是在 C4 數據集上預先訓練的，然後進一步在 Jina AI 的超過 4 億對正負樣本上進行訓練。這些句子對來自各種領域，通過徹底的清理與精心選擇。模型具有 1.37 億參數的標準大小，可以實現快速推斷，建議使用單個 GPU 進行推斷。

簡單用例嘗試:

測試語意非常相近的例子

預計要得到很高的Cosine Similarity分數

(為了簡化展示篇幅，這邊測試的皆是英文單句，其實可以輸入兩篇長篇文章做比對)

How is the weather today? vs. What is the current weather like today?

這兩句問的都是今天的天氣，得到Cosine相似度分數0.934的高分 (分數介於0~1之間)

順便得到Embedding向量的長度為768

from transformers import AutoModel
from numpy.linalg import norm
import numpy as np

cos_sim = lambda a, b: (a @ b.T) / (norm(a) * norm(b))
model = AutoModel.from_pretrained(
    "jinaai/jina-embeddings-v2-base-en", trust_remote_code=True
)  # trust_remote_code is needed to use the encode method
embeddings = model.encode(["How is the weather today?", 
                           "What is the current weather like today?"])

print(np.shape(embeddings[0]))
print(cos_sim(embeddings[0], embeddings[1]))

output:

(768,)
0.9341313

測試語意不相干的例子

預計要得到很低的Cosine Similarity分數

(為了簡化展示篇幅，這邊測試的皆是英文單句，其實可以輸入兩篇長篇文章做比對)

I drew a dog. vs. The storm is coming.

這兩句話，表達完全不相干的語意，經過計算得到的分數較低，0.7以下顯示這兩句話相關性非常低。

embeddings = model.encode(["I drew a dog.", 
                           "The storm is coming."])
print(cos_sim(embeddings[0], embeddings[1]))

output:

0.6212821

打造推薦系統:

範例中的Keras Embedding layer輸入必須是int數字，如果輸入是電影名稱與分類的文字描述細節，則可以用model.encode來做嵌入編碼，把關於電影的描述轉成768維度的特徵值向量，然後和原本的Used_Id Embedding做內積，充分活用以下範例，可以打造商品推薦系統。至於資料集方面，要想辦法蒐集使用者的偏好度，可以藉由發問卷來達成。

model = AutoModel.from_pretrained(
    "jinaai/jina-embeddings-v2-base-en", trust_remote_code=True
)  # trust_remote_code is needed to use the encode method
embeddings = model.encode(movie_input)

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow import keras
from keras.layers import Input, Embedding, Flatten, Dot, Dense
from keras.models import Model
from zipfile import ZipFile
from pathlib import Path

# Download the actual data from http://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
# 使用 ratings.csv 檔案
movielens_data_file_url = (
    "http://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
)
movielens_zipped_file = keras.utils.get_file(
    "ml-latest-small.zip", movielens_data_file_url, extract=False
)
keras_datasets_path = Path(movielens_zipped_file).parents[0]
movielens_dir = keras_datasets_path / "ml-latest-small"

# 只有在沒有DATA的時候才會進行下載
if not movielens_dir.exists():
    with ZipFile(movielens_zipped_file, "r") as zip:
        # 解壓縮
        print("Extracting all the files now...")
        zip.extractall(path=keras_datasets_path)
        print("Done!")

ratings_file = movielens_dir / "ratings.csv"

# 取得電影評分數據
data = pd.read_csv(ratings_file)
print("Columns of data: ", data.columns)

# 創建用户和電影的索引
user_ids = data["userId"].unique()
movie_ids = data["movieId"].unique()

user_id_map = {user_id: i for i, user_id in enumerate(user_ids)}
movie_id_map = {movie_id: i for i, movie_id in enumerate(movie_ids)}

data["user"] = data["userId"].map(user_id_map)
data["movie"] = data["movieId"].map(movie_id_map)

# 劃分訓練還有測試集
train, test = train_test_split(data, test_size=0.2, random_state=42)

# 創建模型
EMBEDDING_SIZE = 768 
user_input = Input(shape=(1,))
user_embedding = Embedding(len(user_ids), EMBEDDING_SIZE)(user_input)
user_embedding = Flatten()(user_embedding)

movie_input = Input(shape=(1,))
movie_embedding = Embedding(len(movie_ids), EMBEDDING_SIZE)(movie_input)
movie_embedding = Flatten()(movie_embedding)

dot_product = Dot(axes=1)([user_embedding, movie_embedding])
model = Model(inputs=[user_input, movie_input], outputs=dot_product)

model.compile(optimizer="adam", loss="mean_squared_error")

# 訓練模型
model.fit(
    [train["user"], train["movie"]],
    train["rating"],
    epochs=5,
    batch_size=64,
    validation_data=([test["user"], test["movie"]], test["rating"]),
)

# 進行電影推薦
user_id = 1  # 推薦給誰，用用戶ID來區別

movies_watched = data[data["user"] == user_id]["movie"].unique()
unwatched_movies = data[~data["movie"].isin(movies_watched)]["movie"].unique()

user_ids = [user_id] * len(unwatched_movies)

user_ids_tensor = np.array(user_ids)
unwatched_movies_tensor = np.array(unwatched_movies)

predictions = model([user_ids_tensor, unwatched_movies_tensor])

movie_recommendations = pd.DataFrame(
    {"movieId": unwatched_movies, "predicted_rating": np.ravel(predictions)}
)
movie_recommendations = movie_recommendations.sort_values(
    by="predicted_rating", ascending=False
)
top_10_recommendations = movie_recommendations.head(10)

print(top_10_recommendations)

利用強化學習:

使機器聽得懂人話，聽到特定概念，轉成嵌入向量以後，當成輸入，經過強化學習，就會開始執行相對應的動作。例如對阿斯拉說，打開推進器，或是讓我們一起進入零的領域，阿斯拉就會根據環境數據，智慧的幫忙準備執行指令的配套，讓主角風見隼人可以放心地推動推進器。

state: 環境數據 / 風見隼人的聲音指令輸入，聲音轉文字，文字轉Embedding (打開推進器)
action: 執行打開推進器，與其它和環境配套的措施
reward: 瞬時速度高，而且沒有撞到護欄，高速過彎
state: 環境數據 / 風見隼人的聲音指令輸入，聲音轉文字，文字轉Embedding (風扇騰空過彎)
action: 執行風扇騰空過彎，與其它和環境配套的措施
reward: 瞬時速度高，而且沒有撞到護欄，高速過彎

能夠使用這項技術的論文(不限於這些):

反思的過程有一個反思標籤是檢查相關性，可以用上這裡的嵌入模型

Dall-E3裡面可以對文字升取樣前後計算Cosine distance，收斂表示已經加入足夠多對圖片的描述

同上，蒸餾監督微調步驟可以使用嵌入模型來判別是否可以停止蒸餾迴圈

其他嵌入模型:

V2 (Based on JinaBert, 8K Seq)

jina-embeddings-v2-small-en: 33 million parameters.
jina-embeddings-v2-base-en: 137 million parameters (you are here).
jina-embeddings-v2-large-en: 435 million parameters (releasing soon).

引用:

@article{DBLP:journals/corr/abs-2108-12409,
  author       = {Ofir Press and
                  Noah A. Smith and
                  Mike Lewis},
  title        = {Train Short, Test Long: Attention with Linear Biases Enables Input
                  Length Extrapolation},
  journal      = {CoRR},
  volume       = {abs/2108.12409},
  year         = {2021},
  url          = {https://arxiv.org/abs/2108.12409},
  eprinttype    = {arXiv},
  eprint       = {2108.12409},
  timestamp    = {Thu, 02 Sep 2021 14:42:29 +0200},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2108-12409.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

如果你對 AI 充滿熱情，學習上又不想浪費時間，我能夠以過來人的經驗給你不少想法，歡迎在Facebook群裡面留言。

如果想要用Zoom直接交談，為你直接解惑的，也可以點以下連結預約時間 (1小時)

https://calendly.com/universe_ai/free_appointment

無限智慧學院的沙龍人工智慧宇宙AI實作範例

留言

留言分享你的想法！

無限智慧學院的沙龍

95會員

128內容數

帶你用上帝視角，針對市面上具有高度價值的影片/論文/書籍，用東方取象，與西方邏輯辯證的角度同時出發，跟著我一起來探討宇宙萬事萬物的本質，隨時隨地都可以來一場說走就走的思維旅行。作者在台積電 / 聯發科等科技產業有累計10年的資歷，近期對於人工智慧，東方易經，西方辯證邏輯，還有佛法向內求有深度興趣。

無限智慧學院的沙龍的其他內容

2024/01/15

三分鐘內搞定AI西洋占星大師，達文西封測前瞻上手

三分鐘內實作取得您專屬的人工智慧西洋占星助理，隨時掌握自己完整的星座運勢，隨時都能來占卜一下，知道自己完整的運勢以後，也比較好提前規劃該如何面對，本範例使用專業的西洋占星術，提供逆行，特別是水膩的相關資訊，齊帶大家都能趨吉避凶，從中找到快樂。

2024/01/15

三分鐘內搞定AI西洋占星大師，達文西封測前瞻上手

2023/11/21

三分鐘實作文字生成圖片網頁APP (使用LCM-LoRA-SDXL模型)

本文帶你三分鐘內跑通"文字生圖片"的網頁APP，使用最頂尖生成技術，LCM-Lora-SDXL模型，能夠在短時間內生成符合文字描述的高解析圖片。屬於免費開源模型(License Link)，請不要產生有害內容，進行合理的使用。

2023/11/21

三分鐘實作文字生成圖片網頁APP (使用LCM-LoRA-SDXL模型)

2023/11/14

三分鐘實作自動經濟分析報告，使用Llama Index & ChatGPT4-Turbo

本文帶你在3分鐘內跑通基本範例，用llamaIndex 串接Excel，經過爬蟲與Mistral 7B整理的資料，讓GPT4成為國際政治經濟大師，每天為您自動產生經濟分析報告。

2023/11/14

三分鐘實作自動經濟分析報告，使用Llama Index & ChatGPT4-Turbo

看更多

你可能也想看

夢夢 🍰 甜點魔法

全家限定！療癒系馬來貘雪糕，創意吃法大公開｜豆漿燕麥碗、藍莓果昔

還在煩惱平凡日常該如何增添一點小驚喜嗎？全家便利商店這次聯手超萌的馬來貘，推出黑白配色的馬來貘雪糕，不僅外觀吸睛，層次豐富的雙層口味更是讓人一口接一口！本文將帶你探索馬來貘雪糕的多種創意吃法，從簡單的豆漿燕麥碗、藍莓果昔，到大人系的奇亞籽布丁下午茶，讓可愛的馬來貘陪你度過每一餐，增添生活中的小確幸！

#懶人料理#食譜#健康甜點

2025/10/15