AI說書 - 從0開始 - 68

Learn AI 不 BI

AI說書 - 從0開始 - 68

發佈於三分鐘學AI

更新於 2024/07/01發佈於 2024/07/01閱讀時間約 5 分鐘

我想要一天分享一點「LLM從底層堆疊的技術」，並且每篇文章長度控制在三分鐘以內，讓大家不會壓力太大，但是又能夠每天成長一點。

仔細看 AI說書 - 從0開始 - 66 中，Decoder 的 Multi-Head Attention 框框，會發現有一條線空接，其實它是有意義的，之所以空接，是因為它與 Encoder 對接，全貌如下：

raw-image

在 Decoder 的 Attention 機制稱為 Cross Attention，其和 Encoder 的 Self-Attetion 做出區隔
此處的 Corss Attention 的機制，有兩個元素從 Encoder 而來 (藍色圈圈)，一個元素從 Decoder 自身而來 (綠色圈圈)
下一張圖講述 Cross Attention 機制的運作方式

Cross Attention 機制的運作方式如下：

raw-image

Begin Token 經過 Mask Attention 得到一向量，再乘上 W^q變成向量 q
a1、a2、a3 各別乘上 W^k與 W^v 變成 k¹、k²、k³與v¹、v²、v³，注意這裡是 Encoder 地盤
q 乘上 k¹、k²、k³ 再經過正規化得 a^'₁、a^'₂、a^'₃
a^'₁、a^'₂、a^'₃ 各自乘上 v¹、v²、v³ 相加後得 v
這是第一個字的作法

那第二個字的作法如下：

raw-image

Begin Token 與「機」經過 Mask Attention 得到一向量，再乘上 W^q變成向量 q^'
a1、a2、a3 各別乘上 W^k與 W^v 變成 k¹、k²、k³與v¹、v²、v³，注意這裡是 Encoder 地盤
q^' 乘上 k¹、k²、k³ 再經過正規化得 a^'₁、a^'₂、a^'₃
a^'₁、a^'₂、a^'₃ 各自乘上 v¹、v²、v³ 相加後得 v^'
這是第二個字的作法

最後列出幾段課本 (Transformers for Natural Language Processing and Computer Vision, 2024) 很美的話，當作知識點整理：

The multi-head attention sublayer 2 also only attends to the positions up to the current position the Transformer is predicting to avoid seeing the sequence it must predict.
The multi-head attention sublayer 2 draws information from the encoder by taking encoder (K, V) into account during the dot-product attention operations. This sublayer also draws information from the masked multi-head attention sublayer 1 (masked attention) by also taking sublayer 1 (Q) into account during the dot-product attention operations.
The linear layer produces an output sequence with a linear function that varies per model but relies on the standard method: y = wx + b.
At the top layer of the decoder, the transformer will reach the output layer, which will map the outputs of the model to the size of the vocabulary to produce the raw logits of the prediction.
The raw logits of the output can go through a softmax function, apply the values obtained to the tokens in the vocabulary, and choose the best probable token for the task requested, or apply sampling functions.

The Transformer produces an output sequence of only one element at a time.

Learn AI 不 BI三分鐘學AIAI從0開始-第二章

Learn AI 不 BI

219會員

572內容數

這裡將提供： AI、Machine Learning、Deep Learning、Reinforcement Learning、Probabilistic Graphical Model的讀書筆記與演算法介紹，一起在未來AI的世界擁抱AI技術，不BI。

留言

留言分享你的想法！

Learn AI 不 BI 的其他內容

AI說書 - 從0開始 - 71

我想要一天分享一點「LLM從底層堆疊的技術」，並且每篇文章長度控制在三分鐘以內，讓大家不會壓力太大，但是又能夠每天成長一點。從 AI說書 - 從0開始 - 37 到 AI說書 - 從0開始 - 70 ，我們完成書籍：Transformers for Natural Language Proc

#AI #ai #PromptEngineering

AI說書 - 從0開始 - 70

我想要一天分享一點「LLM從底層堆疊的技術」，並且每篇文章長度控制在三分鐘以內，讓大家不會壓力太大，但是又能夠每天成長一點。 Transformer 的重要性已經被公認了，因此在 Hugging Face 中亦有被實作，呼叫方式如下： !pip -q install transformers

#AI #ai #PromptEngineering

AI說書 - 從0開始 - 69

我想要一天分享一點「LLM從底層堆疊的技術」，並且每篇文章長度控制在三分鐘以內，讓大家不會壓力太大，但是又能夠每天成長一點。 Transformers for Natural Language Processing and Computer Vision, 2024 這本書中講 Trainin

#AI #ai #PromptEngineering

AI說書 - 從0開始 - 71

我想要一天分享一點「LLM從底層堆疊的技術」，並且每篇文章長度控制在三分鐘以內，讓大家不會壓力太大，但是又能夠每天成長一點。從 AI說書 - 從0開始 - 37 到 AI說書 - 從0開始 - 70 ，我們完成書籍：Transformers for Natural Language Proc

#AI #ai #PromptEngineering

AI說書 - 從0開始 - 70

我想要一天分享一點「LLM從底層堆疊的技術」，並且每篇文章長度控制在三分鐘以內，讓大家不會壓力太大，但是又能夠每天成長一點。 Transformer 的重要性已經被公認了，因此在 Hugging Face 中亦有被實作，呼叫方式如下： !pip -q install transformers

#AI #ai #PromptEngineering

AI說書 - 從0開始 - 69

我想要一天分享一點「LLM從底層堆疊的技術」，並且每篇文章長度控制在三分鐘以內，讓大家不會壓力太大，但是又能夠每天成長一點。 Transformers for Natural Language Processing and Computer Vision, 2024 這本書中講 Trainin

#AI #ai #PromptEngineering