TextToSpeech-Word Embedding

賴靖融

發佈於AI之路有你有我

2024/05/28 更新2024/05/28 發佈閱讀 17 分鐘

一. 引言

上回我們講到 Word Embedding 能夠將字詞表示從使用字典索引改成詞向量表示，且這個詞向量能夠包含一定程度上的語義訊息，今天就讓我們探討 Word Embedding 到底是如何訓練成的。

二. 常見的 Word Embedding 技術

1. Word2Vec

Word2Vec 是 Google 提出的兩種模型：CBOW（Continuous Bag of Words）和 Skip-gram。

CBOW：根據上下文預測目標詞。適合小數據集，訓練速度快。
Skip-gram：根據目標詞預測上下文。適合大數據集，效果更好。

2. GloVe

GloVe（Global Vectors for Word Representation）是 Stanford 提出的基於共現矩陣的 Word Embedding 方法。它通過統計整個語料庫中單詞共現的頻率來訓練向量。

3. FastText

FastText 是 Facebook 提出的改進版 Word2Vec。它不僅考慮單詞，還考慮單詞內部的字符 n-grams，使得模型能夠處理未見過的單詞（OOV）。

本篇將以 Word2Vec 為主進行說明

三. Word2Vec

　　Word2Vec 有兩種訓練方式，Skip-gram 及 Continuous Bag of Words (CBOW)

Skip-Gram

Skip-Gram 的目標是根據給定的中心詞來預測其上下文詞。具體來說，給定一個詞，模型試圖預測它前後一定範圍內的詞。

例如 : 對於句子 "人是動物"，中心詞 "是" 的上下文詞包括 "人" 和 "動物"。

Continuous Bag of Words (CBOW)

CBOW 的目標是根據給定的上下文詞來預測中心詞。具體來說，給定一組上下文詞，模型試圖預測中心詞。

例如，對於句子 "人是動物"，上下文詞 "人" 和 "動物" 用來預測中心詞 "是"。

　　在實際訓練中，以 Skip-Gram 方法為例，"人是一種從動物進化的生物" 這句話若我們使用"人"作為中心詞，然後設定範圍為2，那麼就可以產生"[人,是]","[人,一種]"兩個組合去訓練，接下來讓我一步步提供示例 :

import jieba

# 準備訓練數據
sentences = [
    "臺灣鐵路已放棄興建的路線",
    "人是一種從動物進化的生物",
    "你是英文系的，可以幫我翻譯一下嗎？"
]
# 將句子分詞
tokenized_sentences = [list(jieba.cut(sentence)) for sentence in sentences]
print(tokenized_sentences)

上篇我們做到這邊就直接使用套件，現在來看看套件都幫我們做了什麼

到這邊我們會拿到切分後的句子

['臺灣', '鐵路', '已', '放棄', '興建', '的', '路線']

['人', '是', '一種', '從', '動物', '進化', '的', '生物']

['你', '是', '英文系', '的', '，', '可以', '幫', '我', '翻譯', '一下', '嗎', '？']

再來我們需要根據這些字詞建立出字典庫

# 將大列表切割成字詞列表
words = [word for sentence in tokenized_sentences for word in sentence]
# 處理重複字詞並計算詞頻
word_counts = Counter(words)
# 根據詞頻排序(多的在前)
vocab = sorted(word_counts, key=word_counts.get, reverse=True)
# 賦予編號
word_to_idx = {word: i for i, word in enumerate(vocab)}
idx_to_word = {i: word for word, i in word_to_idx.items()}
vocab_size = len(vocab)

經過處理，我們的字典如下 :

0: '的 '3: '鐵路 '6: '興建' 9: '一種' 12: '進化' 15: '英文系' 18: '幫' 21: '一下'

1: '是 '4: '已 '7: '路線' 10: '從' 13: '生物' 16: '，' 19: '我' 22: '嗎'

2: '臺灣 ' 5: '放棄 '8: '人' 11: '動物' 14: '你' 17: '可以' 20: '翻譯' 23: '？'

然後我們接著設計訓練資料產生器，他需要符合Skip-Gram的訓練方法

def generate_training_data(corpus, word_to_idx, window_size=2):
    training_data = []
    for sentence in corpus:
        sentence_indices = [word_to_idx[word] for word in sentence]
        for center_pos in range(len(sentence_indices)):
            center_word = sentence_indices[center_pos]
            for w in range(-window_size, window_size + 1):
                context_pos = center_pos + w
                if context_pos < 0 or context_pos >= len(sentence_indices) or center_pos == context_pos:
                    continue
                context_word = sentence_indices[context_pos]
                training_data.append((center_word, context_word))
    return np.array(training_data)
    
training_data = generate_training_data(tokenized_sentences, word_to_idx)
print(training_data)

最終的輸出會變成這樣

[[ 2 3] [ 2 4] [ 3 2] [ 3 4] ...... [22 20] [22 21] [22 23] [23 21] [23 22]]

如同前面所講的會變成一個中心詞加上定義的 window_size 範圍內的字詞索引

　　另外我們還需要定義一個負樣本生成器，這個是用在訓練中可以優化訓練效率，其概念也很簡單，在最標準的 Loss 計算下，我們理論上需要對整個字典庫的單詞都計算相關性來優化模型，但這樣的效率低下(考慮到字典庫的大小)，於是便有這種使用負樣本的方式，改成在計算 Loss 時，不再計算全部的單詞，而是隨機選取一些不相干的詞，模型的目標變成最大化中心詞與正樣本的相似度，同時最小化中心詞與負樣本的相似度，其實現如下 :

def get_negative_samples(batch_size, num_neg_samples, vocab_size):
    neg_samples = np.random.choice(vocab_size, size=(batch_size, num_neg_samples), replace=True)
    return torch.tensor(neg_samples, dtype=torch.long)

　　這樣我們資料前處理其訓練資料的準備終於告一段落，再來我們便可以設計我們的神經網路的部分 :

class Word2Vec(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(Word2Vec, self).__init__()
        self.center_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.context_embeddings = nn.Embedding(vocab_size, embedding_dim)
    
    def forward(self, center_words, context_words, neg_samples):
        center_embeds = self.center_embeddings(center_words)
        context_embeds = self.context_embeddings(context_words)
        neg_embeds = self.context_embeddings(neg_samples)
        
        pos_scores = torch.bmm(context_embeds.view(context_embeds.size(0), 1, context_embeds.size(1)), 
                               center_embeds.view(center_embeds.size(0), center_embeds.size(1), 1)).squeeze()
        
        neg_scores = torch.bmm(neg_embeds.neg(), center_embeds.unsqueeze(2)).squeeze()
        
        return pos_scores, neg_scores

　　可以看到我們網路分成兩個部分，一個是中心詞的 Embedding 層，一個是其他詞的Embedding 層，這兩者是不連通的(nn.Embedding可以以字典索引做輸入，不用特意處理輸入轉換成vocab_size維)，其可以分別輸出中心詞的詞向量及其他詞的詞向量，根據一開始 Skip-Gram 的說明，Skip-Gram 是由中心詞預測其他詞的訓練方式。　

　　所以我們定義 forward 內，中心詞的詞向量與其他詞的詞向量進行矩陣相乘，取得其值作為分數，並同時使用方才所講的負樣本進行同樣的運算取得負樣本分數並輸出，損失函數便需要最大化正樣本分數並最小化負樣本分數，損失函數如下 :

import torch.nn.functional as F

def negative_sampling_loss(pos_scores, neg_scores):
    pos_loss = -F.logsigmoid(pos_scores).mean()
    neg_loss = -F.logsigmoid(-neg_scores).mean()
    return pos_loss + neg_loss

　　我們使用 logsigmoid 函數來實現使用 sigmoid 來將分數調整到0-1之間，並同時取 log ，這樣的損失函數能很好的表現出相似度的呈現且容易進行梯度下降。

再來便是喜聞樂見的訓練環節

embedding_dim = 100
model = Word2Vec(vocab_size, embedding_dim)

optimizer = optim.SGD(model.parameters(), lr=0.01)

num_epochs = 100
num_neg_samples = 5
for epoch in range(num_epochs):
    total_loss = 0
    for center, context in training_data:
        center_tensor = torch.tensor([center], dtype=torch.long)
        context_tensor = torch.tensor([context], dtype=torch.long)
        neg_samples = get_negative_samples(1, num_neg_samples, vocab_size)
        
        optimizer.zero_grad()
        pos_scores, neg_scores = model(center_tensor, context_tensor, neg_samples)
        loss = negative_sampling_loss(pos_scores, neg_scores)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss}")

並且可以根據需求定義出測試函數

def get_word_vector(word):
    word_idx = word_to_idx[word]
    word_tensor = torch.tensor([word_idx], dtype=torch.long)
    return model.center_embeddings(word_tensor).detach().numpy()

def find_similar_words(word, top_n=5):
    word_vec = get_word_vector(word)
    similarities = []
    for other_word in vocab:
        if other_word == word:
            continue
        other_vec = get_word_vector(other_word)
        sim = cosine_similarity(word_vec, other_vec)[0][0]
        similarities.append((other_word, sim))
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_n]

similar_words = find_similar_words('人', top_n=3)
print(similar_words)

經過100代訓練後，與[人]相關的前3名為 :

('進化', 0.2668535), ('興建', 0.16363981), ('生物', 0.10142076)

恩......畢竟資料為了演示方便還是使用上一篇的三句話，但是確實是有點相關的

我會將程式碼上傳至Github，有興趣的人也可以自行試試。

四.結語

　　這篇詳細的講解了 Word2Vec 的邏輯及全部搭建流程，雖然實際應用還是直接使用套件會輕鬆許多，但理解架構的話，對於套件提供的參數設定也會有更深的理解，希望這篇帶入了全部程式碼並列出運算結果能幫助不了解的人理解過程，下篇預計會說明音訊重建，還請敬請期待。

留言

留言分享你的想法！

貓貓學習筆記

10會員

21內容數

AI、電腦視覺、圖像處理、AWS等等持續學習時的學習筆記，也包含一些心得，主要是幫助自己學習，若能同時幫助到不小心來到這裡的人，那也是好事一件 : )

貓貓學習筆記的其他內容

2024/07/08

TextToSpeech-聲學特徵轉換

我們前面幾篇已經講完TTS技術的一大半架構了，知道了如何將聲學特徵重建回音訊波形，也從中可以知道要是聲學特徵不完善，最終取得的結果也會不自然，剩下要探討該如何將文字轉換成聲學特徵，且能夠自然地表現停頓及細節變化，讓我們開始吧。

2024/07/08

TextToSpeech-聲學特徵轉換

2024/06/26

TextToSpeech-WaveNet 後日談

距離上篇已經快過一個月了，這個月我也沒閒著，我FF14生產職拉了不少等級進行了上篇 WaveNet 的後續調試，也比較與其他人實現的效果，又發現了幾個實作上可能造成困難的點，現在就跟各位分享一下~

2024/06/26

TextToSpeech-WaveNet 後日談

2024/06/01

TextToSpeech-WaveNet

WaveNet 提供了一個先進的架構用於音訊重建，但是，有必要嗎? Mel 頻譜本身就是經過數學轉換而獲得的結果，不能反運算嗎 ? 到底 WaveNet 在其中扮演了甚麼腳色 ?它是如何運作的 ? 讓我們在這篇好好探討下去。

2024/06/01

TextToSpeech-WaveNet

看更多

你可能也想看

夢夢 🍰 甜點魔法

全家限定！療癒系馬來貘雪糕，創意吃法大公開｜豆漿燕麥碗、藍莓果昔

還在煩惱平凡日常該如何增添一點小驚喜嗎？全家便利商店這次聯手超萌的馬來貘，推出黑白配色的馬來貘雪糕，不僅外觀吸睛，層次豐富的雙層口味更是讓人一口接一口！本文將帶你探索馬來貘雪糕的多種創意吃法，從簡單的豆漿燕麥碗、藍莓果昔，到大人系的奇亞籽布丁下午茶，讓可愛的馬來貘陪你度過每一餐，增添生活中的小確幸！

#懶人料理#食譜#健康甜點

2025/10/15