2024-07-28|閱讀時間 ‧ 約 57 分鐘

【邁向圖神經網絡GNN】Part5: 建構 GNN model 實作 Cora 資料集結點分類任務

上一篇我們通過 pytorch 實作自定義的 message passing class,這篇要實作整個 GNN model ,以及使用 cora dataset 進行分類任務,並比較各個模型的成效,如果還沒看過上一篇的人可以點以下連結:

【邁向圖神經網絡GNN】Part1: 圖數據的基本元素與應用

【邁向圖神經網絡GNN】Part2: 使用PyTorch構建圖形結構的概念與實作

【邁向圖神經網絡GNN】Part3: 圖神經網絡的核心-訊息傳遞機制

【邁向圖神經網絡GNN】Part4: 實作圖神經網路訊息傳遞機制

Photo by Markus Spiske on Unsplash

從 torch 下載 Cora 資料集

from torch_geometric.datasets import Planetoid
from torch_geometric.transforms import NormalizeFeatures
dataset = Planetoid(root='data/Planetoid', name='Cora', transform=NormalizeFeatures())

Cora 資料集是一個常用於圖神經網絡 (Graph Neural Networks, GNN) 研究的標準數據集。它主要包含機器學習領域的科學論文,這些論文被分類成七個類別,如神經網絡、機器學習、人工智能等。Cora 數據集的主要特點和組成如下:

  1. 節點:每個節點代表一篇科學論文。
  2. :如果一篇論文引用了另一篇論文,則這兩篇論文之間存在一條邊。這種表示方式將論文集結成一個引用網絡。
  3. 特徵:每個節點(論文)都有一個對應的特徵向量,這個特徵向量是基於論文的單詞使用情況的稀疏二進制向量。特徵向量的每一維表示一個單詞是否在該論文中出現。
  4. 類別:每篇論文都被標註為七個領域中的一種,例如:人工智能、神經網絡、機器學習… 等。

5. 使用這個資料集可以進行分類任務檢測: 輸入節點與關係和其特徵,預測分類該論文為哪一類別。

Cora 資料集簡單資料觀察

print()
print(f'Dataset: {dataset}:')
print('======================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')

graph = dataset[0] # Get the first graph object.

print()
print(graph)
print('===========================================================================================================')

# Gather some statistics about the graph.
print(f'Number of nodes: {graph.num_nodes}')
print(f'Number of edges: {graph.num_edges}')
print(f'Average node degree: {graph.num_edges / graph.num_nodes:.2f}')
print(f'Number of training nodes: {graph.train_mask.sum()}')
print(f'Training node label rate: {int(graph.train_mask.sum()) / graph.num_nodes:.2f}')
print(f'Has isolated nodes: {graph.has_isolated_nodes()}')
print(f'Has self-loops: {graph.has_self_loops()}')
print(f'Is undirected: {graph.is_undirected()}')

輸出:

Dataset: Cora():
======================
Number of graphs: 1
Number of features: 1433
Number of classes: 7

Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])
===========================================================================================================
Number of nodes: 2708
Number of edges: 10556
Average node degree: 3.90
Number of training nodes: 140
Training node label rate: 0.05
Has isolated nodes: False
Has self-loops: False
Is undirected: True
  1. 圖的數量:數據集中只有一個圖。在這種情況下,整個數據集是一個大的圖,其中節點表示科學論文,邊表示論文間的引用關係。
  2. 特徵數量:每個節點(論文)都有1433個特徵。這些特徵是基於文本內容的二進制向量,代表了1433個不同的單詞是否出現在該論文中。
  3. 類別數量:有7個不同的類別,每個類別代表論文所屬的科學領域。
  4. 節點和邊的數量
  • 節點數量:2708,表示有2708篇科學論文。
  • 邊的數量:10556,表示這些論文間共有10556次引用。

5. 平均節點度數:約3.90,這表明每篇論文平均被其他論文引用了3.90次。

6. 訓練節點

  • 訓練節點數量:140,這表示只有140篇論文的類別標籤會用於訓練模型。
  • 訓練節點標籤比率:約0.05,即大約5%的論文用於訓練,這是一個相對較少的比例,表明大部分數據用於驗證和測試。

7. 其他圖特性

  • 是否有孤立節點:False,表示所有節點至少與一個其他節點有連接。
  • 是否有自循環:False,表示沒有任何節點指向自己的邊。
  • 是否為無向圖:True,這表示引用關係被視為雙向的,即如果節點A引用節點B,那麼從節點B到節點A也被認為存在一個邊。

實作 GNN model

class myGNN(torch.nn.Module):
def __init__(self, layer_num, input_dim, hidden_dim, output_dim, aggr='mean', **kwargs):
super(myGNN, self).__init__()
self.layer_num = layer_num

self.encoder = nn.Linear(input_dim, hidden_dim)

# you can use the message passing layer you like, such as GCN, GAT, ......
self.mp_layer = NN_MessagePassingLayer(input_dim=hidden_dim, hidden_dim=hidden_dim,
output_dim=hidden_dim, aggr=aggr)

self.decoder = nn.Linear(hidden_dim, output_dim)

def forward(self, x, edge_index):
x = self.encoder(x)
for i in range(self.layer_num):
x = self.mp_layer(x, edge_index)
node_out = self.decoder(x)
return node_out

GNN model 可以將 nn 類的結構分為三個主要部分,每個部分都扮演著圖神經網絡中的關鍵角色。以下是每個部分的具體說明及其功能:

1. Encoder: Encoder 的主要作用是將輸入數據的維度從較小的輸入維度(例如,原始特徵維度)擴展到更高的隱藏維度。這種維度的提升有助於在後續的神經網絡層中捕捉更複雜的特徵,從而提高學習能力和表達能力。

2. Message Passing Layer: Message passing layer 是圖神經網絡的核心,負責節點之間信息的傳遞和整合。您可以在這一層中使用各種不同的 message passing 演算法,如 GCN (Graph Convolutional Network), GAT (Graph Attention Network) 等。這個階段的目的是利用節點之間的連接關係來更新節點的特徵表示。

3. Decoder:目的是將通過 message passing 得到的特徵表示維度降低,使其與目標輸出維度一致。這是為了確保最終輸出的維度符合特定的任務需求,如節點分類中每個類別的預測概率。

---

接下來就是各種模型實驗與比較

GNN model

使用上述定義的 GNN model,訓練起來成效不太好, Test acc=0.30 而已,而且 train 的過程中 loss 降不太下來,大多維持在 0.7 

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch_geometric.nn as geom_nn
from IPython.display import Javascript # Restrict height of output cell.

# 定義 myGNN 模型
class myGNN(torch.nn.Module):
def __init__(self, layer_num, input_dim, hidden_dim, output_dim, aggr='mean'):
super(myGNN, self).__init__()
self.layer_num = layer_num
self.encoder = nn.Linear(input_dim, hidden_dim)
self.mp_layer = geom_nn.GCNConv(hidden_dim, hidden_dim, aggr=aggr)
self.decoder = nn.Linear(hidden_dim, output_dim)

def forward(self, x, edge_index):
x = self.encoder(x)
for i in range(self.layer_num):
x = self.mp_layer(x, edge_index)
x = F.relu(x) # Optional: Apply a non-linear activation
x = self.decoder(x)
return x

# 初始化模型和優化器
model = myGNN(layer_num=2, input_dim=dataset.num_features, hidden_dim=16, output_dim=dataset.num_classes)
optimizer = optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
criterion = nn.CrossEntropyLoss()

# 定義訓練函數
def train():
model.train()
optimizer.zero_grad() # Clear gradients.
out = model(graph.x, graph.edge_index) # Perform a single forward pass.
loss = criterion(out[graph.train_mask], graph.y[graph.train_mask]) # Compute the loss solely based on the training nodes.
loss.backward() # Derive gradients.
optimizer.step() # Update parameters based on gradients.
return loss

# 定義測試函數
def test():
model.eval()
out = model(graph.x, graph.edge_index)
pred = out.argmax(dim=1) # Use the class with highest probability.
test_correct = pred[graph.test_mask] == graph.y[graph.test_mask] # Check against ground-truth labels.
test_acc = int(test_correct.sum()) / int(graph.test_mask.sum()) # Derive ratio of correct predictions.
return test_acc

# 訓練和測試循環
for epoch in range(1, 101):
loss = train()
print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')

test_acc = test()
print(f'Test Accuracy: {test_acc:.4f}')

# 可視化輸出
model.eval()
out = model(graph.x, graph.edge_index)
visualize(out, color=graph.y)
display(Javascript('''google.colab.output.setIframeHeight(0, true, {maxHeight: 300})'''))

輸出結果

Epoch: 001, Loss: 1.9620
Epoch: 002, Loss: 1.9599
Epoch: 003, Loss: 1.9578
Epoch: 004, Loss: 1.9557
Epoch: 005, Loss: 1.9536
Epoch: 006, Loss: 1.9517
Epoch: 007, Loss: 1.9499
Epoch: 008, Loss: 1.9484
Epoch: 009, Loss: 1.9472
Epoch: 010, Loss: 1.9463
Epoch: 070, Loss: 1.1654
Epoch: 071, Loss: 1.1427
Epoch: 072, Loss: 1.1213
Epoch: 073, Loss: 1.0998
Epoch: 074, Loss: 1.0781
Epoch: 075, Loss: 1.0575
Epoch: 076, Loss: 1.0385
Epoch: 077, Loss: 1.0197
Epoch: 078, Loss: 1.0012
Epoch: 079, Loss: 0.9838
Epoch: 080, Loss: 0.9677
Epoch: 081, Loss: 0.9528
Epoch: 082, Loss: 0.9386
Epoch: 083, Loss: 0.9234
Epoch: 084, Loss: 0.9078
Epoch: 085, Loss: 0.8938
Epoch: 086, Loss: 0.8806
Epoch: 087, Loss: 0.8677
Epoch: 088, Loss: 0.8559
Epoch: 089, Loss: 0.8421
Epoch: 090, Loss: 0.8272
Epoch: 091, Loss: 0.8136
Epoch: 092, Loss: 0.8029
Epoch: 093, Loss: 0.7931
Epoch: 094, Loss: 0.7788
Epoch: 095, Loss: 0.7640
Epoch: 096, Loss: 0.7530
Epoch: 097, Loss: 0.7432
Epoch: 098, Loss: 0.7308
Epoch: 099, Loss: 0.7168
Epoch: 100, Loss: 0.7069
Test Accuracy: 0.3050

果然從視覺化來看,分類任務成效並不好,稍微再調整一下!

class myGNN(torch.nn.Module):
def __init__(self, layer_num, input_dim, hidden_dim, output_dim, aggr='mean'):
super(myGNN, self).__init__()
self.layer_num = layer_num
self.encoder = nn.Linear(input_dim, hidden_dim)
self.bn_layers = nn.ModuleList([nn.BatchNorm1d(hidden_dim) for _ in range(layer_num)])
self.mp_layers = nn.ModuleList([geom_nn.GCNConv(hidden_dim, hidden_dim, aggr=aggr) for _ in range(layer_num)])
self.decoder = nn.Linear(hidden_dim, output_dim)

def forward(self, x, edge_index):
x = self.encoder(x)
for i in range(self.layer_num):
x = self.mp_layers[i](x, edge_index)
x = F.relu(x)
x = self.bn_layers[i](x)
x = self.decoder(x)
return x
  • 增加 Batch Normalization : 讓訊息傳遞層因為有 noremalize ,傳遞更穩定,模型也盡快收斂
  • 獨立的訊息傳遞層與 modulize : 原本的方法是,單一的訊息傳遞層多次重複使用,調整後每個會是獨立的,允許每層學習到不同的權重與特徵。

調整後輸出:

Epoch: 001, Loss: 2.0951
Epoch: 002, Loss: 1.8160
Epoch: 003, Loss: 1.5728
Epoch: 004, Loss: 1.4217
Epoch: 005, Loss: 1.2931
Epoch: 006, Loss: 1.1803
Epoch: 007, Loss: 1.0787
Epoch: 008, Loss: 0.9838
Epoch: 009, Loss: 0.9088
Epoch: 010, Loss: 0.8384
Epoch: 070, Loss: 0.0025
Epoch: 071, Loss: 0.0024
Epoch: 072, Loss: 0.0024
Epoch: 073, Loss: 0.0023
Epoch: 074, Loss: 0.0023
Epoch: 075, Loss: 0.0022
Epoch: 076, Loss: 0.0022
Epoch: 077, Loss: 0.0021
Epoch: 078, Loss: 0.0021
Epoch: 079, Loss: 0.0021
Epoch: 080, Loss: 0.0020
Epoch: 081, Loss: 0.0020
Epoch: 082, Loss: 0.0019
Epoch: 083, Loss: 0.0019
Epoch: 084, Loss: 0.0019
Epoch: 085, Loss: 0.0019
Epoch: 086, Loss: 0.0018
Epoch: 087, Loss: 0.0018
Epoch: 088, Loss: 0.0018
Epoch: 089, Loss: 0.0018
Epoch: 090, Loss: 0.0017
Epoch: 091, Loss: 0.0017
Epoch: 092, Loss: 0.0017
Epoch: 093, Loss: 0.0017
Epoch: 094, Loss: 0.0017
Epoch: 095, Loss: 0.0017
Epoch: 096, Loss: 0.0016
Epoch: 097, Loss: 0.0016
Epoch: 098, Loss: 0.0016
Epoch: 099, Loss: 0.0016
Epoch: 100, Loss: 0.0016
Test Accuracy: 0.3560

訓練過程中 loss 有下降,但在 test case 表現不佳,僅 0.3560 ,是 overfitting 的現象,那我們再增加 dropout 去調整。

class myGNN(torch.nn.Module):
def __init__(self, layer_num, input_dim, hidden_dim, output_dim, dropout_rate=0.55, aggr='mean'):
super(myGNN, self).__init__()
self.layer_num = layer_num
self.encoder = nn.Linear(input_dim, hidden_dim)
self.bn_layers = nn.ModuleList([nn.BatchNorm1d(hidden_dim) for _ in range(layer_num)])
self.mp_layers = nn.ModuleList([geom_nn.GCNConv(hidden_dim, hidden_dim, aggr=aggr) for _ in range(layer_num)])
self.dropout = dropout_rate
self.decoder = nn.Linear(hidden_dim, output_dim)

def forward(self, x, edge_index):
x = self.encoder(x)
for i in range(self.layer_num):
x = F.relu(x)
x = F.dropout(x, p=self.dropout, training=self.training)
x = self.bn_layers[i](x)
x = self.mp_layers[i](x, edge_index)
x = self.decoder(x)
return x

輸出

Epoch: 001, Loss: 1.9891
Epoch: 002, Loss: 1.9465
Epoch: 003, Loss: 1.9275
Epoch: 004, Loss: 1.9061
Epoch: 005, Loss: 1.8830
Epoch: 006, Loss: 1.8660
Epoch: 007, Loss: 1.8757
Epoch: 008, Loss: 1.8021
Epoch: 009, Loss: 1.8021
Epoch: 010, Loss: 1.7374
Epoch: 070, Loss: 0.3411
Epoch: 071, Loss: 0.3400
Epoch: 072, Loss: 0.3191
Epoch: 073, Loss: 0.4393
Epoch: 074, Loss: 0.2978
Epoch: 075, Loss: 0.3700
Epoch: 076, Loss: 0.3445
Epoch: 077, Loss: 0.3263
Epoch: 078, Loss: 0.4001
Epoch: 079, Loss: 0.3582
Epoch: 080, Loss: 0.3403
Epoch: 081, Loss: 0.3091
Epoch: 082, Loss: 0.3415
Epoch: 083, Loss: 0.3156
Epoch: 084, Loss: 0.2650
Epoch: 085, Loss: 0.3692
Epoch: 086, Loss: 0.2819
Epoch: 087, Loss: 0.2802
Epoch: 088, Loss: 0.2628
Epoch: 089, Loss: 0.1985
Epoch: 090, Loss: 0.2510
Epoch: 091, Loss: 0.2499
Epoch: 092, Loss: 0.3362
Epoch: 093, Loss: 0.2442
Epoch: 094, Loss: 0.3921
Epoch: 095, Loss: 0.2608
Epoch: 096, Loss: 0.3405
Epoch: 097, Loss: 0.2851
Epoch: 098, Loss: 0.2690
Epoch: 099, Loss: 0.2424
Epoch: 100, Loss: 0.2656
Test Accuracy: 0.6230

Test acc 有明顯的提升到 0.623 

這時候可視化輸出結果,看起來有比原本分得更好,那我們再持續調優。

記得我們在

【邁向圖神經網絡GNN】Part3: 圖神經網絡的核心-訊息傳遞機制

有提到關於 node update 之前 agg 的方法,add 的表現大多會優於 mean 與 max ,因此我們這裡也改成 add ,同時 dropout 再提升 0.05

class myGNN(torch.nn.Module):
def __init__(self, layer_num, input_dim, hidden_dim, output_dim, dropout_rate=0.6, aggr='add'):
super(myGNN, self).__init__()
self.layer_num = layer_num
self.encoder = nn.Linear(input_dim, hidden_dim)
self.bn_layers = nn.ModuleList([nn.BatchNorm1d(hidden_dim) for _ in range(layer_num)])
self.mp_layers = nn.ModuleList([geom_nn.GCNConv(hidden_dim, hidden_dim, aggr=aggr) for _ in range(layer_num)])
self.dropout = dropout_rate
self.decoder = nn.Linear(hidden_dim, output_dim)

def forward(self, x, edge_index):
x = self.encoder(x)
for i in range(self.layer_num):
x = F.relu(x)
x = F.dropout(x, p=self.dropout, training=self.training)
x = self.bn_layers[i](x)
x = self.mp_layers[i](x, edge_index)
x = self.decoder(x)
return x

輸出

Epoch: 001, Loss: 2.0578
Epoch: 002, Loss: 1.8797
Epoch: 003, Loss: 1.8410
Epoch: 004, Loss: 1.7337
Epoch: 005, Loss: 1.6006
Epoch: 006, Loss: 1.5471
Epoch: 007, Loss: 1.4120
Epoch: 008, Loss: 1.4017
Epoch: 009, Loss: 1.3204
Epoch: 010, Loss: 1.2137
Epoch: 070, Loss: 0.2363
Epoch: 071, Loss: 0.1732
Epoch: 072, Loss: 0.2351
Epoch: 073, Loss: 0.1176
Epoch: 074, Loss: 0.1425
Epoch: 075, Loss: 0.2506
Epoch: 076, Loss: 0.2288
Epoch: 077, Loss: 0.1546
Epoch: 078, Loss: 0.1162
Epoch: 079, Loss: 0.2172
Epoch: 080, Loss: 0.1823
Epoch: 081, Loss: 0.1976
Epoch: 082, Loss: 0.1133
Epoch: 083, Loss: 0.1148
Epoch: 084, Loss: 0.1102
Epoch: 085, Loss: 0.1629
Epoch: 086, Loss: 0.2070
Epoch: 087, Loss: 0.2446
Epoch: 088, Loss: 0.1316
Epoch: 089, Loss: 0.1847
Epoch: 090, Loss: 0.1238
Epoch: 091, Loss: 0.1610
Epoch: 092, Loss: 0.1585
Epoch: 093, Loss: 0.0948
Epoch: 094, Loss: 0.1809
Epoch: 095, Loss: 0.1859
Epoch: 096, Loss: 0.0982
Epoch: 097, Loss: 0.1741
Epoch: 098, Loss: 0.2341
Epoch: 099, Loss: 0.1564
Epoch: 100, Loss: 0.1871
Test Accuracy: 0.7330

Test acc 明顯提升到 0.7 ,同時可視化效果每一個類別更加分散,還是有零星分錯類別,不過作為 baseline model ,結果顯示 GNN 是有效的。

小結

今天討論了如何使用 PyTorch 框架來構建和訓練圖神經網絡(GNN),並使用 Cora 資料集進行節點分類任務。透過逐步改進模型架構,包括引入批量標準化和獨立的消息傳遞層,以及調整 Dropout 和聚合函數,我們顯著提高了模型的分類準確率。最終實驗表明,經過優化的 GNN 模型在處理圖結構數據具有強大的性能和應用潛力~我們下篇見!

分享至
成為作者繼續創作的動力吧!
從 Google News 追蹤更多 vocus 的最新精選內容從 Google News 追蹤更多 vocus 的最新精選內容

作者的相關文章

Karen的沙龍 的其他內容

你可能也想看

發表回應

成為會員 後即可發表留言
© 2024 vocus All rights reserved.