開發 Python/PyTorch 多執行緒程式在輝達 nVidia CUDA 環境下

發佈於程式

更新於 2024/11/09發佈於 2024/11/06閱讀時間約 7 分鐘

在安裝實體具有多核 GPU 的環境下，可以透過 Python 「多執行緒的」程式，讓 CPU 及 GPU 依照特性，各自同時進行運算。通常會在 CPU 端處理各種資料處理及人機界面的管理，而在 GPU 端則進行大量數值運算的工作；由於這兩方面的工作都需要同時間進行，因此常會利用「多執行緒」(Thread) 的方式來進行。

CPU 端多執行緒程式

假定我們同時在 CPU 端及 GPU 端都要進行多次的矩陣相乘的運算，那麼在 CPU 端的執行緒函數可以這樣撰寫。

import threading
import numpy as np
class cpuThread(threading.Thread):
    def __init__(self, x,y,count):
        threading.Thread.__init__(self)
        self.x=x
        self.y=y
        self.ans=x
        self.count=count
    def run(self):
        for i in range(self.count):
            self.ans=np.matmul(self.x,self.y)

接下來，我們單獨看一下在 CPU 端所花費的執行時間，

import time
np.random.seed(0)
matrixA=np.random.rand(1000,1000).astype('float32')
matrixB=np.random.rand(1000,1000).astype('float32')
beginTime=time.time()
runCPU=cpuThread(matrixA,matrixB,1000)
runCPU.start()
runCPU.join()

# 印出計算所花的時間
print('CPU execution time:',time.time()-beginTime)

GPU 端多執行緒程式

另外，在 GPU 端的執行緒則類似於 CPU 的執行緒，主要的差別則是由「numpy」改用「torch」函式庫。

import torch
class gpuThread(threading.Thread):
    def __init__(self, x,y,count):
        threading.Thread.__init__(self)
        self.x=x
        self.y=y
        self.count=count
        self.ans=x
    def run(self):
        for i in range(self.count):
            self.ans=torch.matmul(self.x,self.y)

以及單獨在 GPU 端所花費的時間。

np.random.seed(0)
matrixA=np.random.rand(1000,1000).astype('float32')
matrixB=np.random.rand(1000,1000).astype('float32')
tensorA=torch.tensor(matrixA).to(device)
tensorB=torch.tensor(matrixB).to(device)

beginTime=time.time()
runGPU=gpuThread(tensorA,tensorB,1000)
runGPU.start()
runGPU.join()
# 等所有 GPU 都計算完畢
torch.cuda.synchronize()
# 印出計算所花的時間
print('GPU execution time:',time.time()-beginTime)

在 Jetson Orin Nano 的機器上，GPU 大約是花費 2~3 秒的時間，而 CPU 端相同的計算大約會花費到 15 秒以上的時間。

同時執行 CPU 端及 GPU 端多執行緒

當然，也可以合併以上兩段程式，同時間在 CPU 及 GPU 端來執行，然後透過「jtop」觀查一下在 Jetson Orin Nano 裝置上計算的負荷如何變化。

np.random.seed(0)
matrixA=np.random.rand(1000,1000).astype('float32')
matrixB=np.random.rand(1000,1000).astype('float32')
tensorA=torch.tensor(matrixA).to(device)
tensorB=torch.tensor(matrixB).to(device)

beginTime=time.time()
runCPU=cpuThread(matrixA,matrixB,1000)
runGPU=gpuThread(tensorA,tensorB,1000)
runCPU.start()
runGPU.start()
runCPU.join()
runGPU.join()
# 等所有 GPU 都計算完畢
torch.cuda.synchronize()
# 印出計算所花的時間
print('CPU combined with GPU execution time:',time.time()-beginTime)