Q-learning 框架 Python 範例

Hank吳

2025/07/22 更新2025/07/22 發佈閱讀 20 分鐘

這次讓我們將 Q-learning 的概念實作出來，並提供一個簡單的 Python 範例。我們將模擬一個非常小的「尋寶」遊戲環境，讓智能體學習如何找到寶藏。

Q-learning 框架 Python 範例：簡單的尋寶遊戲 💎

在這個範例中，我們的智能體要在一個 3x3 的網格世界中移動。目標是找到「寶藏」，並避開「陷阱」。

環境設定：

* 網格大小：3x3

* 起始位置：左上角 (0, 0)

* 寶藏：右下角 (2, 2)，找到獎勵 +10

* 陷阱：例如 (1, 1)，掉入懲罰 -10

* 普通移動：每次移動懲罰 -1 (鼓勵快速找到寶藏)

* 動作：上、下、左、右

1. 環境定義

我們首先需要定義這個簡單的環境。

import numpy as np

class GridWorld:

def __init__(self):

self.grid = np.array([

['S', '-', '-'],

['-', 'T', '-'],

['-', '-', 'G']

])

self.state_space_size = self.grid.size # 3x3 = 9 個狀態

self.action_space_size = 4 # 上、下、左、右

self.current_position = (0, 0) # 智能體起始位置

self.rewards = {

'G': 10, # Goal (寶藏)

'T': -10, # Trap (陷阱)

'-': -1 # Empty (普通移動)

}

self.actions = {

0: (-1, 0), # 上 (row - 1)

1: (1, 0), # 下 (row + 1)

2: (0, -1), # 左 (col - 1)

3: (0, 1) # 右 (col + 1)

}

self.state_map = self._create_state_map()

def _create_state_map(self):

# 將 (row, col) 座標映射到一個單一的整數狀態 ID

state_map = {}

idx = 0

for r in range(self.grid.shape[0]):

for c in range(self.grid.shape[1]):

state_map[(r, c)] = idx

idx += 1

return state_map

def get_state_id(self, position):

return self.state_map[position]

def get_position_from_id(self, state_id):

# 反向映射，方便顯示

for pos, s_id in self.state_map.items():

if s_id == state_id:

return pos

return None

def reset(self):

self.current_position = (0, 0)

return self.get_state_id(self.current_position)

def step(self, action_id):

current_row, current_col = self.current_position

dr, dc = self.actions[action_id]

new_row, new_col = current_row + dr, current_col + dc

# 檢查邊界

if not (0 <= new_row < self.grid.shape[0] and 0 <= new_col < self.grid.shape[1]):

# 撞牆，原地不動，給予少量懲罰

reward = -5

is_done = False

next_state_pos = self.current_position

else:

next_state_pos = (new_row, new_col)

cell_type = self.grid[next_state_pos]

reward = self.rewards.get(cell_type, -1) # 預設 -1

is_done = (cell_type == 'G' or cell_type == 'T') # 到達寶藏或陷阱則結束

self.current_position = next_state_pos

next_state_id = self.get_state_id(self.current_position)

return next_state_id, reward, is_done

def render(self):

display_grid = np.copy(self.grid).astype(object) # 複製一份避免修改原始 grid

r, c = self.current_position

display_grid[r, c] = 'A' # Agent 的位置

for row in display_grid:

print(' '.join(row))

print("-" * 10)

2. Q-learning 演算法實作

現在我們來實作 Q-learning 的核心邏輯。

import random

def q_learning(env, episodes=1000, alpha=0.1, gamma=0.99, epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01):

# 初始化 Q 表

# Q_table 是一個 (狀態數量 x 動作數量) 的表格

Q_table = np.zeros((env.state_space_size, env.action_space_size))

rewards_per_episode = []

for episode in range(episodes):

current_state_id = env.reset() # 重置環境，回到起始狀態

done = False

total_reward = 0

while not done:

# 1. 選擇動作 (epsilon-greedy 策略)

if random.uniform(0, 1) < epsilon:

action_id = random.randrange(env.action_space_size) # 探索：隨機選擇動作

else:

action_id = np.argmax(Q_table[current_state_id, :]) # 利用：選擇 Q 值最高的動作

# 2. 執行動作並觀察結果

next_state_id, reward, done = env.step(action_id)

total_reward += reward

# 3. 更新 Q 值 (核心公式)

# Q(s, a) = Q(s, a) + alpha * [r + gamma * max_a' Q(s', a') - Q(s, a)]

# 取得下一個狀態的最大 Q 值 (如果 next_state 是終點，則 max_q_next_state 為 0)

if done:

max_q_next_state = 0

else:

max_q_next_state = np.max(Q_table[next_state_id, :])

# Q 值更新

Q_table[current_state_id, action_id] += alpha * (

reward + gamma * max_q_next_state - Q_table[current_state_id, action_id]

)

current_state_id = next_state_id # 更新當前狀態

# 可以選擇在這裡 render 環境來觀察過程，但會非常慢

# env.render()

# 探索率衰減

epsilon = max(epsilon_min, epsilon * epsilon_decay)

rewards_per_episode.append(total_reward)

if (episode + 1) % 100 == 0:

print(f"回合 {episode + 1}/{episodes}, 總獎勵: {total_reward:.2f}, Epsilon: {epsilon:.2f}")

print("\n訓練完成！最終 Q 表：")

# 將 Q 表的狀態 ID 轉換為 (row, col) 方便閱讀

for state_id in range(env.state_space_size):

pos = env.get_position_from_id(state_id)

print(f"狀態 {pos} (ID: {state_id}): {Q_table[state_id, :]}")

return Q_table, rewards_per_episode

3. 運行訓練並測試策略

# 創建環境實例

env = GridWorld()

# 運行 Q-learning 訓練

Q_table, rewards = q_learning(env, episodes=2000) # 增加回合數讓模型學習更充分

# --- 測試學習到的策略 ---

print("\n--- 測試學習到的最佳策略 ---")

test_env = GridWorld()

current_state_id = test_env.reset()

test_done = False

test_total_reward = 0

step_count = 0

# 輸出動作對應的文字

action_names = {0: "上", 1: "下", 2: "左", 3: "右"}

print("起始位置:")

test_env.render()

while not test_done and step_count < 20: # 設定最大步數避免無限循環

# 選擇最佳動作 (不再探索)

action_id = np.argmax(Q_table[current_state_id, :])

print(f"目前狀態: {test_env.get_position_from_id(current_state_id)}, 選擇動作: {action_names[action_id]}")

current_state_id, reward, test_done = test_env.step(action_id)

test_total_reward += reward

step_count += 1

test_env.render()

if test_done:

cell_type = env.grid[test_env.current_position]

if cell_type == 'G':

print(f"成功找到寶藏！🎊")

elif cell_type == 'T':

print(f"不幸掉入陷阱！☠️")

print(f"測試結束！總獎勵: {test_total_reward}, 總步數: {step_count}")

break

if not test_done:

print(f"測試結束，未能達到終點 (可能陷入循環或步數過多)。總獎勵: {test_total_reward}, 總步數: {step_count}")

# 繪製每個回合的獎勵趨勢

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))

plt.plot(rewards)

plt.xlabel("回合 (Episode)")

plt.ylabel("總獎勵 (Total Reward)")

plt.title("Q-learning 訓練過程中每個回合的總獎勵")

plt.grid(True)

plt.show()

範例說明：

* GridWorld 類別：

* __init__：初始化網格、狀態空間、動作空間、獎勵和動作對應關係。

* _create_state_map：將 (row, col) 座標轉換為單一整數 ID，方便在 Q 表中索引。

* reset()：將智能體重置回起始位置。

* step(action_id)：智能體執行一個動作後，環境如何變化。它返回新的狀態 ID、獲得的獎勵以及是否回合結束。

* render()：印出當前網格狀態，顯示智能體位置。

* q_learning 函數：

* Q_table 初始化：創建一個零矩陣，行數是狀態空間大小（9），列數是動作空間大小（4）。

* 參數設定：episodes (訓練回合數)、alpha (學習率)、gamma (折扣因子)、epsilon (探索率，初始值較高，鼓勵初期探索)。

* \\epsilon-貪婪策略：在每個時間步，智能體以 epsilon 的機率隨機選擇動作（探索），以 1-epsilon 的機率選擇當前 Q 值最高的動作（利用）。

* Q 值更新公式：這就是 Q-learning 的核心！

* reward + gamma * max_q_next_state：這是我們預期的「新」價值。max_q_next_state 表示從下一個狀態 s' 開始，如果智能體總是選擇最佳動作，預期能獲得的最大未來累積獎勵。

* Q_table[current_state_id, action_id]：這是舊的 Q 值。

* alpha * [...]：新舊價值之間的差異，乘以學習率，決定了這次更新對 Q 值改變的程度。

* epsilon 衰減：隨著訓練回合的增加，epsilon 會逐漸減小，讓智能體從探索轉向更多地利用已學到的知識。

* 運行與測試：

* 我們運行 q_learning 函數來訓練智能體。

* 訓練結束後，我們再運行一個測試回合，讓智能體只選擇 Q 表中 Q 值最高的動作（純粹利用），觀察它是否能找到寶藏。

* 最後，我們繪製了每個回合的總獎勵，你可以觀察到隨著訓練的進行，智能體獲得的獎勵通常會逐漸增加，表示它學到了更好的策略。

這個範例雖然簡單，但它完整地展示了 Q-learning 的核心機制，包括 Q 表的建立、\\epsilon-貪婪探索、以及最關鍵的 Q 值更新公式。希望這個範例能幫助你更好地理解 Q-learning 的運作原理。

留言

Hank吳的沙龍

17會員

161內容數

這不僅僅是一個 Blog，更是一個交流與分享的空間。期待在這裡與你相遇，一起探索科技、體驗生活、夢想旅行！💖

Hank吳的沙龍的其他內容

2025/07/22

深入探討 Q-learning 框架

這次我們就來深入探討 Q-learning 框架，它是強化學習 (Reinforcement Learning, RL) 中一個非常經典且重要的演算法。 Q-learning 框架：讓智能體「邊做邊學」的魔術！✨ Q-learning 是一種免模型 (model-free) 的強化學習演算法，

2025/07/22

深入探討 Q-learning 框架

2025/07/22

深度強化學習 (DRL) 的核心數理原理

讓我們深入探討深度強化學習 (DRL) 模型的數理與原理，這會稍微燒腦一點，但保證讓你對 DRL 的核心運作有更透徹的理解。還記得我們之前提到的，DRL 結合了「深度學習」和「強化學習」嗎？它的強大之處，就在於能讓機器透過不斷與環境互動、從「經驗」中學習，進而優化決策。這背後的數理支撐，主要來

2025/07/22

深度強化學習 (DRL) 的核心數理原理

2025/07/22

什麼是 DRL 深度強化學習 (Deep Reinforcement Learning)？

DRL，也就是深度強化學習 (Deep Reinforcement Learning)，是近年來人工智慧領域最火熱的技術之一。🔥 它結合了深度學習的強大感知能力和強化學習的決策學習能力，讓機器可以像人類一樣，透過不斷的「嘗試錯誤」來學習、進而做出最棒的決策。

2025/07/22

什麼是 DRL 深度強化學習 (Deep Reinforcement Learning)？

看更多

你可能也想看

小P趨勢投資

算力的盡頭是電力！009819 小P量化交易者眼中的AI基建雙引擎致勝邏輯

背景：從冷門配角到市場主線，算力與電力被重新定價小P從2008進入股市，每一個時期的投資亮點都不同，記得2009蘋果手機剛上市，當時蘋果只要在媒體上提到哪一間供應鏈，隔天股價就有驚人的表現，當時光學鏡頭非常熱門，因為手機第一次搭上鏡頭可以拍照，也造就傳統相機廠的殞落，如今手機已經全面普及，題

#AI#算力#電力

2026/04/11

小P趨勢投資

算力的盡頭是電力！009819 小P量化交易者眼中的AI基建雙引擎致勝邏輯

#AI#算力#電力

2026/04/11

Michael楊

Python入門-Day1：語言介紹、觸及的領域、誰在使用

Python是一種易學且功能強大的程式語言，具有直譯、動態語法等特性，並擁有豐富的標準庫。它在各領域如Web開發、數據科學和人工智慧等得到廣泛應用，並被許多大公司如Google和Facebook等使用。Python還有強大的框架、豐富的交互機能、和龐大的社區。

#Python

2024/05/09

Michael楊

Python入門-Day1：語言介紹、觸及的領域、誰在使用

#Python

2024/05/09

陳沅綦的沙龍

柏林劇團《三便士歌劇》：巴里．柯斯基的經典再造，與布萊希特劇場的當代轉向

本文分析導演巴里・柯斯基（Barrie Kosky）如何運用極簡的舞臺配置，將布萊希特（Bertolt Brecht）的「疏離效果」轉化為視覺奇觀與黑色幽默，探討《三便士歌劇》在當代劇場中的新詮釋，並藉由舞臺、燈光、服裝、音樂等多方面，分析該作如何在保留批判核心的同時，觸及觀眾的觀看位置與人性幽微。

#2026北藝嚴選#北藝嚴選#臺北表演藝術中心

2026/02/11