36/100 梯度提升機（XGBoost, LightGBM） 🚀 機器學習比賽常勝軍，高效又準確！

AI時代系列(1) 機器學習三部曲: 🔹 第一部：《機器學習 —— AI 智慧的啟航》

36/100 第四週：監督學習（分類）

36. 梯度提升機（XGBoost, LightGBM） 🚀 機器學習比賽常勝軍，高效又準確！

梯度提升機（Gradient Boosting）

🚀 機器學習比賽常勝軍，高效又準確！

________________________________________

✅ 什麼是梯度提升機（GBM）？

概念核心：

梯度提升機（Gradient Boosting Machine, GBM）屬於集成學習（Ensemble Learning）的一種，主要由**多棵弱模型（決策樹）**逐步「疊加」而成。每一步都修正前一步的誤差，讓整體模型越來越準確。

________________________________________

✅ 代表性算法

👉 XGBoost、LightGBM 與 CatBoost 是當代常用的梯度提升（Gradient Boosting）演算法，它們都致力於提升預測準確率與運算效率。XGBoost 強調正則化與並行運算、LightGBM 適合大數據與高維特徵並採用葉子優先策略節省資源、而 CatBoost 則特別擅長處理類別型特徵，能有效避免過度 one-hot 編碼問題。這三者各有優勢，是實務中建模不可或缺的利器。

________________________________________

✅ 基本流程

1️⃣ 建立一顆弱模型（小決策樹）

2️⃣ 計算預測誤差（殘差）

3️⃣ 下一棵樹專門學習「如何修正這些殘差」

4️⃣ 反覆疊加，直到達到預設的迭代次數或誤差下降不明顯

________________________________________

✅ 優點

✨ 預測效果強大：特別適合處理複雜數據

✨ 自動特徵篩選：重要特徵自動浮現

✨ 抗過擬合能力強：內建正則化機制（L1、L2）

✨ 支援缺失值處理與類別特徵（特別是 CatBoost）

________________________________________

✅ 缺點

⚠ 訓練時間相對較長（但 XGBoost 和 LightGBM 大幅優化）

⚠ 超參數多，需調整（grid search 或 random search）

________________________________________

✅ 使用情境（超熱門！）

✅ 金融風險評分（信貸、詐騙偵測）

✅ 電商推薦系統

✅ 醫療預測模型

✅ Kaggle 競賽王者首選

________________________________________

✅ Python 快速實作（以 XGBoost 為例）

以下範例展示如何使用 XGBoost 建立一個簡單的「金融風險評分」回歸模型。首先，我們會合成一組模擬客戶資料，並根據特徵計算對應的真實風險分數。接著，將資料拆分為訓練集與測試集，利用 XGBRegressor 進行模型訓練。訓練完成後，我們會在測試集上評估模型表現，輸出 RMSE、MAE 及 R² 等指標，並繪製特徵重要性圖，了解哪些變數對風險評分影響最大。此流程展示了 XGBoost 在金融風險建模中的應用實例。

# 1. 載入套件

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from xgboost import XGBRegressor

import matplotlib.pyplot as plt

# 2. 合成客戶資料

np.random.seed(42)

n_samples = 5000

age = np.random.randint(21, 70, size=n_samples) # 年齡

income = np.random.normal(50000, 15000, size=n_samples).clip(10000, 200000) # 年收入

loan_amount = np.random.normal(20000, 8000, size=n_samples).clip(1000, 100000) # 貸款金額

credit_history = np.random.randint(1, 31, size=n_samples) # 信用年限（年）

num_dependents = np.random.randint(0, 6, size=n_samples) # 扶養人數

# 3. 合成「真實」風險分數 (0~1) — 只是示例

risk_score = (

0.6 * (loan_amount / income) # 貸款負擔率

+ 0.2 * (1 / credit_history) # 信用年限（年限越長風險越低）

+ 0.1 * (num_dependents / 5) # 扶養人數占比

+ np.random.normal(0, 0.02, n_samples) # 加點隨機雜訊

)

risk_score = np.clip(risk_score, 0, 1) # 限制在 [0,1] 之間

# 建 DataFrame

df = pd.DataFrame({

'age': age,

'income': income,

'loan_amount': loan_amount,

'credit_history': credit_history,

'num_dependents': num_dependents,

'risk_score': risk_score

})

# 4. 拆分訓練／測試集

X = df.drop('risk_score', axis=1)

y = df['risk_score']

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.2, random_state=42

)

# 5. 建立並訓練 XGBoost 回歸模型

model = XGBRegressor(

n_estimators=100,

max_depth=4,

learning_rate=0.1,

objective='reg:squarederror',

random_state=42

)

model.fit(X_train, y_train)

# 6. 在測試集上預測並評估

y_pred = model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))

mae = mean_absolute_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print("模型評估指標：")

print(f" RMSE: {rmse:.4f}")

print(f" MAE: {mae:.4f}")

print(f" R²: {r2:.4f}")

# 7. 繪製特徵重要性

plt.figure(figsize=(6,4))

importances = model.feature_importances_

feat_names = X.columns

plt.barh(feat_names, importances)

plt.xlabel("Importance")

plt.title("Feature Importances for Risk Scoring")

plt.tight_layout()

plt.show()

輸出結果:

__模型評估指標：

RMSE: 0.0258

MAE: 0.0199

R²: 0.9754

______________________________________

✅ 重點總結

特色說明

🎯 強大性能適合大數據與高維特徵、預測能力佳

🎯 自動特徵選擇內建特徵重要性評估

🎯 過擬合控制強支援正則化、樹剪枝

🎯 競賽神器 Kaggle / AI比賽榜上常客

________________________________________

📌 一句話記住：

XGBoost、LightGBM 是「快、準、狠」的代表，當數據變複雜、特徵變多，直接上這兩款武器，贏面大增！