過擬合(Overfitting)是機器學習中的一個常見問題,指的是模型在訓練數據上表現得非常好,但在處理未見過的測試數據或新數據時表現不佳。這通常發生在模型過於複雜,學習到了訓練數據中的噪聲和特異點,導致模型無法很好地泛化到新的數據集。過度擬合的模型可能會在訓練數據上取得很高的準確率,但這種高準確率並不能反映模型的真實性能。為了避免過度擬合,可以採取多種策略,如增加數據量、使用正則化、簡化模型結構、應用交叉驗證等。這些方法可以幫助模型學習到數據中的真正模式,從而提升其在未來預測中的準確性和可靠性。
# 可使用目錄功能快速確認要閱覽的主題
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# 假設 df 是我們的數據集
df_train = df_analysis[0:891]
X = df_train.drop(columns=['Survived']) # 所有特徵
y = df_train['Survived'] # 目標變數 'Survived'
# 分割數據集
train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.7, random_state=42)
# 建立隨機森林模型
model = RandomForestClassifier(n_estimators=100, max_depth=8, random_state=42)
model.fit(train_X, train_y)
# 預測訓練集與測試集
train_pred = model.predict(train_X)
test_pred = model.predict(test_X)
# 計算準確率
train_accuracy = accuracy_score(train_y, train_pred)
test_accuracy = accuracy_score(test_y, test_pred)
print(f"訓練集準確率: {train_accuracy:.2f}")
print(f"測試集準確率: {test_accuracy:.2f}")
"""
訓練集準確率: 0.92
測試集準確率: 0.81
"""
通常如果準確率的差異超過 5%-10%,可能需要考慮過擬合的可能性。
訓練集與測試集準確率比較適用場景:
訓練集與測試集準確率比較優點:
訓練集與測試集準確率比較缺點:
from sklearn.model_selection import cross_val_score
# 使用 5 折交叉驗證來評估模型
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"交叉驗證準確率: {cv_scores}")
print(f"交叉驗證平均準確率: {cv_scores.mean():.2f}")
print(f"交叉驗證標準差: {cv_scores.std():.2f}")
"""
交叉驗證準確率: [0.80446927 0.81460674 0.86516854 0.81460674 0.85955056]
交叉驗證平均準確率: 0.83
交叉驗證標準差: 0.03
"""
交叉驗證適用場景:
交叉驗證優點:
交叉驗證缺點:
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
import numpy as np
# 繪製學習曲線
train_sizes, train_scores, test_scores = learning_curve(model, X, y, cv=5, n_jobs=-1,
train_sizes=np.linspace(0.1, 1.0, 10))
train_scores_mean = np.mean(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
plt.plot(train_sizes, train_scores_mean, label='Training Accuracy')
plt.plot(train_sizes, test_scores_mean, label='Validation Accuracy')
plt.xlabel('Training Size')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
如果訓練誤差很低但測試誤差很高,並且測試誤差沒有隨著更多數據的加入而顯著下降,這表示過擬合。
範例:
訓練集準確率 (Training Accuracy):
測試集準確率 (Validation Accuracy):
模型的狀態:
結論:
學習曲線適用場景:
學習曲線優點:
學習曲線缺點:
# 建立更簡單的模型,減少 max_depth
simple_model = RandomForestClassifier(n_estimators=100, max_depth=4, random_state=42)
simple_model.fit(train_X, train_y)
# 再次預測訓練集與測試集
train_pred_simple = simple_model.predict(train_X)
test_pred_simple = simple_model.predict(test_X)
# 計算簡化模型的準確率
train_accuracy_simple = accuracy_score(train_y, train_pred_simple)
test_accuracy_simple = accuracy_score(test_y, test_pred_simple)
print(f"簡化模型訓練集準確率: {train_accuracy_simple:.2f}")
print(f"簡化模型測試集準確率: {test_accuracy_simple:.2f}")
觀察簡化後模型的測試誤差是否降低。如果測試誤差降低而訓練誤差略有增加,則說明之前的模型可能過擬合。
簡化模型適用場景:
簡化模型優點:
簡化模型缺點:
※因翻譯問題,有時被翻譯為正規化,但要注意 Normalization 也被翻譯為正規化,兩者是不同東西
常見的正則化方法
小結
λ * Σ|w|
適用場景
from sklearn.linear_model import Lasso
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
# 載入數據集
data = load_boston()
X = data.data
y = data.target
# 分割資料集為訓練集和測試集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 標準化數據
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# 應用L1正則化(Lasso)
lasso = Lasso(alpha=0.1) # alpha是正則化強度的參數
lasso.fit(X_train, y_train)
# 預測與評估
y_pred = lasso.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"Selected Features: {lasso.coef_}")
引入正則化項後,如果模型的測試誤差降低而訓練誤差略有增加,則過擬合得到了改善。
λ * Σw²
適用場景
from sklearn.linear_model import Ridge
# 應用L2正則化(Ridge)
ridge = Ridge(alpha=0.1) # alpha是正則化強度的參數
ridge.fit(X_train, y_train)
# 預測與評估
y_pred = ridge.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"Model Coefficients: {ridge.coef_}")
引入正則化項後,如果模型的測試誤差降低而訓練誤差略有增加,則過擬合得到了改善。
λ1 * Σ|w| + λ2 * Σw²
適用場景
from sklearn.linear_model import ElasticNet
# 應用Elastic Net
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5) # l1_ratio控制L1和L2的比例
elastic_net.fit(X_train, y_train)
# 預測與評估
y_pred = elastic_net.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"Model Coefficients: {elastic_net.coef_}")
引入正則化項後,如果模型的測試誤差降低而訓練誤差略有增加,則過擬合得到了改善。
正則化適用場景:
正則化優點:
正則化缺點:
L1正則化(Lasso)
from sklearn.linear_model import Lasso
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# 載入數據集
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Lasso (L1正則化)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)
# 評估
mse = mean_squared_error(y_test, y_pred)
print(f"Lasso MSE: {mse}")
L2正則化(Ridge)
from sklearn.linear_model import Ridge
# Ridge (L2正則化)
ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)
# 評估
mse = mean_squared_error(y_test, y_pred)
print(f"Ridge MSE: {mse}")
Elastic Net
from sklearn.linear_model import ElasticNet
# Elastic Net
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X_train, y_train)
y_pred = elastic_net.predict(X_test)
# 評估
mse = mean_squared_error(y_test, y_pred)
print(f"Elastic Net MSE: {mse}")
from sklearn.model_selection import cross_val_score
# 使用Ridge模型進行交叉驗證
ridge = Ridge(alpha=0.1)
scores = cross_val_score(ridge, X, y, cv=5, scoring='neg_mean_squared_error')
# 計算平均MSE
average_mse = -scores.mean()
print(f"Cross-Validated MSE: {average_mse}")
from keras.models import Sequential
from keras.layers import Dense
# 簡化模型
model = Sequential([
Dense(64, input_dim=X_train.shape[1], activation='relu'),
Dense(32, activation='relu'),
Dense(1, activation='linear')
])
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))
from keras.callbacks import EarlyStopping
# 設置提前停止
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
# 訓練模型時使用提前停止
model.fit(X_train, y_train, epochs=100, batch_size=32, validation_data=(X_test, y_test), callbacks=[early_stopping])
from keras.layers import Dropout
# 模型中加入Dropout層
model = Sequential([
Dense(64, input_dim=X_train.shape[1], activation='relu'),
Dropout(0.5),
Dense(32, activation='relu'),
Dropout(0.5),
Dense(1, activation='linear')
])
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))
from sklearn.feature_selection import SelectKBest, f_regression
# 選擇K個最好的特徵
selector = SelectKBest(f_regression, k=10)
X_new = selector.fit_transform(X, y)
from sklearn.ensemble import RandomForestRegressor
# 使用隨機森林模型
rf = RandomForestRegressor(n_estimators=100, max_depth=5)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
# 評估
mse = mean_squared_error(y_test, y_pred)
print(f"Random Forest MSE: {mse}")
假設你使用的是多項式回歸模型,進行正則化:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
# 多項式回歸與正則化
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_train)
ridge = Ridge(alpha=0.1)
ridge.fit(X_poly, y_train)
from keras.optimizers import Adam
# 使用學習率衰減策略
optimizer = Adam(learning_rate=0.01, decay=1e-6)
model.compile(optimizer=optimizer, loss='mean_squared_error')
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))
alpha
)的值,增強對模型複雜性的約束,減少過大的權重值,從而使模型更加平滑。from sklearn.linear_model import Lasso
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# 載入數據集
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Lasso (L1正則化)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)
# 評估
mse = mean_squared_error(y_test, y_pred)
print(f"Lasso MSE: {mse}")
from sklearn.linear_model import ElasticNet
# Elastic Net
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X_train, y_train)
y_pred = elastic_net.predict(X_test)
# 評估
mse = mean_squared_error(y_test, y_pred)
print(f"Elastic Net MSE: {mse}")
from keras.models import Sequential
from keras.layers import Dense
# 簡化模型
model = Sequential([
Dense(64, input_dim=X_train.shape[1], activation='relu'),
Dense(32, activation='relu'),
Dense(1, activation='linear')
])
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))
from keras.callbacks import EarlyStopping
# 設置提前停止
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
# 訓練模型時使用提前停止
model.fit(X_train, y_train, epochs=100, batch_size=32, validation_data=(X_test, y_test), callbacks=[early_stopping])
from keras.layers import Dropout
# 模型中加入Dropout層
model = Sequential([
Dense(64, input_dim=X_train.shape[1], activation='relu'),
Dropout(0.5),
Dense(32, activation='relu'),
Dropout(0.5),
Dense(1, activation='linear')
])
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))
from sklearn.ensemble import RandomForestRegressor
# 使用隨機森林模型
rf = RandomForestRegressor(n_estimators=100, max_depth=5)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
# 評估
mse = mean_squared_error(y_test, y_pred)
print(f"Random Forest MSE: {mse}")
from keras.optimizers import Adam
# 使用學習率衰減策略
optimizer = Adam(learning_rate=0.01, decay=1e-6)
model.compile(optimizer=optimizer, loss='mean_squared_error')
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))
from sklearn.model_selection import cross_val_score
# 使用Ridge模型進行交叉驗證
ridge = Ridge(alpha=0.1)
scores = cross_val_score(ridge, X, y, cv=5, scoring='neg_mean_squared_error')
# 計算平均MSE
average_mse = -scores.mean()
print(f"Cross-Validated MSE: {average_mse}")
from sklearn.preprocessing import StandardScaler
# 標準化數據
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
# 定義參數網格
param_grid = {'alpha': [0.01, 0.1, 1, 10]}
# 使用GridSearch進行調參
ridge = Ridge()
grid_search = GridSearchCV(ridge, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
# 獲取最佳參數
best_params = grid_search.best_params_
print(f"Best Params: {best_params}")