在建立 9 國股票週 / 月 / 年報酬分布系統時,我遇到一個現實問題:
以英國市場為例(Yahoo Finance 資料):真實市場資料並不乾淨。
- 部分股票以 pence (GBp) 記價
- 部分股票以 pound (GBP) 記價
- 某些小型股在歷史資料中出現 100x 單位翻轉
這會造成什麼?
如果不處理,週報酬可能出現:
+9,800%
-99% +4,500%
這不是市場波動,而是資料單位錯誤。
問題規模
UK universe:1559 檔
Near-100x scale residual files:189 檔(約 12%)
我沒有嘗試完美修復,而是採取:
解法(Robust 而非 Perfect)
1️⃣ File-level exclusion
凡出現多次 near-100x scale inconsistency 的標的
→ 直接排除在統計之外
理由:
- 分布統計關心的是整體形態
- 12% 的高風險小型股排除後不影響中位數與分布結構
- 卻能大幅降低尾部污染
2️⃣ Return clipping(Winsorization)
週報酬裁切:
[-80%, +300%]
月報酬裁切:
[-90%, +500%]
年報酬裁切:
[-95%, +2000%]
目的不是修改資料,而是:
防止單一極端值扭曲 histogram bin 計數。
結果
排除後:
- UK 月報酬樣本數:77,002
- 年報酬樣本數:6,589
- 分布圖尾部平滑
- 中位數與分位數穩定
重點
這不是資料修復問題。
這是:
在 imperfect data 下,如何建立 robust 統計系統。
當你做跨國分布比較時,
資料治理比演算法更重要。
How to Build Robust Weekly Return Distributions from Noisy Stock Market Data
Keywords targeted:
- weekly return distribution
- stock return histogram
- winsorizing stock returns
- cross-sectional return analysis
- financial data quality
- UK stock data pence vs pound
Opening
When computing cross-sectional stock return distributions across multiple countries, I discovered that approximately 12% of UK small-cap stocks contain persistent unit-scale inconsistencies.
Instead of trying to perfectly repair the data, I implemented a robust exclusion and winsorization framework.
Here’s how.
Step 1: Detect Near-100x Scale Errors
Some UK stocks flip between GBp and GBP pricing in historical datasets.
This produces artificial 100x jumps.
These are detected via near-factor analysis (±25% tolerance around 100x).
Step 2: Exclude Persistent Residual Symbols
Symbols with repeated scale inconsistencies are excluded entirely.
UK:
- 1559 total
- 189 excluded (~12%)
This dramatically stabilizes histogram shape.
Step 3: Winsorize Returns
Weekly returns clipped to [-80%, +300%]
Monthly to [-90%, +500%]
This prevents tail contamination from distorting:
- mean
- quantiles
- histogram bins
Result
- 77,000+ monthly return samples (UK)
- Stable quantile estimates
- Clean cross-market comparability
Conclusion
Perfect data cleaning is not always necessary.
Robust statistical architecture is often enough.
If you're building global return distributions, focus on:
- Data governance
- Tail control
- Symbol-level quality screening
Before optimizing your models.




















