簡介 C++ 多執行緒效能殺手「False Sharing」與兩種解法

發佈於計算機

2025/10/31 更新2025/10/31 發佈閱讀 18 分鐘

在撰寫多執行緒 (Multi-threading) 程式時，我們都希望能榨乾 CPU 的每一分效能。我們本能地會將資料分開，讓每個執行緒（Thread）處理各自的任務，並期望它們能完美地平行運作。但有時會發現：

即使兩個執行緒存取的是完全不同的變數，執行速度卻像是在互相拖累。

這個拖累效能的原因，很可能就是「False Sharing 」，有人翻成「偽共享」。針對這個問題，本文會分享我最近使用 C++ 時，用到的兩種應對方式。

本文將分成四部分：

說明什麼是 False Sharing
使用 thread_local (C++11) 避免 False Sharing
使用 alignas 和 std::hardware_destructive_interference_size (C++17) 避免 False Sharing
測試

1. 什麼是 False Sharing (偽共享)？

因為 CPU 存取 RAM 的速度較慢，所以 CPU 常在核心旁邊設置多層高速快取 (L1, L2, L3 Cache)。當 CPU 需要一個變數時，它不會只從 RAM 中讀取那 1 個位元組 (byte) 或 8 個位元組，而是會一次性地抓取一個資料塊，放進存取速度較快的快取中 (這個資料塊叫做快取行 Cache Line，在 x86 CPU 上的大小通常是 64 bytes。)，只在必要的時候才寫回 RAM。

看個簡單的範例，八個 thread，八個 counter。每個 thread 負責將一個 counter 從 0 加到 700 萬。

constexpr size_t NUM_THREADS = 8;
constexpr long long NUM_ITERATIONS = 7000000;

struct Counter {
    // 使用 atomic 避免被 compiler 優化，較易重現 false sharing
    std::atomic<long long> value{0};
};

// 一個 long long 佔 8 bytes，8 個 counter 64 bytes，剛好能塞進同一條 cache line。
Counter counters[NUM_THREADS];

void thread_function(int thread_id) {
    for (long long i = 0; i < NUM_ITERATIONS; ++i) {
        // 注意，每個執行緒只存取自己的counter
        counters[thread_id].value.fetch_add(1);
    }
}

當這 8 個執行緒在 8 個不同的 CPU 核心上同時執行 counters[thread_id]++ 時：

Core 0 寫入 counters[0] 時，它取得了整條 Cache Line。
Core 1 準備寫入 counters[1]，但發現這條 Cache Line 正在被 Core 0 佔用。它必須等待。
Core 0 很快又想要寫 counters[0]，但它發現 Cache Line 被 Core 1 搶走

兩個核心描述起來就這麼混亂，何況是八個核心。這種資料在多個核心之間不斷地來回傳輸的情況叫做「快取乒乓效應 (Cache Ping-Ponging)」，會使效能急遽下降，甚至比單執行緒還慢。

硬要塞一張圖...

在舊型 CPU 上，Core 0 甚至可能會先把資料寫回 RAM，然後 Core 1 再從 RAM 讀取，這樣會更慢。

在現代 CPU 上，通常會直接在快取間傳輸，沒那麼慢，但還是比沒 False Sharing 慢很多。

False Sharing 的觀念大概是這樣，細節我也沒很懂，再講下去就誤人子弟了。

2. 使用 thread_local (C++11) 避免 False Sharing

使用 thread_local 關鍵字宣告的變數，每個執行緒都擁有一份獨立的複本。這些複本位於 Thread Local Storage (TLS) 中，它們在記憶體中的位址是完全分開的，幾乎保證不會在同一條快取行上，因此能避免 False Sharing。

使用 thread_local 修改後的程式碼：

// 只修改 thread_function
void thread_function(int thread_id) {
    thread_local long long local_counter = counters[thread_id].value;
    for (long long i = 0; i < NUM_ITERATIONS; ++i) {
        ++local_counter;
    }
    counters[thread_id].value = local_counter;
}

因為宣告 local_counter 使用了 thread_local 關鍵字，所以這是每個 thread 自己私有的計數器，八個 counter 在記憶體中的位址是完全分開的，幾乎能確保不會有 False Sharing (不會被讀到同一條 Cache Line)。

3. 使用 alignas 和 std::hardware_destructive_interference_size (C++17) 避免 False Sharing

另個方法是，透過手動填充 (Padding) 來增加 counter 需要的記憶體大小，雖然浪費記憶體，但能避免八個 counter 被讀到同一條 Cache Line，換來的是較好的效能。

std::hardware_destructive_interference_size 的值是 C++ 標準庫提供的建議值，代表的是當前硬體上，為避免 False Sharing 而應分隔變數的最小 byte。可以是 64，但也允許更大的值，如果有兩種不同 Cache Line 大小的 CPU，std::hardware_destructive_interference_size 可以被設為兩者中較大或更能保證隔離的值。使用 std::hardware_destructive_interference_size 的程式會比直接使用 64 的可移植性好。

如果是 C++17之前程式，因為沒支援 std::hardware_destructive_interference_size，可以事先從系統讀取：

$ getconf LEVEL1_DCACHE_LINESIZE
64

然後程式中直接使用這個數字 (我的電腦是 64 bytes)。

使用 alignas 和 std::hardware_destructive_interference_size 後的程式碼：

constexpr size_t NUM_THREADS = 8;
constexpr long long NUM_ITERATIONS = 7000000;

// 可以直接寫從 LEVEL1_DCACHE_LINESIZE 拿到的數字
struct alignas(std::hardware_destructive_interference_size) AlignedCounter {
    std::atomic<long long> value{0};
    // 編譯器會自動在此處填充 (padding)
    // 例如 56 bytes，使其總大小達到 64 bytes
};

AlignedCounter aligned_counters[NUM_THREADS];

void thread_function(int thread_id) {
    for (long long i = 0; i < NUM_ITERATIONS; ++i) {
        aligned_counters[thread_id].value.fetch_add(1);
    }
}

由於每個 counter 至少 64 bytes，他們的位址間隔一定超過 64 bytes，因此不會被讀進同一條 Cache Line，便能避免 False Sharing。

4. 實驗

直接看完整程式：

#include <iostream>
#include <thread>
#include <vector>
#include <atomic>
#include <chrono>
#include <iomanip>
#include <new>
#include <functional>

constexpr size_t NUM_THREADS = 8;
constexpr long long NUM_ITERATIONS = 7000000;

using Clock = std::chrono::high_resolution_clock;

struct Counter {
    std::atomic<long long> value{0};
};

struct alignas(std::hardware_destructive_interference_size) AlignedCounter {
    std::atomic<long long> value{0};
};

Counter fs_counters[NUM_THREADS];         // for test 1: false sharing
Counter tls_counters[NUM_THREADS];        // for test 2: thread local storage
AlignedCounter pad_counters[NUM_THREADS]; // for test 3: size padding

void run_test(const std::string& test_name, std::function<void(int)> thread_body) {
    std::vector<std::thread> threads;

    std::cout << test_name << " ";

    auto start_time = Clock::now();

    for (int i = 0; i < NUM_THREADS; ++i) {
        threads.emplace_back(thread_body, i);
    }

    for (auto& t : threads) {
        t.join();
    }

    auto end_time = Clock::now();
    std::chrono::duration<double, std::milli> elapsed = end_time - start_time;

    std::cout << "time: " << std::fixed << std::setprecision(3)
              << elapsed.count() << " ms" << std::endl;
}

int main() {
    std::cout << "std::hardware_destructive_interference_size): "
              << std::hardware_destructive_interference_size
              << " bytes\n";

    std::cout << "sizeof(Counter): " << sizeof(Counter) << " bytes\n"
              << "sizeof(AlignedCounter): " << sizeof(AlignedCounter) << " bytes\n";

    std::cout << std::endl;

    run_test("test 1. False Sharing", [](int thread_id) {
        for (long long i = 0; i < NUM_ITERATIONS; ++i) {
            fs_counters[thread_id].value.fetch_add(1);
        }
    });

    run_test("test 2. thread_local", [](int thread_id) {
        thread_local long long local_counter = tls_counters[thread_id].value;

        for (long long i = 0; i < NUM_ITERATIONS; ++i) {
            ++local_counter;
        }

        tls_counters[thread_id].value = local_counter;
    });

    run_test("test 3. Padding", [](int thread_id) {
        for (long long i = 0; i < NUM_ITERATIONS; ++i) {
            pad_counters[thread_id].value.fetch_add(1);
        }
    });

    // check counters
    std::cout << "\nfs: ";
    for(int i = 0; i < NUM_THREADS; ++i) std::cout << pad_counters[i].value.load() << " ";

    std::cout << "\ntls: ";
    for(int i = 0; i < NUM_THREADS; ++i) std::cout << tls_counters[i].value.load() << " ";

    std::cout << "\npad: ";
    for(int i = 0; i < NUM_THREADS; ++i) std::cout << pad_counters[i].value.load() << " ";

    std::cout << std::endl;

    return 0;
}

執行結果：

$ ./test
std::hardware_destructive_interference_size): 64 bytes
sizeof(Counter): 8 bytes
sizeof(AlignedCounter): 64 bytes

test 1. False Sharing time: 767.460 ms
test 2. thread_local time: 4.606 ms
test 3. Padding time: 45.192 ms

fsh: 7000000 7000000 7000000 7000000 7000000 7000000 7000000 7000000
tls: 7000000 7000000 7000000 7000000 7000000 7000000 7000000 7000000
pad: 7000000 7000000 7000000 7000000 7000000 7000000 7000000 7000000

測試 1 會 false sharing 所以一定是最慢，測試 2 是用 thread_local 最快，還算符合直覺。

至於測試 2 比測試 3 快的原因，我猜是因為用了 thread_local ，程式較容易被編譯器優化成盡量在暫存器 Register 中計算，而測試 3 的 AlignedCounter 裡面是 std::atomic，猜想會暗示編譯器盡量不要優化到暫存器 (存猜測) 。 Register 比 Cache 更快，所以測試 2 才會最快。

留言

留言分享你的想法！