如果這篇文章裡面的內容都已經price in了,僅供想了解產業及科技的人參考。不構成任何投資建議。
矽中介層的一項支持技術是一種稱為“掩模版縫合”的技術。 由於光刻工具狹縫/掃描最大尺寸,芯片的最大尺寸通常為26mm x 33mm 。
隨著 GPU 芯片本身接近這一極限,並且還需要在其周圍安裝 HBM,中介層需要很大,並且將遠遠超出這一標線極限。
TSMC 通過標線縫合解決了這個問題,這使得他們能夠將中介層圖案化為標線限制的數倍(截至目前,AMD MI300 最高可達 3.5 倍)。
對於更大的設計來說,使用 CoWoS-L 會更加經濟。台積電正在開發 6 倍掩模版尺寸的 CoWoS-L 超級載波中介層。
對於 CoWoS-S,他們沒有提到 4 倍十字線之外的任何內容。這是因為矽中介層的脆弱性。這種矽中介層只有 100 微米厚,並且在工藝流程中隨著中介層尺寸增大而存在分層或破裂的風險。
Nvidia 的 H100 採用 CoWoS-S 上的 7 芯片封裝。中間是H100 GPU ASIC,其芯片尺寸為814mm Nvidia’s H100 is 7-die packaged on CoWoS-S. In the center is the H100 GPU ASIC which has a die size of 814mm。
周圍是 6 個內存堆棧,HBM。不同 SKU 之間的 HBM 配置有所不同,但 H100 SXM 版本使用 HBM3,每個堆棧為 16GB,總內存為 80GB。H100 NVL 將具有兩個封裝,每個封裝上有 6 個活躍的 HBM 堆棧。.
在只有 5 個活動 HBM 的情況下,非 HBM 芯片可以使用虛擬矽,為芯片提供結構支撐。這些芯片位於矽中介層的頂部,該矽中介層在圖片中不清晰可見。該矽中介層位於封裝基板上,該封裝基板是 ABF 封裝基板。
Nvidia GPU 的主要數字處理組件是處理器芯片本身,它是在稱為“4N”的定制台積電工藝節點上製造的。它是在台積電位於台灣台南的 Fab 18 工廠製造的,與台積電 N5 和 N4 工藝節點共享相同的設施。
由於 PC、智能手機和非 AI 相關數據中心芯片的嚴重疲軟,台積電 N5 工藝節點的利用率降至 70% 以下。英偉達在確保額外的晶圓供應方面沒有遇到任何問題。
事實上,Nvidia 已經訂購了大量用於 H100 GPU 和 NVSwitch 的晶圓,這些晶圓立即開始生產,遠遠早於運送芯片所需的晶圓。這些晶圓將存放在台積電的芯片組中,直到下游供應鏈有足夠的產能將這些晶圓封裝成完整的芯片。
基本上,英偉達正在吸收台積電的部分低利用率,並獲得一些定價優勢,因為英偉達已承諾進一步購買成品。
晶圓庫,也稱為芯片庫,是半導體行業的一種做法,其中存儲部分處理或完成的晶圓,直到客戶需要它們為止。
台積電將通過將這些晶圓保留在自己的賬簿上幾乎完全加工來幫助他們的客戶。這種做法使台積電及其客戶能夠保持財務靈活性。由於僅進行了部分加工,因此晶圓庫中保存的晶圓不被視為成品,而是被歸類為 WIP。只有當這些晶圓全部完成後,台積電才能確認收入並將這些晶圓的所有權轉讓給客戶。
這有助於客戶修飾他們的資產負債表,使庫存水平看起來處於控制之中。對於台積電來說,好處是可以幫助保持更高的利用率,從而支撐利潤率。然後,隨著客戶需要更多庫存,這些晶圓可以通過幾個最終加工步驟完全完成,然後以正常銷售價格甚至稍有折扣的價格交付給客戶。
GPU 周圍的高帶寬內存是下一個主要組件。HBM 供應也有限,但正在增加。HBM 是垂直堆疊的 DRAM 芯片,通過矽通孔 (TSV) 連接並 使用 TCB進行鍵合using TCB(未來更高的堆疊數量將需要混合鍵合)。DRAM 裸片下方位於充當控制器的基本邏輯裸片上。通常,現代 HBM 具有 8 層內存和 1 個基本邏輯芯片,但我們很快就會看到具有 12+1 層 HBM 的產品,例如 AMD 的 MI300X 和 Nvidia 即將推出的 H100 更新。
有趣的是,儘管 Nvidia 和 Google 是當今使用量最大的用戶,但 AMD 率先推出了 HBM。2008 年,AMD 預測,為了匹配遊戲 GPU 性能而不斷擴展內存帶寬將需要越來越多的功率,而這些功率需要從 GPU 邏輯中轉移出來,從而降低 GPU 性能。AMD 與 SK Hynix 以及供應鏈中的其他公司(例如 Amkor)合作,尋找一種能夠以更低功耗提供高帶寬的內存解決方案。這導致 SK 海力士於 2013 年開發了 HBM。
SK Hynix 於 2015 年首次為 AMD Fiji 系列遊戲 GPU 提供 HBM,該 GPU 由 Amkor 進行 2.5D 封裝。隨後,2017 年推出了使用 HBM2 的 Vega 系列。然而,HBM 並沒有對遊戲 GPU 性能產生太大的改變。由於沒有明顯的性能優勢以及更高的成本,AMD 在 Vega 之後重新在其遊戲卡中使用 GDDR。如今,Nvidia 和 AMD 的頂級遊戲 GPU 仍在使用更便宜的 GDDR6。
然而,AMD 的最初預測在某種程度上是正確的:擴展內存帶寬已被證明是 GPU 的一個問題,只是這主要是數據中心 GPU 的問題。對於消費級遊戲 GPU,Nvidia 和 AMD 已轉向使用大型緩存作為幀緩衝區,使它們能夠使用帶寬低得多的 GDDR 內存。
正如我們過去所詳述的,推理和訓練工作負載是內存密集型的。隨著人工智能模型中參數數量的指數級增長,僅權重的模型大小就已達到 TB 級。因此,人工智能加速器的性能受到從內存中存儲和檢索訓練和推理數據的能力的瓶頸:這個問題通常被稱為內存牆。
為了解決這個問題,領先的數據中心 GPU 與高帶寬內存 (HBM) 共同封裝。Nvidia 於 2016 年發布了首款 HBM GPU P100。HBM 通過在傳統 DDR 內存和片上緩存之間找到中間立場,以容量換取帶寬來解決內存牆問題。通過大幅增加引腳數以達到每個 HBM 堆棧 1024 位寬的內存總線,可以實現更高的帶寬,這是每個 DIMM 64 位寬的 DDR5 的 16 倍。同時,通過大幅降低每比特傳輸能量 (pJ/bit) 來控制功耗。這是通過更短的走線長度來實現的,HBM 的走線長度以毫米為單位,而 GDDR 和 DDR 的走線長度以厘米為單位。
如今,許多面向HPC的芯片公司正在享受AMD努力的成果。具有諷刺意味的是,AMD 的競爭對手 Nvidia 作為 HBM 用量最大的用戶,或許會受益最多。
作為HBM的先驅,SK海力士是擁有最先進技術路線的領導者。SK 海力士於 2022 年 6 月開始生產 HBM3,是目前唯一一家批量出貨 HBM3 的供應商,擁有超過 95% 的市場份額,這是大多數 H100 SKU 所使用的。HBM 現在的最大配置為 8 層 16GB HBM3 模塊。SK Hynix 正在為 AMD MI300X 和 Nvidia H100 刷新生產數據速率為 5.6 GT/s 的 12 層 24GB HBM3。
HBM 的主要挑戰是存儲器的封裝和堆疊,這是 SK 海力士所擅長的,積累了最強大的工藝流程知識。在未來的文章中,我們還將詳細介紹 SK 海力士的 2 項關鍵封裝創新如何開始發展,並將如何取代當前 HBM 流程中的一個關鍵設備供應商。
三星緊隨 Hynix 之後,預計將在 2023 年下半年發貨 HBM3。我們相信它們是為 Nvidia 和 AMD GPU 設計的。他們目前在銷量上與 SK 海力士存在很大差距,但他們正在緊鑼密鼓地前進,並正在大力投資以追趕市場份額。三星正在投資以追趕並成為 HBM 市場份額第一,就像他們在標準內存方面一樣。我們聽說他們正在與一些加速器公司達成優惠協議,以試圖獲得更多份額。
美光科技排名墊底。美光在混合存儲立方體 (HMC) 技術上投入了更多資金。這是與 HBM 競爭的技術,其概念非常相似,大約在同一時間開發。然而,HMC周圍的生態系統是封閉的,導致圍繞HMC的IP很難開發。此外,還存在一些技術缺陷。HBM 的採用率要高得多,因此 HBM 勝出,成為 3D 堆疊 DRAM 的行業標準。
他們展示了 12 層 HBM 以及未來的混合鍵合 HBM。三星 HBM-4 路線圖的一個有趣的方面是,他們希望在內部 FinFET 節點上製作邏輯/外圍設備。這顯示了他們擁有內部邏輯和 DRAM 代工廠的潛在優勢。
直到 2018 年,美光才開始從 HMC 轉向 HBM 路線圖。這就是美光科技落在最後面的原因。他們仍然停留在HBM2E(SK海力士在2020年中期開始量產)上,甚至無法成功製造top bin HBM2E。
在最近的財報電話會議中,美光對其 HBM 路線圖做出了一些大膽的聲明:他們相信,他們將在 2024 年憑藉 HBM3E 從落後者變為領先者。HBM3E 預計將在第三季度/第四季度開始為 Nvidia 的下一代 GPU 發貨。
我們的 HBM3 斜坡實際上是下一代 HBM3,與當今業界生產的 HBM3 相比,它具有更高水平的性能、帶寬和更低的功耗。該產品,即我們行業領先的產品,將從 2024 年第一季度開始銷量大幅增加,並對 24 財年的收入產生重大影響,並在 2025 年大幅增加,即使是在 2024 年的水平基礎上。我們的目標Our ramp of HBM3,- 是在 HBM 中獲得非常強勁的份額,高於行業中 DRAM 的非自然供應份額
他們在 HBM 中擁有比一般 DRMA 市場份額更高的市場份額的聲明非常大膽。鑑於他們仍在努力大批量生產頂級 HBM2E,我們很難相信美光聲稱他們將在 2024 年初推出領先的 HBM3,甚至成為第一個 HBM3E。在我們看來,儘管服務器的內存容量比英特爾/AMD CPU 服務器要低得多,但美光科技似乎正在試圖改變人們對人工智能失敗者的看法。
我們所有的渠道檢查都發現 SK 海力士在新一代技術方面保持最強,而三星則非常努力地通過大幅供應增加、大膽的路線圖和削減交易來追趕。
下一個瓶頸是 CoWoS 容量。CoWoS(基板上晶圓芯片)是台積電的一種“2.5D”封裝技術,其中多個有源矽芯片(通常的配置是邏輯和 HBM 堆棧)集成在無源矽中介層上。中介層充當頂部有源芯片的通信層。然後將內插器和有源矽連接到包含要放置在系統 PCB 上的 I/O 的封裝基板。
HBM 和 CoWoS 是互補的。HBM 的高焊盤數和短走線長度要求需要 CoWoS 等 2.5D 先進封裝技術來實現 PCB 甚至封裝基板上無法實現的密集、短連接。CoWoS是主流封裝技術,能夠以合理的成本提供最高的互連密度和最大的封裝尺寸。由於目前幾乎所有 HBM 系統都封裝在 CoWoS 上,並且所有高級 AI 加速器都使用 HBM,因此,幾乎所有領先的數據中心 GPU 都由台積電在 CoWoS 上封裝。百度確實有一些先進的加速器,三星的版本也有。
雖然台積電 (TSMC) 的 SoIC 等 3D 封裝技術可以將芯片直接堆疊在邏輯之上,但由於散熱和成本的原因,這對於 HBM 來說沒有意義。SoIC 在互連密度方面處於不同的數量級,並且更適合通過芯片堆疊擴展片上緩存,如 AMD 的 3D V-Cache 解決方案所示。AMD 的 Xilinx 也是多年前 CoWoS 的第一批用戶,用於將多個 FPGA 小芯片組合在一起。
雖然還有一些其他應用程序使用 CoWoS,例如網絡(其中一些用於網絡 GPU 集群,如 Broadcom 的 Jericho3-AI 、超級計算和 FPGA,但絕大多數 CoWoS 需求來自人工智能。與半導體供應鏈的其他部分不同,其他主要終端市場的疲軟意味著有足夠的閒置空間來吸收 GPU 需求的巨大增長,CoWoS 和 HBM 已經是大多數面向人工智能的技術,因此所有閒置空間已在第一季度被吸收。隨著 GPU 需求的爆炸式增長,供應鏈中的這些部分無法跟上並成為 GPU 供應的瓶頸。),
台積電一直在為更多的封裝需求做好準備,但可能沒想到這一波生成式人工智能需求來得如此之快。6月,台積電宣佈在竹南開設先進後端晶圓廠該晶圓廠佔地 14.3 公頃,足以容納每年 100 萬片晶圓的 3D Fabric 產能。這不僅包括 CoWoS,還包括 SoIC 和 InFO 技術。有趣的是,該工廠比台積電其他封裝工廠的總和還要大。雖然這只是潔淨室空間,遠未配備齊全的工具來實際提供如此大的容量,但很明顯,台積電正在做好準備,預計對其先進封裝解決方案的需求會增加。
稍微有幫助的是晶圓級扇出封裝產能(主要用於智能手機 SoC)的閒置,其中一些產能可以在某些 CoWoS 工藝步驟中重新利用。特別是,存在一些重疊的工藝,例如沉積、電鍍、背面研磨、成型、放置和RDL形成。我們將在後續文章中介紹 CoWoS 流程以及所有因此看到積極需求的公司。設備供應鏈發生了有意義的轉變。
還有來自英特爾、三星和 OSAT 的其他 2.5D 封裝技術(例如 ASE 的 FOEB),CoWoS 是唯一一種大批量使用的技術,因為台積電是迄今為止最主要的 AI 加速器代工廠。甚至Intel Habana的加速器也是由台積電製造和封裝的。然而,一些客戶正在尋找台積電的替代品,我們將在下面討論。
CoWoS 有幾種變體,但原始 CoWoS-S 仍然是大批量生產中的唯一配置。這是如上所述的經典配置:邏輯芯片 + HBM 芯片通過帶有 TSV 的矽基中介層連接。然後將中介層放置在有機封裝基板上。
矽中介層的一項支持技術是一種稱為“掩模版縫合”的技術。An enabling technology for silicon interposers is a technology called “reticle stitching”. Chips generally have a 由於光刻工具狹縫/掃描最大尺寸,maximum size of 26mm x 33mm due to lithography tools slit/scan maxing out that size芯片的最大尺寸通常為26mm x 33mm 。隨著 GPU 芯片本身接近這一極限,並且還需要在其周圍安裝 HBM,中介層需要很大,並且將遠遠超出這一標線極限。TSMC 通過標線縫合解決了這個問題,這使得他們能夠將中介層圖案化為標線限制的數倍(截至目前,AMD MI300 最高可達 3.5 倍)。
CoWoS-R 在具有重新分佈層 (RDL) 的有機基板上使用,而不是矽中介層。這是一種成本較低的變體,由於使用有機 RDL 而不是矽基中介層,因此犧牲了 I/O 密度。正如我們所詳述的,AMD 的 MI300 最初是在 CoWoS-R 上設計的,但我們認為,由於翹曲和熱穩定性問題,AMD 必須改用 CoWoS-S。
CoWoS-L 預計將在今年晚些時候推出,並採用 RDL 中介層,但包含嵌入中介層內部的用於芯片間互連的有源和/或無源矽橋。這是台積電相當於英特爾EMIB封裝技術。隨著矽中介層變得越來越難以擴展,這將允許更大的封裝尺寸。MI300 CoWoS-S 可能接近單矽中介層的極限。
對於更大的設計來說,使用 CoWoS-L 會更加經濟。台積電正在開發 6 倍掩模版尺寸的 CoWoS-L 超級載波中介層。對於 CoWoS-S,他們沒有提到 4 倍十字線之外的任何內容。這是因為矽中介層的脆弱性。這種矽中介層只有 100 微米厚,並且在工藝流程中隨著中介層尺寸增大而存在分層或破裂的風險。
Surrounding it are 6 stacks of memory, HBM. The HBM configuration changes between various SKUs but the H100 SXM version uses HBM3, with each stack being 16GB for 80GB total memory. The H100 NVL will have two packages with 6 active stacks of HBM on each package.
In cases where there are only 5 active HBM, the non-HBM die can dummy silicon which is there to provide structural support for the chip. These die sit atop a silicon interposer that is not clearly visible in the picture. This silicon interposer sits on a package substrate which is an ABF package substrate.
The main number-crunching component of Nvidia’s GPUs is the processor die itself, fabricated on a customized TSMC process node called “4N.” It is fabricated in TSMC’s Fab 18 in Tainan, Taiwan, sharing the same facilities as TSMC N5 and N4 process nodes.
TSMC’s utilization rates for the N5 process node fell below 70% due to massive weakness in PC, smartphone, and non-AI related datacenter chips. Nvidia has had no problem securing additional wafer supply.
In fact, Nvidia has ordered a large number of wafers for H100 GPUs and NVSwitch that started production immediately, well before they are required for shipping chips. These wafers will sit at TSMC’s die bank until the downstream supply chain has enough capacity to package these wafers into completed chips.
Basically, Nvidia is soaking up some of TSMC’s low utilization rates and getting a bit of a pricing benefit because NVIDIA has committed to purchase the finished product further down the road.
A wafer bank, also known as a die bank, is a practice in the semiconductor industry where partially processed or completed wafers are stored until they are needed by the customer.
, TSMC will help their customers by keeping these wafers on their own books almost fully processed. This practice allows TSMC and its customers to maintain financial flexibility. As they are only partially processed, wafers held in the wafer bank are not considered finished goods, but instead are classified as WIP. It is only when these wafers are fully completed that TSMC can recognise revenue and transfer ownership of these wafers to their customers.
This helps customers dress up their balance sheet so it appears that inventory levels are under control. For TSMC, the benefit is that it can help keep their utilization rates higher which supports margins. Then, as the customer needs more inventory, these wafers can be fully completed by a few final processing steps and then delivered to the customer at the normal sales price or even a slight discount.
The High Bandwidth Memory around the GPU is the next major component. HBM supply is also limited, but ramping. HBM is vertically stacked DRAM dies connected via Through Silicon Vias (TSVs) and bonded(hybrid bonding will be required for higher stack counts in the future). Beneath the DRAM dies sits on a base logic die that acts as a controller. Typically, modern HBM has 8 layers of memory and 1 base logic die but we will see products with 12+1 layer HBM soon, for example AMD’s MI300X and Nvidia’s upcoming H100 refresh.
Interestingly, it was AMD that pioneered HBM, despite Nvidia and Google being the highest volume user today. In 2008, AMD predicted that the continual scaling of memory bandwidth to match gaming GPU performance would need more and more power that would need to be diverted away from the GPU logic and therefore detract from GPU performance. AMD partnered with SK Hynix and other companies in the supply chain (such as Amkor) to find a memory solution that would deliver high bandwidth with lower power. This resulted in the development of HBM in 2013 by SK Hynix.
SK Hynix first shipped HBM in 2015 for AMD’s Fiji series of gaming GPUs which was 2.5D packaged by Amkor. This was followed up with the Vega series in 2017, which used HBM2. However, HBM wasn’t much of a game-changer for gaming GPU performance. With no clear performance benefits coupled with higher cost, AMD returned to using GDDR for its gaming cards after Vega. Today, top of the line gaming GPUs from Nvidia and AMD are still using cheaper GDDR6.
However, AMD was somewhat correct with their initial prediction: scaling memory bandwidth has proven to be a problem for GPUs, just that it is a problem mostly for datacenter GPUs. With consumer gaming GPUs, Nvidia and AMD have turned to large caches for the frame buffer, enabling them to stay with much lower bandwidth GDDR memory.
As we have detailed in the past, inference and training workloads are memory intensive. With the exponential rise in the number of parameters in AI models this is pushing model size to terabytes for weights alone. Therefore, AI accelerator performance is bottlenecked by the ability to store and retrieve training and inference data from memory: a problem often known as the memory wall。.
To address this, leading-edge datacenter GPUs are co-packaged with High Bandwidth Memory (HBM). Nvidia released their first HBM GPU, the P100 in 2016. HBM tackles the memory wall by finding a middle ground between conventional DDR memory and on-chip cache, trading capacity for bandwidth. Much higher bandwidth is achieved by drastically increasing pin counts to reach a 1024 bit wide memory bus per HBM stack, which is 16x that of DDR5 at 64 bit width per DIMM. At the same time, power is kept in check with drastically lower energy per bit transfer (pJ/bit). This is achieved through much shorter trace lengths, which measure in millimetres for HBM vs cm for GDDR and DDR.
Today, many HPC-facing chip companies are enjoying the fruit of AMD’s efforts. Ironically, AMD’s rival Nvidia has perhaps stood to benefit the most as the highest volume user of HBM.
As the pioneer of HBM, SK Hynix is the leader with the most advanced technology roadmap. SK Hynix started production of HBM3 in June 2022 and is currently the only supplier shipping HBM3 in volume, with over 95% market share, which is what most H100 SKUs are using. The max configuration of HBM now 8-layer 16GB HBM3 modules. SK Hynix is producing 12-layer 24GB HBM3 with a datarate of 5.6 GT/s for the AMD MI300X and Nvidia H100 refresh.The main challenge with HBM is packaging and stacking the memory, which is what SK Hynix has excelled at, having accumulated the strongest process flow knowledge. In a future post, we will also detail how SK Hynix’s 2 key packaging innovations are beginning to ramp up and will displace one key equipment vendor in the current HBM process.
Samsung is next behind Hynix and expects to ship HBM3 in the second half of 2023. We believe they are designed for both Nvidia and AMD GPUs. They currently have a big deficit in volume to SK Hynix, but they are hot on the trails and are investing hugely to catch up on market share. Samsung is investing to catch up and become number 1 in HBM market share just like they are with standard memory. We hear they are cutting favorable deals with some of the accelerator firms to try to capture more share.
They have shown off their 12-layer HBM as well as future Hybrid Bonded HBM. One interesting aspect of Samsung’s HBM-4 roadmap is that they want to make the logic/periphery on an in-house FinFET node. This shows the potential advantage they have of having logic and DRAM foundry in house.
Micron is the furthest behind. Micron was more heavily invested in Hybrid Memory Cube (HMC) technology. This was a competing technology to HBM with a very similar concept that developed around the same time. However, the ecosystem around HMC was closed, making it difficult for IP to be developed around HMC. Furthermore, there were some technical deficiencies. Adoption for HBM was much higher so HBM won out to become the industry standard for 3D stacked DRAM.
It was only in 2018 that Micron started to pivot away from HMC and invest into HBM roadmap. This is why Micron is the furthest behind. They are still stuck on HBM2E (which SK Hynix started mass producing in the middle of 2020) and cannot even manufacture top bin HBM2E successfully.
In their most recent earnings call Micron made some bold statements about their HBM roadmap: they believe they will go from laggard to leader with their HBM3E in 2024. HBM3E is expected to start to ship in Q3/Q4 for Nvidia’s next generation GPU.
actually the sort of the next generation of HBM3, which is a much higher level of performance, bandwidth and lower power than what is in production in HBM3 today in the industry. That product, our industry-leading product will be ramping in volume starting CQ1 of 2024, and will be meaningful in revenue for fiscal year '24 and then substantially larger in 2025, even from those 2024 levels. And we will -we are targeting a very robust share in HBM higher than unnatural supply share for DRAM in the industry。.
Sumit Sadana,美光首席商務官Sumit Sadana, Micron Chief Business Officer
The statement that they will have higher market share in HBM versus their general DRMA marketshare is very bold. Given they still struggle to manufacture top bin HBM2E at high volume, we find it hard to believe Micron’s claim that they will ship leading edge HBM3 in early 2024 and even be the first to HBM3E. It feels to us as though Micron is trying to change the narrative away from being an Nvidia GPU AI loser despite dramatically lower memory content per Nvidia GPU server versus Intel/AMD CPU server.
All our channel checks see SK Hynix remaining strongest at new-generation technologies and Samsung trying very hard to catch back up with huge supply increases, a bold roadmap, and cutting deals.
The next bottleneck is CoWoS capacity. CoWoS (Chip on Wafer on Substrate) is a “2.5D” packaging technology from TSMC where multiple active silicon dies (the usual configuration is logic and HBM stacks) are integrated on a passive silicon interposer. The interposer acts as a communication layer for the active die on top. The interposer and active silicon are then attached to a packaging substrate which contains the I/O to place on the system PCB.
HBM and CoWoS are complementary. The high pad count and short trace length requirements of HBM necessitate 2.5D advanced packaging technologies like CoWoS to enable such dense, short connections that can't be done on a PCB or even a package substrate. CoWoS is the mainstream packaging technology that offers the highest interconnection density and largest package size with reasonable costs. As almost all HBM systems are currently packaged on CoWoS, and all advanced AI accelerators use HBM, the corollary is that virtually all leading-edge data center GPUs are packaged on CoWoS by TSMC. Baidu does have some advanced accelerators with Samsung on their version.
While 3D packaging technologies such as TSMC's SoIC enable stacking dies directly on top of logic, it does not make sense for HBM due to thermals and cost. SoIC sits on a different order of magnitude regarding interconnect density and is better suited to expanding on-chip cache with die stacking, as seen with AMD's 3D V-Cache solution. AMD’s Xilinx was also the first users of CoWoS many years ago for combining multiple FPGA chiplets together.
While there are some other applications that use CoWoS like networking (and some of these are adopted for networking GPU clusters like Broadcom’s Jericho3-AI)supercomputing, and FPGAs, the vast majority of CoWoS demand comes from AI. Unlike other parts of the semiconductor supply chain, where weakness in other major end markets means there is plenty of slack to absorb the enormous pickup in GPU demand, CoWoS and HBM are already majority AI-facing technologies, so all slack was already absorbed in Q1. With GPU demand exploding, these are the parts of the supply chain that just cannot keep up and are bottlenecking GPU supply.
TSMC has been getting ready for more packaging demand but probably did not expect this wave of generative AI demand to come so quickly. 。In June, TSMC announced they have opened their Advanced Backend Fab 6 in Zhunan. This fab has an area of 14.3 hectares which would be enough cleanroom space for potentially 1 million wafers per year of 3D Fabric capacity. This includes not only CoWoS but also SoIC and InFO technologies. Interestingly this fab is larger than the rest of TSMC’s other packaging fabs combined. While this is just cleanroom space and far from fully tooled to actually provide that much capacity, it’s clear that TSMC is getting ready in anticipation of more demand for its advanced packaging solutions.
What does help a little is there is slack in Wafer Level Fan-Out packaging capacity (primarily used for smartphone SoCs), and some of this can be repurposed in some CoWoS process steps. In particular, there are some overlapping processes such as deposition, plating, back grinding, molding, placement, and RDL formation. We will go through the CoWoS process flow and all the firms who see positive demand due to it in a follow-up piece. There are meaningful shifts in the equipment supply chain.
There are other 2.5D packaging technologies from Intel, Samsung and OSATs (like FOEB from ASE), CoWoS is the only one being used in high volume given TSMC is by the far most dominant foundry for AI accelerators. Even Intel Habana’s accelerators are fabricated and packaged by TSMC. However, some customers are seeking alternatives to TSMC which we will discuss below.
There are a few variants of CoWoS but the original CoWoS-S remains the only configuration in high volume production. This is the classic configuration as described above: logic die + HBM die are connected via a silicon-based interposer with TSVs. The interposer then sits on an organic package substrate.
. With GPU die alone approaching this limit and the need to fit HBM around it too, interposers need to be large and will go well beyond this reticle limit. TSMC addresses this with reticle stitching which allows them to pattern interposers multiple times the reticle limit (as of now up to 3.5x with AMD MI300).
CoWoS-R uses on organic substrate with redistribution layers (RDLs) instead of a silicon interposer. This is a lower cost variant that sacrifices I/O density due to using an organic RDL instead of a silicon-based interposer. As we have detailed, AMD’s MI300 was originally designed on CoWoS-R but we believe due to warpage and thermal stability concerns AMD has to instead use CoWoS-S.
CoWoS-L is expected to ramp later this year and utilises an RDL interposer but contains an active and/or passive silicon bridge used for die-to-die interconnect that is embedded inside the interposer. This is TSMC’s equivalent to Intel’s EMIB packaging technology. This will allow for larger package sizes as silicon interposers are getting harder to scale. The MI300 CoWoS-S may be near the limit for a single silicon interposer.
It will be far more economical for even larger designs to go with CoWoS-L. TSMC is working on a CoWoS-L super carrier interposer at 6x reticle size. For CoWoS-S, they have not mentioned anything beyond a 4x reticle. This is because of fragility of the silicon interposer. This silicon interposer is only 100 microns thick and is at risk of delaminating or cracking as interposers scale to larger sizes during the process flow.
https://www.semianalysis.com/p/ai-capacity-constraints-cowos-and