最近討論度很高的 Z-Image 是由阿里巴巴集團旗下的通義千問團隊開發,目前釋出的輕量版模型 Z-Image-Turbo,不論在硬體需求、生成速度、輸出品質等面向都達到高水準,Apache-2.0 授權讓開源或商業應用都有了無限可能,對於生態系發展應該會很有影響力。在基礎模型越來越大的時代,Z-Image 的出現無疑是給了一拳重擊,有人說它是「影像生成模型的 DeepSeek 時刻」。
這篇文章會分享實測 Z-Image-Turbo 的各種能力,並與目前的幾個熱門的開源模型比較,同時也會簡單帶到相關的關鍵技術,讓大家更深入了解它的潛力。
硬體需求與生成速度
Z-Image-Turob 模型參數僅有 60 億個,再加上量化技術,有機會在 16 GB VRAM 的消費級 GPU(例如 RTX 5060 Ti)運行,是個很親民的模型。透過蒸餾(Distilled)技術,Steps 設定在 8 左右,就可產生成品質很不錯的圖片(通常大部分模型需要設定 20 到 30),生成一張圖片的速度有機會壓在數秒鐘內。實測:硬體用量與速度
我的環境是在本地端的 RTX 4090 GPU 主機跑 ComfyUI,測試在不同圖片大小下,模型生成一張圖片時的時間與最大 GPU VRAM 使用量。
Z-Image-Turbo 使用官方的 Workflow(bf16),比較對象是同為快速版的 FLUX-1 [schnell],也使用官方的 Workflow(120 億個參數量、fp8),兩者都用 8 個 Steps(FLUX-1 [schnell] 通常用 4,但為了公平比較才設定成一樣 )。
速度比較:
- 1024x1024(1 MP):Z-Image-Turbo 4.4 秒,FLUX-1 [schnell] 10.8 秒
- 1536x1024(1.5 MP):Z-Image-Turbo 8.1 秒,FLUX-1 [schnell] 14.8 秒
- 1920x1088(2 MP):Z-Image-Turbo 11.7 秒,FLUX-1 [schnell] 17.4 秒
硬體使用量比較:Z-Image-Turbo 約 20 GiB,FLUX-1 [schnell] 約 21 GiB。
生成品質比較:
![Prompt: A hyper-realistic extreme close-up portrait of an elderly woman with deep wrinkles and wise eyes, natural lighting, shot on 35mm Kodak Portra 400 film. Every pore and vellus hair is visible on her skin. Soft sunlight hitting the side of her face, creating a gentle chiaroscuro effect. Background is a blurred rustic kitchen. High texture, raw, authentic, unpolished, depth of field. (左:Z-Image-Turbo 右:FLUX-1 [schnell])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2F96d4ef3a-d1aa-426c-9295-58271155a2f7.jpg&width=740&sign=dMYs62Z5_1F041pKTbWSOaBfAkOVVnHziGpaWxaMU68)
Prompt: A hyper-realistic extreme close-up portrait of an elderly woman with deep wrinkles and wise eyes, natural lighting, shot on 35mm Kodak Portra 400 film. Every pore and vellus hair is visible on her skin. Soft sunlight hitting the side of her face, creating a gentle chiaroscuro effect. Background is a blurred rustic kitchen. High texture, raw, authentic, unpolished, depth of field. (左:Z-Image-Turbo 右:FLUX-1 [schnell])
![Prompt: A breathtaking anime landscape painting in the style of Studio Ghibli. A solitary young traveler with a large backpack stands on a mossy cliff edge overlooking a vast, ancient valley filled with overgrown fantasy ruins and distant waterfalls. Enormous, fantastical cloud formations at sunset, casting warm orange, purple, and deep blue hues across the scene. Gentle wind blowing through long grass. Watercolor texture background, hand-painted feel, highly detailed vegetation. No text, no signs, no letters, no speech bubbles present in the image. (左:Z-Image-Turbo 右:FLUX-1 [schnell])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2Fd97da098-2a4f-4397-a65b-b3a5f923c45d.jpg&width=740&sign=GikXIZouLIq2H8x29KCj2vwVvs6PmHkO48Y_TpSSAnA)
Prompt: A breathtaking anime landscape painting in the style of Studio Ghibli. A solitary young traveler with a large backpack stands on a mossy cliff edge overlooking a vast, ancient valley filled with overgrown fantasy ruins and distant waterfalls. Enormous, fantastical cloud formations at sunset, casting warm orange, purple, and deep blue hues across the scene. Gentle wind blowing through long grass. Watercolor texture background, hand-painted feel, highly detailed vegetation. No text, no signs, no letters, no speech bubbles present in the image. (左:Z-Image-Turbo 右:FLUX-1 [schnell])
![Prompt: A retro pop art illustration in the style of Roy Lichtenstein. A dramatic comic book panel showing a crying woman with blonde hair. Thick black outlines, bold primary colors (red, yellow, blue). The entire image has a visible vintage halftone dot print texture. The aesthetic is graphic, bold, and textured like old newsprint. (左:Z-Image-Turbo 右:FLUX-1 [schnell])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2F20ef6e3e-e00e-4ff4-b21e-ebd01fb87859.jpg&width=740&sign=rxeGOhlkiGWvCanV2WuJR7MdT1w3_ux8MzflM3pph5c)
Prompt: A retro pop art illustration in the style of Roy Lichtenstein. A dramatic comic book panel showing a crying woman with blonde hair. Thick black outlines, bold primary colors (red, yellow, blue). The entire image has a visible vintage halftone dot print texture. The aesthetic is graphic, bold, and textured like old newsprint. (左:Z-Image-Turbo 右:FLUX-1 [schnell])
小結:可以看出兩者在硬體使用量差不多,但 Z-Image-Turbo 有明顯的速度優勢,大概快 FLUX-1 [schnell] 的 1.5 至 2 倍。小圖(1024x1024)能在幾秒內就產出,非常適合在初期發想階段時,快速實驗各種不同的 Prompt。而品質 Z-Image-Turbo 明顯勝出,細節與紋理的豐富程度有很大的差距。
技術:S3-DiT 架構減少模型參數量
Z-Image 提出的 S3-DiT (Scalable Single-Stream Diffusion Transformer)架構,有效的減少模型所需參數量。
在圖片生成的 Diffusion 階段的模型設計,目前最流行的是 DiT(Diffusion Transformer)。這個架構需要考慮如何讓 Transformer 同時處理「圖」Image Latent Tokens 跟「文」Text Tokens 兩種模態(Modal)的資訊,分成 Single-Stream 與 Double-Stream 兩種方式。
Single-Stream 將圖跟文的 Tokens 直接串接在一起,共用一套 QKVO 和 MLP 的權重;Double-Stream 則設計了兩條獨立的通道,一條跑圖、一條跑文,兩者有各自的 QKVO 和 MLP 權重,只在 Attention 階段交換資訊。Single-Stream 的優勢是比較節省權重參數量,而 Double-Stream 一般被認為會有比較好的圖文理解力。
有些模型用兩種的混合架構(例如 FLUX-1、FLUX-2),有些只用 Double-Stream(例如 SD3 ),而 Z-Image 則是採用全 Single-Stream 架構,證明了在大量減少需要的模型參數之下(也減緩了硬體需求),還是能生成品質相當出色的圖片。
技術:D-DMD 蒸餾演算法減少生成步數
Z-Image 提出的 D-DMD(Decoupled Distribution Matching Distillation)演算法,實現高品質的少步數生成 。
DMD 是 Diffusion 模型常見的蒸餾演算法。傳統觀點認為 DMD 的成功來自於讓學生模型(Student Model)單純去匹配老師模型(Teacher Model)的分佈(Distribution),但在實際應用上往往需要依賴 CFG(Classifier-Free Guidance)才能生效,這在理論與實作間存在矛盾 。
Z-Image 團隊發現 DMD 其實包含兩個獨立機制:CFG Augmentation(CA)與Distribution Matching (DM) 。CA 是真正的「引擎」,負責將 CFG 的決策模式帶入模型,驅動從多步到少步的轉換;而 DM 則是「穩壓器」,負責修正生成過程中的偽影與確保訓練穩定 。
透過將這兩者解耦(Decoupled),Z-Image 能夠針對兩者設計不同的優化排程(Decoupled Schedules) 。例如讓 CA 專注於生成細節,而 DM 負責全局的畫質修正,證明了這種分工能在極少的步數(如 8 步)下,生成比傳統方法更細緻且無偽影的高品質圖像 。
圖片品質、指令遵循與英文文字渲染
這裡對 Z-Image-Turbo 做綜合測試,並且與另一個近期的開源模型 FLUX-2 [dev] 比較(使用官方的 Workflow,320 億個參數,bf16)。兩個模型的產生的圖片大小都是 1024x1024,Z-Image-Turob 的 Steps 設定為 8 ,FLUX-2 [dev] 則是 20。
實測:細節、紋理、構圖、光線
![Prompt: A candid, hyper-realistic photographic portrait of an elderly fisherman's face. His skin is deeply weathered, with every wrinkle, sun spot, and coarse pore visible. He has a messy white beard. Natural side lighting from a window (Rembrandt lighting) sculpts his face, highlighting the rough texture. Film grain aesthetic. (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2F094e04b9-32fc-40cf-9279-dbf8a7889dfe.jpg&width=740&sign=YsNuZgSTFkrBAJHjdhnOyqq-ncNiK8SCvNnImUWAFVg)
Prompt: A candid, hyper-realistic photographic portrait of an elderly fisherman's face. His skin is deeply weathered, with every wrinkle, sun spot, and coarse pore visible. He has a messy white beard. Natural side lighting from a window (Rembrandt lighting) sculpts his face, highlighting the rough texture. Film grain aesthetic. (左:Z-Image-Turbo 右:FLUX-2 [dev])
![Prompt: A symmetrical interior photograph of a massive, historic library hall. Towering wooden bookshelves packed with old books line both walls, creating leading lines that draw the eye down a long central aisle to a large arched window at the far end. Natural light beams (god rays) stream through the window, illuminating dust motes in the air. Wide-angle lens, deep focus. (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2F2b510123-67a2-4390-b86e-cc5fb7d5911f.jpg&width=740&sign=KtUo4uc05xhiHm7gOxlg918g3JZzPdKUL0eibJy1Bi0)
Prompt: A symmetrical interior photograph of a massive, historic library hall. Towering wooden bookshelves packed with old books line both walls, creating leading lines that draw the eye down a long central aisle to a large arched window at the far end. Natural light beams (god rays) stream through the window, illuminating dust motes in the air. Wide-angle lens, deep focus. (左:Z-Image-Turbo 右:FLUX-2 [dev])
![Prompt: A photograph of a luxurious still life scene on a polished dark wood table. A heavy cut crystal whiskey glass filled with amber liquid and a single large ice cube sits next to an antique engraved silver pocket watch and an old leather-bound book. Studio lighting highlights the light refraction through the glass and ice, and the metallic reflections on the silver. (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2F48abcecf-f0bd-429f-8cf3-92640018fd36.jpg&width=740&sign=wVsosz5BcC0AUbMqfYduAswVy8XkhKPN3FB3qllc-1w)
Prompt: A photograph of a luxurious still life scene on a polished dark wood table. A heavy cut crystal whiskey glass filled with amber liquid and a single large ice cube sits next to an antique engraved silver pocket watch and an old leather-bound book. Studio lighting highlights the light refraction through the glass and ice, and the metallic reflections on the silver. (左:Z-Image-Turbo 右:FLUX-2 [dev])
![Prompt: A highly detailed microscopic view of a single, intricate snowflake crystal resting on a rough, wool glove. Macro photography, extreme shallow depth of field. (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2Ff8da579a-d32a-404e-8530-9688a024b9c7.jpg&width=740&sign=s3W2AJZM5jiY1_b2zonrhSeG0kWt3YQiJ1e64X8Lc1M)
Prompt: A highly detailed microscopic view of a single, intricate snowflake crystal resting on a rough, wool glove. Macro photography, extreme shallow depth of field. (左:Z-Image-Turbo 右:FLUX-2 [dev])
小結:從上面幾組測試,可以看到兩個模型都有豐富的細節跟紋理,構圖與光線表現都讓人滿意,所以主要是美學的選擇,沒有誰輸誰贏。不過在硬體需求與生成速度有極大的差距,如果沒有一定的美學需求,Z-Image-Turbo 還是比較推薦。
附註:320 億參數量的 FLUX-2 [dev] 是 60 億參數 Z-Image-Turbo 的好幾倍,在 RTX-4090 GPU 應該沒法全部一次讀進 VRAM,所以生成圖片的速度非常慢,需要超過一分鐘。
有趣的是,這兩個模型也蘊含著東西方不同的「文化」。我多跑幾組下來的感覺,Z-Image-Turbo 比較有機會產生東方人面孔,所以模型生成的內容分佈,多少還是跟開發團隊是誰有關係呢!
實測:藝術風格
![Prompt: An Impressionist oil painting in the style of Claude Monet. A woman in a flowing white dress holding a parasol strolls through a lush, flowering garden pathway near a riverbank. Loose, visible brushstrokes, thick impasto texture on the canvas, dappled sunlight filtering through leaves, soft pastel color palette, atmospheric haze without sharp lines. (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2F3e46d3e9-5f2b-41d4-8176-af43dc2b05e3.jpg&width=740&sign=a2_mhMNYjpVHmBCPi3QTj-eiZJv0mzv4lWu3GVLegSM)
Prompt: An Impressionist oil painting in the style of Claude Monet. A woman in a flowing white dress holding a parasol strolls through a lush, flowering garden pathway near a riverbank. Loose, visible brushstrokes, thick impasto texture on the canvas, dappled sunlight filtering through leaves, soft pastel color palette, atmospheric haze without sharp lines. (左:Z-Image-Turbo 右:FLUX-2 [dev])
兩者都生成出油畫,但風格不太一樣。FLUX-2 [dev] 比較符合莫內印象派的畫法,表現出柔光、自然與瞬間光影變化,以及鬆散、自由的筆觸。
![Prompt: A screenshot from a 1990s Japanese anime film. A lone samurai sits under a large ancient tree on a hill, looking at a distant valley. Hand-drawn cel-shading art style, distinct bold black outlines, flat color blocks with hard shadow edges, painted watercolor background layers, slight film grain. (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2F91f69673-68d1-4a69-a942-29fb2b911167.jpg&width=740&sign=7Rh9F4r6ZDqBl0ucLx0-Z4imciHk4mCN195unijAn5A)
Prompt: A screenshot from a 1990s Japanese anime film. A lone samurai sits under a large ancient tree on a hill, looking at a distant valley. Hand-drawn cel-shading art style, distinct bold black outlines, flat color blocks with hard shadow edges, painted watercolor background layers, slight film grain. (左:Z-Image-Turbo 右:FLUX-2 [dev])
這組測試也是 FLUX-2 [dev] 勝出,比較符合 90 年代日本動漫截圖的要求,表現出手繪賽璐珞風格的明顯色塊與尖銳陰影邊緣。Z-Image-Turbo 被「水彩背景」這個關鍵字帶走,比較像是插畫或繪本的風格。
![Prompt: A retro Pop Art comic book panel. A close-up of a vintage fighter pilot in a cockpit wearing a helmet and goggles. Bold black outlines, flat primary colors (red, yellow, blue), visible halftone dot patterns (Ben-Day dots) across the shadows and the helmet reflection, graphic novel aesthetic. (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2F56afbeaa-4e3b-4bda-966b-c64c0fb14597.jpg&width=740&sign=9J2iMrN6OqRFarq4Bmm3Nm8-q5ac5PKfRwa69UdBVQU)
Prompt: A retro Pop Art comic book panel. A close-up of a vintage fighter pilot in a cockpit wearing a helmet and goggles. Bold black outlines, flat primary colors (red, yellow, blue), visible halftone dot patterns (Ben-Day dots) across the shadows and the helmet reflection, graphic novel aesthetic. (左:Z-Image-Turbo 右:FLUX-2 [dev])
這組測試對普普風格的理解,兩者都有表現出大色塊、斑點,但 FLUX-2 [dev] 比較符合三原色的要求(紅、黃、藍),佈滿幾乎整個畫面的斑點也更有普普藝術的感覺。
小結:整體來說 FLUX-2 [dev] 對於不同藝術風格的理解程度高一些,如果要對 Z-Image-Turbo 做藝術風格的精準控制,可能需要對 Prompt 更精雕細琢。
實測:動作控制
![Prompt: A portrait of a futuristic robot with a vibrant crimson body and glowing cyan eyes, holding up a single, bright yellow flower. Photorealistic render, studio lighting. (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2F59d28cbb-f5f8-4423-9e12-9c355e363a5f.jpg&width=740&sign=-B4ScMCgqSevK6QYjjwTU-VifrZ5rFB5B_qy0gcyPOg)
Prompt: A portrait of a futuristic robot with a vibrant crimson body and glowing cyan eyes, holding up a single, bright yellow flower. Photorealistic render, studio lighting. (左:Z-Image-Turbo 右:FLUX-2 [dev])
Z-Image-Turbo 的拇指跟食指沒有實際握到花,FLUX-2 [dev] 動作比較正確,但拇指被截斷了。
![Prompt: A professional female dancer is frozen mid-air in a perfect split leap (saut de basque). Her body is angled toward the viewer, arms extended elegantly. Shot outdoors at sunset, golden hour lighting, detailed costume. (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2Feac3d866-b15d-44ef-8985-36f339e0292d.jpg&width=740&sign=p9jokSg6xtoLvymBG4vJCFlP1sRwIZWDEl4tzETLdPg)
Prompt: A professional female dancer is frozen mid-air in a perfect split leap (saut de basque). Her body is angled toward the viewer, arms extended elegantly. Shot outdoors at sunset, golden hour lighting, detailed costume. (左:Z-Image-Turbo 右:FLUX-2 [dev])
兩者的分腿跳都很有水準,Z-Image-Turbo 比較符合身體稍微轉向觀眾的要求。
![Prompt: A close-up shot of two people high-fiving in a crowded, dimly lit stadium. The woman is shouting in excitement, while the man is smiling calmly. Cinematic lighting, hyper-realistic photo. (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2F5c46ffe2-2441-42fb-a8b0-e309f74a98e6.jpg&width=740&sign=iKTSdUvray-JnAsCLUmNa9s5s9P274cEqSwPF7_RkSU)
Prompt: A close-up shot of two people high-fiving in a crowded, dimly lit stadium. The woman is shouting in excitement, while the man is smiling calmly. Cinematic lighting, hyper-realistic photo. (左:Z-Image-Turbo 右:FLUX-2 [dev])
兩者都有符合女生(興奮大叫)跟男生(冷靜地笑)的表情要求,不過兩人擊掌的動作,Z-Image-Turbo 女生的左手也舉了起來,是個明顯的錯誤。
![Prompt: A cinematic shot of a couple dancing a passionate Argentine Tango in a dimly lit, smoky ballroom. The woman wears a flowing red dress with a high slit, her leg is elegantly hooked around the man's leg (gancho move). The man in a sharp black suit holds her firmly. Dramatic rim lighting highlights their silhouettes and the motion blur of the dress. Intense emotional connection. (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2F32007afb-c595-4bb0-9ab2-9ce4d6b5d1bc.jpg&width=740&sign=WOBqYSwgF3GG7h_I6elzu0YeWwETBR2U_lzbDDiqSig)
Prompt: A cinematic shot of a couple dancing a passionate Argentine Tango in a dimly lit, smoky ballroom. The woman wears a flowing red dress with a high slit, her leg is elegantly hooked around the man's leg (gancho move). The man in a sharp black suit holds her firmly. Dramatic rim lighting highlights their silhouettes and the motion blur of the dress. Intense emotional connection. (左:Z-Image-Turbo 右:FLUX-2 [dev])
要求是女生要勾住男生的腿(Gancho Move),兩張都沒有真的勾住,都不符合要求。
小結:Z-Image 對精細的動作控制都還不足,可能需要其他協助來補足,例如特定動作的 LoRA。
實測:空間推理能力
![Prompt: A photorealistic nature photograph of a precariously balanced stone cairn (rock stack) on a pebble beach at sunset. The stack consists of exactly seven river stones of varying sizes and shapes, arranged vertically from the largest at the bottom to the smallest at the top, appearing to defy gravity with tiny contact points. The lighting is golden hour light, casting long shadows. (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2Fc716430d-c0e1-4a50-ab35-052349639b65.jpg&width=740&sign=Ibirgoys177nkk7H-Umu5U_Dl-MPIzeyPVhl3jbJrsA)
Prompt: A photorealistic nature photograph of a precariously balanced stone cairn (rock stack) on a pebble beach at sunset. The stack consists of exactly seven river stones of varying sizes and shapes, arranged vertically from the largest at the bottom to the smallest at the top, appearing to defy gravity with tiny contact points. The lighting is golden hour light, casting long shadows. (左:Z-Image-Turbo 右:FLUX-2 [dev])
要求生成剛好 7 個大小、外型不同的石頭疊起來,兩者生成的石頭數都超過 7 個,不過 Z-Image-Turbo 有符合「外型不同」的要求。
![Prompt: A photograph from above of dozens of cute, fluffy cats arranged to form the letters C, A, and T. The cats should look happy and playful, full color, high detail, photographic. (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2F7e1a4be5-11ae-465e-b78a-ec1612681d20.jpg&width=740&sign=oru3_E6flY6RPLk3sNY72NqNPhBo-38ypFJW5RLEKdQ)
Prompt: A photograph from above of dozens of cute, fluffy cats arranged to form the letters C, A, and T. The cats should look happy and playful, full color, high detail, photographic. (左:Z-Image-Turbo 右:FLUX-2 [dev])
這裡生成一群蓬鬆的可愛貓咪,排成「CAT」的形狀。FLUX-2 [dev] 有勉強排成要求的形狀,Z-Image-Turbo 則是虐待他們變形,再加上一旁無辜、沒事做的貓咪看著你。
![Prompt: A precise top-down photograph (flat lay) of a professional snooker table ready for the start of a game. The triangle pack of 15 red balls is tightly arranged just behind the pink ball. The yellow, green, and brown balls are perfectly aligned on their spots along the baulk line. The blue ball is in the exact center of the table, and the black ball is on its spot near the top cushion. All positions must be standard and accurate. Even studio lighting. (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2F6f838ace-4e4e-4aa7-b5fb-10afaf6d0f48.jpg&width=740&sign=noMjiMVvWuV95Yxb1fHlHxb9b_PfZ0qhb_7_FIgExT8)
Prompt: A precise top-down photograph (flat lay) of a professional snooker table ready for the start of a game. The triangle pack of 15 red balls is tightly arranged just behind the pink ball. The yellow, green, and brown balls are perfectly aligned on their spots along the baulk line. The blue ball is in the exact center of the table, and the black ball is on its spot near the top cushion. All positions must be standard and accurate. Even studio lighting. (左:Z-Image-Turbo 右:FLUX-2 [dev])
要求排成司諾克撞球開球時的布局,可以看到兩者都明顯沒辦法完成。
小結:Z-Image-Turbo 的空間推理能力還滿不足的,可能需要後續版本有更多控制能力,比較有機會改善。例如之後可能會釋出 Z-Image-Edit,藉由輸入圖片來達到更精準的物理空間控制。
實測:英文文字渲染
![Prompt: A full-body shot of a young woman wearing a dark blue graphic T-shirt. The chest features a large, stylized retro-futuristic emblem where the text "Z-Image vs FLUX-2" is clearly displayed in glowing chrome gradient letters with cyan and magenta neon outlines, surrounded by digital circuit patterns. Golden hour natural light. (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2F8f68479e-d61d-4a14-baf2-c76b63d63e8c.jpg&width=740&sign=sp6UWqjsEUKBnoRZVAN8K_VvMUSmAoccyZB2SOzhBDU)
Prompt: A full-body shot of a young woman wearing a dark blue graphic T-shirt. The chest features a large, stylized retro-futuristic emblem where the text "Z-Image vs FLUX-2" is clearly displayed in glowing chrome gradient letters with cyan and magenta neon outlines, surrounded by digital circuit patterns. Golden hour natural light. (左:Z-Image-Turbo 右:FLUX-2 [dev])
兩者都能正確渲染衣服上的文字,不過 FLUX-2 [dev] 看起來比較自然(字體有因為穿在身上而變形),Z-Image-Turbo 比較直接「貼」在衣服上,看起來呆版。
![Prompt: A vibrant, detailed four-panel sequential comic strip (manga style) displayed side-by-side. The panels feature a young hero and a small robot in a futuristic setting. Panel 1 (Top Left): The hero shouts. Speech bubble must clearly read: "Target acquired!" Panel 2 (Top Right): The robot replies. Speech bubble must clearly read: "Confirmed." Panel 3 (Bottom Left): The hero leaps. Caption box must clearly read: "ACTION!" Panel 4 (Bottom Right): The action concludes. Sound effect on the panel must clearly read: "BAM!" (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2Fabe158c9-cef8-482f-9454-8697b7139244.jpg&width=740&sign=RlEn1sCsP-6U4tlCqQxE5CJRC2qdqWHiOESK-MNEv0c)
Prompt: A vibrant, detailed four-panel sequential comic strip (manga style) displayed side-by-side. The panels feature a young hero and a small robot in a futuristic setting. Panel 1 (Top Left): The hero shouts. Speech bubble must clearly read: "Target acquired!" Panel 2 (Top Right): The robot replies. Speech bubble must clearly read: "Confirmed." Panel 3 (Bottom Left): The hero leaps. Caption box must clearly read: "ACTION!" Panel 4 (Bottom Right): The action concludes. Sound effect on the panel must clearly read: "BAM!" (左:Z-Image-Turbo 右:FLUX-2 [dev])
兩者都能正確渲染簡單四格漫畫的文字,只是畫風不太一樣。
![Prompt: A photorealistic, top-down close-up shot of a vintage newspaper front page lying on a wooden table. The newspaper is named "THE DAILY CHRONICLE" in a large, gothic black font at the very top. Below the name, a smaller line reads "Vol. 105 - Issue 42 - Saturday, December 6, 2025 - Price $2.00". The main headline is huge, bold, and capitalized, clearly reading: "ARTIFICIAL INTELLIGENCE SOLVES PHYSICS". Below the headline, the text is divided into three distinct vertical columns: 1. Left Column: Has a sub-header "Market Reaction". The text paragraph below it must be legible and read: "Stocks jumped fifty points today as tech companies announced major breakthroughs. Investors are celebrating the new era of generative computing speed." 2. Center Column: Has a sub-header "The New Model". The text below reads: "Scientists claim the new Flux-2 and Z-Image architectures have surpassed human capability in visual rendering tasks and text synthesis accuracy." 3. Right Column: Has a small box titled "Weather Forecast". Inside the box, a list reads: "London: Rainy 10C", "New York: Sunny 15C", "Tokyo: Cloudy 18C", "Taipei: Humid 25C". At the very bottom right, a small footer text reads: "Printed in High Resolution. Copyright 2025." Lighting is even and natural, ensuring every letter is sharp and readable. High texture paper grain. (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2F10663140-9212-4260-abaf-fd8e2e200ab7.jpg&width=740&sign=21hTYrzNTDbVuxyoZ7Wb7-6fBKE-fm3ENAWaBOH6nAI)
Prompt: A photorealistic, top-down close-up shot of a vintage newspaper front page lying on a wooden table. The newspaper is named "THE DAILY CHRONICLE" in a large, gothic black font at the very top. Below the name, a smaller line reads "Vol. 105 - Issue 42 - Saturday, December 6, 2025 - Price $2.00". The main headline is huge, bold, and capitalized, clearly reading: "ARTIFICIAL INTELLIGENCE SOLVES PHYSICS". Below the headline, the text is divided into three distinct vertical columns: 1. Left Column: Has a sub-header "Market Reaction". The text paragraph below it must be legible and read: "Stocks jumped fifty points today as tech companies announced major breakthroughs. Investors are celebrating the new era of generative computing speed." 2. Center Column: Has a sub-header "The New Model". The text below reads: "Scientists claim the new Flux-2 and Z-Image architectures have surpassed human capability in visual rendering tasks and text synthesis accuracy." 3. Right Column: Has a small box titled "Weather Forecast". Inside the box, a list reads: "London: Rainy 10C", "New York: Sunny 15C", "Tokyo: Cloudy 18C", "Taipei: Humid 25C". At the very bottom right, a small footer text reads: "Printed in High Resolution. Copyright 2025." Lighting is even and natural, ensuring every letter is sharp and readable. High texture paper grain. (左:Z-Image-Turbo 右:FLUX-2 [dev])
這組測試生成報紙頭條。兩者的排版都符合要求,大標題、小標題等比較大的字體也很穩定,但小字就會亂掉(例如 Z-Image-Turbo 左下框的 announced)或是重複的問題(例如 Z-Image-Turbo 中下框的 capability in visual)發生。
![Prompt: A complex, dense horizontal infographic visualization titled "BLUEPRINT FOR A SUSTAINABLE FUTURE CITY 2050" at the top center. The overall aesthetic is a futuristic technical blueprint with glowing blue and green lines on a dark background. The central element is a stylized isometric map of a futuristic green city with vertical gardens and flying vehicles, labeled "CENTRAL HUB: ECO-METROPOLIS". Connected to this central hub by data streams are four surrounding detailed data panels: Top Left Panel: Titled "RENEWABLE ENERGY MIX". Shows a pie chart displaying percentages: "Solar 45%", "Wind 30%", "Hydro 25%", with corresponding icons. Top Right Panel: Titled "URBAN MOBILITY FLOW". Shows a vertical bar chart comparing transport modes, with tall bars labeled "Public Transit (High)" and "Active Transport (Walk/Cycle)", and a short bar for "Private EVs (Low)". Bottom Left Panel: Titled "CIRCULAR ECONOMY LOOP". Shows a four-stage circular flow diagram with arrows and icons: "DESIGN FOR LONGEVITY" -> "RESPONSIBLE CONSUMPTION" -> "RESOURCE RECOVERY (Recycle)" -> "RE-MANUFACTURING". Bottom Right Panel: Titled "GREEN INFRASTRUCTURE GOALS". Shows an icon of a large tree canopy over buildings with a large data readout: "TARGET: 50% Urban Canopy Cover by 2050" and a smaller text "CO2 Sequestration Max.". Many connecting lines, subtle grid patterns, and digital interface elements fill the background. (左:Z-Image-Turbo 右:FLUX-2 [dev])](https://resize-image.vocus.cc/resize?norotation=true&quality=80&url=https%3A%2F%2Fimages.vocus.cc%2Fc35e6b38-3d27-4e92-9993-c5ce35a3c06b.jpg&width=740&sign=IIxQg9Yo3OYUq6Sc3mk1wbhNS8U3cwtyIwUu-6JT_tQ)
Prompt: A complex, dense horizontal infographic visualization titled "BLUEPRINT FOR A SUSTAINABLE FUTURE CITY 2050" at the top center. The overall aesthetic is a futuristic technical blueprint with glowing blue and green lines on a dark background. The central element is a stylized isometric map of a futuristic green city with vertical gardens and flying vehicles, labeled "CENTRAL HUB: ECO-METROPOLIS". Connected to this central hub by data streams are four surrounding detailed data panels: Top Left Panel: Titled "RENEWABLE ENERGY MIX". Shows a pie chart displaying percentages: "Solar 45%", "Wind 30%", "Hydro 25%", with corresponding icons. Top Right Panel: Titled "URBAN MOBILITY FLOW". Shows a vertical bar chart comparing transport modes, with tall bars labeled "Public Transit (High)" and "Active Transport (Walk/Cycle)", and a short bar for "Private EVs (Low)". Bottom Left Panel: Titled "CIRCULAR ECONOMY LOOP". Shows a four-stage circular flow diagram with arrows and icons: "DESIGN FOR LONGEVITY" -> "RESPONSIBLE CONSUMPTION" -> "RESOURCE RECOVERY (Recycle)" -> "RE-MANUFACTURING". Bottom Right Panel: Titled "GREEN INFRASTRUCTURE GOALS". Shows an icon of a large tree canopy over buildings with a large data readout: "TARGET: 50% Urban Canopy Cover by 2050" and a smaller text "CO2 Sequestration Max.". Many connecting lines, subtle grid patterns, and digital interface elements fill the background. (左:Z-Image-Turbo 右:FLUX-2 [dev])
這組測試生成複雜的資訊圖表,兩者大致的排版有符合,但 Z-Image-Turbo 有很多細節的錯誤,例如右上的長條圖應該要是 3 個不是 5 個。同樣的,兩者對於小字的穩定度都有待加強。
小結:Z-Image-Turbo 對於字體稍大的英文字渲染與簡單排版,能力是不錯的,但還無法穩定的生成小字或比較複雜的排版、圖表。
多語言能力
官方宣稱 Z-Image 有穩定的英文、中文文字渲染能力,不過我看他的技術架構,應該是還會有多語的理解與文字渲染能力,於是我這裡就來實測他英文之外的能力到達什麼程度。
實測:多語言理解
下面的各組測試,都是將一個 Prompt 翻譯成共 8 種語言(英文、簡體中文、繁體中文、日文、韓文、西班牙文、德文、俄文),分別輸入給模型產生圖片(固定同個 Seed 來減少變因),測試模型對不同語言的理解能力與穩定性。

Prompt: 一條宏偉、超大的蘭壽金魚,漂浮在一個空靈的池塘上方,遠處是富士山和盛開的櫻花。浮世繪木版畫風格,深靛藍和朱紅色,高細節的波浪和魚鱗。
測試名詞(蘭壽金魚、富士山、櫻花)與風格(浮世繪、深靛藍和朱紅色)。可以看出英文、簡體中文、繁體中文、日文很有一致性,韓文的金魚品種似乎不太一樣,西班牙文跟德文的繪圖色彩跟其他有差距,俄文則是金魚的顏色表現亮眼,獨樹一幟。

Prompt: 一位年輕男子閉著眼睛,處於深深的沉思與專注狀態。他的食指正輕觸下巴。一束單一的暖色聚光燈從上方照射,形成鮮明的陰影。周圍環境完全黑暗。
測試人的表情/動作(閉眼睛、沉思與專注、食指輕觸下巴)與打光/環境(聚光燈、周圍完全黑暗)。英文、簡體中文、繁體中文的結果非常相似,日文、西班牙文跟俄文除了主角的長相外,整體也算一致,德文沒有符合「周圍完全黑暗」的環境要求,韓文不符合「閉眼睛」、「食指輕觸下巴」的動作需求,而且拍攝角度明顯跟其他不同。

Prompt: 一位身穿黑色高領毛衣的年長女學者的特寫肖像。以極端低角度從下方仰望拍攝。高紋理細節,景深很淺。
測試人的構圖(特寫肖像)、攝影方法(極端的低角度、景深很淺)。同樣的,英文、簡體中文、繁體中文保有一致性,日文的拍攝角度差了一點,西班牙文、沒表現出「極端的低角度」,俄文沒有表現出「景深很淺」,韓文則是變成肖像畫。
小結:從這幾組測試可以知道 Z-Image-Turbo 的確是擁有一定程度的多語言理解能力,然而如果要讓模型能最精確、穩定的遵從指令,英文、簡體中文、繁體中文是最佳的選擇,日文勉強還可以,其他語言可能會比較不穩。
實測:中文文字渲染
下面幾組測試,是同個 Prompt 分別輸入繁體與簡體中文(固定同個 Seed),測試兩者是否有能力上的差距。

Prompt: 一個繁忙、雨後的亞洲城市街道夜景攝影。焦點是一個巨大的、略顯老舊的霓虹燈招牌,懸掛在一棟建築物的二樓外牆上。發光的粉紅色霓虹燈管彎曲拼寫出大字「歡迎光臨」,緊接著下方是用明亮的藍色霓虹燈管拼寫的「本店二十四小時營業」。燈管周圍可見一些裸露的電線和變壓器。濕漉漉的柏油路面和路人撐著的雨傘上,強烈地反射著招牌的粉、藍色光芒。背景是熙熙攘攘的人群、模糊的車流尾燈,以及其他店家閃爍的中文招牌。 (左:繁體中文 右:簡體中文)
霓虹都招牌上的中文字,繁中除了「營」有點跑掉之外,其他都很不錯,簡中則是全對。

Prompt: 一張極簡主義風格(Swiss Style)的高質感平面設計海報。背景是乾淨的米白色,裝飾著幾個抽象、大膽的紅色圓形和黑色線條幾何圖形。海報的排版嚴格遵循網格系統,文字清晰銳利: 1. 海報上方是巨大的、粗體黑色無襯線字體標題:「二零二五亞洲國際平面設計藝術大展」。 2. 標題下方是較小的、優雅的灰色字體副標題:「探索視覺的邊界與無限可能」。 3. 海報的最底部是兩行整齊排列的紅色小字資訊:「台北市立美術館 一月一日至一月三十日」。 強調負空間的運用,向量圖形質感,高品質印刷效果。 (左:繁體中文 右:簡體中文)
簡單的平面設計,不論是繁中或簡中都有缺陷。繁中很有多錯誤或變成簡體的字,而且即使是大字,像是「二零二五亞洲國際平面設計藝術大展」,也有缺漏或多出幾個字;簡中雖然文字本身的渲染比較穩,但同樣在即使是標題大字也沒有完全遵照要求。

Prompt: 一張翻開的古老中國線裝書的特寫、俯拍微距照片。紙張泛黃且有紋理。文字採用傳統的直行格式,使用清晰可辨的黑色楷書字體書寫。頁面上的文字必須清晰地顯示為: 「夫天地者,萬物之逆旅也;光陰者,百代之過客也。而浮生若夢,為歡幾何?古人秉燭夜遊,良有以也。況陽春召我以煙景,大塊假我以文章。會桃花之芳園,序天倫之樂事。群季俊秀,皆為惠連;吾人詠歌,獨慚康樂。幽賞未已,高談轉清。開瓊筵以坐花,飛羽觴而醉月。不有佳詠,何伸雅懷?」 字元必須清晰、間距適中,並嚴格遵循從右到左的垂直對齊方式。柔和、溫暖的圖書館光線。 (左:繁體中文 右:簡體中文)
這裡要求生成字數更多的古文,可以看出不論繁體或簡體都有很多亂掉的文字(尤其是筆劃比較多的),句子的順序、缺漏、重複等問題也很明顯。

Prompt: 一個裝飾華麗、充滿春節喜慶氣氛的傳統中式廳堂場景。背景是富有質感的朱紅色牆面和掛著許多紅燈籠的深色木製門框。畫面的絕對焦點是一副巨大的金色書法對聯,掛在中央,右聯清晰地寫著「恭喜發財」,左聯清晰地寫著「萬事如意」。對聯下方是一張擺滿各式年貨、堆疊的金元寶和盛開牡丹花的紅木供桌。溫暖、喜氣的燈光籠罩整個場景。 (左:繁體中文 右:簡體中文)
這裡發現一個有趣案例,原本認為「恭喜發財、萬事如意」應該是很容易生成的,但不論繁中、簡中都沒有完全符合指定。

Prompt: 一張紀實風格的街頭攝影照片,場景是一個繁忙、略顯混亂的亞洲城市窄巷。在老舊的建築物之間,懸掛著許多條手工製作、邊緣磨損的布條和橫幅。這些布條看起來是用舊床單或粗布製成的,上面用黑色油漆或墨水粗糙地手寫著巨大的繁體中文字。 畫面中最清晰的幾條布條上分別寫著: 一條橫跨街道的大布條寫著「六四天安門」; 掛在牆邊的垂直布條寫著「台灣獨立」; 另一條在風中飄動的布條寫著「法輪大法好」; 還有較小的布條寫著「言論自由」和「勿忘歷史」。 街道上有行人經過,光線是陰天的自然光,充滿真實感和顆粒感。 (左:繁體中文 右:簡體中文)
這裡測試「敏感詞彙」,從結果看來這個開源模型自由度滿大的,團隊沒有像 DeepSeek 一樣對特定詞彙或事實做「屏蔽」。
小結:Z-Image-Turbo 對於少量的中文字渲染穩定,簡體比繁體好一些,但對於長一點、複雜一點的指令仍有待加強。
實測:其他語言的文字渲染
下面兩組測試,也是將一個 Prompt 翻譯成共 6 種語言(日文、韓文、俄文、法文、印地文、泰文),分別輸入給模型產生圖片(固定同個 Seed),測試渲染這些語言文字的能力。

Prompt: 舒適的咖啡館櫃檯。黑板菜單上用白色粉筆手寫著「美味的咖啡」。前方有一杯帶拉花的拿鐵。

Prompt: 一個戶外活動場地。入口上方懸掛著一條巨大的橫幅,上面寫著「歡迎來到音樂節」。背景是蔚藍的天空。
小結:雖然這兩組測試偏簡單,但結果還是滿驚人的,渲染出的文字大概九成是對的,可以得出 Z-Image-Turob 是有一定的多語渲染能力的,在某些不要求完全精確的場合,例如生成某個國家的街道(有個氛圍感),是有機會應用的。
技術:導入 Qwen3 實現原生多語言能力
Z-Image 的 Text Encoder 採用了 這個 LLM,而非傳統的 VLM 文字端。
早期的擴散模型(如 SD 1.5、SDXL),Text Encoder 多半使用 CLIP 或 OpenCLIP。這類模型雖然能對齊圖文特徵,但它們本質上對語言的理解有限,且僅支援英文,輸入長度也受限(例如 77 個 Tokens)。這導致使用者必須使用特定的「詠唱」語法,且難以用自然語言描述複雜邏輯。
近期的模型開始轉向使用更強大的 LLM 或 VLM,大幅提升了對長句與細節的理解力,並且原生支援多語言。像是 FLUX-2 引入了 Mistral-3、Qwen-Image 導入 Qwen2.5-VL,而 Z-Image 使用他們自家的 Qwen3。
有趣的是,Mistral-3 跟 Qwen2.5-VL 都是雙模態的 VLM,Qwen3 則是純文字的 LLM,目前不太確定 Z-Image 這樣選擇的原因。
結論:值得期待的開源模型
Z-Image-Turbo 最大的優勢是低硬體需求跟很快的生成速度,圖片品質跟細節表現也都很有水準,非常適合快速實驗、產生概念圖的使用場景。對於細節的控制,例如藝術風格、動作或物理空間,則還有一段進步空間。Z-Image-Turbo 有多語理解能力,尤其英文跟中文最穩定。同時它也有一定的多語文字渲染能力,可以用在比較簡單的應用場景。
從這篇文章的實測可以知道 Z-Image-Turbo 仍有很多不足之處,但基於它採用的開源方式跟成本優勢,很有機會在開源社群蓬勃發展,會有各種不同的微調模型跟 LoRA 產生,來補足目前的缺點,再加上可能會釋出的 Z-Image-Edit,我們絕對值得關注它接下來的發展。



















