图像生成新SOTA：通义Qwen-Image，开源世界的颠覆者

通义团队开源了首个图像模型， Qwen-Image，一个20B的MMDiT模型，据称具有比肩gpt-4o 的复杂文本渲染和图像编辑能力。开源遵循Apache 2.0协议，免费可商用.(图片编辑模式还没有开放出来)

核心亮点包括：

出色的文本生成与排版能力：Qwen-Image 在处理复杂文本渲染方面表现卓越，支持多行布局、段落级文本生成以及精细的文字细节呈现，无论是英文还是中文，均能实现高度逼真的视觉输出。
精准一致的图像编辑：得益于强化的多任务联合训练策略，Qwen-Image 在图像编辑过程中展现出优异的上下文一致性，确保修改内容自然融入原始画面。
领先的跨基准综合性能：在多个公开权威基准测试中，Qwen-Image 在图像生成与编辑任务上均取得当前最优（SOTA）表现，展现出其作为先进图像生成基础模型的强大实力。

qwen-image 免费在线体验：https://kontextflux.io/image-models/qwen-image

Qwen-Image 在多个公开基准测试中经历了全面评估，持续展现出卓越的性能，在各类任务中均优于现有模型。在通用图像生成方面，模型在 GenEval、DPG 和 OneIG-Bench 上接受了严格测试。在图像编辑方面，其能力通过 GEdit、ImgEdit 和 GSO 等基准进行了评估。尤为重要的是，在文本渲染方面，Qwen-Image 在 LongText-Bench、ChineseWord 和 TextCraft 上的表现尤为突出，尤其在中文文本生成任务中表现卓越。在多样化基准测试中始终如一的领先表现，确立了 Qwen-Image 作为顶尖图像生成模型的地位——不仅具备广泛的通用能力，更在文本渲染精度上表现出众。

除了官方给出的测试基准对比，第三方Artificial Analysis Image Arena Leaderboard也给出了qwen-image的性能排行情况。

在所有的图片生成模型里面(包括开源和闭源),qwen-image大概相当于Flux Kontext pro， Imagen3.0, Ideogram 3.0的水平

如果只和开源模型对比，qwen-image 确实取得了SOTA的成绩

开源模型使用

Qwen-Image模型权重在github, Huggingface 和 Modelscope都有开源

comfyui 已经加入对Qwen-Image支持：

工作流： Json workflow
docs： ComfyUI Native Workflow Example

本地运行

transformers>=4.51.3 (Supporting Qwen2.5-VL)
Install the latest version of diffusers
system requirements: 24GB GPU memory and 64GB+ RAM
pip install git+https://github.com/huggingface/diffusers

from diffusers import DiffusionPipeline
import torch

model_name = "Qwen/Qwen-Image"

# Load the pipeline
if torch.cuda.is_available():
    torch_dtype = torch.bfloat16
    device = "cuda"
else:
    torch_dtype = torch.float32
    device = "cpu"

pipe = DiffusionPipeline.from_pretrained(model_name, torch_dtype=torch_dtype)
pipe = pipe.to(device)

positive_magic = {
    "en": "Ultra HD, 4K, cinematic composition.", # for english prompt
    "zh": "超清，4K，电影级构图" # for chinese prompt
}

# Generate image
prompt = '''A coffee shop entrance features a chalkboard sign reading "Qwen Coffee 😊 $2 per cup," with a neon light beside it displaying "通义千问". Next to it hangs a poster showing a beautiful Chinese woman, and beneath the poster is written "π≈3.1415926-53589793-23846264-33832795-02384197".'''

negative_prompt = " " # Recommended if you don't use a negative prompt.


# Generate with different aspect ratios
aspect_ratios = {
    "1:1": (1328, 1328),
    "16:9": (1664, 928),
    "9:16": (928, 1664),
    "4:3": (1472, 1104),
    "3:4": (1104, 1472),
    "3:2": (1584, 1056),
    "2:3": (1056, 1584),
}

width, height = aspect_ratios["16:9"]

image = pipe(
    prompt=prompt + positive_magic["en"],
    negative_prompt=negative_prompt,
    width=width,
    height=height,
    num_inference_steps=50,
    true_cfg_scale=4.0,
    generator=torch.Generator(device="cuda").manual_seed(42)
).images[0]

image.save("example.png")

Showcase

让它给网站做个配图概括qwen-image的能力：

A movie poster titled "The Power of Qwen-Image". The first row is the main title in a bold, modern font: "QWEN-IMAGE: THE FUTURE OF IMAGING". The second row, directly below, reads "Witness Unparalleled Text Rendering and Precise Image Editing". The third row states "Starring: Superior Chinese & English Text Generation". The fourth row reads "Director: The 20B MMDiT Architecture". The central visual features a sleek, futuristic computer (representing the 20B MMDiT model) from which radiant colors, whimsical creatures, and dynamic, swirling patterns explosively emerge, symbolizing its generative power. Emerging from the digital energy are clear, realistic depictions of its capabilities: a shop sign with the Chinese text "云存储", a book cover with the English text "The Silent Patient", and a traditional Chinese couplet with elegant calligraphy. The background transitions from dark, cosmic tones into a luminous, dreamlike expanse, evoking a digital fantasy realm. At the bottom edge, the text "Powered by State-of-the-Art Cross-Benchmark Performance" appears in a bold, modern sans-serif font with a glowing, slightly transparent effect. The overall style blends sci-fi surrealism with graphic design flair—sharp contrasts, vivid color grading, and layered visual depth—reminiscent of visionary concept art and digital matte painting. 32K resolution, ultra-detailed, masterpiece.

但是我发现用同样的提示词，图片尺寸如果偏离标准方形过多，文字会不太清晰，像是粘滞在一起。比如同样的提示词尺寸改成3：1,就会变成这样（第二行的Witness以及第四行的architecture）：

当然更多case还是很不错的：

A miniature raccoon explorer made of wool wearing all kinds of equipment, walking through dry grass, the whole world is made of felt textile

官方的一些例子：

推上网友测试，其他语言，比如日语也是可以的：

实际应用场景与未来展望

Qwen-Image的卓越能力，尤其是其在复杂文本渲染方面的突破，使其不仅仅是一个图像生成工具，更是一个能广泛应用于多个行业的创造性平台。

在设计与内容创作中的应用

广告与营销： 对于广告设计师来说，生成一张带有特定Slogan、品牌名称和清晰产品信息的图片曾经是一大难题。Qwen-Image能够轻松应对多行、多种字体和中英文混合的文本渲染，极大地缩短了广告海报、产品宣传图的制作周期。
游戏开发： 游戏中的UI元素、路牌、海报或特定道具常常需要包含文字。借助Qwen-Image，开发者可以快速生成带有精确文本的贴图素材，无需经过复杂的后期处理。
教育与出版： 教师和出版商可以利用Qwen-Image生成带有清晰图表、标题和正文的教学插图或海报。例如，生成一张解释“深度学习”概念的信息图，所有文字都能准确呈现。

图像编辑功能的巨大潜力

虽然目前Qwen-Image的图像编辑模式尚未开放，但其底层架构已展示出强大的上下文理解能力。一旦该功能发布，其潜力将是巨大的：

精准替换与修改： 设想你可以选中一张图片中的文字，然后将其替换成任意新的内容，且字体、光照和风格都与原图完美融合。这将彻底改变图像编辑的工作流。
内容个性化： 在电商领域，可以快速生成带有不同顾客姓名的个性化产品图片；在社交媒体上，可以轻松将图片的Slogan或台词进行修改，以适应不同的传播需求。
无缝融合： 无论是将新的物体添加到现有场景，还是对图片进行风格化调整，强大的多任务联合训练都将确保编辑后的图像与原始图像保持高度的一致性和自然度。

AI Arena

除了开源模型，为全面评估 Qwen-Image 的通用图像生成能力，并与其闭源先进模型进行客观对比，团队推出了 AI Arena——一个基于 Elo 评分系统的开源基准评测平台。AI Arena 构建了一个公平、透明且动态的评估环境，支持对不同模型进行持续比对。在每轮评测中，系统会基于同一提示词随机生成两张匿名图像，邀请用户进行两两对比并投票。投票结果通过 Elo 算法实时更新个人与全球排行榜，实现对模型性能的科学、数据驱动评估。目前，AI Arena 已向公众开放。

免费线上平台：

qwen chat: 通义只能对话平台，对话时选择 Image Generation，偶尔会比较慢
huggingface: 会比较慢
qwen-image，注册送20积分
wavespeed, 注册有50次生成额度