AlphaGRPO

Teaser overview. AlphaGRPO uses GRPO with DVReward to activate reasoning and self-reflective refinement in native unified multimodal models. The T2I examples show improved composition and inference-time correction of fine-grained attributes, while the editing examples show transfer to GEdit without editing-task training; the radar chart summarizes gains across generation and editing benchmarks.

TL;DR

AlphaGRPO applies GRPO to AR-Diffusion Native Unified Multimodal Models without an additional cold-start stage. With Decompositional Verifiable Reward (DVReward), it supports multimodal generation tasks such as Reasoning T2I and Self-Reflective Refinement through stable, interpretable supervision from open-source MLLMs.

GRPO for Unified Models

First to introduce GRPO training to AR-Diffusion Native Unified Models. By optimizing the model's latent generative and reflective behaviors, AlphaGRPO supports multimodal generation tasks such as Reasoning T2I and Self-Reflective Refinement.

DVReward

A fine-grained reward mechanism that decomposes user prompts into atomic verifiable questions across semantic alignment and visual fidelity, providing stable supervision via MLLM confidence scoring.

Strong Generalization

Consistent improvements on T2I benchmarks (GenEval, TIIF-Bench) and editing tasks (GEdit) — without training on editing tasks. Inference-time SRR further boosts performance up to +5.8%.

Abstract

We propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Native Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities. Our approach unlocks the model's intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs.

To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic, verifiable semantic and quality questions, which are then evaluated by a general MLLM to provide reliable and interpretable feedback. Extensive experiments demonstrate that AlphaGRPO yields robust improvements across multimodal generation benchmarks, including GenEval, TIIF-Bench, DPG-Bench and WISE, while also achieving significant gains in editing tasks on GEdit without training on editing tasks.

Motivation: Two Key Findings

Explicit Error-Seeking Activates Latent Reasoning

When asked to verify if a generated image is correct, unified models exhibit confirmation bias — frequently asserting that outputs fulfill the original intent. However, switching to reflect mode (explicitly told to find mistakes) successfully breaks this loop, activating the model's understanding capability to identify errors.

Comparison of verification vs reflection behaviors in unified multimodal models

Asking Questions Yields Discriminative Rewards

Holistic scalar scoring (e.g., VIEScore) assigns identical scores to images with clear quality differences — acting as a "black box" that smooths over semantic discrepancies. In contrast, question-based probing via token logits produces highly discriminative signals (0.592 vs. 0.914).

Comparison of score-based vs question-based reward signals

Method

Overview of AlphaGRPO framework and DVReward mechanism

AlphaGRPO and DVReward. AlphaGRPO optimizes Reasoning T2I and Self-Reflective Refinement under unified multimodal trajectories, while DVReward decomposes user requests into atomic verifiable questions for reliable reward computation.

Unified Trajectory Formulation

We conceptualize multimodal generation as a hybrid trajectory τ = (y, z₁ → z₀) that concatenates autoregressive reasoning tokens with a continuous visual diffusion path. Both Reasoning T2I and Self-Reflective Refinement share this unified formulation, enabling end-to-end GRPO optimization across text and image modalities.

DVReward: Decompositional Verifiable Reward

DVReward transforms reward computation from black-box scoring into transparent verification: (1) An LLM decomposes user requests into atomic semantic and quality questions with physical visual grounding; (2) A pre-trained MLLM verifies each question using confidence scores derived from Yes/No token probabilities; (3) The final reward is the geometric mean of semantic and quality scores.

False-Positive Rectification: In self-reflective refinement, we enforce that trajectories failing to improve over the initial image receive the group minimum reward, preventing false-positive optimization signals.

Main Results

Paper table of AlphaGRPO results on text-to-image benchmarks — **Text-to-image benchmarks.** RT2I denotes the AlphaGRPO variant trained on the reasoning text-to-image task. Inf. SRR denotes inference-time self-reflective refinement, which further improves generation quality without changing the visualization task.

Paper table of AlphaGRPO results on GEdit-Bench-EN — **GEdit-Bench-EN.** These results evaluate editing transfer performance. AlphaGRPO improves GEdit scores even though the model is not trained on editing tasks.

Key Experimental Findings

Low-resolution RL generalizes to high resolution.

Although AlphaGRPO is optimized at 512px, it improves 1024px generation results, indicating learned semantic alignment rather than pixel memorization.

SRR training transfers beyond refinement.

Self-reflective refinement training also improves downstream T2I, matching or surpassing direct RT2I optimization on several metrics.

Inference-time SRR boosts benchmark scores.

Test-time self-reflection further raises performance, including +4.5 on TIIF-Bench Short and +3.4 on GenEval over one-pass AlphaGRPO.

Editing transfer emerges without editing training.

Without editing-task training, RT2I improves GEdit by +0.33 and SRR reaches +0.52 over BAGEL, showing transfer to instruction-based editing.

Reward Ablation

Comparison of DVReward against existing reward models on SD3.5M and BAGEL, evaluated on TIIF-Bench and GenEval — **Reward model comparison across diffusion models.** We compare our **DVReward** against existing reward signals (PickScore, VIEScore, HPSv3, UnifiedReward) on two distinct diffusion-based generators: **SD3.5M** (pure DiT) and **BAGEL** (Unified Multimodal Model). DVReward consistently matches or surpasses prior rewards on TIIF-Bench (Short / Long) and GenEval, showing the reward design generalizes across different diffusion models. Unlike holistic scalar rewards that collapse a prompt into a single number, DVReward decomposes each prompt into verifiable sub-questions and aggregates per-question scores, providing a finer-grained training signal.

Qualitative Results

The first two panels are T2I visualizations. RT2I and SRR indicate the RL training task used to obtain each model variant, not a separate visualization setting. The final panel shows image editing examples.

T2I Samples After RT2I Training

T2I qualitative results for the model trained with reasoning text-to-image RL

T2I Samples After SRR Training

T2I qualitative results for the model trained with self-reflective refinement RL

Image Editing Results

BibTeX

@inproceedings{huang2026alphagrpo,
  title={AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in Unified Multimodal Models via Decompositional Verifiable Reward},
  author={Huang, Runhui and Wu, Jie and Yang, Rui and Liu, Zhe and Zhao, Hengshuang},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2026}
}