Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

ML and DL

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

yeonghyeon3 2025. 5. 28. 02:00

References

Terminologies

Single-image super-resolution (SISR): A task to reconstruct a high-resolution (HR) image from a low-resolution (LR) image. It can be represented as: $p(x_H|x_L)$
Autoregressive model: A model that predicts the next value from the previous predictions.
Vision-language model (VLM): A model results the output from the image input and the text input.
ill-posed problem: A problem that is not invertible. The original problem has a clear solution, but the inverse problem has many solutions or does not have a solution.
Reinforcement learning with human feedback (RLHF): A methods from reinforcement learning to directly optimize a language model with human feedback link.

TL;DR - Visual-language powered infinite image magnification

Chain-of-Zoom (CoZ) is proposed to magnify the image beyond the trained regime.
CoZ repretedly reuses a backbone super-resolution (SR) model with multi-scale-aware text prompts.

Image from "Chain-of-Zoom" official GitHub repository (https://github.com/bryanswkim/Chain-of-Zoom) same figure is shown in the manuscript with Figure 1.

Method

Key considerations (copy-pasted from the paper)

An ill-posed problem: A single LR image can correspond to a multitude of plausible HR images.
Trained super-resolution models suffer from a significant limitation: they are inherently upper-bounded by their training configuration. Learned restoration functions are tightly coupled to the specific scale and degradation seen during training.
Intermediate scale-state modeling to bridge the gap between a low-resolution (LR) input and a high-resolution (HR) target image.

Intermediate scale-state modeling

Image generative process can be supposed as a sequence $(x_{0}, x_{1}, \cdots, x_{n})$. The first element of the sequence should be a LR image $x_{L}$ ($x_{0}:=x_{L}$) so the last element is a HR image $x_{H}$ ($x_{n}:=x_{H}$). So, the HR image will be produces via the following process. Note, the image dimension is changed by the dimension ratio $s$, and represented to $d_{i}=sd_{i-1}$.

$$ x_{L} \in \mathbb{R}^{d_{0}} \dashrightarrow x_{i} \in \mathbb{R}^{d_{i}} \dashrightarrow x_{H} \in \mathbb{R}^{d_{n}}$$

This autoregressive process can be summarized into a simple equation under the Markov assumption.

$$p(x_{0}, x_{1}, \cdots, x_{n}) = p(x_{0}) \prod_{i=1}^{n} p(x_{i}|x_{i-1})$$

A text prompt extraction process, shown in Figure 4.

The high-frequency details will disappear through the Markov chain process. These missing details will be supplemented by the text prompt that is extracted by VLM. To reduce hallucinations of VLM, authors leverage two resolutions of the inputs $x_{i-1}$ (temporal state) and $x_{i-2}$ (prior input to the $x_{i-1}$, coarser state).

$$p_{\phi}(c_{i}|x_{i-1}, x_{i-2})$$

By applying this AR-2 modeling, the equations are updated from the Markov assumption. You can skip the details about these equations and focus on the concept of this work.

$$p(x_i \mid x_{i-1}, x_{i-2}) = \int p(x_i \mid x_{i-1}, x_{i-2}, c_i) \, p(c_i \mid x_{i-1}, x_{i-2}) \, dc_i$$

Reinforcement learning with human feedback (RLHF) by Generalized Reward Policy Optimization (GRPO)

The accurate text prompt is highly helpful to reconstruct a magnified image into a s SR image. However, extracting texts from the magnified image patch is challenging due to their sparsity of pixel information (same to insufficient information). For this, this paper adopts GRPO method to fine-tune the VLM model for accurate text extraction. GRPO is structured with three sub-modules as shown in Figure 4.

Critic preference reward: Reward by the human preference score (The authors mentioned this as 'human aesthetic and semantic preference')
Phrase-exclusion reward: Reward if there is no blacklisted phrases (e.g., first image, second image, ...)
Repetition penalty: Penalty for repeating $n$-grams (bunch of words), helps diversify the text generation

Training objective

Next zoom ($x_i$) prediction (parameter $\theta$): Maximizing the likelihood (or log-likelihood), equal to minimizing the mean-squared error (MSE)

The $x_{i}$ will be predicted from its previous states $x_{i-1}$ and $x_{i-2}$ with the text prompt $c_{i}$. The $c_{i}$ is a result of the multi-scale aware prompt extraction that should be extracted from $x_{i-1}$ and $x_{i-2}$ before $x_{i}$ prediction. The prediction function can be summarized as follows:

$$p(x_i|x_{i-1}, x_{i-2}, c_{i})$ := $\mathcal{N}(x_i;f_{\theta}(x_{i-1}, x_{i-2}, c_{i}), \sigma{}^{2}I)$$

The MSE is widely adopted to optimize the model on a continuous representation of pixel values. This paper also adopts MSE to optimize $\theta$, parameters to predict $x_{i}$.

$$\log p(x_i|x_{i-1}, x_{i-2}, c_{i}) = -\frac{1}{2\sigma{}^{2}} \parallel x_{i}-f_{\theta}(x_{i-1}, x_{i-2}, c_{i}) \parallel ^{2} + C$$

Next text ($c_i$) prediction (parameter $\phi$): Maximizing the log-likelihood, equal to minimizing the negative log-likelihood (cross-entropy)

Unlike the image data, the text has discrete representations. We can come up with a preferred function for the discrete distribution, 'cross-entropy'. First, a probability to predict $c_{i}$ can be represented to the autoregressive form:

$$p_{\phi}(c_{i} \mid x_{i-1}, x_{i-2})=\prod_{t=1}^{T_{i}} p_{\phi}(c_{i, t} \mid x_{i-1}, x_{i-2}, c_{i,<t})
$$

The loss function of the text extraction model (VLM) is shown in below that is converted from the autoregressive function.

$$\mathcal{L}_{VLM}^{i}=-\log{} p_{\phi}(c_{i} \mid x_{i-1}, x_{i-2}) = -\sum_{t=1}^{T_{i}} p_{\phi}(c_{i, t} \mid x_{i-1}, x_{i-2}, c_{i,<t})$$

Experimental results

Reconstruction quality

Results of text extraction

Quantitative measures

Please refer to the image quality measures as below:

Measure	Paper
NIQE	Anish Mittal et al., IEEE SPL 2013
MUSIQ	Junjie Ke et al., ICCV 2021
MANIQA	Sidi Yang et al., CVPR 2022 workshop
CLIPIQA	Jlianyi Wang et al., AAAI 2023

ATLAS: Learning to Optimally Memorize the Context at Test Time (0)	2025.06.05
LoRA: Low-Rank Adaptation of Large Language Models (0)	2025.05.26
Review Presentations (2022-2024) (0)	2025.05.21
Adding Conditional Control to Text-to-Image Diffusion Models (a.k.a. ControlNet) (0)	2025.05.21
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions (0)	2025.05.20

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31