Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.
Our goal is to extract a universal, scale-robust visual representation from a frozen Vision Foundation Model (VFM). The core of MuRF is motivated by a fundamental property of visual perception: low resolutions capture global context for robust recognition, while high resolutions provide fine-grained detail for precise refinement.
Overview of Multi-Resolution Fusion (MuRF). We process an image pyramid through a frozen VFM, upsample the resulting feature maps to a common spatial resolution, and concatenate them channel-wise.
MuRF builds a feature pyramid directly from the input space. We resize the image to a set of scaling factors and pass each view through a frozen VFM encoder.
Crucially, these patch-level feature maps are upsampled to a target spatial resolution and concatenated along the channel dimension. This creates a single, unified tensor that is spatially rich, semantically deep, and explicitly preserves the orthogonal signals of both macro and micro views without the destructive interference caused by mean-pooling or summation.
The resulting multi-resolution representation is inherently task-agnostic. We adapt it to various domains by attaching lightweight, task-specific heads while keeping the heavy VFM backbone completely frozen:
We evaluate MuRF on dense prediction using the challenging ADE20K and PASCAL VOC benchmarks. Our baseline for comparison is a linear probing setup utilizing a frozen DINOv2-ViT-B/14 encoder with features extracted from a single input resolution.
By effectively combining the global context from low resolutions with the fine-grained details from high resolutions, MuRF produces a feature map inherently better suited for dense prediction. While training the linear head on MuRF representations takes approximately 1.3× longer than the single-resolution counterpart, it yields a significant performance boost over strong single-scale baselines.
ADE20K
PASCAL VOC
Qualitative semantic segmentation results. MuRF effectively merges holistic scene understanding with precise object boundaries compared to individual single-scale inputs.
Table 1 (Segmentation): Performance in mIoU (%) on ADE20K and PASCAL VOC (higher is better). MuRF significantly outperforms single-scale baselines. Rel. Improv measures relative improvement over the high-resolution DINOv2 baseline.
| Method | Architecture | ADE20K (mIoU ↑) | PASCAL VOC (mIoU ↑) |
|---|---|---|---|
| OpenCLIP | ViT-G/14 | 39.3 | 71.4 |
| MAE | ViT-H/14 | 33.3 | 67.6 |
| DINO | ViT-B/8 | 31.8 | 66.4 |
| iBOT | ViT-L/16 | 44.6 | 82.3 |
| DINOv2 Baselines | |||
| Low Resolution | ViT-B/14 | 40.6 | 71.2 |
| Medium Resolution | ViT-B/14 | 45.5 | 78.9 |
| High Resolution | ViT-B/14 | 46.1 | 82.5 |
| MuRF (Ours) | ViT-B/14 | 47.4 | 83.5 |
| Relative Improvement: | +2.8% | +1.2% | |
We evaluate MuRF on metric depth estimation to test both in-domain learning capability on NYU Depth V2 and zero-shot transfer capability on SUN RGB-D. Our evaluation employs two standard linear probing configurations: Lin. 1 (utilizing features from the final transformer layer concatenated with the [CLS] token) and Lin. 4 (same approach with additional concatenation of tokens from layers l = {3, 6, 9, 12}).
NYU Depth V2 (In-domain)
SUN RGB-D (Zero-shot)
Qualitative depth estimation results. We compare single-scale DINOv2 predictions at 0.5×, 1.0×, and 1.5× input resolutions with our MuRF fusion. By aggregating multi-resolution features, MuRF better preserves global scene structure while sharpening local geometry, producing smoother and more accurate depth maps.
As shown below, the model using our MuRF representation achieves substantially lower error rates across all configurations. This confirms that the fusion of multi-scale features allows the prediction head to better reason about both the overall scene geometry (captured by low resolutions) and precise object boundaries (captured by high resolutions) simultaneously.
| Method | Arch. | NYU Depth V2 (RMSE ↓) | SUN RGB-D (RMSE ↓) | ||
|---|---|---|---|---|---|
| Lin. 1 | Lin. 4 | Lin. 1 | Lin. 4 | ||
| Low Resolution (0.5×) | ViT-B/14 | 0.423 | 0.408 | 0.463 | 0.439 |
| Medium Resolution (1.0×) | ViT-B/14 | 0.389 | 0.373 | 0.432 | 0.416 |
| High Resolution (1.5×) | ViT-B/14 | 0.394 | 0.380 | 0.445 | 0.426 |
| MuRF (Ours) | ViT-B/14 | 0.361 | 0.358 | 0.419 | 0.407 |
| Rel. Improv over 1.0× baseline: | +7.0% | +4.0% | +2.9% | +2.2% | |
We compare multi-scale feature fusion (MuRF) against multi-layer aggregation (Lin. 3). While MuRF excels at capturing fine-grained structural details (in-domain NYUd), intermediate Transformer layers retain robust semantic abstractions (zero-shot SUN RGB-D).
Crucially, these approaches are complementary. Combining both methodologies yields the highest overall performance, demonstrating that spatial scaling and layer aggregation offer orthogonal benefits.
Table 5: Linear probing comparison for depth estimation reporting RMSE. We evaluate on NYU Depth V2 and SUN RGB-D. Lin. 1 utilizes only the final layer. Lin. 3 utilizes layers {4, 8, 12}.
| Method | Resolutions | Layers | NYU Depth V2 (RMSE ↓) | SUN RGB-D (RMSE ↓) |
|---|---|---|---|---|
| Lin. 1 | 1.0× | 12 | 0.389 | 0.432 |
| MuRF (Ours) | {0.5, 1.0, 1.5}× | 12 | 0.361 | 0.419 |
| Lin. 3 | 0.5× | {4, 8, 12} | 0.412 | 0.443 |
| Lin. 3 | 1.0× | {4, 8, 12} | 0.376 | 0.418 |
| Lin. 3 | 1.5× | {4, 8, 12} | 0.380 | 0.428 |
| Lin. 3 + MuRF | {0.5, 1.0, 1.5}× | {4, 8, 12} | 0.357 | 0.409 |
We integrate MuRF into a Multimodal Large Language Model (MLLM) framework for Visual Question Answering (VQA). To verify how MuRF supports MLLMs, we apply it to multiple variants of LLaVA 1.5, swapping the original CLIP vision encoder for DINOv2 and SigLIP2.
Table 3: Performance across major VQA benchmarks. Equipping the MLLM with MuRF strongly improves multimodal understanding capacity regardless of whether DINOv2 or SigLIP2 is used.
| Vision Encoder | Res. | MME | Bias | V* | RW | MR | GQA | MMB | POPE | |
|---|---|---|---|---|---|---|---|---|---|---|
| Percept. | Cogn. | |||||||||
| CLIP (official LLaVA) | 336 | 1511.4 | 347.1 | 16.2 | 50.3 | 56.1 | 26.5 | 62.0 | 195.4 | 86.9 |
| DINOv2 | 336 | 1291.6 | 278.6 | 17.3 | 38.7 | 52.3 | 26.2 | 62.1 | 172.4 | 87.1 |
| 224+336 (Ours) | 1357.1 (+65.5) |
366.4 (+87.8) |
17.7 (+0.4) |
40.3 (+1.6) |
53.6 (+1.3) |
26.1 (-0.1) |
62.4 (+0.3) |
173.1 (+0.7) |
87.1 (0.0) |
|
| CLIP+DINOv2 | 336 | 1403.4 | 243.9 | 15.8 | 48.7 | 53.6 | 31.5 | 62.2 | 194.2 | 86.4 |
| 224+336 (Ours) | 1471.2 (+67.8) |
281.4 (+37.5) |
16.3 (+0.5) |
48.2 (-0.5) |
56.7 (+3.1) |
31.8 (+0.3) |
62.9 (+0.7) |
198.8 (+4.6) |
87.4 (+1.0) |
|
| SigLIP2 | 384 | 1529.3 | 355.4 | 19.4 | 44.0 | 58.2 | 33.1 | 64.1 | 211.7 | 87.1 |
| 256+384 (Ours) | 1545.7 (+16.4) |
371.4 (+16.0) |
19.7 (+0.3) |
42.9 (-1.1) |
58.4 (+0.2) |
33.3 (+0.2) |
64.5 (+0.4) |
216.9 (+5.2) |
86.7 (-0.4) |
|
We validate MuRF in a training-free setting on the MVTec AD 2 benchmark, a challenging industrial inspection dataset.
Qualitative anomaly detection comparison on MVTec AD 2.
Table (Anomaly Detection): Anomaly detection performance (AU-PRO0.05 in %) on the MVTec AD 2 dataset. MuRF demonstrates state-of-the-art results on the TESTpriv,mix subset, showcasing its robustness in a challenging training-free scenario. “Training” indicates whether the method involves parameter tuning within a neural network. Bold indicates the best performance, and underline indicates the second best.
| Method | Training? | TESTpriv | TESTpriv,mix |
|---|---|---|---|
| PatchCore | × | 62.3 | 52.6 |
| SuperAD | × | 61.2 | 59.3 |
| RoBiS | ✓ | 67.3 | 59.7 |
| MuRF (Ours) | × | 66.0 | 62.3 (↑+2.6) |
@misc{zou2026murfunlockingmultiscalepotential,
title={MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models},
author={Bocheng Zou and Mu Cai and Mark Stanley and Dingfu Lu and Yong Jae Lee},
year={2026},
eprint={2603.25744},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.25744},
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.