MuRF

Unlocking the Multi-Scale Potential of Vision Foundation Models

arXiv
Equal Contribution *Equal Second Author
University of Wisconsin-Madison

🔥[NEW!] We propose MuRF, a simple yet universally effective strategy to improve performance of VFMs at inference time.

Abstract

Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.

Approach

Our goal is to extract a universal, scale-robust visual representation from a frozen Vision Foundation Model (VFM). The core of MuRF is motivated by a fundamental property of visual perception: low resolutions capture global context for robust recognition, while high resolutions provide fine-grained detail for precise refinement.

MuRF Architecture Pipeline

Overview of Multi-Resolution Fusion (MuRF). We process an image pyramid through a frozen VFM, upsample the resulting feature maps to a common spatial resolution, and concatenate them channel-wise.

Multi-Resolution Feature Fusion

MuRF builds a feature pyramid directly from the input space. We resize the image to a set of scaling factors and pass each view through a frozen VFM encoder.

Crucially, these patch-level feature maps are upsampled to a target spatial resolution and concatenated along the channel dimension. This creates a single, unified tensor that is spatially rich, semantically deep, and explicitly preserves the orthogonal signals of both macro and micro views without the destructive interference caused by mean-pooling or summation.

Task-Specific Adaptation

The resulting multi-resolution representation is inherently task-agnostic. We adapt it to various domains by attaching lightweight, task-specific heads while keeping the heavy VFM backbone completely frozen:

  • Dense Prediction: For tasks like semantic segmentation and depth estimation, we apply a simple convolutional head (e.g., a 1×1 convolution) to project the concatenated channels directly to the target output space.
  • Multimodal LLMs: For Visual Question Answering (VQA), the MuRF tensor acts as the visual token sequence, passed through a lightweight perception module into the language model's word embedding space to enable multi-scale reasoning.
  • Unsupervised Anomaly Detection: We utilize a training-free nearest-neighbor approach, building memory banks across layer-resolution pairs and averaging the resulting distance scores to leverage the strengths of all views.

Semantic Segmentation

We evaluate MuRF on dense prediction using the challenging ADE20K and PASCAL VOC benchmarks. Our baseline for comparison is a linear probing setup utilizing a frozen DINOv2-ViT-B/14 encoder with features extracted from a single input resolution.

By effectively combining the global context from low resolutions with the fine-grained details from high resolutions, MuRF produces a feature map inherently better suited for dense prediction. While training the linear head on MuRF representations takes approximately 1.3× longer than the single-resolution counterpart, it yields a significant performance boost over strong single-scale baselines.

ADE20K Segmentation Comparison

ADE20K

PASCAL VOC Segmentation Comparison

PASCAL VOC

Qualitative semantic segmentation results. MuRF effectively merges holistic scene understanding with precise object boundaries compared to individual single-scale inputs.

Quantitative Analysis

▼ Show Semantic Segmentation Performance Table

Table 1 (Segmentation): Performance in mIoU (%) on ADE20K and PASCAL VOC (higher is better). MuRF significantly outperforms single-scale baselines. Rel. Improv measures relative improvement over the high-resolution DINOv2 baseline.

Method Architecture ADE20K (mIoU ↑) PASCAL VOC (mIoU ↑)
OpenCLIP ViT-G/14 39.3 71.4
MAE ViT-H/14 33.3 67.6
DINO ViT-B/8 31.8 66.4
iBOT ViT-L/16 44.6 82.3
DINOv2 Baselines
Low Resolution ViT-B/14 40.6 71.2
Medium Resolution ViT-B/14 45.5 78.9
High Resolution ViT-B/14 46.1 82.5
MuRF (Ours) ViT-B/14 47.4 83.5
Relative Improvement: +2.8% +1.2%

Depth Estimation

We evaluate MuRF on metric depth estimation to test both in-domain learning capability on NYU Depth V2 and zero-shot transfer capability on SUN RGB-D. Our evaluation employs two standard linear probing configurations: Lin. 1 (utilizing features from the final transformer layer concatenated with the [CLS] token) and Lin. 4 (same approach with additional concatenation of tokens from layers l = {3, 6, 9, 12}).

NYUd Depth Estimation Comparison

NYU Depth V2 (In-domain)

SUN RGB-D Depth Estimation Comparison

SUN RGB-D (Zero-shot)

Qualitative depth estimation results. We compare single-scale DINOv2 predictions at 0.5×, 1.0×, and 1.5× input resolutions with our MuRF fusion. By aggregating multi-resolution features, MuRF better preserves global scene structure while sharpening local geometry, producing smoother and more accurate depth maps.

Quantitative Analysis

As shown below, the model using our MuRF representation achieves substantially lower error rates across all configurations. This confirms that the fusion of multi-scale features allows the prediction head to better reason about both the overall scene geometry (captured by low resolutions) and precise object boundaries (captured by high resolutions) simultaneously.

▼ Show Quantitative Analysis Table
Method Arch. NYU Depth V2 (RMSE ↓) SUN RGB-D (RMSE ↓)
Lin. 1 Lin. 4 Lin. 1 Lin. 4
Low Resolution (0.5×) ViT-B/14 0.423 0.408 0.463 0.439
Medium Resolution (1.0×) ViT-B/14 0.389 0.373 0.432 0.416
High Resolution (1.5×) ViT-B/14 0.394 0.380 0.445 0.426
MuRF (Ours) ViT-B/14 0.361 0.358 0.419 0.407
Rel. Improv over 1.0× baseline: +7.0% +4.0% +2.9% +2.2%

Analysis of Resolution and Feature Concatenation

We compare multi-scale feature fusion (MuRF) against multi-layer aggregation (Lin. 3). While MuRF excels at capturing fine-grained structural details (in-domain NYUd), intermediate Transformer layers retain robust semantic abstractions (zero-shot SUN RGB-D).

Crucially, these approaches are complementary. Combining both methodologies yields the highest overall performance, demonstrating that spatial scaling and layer aggregation offer orthogonal benefits.

▼ Show Table 5: Linear Probing Comparison

Table 5: Linear probing comparison for depth estimation reporting RMSE. We evaluate on NYU Depth V2 and SUN RGB-D. Lin. 1 utilizes only the final layer. Lin. 3 utilizes layers {4, 8, 12}.

Method Resolutions Layers NYU Depth V2 (RMSE ↓) SUN RGB-D (RMSE ↓)
Lin. 1 1.0× 12 0.389 0.432
MuRF (Ours) {0.5, 1.0, 1.5}× 12 0.361 0.419
Lin. 3 0.5× {4, 8, 12} 0.412 0.443
Lin. 3 1.0× {4, 8, 12} 0.376 0.418
Lin. 3 1.5× {4, 8, 12} 0.380 0.428
Lin. 3 + MuRF {0.5, 1.0, 1.5}× {4, 8, 12} 0.357 0.409

Visual Question Answering (MLLMs)

We integrate MuRF into a Multimodal Large Language Model (MLLM) framework for Visual Question Answering (VQA). To verify how MuRF supports MLLMs, we apply it to multiple variants of LLaVA 1.5, swapping the original CLIP vision encoder for DINOv2 and SigLIP2.

▼ Show Table 3: LLaVA-MuRF VQA Performance

Table 3: Performance across major VQA benchmarks. Equipping the MLLM with MuRF strongly improves multimodal understanding capacity regardless of whether DINOv2 or SigLIP2 is used.

Vision Encoder Res. MME Bias V* RW MR GQA MMB POPE
Percept. Cogn.
CLIP (official LLaVA) 336 1511.4 347.1 16.2 50.3 56.1 26.5 62.0 195.4 86.9
DINOv2 336 1291.6 278.6 17.3 38.7 52.3 26.2 62.1 172.4 87.1
224+336 (Ours) 1357.1
(+65.5)
366.4
(+87.8)
17.7
(+0.4)
40.3
(+1.6)
53.6
(+1.3)
26.1
(-0.1)
62.4
(+0.3)
173.1
(+0.7)
87.1
(0.0)
CLIP+DINOv2 336 1403.4 243.9 15.8 48.7 53.6 31.5 62.2 194.2 86.4
224+336 (Ours) 1471.2
(+67.8)
281.4
(+37.5)
16.3
(+0.5)
48.2
(-0.5)
56.7
(+3.1)
31.8
(+0.3)
62.9
(+0.7)
198.8
(+4.6)
87.4
(+1.0)
SigLIP2 384 1529.3 355.4 19.4 44.0 58.2 33.1 64.1 211.7 87.1
256+384 (Ours) 1545.7
(+16.4)
371.4
(+16.0)
19.7
(+0.3)
42.9
(-1.1)
58.4
(+0.2)
33.3
(+0.2)
64.5
(+0.4)
216.9
(+5.2)
86.7
(-0.4)

Unsupervised Anomaly Detection

We validate MuRF in a training-free setting on the MVTec AD 2 benchmark, a challenging industrial inspection dataset.

Anomaly detection comparison on MVTec AD 2

Qualitative anomaly detection comparison on MVTec AD 2.

Quantitative Analysis

▼ Show Table 4: Anomaly Detection Performance

Table (Anomaly Detection): Anomaly detection performance (AU-PRO0.05 in %) on the MVTec AD 2 dataset. MuRF demonstrates state-of-the-art results on the TESTpriv,mix subset, showcasing its robustness in a challenging training-free scenario. “Training” indicates whether the method involves parameter tuning within a neural network. Bold indicates the best performance, and underline indicates the second best.

Method Training? TESTpriv TESTpriv,mix
PatchCore × 62.3 52.6
SuperAD × 61.2 59.3
RoBiS 67.3 59.7
MuRF (Ours) × 66.0 62.3 (↑+2.6)

BibTeX

        
@misc{zou2026murfunlockingmultiscalepotential,
      title={MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models}, 
      author={Bocheng Zou and Mu Cai and Mark Stanley and Dingfu Lu and Yong Jae Lee},
      year={2026},
      eprint={2603.25744},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.25744}, 
}
        
      

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.