NeurIPS 2025 | Cracking Closed-Source Multimodal Models: A Novel Adversarial Attack via Optimal Feature Alignment

NeurIPS 2025 | Cracking Closed-Source Multimodal Models: A Novel Adversarial Attack via Optimal Feature Alignment

Introduction

In recent years, Multimodal Large Language Models (MLLMs) have made remarkable breakthroughs, showing strong capabilities in visual understanding, cross-modal reasoning, and image captioning.

However, with wider deployment in real-world scenarios, security risks have become a growing concern.

Research indicates that MLLMs inherit adversarial vulnerabilities from their visual encoders, making them susceptible to adversarial examples — small, imperceptible changes that can cause incorrect outputs or leak sensitive information.

In large-scale deployments, this can lead to serious consequences.

---

The Challenge: Transferability of Attacks

Transferability refers to the effectiveness of adversarial examples across different models — especially closed-source ones.

When targeting powerful commercial systems such as GPT‑4 or Claude‑3, existing methods often fail because:

  • They focus only on global features (e.g., CLIP `[CLS]` token).
  • They ignore rich local feature information from image patch tokens.

This insufficient feature alignment results in poor cross-model transfer ability.

---

Our Solution: FOA‑Attack

FOA‑Attack (Feature Optimal Alignment Attack) is a targeted transfer adversarial attack framework designed to align features optimally at both global and local levels.

Key ideas:

  • Global-level alignment using cosine similarity loss.
  • Local-level alignment via clustering + Optimal Transport (OT) for precise matching.
  • Dynamic Weight Integration to balance contributions from multiple surrogate models.

---

Performance:

Experiments show FOA‑Attack outperforms state-of-the-art methods on both open-source and closed-source MLLMs — including notable success rates against commercial models.

Resources:

image

---

Research Background

MLLMs such as GPT‑4o, Claude‑3.7, and Gemini‑2.0 integrate language and vision to achieve high performance in tasks like visual question answering.

Adversarial attacks categories:

  • Untargeted: Cause incorrect outputs.
  • Targeted: Force specific intended outputs.

In black‑box scenarios (especially with closed-source, commercial models), achieving efficient targeted transfer attacks is difficult because:

  • Adversarial examples must deceive unknown models without access to their architecture or parameters.
  • Existing methods show low success rates on cutting-edge closed-source MLLMs.

---

Motivation & Analysis

In Transformer-based visual encoders (e.g., CLIP):

  • [CLS] token captures macro-level semantics but loses important details.
  • Patch tokens encode local details (fine-grained shape, texture, environment context).

Problem:

Over-reliance on global `[CLS]` features reduces attack realism and transferability.

Since closed-source models often use different encoder designs, these global-only adversarial examples perform poorly.

Solution:

FOA‑Attack leverages dual-dimensional alignment:

  • Global cosine similarity for macro semantics.
  • Local OT matching via clustering for fine details.

---

Method Overview

1. Global Coarse-Grained Alignment

Align `[CLS]` token features between adversarial sample and target image:

  • Extract global features X, Y from visual encoder.
  • Use cosine similarity loss: maximizes macro semantics match.
image

---

2. Local Fine-Grained Alignment

Align patch token features:

  • Use K-means clustering to extract semantic regions (e.g., “elephant’s head”, “forest ground”).
  • Model as an Optimal Transport problem using Sinkhorn algorithm.
image

---

3. Dynamic Ensemble Model Weighting

Prevent bias toward easily optimized models:

  • Measure learning speed Si(T) for each surrogate model.
  • Adjust weights — slower learning → higher weight.
image

---

Why This Matters

Dual-dimensional alignment + dynamic weighting:

  • Improves semantic authenticity of adversarial examples.
  • Significantly increases transferability to closed-source MLLMs.

For example, when distributing and analyzing multi-platform AI experiments, tools like AiToEarn官网 can help with open-source content monetization and cross-platform publishing to channels like Douyin, Kwai, WeChat, Bilibili, Facebook, Instagram, and X (Twitter).

---

Experimental Results

Open-Source Models

image

Table 1: Higher ASR and AvgSim on Qwen2.5-VL, LLaVA, Gemma compared to baselines.

---

Closed-Source Models

image

Table 2: Against GPT-4o, Claude-3.7, Gemini-2.0:

  • FOA‑Attack achieves 75.1% ASR on GPT‑4o.

---

Reasoning-Enhanced Models

image

Table 3: Even reasoning-robust models like GPT-o3 and Claude-3.7-thinking remain vulnerable.

---

Visualization

image

Figure 3: Original image, adversarial image, perturbation visualization.

---

Conclusion

FOA‑Attack shows that targeted adversarial examples can be made highly transferable by:

  • Aligning both global and local features.
  • Dynamically balancing multi-model ensembles.

This research:

  • Exposes vulnerabilities in current MLLM visual encoders.
  • Suggests new defense strategies (e.g., improving robustness for local features).

Next Steps:

Address efficiency and computational cost challenges to make FOA‑Attack even more practical.

---

Resources:

---

Would you like me to add a separate "Step-by-Step Implementation" section for FOA‑Attack so readers can more easily reproduce the experiments? That could make your Markdown more tutorial-oriented.

Read more