AI news

AAAI 2026 Oral | UTS and PolyU Break the “One-Size-Fits-All” Mold: How Federated Recommendation Achieves Personalized Image-Text Fusion

Honghao Wang

25 Nov 2025 — 3 min read

Balancing Privacy & Personalization in Multimodal Recommendation Systems

In today’s move toward multimodal recommendation systems, the challenge is how to balance data privacy with personalized image–text understanding.

A research team led by Prof. Guodong Long (University of Technology Sydney), in collaboration with Prof. Qiang Yang and Prof. Chengqi Zhang (The Hong Kong Polytechnic University), has proposed a new framework — FedVLR — to address this challenge.

This work, tackling multimodal fusion heterogeneity in federated environments, has been accepted as an Oral Presentation at AAAI 2026, a leading AI conference.

---

The New Normal: Multimodal Meets Federated Learning

Modern recommendation systems often use images and text to assist decisions.

When combined with Federated Learning — where data stays local to preserve privacy — complexity increases.

The Dilemma in Current Approaches

Privacy-first, feature-light: Skip multimodal processing and rely solely on ID-based features.
One-size-fits-all fusion: Assume all users prefer image–text in the same way.

Reality check:

Preferences vary. For clothing, visuals matter; for electronics, textual specs dominate. Capturing these variations in a federated setting — without seeing individual data — is tough.

---

FedVLR: Rethinking Multimodal Fusion

The team’s key insight: Restructure the decision flow by letting the server handle heavy preprocessing while offloading personalized fusion decisions to lightweight client-side routing.

Paper: https://arxiv.org/abs/2410.08478
Code: https://github.com/mtics/FedVLR

---

Pain Point: Multimodal in Data Silos

In centralized training, all interaction data is visible, so models can learn optimal fusion weights.

In federated learning, the server cannot see user behavior and must guess:

> For User A, is image more important than text?

Key Limitations

Computational bottlenecks:
Clients often can’t run large vision–language models like CLIP.
No personalization:
One global fusion rule ignores individual habits.

---

FedVLR Architecture: Server Prepares, Client Refines

FedVLR decouples feature extraction from preference fusion via a two-layer mechanism:

Layer 1 — Server-Side “Multi-View Pre-fusion”

Heavy computation locked to the server.
Pre-trained vision–language models generate multiple candidate fusion views:
View A: Image-dominant
View B: Text-dominant
View C: Balanced
These semi-finished dishes provide rich visual–text content understanding without burdening client devices.

Lightweight Mixture of Experts (MoE) router runs locally.
Uses private interaction history to compute personalized weights.
Processing stays on-device — preferences never leave the client.

---

Engineering Benefits: Plug-and-Play Personalization

FedVLR is modular and easy to integrate into existing federated recommendation pipelines.

Advantages:

No heavy edge-side preprocessing
Seamless integration into frameworks like FedAvg or FedNCF
Zero extra communication overhead
Strict privacy compliance

---

Real-World Parallels: Cloud Power + Local Customization

Platforms like AiToEarn官网 apply similar principles in content monetization — heavy cloud-based generation with lightweight local personalization.

By connecting generation, publishing, analytics, and rankings, they mirror FedVLR’s privacy–personalization synergy.

---

FedVLR in Action: Results & Validation

Experiments on Public E-commerce & Multimedia Datasets

Highlights:

Consistent gains in NDCG and HR across baseline models.
Cold-start boost in sparse data — personalized fusion helps utilize limited data effectively.

---

Broader Impact: A Paradigm for Federated Foundation Models

In an era of limited edge computing power and increasingly powerful cloud models, the challenge is:

Benefit from cloud-scale knowledge
Preserve local privacy
Avoid expensive deployment costs

FedVLR offers a path:

> Cloud: General content understanding

> Edge: Private preference modeling

This reduces the thresholds for communication and computation — enabling use of complex multimodal and generative AI models in privacy-sensitive contexts.

---

Conclusion

FedVLR is more than a model; it’s a deployable enhancement framework for federated multimodal recommendation.

By smartly dividing work between server-side preprocessing and client-side personalization, it:

Preserves privacy
Improves recommendation accuracy
Enables real-world deployments even on limited hardware

With its open-source release, the community can adapt and extend it to new applications — including AI-powered content creation platforms that demand both personalization and privacy.