SJTU, Tsinghua, Microsoft, and Shanghai AI Lab Release Data Analysis Agent Survey: LLMs as Data Analysts Let Data “Speak” for Itself

SJTU, Tsinghua, Microsoft, and Shanghai AI Lab Release Data Analysis Agent Survey: LLMs as Data Analysts Let Data “Speak” for Itself

2025-10-28 00:01 — Jilin

The system reviewed the overall evolution of large language models (LLMs) in data analysis.

image
image

---

Large Model Intelligence|Sharing

Source: Machine Heart

---

Background

Traditional data analysis typically requires manual workflows—writing SQL, running Python scripts, and manually interpreting results.

Limitations of this traditional approach include:

  • High coupling between steps
  • Poor scalability
  • Difficulty handling dynamic, multimodal, and complex datasets

The rise of LLMs and intelligent agents shifts data analysis from rule execution to semantic understanding. This enables machines to:

  • Interpret intrinsic logic and data relationships
  • Flexibly perform querying, modeling, and reporting tasks
  • Handle diverse data modalities with higher adaptability

---

Key Paper & Resources

The survey—authored by teams from Shanghai Jiao Tong University, Tsinghua University, Microsoft Research Redmond, and Shanghai AI Lab—reviews the development of LLMs in data analysis:

  • Moving from rule-based pipelines to intelligent collaboration
  • From single-modality to multimodal fusion
  • Proposing a General Data Analyst Agent paradigm
image

---

The research identifies five technological shifts in LLM/Agent-based data analysis:

  • Literal → Semantic Reasoning
  • From just “seeing” data to understanding and reasoning about semantics.
  • Closed Tools → Open Collaboration
  • Ability to call external APIs, access knowledge bases, and cooperate with multiple tools.
  • Closed Data → Open-Domain Analysis
  • Capacity to analyze unconstrained datasets beyond legacy system limits.
  • Static Workflows → Dynamic Generation
  • Agents can auto-generate pipelines for higher efficiency.
  • Manual Agent Frameworks → Auto-Generated Frameworks
  • Agents can self-construct workflows for specific tasks.

---

image

Figure 1: Technological evolution of LLMs in data analysis.

image

Figure 2: Overview of LLM/Agent-as-Data-Analyst tech covering structured, semi-structured, unstructured, and heterogeneous data.

---

Evolution by Data Modality

1. Structured Data

  • Relational Data Analysis:
  • From NL2SQL to NL2Code and ModelQA
  • Focus: semantic alignment, schema linking, multi-step reasoning, and end-to-end table QA (e.g., TableGPT, ReAcTable)
  • Graph Data Analysis:
  • Using NL2GQL for query generation
  • Moving from code-level to semantic-level understanding and execution
  • Examples: R3-NL2GQL, GraphGPT

---

2. Semi-Structured Data

  • Markup Language Understanding:
  • Tasks: extraction (Evaporate), querying (XPath Agent), semantic comprehension (MarkupLM)
  • Shift from rule-based to LLM-based tree-structure modeling & hierarchical encoding
  • Semi-Structured Table Understanding:
  • Structure representation (ST-Raptor)
  • Model-driven conversion (TabFormer)
  • Table prompt compression (HySem)
  • Query reasoning (CoS)

---

3. Unstructured Data

  • Document Understanding:
  • OCR & RAG-powered analysis (ZenDB, QUEST)
  • Evolution from OCR templates to VLM-based methods (DocLLM, DocOwl2, DLAFormer)
  • Tasks: layout recognition, retrieval QA, summarization, multi-document reasoning
  • Chart Understanding:
  • Image parsing + language reasoning (ChartQA, Chart-of-Thought)
  • Applications: description generation, Q&A, visual reasoning
  • Video & 3D Model Analysis:
  • Temporal localization, action recognition, 3D semantic fusion (Video-LLaMA, LLMI3D)

---

4. Heterogeneous Data

  • Cross-Modal Data Lake Integration for unified semantic queries and multimodal reasoning
  • Subtasks:
  • Modality alignment
  • NL-based retrieval interfaces
  • Agents for heterogeneous sources (HetAgent, XMODE)

---

Unique Perspective of the Survey

Unlike single-task studies, this work spans full-modality and full-pipeline views, identifying five core design principles for a General Data Analyst Agent.

Challenges ahead include:

  • Scalability
  • Robustness
  • Open-domain adaptability

---

AiToEarn is an open-source AI content monetization platform enabling:

  • AI content generation
  • Cross-platform publishing (Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X)
  • Integrated analytics and AI model ranking (AI模型排名)

Such platforms may in the future integrate directly with data analysis agents for real-time publishing and monetization.

---

image

Read the original article

Open in WeChat

---

Summary:

LLMs and intelligent agents are pushing data analysis toward a future where machines are not just tools, but collaborators and thinkers. The General Data Analyst Agent outlined in the paper combines multimodal capabilities, open-world readiness, and autonomous pipeline generation—delivering a flexible, scalable, and intelligent approach to extracting insights and enabling real-time monetization.

---

Would you like me to also create an infographic diagram summarizing the five major trends from this paper? That could make the Markdown guide more visually engaging.

Read more

Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Spatial Intelligence & Supersensing: The Next Frontier in AI Leading AI researchers — Fei-Fei Li, Saining Xie, and Yann LeCun — have been highlighting a transformative concept: Spatial Intelligence. This goes beyond simply “understanding images or videos.” It’s about: * Comprehending spatial structures * Remembering events * Predicting future outcomes In essence, a truly

By Honghao Wang
Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

NVIDIA OmniVinci: A Breakthrough in Multimodal AI NVIDIA has unveiled OmniVinci, a large language model designed for multimodal understanding and reasoning — capable of processing text, visual, audio, and even robotic data inputs. Led by the NVIDIA Research team, the project explores human-like perception: integrating and interpreting information across multiple data

By Honghao Wang