LLM

SJTU, Tsinghua, Microsoft, and Shanghai AI Lab Release Data Analysis Agent Survey: LLMs as Data Analysts Let Data “Speak” for Itself

Honghao Wang

28 Oct 2025 — 4 min read

2025-10-28 00:01 — Jilin

The system reviewed the overall evolution of large language models (LLMs) in data analysis.

---

Large Model Intelligence｜Sharing

Source: Machine Heart

---

Background

Traditional data analysis typically requires manual workflows—writing SQL, running Python scripts, and manually interpreting results.

Limitations of this traditional approach include:

High coupling between steps
Poor scalability
Difficulty handling dynamic, multimodal, and complex datasets

The rise of LLMs and intelligent agents shifts data analysis from rule execution to semantic understanding. This enables machines to:

Interpret intrinsic logic and data relationships
Flexibly perform querying, modeling, and reporting tasks
Handle diverse data modalities with higher adaptability

---

Key Paper & Resources

Paper Title: LLM/Agent-as-Data-Analyst
Paper Link: https://arxiv.org/abs/2509.23988
GitHub Project: https://github.com/weAIDB/awesome-data-llm

The survey—authored by teams from Shanghai Jiao Tong University, Tsinghua University, Microsoft Research Redmond, and Shanghai AI Lab—reviews the development of LLMs in data analysis:

Moving from rule-based pipelines to intelligent collaboration
From single-modality to multimodal fusion
Proposing a General Data Analyst Agent paradigm

---

Five Major Trends

The research identifies five technological shifts in LLM/Agent-based data analysis:

Literal → Semantic Reasoning
From just “seeing” data to understanding and reasoning about semantics.
Closed Tools → Open Collaboration
Ability to call external APIs, access knowledge bases, and cooperate with multiple tools.
Closed Data → Open-Domain Analysis
Capacity to analyze unconstrained datasets beyond legacy system limits.
Static Workflows → Dynamic Generation
Agents can auto-generate pipelines for higher efficiency.
Manual Agent Frameworks → Auto-Generated Frameworks
Agents can self-construct workflows for specific tasks.

---

Figure 1: Technological evolution of LLMs in data analysis.

Figure 2: Overview of LLM/Agent-as-Data-Analyst tech covering structured, semi-structured, unstructured, and heterogeneous data.

---

Evolution by Data Modality

1. Structured Data

Relational Data Analysis:
From NL2SQL to NL2Code and ModelQA
Focus: semantic alignment, schema linking, multi-step reasoning, and end-to-end table QA (e.g., TableGPT, ReAcTable)
Graph Data Analysis:
Using NL2GQL for query generation
Moving from code-level to semantic-level understanding and execution
Examples: R3-NL2GQL, GraphGPT

---

2. Semi-Structured Data

Markup Language Understanding:
Tasks: extraction (Evaporate), querying (XPath Agent), semantic comprehension (MarkupLM)
Shift from rule-based to LLM-based tree-structure modeling & hierarchical encoding
Semi-Structured Table Understanding:
Structure representation (ST-Raptor)
Model-driven conversion (TabFormer)
Table prompt compression (HySem)
Query reasoning (CoS)

---

3. Unstructured Data

Document Understanding:
OCR & RAG-powered analysis (ZenDB, QUEST)
Evolution from OCR templates to VLM-based methods (DocLLM, DocOwl2, DLAFormer)
Tasks: layout recognition, retrieval QA, summarization, multi-document reasoning
Chart Understanding:
Image parsing + language reasoning (ChartQA, Chart-of-Thought)
Applications: description generation, Q&A, visual reasoning
Video & 3D Model Analysis:
Temporal localization, action recognition, 3D semantic fusion (Video-LLaMA, LLMI3D)

---

4. Heterogeneous Data

Cross-Modal Data Lake Integration for unified semantic queries and multimodal reasoning
Subtasks:
Modality alignment
NL-based retrieval interfaces
Agents for heterogeneous sources (HetAgent, XMODE)

---

Unique Perspective of the Survey

Unlike single-task studies, this work spans full-modality and full-pipeline views, identifying five core design principles for a General Data Analyst Agent.

Challenges ahead include:

Scalability
Robustness
Open-domain adaptability

---

AiToEarn is an open-source AI content monetization platform enabling:

AI content generation
Cross-platform publishing (Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X)
Integrated analytics and AI model ranking (AI模型排名)

Such platforms may in the future integrate directly with data analysis agents for real-time publishing and monetization.

---