SJTU, Tsinghua, Microsoft, and Shanghai AI Lab Release Data Analysis Agent Survey: LLMs as Data Analysts Let Data “Speak” for Itself
2025-10-28 00:01 — Jilin
The system reviewed the overall evolution of large language models (LLMs) in data analysis.


---
Large Model Intelligence|Sharing
Source: Machine Heart
---
Background
Traditional data analysis typically requires manual workflows—writing SQL, running Python scripts, and manually interpreting results.
Limitations of this traditional approach include:
- High coupling between steps
- Poor scalability
- Difficulty handling dynamic, multimodal, and complex datasets
The rise of LLMs and intelligent agents shifts data analysis from rule execution to semantic understanding. This enables machines to:
- Interpret intrinsic logic and data relationships
- Flexibly perform querying, modeling, and reporting tasks
- Handle diverse data modalities with higher adaptability
---
Key Paper & Resources
- Paper Title: LLM/Agent-as-Data-Analyst
- Paper Link: https://arxiv.org/abs/2509.23988
- GitHub Project: https://github.com/weAIDB/awesome-data-llm
The survey—authored by teams from Shanghai Jiao Tong University, Tsinghua University, Microsoft Research Redmond, and Shanghai AI Lab—reviews the development of LLMs in data analysis:
- Moving from rule-based pipelines to intelligent collaboration
- From single-modality to multimodal fusion
- Proposing a General Data Analyst Agent paradigm

---
Five Major Trends
The research identifies five technological shifts in LLM/Agent-based data analysis:
- Literal → Semantic Reasoning
- From just “seeing” data to understanding and reasoning about semantics.
- Closed Tools → Open Collaboration
- Ability to call external APIs, access knowledge bases, and cooperate with multiple tools.
- Closed Data → Open-Domain Analysis
- Capacity to analyze unconstrained datasets beyond legacy system limits.
- Static Workflows → Dynamic Generation
- Agents can auto-generate pipelines for higher efficiency.
- Manual Agent Frameworks → Auto-Generated Frameworks
- Agents can self-construct workflows for specific tasks.
---

Figure 1: Technological evolution of LLMs in data analysis.

Figure 2: Overview of LLM/Agent-as-Data-Analyst tech covering structured, semi-structured, unstructured, and heterogeneous data.
---
Evolution by Data Modality
1. Structured Data
- Relational Data Analysis:
- From NL2SQL to NL2Code and ModelQA
- Focus: semantic alignment, schema linking, multi-step reasoning, and end-to-end table QA (e.g., TableGPT, ReAcTable)
- Graph Data Analysis:
- Using NL2GQL for query generation
- Moving from code-level to semantic-level understanding and execution
- Examples: R3-NL2GQL, GraphGPT
---
2. Semi-Structured Data
- Markup Language Understanding:
- Tasks: extraction (Evaporate), querying (XPath Agent), semantic comprehension (MarkupLM)
- Shift from rule-based to LLM-based tree-structure modeling & hierarchical encoding
- Semi-Structured Table Understanding:
- Structure representation (ST-Raptor)
- Model-driven conversion (TabFormer)
- Table prompt compression (HySem)
- Query reasoning (CoS)
---
3. Unstructured Data
- Document Understanding:
- OCR & RAG-powered analysis (ZenDB, QUEST)
- Evolution from OCR templates to VLM-based methods (DocLLM, DocOwl2, DLAFormer)
- Tasks: layout recognition, retrieval QA, summarization, multi-document reasoning
- Chart Understanding:
- Image parsing + language reasoning (ChartQA, Chart-of-Thought)
- Applications: description generation, Q&A, visual reasoning
- Video & 3D Model Analysis:
- Temporal localization, action recognition, 3D semantic fusion (Video-LLaMA, LLMI3D)
---
4. Heterogeneous Data
- Cross-Modal Data Lake Integration for unified semantic queries and multimodal reasoning
- Subtasks:
- Modality alignment
- NL-based retrieval interfaces
- Agents for heterogeneous sources (HetAgent, XMODE)
---
Unique Perspective of the Survey
Unlike single-task studies, this work spans full-modality and full-pipeline views, identifying five core design principles for a General Data Analyst Agent.
Challenges ahead include:
- Scalability
- Robustness
- Open-domain adaptability
---
Related Platforms — AiToEarn
AiToEarn is an open-source AI content monetization platform enabling:
- AI content generation
- Cross-platform publishing (Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X)
- Integrated analytics and AI model ranking (AI模型排名)
Such platforms may in the future integrate directly with data analysis agents for real-time publishing and monetization.
---
Recommended Reading
- Latest Survey on Cross-Lingual Large Models
- Most Stunning Ideas in Deep Learning Papers
- Meaning of "Deployment Capability" for Algorithm Engineers
- Comprehensive Survey of Transformer Variants
- SGD to NadaMax: 10 Optimization Algorithms
- PyTorch Implementations of Attention Mechanisms

---
Summary:
LLMs and intelligent agents are pushing data analysis toward a future where machines are not just tools, but collaborators and thinkers. The General Data Analyst Agent outlined in the paper combines multimodal capabilities, open-world readiness, and autonomous pipeline generation—delivering a flexible, scalable, and intelligent approach to extracting insights and enabling real-time monetization.
---
Would you like me to also create an infographic diagram summarizing the five major trends from this paper? That could make the Markdown guide more visually engaging.