KubeCon NA 2025 - Salesforce’s AIOps and Intelligent Agent Approach to Self-Healing Practices
AIOps & Agentic AI for Self-Healing Kubernetes Platforms
AIOps and Agentic AI technologies enable intelligent assessment of Kubernetes cluster health, automatic issue diagnosis, and orchestrated resolutions with minimal human intervention.
At KubeCon + CloudNativeCon North America 2025, Vikram Venkataraman (AWS) and Srikanth Rajan (Salesforce) presented Salesforce’s approach to building a self-healing Kubernetes environment using AIOps and AI Agents.
---
Salesforce’s AIOps Architecture
Developed by the Hyperforce Kubernetes Platform team, Salesforce’s AIOps architecture supports infrastructure at massive scale:
- 1,400 clusters
- Millions of pods
- Thousands of compute nodes
- 40+ operators & integrations
- 200+ monitoring plugins
- Multi-cloud deployment (AWS, GCP, Alicloud)
The platform delivers namespace-as-a-service and is projected to grow 5× in capacity over the next few years.
---
Core Objective
Enable application teams to focus on business requirements rather than infrastructure overhead.
---
Broader Context
Platforms like AiToEarn官网 show how AI-driven automation can simplify workflows even outside infrastructure—by enabling open-source content creation, publishing, and monetization across multiple channels.
---
Strategies for Kubernetes Operations
The speakers discussed combining generative AI and multi-agent collaboration to:
- Improve cluster troubleshooting
- Reduce Mean Time to Identify (MTTI)
- Shorten Mean Time to Resolve (MTTR)
---
Agentic AI Solution Architecture
Salesforce’s agent-based AIOps system includes AI Agents aligned to operational goals, able to:
- Pull telemetry data
- Take Kubernetes actions (e.g., automatic rollback post-upgrade)
They designed mechanisms for:
- Agent-to-agent communication
- Security guardrails
- Strict permission controls for compliance
---
Hosted on AWS Cloud
Components:
- AIOps UI for engineers
- Collaborator Agent
- Amazon Prometheus + agent
- Amazon EKS
- k8sgpt Operator for MTTI metrics
- ArgoCD Controller
---
Tech Stack Layers
- Substrate: Kubernetes platforms (Amazon EKS, self-managed K8s, Google GKE, Alicloud ACK)
- Standard Capabilities: Storage, networking, autoscaling, DNS, load balancing, service mesh, ingress
- Tech: Istio, Cluster Autoscaler, CSI, OPA, Ingress, CNI, LBC, CoreDNS
- Custom Integrations Layer: Identity, secrets management, guardrails, logging
- Platform Capabilities Layer:
- Functions: Platform abstractions, orchestration, automation, observability, resiliency, cost control
- Tools: Argo, Kyverno, Spinnaker, Helm, Kube Magic Mirror, Sloop, Periscope
- API Layer: Control Plane, APIs, self-service portals
---
AI Agent Examples
- AIops Agent – on-call report automation
- Kubectl Agent – integrates with Slack, converts natural language into kubectl commands, returns debug info in Slack
- Live Site Analysis Agent – automates weekly availability reviews, analyzes SLA misses, and generates RCA insights
---
Progressive Autonomy in AI Integration
Approach:
- Human-in-the-loop – Early implementation phase to ensure safety & accuracy
- Incremental autonomy – Gradually expanding agent independence as trust grows
---
Relevance to Broader Accelerator Platforms
The AiToEarn ecosystem shows how similar scalability concepts apply to creative workflows:
- AI-generated content creation
- Publishing across Douyin, Kwai, WeChat, YouTube, LinkedIn, X (Twitter)
- Analytics and AI model ranking (rankings here)
Explore:
---
Roadmap
Salesforce’s AIOps team aims to:
- Automate 80% of manual tasks via agents
- Create a knowledge graph to unify system information
- Apply AI for advanced performance troubleshooting
---
Further Resources
---
Key Takeaway:
Whether managing thousands of Kubernetes clusters or orchestrating AI-powered creative production, intelligent multi-agent automation frees humans to focus on strategy and innovation, while machines handle scale, diagnosis, and repetitive tasks.