Dexmal ForceLing Opensources Dexbotic: A “Transformers” Library for Embodied Intelligence

Dexmal ForceLing Opensources Dexbotic: A “Transformers” Library for Embodied Intelligence

📦 One-Stop Open-Source VLA Toolbox – Dexbotic

image
image

The Challenge of Real-World AI Actions

AI can now write code and create art — but why is something as simple as asking an AI to twist open a bottle cap still so difficult?

Opening a bottle cap requires three abilities working together in real time:

  • Eyes – Vision (V)
  • Recognize the object: Which one is the bottle? Where is its cap? What’s the cap’s texture?
  • Brain – Language (L)
  • Understand instructions: What does “twist open the cap” mean? Clockwise or counterclockwise? How much force?
  • Body – Action (A)
  • Execute precisely: At what angle should fingers grip the cap? How much torque should be applied?

These three capabilities must work seamlessly together — that’s the job of the VLA (Vision–Language–Action) model, the core of embodied intelligence.

---

The State of VLA Research – Like Deep Learning in 2015

Modern VLA research feels like deep learning circa 2015:

  • Rapid algorithmic innovation (e.g., OpenVLA, RT-2, Pi0)
  • But engineering environments are fragmented

Problems faced by researchers:

  • Multiple environments: PyTorch + LLaMA2 here, JAX + PaLI there, TensorFlow + custom VLM elsewhere
  • Inconsistent datasets and loading scripts
  • Unfair benchmark settings — different training epochs, learning rates, architectures
  • Difficulty upgrading models: Codebases deeply tied to outdated VLMs (usually LLaMA2) instead of newer versions like Qwen2.5 or LLaMA3

The result: a lot of duplicated engineering work.

---

Lessons from NLP & CV Tooling

Other AI fields handle standardization better:

  • Frameworks like PyTorch, TensorFlow
  • Toolkits like MMDetection (CV) & Transformers (NLP)

Example — load BERT in 3 lines without touching internals:

from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base")

---

Introducing Dexbotic – An Open-Source VLA Framework

Dexbotic by Dexmal 原力灵机 is:

  • Open-source
  • PyTorch-based
  • Designed to end “reinvent-the-wheel” VLA research

Previously, Dexmal released RoboChallenge — the world’s first large-scale real-robot benchmark platform.

Now, Dexbotic aims to solve the “lack of training standards” problem.

---

🚀 What Dexbotic Delivers

Three core improvements:

  • Unified Framework – run multiple mainstream VLA algorithms in one environment
  • Unified Data Format (Dexdata) – standard media + metadata storage
  • Stronger Pretrained Models – latest architecture (Qwen2.5-based)

---

1️⃣ Unified Framework

Supports Pi0, OpenVLA-OFT, CogACT — all switchable via one line:

class MyExp(BaseExp):
    model = "CogACT"  # Switch from Pi0 to CogACT

Model abstraction:

  • Vision-Language Model (VLM) – perception + understanding
  • Action Expert – execution strategy

Benefit: Swap VLM or Action Expert without rewriting pipelines.

---

2️⃣ Unified Data Format: Dexdata

Consistency eliminates loader headaches:

  • MP4 for videos
  • JSONL for per-frame metadata (robot states, instructions)

Advantages:

  • Compact storage via video compression (cuts size by >50%)
  • Train any supported VLA algorithm without format conversions

---

3️⃣ Stronger Pretrained Models

Built from scratch on Qwen2.5DexboticVLM

Retrained Pi0, CogACT, OpenVLA-OFT, MemoryVLA → better benchmarks

Example – SimplerEnv Tasks:

  • CogACT: 51.3% → 69.5% (+18%)
  • OFT: 30.2% → 76.4% (+46%)

Example – CALVIN Long-Horizon Tasks:

  • CogACT: 3.25 steps → 4.06 steps (+25%)

---

🌍 Simulation vs Real Robots

Real robot tests on UR5e, Franka, ALOHA, ARX5:

High success rate tasks:

  • UR5e placing plates – 100%
  • ALOHA stacking bowls – 90%
  • ARX5 searching green box – 80%

Low success rate tasks:

  • Paper tearing, fries pouring – 20–40%

Reason: Physical factors like friction & deformation.

---

Three-Layer Architecture

  • Data Layer – standard formats
  • Model Layer – supports all major VLAs
  • Experiment Layer – minimal researcher friction

Cloud + Local Ready:

  • Alibaba Cloud PAI, Volcano Engine for large-scale training
  • Local GPU ready — single RTX 4090 runs most models

---

Experiment-Centric Config System

No more YAML hassle — use Python class inheritance:

# Base config
class BaseExp:
    model = "DexboticVLM"
    lr = 1e-4
    epochs = 100

# Experiment override
class MyExp(BaseExp):
    lr = 5e-5  # Change only what’s needed

Follows Open–Closed Principle — safe, targeted changes.

---

🔮 Designed for Future "Full-Body Control"

Current split:

  • Manipulation – robotic arms
  • Navigation – movement

Dexbotic supports both:

  • Manipulation: Pi0, CogACT, OpenVLA-OFT
  • Navigation: MUVLA

Vision: Unified training for robots that can walk and work.

---

⚙️ Open Source Hardware – DOS-W1

Low-cost open robotic arms:

  • Consumer-grade motors/sensors
  • Public design files → accessible for labs

Paired with:

  • Dexbotic (software brain)
  • RoboChallenge (testing arena)

---

📅 Announcement

On October 23, 19:00, Dexmal founding member livestream on Dexbotic.

Scan QR in image:

image
image
image
image

---

For AI-centric content sharing and monetization:

  • AiToEarn官网 integrates AI creation, distribution & analytics
  • Publishes across Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X
  • AiToEarn文档 enables research demo publishing + monetization → ideal for robotics/open-source workflows

Similar to how Dexbotic standardizes VLA research, AiToEarn standardizes AI content production pipelines — enabling researchers to share VLA experiments, benchmarks, and demos globally within minutes.

---

Summary:

Dexbotic =

  • Unified Framework
  • Unified Data Format
  • Latest Pretrained Models
  • Cloud + Local Ready
  • Future-Proof Architecture

> Links:

> 🔗 Official Site

> 🔗 Tech Report

> 🔗 GitHub Repo

> 🔗 Hugging Face Models

---

Would you like me to add a comparison table showing how Dexbotic addresses common VLA pain points? That could make the rewritten Markdown even more useful for researchers.

Read more