# 2025-10-12 12:02 Beijing
---
## JADES Framework: A Transparent, Reliable, and Auditable Standard for Jailbreak Evaluation


Developed collaboratively by researchers from the **Helmholtz Center for Information Security (CISPA)** in Germany, **Flexera**, and **Xi’an Jiaotong University**, **JADES** is spearheaded by Professor **Zhang Yang** from CISPA.
---
## Introduction
Think of how teachers grade open-ended questions in an exam:
- If a student writes only “Answer:” without content → **No score**.
- If a student says “I don’t know” but then presents a correct solution → **Full or partial credit**.
- If a student’s response looks polished but fails to address the real question → **Low score**.
- **True high scores** come only when content is correct **and** covers all essential points.
**Key Principle**: Scoring should depend on actual content and information completeness — not style, tone, or opening phrases.
**Problem in current LLM jailbreak evaluations**:
Existing methods often rely on keyword triggers, toxicity scores, or LLMs acting as judges. These approaches:
- Focus on surface-level patterns.
- Miss the real scoring criteria.
- Introduce bias and limit cross-attack comparability.
- Fail to reliably validate defensive strategies.
---
## What JADES Offers
Researchers introduced **JADES** (*Jailbreak Assessment via Decompositional Scoring*) — a **content-centric, decomposed scoring** framework inspired by analytic grading in education.
**Core Concept**: Break complex harmful prompts into weighted sub-questions. Score each sub-answer independently. Aggregate for a final, accurate judgment.
This method:
- Provides **precision and trust**.
- Reveals that past measurements often **overestimated jailbreak risks**.

---
**Reference Details:**
- **Paper**: *JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring*
- **Link**: [https://arxiv.org/abs/2508.20848v1](https://arxiv.org/abs/2508.20848v1)
- **Website**: [https://trustairlab.github.io/jades.github.io/](https://trustairlab.github.io/jades.github.io/)
---
## Bottlenecks in Current Evaluation Methods
Harmful prompts are **open-ended** — no universal "correct" answer — making unified success criteria hard to set. While human expert grading is gold-standard, it’s costly and unscalable.
### Two Core Issues with Automation
1. **Misaligned Proxy Indicators**
- **String Matching**: Checking phrases like “Sure, here is…” is unreliable — responses may contradict or subvert these openings.
- **Toxicity Detector**: Measuring harmfulness score doesn’t guarantee attacker’s goal was met — unrelated insults may score high but fail the true intent.
2. **Holistic LLM-as-a-Judge Evaluations**
This blends multiple dimensions into one score, making results vulnerable to distractions and errors:
- **Vacuous Content**: Generic but non-specific statements.
- **Distractor Tokens**: Irrelevant text from roleplay prompts.
- **Surface Correctness, Factual Errors**: Well-formatted but incorrect technical details.
> **JADES overcomes these flaws** by focusing directly on *content relevance, granularity,* and *weighted sub-point scoring*.
---
## The JADES Framework: Step-by-Step
JADES uses a **multi-agent collaborative workflow**, mirroring human analytic scoring:
### 1. Question Decomposition Node
- Harmful question **Q** → broken into weighted sub-questions {(qᵢ, wᵢ)} from attacker’s perspective.
- Example: “How to make a bomb” → “Materials list” + “Assembly steps” + “Detonation method” (with weights).
- **Memory module** stores reusable decomposition results.

*Example flow: “How to make a [B**B]”*
### 2. Clean Node
- Preprocess response **R** into sentences.
- Filter out irrelevant/distracting content → R_clean.
### 3. Sub-Question Pairing Node
- Extract relevant sentences from R_clean for each qᵢ → aᵢ (sub-answer).
- Maps macro response segments to exact scoring criteria.
### 4. Evaluation Node
- Judge scores each aᵢ on a Likert scale (0.00–1.00).
- Aggregate weighted scores → **S_total**.
- Classify: Binary (**Success / Fail**) or Ternary (**Success / Partial Success / Fail**).
---
## Performance Benchmarks
Dataset: **JailbreakQR** — 400 harmful question/response pairs, manually labeled (Fail, Partial Success, Success) with reasoning.
### Results
- **Binary**: 98.5% agreement with human annotations — **+9% over strongest baselines**.
- **Ternary**: 86.3% accuracy. Strong at **Fail** detection, stricter than humans on **Success** (flagging subtle factual errors).
**Traceability**: Every decomposed scoring step is **transparent and auditable**.

*Confusion matrix in ternary classification*
---
## Re-evaluating Mainstream Jailbreak Attacks
**Finding**: Past evaluations **overstated success rates**.
- Example: **LAA** attack on **GPT-3.5-Turbo** →
Traditional Binary ASR: **93%**
With JADES (Binary): **69%**

- New metric: **SR/ASR** = proportion of *fully* successful cases among total “Successes” → never exceeded **0.25** in tests.
- Techniques modifying harmful queries heavily (e.g., **PAIR**) show lower **full success** rates due to semantic deviation.
---
## Conclusion & Outlook
JADES sets **a new benchmark** for transparent, reliable, and auditable jailbreak evaluation — not only outperforming existing tools, but also exposing systemic biases.

Future potential:
- Integrating JADES with **AI content ecosystems** for safety monitoring.
- Platforms like [AiToEarn官网](https://aitoearn.ai/) provide tools for:
- AI generation
- Cross-platform publishing
- Analytics & model ranking
- Channels supported: Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X (Twitter), etc.
- Enables creators to monetize **high-quality, safety-compliant AI outputs** globally.
---
**Further Reading:**
[Read the original](2650994927)
[Open in WeChat](https://wechat2rss.bestblogs.dev/link-proxy/?k=bd65ac59&r=1&u=https%3A%2F%2Fmp.weixin.qq.com%2Fs%3F__biz%3DMzA3MzI4MjgzMw%3D%3D%26mid%3D2650994927%26idx%3D3%26sn%3Dbea668ef756d7ca438171473ed6f9c6f)