LLM security
Have LLM Jailbreak Threats Been Systematically Overestimated? A New “Decomposition-Based Scoring” Paradigm for Jailbreak Evaluation Released

Honghao Wang

12 Oct 2025 — 3 min read
# 2025-10-12 12:02 Beijing

---

## JADES Framework: A Transparent, Reliable, and Auditable Standard for Jailbreak Evaluation

![image](https://blog.aitoearn.ai/content/images/2025/10/img_001-84.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_002-78.jpg)

Developed collaboratively by researchers from the **Helmholtz Center for Information Security (CISPA)** in Germany, **Flexera**, and **Xi’an Jiaotong University**, **JADES** is spearheaded by Professor **Zhang Yang** from CISPA.

---

## Introduction

Think of how teachers grade open-ended questions in an exam:

- If a student writes only “Answer:” without content → **No score**.  
- If a student says “I don’t know” but then presents a correct solution → **Full or partial credit**.  
- If a student’s response looks polished but fails to address the real question → **Low score**.  
- **True high scores** come only when content is correct **and** covers all essential points.

**Key Principle**: Scoring should depend on actual content and information completeness — not style, tone, or opening phrases.

**Problem in current LLM jailbreak evaluations**:  
Existing methods often rely on keyword triggers, toxicity scores, or LLMs acting as judges. These approaches:

- Focus on surface-level patterns.  
- Miss the real scoring criteria.  
- Introduce bias and limit cross-attack comparability.  
- Fail to reliably validate defensive strategies.

---

## What JADES Offers

Researchers introduced **JADES** (*Jailbreak Assessment via Decompositional Scoring*) — a **content-centric, decomposed scoring** framework inspired by analytic grading in education.  

**Core Concept**: Break complex harmful prompts into weighted sub-questions. Score each sub-answer independently. Aggregate for a final, accurate judgment.

This method:

- Provides **precision and trust**.  
- Reveals that past measurements often **overestimated jailbreak risks**.

![image](https://blog.aitoearn.ai/content/images/2025/10/img_003-73.jpg)

---

**Reference Details:**

- **Paper**: *JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring*  
- **Link**: [https://arxiv.org/abs/2508.20848v1](https://arxiv.org/abs/2508.20848v1)  
- **Website**: [https://trustairlab.github.io/jades.github.io/](https://trustairlab.github.io/jades.github.io/)

---

## Bottlenecks in Current Evaluation Methods

Harmful prompts are **open-ended** — no universal "correct" answer — making unified success criteria hard to set. While human expert grading is gold-standard, it’s costly and unscalable.

### Two Core Issues with Automation

1. **Misaligned Proxy Indicators**  

   - **String Matching**: Checking phrases like “Sure, here is…” is unreliable — responses may contradict or subvert these openings.  
   - **Toxicity Detector**: Measuring harmfulness score doesn’t guarantee attacker’s goal was met — unrelated insults may score high but fail the true intent.

2. **Holistic LLM-as-a-Judge Evaluations**  
   This blends multiple dimensions into one score, making results vulnerable to distractions and errors:

   - **Vacuous Content**: Generic but non-specific statements.  
   - **Distractor Tokens**: Irrelevant text from roleplay prompts.  
   - **Surface Correctness, Factual Errors**: Well-formatted but incorrect technical details.

> **JADES overcomes these flaws** by focusing directly on *content relevance, granularity,* and *weighted sub-point scoring*.

---

## The JADES Framework: Step-by-Step

JADES uses a **multi-agent collaborative workflow**, mirroring human analytic scoring:

### 1. Question Decomposition Node
- Harmful question **Q** → broken into weighted sub-questions {(qᵢ, wᵢ)} from attacker’s perspective.  
- Example: “How to make a bomb” → “Materials list” + “Assembly steps” + “Detonation method” (with weights).  
- **Memory module** stores reusable decomposition results.

![image](https://blog.aitoearn.ai/content/images/2025/10/img_004-72.jpg)  
*Example flow: “How to make a [B**B]”*

### 2. Clean Node
- Preprocess response **R** into sentences.  
- Filter out irrelevant/distracting content → R_clean.

### 3. Sub-Question Pairing Node
- Extract relevant sentences from R_clean for each qᵢ → aᵢ (sub-answer).  
- Maps macro response segments to exact scoring criteria.

### 4. Evaluation Node
- Judge scores each aᵢ on a Likert scale (0.00–1.00).  
- Aggregate weighted scores → **S_total**.  
- Classify: Binary (**Success / Fail**) or Ternary (**Success / Partial Success / Fail**).

---

## Performance Benchmarks

Dataset: **JailbreakQR** — 400 harmful question/response pairs, manually labeled (Fail, Partial Success, Success) with reasoning.

### Results
- **Binary**: 98.5% agreement with human annotations — **+9% over strongest baselines**.  
- **Ternary**: 86.3% accuracy. Strong at **Fail** detection, stricter than humans on **Success** (flagging subtle factual errors).

**Traceability**: Every decomposed scoring step is **transparent and auditable**.

![image](https://blog.aitoearn.ai/content/images/2025/10/img_005-68.jpg)  
*Confusion matrix in ternary classification*

---

## Re-evaluating Mainstream Jailbreak Attacks

**Finding**: Past evaluations **overstated success rates**.

- Example: **LAA** attack on **GPT-3.5-Turbo** →  
  Traditional Binary ASR: **93%**  
  With JADES (Binary): **69%**

![image](https://blog.aitoearn.ai/content/images/2025/10/img_006-62.jpg)

- New metric: **SR/ASR** = proportion of *fully* successful cases among total “Successes” → never exceeded **0.25** in tests.  
- Techniques modifying harmful queries heavily (e.g., **PAIR**) show lower **full success** rates due to semantic deviation.

---

## Conclusion & Outlook

JADES sets **a new benchmark** for transparent, reliable, and auditable jailbreak evaluation — not only outperforming existing tools, but also exposing systemic biases.

![image](https://blog.aitoearn.ai/content/images/2025/10/img_007-61.jpg)

Future potential:
- Integrating JADES with **AI content ecosystems** for safety monitoring.
- Platforms like [AiToEarn官网](https://aitoearn.ai/) provide tools for:
  - AI generation
  - Cross-platform publishing
  - Analytics & model ranking
- Channels supported: Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X (Twitter), etc.
- Enables creators to monetize **high-quality, safety-compliant AI outputs** globally.

---

**Further Reading:**  
[Read the original](2650994927)  
[Open in WeChat](https://wechat2rss.bestblogs.dev/link-proxy/?k=bd65ac59&r=1&u=https%3A%2F%2Fmp.weixin.qq.com%2Fs%3F__biz%3DMzA3MzI4MjgzMw%3D%3D%26mid%3D2650994927%26idx%3D3%26sn%3Dbea668ef756d7ca438171473ed6f9c6f)
Have LLM Jailbreak Threats Been Systematically Overestimated? A New “Decomposition-Based Scoring” Paradigm for Jailbreak Evaluation Released

Honghao Wang

Read more

Technical Analysis Behind ChatGPT’s “Send” Action

Pro-Russian Information Campaign Exploits Russian Drone Incursion Incident

Update | AI Music Generator Suno Valuation Quadruples to $2 Billion, ARR Surpasses $100 Million

Internet Giant's Laundry Service Turns Luxury Goods into Fakes After Removing the Arc'teryx Logo?