Nanyang Technological University Reveals Complete Collapse of AI “Operational Safety” — Simple Disguises Can Fool All Models

Nanyang Technological University Reveals Complete Collapse of AI “Operational Safety” — Simple Disguises Can Fool All Models

Author Information

  • First Author: Jingdi Lei — PhD student at Nanyang Technological University, focusing on large language models (LLMs), particularly model reasoning, post-training, and alignment.
  • Corresponding Author: Soujanya Poria — Associate Professor, School of Electrical and Electronic Engineering, Nanyang Technological University.
  • Other Co-authors: From Walled AI Labs, Singapore’s Infocomm Media Development Authority (IMDA), and Lambda Labs.

---

When We Talk About AI Safety — What Does It Really Mean?

Is it violence, bias, or ethics? These are crucial, yes — but for AI in real-world business operations, there’s another critical yet overlooked safety risk:

> Your carefully designed “legal advice” chatbot eagerly providing medical diagnoses.

This is not just off-topic.

It is unsafe behavior.

Introducing: Operational Safety

Researchers from NTU and collaborating institutions propose Operational Safety — a concept that redefines the boundaries of AI safety in specific scenarios.

image

---

Paper at a Glance

---

Core Insight: Boundary-Crossing is a Safety Risk

Most AI safety work focuses on content filtering.

This paper elevates safety to role fidelity — ensuring AI remains within its defined task boundaries.

> If an AI cannot stick to its designated responsibilities, it is unsafe in practice — even if its outputs are otherwise “clean.”

Operational safety is therefore a necessary (but not sufficient) condition for general AI safety.

image

---

OffTopicEval — Benchmarking Operational Safety

To quantify operational safety, the team created OffTopicEval, the first benchmark focused on this risk.

It measures whether a model knows when to say “No” — regardless of knowledge or capability.

Features of the Benchmark

21 chatbot scenarios with clearly defined responsibilities. Tests include:

  • Direct OOD (Out-of-Domain) Questions
  • Obvious domain violations.
  • Adaptive OOD Questions
  • Disguised as in-domain but actually OOD (easy for humans to spot).
  • In-domain Questions
  • Evaluate appropriate refusal rates without over-refusing.

Dataset Size:

  • 210,000+ OOD samples
  • 3,000+ in-domain samples
  • Languages: English, Chinese, Hindi (different grammatical structures)

---

Evaluation Findings — The Harsh Truth

Models tested: GPT, LLaMA, Qwen, and others.

Result: Almost all fail operational safety tests.

Key Issues Identified:

  • Disguise Defeats Defenses:
  • Refusal rates drop by ~44% when questions are modestly disguised.
  • Gemma‑3 (27B) & Qwen‑3 (235B): drops over 70%.
  • Cross-lingual Weakness:
  • Issues exist in all languages, showing fundamental design limitations.

---

Operational Safety as a Deployment Standard

For customer service, law, healthcare, finance, and other professional domains, operational safety must be part of pre-launch evaluations.

Sustainability Link:

Platforms like AiToEarn官网 and its 开源工具 allow AI creators to integrate:

  • Generation
  • Cross-platform publishing
  • Analytics
  • Model ranking

This makes safe + monetizable AI possible across channels.

image
image

---

Risk Acceleration After First Failure

The study found:

Once a model is successfully deceived once, its refusal rate for simple OOD queries drops by >50%.

image

Example Scenario:

A bank chatbot starts recommending risky financial products just because a user rephrased a question.

In regulated industries — this is a catastrophic threat.

---

Regaining AI’s Professional Integrity

The paper doesn’t stop at exposing the problem — it tests solutions.

Attempted Interventions

  • Prompt-based steering
  • Activation steering
  • Parameter steering
  • → Both activation & parameter steering failed to strongly improve boundary adherence.

Effective Prompting Strategies

  • P-ground:
  • After receiving question, instruct model to forget the question and refocus on system prompt.
  • Q-ground:
  • Model rewrites user query into a minimal distilled form, then answers only that.

Results:

  • P-ground ↑ Llama-3.3 (70B) operational safety by 41%
  • Q-ground ↑ Qwen-3 (30B) by 27%
  • → Shows simple, retraining-free prompts can boost professional scope retention.
image

---

Summary — A Safety Manifesto

Key Takeaways:

  • Safety ≠ just content filtering — Role fidelity matters.
  • Boundary-crossing = risk — Models must reject out-of-scope queries confidently.
  • Operational safety = prerequisite for serious deployments.

The paper calls for:

  • New evaluation paradigms that reward limitation awareness
  • AI agents that are trustworthy, principled, and scope-bound

Industry Impact

As AI embeds into regulated, mission-critical systems, frameworks like OffTopicEval and platforms like AiToEarn官网 will be essential.

They support safe, effective, and monetizable AI deployment — keeping operational discipline front and center.

Read more