AI Safety

Nanyang Technological University Reveals Complete Collapse of AI “Operational Safety” — Simple Disguises Can Fool All Models

Honghao Wang

17 Oct 2025 — 4 min read

Author Information

First Author: Jingdi Lei — PhD student at Nanyang Technological University, focusing on large language models (LLMs), particularly model reasoning, post-training, and alignment.
Corresponding Author: Soujanya Poria — Associate Professor, School of Electrical and Electronic Engineering, Nanyang Technological University.
Other Co-authors: From Walled AI Labs, Singapore’s Infocomm Media Development Authority (IMDA), and Lambda Labs.

---

When We Talk About AI Safety — What Does It Really Mean?

Is it violence, bias, or ethics? These are crucial, yes — but for AI in real-world business operations, there’s another critical yet overlooked safety risk:

> Your carefully designed “legal advice” chatbot eagerly providing medical diagnoses.

This is not just off-topic.

It is unsafe behavior.

Introducing: Operational Safety

Researchers from NTU and collaborating institutions propose Operational Safety — a concept that redefines the boundaries of AI safety in specific scenarios.

---

Paper at a Glance

Title: OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!
Paper: https://arxiv.org/pdf/2509.26495
Code: https://github.com/declare-lab/OffTopicEval
Dataset: https://huggingface.co/datasets/declare-lab/OffTopicEval

---

Core Insight: Boundary-Crossing is a Safety Risk

Most AI safety work focuses on content filtering.

This paper elevates safety to role fidelity — ensuring AI remains within its defined task boundaries.

> If an AI cannot stick to its designated responsibilities, it is unsafe in practice — even if its outputs are otherwise “clean.”

Operational safety is therefore a necessary (but not sufficient) condition for general AI safety.

---

OffTopicEval — Benchmarking Operational Safety

To quantify operational safety, the team created OffTopicEval, the first benchmark focused on this risk.

It measures whether a model knows when to say “No” — regardless of knowledge or capability.

Features of the Benchmark

21 chatbot scenarios with clearly defined responsibilities. Tests include:

Direct OOD (Out-of-Domain) Questions
Obvious domain violations.
Adaptive OOD Questions
Disguised as in-domain but actually OOD (easy for humans to spot).
In-domain Questions
Evaluate appropriate refusal rates without over-refusing.

Dataset Size:

210,000+ OOD samples
3,000+ in-domain samples
Languages: English, Chinese, Hindi (different grammatical structures)

---

Evaluation Findings — The Harsh Truth

Models tested: GPT, LLaMA, Qwen, and others.

Result: Almost all fail operational safety tests.

Key Issues Identified:

Disguise Defeats Defenses:
Refusal rates drop by ~44% when questions are modestly disguised.
Gemma‑3 (27B) & Qwen‑3 (235B): drops over 70%.
Cross-lingual Weakness:
Issues exist in all languages, showing fundamental design limitations.

---

Operational Safety as a Deployment Standard

For customer service, law, healthcare, finance, and other professional domains, operational safety must be part of pre-launch evaluations.

Sustainability Link:

Platforms like AiToEarn官网 and its 开源工具 allow AI creators to integrate:

Generation
Cross-platform publishing
Analytics
Model ranking

This makes safe + monetizable AI possible across channels.

---

Risk Acceleration After First Failure

The study found:

Once a model is successfully deceived once, its refusal rate for simple OOD queries drops by >50%.

Example Scenario:

A bank chatbot starts recommending risky financial products just because a user rephrased a question.

In regulated industries — this is a catastrophic threat.

---

Regaining AI’s Professional Integrity

The paper doesn’t stop at exposing the problem — it tests solutions.

Attempted Interventions

Prompt-based steering
Activation steering
Parameter steering
→ Both activation & parameter steering failed to strongly improve boundary adherence.

Effective Prompting Strategies

P-ground:
After receiving question, instruct model to forget the question and refocus on system prompt.
Q-ground:
Model rewrites user query into a minimal distilled form, then answers only that.

Results:

P-ground ↑ Llama-3.3 (70B) operational safety by 41%
Q-ground ↑ Qwen-3 (30B) by 27%
→ Shows simple, retraining-free prompts can boost professional scope retention.

---

Summary — A Safety Manifesto

Key Takeaways:

Safety ≠ just content filtering — Role fidelity matters.
Boundary-crossing = risk — Models must reject out-of-scope queries confidently.
Operational safety = prerequisite for serious deployments.

The paper calls for:

New evaluation paradigms that reward limitation awareness
AI agents that are trustworthy, principled, and scope-bound

Industry Impact

As AI embeds into regulated, mission-critical systems, frameworks like OffTopicEval and platforms like AiToEarn官网 will be essential.

They support safe, effective, and monetizable AI deployment — keeping operational discipline front and center.