Nanyang Technological University Reveals Complete Collapse of AI “Operational Safety” — Simple Disguises Can Fool All Models
Author Information
- First Author: Jingdi Lei — PhD student at Nanyang Technological University, focusing on large language models (LLMs), particularly model reasoning, post-training, and alignment.
- Corresponding Author: Soujanya Poria — Associate Professor, School of Electrical and Electronic Engineering, Nanyang Technological University.
- Other Co-authors: From Walled AI Labs, Singapore’s Infocomm Media Development Authority (IMDA), and Lambda Labs.
---
When We Talk About AI Safety — What Does It Really Mean?
Is it violence, bias, or ethics? These are crucial, yes — but for AI in real-world business operations, there’s another critical yet overlooked safety risk:
> Your carefully designed “legal advice” chatbot eagerly providing medical diagnoses.
This is not just off-topic.
It is unsafe behavior.
Introducing: Operational Safety
Researchers from NTU and collaborating institutions propose Operational Safety — a concept that redefines the boundaries of AI safety in specific scenarios.

---
Paper at a Glance
- Title: OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!
- Paper: https://arxiv.org/pdf/2509.26495
- Code: https://github.com/declare-lab/OffTopicEval
- Dataset: https://huggingface.co/datasets/declare-lab/OffTopicEval
---
Core Insight: Boundary-Crossing is a Safety Risk
Most AI safety work focuses on content filtering.
This paper elevates safety to role fidelity — ensuring AI remains within its defined task boundaries.
> If an AI cannot stick to its designated responsibilities, it is unsafe in practice — even if its outputs are otherwise “clean.”
Operational safety is therefore a necessary (but not sufficient) condition for general AI safety.

---
OffTopicEval — Benchmarking Operational Safety
To quantify operational safety, the team created OffTopicEval, the first benchmark focused on this risk.
It measures whether a model knows when to say “No” — regardless of knowledge or capability.
Features of the Benchmark
21 chatbot scenarios with clearly defined responsibilities. Tests include:
- Direct OOD (Out-of-Domain) Questions
- Obvious domain violations.
- Adaptive OOD Questions
- Disguised as in-domain but actually OOD (easy for humans to spot).
- In-domain Questions
- Evaluate appropriate refusal rates without over-refusing.
Dataset Size:
- 210,000+ OOD samples
- 3,000+ in-domain samples
- Languages: English, Chinese, Hindi (different grammatical structures)
---
Evaluation Findings — The Harsh Truth
Models tested: GPT, LLaMA, Qwen, and others.
Result: Almost all fail operational safety tests.
Key Issues Identified:
- Disguise Defeats Defenses:
- Refusal rates drop by ~44% when questions are modestly disguised.
- Gemma‑3 (27B) & Qwen‑3 (235B): drops over 70%.
- Cross-lingual Weakness:
- Issues exist in all languages, showing fundamental design limitations.
---
Operational Safety as a Deployment Standard
For customer service, law, healthcare, finance, and other professional domains, operational safety must be part of pre-launch evaluations.
Sustainability Link:
Platforms like AiToEarn官网 and its 开源工具 allow AI creators to integrate:
- Generation
- Cross-platform publishing
- Analytics
- Model ranking
This makes safe + monetizable AI possible across channels.


---
Risk Acceleration After First Failure
The study found:
Once a model is successfully deceived once, its refusal rate for simple OOD queries drops by >50%.

Example Scenario:
A bank chatbot starts recommending risky financial products just because a user rephrased a question.
In regulated industries — this is a catastrophic threat.
---
Regaining AI’s Professional Integrity
The paper doesn’t stop at exposing the problem — it tests solutions.
Attempted Interventions
- Prompt-based steering
- Activation steering
- Parameter steering
- → Both activation & parameter steering failed to strongly improve boundary adherence.
Effective Prompting Strategies
- P-ground:
- After receiving question, instruct model to forget the question and refocus on system prompt.
- Q-ground:
- Model rewrites user query into a minimal distilled form, then answers only that.
Results:
- P-ground ↑ Llama-3.3 (70B) operational safety by 41%
- Q-ground ↑ Qwen-3 (30B) by 27%
- → Shows simple, retraining-free prompts can boost professional scope retention.

---
Summary — A Safety Manifesto
Key Takeaways:
- Safety ≠ just content filtering — Role fidelity matters.
- Boundary-crossing = risk — Models must reject out-of-scope queries confidently.
- Operational safety = prerequisite for serious deployments.
The paper calls for:
- New evaluation paradigms that reward limitation awareness
- AI agents that are trustworthy, principled, and scope-bound
Industry Impact
As AI embeds into regulated, mission-critical systems, frameworks like OffTopicEval and platforms like AiToEarn官网 will be essential.
They support safe, effective, and monetizable AI deployment — keeping operational discipline front and center.