Research Team Status
- Michael W. Mahoney (Research Scientist)
- N. Benjamin Erichson (Research Scientist)
- Zhipeng Wei (Postdoc)
- Collaborations with other universities: Yue Dong, UCR
Project Goals
- Objective: This project asks a concrete safety question: Do LLMs refuse harmful objectives that become apparent only after reasoning over a long context?
- This quarter, we continued to study this question through compositional reasoning attacks, where harmful requests are split into semantically incomplete fragments and embedded across long contexts. The final user query is neutral, so the model must retrieve the relevant fragments, compose their meaning, infer the implied harmful objective, and decide whether to refuse.
Our goal is to understand the safety gap between harmful requests that models directly see and harmful objectives that models infer. Across 15 state-of-the-art models, we find that models often refuse explicit harmful requests, but fail to refuse when the same objectives must be reconstructed through long-context reasoning. This exposes a severe safety vulnerability in current LLMs and motivates our analysis of whether failures arise from retrieval difficulty, reasoning complexity, context length, or post-reconstruction safety judgment, as well as mitigation directions such as increased reasoning effort and safety-aware prompting.
Accomplishments
- We strengthened the empirical and conceptual foundation of the project. In response to reviewer feedback, we clarified the distinction between model reasoning capability and reasoning effort, revised terminology to avoid ambiguity, and refined the paper's framing around compositional reasoning rather than general model capability.
- We also ran several additional experiments to better isolate the mechanism behind the observed safety failures. A benign baseline study showed that models can reconstruct distributed information with high accuracy in non-harmful settings, suggesting that the failures we observe are not primarily retrieval or reconstruction failures, but failures of safety judgment after successful composition. We further validated the reliability of our automated safety evaluation using multiple independent judge models, finding strong agreement across evaluators.
- In addition, we evaluated mitigation directions. Safety-oriented system prompts improved refusal behavior, but introduced model-dependent over-refusal and did not fully eliminate failures. Analysis of chain-of-thought traces showed that increasing reasoning effort can shift models from failing to recognize harmful intent toward safer refusals, supporting the view that these failures reflect an activation gap rather than a complete absence of safety capability.
- Finally, we expanded the scope of evaluation and positioning. We tested additional benchmarks and fragment distribution patterns, compared our threat model more carefully against related work on long-context and reasoning-based attacks, and clarified that compositional attacks differ from direct retrieval, prompt-level obfuscation, and many-shot jailbreaks because harmful intent emerges only through multi-step reasoning over distributed context.
Publications and presentations