2025 Q2 | Science of Security Virtual Organization

2025 Q2

Improving Security and Safety of Neural Networks

Research Team Status

Michael W. Mahoney (Research Scientist)
N. Benjamin Erichson (Research Scientist)
Serge Egelman (Research Scientist)
Zhipeng Wei (incoming Postdoc)

Project Goals

We continued our research on characterizing vulnerabilities in Judge-LLM models, with a specific focus on the robustness of these systems to emoji-based adversarial attacks. Our work not only explores new attack dimensions but also investigates the limitations of current defense strategies.
Our goal remains aligned with long-term objectives: enhancing model robustness, understanding tokenization biases, and advancing AI safety evaluation frameworks. A major milestone this quarter was the acceptance of our Emoji Attack paper at ICML 2025.

Accomplishments

Semantic Role of Emojis in Attack Effectiveness
- We studied emoji semantics and how they affect attack outcomes.
- We show that negative emojis correlate with increased unsafe classifications.
- We highlighted challenges in categorizing emoji meanings due to contextual and cultural variability (e.g., 🙂 interpreted differently by age groups).
- Classification discrepancies were observed between Llama Guard and ChatGPT-3.5, indicating that emoji interpretation varies across models due to differences in training data and architecture.
Cross-Lingual Applicability of the Emoji Attack
- We conducted preliminary studies on the cross-lingual robustness of the emoji attack.
- Using the shenzhi-wang/Llama3.1-8B-Chinese-Chat model, we showed that token segmentation biases exist in Chinese, and emoji insertions lower unsafe prediction rates, similar to English.
- These results support the language-agnostic nature of the attack.
Evaluating and Stress-Testing Defenses
- LLM-Based Filtering using GPT-3.5-turbo to sanitize outputs. While partially effective, compositional obfuscation (e.g., mixing "b" and emoji) significantly weakened the filter’s ability to identify harmful content.
- Adversarial Training of Llama Guard improved detection for emoji-perturbed inputs. However, jailbreak combinations (e.g., CodeChameleon, Jailbroken) still evaded detection, showing that current defenses are not fully robust.
- Surprisingly, in combination with DeepInception or ReNeLLM, adversarial training sometimes increased unsafe classification.

Impact of research
- Our Emoji Attack paper was accepted to ICML 2025.
- There are two blog posts about our Emoji Attack
  - https://medium.com/google-cloud/emoji-jailbreaks-b3b5b295f38b
    - "Emoji jailbreaks are not just a quirky academic curiosity. They have real-world implications, and they can be used to generate some seriously nasty stuff."
    - "Emojis are not going to break AI. They are just a reminder that AI security is a never-ending quest, a fascinating puzzle, and, dare I say, even a little bit… fun? (Okay, maybe “fun” is not the exact word my security team would use when they are patching zero-day emoji exploits at 3 AM, but you get my point 😉)."
  - https://medium.com/@bhargavaganti/the-emoji-attack-a-new-threat-vector-in-ai-safety-evaluation-6ab07ff4f107
- Here has been an article in The Economic Times
  - Smiley sabotage: How 'emojis' are becoming AI’s weakest link in cybersecurity? https://economictimes.indiatimes.com/magazines/panache/how-emojis-are-becoming-ais-weakest-link-in-cybersecurity/articleshow/120253502.cms?utm_source=contentofinterest&utm_medium=text&utm_campaign=cppst

Lead PI:

Michael Mahoney

Co-Pi(s):

N. Benjamin Erichson

Serge Egelman