Research Team Status
- Michael W. Mahoney (Research Scientist)
- N. Benjamin Erichson (Research Scientist)
- Serge Egelman (Research Scientist)
- Zhipeng Wei (incoming Postdoc)
Project Goals
- We continued our research on characterizing vulnerabilities in Judge-LLM models, with a specific focus on the robustness of these systems to emoji-based adversarial attacks. Our work not only explores new attack dimensions but also investigates the limitations of current defense strategies.
- Our goal remains aligned with long-term objectives: enhancing model robustness, understanding tokenization biases, and advancing AI safety evaluation frameworks. A major milestone this quarter was the acceptance of our Emoji Attack paper at ICML 2025.
Accomplishments
- Semantic Role of Emojis in Attack Effectiveness
- We studied emoji semantics and how they affect attack outcomes.
- We show that negative emojis correlate with increased unsafe classifications.
- We highlighted challenges in categorizing emoji meanings due to contextual and cultural variability (e.g., 🙂 interpreted differently by age groups).
- Classification discrepancies were observed between Llama Guard and ChatGPT-3.5, indicating that emoji interpretation varies across models due to differences in training data and architecture.
- Cross-Lingual Applicability of the Emoji Attack
- We conducted preliminary studies on the cross-lingual robustness of the emoji attack.
- Using the shenzhi-wang/Llama3.1-8B-Chinese-Chat model, we showed that token segmentation biases exist in Chinese, and emoji insertions lower unsafe prediction rates, similar to English.
- These results support the language-agnostic nature of the attack.
- Evaluating and Stress-Testing Defenses
- LLM-Based Filtering using GPT-3.5-turbo to sanitize outputs. While partially effective, compositional obfuscation (e.g., mixing "b" and emoji) significantly weakened the filter’s ability to identify harmful content.
- Adversarial Training of Llama Guard improved detection for emoji-perturbed inputs. However, jailbreak combinations (e.g., CodeChameleon, Jailbroken) still evaded detection, showing that current defenses are not fully robust.
- Surprisingly, in combination with DeepInception or ReNeLLM, adversarial training sometimes increased unsafe classification.
- Impact of research
- Our Emoji Attack paper was accepted to ICML 2025.
- There are two blog posts about our Emoji Attack
- https://medium.com/google-cloud/emoji-jailbreaks-b3b5b295f38b
- "Emoji jailbreaks are not just a quirky academic curiosity. They have real-world implications, and they can be used to generate some seriously nasty stuff."
- "Emojis are not going to break AI. They are just a reminder that AI security is a never-ending quest, a fascinating puzzle, and, dare I say, even a little bit… fun? (Okay, maybe “fun” is not the exact word my security team would use when they are patching zero-day emoji exploits at 3 AM, but you get my point 😉)."
- https://medium.com/@bhargavaganti/the-emoji-attack-a-new-threat-vector-in-ai-safety-evaluation-6ab07ff4f107
- https://medium.com/google-cloud/emoji-jailbreaks-b3b5b295f38b
- Here has been an article in The Economic Times
- Smiley sabotage: How 'emojis' are becoming AI’s weakest link in cybersecurity? https://economictimes.indiatimes.com/magazines/panache/how-emojis-are-becoming-ais-weakest-link-in-cybersecurity/articleshow/120253502.cms?utm_source=contentofinterest&utm_medium=text&utm_campaign=cppst