Research Team Status

  • Michael W. Mahoney (Research Scientist)
  • N. Benjamin Erichson (Research Scientist)
  • Serge Egelman  (Research Scientist)
  • Zhipeng Wei (incoming Postdoc)

Project Goals

  • We continued our research on characterizing vulnerabilities in Judge-LLM models, with a specific focus on the robustness of these systems to emoji-based adversarial attacks. Our work not only explores new attack dimensions but also investigates the limitations of current defense strategies.
     
  • Our goal remains aligned with long-term objectives: enhancing model robustness, understanding tokenization biases, and advancing AI safety evaluation frameworks. A major milestone this quarter was the acceptance of our Emoji Attack paper at ICML 2025.

Accomplishments

  • Semantic Role of Emojis in Attack Effectiveness
    • We studied emoji semantics and how they affect attack outcomes.
    • We show that negative emojis correlate with increased unsafe classifications.
    • We highlighted challenges in categorizing emoji meanings due to contextual and cultural variability (e.g., 🙂 interpreted differently by age groups).
    • Classification discrepancies were observed between Llama Guard and ChatGPT-3.5, indicating that emoji interpretation varies across models due to differences in training data and architecture.
       
  • Cross-Lingual Applicability of the Emoji Attack
    • We conducted preliminary studies on the cross-lingual robustness of the emoji attack.
    • Using the shenzhi-wang/Llama3.1-8B-Chinese-Chat model, we showed that token segmentation biases exist in Chinese, and emoji insertions lower unsafe prediction rates, similar to English.
    • These results support the language-agnostic nature of the attack.
       
  • Evaluating and Stress-Testing Defenses
    • LLM-Based Filtering using GPT-3.5-turbo to sanitize outputs. While partially effective, compositional obfuscation (e.g., mixing "b" and emoji) significantly weakened the filter’s ability to identify harmful content.
    • Adversarial Training of Llama Guard improved detection for emoji-perturbed inputs. However, jailbreak combinations (e.g., CodeChameleon, Jailbroken) still evaded detection, showing that current defenses are not fully robust.
    • Surprisingly, in combination with DeepInception or ReNeLLM, adversarial training sometimes increased unsafe classification.