2025 Q2 | Science of Security Virtual Organization

2025 Q2

Improving Malware Classifiers with Plausible Novel Samples

Research Team Status

Names of researchers and position
(e.g. Research Scientist, PostDoc, Student (Undergrad/Masters/PhD))
Skyler Grandel, PhD student
Dung Thuy "Judy" Nguyen, PhD student
Kailani "Cai" Lemieux-Mack , PhD student
Yifan Zhang, PhD student
Preston Robinette, PhD student
Eli Jiang, undergraduate student
Evelyn Guo, undergraduate student

Any new collaborations with other universities/researchers?
- Worked with CMU on recent reverse engineering/decompilation project. They provided raw data concerning human comprehension of decompilation enhanced with AI models.

Project Goals

What is the current project goal?
- This quarter, we have been addressing robustness and generalizability of machine learning models. This will ultimately contribute to the enhancement of neural network based malware classifiers. We previously reported advancements in machine unlearning and purification techniques. This quarter, we have investigated a new approach to domain generalization through the use of an interpolative style transfer technique that enables clients in a federated learning scenario to improve model performance while retaining data privacy. In the context of malware classification, this can enable federated learning scenarios where different clients own different subsets of malware and to transfer between them -- in turn, providing a basis for synthesizing plausible novel samples for improving classification defense.
How does the current goal factor into the long-term goal of the project?
- The overall goal of the project is to improve neural malware classifiers through the consideration of new malware classes and families. Domain shift is a key issue -- the ability to foresee what "tomorrow's malware" will look like to support accurate detection as adversaries advance. This quarter's developments contribute another approach to improving the generalizability and robustness of machine learning classifiers, specifically enhancing shifting between domains, which in turn can be applied to shifting between properties of malware samples.

Accomplishments

Address whether project milestones were met. If milestones were not met, explain why, and what are the next steps.
- We are on track to meet the Year 2 milestones. We previously reported PBP and MalMixer as part of Year 1's effort to development malware augmentation techniques. We also previously reported machine unlearning advancements. The current quarter's work entails important advancements in model robustness and generalizability, which is critical for advancing malware classification.
What is the contribution to foundational cybersecurity research? Was there something discovered or confirmed?
- Domain generalization is an important problem in machine learning due to domain shift -- that new unseen samples may contain properties that do not match anything previously seen during the training of a model. While possible to retrain a model on new data, this requires substantial effort to label and large amounts of computational resources. Further, the time spent waiting to retain a model may mean missing important in-the-wild samples that do not match an existing domain. Previous techniques make critical assumptions about the distribution of training samples among clients and about the number of clients that participate in a federated learning scenario. Our approach involves the creation of a straightforward vector of statistics of each domain within each client, which can be shared among clients to enable transfer between clients without sharing client data (and while retaining reasonable training performance). This enhances the robustness of the global model while better maintaining privacy among clients.
Impact of research
- Internal to the university (coursework/curriculum)
  - None new to report.
- External to the university (transition to industry/government (local/federal); patents, start-ups, software, etc.)
  - None to report.
- Any acknowledgements, awards, or references in media?
  - None to report

Publications and presentations

Add publication reference in the publications section below. An authors copy or final should be added in the report file(s) section. This is for NSA's review only.
Optionally, upload technical presentation slides that may go into greater detail. For NSA's review only.

Lead PI:

Kevin Leach

Co-Pi(s):

Taylor Johnson

Report Materials

Publications

PARDON: Privacy-Aware and Robust Federated Domain Generalization