2025 Q2 | Science of Security Virtual Organization

2025 Q2

Leveraging Machine Learning for Binary Software Understanding

Research Team Status

Names of researchers and position
(e.g. Research Scientist, PostDoc, Student (Undergrad/Masters/PhD))
- Yan Shoshitaishvili - Lead PI, Associate Professor
- Adam Doupe - Co-I, Associate Professor
- Chitta Baral - Co-I, Professor
- Zion Basque – PhD Student
- Yibo Liu – PhD Student
- Ati Priya Bajaj – PhD Student
- Chang Zhu – PhD Student
- Divij Handa - PhD Student
- William Gibbs - PhD Student
- Michael Tompkins - PhD Student
- Jayakrishna Vadayath - PhD Student
Any new collaborations with other universities/researchers?
- None.

Project Goals

What is the current project goal?
- Task 2 (Option Year 1): Higher-level decompliation abstraction. The focus here is to abstract the binary software beyond the decompiled code into human-level representations.
  - Task 2.1: Code to Human Description
  - Task 2.2: Translating Decompiled Code
  - Task 2.3: Code to High Level Structural Representations
How does the current goal factor into the long-term goal of the project?
- Long-Term Goal: Achieving binary software understanding, in order to make identifying security issues much easier and cheaper.
- Task 2 builds upon the foundations created by Task 1 by working towards being able to describe code in natural language, in a variety of programming languages, and to more abstract structural representations such as flow graphs or state transition diagrams.

Accomplishments

Address whether project milestones were met. If milestones were not met, explain why, and what are the next steps.
What is the contribution to foundational cybersecurity research? Was there something discovered or confirmed?
Impact of research
- Internal to the university (coursework/curriculum)
- External to the university (transition to industry/government (local/federal); patents, start-ups, software, etc.)
- Any acknowledgements, awards, or references in media?

Recompilable Decompilation:

March - July 2025: The goal of this project is to make angr's decompiled code recompilable, ensuring that the recompiled binary not only compiles successfully but also exhibits the intended behavior. A key focus is on verifying the correctness of the recompiled binaries' behavior, ensuring they faithfully reproduce the original functionality. We do this validation by trying to achieve byte equivalence.

Decompiled code typically does not recompile out of the box because it does not conform to the C syntax rules expected by compilers like GCC. We have developed a preliminary pipeline that attempts to recompile the decompiled code and verify the functionality of the recompiled binary.

As discussed during our last update, we have an updated implementation where we attempt to recompile all functions. We are currently analyzing common errors encountered when recompiling both stripped binary and the binary with debug information.

Software Reconstruction and Collaborative Reverse Engineering

January - March 2025: The research investigates collaborative dynamics in software reconstruction within reverse engineering (RE), focusing on human factors in the recovery and recompilation phases. Unlike traditional RE, which is often an individual effort, this study explores reconstruction as a team-driven process, particularly in large-scale projects like video game recovery. The research analyzes the methodologies and workflows used by the video game community, a highly active and diverse group that engages in cross-platform and multi-language software reconstruction.

[Image Link: Software Reconstruction and Collaborative Reverse Engineering]

April - June 2025: To ground this investigation, we conducted a survey targeting experienced reverse engineers involved in collaborative reconstruction projects. The survey collected both qualitative and quantitative data on contributors’ challenges, workflows, and tools. By examining the lived experiences of practitioners, we gained valuable insights into knowledge sharing, role distribution, decision-making processes, and the use of ML-based tools and CI/CD pipelines. The responses also shed light on common obstacles in the recompilation phase, such as achieving byte-level equivalence, managing toolchain compatibility, and handling missing or undocumented components.

Focusing on the highly active video game reverse engineering community, this research uncovers not only the technical strategies but also the social infrastructure that enables distributed collaboration—ranging from onboarding newcomers to resolving conflicts and coordinating large teams.

The results contribute to a deeper understanding of RE as a social and collaborative process. Findings from the survey inform recommendations for improving tool support, promoting sustainable project practices, and fostering effective team communication. Ultimately, this work supports the broader goal of software preservation by redefining RE through the lens of collaborative reconstruction.

TYGR: TYGR was accepted to USENIX Security 2024. We have presented the paper in the conference and we have also open-sourced the tool: https://github.com/sefcom/TYGR

AI-Assisted Reverse Engineering and Decompilers

We are currently working on a resubmission for this work to enhance the broader community outcomes of using AI in reverse engineering. Specifically, we are focusing on data analysis to identify ways people can better prepare to utilize AI, particularly large language models (LLMs), during reverse engineering.

More findings can be found in our paper draft, which is available upon request. Our tool, developed from this paper, is also available at https://github.com/mahaloz/DAILA.

Rust Decompilation

March - July 2025: Our research on Rust decompilation aims to develop a Rust decompiler on top of C/C++ decompiler angr to generate Rust pseudocode. We have finished a prototype of Rust decompiler called Oxidizer, which is able to produce Rust pseudocode close to the Rust source code. Recently we have made some improvements on struct memory layout recovery and type recovery. The updated decompilation on our manually crafted running example is shown below:

[Image Link: Rust Decompilation]

Our contributions are:

We systematically study the Rust-specific decompilation issues that arise during decompiling Rust binaries using state-of-the-art C decompilers. We also identify what techniques should be implemented in a Rust decompiler to solve these issues.
We designed a new decompilation pipeline and incorporated multiple techniques to address these decompilation issues. Future researchers can improve existing techniques or introduce new techniques based on our Rust decompiler prototype.
We thoroughly study the effectiveness of Oxidizer by evaluating it on uutils coreutils, a Rust rewrite of the GNU Coreutils suite, and real-world Rust malware samples.

Right now we are running evaluation on Oxidizer and preparing for a new user study to show how effective Oxidizer is at helping analysts on real-world reverse engineering tasks.

Annotating Decompiled Code with LLM for Fuzzing

We are currently working on a fuzzing project that can use LLM’s to read decompiled code which is then annotated with feedback for the fuzzer in order to guide the exploration using the techniques described in the paper “Ijon: Exploring deep state spaces via fuzzing” from Aschermann et al.

On source code, this approach yields great results and we believe we can apply it to binary code as well using the latest results from decompilation, static-binary rewriting and LLM.

Publications and presentations

Add publication reference in the publications section below. An authors copy or final should be added in the report file(s) section. This is for NSA's review only.
Optionally, upload technical presentation slides that may go into greater detail. For NSA's review only.

No new published papers since last quarterly report.

Lead PI:

Yan Shoshitaishvili

Co-Pi(s):

Adam Doupé