2025 Q1 | Science of Security Virtual Organization

2025 Q1

Leveraging Machine Learning for Binary Software Understanding

Research Team Status

Names of researchers and position
(e.g. Research Scientist, PostDoc, Student (Undergrad/Masters/PhD))
- Yan Shoshitaishvili - Lead PI, Associate Professor
- Adam Doupe - Co-I, Associate Professor
- Chitta Baral - Co-I, Professor
- Zion Basque – PhD Student
- Yibo Liu – PhD Student
- Ati Priya Bajaj – PhD Student
- Chang Zhu – PhD Student
- Divij Handa - PhD Student
- William Gibbs - PhD Student
- Michael Tompkins - PhD Student
Any new collaborations with other universities/researchers?
- None

Project Goals

What is the current project goal?
- Task 2 (Option Year 1): Higher-level decompliation abstraction. The focus here is to abstract the binary software beyond the decompiled code into human-level representations.
  - Task 2.1: Code to Human Description
  - Task 2.2: Translating Decompiled Code
  - Task 2.3: Code to High Level Structural Representations
How does the current goal factor into the long-term goal of the project?
- Long-Term Goal: Achieving binary software understanding, in order to make identifying security issues much easier and cheaper.
- Task 2 builds upon the foundations created by Task 1 by working towards being able to describe code in natural language, in a variety of programming languages, and to more abstract structural representations such as flow graphs or state transition diagrams.

Accomplishments

Address whether project milestones were met. If milestones were not met, explain why, and what are the next steps.
What is the contribution to foundational cybersecurity research? Was there something discovered or confirmed?
Impact of research
- Internal to the university (coursework/curriculum)
- External to the university (transition to industry/government (local/federal); patents, start-ups, software, etc.)
- Any acknowledgements, awards, or references in media?

Recompilable Decompilation:

January - March 2025: The goal of this project is to make angr's decompiled code recompilable, ensuring that the recompiled binary not only compiles successfully but also exhibits the intended behavior. A key focus is on verifying the correctness of the recompiled binaries' behavior, ensuring they faithfully reproduce the original functionality. We do this validation by trying to achieve byte equivalence.

Decompiled code typically does not recompile out of the box because it does not conform to the C syntax rules expected by compilers like GCC. We have developed a preliminary pipeline that attempts to recompile the decompiled code and verify the functionality of the recompiled binary.

As discussed during our last update, trying to recompile each function individually ( in some cases, the presence of debug information) makes exact function prototype and data types recovery essential, but this may not always be the case when trying to achieve recompilation. So we are working on updating our recompilation pipeline to recompile all decompiled functions in an object file together. By doing this, the function prototypes of the callee and caller functions are consistent across, which attributed to many errors before. With this, now we’ll see “real” recompilation errors like handling global data accesses, compiler intrinsics, and some compiler syntax errors. The next step would be to link these recompiled object files.

Software Reconstruction and Collaborative Reverse Engineering

January - March 2025: The research investigates collaborative dynamics in software reconstruction within reverse engineering (RE), focusing on human factors in the recovery and recompilation phases. Unlike traditional RE, which is often an individual effort, this study explores reconstruction as a team-driven process, particularly in large-scale projects like video game recovery. The research analyzes the methodologies and workflows used by the video game community, a highly active and diverse group that engages in cross-platform and multi-language software reconstruction.

[Image Link: Software Reconstruction and Collaborative Reverse Engineering]

By studying these projects, the research aims to uncover the technical and social aspects of collaboration in RE, including knowledge sharing, role distribution, and decision-making. Additionally, it examines challenges in recompilation, such as preserving software functionality, handling missing dependencies, and improving tool support for team-based workflows. Insights from this study will contribute to understanding RE as a collaborative effort, inform the development of better reconstruction tools, and support the broader goal of software preservation. Ultimately, this research redefines RE beyond solo efforts, emphasizing teamwork in tackling complex software recovery challenges.

TYGR: TYGR was accepted to USENIX Security 2024. We have presented the paper in the conference and we have also open-sourced the tool: https://github.com/sefcom/TYGR

AI-Assisted Reverse Engineering and Decompilers

January - March 2025: We have submitted our recent work REaLLM, which aims to study how humans, decompilers, and LLMs interact in reverse engineering software, to CCS 2025. The work shed light on effective strategies for human-LLM teaming in software reverse engineering. We have a shortlist of the more interesting findings:

The best strategy for LLM use is quick, often, and not deeply involved. For functions, the top performers used an LLM only once, but used it at least once in many places.
Interestingly, greater LLM use on large functions gave worse results than on small functions.
Novices approached expert level on some functions with the assistance of LLMs

More findings can be found in our paper draft, which is available upon request. Our tool, developed from this paper, is also available at https://github.com/mahaloz/DAILA.

Rust Decompilation

January - March 2025: Our research on Rust decompilation aims to develop a Rust decompiler on top of C/C++ decompiler angr to generate Rust pseudocode. We have finished a prototype of Rust decompiler called Oxidizer, which is able to produce Rust pseudocode close to the Rust source code. Oxidizer’s decompilation on our manually crafted running example is shown below:

[Image Link: Rust Decompilation]

Our contributions are:

We systematically study the Rust-specific decompilation issues that arise during decompiling Rust binaries using state-of-the-art C decompilers. We also identify what techniques should be implemented in a Rust decompiler to solve these issues.
We designed a new decompilation pipeline and incorporated multiple techniques to address these decompilation issues. Future researchers can improve existing techniques or introduce new techniques based on our Rust decompiler prototype.
We thoroughly study the effectiveness of Oxidizer by evaluating it on uutils coreutils, a Rust rewrite of the GNU Coreutils suite, and real-world Rust malware samples.

Right now we are preparing for a user study to show how effective Oxidizer is at helping analysts on real-world reverse engineering tasks.

Publications and presentations

Add publication reference in the publications section below. An authors copy or final should be added in the report file(s) section. This is for NSA's review only.
Optionally, upload technical presentation slides that may go into greater detail. For NSA's review only.

No new published papers since last quarterly report.

Lead PI:

Yan Shoshitaishvili

Co-Pi(s):

Adam Doupé