Formal characterization of the memory used by a program is an important basis for security analyses, compositional verification, and identification of noninterference.
However, soundly proving memory usage requires operating on the assembly level due to the semantic gap between high-level languages and the code that processors actually execute.
Automated methods, such as model checking, would not be able to handle many interesting functions due to the undecidability of memory usage.
Fully-interactive methods do not scale well either.
Sound control flow recovery (CFR) is also important for binary decompilation, verification, patching, and security analysis.
It lifts raw unstructured data into a form that allows reasoning over behavior and semantics.
However, doing so requires interpreting the behavior of the program when indirect or dynamic control flow exists, creating a recursive dependency.
This dissertation tackles the first property with two contributions that perform proof generation combined with interactive theorem proving in a semi-automated manner:
an untrusted tool extracts as much information as it can from the functions under test and then generates all the necessary proofs to be completed in a theorem prover.
The first, Floyd-style approach still requires significant manual effort but provides good flexibility and ensures no paths are analyzed more than once.
In contrast, the second, Hoare-style approach sacrifices some flexibility and avoidance of repeated path evaluation in order to achieve much greater automation.
However, neither approach can handle the dynamic control flow caused by indirect branching.
The second property is handled by the second set of contributions of this dissertation.
These two contributions provide fully-automated methods of recovering control flow from binaries even in the presence of indirect branching.
When such dynamic control flow cannot be overapproximatively resolved, it is clearly noted in the resultant output.
In the first approach to control flow recovery, a structured memory representation allows for general analysis of control flow in the presence of indirection, gaining scalability by utilizing context-free function analysis.
It supports various aliasing conditions via the usage of nondeterminism, with multiple output states potentially being produced from a given input state.
The second approach adds function context and abstract interpretation-inspired modeling of the C++ exception handling (EH) application binary interface (ABI), allowing for the discovery of previously-unknown paths while maintaining or increasing automation. / Doctor of Philosophy / Modern computer programs are so complicated that individual humans cannot manually check all but the smallest programs to make sure they are correct and secure.
This is even worse if you want to reduce the trusted computing base (TCB), the stuff that you have to assume is working right in order to say a program will execute correctly.
The TCB includes your computer itself, but also whatever tools were used to take the programs written by programmers and transform them into a form suitable for running on a computer. Such tools are often called compilers.
One method of reducing the TCB is to examine the lowest-level representation of that program, the assembly or even machine code that is actually run by your computer.
This poses unique challenges, because operating on such a low level means you do not have a lot of the structure that a more abstract, higher-level representation provides.
Also, sometimes you want to formally state things about a program's behavior; that is, say things about what it does with a high degree of confidence based on mathematical principles.
You may also want to verify that one or more of those statements are true.
If you want to be detailed about that behavior, you may need to know all of the chunks, or regions, in random-access memory (RAM) that are used by that program.
RAM, henceforth referred to as just ``memory'', is your computer's first place of storage for the information used by running programs.
This is distinct from long-term storage devices like hard disk drives (HDDs) or solid-state drives (SSDs), which programs do not normally have direct access to.
Unfortunately, there is no one single approach that can automatically determine with absolute certainty for all cases the exact regions of memory that are read or written.
This is called undecidability, and means that you need to approximate those memory regions a lot of the time if you want to have a significant degree of automation.
An underapproximation, an approach that only gives you some of the regions, is not useful for formal statements as it might miss out on some behavior; it is unsound.
This means that you need an overapproximation, an approach that is guaranteed to give you at least the regions read or written.
Therefore, the first contribution of this dissertation is a preliminary approach to such an overapproximation.
This approach is based on the work of Robert L. Floyd, focusing on the direct control flow (where the steps of a program go) in an individual function (structured program component).
It still requires a lot of user effort, including having to manually specify the regions in memory that were possibly used and do a lot of work to prove that those regions are (overapproximatively) correct, so our tests were limited in scope.
The second contribution automated a lot of the manual work done for the first approach.
It is based on the work of Charles Antony Richard Hoare, who developed a verification approach focusing on the syntax (the textual form) of programs.
This contribution produces what we call formal memory usage certificates (FMUCs), which are formal statements that the regions of memory they describe are the only ones possibly affected by the functions under test.
These statements also come with proofs, which for our work are like scripts used to verify that the things the FMUCs assert about the corresponding functions can be shown to be true given the assumptions our FMUCs have.
Sometimes those proofs are incomplete, though, such as when there is a loop (repeated bit of code) in a function under test or one function calls (executes) another.
In those cases, a user has to finish the proof, in the first case by weakening (removing information from) the FMUC's statements about the loop and in the second by composing, or combining, the FMUCs of the two functions.
Additionally, this second approach cannot handle dynamic control flow.
Such control flow occurs when the low-level instructions a program uses to move to another place in that program do not have a pre-stored location to go to.
Instead, that location is supplied as the program is running.
This is opposed to direct control flow, where the place to go to is hard-coded into the program when it is compiled.
The tool also cannot not deal with aliasing, which is when different state parts (value-holding components) of a program contain the same value and that value is used as the numeric address or identifier of a location in memory.
Specifically, it cannot deal with potential aliasing, when there is not enough information available to determine if the state parts alias or not.
Because of that, we had to add extra assumptions to the FMUCs that limited them to those cases where ambiguous memory-referencing state parts referred to separate memory locations.
Finally, it specifically requires assembly as input; you cannot directly supply a binary to it.
This is also true of the first contribution.
Because of this, we were able to test on more functions than before, but not a lot more.
Not being able to deal with dynamic control flow is a big problem, as almost all programs use it.
For example, when a function reaches its end, it has to figure out where to return to based on the current state of the program (in the previous contribution, this was done manually).
This means that control flow recovery (CFR) is very important for many applications, including decompilation (converting a program back into a higher-level form), patching (updating a program in place without modifying the original code and recompiling it), and low-level analysis or verification in general.
However, as you may have noticed from earlier in this paragraph, in order to deal with such dynamic control flow you need to figure out what the possible destinations are for the individual control flow transfers.
That can require knowing where you came from in the program, which means that analysis of dynamic control flow requires context (in this context, information previously obtained in the program).
Even worse, it is another undecidable problem that requires overapproximation.
To soundly recover control flow, we developed Hoare graphs (HGs), the third contribution of this dissertation.
HGs use memory models that take the form of forests, or collections of tree data structures.
A single tree represents a region in memory that may have multiple symbolic references, or abstract representations of a value.
The children of the tree represent regions used in the program that are enclosed within their parent tree elements.
Now, instead of assuming that all ambiguous memory regions are separate, we can use them under various aliasing conditions.
We have also implemented support for some forms of dynamic control flow.
Those that are not supported are clearly marked in the resultant HG.
No user interaction is required even when loops are present thanks to a methodology that automatically reduces the amount of information present at a re-executed instruction until the information stabilizes.
Function composition is also automatic now thanks to a method that treats each function as its own context in a safe and automated way, reducing memory consumption of our tool and allowing larger programs to be examined.
In the process we did lose the ability to deal with recursion (functions that call themselves or call other functions that call back to the original), though.
Lastly, we provided the ability to directly load binaries into the tool, no external disassembly (converting machine code into human-readable instructions) needed.
This all allowed much greater testing than before, with applications to multiple programs and program libraries.
The fourth and final contribution of this dissertation iterates on the HG work by narrowing focus to the concept of exceptional control flow.
Specifically, it models the kind of exception handling used by C++ programs.
This is important as, if you want to explore a program's behavior, you need to know all the places it goes to.
If you use a tool that does not model exception handling, you may end up missing paths of execution caused by unwinding.
This is when an exception is thrown and propagates up through the program's current stack of function calls, potentially reaching programmer-supplied handling for that exception.
Despite this, commonplace tools for static, low-level program analysis do not model such unwinding.
The control flow graph (CFG) produced by our exception-aware tool are called exceptional interprocedural control flow graphs (EICFGs).
These provide information about the exceptions being thrown and what paths they take in the program when they are thrown.
Additional improvements are a better methodology for handling dynamic control flow as well adding back in support for recursion.
All told, this allowed us to explore even more programs than ever before.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/114054 |
Date | 07 March 2023 |
Creators | Bockenek, Joshua Alexander |
Contributors | Electrical and Computer Engineering, Ravindran, Binoy, Verbeek, Freek, Jung, Changhee, Hsiao, Michael S., Plassmann, Paul E. |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Language | English |
Detected Language | English |
Type | Dissertation |
Format | ETD, application/pdf |
Rights | Creative Commons Attribution 4.0 International, http://creativecommons.org/licenses/by/4.0/ |
Page generated in 0.0023 seconds