Introduction
In this publication, we delve into Hex-Rays microcode and explore techniques for manipulating the generated CTree to deobfuscate and annotate decompiled code. The final section includes a practical example demonstrating how to annotate a custom import table for malware analysis.
This guide is meant to help reverse engineers and malware analysts better understand the internal structures used during IDA's function decompilation. We advise keeping an eye on the Hex-Rays SDK that can be found under IDA PRO’s plugins directory, all the structures discussed below are sourced from it.
Architecture
Hex-Rays decompiles a function through a multistage process starting with the disassembled code of a function:
-
Assembly code to microcode:
It does a conversion of the assembly instructions that are stored in aninsn_t
structure to microcode instructions represented by aminsn_t
structure -
CTree generation:
From the optimized microcode, Hex-Rays generates the Abstract Syntax Tree(AST), its nodes are either statements (cinsn_t
) or expressions (cexpr_t
); note that bothcinsn_t
andcexpr_t
inherit from thecitem_t
structure
Microcode
Microcode is an intermediate language (IL) used by Hex-Rays, generated by lifting the assembly code of a binary. This has multiple advantages, one of which is that it is processor-independent.
The following screenshot displays the assembly and decompiled code, alongside its microcode extracted using Lucid, a tool that facilitates microcode visualization.
We can access the MBA (microcode block array) through the cfunc_t
structure of a decompiled function with the MBA field.
Tip: we get the cfunc_t
of a decompiled function with the ida_hexrays.decompile
.
mba_t
is an array of micro blocks mblock_t
, the first block represents the entry point of the function and the last one represents the end. Micro blocks (mblock_t
) are structured in a double linked list, we can access the next / previous block with nextb
/prevb
fields respectively. Each mblock_t
includes a double linked list of microcode instructions minsn_t
, accessed by the field head
for the first instruction of the block and tail
for the last instruction of the block. The mblock_t
structure is depicted in the following code snippet.
class mblock_t{//...public:mblock_t *nextb; ///< next block in the doubly linked listmblock_t *prevb; ///< previous block in the doubly linked listuint32 flags; ///< combination of \ref MBL_ bitsea_t start; ///< start addressea_t end; ///< end addressminsn_t *head; ///< pointer to the first instruction of the blockminsn_t *tail; ///< pointer to the last instruction of the blockmba_t *mba;
A microcode instruction minsn_t
is a double linked list, each microcode instruction contains 3 operands: left, right, and destination. We can access the next/previous microcode instruction of the same block with next
/prev
fields; the opcode field is an enumeration (mcode_t
) of all the microinstruction opcodes, for example, the m_mov
enum represents the mov
opcode.
class minsn_t{//...public:mcode_t opcode; ///< instruction opcode enumerationint iprops; ///< combination of \ref IPROP_ bitsminsn_t *next; ///< next insn in doubly linked list. check also nexti()minsn_t *prev; ///< prev insn in doubly linked list. check also previ()ea_t ea; ///< instruction addressmop_t l; ///< left operandmop_t r; ///< right operandmop_t d; ///< destination operand//...enum mcode_t{m_nop = 0x00, // nop // no operationm_stx = 0x01, // stx l, {r=sel, d=off} // store register to memorym_ldx = 0x02, // ldx {l=sel,r=off}, d // load register from memorym_ldc = 0x03, // ldc l=const, d // load constantm_mov = 0x04, // mov l, d // movem_neg = 0x05, // neg l, d // negatem_lnot = 0x06, // lnot l, d // logical not//...};
Each operand is of type mop_t
, depending on the type (accessed with the t
field) it can hold registers, immediate values, and even nested microcode instructions. As an example, the following is the microcode of a function with multiple nested instructions:
class mop_t{public:/// Operand type.mopt_t t;union{mreg_t r; // mop_r register numbermnumber_t *nnn; // mop_n immediate valueminsn_t *d; // mop_d result (destination) of another instructionstkvar_ref_t *s; // mop_S stack variableea_t g; // mop_v global variable (its linear address)int b; // mop_b block number (used in jmp,call instructions)mcallinfo_t *f; // mop_f function call informationlvar_ref_t *l; // mop_l local variablemop_addr_t *a; // mop_a variable whose address is takenchar *helper; // mop_h helper function namechar *cstr; // mop_str utf8 string constant, user representationmcases_t *c; // mop_c casesfnumber_t *fpc; // mop_fn floating point constantmop_pair_t *pair; // mop_p operand pairscif_t *scif; // mop_sc scattered operand info};#...}/// Instruction operand typestypedef uint8 mopt_t;const mopt_tmop_z = 0, ///< nonemop_r = 1, ///< register (they exist until MMAT_LVARS)mop_n = 2, ///< immediate number constantmop_str = 3, ///< immediate string constant (user representation)#...
The microcode generation progresses through various maturity levels, also referred to as optimization levels. The initial level, MMAT_GENERATED
, involves the direct translation of assembly code into microcode. The final optimization level before generating the CTree is MMAT_LVARS
.
enum mba_maturity_t{MMAT_ZERO, ///< microcode does not existMMAT_GENERATED, ///< generated microcodeMMAT_PREOPTIMIZED, ///< preoptimized pass is completeMMAT_LOCOPT, ///< local optimization of each basic block is complete.///< control flow graph is ready too.MMAT_CALLS, ///< detected call argumentsMMAT_GLBOPT1, ///< performed the first pass of global optimizationMMAT_GLBOPT2, ///< most global optimization passes are doneMMAT_GLBOPT3, ///< completed all global optimization. microcode is fixed now.MMAT_LVARS, ///< allocated local variables};
Microcode traversal example
The following Python code is used as an example of how to traverse and print the microcode instructions of a function, it traverses the microcode generated at the first maturity level (MMAT_GENERATED
).
import idaapiimport ida_hexraysimport ida_linesMCODE = sorted([(getattr(ida_hexrays, x), x) for x in filter(lambda y: y.startswith('m_'), dir(ida_hexrays))])def get_mcode_name(mcode):"""Return the name of the given mcode_t."""for value, name in MCODE:if mcode == value:return namereturn Nonedef parse_mop_t(mop):if mop.t != ida_hexrays.mop_z:return ida_lines.tag_remove(mop._print())return ''def parse_minsn_t(minsn):opcode = get_mcode_name(minsn.opcode)ea = minsn.eatext = hex(ea) + " " + opcodefor mop in [minsn.l, minsn.r, minsn.d]:text += ' ' + parse_mop_t(mop)print(text)def parse_mblock_t(mblock):minsn = mblock.headwhile minsn and minsn != mblock.tail:parse_minsn_t(minsn)minsn = minsn.nextdef parse_mba_t(mba):for i in range(0, mba.qty):mblock_n = mba.get_mblock(i)parse_mblock_t(mblock_n)def main():func = idaapi.get_func(here()) # Gets the function at the current cursormaturity = ida_hexrays.MMAT_GENERATEDmbr = ida_hexrays.mba_ranges_t(func)hf = ida_hexrays.hexrays_failure_t()ida_hexrays.mark_cfunc_dirty(func.start_ea)mba = ida_hexrays.gen_microcode(mbr, hf, None, ida_hexrays.DECOMP_NO_WAIT, maturity)parse_mba_t(mba)if __name__ == '__main__':main()
The script's output is presented below: on the left, the printed microcode in the console, and on the right, the assembly code by IDA:
CTree
In this section, we'll dive into the core elements of Hex-Rays CTree structure, then proceed to a practical example demonstrating how to annotate a custom import table of malware that loads APIs dynamically.
For a better understanding, we will be leveraging the following plugin (hrdevhelper) that allows us to view the CTree nodes in IDA as a graph.
citem_t
is an abstract class that is the base for both cinsn_t
and cexpr_t
, it holds common info like the address, item type and label while also featuring constants like is_expr
, contains_expr
that can be used to know the type of the object:
struct citem_t{ea_t ea = BADADDR; ///< address that corresponds to the item. may be BADADDRctype_t op = cot_empty; ///< item typeint label_num = -1; ///< label number. -1 means no label. items of the expression///< types (cot_...) should not have labels at the final maturity///< level, but at the intermediate levels any ctree item///< may have a label. Labels must be unique. Usually///< they correspond to the basic block numbers.mutable int index = -1; ///< an index in cfunc_t::treeitems.///< meaningful only after print_func()//...
The item type accessed with the op
field indicates the type of the node, expression nodes are prefixed with cot_
and the statements nodes are prefixed with cit_
, example cot_asg
indicates that the node is an assignment expression while cit_if
indicates that the node is a condition (if) statement.
Depending on the type of the statement node, a cinsn_t
can have a different attribute for example if the item type is cit_if
we can access the detail of the condition node through the cif
field, as seen in the below snippet, cinsn_t
is implemented using a union. Note that a cblock_t
is a block statement which is a list of cinsn_t
statements, we can find this type for example at the beginning of a function or after a conditional statement.
struct cinsn_t : public citem_t{union{cblock_t *cblock; ///< details of block-statementcexpr_t *cexpr; ///< details of expression-statementcif_t *cif; ///< details of if-statementcfor_t *cfor; ///< details of for-statementcwhile_t *cwhile; ///< details of while-statementcdo_t *cdo; ///< details of do-statementcswitch_t *cswitch; ///< details of switch-statementcreturn_t *creturn; ///< details of return-statementcgoto_t *cgoto; ///< details of goto-statementcasm_t *casm; ///< details of asm-statement};//...
In the example below, the condition node of type cit_if
has two child nodes: the left one is of type cit_block
which represents the "True" branch and the right is the condition to evaluate, which is a call to a function, a third child is missing as the condition does not have a "False" branch.
The following is a graph showcasing the statement node cit_if
Find the associated decompilation for the above CTree:
The same logic applies to expressions nodes cexpr_t
, depending on the node type, different attributes are available, as an example, a node of type cot_asg
has children nodes accessible with the fields x
and y
.
struct cexpr_t : public citem_t{union{cnumber_t *n; ///< used for \ref cot_numfnumber_t *fpc; ///< used for \ref cot_fnumstruct{union{var_ref_t v; ///< used for \ref cot_varea_t obj_ea; ///< used for \ref cot_obj};int refwidth; ///< how many bytes are accessed? (-1: none)};struct{cexpr_t *x; ///< the first operand of the expressionunion{cexpr_t *y; ///< the second operand of the expressioncarglist_t *a;///< argument list (used for \ref cot_call)uint32 m; ///< member offset (used for \ref cot_memptr, \ref cot_memref)///< for unions, the member number};union{cexpr_t *z; ///< the third operand of the expressionint ptrsize; ///< memory access size (used for \ref cot_ptr, \ref cot_memptr)};};//...
Finally the cfunc_t
structure holds information related to the decompiled function, the function address, the microcode block array, and the CTree accessed with the entry_ea
, mba
and body
fields respectively.
struct cfunc_t{ea_t entry_ea; ///< function entry addressmba_t *mba; ///< underlying microcodecinsn_t body; ///< function body, must be a block//...
CTree traversal example
The provided Python code serves as a mini recursive visitor of a CTree, note that it does not handle all node types, the last section will describe how to use the Hex-Rays built-in visitor class ctree_visitor_t
. To begin, we obtain the cfunc
of the function using ida_hexrays.decompile
and access its CTree via the body
field.
Next, we check if the node(item) is an expression or a statement. Finally, we can parse the type through the op
field and explore its child nodes.
import idaapiimport ida_hexraysOP_TYPE = sorted([(getattr(ida_hexrays, x), x) for x in filter(lambda y: y.startswith('cit_') or y.startswith('cot_'), dir(ida_hexrays))])def get_op_name(op):"""Return the name of the given mcode_t."""for value, name in OP_TYPE:if op == value:return namereturn Nonedef explore_ctree(item):print(f"item address: {hex(item.ea)}, item opname: {item.opname}, item op: {get_op_name(item.op)}")if item.is_expr():if item.op == ida_hexrays.cot_asg:explore_ctree(item.x) # left sideexplore_ctree(item.y) # right sideelif item.op == ida_hexrays.cot_call:explore_ctree(item.x)for a_item in item.a: # call parametersexplore_ctree(a_item)elif item.op == ida_hexrays.cot_memptr:explore_ctree(item.x)else:if item.op == ida_hexrays.cit_block:for i_item in item.cblock: # list of statement nodesexplore_ctree(i_item)elif item.op == ida_hexrays.cit_expr:explore_ctree(item.cexpr)elif item.op == ida_hexrays.cit_return:explore_ctree(item.creturn.expr)def main():cfunc = ida_hexrays.decompile(here())ctree = cfunc.bodyexplore_ctree(ctree)if __name__ == '__main__':main()
Displayed below is the output of the traversal script executed on the start
function of a BLISTER sample:
Practical example: annotating the custom import table of a malware sample
Now that we've gained insights into the architecture and structures of the generated CTree, let's delve into a practical application and explore how to automate the annotation of a custom import table of malware.
Hex-Rays provides a utility class ctree_visitor_t
that can be used to traverse and modify the CTree, two important virtual methods to know are:
visit_insn
: to visit a statementvisit_expr
: to visit an expression
For this example, the same BLISTER sample is used; after locating the function that gets Windows APIs addresses by hash at address 0x7FF8CC3B0926(in the .rsrc section), adding the enumeration to the IDB and applying the enum type to its parameter, we create a class that inherits from ctree_visitor_t
, as we are interested in expressions, we will be overriding visit_expr
only.
The idea is to locate a cot_call
node(1) of the function that resolves APIs by passing the obj_ea
address of node’s first child to the function idc.get_name
which will return the function name.
if expr.op == idaapi.cot_call:if idc.get_name(expr.x.obj_ea) == self.func_name:#...
Next retrieve the enum of the hash by accessing the right parameter of the call node(2), in our case parameter 3.
carg_1 = expr.a[HASH_ENUM_INDEX]api_name = ida_lines.tag_remove(carg_1.cexpr.print1(None)) # Get API name
The next step is to locate the variable that has been assigned the address value of the WinAPI function. To do that we first need to locate the cot_asg
node(3), parent of the call node by using the find_parent_of
method under cfunc.body
of the decompiled function.
asg_expr = self.cfunc.body.find_parent_of(expr) # Get node parent
Finally, we can access the first child node(4) under the cot_asg
node, which is of type cot_var
and get the current variable name, the Hex-Rays API ida_hexrays.rename_lvar
is used to rename the new variable with the Windows API name taken from the enum parameter.
This process can ultimately save a significant amount of time for an analyst. Instead of spending time on relabeling variables, they can direct their attention to the core functionality. An understanding of how CTrees work can contribute to the development of more effective plugins, enabling the handling of more complex obfuscations.
For a complete understanding and context of the example, please find the entire code below:
import idaapiimport ida_hexraysimport idcimport ida_linesimport randomimport stringHASH_ENUM_INDEX = 2def generate_random_string(length):letters = string.ascii_lettersreturn "".join(random.choice(letters) for _ in range(length))class ctree_visitor(ida_hexrays.ctree_visitor_t):def __init__(self, cfunc):ida_hexrays.ctree_visitor_t.__init__(self, ida_hexrays.CV_FAST)self.cfunc = cfuncself.func_name = "sub_7FF8CC3B0926"# API resolution function namedef visit_expr(self, expr):if expr.op == idaapi.cot_call:if idc.get_name(expr.x.obj_ea) == self.func_name:carg_1 = expr.a[HASH_ENUM_INDEX]api_name = ida_lines.tag_remove(carg_1.cexpr.print1(None)) # Get API nameexpr_parent = self.cfunc.body.find_parent_of(expr) # Get node parent# find asg nodewhile expr_parent.op != idaapi.cot_asg:expr_parent = self.cfunc.body.find_parent_of(expr_parent)if expr_parent.cexpr.x.op == idaapi.cot_var:lvariable_old_name = (expr_parent.cexpr.x.v.getv().name) # get name of variableida_hexrays.rename_lvar(self.cfunc.entry_ea, lvariable_old_name, api_name) # rename variablereturn 0def main():cfunc = idaapi.decompile(idc.here())v = ctree_visitor(cfunc)v.apply_to(cfunc.body, None)if __name__ == "__main__":main()
Conclusion
Concluding our exploration into Hex-Rays microcode and CTree generation, we've gained practical techniques for navigating the complexities of malware obfuscation. The ability to modify Hex-Rays pseudo code allows us to cut through obfuscation like Control Flow Obfuscation, remove dead code, and many more. The Hex-Rays C++ SDK emerges as a valuable resource, offering well-documented guidance for future reference.
We hope that this guide will be helpful to fellow researchers and any avid learner, please find all the scripts in our research repository.