MCRIT: The MinHash-based Code Relationship & Investigation Toolkit


After another 3 years of hiatus, I’m at a point where I feel that I may have interesting things to share and write about again! :)

Last year around this time, the work on my PhD finally concluded in the publication of my thesis, “Classification, Characterization, and Contextualization of Windows Malware using Static Behavior and Similarity Analysis”. Part of this thesis are projects I had been continually working on already since 2016, most prominently Malpedia, but also ApiScout, for which I described the idea and some of its development in previous blog posts.

MCRIT

The final part of my PhD thesis was centered on the development of the “MinHash-based Code Relationship & Investigation Toolkit (MCRIT)”, a one-to-many code similarity framework, and its application on the Malpedia data set. Ever since BinDiff and Diaphora, the one-to-one diffing use case was nicely covered but I always felt that efficient one-to-many function search on moderately sized (thousands to tens of thousand binaries) was something missing in the malware analyst’s toolbox, at least with respect to free and openly available tools.

I’m happy to say that in the meantime, MCRIT has grown from a pure research evaluation prototype into a practically usable framework. It is available on GitHub and it was publically released in April 2023 at Botconf (slides, paper, video).

If you want to try it out yourself, there is now a Dockerized version that simplifies the setup procedure to a

git clone
and
docker-compose up.

You can also check out the documentation to get an overview about its capabilities.

Status Quo

The current state of MCRIT already provides everything needed to use it for malware identification, lineage analysis, and also label transfer between multiple binaries/samples at the same time.

Pictures say more than words, so let me reuse some content from my Botconf presentation in the following example. Let’s have a look at how MCRIT could have been used to investigate the public attribution of Wannacry towards Lazarus.

Example: Wannacry

The data management in MCRIT is structured around (malware and benign) families, which contain samples, which in turn have code / functions in them. Internally, MCRIT uses the SMDA disassembler, which I had originally designed and optimized for processing memory dumps but which similarly achieves good results on regular unmapped binaries. Working towards the MCRIT release, we extended SMDA with various symbol parsing capability, e.g. for Delphi and Go, as well as DWARF, which will be beneficial in the long run as more reference code data sets will be made available.

Now, assuming you have already fed data into the system, you may see something like this under the family page:

screenshot

From here, you can search and filter, but also by clicking the flask symbol you can create matching jobs, which will enable you to typically run a single binary against your whole collection in a matter of seconds.

In our example, we used a win.wannacry sample from February 2017 (which was later linked towards Lazarus) and matched it against the whole collection of my demo instance - which currently sits at around 7.000 samples and about 8.000.000 functions. This took about 15 seconds on this system (8 cores, 32 GB RAM) and produced a total of 64.000 matches into 177 of the 450 functions in our reference sample.

On family/sample level of the result, we can see that while naturally other win.wannacry samples match best, there are also some other known Lazarus families showing up (win.joanap, win.badcall).
But for our example, we are interested in a particular match.

screenshot

Filtering down all matched functions to those that only point into a few other families, we can quickly isolate a few interesting relations:

screenshot

Apart from a (smaller) function that strongly matches into other Lazarus code from Operation Blockbuster, directly below in the screenshot among them is also the one that was referenced by Neel Mehta from Google in his famous attribution-hinting tweet, linking another Lazarus family called win.contopee to win.wannacry.

The web interface of MCRIT has a built-in functionality to do some basic visual diffing, in this case for the particular function at offset 0x402560:

screenshot

Here, blue basic blocks have a matching position-independent code hash (PicHash), meaning that apart from memory locations (which are wildcarded in the calculation) they are fully identical on byte-level. Green and yellow (not shown) are still very similar while red indicates multiple changes in instructions beyond an internal threshold.

I think with some further smoothing of the workflow and filtering options added, MCRIT may become a helpful tool for similar cases in the future.

Next Steps

To make MCRIT more immediately useful to others, I am currently busy compiling a collection of benign reference code.
On the one hand, this will aid in filtering out common library functions during the matching procedure (another distinctive feature of MCRIT) so that matching results become even more expressive.
On the other hand, those labels become available for import in other analysis tools, like with the already available IDA Pro plugin:

screenshot