Fully-Automated String Decryption and Data Leakage Detection using Hybrid Code Analysis (HCA)

    Introduction

    In June earlier this year we demonstrated our generic instrumentation engine with Opfake.C (MD5: 001a42a555b4bd39bf6ecd8b11441870) and showed how it was easily possible to hook calls to local methods matching certain method signatures (see this blogpost). In this concrete case, we log all invokes to static methods that take a String as input parameter and return a String, e.g.
     
    public static String method(String s)

    We define these type of methods as "DecryptString" methods signatures. Often, mildly sophisticated malware stores their Strings in an encrypted form in order to hinder pattern-based matches from static analysis AV engines. Thus, malware authors do not put out the effort to implement complex decryption algorithms and use simple techniques, such as substitution based ciphers.

    Usually, the encrypted strings are spread throughout the entire package and need to be decrypted quickly on-the-fly. The decrypted payload is usually a class/method name used to lookup class objects, method objects, reflective invokes or often to hide C&C URLs. Also, samples using encrypted strings usually try to encrypt all possible strings, so that we can assume there is going to be a lot of "DecryptString" method calls overall during runtime. So we had an idea: what if we record all I/O Strings of all invokes matching the "DecryptString" method signature and build a character-based "conversion map" and use that to decrypt information to try to decrypt other, non-executed invokes to the same method? And if that succeeds, can we build behavior signatures off of that data? Afterall, combining dynamic analysis results with static analysis to obtain behavior data is what Hybrid Code Analysis (HCA) is all about. Let's get to work.


    Building Input/Output Character Maps

    The first step was to improve our engine to build input/output character maps for all runtime invokes to methods matching the "DecryptString" method signature as noted above. Of course, the character maps we build need to take into account overloaded method names so that we can reliably account the logged data on a per function scope. Also, we only considered input/output data if the input/output String has the same length and characters differ. In the case of Opfake.C, there is really only one method that gives good results and which is used heavily to decrypt Strings. Here is the calculated Input/Output Character Map for "public static String mkfkejkpu.mkfkejkpu.mkfkejkpu(String s)" based on 400+ observed runtime calls:
     
    Input Output Input Output Input Output
    n 0 + a Z m
    9 1 8 A 0 N
    C 2 2 B s n
    R 3 @ b 3 o
    ; 4 ] c l O
    , 5 o C F p
    E 6 < d Y P
    i 7 A D p q
    M 8 . e Q R
    7 - B E e r
    K ( k f t s
    * ) h F z S
    U * H g ^ T
    4 , w G j t
    : . r h 1 u
    ? / x H X U
    b : - i D V
    g ? u I J v
    ` [ _ j L w
    V _ m J S W
    G } N K c X
    W + P k a x
    O < 6 l > y
    ) = v L y Y
    f > 5 M ( Z




    [ z

    Wow! :-) With the exception of a few characters (like the number "9"), we have almost a complete table of the main ASCII human readable characters. Also, the conversion map does not seem to be a simple substition cipher as "ROT-13" or the likes. Before we take a look if we can generate some good results using the character map on other non-executed invokes to the same method, let us take a look at how a typical non-executed code sequence looks like:


    As we can see above, without reverse engineering the "Decryption"-method mkfkejkpu and implementing some custom decryption algorithm, it will not be possible to understand what is going on there. Using some data flow analysis for the parameter (which is easy) and our previously calculated character map, it is possible for Joe Sandbox Mobile to fully automatically decrypt Strings for these calls, even though the code is never executed. This is what the results look like for the same code sequence:


    Aha! The code seems to be part of a routine that is building a C&C URL http://m-l1g.net/q.php that is probably used to post some data. Scrolling down a bit, we find this code sequence in the same method:


    which confirms our assumption that an HTTP based request will be executed (the reflective invoke happens shortly after). The "synthetically" (or heuristically) calculated return values are marked as "Synthetic Return" instead of "Return", as usual.

    Creating Behavior Signatures based on Synthetic Strings

    The decryption mechanism applies fully-automated at every non-executed invoke to the same method, we were able to understand the entire payload of Opfake.C. Using the data, we built a proof of concept signature that detects SMS sending code, even if the code isn't executed and the lookup Strings are residing in the package fully encrypted. Here is the code sequence:


    .. and here is the Signature:


    The signature matches if the Strings "android.telephony.SmsManager", "sendTextMessage" and a reflective invoke happen within the same code context. Of course, the signature offers a "Source" link to quickly jump to the relevant code location. Besides the signature above, we came up with two more signatures to help getting an overview of decrypted strings in the package quickly, especially if decrypted Strings appear in the same code context as a reflective invoke (a good indicator for hidden payload):



    See the "Uses an encrypted string to lookup and invoke a method via reflection" Signature for "payload hiding" code locations and the "Probably tries to hide strings using a DecryptString routine" signature for a full list.

    Detecting Sensitive Information Leakage

    Besides the really cool "auto-decryption" feature that we added to Joe Sandbox Mobile, we also added a second signature that detects if sensitive phone information is possibly being leaked. As outlined in the Chuli.A blogpost from August, we have been creating signatures that are more context-aware and work on dynamic session data, such as critical phone identifying information being leaked. In that post we showed how sensitive phone information was being posted in a base64 encoded format as part of HTTP post parameters. Posting data to a C&C server PHP file is not new and a lot of malware uses encrypted payload and not only simple encodings. In the case of Opfake.C, the malware authors decided to encrypt sensitive phone information using the AES cipher algorithm. Here is the relevant code location:


    In the figure above, we see an AES cipher instance being initialized.


    Shortly after the initialization code, we see a call to Cipher.doFinal with a String that contains sensitive phone information, such as the IMEI/IMSI and other sensitive phone information. In the new version of Joe Sandbox Mobile, whenever a Cipher encrypts a payload that contains sensitive phone information, the following signature triggers:


    The "Leaked:" part of the comment indicates which sensitive phone information has been identified and a quick entrypoint to the relevance code location is provided, as well. Of course, implementing this signature would not have been possible without context-awareness (the session information) and full parameter data of the runtime invoke.

    Conclusion

    In this blogpost we demonstrated the power of Hybrid Code Analysis (HCA) that combines dynamic and static analysis in Joe Sandbox Mobile. Using HCA, it was possible to understand how Strings are decrypted in Opfake.C and re-apply the learned character mapping to other encrypted Strings on non-executed invokes (essentially "simulating" a decryption). That way, it was possible to understand the full payload of Opfake.C and create intelligent behavior signatures. Furthermore, we outlined that context-awareness and parameter-level instrumentation, as implemented in Joe Sandbox Mobile, can open doors to more complex signatures that detect data leakage.