Skip to main content

Role of Shared code analysis, or similarity analysis in malware analysis

 With growing the malware there is an approach called Shared code analysis, or similarity analysis, that will save tons of reverse engineering work for malware researchers.

Shared code analysis is an approach to comparing two malware samples by estimating the percentage of precompilation source code they share.
There are four measures to identify similarity between malware samples:
1-instruction sequence based similarity (x86 Assembly instructions). 2-String based similarity .
3- IAT based similarity.
4- Dynamic API Call based similarity (you can collect malicious API Calls from logs) .

Benefits of shared code analysis approach:
-Determine a new malware sample’s code similarity to thousands of previously seen malware samples,
-Identify new malware families based on sharing code.
-Visualize malware relationships to know the most common techniques that threat actors use (this benefit is important in building malware detector based ML).
-Replacement for manual reverse engineering work.

How does shared code analysis work?
You can identify the similarity between malware using "Jaccard index."
Jaccard index: compares members for two sets to see which members are shared and which are distinct. It's a measure of similarity for the two sets of data, with a range from 0% to 100%.

To identify the similarity using Jaccard index use the following equation
J= (AB) /(AB) *100
For example: if you have two sets A= {1,2,3,4,5}, B= {2,9,8,7,10,5}
 You can find the similarity between the two sets by Jaccard index: 
(AB) /(AB) = (2/9) *100=22.2%  
22.2% means the percentage of similarity between the two sets .

“To scale malware similarity comparisons, we need to use randomized comparison approximation algorithms.
known as minhash serves this purpose beautifully. The minhash method allows us to compute the Jaccard index using approximation to avoid computing similarities between non-similar malware samples."

The references :
For more details about similarity use the Jaccard index and MinHAsh algorithm.
You can visit the following link

The code that implements this approach(jaccard index, MInHash Algorithm) of similarity exists in malware data science book CHP5.

Malware Data Science Book:


Popular posts from this blog

Anti-Malware Application for the android system

  on Anti-Malware Application for the android system Agenda: 1-the architecture of the anti-malware application 2-detecting malware based on a hash of the application how to detect based on hash How to update the firebase database automatically using Twitter API. With malware hashes that were detected recently. 3-Detecting based on deep learning dataset DL model integrating the DL model with android using FLASK server 4-conclusion Introduction 0xbyte is an anti-malware application that has built on two detection techniques (detecting based on the hash of application- detection based on the permissions of applications, using deep learning ). This project has built by combining two programming languages (Python-Java). this is the link of GitHub for the project: 1-the architecture of the anti-malware application ِAs we Showed in The below image, the architecture of the application was built .on two techniques. The first is detectin

Unpacking MZP Ransomware manually using tail jump

  Post author on Unpacking MZP Ransomware manually using tail jump Malware authors use many of tricks to prevent analysis for security researchers and evade Antiviruses. One of the most technique used,it is a packer. What is the packer? It’s a software or tool for compressing programs or malware by obfuscating the content of executable file and generate a new executable file in packed structure. Why is unpacking malware important? Because you cannot analysis malware without unpacking and deobfuscating strings to be readable. How to the unpack happen ? As we see in the image OS create stub code with packed file What is stub code ? Stub code is responsible for unpacking packed sections, when you are running the file ,the address of unpack file exists in the stub code to unpack file. So at the end of the stub code we will see an unconditional jump (tail jump), that is meant after execute the stub code will jump to the address of unpacking file. How to identify type of the packer? There ar

How to Use WMI to extract fruitful features for malware detection based on ML

  If you are interested in R&D in malware detection using AI: you can use WMI API to extract fruitful information about every malware and build a dataset for malware, specifically file-less malware. Windows Management Instrumentation (WMI) API: is used to monitor windows operating systems. For example monitoring process creations, services, and privileges information for every malware, determining if the malware is packed or not. by checking allocation virtual size. Furthermore, threat actors use WMI in malicious intent, such as developing file-less malware You can write a WMI script using C++/C/ python/Powershell. the attachment image is an example from the collected data. source code: you can extend this code to extract more information using WMI from these links: reference: blackhat python book