Skip to main content

Anti-Malware Application for the android system

 







Agenda:

1-the architecture of the anti-malware application

2-detecting malware based on a hash of the application

  • how to detect based on hash
  • How to update the firebase database automatically using Twitter API. With malware hashes that were detected recently.

3-Detecting based on deep learning

  • dataset
  • DL model
  • integrating the DL model with android using FLASK server

4-conclusion


Introduction


0xbyte is an anti-malware application that has built on two detection techniques (detecting based on the hash of application- detection based on the permissions of applications, using deep learning ). This project has built by combining two programming languages (Python-Java).

this is the link of GitHub for the project:

https://github.com/M-khalifa1/Anti-malware-detection-app.


1-the architecture of the anti-malware application

ِAs we Showed in The below image, the architecture of the application was built .on two techniques. The first is detecting based on the app’s hash, updating the database every 48 hours using Twitter API with recent hashes for malware by analyzing tweets. The second technique is based on deep learning by classifying app permissions to predict that the application is malicious or normal.



2-detecting malware based on a hash of the application

this technique depends on extracting MD5, SHA1, SHA256 for every application by passing the package name and type of hash and comparing every application hash with the firebase database to ensure if the application is malicious or benign.

//java code
//both below functions to extract hashes for every application by passing package name and type of hash
public String GetSignHashesStr(String pakName, String type) {
   try {
       PackageInfo packageInfo = getPackageManager().getPackageInfo(pakName, PackageManager.GET_SIGNATURES);
       Signature[] signs = packageInfo.signatures;
       Signature sign = signs[0];
       String signStr = EncryptionSign(sign.toByteArray(), type);
       signStr = signStr.toUpperCase();
       return signStr;
   } catch (PackageManager.NameNotFoundException e) {
       e.printStackTrace();
   }
   return "";
}

public static String EncryptionSign(byte[] byteStr, String type) {
   MessageDigest messageDigest = null;
   StringBuffer md5StrBuff = new StringBuffer();
   try {
       messageDigest = MessageDigest.getInstance(type);
       messageDigest.reset();
       messageDigest.update(byteStr);
       byte[] byteArray = messageDigest.digest();
       for (int i = 0; i < byteArray.length; i++) {
           if (Integer.toHexString(0xFF & byteArray[i]).length() == 1) {
               md5StrBuff.append("0").append(Integer.toHexString(0xFF & byteArray[i]));
           } else {
               md5StrBuff.append(Integer.toHexString(0xFF & byteArray[i]));
           }
       }
   } catch (NoSuchAlgorithmException e) {
       e.printStackTrace();
   }
   return md5StrBuff.toString();
}

On the other hand, we have a problem here in the first technique.

How to update the firebase database automatically using Twitter API. With malware hashes that was detected recently?

Sometimes, malware researchers publish information about malwares that was found recently—always, those information contains on malware hash. So we need to update the firebase database with those hashes. Therefore, we have developed a solution to automatically update the database firebase with recent malware hashes by collecting tweets using Twitter API, analyzing them, and extracting hashes from every tweet. The below code demonstrates this technique. You should schedule this code to run it every 48 hours using scheduled tasks on windows or the Cron job on Linux, to update the database automatically,

# you have to  run this code every 48 hour by using task scheduler on windows or cron job in linux
# the purpose of this code to update date bas for anti-malware application with recent malware hashes automatic
import pandas as pd
import re
import firebase_admin
from firebase_admin import credentials
from firebase_admin import db
# the below code, to update firebase database with recent malware hashes automatic using Twitter API.
# Stweet library is an unofficial library to collect data from Twitter
# we can search in Twitter by these words in "thislist" and collect tweets that contain these words.
thislist = ["android md5", "android sha1", "android sha 256"]
for x in thislist:
    import stweet as st

    search_tweets_task = st.SearchTweetsTask(
        all_words=x
    )
    tweets_collector = st.CollectorTweetOutput()

    st.TweetSearchRunner(
        search_tweets_task=search_tweets_task,
        tweet_outputs=[tweets_collector, st.CsvTweetOutput("Hashes_from_twitter.csv")]
    ).run()

    tweets = tweets_collector.get_scrapped_tweets()
    print(x)
#########################################################################

# you should download JSON file from the firebase database for the application to do authentication
cred = credentials.Certificate("anydb-ae15a-firebase-adminsdk-lqe4b-8117b97e99.json")
# this link for your database firebase
firebase_admin.initialize_app(cred, {
    'databaseURL': 'https://anydb-ae15a-default-rtdb.firebaseio.com/'
})

refmd5 = db.reference('md5')
refsha1 = db.reference('sha1')
refsha2 = db.reference('sha256')

# read tweets from Hashes_from_twitter.csv file which are collected from twitter
df = pd.read_csv("Hashes_from_twitter.csv")
duplicates = []
# this loop to extract MD5, SHA1, SHA256 from every tweet by these Regular expression
for t in (df.full_text):
    Extract_md5_fromTwittes = re.findall(r'(?i)(?<![a-z0-9])[a-f0-9]{32}(?![a-z0-9])', t)
    Extract_sha256_fromTwittes = re.findall(r'(?i)(?<![a-z0-9])[a-f0-9]{64}(?![a-z0-9])', t)
    Extract_sha1_fromTwittes = re.findall(r'(?i)(?<![a-z0-9])[a-f0-9]{40}(?![a-z0-9])', t)

    # to fetch hashes from database to ensure if the hash is repeated or not
    md5fetch = refmd5.get(Extract_md5_fromTwittes)
    sha1fetch = refsha1.get(Extract_sha1_fromTwittes)
    sha256 = refsha2.get(Extract_sha256_fromTwittes)

    Find_Redundant_Md5 = str(Extract_md5_fromTwittes)
    Find_Redundant_sha1 = str(Extract_sha1_fromTwittes)
    Find_Redundant_sha256 = str(Extract_sha256_fromTwittes)

    if Find_Redundant_Md5 in str(md5fetch):
        print("founded")
    # insert hashes in database if the hash was not repeated
    else:
        # to check if the tweet contains on more than one hash , so we have divided every hash in a record
    # to insert MD5 hashes in firebase database

        if len(Extract_md5_fromTwittes) == 2:
            refmd5.push().child("0").set(Extract_md5_fromTwittes[0])  # ['32']
            refmd5.push().child("0").set(Extract_md5_fromTwittes[1])
        elif len(Extract_md5_fromTwittes) == 3:
            refmd5.push().child("0").set(Extract_md5_fromTwittes[0])  # ['32']
            refmd5.push().child("0").set(Extract_md5_fromTwittes[1])
            refmd5.push().child("0").set(Extract_md5_fromTwittes[2])
        elif len(Extract_md5_fromTwittes) == 4:
            refmd5.push().child("0").set(Extract_md5_fromTwittes[0])  # ['32']
            refmd5.push().child("0").set(Extract_md5_fromTwittes[1])
            refmd5.push().child("0").set(Extract_md5_fromTwittes[2])
            refmd5.push().child("0").set(Extract_md5_fromTwittes[3])
        else:
            refmd5.push().set(Extract_md5_fromTwittes)
    # to insert SHA1 hashes in firebase database

    if Find_Redundant_sha1 in str(sha1fetch):
        print("founded")
    else:
        if len(Extract_sha1_fromTwittes) == 2:
            refsha1.push().child("0").set(Extract_sha1_fromTwittes[0])  # ['32']
            refsha1.push().child("0").set(Extract_sha1_fromTwittes[1])
        elif len(Extract_sha1_fromTwittes) == 3:
            refsha1.push().child("0").set(Extract_sha1_fromTwittes[0])  # ['32']
            refsha1.push().child("0").set(Extract_sha1_fromTwittes[1])
            refsha1.push().child("0").set(Extract_sha1_fromTwittes[2])
        elif len(Extract_sha1_fromTwittes) == 4:
            refsha1.push().child("0").set(Extract_sha1_fromTwittes[0])  # ['32']
            refsha1.push().child("0").set(Extract_sha1_fromTwittes[1])
            refsha1.push().child("0").set(Extract_sha1_fromTwittes[2])
            refsha1.push().child("0").set(Extract_sha1_fromTwittes[3])
        else:
            refsha1.push().set(Extract_sha1_fromTwittes)
    # to insert SHA256 hashes in firebase database
    if Find_Redundant_sha256 in str(sha256):
        print("founded")
    else:
        if len(Extract_sha256_fromTwittes) == 2:
            refsha2.push().child("0").set(Extract_sha256_fromTwittes[0])
            refsha2.push().child("0").set(Extract_sha256_fromTwittes[1])
        elif len(Extract_sha256_fromTwittes) == 3:
            refsha2.push().child("0").set(Extract_sha256_fromTwittes[0])
            refsha2.push().child("0").set(Extract_sha256_fromTwittes[1])
            refsha2.push().child("0").set(Extract_sha256_fromTwittes[2])
        elif len(Extract_sha256_fromTwittes) == 4:
            refsha2.push().child("0").set(Extract_sha256_fromTwittes[0])
            refsha2.push().child("0").set(Extract_sha256_fromTwittes[1])
            refsha2.push().child("0").set(Extract_sha256_fromTwittes[2])
            refsha2.push().child("0").set(Extract_sha256_fromTwittes[3])
        else:
            refsha2.push().set(Extract_sha256_fromTwittes)

Detection function in android application: contain two approaches of detection, which we have been mentioned above. The first detecting is based on the hash. The second detection is based on Deep learning.

Note: you have to replace the IP and port number with your IP and port number. These changes related to the flask server to interact with the deep learning model.


public void Detection() {
//the below code related to design
 setContentView(R.layout.activity_main);
 TextView t1 = findViewById(R.id.textView1);
 t1.setText("Malware : " + malware_count);
//this is the first approach: detect based on hash
//to detect malware  based on hash must be extract hash of application via "package manager"

 PackageManager manager = getPackageManager();
 apps = manager.getInstalledPackages(0);
 for (PackageInfo packageInfo : apps) {
  ApplicationInfo applicationInfo = packageInfo.applicationInfo;

//there are application systems we are don't care it, so we  care about installed application
//so the two line of code to ensure if the app was installed or not
  boolean isSystemApp = ((applicationInfo.flags & ApplicationInfo.FLAG_SYSTEM) != 0);
  if (isSystemApp == false) {
//to extract hash for every application
   String MD5Hash = GetSignHashesStr(packageInfo.packageName, "MD5");
   String SHAHash = GetSignHashesStr(packageInfo.packageName, "SHA1");
   String SHA256Hash = GetSignHashesStr(packageInfo.packageName, "SHA256");

//to search in firebase to ensure if  MD5 of application malware or not
   DatabaseReference database = FirebaseDatabase.getInstance().getReference().child("md5");
   Query query1 = database.orderByChild("0").equalTo(MD5Hash);
   query1.addListenerForSingleValueEvent(new ValueEventListener() {
    @Override
    public void onDataChange(@NonNull @NotNull DataSnapshot snapshot) {
     if (snapshot.exists()) {
      setContentView(R.layout.activity_main);
      TextView t1 = findViewById(R.id.textView1);
      malware_count += 1;
      t1.setText("Malware : " + malware_count);
      Malcious_app_list.add(packageInfo.packageName);
     }

//to search in firebase to ensure if sha1 of application malware or not
     else {
      DatabaseReference database = FirebaseDatabase.getInstance().getReference().child("sha1");
      Query query1 = database.orderByChild("0").equalTo(SHAHash);
      query1.addListenerForSingleValueEvent(new ValueEventListener() {
       @Override
       public void onDataChange(@NonNull @NotNull DataSnapshot snapshot) {
        if (snapshot.exists()) {
         Log.e(" Sha1 checker", "exist" + packageInfo.packageName);
         setContentView(R.layout.activity_main);
         TextView t1 = findViewById(R.id.textView1);
         stadet += packageInfo.packageName + "\n";
         Malcious_app_list.add(packageInfo.packageName);
         malware_count += 1;
         t1.setText("Malware : " + malware_count);
        }
//to search in firebase to ensure if sha256 of application malware or not

        else {
         DatabaseReference database = FirebaseDatabase.getInstance().getReference().child("sha256");
         Query query1 = database.orderByChild("0").equalTo(SHA256Hash);
         query1.addListenerForSingleValueEvent(new ValueEventListener() {
          @Override
          public void onDataChange(@NonNull @NotNull DataSnapshot snapshot) {
           if (snapshot.exists()) {
            Log.e("Sha 256checker", "exist" + packageInfo.packageName);
            setContentView(R.layout.activity_main);
            TextView t1 = findViewById(R.id.textView1);
            malware_count += 1;
            t1.setText("Malware : " + malware_count);
            Malcious_app_list.add(packageInfo.packageName);
           }

//the second approach
//detect based on Deep learning  by classifying permissions of applications
           else {

            try {
//Provides access to an application's raw / asset files
//to extract permissions of application we need to pass the package name
             AssetManager assetManager = createPackageContext(packageInfo.packageName, 0).getAssets();
             XmlResourceParser xml = assetManager.openXmlResourceParser("AndroidManifest.xml");
             int eventType = xml.next();
             while (eventType != XmlPullParser.END_DOCUMENT) {
              if (eventType == XmlPullParser.START_DOCUMENT) {
              } else if (eventType == XmlPullParser.START_TAG) {
               String tag = xml.getName();

               if (TAG_ITEM1.equals(tag)) {  //uses-permission
                String attrValue = xml.getAttributeValue("http://schemas.android.com/apk/res/android", "name");
                if (!attrValue.contains("permission.") && attrValue.contains("vending.")) {
                 String[] partsVend = attrValue.split("vending.");
                 String partVe = partsVend[1];
                 sumPermVend += partVe;

                } else if (attrValue.contains("permission.")) {
                 String[] parts = attrValue.split("permission.");
                 String part2 = parts[1];

                 String sendd = part2;
                 sumPerm += sendd;
                }
               }

               if (TAG_ITEM2.equals(tag)) {  //uses-permission-sdk-23
                String attrValue = xml.getAttributeValue("http://schemas.android.com/apk/res/android", "name");
                String[] parts = attrValue.split("permission.");
                String part2 = parts[1];

                String sendd2 = part2;
                sumPermSdk += sendd2;
               }

              }
              eventType = xml.next();

             }

             xml.close();
            } catch (PackageManager.NameNotFoundException | IOException | XmlPullParserException ignore) {
            }
            String perm = sumPerm + sumPermSdk + sumPermVend;
            Log.d("TAG", sumPerm + sumPermSdk + sumPermVend);

            sumPerm = "";
            sumPermSdk = "";
            sumPermVend = "";
//sleep 10 sec between every request from application to DL model through Flask server
            SystemClock.sleep(10000);
//this is Flask server code to send and receive between python and java
            OkHttpClient okHttpClient = new OkHttpClient();
//to send permission of application in "n1" to Deep learning model
            RequestBody formbody = new FormBody.Builder().add("n1", perm).build();
//this ip of my lab top due to the DL model exist on it.
            Request request = new Request.Builder().url("http://192.168.1.3:5000/").post(formbody).build();
            okHttpClient.newCall(request).enqueue(new Callback() {
             @Override
             public void onFailure(@NotNull Call call, @NotNull IOException e) {
             }
//response function to retrieve result of prediction for permissions
             @Override
             public void onResponse(@NotNull Call call, @NotNull Response response) {
              runOnUiThread(new Runnable() {

               @Override
               public void run() {
                try {
//if the result of prediction =1 may be the application is malicious
                 String Result_of_prediction = response.body().string();
                 cmp = Result_of_prediction.equals("[1]");
                 if (cmp == true) {
                  setContentView(R.layout.activity_main);
                  TextView t1 = findViewById(R.id.textView1);
                  Malcious_app_list.add(packageInfo.packageName);
                  malware_count += 1;
                  t1.setText("Malware : " + malware_count);
                 }
                } catch (IOException e) {
                 e.printStackTrace();
                }


               }
              });


             }
            });


           }
          }

          @Override
          public void onCancelled(@NonNull @NotNull DatabaseError error) {

          }
         });
        }
       }

       @Override
       public void onCancelled(@NonNull @NotNull DatabaseError error) {

       }
      });
     }
    }

    @Override
    public void onCancelled(@NonNull @NotNull DatabaseError error) {
    }
   });
  }//end of if
 }// end of for loop
} //end of function

3-Detecting based on deep learning

dataset

We have depended on the permissions of the applications as a feature to detect malware. So, we have collected malicious permission from malicious android applications and being permissions from benign android applications, and we have built the dataset. by classifying them which means the malicious permissions were labeled by 1 and benign permissions were labeled by 0.

the dataset consist of 837 rows and the number of rows werelabeled by 1 is 549, and 288 was labeled by 0.
The AndroidManifest.xml file contains information about your android Application, including components of the application such as Permissions, activities, services, broadcast receivers, content providers, etc.

The below script helped us to extract permissions from the “androidmanifest.XML” in every application.


# the below code for dataset collection from android "AndroidManifest.xml"


import xml.etree.ElementTree as ET

root = ET.parse("AndroidManifest.xml").getroot()
permissions = root.findall("uses-permission")
for perm in permissions:
    for att in perm.attrib:
        s="{}${}".format(att, perm.attrib[att])
        split_string = s.split("permission.", 1)
        substring = split_string[1]
        print(substring)


Deep learning model

we have used the deep learning model based on tensorflow to do binary classification on permissions of the application.

import csv
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import numpy as np
import os
import time

# read dataset
df = pd.read_csv(r'dataset.csv')
# to print numbers of 0's and1's
print((df.label == 1).sum())
print((df.label == 0).sum())
from collections import Counter


# Count unique words
def counter_word(text_col):
    count = Counter()
    for perm in text_col.values:
        for word in perm.split():
            count[word] += 1
    return count


counter = counter_word(df.perm)
num_unique_words = len(counter)
# training all data
train_size = int(df.shape[0])
train_df = df[:train_size]
train_sentences = train_df.perm.to_numpy()
train_labels = train_df.label.to_numpy()

# Tokenize
from tensorflow.keras.preprocessing.text import Tokenizer

# vectorize a text corpus by turning each text into a sequence of integers
tokenizer = Tokenizer(num_words=num_unique_words)
tokenizer.fit_on_texts(train_sentences)  # fit only to training

# each word has unique index
word_index = tokenizer.word_index
train_sequences = tokenizer.texts_to_sequences(train_sentences)
# Pad the sequences to have the same length
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Max number of words in a sequence
max_length = 50
train_padded = pad_sequences(train_sequences, maxlen=max_length, padding="post", truncating="post")

# Create LSTM model
from tensorflow.keras import layers


# Embedding: https://www.tensorflow.org/tutorials/text/word_embeddings
# Turns positive integers (indexes) into dense vectors of fixed size. (other approach could be one-hot-encoding)


# Word embeddings give us a way to use an efficient, dense representation in which similar words have
# a similar encoding. Importantly, you do not have to specify this encoding by hand. An embedding is a
# dense vector of floating point values (the length of the vector is a parameter you specify).


model = keras.models.Sequential()
model.add(layers.Embedding(num_unique_words, 32, input_length=max_length))

# The layer will take as input an integer matrix of size (batch, input_length),
# and the largest integer (i.e. word index) in the input should be no larger than num_words (vocabulary size).
# Now model.output_shape is (None, input_length, 32), where `None` is the batch dimension.


model.add(layers.LSTM(64, dropout=0.1))
model.add(layers.Dense(1, activation="sigmoid"))
model.summary()

loss = keras.losses.BinaryCrossentropy(from_logits=False)
optim = keras.optimizers.Adam(lr=0.001)
metrics = ["accuracy"]
model.compile(loss=loss, optimizer=optim, metrics=metrics)

model.fit(train_padded, train_labels, epochs=15)


# prediction
def test():
    col_names = ['perm']
    df1 = pd.read_csv(r'Responds_from_APP.csv', names=col_names)
    req_size = int(df1.shape[1])
    eva_df = df1[:req_size]
    eva_sentences = eva_df.perm.to_numpy()
    eva_sequences = tokenizer.texts_to_sequences(eva_sentences)
    eva_padded = pad_sequences(eva_sequences, maxlen=max_length, padding="post", truncating="post")
    predictions = model.predict(eva_padded)
    predictions = [1 if p > 0.5 else 0 for p in predictions]
    print(eva_sentences)
    print(predictions)
    return predictions

integrating the DL model with android using FLASK server


# this is code for FLASK server to receives permission from the application
import flask
from flask import Flask, redirect, url_for, request
app = flask.Flask(__name__)
@app.route('/', methods=['GET', 'POST'])
def Request_Respond():
    # "value" variable to receives permissions from the Application
    # and write it in "Responds_from_APP.csv" file
    value = request.form['n1']
    with open('Responds_from_APP.csv', 'w+', encoding='UTF8') as f:
        writer = csv.writer(f)
        writer.writerow([value])
    print(value)
    # we can ignore the prediction of permissions if the "Value" variable was empty
    # which means the application doesn't have permissions
    if value != "":
        # if "value" variable has been received permissions from application, now we can predict
        # if this permission is malicious or normal. then return the result to the application using FLASK API
        Result_of_perdiction = test()
        print(Result_of_perdiction)
        return str(Result_of_perdiction)
    return str(0)
app.run(host="0.0.0.0", port=5000, debug=True)

4-Conclusion

This project has developed based on R&D, so I loved this experiment, also I want to thank my friends who are contributed to this project (Amr- Abdallah -Mina), also I want to thank every person who helps me.
If you have any questions please leave a comment or get in touch with me: via the email:Mahmoud.khalifa@ieee.org.


Comments

Popular posts from this blog

Unpacking MZP Ransomware manually using tail jump

  Post author on Unpacking MZP Ransomware manually using tail jump Malware authors use many of tricks to prevent analysis for security researchers and evade Antiviruses. One of the most technique used,it is a packer. What is the packer? It’s a software or tool for compressing programs or malware by obfuscating the content of executable file and generate a new executable file in packed structure. Why is unpacking malware important? Because you cannot analysis malware without unpacking and deobfuscating strings to be readable. How to the unpack happen ? As we see in the image OS create stub code with packed file What is stub code ? Stub code is responsible for unpacking packed sections, when you are running the file ,the address of unpack file exists in the stub code to unpack file. So at the end of the stub code we will see an unconditional jump (tail jump), that is meant after execute the stub code will jump to the address of unpacking file. How to identify type of the packer? There ar

How to Use WMI to extract fruitful features for malware detection based on ML

  If you are interested in R&D in malware detection using AI: you can use WMI API to extract fruitful information about every malware and build a dataset for malware, specifically file-less malware. Windows Management Instrumentation (WMI) API: is used to monitor windows operating systems. For example monitoring process creations, services, and privileges information for every malware, determining if the malware is packed or not. by checking allocation virtual size. Furthermore, threat actors use WMI in malicious intent, such as developing file-less malware You can write a WMI script using C++/C/ python/Powershell. the attachment image is an example from the collected data. source code:  https://lnkd.in/ganSBPai you can extend this code to extract more information using WMI from these links:  https://lnkd.in/gcxwDyf4 https://lnkd.in/g-H2NzqV reference: blackhat python book