Use machine learning to detect malicious domain names

By Lex Crielaars, Chief Technology Officer at SMT.


In the current security landscape, it is getting harder to detect cyber-attacks. The amount of data that needs to be searched continues to increase and the attacks become more complex. Classic methods for detecting an attack are mainly based on static searches such as blacklists and regular patterns. This leaves little flexibility for finding false or malicious data in the available machine data. Machine learning techniques create extra value with new insights and higher detection rates of these attacks.



Hackers’ attacks often start with infecting an endpoint within an organization. An endpoint can be a PC or a laptop, but also a phone, tablet, security camera or even a printer. An infected endpoint is also called a slave or a zombie, which means they are no longer under the control of the organization but of the hacker. Many infected endpoints together form a network – botnet – that in turn communicates with a Command & Control server, also known as a C2 server.

With malware, a hacker tries to infect as many endpoints as possible and to make the botnet as large as possible. The larger the botnet, the more bandwidth and computing power it has. This bandwidth is then used to cause as much damage as possible, such as with DDoS attacks on websites.  The computing power can be used for mining cryptocurrencies such as bitcoin. Slaves are also used to infect other endpoints. This happens quite often with ransomware. After all, the damage and the threat are much greater if a complete company gets shutdown instead of just a few endpoints.

For this reason, it is important to quickly detect communication between the infected endpoints and the C2-server(s). Hackers conceal the communication behind (seemingly legitimate) domain names that are frequently changed to make detection more difficult. Botnets can hold millions of slaves, so the domain names for communication with the C2-server are automatically generated, registered and put into use. A slave can infect an entire network within just a few hours, so a quick recovery is crucial. Proxy and DNS logs are monitored to see which websites are being connected. The question remains; how are legitimate websites distinguished from algorithmically generated websites? Recognizing illegitimate domain names is essential when discovering characteristics of typical malware communication.


The solution

The algorithmically generated domain names are technically similar to legitimate websites, they have all the characteristics of real domain names such as the length and the ratio between vowels and consonants. Using machine learning, a data model can be trained to find out which domain names are legitimate and which are not. Algorithms from known botnets are used to generate large amounts of domain names, which are used to train the data model.
At the same time, the model is trained with a list of legitimate domain names. This way, the training set is a combination of real and generated domain names. One part is used to train the data model and the other part to test the model for accuracy and reliability.
The data model can now be used to evaluate unknown domain names and detect whether it is a legitimate or generated domain name. The data model is of course also applied to domain names of botnets that it is not trained on. Research has shown that more than half of those domain names are still recognized because of the shared characteristics. In fact, with this specific data model we can recognize 40% of the Wannacry domain names without having used the Wannacry algorithm for training.


The added value

Detecting endpoints that communicate with C2 servers is an important security use case for organizations. Every endpoint can become infected and the timely detection of an infection is crucial. Is the company completely infected with malware or ransomware or does the company know how to prevent this in time? An in-house data model offers real-time protection and is trained on specific characteristics of the customer.

Would you like to know more about the possibilities for your organization? Contact our specialists or download the free white paper “Operationalize Machine Learning To Detect Malicious Domain Names” below.



July 2019