And we’re back. Welcome back and if you’re new here, make sure to check out Part 1 of this article first.
Today I’m going to talk about datasets. Subjectively speaking, 70-80% of the work done with AI is spent searching for, curating and creating datasets that can be used to train your model. The more thoughtful you are on this part of the process, the better your model will be in terms of efficiency and accuracy. I know it’s tempting to throw as much data as possible to the wall and let the algorithms figure it out, but I don’t necessarily agree with that approach. Don’t let the power of AI turn you into a lazy neanderthal! If you can reduce noise in the early stages, it will pay dividends for you later in terms of computational cost and speed. For Traffic Classification, I had some rules in mind on what kind of data should be used. Here are my thoughts:
- Eliminate Noise – Any data that can easily be duplicated, faked or spoofed will be disregarded. This includes IP Addresses, Application Ports and MAC Addresses. IP Addresses are very dynamic so they are not dependable. I wouldn’t use these data points for traditional methods, so why would I use them for an AI model?
- Less is More – Use the minimal amount of data necessary to make a prediction. This means trying out different combinations of data points and multiple payload sizes to compare accuracy. If I can get the same performance just analyzing the first 100 bytes of a packet payload versus the theoretical limit of 1500 bytes, that’s a win.
- VPN and Encrypted Traffic – Most non-technical consumers do not connect to a VPN or even know what a VPN is. Lets address the bigger market first and then think about the fringe cases later. For now we will not address this type of traffic.
- Prediction Categories – The dataset will be focused on 3 types of traffic/applications: Gaming, Voice/VoIP and Video Chat. If we cannot classify into any of these three categories with a high probability, we will consider the traffic Other. We may still include other prediction classes to increase the resolution of the final prediction but in reality we are only concerned with predicting those 3 classes accurately.
Usually the hardest part is finding the dataset that you need. Luckily for us, there is a decent one online to start with from the University of New Brunswick (1). Kaggle is also a good place to check, as they post datasets online for the competitions that they run. It is also fairly easy to create your own dataset by using Wireshark to capture your own network traffic. We can use this method to augment what we’re missing. In my next article, I’ll do a deeper dive into how I pre-processed the data to fit my needs. It will get more technical and may include some code so stay tuned!
- Gerard Drapper Gil, Arash Habibi Lashkari, Mohammad Mamun, Ali A. Ghorbani, “Characterization of Encrypted and VPN Traffic Using Time-Related Features“, In Proceedings of the 2nd International Conference on Information Systems Security and Privacy(ICISSP 2016) , pages 407-414, Rome, Italy.