If you’ve made it this far, congratulations. I have not bored you away yet. Today I will talk about the structure of the Traffic Classification dataset and some of the techniques I will use to pre-process it so it can be fed into a neural network. In general, neural networks take “Tensors” as input (see example above). The goal for this project is to take the current dataset and convert it into a 2-D tensor format.
Matrices and One Hot Encoding
You can think of each row in the Tensor as a network packet and each column as attributes of that packet i.e. [protocol, length, payload]. Any data that is not numerical will have to be “one hot encoded“. In this case, the protocol can be either TCP or UDP. So an encoding will look like [1, 0, length, payload] for TCP packets and [0, 1, length, payload] for UDP packets. This will ensure the protocol attribute can be processed mathematically during training.
Wireshark Packet Captures (pcapng)
Wireshark has multiple ways to export data, including exporting to a .CSV file. This can then be opened by Microsoft Excel or a similar application (i.e. Google Sheets, LibreOffice Calc). Here the important attributes are Protocol and Length. As I discussed in the previous articles, Source and Destination IP Addresses do not reliably contribute to the classification of a packet as they are highly dynamic and can also be easily faked or spoofed. Time might be useful if we consider the delta between packets but I’ll table that for now and not include it at this time. Info does not immediately look useful either so I will disregard that for now.
The RAW packet capture also contains the Payload but I couldn’t find an easy way to extract that data to the .CSV file. My only alternative was to “follow the stream” and copy/paste the RAW data to a .TXT file. Each row represents a packet and the contents of that row was the RAW payload information in HEX format [0-f].
A you can see, the first line in the .TXT file is the same Payload data as the first packet in the Wireshark capture. Here is what the final sheet looks like after I copied all the Payload information from the .TXT file to the .CSV file.
Manipulating .CSV files with Pandas, Numpy and Python
Now that we have a base .CSV file to work with, we can start manipulating the data into a 2-D Numpy array that can be turned into a Tensor and then fed into our AI model. First we’ll use Pandas to convert the .CSV file to a Pandas dataframe.
import pandas as pd
import numpy as np
payload_size = 100
dataset_length = 0
payload_size_after_conversion = int(payload_size/2)
# convert .CSV to Pandas dataframe
gaming_dataset = pd.read_csv('Gaming_Dataset.csv')
dataset_length = len(gaming_dataset)
Then we can take this Pandas dataframe and “One Hot Encode” the Protocol values (UDP/TCP) using the pd.get_dummies() function, saving it to our new Numpy array.
# Initialize empty numpy array and start copying over data
# rows = length of .CSV, columns = (Protocol=2 + Length=1 + Payload=Size/2)
array = np.empty([dataset_length, (2+1+payload_size_after_conversion)])
# copy Protocol column after 'one hot encoding'
array[:, 0:2] = pd.get_dummies(gaming_dataset['Protocol'])
Then we need to copy the Length values over to the Numpy Array.
# copy Length column
array[:, 2:3] = gaming_dataset[['Length']]
Then before we copy the Payload values over, we need to pad them with zeros to make sure they are all the same length. The size I chose initially is 100, so if the Payload is larger than than, I’ll truncate it at 100.
# Padding function, pad or truncate to specified
def padding(string, length=100):
if len(string) > length:
string = string[:length]
elif len(string) < length:
string = string.ljust(length, '0')
Then I will split them into pairs like ’00’ or ‘FF’ and convert them to decimal. This is a way to reduce or compress the data which will make our computation cost lower.
# split string into array of bytes ['00', 'FF', ...]
array = [string[i:i+2] for i in range(0, len(string), 2)]
# convert array of bytes to values of 0-255
for i in range(len(array)):
array[i] = int(array[i], 16)
Here is the for loop where we apply all those functions to the Payload values in the Pandas dataframe and copy them over to the Numpy array. Hopefully we will still able to maintain a high level of accuracy with this compression but we will need to run it through the model first to find out.
# Pad or Truncate Payload to specified size, split Payload HEX into Bytes and
# convert to decimal. Then add to final array for neural network processing.
for i in range(dataset_length):
payload = gaming_dataset.at[i, 'Payload']
payload = padding(payload, payload_size)
payload = hexstring_to_bytes(payload)
payload = bytes_to_decimal(payload)
array[i, 3:3+payload_size_after_conversion+1] = payload
I hope the code was easy to follow and understand. If you have any questions feel free to leave comments below. If you were able to keep up then you are super awesome! In the next article I’ll talk about which AI model we will use to feed this data into and why. Stay tuned!