Feature Selection Methods

Whether your data is in packet format (.pcap) or a network flow format (netflow), there is probably tons of it. With more network surface area comes more logs. But what out of those logs is useful? High dimensionality (e.g. tons of rows and columns) datasets result from network intrusion detection systems, and feature selection is essential to reduce the amount of noise or bias prior to training. Identifying which features are useful can also help you reduce overhead and spend.

Feature Selection Approaches

Several feature selection methods exist: Supervised, Semi-Supervised, and Unsupervised. The main difference between these approaches lies in how classifier outcome is considered.

Different feature selection methods

Supervised feature selection methods select features based on classifier outcome. That is, how the classifier performs.
Unsupervised feature selection, in contrast, ignore the outcome during selection. While unsupervised feature selection does not require class label information, which can be useful if the ground truth is unknown (i.e. the data is unable to be labeled as benign or malicious), unsupervised feature selection can more difficult to draw conclusions on the original features.
Semi-supervised feature selection is an approach using parts of the supervised and unsupervised techniques above as only part of the data is labeled.

Feature Selection vs. Feature Extraction

Dimensionality reduction, or reducing the number of features, is performed through two methods: feature extraction and feature selection. Feature extraction, like principle component analysis (PCA), creates new features based on the original data but the new higher-level features are hard to understand and difficult to draw conclusions about the original data. Feature selection, by contrast, does not create new features and instead selects a subset of features leading to higher interpretability. Both reduce the number of features to ultimately process.

Supervised Feature Selection Approaches

Filter methods, wrapper methods, and embedded methods are the three primary supervised mechanisms to eliminate features that are not relevant for classification.

Filter methods are often statistical techniques applied to a feature to determine its correlation with the label or outcome. Various filter methods exist such as Mutual Information or information gain, Pearson’s Correlation Coefficient, and analysis of variance (ANOVA).
Wrapper methods, like Recursive Feature Elimination, rely on classifier performance to determine feature value. That is, wrapper methods determine a subset of features that optimize how the classifier performs based on a metric, like accuracy. As you might imagine, exhaustive searches like this are slow.
Finally, embedded methods, like embedded random forest, are a combination of filter and wrapper methods that attempts to reduce the computational time required by traditional wrapper methods.

Filter Methods

Filter methods use statistical techniques to determine how closely a feature correlates with the label or outcome. Filter methods rank features and use highly ranked features to train and test a classifier. When prioritizing training time, filter methods are more suitable as security applications deal with the storage and retrieval of high dimensionality datasets and filter methods are more computationally efficient. Various filter methods exist such as Mutual Information or Information Gain, Pearson’s Correlation Coefficient, and analysis of variance (ANOVA).

Filter Method Example

Wrapper Methods

Wrapper methods rely on classifier performance to determine determine useful features instead of determining important features agnostically. The classifier is wrapped in an algorithm that searches the feature space for a subset of features that yield the highest classifier performance instead of general set of features. Wrapper methods perform feature selection using a search strategy, a predictor, and an evaluation function. The predictor is treated as a black box and its performance is used as the objective function. Search algorithms, such as recursive feature elimination, sequential feature selection algorithms, and genetic algorithms, are then employed to determine a subset of features which maximize classification performance.

Wrapper Method Example

Embedded Methods

Embedded methods attempt to reduce the computational time required by traditional wrapper methods. Also called intrinsic methods, embedded methods include feature selection as part of the training process without splitting the data into training and testing sets. They seek to optimize the exhaustive search for correlated features used in wrapper methods but still optimize based on improving a metric, usually accuracy, of a specific classifier. It is a hybrid approach that attempts to combine both filter and wrapper methods in a two stage process to grant the benefits of both.

Embedded Method Example