# Anomaly Detection in Machine Learning: Classification Algorithms vs Anomaly Detection

**Introduction**

Anomaly simply means something unusual or abnormal. We often encounter anomalies in our daily life. It can be suspicious activities of an end-user on a network or malfunctioning of equipment. Sometimes it is vital to detect such anomalies to prevent a disaster. For example, detecting a bad user can prevent online fraud or detecting malfunctioning equipment can prevent system failure.

**Classification Algorithms vs Anomaly Detection:**

Machine learning provides us many techniques to classify things into classes, for example, we have algorithms like logistic regression and support vector machine for classification problems. But these algorithms fail to classify anomalous and non-anomalous problems. In a typical classification problem, we have almost an equal or a comparable number of positive and negative examples. Suppose we have a classification problem in which we have to decide whether a vehicle is a car or not. Then in our dataset the share of cars maybe 40-60%, similarly the share of non-car examples maybe 40-60%. So, we generally have a balanced amount of positive and negative examples, and we train our model on a good amount positive as well as negative examples. On the other hand, in anomaly detection problem we have a significantly lesser amount of positive (anomalous) examples than the negative (non-anomalous) examples. The positive examples may be less than 5% or even 1% (obviously that is why they are anomalous). In such case, a classification algorithm cannot be trained well on positive examples. Here comes the anomaly detection algorithm to rescue us.

**Anomaly Detection Algorithm:**

Anomaly detection algorithm works on probability distribution technique. Here we use Gaussian distribution to model our data. It is a bell-shaped function given by

Ɲ (µ, σ^{2)}

µ = mean

σ^{2}= Variance

σ = Standard Deviation

If the probability of **χ** is Gaussian with mean µ and variance σ^{2 } then

^{ }**χ **** ~ **Ɲ ( µ, σ^{2 })

‘**~**’ is called tilde which is read as: “**χ** is distributed as“.

µ or mean describes the center of the curve.

σ or standard deviation and describes width of the curve.

The full function is as follows

p(x, µ, σ^{2}) = (1/ (σ*sqrt(2**π**)))* e^{– (1/2)*(sqr(x-µ)/σ2)}

*Image Source: Wikipedia*

Suppose we have data set X as follows

X = [ [ x_{11 } x_{12} x_{13} ….. x_{1n} ]

[ x_{21 }x_{22} x_{23} ….. x_{2n} ]

:

:

:

[ x_{m1} x_{m2} x_{m3} ….. x_{mn} ] ]

here x_{ij } means^{ }j^{th} feature i^{th } example.

**To find mean µ :**

We do column-wise summation and divide it by a number of examples. We get a row matrix of dimension (1x n).

µ = (1/m) * ( Σ^{m}_{i=1 } X_{i }) ; ( Here subscript i represents column no.)

= [ [ µ_{1} µ_{2} µ_{3} ………. µ_{n} ] ]

**To find variance σ ^{2} :**

σ^{2} = (1/m) * ( Σ^{m}_{i=1 } (x_{i }– µ)^{2} ) ; ( Here subscript i represents column no. )

In Octave/MATLAB syntax the above expression can be written as

σ^{2 }= sum ((X- µ).^2) * (1/m)

= [ σ^{2}_{1} σ^{2}_{2} σ^{2}_{3} …. σ^{2}_{n} ]

Here we get σ^{2 } as a row matrix of dimension (1x n).

**Algorithm:**

**Training algorithm**

- Choose features x
_{i}that you think to be indicative of anomalies. - Split non-anomalous examples as 60% for training set, 20% for cross validation and 20% for test set.
- Split anomalous examples as 50% for cross validation set and 50% test set.
- Calculate µ and σ
^{2}on training set.

4. Take an arbitrary value as threshold ɛ. - Check a every example x from cross validation set if it is anomalous or not calculate P(x) as follows

P(x)= Π^{n}_{j=1 }p(x_{j} ; µ_{j ; }σ_{j}^{2} )

= Π^{n}_{j=1}{(1/ (σ*sqrt(2**π**)))* exp(- (1/2)*((x_{j}-µ_{j})^{2} /σ_{j}^{2}))}

= p(x_{1} ; µ_{1 ; }σ_{1}^{2} ) * p(x_{2} ; µ_{2 ; }σ_{2}^{2} ) *p(x_{3} ; µ_{3 ; }σ_{3}^{2} )*…

……*p(x_{n} ; µ_{n ; }σ_{n}^{2} )

- If P(x) is less than threshold ɛ then it is an anomaly otherwise not.
- Calculate F
_{1 }score for current ɛ. - Repeat steps 4,5 and 6 for different values of ɛ and select ɛ which has highest F
_{1}score

After training, we have µ, σ^{2 }and ɛ .

**Algorithm to check whether a new example is anomaly or not**

- Calculate P(x) for the new examples x as follows:

P(x)= Π^{n}_{j=1 }p(x_{j} ; µ_{j ; }σ_{j}^{2} )

= Π^{n}_{j=1}{(1/ (σ*sqrt(2**π**)))* exp(- (1/2)*((x_{j}-µ_{j})^{2} /σ_{j}^{2}))}

= p(x_{1} ; µ_{1 ; }σ_{1}^{2} ) * p(x_{2} ; µ_{2 ; }σ_{2}^{2} ) *p(x_{3} ; µ_{3 ; }σ_{3}^{2} )*…

……*p(x_{n} ; µ_{n ; }σ_{n}^{2} )

- If P(x) is less than threshold ɛ then it is an anomaly otherwise not.

**F _{1} Score:**

F_{1 }score is an error metrics for skewed data. Skewed data is the data where either of positive and negative example is significantly large than the other (>90%).

F_{1} score can be given as follows:

F_{1} = 2*{(precision * recall)/(precision + recall) }

Where,

precision = True Positive/(True positive + False positive )

recall = True Positive/(True positive + False negative)

True Positive => When actual output is +ve and your algorithm predicts +ve

False Positive => When actual output is -ve and your algorithm predicts +ve

True Negative => When actual output is -ve and your algorithm predicts -ve

False Negative => When actual output is +ve and your algorithm predicts -ve

A good algorithm has high precision and recall. F_{1 }tells us how well our algorithm works.

**Higher the F1 score the better.**

* Image source: Wikipedia*

**When to use Anomaly Detection:**

- When we have very small number of true examples.
- When we have different types of anomalies and it is difficult to learn from +ve examples what anomalies look like.

**When to use Supervised Learning:**

- When we have large number of both +ve and -ve examples
- When we have enough +ve examples to sense what new +ve example will look like

**Choosing what features to use:**

Selection of features affects how well your anomaly detection algorithm works. Select those features which are indicative of anomalies. The features you select must be gaussian. To check whether your features are gaussian or not, plot them. The plotted graph should be bell-shaped.

If your features are not gaussian then transform them into gaussian using any of the functions

- log(x)
- log(x+1)
- log(x+c)
- sqrt(x)
- x
^{1/3 }

## Leave a Reply