Thursday 5 March 2015

Anomaly Detection - Machine Learning

Anomaly Detection Algorithm (Supervised Learning)

Anomaly detection is an algorithm that uses a huge dataset of healthy examples to learn about the features of the average good example and then uses the metrics learnt to compare with a new example and decide as to whether the new example is an anomaly to the commonly witnessed specimens.

As an analyst , one example where you may use this algorithm is to detect anomalous visitors to your website by looking at factors like number of visits per day, number of posts in the forum and typing speed. 
Another example may be in a server room/ data center where you may predict system failure by detecting anomalous activity in features like CPU load, number of disc accesses per second, memory usage and CPU load per unit network traffic.


The above two figures show a surface plot of a Gaussian variable dependent on two parameters x1 and x2.

Data Representation

The dataset should be in the form of a .csv file or a similarly syntaxed text file.

The algorithm assumes that your data-points have a normal (Gaussian) distribution.
The Matlab code has been provided below. It is pretty much just plug and play. You can also download the code from my Github page. Any suggestions and contributions will be duly appreciated.

A few things to keep in mind...
  • The code can handle any number of features but as always , they have to be numerically describable.
  • The algorithm asks you to choose a normalized threshold parameter i.e. a value between 0 and one. The general rule of thumb is that the more particular you are about the spread of your data, the higher should be your threshold. In other words, if a very narrow margin of error is allowed in your healthy examples, then you should choose a very high value of threshold. It is sort of like a quality scale where 0 being the worst quality and 1 signifying the best quality.
The code...
####################################
% anomaly detection
fprintf('welcome to the anomaly detection module\n');
fprintf('the dataset that you will use to train the algorithm');
fprintf(' must only have non anomalous examples \n');
da=input('please mention the dataset file in single quotes: \n');
x=load(da);
s=size(x);
l=s(1,1);
b=s(1,2);
fprintf('your dataset has %i healthy examples with %i features \n',l,b);
fprintf('training...\n');
mus=zeros(2,b);

for i=1:l
    for j=1:b
        mus(1,j)=sum(x(j,:))/l;
        mus(2,j)=(1/l)*var(sum(x(j,:))/l,x(j,:));
    end
        






end
fprintf('training complete.\n');
inp=input('please enter new datapoint for checking if it is anomalous :\n');
eps=input('please enter normalized threashold probability :\n');
p=prob(inp,mus);

if p<eps
    fprintf('this dat-point is an anomaly\n');
end

if p>eps
    fprintf('this data-point is healthy\n');
end


###############################

Save the following function as 'var.m' in your working directory.

#####################
function x=var(a,b)
for i=1:length(b)
    c=(b(1,i)-a)^2;
end


x=(1/length(b))*c;

end
###############

Save the following function as 'prob.m' in your working directory.

####################
function p=prob(a,b)
p=1;
c=zeros(1,length(a));
for i=1:length(b)
    c(1,i)= (0.3989*(1/sqrt(b(2,i))))*exp(-((a(1,i)-b(1,i))^2)/2*b(2,i));
end
for j=1:length(a)
    p=p*c(1,j);
end
end
##################



No comments:

Post a Comment