Saturday, 28 March 2015

Machine Learning-Artificial Neural Networks

Artificial Neural Network - With Back-propagation Algorithm(Supervised Learning)

Artificial neural networks  are basically a bunch of computer code that tries to learn from given examples.
A skilled practitioner of machine learning can use artificial neural networks (ANN's) to teach a computer to do virtually anything. Many respectable researchers in the field believe to be the panacea algorithm; the one algorithm for solving every hard problem (in a certain- narrow sense of-course !)
image taken from the web


Neural networks have been around for quite some time now but they  still are a topic on which massive amounts of research is ongoing.

The whole idea behind a neural net is to mimic the brain. The argument is that,  like an average Joe who can master multiple languages and skills during his lifetime using the same set of neurons in his brains (it's not as straightforward as that- but what the hell) the same algorithm which tries to mimic the biological neurons should be able to learn anything if used properly.

ANN based applications are ubiquitous now-a-days and it is an important tool to to add to your repertoire as an analyst.

Writing code for an ANN is relatively straightforward; using it effectively on the other hand is somewhat of an art. It takes time , patience and practice.

I have provided below the python code (Yes python, I guess it was time to switch from MATLAB)  for an artificial neural network which trains using the back-propagation algorithm.  The code is plug and play but as a user, you are expected to know how to use a neural network. The finer points have been mentioned below. You can also download the code from my Github page. As always, contributions to the code are welcome and appreciated.


Additional Requirements- Numpy (I use the Spyder IDE)


The code:
##########################################################################3

import numpy as np

def sgm(x,derivative=False):
if not derivative:
return 1/(1+np.exp(-x))
else:
out = sgm(x)
return out*(1.0-out)

def linear(x,derivative=False):
if not derivative:
return x
else:
return 1.0

def guassian(x,derivative=False):
if not derivative:
return np.exp(-x**2)
else:
return -2*x*np.exp(-x**2)

def tanh(x,derivative=False):
if not derivative:
return np.tanh(x)
else:
return (1.0-np.tanh(x)**2)


class backPropagationNetwork:
layerCount = 0
shape = None
weights = []
layerTransferFunc = []

# constructor function for the class
def __init__(self, layerSize,layerTransferFunc = None):
# layerSize is the Architecture of the NN as (4,3,1)
self.layerCount = len(layerSize)-1 # input layers is just a buffer
self.shape = layerSize

# for the forward pass maybe.
self._layerInput = []
self._layerOutput = []
self._previousWeightDelta = []


for(l1,l2) in zip(layerSize[:-1],layerSize[1:]):
self.weights.append(np.random.normal(scale=0.1,size=(l2,l1+1)))
# add for each weight matrix a matrix in _previousWeightDelta for previous values
self._previousWeightDelta.append(np.zeros(shape=(l2,l1+1)))

if layerTransferFunc is None:
layerTransferFunc = []
for i in range(self.layerCount):
if i == self.layerCount - 1:
layerTransferFunc.append(linear)
else:
layerTransferFunc.append(sgm)
else:
if len(layerTransferFunc) != len(layerSize):
raise ValueError("Incompatible no of transfer functions.")
elif layerTransferFunc[0] is not None:
raise ValueError("no transfer functions for input layer.")
else:
layerTransferFunc = layerTransferFunc[1:]

self.layerTransferFunc = layerTransferFunc


# forward run/pass
def run(self,X):

# no of training examples
m = X.shape[0]

# initialize/ clear out the input and output list from previous run
self._layerInput = []
self._layerOutput = []

# Forward pass
for i in range(self.layerCount):
if i == 0:
layerInput = self.weights[0].dot(np.vstack([X.T, np.ones([1,m])]))            # vstack(a,b) stacks matrix/vector b below matrix/vector a
else:
layerInput = self.weights[i].dot(np.vstack([self._layerOutput[-1], np.ones([1,m])]))

self._layerInput.append(layerInput)
self._layerOutput.append(self.layerTransferFunc[i](layerInput))

return self._layerOutput[-1].T


def trainEpoch(self,X,Y,trainingRate ,momentum ):
# trains the network for one epoch
delta = []
m = X.shape[0]

# forward pass before we can compute the gradient by back propagation
self.run(X)


for i in reversed(range(self.layerCount)):                                             # reverse as the backpropogation work in reverse order

if i == self.layerCount-1:                                                         # if this is for the preactivation at the output
outputDelta = self._layerOutput[i] - Y.T                                      # this is also the gradient at output if we take the least square error function
error = np.sum(outputDelta**2)                                                # sum of all the elements along all dimensions
delta.append(outputDelta*self.layerTransferFunc[i](self._layerInput[i],True))                  # '*' operator is for coordinate wise multiplication
else:
deltaPullback = self.weights[i+1].T.dot(delta[-1])  # this is the gradient at the activation of the hidden layer (i+1), note that i = 0
 # is for hidden layer 1.
delta.append(deltaPullback[:-1,:]*self.layerTransferFunc[i](self._layerInput[i],True))         # this is the gradient at the preactivation at hidden layer (i+1)

for i in range(self.layerCount):
deltaIndex = self.layerCount - 1 - i # delta[0] is preactivation at output and so on in backward direction
if i == 0:
layerOutput = np.vstack([X.T,np.ones([1,m])]) # for W0 the delta (preactivation) is input layers
else:
layerOutput = np.vstack([self._layerOutput[i-1],np.ones([1,self._layerOutput[i-1].shape[1]])]) # _layerOutput[0] contains the activation of the hidden layer 1 and so for Wi we need _layerOutput[i-1]

weightDelta = np.sum(layerOutput[None,:,:].transpose(2,0,1)*delta[deltaIndex][None,:,:].transpose(2,1,0),axis=0)

weightDelta = trainingRate * weightDelta + momentum*self._previousWeightDelta[i]
self.weights[i] -= weightDelta
self._previousWeightDelta[i] = weightDelta

return error # incase useful




if __name__ == "__main__":
 
    print ("welcome to neural network")
    data_file=raw_input("please mention the name of your training set file\n")
 
    d=np.genfromtxt(str(data_file),delimiter=',')
    z=np.shape(d)
    print ("your data set has %i datapoints a\n"%z[0])
    print ("And there are %i columns\n"%z[1])
    nx=int(raw_input("how many of the %i columns are inputs ?\n"%z[1]))
    ny=z[1]-nx
    x=[]
    y=[]
    for p in range(z[0]):
        x.append(d[p,0:nx])
        y.append(d[p,nx:])
    X = np.array(x)
    Y= np.array(y)
    net_def=[int(nx)]
    print net_def
    qwe=raw_input("how many epochs ?\n")
    hlc=int(raw_input("how many hidden layer do you want ?\n"))
    print int(hlc)
    for n in range(int(hlc)):
        a=raw_input("how many units in layer %s\n"%(n+2))
        net_def.append(int(a))
    net_def.append(ny)
    m=float(raw_input("please define momentum:\n"))
    alph=float(raw_input("please define learning rate:\n"))
    print("You can choose from a host of transfer functions :-sigmoid(sgm),linear(linear),gaussian(gaussian),tanh(tanh)\n")
    tff=str(raw_input("choose on of the transfer functions(mentioned in paranthesis above)\n"))
    tup=tuple(net_def)
    layerTransferFunc = [None]
    for n in range(len(tup)-1):
        layerTransferFunc.append(str(tff))
    bpn = backPropagationNetwork(tup)
    maxIteration = int(qwe)
    minError = 1e-5
 
    for i in range(maxIteration+1):
     
     
        err = bpn.trainEpoch(X,Y,momentum=float(m),trainingRate=float(alph))
     
        if i%100 == 0:
            print "iteration {0}\t error: {1:0.6f}".format(i, err)
        if err <= minError:
            print "Minimum error reached as iteration {0}".format(i)
            break
    Ycomputed = bpn.run(X)
    for i in range(np.shape(X)[0]):
        print "Input: {0}\n Output: {1}".format(X[i],Ycomputed[i])#tup
    print(bpn.shape)
    print(bpn.weights)
 


########################################################################



How to use : This is the art part.

First of all, your data should be munged properly on a spreadsheet with normalized decimal values and binary tags and subsequently be saved as a .csv file or a text file.

Next, as always, all your initial columns must represent the inputs to the neural network and the rest have to represent the corresponding output (remember that we are doing supervised learning). Every row is one data point i.e. one training example. The program will ask you for these when you run it and based on that, it will decide the number of nodes in the input and output layer.

It will then ask you about the number of hidden layers and the number of nodes in each layer.
You will also be prompted to define the learning rate and momentum; both of which are values between 0 and 1.

You will also be asked to decide the number of epochs . Epochs is just a fancy way of depicting the number of times the algorithm uses your entire dataset to train the network. It is usually in the order of millions.

The learning rate is somewhat analogous to the size of steps that you want to take while trying to solve an optimization problem and the momentum can be intuited as a coefficient that decides the amount of impact that the already learnt parameters will have on the subsequent learning steps. You may also need to decide if you have sufficient data. In machine learning , more data is almost never a bad thing (if you are willing to compromise on the computational cost front).

Deciding the network architecture (number of hidden layers and their nodes) and cleverly choosing inputs and outputs along with fine-tuning your learning rate and momentum parameters may take some time and experimentation. It is all about finding the right balance . After the error converges to a level of you satisfaction, you can use the network for doing further prediction/classification/regression etc.

All you have to do is declare your new input vector as a numpy array (lets call it "R") and write the folloing in the command window :-
#################
Y=bpn.run(R)
print Y
###################

you can also print the weight matrix by using the following command:-
##########
print weights
###########

Acknowledgements:
I must thank the following two people who unknowingly helped me create this post:
This Amazing Person's Youtube tutorial and
This like minded fellow programmer's Github page
                                         

Monday, 23 March 2015

Paralytic Analytics-Epidemics

How To Create An Epidemic -Using Logic and Psychology 


                                                     THE DEVIL IS IN THE DETAILS

Most people think that the job is of an analyst is  just to passively analyse the information that he gets and produce useful insights using that. Well that is only half the truth. Sometimes you may be asked to proactively exploit your understanding of a system to create the desired effect. 

In his book The Tipping Point: How Little Things Can Make a Big Difference by Malcom Gladwell the author has describe how epidemics, in every sense of the word are created. Epidemics ranging right from the flu to a fashion or opinion epidemic. He, in very vivid detail describes what it takes to make something go viral; and not just in the context of the internet.

Here is a resipy of a strategy for creating a useful epidemic.
Lets look at it from the eyes of a Mathematician and Game theorist.

The best way to learn is by example. Suppose you the owner of a alternate energy solutions company which specializes in solar micro-grid based solutions. Now like any other business, your main aim is to make the maximum possible profits. And there are four main ways of making a lot of money while selling something.
Either...


  • you should have a monopoly over the sale of the goods/services that you are selling or
  • your product should be the best in the market in which case it can be difficult to offer a competitive price , or 
  • you have to sell something extremely precious and it's price should not be regulated by any organisation, that means that you cant sell oil or gold but one of the many viable options is art. But sadly you ended up in the energy business so that is off the table. And finally
  • you should sell something in huge volumes with a small margin for profit. That way you increase your probability of being liked by a larger crowd of people and the probability of repeat customers because of the competitive pricing that you are able to offer.

Well , you may have the best product in the market , in which case you don't need to read further but we are looking for a way of creating a successful business independent of the kind of product that you are offering. Of course , product quality does matter in the long run, but that is not the only factor.

That narrows down your options to just one- maximizing your sale volume.
So how to maximize sale volume you ask?
Well, the answer to that is simple (but not trivial); first o fall you need find a target customer base.

Your target customer/client base is a group should have the following properties:


  • They should be one of the majority sections of the society.(Because you want a lot of clients)
  • Should be able to afford your services/product and will benefit from it.

That narrows it down to everybody starting from the middle class and above who lives in a house and even other companies that have office buildings of any sort. Now one other thing that you need to be aware of is the limitations of your product. you cant go around making false promises. And you need to tell your clients about it in the most strategic way. Something on the lines of "No, no Sir, your wife is not fat at all, she is just Not Thin".

One obvious client base in your target city for business lives in the suburbs and housing colonies. So basically you are looking for similar people (socially and economically) who live in close proximity.

Now, the "game" begins...

The book Tipping point describes three main factors that are important for triggering an epidemic

1. "The Law of The Few". Unlike what most of you may be thinking, epidemics depend on a select group of very few people. They are of the following 3 types :-


  • The Connectors: These are the people in a community who know large numbers of people and who are in the habit of making introductions. A connector is essentially the social equivalent of a computer network hub. They usually know people across an array of social, cultural, professional, and economic circles, and make a habit of introducing people who work or live in different circles. They are people who "link us up with the world...people with a special gift for bringing the world together". They are "a handful of people with a truly extraordinary knack for making friends and acquaintances".Malcolm Gladwell characterizes these individuals as having social networks of over one hundred people. Gladwell attributes the social success of Connectors to the fact that "their ability to span many different worlds is a function of something intrinsic to their personality, some combination of curiosity, self-confidence, sociability, and energy".
  • The Mavens: These are "information specialists", or "people we rely upon to connect us with new information". They accumulate knowledge, especially about the marketplace, and know how to share it with others. Gladwell says "A Maven is someone who wants to solve other people's problems, generally by solving his own". According to Gladwell, Mavens start "word-of-mouth epidemics" due to their knowledge, social skills, and ability to communicate. As Malcolm Gladwell states, "Mavens are really information brokers, sharing and trading what they know".
  • The salesmen: These are "persuaders", charismatic people with powerful negotiation skills. They tend to have an indefinable trait that goes beyond what they say, which makes others want to agree with them.

2. The stickiness factor. The specific content of a message that renders its impact memorable. This may include the use of catch phrases or slogans. Something on the lines of the Tata Sky catch phrase , "Isko laga dala to life jhingalala"


3. The power of context. Human behavior is sensitive to and strongly influenced by its environment. As Malcolm Gladwell says, "Epidemics are sensitive to the conditions and circumstances of the times and places in which they occur". This will be better understood as you read further.

Malcom Gladwell is not the only one teaching about how you can make things more popular instantly. An interesting way of looking at this methodology can also be seen in one of the lectures on "Learning:Near misses, Felicity Conditions" by MIT's Patrick H Winston in his artificial intelligence lecture series. He proposes the method of inducing felicity to any product using the five factors represented on a star.(The 5 S's)



Here is a breakdown of the five features that your product/idea (basically anything that you are trying to sell) should have:


  • Symbol: This is as simple as it looks. Your product/idea should be represented by a visual handle like a brand icon that reminds people of your product/idea every time they see it. This accounts for a vast majority of the crowd who are visual thinkers.
  • Slogan: This is again a sensory handle for the people to be reminded of your product/idea . The slogan you create should have the stickiness factor in it, ergo, once people hear it, they should not be able to take it off their mind. This is where Gladwell and professor Winston meet.
  • Salient Thought: Well this is a bit tricky to understand and implement, but when you think about the salient feature of your product/idea , it usually refers to the feature that really sticks out without it being directly mentioned. It basically pertains to the purpose of the particular endeavor. For example; what is the salient thought of a Fort ? It is security during a siege. This needs to be clearly conveyed by your selling pitch.
  • Surprise: This pertains to the uniqueness of your idea/product. The surprise factor usually demands a deviation from the norm in such a way that it has never been seen or expected before. All this may take a certain level of showmanship on your part. And finally
  • Story: See, people just LOVE a story. Studies have shown a heightened retention of stated facts in an average person if they are presented to them in form of a story. Stories are sort of a mnemonic device for remembering facts.

So, there you have it, if you can execute the above five steps perfectly, it is virtually guaranteed that your product/idea will be widely appreciated.

And now you may ask that all this may be good in theory, but how do you really go about doing all this.
To understand that, we come back to our solar energy solutions example.

Most of the features suggested by Prof. Winston may be handled by marketing and or consulting firms, but we will now see what can be done on a human level for ensuring success in your business.

The first thing you need to arrange for are the salesmen . They are the ones that will be doing most of the legwork for you.
Once you have your guy and I am trusting your judgement of character for choosing one, you move onto the next step.
You have to find the connectors and the mavens and that to simultaneously. This is when you put your salesmen to work.

Here is how you identify your connectors:
They are the famous people who live in your targeted suburb. They may be the heads of the community, local political leaders and pseudo-celebrities. But keep in mind that they cannot be too different from your target audience, they just have to be respected members of the society whose abode is frequented by very many. These people will form the focal point of your expansion campaign.

It is these people that you need to sell your product to first and do so in such a way that it is amply conspicuous . For example, although it is normal practice to just place solar panels on the roof of the building and be done with it but in this case you may need to make an exception. The installation (even if temporary) need to be visible for miles around. Maybe you can install the solar panels on elevated platforms on the roof for no added cost for the period of you epidemic campaign and later "upgrade" to the more safer mode of just placing the panels on the roof floor. The purpose of this " peacock display " is to force the visitors to the connector's place to talk and ask about their recent acquisition. This will help to trigger the first kind of epidemic that you need to generate - The word of mouth epidemic.

Now you need to find your mavens:
It is preferable that your connectors double as mavens, i.e. they display a keen knowledge about the benefits of the recent addition to their home/office. But you don't need to worry about that because people like the feeling that they have made a good decision after it becomes irreversible. So you can rely on your connectors for becoming maven who have expert advice to share which was provided by your salesmen in the first place.

The second principle of psychology that you can exploit for your epidemic campaign id the principle of social proof and authority. So apart from the newly wise connectors , you need to find people who hold a certain sway over their neighbors and society by virtue of their technical knowledge about your product. These may be engineers , owners of technical businesses or related to anything even tangentially related to technology and not necessarily your technology. These people are prone to give free advice as a tribute to their astute technical knowledge. you need to first convince these people about the greatness of your product, because these are the people to whom their neighbor goes for advise after the visit to the connector's place to validate his prospective investment in your product.

It is like so :- if a doctor (assume he is a pervert) asks you to strip in-front of him for the medical checkup though you know that it is only a bruise on your hand and does not warrant a full body inspection, you will still consider the idea of consenting to strip because after all , he is a doctor, and knows better, so you better strip or die because of the mysterious prognosis he may not find if you don't strip. So you depend on his authority and trust him to have only the best of intentions. So, you believe anything he says.

This is what you are trying to achieve through converting the techies in the target crowd into clients.
The other psychological weapon that you are using is of social proof. "If all the important people are buying this product , then it must be awesome. So I am going to buy it too".

The other socio-psychological effect that is desirable is that of forced stickiness and it also borrows concepts form the concept of power of concept. This you can achieve by conducting an aggressive yet cheap ad campaign wherein people see you ads everywhere they go. This may include inserting flyers in the daily newspaper that everybody subscribes to and sticking them in places like building elevators and anywhere else that you can imagine. This will help boost your stickiness factor by exposing people to your brand symbol and slogan again and again.

This is how you influence a large section of society using a very few people. After this it is just about following the savage civilization approach. You move from suburb to suburb and from city to city , sucking up clients and leaving behind a trail of profit in your wake.




So there you have it, a full blown cook book approach to creating your own epidemics !

Understand, that these strategies are universally applicable with slight adjustments with respect to context.


Thursday, 5 March 2015

Anomaly Detection - Machine Learning

Anomaly Detection Algorithm (Supervised Learning)

Anomaly detection is an algorithm that uses a huge dataset of healthy examples to learn about the features of the average good example and then uses the metrics learnt to compare with a new example and decide as to whether the new example is an anomaly to the commonly witnessed specimens.

As an analyst , one example where you may use this algorithm is to detect anomalous visitors to your website by looking at factors like number of visits per day, number of posts in the forum and typing speed. 
Another example may be in a server room/ data center where you may predict system failure by detecting anomalous activity in features like CPU load, number of disc accesses per second, memory usage and CPU load per unit network traffic.


The above two figures show a surface plot of a Gaussian variable dependent on two parameters x1 and x2.

Data Representation

The dataset should be in the form of a .csv file or a similarly syntaxed text file.

The algorithm assumes that your data-points have a normal (Gaussian) distribution.
The Matlab code has been provided below. It is pretty much just plug and play. You can also download the code from my Github page. Any suggestions and contributions will be duly appreciated.

A few things to keep in mind...
  • The code can handle any number of features but as always , they have to be numerically describable.
  • The algorithm asks you to choose a normalized threshold parameter i.e. a value between 0 and one. The general rule of thumb is that the more particular you are about the spread of your data, the higher should be your threshold. In other words, if a very narrow margin of error is allowed in your healthy examples, then you should choose a very high value of threshold. It is sort of like a quality scale where 0 being the worst quality and 1 signifying the best quality.
The code...
####################################
% anomaly detection
fprintf('welcome to the anomaly detection module\n');
fprintf('the dataset that you will use to train the algorithm');
fprintf(' must only have non anomalous examples \n');
da=input('please mention the dataset file in single quotes: \n');
x=load(da);
s=size(x);
l=s(1,1);
b=s(1,2);
fprintf('your dataset has %i healthy examples with %i features \n',l,b);
fprintf('training...\n');
mus=zeros(2,b);

for i=1:l
    for j=1:b
        mus(1,j)=sum(x(j,:))/l;
        mus(2,j)=(1/l)*var(sum(x(j,:))/l,x(j,:));
    end
        






end
fprintf('training complete.\n');
inp=input('please enter new datapoint for checking if it is anomalous :\n');
eps=input('please enter normalized threashold probability :\n');
p=prob(inp,mus);

if p<eps
    fprintf('this dat-point is an anomaly\n');
end

if p>eps
    fprintf('this data-point is healthy\n');
end


###############################

Save the following function as 'var.m' in your working directory.

#####################
function x=var(a,b)
for i=1:length(b)
    c=(b(1,i)-a)^2;
end


x=(1/length(b))*c;

end
###############

Save the following function as 'prob.m' in your working directory.

####################
function p=prob(a,b)
p=1;
c=zeros(1,length(a));
for i=1:length(b)
    c(1,i)= (0.3989*(1/sqrt(b(2,i))))*exp(-((a(1,i)-b(1,i))^2)/2*b(2,i));
end
for j=1:length(a)
    p=p*c(1,j);
end
end
##################



Tuesday, 3 March 2015

K-Nearest neighbors Algorithm - Machine Learning

K-Nearest Neighbors Algorithm-Vector style ( supervised learning)


The K-nearest neighbor algorithm is a supervised learning algorithm where the computer learns how to classify new data-points based on the training examples that you give it.

This algorithm can handle any number of categories unlike the logistic regression classifier .

Let us see what it does using an example...
Suppose you have to classify data into multiple categories based on any number of features. For illustrative purposes, we shall look at just two features.
The figure shows clusters of four categories (red ,green ,blue and brown) and their centroids plotted against two arbitrary features.

Now we need to find out which category the new point belongs to. This can simply be done by comparing the vector component of the new point on the centroid vectors of the four clusters. The cluster centroid on which the new point's vector subtends the largest component can safely be assumed to be the parent category.

For this algorithm, I am giving the Matlab code below. You can also download from my Github page.

Data format

  • Your training data can be in the form of a .csv or a similarly syntaxed text file.
  • Your complete dataset should consist of features which are NUMERICALLY DESCRIBABLE and only the last column of your dataset must contain the decimal category label.
  • You should know the total number of categories that your dataset depicts.
The code...


###############################################
% k nearest neighbours vector style

fprintf(' welcome to the vector style k nearest neighbour module\n');
da=input('please mention the dataset file in single quotes :\n');
d=load(da);
s=size(d);
l=s(1,1);
b=s(1,2);
fprintf('there are %i datapoints with %i dimensions\n',l,b-1);

K=input('how many categories are depicted by your dataset?');
x=d;

group=zeros(1,b-1);
centroids=zeros(b-1,K);
 temp_avg=zeros(b-1,K);
 count=0;
%determining the centroid vector for each class
for j=1:K
for i=1:l
    if x(i,b)==j
         
        temp_avg(:,j)=temp_avg(:,j)+x(i,1:b-1)';
        count=count+1;
        group(1,:)=(1/count).*temp_avg(:,j);
        centroids(:,j)=group;
        
    
    
    
    
    
    
        
    end
end
end

fprintf('the following are the centroids of the %i categories\n',K);
disp(centroids);
test=input('you can now classify new vectors.\n please enter the new vector :');
component=zeros(1,K);
for i=1:K
    component(1,i)=magnitude(test)*cosine(test,centroids(:,i));
end

 [max_value, index] = max(component(:));
 y=index;
 fprintf('the vector that you want to test belongs to category : %i\n;',y); 



##################################################

save the following function as 'magnitude.m' in the working directory.

############################################
function x = magnitude(a)
x=sqrt(sum(a.^2));
end

##########################################

save the following function as 'cosine.m' in the working directory

####################
function x=cosine(a,b)

c=sqrt(sum(a.^2));
d=sqrt(sum(b.^2));
x=dot(a,b)/(c*d);

end

###################

K-means clustering -Machine Learning

K-Means algorithm for clustering (unsupervised learning)

The k-means clustering algorithm is an unsupervised learning algorithm where the computer tries to segregate an unlabeled dataset into clusters of similar objects.

The number of clusters are user-defined.

Let us understand how it works using an example...

Suppose you are an analyst for a T-shirt company and you want to decide the average measurements for different size categories like extra small, small, large, extra-large etc.
All you have is the following data.
Based on this given data, you can intuitively make the following clusters.

This can be accomplished using the K-means clustering algorithm.
Apart from segregating the unlabeled dataset into clusters, the algorithm also gives you the centroid of the each cluster so that you can get describe an "average member" of the cluster.

I have provided the Matlab code below . you can also download it from my Github page.

The code is plug and play and can handle any dimensional vectors.

Please note:

  • Your data should be in a .csv or a similarly syntaxed text file
  • All your features should be numerically describable; something like so...



The code...


############################################
% unsupervised k means algorithm%

fprintf('welcome to unsupervised clustering module (k-means)\n');
da=input('please enter the name of the datafile in single quotes\n');
d=load(da);
s=size(d);
l=s(1,1);
b=s(1,2);
fprintf('there are %i datapoints with %i dimensions\n',l,b);
K=input('how many clusters do you want to create ?\n');
%ite=input('how many iterations do you want to run?\n');
centroids=rand(K,b);
%x=normalize(d);
x=d;
dist_mat=zeros(l,K);
clusterlabel=zeros(l,1);
newcent=zeros(K,b);
while 1
    
    newcent=centroids;
for i=1:l
    for j=1:K
        dist_mat(i,j)=euclid_dist(x(i,:),centroids(j,:));%find the euclidian distance between the chosen point and all the randoly initialized centroids
        
        
    end
  
    
   
end


 % disp(dist_mat);
 h=zeros(1,K);
 %assign each data point to nearest centroid cluster
 for i=1:l
     h=dist_mat(i,:);
     %disp(h);
     [max_value, index] = min(h(:));
     clusterlabel(i,1)=index;
 end
 disp(clusterlabel);
 temp_avg=zeros(K,b);
 count=0;
 %update all the centroids
 for i=1:l
     
     for j=1:K
         if clusterlabel(i,1)==j
             temp_avg(j,:)=temp_avg(j,:)+x(i,:);
             count=count+1;
             centroids=(1/count).*temp_avg;
             
         end
         
         
     end
     
     
     
 end
 if newcent-centroids==0
     break;
 end
end
fprintf(' the final centroids are :\n');
 disp(centroids);
 fprintf('\n');
 fprintf('the following clusters were formed...\n');
 for i=1:K
    fprintf(' cluster %i\n',i);
    for j=1:l
       if clusterlabel(j,1)==i
           fprintf('data-poiint %i \t',j);
           disp(x(j,:));
       end
    end
        
        
        
 end
     
     
##########################################


you will also need the following function; save it as 'euclid_dist.m' in your working directory

###############
function c = euclid_dist(a,b)
e=((a-b).*(a-b)).^0.5;
c=sum(e);
end
##################

Sunday, 1 March 2015

Logistic regression-Machine Learning

Logistic Regression-Classification Algorithm

Logistic regression is a technique for classifying points of dataset into two separate categories. It is a kind of supervised algorithm where you make the computer learn how to differentiate between  two categories using a huge set of examples and based on that the computer should be able to classify a new data point into one of the two categories.

In all its technicalities, Logistic regression is a lot like linear regression in the sence that you decide the membership of a data-point based on a multitude of factors.

Let us try to understand this using an example. Suppose you want to classify tumors based on their size and cell density as to whether they are malignant or benign. Then you draw a graph based on the data that you have. You may get something like this...

The above graph depict malignant (red) and benign (blue) tumors based on the tumor size and cell density.
The green curve is a decision boundary that separates the two categories. Our job as an analyst is to find a way of creating the most suitable boundary curve.  Also, unlike this two dimensional figure, the factors affecting our categorization can be many more, which means that you may need a multi-dimensional space to represent this kind of data. Although it is not possible to do so graphically, maths can handle an infinite number of dimensions. That is when Logistic regression comes into picture.


As an analyst you do not need to worry though, Below I have provided Matlab code for doing this with a dataset with an arbitrary number of dimensions.

All you need to do is to format your data in a .csv or a text format where all the initial columns numerically depict the factors affecting your classification decision and the last column must be classification based on binary numbers .i.e. if the tumor is malignant, label it '1' or if the tumor is benign, label it '0'.

After that all you need to do is execute the code.

There are two things you need to keep in mind before using this algorithm.

  • It can handle classification tasks where there are only two categories
  • All data should be represented in numerical format in the same base except the output data. That has to be binary.
If you want to use this algorithm for solving problems that have more than one classification categories, you can still use this algorithm by the one against all approach. In that case you need to choose one category at a time and compare it to all the others as the binary complement of that category.
Something like this...

Using the code

The algorithm uses gradient descent to optimize its parameters.

This means that while running, will be prompted to define the learning rate and the number of iterations that you want to tun your algorithm for.

Typically, the learning rate is of the order 0.01 but it is recommended that you experiment with it. also, data sets of different sizes may require different number of iterations. Typically , the number of iterations are double the number of data points in your dataset.
NOTE: if your results are blowing up to infinity, that means your learning rate is too big.

How to interpret the results?

The output of the algorithm will be a vector 'T'. You have to dot product this vector (except it's first term) with your new feature vector which needs to be classified. Then add the first term of the 'T' vector to this dot product result. Your answer will be a number between 0 and 1. Suppose that you get 0.7. This means that there is a 70% chance that the given data point belongs to the category labeled 1.


Data representation

Here is the Matlab code...
You can also download this code from my Github Page
#####################################################
%%logistic regression
% regularization pending
fprintf('welcome to logistic regression');
fprintf('\n');
data=input(' mention the name of data set in single quotes \n');
d=load(data);
disp(d);
s=size(d);
l=s(1,1);
b=s(1,2);
fprintf('number of data points = %i',s(1,1));
fprintf('\n');
fprintf('the number of features is = %i',s(1,2)-1);
X=d(:,1:s(1,2)-1);
X=normalize(X);
y=d(:,s(1,2));
%disp(X);
theta=rand(s(1,2),1);
temp=zeros(s(1,2),1);
fprintf('\n');
alpha=input('please define learning rate');
n=input('define number of iterations');
fprintf('\n');
fprintf('running logistic regression...');
x=[ones(s(1,1),1),X];
for i=1:n
    
    
    for j=1:s(1,2)
       % fprintf('%i',theta(j,1));
        temp(j)=(theta(j)-(alpha*(1/l)*(regderivative(x,y,theta,j))))
        theta(j)=temp(j);
        
        
    end
   
end
T=theta;
######################################################

You will also need the folloing functions. 
Save the following function as 'regderivative.m' in the working directory.


#########################################################
function z=regderivative(x,y,theta,j)


z=0;
for i=1:length(y)
    z=z+(sigmoid(x(i,:)*theta)-y(i))*(x(i,j));
end

end
########################################################

Save this as 'normalize.m' in the working directory

########################################################
function z=normalize(x)


s=size(x);
l=s(1,1);
b=s(1,2);
for i=1:b
    for j=1:l
        x(j,i)=((x(j,i)-mean(x(:,i))))/std(x(:,i));
    end
end

z=x;
end
############################################################

save this function as 'sigmoid.m' in your working directory

############################################
function y=sigmoid(x)
y = 1/(1-exp(-x))
#############################################

Linear Regression-Machine Learning

Multi-variate linear regression
Regression in most contexts is just a fancy way of interpolation. It is a frequently used in estimation and prediction scenarios. It is one of the favorite tools of a data analyst.

Let’s see how it works by looking at some examples...
Suppose if you have the following data.

It is a 2 dimensional plot depicting the correlation between the population of a city and the profits a chain of retail stores makes per year.
But, what if using this data, you wanted to know the profit (predicted approximate) a given company would make for a city that has a population of 37,890?

This is a very realistic problem where the decision of whether or not to open a retail outlet in that city would depend upon your prediction as an analyst. So how would you give a reasonable estimate?

Let’s see…

If there was some way of fitting a straight line through the data points such that in order to predict for any given population, you would just need to see the ‘Y’ co-ordinate of that point; that would be awesome!

Something like so…

But, how to decide which line would be the best? After all, there is a possibility of drawing infinite such lines!

This is when linear regression comes to the rescue. The whole idea behind linear regression is to be able to draw such a line that it “best fits” the data that you have in order to give more reliable predictions for any new piece of data. One way of doing this is by trying to fit a line in you data in such a way that the Euclidian vertical distance of the line from all your data points is minimized.


Suppose that the red points depict your data, you want to draw the green line in such a way that the combined length of the blue lines is minimized. This algorithm is also known as the least squares method.

So far so good. But what about the case when there are more than one factor that decide the overall profits of the retail chain?
Factors like shop size, number of items sold,  gender ratio in the area, literacy rate( it might be a book store) etc. In theory, there can be infinite such factors that contribute to the sales figures. 

What I have demonstrated so far can be called univariate linear regression where the output  depends only on one factor. Multivariate linear regression is the case when there is more than one (numerically describable) factor which affects the output.

In such scenarios, you cannot just plot a hyper-dimensional graph and try to fit a straight line but you can definitely simulate that mathematically and get the right results.



I am going to give the MATLAB code for this over here but you can also download it from my github page.

Any improvements and contribution to the code is greatly appreciated.


So, suppose that you have some data using which you want to predict for more incoming data, well here is how you can use the following matlab code for doing your regression analysis...


The code first.
#######################################################################
%%multivariate leniar regression
fprintf('welcome to  multivariate linear regression');
fprintf('\n');
data=input('please enter the name of  your data fie in single brackets \n');

d=load(data);
%the data set must have all the begining columns as features and the last column as output

disp(d);
s=size(d);
l=s(1,1);
fprintf('number of data points = %i',s(1,1));
fprintf('\n');
fprintf('the number of features is = %i',s(1,2)-1);
X=d(:,1:s(1,2)-1);
X=normalize(X);
y=d(:,s(1,2));
%disp(x)
%disp(y)
theta=rand(s(1,2),1);
temp=zeros(s(1,2),1);
fprintf('\n');
alpha=input('please define learning rate');
n=input('define number of iterations');
fprintf('\n');
fprintf('running linear regression...');
x=[ones(s(1,1),1),X];
%disp(x)
%disp(x*theta);
for i=1:n
    
    
    for j=1:s(1,2)
        fprintf('%i',theta(j,1));
        temp(j)=(theta(j)-(alpha*(1/l)*(derivative(x,y,theta,j))))
        theta(j)=temp(j);
        
        
    end
   
end
T=theta;
 ##################################################################


You also need the following function. save it as 'derivative.m' in the same directory as the previous code.


######################################
function z=derivative(x,y,theta,j)

z=0;
for i=1:length(y)
    z=z+(x(i,:)*theta-y(i))*(x(i,j));
end

end
#####################################
This is the second function that the code needs to run. save it as 'normalize.m'


##########################################
function z=normalize(x)


s=size(x);
l=s(1,1);
b=s(1,2);
for i=1:b
    for j=1:l
        x(j,i)=((x(j,i)-mean(x(:,i))))/std(x(:,i));
    end
end

z=x;
end
############################################


The algorithm uses gradient descent to optimize its parameters.

This means that while running, will be prompted to define the learning rate and the number of iterations that you want to tun your algorithm for.

Typically, the learning rate is of the order 0.01 but it is recommended that you experiment with it. also, data sets of different sizes may require different number of iterations. Typically , the number of iterations are double the number of data points in your dataset.

NOTE: if your results are blowing up to infinity, that means your learning rate is too big.

How to format input data?
The input data should typically be a .csv file or a similarly syntaxed text file in which all the columns except the last one represent the features on which your prediction depends and the last column denotes the corresponding result.

How to interpret the result?

The output of the code will be a vector 'T' which is the parameter vector that you need to multiply(dot product) with your new feature vector to predict the output. keep in mind that the first element of 'T' must be multiplied with unity and the rest should be multiplied with your new feature vector.