Sunday, 5 April 2015

K Means Clustering -Machine Learning

K-Means Clustering in Python (Unsupervised Learning)

The K-means clustering algorithm is a class of unsupervised learning algorithm that takes an unlabeled dataset and divides it into a user defined number of clusters. These clusters consists of data-points which are more similar to each other than the members of the other cluster.
It can be thought of as a crude form of pattern recognition.
Image taken from the web
 The algorithm can handle any number of features for each data-point but with the condition that all of them are numerical values rather than nominal. Categorical data needs to be converted into binary tags.

Usually playing around with the number of clusters and plotting them for visual verification is a good practice.

The python code has been provided below. You can also download it and all of my other code from my Github page. As always, any improvement or contribution will be appreciated. 
Note: You will need to have pre-installed numpy and pandas libraries before running the code . The program itself is plug and play

The code...

###################################################################



"""K means clustering"""
import pandas as pd
import numpy as np

if __name__=="__main__":
    
    print("welcome to the k means clustering package. this program will help you\n\
    create a user defined number(K) of clusters of similar data points\n\
    the clusters are created based on a crude techineqe of pattern recognition\n\
    This algorithm only works on numerical data\n\
    So lets begin...")
    ####load dataset
    da=raw_input("please mentation the name of your data file (only csv file formats accepted)\n")
    kay=raw_input("how many clusters would you like to create?\n")
    k=int(kay)
    dat=pd.read_csv(da);
    d=np.mat(dat);
    n=d.shape[1]
    m=d.shape[0]
    
    a=1
    b=-1
    cento=np.mat(np.random.rand(k,n))
    cent=a+(b-a)*cento
    
   
    norm=np.mat(np.zeros(shape=(1,n)))
    
    
    #normalization
    for i in range(d.shape[1]):
        norm[0,i]=max(d[:,i])-min(d[:,i])
    nd=np.mat(np.zeros(shape=(m,n)))
    for i in range(n):
        for j in range(m):
            nd[j,i]=(max(d[:,i])-d[j,i])/norm[:,i]
    """for l in range(len(nor)):
        norm[:,l]=nor[l]
    #kMeans(nd,k,cent,distMeas=distEclud)
    m = d.shape[0]"""
    
    clustertag=np.mat(np.zeros((m,1)))
    meansumk=np.mat(np.zeros((1,n)))    
    newcent=np.mat(np.zeros((k,n)))
    box=np.mat(np.zeros((k,n)))
    #print norm,nd
    while 1:
       
        
        
        for i in range(m):
            
            box[:,:]=d[i,:]
            #print box
            sqdif =np.power((box-cent),2)
            root=np.power(sqdif,0.5)
            rsum=root.sum(axis=1)
            indexleast=np.argsort(rsum,axis=0) 
            clustertag[i,0]=indexleast[0]
            print indexleast
        
        
            
        for i in range(k):
            count=0
            for j in range(m):
                if clustertag[j,0]==i:
                    meansumk[0,:]=meansumk[0,:]+d[j,:]
                    
                    count+=1
                    newcent[i,:]=meansumk/count
        cent=newcent
        if newcent.all()-cent.all()==0:
            break
    
    lom=np.mat(np.zeros((k,n)))
    print ("The centroids of the %i clusters are...\n"%k)
    for h in range(k):
        lom[h,:]=np.multiply(norm[0,:],cent[h,:])
    print lom

##############################################

What the input should look like
What the output would be like...

The output will be a matrix of K number of rows representing the centroid of the each cluster. You can also think of it as an average feature vector of that particular cluster.

No comments:

Post a Comment