Mathalytics- Where Math Meets Analytics: K-means clustering -Machine Learning

K-Means algorithm for clustering (unsupervised learning)

The k-means clustering algorithm is an unsupervised learning algorithm where the computer tries to segregate an unlabeled dataset into clusters of similar objects.

The number of clusters are user-defined.

Let us understand how it works using an example...

Suppose you are an analyst for a T-shirt company and you want to decide the average measurements for different size categories like extra small, small, large, extra-large etc.

All you have is the following data.

Based on this given data, you can intuitively make the following clusters.

This can be accomplished using the K-means clustering algorithm.

Apart from segregating the unlabeled dataset into clusters, the algorithm also gives you the centroid of the each cluster so that you can get describe an "average member" of the cluster.

I have provided the Matlab code below . you can also download it from my Github page.

The code is plug and play and can handle any dimensional vectors.

Please note:

Your data should be in a .csv or a similarly syntaxed text file
All your features should be numerically describable; something like so...

The code...

############################################

% unsupervised k means algorithm%

fprintf('welcome to unsupervised clustering module (k-means)\n');

da=input('please enter the name of the datafile in single quotes\n');

d=load(da);

s=size(d);

l=s(1,1);

b=s(1,2);

fprintf('there are %i datapoints with %i dimensions\n',l,b);

K=input('how many clusters do you want to create ?\n');

%ite=input('how many iterations do you want to run?\n');

centroids=rand(K,b);

%x=normalize(d);

x=d;

dist_mat=zeros(l,K);

clusterlabel=zeros(l,1);

newcent=zeros(K,b);

while 1

newcent=centroids;

for i=1:l

for j=1:K

dist_mat(i,j)=euclid_dist(x(i,:),centroids(j,:));%find the euclidian distance between the chosen point and all the randoly initialized centroids

end

% disp(dist_mat);

h=zeros(1,K);

%assign each data point to nearest centroid cluster

for i=1:l

h=dist_mat(i,:);

%disp(h);

[max_value, index] = min(h(:));

clusterlabel(i,1)=index;

end

disp(clusterlabel);

temp_avg=zeros(K,b);

count=0;

%update all the centroids

for i=1:l

for j=1:K

if clusterlabel(i,1)==j

temp_avg(j,:)=temp_avg(j,:)+x(i,:);

count=count+1;

centroids=(1/count).*temp_avg;

end

if newcent-centroids==0

break;

end

fprintf(' the final centroids are :\n');

disp(centroids);

fprintf('\n');

fprintf('the following clusters were formed...\n');

for i=1:K

fprintf(' cluster %i\n',i);

for j=1:l

if clusterlabel(j,1)==i

fprintf('data-poiint %i \t',j);

disp(x(j,:));

end

##########################################

you will also need the following function; save it as 'euclid_dist.m' in your working directory

###############

function c = euclid_dist(a,b)

e=((a-b).*(a-b)).^0.5;

c=sum(e);

end

##################

Mathalytics- Where Math Meets Analytics

Tuesday, 3 March 2015

K-means clustering -Machine Learning

K-Means algorithm for clustering (unsupervised learning)

No comments:

Post a Comment

Blog Archive