K-Means algorithm for clustering (unsupervised learning)
The k-means clustering algorithm is an unsupervised learning algorithm where the computer tries to segregate an unlabeled dataset into clusters of similar objects.
The number of clusters are user-defined.
Let us understand how it works using an example...
Suppose you are an analyst for a T-shirt company and you want to decide the average measurements for different size categories like extra small, small, large, extra-large etc.
All you have is the following data.
Based on this given data, you can intuitively make the following clusters.
This can be accomplished using the K-means clustering algorithm.
Apart from segregating the unlabeled dataset into clusters, the algorithm also gives you the centroid of the each cluster so that you can get describe an "average member" of the cluster.
I have provided the Matlab code below . you can also download it from my Github page.
The code is plug and play and can handle any dimensional vectors.
Please note:
- Your data should be in a .csv or a similarly syntaxed text file
- All your features should be numerically describable; something like so...
The code...
############################################
% unsupervised k means algorithm%
fprintf('welcome to unsupervised clustering module (k-means)\n');
da=input('please enter the name of the datafile in single quotes\n');
d=load(da);
s=size(d);
l=s(1,1);
b=s(1,2);
fprintf('there are %i datapoints with %i dimensions\n',l,b);
K=input('how many clusters do you want to create ?\n');
%ite=input('how many iterations do you want to run?\n');
centroids=rand(K,b);
%x=normalize(d);
x=d;
dist_mat=zeros(l,K);
clusterlabel=zeros(l,1);
newcent=zeros(K,b);
while 1
newcent=centroids;
for i=1:l
for j=1:K
dist_mat(i,j)=euclid_dist(x(i,:),centroids(j,:));%find the euclidian distance between the chosen point and all the randoly initialized centroids
end
end
% disp(dist_mat);
h=zeros(1,K);
%assign each data point to nearest centroid cluster
for i=1:l
h=dist_mat(i,:);
%disp(h);
[max_value, index] = min(h(:));
clusterlabel(i,1)=index;
end
disp(clusterlabel);
temp_avg=zeros(K,b);
count=0;
%update all the centroids
for i=1:l
for j=1:K
if clusterlabel(i,1)==j
temp_avg(j,:)=temp_avg(j,:)+x(i,:);
count=count+1;
centroids=(1/count).*temp_avg;
end
end
end
if newcent-centroids==0
break;
end
end
fprintf(' the final centroids are :\n');
disp(centroids);
fprintf('\n');
fprintf('the following clusters were formed...\n');
for i=1:K
fprintf(' cluster %i\n',i);
for j=1:l
if clusterlabel(j,1)==i
fprintf('data-poiint %i \t',j);
disp(x(j,:));
end
end
end
##########################################
you will also need the following function; save it as 'euclid_dist.m' in your working directory
###############
function c = euclid_dist(a,b)
e=((a-b).*(a-b)).^0.5;
c=sum(e);
end
##################
No comments:
Post a Comment