Monday, 13 April 2015

Paralytic Analytics-Statistics of old age

Will you be a sane 80 year old?



I came across this question on quora:

What is the statistical likelihood that a man in his early 70's will comfortably live ten more years, with all mental faculties present?

Let us look at the answer...
You know what? I am just gonna copy paste my answer on the website here...

Assumption: I am assuming that you are an Indian.
The Statistics:

India population -1,282,741,906 (1.28 billion) As of March 28, 2015.
Sex Ratio:
Current Sex Ratio of India 2015 -943 females for every 1,000 males
That means you have a 51.45% chance of being a male.

Age consideration:
India has the following age distribution-
age 65 and above-5.8% (male 34,133,175/female 37,810,599) (2014 est.)
This means that you have a of 2.75% of being a male over the age of 65.

According to @ Facts and figures about mental illness website@ three% of the population faces disability due to mental illness and around 20% of adults are affected by some form of mental disorder every year.

So, lets be a tad bit pessimistic (this sort of also factors in your old age) and say that you have a 20% chance of going senile. The good news is that there is an 80% chance that you wont go senile.

So factoring in all this we calculate:
1,282,741,906 *0.5145*0.0275*0.8=14519355.63
probability=14519355.63/ 1,282,741,906 =0.0113= 1.13%

So there you have it!
There is less than 1.13% chance that you will grow up to be a sane 80 year old.
Please bear in mind that this too is a very generous estimate. A more accurate assessment would depend on other unmentioned factors like religion, social status, income, number of children and their gender, your region of residence, crime rate, lifestyle choices etc. All these factors mostly contribute towards making your chances even slimmer.

Hey! I am not pessimistic; it's just that the answer to your question was almost bound to be a tad bit depressing.
But, look at the brighter side...

The life expectancy of an average Indian is 66.21 years. Which means you are actually really lucky to have crossed the age of 70! So the possibility that you will reach 80 is quite low but if your luck and current age were to be considered as any sort of factors here, then I would say that you are definitely one of the exceptions to the rule.

Remember what they say about life: It's a short trip! Better make it a good one.
Good luck for the rest of your life.

Sunday, 5 April 2015

K Means Clustering -Machine Learning

K-Means Clustering in Python (Unsupervised Learning)

The K-means clustering algorithm is a class of unsupervised learning algorithm that takes an unlabeled dataset and divides it into a user defined number of clusters. These clusters consists of data-points which are more similar to each other than the members of the other cluster.
It can be thought of as a crude form of pattern recognition.
Image taken from the web
 The algorithm can handle any number of features for each data-point but with the condition that all of them are numerical values rather than nominal. Categorical data needs to be converted into binary tags.

Usually playing around with the number of clusters and plotting them for visual verification is a good practice.

The python code has been provided below. You can also download it and all of my other code from my Github page. As always, any improvement or contribution will be appreciated. 
Note: You will need to have pre-installed numpy and pandas libraries before running the code . The program itself is plug and play

The code...

###################################################################



"""K means clustering"""
import pandas as pd
import numpy as np

if __name__=="__main__":
    
    print("welcome to the k means clustering package. this program will help you\n\
    create a user defined number(K) of clusters of similar data points\n\
    the clusters are created based on a crude techineqe of pattern recognition\n\
    This algorithm only works on numerical data\n\
    So lets begin...")
    ####load dataset
    da=raw_input("please mentation the name of your data file (only csv file formats accepted)\n")
    kay=raw_input("how many clusters would you like to create?\n")
    k=int(kay)
    dat=pd.read_csv(da);
    d=np.mat(dat);
    n=d.shape[1]
    m=d.shape[0]
    
    a=1
    b=-1
    cento=np.mat(np.random.rand(k,n))
    cent=a+(b-a)*cento
    
   
    norm=np.mat(np.zeros(shape=(1,n)))
    
    
    #normalization
    for i in range(d.shape[1]):
        norm[0,i]=max(d[:,i])-min(d[:,i])
    nd=np.mat(np.zeros(shape=(m,n)))
    for i in range(n):
        for j in range(m):
            nd[j,i]=(max(d[:,i])-d[j,i])/norm[:,i]
    """for l in range(len(nor)):
        norm[:,l]=nor[l]
    #kMeans(nd,k,cent,distMeas=distEclud)
    m = d.shape[0]"""
    
    clustertag=np.mat(np.zeros((m,1)))
    meansumk=np.mat(np.zeros((1,n)))    
    newcent=np.mat(np.zeros((k,n)))
    box=np.mat(np.zeros((k,n)))
    #print norm,nd
    while 1:
       
        
        
        for i in range(m):
            
            box[:,:]=d[i,:]
            #print box
            sqdif =np.power((box-cent),2)
            root=np.power(sqdif,0.5)
            rsum=root.sum(axis=1)
            indexleast=np.argsort(rsum,axis=0) 
            clustertag[i,0]=indexleast[0]
            print indexleast
        
        
            
        for i in range(k):
            count=0
            for j in range(m):
                if clustertag[j,0]==i:
                    meansumk[0,:]=meansumk[0,:]+d[j,:]
                    
                    count+=1
                    newcent[i,:]=meansumk/count
        cent=newcent
        if newcent.all()-cent.all()==0:
            break
    
    lom=np.mat(np.zeros((k,n)))
    print ("The centroids of the %i clusters are...\n"%k)
    for h in range(k):
        lom[h,:]=np.multiply(norm[0,:],cent[h,:])
    print lom

##############################################

What the input should look like
What the output would be like...

The output will be a matrix of K number of rows representing the centroid of the each cluster. You can also think of it as an average feature vector of that particular cluster.

K Nearest Neighbor Algorithm -Machine Learning

Classical K-Nearest Neighbor Algorithm in Python

The K-nearest neighbor algorithm is a supervised learning algorithm used to classify data-points based on multiple features. It can handle both qualitative and quantitative data. The core logic of the algorithm is to compare the new data point to all the existing data-points in it's database and assign it to the group which it most resembles. Apart from the dataset, the algorithm also requires the user to define a parameter 'K'. To understand why this parameter is needed, one must have at-least a rough idea of the inner workings of the algorithm.
Image taken from the web


What the algorithm basically does is it sorts every data-point in it's database according to decreasing degree of similarity to the new data-point. Now, this sorted list may have points from every class of the data-set; so, to give a "best guess" answer of the classification problem, it takes a vote. For this vote it chooses the top K items in the list and assigns the new data-point to the class which has a majority in this population. A good value of K usually depends on the size of the population and for large data-sets, it is enough to choose a K at around 10% of the total population.


It's application may range anywhere from segregating customers of a supermarket into different categories to writing software for solving captcha autonomously.
Example of a captcha (taken from the web)
The python code for this algorithm has been given below (I use the Spyder IDE).You can also download it from my Github page. Any contribution and  improvement to the code is duly welcome. The code is plug and play.

Note: Apart from the standard python packages, you will also need numpy and pandas to use this code.


The code...
############################################################################
import numpy as np
import operator
import pandas as pd




if __name__ == "__main__":
print ("Welcome to K-nearest neighbor supervised learning classifier.\n\
All you will need to do is give your dataset and your unclassified feature set and this program will classify it for you\n\
Please keep in mind that the data must be in a csv file or else you will have to modify the source code\n\
\n\
so lets begin...")

da=[]
mother_file=raw_input("please mention the name of your mother file\n")
da = pd.read_csv(mother_file)
d=np.mat(da)
dataset=d[:,0:-1]
labels=d[:,-1]


inx=raw_input("please enter your input parameters in the form of a list ")

x=np.mat(inx)
if x.shape[1]!=dataset.shape[1]:
print ("invalid input\n please run the program again")

kay=raw_input("how many nearest neighbours(k) would you like to count the majority from\n")
k=int(kay)
siz=dataset.shape[0]
bl=np.zeros([dataset.shape[0],dataset.shape[1]])
for i in range(len(bl)):
bl[i,:]=x
difmat=bl-dataset

sqdifmat=np.square(difmat.astype(float))
root=np.sqrt(sqdifmat)
distances = root.sum(axis=1)

sortedDistIndicies = np.argsort(distances,axis=0)
classCount={}

for i in range(k):

voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
print sortedClassCount[0][0]

##############################################################

How to format the data...
For the code to work without you needing to poke around the script itself, your data needs to be stored in a .csv file. The given code can handle alphanumeric designation of the class but its feature description needs to be strictly numeric.
Something like so...
The code can handle an arbitrary number of features and does a better job than the Matlab code seen before on this blog.
Remember, the input that you give to the algorithm must be in the form of a list.