Multi-variate linear regression
Regression in most
contexts is just a fancy way of interpolation. It is a frequently used in
estimation and prediction scenarios. It is one of the favorite tools of a data
analyst.
Let’s see how it works
by looking at some examples...
Suppose if you have the
following data.
It is a 2 dimensional
plot depicting the correlation between the population of a city and the profits
a chain of retail stores makes per year.
But, what if using this
data, you wanted to know the profit (predicted approximate) a given company
would make for a city that has a population of 37,890?
This is a very realistic
problem where the decision of whether or not to open a retail outlet in that
city would depend upon your prediction as an analyst. So how would you give a
reasonable estimate?
Let’s see…
If there was some way of
fitting a straight line through the data points such that in order to predict
for any given population, you would just need to see the ‘Y’ co-ordinate of
that point; that would be awesome!
Something like so…
But, how to decide which
line would be the best? After all, there is a possibility of drawing infinite
such lines!
This is when linear
regression comes to the rescue. The whole idea behind linear regression is to
be able to draw such a line that it “best fits” the data that you have in order
to give more reliable predictions for any new piece of data. One way of doing
this is by trying to fit a line in you data in such a way that the Euclidian vertical
distance of the line from all your data points is minimized.
Suppose that the red
points depict your data, you want to draw the green line in such a way that the
combined length of the blue lines is minimized. This algorithm is also known as
the least squares method.
So far so good. But what
about the case when there are more than one factor that decide the overall
profits of the retail chain?
Factors like shop size,
number of items sold, gender ratio in
the area, literacy rate( it might be a book store) etc. In theory, there can be
infinite such factors that contribute to the sales figures.
What I have demonstrated so far can be called univariate linear regression where the output depends only on one factor. Multivariate linear regression is the case when there is more than one (numerically describable) factor which affects the output.
In such scenarios, you cannot just plot a hyper-dimensional graph and try to fit a straight line but you can definitely simulate that mathematically and get the right results.
I am going to give the MATLAB code for this over here but you can also download it from my github page.
Any improvements and contribution to the code is greatly appreciated.
So, suppose that you have some data using which you want to predict for more incoming data, well here is how you can use the following matlab code for doing your regression analysis...
The code first.
#######################################################################
%%multivariate leniar regression
fprintf('welcome to multivariate linear regression');
fprintf('\n');
data=input('please enter the name of your data fie in single brackets \n');
d=load(data);
%the data set must have all the begining columns as features and the last column as output
disp(d);
s=size(d);
l=s(1,1);
fprintf('number of data points = %i',s(1,1));
fprintf('\n');
fprintf('the number of features is = %i',s(1,2)-1);
X=d(:,1:s(1,2)-1);
X=normalize(X);
y=d(:,s(1,2));
%disp(x)
%disp(y)
theta=rand(s(1,2),1);
temp=zeros(s(1,2),1);
fprintf('\n');
alpha=input('please define learning rate');
n=input('define number of iterations');
fprintf('\n');
fprintf('running linear regression...');
x=[ones(s(1,1),1),X];
%disp(x)
%disp(x*theta);
for i=1:n
for j=1:s(1,2)
fprintf('%i',theta(j,1));
temp(j)=(theta(j)-(alpha*(1/l)*(derivative(x,y,theta,j))))
theta(j)=temp(j);
end
end
T=theta;
function z=derivative(x,y,theta,j)
z=0;
for i=1:length(y)
z=z+(x(i,:)*theta-y(i))*(x(i,j));
end
end
#####################################
This is the second function that the code needs to run. save it as 'normalize.m'
##########################################
function z=normalize(x)
s=size(x);
l=s(1,1);
b=s(1,2);
for i=1:b
for j=1:l
x(j,i)=((x(j,i)-mean(x(:,i))))/std(x(:,i));
end
end
z=x;
end
############################################
The algorithm uses gradient descent to optimize its parameters.
This means that while running, will be prompted to define the learning rate and the number of iterations that you want to tun your algorithm for.
Typically, the learning rate is of the order 0.01 but it is recommended that you experiment with it. also, data sets of different sizes may require different number of iterations. Typically , the number of iterations are double the number of data points in your dataset.
NOTE: if your results are blowing up to infinity, that means your learning rate is too big.
How to format input data?
The input data should typically be a .csv file or a similarly syntaxed text file in which all the columns except the last one represent the features on which your prediction depends and the last column denotes the corresponding result.
How to interpret the result?
The output of the code will be a vector 'T' which is the parameter vector that you need to multiply(dot product) with your new feature vector to predict the output. keep in mind that the first element of 'T' must be multiplied with unity and the rest should be multiplied with your new feature vector.
No comments:
Post a Comment