DK's R self study: Regression analysis

Today I am going to explain regression analysis.
Before I talk about analysis method, I would like to start by talking about concept first.
This would be very challenge to me, but I ll try to do my best to explain easily as possible as I can.
I've already mentioned in previous post, regression analysis is focused on causality
between independent variable (response variable) and dependent variable (explanatory variable).
What is this mean ?
In order to answer this question, we can't go further without correlation test.
Once you've tested correlation analysis,
you can get correlation coefficient value to judge the strength of the relation between two variables.
The correlation coefficient value ranges from -1 to +1 which stands for positive relation or negative trend relation.
However, this results doesn't explain the exact causality between two variables.
In other words, By getting the regression coefficient,
we can figure out the ratio of the variability in
independent variable x that affects the explanatory variable y.

Let's assume there are two data sample,
Type 1 shows a positive correlation but Type 2 shows a negative correlation. And
we can get a measureable data which is called correlation coefficient.
In regression analysis, we use similar index to gauge the strength of correlation.
We called this is R-squared and this value can get from the square of correlation coefficient.

> cor(Temperature, Discomfort_Index)
[1] 0.9963963
> cor(Temperature, Sales_Long_trousers)
[1] -0.8676816

Then, how we can get a measurable causality between Temperature and Discomfort_index in Type 1 case ? Answer is regression analysis.
As I always told you, there are two components which is response variable and explanatory variable. In type 1, temperature would be response variable and Discomfort_index would be explanatory variable.

A explanatory variable and response variable might not be the only one factor, each one can have a multiple factors. For example, a discomfort index can be affected by not only temperature but also humidity and geographic region etc.

If each variable is given only one, we call this is a simple regression model but each one of variables are given more than only one, then we call this is a multiple regression model.

Additionally, if an explanatory variables can be explained by the linear relationship with response variables, we call this is an "linear regression model", otherwise we call this is an "nonlinear regression model"

The goal of linear regression model is to set up an best-fit line equation in terms of given variable x,y. Therefore we have to set up a equation model first.

Y = a*X1 + b + e
(b is an intercept, a is slope, e( epsilon Greek letter) is a error term)

if you have a multiple response variable then your model is like this.
Y = a*X1 + b*X2 + c + e

Then how to find out the fitted regression line ?.
Statistically , this line is calculated by the rule of least-squares method.

As you can see, this linear regression line is a best-fit line to minimize the total length sum of each points that is away from the linear regression line.

Once, you get a best-fit line, then you can the predict the Y value from the value X.
In other words, you can predict discomfort index when temperature is 35 without experiment data.

Then why don't we do this with sample data by using R command.

1) Prepare data and just plot the data
I have a two data set which is a temperature data and discomfort_index data.
My goal is find out the best_fit line between two variables.

> Temperature
[1] 26 27 28 29 30 31 32 35
> Discomfort_Index
[1] 24.1 25.0 26.0 27.0 28.0 28.9 29.9 33.9

2) Linear regression can be executed by this simple command
Important thing is when you conduct a correlation analysis, order is not a big deal,
however, in regression analysis, explanatory variable should be ordered first.
As you can see below, you can get two important value
intercept (-3.9) and slope. (1.067)

> lm(Discomfort_Index~Temperature)

Call:
lm(formula = Discomfort_Index ~ Temperature)

Coefficients:
(Intercept) Temperature
-3.900 1.067

3) Plotting a regression line on the previous graph.
There are two ways you can do it
First, you've already know the intercept and slope, you can add new line by yourself.
Or you can do it by simple command.

> abline(a=-3.900, b=1.067, col="red")

4) Getting a detailed information to determine your regression model is suitable.
You can review more useful information to check the model suitability.
In a few words,
We can see a Residuals summary at the beginning. Residuals is calculated by subtracting expected value of explanatory variable from real value of explanatory variable.
Then go further to the middle, regression coefficients info is given and R-squared is also give so we can estimate that strength of correlation between temperature and discomfort index is very strong. we interpret this power as much as 99.28%
Lastly, at the bottom, a F-statistics result is given, so we can determine the model suitability. (In this case, we have a statistical evidence that we can reject null hypothesis and accept alternative hypothesis)
H0 : This regression model is not suitable.

> summary(lm(Discomfort_Index~Temperature))

Call:
lm(formula = Discomfort_Index ~ Temperature)

Residuals:
Min 1Q Median 3Q Max
-0.35126 -0.15861 -0.01597 0.12668 0.44706

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.90000 1.10803 -3.52 0.0125 *
Temperature 1.06723 0.03709 28.77 1.17e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2861 on 6 degrees of freedom
Multiple R-squared: 0.9928, Adjusted R-squared: 0.9916

F-statistic: 828 on 1 and 6 DF, p-value: 1.167e-07

This post was very long and tough. Anyway, I hope this post help your understanding.

DK's R self study

2013년 9월 11일 수요일

Regression analysis - No26

댓글 없음:

댓글 쓰기