DK's R self study: Standard Deviation- No12

Last two posts, we reviewed the meaning of basic statistics such as qualtile, median, mean etc. Among them, I think average is one of the most common statistics and we use it in our daily lives very often.
For example, math test average score of your class or average height of your class.
However, we can't calculate further meaning with average number.

Let's assume that there are two classes and their mathematics test result are as follows.

>classA <- c( 80, 90, 75, 70, 80, 85. 80)

>classB <- c(100,100,100,100, 55,50,55)

And we can calculate average.

> mean(classA)

[1] 80

> mean(classB)

[1] 80

As you can see, two classes has same average.

Can you tell students of two classes have a similar educational attainment ?

I don't think so because scores of classB are not distributed evenly.

In other words, distance from the average of classB is further than classA.

As a result, I can tell that classA is much more stable than classB in terms of their score.

How can we get this insight ?

Standard deviation is the right answer of this question.

Standard deviation(represented by the symbol sigma, σ) shows how much variation (dispersion or sigma squared, σ2) exists from the average.

Let me show you mathematical formula to get a variance and standard deviation.

$\begin{align} \operatorname{Var}(X) &= \operatorname{Cov}(X, X) \\ &= \operatorname{E}\left[(X - \mu) (X - \mu)\right] \\ &= \operatorname{E}\left[(X - \mu)^2 \right] \end{align}$

$\sigma = \sqrt{\operatorname E[(X - \mu)^2]}= \sqrt{\operatorname E[X^2]-(\operatorname E[X])^2}.$

Of course, R provides a simple function to get a variance or standard deviation

[ var() , sd () ]

but if you are familiar with above mathematical formula, then you can get it by yourself.

Now, variance of classB is calculated like those two ways.

> classB

[1] 100 100 100 100 55 50 55

> var(classB)

[1] 625

> ((100-80)^2+(100-80)^2+(100-80)^2+(100-80)^2+(55-80)^2+(50-80)^2+(55-80)^2)/6

[1] 625

As you can guess, square root of variance is a standard deviation and R provides a simple function to get a standard deviation.

> sqrt(var(classB))

[1] 25

> sd(classB)

[1] 25

Now, we can expect which class has a lower standard deviation.
As you can see below, we could infer a great idea from this study.

A low standard deviation indicates that the data points tend to be very close to the mean; high standard deviation indicates that the data points are spread out over a large range of values (Wikipedia)

> sd(classA)
[1] 6.454972
> sd(classB)
[1] 25

I think standard deviation is very important to understand basic statistic.

I will use σ instead of sd to represent standard deviation in our future lessons.

DK's R self study

2013년 1월 19일 토요일

Standard Deviation- No12

댓글 없음:

댓글 쓰기