Correlation

Suppose that the average lifespan for people who smoke is:

 Packs Per Week Life Span 1 72 2 70 3 69 5 68

We can calculate the least squares regression line:

y = 73 - 1.3x

We define the first residual to be the difference between the first lifespan and the first estimated lifespan:

72 - (73 - 1.3(1)) = 0.3

the second residual as:

70 - (73 - 1.3(2)) = -0.4

the third as:

69 - (73 - 1.3(3)) = -0.1

and the fourth as

68 - (73 - 1.3(5)) = 1.5

in general we have the residual is

 yi  - y   =   yi  - (a + bxi)

Coefficient of determination: r2

We define the coefficient of determination as an indication of how linear the data is.  r2 has the following properties:

Properties of the Coefficient of Determination

1. r2 is between 0 and 1.

2. If  r2  =  1 then all points lie on a line.  (perfectly linear)

3. If r2  =  0 then the regression line is a useless indicator for predicting y values.

Construction

To compute r2, do the following:

1. Compute the sum of the squares of the residuals:  SSResid

2. Compute Sy2 and (Sy)2We say that

SSTo = Sy2 - (S y)2/n

3. Compute

1 - SSResid/SSto

This is r2 .

If we multiply r2 by 100%, we arrive at the percent of the observed variation attributable to the linear relationship.

Correlation:   r

If we want to determine not just if they are linearly related, but also want to know whether there is a positive relationship or a negative relationship (b> 0 or b<0) and want the calculation unitless, we compute Pearson's correlation coefficient r We have

r2  =  r2

that is the square of the correlation coefficient is equal to the coefficient of determination.

• If r  <  0 then they are negatively correlated.

• If r > 0 then they are positively correlated.

We say that the correlation is

1.   strong if |r| >.8

2. middle if .5 < |r| < .8 and

3. weak otherwise.

Correlation does not imply causation.  For example there may be a strong correlation between grayness in hair and wrinkles, but having gray hair does not cause one to have wrinkles.