Thursday, 2 December 2010

Reading 11, 2.4 Limitations of Correlation

The references refer to the CFA text book.

2 Correlation Analysis

 2.4 Limitations of Correlation Analysis

Linear vs Non Linear association

Correlation measures linear association between two variables, linear meaning “A straight-line relationship, as opposed to a relationship that cannot be graphed as a straight line.”
NB - Two variables can have a strong nonlinear relation and still have a very low correlation.

The following is an example of a graph with a STRONG NONLINEAR RELATIONSHIP


(Trends in the intensity of copper use in Japan since 1960. Juan Ignacio Guzma´na, Takashi Nishiyamab, John E. Tiltona)

Outliers

Correlation also may be an unreliable measure when outliers are present in one or both of the series. On the post of December 2, 2010, the following scatter graph was presented, based on BHP data.  The calculated correlated for this data set was 0.54.
Reviewing the graph it appears that the 2009 year is an outlier, as it is the only year where Revenue & Earnings are no relatively close on the graph.  This is potentially an outlier.  The year 2009 is removed from the data set below, to see the impact on the correlation coefficient. It is predicted that the correlation will increase.

 


Practical example – BHP revenue vs Earnings per ordinary share - Correlation coefficient WITHOUT outlier.

The year 2009, as a perceived outlier, has been removed from the data set. The result below is that the correlation increases significantly
You can check your answers for other examples using the following web site: http://easycalculation.com/statistics/correlation.php

1. Data
Year
2010
2008
2007
2006
Revenue US $m
52798
59473
47473
39099
Earnings per ordinary share (diluted) (US sent)
227.8
274.8
228.9
172.4

2. Calculation
Year
Revenue $
Dividends
Cross product
Squared deviations Revenue
Squared deviations Dividends
2010
52798
227.80
77,487.97
8,923,363.84
672.88
2008
59473
274.80
704,760.87
93,358,108.84
5,320.24
2007
47473
228.90
-63,214.11
5,465,308.84
731.16
2006
39099
172.40
315,569.63
114,742,659.24
867.89
Average
49710.75
225.975
Covariance
Sum
1,044,255.17
222,449,401
5,266
(N-1)
3
Answer
348,085.0583
Variance
Sum Squared deviations
222,449,491
5,266
(N-1)
3
3
Answer
74,149,800
1,755
Standard deviation
8,611
41.9
Coefficient Correlation
1. Covariance
348,085.058
2. Standard deviation X Standard deviation
360,775.26
Answer (1/2)
0.964825


Conclusion
Determine whether a computed sample correlation changes greatly by removing a few outliers. 
In this example, by removing 2009, the correlation moved from a weak correlation to an almost perfect correlation!

But one must also use judgment to determine whether those outliers contain information about the two variables’ relationship (and should thus be included in the correlation analysis) or contain no information (and should thus be excluded).
Correlation does not imply causation. You might just be lucky!
Even if two variables are highly correlated, one does not necessarily cause the other in the sense that certain values of one variable bring about the occurrence of certain values of the other. Furthermore, correlations can be spurious in the sense of misleadingly pointing towards associations between variables

Spurious correlation

The term spurious correlation refer to
1) correlation between two variables that reflects chance relationships in a particular data set,
2) correlation induced by a calculation that mixes each of two variables with a third, and
3) correlation between two variables arising not from a direct relation between them but from their relation to a third variable




No comments:

Post a Comment