The references refer to the CFA text book.
2 Correlation Analysis
Linear vs Non Linear association
Correlation measures linear association between two variables, linear meaning “A straight-line relationship, as opposed to a relationship that cannot be graphed as a straight line.”
NB - Two variables can have a strong nonlinear relation and still have a very low correlation.
The following is an example of a graph with a STRONG NONLINEAR RELATIONSHIP
(Trends in the intensity of copper use in Japan since 1960. Juan Ignacio Guzma´na, Takashi Nishiyamab, John E. Tiltona)
Outliers
Correlation also may be an unreliable measure when outliers are present in one or both of the series. On the post of December 2, 2010, the following scatter graph was presented, based on BHP data. The calculated correlated for this data set was 0.54.
Reviewing the graph it appears that the 2009 year is an outlier, as it is the only year where Revenue & Earnings are no relatively close on the graph. This is potentially an outlier. The year 2009 is removed from the data set below, to see the impact on the correlation coefficient. It is predicted that the correlation will increase.
Practical example – BHP revenue vs Earnings per ordinary share - Correlation coefficient WITHOUT outlier.
The year 2009, as a perceived outlier, has been removed from the data set. The result below is that the correlation increases significantly
You can check your answers for other examples using the following web site: http://easycalculation.com/statistics/correlation.php
1. Data | ||||
Year | 2010 | 2008 | 2007 | 2006 |
Revenue US $m | 52798 | 59473 | 47473 | 39099 |
Earnings per ordinary share (diluted) (US sent) | 227.8 | 274.8 | 228.9 | 172.4 |
2. Calculation | ||||||
Year | Revenue $ | Dividends | Cross product | Squared deviations Revenue | Squared deviations Dividends | |
2010 | 52798 | 227.80 | 77,487.97 | 8,923,363.84 | 672.88 | |
2008 | 59473 | 274.80 | 704,760.87 | 93,358,108.84 | 5,320.24 | |
2007 | 47473 | 228.90 | -63,214.11 | 5,465,308.84 | 731.16 | |
2006 | 39099 | 172.40 | 315,569.63 | 114,742,659.24 | 867.89 | |
Average | 49710.75 | 225.975 | ||||
Covariance | Sum | 1,044,255.17 | 222,449,401 | 5,266 | ||
(N-1) | 3 | |||||
Answer | 348,085.0583 | |||||
Variance | Sum Squared deviations | 222,449,491 | 5,266 | |||
(N-1) | 3 | 3 | ||||
Answer | 74,149,800 | 1,755 | ||||
Standard deviation | 8,611 | 41.9 | ||||
Coefficient Correlation | 1. Covariance | 348,085.058 | ||||
2. Standard deviation X Standard deviation | 360,775.26 | |||||
Answer (1/2) | 0.964825 |
Conclusion
Determine whether a computed sample correlation changes greatly by removing a few outliers.
In this example, by removing 2009, the correlation moved from a weak correlation to an almost perfect correlation!
But one must also use judgment to determine whether those outliers contain information about the two variables’ relationship (and should thus be included in the correlation analysis) or contain no information (and should thus be excluded).
Correlation does not imply causation. You might just be lucky!
Even if two variables are highly correlated, one does not necessarily cause the other in the sense that certain values of one variable bring about the occurrence of certain values of the other. Furthermore, correlations can be spurious in the sense of misleadingly pointing towards associations between variables
Spurious correlation
The term spurious correlation refer to
1) correlation between two variables that reflects chance relationships in a particular data set,
2) correlation induced by a calculation that mixes each of two variables with a third, and
3) correlation between two variables arising not from a direct relation between them but from their relation to a third variable
No comments:
Post a Comment