Variance, Covariance and Correlation

To set expectations up front, this is a decidedly non-technical discussion of what is, in fact, a very technical, but very fundamental, topic in applied statistics. For the technical details, let me refer you to my favorite textbook on the topic:

Cohen J, Cohen P, West SG, Aiken LS. 2003. Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum.

And if you are interested, wikipedia has lots of formulas for you to geek out on. We’re also referring here to questions of sample variance/covariance, and not population parameters.

Understanding variance, covariance, and correlation are fundamental to making sense of the vast majority of analyses we do as researchers. Variance—and probability theory, but that’s another blog post—are the building blocks to making sense of causal relationships and more importantly the strength of those causal relationships. Yes, we want to know whether X effects Y, but really we want to know how much of a change in X causes how much of a change in Y. To do that, we need to understand variance.


Variance

Lets start by creating a dataframe with three normally distributed variables but with different standard deviations.

library(tidyverse)
set.seed(08022003)
sd5.df <- data_frame(x = rnorm(1000, sd = .5), sd = .5)
sd1.df <- data_frame(x = rnorm(1000, sd = 1), sd = 1)
sd15.df <- data_frame(x = rnorm(1000, sd = 1.5), sd = 1.5)
my.df <- bind_rows(sd5.df, sd1.df, sd15.df)  # Bind our variables into a new data frame
my.df$sd <- as.factor(my.df$sd)  # Convert the 'sd' variable into a factor

The easiest way to understand variance is with a graphic, so lets create a box plot showing the different variances.

my.boxplot <- ggplot(my.df, aes(x = sd, y = x, color = sd)) +
  geom_boxplot()
my.boxplot + theme_minimal() + 
  xlab("Standard Deviation (sd)") +
  ggtitle("Box plot of three random variables with different variances")

Here we have my favorite diagnostic tool, the Box Plot, which is a handy way to visually see the dispersion of a given random variable.

Dispersion is just what variance is. Variance is technically the average of the squared difference from the mean value of all of the observations of a given variable in your sample. What variance tells you is how spread out the observations of your variable are from the mean value of that variable. The higher the variance, the more spread out the variables are.

When you take the square root of the variance, you get the variable’s standard deviation. What makes the standard deviation so handy is that it puts the variance into the same units as the variable itself (more on that later).

In the box plot above, I’ve generated three random continuous variables (1,000 observations each [n = 1,000]), with an expected mean of 0 but with three different standard deviations, .5, 1, and 1.5. As you can see, as the standard deviation gets larger, the ‘box’ around the mean value of 0 (the line in the center) gets larger, and the ‘whiskers’ of the plot also go farther out. The higher the variance (standard deviation) the more spread out the observations are from the mean value.

Here’s the most important thing to understand about variance–we have to understand the variance of X and Y if we are to understand the covariance between X and Y.


Covariance

If variance is the measure of how dispersed a set of observations of a single variable are, covariance is the extent to which the variance in one variable depends on another variable. In effect, covariance is a measure of the relationship between two variables. The higher the covariance, the stronger the relationship.

Covariances can be positive (both variables move in the same direction), negative (both variables move in different directions), or in the case of no relationship, zero.

I’ve generated a new dataset (code is below) with three random continuous variables, x1, x2, and y. I’ve purposely set the covariance between x1 and x2 to be zero–no relationship. If you take a look at the scatter plot below of x1 and x2, it seems pretty clear that there is no clear relationship between the two variables…

library(MASS)
set.seed(08022003)
# We start by creating a defined covariance matrix
cov.matrix <- matrix(c(.5,0,1.5, 
                       0,1,0,
                       1.5,0,15),
                     nrow = 3, ncol = 3,
                     dimnames = list(c("x1", "x2", "y")))
# Now we generate our simulated data
cv.df = mvrnorm(n = 1000,  # Number of observations
                mu = c(0, 0, 0),  # Variable means
                Sigma = cov.matrix,  # Covariance matrix
                tol = .1,  # Ensures a positive definite matrix
                empirical = TRUE)  # Set our matrix as the true empirical values
cv.df <- data.frame(cv.df)
# Lets make a scatterplot of x1 and x2
x1x2.scatterplot <- ggplot(cv.df, aes(x = x1, y = x2)) +
  geom_point(shape=1)
x1x2.scatterplot + theme_minimal()

In the same data though I’ve purposely set the covariance of x1 and y to be equal to 1.5. Covariances aren’t bounded–the value of the covariance depends on the mean value of the data and the range of values in the sample (more on this later when we get to correlation). Take a look at the following scatter plot of x1 and y…

x1y.scatterplot <- ggplot(cv.df, aes(x = x1, y = y)) +
  geom_point(shape=1)
x1y.scatterplot + theme_minimal()

Here we see what seems like a general positive linear trend. As x1 increases, so does y.

Covariance is one of the most important concepts in statistics, and most analyses involve estimating some type of covariance structure. Unfortunately, covariance by itself has some interpretational limitations, so we need it’s close cousin, correlation.


Correlation

Correlation takes the covariance and makes it meaningful. Take a look again at the scatter plot of x1 and y. The x1 scale goes from -3 to 3. The y values though go from -15 to over 10. x1 and y are on very different scales of measurement. For example, x1 might be in inches, while y might be in pounds.

Covariance tells the nature of the relationship, but not the degree. The value of covariance in our data is 1.5–but it could just as easily be 15,000. The value of the covariance is a function of the values (the scale of measurement) of x1 and y. The higher the values of x1 and y, the higher the covariance.

This means that to interpret the covariance we need to get x1 and y on a similar scale of measurement; we need to take an apples and oranges comparison and make it into an apples to apples comparison. We do that by dividing the covariance of x1 and y by the product of their standard deviations (if you want to geek out on the formula, wiki isn’t bad). This sets the scale of measurement of x1 and y to be equivalent.

In the dataset that we constructed, we set the variance of x1 to be .5, and the variance of y to be 15. This gave us values with very different minimum and maximum values. We set the covariance of x1 and y to be 1.5.

Yes, software does it for us, but we can just as easily calculate the correlation ourselves, because remember that the square root of variance is the standard deviation (usually represented by the Greek letter sigma, σ). So, our formula for correlation (usually represented by the Greek letter rho, ρ) is…

Corr(x1,y) ρ = Cov(x1, y)/ [(σx1)(σy)]
ρ (x1, y) = 1.5/[(√.5)(√15)]
ρ (x1, y) = .548

If we let R do the work for us, we get the same value…

myCor.matrix <- cor(cv.df, method = c("pearson"))  
round(myCor.matrix, 3)
##       x1 x2     y
## x1 1.000  0 0.548
## x2 0.000  1 0.000
## y  0.548  0 1.000

The correlation gives us an easily interpretable measure of the strength of the relationship. Correlations always take a value between -1 and 1. A correlation of -1 would be a perfect negative relationship–for every change in x1, y moves in the opposite direction in the same magnitude (think about our scatterplot and all of the data points falling in a perfect straight line). A correlation of 1 is a perfect positive relationship (x1 and y move in the same direction of the same magnitude), and 0 is no relationship.

Determining whether the relationship is ‘strong’ or ‘statistically significant’ is for another post. But, for most social science research, Cohen (1988) provides a handy guide:

  • Small: .1 - .29
  • Medium: .3 - .49
  • Large: ≥ .5

Key Takeaway

We’ve just scratched the surface of understanding variance, covariance, and correlation. I would be remiss though if I didn’t mention the oft-cited comment that correlation does not imply causation. As we’ve talked about before, establishing causality requires three necessary conditions. Correlation is just an estimated value in your data–it’s a piece of the puzzle to understanding the relationship between two variables, but it’s just a piece. Also, a low correlation coefficient does not necessarily mean that there is no meaningful relationship between x and y–the relationship might be non-linear, but that’s for another post!