Assistant Professor | Bloch School of Management

Plan of the day

  • Quiz review
  • Homework questions
  • Project review and update
  • Dealing with factor variables
  • Interaction effects

Quiz results…

  • Median: 8

  • Mean: 7.9

Lets talk about the homework…

So what's bugging you about the project?

GO BRONCOS, BEAT THE (Los Angeles???) CHARGERS!!!

library(tidyverse)
library(readr)
dsom.ds <- read_csv("http://www.drbanderson.com/data/DSOM5522Fall2017Experiment.csv")
experiment.df <- as.data.frame(dsom.ds) 

A categorical variable takes on a discrete value based on assignment by the researcher. The numerical value itself has no meaning—it is simply a placeholder to denote assignment of the observation to a given condition.

Categorical variables can be dichotomous…

Value Category
0 Male
1 Female

Or can take on a nearly infinite number of assignments…

Value Category
0 Brown Hair
1 Black Hair
2 Red Hair

You can also give categorical variables a particular order, but that's for another class :)

One handy technique is to convert continuous variables into categorical variables. We're going to do that with our RespAge variable in the dataset.

Lets start by getting an idea of the range of respondent ages':

summary(experiment.df$RespAge)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   27.00   36.00   39.00   39.01   42.00   53.00

Now we're going to apply our ranges to set up three categories of respondents by age demographic:

experiment.df <- experiment.df %>%
  mutate(Demographic = cut(RespAge, c(25,35,40,55),
                           labels=c("Millennials", "Gen Y", "Gen X")))
# Now lets get a breakdown of how many respondents are in each demographic...
table(experiment.df$Demographic)
## 
## Millennials       Gen Y       Gen X 
##         121         290         232

Categorical variables, often called factor variables, can be very helpful to take otherwise messy or noisy data and make them more meaningful. Visualizations are a great way to help display data according to the category.

Lets take a look at how creativity breaks down by demographic:

ggplot(experiment.df, aes(x=Demographic, y=Creativity, fill=Demographic)) +
  geom_boxplot() +
  labs(title = "Boxplot of Creativity of Respondents by Demographic",
       y = "Mean Level of Creativity",
       x = "")

Or how about a histogram…

ggplot(experiment.df, aes(x=Creativity)) +
  geom_histogram(aes(fill=Demographic)) +
  facet_wrap( ~ Demographic, ncol=1) +
  labs(title = "Creativity of Respondents by Demographic",
       y = "Count of Creativity",
       x = "Level of Creativity (4 = Mean Creativity)")

Categorical (factor) variables are also easily modeled in GLM models…

happiness.logit <- glm(Happiness ~ Demographic, 
                       data = experiment.df, family = "binomial")
summary(happiness.logit)

## 
## Call:
## glm(formula = Happiness ~ Demographic, family = "binomial", data = experiment.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9162   0.5895   0.5895   0.6287   0.9097  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.6685     0.1921   3.480 0.000501 ***
## DemographicGen Y   0.8526     0.2456   3.471 0.000518 ***
## DemographicGen X   0.9936     0.2628   3.781 0.000156 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 647.38  on 642  degrees of freedom
## Residual deviance: 631.35  on 640  degrees of freedom
## AIC: 637.35
## 
## Number of Fisher Scoring iterations: 4

The kicker though is that interpreting factor variables requires us to know the reference category. The reference category is the category assigned the 0 value:

levels(experiment.df$Demographic)
## [1] "Millennials" "Gen Y"       "Gen X"

The default behavior is to assign the left-most category to be equal to zero. You can change this behavior with the relevel function, but we don't need to do that.

Ok, lets make sense of the results…

library(broom)
happiness.logit.coef <- tidy(happiness.logit, conf.int = TRUE)
happiness.logit.coef <- happiness.logit.coef %>% 
  mutate_if(is.numeric, funs(round(., 3)))
happiness.logit.coef
##               term estimate std.error statistic p.value conf.low conf.high
## 1      (Intercept)    0.668     0.192     3.480   0.001    0.299     1.054
## 2 DemographicGen Y    0.853     0.246     3.471   0.001    0.369     1.334
## 3 DemographicGen X    0.994     0.263     3.781   0.000    0.480     1.512

\(ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x_1}+{\beta}{x_2}+\epsilon\)


Our intercept in this case is the predicted log odds of \(Happiness\) when the respondent is a Millennial (Demographic = 0). We can think about it as…

0 = Millennial

1 = Gen Y

2 = Gen X

The intercept isn't the baseline, like we interpreted in our experiment model. It's the reference. We interpret the other estimates relative to the reference category:

happiness.logit.coef
##               term estimate std.error statistic p.value conf.low conf.high
## 1      (Intercept)    0.668     0.192     3.480   0.001    0.299     1.054
## 2 DemographicGen Y    0.853     0.246     3.471   0.001    0.369     1.334
## 3 DemographicGen X    0.994     0.263     3.781   0.000    0.480     1.512

The DemographicGenY estimate is the log odds of Happiness for Gen Y'ers divided by the log odds of Happiness for Millennials.

happiness.logit.coef
##               term estimate std.error statistic p.value conf.low conf.high
## 1      (Intercept)    0.668     0.192     3.480   0.001    0.299     1.054
## 2 DemographicGen Y    0.853     0.246     3.471   0.001    0.369     1.334
## 3 DemographicGen X    0.994     0.263     3.781   0.000    0.480     1.512

The DemographicGenX estimate is the log odds of Happiness for Gen X'ers divided by the log odds of Happiness for Millennials.

The interpretation of a categorical variable in an OLS regression model is a little bit different…

creativity.ols <- lm(Creativity ~ Demographic, data = experiment.df)
tidy(creativity.ols)
##               term   estimate std.error statistic       p.value
## 1      (Intercept)  4.5206612 0.1010919 44.718312 4.339525e-199
## 2 DemographicGen Y -0.2206612 0.1203479 -1.833527  6.718840e-02
## 3 DemographicGen X -0.6284198 0.1246981 -5.039529  6.080664e-07

The DemographicGenY estimate is the predicted value of Creativity for Gen Y'ers minus the predicted value of Creativity for Millennials.

The DemographicGenX estimate is the predicted value of Creativity for Gen X'ers minus the predicted value of Creativity for Millennials.

Once again, the log odds (or odds) aren't all that intuitive. Lets convert these to probabilities, and we'll start by creating the vector of values of Demographic that mimic our categories:

# Note here that we are passing the LEVELS of the categorical to our data frame!
predictor.values <- data_frame(Demographic = c("Millennials", "Gen Y", "Gen X"))
predictor.values
## # A tibble: 3 x 1
##   Demographic
##         <chr>
## 1 Millennials
## 2       Gen Y
## 3       Gen X

Now lets create our predicted values by passing the vector of possible Demograhpic variable values that we just created to our estimated model:

happiness.predict <- broom::augment(x = happiness.logit, 
                                    newdata = predictor.values, type.predict = "response")
happiness.predict
##   Demographic   .fitted    .se.fit
## 1 Millennials 0.6611570 0.04302876
## 2       Gen Y 0.8206897 0.02252647
## 3       Gen X 0.8405172 0.02403733

Now lets make a plot, and we'll start by creating our confidence intervals:

happiness.predict <- happiness.predict %>%
  mutate(Pr_y = 100 * .fitted,
         lower.ci = 100 * (.fitted - (1.96 * .se.fit)),
         upper.ci = 100 * (.fitted + (1.96 * .se.fit))) %>%
  dplyr::select(Demographic, Pr_y, lower.ci, upper.ci)
happiness.predict
##   Demographic     Pr_y lower.ci upper.ci
## 1 Millennials 66.11570 57.68207 74.54934
## 2       Gen Y 82.06897 77.65378 86.48415
## 3       Gen X 84.05172 79.34041 88.76304

And now for the plot…

predict.plot <- ggplot(happiness.predict, aes(Demographic, Pr_y, colour = Demographic)) + 
  geom_point() +
  geom_errorbar(aes(ymin = lower.ci, ymax = upper.ci)) +
  scale_x_discrete(limits=c("Millennials", 
                            "Gen Y", 
                            "Gen X")) + 
  labs(title = "Probability of Happiness on a Task by Demographic",
       subtitle = "Older Workers are Happier Workers?",
       y = "Probability of Happiness",
       x = "") + 
  scale_colour_discrete(guide = FALSE)
predict.plot

Ok, lets kick it up a notch with an interaction effect.

In an interaction effect, we are saying that the effect (or odds, etc.) of \(x\) on \(y\) varies in the presence of another variable, \(m\).

The \(m\) in this case provides context.

The concept of interaction effects, or moderation, is one of the insanely powerful tools we have at our disposal to really give a fine grained perspective on our analyses.

They are also challenging to perform correctly, interpret correctly, and visualize correctly :)

The basic logic to calculate an interaction effect is to multiply \(x\) and \(m\) together, creating a new variable, \(xm\).

Then in your model, you include \(x\), \(m\), and the new \(xm\).

The coefficient you are looking for in determining if an interaction effect is present is the one for \(xm\).

Instead of using the hand multiplication approach, we are going to use a shortcut in R that handles the multiplication under-the-hood. Lets start with an interaction of a continuous variable, RespExper with our Demographic categorical variable.

With interaction effects though, it's really helpful to mean center the continuous variable(s), so lets do that first:

experiment.df <- experiment.df %>%
  mutate(RespExper.center = (RespExper - mean(RespExper)))

Now lets specify our model. What we are suggesting is that the log odds for the relationship between RespExper and Happiness varies across the Demographic of the respondent.

happiness.logit.demo <- glm(Happiness ~ RespExper.center*Demographic,
                            data = experiment.df, family = "binomial")
tidy(happiness.logit.demo)
##                                term  estimate std.error statistic
## 1                       (Intercept) 0.7355937 0.2165199 3.3973496
## 2                  RespExper.center 0.1062160 0.1501101 0.7075870
## 3                  DemographicGen Y 0.8140833 0.2673218 3.0453310
## 4                  DemographicGen X 0.8676217 0.2825404 3.0707882
## 5 RespExper.center:DemographicGen Y 0.1464776 0.2054146 0.7130829
## 6 RespExper.center:DemographicGen X 0.1520271 0.2258966 0.6729945
##        p.value
## 1 0.0006804198
## 2 0.4792017769
## 3 0.0023242437
## 4 0.0021349454
## 5 0.4757944825
## 6 0.5009508094

Interpreting interaction coefficients isn't exactly straightforward. I do want to draw your attention though to the p-values…

tidy(happiness.logit.demo)
##                                term  estimate std.error statistic
## 1                       (Intercept) 0.7355937 0.2165199 3.3973496
## 2                  RespExper.center 0.1062160 0.1501101 0.7075870
## 3                  DemographicGen Y 0.8140833 0.2673218 3.0453310
## 4                  DemographicGen X 0.8676217 0.2825404 3.0707882
## 5 RespExper.center:DemographicGen Y 0.1464776 0.2054146 0.7130829
## 6 RespExper.center:DemographicGen X 0.1520271 0.2258966 0.6729945
##        p.value
## 1 0.0006804198
## 2 0.4792017769
## 3 0.0023242437
## 4 0.0021349454
## 5 0.4757944825
## 6 0.5009508094

In our data, the interaction effects aren't statistically different from zero. That means that the log odds of happiness as a function of the respondent's years of experience doesn't vary across the three categories of demographics.

To put another way, we would expect the same probability curve of experience on happiness irrespective of the age range (by our categories) of our respondents.

The easiest way to make sense of an interaction is a visualization. This gets a little tricky though, because RespExper.center is a continuous variable, which means that it takes on a bunch of different values itself.

Plotting these is a bit tricky, so we're going to trade flexibility in our code with some shortcuts in R.

install.packages("visreg")

library(visreg)
visreg(happiness.logit.demo, "RespExper.center", by = "Demographic", 
       scale = "response", gg = TRUE,
       ylab = "Probability of Happiness",
       xlab = "Mean Centered Years of Experience (Mean = 0)")

Now, there are also continuous variable by continuous variable interactions…

experiment.df <- experiment.df %>%
  mutate(RespAge.center = (RespAge - mean(RespAge)))

happiness.logit.cont <- glm(Happiness ~ RespExper.center*RespAge.center,
                            data = experiment.df, family = "binomial")
tidy(happiness.logit.cont)
##                              term    estimate  std.error statistic
## 1                     (Intercept) 1.423476656 0.10451981 13.619205
## 2                RespExper.center 0.211288779 0.09238293  2.287098
## 3                  RespAge.center 0.076827821 0.02712916  2.831928
## 4 RespExper.center:RespAge.center 0.004784687 0.02017426  0.237168
##        p.value
## 1 3.079032e-42
## 2 2.219013e-02
## 3 4.626830e-03
## 4 8.125265e-01

Now lets take a look at a plot. Continuous by continuous interactions require us to visualize the relationship as a three-dimensional image. This is called a perspective plot.

visreg2d(happiness.logit.cont, "RespExper.center", "RespAge.center", plot.type="persp",
         scale = "response",
         ylab = "\n Mean Centered Age \n (Mean = 0)",
         xlab = "\n Mean Centered Years of \n Experience (Mean = 0)",
         zlab = "\n Probability of Happiness")