- Quiz review
- Homework questions
- Project review and update
- Dealing with factor variables
- Interaction effects

Assistant Professor | Bloch School of Management

- Quiz review
- Homework questions
- Project review and update
- Dealing with factor variables
- Interaction effects

Quiz results…

Median: 8

Mean: 7.9

Lets talk about the homework…

So what's bugging you about the project?

GO BRONCOS, BEAT THE (Los Angeles???) CHARGERS!!!

library(tidyverse) library(readr) dsom.ds <- read_csv("http://www.drbanderson.com/data/DSOM5522Fall2017Experiment.csv") experiment.df <- as.data.frame(dsom.ds)

Lets start by talking about categorical variables.

A categorical variable takes on a discrete value based on assignment by the researcher. The numerical value itself has no meaning—it is simply a placeholder to denote assignment of the observation to a given condition.

Categorical variables can be dichotomous…

Value | Category |
---|---|

0 | Male |

1 | Female |

Or can take on a nearly infinite number of assignments…

Value | Category |
---|---|

0 | Brown Hair |

1 | Black Hair |

2 | Red Hair |

… | … |

You can also give categorical variables a particular order, but that's for another class :)

One handy technique is to convert continuous variables into categorical variables. We're going to do that with our `RespAge`

variable in the dataset.

Lets start by getting an idea of the range of respondent ages':

summary(experiment.df$RespAge)

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 27.00 36.00 39.00 39.01 42.00 53.00

Now we're going to apply our ranges to set up three categories of respondents by age demographic:

experiment.df <- experiment.df %>% mutate(Demographic = cut(RespAge, c(25,35,40,55), labels=c("Millennials", "Gen Y", "Gen X"))) # Now lets get a breakdown of how many respondents are in each demographic... table(experiment.df$Demographic)

## ## Millennials Gen Y Gen X ## 121 290 232

Categorical variables, often called factor variables, can be very helpful to take otherwise messy or noisy data and make them more meaningful. Visualizations are a great way to help display data according to the category.

Lets take a look at how `creativity`

breaks down by demographic:

ggplot(experiment.df, aes(x=Demographic, y=Creativity, fill=Demographic)) + geom_boxplot() + labs(title = "Boxplot of Creativity of Respondents by Demographic", y = "Mean Level of Creativity", x = "")

Or how about a histogram…

ggplot(experiment.df, aes(x=Creativity)) + geom_histogram(aes(fill=Demographic)) + facet_wrap( ~ Demographic, ncol=1) + labs(title = "Creativity of Respondents by Demographic", y = "Count of Creativity", x = "Level of Creativity (4 = Mean Creativity)")

Categorical (factor) variables are also easily modeled in GLM models…

happiness.logit <- glm(Happiness ~ Demographic, data = experiment.df, family = "binomial") summary(happiness.logit)

## ## Call: ## glm(formula = Happiness ~ Demographic, family = "binomial", data = experiment.df) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9162 0.5895 0.5895 0.6287 0.9097 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.6685 0.1921 3.480 0.000501 *** ## DemographicGen Y 0.8526 0.2456 3.471 0.000518 *** ## DemographicGen X 0.9936 0.2628 3.781 0.000156 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 647.38 on 642 degrees of freedom ## Residual deviance: 631.35 on 640 degrees of freedom ## AIC: 637.35 ## ## Number of Fisher Scoring iterations: 4

The kicker though is that interpreting factor variables requires us to know the *reference category*. The reference category is the category assigned the **0** value:

levels(experiment.df$Demographic)

## [1] "Millennials" "Gen Y" "Gen X"

The default behavior is to assign the *left-most* category to be equal to zero. You can change this behavior with the `relevel`

function, but we don't need to do that.

Ok, lets make sense of the results…

library(broom) happiness.logit.coef <- tidy(happiness.logit, conf.int = TRUE) happiness.logit.coef <- happiness.logit.coef %>% mutate_if(is.numeric, funs(round(., 3))) happiness.logit.coef

## term estimate std.error statistic p.value conf.low conf.high ## 1 (Intercept) 0.668 0.192 3.480 0.001 0.299 1.054 ## 2 DemographicGen Y 0.853 0.246 3.471 0.001 0.369 1.334 ## 3 DemographicGen X 0.994 0.263 3.781 0.000 0.480 1.512

Our intercept in this case is the predicted log odds of \(Happiness\) when the respondent is a Millennial (`Demographic`

= 0). We can think about it as…

0 = Millennial

1 = Gen Y

2 = Gen X

The intercept isn't the baseline, like we interpreted in our experiment model. It's the *reference*. We interpret the other estimates relative to the reference category:

happiness.logit.coef

## term estimate std.error statistic p.value conf.low conf.high ## 1 (Intercept) 0.668 0.192 3.480 0.001 0.299 1.054 ## 2 DemographicGen Y 0.853 0.246 3.471 0.001 0.369 1.334 ## 3 DemographicGen X 0.994 0.263 3.781 0.000 0.480 1.512

The `DemographicGenY`

estimate is the log odds of `Happiness`

for Gen Y'ers divided by the log odds of `Happiness`

for Millennials.

happiness.logit.coef

## term estimate std.error statistic p.value conf.low conf.high ## 1 (Intercept) 0.668 0.192 3.480 0.001 0.299 1.054 ## 2 DemographicGen Y 0.853 0.246 3.471 0.001 0.369 1.334 ## 3 DemographicGen X 0.994 0.263 3.781 0.000 0.480 1.512

The `DemographicGenX`

estimate is the log odds of `Happiness`

for Gen X'ers divided by the log odds of `Happiness`

for Millennials.

The interpretation of a categorical variable in an OLS regression model is a little bit different…

creativity.ols <- lm(Creativity ~ Demographic, data = experiment.df) tidy(creativity.ols)

## term estimate std.error statistic p.value ## 1 (Intercept) 4.5206612 0.1010919 44.718312 4.339525e-199 ## 2 DemographicGen Y -0.2206612 0.1203479 -1.833527 6.718840e-02 ## 3 DemographicGen X -0.6284198 0.1246981 -5.039529 6.080664e-07

The `DemographicGenY`

estimate is the predicted value of `Creativity`

for Gen Y'ers **minus** the predicted value of `Creativity`

for Millennials.

The `DemographicGenX`

estimate is the predicted value of `Creativity`

for Gen X'ers **minus** the predicted value of `Creativity`

for Millennials.

Once again, the log odds (or odds) aren't all that intuitive. Lets convert these to probabilities, and we'll start by creating the vector of values of `Demographic`

that mimic our categories:

# Note here that we are passing the LEVELS of the categorical to our data frame! predictor.values <- data_frame(Demographic = c("Millennials", "Gen Y", "Gen X")) predictor.values

## # A tibble: 3 x 1 ## Demographic ## <chr> ## 1 Millennials ## 2 Gen Y ## 3 Gen X

Now lets create our predicted values by passing the vector of possible `Demograhpic`

variable values that we just created to our estimated model:

happiness.predict <- broom::augment(x = happiness.logit, newdata = predictor.values, type.predict = "response") happiness.predict

## Demographic .fitted .se.fit ## 1 Millennials 0.6611570 0.04302876 ## 2 Gen Y 0.8206897 0.02252647 ## 3 Gen X 0.8405172 0.02403733

Now lets make a plot, and we'll start by creating our confidence intervals:

happiness.predict <- happiness.predict %>% mutate(Pr_y = 100 * .fitted, lower.ci = 100 * (.fitted - (1.96 * .se.fit)), upper.ci = 100 * (.fitted + (1.96 * .se.fit))) %>% dplyr::select(Demographic, Pr_y, lower.ci, upper.ci) happiness.predict

## Demographic Pr_y lower.ci upper.ci ## 1 Millennials 66.11570 57.68207 74.54934 ## 2 Gen Y 82.06897 77.65378 86.48415 ## 3 Gen X 84.05172 79.34041 88.76304

And now for the plot…

predict.plot <- ggplot(happiness.predict, aes(Demographic, Pr_y, colour = Demographic)) + geom_point() + geom_errorbar(aes(ymin = lower.ci, ymax = upper.ci)) + scale_x_discrete(limits=c("Millennials", "Gen Y", "Gen X")) + labs(title = "Probability of Happiness on a Task by Demographic", subtitle = "Older Workers are Happier Workers?", y = "Probability of Happiness", x = "") + scale_colour_discrete(guide = FALSE) predict.plot

Ok, lets kick it up a notch with an interaction effect.

In an interaction effect, we are saying that the effect (or odds, etc.) of \(x\) on \(y\) varies in the presence of another variable, \(m\).

The \(m\) in this case provides context.

The concept of interaction effects, or moderation, is one of the insanely powerful tools we have at our disposal to really give a fine grained perspective on our analyses.

They are also challenging to perform correctly, interpret correctly, and visualize correctly :)

The basic logic to calculate an interaction effect is to multiply \(x\) and \(m\) together, creating a new variable, \(xm\).

Then in your model, you include \(x\), \(m\), and the new \(xm\).

The coefficient you are looking for in determining if an interaction effect is present is the one for \(xm\).

Instead of using the hand multiplication approach, we are going to use a shortcut in R that handles the multiplication under-the-hood. Lets start with an interaction of a continuous variable, `RespExper`

with our `Demographic`

categorical variable.

With interaction effects though, it's **really** helpful to mean center the continuous variable(s), so lets do that first:

experiment.df <- experiment.df %>% mutate(RespExper.center = (RespExper - mean(RespExper)))

Now lets specify our model. What we are suggesting is that the log odds for the relationship between `RespExper`

and `Happiness`

varies across the `Demographic`

of the respondent.

happiness.logit.demo <- glm(Happiness ~ RespExper.center*Demographic, data = experiment.df, family = "binomial") tidy(happiness.logit.demo)

## term estimate std.error statistic ## 1 (Intercept) 0.7355937 0.2165199 3.3973496 ## 2 RespExper.center 0.1062160 0.1501101 0.7075870 ## 3 DemographicGen Y 0.8140833 0.2673218 3.0453310 ## 4 DemographicGen X 0.8676217 0.2825404 3.0707882 ## 5 RespExper.center:DemographicGen Y 0.1464776 0.2054146 0.7130829 ## 6 RespExper.center:DemographicGen X 0.1520271 0.2258966 0.6729945 ## p.value ## 1 0.0006804198 ## 2 0.4792017769 ## 3 0.0023242437 ## 4 0.0021349454 ## 5 0.4757944825 ## 6 0.5009508094

Interpreting interaction coefficients isn't exactly straightforward. I do want to draw your attention though to the p-values…

tidy(happiness.logit.demo)

## term estimate std.error statistic ## 1 (Intercept) 0.7355937 0.2165199 3.3973496 ## 2 RespExper.center 0.1062160 0.1501101 0.7075870 ## 3 DemographicGen Y 0.8140833 0.2673218 3.0453310 ## 4 DemographicGen X 0.8676217 0.2825404 3.0707882 ## 5 RespExper.center:DemographicGen Y 0.1464776 0.2054146 0.7130829 ## 6 RespExper.center:DemographicGen X 0.1520271 0.2258966 0.6729945 ## p.value ## 1 0.0006804198 ## 2 0.4792017769 ## 3 0.0023242437 ## 4 0.0021349454 ## 5 0.4757944825 ## 6 0.5009508094

In our data, the interaction effects aren't statistically different from zero. That means that the log odds of happiness as a function of the respondent's years of experience doesn't vary across the three categories of demographics.

To put another way, we would expect the same probability curve of experience on happiness irrespective of the age range (by our categories) of our respondents.

The easiest way to make sense of an interaction is a visualization. This gets a little tricky though, because `RespExper.center`

is a continuous variable, which means that it takes on a **bunch** of different values itself.

Plotting these is a bit tricky, so we're going to trade flexibility in our code with some shortcuts in R.

install.packages("visreg")

library(visreg) visreg(happiness.logit.demo, "RespExper.center", by = "Demographic", scale = "response", gg = TRUE, ylab = "Probability of Happiness", xlab = "Mean Centered Years of Experience (Mean = 0)")

Now, there are also continuous variable by continuous variable interactions…

experiment.df <- experiment.df %>% mutate(RespAge.center = (RespAge - mean(RespAge))) happiness.logit.cont <- glm(Happiness ~ RespExper.center*RespAge.center, data = experiment.df, family = "binomial") tidy(happiness.logit.cont)

## term estimate std.error statistic ## 1 (Intercept) 1.423476656 0.10451981 13.619205 ## 2 RespExper.center 0.211288779 0.09238293 2.287098 ## 3 RespAge.center 0.076827821 0.02712916 2.831928 ## 4 RespExper.center:RespAge.center 0.004784687 0.02017426 0.237168 ## p.value ## 1 3.079032e-42 ## 2 2.219013e-02 ## 3 4.626830e-03 ## 4 8.125265e-01

Now lets take a look at a plot. Continuous by continuous interactions require us to visualize the relationship as a three-dimensional image. This is called a *perspective* plot.

visreg2d(happiness.logit.cont, "RespExper.center", "RespAge.center", plot.type="persp", scale = "response", ylab = "\n Mean Centered Age \n (Mean = 0)", xlab = "\n Mean Centered Years of \n Experience (Mean = 0)", zlab = "\n Probability of Happiness")