Predicted probabilities with rstanarm

· by Brian Anderson · Read in about 14 min · (2931 words) ·

I’m a big fan of logit models. In management research and management practice, logit models help us to answer the kind of questions we typically ask…what is the probability of \(y\) happening, if \(x\) exists or if \(x\) changes?

As I turn almost exclusively to Bayesian inference in my work, I thought it would be helpful to post how I go about calculating and visualizing predicted probabilities.

For this example, I’m going to use a modified dataset I use often in class. The data is a classic 2x2 between-subjects experimental design, with a binary outcome, Happiness, and two experimental conditions, EntManager and TaskSpecificity. We also have covariates on the participants Age, Gender, and years of Work Experience.

Our research question is how much happiness on a completing a task improves if I have an entrepreneurial manager leading me, and if I’m provided an outline of how to perform that task. For parsimony, I’m going to skip the interaction effect between the two experimental conditions.

library(tidyverse)
experiment.df <- read_csv("https://www.drbanderson.com/data/Experiment.csv")
experiment.df
## # A tibble: 643 x 7
##      Obs Gender   Age WorkExperience Happiness EntManager TaskSpecificity
##    <dbl>  <dbl> <dbl>          <dbl>     <dbl>      <dbl>           <dbl>
##  1     1      0    39              0         1          0               1
##  2     2      1    39              4         1          1               1
##  3     3      1    33              4         0          0               0
##  4     4      0    44              5         1          1               1
##  5     5      0    40              3         1          1               0
##  6     6      1    36              3         1          0               0
##  7     7      1    29              3         0          0               1
##  8     8      0    37              2         0          0               1
##  9     9      1    42              4         1          1               1
## 10    10      0    34              1         0          1               0
## # ... with 633 more rows

Wrangling

I am a big fan of mean-centering continuous predictors in a model, because it makes the intercept meaningful. I also like working with binary or categorical predictors, like Gender in this data, as factor variables. So lets do a little wrangling…

experiment.df <- experiment.df %>% 
  mutate(cenAge = Age - mean(Age),
         cenExp = WorkExperience - mean(WorkExperience)) %>% 
  mutate(Gender = factor(Gender, labels = c("Male", "Female")))

Specify The Model

I’m a big fan of rstanarm. The rstanarm package uses rstan for the back-end, but keeps the model specification very similar to base R functions.

One of the advantages of rstanarm is the default weakly informative priors. One of the best parts of Bayesian inference is the ability to account for what we already know, or suspect, about a model. This understanding makes up the prior, and under Bayesian inference, we use the prior, in conjunction with our data, to update our beliefs after running a model…

\[\text{Posterior} \propto \text{Prior x Likelihood}\]

In the case of say, the false-positive rate of a medical test, it often makes sense to use a strong well-informed prior. For many applications where we do not have a strong understanding, a weakly informative prior typically makes more sense. The benefit of a weakly informative prior is to penalize estimates that are unlikely to exist in reality, but not be so strong as to rule out estimates that could be in the range of possible values.

While I really like the ability to incorporate prior information, what I like most about Bayesian inference is that I can actually test the hypothesis I want to test.

In this model, I can actually test whether having an entrepreneurial manager and task specificity increases happiness. That is something, despite the very common misconcepttion, that frequentist statistics just cannot do.

Ok, lets specify our model. I’m going to fit the model using rstanarm’s default weakly informative priors, a common seed value to make the MCMC draws replicable, and specify the logit likelihood link function.

library(rstanarm)
model.seed <- 1

model <- stan_glm(Happiness ~ EntManager + TaskSpecificity + cenAge + cenExp + Gender,
                  family = binomial(link = "logit"),
                  seed = model.seed, data = experiment.df)
## 
## SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0.000562 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 5.62 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.772802 seconds (Warm-up)
## Chain 1:                0.645267 seconds (Sampling)
## Chain 1:                1.41807 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 8.4e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.84 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.719464 seconds (Warm-up)
## Chain 2:                0.786937 seconds (Sampling)
## Chain 2:                1.5064 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 8.5e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.85 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.777165 seconds (Warm-up)
## Chain 3:                0.743296 seconds (Sampling)
## Chain 3:                1.52046 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 8.5e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.85 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
## Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.805661 seconds (Warm-up)
## Chain 4:                0.841959 seconds (Sampling)
## Chain 4:                1.64762 seconds (Total)
## Chain 4:

Ok, lets take a look at the model.

model
## stan_glm
##  family:       binomial [logit]
##  formula:      Happiness ~ EntManager + TaskSpecificity + cenAge + cenExp + 
##     Gender
##  observations: 643
##  predictors:   6
## ------
##                 Median MAD_SD
## (Intercept)     0.4    0.2   
## EntManager      1.3    0.2   
## TaskSpecificity 0.8    0.2   
## cenAge          0.1    0.0   
## cenExp          0.2    0.1   
## GenderFemale    0.3    0.2   
## 
## Sample avg. posterior predictive distribution of y:
##          Median MAD_SD
## mean_PPD 0.8    0.0   
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

I’ll leave it to the rstanarm vignette for interpretation of the results. Right now though, lets focus on getting the predicted probabilities out of the model, which from my perspective, is the most useful part of the analysis.

One thing I am going to do is to create a dataframe of tidy model results. This comes in handy for presenting results in a paper, and I’ll us the broom package to extract the parameter estimates and store them in a dataframe. I’m also going to round the estimates to two decimal places because lets face it, that’s really all we need in practice!

library(broom)
model.results <- tidy(model) %>% 
  mutate_if(is.numeric, funs(round(., 2)))

model.results
## # A tibble: 6 x 3
##   term            estimate std.error
##   <chr>              <dbl>     <dbl>
## 1 (Intercept)         0.44      0.17
## 2 EntManager          1.34      0.22
## 3 TaskSpecificity     0.77      0.21
## 4 cenAge              0.08      0.03
## 5 cenExp              0.22      0.09
## 6 GenderFemale        0.3       0.23

Helper Functions

Because I use these models a lot, I’ve written a series of helper functions. Ultimately I’ll turn these into a package, but for now, I’ll just put them below.

The first function calculates the predicted probabilities across four experimental conditions (I’ll explain more in a bit). The second function creates a table with the median predicted probability by condition. The last function creates a ggplot visualization of the posterior predicted probability distributions over the four conditions.

We are going to use the posterior_linpred function in rstanarm to calculate the predicted probability distribution. The posterior_linpred function returns posterior draws from a linear predictor, and I’ve found it to be invaluable, particularly when working with logit models.

Calculated predicted probabilities

# Define a function to get the predicted probabilities for each condition. The 
#  function takes in the fitted model results, and then iterates through the 
#  four experimental conditions, holding the covariates constant, and using the 
#  same model seed as the fitted model. The function returns four different 
#  dataframes, which are combined, and then transformeed into a long-form 
#  dataframe easier for use with ggplot and other tidyverse functions.
getProbabilities <- function(x) {
  problist <- list()
  for (i in 1:4) {
    df <- data.frame(posterior_linpred(x, newdata = conditions.df[i, ], 
                                       seed = model.seed, re.form = NA, transform = TRUE))
    problist[[i]] <- df
  }
  
  df <- do.call(cbind, problist)
  colnames(df) <- conditions
  
  df %>% 
    gather(Condition, Probability) %>% 
    group_by(Condition) 
}

Median predicted probability

# Define a function to get the median predicted probability for each condition, 
#  and arrange them in the same order as the conditions vector. We also construct 
#  a 90% uncertainty interval around each predicted median.
createProbTable <- function(x) {
  x %>% 
    summarize(MedianProbability = median(Probability),
              ui.low = quantile(Probability, .05),
              ui.high = quantile(Probability, .95)) %>% 
    arrange(match(Condition, conditions))
}

Plot of posterior predicted probability

# Define a function to plot the probability distribution for each condition on a 
#  single ggplot object, including the median probability of each condition as a 
#  dashed line. The function takes in as inputs the predicted probabilities (x) 
#  and the median predicted probabilties (y).
plotProbabilities <- function(x, y) {
  # Note that the `x = Probability` refers to the x axis in the ggplot object; 
  #  the first 'x' passes the predicted probability dataframes
  ggplot(data = x, aes(x = Probability, fill = Condition)) +
    geom_density(alpha = .3) + 
    geom_vline(data = y, aes(xintercept = MedianProbability,  color = Condition),
               linetype = "dashed", size = 1) +
    scale_x_continuous(labels = scales::percent) + 
    labs(y = "Pr. Density") + 
    theme_bw() + 
    theme(legend.position = "bottom", legend.title = element_blank()) + 
    guides(fill = guide_legend(nrow = 2))
} 

Predicted Probabilities

There are a couple of steps to the next process. The first one is simply defining a vector of conditions that we will use to display the predicted probabilities. Because the design is a 2x2, and both predictors are binary, I define four total conditions representing the four possible categories a participant could have seen in the experiment.

conditions <- c("Conservative Manager/No Task Specificity",
                "Conservative Manager/Task Specificity",
                "Entrepreneurial Manager/No Task Specificity",
                "Entrepreneurial Manager/Task Specificity")

Next comes defining a new dataframe. The posterior_linpred function draws on R’s predict function, which can take in a new dataframe as input. We want to predict the probability of happiness at each of the four possible experimental conditions. But because our initial model contains covariates, we need to include these as well.

This is where mean-centering the continuous predictors comes in handy, as does factoring the categorical variable, Gender…

conditions.df <- data.frame(EntManager = c(0, 0, 1, 1),
                            TaskSpecificity = c(0, 1, 0, 1),
                            Gender = "Male",
                            cenAge = mean(experiment.df$cenAge),
                            cenExp = mean(experiment.df$cenExp))

conditions.df
##   EntManager TaskSpecificity Gender        cenAge       cenExp
## 1          0               0   Male -2.301813e-17 1.098384e-16
## 2          0               1   Male -2.301813e-17 1.098384e-16
## 3          1               0   Male -2.301813e-17 1.098384e-16
## 4          1               1   Male -2.301813e-17 1.098384e-16

What we have above is a dataframe with four sets of conditions. The EntManager and TaskSpecificity conditions vary based on the four possible scenarios. But the Gender, Age, and Experience variables are the same for each row. We set the Gender variable at our baseline (Male), and Age and Experience at their mean values (very close to zero, since they were mean-centered). This means our predicted probabilities align well with the log odds our model summary gives us. Each condition is the predicted probability for a Male respondent, at the sample mean level of Age and Work Experience. It would be just as easy to change the factor variable to Female, or set our covariates to any value in the sample we wanted.

Now we pass our conditions vector, and our new dataframe, to our getProbabilities and createProbTable helper functions. The functions return the median predicted probability from the posterior distribution.

# Call the getProbabilities function to calculate the predicted probabilities.
model.prob <- getProbabilities(model)

# Call the createProbTable function to create the predicted probabilities table.
model.prob.table <- createProbTable(model.prob)
model.prob.table %>% 
  mutate_if(is.numeric, funs(100 * round(., 2))) # Convert to percentages
## # A tibble: 4 x 4
##   Condition                                MedianProbabili… ui.low ui.high
##   <chr>                                               <dbl>  <dbl>   <dbl>
## 1 Conservative Manager/No Task Specificity               61     54      67
## 2 Conservative Manager/Task Specificity                  77     71      83
## 3 Entrepreneurial Manager/No Task Specifi…               86     81      89
## 4 Entrepreneurial Manager/Task Specificity               93     90      95

So in our baseline condition with a conservative manager and no task specificity, an average participant had a 61% probability of being happy with the task, with a 90% uncertainty interval ranging from 54% to 67%.

But in our condition with an entrepreneurial manager and task specificity, an average participant had a 93% probability of being happy, with a 90% uncertainty interval ranging from 90% to 95%.

Bayesian inference makes describing the model results intuitive, useful, and easy!


Plot Probabilities

Median predicted probabilities are helpful, but the power of Bayesian inference lies in thinking about model parameters not as fixed point estimates but as random parameters that take on a range of possible values.

The best way to make sense of the uncertainty and variance in the predicted probabilities is visualizing the posterior distribution of those probabilities, across the four possible experimental conditions. Here we are going to pass our model.prob dataframe of predicted probabilities, along with our model.prob.table dataframe with the median predicted probability, to our plotProbabilities helper function.

# Call the plotProbabilities function and set the plot titles
model.prob.plot <- plotProbabilities(model.prob, model.prob.table) +
  labs(title = "Predicting Happiness",
       subtitle = "Dashed line: Median predicted probability",
       x = "Probability of Feeling Happy")
model.prob.plot

As we’ve already discussed, the highest probability of happiness occurs in the entrepreneurial manager/task specificity condition. But this condition also has the smallest uncertainty interval—we can say that there is effectively zero probability that there is no effect on probability of being happy when in this condition (probability = 50%, or 50:50 chance), and also that there is little chance that the probability of being happy drops below 80%. But in our baseline condition, we see that the uncertainty interval is much larger, reflecting less confidence (greater uncertainty) in our median estimate.

This straightforward interpretation shows the power of Bayesian inference for these types of models, and something frequentist confidence intervals, despite popular lore, just cannot provide.


Wrap-up

My example was a 2x2, but the logic and functions are easily extended to more complicated models. It is also straightforward to adjust the functions for continuous outcomes and continuous predictors, but that’s for another post!