Complex LDV Models

ENT5587B - Research Design & Theory Testing II

Brian S. Anderson, Ph.D.
Assistant Professor
Department of Global Entrepreneurship & Innovation
andersonbri@umkc.edu

• How goes it?
• IRB Applications & eProtocol
• LDV Model Review
• Multinomial Logit
• Count Models
• Lab 2 Feb: LDV Model Assessment

Before we get in to our content, lets learn a new skill—how to create your own dataset.

Being able to create your own dataset is ridiculously handy. It lets you evaluate models, test predictions, conduct Monte Carlo simulations, and teach stats :)

R is very flexible when it comes to generating random numbers that you can combine into datasets. For example, we can create a (pseudo) random number with 1,000 observations that is normally distributed…

library(tidyverse)
set.seed(123)
my.df <- data.frame(x.normal = rnorm(1000, mean = 0, sd = 1))
normal.plot <- ggplot(my.df, aes(x.normal)) +
geom_density((aes(colour = "Our Variable"))) +
stat_function(fun = dnorm, (aes(colour = "Standard Normal"))) +
scale_colour_manual("", values = c("darkblue", "firebrick")) +
labs(title = "Density Function of 'X' With Normal Distribution Overlay",
x = "",
y = "Distribution Density")
normal.plot 

Typically though we want to create an entire dataset with a correlation structure that we’ve pre-determined.

Lets start with what we want our ending dataset to look like…

We’re going to be demonstrating logit, multinomial logit, and count models today, so we need three different dependent variables. Lets also create two predictor variables as well. So our dataset needs:

• y1 = Dichotomous (0/1)
• y2 = Nominal (0/1/2)
• y3 = Interval (0 -> n)
• x1 = Continuous
• x2 = Continuous

Now lets specify our covariance matrix. We’re doing this because the R function we’ll use to generate the dataset calls for a covariance matrix, but we can effectively make correlations by setting the variance of the variable to 1, remember?

Now, there is really no such thing as a correlation between nominal variables or between nominal variables and continuous variables. For these variables, we are interested in their association. But for our purposes, it’s easier to just think about the associations we wish to model as covariances/correlations.

So, lets build our matrix…

cov.matrix <- matrix(c(1,0,0,.2,.3,
0,.8,0,.4,.5,
0,0,7,.6,.7,
.2,.4,.6,1,0,
.3,.5,.7,0,1),
nrow = 5, ncol = 5,
dimnames = list(c("y1", "y2", "y3", "x1", "x2")))
cov.matrix
##    [,1] [,2] [,3] [,4] [,5]
## y1  1.0  0.0  0.0  0.2  0.3
## y2  0.0  0.8  0.0  0.4  0.5
## y3  0.0  0.0  7.0  0.6  0.7
## x1  0.2  0.4  0.6  1.0  0.0
## x2  0.3  0.5  0.7  0.0  1.0

Now, lets generate the dataset using the MASS library and the mvrnorm function.

library(MASS)
set.seed(123)
my.ds = mvrnorm(n = 1000,  # Number of observations
mu = c(.5, 1.5, 10, 0, 0),  # Variable means
Sigma = cov.matrix,  # Covariance matrix
tol = .1,  # Ensures a positive definite matrix
empirical = TRUE)  # Set our matrix as the empirical values
# Now lets check the covariance matrix for the new dataset
myCov.matrix <- cov(my.ds, method = c("pearson"))
round(myCov.matrix,1)
##     y1  y2  y3  x1  x2
## y1 1.0 0.0 0.0 0.2 0.3
## y2 0.0 0.8 0.0 0.4 0.5
## y3 0.0 0.0 7.0 0.6 0.7
## x1 0.2 0.4 0.6 1.0 0.0
## x2 0.3 0.5 0.7 0.0 1.0

We’re not done yet though. Each of the variables we created are continuous, and that wasn’t what we wanted for each of our dependent variables. Now, there are a variety of other ways to do this, but when I just need to keep the associations close to what I specified earlier, I just keep things simple with mutate.

library(dplyr)
myData.df <- as.data.frame(my.ds) %>%
dplyr::select(y1:x2) %>%
mutate(y1 = ifelse(y1 > .5, 1, 0)) %>%
mutate(y2 = ifelse(y2 < .5, 0,
ifelse(y2 > 1.5, 2, 1))) %>%
mutate(y3 = round(y3, 0)) %>%
mutate(y3 = ifelse(y3 > 12, 0, y3)) %>%
mutate(y3 = ifelse(y3 > 0, (13-y3), y3))
# Lets convert y2 to a factor variable to show this feature
myData.df$y2 <- factor(myData.df$y2, labels = c("Failed", "AngelFunding", "VCFunding"))
head(myData.df, 5)          
##   y1        y2 y3          x1         x2
## 1  1 VCFunding  1  0.01792598  0.7534946
## 2  1 VCFunding  2 -0.47772158  0.4333164
## 3  1 VCFunding  6  0.80987533  0.2537901
## 4  0 VCFunding  1  0.37706793 -0.1173286
## 5  1    Failed  0  1.25713777 -1.8283768

Data set tutorial over, but before we move on, lets talk about probabilities…

Probabilities produce—usually—the most useful of insights to researchers and to practitioners yet are frequently overlooked for their continuous variable/effect size cousins in our research.

Consider entrepreneurial failure.

The 80/90/95/96% statistic gets floated—and repeated—often, although the true percentage of startup/business failures is largely unknowable.

What we can agree on is that the majority of startups will fail, and it would be really valuable to know what causes entrepreneurial failure.

Wait a second, why would an experimental design be decidely unethical to research this question?

So we have to deal with observational data. Lets leave aside the endogeneity problem for right now, and think about how we can tease out the causes of entrepreneurial failure.

The easiest way to think about entrepreneurial failure is as a binary (0 or 1) outcome—the startup failed during the sampling window, or it didn’t.

With a binary outcome, our DV is censored. Censored variables include any nominal variable where the outcome takes on a discrete value, is truncated in some way, or is otherwise “cut off”.

Oh, another way of looking at failure is with a hazard model, but that’s for another class!

With our discrete outcome then, we must use a Limited Dependent Variable (LDV) estimator. In a LDV model, the residuals are not normally distributed, so OLS estimators are inappropriate.

To predict entrepreneurial failure, we’re going to use a logit model.

$$ln\left[\frac{p}{1-p}\right]=\alpha+{\beta}{x}+\varepsilon$$

• ln = Natural logarithm
• p = Probability that event y occurs (y = 1)
• p/(1-p) = Odds ratio
• ln[p/(1-p)] = log odds ratio, or ‘logit’ transformation of the odds ratio

A logit model is simply a non-linear transformation of the general linear model. In this case, the logit transformation is the link function.

Here’s the material point…

With a logit model, we’re really interested in the probability that an event (our Y = 1 condition) will occur, as a function of a change in our independent variable(s).

The estimate of $${\beta}$$ in a logit model isn’t the change in the value of Y as X changes, because Y can only take on a value of 0 or 1. So a logit model tells you the change in the odds ratio of Y occurring as a function of the level of X.

We use the odds ratio to construct the probability.

Lets roll!

myData.df$y1 <- factor(myData.df$y1,  # Convert y1 to a factor variable
labels = c("Survived", "Failed"))
logit.model <- glm(y1 ~ x1 + x2,
data = myData.df,
family = "binomial")
summary(logit.model)
##
## Call:
## glm(formula = y1 ~ x1 + x2, family = "binomial", data = myData.df)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.8524  -1.1081   0.5755   1.0989   1.8473
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.008828   0.065952   0.134    0.894
## x1          0.371658   0.067964   5.468 4.54e-08 ***
## x2          0.487082   0.069366   7.022 2.19e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 1386.3  on 999  degrees of freedom
## Residual deviance: 1303.0  on 997  degrees of freedom
## AIC: 1309
##
## Number of Fisher Scoring iterations: 4

We’ll make use of quite a bit of this output, but for right now, we can say that the log odds of failure occurring increases by .371 and by .487 for each one unit increase in x1 and x2 respectively.

Pretty easy to understand, right?

Ok, the log odds are the natural log of the odds, and the odds is the probability that an event will occur divided by the probability that the event will not occur.

$$ln\left[\frac{p}{1-p}\right]$$

The difference between the odds for a given value of the predictor and another value is called the odds ratio, which is what our model displayed as the coefficient estimate.

BTW, not understanding that the coefficient is the log odds ratio is one of the things Wiersema and Bowen (2009) were complaining about, and it’s a big reason why the paper you critiqued during Lab was incorrect.

To convert from the log odds to just the odds, we exponent the coefficient estimate.

exp(coef(logit.model))
## (Intercept)          x1          x2
##    1.008867    1.450137    1.627560

This is easier to understand. Because our predictors are continuous, we interpret the odds ratio as the change in the odds of failure occurring for every one unit increase in the predictor.

So in this case, the odds of failure occurring is 1.450137:1 for every one unit change in x1. We can convert the odds to a probability and say that the probability of failure for an average x1 is 59%.

myProb <- 1.450137/(1.450137 + 1)
round(myProb,2)
## [1] 0.59

Quick note, but REALLY important…

If you see an odds ratio below 1.0—0.68:1 for example—that means that the odds of what you are predicting to happen (when the DV = 1) went down with an increase in your independent variable. In this example, the probability of y = 1 is now only 40% for an average x1.

myOdds <- 0.68/(0.68 + 1)
round(myOdds,2)
## [1] 0.4

If the odds ratio was 1.0, you would have an equivalent probability (a 50/50 chance) of the DV = 1 condition occurring.

You can observe a negative coefficient estimate in a logit model. A negative coefficient implies that the log odds of y = 1 decreases as x increases by one unit (e.g., log(0.68) = -0.386).

Ok, so the odds are helpful, but we really want to understand the probability of failure.

The kicker though is that the probability of the event happening is based on a particular value of the predictor.

Imagine that x1 is the startup’s burn rate. At low values of x1, the business is burning through very little cash, so the probability of failure is likely to be pretty low. As the burn rate increases—x1 gets bigger—the probability of failure is also likely to get bigger.

So, to really understand how x1 relates to failure, we need to get the predicted probabilities across the range of values of x1 in our dataset.

Remember though that this is a multivariate model, so we’ll calculate the predicted probabilities for x1 holding constant the odds of failure at the mean level of x2.

Lets start by getting the range of x1 and the mean of x2. We’ll create a new tibble with our range of x1 incremented by .1.

library(dplyr)
myRange.vector <- myData.df %>%
summarise(min.x1 = round(min(x1),1),
max.x1 = round(max(x1),1),
mean.x2 = mean(x2))
myRange.vector
##   min.x1 max.x1       mean.x2
## 1   -3.2    3.4 -9.381493e-18

Now we’ll create a new tibble for our predicted probabilities based on our range of x1 and the mean value of x2.

# Create a new dataframe, and start adding variables
myRange.df <- data.frame(x1 = seq(from = myRange.vector$min.x1, to = myRange.vector$max.x1, by = .1))
# Note the different syntax here to add a variable to myPredicted.df
myRange.df$x2 <- myRange.vector$mean.x2

Now lets run our model again, get the predicted values (called fit), standard errors, and calculate the confidence intervals.

myPredicted.df <- cbind(myRange.df,
predict(logit.model, newdata = myRange.df,
# Now lets calculate the confidence intervals.
# We're going to use the 'plogis' function to covert the values to probabilities.
myPredicted.df <- within(myPredicted.df, {
PredProb <- plogis(fit)
lower.ci <- plogis(fit - (1.96 * se.fit))
upper.ci <- plogis(fit + (1.96 * se.fit))
})
head(myPredicted.df, 5)
##     x1            x2       fit    se.fit residual.scale  upper.ci
## 1 -3.2 -9.381493e-18 -1.180479 0.2272658              1 0.3240901
## 2 -3.1 -9.381493e-18 -1.143313 0.2207707              1 0.3294657
## 3 -3.0 -9.381493e-18 -1.106148 0.2142943              1 0.3348944
## 4 -2.9 -9.381493e-18 -1.068982 0.2078383              1 0.3403761
## 5 -2.8 -9.381493e-18 -1.031816 0.2014047              1 0.3459108
##    lower.ci  PredProb
## 1 0.1643900 0.2349660
## 2 0.1713593 0.2417125
## 3 0.1785556 0.2485898
## 4 0.1859802 0.2555968
## 5 0.1936338 0.2627322

We can take a look at our results with a tibble, and take advantage of some mutations…

viewPredicted.df <- myPredicted.df %>%
dplyr::select(x1, PredProb, lower.ci, upper.ci) %>%
mutate(upper.ci = 100*round(upper.ci, 4),
lower.ci = 100*round(lower.ci, 4),
PredProb = 100*round(PredProb, 4))
head(viewPredicted.df, 10)
##      x1 PredProb lower.ci upper.ci
## 1  -3.2    23.50    16.44    32.41
## 2  -3.1    24.17    17.14    32.95
## 3  -3.0    24.86    17.86    33.49
## 4  -2.9    25.56    18.60    34.04
## 5  -2.8    26.27    19.36    34.59
## 6  -2.7    27.00    20.15    35.15
## 7  -2.6    27.74    20.96    35.71
## 8  -2.5    28.49    21.80    36.28
## 9  -2.4    29.25    22.65    36.86
## 10 -2.3    30.03    23.53    37.44

The best way to explore our probabilities though is with a plot…

library(ggplot2)
library(ggthemes)
PredProb.plot <- ggplot(viewPredicted.df, aes(x = x1, y = PredProb)) +
geom_line() +
geom_ribbon(aes(ymin = lower.ci, ymax = upper.ci), alpha = .2) +
labs(title = "Probability of Entrepreneurial Failure \nAs Burn Rate Increases",
x = "Burn Rate",
y = "Predicted Probablility (%)")
PredProb.plot + theme_economist()

Now that we understand the different probabilities, we can also take a look at the marginal effects.

The marginal effect is a measure of the change in the probability of Y (failure in our model) as a function of a change in a given set of values of the predictor. The average marginal effect then is just the mean expected change in probability for each unit change in X.

As Wiersema and Bowen (2009) noted though, the marginal effect is a function of the change in the value of X. So the marginal effect then is not a constant for continuous variables.

There’s some debate as to the usefulness of marginal effects. We’ll spend more time with MEs when we get to moderation, but there is a handy R package that calculates the AME for you (otherwise you need to use calculus!).

install.packages("mfx")
library(mfx)
ame.model <- logitmfx(formula = y1 ~ x1 + x2, data = myData.df, atmean = FALSE)
ame.model
## Call:
## logitmfx(formula = y1 ~ x1 + x2, data = myData.df, atmean = FALSE)
##
## Marginal Effects:
##       dF/dx Std. Err.      z     P>|z|
## x1 0.085447  0.016614 5.1430 2.703e-07 ***
## x2 0.111983  0.017578 6.3706 1.883e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So for each one unit increase in x1, on average we expect an 8.5 (100*.085) increase in the probability of failure holding the odds of x2 constant, which roughly corresponds to our picture.

Ok, we’re not done yet with our logit model. Unlike in OLS, there is no such thing as the amount of explained variance—the R2—in the dependent variable, because we’re predicting probabilities, not variance.

LDV models will often display so-called psuedo-R2 values, HOWEVER, as Wiersema and Bowen (2009) noted they differ widely in their assumptions, and still don’t really capture the same thing as R2 in OLS.

There is, however, a global test of model fit for most LDV models analogous to an F test in OLS, called the likelihood ratio test.

The likelihood ratio test statistic follows a Chi2 distribution with d.f. equal to the number of predictors in the model. The statistic itself is the difference between the deviance residuals of a null model from our estimated model.

# We'll use a tibble to store our variables.
logit.model.fit <- tibble(LikelihoodRatio = with(logit.model,
null.deviance - deviance),
df = with(logit.model, df.null - df.residual),
pvalue = with(logit.model,
pchisq(null.deviance - deviance,
df.null - df.residual,
lower.tail = FALSE)))
logit.model.fit
## # A tibble: 1 × 3
##   LikelihoodRatio    df     pvalue
##             <dbl> <int>      <dbl>
## 1        83.29759     2 8.1688e-19

So quiz…how do we interpret this result?

Next up, multinomial logit.

Multinomial logit involves dependent variables that take on discrete values, but the order doesn’t matter.

levels(myData.df$y2) ## [1] "Failed" "AngelFunding" "VCFunding" In our data, y2 takes on the values of 0, 1, and 2 for whether the business failed in the first year, received Angel Funding, or received VC Funding. In our data, and in multinomial logit, the order the levels doesn’t matter–they are independent. If the order does matter, then an ordered logit model is more appropriate, but that’s for another class. For our multinomial model, we’re going to use the nnet package. install.packages("nnet") Lets specify our model with y2 as the DV, and x1 and x2 as predictors. First though we need to set a reference level for our y2 variable. The reference level provides our point of comparison for the odds ratios/probabilities. myData.df$y2 <- relevel(myData.df$y2, ref = "Failed") library(nnet) multinomial.model <- multinom(y2 ~ x1 + x2, data = myData.df) summary(multinomial.model) ## # weights: 12 (6 variable) ## initial value 1098.612289 ## iter 10 value 733.188105 ## final value 731.616110 ## converged ## Call: ## multinom(formula = y2 ~ x1 + x2, data = myData.df) ## ## Coefficients: ## (Intercept) x1 x2 ## AngelFunding 2.171355 0.8385041 1.342870 ## VCFunding 2.287509 1.8490431 2.470239 ## ## Std. Errors: ## (Intercept) x1 x2 ## AngelFunding 0.1921911 0.1367924 0.1523777 ## VCFunding 0.1924488 0.1575186 0.1766762 ## ## Residual Deviance: 1463.232 ## AIC: 1475.232 So this gets a little complicated, and also note that the package doesn’t provide p-values by default—we won’t make those today, but it’s pretty straightforward. You can think about the coefficient estimate as the change in the log odds of y2 occurring for every one unit change in x1 (or x2) in comparison to the reference level. So for x1 for example, we would say that the log odds of Angel Funding occurring increase by .839 for every one unit change in x1 relative to the odds of Failure. Similarly, we would say that the log odds of VC Funding increase by 1.849 for every one unit change in x1 relative to the odds of Failure. Calculating confidence intervals and plots then proceed as with the logit model, although you’ve got two sets of equations to deal with now. Really though, give it a shot! Next up, count models. As you read in Blevins et al (2016), count models are increasingly popular in management research. What makes count models interesting—at least to me—is that the DV is usually often of theoretical and practical importance. Take for example patent activity, and lets assume our variable, x3, represents the number of patents filed. hist(myData.df$y3)  # Griffin, this one is for you!

##

Not surprisingly, most companies didn’t have any patent filings (y3 = 0). Some had a lot (y3 = 12). What we’re interested in is whether our predictors—x1 and x2—increase patent count.

If I’m a manager, I want to know if investing in x1 and x2 helps boost my patent filings. Theoretically, I’m interested in productive resources and resource combinations that relate to innovation productivity.

The challenge though, as Blevins et al (2016) noted, is that while such models are easy to conceive and to specify, modeling them correctly is challenging, let alone interpreting the model output.

The most substantial challenge is the issue of dispersion, which is a close cousin to variance. Dispersion is a big deal in count models because the most common estimator, Poisson models, is very sensitive to the dispersion of the dependent variable.

Go back and take a look at the histogram of y3. It’s easy to see that the counts are not spread equally across the observed values. We also see that the most common observed value is zero—no patent activity at all. The unequal distribution and the significant number of zeros both cause problems for regular Poisson regression.

This is really a more advanced topic than we have time for, and I’d refer you to UCLA’s excellent page on the topic. What I do want to highlight though is evaluating the difference between Poisson models and the common model used model to deal with overdispersion—violating the assumption that $$E[Y]=\mu$$—the negative binomial model.

Side note–as Blevins et al (2016) discussed, many times when researchers switched from Poisson to negative binomial the most appropriate method was likely zero-inflated Poission. That is really a different discussion, but it’s worth familiarizing yourself with this option if you are going to model count-based outcomes.

We’re going to make use of the odTest function in the pscl package.

install.packages("pscl")
library(pscl)
possionOD.model <- glm.nb(y3 ~ x1 + x2,
data = myData.df)
odTest(possionOD.model)
## Likelihood ratio test of H0: Poisson, as restricted NB model:
## n.b., the distribution of the test-statistic under H0 is non-standard
## e.g., see help(odTest) for details/references
##
## Critical value of test statistic at the alpha= 0.05 level: 2.7055
## Chi-Square Test Statistic =  163.0489 p-value = < 2.2e-16

This test compares a Poisson model to a negative binomial model to evaluate whether it’s appropriate the relax the assumption that $$E[Y]=\mu$$, which must be true for a Poisson model to produce consistent estimates.

Rejecting the null hypothesis provides statistical evidence of over-dispersion, and that your data violates the $$E[Y]=\mu$$ assumption. As such, an alternative to classic Poisson is preferred.

Purely FWIW, I prefer the likelihood ratio test over the classic ratio comparison discussed in Blevins et al (2016), but they’re very close cousins of each other.

Now lets estimate our negative binomial model.

negativebinomal.model <- glm.nb(y3 ~ x1 + x2, data = myData.df)
summary(negativebinomal.model)
##
## Call:
## glm.nb(formula = y3 ~ x1 + x2, data = myData.df, init.theta = 3.774289464,
##
## Deviance Residuals:
##      Min        1Q    Median        3Q       Max
## -2.60279  -0.93497  -0.07497   0.60898   2.19143
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  1.11804    0.02455  45.540  < 2e-16 ***
## x1          -0.17167    0.02437  -7.045 1.86e-12 ***
## x2          -0.19567    0.02451  -7.983 1.43e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(3.7743) family taken to be 1)
##
##     Null deviance: 1347.8  on 999  degrees of freedom
## Residual deviance: 1235.9  on 997  degrees of freedom
## AIC: 4343
##
## Number of Fisher Scoring iterations: 1
##
##
##               Theta:  3.774
##           Std. Err.:  0.439
##
##  2 x log-likelihood:  -4335.029

A negative binomial model also makes use of a log transformation to linearize its parameters and render consistent estimates.

That means the coefficient estimates are expressed in log units, and specifically as the expected log count. So in our model, for every one unit increase in x1, we expect the log count of y3 to decrease by -.172.

There is a lot more to these types of models, well beyond what we will be able to cover, but I would encourage you to learn more!

Wrap-up.

Lab 2 Feb – LDV Model Assessment

Seminar 6 Feb – Moderation