The basics of panel data

ENT5587B - Research Design & Theory Testing II

Brian S. Anderson, Ph.D.
Assistant Professor
Department of Global Entrepreneurship & Innovation
andersonbri@umkc.edu

© 2017 Brian S. Anderson

• Reviews due!
• Paper revision and reviewer responses due 24 April
• Data collection discussion
• Panel data basics
• Lab this afternoon – Panel data assessment
• Seminar 17 April – Panel data and causal effects

This is a tough topic. I’m not kidding.

For the next three weeks we will, maybe, scratch the surface. We’re going to cover the basics, which should help you to make sense of papers that use panel data.

If you think—and I think must of you are in this boat—that you will investigate multilevel research questions, let me point you to two very good resources.

The Wooldridge book is a classic text, and you should buy it. You’ll see references to this book across several different literatures. The Heiss book is new—and free to read!—and leverages the Wooldridge text within an R framework. Think of Wooldridge as the theory and the proofs, and Heiss as the ‘how-to’ in R.

First up, some definitions…

Panel, multilevel, longitudinal, and hierarchical data are all the same thing. You have observations of some lower level object, $$t$$, nested within a higher order entity, $$i$$.

So $$x_{it}$$ is the observation of $$x$$ for entity $$i$$ at time $$t$$.

REALLY REALLY REALLY IMPORTANT NOTE

In the psychology literature (and I sometimes do this myself), you will see the lower level represented as $$i$$ and the higher as $$j$$, as in the $$i^{th}$$ observation of the $$j^{th}$$ entity, which is sometimes written as $$x_{ij}$$ (the inverse order of the preceding slide). I’m going to use the $$i,t$$ notation because it’s closer to how economists (and most strategy people) present panel data, and the way I learned it.

The difference generally has to do with whether the lower order entity is a temporal observation (econ/strategy), or is an entity within another entity (OB/psych).

THE MATH IS THE SAME, but watch out on confusing your $$i's$$, $$j's$$, and $$t's$$, because order does matter!

Lets also bust a common myth in our field right now. You do NOT need special software to analyze panel data. Period dot.

You do, however, need to use different kinds of estimation techniques that account for the nested structure of the data.

Side note…

Time series data is not quite the same thing. It could be, in the sense that you might have multiple $$i's$$ in a data set. But you could also just have one $$i$$ and you are trying to model the $$t$$ observations for that one $$i$$. For example, think about analyzing the S&P 500 Index over time—there is one $$i$$, the index, but tens of thousands of $$t's$$, the daily value of the index.

In the hard sciences, they work a lot with time series data. In the social sciences, we deal mostly with panel data. The analytic techniques are generally comparable, but the language used can be a little different. We’re going to focus on panel data in this class, but we’ll do a quick time series example.

$$y_{it}=\alpha+\beta{x_{it}}+\mu_{i}+\epsilon_{it}$$

We will be working with this equation a lot.

• $$y$$ = Value of the $$t^{th}$$ observation of $$y$$ for the $$i^{th}$$ entity
• $$\alpha$$ = The value of $$y$$ when $$x$$ equals zero across all $$i, t's$$
• $$\beta$$ = The expected change in $$y$$ for an average $$i$$ across time $$t$$. Note that this interpretation gets more complicated, quickly.
• $$x$$ = Value of the $$t^{th}$$ observation of $$x$$ for the $$i^{th}$$ entity
• $$\mu$$ = The portion of the disturbance term unique to $$i$$ and that is constant over time $$t$$.
• $$\epsilon$$ = The variance in the $$t^{th}$$ observation of $$y$$ for the $$i^{th}$$ entity that is not explained by the variance in the $$t^{th}$$ observation of $$x$$ for the $$i^{th}$$ entity or $$\mu_{i}$$.
$$y_{it}=\alpha+\beta{x_{it}}+\mu_{i}+\epsilon_{it}$$

WARNING

Note that failure to understand the assumptions underlying this equation is, as Antonakis et al (2010) rightly noted, the reason why the vast majority—I’d say greater than 80%—of panel studies in our literature are misspecified.

END WARNING

You, however, will know better right from day one.

Thank me later :)

Lets start with a time series example.

In a time series model, we’re generally interested in analyzing the pattern of $$t's$$ for a given $$i$$. It is one dimensional data, similar to cross-sectional data, although in this case it’s one $$i$$ and many $$t's$$, as opposed to one $$t$$ for many $$i's$$.

Generally we assume that observations close together will correlate more than observations farther apart. Just as with any sample, the greater the observations of $$t$$, the more accurate our estimation of this temporal dependency.

While it’s interesting to fit times series parameters, ultimately the value of time series modeling is prediction/forecasting. For example, trying to forecast the magnitude of future earthquakes based on previous earthquakes.

Or trying to forecast the weather. Or trying to forecast tomorrow’s value of a stock price :)

Lets get some data. I’ve got the daily closing stock price for Illinois Tool Works from 3 Jan 2000 through 31 Dec 2009.

Where did I get it do you ask? Right here.

library(readr)
library(tidyverse)
my.df <- my.ds
head(my.df, 3)
## # A tibble: 3 × 2
##        Date ClosingPrice
##       <chr>        <dbl>
## 1 03jan2000      31.9375
## 2 04jan2000      30.5000
## 3 05jan2000      30.8750

Take a look at the Date variable. It’s actually being stored as a string right now. That’s not going to work for us. Frequently when dealing with secondary data sources we need to manipulate the data to make it usable for analyses.

class(my.df$Date) ## [1] "character" Fortunately, the lubridate package makes it easy to work with dates. We don’t have enough time to go through all of it, but I’d encourage you to take a look at this tutorial. library(lubridate) my.df$Date <- dmy(my.df$Date) class(my.df$Date)
## [1] "Date"
head(my.df, 3)
## # A tibble: 3 × 2
##         Date ClosingPrice
##       <date>        <dbl>
## 1 2000-01-03      31.9375
## 2 2000-01-04      30.5000
## 3 2000-01-05      30.8750

Ok, the first thing we need to do is to tell R that we’ve got time series data. For this, we’re going to use the xts package.

library(xts)
my.df.ts <- xts(my.df$ClosingPrice, order.by = my.df$Date)
colnames(my.df.ts) <- "ClosingPrice"

When working with time series data, we generally focus on visualizations. So lets take a look at Illinois Tool Work’s stock over the ten year period.

plot.xts(my.df.ts, main = "Illinois Tool Works Daily Closing Stock Price")

Ok, now lets do some useful things with this. We’re going to need the forecast package for this.

This isn’t even time series 101 BTW, and is a very rudimentary analysis. Generally, you want to use ARIMA models for time series which explicitly model the auto-regressive component in the data.

This is for a different (semester length) class, but lets just do an example.

library(forecast)
my.ts.2009 <- my.df.ts['2009']  # Subset our time series to just 2009
arima.model <- auto.arima(my.ts.2009)  # Use an auto-fitting ARIMA model
plot(forecast(arima.model),
main = "Illinois Tool Works 2009 Stock Price ARIMA Forecast")  # Make the plot

The dark grey is an 80% confidence level around the stock’s predicted value. The light grey is a 95% band. Notice that the bands get larger as time progresses.

A big part of what we want to understand with time series models is auto-correlation—the extent to which an observation at time $$t$$ correlates with an observation at time $$t-1$$. Auto-correlation is at the heart of time series modeling, our ability to make reasonable forecasts, and why we need to use special estimators (auto-correlation violates the i.i.d. assumption).

acf(my.ts.2009, lag.max=100,
main = "Illinois Tool Works 2009 Stock Price Autocorrelation") 

As is common with stock prices, particularly for large, established firms with low volatility, daily stock prices are highly correlated. The dotted line is for the $$\alpha$$ = .05 level of significance. We have to go back more than 70 days to see no statistically significant auto-correlation.

Again, we didn’t even come close to scratching the surface here, and time series modeling can be a real beast to model well.

If you think you are interested in this kind of approach, lets talk offline about getting you additional training.

Ok, so lets move on to full panel data.

We’re going to use a different dataset for this one. It’s a 20-year panel of publicly traded companies in the U.S. (1995-2014) with sales greater than $50MM annually (non-inflation adjusted). Where did I get it do you ask? Right here. First up, lets get the data. my.panel.ds <- read_csv("http://a.web.umkc.edu/andersonbri/Panel.csv") my.panel.df <- my.panel.ds Open up the dataset and take a look… View(my.panel.df) Take a look at a couple of things: • The panel is unbalanced—firms ($$i's$$) don’t necessarily have the same number of observations ($$t's$$) • Often you will see missing values (NA) or 0.00 values, which may or may not indicate that the firm had a zero value on that variable for that year Now take a look at this… header.df <- subset(my.panel.df, select = c(gvkey, fyear, conm)) head(header.df, 5) ## # A tibble: 5 × 3 ## gvkey fyear conm ## <chr> <int> <chr> ## 1 001062 2007 ASA GOLD AND PRECIOUS METALS ## 2 001062 2008 ASA GOLD AND PRECIOUS METALS ## 3 001062 2009 ASA GOLD AND PRECIOUS METALS ## 4 001177 1995 AETNA INC ## 5 001177 1996 AETNA INC Something unique to Compustat data is the necessity to convert the gvkey variable from a string to numeric. In R, we do that with the as.numeric function. my.panel.df$gvkey <- as.numeric(my.panel.df$gvkey) You’re going to want to remember that if you work with Compustat data. Trust me :) Critical to panel data analysis is that there must be a unique combination of $$i$$ and $$t$$ to identify each observation… Valid Invalid 1062($$i$$); 2007($$t$$) 1062($$i$$); 2007($$t$$) 1062($$i$$); 2008($$t$$) 1062($$i$$); 2007($$t$$) The reason this is important is because with panel data, we are looking at two different effects of $$x$$ on $$y$$. • The within effect, which is the average effect of $$x$$ on $$y$$ for a given (average) $$i$$ across time $$t$$ • The between effect, which is the average effect of $$x$$ on $$y$$ across $$i's$$ over time $$t$$ If we couldn’t uniquely identify each observation, there would be a perfectly co-linear combination and the estimator wouldn’t know which one was the “right” one to use to estimate each effect. Think about it this way… We can decompose any $$x_{it}$$ into the following: $$x_{it}=\gamma_i+\tau_{it}$$ For each $$x_{it}$$, there is going to be a between-component, $$\gamma$$, that never changes over time for each $$i$$ in the sample (the firm, for example). But there is also going to be a within-component, $$\tau$$, that can change for each $$\gamma$$ over time (the firm’s sales, for example). We’re going to get in to within and between effects a lot more next week, although understandin the within and the between are central to panel data estimators. For right now, lets just focus on setting up our model. Panel data isn’t always secondary financial data. Panels could be… • Repeated measures in an experiment • Students nested within classrooms • Employees nested within managers • Top management teams nested within firms • Firms nested within industries • Counties nested within states One thing though with secondary panel data is that, most often, the data violates the normalcy assumption, which can bias our parameter estimates. Lets do a QQ Plot of the ‘revt’ variable, which is the firm’s annual reported revenue in millions of USD. qqnorm(my.panel.df$revt, main = "QQ Plot of Firm Revenue")

For a reminder, this is what a QQ plot looks like for a normally distributed variable…

Often you will employ variable transformations with secondary financial data. The most common is taking the natural log of the variable…

my.panel.df$log.revt <- log(my.panel.df$revt)
qqnorm(my.panel.df\$log.revt, main = "QQ Plot of Natural Log Firm Revenue")