Visualizing panel data

· by Brian Anderson · Read in about 13 min · (2620 words) ·

I ended up creating some of these visualizations for a recent paper submission, and thought they would be helpful. As we move from an era of researcher (who used some statistics) to data scientist, visualizing data and data structures is a critical skill set.

Many entrepreneurship and strategy scholars use panel/multilevel data, but visualizing that data is challenging. The problem is that panel/multilevel data often nests repeated measures of a variable (\(j\)) nested within some entity (\(i\)). As opposed to a simple cross-sectional relationship, when we try to visualize panel data in, say, a scatterplot, we get a mass of data points that ignores that these variables are nested within an entity. This likely gives a biased (not to mention messy) picture of the data.

To illustrate, I pulled a sample of publicly traded firms from the Compustat database. Compustat is a subscription service, so I can’t post the data online. You’ll be able though to follow-along after I load the data. Here though is the code I used…

library(readr)
vis.ds <- read_csv("filename.csv", col_types = cols(datadate = col_date(format = "%Y%m%d"),
                                                    fyear = col_integer(), gvkey = col_integer()))

Tidy it up

Now lets Tidy the data up a bit, and we’re going to eliminate outliers with greater than 50% of sales spent on R&D activity, or any negative R&D Intensity values. We’re also going to calculate a profitability metric, Return on Sales, and drop those firms with +/- 50% ROS. These are purely arbitrary decisions, but helps make the data a little easier to manage.

library(tidyverse)
library(lubridate)
vis.df <- vis.ds %>%
  rename(TickerSymbol = tic,
         NetIncome = ni,
         Revenue = revt,
         RND = xrd,
         SICCode = sic) %>%
  mutate(Year = year(datadate),
         ROS = 100 * (NetIncome/Revenue),
         RNDIntensity = 100 * (RND / Revenue)) %>%
  select(TickerSymbol, Year, SICCode, RNDIntensity, ROS) %>%
  filter(RNDIntensity <= 50 & RNDIntensity >= 0) %>%
  filter(ROS >= -50 & ROS <= 50)
head(vis.df, 10)
## # A tibble: 10 x 5
##    TickerSymbol  Year SICCode RNDIntensity   ROS
##    <chr>        <dbl>   <dbl>        <dbl> <dbl>
##  1 AORGB         1997    3576         6.58  8.70
##  2 AORGB         1998    3576         7.74 -1.37
##  3 AORGB         1999    3576         8.46  4.97
##  4 AORGB         2000    3576        10.1   5.45
##  5 AORGB         2001    3576        13.2  -6.75
##  6 AORGB         2002    3576        11.5   3.96
##  7 AORGB         2003    3576        14.5   2.30
##  8 AORGB         2004    3576        17.4   5.87
##  9 ALOT          1998    3577         6.45  2.38
## 10 ALOT          1999    3577         7.07  1.19

Now lets add a factor variable around the SIC code, with labels…

library(forcats)
vis.df <- vis.df %>%
  mutate(SICCode = factor(SICCode)) %>%
  mutate(SICCode = fct_recode(SICCode, 
            "3570 - Computer & Office Eqpmt" = "3570",
            "3571 - Electronic Computers" = "3571",
            "3572 - Computer Storage Devices" = "3572",
            "3575 - Computer Terminals" = "3575",
            "3576 - Computer Communications Eqpmt" = "3576",
            "3577 - Computer Peripheral Eqpmt" = "3577",
            "3578 - Calculating & Accounting Machines" = "3578",
            "3579 - Office Machines" = "3579"))

head(vis.df, 10)
## # A tibble: 10 x 5
##    TickerSymbol  Year SICCode                            RNDIntensity   ROS
##    <chr>        <dbl> <fct>                                     <dbl> <dbl>
##  1 AORGB         1997 3576 - Computer Communications Eq…         6.58  8.70
##  2 AORGB         1998 3576 - Computer Communications Eq…         7.74 -1.37
##  3 AORGB         1999 3576 - Computer Communications Eq…         8.46  4.97
##  4 AORGB         2000 3576 - Computer Communications Eq…        10.1   5.45
##  5 AORGB         2001 3576 - Computer Communications Eq…        13.2  -6.75
##  6 AORGB         2002 3576 - Computer Communications Eq…        11.5   3.96
##  7 AORGB         2003 3576 - Computer Communications Eq…        14.5   2.30
##  8 AORGB         2004 3576 - Computer Communications Eq…        17.4   5.87
##  9 ALOT          1998 3577 - Computer Peripheral Eqpmt           6.45  2.38
## 10 ALOT          1999 3577 - Computer Peripheral Eqpmt           7.07  1.19

What’s in the data?

I selected the revenue, net income, and research and development expenses for all publicly traded firms in the IT manufacturing industry (SIC 35XX) from 1997 - 2016.

We’ve got…

n_distinct(vis.df$TickerSymbol)
## [1] 295

…different ticker symbols across…

n_distinct(vis.df$Year)
## [1] 20

…years in…

n_distinct(vis.df$SICCode)
## [1] 8

…different industry sectors (4 digit SIC code).


Classic visualizations

This is simply a scatter plot with a line of best fit of the relationship between R&D Intensity and Return on Sales:

library(ggplot2)
scatter.plot <- ggplot(vis.df, aes(y = ROS, x = RNDIntensity)) + 
  geom_point() + 
  geom_smooth(method = "lm", se =TRUE) + 
  labs(title = "Relationship Between R&D Intensity and Firm Profitability", 
       subtitle = "IT Manufacturing Businesses: 1997-2016", 
       y = "Return on Sales (%)", 
       x = "Research & Development Intensity (R&D / Sales [%])")
scatter.plot

Ok, so that’s a lot of data points. The line of best fit suggests that firms that spend more on R&D as a percentage of sales are also less profitable, which makes some sense, because R&D is booked as an expense.

Besides being a bit ugly, the problem with this plot is that it ignores differences over time and by industry.


New visualizations

Lets take a closer look to see how the average level of R&D Intensity has changed over time for each of the industries in our sample. The idea is to make use of an aggregation technique to better understand the trend(s) in the data.

# Start by aggregrating the data by SIC Code and by Year across the average R&D Intensity
summary.df <- vis.df %>%
  group_by(SICCode, Year) %>%
  summarize(mean.RND = mean(RNDIntensity, na.rm = TRUE),
            mean.ROS = mean(ROS, na.rm = TRUE))

# Now we pass the aggegrated data to a time series plot
my.ts.plot <- ggplot(summary.df, aes(x = Year)) +
  geom_line(aes(y = mean.RND, colour = SICCode)) + 
  scale_x_continuous(breaks = seq(min(summary.df$Year), max(summary.df$Year), 1)) +
  theme(axis.text.x = element_text(angle = 90, vjust=0.5)) + 
  scale_color_discrete(name = "IT Manufacturing Sectors by SIC Code") + 
  labs(title = "Changes in R&D Intensity by IT Manufacturing Sectors (1997-2016)",
       y = "Research & Development Intensity (R&D / Sales [%])", 
       x = "")
my.ts.plot

There would appear to be substantial variance in R&D Intensity across industries and across time, which is likely distorting the simple scatter plot. We could take the next step and create a visualization of the scatterplot between ROS and R&D Intensity by industry:

facet.plot <- ggplot(vis.df, aes(y = ROS, x = RNDIntensity)) + 
  geom_point() + 
  geom_smooth(method = "lm", se =TRUE) +   
  facet_wrap(~SICCode) + 
  theme(strip.text.x = element_text(size = 8)) + 
  labs(title = "Relationship Between R&D Intensity and Firm Profitability",
       subtitle = "By IT Manufacturing Sectors (1997-2016)", 
       y = "Return on Sales (%)", 
       x = "Research & Development Intensity (R&D / Sales [%])")
facet.plot

Simply based on the above plots, it would seem that time and industry sector may be meaningful moderator on the relationship between R&D Intensity and firm profitability, or merits inclusion as a covariate in a confirmatory test of the hypothesis. Regardless, it gives us a better picture of the nested structure of the data.

Just as an illustration, we can show that both time and industry effects play a role in estimating the relationship between the R&D Intensity and Return on Sales.

We start by estimating a pooled model…

pooled.model <- lm(ROS ~ RNDIntensity, data = vis.df)
summary(pooled.model)
## 
## Call:
## lm(formula = ROS ~ RNDIntensity, data = vis.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.999  -7.390   1.955   8.837  58.097 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.27867    0.56525   7.569 5.51e-14 ***
## RNDIntensity -0.57365    0.03618 -15.857  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.15 on 2164 degrees of freedom
## Multiple R-squared:  0.1041, Adjusted R-squared:  0.1037 
## F-statistic: 251.4 on 1 and 2164 DF,  p-value: < 2.2e-16

Here we observe a -0.57 change in ROS for each percentage increase in R&D Intensity. That matches well with our pooled scatterplot earlier.

Lets take a look at a model that includes year and industry as fixed effects:

# We're going to start by converting Year to a factor variable
factor.df <- vis.df %>%
  mutate(Year = factor(Year))

fixed.model <- lm(ROS ~ RNDIntensity + SICCode + Year, data = factor.df)
summary(fixed.model)
## 
## Call:
## lm(formula = ROS ~ RNDIntensity + SICCode + Year, data = factor.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.872  -7.138   2.095   8.923  60.465 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                       9.92656    2.15212
## RNDIntensity                                     -0.69610    0.04260
## SICCode3571 - Electronic Computers               -5.75221    1.81885
## SICCode3572 - Computer Storage Devices           -5.57420    1.67979
## SICCode3575 - Computer Terminals                -12.78515    2.60629
## SICCode3576 - Computer Communications Eqpmt      -1.42589    1.65152
## SICCode3577 - Computer Peripheral Eqpmt          -5.60919    1.61643
## SICCode3578 - Calculating & Accounting Machines  -7.26595    1.91462
## SICCode3579 - Office Machines                    -7.48284    2.42651
## Year1998                                          0.75653    1.83594
## Year1999                                          0.72072    1.86186
## Year2000                                          1.84997    1.92552
## Year2001                                         -4.47751    1.98679
## Year2002                                         -2.32280    1.99372
## Year2003                                          0.71261    2.01578
## Year2004                                          1.73408    2.02772
## Year2005                                          1.82933    2.04164
## Year2006                                         -0.28454    2.07874
## Year2007                                          0.54803    2.09362
## Year2008                                         -0.63454    2.15521
## Year2009                                         -1.74292    2.16344
## Year2010                                          2.12560    2.18157
## Year2011                                          1.89130    2.18027
## Year2012                                          0.05232    2.20918
## Year2013                                          1.20335    2.20430
## Year2014                                          0.94409    2.26639
## Year2015                                          1.50004    2.36051
## Year2016                                          1.53667    2.39429
##                                                 t value Pr(>|t|)    
## (Intercept)                                       4.612 4.21e-06 ***
## RNDIntensity                                    -16.340  < 2e-16 ***
## SICCode3571 - Electronic Computers               -3.163 0.001586 ** 
## SICCode3572 - Computer Storage Devices           -3.318 0.000921 ***
## SICCode3575 - Computer Terminals                 -4.905 1.00e-06 ***
## SICCode3576 - Computer Communications Eqpmt      -0.863 0.388025    
## SICCode3577 - Computer Peripheral Eqpmt          -3.470 0.000531 ***
## SICCode3578 - Calculating & Accounting Machines  -3.795 0.000152 ***
## SICCode3579 - Office Machines                    -3.084 0.002070 ** 
## Year1998                                          0.412 0.680330    
## Year1999                                          0.387 0.698722    
## Year2000                                          0.961 0.336779    
## Year2001                                         -2.254 0.024320 *  
## Year2002                                         -1.165 0.244126    
## Year2003                                          0.354 0.723736    
## Year2004                                          0.855 0.392543    
## Year2005                                          0.896 0.370348    
## Year2006                                         -0.137 0.891138    
## Year2007                                          0.262 0.793531    
## Year2008                                         -0.294 0.768466    
## Year2009                                         -0.806 0.420548    
## Year2010                                          0.974 0.329995    
## Year2011                                          0.867 0.385788    
## Year2012                                          0.024 0.981106    
## Year2013                                          0.546 0.585186    
## Year2014                                          0.417 0.677042    
## Year2015                                          0.635 0.525189    
## Year2016                                          0.642 0.521068    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.95 on 2138 degrees of freedom
## Multiple R-squared:  0.1387, Adjusted R-squared:  0.1278 
## F-statistic: 12.75 on 27 and 2138 DF,  p-value: < 2.2e-16

That gives a much different picture. We could also explore an interaction effect between R&D Intensity and industry sector, controlling for year. By the way, the default output will be messy. We could tidy it up a bit, but I think the default output helps to make the point about the advantage of the visualization.

interaction.model <- lm(ROS ~ RNDIntensity*SICCode + Year, data = factor.df)
summary(interaction.model)
## 
## Call:
## lm(formula = ROS ~ RNDIntensity * SICCode + Year, data = factor.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.582  -6.979   2.126   8.963  62.324 
## 
## Coefficients:
##                                                              Estimate
## (Intercept)                                                   3.77965
## RNDIntensity                                                 -0.06192
## SICCode3571 - Electronic Computers                           -3.79398
## SICCode3572 - Computer Storage Devices                       -0.88581
## SICCode3575 - Computer Terminals                             -4.02063
## SICCode3576 - Computer Communications Eqpmt                   8.62285
## SICCode3577 - Computer Peripheral Eqpmt                       0.77758
## SICCode3578 - Calculating & Accounting Machines              -0.30853
## SICCode3579 - Office Machines                                 0.67860
## Year1998                                                      0.79826
## Year1999                                                      0.72281
## Year2000                                                      1.77274
## Year2001                                                     -4.46746
## Year2002                                                     -2.27034
## Year2003                                                      0.86501
## Year2004                                                      1.93811
## Year2005                                                      1.99015
## Year2006                                                     -0.15419
## Year2007                                                      0.70491
## Year2008                                                     -0.65150
## Year2009                                                     -1.53766
## Year2010                                                      2.26664
## Year2011                                                      2.03031
## Year2012                                                      0.25336
## Year2013                                                      1.38646
## Year2014                                                      1.09518
## Year2015                                                      1.60648
## Year2016                                                      1.58249
## RNDIntensity:SICCode3571 - Electronic Computers              -0.29275
## RNDIntensity:SICCode3572 - Computer Storage Devices          -0.49852
## RNDIntensity:SICCode3575 - Computer Terminals                -1.23562
## RNDIntensity:SICCode3576 - Computer Communications Eqpmt     -0.83872
## RNDIntensity:SICCode3577 - Computer Peripheral Eqpmt         -0.66337
## RNDIntensity:SICCode3578 - Calculating & Accounting Machines -0.80205
## RNDIntensity:SICCode3579 - Office Machines                   -0.99565
##                                                              Std. Error
## (Intercept)                                                     2.68719
## RNDIntensity                                                    0.17428
## SICCode3571 - Electronic Computers                              2.88713
## SICCode3572 - Computer Storage Devices                          2.61393
## SICCode3575 - Computer Terminals                                4.11081
## SICCode3576 - Computer Communications Eqpmt                     2.69918
## SICCode3577 - Computer Peripheral Eqpmt                         2.47158
## SICCode3578 - Calculating & Accounting Machines                 2.75556
## SICCode3579 - Office Machines                                   3.42065
## Year1998                                                        1.82580
## Year1999                                                        1.85097
## Year2000                                                        1.91444
## Year2001                                                        1.97520
## Year2002                                                        1.98232
## Year2003                                                        2.00502
## Year2004                                                        2.01709
## Year2005                                                        2.03053
## Year2006                                                        2.06750
## Year2007                                                        2.08293
## Year2008                                                        2.14325
## Year2009                                                        2.15601
## Year2010                                                        2.17038
## Year2011                                                        2.16899
## Year2012                                                        2.19944
## Year2013                                                        2.19389
## Year2014                                                        2.25444
## Year2015                                                        2.34757
## Year2016                                                        2.38138
## RNDIntensity:SICCode3571 - Electronic Computers                 0.21520
## RNDIntensity:SICCode3572 - Computer Storage Devices             0.20691
## RNDIntensity:SICCode3575 - Computer Terminals                   0.63832
## RNDIntensity:SICCode3576 - Computer Communications Eqpmt        0.18863
## RNDIntensity:SICCode3577 - Computer Peripheral Eqpmt            0.19165
## RNDIntensity:SICCode3578 - Calculating & Accounting Machines    0.27028
## RNDIntensity:SICCode3579 - Office Machines                      0.35239
##                                                              t value
## (Intercept)                                                    1.407
## RNDIntensity                                                  -0.355
## SICCode3571 - Electronic Computers                            -1.314
## SICCode3572 - Computer Storage Devices                        -0.339
## SICCode3575 - Computer Terminals                              -0.978
## SICCode3576 - Computer Communications Eqpmt                    3.195
## SICCode3577 - Computer Peripheral Eqpmt                        0.315
## SICCode3578 - Calculating & Accounting Machines               -0.112
## SICCode3579 - Office Machines                                  0.198
## Year1998                                                       0.437
## Year1999                                                       0.391
## Year2000                                                       0.926
## Year2001                                                      -2.262
## Year2002                                                      -1.145
## Year2003                                                       0.431
## Year2004                                                       0.961
## Year2005                                                       0.980
## Year2006                                                      -0.075
## Year2007                                                       0.338
## Year2008                                                      -0.304
## Year2009                                                      -0.713
## Year2010                                                       1.044
## Year2011                                                       0.936
## Year2012                                                       0.115
## Year2013                                                       0.632
## Year2014                                                       0.486
## Year2015                                                       0.684
## Year2016                                                       0.665
## RNDIntensity:SICCode3571 - Electronic Computers               -1.360
## RNDIntensity:SICCode3572 - Computer Storage Devices           -2.409
## RNDIntensity:SICCode3575 - Computer Terminals                 -1.936
## RNDIntensity:SICCode3576 - Computer Communications Eqpmt      -4.446
## RNDIntensity:SICCode3577 - Computer Peripheral Eqpmt          -3.461
## RNDIntensity:SICCode3578 - Calculating & Accounting Machines  -2.967
## RNDIntensity:SICCode3579 - Office Machines                    -2.825
##                                                              Pr(>|t|)    
## (Intercept)                                                  0.159708    
## RNDIntensity                                                 0.722391    
## SICCode3571 - Electronic Computers                           0.188953    
## SICCode3572 - Computer Storage Devices                       0.734732    
## SICCode3575 - Computer Terminals                             0.328155    
## SICCode3576 - Computer Communications Eqpmt                  0.001421 ** 
## SICCode3577 - Computer Peripheral Eqpmt                      0.753091    
## SICCode3578 - Calculating & Accounting Machines              0.910862    
## SICCode3579 - Office Machines                                0.842764    
## Year1998                                                     0.662004    
## Year1999                                                     0.696203    
## Year2000                                                     0.354561    
## Year2001                                                     0.023812 *  
## Year2002                                                     0.252216    
## Year2003                                                     0.666206    
## Year2004                                                     0.336738    
## Year2005                                                     0.327141    
## Year2006                                                     0.940557    
## Year2007                                                     0.735079    
## Year2008                                                     0.761175    
## Year2009                                                     0.475803    
## Year2010                                                     0.296441    
## Year2011                                                     0.349346    
## Year2012                                                     0.908302    
## Year2013                                                     0.527477    
## Year2014                                                     0.627166    
## Year2015                                                     0.493849    
## Year2016                                                     0.506424    
## RNDIntensity:SICCode3571 - Electronic Computers              0.173861    
## RNDIntensity:SICCode3572 - Computer Storage Devices          0.016064 *  
## RNDIntensity:SICCode3575 - Computer Terminals                0.053030 .  
## RNDIntensity:SICCode3576 - Computer Communications Eqpmt     9.18e-06 ***
## RNDIntensity:SICCode3577 - Computer Peripheral Eqpmt         0.000548 ***
## RNDIntensity:SICCode3578 - Calculating & Accounting Machines 0.003036 ** 
## RNDIntensity:SICCode3579 - Office Machines                   0.004766 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.86 on 2131 degrees of freedom
## Multiple R-squared:  0.1519, Adjusted R-squared:  0.1384 
## F-statistic: 11.23 on 34 and 2131 DF,  p-value: < 2.2e-16

Hmmm…certainly looks like there could be something there, but lets face it, that model isn’t all that easy to make sense of. At least it’s worth looking into more, and something the visualization helped us see quickly and easily. That’s the power of the picture.


Visualizations should do two things…

  1. Inform the researcher about the data, relationships in the data, and help to guide analytical choices.
  2. Help the reader make sense of the analyses and the data

Pictures are, truly, worth a thousand words. I know I’m trying to find ways to use them more in my own research, and I hope other entrepreneurship researchers will do the same.