Data wrangling is a critically important skill for entrepreneurship (and management) researchers.
Data wrangling takes data that is messy, unstructured, coded strangely, and otherwise not very usable and turns it into data that you can use to run analyses, build models, and construct visualizations.
Importing the data
One of the great developments of the past decade is the public availability of millions of datasets. It’s easy today to simply access datasets over the Internet and dynamically construct tidy data using a variety of software packages (my preference is R). This also has important implications for data and code sharing, which increases the replicability of research. Stick to flat .csv files, share them online (OSF is my favorite, or GitHub), and let other scholars take a look.
For this post I’m going to use data on major safety and security events among U.S. public transit agencies from 2014-2016. As a side note, I really enjoy data.gov and related sites that make it easy to access unique datasets for teaching!
Lets start by loading our data…
library(tidyverse) # This is just a shortcut I use when accessing data online...I use a .ds object name ## to refer to the 'original data', and from there create different .df objects for ## analyses. Importantly though, both objects are dataframes, they just have different ## naming conventions that I use. safety.ds <- read_csv("https://data.transportation.gov/api/views/k8v5-iehp/rows.csv?accessType=DOWNLOAD")
The data is multilevel (multiple observations by transit agency over multiple years), but I want to collapse/aggregate the data so that I have one observation of a transit agency per row, with the 2014-2016 observations collapsed for each transit agency. For analysis purposes, it likely makes more sense to keep the multilevel structure, but for this post, I’m going to focus on summarizing/aggregating.
One of the great things about R and constructing dataframes dynamically is that for big datasets with potentially hundreds (or thousands) of columns (variable), we can easily subset the data, and only focus on the variables that we want. In looking at the data dictionary, I want to focus on a few key variables:
- Property Damage
- Total Injuries
- Total Fatalities
- Number of Transit Vehicles Involved
To identify the multilevel structure, we use…
- 5 Digit NTD ID (numerical ID for each agency)
- Year (2014-2016)
Lets start by going back into our original
safety.ds dataframe and selecting only the variables we need. I am a big believer in using the tidyverse, and the functions listed here are part of the
# Note that the quotation marks are necessary because some of the column (variable) ## names have spaces. We'll take care of that in a minute. I'm going to create a ## new safety.df object from the master safety.ds object. Technically not necessary ## but it is how I like to keep track of my analyses. It also makes it easy to keep ## one 'master' version in memory, and create as many other 'working' data ## frames from that 'master' version as I might need. safety.df <- safety.ds %>% select("5 Digit NTD ID", Year, Agency, "Property Damage", "Total Injuries", "Total Fatalities", "Number Of Transit Vehicles Involved") # Next we're going to remove the spaces in the column name, which makes it easier to refer # to the variables names(safety.df) <- gsub(" " , "", names(safety.df)) # Now lets take a look at our dataframe head(safety.df, 10)
## # A tibble: 10 x 7 ## `5DigitNTDID` Year Agency PropertyDamage TotalInjuries TotalFatalities ## <int> <int> <chr> <chr> <int> <int> ## 1 40078 2014 Cobb … $1500.00 4 0 ## 2 40078 2014 Cobb … $5000.00 1 0 ## 3 40078 2014 Cobb … $1500.00 3 0 ## 4 40078 2014 Cobb … $2500.00 2 0 ## 5 40078 2014 Cobb … $2500.00 6 0 ## 6 30034 2015 Maryl… $0.00 1 0 ## 7 60008 2015 Metro… $25000.00 NA NA ## 8 40078 2014 Cobb … $5000.00 1 0 ## 9 50066 2014 Chica… $0.00 1 0 ## 10 50066 2014 Chica… $0.00 1 0 ## # ... with 1 more variable: NumberOfTransitVehiclesInvolved <int>
Lets take a closer look at how R is storing our data with the
str(safety.df, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame': 23583 obs. of 7 variables: ## $ 5DigitNTDID : int 40078 40078 40078 40078 40078 30034 60008 40078 50066 50066 ... ## $ Year : int 2014 2014 2014 2014 2014 2015 2015 2014 2014 2014 ... ## $ Agency : chr "Cobb County Department of Transportation" "Cobb County Department of Transportation" "Cobb County Department of Transportation" "Cobb County Department of Transportation" ... ## $ PropertyDamage : chr "$1500.00" "$5000.00" "$1500.00" "$2500.00" ... ## $ TotalInjuries : int 4 1 3 2 6 1 NA 1 1 1 ... ## $ TotalFatalities : int 0 0 0 0 0 0 NA 0 0 0 ... ## $ NumberOfTransitVehiclesInvolved: int 1 1 1 1 1 NA NA NA NA NA ...
We have our first problem. R is storing The
PropertyDamage variable as a string (character) object, and we need it as an integer for our analyses. R stored it as a character because in the original data, the amount of property damage includes a
$ before the number, which we need to remove:
## chr [1:23583] "$1500.00" "$5000.00" "$1500.00" "$2500.00" "$2500.00" ...
So lets take care of that…
# We've got two steps here---the first step is to remove the `$`, and in the second # step we need to change the `PropertyDamage' column to numeric safety.df <- safety.df %>% mutate(PropertyDamage = as.numeric(gsub("\\$", "", PropertyDamage))) # Now lets check out our column type... str(safety.df$PropertyDamage)
## num [1:23583] 1500 5000 1500 2500 2500 0 25000 5000 0 0 ...
Much better :)
Renaming columns (variables)
One small note about R—columns generally cannot start with a number, so the
5DigitNTDID column violates a tidy naming convention. So lets just rename that quickly…
safety.df <- safety.df %>% rename(AgencyID = "5DigitNTDID")
That said, while the number convention is an R consideration, it’s always best to use column (variable) names understandable by a real person. It helps you as the researcher, it helps coauthors keep track of what you (and they) are doing, and makes it easy to share with others. You might know what the
PD14TAXYT variable means, but calling it
PropertyDamage instead makes everyone’s life easier :).
Before we aggregate our data, we need to deal with a missing data problem…
safety.df %>% filter(AgencyID == 1, Year == 2014) %>% select(PropertyDamage, TotalInjuries, TotalFatalities)
## # A tibble: 59 x 3 ## PropertyDamage TotalInjuries TotalFatalities ## <dbl> <int> <int> ## 1 0. 1 0 ## 2 3783. 1 0 ## 3 1675. NA NA ## 4 3533. 1 0 ## 5 0. 1 0 ## 6 9989. 1 0 ## 7 946. NA NA ## 8 0. 1 0 ## 9 1920. 1 0 ## 10 12131. 1 0 ## # ... with 49 more rows
Here we have the records for the King County Department of Transportation (AgencyID = 1) for 2014, using the
filter function. We see instances of a safety incident that caused property damage, but no injuries or fatalities. This happens quite a bit in our data. What to do about it is a judgement call.
For this data, I’m going to show one approach. Ultimately we’re going to aggregate this data, but the missing values would drop a number of observations when we aggregate, and we want to preserve these observations.
One approach would be to set
TotalFatilities equal to 0 if the observation is currently NA. Our assumption is that the NA for injuries and fatalities means that nobody was hurt in the incident. This is where subject matter knowledge is very, very helpful, because our assumption could be completely wrong.
So along with our assumption and any change in the data, I’m also going to create a new variable to flag whether the safety incident resulted property damage only. This way I create a marker variable to indicate when I changed a value, and it’s a good practice. I could use this variable as a control variable, or might not need it at all. But I have it, and it’s easy to create and keep track of what I did. This is a reason why using tools like markdown (also available for Stata!) is a much better approach to applied data analysis than point and click statistics with context menus.
# The dplyr 'if_else' function is INSANELY handy! I'm going to set the value of ## a new variable, 'PropertyOnly', equal to 1 if both TotalInjuries and ## TotalFatalities are NA (missing). safety.df <- safety.df %>% mutate(PropertyOnly = if_else(is.na(TotalInjuries) & is.na(TotalFatalities), 1, 0)) safety.df %>% filter(AgencyID == 1, Year == 2014) %>% select(PropertyDamage, TotalInjuries, TotalFatalities, PropertyOnly)
## # A tibble: 59 x 4 ## PropertyDamage TotalInjuries TotalFatalities PropertyOnly ## <dbl> <int> <int> <dbl> ## 1 0. 1 0 0. ## 2 3783. 1 0 0. ## 3 1675. NA NA 1. ## 4 3533. 1 0 0. ## 5 0. 1 0 0. ## 6 9989. 1 0 0. ## 7 946. NA NA 1. ## 8 0. 1 0 0. ## 9 1920. 1 0 0. ## 10 12131. 1 0 0. ## # ... with 49 more rows
Now, we’re going to replace our NA values with zeros. Note that the
NumberOfVehiclesInvolved variable also records a NA when there were no transit vehicles involved. We could use a similar logic to deal with this problem as the property only approach, but for simplicity I’ll just set those values equal to 0 as well.
# Here we use the `replace_na` function to set each of our selected columns equal to 0 ## if the function sees a 'NA' value for that observation. safety.df <- safety.df %>% replace_na(list(TotalInjuries = 0, TotalFatalities = 0, NumberOfTransitVehiclesInvolved = 0))
Now lets take another look at King County…
safety.df %>% filter(AgencyID == 1, Year == 2014) %>% select(PropertyDamage, TotalInjuries, TotalFatalities, PropertyOnly)
## # A tibble: 59 x 4 ## PropertyDamage TotalInjuries TotalFatalities PropertyOnly ## <dbl> <dbl> <dbl> <dbl> ## 1 0. 1. 0. 0. ## 2 3783. 1. 0. 0. ## 3 1675. 0. 0. 1. ## 4 3533. 1. 0. 0. ## 5 0. 1. 0. 0. ## 6 9989. 1. 0. 0. ## 7 946. 0. 0. 1. ## 8 0. 1. 0. 0. ## 9 1920. 1. 0. 0. ## 10 12131. 1. 0. 0. ## # ... with 49 more rows
Aggregating and summarizing data
Now lets do some data aggregation. I’m going to make use of
dplyr again and it’s
I’m planning on collapsing the data, and summing up the observations by agency over the three period. But I also might want to keep track of how many incidents each agency reported over the period. That might be an important control variable down the road, so I’m going to create a new ’TotalIncidents` variable.
# I start by grouping the data by AgencyID, so that each created column ## reflects all of the safety incidents for that particularly Agency. safety.df <- safety.df %>% group_by(AgencyID) %>% summarise(PropertyDamage = sum(PropertyDamage), TotalInjuries = sum(TotalInjuries), TotalFatalities = sum(TotalFatalities), TotalVehicles = sum(NumberOfTransitVehiclesInvolved), TotalPropertyOnly = sum(PropertyOnly), TotalIncidents = n()) # The `n()` shortcut counts the obs. by agency
Now we see what our wrangled, tidy, and aggregated data look like, but for this I’m going to use a function called skim, which gives us a nice summary display of our data…
## Skim summary statistics ## n obs: 425 ## n variables: 7 ## ## Variable type: integer ## variable missing complete n mean sd p0 p25 p50 ## AgencyID 0 425 425 47270.84 26895.57 1 30026 40192 ## TotalIncidents 0 425 425 55.49 176.44 1 2 7 ## p75 p100 hist ## 60111 99423 ▅▃▃▆▇▁▁▆ ## 25 1999 ▇▁▁▁▁▁▁▁ ## ## Variable type: numeric ## variable missing complete n mean sd p0 p25 p50 ## PropertyDamage 4 421 425 392014.72 858671.8 0 28822 96500 ## TotalFatalities 0 425 425 1.75 10.74 0 0 0 ## TotalInjuries 0 425 425 74.98 257.9 0 3 8 ## TotalPropertyOnly 0 425 425 11.7 34.6 0 0 2 ## TotalVehicles 0 425 425 44.17 125.64 0 2 7 ## p75 p100 hist ## 3e+05 7e+06 ▇▁▁▁▁▁▁▁ ## 1 207 ▇▁▁▁▁▁▁▁ ## 31 2900 ▇▁▁▁▁▁▁▁ ## 6 337 ▇▁▁▁▁▁▁▁ ## 23 1230 ▇▁▁▁▁▁▁▁
A cool thing about
skim is that it gives you columns for missing, complete, and total number of observations per variable. There are four missing values for the
PropertyDamage variable. But, we can also see from the
p0 column that the observed range of the
PropertyDamage variable starts at zero, and can check that with the
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0 28822 96500 392015 296776 7045131 4
So we’ve got some options, including just dropping these observations from the analysis, but lets take a quick look and see if there is some type of pattern to these observations. For that, I’ll use the
safety.df %>% filter(is.na(PropertyDamage))
## # A tibble: 4 x 7 ## AgencyID PropertyDamage TotalInjuries TotalFatalities TotalVehicles ## <int> <dbl> <dbl> <dbl> <dbl> ## 1 50052 NA 7. 0. 14. ## 2 50066 NA 2062. 39. 849. ## 3 90015 NA 480. 7. 338. ## 4 90090 NA 4. 0. 4. ## # ... with 2 more variables: TotalPropertyOnly <dbl>, TotalIncidents <int>
Well, it’s pretty clear that these agencies did have accidents, including property only accidents over the period. This is the time for a subject matter expert, which I am not. So lets just exclude those from our analysis for right now…
safety.df <- safety.df %>% drop_na()
It’s always a good idea to get a quick visual check of the variables and their distributions, particularly for continuous variables.
skim function also comes with a handy feature—a histogram for each variable:
# The `skim` function returns a tidy dataframe, which is a really nice feature of ## the package. We can use our other wrangling skills to filter and display a ## tibble with each variable and its histogram: skim(safety.df) %>% filter(stat == "hist") %>% select(variable, formatted) %>% rename(histogram = formatted)
## # A tibble: 7 x 2 ## variable histogram ## <chr> <chr> ## 1 AgencyID ▅▃▃▆▇▁▁▆ ## 2 PropertyDamage ▇▁▁▁▁▁▁▁ ## 3 TotalInjuries ▇▁▁▁▁▁▁▁ ## 4 TotalFatalities ▇▁▁▁▁▁▁▁ ## 5 TotalVehicles ▇▁▁▁▁▁▁▁ ## 6 TotalPropertyOnly ▇▁▁▁▁▁▁▁ ## 7 TotalIncidents ▇▁▁▁▁▁▁▁
The pattern isn’t an error—leaving aside Agency ID (which is just an identifier), the other variables skew heavily to the left, indicating that there are a substantial number of zeros for each variable.
With a larger histogram, it’s easy to see the large number of zeros…
# I'm going to use `Total Injuries` as an example... ggplot(safety.df, aes(TotalInjuries)) + geom_histogram()
So what to do?
There are no definitive guidelines for how to handle data with a lot of zeros. When I am deciding on what to do with data like this, my first consideration is whether zero is meaningful in the context of the the data. In this case, zero
TotalInjuries is meaningful (and a good thing!)—the transit agency reported no incidents between 2014-2016 in which a person got hurt.
The second consideration is the research question, and whether zero is an important factor in the model. For example, if I want to model change in Total Injuries in terms of count (
TotalInjuries as the dependent variable), then I may not want to transform the variable, but I would likely need to use a negative binomial model because of the dispersion.
But if I am not concerned about how each additional injury might change an outcome, and want to consider the pattern of the relationship, then a log transformation might be an option. At the very least, log transformations can help with an initial visualization of a relationship, before settling on a specific modeling approach.
But because you can’t take the log of zero (or of a negative number), a common practice is to add a small constant to the variable for scaling, which I’ll do here. I’m interested in visualizing the relationship between Total Injuries and Property Damage (which also has a lot of zeros), so I’ll do the same transformation with the
# Note that adding the constant does technically change the interpretation of ## the fitted parameter (e.g., beta). One option is to use a very small value ## relative to the data, and there are formulas for this. In this case, I'm ## just going to use .001, which makes the change in beta's interpretation trival. safety.df <- safety.df %>% mutate(log.TotalInjuries = log(TotalInjuries + .001), log.PropertyDamage = log(PropertyDamage + .001))
I am a BIG believer that we need more visualizations in our research reports and papers. In entrepreneurship research, we seem to have gotten away from the power of a picture to tell a story.
I think part of the reason is that researchers are trained mostly to look at model output and to search for asterisks, which has the perverse result of encouraging and enabling p-hacking.
I think that initial visualizations help inform future model building decisions. It’s not optional, it’s a must do. It also serves as a “check” on your model—if the picture looks positive and linear, and the model you fit gives a negative coefficient estimate, it’s probably worth a closer look at your model!
So I’m going to do a quick visualization of the relationship between Property Damage and Total Injuries, and I’m going to use the log transformation of each variable.
I also treat these plots as exploratory, so rather than use a linear trend (line of best fit), I much prefer a loess smoothing trend, letting the data speak for itself.
ggplot(safety.df, aes(x = log.TotalInjuries, y = log.PropertyDamage)) + geom_point() + # This option provides our scatter plot geom_smooth() # This option fits our loess trend
This plot tells us that among incidents with injuries, there is a monotonic positive trend—as injuries go up, so to does property damage, which makes sense.
But we also see a large number of agencies who reported incidents with property damage, but no injuries. That’s going to be an important consideration in a future model, because it will likely bias the estimated relationship between Total Injuries and Property Damage. Just how many observations are we talking about? Well, we can wrangle that information too…
safety.df %>% count(PropertyDamage > 0, TotalInjuries == 0)
## # A tibble: 4 x 3 ## `PropertyDamage > 0` `TotalInjuries == 0` n ## <lgl> <lgl> <int> ## 1 FALSE FALSE 12 ## 2 FALSE TRUE 2 ## 3 TRUE FALSE 378 ## 4 TRUE TRUE 29
TRUE condition, we see 29 agencies that reported incidents with property damage but no injuries.
Our table above, and our plot, also gives us a clue about outliers that we need to consider. We have 12 agencies who reported incidents, but had no property damage and no injuries. This is where a subject matter expert comes in—why would an agency report a safety or security incident where nobody gets hurt and there was no damage?
Well, lets take a look at another variable,
safety.df %>% count(PropertyDamage == 0, TotalInjuries == 0, TotalFatalities > 0)
## # A tibble: 7 x 4 ## `PropertyDamage == 0` `TotalInjuries == 0` `TotalFatalities > 0` n ## <lgl> <lgl> <lgl> <int> ## 1 FALSE FALSE FALSE 253 ## 2 FALSE FALSE TRUE 125 ## 3 FALSE TRUE FALSE 28 ## 4 FALSE TRUE TRUE 1 ## 5 TRUE FALSE FALSE 11 ## 6 TRUE FALSE TRUE 1 ## 7 TRUE TRUE FALSE 2
If I understood the documentation correctly, a suicide gets reported as a safety/security incident, which might explain the two agencies reporting no damage, no injuries, but fatalities.
I’m not going to get in to the model implications and choices (that’s another post!) based on these findings. What it is safe to say is that a quick plot with three lines of code gave us substantial insight into the dynamics of our data, and critical factors that we would need to consider going forward if we wanted to build and fit a model.
So just for comparison, I’ll filter our dataset to visualize the relationship between Total Injuries and Property Damage only among those agencies that reported both injuries and damage…
# A helpful feature of the tidyverse is that usually you can combine several ## elements of the tidyverse within the chain (piping-- %>% ). Here I use the ## dplyr::filter function within my ggplot. ggplot(data = filter(safety.df, PropertyDamage > 0 & TotalInjuries > 0), aes(x = log.TotalInjuries, y = log.PropertyDamage)) + geom_point() + # This option provides our scatter plot geom_smooth() # This option fits our loess trend
Not surprisingly, we see a pretty strong positive trend between the number of injuries and the amount of property damage.
Data wrangling is an essential skill for all applied researchers. Along with taking messy data and making it useful, being an effective wrangler provides critical insights to your data and how to model it effectively. It’s a skill worth investing in!