Data Wrangling

· by Brian Anderson · Read in about 18 min · (3623 words) ·

Data wrangling is a critically important skill for entrepreneurship (and management) researchers.

Data wrangling takes data that is messy, unstructured, coded strangely, and otherwise not very usable and turns it into data that you can use to run analyses, build models, and construct visualizations.

Being a believer in Tidy Data, I’m going to use a set of tools from the tidyverse to walk through how to take a messy dataset and make it useful.

Importing the data

One of the great developments of the past decade is the public availability of millions of datasets. It’s easy today to simply access datasets over the Internet and dynamically construct tidy data using a variety of software packages (my preference is R). This also has important implications for data and code sharing, which increases the replicability of research. Stick to flat .csv files, share them online (OSF is my favorite, or GitHub), and let other scholars take a look.

For this post I’m going to use data on major safety and security events among U.S. public transit agencies from 2014-2016. As a side note, I really enjoy data.gov and related sites that make it easy to access unique datasets for teaching!

Lets start by loading our data…

library(tidyverse)
# This is just a shortcut I use when accessing data online...I use a .ds object name
##  to refer to the 'original data', and from there create different .df objects for 
##  analyses. Importantly though, both objects are dataframes, they just have different 
##  naming conventions that I use.

safety.ds <- read_csv("https://data.transportation.gov/api/views/k8v5-iehp/rows.csv?accessType=DOWNLOAD")

So it’s a good size dataset, with over 23,000 observations and 93 columns (variables). And while technically tidy, it’s not exactly friendly, so we get to exercise our data wrangling skills.

The data is multilevel (multiple observations by transit agency over multiple years), but I want to collapse/aggregate the data so that I have one observation of a transit agency per row, with the 2014-2016 observations collapsed for each transit agency. For analysis purposes, it likely makes more sense to keep the multilevel structure, but for this post, I’m going to focus on summarizing/aggregating.


Selecting variables

One of the great things about R and constructing dataframes dynamically is that for big datasets with potentially hundreds (or thousands) of columns (variable), we can easily subset the data, and only focus on the variables that we want. In looking at the data dictionary, I want to focus on a few key variables:

  • Property Damage
  • Total Injuries
  • Total Fatalities
  • Number of Transit Vehicles Involved

To identify the multilevel structure, we use…

  • 5 Digit NTD ID (numerical ID for each agency)
  • Year (2014-2016)

Lets start by going back into our original safety.ds dataframe and selecting only the variables we need. I am a big believer in using the tidyverse, and the functions listed here are part of the dplyr package.

# Note that the quotation marks are necessary because some of the column (variable)
##  names have spaces.  We'll take care of that in a minute.  I'm going to create a 
##  new safety.df object from the master safety.ds object.  Technically not necessary
##  but it is how I like to keep track of my analyses.  It also makes it easy to keep 
##  one 'master' version in memory, and create as many other 'working' data
##  frames from that 'master' version as I might need.

safety.df <- safety.ds %>% 
  select("5 Digit NTD ID", Year, Agency, "Property Damage", "Total Injuries", 
         "Total Fatalities", "Number Of Transit Vehicles Involved")

# Next we're going to remove the spaces in the column name, which makes it easier to refer
#  to the variables
names(safety.df) <- gsub(" " , "", names(safety.df))

# Now lets take a look at our dataframe
head(safety.df, 10)
## # A tibble: 10 x 7
##    `5DigitNTDID`  Year Agency PropertyDamage TotalInjuries TotalFatalities
##            <int> <int> <chr>  <chr>                  <int>           <int>
##  1         40078  2014 Cobb … $1500.00                   4               0
##  2         40078  2014 Cobb … $5000.00                   1               0
##  3         40078  2014 Cobb … $1500.00                   3               0
##  4         40078  2014 Cobb … $2500.00                   2               0
##  5         40078  2014 Cobb … $2500.00                   6               0
##  6         30034  2015 Maryl… $0.00                      1               0
##  7         60008  2015 Metro… $25000.00                 NA              NA
##  8         40078  2014 Cobb … $5000.00                   1               0
##  9         50066  2014 Chica… $0.00                      1               0
## 10         50066  2014 Chica… $0.00                      1               0
## # ... with 1 more variable: NumberOfTransitVehiclesInvolved <int>

Lets take a closer look at how R is storing our data with the str() function:

str(safety.df, give.attr = FALSE)
## Classes 'tbl_df', 'tbl' and 'data.frame':    23583 obs. of  7 variables:
##  $ 5DigitNTDID                    : int  40078 40078 40078 40078 40078 30034 60008 40078 50066 50066 ...
##  $ Year                           : int  2014 2014 2014 2014 2014 2015 2015 2014 2014 2014 ...
##  $ Agency                         : chr  "Cobb County Department of Transportation" "Cobb County Department of Transportation" "Cobb County Department of Transportation" "Cobb County Department of Transportation" ...
##  $ PropertyDamage                 : chr  "$1500.00" "$5000.00" "$1500.00" "$2500.00" ...
##  $ TotalInjuries                  : int  4 1 3 2 6 1 NA 1 1 1 ...
##  $ TotalFatalities                : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ NumberOfTransitVehiclesInvolved: int  1 1 1 1 1 NA NA NA NA NA ...

We have our first problem. R is storing The PropertyDamage variable as a string (character) object, and we need it as an integer for our analyses. R stored it as a character because in the original data, the amount of property damage includes a $ before the number, which we need to remove:

str(safety.df$PropertyDamage)
##  chr [1:23583] "$1500.00" "$5000.00" "$1500.00" "$2500.00" "$2500.00" ...

So lets take care of that…

# We've got two steps here---the first step is to remove the `$`, and in the second 
#  step we need to change the `PropertyDamage' column to numeric

safety.df <- safety.df %>% 
  mutate(PropertyDamage = as.numeric(gsub("\\$", "", PropertyDamage)))

# Now lets check out our column type...
str(safety.df$PropertyDamage)
##  num [1:23583] 1500 5000 1500 2500 2500 0 25000 5000 0 0 ...

Much better :)


Renaming columns (variables)

One small note about R—columns generally cannot start with a number, so the 5DigitNTDID column violates a tidy naming convention. So lets just rename that quickly…

safety.df <- safety.df %>% 
  rename(AgencyID = "5DigitNTDID")

That said, while the number convention is an R consideration, it’s always best to use column (variable) names understandable by a real person. It helps you as the researcher, it helps coauthors keep track of what you (and they) are doing, and makes it easy to share with others. You might know what the PD14TAXYT variable means, but calling it PropertyDamage instead makes everyone’s life easier :).


Missing data

Before we aggregate our data, we need to deal with a missing data problem…

safety.df %>% 
  filter(AgencyID == 1, Year == 2014) %>% 
  select(PropertyDamage, TotalInjuries, TotalFatalities)
## # A tibble: 59 x 3
##    PropertyDamage TotalInjuries TotalFatalities
##             <dbl>         <int>           <int>
##  1             0.             1               0
##  2          3783.             1               0
##  3          1675.            NA              NA
##  4          3533.             1               0
##  5             0.             1               0
##  6          9989.             1               0
##  7           946.            NA              NA
##  8             0.             1               0
##  9          1920.             1               0
## 10         12131.             1               0
## # ... with 49 more rows

Here we have the records for the King County Department of Transportation (AgencyID = 1) for 2014, using the filter function. We see instances of a safety incident that caused property damage, but no injuries or fatalities. This happens quite a bit in our data. What to do about it is a judgement call.

For this data, I’m going to show one approach. Ultimately we’re going to aggregate this data, but the missing values would drop a number of observations when we aggregate, and we want to preserve these observations.

One approach would be to set TotalInjuries and TotalFatilities equal to 0 if the observation is currently NA. Our assumption is that the NA for injuries and fatalities means that nobody was hurt in the incident. This is where subject matter knowledge is very, very helpful, because our assumption could be completely wrong.

So along with our assumption and any change in the data, I’m also going to create a new variable to flag whether the safety incident resulted property damage only. This way I create a marker variable to indicate when I changed a value, and it’s a good practice. I could use this variable as a control variable, or might not need it at all. But I have it, and it’s easy to create and keep track of what I did. This is a reason why using tools like markdown (also available for Stata!) is a much better approach to applied data analysis than point and click statistics with context menus.

# The dplyr 'if_else' function is INSANELY handy! I'm going to set the value of 
##  a new variable, 'PropertyOnly', equal to 1 if both TotalInjuries and 
##  TotalFatalities are NA (missing).

safety.df <- safety.df %>% 
  mutate(PropertyOnly = if_else(is.na(TotalInjuries) & is.na(TotalFatalities), 1, 0))

safety.df %>% 
  filter(AgencyID == 1, Year == 2014) %>% 
  select(PropertyDamage, TotalInjuries, TotalFatalities, PropertyOnly)
## # A tibble: 59 x 4
##    PropertyDamage TotalInjuries TotalFatalities PropertyOnly
##             <dbl>         <int>           <int>        <dbl>
##  1             0.             1               0           0.
##  2          3783.             1               0           0.
##  3          1675.            NA              NA           1.
##  4          3533.             1               0           0.
##  5             0.             1               0           0.
##  6          9989.             1               0           0.
##  7           946.            NA              NA           1.
##  8             0.             1               0           0.
##  9          1920.             1               0           0.
## 10         12131.             1               0           0.
## # ... with 49 more rows

Now, we’re going to replace our NA values with zeros. Note that the NumberOfVehiclesInvolved variable also records a NA when there were no transit vehicles involved. We could use a similar logic to deal with this problem as the property only approach, but for simplicity I’ll just set those values equal to 0 as well.

# Here we use the `replace_na` function to set each of our selected columns equal to 0
##  if the function sees a 'NA' value for that observation.

safety.df <- safety.df %>% 
  replace_na(list(TotalInjuries = 0,
                  TotalFatalities = 0,
                  NumberOfTransitVehiclesInvolved = 0))

Now lets take another look at King County…

safety.df %>% 
  filter(AgencyID == 1, Year == 2014) %>% 
  select(PropertyDamage, TotalInjuries, TotalFatalities, PropertyOnly)
## # A tibble: 59 x 4
##    PropertyDamage TotalInjuries TotalFatalities PropertyOnly
##             <dbl>         <dbl>           <dbl>        <dbl>
##  1             0.            1.              0.           0.
##  2          3783.            1.              0.           0.
##  3          1675.            0.              0.           1.
##  4          3533.            1.              0.           0.
##  5             0.            1.              0.           0.
##  6          9989.            1.              0.           0.
##  7           946.            0.              0.           1.
##  8             0.            1.              0.           0.
##  9          1920.            1.              0.           0.
## 10         12131.            1.              0.           0.
## # ... with 49 more rows

Aggregating and summarizing data

Now lets do some data aggregation. I’m going to make use of dplyr again and it’s group_by and summarize functions.

I’m planning on collapsing the data, and summing up the observations by agency over the three period. But I also might want to keep track of how many incidents each agency reported over the period. That might be an important control variable down the road, so I’m going to create a new ’TotalIncidents` variable.

# I start by grouping the data by AgencyID, so that each created column
##  reflects all of the safety incidents for that particularly Agency.

safety.df <- safety.df %>% 
  group_by(AgencyID) %>%  
  summarise(PropertyDamage = sum(PropertyDamage),
            TotalInjuries = sum(TotalInjuries),
            TotalFatalities = sum(TotalFatalities),
            TotalVehicles = sum(NumberOfTransitVehiclesInvolved),
            TotalPropertyOnly = sum(PropertyOnly),
            TotalIncidents = n())  # The `n()` shortcut counts the obs. by agency

Now we see what our wrangled, tidy, and aggregated data look like, but for this I’m going to use a function called skim, which gives us a nice summary display of our data…

library(skimr)
skim(safety.df)
## Skim summary statistics
##  n obs: 425 
##  n variables: 7 
## 
## Variable type: integer 
##        variable missing complete   n     mean       sd p0   p25   p50
##        AgencyID       0      425 425 47270.84 26895.57  1 30026 40192
##  TotalIncidents       0      425 425    55.49   176.44  1     2     7
##    p75  p100     hist
##  60111 99423 ▅▃▃▆▇▁▁▆
##     25  1999 ▇▁▁▁▁▁▁▁
## 
## Variable type: numeric 
##           variable missing complete   n      mean        sd p0   p25   p50
##     PropertyDamage       4      421 425 392014.72 858671.8   0 28822 96500
##    TotalFatalities       0      425 425      1.75     10.74  0     0     0
##      TotalInjuries       0      425 425     74.98    257.9   0     3     8
##  TotalPropertyOnly       0      425 425     11.7      34.6   0     0     2
##      TotalVehicles       0      425 425     44.17    125.64  0     2     7
##    p75  p100     hist
##  3e+05 7e+06 ▇▁▁▁▁▁▁▁
##      1   207 ▇▁▁▁▁▁▁▁
##     31  2900 ▇▁▁▁▁▁▁▁
##      6   337 ▇▁▁▁▁▁▁▁
##     23  1230 ▇▁▁▁▁▁▁▁

A cool thing about skim is that it gives you columns for missing, complete, and total number of observations per variable. There are four missing values for the PropertyDamage variable. But, we can also see from the p0 column that the observed range of the PropertyDamage variable starts at zero, and can check that with the summary function…

summary(safety.df$PropertyDamage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0   28822   96500  392015  296776 7045131       4

So we’ve got some options, including just dropping these observations from the analysis, but lets take a quick look and see if there is some type of pattern to these observations. For that, I’ll use the filter function.

safety.df %>% 
  filter(is.na(PropertyDamage))
## # A tibble: 4 x 7
##   AgencyID PropertyDamage TotalInjuries TotalFatalities TotalVehicles
##      <int>          <dbl>         <dbl>           <dbl>         <dbl>
## 1    50052             NA            7.              0.           14.
## 2    50066             NA         2062.             39.          849.
## 3    90015             NA          480.              7.          338.
## 4    90090             NA            4.              0.            4.
## # ... with 2 more variables: TotalPropertyOnly <dbl>, TotalIncidents <int>

Well, it’s pretty clear that these agencies did have accidents, including property only accidents over the period. This is the time for a subject matter expert, which I am not. So lets just exclude those from our analysis for right now…

safety.df <- safety.df %>%
  drop_na()

Checking variables

It’s always a good idea to get a quick visual check of the variables and their distributions, particularly for continuous variables.

The skim function also comes with a handy feature—a histogram for each variable:

# The `skim` function returns a tidy dataframe, which is a really nice feature of 
##  the package.  We can use our other wrangling skills to filter and display a 
##  tibble with each variable and its histogram:

skim(safety.df) %>% 
  filter(stat == "hist") %>% 
  select(variable, formatted) %>% 
  rename(histogram = formatted)
## # A tibble: 7 x 2
##   variable          histogram
##   <chr>             <chr>    
## 1 AgencyID          ▅▃▃▆▇▁▁▆ 
## 2 PropertyDamage    ▇▁▁▁▁▁▁▁ 
## 3 TotalInjuries     ▇▁▁▁▁▁▁▁ 
## 4 TotalFatalities   ▇▁▁▁▁▁▁▁ 
## 5 TotalVehicles     ▇▁▁▁▁▁▁▁ 
## 6 TotalPropertyOnly ▇▁▁▁▁▁▁▁ 
## 7 TotalIncidents    ▇▁▁▁▁▁▁▁

The pattern isn’t an error—leaving aside Agency ID (which is just an identifier), the other variables skew heavily to the left, indicating that there are a substantial number of zeros for each variable.

With a larger histogram, it’s easy to see the large number of zeros…

# I'm going to use `Total Injuries` as an example...
ggplot(safety.df, aes(TotalInjuries)) + 
  geom_histogram()

So what to do?

There are no definitive guidelines for how to handle data with a lot of zeros. When I am deciding on what to do with data like this, my first consideration is whether zero is meaningful in the context of the the data. In this case, zero TotalInjuries is meaningful (and a good thing!)—the transit agency reported no incidents between 2014-2016 in which a person got hurt.

The second consideration is the research question, and whether zero is an important factor in the model. For example, if I want to model change in Total Injuries in terms of count (TotalInjuries as the dependent variable), then I may not want to transform the variable, but I would likely need to use a negative binomial model because of the dispersion.

But if I am not concerned about how each additional injury might change an outcome, and want to consider the pattern of the relationship, then a log transformation might be an option. At the very least, log transformations can help with an initial visualization of a relationship, before settling on a specific modeling approach.

But because you can’t take the log of zero (or of a negative number), a common practice is to add a small constant to the variable for scaling, which I’ll do here. I’m interested in visualizing the relationship between Total Injuries and Property Damage (which also has a lot of zeros), so I’ll do the same transformation with the PropertyDamage variable.

# Note that adding the constant does technically change the interpretation of 
##  the fitted parameter (e.g., beta).  One option is to use a very small value
##  relative to the data, and there are formulas for this. In this case, I'm 
##  just going to use .001, which makes the change in beta's interpretation trival.

safety.df <- safety.df %>% 
  mutate(log.TotalInjuries = log(TotalInjuries + .001),
         log.PropertyDamage = log(PropertyDamage + .001))

Visualizing relationships

I am a BIG believer that we need more visualizations in our research reports and papers. In entrepreneurship research, we seem to have gotten away from the power of a picture to tell a story.

I think part of the reason is that researchers are trained mostly to look at model output and to search for asterisks, which has the perverse result of encouraging and enabling p-hacking.

I think that initial visualizations help inform future model building decisions. It’s not optional, it’s a must do. It also serves as a “check” on your model—if the picture looks positive and linear, and the model you fit gives a negative coefficient estimate, it’s probably worth a closer look at your model!

So I’m going to do a quick visualization of the relationship between Property Damage and Total Injuries, and I’m going to use the log transformation of each variable.

I also treat these plots as exploratory, so rather than use a linear trend (line of best fit), I much prefer a loess smoothing trend, letting the data speak for itself.

ggplot(safety.df, aes(x = log.TotalInjuries, y = log.PropertyDamage)) + 
  geom_point() +  # This option provides our scatter plot
  geom_smooth()  # This option fits our loess trend

This plot tells us that among incidents with injuries, there is a monotonic positive trend—as injuries go up, so to does property damage, which makes sense.

But we also see a large number of agencies who reported incidents with property damage, but no injuries. That’s going to be an important consideration in a future model, because it will likely bias the estimated relationship between Total Injuries and Property Damage. Just how many observations are we talking about? Well, we can wrangle that information too…

safety.df %>% 
  count(PropertyDamage > 0, TotalInjuries == 0)
## # A tibble: 4 x 3
##   `PropertyDamage > 0` `TotalInjuries == 0`     n
##   <lgl>                <lgl>                <int>
## 1 FALSE                FALSE                   12
## 2 FALSE                TRUE                     2
## 3 TRUE                 FALSE                  378
## 4 TRUE                 TRUE                    29

Under the TRUE and TRUE condition, we see 29 agencies that reported incidents with property damage but no injuries.

Our table above, and our plot, also gives us a clue about outliers that we need to consider. We have 12 agencies who reported incidents, but had no property damage and no injuries. This is where a subject matter expert comes in—why would an agency report a safety or security incident where nobody gets hurt and there was no damage?

Well, lets take a look at another variable, TotalFatilities

safety.df %>% 
  count(PropertyDamage == 0, TotalInjuries == 0, TotalFatalities > 0)
## # A tibble: 7 x 4
##   `PropertyDamage == 0` `TotalInjuries == 0` `TotalFatalities > 0`     n
##   <lgl>                 <lgl>                <lgl>                 <int>
## 1 FALSE                 FALSE                FALSE                   253
## 2 FALSE                 FALSE                TRUE                    125
## 3 FALSE                 TRUE                 FALSE                    28
## 4 FALSE                 TRUE                 TRUE                      1
## 5 TRUE                  FALSE                FALSE                    11
## 6 TRUE                  FALSE                TRUE                      1
## 7 TRUE                  TRUE                 FALSE                     2

If I understood the documentation correctly, a suicide gets reported as a safety/security incident, which might explain the two agencies reporting no damage, no injuries, but fatalities.

I’m not going to get in to the model implications and choices (that’s another post!) based on these findings. What it is safe to say is that a quick plot with three lines of code gave us substantial insight into the dynamics of our data, and critical factors that we would need to consider going forward if we wanted to build and fit a model.

So just for comparison, I’ll filter our dataset to visualize the relationship between Total Injuries and Property Damage only among those agencies that reported both injuries and damage…

# A helpful feature of the tidyverse is that usually you can combine several
##  elements of the tidyverse within the chain (piping-- %>% ). Here I use the 
##  dplyr::filter function within my ggplot.

ggplot(data = filter(safety.df, PropertyDamage > 0 & TotalInjuries > 0), 
       aes(x = log.TotalInjuries, y = log.PropertyDamage)) + 
  geom_point() +  # This option provides our scatter plot
  geom_smooth()  # This option fits our loess trend

Not surprisingly, we see a pretty strong positive trend between the number of injuries and the amount of property damage.


Key takeaway…

Data wrangling is an essential skill for all applied researchers. Along with taking messy data and making it useful, being an effective wrangler provides critical insights to your data and how to model it effectively. It’s a skill worth investing in!