Tidy data

· by Brian Anderson · Read in about 14 min · (2916 words) ·

UPDATE 26 Nov 2018

As it turns out, the raw data from Data.gov got an update itself, interestingly becoming tidier since I wrote this post in Nov of 2017. The result is that the code doesn’t really follow the data any longer, so it looks like I’ll need to write a new Tidy Data post :)

What is Un-Tidy Data?

Un-tidy data is data that you normally get out of Qualtrics, survey platforms, Compustat, Data.gov, etc. There is no universal way to characterize it, other then by saying that Un-Tidy Data generally is not yet fit to run statistical analyses or visualizations.

Let’s take a look at a dataset from Data.gov.

Now, I have very little idea just what is in the data, but this is the description:

The data provided here include hospital-specific charges for the more than 3,000 U.S. hospitals that receive Medicare Inpatient Prospective Payment System (IPPS) payments for the top 100 most frequently billed discharges, paid under Medicare based on a rate per discharge using the Medicare Severity Diagnosis Related Group (MS-DRG) for Fiscal Year (FY) 2011. These DRGs represent more than 7 million discharges or 60 percent of total Medicare IPPS discharges.

Here is what I do know:we aren’t able to make much use of the data after pulling it off of data.gov.

Let’s get started. Oh, I chose this one because it has over 160,000 observations. That makes using MS Excel very impractical, as I’ll explain later.

library(tidyverse)
library(knitr)
library(readr)
payments.ds <- read_csv("https://data.cms.gov/api/views/97k6-zzx3/rows.csv?accessType=DOWNLOAD")

Let’s take a look at what we’ve got:

head(payments.ds, 10)
## # A tibble: 10 x 12
##    `DRG Definition` `Provider Id` `Provider Name` `Provider Stree…
##    <chr>                    <dbl> <chr>           <chr>           
##  1 039 - EXTRACRAN…         10001 SOUTHEAST ALAB… 1108 ROSS CLARK…
##  2 039 - EXTRACRAN…         10005 MARSHALL MEDIC… 2505 U S HIGHWA…
##  3 039 - EXTRACRAN…         10006 ELIZA COFFEE M… 205 MARENGO STR…
##  4 039 - EXTRACRAN…         10011 ST VINCENT'S E… 50 MEDICAL PARK…
##  5 039 - EXTRACRAN…         10016 SHELBY BAPTIST… 1000 FIRST STRE…
##  6 039 - EXTRACRAN…         10023 BAPTIST MEDICA… 2105 EAST SOUTH…
##  7 039 - EXTRACRAN…         10029 EAST ALABAMA M… 2000 PEPPERELL …
##  8 039 - EXTRACRAN…         10033 UNIVERSITY OF … 619 SOUTH 19TH …
##  9 039 - EXTRACRAN…         10039 HUNTSVILLE HOS… 101 SIVLEY RD   
## 10 039 - EXTRACRAN…         10040 GADSDEN REGION… 1007 GOODYEAR A…
## # ... with 8 more variables: `Provider City` <chr>, `Provider
## #   State` <chr>, `Provider Zip Code` <dbl>, `Hospital Referral Region
## #   Description` <chr>, `Total Discharges` <dbl>, `Average Covered
## #   Charges` <dbl>, `Average Total Payments` <dbl>, `Average Medicare
## #   Payments` <dbl>



What do we want to have?

Ok, so let’s understand a bit more about what we are hoping for with Tidy Data.

The basic tenants are this:

  • Each variable forms a column
  • Each observation forms a row
  • Each type of observational unit forms a table


Why do we care?

So the paper goes in to more explanation on it, but this is why I think you should care about Tidy Data:

  • It’s easier to use (especially if you haven’t touched the data in a while)
  • It’s easier to share with coauthors
  • It’s easier to document what you did and why you did it
  • Most statistical packages expect data in this format anyway


Why shouldn’t I just clean it up in MS Excel?

The short answer is that you could, and I used to do it that way. I have since learned the error of my ways, after having to pull an R&R at JBV because I had made a mistake when I ‘cleaned’ the data using Excel, didn’t document it, and couldn’t remember what I had done. When it came time to try to reanalyze the data for the revision, it became clear that I had made some type of early stage mistake, but I couldn’t duplicate it. The challenge with using MS Excel is that it doesn’t keep any kind of record for why you did, or did not, change.

The nice thing about using R and R Studio (although you could also do it with a .do file in Stata) is that when you use tools like Markdown it makes it VERY easy to keep track of what you did. That helps you, your coauthors, and the readers of your papers when you share your code.

As you will also find, data cleaning with code is also—typically—a lot faster than hand-jamming in MS Excel, because there are a lot of packages and functions that make life easy for you.



What are some common violations of Tidy Data?

The paper goes into more detail, but here are the things that I come across most often in downloading datasets that require some work before they are useful:

  • Numerical values being stored as strings because of a character (somewhere)
  • Extraneous header rows
  • Poorly labeled columns (variables)
  • Data stored in wide format, or some crazy long/wide format
  • Nested data that does not uniquely identify an observation as a combination of some \(i\), \(j\) combination


Variable names

So let’s take look again at the raw hospital payments data, and we can see where our difficulties are. Let’s start with the variable names:

names(payments.ds)
##  [1] "DRG Definition"                      
##  [2] "Provider Id"                         
##  [3] "Provider Name"                       
##  [4] "Provider Street Address"             
##  [5] "Provider City"                       
##  [6] "Provider State"                      
##  [7] "Provider Zip Code"                   
##  [8] "Hospital Referral Region Description"
##  [9] "Total Discharges"                    
## [10] "Average Covered Charges"             
## [11] "Average Total Payments"              
## [12] "Average Medicare Payments"

Now, these names aren’t bad in the sense that we can easily make sense of just what they are. The problem though is that they also have spaces, which means that we aren’t going to be able to make use of them in a function.

Go ahead, try to run this code:

library(tidyverse)
kc_records.df <- payments.ds %>%
  filter(Provider Zip Code == 64108)

Now, strictly speaking dplyr can handle columns with spaces in the names by using ’’ marks, but not all R functions can, and Stata just goes hard broke. So the first thing to do is to get rid of the spaces in the column names. You could do this by hand, by calling each variable name and then reassigning:

library(tidyverse)
payments.ds <- payments.ds %>%
  rename(DRGDefinition = 'DRG Definition')

That process, however, takes a lot of time, because we need to write a line of code for all 12 variables. Using the gsub function makes it a lot easier to handle them all at one time, with one line of code.

# Note that if you wanted to put another placeholder in the column name, say a '_', then just 
#  put that in place of the ''.  
names(payments.ds) <- gsub(' ', '', names(payments.ds))

Now let’s take a look:

names(payments.ds)
##  [1] "DRGDefinition"                    
##  [2] "ProviderId"                       
##  [3] "ProviderName"                     
##  [4] "ProviderStreetAddress"            
##  [5] "ProviderCity"                     
##  [6] "ProviderState"                    
##  [7] "ProviderZipCode"                  
##  [8] "HospitalReferralRegionDescription"
##  [9] "TotalDischarges"                  
## [10] "AverageCoveredCharges"            
## [11] "AverageTotalPayments"             
## [12] "AverageMedicarePayments"

And now we can reference our variable names in a variety of functions:

kc_records.df <- payments.ds %>%
  filter(ProviderZipCode == 64108)
head(kc_records.df, 10)
## # A tibble: 10 x 12
##    DRGDefinition ProviderId ProviderName ProviderStreetA… ProviderCity
##    <chr>              <dbl> <chr>        <chr>            <chr>       
##  1 065 - INTRAC…     260048 TRUMAN MEDI… 2301 HOLMES STR… KANSAS CITY 
##  2 069 - TRANSI…     260048 TRUMAN MEDI… 2301 HOLMES STR… KANSAS CITY 
##  3 074 - CRANIA…     260048 TRUMAN MEDI… 2301 HOLMES STR… KANSAS CITY 
##  4 101 - SEIZUR…     260048 TRUMAN MEDI… 2301 HOLMES STR… KANSAS CITY 
##  5 189 - PULMON…     260048 TRUMAN MEDI… 2301 HOLMES STR… KANSAS CITY 
##  6 190 - CHRONI…     260048 TRUMAN MEDI… 2301 HOLMES STR… KANSAS CITY 
##  7 191 - CHRONI…     260048 TRUMAN MEDI… 2301 HOLMES STR… KANSAS CITY 
##  8 192 - CHRONI…     260048 TRUMAN MEDI… 2301 HOLMES STR… KANSAS CITY 
##  9 193 - SIMPLE…     260048 TRUMAN MEDI… 2301 HOLMES STR… KANSAS CITY 
## 10 194 - SIMPLE…     260048 TRUMAN MEDI… 2301 HOLMES STR… KANSAS CITY 
## # ... with 7 more variables: ProviderState <chr>, ProviderZipCode <dbl>,
## #   HospitalReferralRegionDescription <chr>, TotalDischarges <dbl>,
## #   AverageCoveredCharges <dbl>, AverageTotalPayments <dbl>,
## #   AverageMedicarePayments <dbl>


Numbers stored as characters

Let’s take a look at how R classified the columns. We’re going to make use of the ‘glimpse’ function which is part of the tibble package, which is part of the tidyverse.

glimpse(payments.ds)
## Observations: 163,065
## Variables: 12
## $ DRGDefinition                     <chr> "039 - EXTRACRANIAL PROCEDUR...
## $ ProviderId                        <dbl> 10001, 10005, 10006, 10011, ...
## $ ProviderName                      <chr> "SOUTHEAST ALABAMA MEDICAL C...
## $ ProviderStreetAddress             <chr> "1108 ROSS CLARK CIRCLE", "2...
## $ ProviderCity                      <chr> "DOTHAN", "BOAZ", "FLORENCE"...
## $ ProviderState                     <chr> "AL", "AL", "AL", "AL", "AL"...
## $ ProviderZipCode                   <dbl> 36301, 35957, 35631, 35235, ...
## $ HospitalReferralRegionDescription <chr> "AL - Dothan", "AL - Birming...
## $ TotalDischarges                   <dbl> 91, 14, 24, 25, 18, 67, 51, ...
## $ AverageCoveredCharges             <dbl> 32963.07, 15131.85, 37560.37...
## $ AverageTotalPayments              <dbl> 5777.24, 5787.57, 5434.95, 5...
## $ AverageMedicarePayments           <dbl> 4763.73, 4976.71, 4453.79, 4...

Ok, so we’ve got a problem. If we look at, say, AverageCoveredCharges, we can see that R is storing it as a character, but it should just be a number:

summary(payments.ds$AverageCoveredCharges)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2459   15947   25246   36134   43233  929119

The reason is that the raw data included a “$” symbol in each value:

head(payments.ds$AverageCoveredCharges, 5)
## [1] 32963.07 15131.85 37560.37 13998.28 31633.27

So the “$” sign has to go. Note that we have the same problem with the AverageTotalPayments and AverageMedicarePayments variables, so we should take care of these too.

There are several ways to do this, and we could write a function to loop through any column where we see a dollar sign and make a conversion. For our purposes here though, I’m just going to keep it old school using the gsub function to drop the ‘$’, and then the as.numeric function to coerce the column from a character to a numeric value.

# Note the '\\' before the dollar sign--this one is necessary for the dollar sign since 
#  it is a special, special character
payments.ds <- payments.ds %>%
  mutate(AverageCoveredCharges = as.numeric(gsub('\\$', '', AverageCoveredCharges)), 
         AverageTotalPayments = as.numeric(gsub('\\$', '', AverageTotalPayments)), 
         AverageMedicarePayments = as.numeric(gsub('\\$', '', AverageMedicarePayments)))

Ok, so now what does our data look like:

glimpse(payments.ds)
## Observations: 163,065
## Variables: 12
## $ DRGDefinition                     <chr> "039 - EXTRACRANIAL PROCEDUR...
## $ ProviderId                        <dbl> 10001, 10005, 10006, 10011, ...
## $ ProviderName                      <chr> "SOUTHEAST ALABAMA MEDICAL C...
## $ ProviderStreetAddress             <chr> "1108 ROSS CLARK CIRCLE", "2...
## $ ProviderCity                      <chr> "DOTHAN", "BOAZ", "FLORENCE"...
## $ ProviderState                     <chr> "AL", "AL", "AL", "AL", "AL"...
## $ ProviderZipCode                   <dbl> 36301, 35957, 35631, 35235, ...
## $ HospitalReferralRegionDescription <chr> "AL - Dothan", "AL - Birming...
## $ TotalDischarges                   <dbl> 91, 14, 24, 25, 18, 67, 51, ...
## $ AverageCoveredCharges             <dbl> 32963.07, 15131.85, 37560.37...
## $ AverageTotalPayments              <dbl> 5777.24, 5787.57, 5434.95, 5...
## $ AverageMedicarePayments           <dbl> 4763.73, 4976.71, 4453.79, 4...



Nested data

Take another look at our data:

head(payments.ds, 10)
## # A tibble: 10 x 12
##    DRGDefinition ProviderId ProviderName ProviderStreetA… ProviderCity
##    <chr>              <dbl> <chr>        <chr>            <chr>       
##  1 039 - EXTRAC…      10001 SOUTHEAST A… 1108 ROSS CLARK… DOTHAN      
##  2 039 - EXTRAC…      10005 MARSHALL ME… 2505 U S HIGHWA… BOAZ        
##  3 039 - EXTRAC…      10006 ELIZA COFFE… 205 MARENGO STR… FLORENCE    
##  4 039 - EXTRAC…      10011 ST VINCENT'… 50 MEDICAL PARK… BIRMINGHAM  
##  5 039 - EXTRAC…      10016 SHELBY BAPT… 1000 FIRST STRE… ALABASTER   
##  6 039 - EXTRAC…      10023 BAPTIST MED… 2105 EAST SOUTH… MONTGOMERY  
##  7 039 - EXTRAC…      10029 EAST ALABAM… 2000 PEPPERELL … OPELIKA     
##  8 039 - EXTRAC…      10033 UNIVERSITY … 619 SOUTH 19TH … BIRMINGHAM  
##  9 039 - EXTRAC…      10039 HUNTSVILLE … 101 SIVLEY RD    HUNTSVILLE  
## 10 039 - EXTRAC…      10040 GADSDEN REG… 1007 GOODYEAR A… GADSDEN     
## # ... with 7 more variables: ProviderState <chr>, ProviderZipCode <dbl>,
## #   HospitalReferralRegionDescription <chr>, TotalDischarges <dbl>,
## #   AverageCoveredCharges <dbl>, AverageTotalPayments <dbl>,
## #   AverageMedicarePayments <dbl>

Look specifically at the DRGDefinition column. According to the website, this column is:

Code and description identifying the DRG. DRGs are a classification system that groups similar clinical conditions (diagnoses) and the procedures furnished by the hospital during the stay.

If you look across the data, you will see that the data reports the diagnostic codes for each hospital provider for the year 2011, including the total number of discharges for that condition, and the financial information, etc.

In a sense, we have nested data. From one perspective, we have a list of hospitals, and set of DRGs for each hospital.

We can think about these as each hospital, \(j\), containing \(i\) diagnostics. Right now though R doesn’t know that this nesting structure exists, nor does it know that we can group the DRGs together.

We will cover nesting another time, but right now, storing the DRGDefinition as a character variable doesn’t help us much. One solution is to convert DRGs to a factor variable. Factors are a very useful way to store data, particularly for modeling and for visualizations.

So we’ll convert DRG to a factor:

payments.ds <- payments.ds %>%
  mutate(DRGDefinition = as.factor(DRGDefinition))

Lastly, it’s always a good idea to sort, or arrange your data according the nesting structure you specify:

payments.ds <- payments.ds %>%
  arrange(ProviderId, DRGDefinition)



Doing useful things with Tidy Data

There are no shortage of things we could do with this data, now that we’ve got it Tidy. To start with, we can create a new data frame with only a few pieces of information we might want:

payments_subset.df <- as_data_frame(payments.ds) %>%
  select(ProviderId, DRGDefinition, ProviderState, TotalDischarges, AverageTotalPayments)
head(payments_subset.df, 10)
## # A tibble: 10 x 5
##    ProviderId DRGDefinition ProviderState TotalDischarges AverageTotalPay…
##         <dbl> <fct>         <chr>                   <dbl>            <dbl>
##  1      10001 039 - EXTRAC… AL                         91            5777.
##  2      10001 057 - DEGENE… AL                         38            4895.
##  3      10001 064 - INTRAC… AL                         84           10260.
##  4      10001 065 - INTRAC… AL                        169            6542.
##  5      10001 066 - INTRAC… AL                         33            4596.
##  6      10001 069 - TRANSI… AL                         37            4134.
##  7      10001 074 - CRANIA… AL                         13            4876 
##  8      10001 101 - SEIZUR… AL                         27            4667.
##  9      10001 176 - PULMON… AL                         33            6020.
## 10      10001 177 - RESPIR… AL                         21           11635.

For example, we can take a look at the top five DRG diagnostic by total discharges in each state.

# First we are going to create a collapsed dataframe of the total number of discharges
#  by diagnostic for all providers in each state.  We'll make use of this again.
topdiagnostics.df <- payments_subset.df %>%
  group_by(ProviderState, DRGDefinition) %>%
  summarise(TotalDischarges = sum(TotalDischarges)) 
# Now we get the top five diagnostics for each state.
state_topdiagnostic.df <- topdiagnostics.df %>%
  group_by(ProviderState) %>%
  top_n(5, TotalDischarges) %>%
  arrange(ProviderState, TotalDischarges)
head(state_topdiagnostic.df, 10)
## # A tibble: 10 x 3
## # Groups:   ProviderState [2]
##    ProviderState DRGDefinition                             TotalDischarges
##    <chr>         <fct>                                               <dbl>
##  1 AK            392 - ESOPHAGITIS, GASTROENT & MISC DIGE…             210
##  2 AK            292 - HEART FAILURE & SHOCK W CC                      217
##  3 AK            871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV…             255
##  4 AK            194 - SIMPLE PNEUMONIA & PLEURISY W CC                320
##  5 AK            470 - MAJOR JOINT REPLACEMENT OR REATTAC…             648
##  6 AL            690 - KIDNEY & URINARY TRACT INFECTIONS …            4598
##  7 AL            292 - HEART FAILURE & SHOCK W CC                     4646
##  8 AL            871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV…            4773
##  9 AL            392 - ESOPHAGITIS, GASTROENT & MISC DIGE…            5771
## 10 AL            470 - MAJOR JOINT REPLACEMENT OR REATTAC…            8411



Tidy visualizations

The great thing about Tidy Data is the ability to easily create visualizations. We have state-level data, so let’s do a visualization of the total discharges in DRG category 470 - Major Joint Replacements by state.

library(ggplot2)

# We will start by creating a dataframe with just DRG category 470 for each state.
state_data.df <- state_topdiagnostic.df %>%
  filter(DRGDefinition == '470 - MAJOR JOINT REPLACEMENT OR REATTACHMENT OF LOWER EXTREMITY W/O MCC') %>% 
  ungroup()  # Added this on 26 Nov 2018 to clear a rendering error

# The raw count of Total Discharges might be valuable, but we'll go ahead and normalize
#  the value by converting Total Discharges to a z-score
state_data.df <- as_data_frame(state_data.df) %>%
  mutate(TotalDischarges_z = round((TotalDischarges - mean(TotalDischarges)) / sd(TotalDischarges), 2)) %>%
  # Now we create a flag if the state is above or below the mean
  mutate(TotalDischarges_flag = ifelse(TotalDischarges_z < 0, "below", "above")) %>%
  # Now sort the data based on z-score
  arrange(TotalDischarges_z) %>%
  # Lastly convert state to a factor variable
  mutate(ProviderState = factor(ProviderState, levels = ProviderState))

# Now we make a distribution chart
ggplot(state_data.df, aes(x = ProviderState, y = TotalDischarges_z, 
                          label = TotalDischarges_z)) + 
  geom_bar(stat = 'identity', aes(fill = TotalDischarges_flag), width = .5) +
  scale_fill_manual(name = "Total Discharges", 
                    labels = c("Above Average", "Below Average"), 
                    values = c("above" = "dodgerblue4", "below" = "darkorange3")) + 
  labs(title = "Major Joint Replacement or Reattachement of Lower Extremity", 
       subtitle = "Standardized Total Discharges By State", 
       y = "Standardized Total Discharges (Mean = 0)", x = "") + 
  theme(axis.text.y = element_text(margin = margin(2))) + 
  coord_flip()



The bottom line is that Tidy Data is friendly data, and friendly data makes data science possible. So remember, always be Tidy!