Assistant Professor | Bloch School of Management

Three statisticians go hunting. One shoots, and misses far left. Another shoots, and misses far right.

The third one says, “We got it!”

"While it is easy to lie with statistics, it is even easier to lie without them."

Frederick Mosteller

The purpose of statistics is to help us answer questions.

Help is the operative word—unless you’re a mathematician (and even then) we don’t operate in a world of absolutes.

This means that before we get started you need to ban from your lexicon the phrase “This proves that…”

At best, statistics improves inference—our ability to draw conclusions based on logic, available evidence, and causal reasoning.

At worst, statistics gone wrong gets people killed.

While statistics helps with our decision-making, it is far from the panacea.

So be humble.

Ok, dramatic introduction over. Lets do some data science!

## Plan of the day

• Slack
• R, RStudio, and R Markdown
• Lies, damn lies, and statistics

Assumptions of this course…

There is significant heterogeneity in quantitative skills and prior statistics coursework.

This means that we’ll be doing a mixture of common denominator instruction and specialized coursework to better challenge each student.

Your career objectives center on general managerial progression and not in a purely technical (e.g., statistician/data analyst) role.

That means we’ll be focusing more on how to use statistics to answer business problems AND how to direct your team to answer those questions with the appropriate analytic tools.

There is no possible way to cover all foundational statistics in this course.

That means we’ll be focusing on an 80% solution—going deep with a single core topic (the general linear model) and its assumptions that should help answer the majority of business questions that you have.

## Learning objectives

• Lead and design statistical analysis projects.

• Conduct basic statistical analyses and perform hypothesis tests.

• Correctly interpret and critically analyze analyses produced by others.

• Present statistical analyses in a clear and compelling way.

## Assessments

So to help maximize our learning, I’m a big fan of assess early, assess often, and assess using different mechanisms…

• Homeworks

• Quizzes

• Group practice deliverables

• Final project

• Office hours

• Slack messaging

• Materials and readings (drbanderson.com, WSJ & HBR)

• Classroom participation and attendance

• QUESTION QUESTION QUESTION!!!

• You can interrupt at any time—promise :)

• If you don’t understand something, or if I have confused the hell out of you, you are REQUIRED to ask for clarification. Seriously. I mean it.

• I love ‘But what about…’!

## How I like to communicate…

If you are not comfortable writing programs in R, python, java, and c++, you are going to seriously struggle in this class.

Nah…just kidding :)

But, R is a programming language, and you are going to be "writing code" to do your analyses. If you've never done anything like this before, it might seem a little intimidating.

Trust me—you'll be whiz when we are done, and have confidence in yourself!

This is the part where one of you asks me "Why don't you teach stats with Microsoft Excel?"

## Learning to love your inner-geek and use R

R really is becoming the standard for data science around the world, and finding the answer to a problem in R is usually just a Google away. Seriously. Google is your friend.

Some resources though that will help specifically for this class…

• R: The engine of our fun!

• R Studio: A handy 'wrapper' for R that makes it really easy to use

• R Markdown: What you will use to take notes, turn in homework, etc.

• R Markdown Cheat Sheet: A handy reference guide

We're off like a herd of turtles!

## Fire up R Studio…

R by itself is pretty powerful, and we only going to scratch the surface of what it can do (case in point, I created this slide deck in R. Yes, I am that big of a geek.).

What makes it even more powerful is that we can access literally thousands of add-ons to R, called packages, that extend R's capabilities. Lets start with two packages that we're going to use a lot. Note that once you install a package, you don't have to install it again unless you re-install R.

install.packages("tidyverse")
install.packages("apaTables")

Ok, now create a R Markdown document. Click File –> New File –> R Markdown...

You will love markdown, promise!

We will be using markdown documents every day, for all deliverables (except quizzes), and I highly recommend using them as your primary note taking solution. The reason being is that your code and your notes will all be in the same place, and it makes it INSANELY easy to share with your team.

We'll explore a lot more with markdown later. For now, delete everything you see below the YAML header, which are the lines of code at the top between the --- markers.

You now have a clean markdown file to work with.

## Data science is all about relationships…

In your markdown document, click on Code –> Insert Chunk.

What you're seeing is one of the coolest things about markdown. Inside of the chunk, you can execute your code and see the results by pressing the green arrow that looks like a play button.

What makes markdown so powerful is that all of your analyses and results are contained within a single text file. It's not proprietary in any way, and virtually any computer can read the file. That makes data science with R easily reproducible, which is the ability for another analyst with the same data to draw the same conclusions as you using the same type of analysis (and ideally code).

Ok, lets start by getting data into R. There are a lot of ways to do this, but I'm going to focus on loading data from .csv files, which is a common data format.

You're going to tell R to load a package (the one we installed earlier), and then using a function called read_csv R will go to my website, pull down the data, and store it in something called a data frame, which we are calling IQ.df.

library(tidyverse)
IQ.df <- read_csv("http://www.drbanderson.com/data/IQ.csv")

With statistics, we’re trying to make sense of what has happened (descriptive) in order to make a reasonably accurate guess of what will happen (prescriptive).

Statistical analysis is ALWAYS retrospective. We can’t analyze anything until AFTER it’s happened.

So if our goal is to make a good guess at what will happen—to understand that if X occurs or changes Y occurs or changes—we need to have a good handle on causal inference.

Necessary & Sufficient Conditions For Causality…

$$X —> Y$$

• Temporal sequencing (X happens before Y)

• The relationship must not be spurious (the effect of X on Y could not have occurred by random chance)

• There must be no other possible explanations (there isn’t another factor, Z, influencing the X —> Y relationship)

So, how likely is it that in the real world we’ll be able to satisfy all of those conditions?

Outside of experimental physics (and even then, there is error), it’s basically impossible to establish true, unequivocal, causal relationships.

So, the method we use to go about our research and analyses matters to maximizing what causal understanding what may infer.

Causal inference is our ability to make a confident—but not perfect—statement of causality.

Why does causal inference matter? Lets go back to our data and take a look. Remember with markdown, anytime you want to write some code, you need to put it in a chunk (Code –> Insert Chunk). Your markdown document can have as many chunks as you want.

head(IQ.df, 5)
## # A tibble: 5 x 2
##      II    IQ
##   <int> <int>
## 1    52    97
## 2    64    79
## 3    39   103
## 4    64   108
## 5    50   123

Wow. Conservatives are smarter than liberals.

library(apaTables)
apa.cor.table(IQ.df, show.conf.interval = FALSE)
##
##
## Means, standard deviations, and correlations
##
##
##   Variable M      SD    1
##   1. II    50.01  10.00
##
##   2. IQ    100.00 15.00 .24**
##
##
## Note. * indicates p < .05; ** indicates p < .01.
## M and SD are used to represent mean and standard deviation, respectively.
## 

Lets look at a picture.

Pictures don’t lie.

x <- IQ.df$IQ y <- IQ.df$II
plot(x, y, xlab = "IQ", ylab = "Ideological Identification [100 = Max Conservative]",
main = "Conservatives Are Smarter Than Liberals [N = 5,162]")
abline(lm(y~x), col="red")

When you saw the statistic and saw the graph for the first time, what was your reaction?

This is a rhetorical question, by the way.

Even trained scientists see the data in a way that tends to match his or her worldview.

We are human after all. Promise.

The scientific method exists to help protect us from our own biases.

We use the scientific method to help us find the "objective" truth, whether we like it or not.

Sorry to disappoint—or fear not—depending on your perspective.

I made that dataset up. It’s completely fake, promise.

I have no idea what the relationship is between political ideology and IQ. I do, of course, have my own opinion (but you’ll never know!).

The point is that whether you are running physics experiments or marketing experiments, understanding the scientific method and what it means for data science makes you a powerful tool.

The critical skill isn't the ability to write code. The critical skills are the ability to…

• Understand how to ask insightful—and valuable—questions

• How to collect data that helps to answer those questions

• How to analyze that data to derive the most meaningful answer

• How to communicate that answer to influence and shape decision making

Anybody can write code (seriously). How you add value is by being a data scientist, not by being a programmer.

We are going to talk a lot about the assumptions made by the analyses you run, by the measures you are using, and by the way the data was collected.

My goal isn't to make you rock-star programmers and stats-nerds, it's to build your ability to think statistically. To understand the strengths and limitations of data science.

So what is the single most valuable skill for a data scientist?

Easy…humility.