Introduction to Data Science Independent Study

Objective

This independent study covers the basics of causal inference, linear modeling, and research design through frequentist and Bayesian perspectives.

At the end of the term, you should be able to…

  1. Design and execute research projects that maximize causal inference.
  2. Conduct statistical analyses appropriate for the research design employed, perform hypothesis tests, and correctly interpret analytical results.
  3. Critically analyze research designs and statistical analyses produced by others, AND make recommendations for how the design and analysis could be improved.
  4. Communicate your research design, analytical approach, and empirical results in a clear and compelling way using a study pre-registration document.

There are 13 sessions in the course, each covering a different topical area. For each session, which corresponds to our weakly meeting, you will complete readings in advance of our meeting and a homework assignment. You are welcome—and encouraged—to collaborate on the homework assignments. We will review the homework assignment during our meeting, and discuss the readings and concepts from that session.



Resources

In addition to the books, blogs, and other resources listed below, you can find all of the readings available through the UMKC Library, Google Scholar (for off-campus access, I encourage you to use the UMKC VPN service), or direct link in this document. Please remember to bring a laptop to each in-person session.

Books

The following books are not obligatory, but highly recommended. Recommended in the sense that they should (will) become a valuable part of your scholarly library. I will not use these books specifically in our class discussions, but may refer you to them for additional information.

  • Angrist, J., Pischke, J.-S., 2008. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press, Princeton, NJ.
  • Cohen, J., Cohen, P., West, S.G., Aiken, L.S., 2003. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. Erlbaum, Maywah, NJ.
  • Cook, T.D., Campbell, D.T., 1979. Quasi-Experimentation: Design and Analysis Issues for Field Settings. Houghton Mifflin, Boston, MA.
  • Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B., 2013. Bayesian Data Analysis (3rd Edition). CRC press, Boca Raton, FL.
  • Morgan, S.L., Winship, C., 2007. Counterfactuals and Causal Inference: Methods and Principles for Social Research. Cambridge University Press, New York.
  • Pearl, J., 2018. The Book of Why: The New Science of Cause and Effect. Basic Books, New York, NY.
  • Pearl, J., 2009. Causality: Models, Reasoning, and Inference, 2nd Ed. Cambridge University Press, New York, NY.
  • Sword, H., 2017. Air & Light & Time & Space: How Successful Academics Write. Harvard University Press, Cambridge, MA.
  • Sword, H., 2012. Stylish Academic Writing. Harvard University Press, Cambridge, MA.
  • Wooldridge, J.M., 2010. Econometric Analysis of Cross Section and Panel Data. MIT Press, Cambridge, MA.

Platform

We will be using RStudio Cloud as our technology platform. Each session has an associated project in the course workspace, including R markdown files, data, and other associated material. There is no need to download and install R or RStudio on your computer.

One of the great features in RStudio Cloud is the integration with DataCamp. I would encourage you to explore DataCamp’s free courses, and if you like them, a subscription is a great value!



Sessions

Session 1—Ethics & Data Science

  • Science isn’t broken
  • P-hacking and multiple comparisons
  • Bad statistics and theoretical looseness
  • McShane, B.B., Gal, D., 2016. Blinding us to the obvious? The effect of statistical training on the evaluation of evidence. Manage. Sci. 62, 1707–1718.
  • McShane, B.B., Gal, D., Gelman, A., Robert, C., Tackett, J.L., 2017. Abandon statistical significance. Working paper.
  • Morey, R.D., Hoekstra, R., Rouder, J.N., Lee, M.D., Wagenmakers, E.-J., 2016. The fallacy of placing confidence in confidence intervals. Psychon. Bull. Rev. 23, 103–123.
  • Nuijten, M.B., Hartgerink, C.H.J., van Assen, M.A.L.M., Epskamp, S., Wicherts, J.M., 2016. The prevalence of statistical reporting errors in psychology (1985–2013). Behav Res 48, 1205–1226.
  • Wasserstein, R.L., Lazar, N.A., 2016. The ASA’s statement on p-values: Context, process, and purpose. Am. Stat. 70, 129–133.

Session 2—Ethics & Publishing

  • We found only one-third of published psychology research is reliable – now what?
  • Trouble at the lab
  • Aguinis, H., Ramani, R.S., Alabduljader, N., 2017. What you see is what you get? Enhancing methodological transparency in management research. Academy of Management Annals Forthcoming, 1–62.
  • Bettis, R.A., 2012. The search for asterisks: Compromised statistical tests and flawed theories. Strat. Mgmt. J. 33, 108–113.
  • Butler, N., Delaney, H., Spoelstra, S., 2017. The gray zone: Questionable research practices in the business school. Acad. Manage. Learning & Education 16, 94–109.
  • Gelman, A., 2013. Garden of forking paths. Working Paper 1–17.
  • Gelman, A., Carlin, J., 2014. Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspect. Psychol. Sci. 9, 641–651.
  • Ioannidis, J.P.A., 2016. Why most clinical research is not useful. PLoS Med 13, e1002049.
  • Ioannidis, J.P.A., Stuart, M.E., Brownlee, S., Strite, S.A., 2017. How to survive the medical misinformation mess. Eur. J. Clin. Invest. Forthcoming, 1–8.
  • John, L.K., Loewenstein, G., Prelec, D., 2012. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol. Sci. 23, 524–532.
  • O’Boyle, E.H., Banks, G.C., Gonzalez-Mule, E., 2017. The Chrysalis Effect: How ugly initial results metamorphosize into beautiful articles. J. Manage. 43, 376–399.
  • Simmons, J.P., Nelson, L.D., Simonsohn, U., 2011. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22, 1359–1366.

Session 3—Replicable & Reproducible Research

  • Tidy data
  • Bergh, D.D., Sharp, B.M., Aguinis, H., Li, M., 2017. Is there a credibility crisis in strategic management research? Evidence on the reproducibility of study findings. Strategic Organization Forthcoming, 1–21.
  • Bettis, R.A., Helfat, C.E., Shaver, J.M., 2016. The necessity, logic, and forms of replication. Strat. Mgmt. J. 37, 2193–2203.
  • Ethiraj, S.K., Gambardella, A., Helfat, C.E., 2017. Improving data availability: A new SMJ initiative. Strat. Mgmt. J 38, 2145–2146.
  • Lindsay, D.S., 2017. Sharing data and materials in psychological science. Psychol. Sci. Forthcoming, 1–4.
  • Marwick, B., Boettiger, C., Mullen, L., 2018. Packaging data analytical work reproducibly using R (and friends) (No. e3192v2). PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.3192v2
  • Patil, P., Peng, R.D., Leek, J.T., 2016. What should researchers expect when they replicate studies? A statistical view of replicability in Psychological Science. Perspect. Psychol. Sci. 11, 539–544.
  • Schweinsberg et al., 2016/9. The pipeline project: Pre-publication independent replications of a single laboratory’s research pipeline. J. Exp. Soc. Psychol. 66, 55–67.
  • van ’t Veer, A.E., Giner-Sorolla, R., 2016. Pre-registration in social psychology—A discussion and suggested template. J. Exp. Soc. Psychol. 67, 2–12.

Session 4—Causal Inference

  • Establishing causality
  • Causal language
  • Endogeneity: An inconvenient truth
  • Antonakis, J., Bendahan, S., Jacquart, P., Lalive, R., 2010. On making causal claims: A review and recommendations. Leadersh. Q. 21, 1086–1120.
  • Hildebrand, T., Puri, M., Rocholl, J., 2016. Adverse incentives in crowdfunding. Manage. Sci. 63, 587–608.
  • Gelman, A., Imbens, G., 2013. Why ask why? Forward causal inference and reverse causal questions. Working Paper Series. https://doi.org/10.3386/w19614
  • Loken, E., Gelman, A., 2017. Measurement error and the replication crisis. Science 355, 584–585.
  • Rubin, D.B., 2005. Causal inference using potential outcomes. J. Am. Stat. Assoc. 100, 322–331.
  • Westfall, J., Yarkoni, T., 2016. Statistically controlling for confounding constructs is harder than you think. PLoS One 11, e0152719.

Session 5—Hypothesis Testing

  • Null hypothesis testing
  • Anderson, S.F., Kelley, K., Maxwell, S.E., 2017. Sample-size planning for more accurate statistical power: A method adjusting sample effect sizes for publication bias and uncertainty. Psychol. Sci. Forthcoming, 956797617723724.
  • Gelman, A., 2017. The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. Working Paper 1–11.
  • Gelman, A., Weakliem, D., 2009. Of beauty, sex and power: Too little attention has been paid to the statistical challenges in estimating small effects. Am. Sci. 97, 310–316.
  • Gelman, A., Carlin, J., 2014. Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspect. Psychol. Sci. 9, 641–651.

Session 6—Bayesian Inference

  • Dienes, Z., Mclatchie, N., 2018. Four reasons to prefer Bayesian analyses over significance testing. Psychon. Bull. Rev. 25, 207–218.
  • Etz, A., Gronau, Q.F., Dablander, F., Edelsbrunner, P.A., Baribault, B., 2018. How to become a Bayesian in eight easy steps: An annotated reading list. Psychon. Bull. Rev. 25, 219–234.
  • Etz, A., Vandekerckhove, J., 2018. Introduction to Bayesian inference for psychology. Psychon. Bull. Rev. 25, 5–34.
  • Gelman, A., Shalizi, C.R., 2013. Philosophy and the practice of Bayesian statistics. Br. J. Math. Stat. Psychol. 66, 8–38.
  • Kruschke, J.K., 2010. What to believe: Bayesian methods for data analysis. Trends Cogn. Sci. 14, 293–300.
  • Kruschke, J.K., Liddell, T.M., 2018. Bayesian data analysis for newcomers. Psychon. Bull. Rev. 25, 155–177.

Session 7—Sampling & Missing Data

  • Certo, S.T., Busenbark, J.R., Woo, H.-S., Semadeni, M., 2016. Sample selection bias and Heckman models in strategic management research. Strat. Mgmt. J. 37, 2639–2657.
  • Clougherty, J.A., Duso, T., Muck, J., 2016. Correcting for self-selection based endogeneity in management research: Review, recommendations and simulations. Organizational Research Methods 19, 286–347.
  • Curran, P.J., Hussong, A.M., 2009. Integrative data analysis: The simultaneous analysis of multiple data sets. Psychol. Methods 14, 81–100.
  • Newman, D.A., 2014. Missing data. Organizational Research Methods 17, 372–411.
  • Stuart, E.A., Azur, M., Frangakis, C., Leaf, P., 2009. Multiple imputation with large data sets: A case study of the Children’s Mental Health Initiative. Am. J. Epidemiol. 169, 1133–1139.

Session 8—Variables & Measures

  • Measurement error as an endogeneity problem
  • That pesky alpha
  • Achtenhagen, L., Naldi, L., Melin, L., 2010. “Business Growth”—Do practitioners and scholars really talk about the same thing? Entrepreneurship Theory and Practice 34, 289–316.
  • Edwards, J.R., Bagozzi, R.P., 2000. On the nature and direction of relationships between constructs and measures. Psychol. Methods 5, 155–174.
  • Podsakoff, P.M., MacKenzie, S.B., Podsakoff, N.P., Lee, J.Y., 2003. The mismeasure of man(agement) and its implications for leadership research. Leadersh. Q. 14, 615–656.

Session 9—Model Building vs. Model Testing

  • Hambrick, D.C., 2007. The field of management’s devotion to theory: Too much of a good thing? Acad. Manage. J. 50, 1346–1352.
  • Gelman, A., Hill, J., Yajima, M., 2012. Why we (usually) don’t have to worry about multiple comparisons. J. Res. Educ. Eff. 5, 189–211.
  • Ragins, B.R., 2012. Editor’s comments: Reflections on the craft of clear writing. Acad. Manage. Rev. 37, 493–501.
  • Sutton, R.I., Staw, B.M., 1995. What theory is not. Adm. Sci. Q. 40, 371–384.
  • Weick, K.E., 1995. What theory is not, theorizing is. Adm. Sci. Q. 40, 385–390.

Session 10—Simple Regression & Correlations

  • Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.B., Poole, C., Goodman, S.N., Altman, D.G., 2016. Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. Eur. J. Epidemiol. 31, 337–350.
  • Kelley, K., Maxwell, S.E., 2003. Sample size for multiple regression: Obtaining regression coefficients that are accurate, not simply significant. Psychol. Methods 8, 305–321.
  • Kelley, K., Preacher, K.J., 2012. On effect size. Psychol. Methods 17, 137–152.
  • Morey, R.D., Hoekstra, R., Rouder, J.N., Lee, M.D., Wagenmakers, E.-J., 2016. The fallacy of placing confidence in confidence intervals. Psychon. Bull. Rev. 23, 103–123.
  • Nimon, K.F., Oswald, F.L., 2013. Understanding the Results of Multiple Linear Regression. Organizational Research Methods 16, 650–674.

Session 11—Experiments & Making Comparisons

  • Carter, S.P., Greenberg, K., Walker, M.S., 2017/2. The impact of computer usage on academic performance: Evidence from a randomized trial at the United States Military Academy. Econ. Educ. Rev. 56, 118–132.
  • Chandler, J.J., Paolacci, G., 2017. Lie for a dime: When most prescreening responses are honest but most study participants are impostors. Social Psychological and Personality Science Forthcoming, 1–9.
  • Doyen, S., Klein, O., Pichon, C.-L., Cleeremans, A., 2012. Behavioral priming: It’s all in the mind, but whose mind? PLoS One 7, e29081.
  • Imai, K., Tingley, D., Yamamoto, T., 2013. Experimental designs for identifying causal mechanisms. J. R. Stat. Soc. Ser. A Stat. Soc. 176, 5–51.
  • Kagan, E., Leider, S., Lovejoy, W.S., 2017. Ideation–execution transition in product development: An experimental analysis. Manage. Sci. Forthcoming. https://doi.org/10.1287/mnsc.2016.2709
  • Shepherd, D.A., Patzelt, H., Berry, C.M., 2017. Why didn’t you tell me? Voicing concerns over objective information about a project’s flaws. J. Manage. Forthcoming, 1–27.

Session 12—Linear Modeling

  • Interpreting logistic regression, Part 1
  • Interpreting logistic regression, Part II
  • Bernerth, J.B., Aguinis, H., 2016. A critical review and best-practice recommendations for control variable usage. Pers. Psychol. 69, 229–283.
  • Spector, P.E., Brannick, M.T., 2011. Methodological urban legends: The misuse of statistical control variables. Organizational Research Methods 14, 287–305.
  • Wiersema, M.F., Bowen, H.P., 2009. The use of limited dependent variable techniques in strategy research: Issues and methods. Strat. Mgmt. J. 30, 679–692.

Session 13—Bayesian Linear Modeling

  • Gelman, A., Shalizi, C.R., 2013. Philosophy and the practice of Bayesian statistics. Br. J. Math. Stat. Psychol. 66, 8–38.
  • Matzke, D., Boehm, U., Vandekerckhove, J., 2018. Bayesian inference for psychology, part III: Parameter estimation in nonstandard models. Psychon. Bull. Rev. 25, 77–101.
  • Merkle, E.C., Wang, T., 2018. Bayesian latent variable models for the analysis of experimental psychology data. Psychon. Bull. Rev. 25, 256–270.
  • Rouder, J.N., Haaf, J.M., Vandekerckhove, J., 2018. Bayesian inference for psychology, Part IV: Parameter estimation and Bayes factors. Psychon. Bull. Rev. 25, 102–113.



Three statisticians go hunting. The first one shoots, and misses far right. The second one shoots, and misses far left. The third one says, “we got it!”