Sunday, March 25, 2018

To Code or Not to Code (in intro statistics)?

Last week we wrapped Stats 60/Psych 10, which was the first time I have ever taught such a course.  One of the goals of the course was for the students to develop enough data analysis skill in R to be able to go off and do their own analyses, and it seems that we were fairly successful in this.  To quantify our performance I used data from an entrance survey (which asked about previous programming experience) and an exit survey (which asked about self-rated R skill on a 1-7 scale).  Here are the data from the exit survey, separated by whether the students had any previous programming experience:
This shows us that there are now about fifty Stanford undergrads who had never programmed before and who now feel that they have at least moderate R ability (3 or above).  Some comments on the survey question "What were your favorite aspects of the course?" also reflected this (these are all from people who had never programmed before):

  • The emphasis on learning R was valuable because I feel that I've gained an important skill that will be useful for the rest of my college career.
  • I feel like I learned a valuable skill on how to use R
  • Gradually learning and understanding coding syntax in R
  • Finally getting code right in R is a very rewarding feeling
  • Sense of accomplishment I got from understanding the R material on my own
At the same time, there was a substantial contingent of the class that did not like the coding component.  This was evident to some comments on the survey question "What were your least favorite aspects of the course?":
  • R coding. It is super difficult to learn as a person with very little coding background, and made this class feel like it was mostly about figuring out code rather than about absorbing and learning to apply statistics.
  • My feelings are torn on R. I understand that it's a useful skill & plan to continue learning it after the course (yay DataCamp), but I also found it extremely frustrating & wouldn't have sought it out to learn on my own.
  • I had never coded before, nor have I ever taken a statistics course. For me, trying to learn these concepts together was difficult. I felt like I went into office hours for help on coding, rather than statistical concepts.
One of the major challenges of the quarter system is that we only have 10 weeks to cover a substantial amount of material, which has left me asking myself whether it is worth it to teach students to analyze data in R, or whether I should instead use one of the newer open-source graphical statsitics packages, such as JASP or Jamovi.  The main pro that I see of moving to a graphical package are that the students could spend more time focusing on statistical concepts, and less time trying to understand R programming constructs like pipes and ggplot aesthetics that have little to do with statistics per se.   On the other hand, there are the several reasons that I decided to teach the course using R in the first place:
  • Many of the students in the class come from humanities departments where they would likely never have a chance to learn coding.  I consider computational literacy (including coding) to be essential for any student today (regardless of whether they are from sciences or the humanities), and this course provides those students with a chance to acquire at least a bit of skill and hopefully inspires curiosity to learn more.
  • Analyzing data by pointing and clicking is inherently non-reproducible, and one of the important aspects of the course was to focus the students on the importance of reproducible research practices (e.g. by having them submit RMarkdown notebooks for the problem sets and final project). 
  • A big part of working with real data is wrangling the data into a form where the statistics can actually be applied.  Without the ability to code, this becomes much more difficult.
  • The course focuses a lot on simulation and randomization, and I'm not sure that the interactive packages will be useful for instilling these concepts.
I'm interested to hear your thoughts about this tradeoff: Is it better for the students to walk away with some R skill but less conceptual statistical knowledge, or greater conceptual knowledge without the ability to implement it in code?  Please leave your thoughts in the comments below.


  1. I would say definitely code, making it part of the learning process.
    Students might be scared when they see code and think they might be evaluated on coding, but unfortunately tend to discard the coding part if it's not explicit part of the evaluation, and it's a pity.
    So I would make it clear from the start that coding is part of the training, and functional to improve the learning, and to express what was learned.

  2. thanks Daniele! That is how I tried to do it this time - though it's clear from some of the comments that we didn't do a good enough job integrating the coding content and statistical content.

  3. > trying to learn these concepts together was difficult

    I don't think teaching coding and statistics in one bundle is profitable to learn either. I totally agree that both need to be learned, and not coding these days is a kind of illiteracy. But just as I would not teach English and statistics in the same course, I wouldn't do that to statistics. Either make basic coding skills a prerequisite for the statistics course, or teach statistic concepts in an abstract way without prerequisite and then later get to the applied stuff in a course that does require coding.

    By the way, I learned to code in SPSS, then R. I only started making sense of R past Stack Overflow copy/paste after I learned Python, and I only made sense of code structure after I learned Haskell. Yes, doing stats provided the motivation to learn to code, but no, R is not a good language to learn coding concepts in.

    1. thanks for the thoughts and the interesting analogy. requiring coding as a prereq is a non-starter for this course.

    2. Then consider teaching statistics first. Learning works best when the novelty is confined to a single dimension. Or break the course into separate blocks, and teach one (either statistics or coding) in isolation first.

  4. Unfortunately, the most positive evaluations go to easy-A classes that do not push students, down the road you'll meet lots of students who say: that class was so hard, but I learnt statistics and how to do the
    things I needed for this really cool project. You can't judge a new car when you drive out the dealership, try asking in 3 years. ...

  5. The following thoughts were emailed by E.-J. Wagenmakers and are being posted with his consent:

    Dear Russ,
    My thoughts:
    1. I love R, and I have no problems whatsoever with teaching students
    R. Every student should learn how to code in R. It does seem to me
    that it is a separate skill from stats, but then again, if you teach
    stats through lots of simulations, there is no going around R.
    2. "Analyzing data by pointing and clicking is inherently
    non-reproducible" -> I respectfully disagree. The JASP files store not
    only the annotated output, but also store the input options that gave
    rise to that output. So I can send you a jasp file and you will know
    *exactly* what I've done.
    3. For me, as an R programmer, conducting an analysis in R takes
    several orders of magnitude longer than doing it in JASP.
    4. With JASP I know I am not making some sort of stupid programming mistake! This means I don't have to debug my code. This is an clear advantage of GUIs that is missing from your list. Consider advising a student. You either get an analysis output from JASP, or from the student's own R code. Which result do you trust?
    5. Two years later, will the students still program in R? Or will they
    be sick of having to load the package you need for an ANOVA ("wait, we also needed to code some stuff as factors, right?"). Or will they have forgotten most of it?
    6. I don't think the choice between R and JASP is an either-or situation.
    7. Let me stress again how much I like R, and how much I appreciate
    the courses that teach R. It is a fabulous tool and, in fact, JASP
    would not have been possible without R and its packages (see Currently the JASP team is
    working to achieve complete synergy with R; this is under development but we are excited about the possibilities for the future.

  6. Hi there! (Disclaimer: mostly anec data to follow)

    As a post-bach RA in a clinical neuroscience lab aiming to pursue a PhD in Psych/Neuroscience, I wish that I had gotten coding experience during undergrad. I majored in Psychobiology (no coding required for my major) at UCLA, and I feel like I'm playing catch up since coding is pretty essential for fMRI paradigms/processing pipelines/analysis.

    After consulting with my roommates (both previous Econ majors who now work in the business side of tech), they believe that coding/data wrangling is also critical for success in their futures, so regardless of the field, learning R would be valuable to all of us. We're currently enrolled in online coding classes (Python, SQL, Javascript) through UCLA's extension program.

    Personally, I think that integrating stats and coding is parsimonious. I also learned SPSS first in undergrad, then R post grad. There were definitely headbanging moments while learning R, but I wish I could've began to learn it in my first psych stats course despite only having 10 weeks to do so.

    Best wishes,
    Sarah C.

  7. I strongly believe that the best way to give people an intuitive understanding of statistics is by getting them to simulate data and then run analysis. If you do this, you see a light go on as they get, for the first time, what a p-value is and why p-hacking is serious. It's also good for getting a grip on just how seriously sample size affects power. Therefore i'd go for teaching via R, but it's all about how you do that. I'd say that if beginners are being exposed to ggplot2 and pipes, then you're doing it wrong. The R needs to be as simple as is compatible with the statistics you are learning, with exercises that they can work through one line at a time: then they can run the code, look at the output, tweak the code, see what happens If they are natural coders, you can point them to books that will explain how to write elegant scripts, but I think for ease of understanding, clunky scripts can be better. I'm trying to develop teaching scripts myself and always interested in feedback: see

    1. Thanks Dorothy! I agree regarding ggplot - its API has a seriously steep learning curve, though it is quite powerful once you learn how to use it. regarding pipes, I think there are things that are made so easy by the tidyverse (e.g. grouping and summarizing) that would be a lot more work without it. I was also swayed by the arguments here:

  8. I would not base any decisions on these kinds of data—even tentatively. Self-efficacy is a terrible predictor of performance (Dunning-Kruger, etc.). Both the quantitative and qualitative data reflect the impact of prior knowledge on confidence, which is a very different hypothetical than coding with R enhances statistics learning outcomes. Also, if we care about diversification of STEM fields for which statistics is an entry point, orienting a course toward coding will give an increased confidence advantage to people with prior computing backgrounds at the expense of those without. Those individuals with these backgrounds are vastly more likely to be white, male, and from upper SES ranges. Those without are the people whom we struggle to bring into computationally intensive fields.

    Sorry to be a downer.

    1. Thanks David, great points. Just a couple of comments:
      - I realize that self-rated skill is not a great predictor of actual performance, but in some ways I am just as interested in self-efficacy here as I am in actual skill. no one is going to become a highly skilled R programmer after just 10 weeks, but if they feel like they have some base skill on which they can build then they are more likely to pursue additional training in the future.
      - Your points about diversity are well taken, but there is another way to view it: we are offering a course that introduces coding to a set of people who otherwise would likely never have encountered it (since this class is heavily populated with individuals from arts and humanities). If we can get some of them interested in using statistics and computing in their home disciplines (which was the goal of our class project), then I would consider that a win because it would almost certainly be reaching some of those people who had self-selected against computational majors. I have to look at the bright side...

    2. My first question would be demographics not only majors in Liberal Arts etc but Elective or non-elective: Grad or undergrad. What are students hoping to get out of the course. Statistics and programming (both) are complementary skills for a wide variety jobs. A self-rating might imply a correlation for what students hoped to gain from electing the course. I would also not assume SES or class as a stronger correlation for learning than choice and other motivations on what the students hope to get out of the course so offering basic skills and self-rating indicate an appreciation to both fields they may pursue further. A worthy statistical objective.

  9. hi russ,

    just thought i'd comment on this:

    > Analyzing data by pointing and clicking is inherently non-reproducible

    i don't think this is true. in jamovi all the options you use to run an analysis, the data used, and the results are all bundled together in the same file. you can click on an earlier analysis, and see the options which were used. better, you can send a file to a colleague, they can open the file, and see what options you used. this makes reproducibility so simple, and people do it without even realising.

    i'd also mention that jamovi includes an 'R syntax mode', where it will output the equivalent R syntax. this lets you copy/paste it into R studio or the like. in this way, jamovi is a great way to ease people into using R.

    check it out