Tuesday, April 17, 2018

How can one do reproducible science with limited resources?

When I visit other universities to talk, we often end up having free-form discussions about reproducibility at some point during the visit.  During a recent such discussion, one of the students raised a question that regularly comes up in various guises. Imagine you are a graduate student who desperately wants to do fMRI research, but your mentor doesn’t have a large grant to support your study.  You cobble together funds to collect a dataset of 20 subjects performing your new cognitive task, and you wish to identify the whole-brain activity pattern associated with the task. Then you happen to read "Scanning the Horizon” which points out that a study with only 20 subjects is not even sufficiently powered to find the activation expected from a coarse comparison of motor activity to rest, much less to find the subtle signature of a complex cognitive process.  What are you to do?
In these discussions, I often make a point that is statistically correct but personally painful to our hypothetical student:  The likelihood of such a study identifying a true positive result if it exists is very low, and the likelihood of any positive results being false is high (as outlined by Button et al, 2013), even if the study was fully pre-registered and there is no p-hacking.  In the language of clinical trials, this study is futile, in the sense that it is highly unlikely to achieve its aims. In fact, such a study is arguably unethical, since the (however miniscule) risks of participating in the study are not offset by any potential benefit to the subject or to society.  This raises a dilemma: How are students with limited access to research funding supposed to gain experience in an expensive area of research and test their ideas against nature?

I have struggled with how to answer these questions over the last few years.  I certainly wouldn't want to suggest that only students from well-funded labs or institutions should be able to do the science that they want to do.  But at the same time, giving students a pass on futile studies will have dangerous influence, since many of those studies will be submitted for publication and will thus increase the number of false reports (positive or negative) in the literature.  As Tal Yarkoni said in his outstanding “Big Correlations in Little Studies” paper:
Consistently running studies that are closer to 0% power than to 80% power is a sure way to ensure a perpetual state of mixed findings and replication failures.
Thus, I don’t think that the answer is to say that it’s OK to run underpowered studies.  In thinking about this issue, I’ve come up with a few possible ways to address the challenge.

1) "if you can’t answer the question you love, love the question you can"

In an outstanding reflection published last year in the Journal of Neuroscience, Nancy Kanwisher said the following in the context of her early work on face perception:
I had never worked on face perception because I considered it to be a special case, less important than the general case of object perception. But I needed to stop messing around and discover something, so I cultivated an interest in faces. To paraphrase Stephen Stills, if you can’t answer the question you love, love the question you can.
In the case of fMRI, one way to find a question that you can answer is to look at shared datasets.  There is now a huge variety of shared data available from resources including OpenfMRI/OpenNeuro, FCP/INDI, ADNI, the Human Connectome Project, and OASIS, just to name a few. If  a relevant dataset is not available openly but you know of a paper where someone has reported such a dataset, you can also contact those authors and ask whether they would be willing to share their data (often with an agreement of coauthorship). An example of this from our lab is a recent paper by Mac Shine (published in Network Neuroscience), in which he contacted the authors of two separate papers with relevant datasets and asked them to share the data. Both agreed, and the results came together into a nice package.  These were pharmacological fMRI studies that would not have even been possible within my lab, so the sharing of data really did open up a new horizon for us.

Another alternative is to do a meta-analysis, either based on data available from sites like Neurosynth or Neurovault, or by requesting data directly from researchers.  As an example, a student in one of my graduate classes did a final project in which he requested the data underlying meta-analyses published by two other groups, and then combined these to perform a composite meta-analysis, which was ultimately published.  

2) Focus on cognitive psychology and/or computational models for now

One of my laments regarding the training of cognitive neuroscientists in today’s climate is that their training is generally tilted much more strongly towards the neuroscience side (and particularly focused on neuroimaging methods), at the expense of training in good old fashioned cognitive psychology.  As should be clear from many of my writings, I think that a solid training in cognitive psychology is essential in order to do good cognitive neuroscience; certainly just as important as knowing how to properly analyze fMRI data. Increasingly, this means thinking about computational models for cognitive processes.  Spending your graduate years focusing on designing cognitive studies and building computational models of them will put you in an outstanding position to get a good postdoc in a neuroimaging lab that has the resources to support the kind of larger neuroimaging studies that are now required for reproducibility. I’ve had a couple of people from pure cognitive psychology backgrounds enter my lab as postdocs, and their NIH fellowship applications were both funded on the first try, because the case for additional training in neuroscience was so clear.  Once you become skilled at cognition and (especially) computation, imaging researchers will be chomping at the bit to work with you (I know I would!). In the meantime you can also start to develop chops at neuroimaging analysis using shared data as outlined in #1 above.

3) Team up

The field of genetics went through a similar reckoning with underpowered studies more than a decade ago, and the standard in that field is now for large genome-wide association studies which often include tens of thousands of subjects.  They also usually include tens of authors on each paper, because amassing such large samples requires more resources than any one lab can possess. This strategy has started to appear in neuroimaging through the ENIGMA consortium, which has brought together data from many different imaging labs to do imaging genetics analyses.  If there are other labs working on similar problems, see if you can team up with them to run a larger study; you will likely have to make compromises, but a reproducible study is worth it (cf. #1 above).

4) Think like a visual neuroscientist

This one won’t work for every question, but in some cases it’s possible to focus your investigation on a much smaller number of individuals who are characterized much more thoroughly; instead of collecting an hour of data each on 20 people, collect 4 hours of data per person on 5 people.  This is the standard approach in visual neuroscience, where studies will often have just a few subjects who have been studied in great detail, sometimes with many hours of scanning per individual (e.g. see any of the recent papers from Jack Gallant’s lab for examples of this strategy). Under this strategy you don’t use standard group statistics, but instead present the detailed results from each individual; if they are consistent enough across the individuals then this might be enough to convince reviewers, though the farther you get from basic sensory/motor systems (where the variance between individuals is expected to be relatively low) the harder it will be to convince them.  It is essential to keep in mind that this kind of analysis does not allow one to generalize beyond the sample of individuals who were included in the study, so any resulting papers will be necessarily limited in the conclusions they can draw.

5) Carpe noctem

At some imaging centers, the scanning rates become drastically lower during off hours, such that the funds that would buy 20 hours of scanning during prime time might stretch to buy 50 or more hours late at night.  A well known case is the Midnight Scan Club at Washington University, which famously used cheap late night scan time to characterize the brains of ten individuals in detail. Of course, scanning in the middle of the night raises all sorts of potential issues about sleepiness in the scanner (as well in the control room), so it shouldn’t be undertaken without thoroughly thinking through how to address those issues, but it has been a way that some labs have been able to stretch thin resources much further.  I don’t want this to be taken as a suggestion that students be forced to work both day and night; scanning into the wee hours should never be forced upon a student who doesn’t want to do it, and the rest of their work schedule should be reorganized so that they are not literally working day and night.

I hope these ideas are useful - If you have other ideas, please leave them in the comments section below!

(PS: Thanks to Pat Bissett and Chris Gorgolewski for helpful comments on a draft of this piece!)

Sunday, March 25, 2018

To Code or Not to Code (in intro statistics)?

Last week we wrapped Stats 60/Psych 10, which was the first time I have ever taught such a course.  One of the goals of the course was for the students to develop enough data analysis skill in R to be able to go off and do their own analyses, and it seems that we were fairly successful in this.  To quantify our performance I used data from an entrance survey (which asked about previous programming experience) and an exit survey (which asked about self-rated R skill on a 1-7 scale).  Here are the data from the exit survey, separated by whether the students had any previous programming experience:
This shows us that there are now about fifty Stanford undergrads who had never programmed before and who now feel that they have at least moderate R ability (3 or above).  Some comments on the survey question "What were your favorite aspects of the course?" also reflected this (these are all from people who had never programmed before):

  • The emphasis on learning R was valuable because I feel that I've gained an important skill that will be useful for the rest of my college career.
  • I feel like I learned a valuable skill on how to use R
  • Gradually learning and understanding coding syntax in R
  • Finally getting code right in R is a very rewarding feeling
  • Sense of accomplishment I got from understanding the R material on my own
At the same time, there was a substantial contingent of the class that did not like the coding component.  This was evident to some comments on the survey question "What were your least favorite aspects of the course?":
  • R coding. It is super difficult to learn as a person with very little coding background, and made this class feel like it was mostly about figuring out code rather than about absorbing and learning to apply statistics.
  • My feelings are torn on R. I understand that it's a useful skill & plan to continue learning it after the course (yay DataCamp), but I also found it extremely frustrating & wouldn't have sought it out to learn on my own.
  • I had never coded before, nor have I ever taken a statistics course. For me, trying to learn these concepts together was difficult. I felt like I went into office hours for help on coding, rather than statistical concepts.
One of the major challenges of the quarter system is that we only have 10 weeks to cover a substantial amount of material, which has left me asking myself whether it is worth it to teach students to analyze data in R, or whether I should instead use one of the newer open-source graphical statsitics packages, such as JASP or Jamovi.  The main pro that I see of moving to a graphical package are that the students could spend more time focusing on statistical concepts, and less time trying to understand R programming constructs like pipes and ggplot aesthetics that have little to do with statistics per se.   On the other hand, there are the several reasons that I decided to teach the course using R in the first place:
  • Many of the students in the class come from humanities departments where they would likely never have a chance to learn coding.  I consider computational literacy (including coding) to be essential for any student today (regardless of whether they are from sciences or the humanities), and this course provides those students with a chance to acquire at least a bit of skill and hopefully inspires curiosity to learn more.
  • Analyzing data by pointing and clicking is inherently non-reproducible, and one of the important aspects of the course was to focus the students on the importance of reproducible research practices (e.g. by having them submit RMarkdown notebooks for the problem sets and final project). 
  • A big part of working with real data is wrangling the data into a form where the statistics can actually be applied.  Without the ability to code, this becomes much more difficult.
  • The course focuses a lot on simulation and randomization, and I'm not sure that the interactive packages will be useful for instilling these concepts.
I'm interested to hear your thoughts about this tradeoff: Is it better for the students to walk away with some R skill but less conceptual statistical knowledge, or greater conceptual knowledge without the ability to implement it in code?  Please leave your thoughts in the comments below.

Monday, January 22, 2018

Defaults in R can make debugging incredibly hard for beginners

I am teaching a new undergraduate statistics class at Stanford, and an important part of the course is teaching students to run their own analyses using R/RStudio.  Most of the students have never coded before, and debugging turns out to be one of the major challenges. Working with students over the last few days I have found that a couple of the default features in R can combine to make debugging very difficult on occasion.  Changing these defaults could have a big impact on new users' early learning experiences.

One of the datasets that we use is the NHANES dataset via the NHANES library.  Over the last few days several students have experienced very strange problems, where the NHANES data frame doesn’t contain the appropriate data, even after restarting R and reloading the NHANES library.  It turns out that this is due to several “features” in R:
  • Users are asked when exiting whether to save the workspace image, and the default is to save it.
  • The global workspace (saved in ~/.RData) is by default automatically loaded upon starting R.
  • When a package is loaded that contains a data object, this object is masked by any object in the global workspace with the same name.  

Here is an example.  First I load the NHANES library, and check that the NHANES data frame contains the appropriate data.

> library(NHANES)
> head(NHANES)
     ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3    Education MaritalStatus    HHIncome HHIncomeMid
1 51624  2009_10   male  34     30-39       409 White  <NA>  High School       Married 25000-34999       30000
2 51624  2009_10   male  34     30-39       409 White  <NA>  High School       Married 25000-34999       30000
3 51624  2009_10   male  34     30-39       409 White  <NA>  High School       Married 25000-34999       30000
4 51625  2009_10   male   4       0-9        49 Other  <NA>         <NA>          <NA> 20000-24999       22500
5 51630  2009_10 female  49     40-49       596 White  <NA> Some College   LivePartner 35000-44999       40000
6 51638  2009_10   male   9       0-9       115 White  <NA>         <NA>          <NA> 75000-99999       87500

Now let’s say that I accidentally set NHANES to some other value:

[1] NA

Now I quit RStudio, clicking the default “Save” option to save the workspace, and then restart RStudio. I get a message telling me that the workspace was loaded, and I see that my altered version of the NHANES variable still exists.  I would think that reloading the NHANES library should fix this, but this is what happens:

> library(NHANES)

Attaching package: ‘NHANES’

The following object is masked _by_ ‘.GlobalEnv’:


[1] NA

That is, objects in the global environment take precedence over newly loaded objects.  If one didn't know how to parse that warning they would have no idea that this loading operation is having no effect.  The only way rid ourselves of this broken variable is either restart R after removing ~/.RData, or remove the variable from the global workspace:

> rm(NHANES, envir = globalenv())
> library(NHANES)
> head(NHANES)
     ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3    Education MaritalStatus    HHIncome HHIncomeMid
1 51624  2009_10   male  34     30-39       409 White  <NA>  High School       Married 25000-34999       30000
2 51624  2009_10   male  34     30-39       409 White  <NA>  High School       Married 25000-34999       30000
3 51624  2009_10   male  34     30-39       409 White  <NA>  High School       Married 25000-34999       30000
4 51625  2009_10   male   4       0-9        49 Other  <NA>         <NA>          <NA> 20000-24999       22500
5 51630  2009_10 female  49     40-49       596 White  <NA> Some College   LivePartner 35000-44999       40000
6 51638  2009_10   male   9       0-9       115 White  <NA>         <NA>          <NA> 75000-99999       87500

This seems like a combination of really problematic default behaviors to me: automatically saving and then loading the global workspace by default, and masking objects loaded from libraries with objects in the workspace.  Together they have resulted in hours of unnecessary confusion and frustration for my students, at exactly the point in their learning curve where it is most problematic to do so.

I have one simple suggestion for the R developers: Please turn off automatic loading of the workspace by default.  It would be as simple as changing the default on one radio box, and it would potentially save new users lots of time and frustration.

Until that happens, beginning R users should do the following:

  • Under the Preferences panel (the General Tab in R), unselect the “Restore .RData into workspace on startup” option.  
  • I would also recommend setting the “Save workspace to .RData on exit” preference to “Never”, since I find that I generally only restart R when I want the entire workspace cleared out, so this option will never be of use to me.

Friday, December 2, 2016

The NIH should stop penalizing collaborative research

The National Institutes of Health (NIH) just put out its most recent strategic plan for research in behavioral and social sciences, which outlines four directions for behavioral/social research in the future (integrating neuroscience, better measurement, digital interventions, and large-scale data-intensive science).  All of these require collaboration between researchers across multiple domains, and indeed Collins and Riley point out the need for more "transdisciplinary" research in the behavioral and social sciences.  Given the strong trend towards transdisciplinary work over the last couple of decades, one would think that the NIH would do whatever it can to help remove barriers to the kinds of collaborations that are often necessary to make transdisciplinary science work.  Instead, collaborative work across institutions is actively penalized by the way that grants are awarded and administered.  A simple change to this could greatly smooth the ability for researchers across different institutions to collaborate, which is often necessary in order to bring together the best researchers across different scientific disciplines.

To explain the situation, first let's think about how one would administer a collaborative grant in the ideal world.  Let's say Professor Smith is a biologist at University X studying cancer, and Professor Jones is a computer scientist at University Y who has a new method for statistical analysis of cancer cells.  They decide to write a grant proposal together, and each of them develops a budget to pay for the people or materials necessary to do the research (let's say $150,000/year for Smith and $100,000/year for Jones).  The grant gets a very good priority score from the reviewers, and the agency decides to fund it.  In an ideal world, the agency would then send $150,000 to University X and $100,000 to University Y, and each would be treated as separate accounts from the standpoint of financial administration, even if their scientific progress would be judged as a whole.

At some agencies (for example, for collaborative grants from the National Science Foundation), this is how it works. However, for nearly all regular grants at the NIH, the entire grant gets awarded to the lead institution, and then this institution must dole out the money to the collaborators via subawards.  This might sound like no big deal, but it causes significant problems in two different ways:

The first problem has to do with "indirect costs" (also known as "overhead"), which are the funds that universities receive for hosting the grant; they are meant to pay for all of the administrative and physical overhead related to a research project.  The overhead rates for federal grants are negotiated between each institution and the federal government; for example, at Stanford the negotiated rate is 57%.  This means that if the grant was awarded by NIH to Dr. Smith at a university where the rate was 50%, then NIH would send the entire $250,000 in "direct costs" plus $125,000 in "indirect costs" to University X. In the situation above, University X would then create a subaward to University Y, and send them the $100,000 for Dr. Jones's part of the research.  But what about the indirect costs?  In the best-of-all-worlds model, each institution would take its proportion of the indirect costs directly. In the NIH model, what happens is that the subaward must include both the direct and indirect costs for University Y, which both must come out of the direct costs given to University X; that is, the subaward amount would be $150,000 ($100,000 in direct costs plus $50,000 in indirect costs).  This penalizes researchers because it means that they will generally get about 1/3 less direct funds for work to be done on a subaward than work done directly from the primary grant, since the indirect costs (usually around 50%) for the subrecipient have to come out of the direct costs of the main grant.  If grant funds were unlimited then this wouldn't be a problem, but many grant mechanisms have explicit caps on the amount that can be requested.  

In addition to the reduced budget due to treating subaward indirect costs as as direct costs in the main budget, there is also an added extra expense due to "double dipping" of indirect costs.  When the primary institution computes its indirect costs, it is allowed to charge indirect costs on the first $25K of the subaward; this means that NIH ends up spending an extra ~$12.5K in indirect costs on each subaward.  This is presumably meant to cover the administrative budget of managing the subcontract, but it is another extra cost that arises for collaborative grants due to the NIH system.

There is a second way that the NIH model makes collaboration harder, which is the greatly increased  administrative burden for subaward management for grants lasting more than a year (as they almost always do).  When an investigator receives an NIH grant directly, the university treats the grant as lasting the entire period; that is, the researcher can spend the money continuously over the grant period.  If they don't spend the entire budget they can automatically carry over the leftover funds to the next year (as long as this amount isn't too much), and the university will also usually allow them to spend a bit of the next year's money before it arrives, since it's guaranteed to show up.  For subawards, the accounting works differently. Every year the primary recipient generates a new subaward, which can't happen until after the primary award for that year has been received and processed.  Then this new subaward has to be processed and given a new account number by the recipient's university. In addition, it is common for the lead school to not allow automatic carry-forward of unspent funds between years, and sometimes they requite any unused funds to be relinquished, and then be rewarded back in the new year's fund.  All of these processes take time, which means that the subaward recipient is often left hanging without funding for periods of time, particularly at the end of the yearly grant period.  This is a pretty minimal cost compared to the actual cost described above, but it ends up taking a substantial amount of time away from doing research.

Why can't the NIH adopt a process like the one used for collaborative grants at NSF, in which the money goes directly to each institution separately and indirect costs are split proportionately?  This would be a way in which NIH could really put its money where its mouth is regarding collaborative transdisciplinary research.  

UPDATE: Vince Calhoun pointed out to me that the indirect costs in the subcontract do not actually count against the modular budget cap.  According to the NIH Guide on budget development: "Consortium F&A costs are NOT included as part of the direct cost base when determining whether the application can use the modular format (direct costs < $250,000 per year), or determining whether prior approval is needed to submit an application (direct costs $500,000 or more for any year)...NOTE: This policy does not apply to applications submitted in response to RFAs or in response to other funding opportunity announcements including specific budgetary limits." Thus, while this addresses the specific issue of modular budgets, it doesn't really help with the many funding opportunities that include specific budget caps, which covers nearly all of the grants that my lab applies for.

Thursday, September 1, 2016

Why preregistration no longer makes me nervous

In a recent presidential column in the APS Observer, Susan Goldin-Meadow lays out her concerns about preregistration.  She has two main concerns:

  • The first is the fear that preregistration will stifle discovery. Science isn’t just about testing hypotheses — it’s also about discovering hypotheses grounded in phenomena that are worthy of study. Aren’t we supposed to let the data guide us in our exploration? How can we make new discoveries if our studies need to be catalogued before they are run?
  • The second concern is that preregistration seems like it applies only to certain types of studies — experimental studies done in the lab under controlled conditions. What about observational research, field research, and research with uncommon participants, to name just a few that might not fit neatly into the preregistration script?

She makes the argument that there are two stages of scientific practice, and that pre-registration is only appropriate for one of them:
The first stage is devoted to discovering phenomena, describing them appropriately (i.e., figuring out which aspects of the phenomenon define it and are essential to it), and exploring the robustness and generality of the phenomenon. Only after this step has been taken (and it is not a trivial one) should we move on to exploring causal factors — mechanisms that precede the phenomenon and are involved in bringing it about, and functions that follow the phenomenon and lead to its recurrence….Preregistration is appropriate for Stage 2 hypothesis-testing studies, but it is hard to reconcile with Stage 1 discovery studies.

I must admit that I started out with exactly the same concerns about pre-registration.  I was worried that it would stifle discovery, and lead to turnkey science that would never tell us anything new. However, I no longer believe that.  It’s become clear to me that pre-registration is just as useful at the discovery phase as at the hypothesis-testing phase, because it helps keep us from fooling ourselves.  For discovery studies, we have adopted a strategy of pre-registering whatever details we can; in some cases this might just be the sample size, sampling strategy, and the main outcome of interest.  In these cases we will almost certainly do analyses beyond these, but having pre-registered these details gives us and others more faith in the results from the planned analyses; it also helps us more clearly distinguish between a priori and ad hoc analysis decisions (i.e., we can’t tell ourselves “we would have planned to do that analysis”); if it’s not pre-registered, then it’s treated through the lens of discovery, and thus not really believed until it’s replicated or otherwise validated.  In the future, in our publications we will be very clear about which results arose from pre-registered analyses and which were unplanned discovery analyses; I am hopeful that by helping more clearly distinguish between these two kinds of analyses, the move to pre-registration will make all of our science better.

I would also argue that the phase of "exploring the robustness and generality of the phenomenon”, which Goldin-Meadow assigns to the unregistered discovery phase, is exactly the phase in which pre-registration is most important. Imagine how many hours of graduate student time and gallons of tears could have been saved if this strategy had been used in the initial studies of ego depletion or facial feedback.  In our lab, it is now standard to perform a pre-registered replication before we believe any new behavioral phenomenon; it’s been interesting to see how many of them fall by the wayside.  In some cases we simply can’t do a replication due to the size or nature of the study; in these cases, we register whatever we can up front, and we try to reserve a separate validation dataset for testing of whatever results come from our initial discovery set.  You can see an example of this in our recent online study of self-regulation.

I’m glad that this discussion is going on in the open, because I think a lot of my colleagues in the field share concerns similar to those expressed by Goldin-Meadow.  I hope that the examples of many successful labs now using pre-registration will help convince them that it really is a road to better science.

Wednesday, August 24, 2016

Interested in the Poldrack Lab for graduate school?

Update:  The Poldrack Lab will be accepting new graduate students for 2019.

This is the time of year when I start getting lots of emails asking whether I am accepting new grad students for next year.  The answer is almost always going to be yes (unless I am moving, and I don’t plan on doing that again for a long time!), because I am always on the lookout for new superstars to join the lab.  If you are interested, here are some thoughts and tips that I hope will help make you more informed about the process.  These are completely my own opinions, and some of them may be totally inaccurate regarding other PIs or graduate programs, so please take them for what they are worth and no more.

Which program should I apply to? I am affiliated with three graduate programs at Stanford: Psychology, Neuroscience, and Biomedical Informatics. In choosing a program, there are several important differences:

  • Research: While most of these programs are fairly flexible, there are generally some expectations regarding the kind of research you will do, depending on the specific program.  For example, if you joining the BMI program then your work is expected to have at least some focus on  novel data analysis or informatics methods, whereas if you are joining Psychology your work is expected to make some contact with psychological function. Having said that, most of what we do in our lab could be done by a student in any of these programs.
  • Coursework: Perhaps the biggest difference between programs is the kind of courses you are required to take. Each program has a set of core requirements.  In psychology, you will take a number of core courses in different areas of psychology (cognitive, neuroscience, social, affective, developmental).  In the neuroscience program you will take a set of core modules spanning different areas of neuroscience (including one on cognitive neuroscience that Justin Gardner and I teach), whereas in BMI you take core courses around informatics-related topics.  In each program you will also take elective courses (often outside the department) that establish complementary core knowledge that is important for your particular research; for example, you can take courses in our world-class statistics department regardless of which program you enroll in. One way to think about this is:  What do I want to learn about that is outside of my specific content area? Take a look at the core courses in each program and see which ones interest you the most.
  • First-year experience: In Psychology, students generally jump straight into a specific lab (or a collaboration between labs), and spend their first year doing a first-year project that they present to their area meeting at the end of the year. In Neuroscience and BMI, students do rotations in multiple labs in their first year, and are expected to pick a lab by the end of their first year. 
  • Admissions: All of these programs are highly selective, but each differs in the nature of its admissions process.  At one end of the spectrum is the Psychology admissions process, where initial decisions for who to interview are made by the combined faculty within each area of the department.  At the other end is the Neuroscience program, where initial decisions are made by an admissions committee.  As a generalization, I would say that the Psychology process is better for candidates whose interests and experience fit very closely with a specific PI or set of PIs, whereas the committee process caters towards candidates who may not have settled on a specific topic or PI.
  • Career positioning: I think that the specific department that one graduates from matters a lot less than people think it does.  For example, I have been in psychology departments that have hired people with PhDs in physics, applied mathematics, and computer science. I think that the work that you do and the skills that you acquire ultimately matter a lot more than the name of the program that is listed on your diploma.  

What does it take to get accepted? There are always more qualified applicants than there are spots in our graduate programs, and there is no way to guarantee admission to any particular program.  On the flipside, there are also no absolute requirements: A perfect GRE score and a 4.0 GPA are great, but we look at the whole picture, and other factors can sometimes outweigh a weak GRE score or GPA.  There are a few factors that are particularly important for admission to my lab:

  • Research experience: It is very rare for someone to be accepted into any of the programs I am affiliated with at Stanford without significant research experience.  Sometimes this can be obtained as an undergraduate, but more often successful applicants to our program have spent at least a year working as a research assistant in an active research laboratory.  There are a couple of important reasons for this.  First, we want you to understand what you are getting into; many people have rosy ideas of what it’s like to be a scientist, which can fall away pretty quickly in light of the actual experience of doing science.  Spending some time in a lab helps you make sure that this is how you want to spend your life. In addition, it provides you with someone who can write a recommendation letter that speaks very directly to your potential as a researcher.  Letters are a very important part of the admissions process, and the most effective letters are those that go into specific detail about your abilities, aptitude, and motivation.
  • Technical skills: The research that we do in my lab is highly technical, requiring knowledge of computing systems, programming, and math/statistics.  I would say that decent programming ability is a pretty firm prerequisite for entering my lab; once you enter the lab I want you to be able to jump directly into doing science, and this just can’t happen if you have to spend a year teaching yourself how to program from scratch. More generally, we expect you to be able to pick up new technical topics easily; I don’t expect students to necessarily show up knowing how a reinforcement learning model works, but I expect them to be able to go and figure it out on their own by reading the relevant papers and then implement it on their own. The best way to demonstrate programming ability is to show a specific project that you have worked on. This could be an open source project that you have contributed to, or a project that you did on the side for fun (for example, mine your own social media feed, or program a cognitive task and measure how your own behavior changes from day to day). If you don’t currently know how to program, see my post on learning to program from scratch, and get going!
  • Risk taking and resilience: If we are doing interesting science then things are going to fail, and we have to learn from those failures and move on.  I want to know that you are someone who is willing to go out on a limb to try something risky, and can handle the inevitable failures gracefully.  Rather than seeing a statement of purpose that only lists all of your successes, I find it very useful to also know about risks you have taken (be they physical, social, or emotional), challenges you have faced, failures you have experienced, and most importantly what you learned from all of these experiences.
What is your lab working on? The ongoing work in my lab is particularly broad, so if you want to be in a lab that is deeply focused on one specific question then my lab is probably not the right place for you.  There are few broad questions that encompass much of the work that we are doing:
  • How can neuroimaging inform the structure of the mind?  My general approach to this question is outlined in my Annual Review chapter with Tal Yarkoni.  Our ongoing work on this topic is using large-scale behavioral studies (both in-lab and online) and imaging studies to characterize the underlying structure of the concept of “self-regulation” as it is used across multiple areas of psychology.  This work also ties into the Cognitive Atlas project, which aims to formally characterize the ontology of psychological functions and their relation to cognitive tasks. Much of the work in this domain is discovery-based data-driven, in the sense that we aim to discover structure using multivariate analysis techniques rather than testing specific existing theories. 
  • How do brains and behavior change over time?  We are examining this at several different timescales. First, we are interested in how experience affects value-based choices, and particularly how the exertion of cognitive control or response inhibition can affect representations of value (Schonberg et al., 2014). Second, we are studying dynamic changes in both resting state and task-related functional connectivity over the seconds/minutes timescale (Shine et al, 2016), in order to relate network-level brain function to cognition.  Third, we are mining the MyConnectome data and other large datasets to better understand how brain function changes over the weeks/months timescale (Shine et al, 2016, Poldrack et al., 2015).  
  • How can we make science better?  Much of our current effort is centered on developing frameworks for improving the reproducibility and transparency of science.  We have developed the OpenfMRI and Neurovault projects to help researchers share data, and our Center for Reproducible Neuroscience is currently developing a next-generation platform for analysis and sharing of neuroimaging data.  We have also developed the Experiment Factory infrastructure for performing large-scale online behavioral testing.  We are also trying to do our best to make our own science as reproducible as possible; for example, we now pre-register all of our studies, and for discovery studies we try when possible to validate the results using a held-out validation sample.

These aren’t the only topics we study, and we are always looking for new and interesting extensions to our ongoing work, so if you are interested in other topics then it’s worth inquiring to see if they would fit with the lab’s interests.   At present, roughly half of the lab is engaged in basic cognitive neuroscience questions, and the other half is engaged in questions related to data analysis/sharing and open science.  This can make for some interesting lab meetings, to say the least. 

What kind of adviser am I? Different advisers have different philosophies, and it’s important to be sure that you pick an advisor whose style is right for you.  I would say that the most important characteristic of my style is that I am to foster independent thinking in my trainees.  Publishing papers is important, but not as important as developing one’s ability to conceive novel and interesting questions and ask them in a rigorous way. This means that beyond the first year project, I don’t generally hand my students problems to work on; rather, I expect them to come up with their own questions, and then we work together to devise the right experiments to test them.  Another important thing to know is that I try to motivate by example, rather than by command.  I rarely breathe down my trainees necks about getting their work done, because I work on the assumption that they will work at least as hard as I work without prodding.  On the other hand, I’m fairly hands-on in the sense that I still love to get deep in the weeds of experimental design and analysis code.  I would also add that I am highly amenable to joint mentorship with other faculty.

If you have further questions about our lab, please don’t hesitate to contact me by email.  Unfortunately I don’t have time to discuss ongoing research with everyone who is interested in applying, but I try to do my best to answer specific questions about our lab’s current and future research interests. 

Sunday, August 21, 2016

The principle of assumed error

I’m going to be talking at the Neurohackweek meeting in a few weeks, giving an overview of issues around reproducibility in neuroimaging research.  In putting together my talk, I have been thinking about what general principles I want to convey, and I keep coming back to the quote from Richard Feynman in his 1974 Caltech commencement address: "The first principle is that you must not fool yourself and you are the easiest person to fool.”  In thinking about how can we keep from fooling ourselves, I have settled on a general principle, which I am calling the “principle of assumed error” (I doubt this is an original idea, and I would be interested to hear about relevant prior expressions of it).  The principle is that whenever one finds something using a computational analysis that fits with one’s predictions or seems like a “cool” finding, they should assume that it’s due to an error in the code rather than reflecting reality.  Having made this assumption, one should then do everything they can to find out what kind of error could have resulted in the effect.  This is really no different from the strategy that experimental scientists use (in theory), in which upon finding an effect they test every conceivable confound in order to rule them out as a cause of the effect.  However, I find that this kind of thinking is much less common in computational analyses. Instead, when something “works” (i.e. gives us an answer we like)  we run with it, whereas when the code doesn’t give us a good answer then we dig around for different ways to do the analysis that give a more satisfying answer.  Because we will be more likely to accept errors that fit our hypotheses than those that do not due to confirmation bias, this procedure is guaranteed to increase the overall error rate of our research.  If this sounds a lot like p-hacking, that’s because it is; as Gelman & Loken pointed out in their Garden of Forking Paths paper, one doesn't have to be on an explicit fishing expedition in order to engage in practices that inflate error due to data-dependent analysis choices and confirmation bias.  Ultimately I think that the best solution to this problem is to always reserve a validation dataset to confirm the results of any discovery analyses, but before one burns their only chance at such a validation, it’s important to make sure that the analysis has been thoroughly vetted.

Having made the assumption that there is an error, how does one go about finding it?  I think that standard software testing approaches offer a bit of help here, but in general it’s going to be very difficult to find complex algorithmic errors using basic unit tests.  Instead, there are a couple of strategies that I have found useful for diagnosing errors.

Parameter recovery
If your model involves estimating parameters from data, it can be very useful to generate data with known values of those parameters and test whether the estimates match the known values.  For example, I recently wrote a python implementation of the EZ-diffusion model, which is a simple model for estimating diffusion model parameters from behavioral data.  In order to make sure that the model is correctly estimating these parameters, I generated simulated data using parameters randomly sampled from a reasonable range (using the rdiffusion function from the rtdists R package), and then estimated the correlation between the parameters used to generate the data and the model estimates. I set an aribtrary threshold of 0.9 for the correlation between the estimated and actual parameters; since there will be some noise in the data, we can't expect them to match exactly, but this seems close enough to consider successful.  I set up a test using pytest, and then added CircleCI automated testing for my Github repo (which automatically runs the software tests any time a new commit is pushed to the repo)1. This shows how we can take advantage of software testing tools to do parameter recovery tests to make sure that our code is operating properly.  I would argue that whenever one implements a new model fitting routine, this is the first thing that should be done. 

Imposing the null hypothesis
Another approach is to generate data for which the null hypothesis is true, and make sure that the results come out as expected under the null.  This is a good way to protect one from cases where the error results in an overly optimistic result (e.g. as I discussed here previously). One place I have found this particularly useful is in checking to make sure that there is no data peeking when doing classification analysis.  In this example (Github repo here), I show how one can use random shuffling of labels to test whether a classification procedure is illegally peeking at test data during classifier training. In the following function, there is an error in which the classifier is trained on all of the data, rather than just the training data in each fold:

def cheating_classifier(X,y):
    for train,test in skf:
        knn.fit(X,y) # this is training on the entire dataset!
    return numpy.mean(pred==y)

Fit to a dataset with a true relation between the features and the outcome variable, this classifier predicts the outcome with about 80% accuracy.  In comparison, the correct procedure (separating training and test data):

def crossvalidated_classifier(X,y):
    for train,test in skf:
    return numpy.mean(pred==y)

predicts the outcome with about 68% accuracy.  How would we know that the former is incorrect?  What we can do is to perform the classification repeatedly, each time shuffling the labels.  This is basically making the null hypothesis true, and thus accuracy should be at chance (which in this case is 50% because there are two outcomes with equal frequency).  We can assess this using the following:

def shuffle_test(X,y,clf,nperms=10000):

    for i in range(nperms):
    return acc

This shuffles the data 10,000 times and assesses classifier accuracy.  When we do this with the crossvalidated classifier, we see that accuracy is now about 51% - close enough to chance that we can feel comfortable that our procedure is not biased.  However, when we submit the cheating classifier to this procedure, we see mean accuracy of about 69%; thus, our classifier will exhibit substantial classification accuracy even when there is no true relation between the labels and the features, due to overfitting of noise in the test data.

Randomization is not perfect; in particular, one needs to make sure that the samples are exchangeable under the null hypothesis.  This will generally be true when the samples were acquired through random sampling, but can fail when there is structure in the data (e.g. when the samples are individual subjects, but some sets of subjects are related). However, it’s often a very useful strategy when this assumption holds.

I’d love to hear other ideas about how to implement the principle of assumed error for computational analyses.  Please leave your comments below!

1 This should have been simple, but I hit some snags that point to just how difficult it can be to build truly reproducible analysis workflows. Running the code on my Mac, I found that my tests passed (i.e. the correlation between the estimated parameters using EZ-diffusion and the actual parameters used to generate the data was > 0.9), confirming that my implementation seemed to be accurate. However, when I ran it on CircleCI (which implements the code within a Ubuntu Linux virtual machine), the tests failed, showing much lower correlations between estimated and actual values. Many things differed between the two systems, but my hunch was that it was due to the R code that was used to generate the simulated data (since the EZ diffusion model code is quite simple). I found that when I updated my Mac to the latest version of the rtdists package used to generate the data, I reproduced the poor results that I had seen on the CircleCI test. (I turns out that the parameterization of the function that was using had changed, leading to bad results with the previous function call.). My interim solution was to simply install the older version of the package as part of my CircleCI setup; having done this, the CircleCI tests now pass as well.