Monday, April 18, 2016

How folksy is psychology? The linguistic history of cognitive ontologies

I just returned from a fabulous meeting on Rethinking the Taxonomy of Psychology, hosted by Mike Anderson, Tim Bayne, and Jackie Sullivan.  I think that in another life I must have been a philosopher, because I always have so much fun hanging out with them, and this time was no different.  In particular, the discussions at this meeting moved from simply talking about whether there is a problem with our ontology (which is old hat at this point) to specifically how we can think about using neuroscience to revise the ontology.  I was particularly excited to see all of the interest from a group of young philosophers whose work is spanning philosophy and cognitive neuroscience, who I am counting on to keep the field moving forward!

I have long made the the point that the conceptual structure of current psychology is not radically different from that of William James in the 19th century.  This seems plausible on its face if you look at some of the section headings from his 1890 “To How Many Things Can We Attend At Once?”
  • “The Varieties Of Attention.”
  • “The Improvement Of Discrimination By Practice”
  • “The Perception Of Time.”
  • “Accuracy Of Our Estimate Of Short Durations”
  • “To What Cerebral Process Is The Sense Of Time Due?”
  • “Forgetting.”
  • “The Neural Process Which Underlies Imagination”
  • “Is Perception Unconscious Inference?”
  • “How The Blind Perceive Space.”
  • “Emotion Follows Upon The Bodily Expression In The Coarser Emotions At Least.”
  • “No Special Brain-Centres For Emotion”
  • “Action After Deliberation”:
Beyond the sometimes flowery language, there are all topics that one could imagine being topics of research papers today, but for my talk I wanted to see if there was more direct evidence that the psychological ontology is less different (and thus more "folksy") than ontologies in other sciences.   To address this, I did a set of analyses that looked at the linguistic history of terms in the contemporary psychological ontology (as defined in the Cognitive Atlas) as compared to terms from contemporary biology (as enshrined in the Gene Ontology).  I started (with a bit of help from Vanessa Sochat) by examining the proportion of terms from the Cognitive Atlas that were present in James' Principles (from the full text available here).  This showed that 22.9% of the terms in our current ontology were present in James's text (some examples are: goal, deductive reasoning, effort, false memory, object perception, visual attention, task set, anxiety, mental imagery, unconscious perception, internal speech, primary memory, theory of mind, judgment).

How does this compare to biology?  To ask this, I obtained two biology textbooks published around the same time as James' Principles (T. H. Huxley's Course of Elementary Instruction in Practical Biology from 1892, and T. J. Parker's Lessons in Elementary Biology from 1893), which are both available in full text from Google Books.  In each of these books I assessed the presence of each term from the Gene Ontology, separately for each of the GO subdomains (biological processes, molecular functions, and cellular components).  Here are the results:

Huxley Parker Overlap
biological process (28,566) 0.09% (26) 0.1% (32) 20
molecular functions (10,057) 0 0 -
cellular components (3,903) 1.05% (41) 1.01% (40) 25

The percentages of overlap are much lower, perhaps not surprisingly since the number of GO terms is so much larger than the number of Cognitive Atlas terms.  But even the absolute numbers are substantially lower, and there is not one mention of any of the GO molecular functions (striking but completely unsurprising, since molecular biology would not be developed for many more decades).

These results were interesting, but it could be that they are specific to these particular books, so I generalized the analysis using the Google N-Gram corpus, which indexes the presence of individual words and phrases across more than 3 million books.  Using a python package that accesses the ngram viewer API, I estimated the presence of all of the Cognitive Atlas terms as well as randomly selected subsets of each of the GO subdomains in the English literature between 1800 and 2000; I'm planning to rerun the analysis on the full corpus using the downloaded version of the N-grams corpus, but using this API required throttling that prevented me from the full sets of GO terms.  Here are the results for the Cognitive Atlas:

It is difficult to imagine stronger evidence that the ontology of psychology is relying on pre-scientific concepts; around 80% of the one-word terms in the ontology were already in use in 1800! Compare this to the Gene Ontology terms (note that there were not enough single-word molecular function terms to get a reasonable estimate):

It's clear that the while a few of the terms in these ontologies were in use prior to the development of the biosciences, the proportion is much smaller than what one sees for psychology. In my talk, I laid out two possibilities arising from this:

  1. Psychology has special access to its ontology that obviates the need for a rejection of folk concepts
  2. Psychology is due for a conceptual revolution that will leave behind at least some of our current concepts
My guess is that the truth lies somewhere in between these.  The discussions that we had at the meeting in London provided some good ideas about how to conceptualize the kinds of changes that neuroscience might drive us to make to this ontology. Perhaps the biggest question to come out of the meeting was whether a data-driven approach can ever overcome the fact that the data were collected from experiments that are based on the current ontology. I am guessing that it can (given, e.g. the close relations between brain activity present in task and rest), but this remains one of the biggest questions to be answered.  Fortunately there seems to be lots of interest and I'm looking forward to great progress on these questions in the next few years.

Friday, February 26, 2016

Reproducibility and quantitative training in psychology

We had a great Town Hall Meeting of our department earlier this week, which was focused on issues around reproducibility, which Mike Frank has already discussed in his blog.  A number of the questions that were raised by both faculty and graduate students centered around training, and this has gotten many of us thinking about how we should update our quantitive training to address these concerns.  Currently the graduate statistics course is fairly standard, covering basic topics in probability and statistics including basic probability theory, sampling distributions, null hypothesis testing, general(ized) linear models (regression, ANOVA), and mixed models, with exercises done primarily using R.  While many of these topics remain essential for psychologists and neuroscientists, it's equally clear that there are a number of other topics that we might want to cover that are highly relevant to issues of reproducibility:

  • the statistics of reproducibility (e.g., implications of power for predictive validity; Ioannidis, 2005)
  • Bayesian estimation and inference
  • bias/variance tradeoffs and regularization
  • generalization and cross-validation
  • model-fitting and model comparison
There are also a number of topics that are clearly related to reproducibility but fall more squarely under the topic of "software hygiene":
  • data management
  • code validation and testing
  • version control
  • reproducible workflows (e.g., virtualization/containerization)
  • literate programming
I would love to hear your thoughts about what a 21st century graduate statistics course in psychology/neuroscience should cover- please leave comments below!

Wednesday, December 9, 2015

Reproducible analysis in the MyConnectome project

Today our paper describing the MyConnectome project was published in Nature Communications.  This paper is unlike any that I have ever worked on before (and probably ever will again), as it reflects analyses of data collected on myself over the course of 18 months from 2012-2014.  A lot has been said already about what the results might or might not mean.  What I want to discuss here is the journey that ultimately led me to develop a reproducible shared analysis platform for the study.

Data collection was completed in April 2014, shortly before I moved to the Bay Area, and much of that summer was spent analyzing the data.  As I got deeper into the analyses, it became clear that we needed a way to efficiently and automatically reproduce the entire set of analyses.  For example, there were a couple of times during the data analysis process when my colleagues at Wash U updated their preprocessing strategy, which meant that I had to rerun all of the statistical analyses that relied upon those preprocessed data. This ultimately led me to develop a python package ( that implements all of the statistical analyses (which use a mixture of python, R, and **cough** MATLAB) and provides a set of wrapper scripts to run them.  This package made it fairly easy for me to rerun the entire set of statistical analyses on my machine by executing a single script, and provided me with confidence that I could reproduce any of the results that went into the paper.  

The next question was: Can anyone else (including myself at some later date) reproduce the results?  I had performed the analyses on my Mac laptop using a fairly complex software stack involving many different R and python packages, using a fairly complex set of imaging, genomic, metabolomic, and behavioral data.  (The imaging and -omics data had been preprocessed on large clusters at the Texas Advanced Computing Center (TACC) and Washington University; I didn’t attempt to generalize this part of the workflow).  I started by trying to replicate the analyses on a Linux system; identifying all of the necessary dependencies was an exercise in patience, as the workflow would break at increasingly later points in the process.  Once I had the workflow running, the first analyses showed very different results between the platforms; after the panic subsided (fortunately this happened before the paper was submitted!), I tracked the problem down to the R forecast package on Linux versus Mac (code to replicate issue available here).  It turned out that the auto.arima() function (which is the workhorse of our time series analyses) returned substantially different results on Linux and Mac platforms if the Y variable was not scaled (due apparently to a bug on the Linux side), but very close results when the Y variable was scaled. Fortunately, the latest version of the forecast package (6.2) gives identical results across Linux and Mac regardless of scaling, but the experience showed just how fragile our results can be when we rely upon complex black-box analysis software, and how we shouldn't take cross-platform reproducibility for granted (see here for more on this issue in the context of MRI analysis).

Having generalized the analyses to a second platform, the next logical step was to generalize it to any machine.  After discussing the options with a number of people in the open science community, the two most popular candidates were provisioning of a virtual machine (VM) using Vagrant, or creating a Docker container.  I ultimately chose to go with the Vagrant solution, primarily because it was substantially easier; in principle you simply set up a Vagrantfile that describes all of the dependencies, and type “vagrant up”.    Of course, this “easy” solution took many hours to actually implement successfully because it required reconstruction of all of the dependencies that I had taken for granted on the other systems, but once it was done we had a system that allows anyone to recreate the full set of statistical analyses exactly on their system, which is available at

A final step was to provide a straightforward way for people to view the complex set of results.  Our visualization guru, Vanessa Sochat, developed a flask application ( that provides a front end to all of the HTML reports generated by the various analyses, as well as a results browser that allows one to browse the 38,363 statistical tests that were computed for project.  This browser is available locally if one installs and runs the VM, and is also accessible publicly from
Dashboard for analyses

Browser for timeseries analysis results

We have released code and data with papers in the past, but this is the first paper I have ever published that attempts to include a fully reproducible snapshot of the statistical analyses.  I learned a number of lessons in the process of doing this:
  1. The development of a reproducible workflow saved me from publishing a paper with demonstrably irreproducible results, due to the OS-specific software bug mentioned above.  This in itself makes the entire process worthwhile from my standpoint.
  2. Converting a standard workflow to a fully reproducible workflow is difficult. It took many hours of work beyond the standard analyses in order to develop a working VM with all of the analyses automatically run; that doesn’t even count the time that went into developing the browser. Had I started the work within a virtual machine from the beginning, it would have been much easier, but still would require extra work beyond that needed for the basic analyses.
  3. Ensuring longevity of a working pipeline is even harder.  The week before the paper was set to published I tried a fresh install of the VM to make sure it was still working.  It wasn’t.  The problem was simple (miniconda had changed the name of its installation directory), and highlighted a significant flaw in our strategy, which was that we had not specified software versions in our VM provisioning.  I hope that we can add that in the future, but for now, we have to keep our eyes out for the disruptive effects of software updates.
I look forward to your comments and suggestions about how to better implement reproducible workflows in the future, as this is one of the major interests of our Center for Reproducible Neuroscience.

Sunday, November 1, 2015

Are good science and great storytelling compatible?

Chris Chambers has a piece in the Guardian ("Are we finally getting serious about fixing science?") discussing a recent report about reproducibility from the UK Academy of Medical Sciences, based on a meeting held earlier this year in London. A main theme of the piece is that scientists need to focus more on going good science and less on "storytelling":
Some time in 1999, as a 22 year-old fresh into an Australian PhD programme, I had my first academic paper rejected. “The results are only moderately interesting”, chided an anonymous reviewer. “The methods are solid but the findings are not very important”, said another. “We can only publish the most novel studies”, declared the editor as he frogmarched me and my boring paper to the door.
I immediately asked my supervisor where I’d gone wrong. Experiment conducted carefully? Tick. No major flaws? Tick. Filled a gap in the specialist literature? Tick. Surely it should be published even if the results were a bit dull? His answer taught me a lesson that is (sadly) important for all life scientists. “You have to build a narrative out of your results”, he said. “You’ve got to give them a story”. It was a bombshell. “But the results are the results!” I shouted over my coffee. “Shouldn’t we just let the data tell their own story?” A patient smile. “That’s just not how science works, Chris.”
He was right, of course, but perhaps it’s the way science should work. 

None of us in the reproducibility community would dispute that the overselling of results in service of high-profile publications is problematic, and I doubt that Chambers really believes that our papers should just be data dumps presented without context or explanation.  But by likening the creation of a compelling narrative about one's results to "selling cheap cars", this piece goes too far.  Great science is not just about generating reproducible results and "letting the data tell their own story"; it should also give us deeper insights into how the world works, and those insights are fundamentally built around and expressed through narratives, because humans are story-telling animals.    We have all had the experience of sitting through a research talk that involved lots of data and no story, and it's a painful experience; this speaks to the importance of solid narrative in our communication of scientific ideas.

Narrative becomes even more important when we think about conveying our science to the public. Non-scientists are not in a position to "let the data speak to them" because most of them don't speak the language of data; instead, they speak the language of human narrative. It is only by abstracting away from the data to come up with narratives such as "memory is not like a videotape recorder" or "self-control relies on the prefrontal cortex" that we can bring science to the public in a way that can actually have impact on behavior and policy.

I think it would be useful to stop conflating scientific storytelling with "embellishing and cherry-picking".   Great storytelling (be it spoken or written) is just as important to the scientific enterprise as great methods, and we shouldn't let our zeal for the latter eclipse the importance of the former.

Wednesday, August 26, 2015

New course on decision making: Seeking feedback

I am currently developing a new course on the psychology of decision making that I will teach at Stanford in the Spring Quarter of 2016. I've looked at the various textbooks on this topic and I'm not particularly happy with any of them, so I am rolling my own syllabus and will use readings from the primary literature.  I have developed a draft syllabus and would love to get feedback: Are there important topics that I am missing?  Different readings that I should consider?  Topics I should consider dropping?  Please leave comments with your suggestions, or email me at!

Part 1: What is a decision? 

1. Varieties of decision making (overview of course)

Part 2: Normative decision theory: How an optimal system should make decisions

2. axiomatic approach from economics
- TBD reading on expected utility theory

3. Bayesian decision theory
K├Ârding, K. P. (2007). Decision Theory: What “Should” the Nervous System Do? Science, 318(5850), 606–610.

4. Information accumulation
Smith & Ratcliff, 2004, Psychology and neurobiology of simple decisions.  TINS.

Part 3: Psychology: How humans make decisions

5. Anomalies: the ascendence of psychology and behavioral economics
Kahneman, D. (2003). A perspective on judgment and choice. American Psychologist,
58, 697-720

6. Judgment: Anchoring and adjustment
Chapman, G.B. & Johnson, E.J. (2002). Incorporating the irrelevant: Anchors in
judgment of belief and value

7. Heuristics: availability, representativeness
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases.
Science, 185, 1124-1131. 

8. Risk and uncertainty: Risk perception, risk attitudes
Slovic, P. (1987). Perception of risk. Science, 236, 280-285

9. Prospect theory 
Kahneman, D. & Tversky A. (1984). Choices, values, and frames. American
Psychologist, 39, 341–350.

10. Framing, endowment effects, and applications of prospect theory
Kahneman, D., Knetsch, J.L., & Thaler, R.H. (1991). The endowment effect, loss
aversion, and status quo bias. Journal of Economic Perspectives, 5, 193-206.

11. Varieties of utility
Kahneman, Wakker, & Sarin (1997). Back to Bentham: Explorations of experienced utility.  Quarterly Journal of Economics.

12. Intertemporal choice and self-control
Mischel, W., Shoda, Y., & Rodriguez, M.L. (1989). Delay of gratification in children. Science, 244, pp. 933-938.

13. Emotion and decision making
Rottenstreich, Y. & Hsee, C.K. (2001). Money, kisses and electric shocks: On the
affective psychology of risk. Psychological Science, 12, 185-190.

14. Social decision making and game theory

Part 4: Neuroscience of decision making

15. Neuroscience of simple decisions
Sugrue, Corrado, & Newsome (2005). Choosing the greater of two goods: neural currencies for valuation and decision making. Nature Reviews Neuroscience.

16. Neuroscience of Value-based decision making
Rangel et al., 2008, A framework for studying the neurobiology of value-based decision making

17. Reinforcement learning and dopamine, wanting/liking
Schultz, Montague, and Dayan (1997) A neural substrate of prediction and reward

18. Decision making in simple organisms
Reading TBD (c. elegans, snails, slime mold, etc)

Part 5: Ethical issues

19. Free will
Roskies (2006) Neuroscientific challenges to free will and responsibility.
Shadlen & Roskies (2012). The neurobiology of decision-making and responsibility: reconciling mechanism and mindedness. 

Sunday, March 16, 2014

532 days of self-examination

Last Tuesday we collected the final MRI scan and blood draw for the MyConnectome project.

Here I am emerging from the scanner on March 11, after the 104th MRI scan (photo: Mei-Yen Chen)
The study started on 9/25/2012 and ended on 3/11/2014, for a total of 532 days.  During this time we performed 104 MRI scanning sessions and 48 blood draws.  After excluding some of the data for quality control we are left with:

  • 88 resting state fMRI scans
  • 20 diffusion weighted imaging scans
  • 13 anatomical (T1/T2-weighted) scans
  • 18 breath-holding fMRI scans
  • 15 N-back task scans
  • 9 dot-motion/stop signal task scans
  • 8 object localizer scans
  • 5 language localizer scans
  • 4 spatial working memory localizer scans
Now comes the fun part, which is analyzing all of these data (which, with imaging and genomics data together, comprise more than 3 TB of data).  Fortunately we have the awesome computing resources of the Texas Advanced Computing Center, along with the computing resources at Washington University in St. Louis where our collaborators are also analyzing the data.  

We will be presenting some of the early results at the Organization for Human Brain Mapping meeting in Hamburg in June and the ICON meeting in Brisbane in July, and hope to have an initial paper submitted later this year.  More soon!

Thursday, December 19, 2013

A discussion of causal inference on fMRI data

My facebook page was recently home to an interesting discussion about causal inference that was spurred by a recent article that I retweeted:

Effective connectivity or just plumbing? Granger Causality estimates highly reliable maps of venous drainage.

What follows is the discussion that ensued.  I have moved it to my blog so that additional people can participate, and so that the discussion will be openly accessible (after removing a few less relevant comments).  Please feel free to comment with additional thoughts about the discussion.

  • Peter Bandettini Agree! Latencies due to veins 1 to 4 sec. Latencies due to neurons < 300 ms. Vein latency variation over space dominates.
  • Russ Poldrack it's amazing how we have known for a good while now that Granger causality is useless for fMRI, yes people continue to use it
  • Peter Bandettini There are ways to look through the plumbing.
  • Tal Yarkoni My impression is that as bad as the assumptions you have to make for GCM may be, the assumptions for DCM are still worse. The emerging trend seems to be towards a general skepticism of causal modeling of observational imaging data, not a shift from one causal method to another.
  • Micah Allen In terms of hemodynamics, I'm not aware of any assumptions different from those in all canonical HRF models. DCM uses a slightly updated version of the balloon model. Further, unlike GCM the parameters estimated by DCM have been repeatedly validated using multimodal imaging. Although I will bring this up at the next Methods meeting.
  • Russ Poldrack Tal - I don't agree regarding blanket skepticism about causal inference from observational data. It is definitely limited by some strong assumptions, but under those assumptions it works well, at least on simulated data (as shown e.g. by the recent work from Ramsey and Glymour). But I do agree regarding general skepticism about DCM - not so much about the theory, which I think is mostly reasonable, but about the application to fMRI (especially by people who just punch the button in SPM without really understanding what they are doing). I thought that the Lohmann paper did a good job of laying out many of the problems, and that Karl et al.'s response was pretty much gobbledygook.
  • Vince Calhoun I do not think even GCM can be considered useless for fMRI...however it is critical to keep the assumptions (what are the parameters capturing) and the data (what causes hemodynamic delay, etc) in mind in evaluating its utility and especially in interpreting the results (e.g. does this approach enable us to 'see through' the delay or just to characterize it). The post analysis (single or group) is critical as well and will often float or sink the boat. Every method out there has lots of holes in it, but in the right hands most of them can be informative.
  • Tal Yarkoni Russ, I think the question is whether you think those strong assumptions are ever *realistic*... E.g., looking back at the Ramsey & Glymour sims, my impression is that it's kind of an artificial situation in that (a) the relevant ROIs have already been selected (typically one wouldn't have much basis for knowing that only *these* 11 ROIs are relevant to the causal model!), (b) the focus is on ROI-level dynamics when it's not at all clear that connectivity between ROIs is a good proxy for neuronal dynamics (I understand this is a practical assumption made of necessity, but we don't actually know if it's a reasonable one!), and (c) I think the way error is quantified is misleading--if you tell me that you only have 3 mislabeled edges out of 60, that doesn't sound like much, but consider the implication, which could be, e.g., that now we think ACC directs activity in IPC rather than vice versa--which from a theoretical standpoint could completely change our understanding, and might very well falsify one's initial hypothesis!

    And this still ignores all of the standard criticisms of causal modeling of observational data, which still apply to neuroimaging data--e.g., the missing variable problem, the labeling problem (if you define the boundaries of the nodes differently, you will get very different results), the likelihood that there's heterogeneity in causal relationships across subjects, and so on. So personally I'm very skeptical that once you get beyond toy examples with a very small number of nodes (e.g., visual cortex responds to a stimulus and causes prefrontal changes at some later time), it's possible to say much of anything *positive* about causal dynamics in a complex, brain-wide network. That said, I do agree with Vince that any method can have utility when handled carefully. E.g., a relatively benign use of causal methods may be to determine whether a hypothesized model is completely *incompatible* with the data--which is a safer kind of inference. But that's almost never how these techniques are used in practice...
  • Tyler Davis Pragmatically, if you take all of the nonsense that happens with SEM in the behavioral literature, and then multiply be a factor of two to account for the nonsense due to mismodeling/misinterpreting hemodynamics, it makes it tough to believe an fMRI study with causal modeling unless you know the authors are good at math/know what they are doing or you were inclined to believe the results already.
  • Russ Poldrack We are using a measurement method that intrinsically is at least an order of magnitude from the real casual mechanisms, so It would be folly to have too much faith in the resulting causal inferences. But to the degree that we can validate them externally, the results could still be useful. For me the proof is always in the pudding!
  • Anastasia Christakou What *is* the pudding? If you have a biological hypothesis and your model-driven analysis is in line with it, what more can you do to "prove" it? Are you talking about *independent* external validation?
  • Russ Poldrack the pudding in this case is some kind of external validation - for example, if you can show me that graphical model/DCM/GCM parameters are believably heritable, or that model parameters from fMRI are predictive of results in another domain (e.g., EEG/MEG, behavior).
  • Micah Allen Tal Yarkoni I don't really find the ROI/model-selection problem with DCM overly troubling, although I do see where you are coming from. As Karl is fond of saying, it's your job as the researcher to motivate the model space as it's the hypothesis you are testing. The DCM is only valid in light of that constraint. I find it a bit unfortunate that there is this divergence between the two schools of thought; obviously in cases where you have no clue what the relevant hypothesis should be, tools like graph theory and functional correlations can be an excellent starting point. By definition, the validity of any DCM depends upon the validity of interpreting the experimentally-induced mass-univariate activations. DCM is built to assess specific directional hypothesis between regions activated by some experimental manipulation - it is unsurprising that if the selected task activates brain regions non-specifically or the ROIs are extracted haphazardly then the DCMs are equally invalid. But this isn't an indictment of DCM, it's a rather a failure to motivate a relevant model space. If working in unclear or murky territory, Karl is the first to say that a graph theoretical or connectomic approach can be a first step towards motivating a DCM. This is all circumstantial to the basic issue that GCM just plain gets the direction of connections wrong. 

    I've read quite a bit of the DCM and GCM literature, and I actually do agree with you that many of the early papers are plagued by extremely toy examples. The end result was also always the same, that everything connected to everything. This is part of why the best DCM papers are those based on computational models, such as Hanneke den Ouden's papers on motor learning, where the estimated parameters and winning model families are themselves of interest. DCM is extremely well equipped to assess the evidence that say, motor cortex updates premotor with the probability of a stimulus being a face or a house. This can't be said for any other method - DCM is fundamentally built to assess brain-behavior relationships in this way. All the connectomics in the world won't tell you much beyond "these brain regions are really important hubs and you probably shouldn't knock them out". To go the rest of the way you need a hypothesis driven methodology. 

    That being said, a previous limitation has simply been that it's incredibly frustrating and time consuming to estimate interesting models with more than 2 or 3 nodes. This has resulted in many small sample, 'toy' DCM papers that are generally not very interesting. The new post-hoc model optimization routines largely ameriolate these practical concerns - in my upcoming paper I easily estimate models with all 6-7 nodes activated by my mass-univariate analysis. I have a colleague estimating similar models in 600 scans from the HCP database. This means we will begin to see more intuitive brain-behavior correlations. 

    As for the debate between Lohmann and Friston - Lohmann's critique is just plain factually wrong on the details of the post-hoc procedure, and on several other details. Further Lohmann seems to fundamentally misunderstand the goal of model development and model selection. So I'm not really convinced by that. 

    DCM requires strong hypothesis, which is both it's strength and weakness, and fits extremely well with exploratory data-driven methods that actually work as advertised (unlike GCM). We've not even gotten into DCM for MEG/EEG (on which I am not an expert). The neural mass models there are extremely fascinating, going far beyond modelling mere data features to actually encapsulating a model of the underlying neurophysiological brain dynamics underlying observed responses. DCM for fMRI is itself at best just a starting point for actual neural modelling in M/EEG. 

    Finally, the greatest strength of DCM is the Bayesian model selection procedure. Don't like the canonical HRF? Substitute your own HRF and compare model evidences. Want to combine DTI, MEG, and LFP? DCM can do that.
  • Russ Poldrack Micah - what about Lohmann's comments regarding model fit? I read Karl's response as saying that it doesn't matter if your model only accounts for 0.1% of the variance, as long as it beats another model in the model selection procedure. That seems mathematically correct but scientifically wrongheaded. FWIW We gave up on DCM soon after starting when we found that the models generally fit the data really poorly.
  • Micah Allen Were you doing DCM for fMRI or MEG? In general I've been told that the MEG variance explained tend to be far higher than in the fMRI DCM - close to 90% in some cases, compared to 20-60% in the best case for fMRI. An important caveat with DCM for fMRI is that it is not ideal for fast purely event related designs. If you have a fast stimulus driving factor and a slow (e.g. attention) factor you should see variance explained in the 20-60% range. In my paradigm I only get about 6-12% likely because I have two fast alternating events (stimulus intensity and probability). 

    On the model fit question, I think this comes down to what the variational free energy parameter is actually optimizing. I am not an expert on the calculus behind VFE, but i've been to the DCM workshops and tried to understand the theory as much as my limited calc background allows. Essential VFE is a measure that weights the fit of the model (how well the model predicts the data, in a bayesian posterior sense) by the model complexity. I found this bit of Karl's response helpful:

    "Generally speaking, a model with a poor fit can have more
    evidence than a model with a good fit. For example, if you give a
    model pure measurement noise, a good model should properly identify
    that there is noise and no signal. This model will be better than a
    model that tries to (over) fit noisy fluctuations."

    So the model evidence accounts for fit, but goes beyond it. As Karl points out, a model with perfect fit can be a very poor model, so the VFE is an attempt to balance these things. Beyond that all I can say is that I was taught to always look at the variance explained as a diagnostic step - generally the rule of thumb here is that if your variance explained is really low, you probably made a mistake somewhere or the paradigm is very poorly optimized for DCM. I think in a practical sense, Lohmann have a point here, as there are diagnostic tools (like spm_dcm_explore) but no real guidelines for using them (when to decide you've mucked up the DCM). I think strictly speaking the VFE does the job, but only in the (perhaps too ideal) world where the models make sense.

    I found this blog really useful for understanding why DCM sometimes fails to give a good variance explained for event related paradigms:
    What follows is a rough but hopefully didactically useful introduction into the ...See More
  • Russ Poldrack this was with fMRI data (with Marta Isabel Garrido) - but it was with fast ER designs, which are pretty much all we do in my lab, so that might explain part of the poor fit. Karl's point about penalizing model complexity is totally reasonable, but I'm not sure that it really has anything to do with VFE per se -VFE is just one of many ways to penalize for model complexity. (of course, it's Bayesian, so it must be better 
  • Daniel Handwerker The other trouble with the current DCM approaches is that they rely heavily on assumptions of what hemodynamic responses should be - including using the balloon model in a way it was never intended to be used. As part of a commentary on hemodynamic variation (Neuroimage 2012), I ran a small side analysis that showed how a modest, but very believable, difference in the size of the post-peak undershoot can flip the estimated direction of causality using DCM. This was a bit of a toy analysis, but it highlights a real concern how assumptions in building low-level parts of SPM's DCM model can really affect results.
  • Micah Allen Daniel, i'm not sure i'd agree that is a limitation. The nice thing about DCM is since it is built around a Bayesian hypothesis testing architecture, any competing HRF can be substituted for another and the resulting model estimates compared. So you could easily run a DCM with your HRF vs the canonical - if yours wins it would be a good argument for updating the stock model. The HRF part of DCM is totally modular, so a power user should find it easy to substitute a competing model (or multiple competing models). This point was made repeatedly at the DCM workshop in Paris last year.
  • Tal Yarkoni Micah, I think you're grossly underestimating the complexities involved. Take the problem of selecting ROIs: you say it's incumbent on the researcher to correctly interpret the mass univariate results. But what does that mean in practice? Estimation uncertainty at any given voxel is high; if you choose an ROI based on passing some p value threshold, you will often have regions in which the confidence interval's lower bound is very close to zero. With small samples, you will routinely miss regions that are actually quite important at the population level, and routinely include other regions that are non-zero but probably should not be included. If you use very large samples, on the other hand, then everything is statistically significant, so now you have the problem of deciding which parts of the brain should be part of your causal model and which shouldn't based on some other set of criteria. If you take what you just said to its conclusion, people probably shouldn't fit DCM models to small datasets period, since the univariate associations are known to be shaky.

    Even if you settle on a set of ROIs, the problem of definition is not just one of avoiding "haphazard" selection; e.g., Steve Smith's work in NeuroImage nicely showed that even relatively little mixing between different nodes' timeseries will seriously muck up estimation--and that's with better behaved network modeling techniques, in a case where we know what the discrete nodes are a priori (because it's a simulation), and using network structures that are much more orderly than real data are likely to be (Smith et al's networks are very sparse). In the real world you typically don't know any of this; for example, in task-related activation maps, the entire dorsal medial surface of the brain is often active to some degree, and there is no good reason to, say, split it into 4 different nodes based on one threshold versus 2 nodes at another--even though this is the kind of choice that can produce wildly different results in a structural model.

    As for the hemodynamic issue: the problem is not so much that the canonical HRF is wrong (though of course we know it often is--and systematically so, in that it differs reliably across brain regions)--it's that you compound all of its problems when your modeling depends on deconvolved estimates. It's one thing to say that there is X amount of activation at a given brain region when you know that your model is likely to be wrong to some degree, and quite another to say that you can estimate complex causal interactions between 12 different nodes when the stability of that model depends on the same HRF fitting the data well in all cases.

    As to the general idea that DCM depends on strong hypotheses; this sounds great in principle, but the problem is that there are so many degrees of freedom available to the user that it is rarely clear what constitutes a disconfirmation of the hypothesis versus a "mistake" in one's initial assumptions. Of course to some degree this is a problem when doing research of any kind, but it's grossly compounded when the space of models is effectively infinite and the relationship between the model space and the hypothesis space is quite loose (in the sense that there are many, many, many network models that would appear consistent with just about any theoretical story one can tell).

    Mind you, this is an empirical question, at least in the sense that one could presumably quantify the effect of various "trivial" choices on DCM results. Take the model you mention with 6-7 nodes: I would love to know what happens if you systematically: (a) fit models that subsample from those nodes; (b) add nodes from other ROIs that didn't meet the same level of stringency; (c) define ROIs based on different statistical criteria (remember that surviving correction is a totally arbitrary criterion as it's sample size dependent); (d) randomly vary the edges (for all of the above). The prediction here is that there should be a marked difference in terms of model fit between the models you favor and the models generated by randomly permuting some of these factors--or, that when other models fit better, they should be theoretically consistent in terms of interpretation with the favored model. Is this generally the case? Has anyone demonstrated this?
  • Rajeev Raizada I read the above discussion just now, and found it very interesting indeed. I have never tried playing with DCM, but I have worked in neural modeling in the computational neuroscience / neural circuits sense. In general, I am skeptical of complex and sophisticated models, especially models which congratulate themselves on their complexity and sophistication, or, in the case of some of the models mentioned immediately above, models which congratulate themselves on the sophistication of their self-quantification of complexity. 

    A question for the DCM/GCM-aficionados out there: is there a single instance of such approaches generating a novel insight into brain function which was later independently validated by a different method? It looks to me as though there are a lot of instances of such approaches producing interesting-looking outputs which seem reasonable given what we think we know about the brain. But the road to hell in science is paved with interesting and reasonable-sounding stories.

    This last line is partly troll-bait, but I'll throw it out there anyway as I think it might be a valid comparison: are DCM/GCM approaches a bit like the evolutionary psychology of fMRI? Sources of interesting just-so stories?
  • Micah Allen Out of the office at dinner now so I'm afraid I must leave this debate for now. You make dinner excellent points Tal though I'm not sure I agree with the picture you paint of fMRI effects bring quite so arbitrary. At my former center we stuck to an N = 30 guideline and found this to be an excellent compromise, neither under nor over powered. In my oddball paradigm I get extremely anatomically constrained activations that fit well with the literature. I extract 6mm spherical VOIs from each peak. This seems like a pretty reasonable way to characterize the overall effect, but I do think the analysis you suggest would interesting in any context DCM or otherwise. Sorry so brief- on my phone at dinner!
  • Micah Allen Rajeev Raizada a good question but there is a lot of ongoing intracortical and multimodal validation work being done with DCM. Check out Rosalyn Moran's recent papers 
  • Rajeev Raizada Sounds interesting. Are you talking about this paper, which finds that boosting acetylcholine increases the size of the mismatch negativity? An interesting result, but it's not entirely clear to me that the theoretical overlay of predictive-coding and DCM adds a great deal, or that the empirical result adds much support to the theory. After all, this paper from 2001 already showed the converse result (reduced ACh gives reduced MMN): , and there is a ton of evidence showing that ACh increases cortical signal-to-noise. However, I may be focussing on a different paper than you were referring to.
    PubMed comprises more than 23 million citations for biomedical literature from M...See More
  • Daniel Handwerker As Tal notes, the issue with hemodynamic variation is that it varies quite a bit around the brain and there is no existing model in a Baysean or any other framework that can solve this degree of variation. This isn't a problem if one's analysis is robust to much of the variation, but the deconvolution step in DCM amplifies its sensitivity to the selected hemodynamic model. If two brain regions having slightly different relative undershoots is enough to make the model fail, then when can we trust that it works?

    Another way of saying that a power user can make their own hemodynamic model is to say that the model that is default with the SPM DCM software doesn't work in real-world situations, but that doesn't preclude someone in the future from making a model that does work. This might be true, but it does little to increase confidence in the accuracy of current DCM studies. 

    As others have noted, simpler causality measures often have their assumptions more out in the open and someone can design a study that is robust to those assumptions. If a study was designed well enough that it could work within the complex assumptions of DCM, I can't think of a situation where it wouldn't also meet the assumptions of simpler approaches.
  • Micah Allen No, the option is there for those who want to show that there is a better model. The canonical balloon HRF enjoys a great deal of robustness, which is why its the function of choice in nearly all neuroimaging packages. Further DCM has been shown to be very robust to lag differences of up to 2 seconds. I hope you guys at not out there rejecting DCM papers on these basic principles without at least reading the actual validation papers behind it. Your toy example shows that DCM is sensitive to the HRF of choice, not that the canonical is an inappropriate model. If there is that big of a smoking gun problem with DCM, I'd probably suggest you publish to that effect to prevent a lot of people from wasting time!
  • Micah Allen Just to try to close on a positive note, I find this review by Stephen Smith to be exceptionally balanced, cutting through much of the hyperbole surrounding different connectivity methods. I think he really does a great job pointing out the natural strengths and weaknesses of each of the available approaches, advocating for a balanced viewpoint. Also don't miss the excellent foot note where he describes views from 'scientist A and scientist B' - pretty obvious who he's referring to there
  • Marta Isabel Garrido This is one of the best validation papers showing robustness of DCM for fMRI (and also how GC fails terribly)
    PubMed comprises more than 23 million citations for biomedical literature from M...See More
  • Daniel Handwerker Micah, when I say that a fundamental assumption of the current implementation of DCM fails under very realistic conditions to such an extent that it will flip the result, improving DCM is not my only option. The other option is to not personally use DCM and treat DCM findings which rest on these fundamental assumptions with a good bit of skepticism. If I had a specific project that would clearly benefit from the DCM approach, I might reassess my stance, but until then, it's the job of DCM users to make the method/interpretations more robust.
    I'll also note that the robustness of the canonical HRF is analysis dependent. My 2004 Neuroimage paper showed that it's robust for finding significance maps using a GLM, but it is less robust if you're taking magnitude estimates into a group analysis. Still, the robustness of the canonical HRF in linear regression based studies has been shown to be pretty good. That doesn't mean the same model is robust when used as part of a completely different analysis method. Any causality measure is probably going to be more sensitive to GLM shape compared to intravoxel statistical approaches. My review that I mentioned earlier ( ) is a published example showing how DCM can fail. It's an example rather than a full examination of the method's limitations, but it's enough to cause me some concern.
  • Nikolaus Kriegeskorte this is a fascinating thread. i'm left with three thoughts.
    (1) the paper by webb et al. does not seem to invalidate granger causality inferences based on comparisons between different experimental conditions. it would be good to hear Alard Roebroeck's
    take on these issues. (2) it would be good to have an fMRI causality modelling competition in which several simulated data sets (with known but unrevealed ground truth and realistic complexities) are analysed by multiple groups with multiple techniques. (3) the only thing that puts the proof in the pudding for me is prediction. in decoding analyses, for example, the use of independent test sets (crossvalidation) ensures that incorrect assumptions make us more likely to accept the null hypothesis. what is the equivalent of this in the realm of causal analyses?
    16 hours ago · Edited · Unlike · 2
  • Russ Poldrack Niko - agreed, the only reasonable use of Granger causality analysis with fMRI that I have seen is the work by Roebroek that showed differences between conditions in G-causality within the same region, which mostly obviates the issues regarding HRFs (unless you think the latency is changing with amplitude). if only I was FB friends with Alard! And I second both your call for a competition (though the devil is in the details of the simulated data) and the ultimate utility of crossvalidation.
  • Jack Van Horn Niko: something like your item #2 as a challenge for the OHBM Hackathon.#justsayin
    15 hours ago · Like · 2
  • Nikolaus Kriegeskorte jack, that would be great. for the next meeting, we have an organising team with a plan already. could consider this for hawaii. it would be good to hear opinions from Klaas and Karl (whom i'm missing on facebook), Alard Roebroeck, and Stephen Smith -- as well as the commentators above.
    15 hours ago · Like · 1
  • Nikolaus Kriegeskorte Russ Poldrack is right that the devil is in the details of the simulated data. each method may prevail when its model matches the one that generated the data. it will therefore be key to include realistic complications of the type described by tal, which none of the methods address. the goal could be to decide about each of a set of causal claims that either do or do not hold in the data-generating ground-truth model.
  • Nikolaus Kriegeskorte I'd like to hear Rik Henson's take on this entire thread.
  • Micah Allen This has been a very informative thread for me - wish Facebook posts could be storified as this is an important debate. I would love to see it formally continued at OHBM. I will try to get Karl's opinion on these issues at the next methods meeting. Over beers last night several members of that group expressed that some of this validation was underway, and that DCM is undergoing significant revision to address many of these issues. Still it's clear that there has not been enough discussion. Also as a side note I agree that the GCM paper in my blog post does not necessarily implicate between condition differences. I think the answer to some of these issues is inevitably going to come down to experimental design, as probably both GCM and DCM can be more or less robust to vascular confounds depending on the timing and nature of your paradigm. It would be nice to know exactly what those constraints are.
    5 hours ago via mobile · Edited · Like
  • Jack Van Horn This brings to mind early 1990's concerns about the effects that vasculature and draining veins, in particular, had on measured regional BOLD activation. This was considered a real issue by some and a potential deal breaker for the future of fMRI over PET. But others felt that the HRF actually saved the day since it would be modulated by neural activity under cognitive task conditions. The issue kind of got swept under the rug for the last 15 years or so. Interesting that this is now re-emerging in the realm of functional connectivity, resting-state, and notably Granger causality. Could it be that auto-correlative Granger modeling is doing exactly what it is supposed to do? And perhaps it has simply been our poor understanding of its implications relative to actual hemodynamics that is finally catching up to us? I look forward to seeing the further discussions.