Thursday, December 19, 2013

A discussion of causal inference on fMRI data

My facebook page was recently home to an interesting discussion about causal inference that was spurred by a recent article that I retweeted:

Effective connectivity or just plumbing? Granger Causality estimates highly reliable maps of venous drainage.

What follows is the discussion that ensued.  I have moved it to my blog so that additional people can participate, and so that the discussion will be openly accessible (after removing a few less relevant comments).  Please feel free to comment with additional thoughts about the discussion.

  • Peter Bandettini Agree! Latencies due to veins 1 to 4 sec. Latencies due to neurons < 300 ms. Vein latency variation over space dominates.
  • Russ Poldrack it's amazing how we have known for a good while now that Granger causality is useless for fMRI, yes people continue to use it
  • Peter Bandettini There are ways to look through the plumbing.
  • Tal Yarkoni My impression is that as bad as the assumptions you have to make for GCM may be, the assumptions for DCM are still worse. The emerging trend seems to be towards a general skepticism of causal modeling of observational imaging data, not a shift from one causal method to another.
  • Micah Allen In terms of hemodynamics, I'm not aware of any assumptions different from those in all canonical HRF models. DCM uses a slightly updated version of the balloon model. Further, unlike GCM the parameters estimated by DCM have been repeatedly validated using multimodal imaging. Although I will bring this up at the next Methods meeting.
  • Russ Poldrack Tal - I don't agree regarding blanket skepticism about causal inference from observational data. It is definitely limited by some strong assumptions, but under those assumptions it works well, at least on simulated data (as shown e.g. by the recent work from Ramsey and Glymour). But I do agree regarding general skepticism about DCM - not so much about the theory, which I think is mostly reasonable, but about the application to fMRI (especially by people who just punch the button in SPM without really understanding what they are doing). I thought that the Lohmann paper did a good job of laying out many of the problems, and that Karl et al.'s response was pretty much gobbledygook.
  • Vince Calhoun I do not think even GCM can be considered useless for fMRI...however it is critical to keep the assumptions (what are the parameters capturing) and the data (what causes hemodynamic delay, etc) in mind in evaluating its utility and especially in interpreting the results (e.g. does this approach enable us to 'see through' the delay or just to characterize it). The post analysis (single or group) is critical as well and will often float or sink the boat. Every method out there has lots of holes in it, but in the right hands most of them can be informative.
  • Tal Yarkoni Russ, I think the question is whether you think those strong assumptions are ever *realistic*... E.g., looking back at the Ramsey & Glymour sims, my impression is that it's kind of an artificial situation in that (a) the relevant ROIs have already been selected (typically one wouldn't have much basis for knowing that only *these* 11 ROIs are relevant to the causal model!), (b) the focus is on ROI-level dynamics when it's not at all clear that connectivity between ROIs is a good proxy for neuronal dynamics (I understand this is a practical assumption made of necessity, but we don't actually know if it's a reasonable one!), and (c) I think the way error is quantified is misleading--if you tell me that you only have 3 mislabeled edges out of 60, that doesn't sound like much, but consider the implication, which could be, e.g., that now we think ACC directs activity in IPC rather than vice versa--which from a theoretical standpoint could completely change our understanding, and might very well falsify one's initial hypothesis!

    And this still ignores all of the standard criticisms of causal modeling of observational data, which still apply to neuroimaging data--e.g., the missing variable problem, the labeling problem (if you define the boundaries of the nodes differently, you will get very different results), the likelihood that there's heterogeneity in causal relationships across subjects, and so on. So personally I'm very skeptical that once you get beyond toy examples with a very small number of nodes (e.g., visual cortex responds to a stimulus and causes prefrontal changes at some later time), it's possible to say much of anything *positive* about causal dynamics in a complex, brain-wide network. That said, I do agree with Vince that any method can have utility when handled carefully. E.g., a relatively benign use of causal methods may be to determine whether a hypothesized model is completely *incompatible* with the data--which is a safer kind of inference. But that's almost never how these techniques are used in practice...
  • Tyler Davis Pragmatically, if you take all of the nonsense that happens with SEM in the behavioral literature, and then multiply be a factor of two to account for the nonsense due to mismodeling/misinterpreting hemodynamics, it makes it tough to believe an fMRI study with causal modeling unless you know the authors are good at math/know what they are doing or you were inclined to believe the results already.
  • Russ Poldrack We are using a measurement method that intrinsically is at least an order of magnitude from the real casual mechanisms, so It would be folly to have too much faith in the resulting causal inferences. But to the degree that we can validate them externally, the results could still be useful. For me the proof is always in the pudding!
  • Anastasia Christakou What *is* the pudding? If you have a biological hypothesis and your model-driven analysis is in line with it, what more can you do to "prove" it? Are you talking about *independent* external validation?
  • Russ Poldrack the pudding in this case is some kind of external validation - for example, if you can show me that graphical model/DCM/GCM parameters are believably heritable, or that model parameters from fMRI are predictive of results in another domain (e.g., EEG/MEG, behavior).
  • Micah Allen Tal Yarkoni I don't really find the ROI/model-selection problem with DCM overly troubling, although I do see where you are coming from. As Karl is fond of saying, it's your job as the researcher to motivate the model space as it's the hypothesis you are testing. The DCM is only valid in light of that constraint. I find it a bit unfortunate that there is this divergence between the two schools of thought; obviously in cases where you have no clue what the relevant hypothesis should be, tools like graph theory and functional correlations can be an excellent starting point. By definition, the validity of any DCM depends upon the validity of interpreting the experimentally-induced mass-univariate activations. DCM is built to assess specific directional hypothesis between regions activated by some experimental manipulation - it is unsurprising that if the selected task activates brain regions non-specifically or the ROIs are extracted haphazardly then the DCMs are equally invalid. But this isn't an indictment of DCM, it's a rather a failure to motivate a relevant model space. If working in unclear or murky territory, Karl is the first to say that a graph theoretical or connectomic approach can be a first step towards motivating a DCM. This is all circumstantial to the basic issue that GCM just plain gets the direction of connections wrong. 

    I've read quite a bit of the DCM and GCM literature, and I actually do agree with you that many of the early papers are plagued by extremely toy examples. The end result was also always the same, that everything connected to everything. This is part of why the best DCM papers are those based on computational models, such as Hanneke den Ouden's papers on motor learning, where the estimated parameters and winning model families are themselves of interest. DCM is extremely well equipped to assess the evidence that say, motor cortex updates premotor with the probability of a stimulus being a face or a house. This can't be said for any other method - DCM is fundamentally built to assess brain-behavior relationships in this way. All the connectomics in the world won't tell you much beyond "these brain regions are really important hubs and you probably shouldn't knock them out". To go the rest of the way you need a hypothesis driven methodology. 

    That being said, a previous limitation has simply been that it's incredibly frustrating and time consuming to estimate interesting models with more than 2 or 3 nodes. This has resulted in many small sample, 'toy' DCM papers that are generally not very interesting. The new post-hoc model optimization routines largely ameriolate these practical concerns - in my upcoming paper I easily estimate models with all 6-7 nodes activated by my mass-univariate analysis. I have a colleague estimating similar models in 600 scans from the HCP database. This means we will begin to see more intuitive brain-behavior correlations. 

    As for the debate between Lohmann and Friston - Lohmann's critique is just plain factually wrong on the details of the post-hoc procedure, and on several other details. Further Lohmann seems to fundamentally misunderstand the goal of model development and model selection. So I'm not really convinced by that. 

    DCM requires strong hypothesis, which is both it's strength and weakness, and fits extremely well with exploratory data-driven methods that actually work as advertised (unlike GCM). We've not even gotten into DCM for MEG/EEG (on which I am not an expert). The neural mass models there are extremely fascinating, going far beyond modelling mere data features to actually encapsulating a model of the underlying neurophysiological brain dynamics underlying observed responses. DCM for fMRI is itself at best just a starting point for actual neural modelling in M/EEG. 

    Finally, the greatest strength of DCM is the Bayesian model selection procedure. Don't like the canonical HRF? Substitute your own HRF and compare model evidences. Want to combine DTI, MEG, and LFP? DCM can do that.
  • Russ Poldrack Micah - what about Lohmann's comments regarding model fit? I read Karl's response as saying that it doesn't matter if your model only accounts for 0.1% of the variance, as long as it beats another model in the model selection procedure. That seems mathematically correct but scientifically wrongheaded. FWIW We gave up on DCM soon after starting when we found that the models generally fit the data really poorly.
  • Micah Allen Were you doing DCM for fMRI or MEG? In general I've been told that the MEG variance explained tend to be far higher than in the fMRI DCM - close to 90% in some cases, compared to 20-60% in the best case for fMRI. An important caveat with DCM for fMRI is that it is not ideal for fast purely event related designs. If you have a fast stimulus driving factor and a slow (e.g. attention) factor you should see variance explained in the 20-60% range. In my paradigm I only get about 6-12% likely because I have two fast alternating events (stimulus intensity and probability). 

    On the model fit question, I think this comes down to what the variational free energy parameter is actually optimizing. I am not an expert on the calculus behind VFE, but i've been to the DCM workshops and tried to understand the theory as much as my limited calc background allows. Essential VFE is a measure that weights the fit of the model (how well the model predicts the data, in a bayesian posterior sense) by the model complexity. I found this bit of Karl's response helpful:

    "Generally speaking, a model with a poor fit can have more
    evidence than a model with a good fit. For example, if you give a
    model pure measurement noise, a good model should properly identify
    that there is noise and no signal. This model will be better than a
    model that tries to (over) fit noisy fluctuations."

    So the model evidence accounts for fit, but goes beyond it. As Karl points out, a model with perfect fit can be a very poor model, so the VFE is an attempt to balance these things. Beyond that all I can say is that I was taught to always look at the variance explained as a diagnostic step - generally the rule of thumb here is that if your variance explained is really low, you probably made a mistake somewhere or the paradigm is very poorly optimized for DCM. I think in a practical sense, Lohmann have a point here, as there are diagnostic tools (like spm_dcm_explore) but no real guidelines for using them (when to decide you've mucked up the DCM). I think strictly speaking the VFE does the job, but only in the (perhaps too ideal) world where the models make sense.

    I found this blog really useful for understanding why DCM sometimes fails to give a good variance explained for event related paradigms:
    What follows is a rough but hopefully didactically useful introduction into the ...See More
  • Russ Poldrack this was with fMRI data (with Marta Isabel Garrido) - but it was with fast ER designs, which are pretty much all we do in my lab, so that might explain part of the poor fit. Karl's point about penalizing model complexity is totally reasonable, but I'm not sure that it really has anything to do with VFE per se -VFE is just one of many ways to penalize for model complexity. (of course, it's Bayesian, so it must be better 
  • Daniel Handwerker The other trouble with the current DCM approaches is that they rely heavily on assumptions of what hemodynamic responses should be - including using the balloon model in a way it was never intended to be used. As part of a commentary on hemodynamic variation (Neuroimage 2012), I ran a small side analysis that showed how a modest, but very believable, difference in the size of the post-peak undershoot can flip the estimated direction of causality using DCM. This was a bit of a toy analysis, but it highlights a real concern how assumptions in building low-level parts of SPM's DCM model can really affect results.
  • Micah Allen Daniel, i'm not sure i'd agree that is a limitation. The nice thing about DCM is since it is built around a Bayesian hypothesis testing architecture, any competing HRF can be substituted for another and the resulting model estimates compared. So you could easily run a DCM with your HRF vs the canonical - if yours wins it would be a good argument for updating the stock model. The HRF part of DCM is totally modular, so a power user should find it easy to substitute a competing model (or multiple competing models). This point was made repeatedly at the DCM workshop in Paris last year.
  • Tal Yarkoni Micah, I think you're grossly underestimating the complexities involved. Take the problem of selecting ROIs: you say it's incumbent on the researcher to correctly interpret the mass univariate results. But what does that mean in practice? Estimation uncertainty at any given voxel is high; if you choose an ROI based on passing some p value threshold, you will often have regions in which the confidence interval's lower bound is very close to zero. With small samples, you will routinely miss regions that are actually quite important at the population level, and routinely include other regions that are non-zero but probably should not be included. If you use very large samples, on the other hand, then everything is statistically significant, so now you have the problem of deciding which parts of the brain should be part of your causal model and which shouldn't based on some other set of criteria. If you take what you just said to its conclusion, people probably shouldn't fit DCM models to small datasets period, since the univariate associations are known to be shaky.

    Even if you settle on a set of ROIs, the problem of definition is not just one of avoiding "haphazard" selection; e.g., Steve Smith's work in NeuroImage nicely showed that even relatively little mixing between different nodes' timeseries will seriously muck up estimation--and that's with better behaved network modeling techniques, in a case where we know what the discrete nodes are a priori (because it's a simulation), and using network structures that are much more orderly than real data are likely to be (Smith et al's networks are very sparse). In the real world you typically don't know any of this; for example, in task-related activation maps, the entire dorsal medial surface of the brain is often active to some degree, and there is no good reason to, say, split it into 4 different nodes based on one threshold versus 2 nodes at another--even though this is the kind of choice that can produce wildly different results in a structural model.

    As for the hemodynamic issue: the problem is not so much that the canonical HRF is wrong (though of course we know it often is--and systematically so, in that it differs reliably across brain regions)--it's that you compound all of its problems when your modeling depends on deconvolved estimates. It's one thing to say that there is X amount of activation at a given brain region when you know that your model is likely to be wrong to some degree, and quite another to say that you can estimate complex causal interactions between 12 different nodes when the stability of that model depends on the same HRF fitting the data well in all cases.

    As to the general idea that DCM depends on strong hypotheses; this sounds great in principle, but the problem is that there are so many degrees of freedom available to the user that it is rarely clear what constitutes a disconfirmation of the hypothesis versus a "mistake" in one's initial assumptions. Of course to some degree this is a problem when doing research of any kind, but it's grossly compounded when the space of models is effectively infinite and the relationship between the model space and the hypothesis space is quite loose (in the sense that there are many, many, many network models that would appear consistent with just about any theoretical story one can tell).

    Mind you, this is an empirical question, at least in the sense that one could presumably quantify the effect of various "trivial" choices on DCM results. Take the model you mention with 6-7 nodes: I would love to know what happens if you systematically: (a) fit models that subsample from those nodes; (b) add nodes from other ROIs that didn't meet the same level of stringency; (c) define ROIs based on different statistical criteria (remember that surviving correction is a totally arbitrary criterion as it's sample size dependent); (d) randomly vary the edges (for all of the above). The prediction here is that there should be a marked difference in terms of model fit between the models you favor and the models generated by randomly permuting some of these factors--or, that when other models fit better, they should be theoretically consistent in terms of interpretation with the favored model. Is this generally the case? Has anyone demonstrated this?
  • Rajeev Raizada I read the above discussion just now, and found it very interesting indeed. I have never tried playing with DCM, but I have worked in neural modeling in the computational neuroscience / neural circuits sense. In general, I am skeptical of complex and sophisticated models, especially models which congratulate themselves on their complexity and sophistication, or, in the case of some of the models mentioned immediately above, models which congratulate themselves on the sophistication of their self-quantification of complexity. 

    A question for the DCM/GCM-aficionados out there: is there a single instance of such approaches generating a novel insight into brain function which was later independently validated by a different method? It looks to me as though there are a lot of instances of such approaches producing interesting-looking outputs which seem reasonable given what we think we know about the brain. But the road to hell in science is paved with interesting and reasonable-sounding stories.

    This last line is partly troll-bait, but I'll throw it out there anyway as I think it might be a valid comparison: are DCM/GCM approaches a bit like the evolutionary psychology of fMRI? Sources of interesting just-so stories?
  • Micah Allen Out of the office at dinner now so I'm afraid I must leave this debate for now. You make dinner excellent points Tal though I'm not sure I agree with the picture you paint of fMRI effects bring quite so arbitrary. At my former center we stuck to an N = 30 guideline and found this to be an excellent compromise, neither under nor over powered. In my oddball paradigm I get extremely anatomically constrained activations that fit well with the literature. I extract 6mm spherical VOIs from each peak. This seems like a pretty reasonable way to characterize the overall effect, but I do think the analysis you suggest would interesting in any context DCM or otherwise. Sorry so brief- on my phone at dinner!
  • Micah Allen Rajeev Raizada a good question but there is a lot of ongoing intracortical and multimodal validation work being done with DCM. Check out Rosalyn Moran's recent papers 
  • Rajeev Raizada Sounds interesting. Are you talking about this paper, which finds that boosting acetylcholine increases the size of the mismatch negativity? An interesting result, but it's not entirely clear to me that the theoretical overlay of predictive-coding and DCM adds a great deal, or that the empirical result adds much support to the theory. After all, this paper from 2001 already showed the converse result (reduced ACh gives reduced MMN): , and there is a ton of evidence showing that ACh increases cortical signal-to-noise. However, I may be focussing on a different paper than you were referring to.
    PubMed comprises more than 23 million citations for biomedical literature from M...See More
  • Daniel Handwerker As Tal notes, the issue with hemodynamic variation is that it varies quite a bit around the brain and there is no existing model in a Baysean or any other framework that can solve this degree of variation. This isn't a problem if one's analysis is robust to much of the variation, but the deconvolution step in DCM amplifies its sensitivity to the selected hemodynamic model. If two brain regions having slightly different relative undershoots is enough to make the model fail, then when can we trust that it works?

    Another way of saying that a power user can make their own hemodynamic model is to say that the model that is default with the SPM DCM software doesn't work in real-world situations, but that doesn't preclude someone in the future from making a model that does work. This might be true, but it does little to increase confidence in the accuracy of current DCM studies. 

    As others have noted, simpler causality measures often have their assumptions more out in the open and someone can design a study that is robust to those assumptions. If a study was designed well enough that it could work within the complex assumptions of DCM, I can't think of a situation where it wouldn't also meet the assumptions of simpler approaches.
  • Micah Allen No, the option is there for those who want to show that there is a better model. The canonical balloon HRF enjoys a great deal of robustness, which is why its the function of choice in nearly all neuroimaging packages. Further DCM has been shown to be very robust to lag differences of up to 2 seconds. I hope you guys at not out there rejecting DCM papers on these basic principles without at least reading the actual validation papers behind it. Your toy example shows that DCM is sensitive to the HRF of choice, not that the canonical is an inappropriate model. If there is that big of a smoking gun problem with DCM, I'd probably suggest you publish to that effect to prevent a lot of people from wasting time!
  • Micah Allen Just to try to close on a positive note, I find this review by Stephen Smith to be exceptionally balanced, cutting through much of the hyperbole surrounding different connectivity methods. I think he really does a great job pointing out the natural strengths and weaknesses of each of the available approaches, advocating for a balanced viewpoint. Also don't miss the excellent foot note where he describes views from 'scientist A and scientist B' - pretty obvious who he's referring to there
  • Marta Isabel Garrido This is one of the best validation papers showing robustness of DCM for fMRI (and also how GC fails terribly)
    PubMed comprises more than 23 million citations for biomedical literature from M...See More
  • Daniel Handwerker Micah, when I say that a fundamental assumption of the current implementation of DCM fails under very realistic conditions to such an extent that it will flip the result, improving DCM is not my only option. The other option is to not personally use DCM and treat DCM findings which rest on these fundamental assumptions with a good bit of skepticism. If I had a specific project that would clearly benefit from the DCM approach, I might reassess my stance, but until then, it's the job of DCM users to make the method/interpretations more robust.
    I'll also note that the robustness of the canonical HRF is analysis dependent. My 2004 Neuroimage paper showed that it's robust for finding significance maps using a GLM, but it is less robust if you're taking magnitude estimates into a group analysis. Still, the robustness of the canonical HRF in linear regression based studies has been shown to be pretty good. That doesn't mean the same model is robust when used as part of a completely different analysis method. Any causality measure is probably going to be more sensitive to GLM shape compared to intravoxel statistical approaches. My review that I mentioned earlier ( ) is a published example showing how DCM can fail. It's an example rather than a full examination of the method's limitations, but it's enough to cause me some concern.
  • Nikolaus Kriegeskorte this is a fascinating thread. i'm left with three thoughts.
    (1) the paper by webb et al. does not seem to invalidate granger causality inferences based on comparisons between different experimental conditions. it would be good to hear Alard Roebroeck's
    take on these issues. (2) it would be good to have an fMRI causality modelling competition in which several simulated data sets (with known but unrevealed ground truth and realistic complexities) are analysed by multiple groups with multiple techniques. (3) the only thing that puts the proof in the pudding for me is prediction. in decoding analyses, for example, the use of independent test sets (crossvalidation) ensures that incorrect assumptions make us more likely to accept the null hypothesis. what is the equivalent of this in the realm of causal analyses?
    16 hours ago · Edited · Unlike · 2
  • Russ Poldrack Niko - agreed, the only reasonable use of Granger causality analysis with fMRI that I have seen is the work by Roebroek that showed differences between conditions in G-causality within the same region, which mostly obviates the issues regarding HRFs (unless you think the latency is changing with amplitude). if only I was FB friends with Alard! And I second both your call for a competition (though the devil is in the details of the simulated data) and the ultimate utility of crossvalidation.
  • Jack Van Horn Niko: something like your item #2 as a challenge for the OHBM Hackathon.#justsayin
    15 hours ago · Like · 2
  • Nikolaus Kriegeskorte jack, that would be great. for the next meeting, we have an organising team with a plan already. could consider this for hawaii. it would be good to hear opinions from Klaas and Karl (whom i'm missing on facebook), Alard Roebroeck, and Stephen Smith -- as well as the commentators above.
    15 hours ago · Like · 1
  • Nikolaus Kriegeskorte Russ Poldrack is right that the devil is in the details of the simulated data. each method may prevail when its model matches the one that generated the data. it will therefore be key to include realistic complications of the type described by tal, which none of the methods address. the goal could be to decide about each of a set of causal claims that either do or do not hold in the data-generating ground-truth model.
  • Nikolaus Kriegeskorte I'd like to hear Rik Henson's take on this entire thread.
  • Micah Allen This has been a very informative thread for me - wish Facebook posts could be storified as this is an important debate. I would love to see it formally continued at OHBM. I will try to get Karl's opinion on these issues at the next methods meeting. Over beers last night several members of that group expressed that some of this validation was underway, and that DCM is undergoing significant revision to address many of these issues. Still it's clear that there has not been enough discussion. Also as a side note I agree that the GCM paper in my blog post does not necessarily implicate between condition differences. I think the answer to some of these issues is inevitably going to come down to experimental design, as probably both GCM and DCM can be more or less robust to vascular confounds depending on the timing and nature of your paradigm. It would be nice to know exactly what those constraints are.
    5 hours ago via mobile · Edited · Like
  • Jack Van Horn This brings to mind early 1990's concerns about the effects that vasculature and draining veins, in particular, had on measured regional BOLD activation. This was considered a real issue by some and a potential deal breaker for the future of fMRI over PET. But others felt that the HRF actually saved the day since it would be modulated by neural activity under cognitive task conditions. The issue kind of got swept under the rug for the last 15 years or so. Interesting that this is now re-emerging in the realm of functional connectivity, resting-state, and notably Granger causality. Could it be that auto-correlative Granger modeling is doing exactly what it is supposed to do? And perhaps it has simply been our poor understanding of its implications relative to actual hemodynamics that is finally catching up to us? I look forward to seeing the further discussions.


  1. Thanks for the fascinating discussion.

    As shown by Lionel Barnett and Anil Seth in a recent paper, the problem is not HRV per se, but HRF together with downsampling.
    Furthermore, the problem of generating a good ground truth is a though one.
    We have developed a method for retrieving and deconvolving the HRF for each voxel even at rest

    and this seems to improve the performance of GC.

    We used it also to get whole brain GC maps from deconvolved data, which can be compared to the paper by Webb et al

    I wrote Olivier David asking him his data from his PLOS Biology 2008 paper to see whether I can get GC to perform better.

  2. Just wanted to try and wrap up my thoughts on this debate, as I never really responded to Tals comments, which I found very useful. In general, I am sure you can find examples where DCM for fMRI may perform poorly; the analysis you suggest would inevitably point out places were it is robust and places were it is not. It certainly should not be surprising that iterating design and analysis alters the outcome. I think in cases where the HRF peforms poorly, the winning model is likely to have an extremely poor variance explained. I don't buy into their skepticism that the assumptions of the HRF don't scale with more complex models. We already have reasonably robust evidence that it does, and I thik it's not enough just to be skeptical - I can see no reason in principle why the example with 2 nodes is any different than 6 nodes. Even the GLM can be a poor model in some cases; the point isn't to produce a perfect method but rather a good compromise between model utility and accuracy. With good experimental design I think DCM is largely robust to the small differences. I do agree with Tal that a large scale data mining approach iterating all possible variables would be interesting, but I don't agree with his totally arbitrary definition of biologically relevant activations. I get that if you add subjects infinitely the entire brain activates, but that doesn't say much about biological interpretability as you've obviously overpowered the design. Solid designs with reasonable Ns typically report relatively constrained activation profiles that fit anatomy quite well. So i'd welcome the exhaustive approach he recommends, but I am also not sure what we would really learn beyond 'the way you do things changes what you get'. There is a side of science that sees this as a drawback but I agree with viewpoint that this is where the responsibility of the scientist to understand their method comes into play.

    In general it comes down to two (sadly) competing views of science. On one, the goal is to get endless amounts of data, very generally defined, and to apply high powered machine learning and data mining routines to let the data 'reveal the truth'. The other more classical view advocates careful model development and hypothesis testing, on the notion that good theories will produce robust results without the need for endless data, and understanding that many of the findings produce by that approach are just rich descriptions that make few (if any) predictions about behavior. I think the best result comes from fusing these two views, letting exploratory data mining inform our hypothesis and using a tightly hypothesized approach to bang out principles. To the extent that our theories make strong, testable predictions, we should be able to tease out and replicate effects reasonably well. From my point of view and experience, predictive coding makes extremely powerful predictions about the brain and behavior that are increasingly being borne out by dynamic causal modelling. I appreciate the power of the data mining approach but in the end I doubt we're ready to remove the ever fallible human from the process ;)

  3. Russ, you really need to raise the character limit on comments; 4,096 ain't enough--this isn't Twitter! ;)

    Micah, I don't really think it's fair to summarize people's concerns here by saying that "you can find examples where DCM for fMRI may perform poorly". A better way to summarize them is that the space of possible models is so unimaginably huge, our understanding of what the brain is doing is so poor, and there are so many choices a researcher needs to make that are essentially arbitrary, that it's almost inconceivable that there could be a method for reliably evaluating the (observational) evidence for causal models. This wouldn't be such a terrible thing if there was a serious movement to try and validate the methods before putting them into use, but in practice there doesn't be. The kind of simulations that Lohmann et al did, for instance, strike me as absolutely critical, and is the kind of thing one should do *before* a method becomes widely applied, not several years later.

    To respond to your specific comments, I'm certainly not saying that the definition of nodes in the model is totally arbitrary; what I'm saying is that *much of it* is arbitrary, and that there is essentially no reason to think that just by blind luck we will ever get it right (not *perfectly* right; I mean even close to right). Let's take again your data as an example. You say your model has 6 nodes because those 6 survived group-level correction. But surely you agree that if your sample size were twice as large, many more regions would survive correction. And if your sample size were halved, fewer would. This is not up for debate; it's a necessary consequence of the null hypothesis significance testing framework. If we take your argument seriously that DCM should only be applied to models constrained by the results of a mass univariate analysis where regions activate in response to experimental manipulations, what are we now to make of this? Is it really reasonable to say that the 6-node model you selected happens to be the right one, and is a decent approximation to the truth, even thought we know full well that a doubling of sample size would have given you perhaps a 15-node model, and a halving might have perhaps given you a 4-node model? And that every time you re-run the experiment, you're likely to get somewhat different nodes? (1/4; continued below.)

  4. Note that this is exactly the same omitted variable issue that presents such trenchant problems for structural models in *any* domain; imaging is not special. The key point, which is almost invariably ignored, is that a given causal interpretation of a model is only sensible if one is willing to assume that all and only the relevant nodes are included in the network. In behavioral research this is already almost always an untenable assumption because it isn't hard to think of confounding variables for almost any measured variable that is supposed to carry causal influence. But it gets much worse when you're talking about neuroimaging, where the brain is a very dense causal system and we know for a fact that most of the nodes selected for DCM analysis correlate quite strongly with many other voxels or ROIs that were left out (typically for arbitrary reasons). To put it in perspective, this is kind of like constructing a simple causal model positing that the positive correlation between food intake on body mass is mediated by fat consumption, and then when someone pops up and politely interjects that fat consumption is heavily confounded with carb consumption, protein consumption, and any number of other things, simply saying "well sure, but we didn't model those things, so oh well" and continuing along. When you build a model that includes, say, ACC and DLPFC, but not inferior parietal cortex, this is only an okay approximation if you have strong reason to think that IPC is not actually doing a lot of the work you're attributing to ACC or DLPFC (with which it will be strongly correlated) AND you also know that including IPC wouldn't fundamentally change the causal dynamics between the other regions (and frankly, I don't know how you'd know that).

    Even if you set the omitted variable issue aside--which, again, is not a small one, and for many people is a deal-breaker even in the behavioral literature--you still haven't really addressed the issue of node definition: if you haven't quantified the impact of, say, including or dropping an extra 10% of the voxels in each ROI based on a slight juggling of the threshold (which, again, you can't argue is anything other than arbitrary!), what reason do I have to believe your results? Again, it's not that this is some far-fetched hypothetical concern that could, under just the right circumstances, present some smalll problem. We have both theoretical reasons and empirical reasons to suspect that very small perturbations of the nodes will have very large effects on the model. It's not an acceptable defense to say that, well, it's the researcher's job to get things right. Surely it's also the researcher's job to show the people evaluating the research that the results don't completely fall apart when very minor, and essentially arbitrary, parameters are tweaked. And again, it's not really up for debate that there *is* a large amount of arbitrariness, because as I understand it you're defining the nodes using spheres centered around the mass univariate activation peaks, and I'm quite sure that if you look at the confidence intervals of all your voxels, you will find that the peak voxels would have been quite different with a different sample (plus the diameter of the sphere will make a big difference too). (2/4)

  5. Mind you, all this still depends on yet another very strong, and in my view largely indefensible, assumption, which is that a brain region that plays an important causal role in the network has to show a univariate difference between experimental conditions. Frankly this strikes me as wishful thinking. For one thing, many absolutely crucial causal effects may be driven by very transient changes in psychological state space that don't translate into robust BOLD signals (i.e., they would be invisible to fMRI). We have no reason to think that fMRI is in a position to capture all, or even most, of the causal dynamics crucial to understanding what a network is doing, do we? Moreover, each voxel contains hundreds of thousands or millions of neurons, and a typical ROI may contain hundreds of millions. We know from electrophysiological studies that it is entirely typical to find interdigitated assemblies of neurons with quite different response properties within roughly the same chunk of tissue; it is not at all a stretch to think that a particular region might show no net change in BOLD signal even though vitally important causal interactions are taking place (e.g., two different populations of prefrontal neurons are involved in recurrent activation of different kinds of sensory representations in posterior cortex). Note that I'm not arguing that all of the signals must sum to a net of exactly zero when comparing two experimental conditions. I'm saying that as far as I'm aware, we have no good reason at all to expect anything like a strong linear relationship between the causal centrality of a region and its activation strength in a mass univariate analysis. And we have even less reason to expect that such a relationship would manifest in a readily interpretable way--which is what you require when you say that it is the researcher's job to look at their experimental results and pick out the nodes that matter. I think this is giving researchers entirely too much credit--not to mention some very extraordinary powers of apprehension.

    This of course all entirely leaves aside the Lohmann et al critique, which in my view carries exactly the force Lohmann et al suggested, and still hasn't been rebutted in a convincing way. It's no good to say that it's the researcher's job to select the right models to test (which was the gist of Friston's response) if there's no evidence that the method is capable of discriminating between models reliably to begin with. What Lohmann et al showed was that DCM cannot reliably distinguish families of models that are neuroscientifically plausible from those that are wildly implausible; in view of that result, why should anyone expect that DCM can discriminate between two models that *are* both neuroscientifically plausible--a discrimination that can only be more difficult? When you say that "With good experimental design I think DCM is largely robust to the small differences", what are you basing that on? The assertion flies in the face of the Lohmann simulations, all of the considerations I raised above, and of many people's experience tweaking DCM models for a very long time to try to get a stable result. It isn't the kind of statement the purveyors and consumers of a sophisticated causal modeling technique that rests on numerous strong assumptions should just take for granted--especially when it's so easy to empirically test. Or, to put it this way: how do you (and I mean you personally) actually *know* that your results are robust, if you haven't done the work to test their resilience to small changes, or compared them to what you would get by generating models at random? Isn't this a giant leap of faith? Isn't it your responsibility to first verify for yourself that your results don't disappear when you tweak key parameters before publishing them? (3/4)

  6. Lastly, I think it's kind of unfair to characterize this as a debate between people on the machine learning side who favor exploratory data analysis but don't really care about hypotheses, and people who think deeply about hypothesis teseting but are willing to take a couple of steps beyond the data. It's a nice rhetorical move, because it allows you to write off half of the field as people with philosophical differences, but I don't think this is about philosophical differences at all. One can think deeply about theoretical models and still recognize that just because a model makes sense in your head, and produces results you like when explicitly compared with several other models, does not mean that it approximates the truth in any meaningful sense. This is not a philosophical claim, it's an empirical one that can be directly tested (and has been tested). A major problem with much of the DCM literature (and the causal modeling literature more generally) is that there is very little emphasis on validation beyond what makes theoretical sense to the authors--which isn't really validation at all. *That* is the substantive criticism you see from many people in the above thread. I imagine that when you look at your model--which DCM reports has better evidence than at least a few other alternatives--you see an elegant theoretical story supported by strong evidence, and you probably view concerns about the quality of the underlying data as minor annoyances that aren't your responsibility, and that future studies can address. The problem is that many other people who look at your results will immediately think of all the extremely strong assumptions you have to make in order to believe the results but that are almost certain to be false, and which you could have easily tested but didn't.

    The key point is that this really *isn't* some deep philosophical divide between two incommensurable camps; it's something that causal modeling proponents have the power to address. All it would take to bridge whatever divide exists is for people in the causal modeling camp to get more serious about external validation. This means two things, as far as I'm concerned: (a) testing the assumptions underlying DCM in cases where it's straightforward to do so (e.g., by simply repeating your analyses several times with slightly different parameter choices); and (b) as Niko said, actually predicting novel observations in an unbiased way (i.e., showing that the DCM model with the highest evidence also consistently predicts new observations we care about better than models with lower evidence). If you can show me in your papers that your results aren't entirely dependent on parameter choices that were clearly not under your control in any meaningful sense, and that your method is robust against very simple validity threats (e.g., random models having better evidence, as in Lohmann et al), then I will be happy to take them seriously. But if you can't or aren't willing to provide that kind of support, then I don't see why I (or anyone else) should put any stock in the findings given all of the enormous assumptions involved. (4/4)

    1. I recall from Friston's respons to Lohmann that he pointed out the simulation is largely based on a misunderstanding of the aims and mechanics of Bayesian model selection. By using this procedure you assume that all models have an a priori equivalent prior. Your argument is that the selection failed to discriminate plausible from obviously implausible models; if that model is so obviously implausible then you should specify a different prior for it, in which case it would be rejected. You also seem to be ignoring Friston's argument (which I have directly observed to be true) that with increasing model space you have increasing model dilution, and a higher difficulty of selecting the best model. Lastly I hope I haven't given you the idea that i've just put my data into a model and pushed some buttons. My convinctions come from working in a non-FIL group that does almost exclusively MMN/oddball work. I've seen in 3 different datasets from three different colleagues across EEG, MEG, and fMRI the exact same out come - namely that oddballs elicit activation in primary sensory, salience, and middle/inferior frontal regions; and secondary that the comparison of deviants to standards elicits a massive change in frontal vs all backwards connections. This is what I mean by having a theory that makes strong directional hypothesis.

      The oddball task is extensively computationally validated, with a rich multimodal literature, and is thus very well suited for DCM. I just don't want you to get the idea that I am blindingly advocating DCM because I happened to get a nice result in one analysis. I am aware that DCM can also be a fragile thing and is probably not ready for mass consumption or fluffy social neuroscience 'models'. I also simultaneously agree with the underlying neural complexity you advocate, and believe that it is largely irrelevant to fMRI. I've run some 300 fMRI scans personally over 4 different experiments and see time and again that mostly what we can resolve is large-scale interactions between the major macroscopic networks (salience, dmn, control, sensory). I'm skeptical of anything beyond this level of inference, which is why I say fMRI is just a basic starting point for more advanced modelling. Anyway, I am not being dismissive, your points are well taken and I will get back to you more fully on this issue. If anything the debate itself has greatly enhanced my understanding of what is at stake here.

      Happy holidays,
      Micah (2/2)

  7. Hey Tal-

    Great response. I'm on vacation now in Italy so it will be a while before I can formulate a full response. To be clear, I don't think anyone wants to argue against the validation approach you suggest (hard to argue for more data or analysis), but I just can't help but feel like things are not quite as arbitrary as you suggest. I think there are some cases where we have much stronger priors - but you argue that these more basic examples are too 'toy' to be interesting. I guess we agree then that, fMRI isn't much use for actual biophysical modelling or complex paradigms, but we don't agree what can be done at the lower more conservative end of the scale. I do just want to touch on the issue of statistical significance. On the one hand i'm not a big fan of NHST, and like DCM largely because it introduces a Bayesian model selection procedure to users who might otherwise not encounter it.

    But it seems like abuse of statistics to suggest that the fact that our retained activations changes drastically based on the statistical threshold suggests that the underlying phenomenon are just as arbitrary. Statistical thresholds are just conventions, and I don't see why (if using NHST) we should be increasing or decreasing thresholds in the arbitrary way you suggest, given that the consensus seems to be approximately a pFWE < 0.05 (peak or cluster). I'm also not sure how much farther we can really continue this debate without pulling out some data because there is a lot of assertion of opinion on both sides. You say that most of my activations probably have CIs overlapping zero; I say that my activations provide almost localizer-specific regional activations of all a priori expected areas. We're talking about a very simple high/low somatosensory stimulation with deviants and standards, not some theory of mind task. In pilots and at the subject level, the group level results are almost spatially identical with the subject level. There is very little variation in this response and it fits quite well with the literature. So I think we have quite strong priors to take a reasonable approximation of those areas into a causal model. (1/2)

  8. Oh damn, my comments went in backwards. That is what I get for arguing on the internet first thing in the morning :(

  9. Last thought - it is very easy to download about 500 task related motor activation data sets from HCP and do DCM on them. Why don't we collaborate to do so and make the exact analysis you are proposing? I have a colleague who did this in 300 participants viewing heider-simmle illusion and showed that modulation of forward vs backward by more complex stimuli was extremely consistent across the population. One draw back however is that with DCM we cannot compare models with different VOI data. Anyway, the data is out there for us to play with if you like...

  10. Hi this is Anil Seth. What an excellent debate and I hope I can add few quick thoughts of my own since this is an issue close to my heart (no pub intended re vascular confounds).

    First, back to the Webb et al paper. They indeed show that a vascular confound may affect GC-FMRI but only in the resting state and given suboptimal TR and averaging over diverse datasets. Indeed I suspect that their autoregressive models may be poorly fit so that the results rather reflect a sort-of mental chronometry a la Menon, rather than GC per se.

    In any case the more successful applications of GC-fMRI are those that compare experimental conditions or correlate GC with some behavioural variable (see e.g. Wen et al. In these cases hemodynamic and vascular confounds may subtract out.

    Interpreting findings like these means remembering that GC is a description of the data (i.e. DIRECTED FUNCTIONAL connectivity) and is not a direct claim about the underlying causal mechanism (e.g. like DCM, which is a measure of EFFECTIVE connectivity). Therefore (model light) GC and (model heavy) DCM are to a large extent asking and answering different questions, and to set them in direct opposition is to misunderstand this basic point. Karl, Ros Moran, and I make these points in a recent review (

    Of course both methods are complex and 'garbage in garbage out' applies: naive application of either is likely to be misleading or worse. Indeed the indirect nature of fMRI BOLD means that causal inference will be very hard. But this doesn't mean we shouldn't try. We need to move to network descriptions in order to get beyond the neo-phrenology of functional localization. And so I am pleased to see recent developments in both DCM and GC for fMRI. For the latter, with Barnett and Chorley I have shown that GC-FMRI is INVARIANT to hemodynamic convolution given fast sampling and low noise ( This counterintuitive finding defuses a major objection to GC-fMRI and has been established both in theory, and in a range of simulations of increasing biophysical detail. With the development of low-TR multiband sequences, this means there is renewed hope for GC-fMRI in practice, especially when executed in an appropriate experimental design. Barnett and I have also just released a major new GC software which avoids separate estimation of full and reduced AR models, avoiding a serious source of bias afflicting previous approaches (

    Overall I am hopeful that we can move beyond premature rejection of promising methods on the grounds they fail when applied without appropriate data or sufficient care. This applies to both GC and fMRI. These are hard problems but we will get there.

  11. I am a bit concerned about how GC was computed in that paper:
    -they computed pairwise GC, that especially for datasets composed of many short time series will result in a high number of false positives.
    -why did they consider only regions with a high asymmetry in GC? zero GC balance could result from no interaction or by high outflow plus high inflow, which are quite different scenarios.

    Using our approach to retrieve and deconvolve HRF from resting state, I plotted the HRF shape on the brain:

    looks like that regions with higher, slower and wider HRF are those that you identify as GC sinks (venous overlap), and this would explain that GC maps obtained after deconvolution (and with a conditioned approach) are different from the ones they find (Wu et al. in the references of the paper).

  12. Excellent discussion, with some great arguments, a number of insights for me, but also a deja vu or two.

    Concerning the initial post: I wonder about the argumentation... Showing that a method has high sensitivity for A (here: large vasculature) does not in any way establish it lacks the specificity to pick up B (here: neuronal interactions). You could also use almost any ICA variant on the same data to find the same (or even better?) vasculature maps. Does that establish that ICA cannot be used for rs-fMRI analysis to find neuronal resting state networks? Obviously (to me, at least): no, it doesn’t establish that.

    From a historical perspective, when we first worked on Granger Causality Mapping (GCM) around 2004 (; thanks for your appreciation for the work, Russ!) we wanted to answer pretty much the following questions

    i) What could GCM add, why do it at all?
    Our best answer: It adds structural model exploration that can (should?) precede confirmatory connectivity analysis, such as DCM. Confirmatory analysis selects between hypotheses or models containing a few areas (and for fMRI each hypothesis contains the same areas). It runs the risk of selecting one that compares poorly with many non-tested models that contain 1 or (many) more crucial missing regions (the missing region problem). Moreover: a causal mapping approach can correctly identify the boundaries between differently interacting nodes, which might have been lumped together based on a thresholded GLM, and can identify nodes that did not show up in a GLM at all. We have re-iterated this point in, and after that (and Karl’s kind agreeing reply:, methods to allow DCM to test models with more than 3-4 regions have started to appear. I am happy to now see FIL lab affiliates (read: Micah) say things like ‘a previous limitation has simply been that it's incredibly frustrating and time consuming to estimate interesting models with more than 2 or 3 nodes’ and ‘DCM requires strong hypothesis, which is both it's strength and weakness, and fits extremely well with exploratory data-driven methods’ agreeing with our original point almost 10 years ago. Likewise it is interesting to see in this discussion (e.g.: Tal’s posts) that the missing region problem is still one of the chief concerns scientists have in using confirmatory approaches.

    ii) Can we pick up 10-100ms neuronal (conduction and processing) delays through 1000’s ms hemodynamics delays and in the presence of remote vasculature with 100’s ms delays?
    Our best answer came from first testing this in simulation and was, perhaps surprisingly: yes. However, anything with more delay is picked up with more statistical power (more sensitivity, see Figure 3 in Shorter TRs (as in the HCP data) increase power altogether and separate long and short delays more effectively. From that fact and from seeing early ‘raw’ GCMs abundant in large vasculature, as expected, we devised a strategy to filter out vasculature: ‘instantaneously- filtered’ GCMs. In short: short delays components (such neuronal interactions) show up both in instantaneous correlations and G-causal (delayed) terms; large delay components (such as vasculature) show up almost exclusively in the G-causal terms. So, filter out delayed components that do not have an instantaneous correlation term. That proved very effective. Conversely: if you really want to look at vascular delays, unfiltered maps are a very sensitive instrument, which is what the cited PLoS ONE study seems to have done. No surprise here. And, if it is the study target, mapping out large vasculature can be very useful, e.g. as a danger-zone map for other methods.

  13. iii) How do we deal with the varying HRF confounds?
    Our best answer: Always (at the very least) look for experimental modulation of G-causality. Period. We have made and re-made this point many times, but it is still often missed. This also lets experimental manipulations drive causal understanding, as they should, rather than just assumptions added to statistics that come from the data. Model based deconvolution can be an additional tool here but, as we discussed in, this is yet another model-based endeavor and inaccurate HRF models can lead to new errors rather than remove old ones (see Daniel’s posts above). Incidentally, we wrote this work mainly as a reply to Olivier David’s study (cited by Marta above) to argue there are a few gross oversights in that study: first: they did not use experimental modulation in applying G-causality  and, second, when they added the deconvolution model of DCM to G-causality it performed perfectly against the gold-standard, unlike DCM which underperformed, even on 3 regions (Marta: Why cite this study to say GC does not perform well!?). Interestingly, even in deconvolution modeling, one can pit more or less parametric models against each other. As Pedro-valdes Sosa, Anil Seth and myself summarized in, various AR and ARMA models (used for G-causality) have invariances that theoretically render them unaffected by things like varying HRF convolution kernels. Anil’s nice work subsequently showed this can indeed be the case for simple AR models if one samples fast enough (see Anil’s post above).
    I second (third, fourth?) Niko’s call for a causal analysis competition. And, yes, the generating model must be realistic in the detail that we know and agree on, whether we a priori believe it will be relevant detail or not. Not being close to the estimating model, as Niko posed, is one important aspect. Actually containing the aspects we *know* to be present in the brain (such as 10ms conduction delays) rather than assuming these will not matter anyway, is another important aspect. To illustrate: when Steve Smith showed me the manuscript of the network modeling approaches paper, I told him I thought the conclusion that the sensitivity of G-causality methods is low is misleading. Steve used the DCM model for fMRI to generate the data, which does not have any neuronal delays. It can’t, it doesn’t use delays (in fMRI). The model necessarily puts any observed delays in the hemodynamics in the inversion. It has been designed to do that. So, if you generate data with it, you are generating data with only instantaneous dependencies. It is obvious connectivity models tuned to detect instantaneous dependencies or that use them in their causal modeling will do well on these simulations. And that those tuned to delayed interactions will (correctly) show nothing. I told Steve I expected the results to be different if, instead, he would use the neuronal DCM model for EEG/MEG with suitable neuronal delays (then chained with the DCM hemodynamic model, of course). Steve said he did not think it would make a difference. I still think it will, and I think Anil’s paper has shown as much with an even more complex neuronal simulation model with delays.
    I, for one, am still simultaneously critical and enthusiastic about both GCM and DCM. The methods currently seem to be converging as each is taking up the strengths of the other. I agree external validation is important and should be a new norm: test the same (set of) model(s) with different modalities over several studies. But still both will be potentially very useful if used carefully and quite dangerous if used indiscriminately. Just like any other even moderately complex method. The thing is, I think that if we chose not to use them at all we would lose some of our best current instruments to help understand the wildly complex multiscale non-linearly coupled network that is the brain…

  14. Btw: Niko, to your comment on crossvalidation of connectivity model fits on independent test sets. This has been done by Jason Smith and Barry Hortwitz, see e.g.: . The switching linear dynamic system essentially learns a different AR connectivity model for each experimental condition from the training data. Then it predicts experimental conditions at each time point in the unseen test data. This has two more important side implications for the discussion. 1) experimental design and modulation have it again. 2) The prediction level on the test set indeed works as a global measure of fit immune to overfitting. We could accept, say, 80% correct as a minimum to accept any model if we were so inclined.