Thursday, December 19, 2013

A discussion of causal inference on fMRI data

My facebook page was recently home to an interesting discussion about causal inference that was spurred by a recent article that I retweeted:

Effective connectivity or just plumbing? Granger Causality estimates highly reliable maps of venous drainage.

What follows is the discussion that ensued.  I have moved it to my blog so that additional people can participate, and so that the discussion will be openly accessible (after removing a few less relevant comments).  Please feel free to comment with additional thoughts about the discussion.

  • Peter Bandettini Agree! Latencies due to veins 1 to 4 sec. Latencies due to neurons < 300 ms. Vein latency variation over space dominates.
  • Russ Poldrack it's amazing how we have known for a good while now that Granger causality is useless for fMRI, yes people continue to use it
  • Peter Bandettini There are ways to look through the plumbing.
  • Tal Yarkoni My impression is that as bad as the assumptions you have to make for GCM may be, the assumptions for DCM are still worse. The emerging trend seems to be towards a general skepticism of causal modeling of observational imaging data, not a shift from one causal method to another.
  • Micah Allen In terms of hemodynamics, I'm not aware of any assumptions different from those in all canonical HRF models. DCM uses a slightly updated version of the balloon model. Further, unlike GCM the parameters estimated by DCM have been repeatedly validated using multimodal imaging. Although I will bring this up at the next Methods meeting.
  • Russ Poldrack Tal - I don't agree regarding blanket skepticism about causal inference from observational data. It is definitely limited by some strong assumptions, but under those assumptions it works well, at least on simulated data (as shown e.g. by the recent work from Ramsey and Glymour). But I do agree regarding general skepticism about DCM - not so much about the theory, which I think is mostly reasonable, but about the application to fMRI (especially by people who just punch the button in SPM without really understanding what they are doing). I thought that the Lohmann paper did a good job of laying out many of the problems, and that Karl et al.'s response was pretty much gobbledygook.
  • Vince Calhoun I do not think even GCM can be considered useless for fMRI...however it is critical to keep the assumptions (what are the parameters capturing) and the data (what causes hemodynamic delay, etc) in mind in evaluating its utility and especially in interpreting the results (e.g. does this approach enable us to 'see through' the delay or just to characterize it). The post analysis (single or group) is critical as well and will often float or sink the boat. Every method out there has lots of holes in it, but in the right hands most of them can be informative.
  • Tal Yarkoni Russ, I think the question is whether you think those strong assumptions are ever *realistic*... E.g., looking back at the Ramsey & Glymour sims, my impression is that it's kind of an artificial situation in that (a) the relevant ROIs have already been selected (typically one wouldn't have much basis for knowing that only *these* 11 ROIs are relevant to the causal model!), (b) the focus is on ROI-level dynamics when it's not at all clear that connectivity between ROIs is a good proxy for neuronal dynamics (I understand this is a practical assumption made of necessity, but we don't actually know if it's a reasonable one!), and (c) I think the way error is quantified is misleading--if you tell me that you only have 3 mislabeled edges out of 60, that doesn't sound like much, but consider the implication, which could be, e.g., that now we think ACC directs activity in IPC rather than vice versa--which from a theoretical standpoint could completely change our understanding, and might very well falsify one's initial hypothesis!

    And this still ignores all of the standard criticisms of causal modeling of observational data, which still apply to neuroimaging data--e.g., the missing variable problem, the labeling problem (if you define the boundaries of the nodes differently, you will get very different results), the likelihood that there's heterogeneity in causal relationships across subjects, and so on. So personally I'm very skeptical that once you get beyond toy examples with a very small number of nodes (e.g., visual cortex responds to a stimulus and causes prefrontal changes at some later time), it's possible to say much of anything *positive* about causal dynamics in a complex, brain-wide network. That said, I do agree with Vince that any method can have utility when handled carefully. E.g., a relatively benign use of causal methods may be to determine whether a hypothesized model is completely *incompatible* with the data--which is a safer kind of inference. But that's almost never how these techniques are used in practice...
  • Tyler Davis Pragmatically, if you take all of the nonsense that happens with SEM in the behavioral literature, and then multiply be a factor of two to account for the nonsense due to mismodeling/misinterpreting hemodynamics, it makes it tough to believe an fMRI study with causal modeling unless you know the authors are good at math/know what they are doing or you were inclined to believe the results already.
  • Russ Poldrack We are using a measurement method that intrinsically is at least an order of magnitude from the real casual mechanisms, so It would be folly to have too much faith in the resulting causal inferences. But to the degree that we can validate them externally, the results could still be useful. For me the proof is always in the pudding!
  • Anastasia Christakou What *is* the pudding? If you have a biological hypothesis and your model-driven analysis is in line with it, what more can you do to "prove" it? Are you talking about *independent* external validation?
  • Russ Poldrack the pudding in this case is some kind of external validation - for example, if you can show me that graphical model/DCM/GCM parameters are believably heritable, or that model parameters from fMRI are predictive of results in another domain (e.g., EEG/MEG, behavior).
  • Micah Allen Tal Yarkoni I don't really find the ROI/model-selection problem with DCM overly troubling, although I do see where you are coming from. As Karl is fond of saying, it's your job as the researcher to motivate the model space as it's the hypothesis you are testing. The DCM is only valid in light of that constraint. I find it a bit unfortunate that there is this divergence between the two schools of thought; obviously in cases where you have no clue what the relevant hypothesis should be, tools like graph theory and functional correlations can be an excellent starting point. By definition, the validity of any DCM depends upon the validity of interpreting the experimentally-induced mass-univariate activations. DCM is built to assess specific directional hypothesis between regions activated by some experimental manipulation - it is unsurprising that if the selected task activates brain regions non-specifically or the ROIs are extracted haphazardly then the DCMs are equally invalid. But this isn't an indictment of DCM, it's a rather a failure to motivate a relevant model space. If working in unclear or murky territory, Karl is the first to say that a graph theoretical or connectomic approach can be a first step towards motivating a DCM. This is all circumstantial to the basic issue that GCM just plain gets the direction of connections wrong. 

    I've read quite a bit of the DCM and GCM literature, and I actually do agree with you that many of the early papers are plagued by extremely toy examples. The end result was also always the same, that everything connected to everything. This is part of why the best DCM papers are those based on computational models, such as Hanneke den Ouden's papers on motor learning, where the estimated parameters and winning model families are themselves of interest. DCM is extremely well equipped to assess the evidence that say, motor cortex updates premotor with the probability of a stimulus being a face or a house. This can't be said for any other method - DCM is fundamentally built to assess brain-behavior relationships in this way. All the connectomics in the world won't tell you much beyond "these brain regions are really important hubs and you probably shouldn't knock them out". To go the rest of the way you need a hypothesis driven methodology. 

    That being said, a previous limitation has simply been that it's incredibly frustrating and time consuming to estimate interesting models with more than 2 or 3 nodes. This has resulted in many small sample, 'toy' DCM papers that are generally not very interesting. The new post-hoc model optimization routines largely ameriolate these practical concerns - in my upcoming paper I easily estimate models with all 6-7 nodes activated by my mass-univariate analysis. I have a colleague estimating similar models in 600 scans from the HCP database. This means we will begin to see more intuitive brain-behavior correlations. 

    As for the debate between Lohmann and Friston - Lohmann's critique is just plain factually wrong on the details of the post-hoc procedure, and on several other details. Further Lohmann seems to fundamentally misunderstand the goal of model development and model selection. So I'm not really convinced by that. 

    DCM requires strong hypothesis, which is both it's strength and weakness, and fits extremely well with exploratory data-driven methods that actually work as advertised (unlike GCM). We've not even gotten into DCM for MEG/EEG (on which I am not an expert). The neural mass models there are extremely fascinating, going far beyond modelling mere data features to actually encapsulating a model of the underlying neurophysiological brain dynamics underlying observed responses. DCM for fMRI is itself at best just a starting point for actual neural modelling in M/EEG. 

    Finally, the greatest strength of DCM is the Bayesian model selection procedure. Don't like the canonical HRF? Substitute your own HRF and compare model evidences. Want to combine DTI, MEG, and LFP? DCM can do that.
  • Russ Poldrack Micah - what about Lohmann's comments regarding model fit? I read Karl's response as saying that it doesn't matter if your model only accounts for 0.1% of the variance, as long as it beats another model in the model selection procedure. That seems mathematically correct but scientifically wrongheaded. FWIW We gave up on DCM soon after starting when we found that the models generally fit the data really poorly.
  • Micah Allen Were you doing DCM for fMRI or MEG? In general I've been told that the MEG variance explained tend to be far higher than in the fMRI DCM - close to 90% in some cases, compared to 20-60% in the best case for fMRI. An important caveat with DCM for fMRI is that it is not ideal for fast purely event related designs. If you have a fast stimulus driving factor and a slow (e.g. attention) factor you should see variance explained in the 20-60% range. In my paradigm I only get about 6-12% likely because I have two fast alternating events (stimulus intensity and probability). 

    On the model fit question, I think this comes down to what the variational free energy parameter is actually optimizing. I am not an expert on the calculus behind VFE, but i've been to the DCM workshops and tried to understand the theory as much as my limited calc background allows. Essential VFE is a measure that weights the fit of the model (how well the model predicts the data, in a bayesian posterior sense) by the model complexity. I found this bit of Karl's response helpful:

    "Generally speaking, a model with a poor fit can have more
    evidence than a model with a good fit. For example, if you give a
    model pure measurement noise, a good model should properly identify
    that there is noise and no signal. This model will be better than a
    model that tries to (over) fit noisy fluctuations."

    So the model evidence accounts for fit, but goes beyond it. As Karl points out, a model with perfect fit can be a very poor model, so the VFE is an attempt to balance these things. Beyond that all I can say is that I was taught to always look at the variance explained as a diagnostic step - generally the rule of thumb here is that if your variance explained is really low, you probably made a mistake somewhere or the paradigm is very poorly optimized for DCM. I think in a practical sense, Lohmann have a point here, as there are diagnostic tools (like spm_dcm_explore) but no real guidelines for using them (when to decide you've mucked up the DCM). I think strictly speaking the VFE does the job, but only in the (perhaps too ideal) world where the models make sense.

    I found this blog really useful for understanding why DCM sometimes fails to give a good variance explained for event related paradigms:
    What follows is a rough but hopefully didactically useful introduction into the ...See More
  • Russ Poldrack this was with fMRI data (with Marta Isabel Garrido) - but it was with fast ER designs, which are pretty much all we do in my lab, so that might explain part of the poor fit. Karl's point about penalizing model complexity is totally reasonable, but I'm not sure that it really has anything to do with VFE per se -VFE is just one of many ways to penalize for model complexity. (of course, it's Bayesian, so it must be better 
  • Daniel Handwerker The other trouble with the current DCM approaches is that they rely heavily on assumptions of what hemodynamic responses should be - including using the balloon model in a way it was never intended to be used. As part of a commentary on hemodynamic variation (Neuroimage 2012), I ran a small side analysis that showed how a modest, but very believable, difference in the size of the post-peak undershoot can flip the estimated direction of causality using DCM. This was a bit of a toy analysis, but it highlights a real concern how assumptions in building low-level parts of SPM's DCM model can really affect results.
  • Micah Allen Daniel, i'm not sure i'd agree that is a limitation. The nice thing about DCM is since it is built around a Bayesian hypothesis testing architecture, any competing HRF can be substituted for another and the resulting model estimates compared. So you could easily run a DCM with your HRF vs the canonical - if yours wins it would be a good argument for updating the stock model. The HRF part of DCM is totally modular, so a power user should find it easy to substitute a competing model (or multiple competing models). This point was made repeatedly at the DCM workshop in Paris last year.
  • Tal Yarkoni Micah, I think you're grossly underestimating the complexities involved. Take the problem of selecting ROIs: you say it's incumbent on the researcher to correctly interpret the mass univariate results. But what does that mean in practice? Estimation uncertainty at any given voxel is high; if you choose an ROI based on passing some p value threshold, you will often have regions in which the confidence interval's lower bound is very close to zero. With small samples, you will routinely miss regions that are actually quite important at the population level, and routinely include other regions that are non-zero but probably should not be included. If you use very large samples, on the other hand, then everything is statistically significant, so now you have the problem of deciding which parts of the brain should be part of your causal model and which shouldn't based on some other set of criteria. If you take what you just said to its conclusion, people probably shouldn't fit DCM models to small datasets period, since the univariate associations are known to be shaky.

    Even if you settle on a set of ROIs, the problem of definition is not just one of avoiding "haphazard" selection; e.g., Steve Smith's work in NeuroImage nicely showed that even relatively little mixing between different nodes' timeseries will seriously muck up estimation--and that's with better behaved network modeling techniques, in a case where we know what the discrete nodes are a priori (because it's a simulation), and using network structures that are much more orderly than real data are likely to be (Smith et al's networks are very sparse). In the real world you typically don't know any of this; for example, in task-related activation maps, the entire dorsal medial surface of the brain is often active to some degree, and there is no good reason to, say, split it into 4 different nodes based on one threshold versus 2 nodes at another--even though this is the kind of choice that can produce wildly different results in a structural model.

    As for the hemodynamic issue: the problem is not so much that the canonical HRF is wrong (though of course we know it often is--and systematically so, in that it differs reliably across brain regions)--it's that you compound all of its problems when your modeling depends on deconvolved estimates. It's one thing to say that there is X amount of activation at a given brain region when you know that your model is likely to be wrong to some degree, and quite another to say that you can estimate complex causal interactions between 12 different nodes when the stability of that model depends on the same HRF fitting the data well in all cases.

    As to the general idea that DCM depends on strong hypotheses; this sounds great in principle, but the problem is that there are so many degrees of freedom available to the user that it is rarely clear what constitutes a disconfirmation of the hypothesis versus a "mistake" in one's initial assumptions. Of course to some degree this is a problem when doing research of any kind, but it's grossly compounded when the space of models is effectively infinite and the relationship between the model space and the hypothesis space is quite loose (in the sense that there are many, many, many network models that would appear consistent with just about any theoretical story one can tell).

    Mind you, this is an empirical question, at least in the sense that one could presumably quantify the effect of various "trivial" choices on DCM results. Take the model you mention with 6-7 nodes: I would love to know what happens if you systematically: (a) fit models that subsample from those nodes; (b) add nodes from other ROIs that didn't meet the same level of stringency; (c) define ROIs based on different statistical criteria (remember that surviving correction is a totally arbitrary criterion as it's sample size dependent); (d) randomly vary the edges (for all of the above). The prediction here is that there should be a marked difference in terms of model fit between the models you favor and the models generated by randomly permuting some of these factors--or, that when other models fit better, they should be theoretically consistent in terms of interpretation with the favored model. Is this generally the case? Has anyone demonstrated this?
  • Rajeev Raizada I read the above discussion just now, and found it very interesting indeed. I have never tried playing with DCM, but I have worked in neural modeling in the computational neuroscience / neural circuits sense. In general, I am skeptical of complex and sophisticated models, especially models which congratulate themselves on their complexity and sophistication, or, in the case of some of the models mentioned immediately above, models which congratulate themselves on the sophistication of their self-quantification of complexity. 

    A question for the DCM/GCM-aficionados out there: is there a single instance of such approaches generating a novel insight into brain function which was later independently validated by a different method? It looks to me as though there are a lot of instances of such approaches producing interesting-looking outputs which seem reasonable given what we think we know about the brain. But the road to hell in science is paved with interesting and reasonable-sounding stories.

    This last line is partly troll-bait, but I'll throw it out there anyway as I think it might be a valid comparison: are DCM/GCM approaches a bit like the evolutionary psychology of fMRI? Sources of interesting just-so stories?
  • Micah Allen Out of the office at dinner now so I'm afraid I must leave this debate for now. You make dinner excellent points Tal though I'm not sure I agree with the picture you paint of fMRI effects bring quite so arbitrary. At my former center we stuck to an N = 30 guideline and found this to be an excellent compromise, neither under nor over powered. In my oddball paradigm I get extremely anatomically constrained activations that fit well with the literature. I extract 6mm spherical VOIs from each peak. This seems like a pretty reasonable way to characterize the overall effect, but I do think the analysis you suggest would interesting in any context DCM or otherwise. Sorry so brief- on my phone at dinner!
  • Micah Allen Rajeev Raizada a good question but there is a lot of ongoing intracortical and multimodal validation work being done with DCM. Check out Rosalyn Moran's recent papers 
  • Rajeev Raizada Sounds interesting. Are you talking about this paper, which finds that boosting acetylcholine increases the size of the mismatch negativity? An interesting result, but it's not entirely clear to me that the theoretical overlay of predictive-coding and DCM adds a great deal, or that the empirical result adds much support to the theory. After all, this paper from 2001 already showed the converse result (reduced ACh gives reduced MMN): , and there is a ton of evidence showing that ACh increases cortical signal-to-noise. However, I may be focussing on a different paper than you were referring to.
    PubMed comprises more than 23 million citations for biomedical literature from M...See More
  • Daniel Handwerker As Tal notes, the issue with hemodynamic variation is that it varies quite a bit around the brain and there is no existing model in a Baysean or any other framework that can solve this degree of variation. This isn't a problem if one's analysis is robust to much of the variation, but the deconvolution step in DCM amplifies its sensitivity to the selected hemodynamic model. If two brain regions having slightly different relative undershoots is enough to make the model fail, then when can we trust that it works?

    Another way of saying that a power user can make their own hemodynamic model is to say that the model that is default with the SPM DCM software doesn't work in real-world situations, but that doesn't preclude someone in the future from making a model that does work. This might be true, but it does little to increase confidence in the accuracy of current DCM studies. 

    As others have noted, simpler causality measures often have their assumptions more out in the open and someone can design a study that is robust to those assumptions. If a study was designed well enough that it could work within the complex assumptions of DCM, I can't think of a situation where it wouldn't also meet the assumptions of simpler approaches.
  • Micah Allen No, the option is there for those who want to show that there is a better model. The canonical balloon HRF enjoys a great deal of robustness, which is why its the function of choice in nearly all neuroimaging packages. Further DCM has been shown to be very robust to lag differences of up to 2 seconds. I hope you guys at not out there rejecting DCM papers on these basic principles without at least reading the actual validation papers behind it. Your toy example shows that DCM is sensitive to the HRF of choice, not that the canonical is an inappropriate model. If there is that big of a smoking gun problem with DCM, I'd probably suggest you publish to that effect to prevent a lot of people from wasting time!
  • Micah Allen Just to try to close on a positive note, I find this review by Stephen Smith to be exceptionally balanced, cutting through much of the hyperbole surrounding different connectivity methods. I think he really does a great job pointing out the natural strengths and weaknesses of each of the available approaches, advocating for a balanced viewpoint. Also don't miss the excellent foot note where he describes views from 'scientist A and scientist B' - pretty obvious who he's referring to there
  • Marta Isabel Garrido This is one of the best validation papers showing robustness of DCM for fMRI (and also how GC fails terribly)
    PubMed comprises more than 23 million citations for biomedical literature from M...See More
  • Daniel Handwerker Micah, when I say that a fundamental assumption of the current implementation of DCM fails under very realistic conditions to such an extent that it will flip the result, improving DCM is not my only option. The other option is to not personally use DCM and treat DCM findings which rest on these fundamental assumptions with a good bit of skepticism. If I had a specific project that would clearly benefit from the DCM approach, I might reassess my stance, but until then, it's the job of DCM users to make the method/interpretations more robust.
    I'll also note that the robustness of the canonical HRF is analysis dependent. My 2004 Neuroimage paper showed that it's robust for finding significance maps using a GLM, but it is less robust if you're taking magnitude estimates into a group analysis. Still, the robustness of the canonical HRF in linear regression based studies has been shown to be pretty good. That doesn't mean the same model is robust when used as part of a completely different analysis method. Any causality measure is probably going to be more sensitive to GLM shape compared to intravoxel statistical approaches. My review that I mentioned earlier ( ) is a published example showing how DCM can fail. It's an example rather than a full examination of the method's limitations, but it's enough to cause me some concern.
  • Nikolaus Kriegeskorte this is a fascinating thread. i'm left with three thoughts.
    (1) the paper by webb et al. does not seem to invalidate granger causality inferences based on comparisons between different experimental conditions. it would be good to hear Alard Roebroeck's
    take on these issues. (2) it would be good to have an fMRI causality modelling competition in which several simulated data sets (with known but unrevealed ground truth and realistic complexities) are analysed by multiple groups with multiple techniques. (3) the only thing that puts the proof in the pudding for me is prediction. in decoding analyses, for example, the use of independent test sets (crossvalidation) ensures that incorrect assumptions make us more likely to accept the null hypothesis. what is the equivalent of this in the realm of causal analyses?
    16 hours ago · Edited · Unlike · 2
  • Russ Poldrack Niko - agreed, the only reasonable use of Granger causality analysis with fMRI that I have seen is the work by Roebroek that showed differences between conditions in G-causality within the same region, which mostly obviates the issues regarding HRFs (unless you think the latency is changing with amplitude). if only I was FB friends with Alard! And I second both your call for a competition (though the devil is in the details of the simulated data) and the ultimate utility of crossvalidation.
  • Jack Van Horn Niko: something like your item #2 as a challenge for the OHBM Hackathon.#justsayin
    15 hours ago · Like · 2
  • Nikolaus Kriegeskorte jack, that would be great. for the next meeting, we have an organising team with a plan already. could consider this for hawaii. it would be good to hear opinions from Klaas and Karl (whom i'm missing on facebook), Alard Roebroeck, and Stephen Smith -- as well as the commentators above.
    15 hours ago · Like · 1
  • Nikolaus Kriegeskorte Russ Poldrack is right that the devil is in the details of the simulated data. each method may prevail when its model matches the one that generated the data. it will therefore be key to include realistic complications of the type described by tal, which none of the methods address. the goal could be to decide about each of a set of causal claims that either do or do not hold in the data-generating ground-truth model.
  • Nikolaus Kriegeskorte I'd like to hear Rik Henson's take on this entire thread.
  • Micah Allen This has been a very informative thread for me - wish Facebook posts could be storified as this is an important debate. I would love to see it formally continued at OHBM. I will try to get Karl's opinion on these issues at the next methods meeting. Over beers last night several members of that group expressed that some of this validation was underway, and that DCM is undergoing significant revision to address many of these issues. Still it's clear that there has not been enough discussion. Also as a side note I agree that the GCM paper in my blog post does not necessarily implicate between condition differences. I think the answer to some of these issues is inevitably going to come down to experimental design, as probably both GCM and DCM can be more or less robust to vascular confounds depending on the timing and nature of your paradigm. It would be nice to know exactly what those constraints are.
    5 hours ago via mobile · Edited · Like
  • Jack Van Horn This brings to mind early 1990's concerns about the effects that vasculature and draining veins, in particular, had on measured regional BOLD activation. This was considered a real issue by some and a potential deal breaker for the future of fMRI over PET. But others felt that the HRF actually saved the day since it would be modulated by neural activity under cognitive task conditions. The issue kind of got swept under the rug for the last 15 years or so. Interesting that this is now re-emerging in the realm of functional connectivity, resting-state, and notably Granger causality. Could it be that auto-correlative Granger modeling is doing exactly what it is supposed to do? And perhaps it has simply been our poor understanding of its implications relative to actual hemodynamics that is finally catching up to us? I look forward to seeing the further discussions.