Decision makers need more than just a description of evidence – they need to know what the evidence means for their decisions. They need to know how confident they can be, whether the evidence is strong enough to act on, and what they should realistically expect if they invest in a program or scale it to a new context. BASIE (Bayesian Interpretation of Estimates) was designed to answer these questions directly, using criteria relevant to the decision maker’s goals, in plain terms that support a yes, no, or not-yet decision.
BASIE answers questions like:
BASIE works equally well whether the estimate comes from a randomized trial, a comparison group design, or a descriptive analysis of group differences or trends over time.
Given all the evidence, BASIE produces probability statements — how likely is it that this program truly works? how likely is the true effect to exceed a meaningful threshold? — that map directly onto your decision criteria.
Getting clear, direct answers to these questions is exactly what decision makers need. Yet the standard statistical methods used in most evaluation reports — significance tests, p-values — were not designed for this purpose. Statistical significance tells you that a random error as large as your estimate is unlikely. As we will see in the Picking Winners example, that’s importantly different from telling you whether the program really worked, whether a difference between groups is real, or whether a change over time is meaningful. It’s also based on an arbitrary threshold, unrelated to the goals of the decision maker.
Work through the tabs from left to right, or jump directly to what you need.
| Picking Winners | → | A worked example showing how a local foundation’s standard approach to picking grant winners goes wrong — and how BASIE fixes it. |
| Try It Yourself | → | Explore the i3 estimates interactively. The Interpret the i3 Estimates sub-tab shows Bayesian re-interpretations alongside the original frequentist results. The Apply BASIE to Your Work sub-tab describes how BASIE can be tailored to your organization’s data and decisions. |
| Methods | → | Technical background on the BASIE framework, the statistical model, and links to the underlying research. |
Foundations, government agencies, and other funders face the same high-stakes question: of all the programs and interventions we could support, which ones are most likely to make a real difference? When rigorous evaluation evidence is available, the natural impulse is to follow the numbers — to fund the programs with the best-looking results. But reading evaluation findings correctly turns out to be harder than it looks.
Consider a local foundation that wants to improve math outcomes in its community. Foundation staff draw on 44 rigorous evaluations of math interventions supported by the U.S. Department of Education’s Investing in Innovation (i3) program. They set two explicit criteria for what “works”: a program should be very unlikely to produce a negative effect, and more likely than not to improve math scores by at least 0.10 standard deviations — a threshold they consider educationally meaningful.
Typically, foundation staff would apply a straightforward filter: select programs with estimates that are positive, statistically significant, and larger than 0.10 standard deviations. This filter yields five candidates, with impact estimates of 0.40, 0.22, 0.18, 0.16, and 0.13 standard deviations. The 0.40 estimate looks especially compelling — nearly twice as large as the next best result.
This looks like a solid, evidence-based shortlist. But is it as good as it seems?
This procedure has a conceptually subtle but consequential flaw. When winners are selected using noisy performance estimates, the winners’ true quality is almost always lower than their measured performance. Part of what makes them look best is luck. This is the winner’s curse, and it shows up across many fields:
To quantify how serious this is, we can use a simulation grounded in the i3 evidence itself. A meta-analysis of the 44 evaluations reveals that only about 8% of i3 programs have a true effect larger than 0.10 SD — a needle-in-a-haystack problem. When genuinely effective programs are rare, a filter that selects partly on luck picks up mostly hay.
To assess inferential errors, we need to compare inferences to the truth. For example, calculating a Type M error requires comparing an estimate of an intervention’s effect to its true effect. This is impossible with real data because true effects are unknown — but it is straightforward in a simulation, where true effects are generated and then estimated subject to random error.
Data inputs. The simulation uses the actual standard errors from the 44 i3 evaluations. The distribution of true effects is anchored to a random-effects meta-analysis of those evaluations, which yields μ = 0.014 and τ = 0.062 — implying that only about 8% of true effects exceed 0.10 SD.
Simulation steps (repeated across many replications):
Key result. Under the meta-analytic scenario, the average maximum selected estimate is 0.52 SD while the average corresponding true effect is only 0.08 SD — a 6.5× Type M error. The probability that a selected program’s true effect is negative never exceeds 10%, so the goal of avoiding harm is roughly met. But fewer than 43% of selected programs have a true effect exceeding the 0.10 SD threshold for meaningful improvement.
Sensitivity analyses. Three additional scenarios were examined with wider distributions of true effects (τ = 0.10, 0.15, and 0.20). Type M errors decline as true effect variability increases (because genuinely large effects become more common), but remain substantial across all scenarios. The qualitative conclusion — that significance-filtered shortlists are systematically misleading — is robust to the choice of prior distribution.
BASIE (Bayesian Interpretation of Estimates) replaces “Is this statistically significant?” with questions that directly support decision making:
BASIE uses prior evidence about how program effects are typically distributed in a domain and combines it with the current study’s estimates to produce probability statements that map directly onto the foundation’s decision criteria. The results are dramatically better:
| Standard approach | BASIE | |
|---|---|---|
| Magnitude error (best candidate) | 6.5× | 1.1× |
| Does the best pick truly win? | No — true effect of “best” pick is smaller than the median pick | Yes — apparent winner truly is the winner |
| Selected candidates exceed threshold | 37–43% of the time | 57–72% of the time |
| Knows when to say no? | Selects a candidate in 97% of simulations — but often shouldn’t | Selects a candidate in only 53% of simulations |
The figure below shows what happens when BASIE reinterprets the five apparent winners. The tan bars show the original frequentist estimates; the dark gold bars show the Bayesian estimates, which are “shrunk” toward more realistic values by accounting for the overall distribution of effects in the i3 portfolio.
The winner’s curse is systematic, not accidental — it affects any significance-filtered shortlist. Useful next steps for a foundation in this position might include:
You’re now in the role of foundation staff, using BASIE to reinterpret the 44 i3 evaluation findings. Use the views below to compare intervention effects using both a frequentist and Bayesian approach. You can change the cutoff used to determine what constitutes a meaningful effect and you can assess sensitivity to prior distributions.
Prior distributions play a central role in Bayesian interpretation. Learn more →
What you’ve just seen is a simplified demonstration using published i3 evaluation data. A custom BASIE implementation is built around your organization’s own evidence — your data, your decision criteria, your stakeholders — whether your evidence comes from impact evaluations, descriptive comparisons, or trend analyses over time.
Whether you need a one-time analysis, a reusable framework, or a full evaluation approach built around Bayesian reasoning, a solution can be designed to fit your team, your data, and your decisions.
Interested in working together?
Contact John Deke to discuss what a tailored BASIE implementation could look like for your organization.
BASIE (Bayesian Interpretation of Estimates) is a framework for applying hierarchical Bayesian models to portfolios of estimates — whether from randomized trials, quasi-experimental designs, or descriptive analyses of group differences or trends — producing posterior probabilities and predictive intervals that directly support evidence-based decisions. It was developed by John Deke and Mariel Finucane. The motivating example used throughout — a local foundation evaluating 44 i3 math interventions — is described in detail on the Picking Winners tab.
The model is a two-level normal-normal hierarchy, a direct generalization of the eight schools model introduced by Rubin (1981). For J studies with impact estimates yj and known standard errors σj:
Here θj is the true causal effect for study j, μ is the mean of true effects across the population of studies, and τ is the standard deviation of true effects. The prior on μ is always zero-centered (m0 = 0); the slider controls s0, the prior standard deviation. All priors on μ are proper. For τ, the Gamma family is used because it is defined only on the positive real line (enforcing τ ≥ 0) and, with shape parameter k = 2, places relatively little mass near zero — avoiding the unrealistic assumption that all true effects are identical. An improper flat prior on τ is also available; in that case τ is estimated entirely from the data with no regularization toward any particular level of heterogeneity.
Bayesian interpretation requires combining evidence from the current study with prior information about the plausible range of effects. To understand why, consider what we are trying to accomplish: we want to know the probability that a program’s true effect is meaningful. An estimate alone cannot answer this question — it reflects both random error and genuine program effects. To calculate the probability that the estimate reflects a meaningful effect, we need information about how common meaningful effects are in general. That information is captured by the prior distribution.
The choice of prior should be grounded in evidence and logic, not chosen arbitrarily or to favor a predetermined conclusion. A well-chosen, transparent prior is far preferable to avoiding the question — frequentist significance tests also embed implicit assumptions about the plausibility of effects, but do so without stating them openly. Making prior assumptions explicit is a feature of Bayesian analysis, not a weakness.
The most important source of prior information for interpreting any individual estimate is the collection of all the other estimates in the portfolio. This is the central insight behind the hierarchical normal model: each study’s estimate informs our understanding of the population of true effects, and that collective understanding in turn shapes the interpretation of each individual estimate. This process — called partial pooling or shrinkage — pulls extreme estimates toward more realistic values, correcting for the winner’s curse illustrated in the Picking Winners tab.
In addition to learning from the data, we place a “prior on the prior” — called a hyper-prior — that reflects general background knowledge about effect sizes in the domain. The sliders in the prior panel control these hyper-priors.
The hyper-prior on μ is always centered at zero — reflecting the assumption that, before seeing the data, positive and negative effects are equally plausible. This is also a form of pre-registered skepticism that prevents the prior from being chosen in a way that favors a desired result. Once the data are observed, the posterior for μ can move substantially away from zero if the estimates collectively point in that direction.
The four options for the standard deviation of this hyper-prior reflect different background assumptions about how large effects are likely to be, informed by meta-analyses showing that variation in effects across social programs is real but not enormous:
τ governs how much true effects vary across studies. A small τ means programs have similar effects; a large τ means there is substantial heterogeneity. The Gamma family is a natural choice for τ for three reasons: (1) it is defined only on positive values, as required since standard deviations must be non-negative; (2) with shape parameter k = 2, it places relatively little weight near zero, avoiding the unrealistic assumption that all programs have identical true effects; and (3) the Stan Prior Choice Recommendations (Gelman et al.) explicitly recommend Gamma(2, 0) as a “boundary-avoiding” prior for hierarchical scale parameters, because it keeps the posterior mode away from zero while still allowing it to be close to zero if the data warrant — BASIE uses Gamma(2, β) with β > 0, producing proper (integrable) priors with this same boundary-avoiding shape.
The five options, described in terms of their implications for the spread of true effects:
We encourage users to move the sliders and observe how results change. Robustness to prior choice is itself informative — if the conclusions are similar across a range of reasonable priors, the evidence is strong. If they change substantially, the data alone may not be sufficient for confident conclusions, and the choice of prior deserves careful attention.
Inference proceeds by marginalizing over μ analytically (conditional on τ), with each θj then sampled from its conjugate normal posterior conditional on (μ, τ). The marginal posterior of τ given the data is evaluated on a fine grid (1,000 points from 0.001 to 40); τ is then sampled from this discrete approximation. Conditional on each τ draw, μ is sampled from its conjugate normal posterior, and each θj is sampled from its conjugate normal posterior. This grid-based approach follows the algorithm described in Deke & Finucane (2019). The default uses 10,000 posterior draws.
The default prior for μ is N(0, 0.22). The default prior for τ is uninformative (flat), consistent with the Rubin (1981) eight schools model — the degree of shrinkage is determined entirely by the data. Users can impose informative Gamma priors on τ using the slider in the prior panel.
BASIE was developed by John Deke and Mariel Finucane. The pre-loaded estimates are drawn from the i3 evaluation portfolio summarized in Goodson et al. (2024).