BASIE Demo

BASIE bridges the gap from evidence to decision making

Evidence → BASIE → Decision Making

Decision makers need more than just a description of evidence – they need to know what the evidence means for their decisions. They need to know how confident they can be, whether the evidence is strong enough to act on, and what they should realistically expect if they invest in a program or scale it to a new context. BASIE (Bayesian Interpretation of Estimates) was designed to answer these questions directly, using criteria relevant to the decision maker’s goals, in plain terms that support a yes, no, or not-yet decision.

BASIE answers questions like:

“Did the program really work?”
“Are outcomes truly different between groups?”
“Did outcomes improve over time?”

BASIE works equally well whether the estimate comes from a randomized trial, a comparison group design, or a descriptive analysis of group differences or trends over time.

Given all the evidence, BASIE produces probability statements — how likely is it that this program truly works? how likely is the true effect to exceed a meaningful threshold? — that map directly onto your decision criteria.

Traditional statistical methods answer a question nobody asks

Getting clear, direct answers to these questions is exactly what decision makers need. Yet the standard statistical methods used in most evaluation reports — significance tests, p-values — were not designed for this purpose. Statistical significance tells you that a random error as large as your estimate is unlikely. As we will see in the Picking Winners example, that’s importantly different from telling you whether the program really worked, whether a difference between groups is real, or whether a change over time is meaningful. It’s also based on an arbitrary threshold, unrelated to the goals of the decision maker.

A guide to this demo

Work through the tabs from left to right, or jump directly to what you need.

Picking Winners	→	A worked example showing how a local foundation’s standard approach to picking grant winners goes wrong — and how BASIE fixes it.
Try It Yourself	→	Explore the i3 estimates interactively. The Interpret the i3 Estimates sub-tab shows Bayesian re-interpretations alongside the original frequentist results. The Apply BASIE to Your Work sub-tab describes how BASIE can be tailored to your organization’s data and decisions.
Methods	→	Technical background on the BASIE framework, the statistical model, and links to the underlying research.

Picking Winners: The Challenge Every Funder Faces

Foundations, government agencies, and other funders face the same high-stakes question: of all the programs and interventions we could support, which ones are most likely to make a real difference? When rigorous evaluation evidence is available, the natural impulse is to follow the numbers — to fund the programs with the best-looking results. But reading evaluation findings correctly turns out to be harder than it looks.

Consider a local foundation that wants to improve math outcomes in its community. Foundation staff draw on 44 rigorous evaluations of math interventions supported by the U.S. Department of Education’s Investing in Innovation (i3) program. They set two explicit criteria for what “works”: a program should be very unlikely to produce a negative effect, and more likely than not to improve math scores by at least 0.10 standard deviations — a threshold they consider educationally meaningful.

rigorous evaluations

apparent winners

0.40 SD

standout estimate

1. The Standard Approach

Typically, foundation staff would apply a straightforward filter: select programs with estimates that are positive, statistically significant, and larger than 0.10 standard deviations. This filter yields five candidates, with impact estimates of 0.40, 0.22, 0.18, 0.16, and 0.13 standard deviations. The 0.40 estimate looks especially compelling — nearly twice as large as the next best result.

Figure 1. Frequentist impact estimates for all 44 i3 evaluations, sorted by magnitude. Gold bars are programs that meet the foundation’s selection criteria (statistically significant and effect > 0.10 SD). Asterisks (*) mark statistically significant estimates.

This looks like a solid, evidence-based shortlist. But is it as good as it seems?

2. The Problem: The Winner’s Curse

This procedure has a conceptually subtle but consequential flaw. When winners are selected using noisy performance estimates, the winners’ true quality is almost always lower than their measured performance. Part of what makes them look best is luck. This is the winner’s curse, and it shows up across many fields:

Sports Illustrated cover jinx: Athletes who make the cover have exceptional recent performance — partly skill, partly luck. Subsequent performance regresses toward their true level.
Promising trials bias: Treatments selected for confirmatory trials based on early promising results routinely show smaller effects when confirmed.
The replication crisis: Landmark findings in social science have repeatedly failed to replicate at their original magnitude, partly because the studies that became famous were selected for large, significant results.

To quantify how serious this is, we can use a simulation grounded in the i3 evidence itself. A meta-analysis of the 44 evaluations reveals that only about 8% of i3 programs have a true effect larger than 0.10 SD — a needle-in-a-haystack problem. When genuinely effective programs are rare, a filter that selects partly on luck picks up mostly hay.

6.5×

Magnitude error of the standard approach. Simulations show that “winning” estimates average 0.52 SD — but the true effects behind those winners average only 0.08 SD. Across simulations, fewer than half of selected programs truly meet the foundation’s criteria.

🔬 Simulation methodology (technical details)

To assess inferential errors, we need to compare inferences to the truth. For example, calculating a Type M error requires comparing an estimate of an intervention’s effect to its true effect. This is impossible with real data because true effects are unknown — but it is straightforward in a simulation, where true effects are generated and then estimated subject to random error.

Data inputs. The simulation uses the actual standard errors from the 44 i3 evaluations. The distribution of true effects is anchored to a random-effects meta-analysis of those evaluations, which yields μ = 0.014 and τ = 0.062 — implying that only about 8% of true effects exceed 0.10 SD.

Simulation steps (repeated across many replications):

Draw 44 true effects θ_j from N(μ, τ²), using the meta-analytic parameter estimates.
For each true effect, generate a noisy impact estimate ŷ_j ~ N(θ_j, s_j²), using the actual i3 standard errors s_j.
Calculate t-statistics and apply the foundation’s selection rule: flag estimates where t ≥ 1.96 and ŷ_j ≥ 0.10.
Record the minimum, median, and maximum selected estimate alongside their corresponding true effects θ_j.
Compare the average selected estimates to the average true effects to compute Type M errors.

Key result. Under the meta-analytic scenario, the average maximum selected estimate is 0.52 SD while the average corresponding true effect is only 0.08 SD — a 6.5× Type M error. The probability that a selected program’s true effect is negative never exceeds 10%, so the goal of avoiding harm is roughly met. But fewer than 43% of selected programs have a true effect exceeding the 0.10 SD threshold for meaningful improvement.

Sensitivity analyses. Three additional scenarios were examined with wider distributions of true effects (τ = 0.10, 0.15, and 0.20). Type M errors decline as true effect variability increases (because genuinely large effects become more common), but remain substantial across all scenarios. The qualitative conclusion — that significance-filtered shortlists are systematically misleading — is robust to the choice of prior distribution.

3. The Solution: BASIE

BASIE (Bayesian Interpretation of Estimates) replaces “Is this statistically significant?” with questions that directly support decision making:

What is the probability this program causes harm?
What is the probability the effect exceeds our meaningful threshold?

BASIE uses prior evidence about how program effects are typically distributed in a domain and combines it with the current study’s estimates to produce probability statements that map directly onto the foundation’s decision criteria. The results are dramatically better:

	Standard approach	BASIE
Magnitude error (best candidate)	6.5×	1.1×
Does the best pick truly win?	No — true effect of “best” pick is smaller than the median pick	Yes — apparent winner truly is the winner
Selected candidates exceed threshold	37–43% of the time	57–72% of the time
Knows when to say no?	Selects a candidate in 97% of simulations — but often shouldn’t	Selects a candidate in only 53% of simulations

BASIE Applied to the Five Candidates

The figure below shows what happens when BASIE reinterprets the five apparent winners. The tan bars show the original frequentist estimates; the dark gold bars show the Bayesian estimates, which are “shrunk” toward more realistic values by accounting for the overall distribution of effects in the i3 portfolio.

Figure 2. Frequentist estimates (tan) vs. Bayesian estimates (dark gold) for the five programs selected by the standard approach. Bayesian estimates are shrunk substantially toward more plausible values.

Computing Bayesian results…

No candidate clearly clears 50%. The honest answer may be: none of these programs meet the bar. The closest calls — DEV104 and VALID21 — hover right around 49–50%, which is essentially a coin flip on whether these programs truly meet the foundation’s threshold. That ambiguity is a little disappointing now — but it prevents a much bigger disappointment later. (Results this close to 50% are also subject to small amounts of Monte Carlo variability; refresh the page to see how much they move.)

4. What This Means for Grant Making

The winner’s curse is systematic, not accidental — it affects any significance-filtered shortlist. Useful next steps for a foundation in this position might include:

Revisit the decision criteria. Accepting a smaller effect size may be warranted if lower-cost options are also on the table.
Commission a local evaluation. Effect sizes vary across contexts. A well-designed local study, combined with qualitative assessment of fit, can yield more targeted evidence.
Invest in program development. If few existing programs have large true effects, funding earlier-stage development work could improve the pool of candidates for future evaluation.

What does the evidence really mean?

You’re now in the role of foundation staff, using BASIE to reinterpret the 44 i3 evaluation findings. Use the views below to compare intervention effects using both a frequentist and Bayesian approach. You can change the cutoff used to determine what constitutes a meaningful effect and you can assess sensitivity to prior distributions.

Foundation’s meaningful threshold (effect size): The foundation set this at 0.10 SD — adjust to explore how conclusions change if they reconsidered this criterion.

Prior distributions play a central role in Bayesian interpretation. Learn more →

⚙︎ Adjust prior assumptions

N(0,.10)N(0,.20)N(0,.40)N(0,1.0)

G(2,16)G(2,8)G(2,4)G(2,2)Uninformative

View as:

Vertical bars Forest plot Table

Show:

Point estimate Probability

Which estimate:

Frequentist Bayesian (posterior mean) Both

Sort by:

Filter by:

Leave bounds blank for no limit.

Bringing BASIE to Your Organization

What you’ve just seen is a simplified demonstration using published i3 evaluation data. A custom BASIE implementation is built around your organization’s own evidence — your data, your decision criteria, your stakeholders — whether your evidence comes from impact evaluations, descriptive comparisons, or trend analyses over time.

What a custom implementation can include

Models calibrated to your domain. Prior distributions and threshold criteria are set to reflect what’s known about effect sizes in your specific program area — whether the evidence base comes from experimental, quasi-experimental, or descriptive studies.
Integration with your evaluation workflows. Bayesian results are produced alongside your existing evaluation outputs, formatted for your reporting context and stakeholder audience.
More sophisticated hierarchical models. Extensions to the base model can account for site-level variation, subgroup effects, multiple outcomes, or multi-arm study designs.
Interactive dashboards for decision makers. Stakeholders can adjust assumptions, explore sensitivity, and move between probability statements and effect sizes — without needing to understand the underlying model.
Transparent sensitivity analyses. Results are presented with explicit documentation of how conclusions change under different prior assumptions, making the role of judgment visible and auditable.

Whether you need a one-time analysis, a reusable framework, or a full evaluation approach built around Bayesian reasoning, a solution can be designed to fit your team, your data, and your decisions.

Interested in working together?

Contact John Deke to discuss what a tailored BASIE implementation could look like for your organization.

Methods

BASIE (Bayesian Interpretation of Estimates) is a framework for applying hierarchical Bayesian models to portfolios of estimates — whether from randomized trials, quasi-experimental designs, or descriptive analyses of group differences or trends — producing posterior probabilities and predictive intervals that directly support evidence-based decisions. It was developed by John Deke and Mariel Finucane. The motivating example used throughout — a local foundation evaluating 44 i3 math interventions — is described in detail on the Picking Winners tab.

The statistical model

The model is a two-level normal-normal hierarchy, a direct generalization of the eight schools model introduced by Rubin (1981). For J studies with impact estimates y_j and known standard errors σ_j:

y_j ∼ N(θ_j, σ_j²)    [observation model]
θ_j ∼ N(μ, τ²)    [population model]
μ ∼ N(m₀, s₀²)    [prior on mean effect]
τ ∼ Gamma(k, β)    [prior on SD of effects]

Here θ_j is the true causal effect for study j, μ is the mean of true effects across the population of studies, and τ is the standard deviation of true effects. The prior on μ is always zero-centered (m₀ = 0); the slider controls s₀, the prior standard deviation. All priors on μ are proper. For τ, the Gamma family is used because it is defined only on the positive real line (enforcing τ ≥ 0) and, with shape parameter k = 2, places relatively little mass near zero — avoiding the unrealistic assumption that all true effects are identical. An improper flat prior on τ is also available; in that case τ is estimated entirely from the data with no regularization toward any particular level of heterogeneity.

Plausible priors precede persuasive posteriors

Bayesian interpretation requires combining evidence from the current study with prior information about the plausible range of effects. To understand why, consider what we are trying to accomplish: we want to know the probability that a program’s true effect is meaningful. An estimate alone cannot answer this question — it reflects both random error and genuine program effects. To calculate the probability that the estimate reflects a meaningful effect, we need information about how common meaningful effects are in general. That information is captured by the prior distribution.

Priors should be evidence-based, not arbitrary

The choice of prior should be grounded in evidence and logic, not chosen arbitrarily or to favor a predetermined conclusion. A well-chosen, transparent prior is far preferable to avoiding the question — frequentist significance tests also embed implicit assumptions about the plausibility of effects, but do so without stating them openly. Making prior assumptions explicit is a feature of Bayesian analysis, not a weakness.

The other estimates are the main source of prior information

The most important source of prior information for interpreting any individual estimate is the collection of all the other estimates in the portfolio. This is the central insight behind the hierarchical normal model: each study’s estimate informs our understanding of the population of true effects, and that collective understanding in turn shapes the interpretation of each individual estimate. This process — called partial pooling or shrinkage — pulls extreme estimates toward more realistic values, correcting for the winner’s curse illustrated in the Picking Winners tab.

In addition to learning from the data, we place a “prior on the prior” — called a hyper-prior — that reflects general background knowledge about effect sizes in the domain. The sliders in the prior panel control these hyper-priors.

The hyper-prior on μ (mean of true effects)

The hyper-prior on μ is always centered at zero — reflecting the assumption that, before seeing the data, positive and negative effects are equally plausible. This is also a form of pre-registered skepticism that prevents the prior from being chosen in a way that favors a desired result. Once the data are observed, the posterior for μ can move substantially away from zero if the estimates collectively point in that direction.

The four options for the standard deviation of this hyper-prior reflect different background assumptions about how large effects are likely to be, informed by meta-analyses showing that variation in effects across social programs is real but not enormous:

N(0, 0.10²): Very skeptical — prior puts most weight on effects smaller than ±0.13 SD (middle 80%). Appropriate when evidence strongly suggests effects in this domain are small.
N(0, 0.20²): Moderately skeptical — middle 80% of the prior falls between −0.26 and +0.26 SD. Consistent with meta-analyses of educational and social program evaluations.
N(0, 0.40²): Less skeptical — middle 80% between −0.51 and +0.51 SD. Appropriate for domains with greater effect size variability.
N(0, 1.0²): Very diffuse — places substantial prior weight on large effects. Not typically supported by systematic reviews of social program evaluations, but included as a sensitivity and bounding analysis.

The hyper-prior on τ (SD of true effects)

τ governs how much true effects vary across studies. A small τ means programs have similar effects; a large τ means there is substantial heterogeneity. The Gamma family is a natural choice for τ for three reasons: (1) it is defined only on positive values, as required since standard deviations must be non-negative; (2) with shape parameter k = 2, it places relatively little weight near zero, avoiding the unrealistic assumption that all programs have identical true effects; and (3) the Stan Prior Choice Recommendations (Gelman et al.) explicitly recommend Gamma(2, 0) as a “boundary-avoiding” prior for hierarchical scale parameters, because it keeps the posterior mode away from zero while still allowing it to be close to zero if the data warrant — BASIE uses Gamma(2, β) with β > 0, producing proper (integrable) priors with this same boundary-avoiding shape.

The five options, described in terms of their implications for the spread of true effects:

Gamma(2, 16): Very little variation expected — prior mode at τ ≈ 0.06 SD; middle 80% roughly 0.03–0.19 SD. Appropriate when prior evidence suggests effects are highly consistent across programs.
Gamma(2, 8): Little variation expected — prior mode at τ ≈ 0.13 SD; middle 80% roughly 0.06–0.37 SD.
Gamma(2, 4): Moderate variation expected — prior mode at τ ≈ 0.25 SD; middle 80% roughly 0.11–0.55 SD.
Gamma(2, 2): Substantial variation expected — prior mode at τ ≈ 0.50 SD; middle 80% roughly 0.22–1.08 SD.
Uninformative (flat) [default]: τ is estimated entirely from the data with no prior constraint. Consistent with the Rubin (1981) eight schools model. Appropriate when you want the evidence itself to determine the degree of shrinkage without any prior nudge toward a particular level of heterogeneity.

A note on sensitivity

We encourage users to move the sliders and observe how results change. Robustness to prior choice is itself informative — if the conclusions are similar across a range of reasonable priors, the evidence is strong. If they change substantially, the data alone may not be sufficient for confident conclusions, and the choice of prior deserves careful attention.

Posterior computation

Inference proceeds by marginalizing over μ analytically (conditional on τ), with each θ_j then sampled from its conjugate normal posterior conditional on (μ, τ). The marginal posterior of τ given the data is evaluated on a fine grid (1,000 points from 0.001 to 40); τ is then sampled from this discrete approximation. Conditional on each τ draw, μ is sampled from its conjugate normal posterior, and each θ_j is sampled from its conjugate normal posterior. This grid-based approach follows the algorithm described in Deke & Finucane (2019). The default uses 10,000 posterior draws.

The default prior for μ is N(0, 0.2²). The default prior for τ is uninformative (flat), consistent with the Rubin (1981) eight schools model — the degree of shrinkage is determined entirely by the data. Users can impose informative Gamma priors on τ using the slider in the prior panel.

Acknowledgments

BASIE was developed by John Deke and Mariel Finucane. The pre-loaded estimates are drawn from the i3 evaluation portfolio summarized in Goodson et al. (2024).

References

Deke, J. & Finucane, M. (2019). Moving Beyond Statistical Significance: The BASIE (Bayesian Interpretation of Estimates) Framework for Interpreting Findings from Impact Evaluations. OPRE Report 2019-35. Office of Planning, Research, and Evaluation, ACF, U.S. DHHS. acf.gov →
Goodson, B. D., Harvill, E., Sarna, M., Brown, K., and McCormick, R. (2024). Federal Efforts Towards Investing in Innovation through the i3 Fund: A Summary of Grantmaking and Evidence-Building. NCEE 2024002. National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education. ies.ed.gov →
IES BASIE Framework and Toolkit. ies.ed.gov →
Rubin, D. B. (1981). Estimation in parallel randomized experiments. Journal of Educational Statistics, 6(4), 377–401.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman & Hall/CRC. stat.columbia.edu →
Gelman, A. (2022). Hey, check this out — it’s really cool: A Bayesian framework for interpreting findings from impact evaluations. Statistical Modeling, Causal Inference, and Social Science (blog). statmodeling.stat.columbia.edu →
Wasserstein, R. L. & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129–133. doi.org →
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “p < 0.05.” The American Statistician, 73(sup1), 1–19. doi.org →