BASIE Workbench

Traditional statistical methods answer a question nobody asks

Decision makers want data to help answer questions like:

“Did the program really work?” “Are outcomes truly different between groups?” “Did outcomes improve over time?”

Those are the right questions! But the standard statistical methods that researchers report — like statistical significance — were never designed to answer them.

Statistical significance tells you that a random error as large as your estimate is unlikely. As we will see with The PIP Foundation, that’s importantly different from telling you whether the program really worked, the difference between groups is real, or the change over time is meaningful.

It’s also based on an arbitrary threshold, unrelated to the goals of the decision maker.

BASIE (Bayesian Interpretation of Estimates) provides tailored, flexible answers to the right questions: given all the evidence, what is the probability this program truly produces meaningful results, that the outcomes are truly different between groups, or that outcomes improved over time?

A guide to this tool

Work through the tabs from left to right, or jump directly to what you need.

The PIP Foundation	→	A worked example showing how a fictional foundation’s standard approach to picking grant winners goes wrong — and how BASIE fixes it. Start here if you’re new to BASIE.
Enter Estimates	→	Load your own estimates. Upload a CSV with impact estimates and standard errors, or explore the pre-loaded i3 evaluation data from the PIP Foundation example.
Interpret Estimates	→	See Bayesian re-interpretations of your estimates side by side with the original frequentist results. Includes a forest plot, posterior summary table, and probability statements for each study.
Predict	→	Project what a new study from the same portfolio of evidence would likely find — a forward-looking answer to “what should we expect if we tried this again?”
About BASIE Workbench	→	Technical background on the BASIE framework, the statistical model, and links to the underlying research.

Need something more powerful?

This tool is a working demonstration of what BASIE can do with a simple model and a standard dataset. It is designed to make the core concepts tangible and to give you a genuine, hands-on sense of the Bayesian approach — not just a description of it.

More advanced implementations can go considerably further:

Models tailored to your organization’s specific domain, decision rules, and threshold criteria
Integration with your own evaluation data pipelines or reporting workflows
More sophisticated hierarchical models that account for site-level variation, subgroup effects, or multi-arm designs
Interactive dashboards that let stakeholders explore uncertainty and tradeoffs directly
Sensitivity analyses that make the role of prior assumptions transparent and auditable

Whether you need a one-time analysis, a reusable tool, or a full evaluation framework built around Bayesian reasoning, we can design something that fits your team, your data, and your decisions.

Interested in a custom solution? Contact John Deke to discuss what a tailored BASIE implementation could look like for your organization.

The PIP Foundation: A Motivating Example

The Philanthropy in Perpetuity (PIP) Foundation is a fictional organization facing a very real challenge: how do you choose which programs to fund when you have dozens of rigorous evaluations to draw from, and the results are noisy?

PIP wants to fund math interventions in school districts. They have set two explicit criteria for what “works”: a program should be very unlikely to produce a negative effect, and more likely than not to improve math scores by at least 0.10 standard deviations — a threshold they consider educationally meaningful. Their evidence base is 44 rigorous evaluations of programs funded by the U.S. Department of Education’s Investing in Innovation (i3) program.

rigorous evaluations

apparent winners

0.40 SD

standout estimate

1. The Standard Approach

PIP’s staff apply a straightforward filter: select programs with estimates that are positive, statistically significant, and larger than 0.10 standard deviations. This filter yields five candidates, with impact estimates of 0.40, 0.22, 0.18, 0.16, and 0.13 standard deviations. The 0.40 estimate looks especially compelling — nearly twice as large as the next best result.

Figure 1. Frequentist impact estimates for all 44 i3 evaluations, sorted by magnitude. Gold bars are programs that meet PIP’s selection criteria (statistically significant and effect > 0.10 SD). Asterisks (*) mark statistically significant estimates.

This looks like a solid, evidence-based shortlist. But is it as good as it seems?

2. The Problem: The Winner’s Curse

PIP’s procedure has a conceptually subtle but consequential flaw. When winners are selected using noisy performance estimates, the winners’ true quality is almost always lower than their measured performance. Part of what made them look best was luck. This is the winner’s curse, and it shows up across many fields:

Sports Illustrated cover jinx: Athletes who make the cover had exceptional recent performance — partly skill, partly luck. Subsequent performance regresses toward their true level.
Promising trials bias: Treatments selected for confirmatory trials based on early promising results routinely show smaller effects when confirmed.
The replication crisis: Landmark findings in social science have repeatedly failed to replicate at their original magnitude, partly because the studies that became famous were selected for large, significant results.

To quantify how serious this is for PIP, we can use a simulation grounded in the i3 evidence itself. A meta-analysis of the 44 evaluations reveals that only about 8% of i3 programs have a true effect larger than 0.10 SD — a needle-in-a-haystack problem. When genuinely effective programs are rare, a filter that selects partly on luck picks up mostly hay.

6.5×

Magnitude error of the standard approach. Simulations show that “winning” estimates average 0.52 SD — but the true effects behind those winners average only 0.08 SD. Across simulations, fewer than half of selected programs truly meet PIP’s criteria.

3. The Solution: BASIE

BASIE (Bayesian Interpretation of Estimates) replaces “Is this statistically significant?” with questions that directly support decision making:

What is the probability this program causes harm?
What is the probability the effect exceeds our meaningful threshold?

BASIE uses prior evidence about how program effects are typically distributed in a domain and combines it with the current study’s estimates to produce probability statements that map directly onto PIP’s decision criteria. The results are dramatically better:

	Standard approach	BASIE
Magnitude error (best candidate)	6.5×	1.1×
Does the best pick truly win?	No — true effect of “best” pick is smaller than the median pick	Yes — apparent winner truly is the winner
Selected candidates exceed threshold	37–43% of the time	57–72% of the time
Knows when to say no?	Selects a candidate in 97% of simulations — but often shouldn’t	Selects a candidate in only 53% of simulations

BASIE Applied to PIP’s Five Candidates

The figure below shows what happens when BASIE reinterprets the five apparent winners. The grey bars show the original frequentist estimates; the teal bars show the Bayesian estimates, which are “shrunk” toward more realistic values by accounting for the overall distribution of effects in the i3 portfolio.

Figure 2. Frequentist estimates (grey) vs. Bayesian estimates (teal) for the five programs selected by PIP’s standard approach. Bayesian estimates are shrunk substantially toward more plausible values.

Computing Bayesian results…

No candidate clears 50%. The honest answer may be: none of these programs meet the bar. This is a little disappointing now — but it prevents a much bigger disappointment later.

4. What This Means for Grant Making

The winner’s curse is systematic, not accidental — it affects any significance-filtered shortlist. Useful next steps for a foundation in PIP’s position might include:

Revisit the decision criteria. Accepting a smaller effect size may be warranted if lower-cost options are also on the table.
Commission a local evaluation. Effect sizes vary across contexts. A well-designed local study, combined with qualitative assessment of fit, can yield more targeted evidence.
Invest in program development. If few existing programs have large true effects, funding earlier-stage development work could improve the pool of candidates for future evaluation.

What the evaluations found

The pre‑loaded estimates are the 44 i3 evaluations from the PIP Foundation motivating example (see first tab). Upload your own CSV to replace them. Using your own data? Not required, but please consider taking the time to drop me a line at jdeke73@gmail.com — I’d love to know how you’re using BASIE Workbench.

CSV must include Estimate and StdError columns. Optional: id, Description.

Upload CSV

View as: Vertical bars Forest plot Table Fixed effects meta-analysis

What does the evidence really mean?

Traditional methods tell you how the estimates look — BASIE tells you what they mean. Use the views below to compare frequentist estimates with Bayesian posteriors, and to see the probability that each true effect exceeds a threshold that matters for your decisions.

⚙︎ Adjust prior assumptions

N(0,.10)N(0,.20)N(0,.40)N(0,1.0)

G(2,16)G(2,8)G(2,4)G(2,2)Uninformative

View as:

Vertical bars Forest plot Table

Show:

Point estimate Probability

Which estimate:

Frequentist Bayesian (posterior mean) Both

Sort by:

Filter by:

Leave bounds blank for no limit.

What is the true impact in another study like those in this analysis likely to be?

Studies measure program impacts with error. This prediction sets that measurement error aside and asks: if we could observe the true effect in a comparable study—not just one study’s noisy estimate of it—what would we expect to find?

Cutoff of interest: The probability the next effect exceeds this value will be calculated.

⚙︎ Adjust prior assumptions

N(0,.10)N(0,.20)N(0,.40)N(0,1.0)

G(2,16)G(2,8)G(2,4)G(2,2)Uninformative

Upload estimates on the first tab to see predictions.

About this tool

BASIE Workbench is a browser-based implementation of the BASIE (Bayesian Interpretation of Estimates) framework, developed by John Deke and Mariel Finucane. It applies a hierarchical Bayesian model to a portfolio of impact estimates, producing posterior probabilities and predictive intervals that directly support evidence-based decisions. The motivating example used throughout this tool — the fictional PIP Foundation evaluating 44 i3 math interventions — is described in detail on the PIP Foundation tab.

The statistical model

The model is a two-level normal-normal hierarchy, a direct generalization of the eight schools model introduced by Rubin (1981). For J studies with impact estimates y_j and known standard errors σ_j:

y_j ∼ N(θ_j, σ_j²)    [observation model]
θ_j ∼ N(μ, τ²)    [population model]
μ ∼ N(m₀, s₀²)    [prior on mean effect]
τ ∼ Gamma(k, β)    [prior on SD of effects]

Here θ_j is the true causal effect for study j, μ is the mean of true effects across the population of studies, and τ is the standard deviation of true effects. The prior on μ is always zero-centered (m₀ = 0); the slider controls s₀, the prior standard deviation. All priors on μ are proper. For τ, the Gamma family is used because it is defined only on the positive real line (enforcing τ ≥ 0) and, with shape parameter k = 2, places relatively little mass near zero — avoiding the unrealistic assumption that all true effects are identical. An improper flat prior on τ is also available; in that case τ is estimated entirely from the data with no regularization toward any particular level of heterogeneity.

Posterior computation

Inference proceeds by marginalizing over μ analytically (conditional on τ), with each θ_j then sampled from its conjugate normal posterior conditional on (μ, τ). The marginal posterior of τ given the data is evaluated on a fine grid (1,000 points from 0.001 to 40); τ is then sampled from this discrete approximation. Conditional on each τ draw, μ is sampled from its conjugate normal posterior, and each θ_j is sampled from its conjugate normal posterior. This grid-based approach follows the algorithm described in Deke & Finucane (2019). The default uses 10,000 posterior draws.

The default priors — μ ~ N(0, 0.2²) and τ ~ Gamma(2, 2) — were chosen to reflect the empirical distribution of true effects observed in the i3 evaluation portfolio: most true effects are small, and fewer than 10% exceed 0.10 standard deviations.

The predictive distribution

The Predict tab displays the posterior predictive distribution for the true effect in a new study drawn from the same population — that is, θ_new ~ N(μ, τ²) marginalized over the joint posterior of (μ, τ). This is not the predictive distribution for a new estimate y_new, which would add an additional layer of sampling variance. The distinction matters: the predictive interval for the true effect answers the decision-relevant question of what impact a comparable program would actually have, setting aside measurement error.

Data requirements

Upload a CSV with at minimum two columns: Estimate (the point estimate) and StdError (its standard error). Optional columns id and Description are used for labeling. All estimates should be in comparable units (e.g., standardized effect sizes). Rows with missing or non-positive standard errors are dropped automatically.

Acknowledgments

BASIE was developed by John Deke and Mariel Finucane. BASIE Workbench is a browser-based demonstration tool built by John Deke. The pre-loaded i3 evaluation estimates are drawn from the IES-funded evaluation portfolio described in Deke & Finucane (2019).

References

Deke, J. & Finucane, M. (2019). Moving Beyond Statistical Significance: The BASIE (Bayesian Interpretation of Estimates) Framework for Interpreting Findings from Impact Evaluations. OPRE Report 2019-35. Office of Planning, Research, and Evaluation, ACF, U.S. DHHS. acf.gov →
IES BASIE Framework and Toolkit. ies.ed.gov →
Rubin, D. B. (1981). Estimation in parallel randomized experiments. Journal of Educational Statistics, 6(4), 377–401.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman & Hall/CRC. stat.columbia.edu →
Gelman, A. (2022). Hey, check this out — it’s really cool: A Bayesian framework for interpreting findings from impact evaluations. Statistical Modeling, Causal Inference, and Social Science (blog). statmodeling.stat.columbia.edu →
Wasserstein, R. L. & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129–133. doi.org →
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “p < 0.05.” The American Statistician, 73(sup1), 1–19. doi.org →