Decision makers want data to help answer questions like:
Those are the right questions! But the standard statistical methods that researchers report — like statistical significance — were never designed to answer them.
Statistical significance tells you that a random error as large as your estimate is unlikely. As we will see with The PIP Foundation, that’s importantly different from telling you whether the program really worked, the difference between groups is real, or the change over time is meaningful.
It’s also based on an arbitrary threshold, unrelated to the goals of the decision maker.
BASIE (Bayesian Interpretation of Estimates) provides tailored, flexible answers to the right questions: given all the evidence, what is the probability this program truly produces meaningful results, that the outcomes are truly different between groups, or that outcomes improved over time?
Work through the tabs from left to right, or jump directly to what you need.
| The PIP Foundation | → | A worked example showing how a fictional foundation’s standard approach to picking grant winners goes wrong — and how BASIE fixes it. Start here if you’re new to BASIE. |
| Enter Estimates | → | Load your own estimates. Upload a CSV with impact estimates and standard errors, or explore the pre-loaded i3 evaluation data from the PIP Foundation example. |
| Interpret Estimates | → | See Bayesian re-interpretations of your estimates side by side with the original frequentist results. Includes a forest plot, posterior summary table, and probability statements for each study. |
| Predict | → | Project what a new study from the same portfolio of evidence would likely find — a forward-looking answer to “what should we expect if we tried this again?” |
| About BASIE Workbench | → | Technical background on the BASIE framework, the statistical model, and links to the underlying research. |
This tool is a working demonstration of what BASIE can do with a simple model and a standard dataset. It is designed to make the core concepts tangible and to give you a genuine, hands-on sense of the Bayesian approach — not just a description of it.
More advanced implementations can go considerably further:
Whether you need a one-time analysis, a reusable tool, or a full evaluation framework built around Bayesian reasoning, we can design something that fits your team, your data, and your decisions.
The Philanthropy in Perpetuity (PIP) Foundation is a fictional organization facing a very real challenge: how do you choose which programs to fund when you have dozens of rigorous evaluations to draw from, and the results are noisy?
PIP wants to fund math interventions in school districts. They have set two explicit criteria for what “works”: a program should be very unlikely to produce a negative effect, and more likely than not to improve math scores by at least 0.10 standard deviations — a threshold they consider educationally meaningful. Their evidence base is 44 rigorous evaluations of programs funded by the U.S. Department of Education’s Investing in Innovation (i3) program.
PIP’s staff apply a straightforward filter: select programs with estimates that are positive, statistically significant, and larger than 0.10 standard deviations. This filter yields five candidates, with impact estimates of 0.40, 0.22, 0.18, 0.16, and 0.13 standard deviations. The 0.40 estimate looks especially compelling — nearly twice as large as the next best result.
This looks like a solid, evidence-based shortlist. But is it as good as it seems?
PIP’s procedure has a conceptually subtle but consequential flaw. When winners are selected using noisy performance estimates, the winners’ true quality is almost always lower than their measured performance. Part of what made them look best was luck. This is the winner’s curse, and it shows up across many fields:
To quantify how serious this is for PIP, we can use a simulation grounded in the i3 evidence itself. A meta-analysis of the 44 evaluations reveals that only about 8% of i3 programs have a true effect larger than 0.10 SD — a needle-in-a-haystack problem. When genuinely effective programs are rare, a filter that selects partly on luck picks up mostly hay.
BASIE (Bayesian Interpretation of Estimates) replaces “Is this statistically significant?” with questions that directly support decision making:
BASIE uses prior evidence about how program effects are typically distributed in a domain and combines it with the current study’s estimates to produce probability statements that map directly onto PIP’s decision criteria. The results are dramatically better:
| Standard approach | BASIE | |
|---|---|---|
| Magnitude error (best candidate) | 6.5× | 1.1× |
| Does the best pick truly win? | No — true effect of “best” pick is smaller than the median pick | Yes — apparent winner truly is the winner |
| Selected candidates exceed threshold | 37–43% of the time | 57–72% of the time |
| Knows when to say no? | Selects a candidate in 97% of simulations — but often shouldn’t | Selects a candidate in only 53% of simulations |
The figure below shows what happens when BASIE reinterprets the five apparent winners. The grey bars show the original frequentist estimates; the teal bars show the Bayesian estimates, which are “shrunk” toward more realistic values by accounting for the overall distribution of effects in the i3 portfolio.
The winner’s curse is systematic, not accidental — it affects any significance-filtered shortlist. Useful next steps for a foundation in PIP’s position might include:
The pre‑loaded estimates are the 44 i3 evaluations from the PIP Foundation motivating example (see first tab). Upload your own CSV to replace them. Using your own data? Not required, but please consider taking the time to drop me a line at jdeke73@gmail.com — I’d love to know how you’re using BASIE Workbench.
CSV must include Estimate and StdError columns. Optional: id, Description.
Traditional methods tell you how the estimates look — BASIE tells you what they mean. Use the views below to compare frequentist estimates with Bayesian posteriors, and to see the probability that each true effect exceeds a threshold that matters for your decisions.
Studies measure program impacts with error. This prediction sets that measurement error aside and asks: if we could observe the true effect in a comparable study—not just one study’s noisy estimate of it—what would we expect to find?
BASIE Workbench is a browser-based implementation of the BASIE (Bayesian Interpretation of Estimates) framework, developed by John Deke and Mariel Finucane. It applies a hierarchical Bayesian model to a portfolio of impact estimates, producing posterior probabilities and predictive intervals that directly support evidence-based decisions. The motivating example used throughout this tool — the fictional PIP Foundation evaluating 44 i3 math interventions — is described in detail on the PIP Foundation tab.
The model is a two-level normal-normal hierarchy, a direct generalization of the eight schools model introduced by Rubin (1981). For J studies with impact estimates yj and known standard errors σj:
Here θj is the true causal effect for study j, μ is the mean of true effects across the population of studies, and τ is the standard deviation of true effects. The prior on μ is always zero-centered (m0 = 0); the slider controls s0, the prior standard deviation. All priors on μ are proper. For τ, the Gamma family is used because it is defined only on the positive real line (enforcing τ ≥ 0) and, with shape parameter k = 2, places relatively little mass near zero — avoiding the unrealistic assumption that all true effects are identical. An improper flat prior on τ is also available; in that case τ is estimated entirely from the data with no regularization toward any particular level of heterogeneity.
Inference proceeds by marginalizing over μ analytically (conditional on τ), with each θj then sampled from its conjugate normal posterior conditional on (μ, τ). The marginal posterior of τ given the data is evaluated on a fine grid (1,000 points from 0.001 to 40); τ is then sampled from this discrete approximation. Conditional on each τ draw, μ is sampled from its conjugate normal posterior, and each θj is sampled from its conjugate normal posterior. This grid-based approach follows the algorithm described in Deke & Finucane (2019). The default uses 10,000 posterior draws.
The default priors — μ ~ N(0, 0.22) and τ ~ Gamma(2, 2) — were chosen to reflect the empirical distribution of true effects observed in the i3 evaluation portfolio: most true effects are small, and fewer than 10% exceed 0.10 standard deviations.
The Predict tab displays the posterior predictive distribution for the true effect in a new study drawn from the same population — that is, θnew ~ N(μ, τ2) marginalized over the joint posterior of (μ, τ). This is not the predictive distribution for a new estimate ynew, which would add an additional layer of sampling variance. The distinction matters: the predictive interval for the true effect answers the decision-relevant question of what impact a comparable program would actually have, setting aside measurement error.
Upload a CSV with at minimum two columns: Estimate (the point estimate) and StdError (its standard error). Optional columns id and Description are used for labeling. All estimates should be in comparable units (e.g., standardized effect sizes). Rows with missing or non-positive standard errors are dropped automatically.
BASIE was developed by John Deke and Mariel Finucane. BASIE Workbench is a browser-based demonstration tool built by John Deke. The pre-loaded i3 evaluation estimates are drawn from the IES-funded evaluation portfolio described in Deke & Finucane (2019).