[Google Scholar] [arXiv]

Working Papers

The balancing act in causal inference. Eli Ben-Michael, Avi Feller, David A. Hirshberg, and José Zubizarreta. [abstract; pre-print]

The idea of covariate balance is at the core of causal inference. Inverse propensity weights play a central role because they are the unique set of weights that balance the covariate distributions of different treatment groups. We discuss two broad approaches to estimating these weights: the more traditional one, which fits a propensity score model and then uses the reciprocal of the estimated propensity score to construct weights, and the balancing approach, which estimates the inverse propensity weights essentially by the method of moments, finding weights that achieve balance in the sample. We review ideas from the causal inference, sample surveys, and semiparametric estimation literatures, with particular attention to the role of balance as a sufficient condition for robust inference. We focus on the inverse propensity weighting and augmented inverse propensity weighting estimators for the average treatment effect given strong ignorability and consider generalizations for a broader class of problems including policy evaluation and the estimation of individualized treatment effects.

Estimating racial disparities in emergency general surgery. Eli Ben-Michael, Avi Feller, Rachel Kelz, and Luke Keele. [abstract; pre-print]

Research documents that Black patients experience worse general surgery outcomes than white patients in the United States. In this paper, we focus on an important but less-examined category: the surgical treatment of emergency general surgery (EGS) conditions, which refers to medical emergencies where the injury is "endogenous," such as a burst appendix. Our goal is to assess racial disparities for common outcomes after EGS treatment using an administrative database of hospital claims in New York, Florida, and Pennsylvania, and to understand the extent to which differences are attributable to patient-level risk factors versus hospital-level factors. To do so, we use a class of linear weighting estimators that re-weight white patients to have a similar distribution of baseline characteristics as Black patients. This framework nests many common approaches, including matching and linear regression, but offers important advantages over these methods in terms of controlling imbalance between groups, minimizing extrapolation, and reducing computation time. Applying this approach to the claims data, we find that disparities estimates that adjust for the admitting hospital are substantially smaller than estimates that adjust for patient baseline characteristics only, suggesting that hospital-specific factors are important drivers of racial disparities in EGS outcomes.

Augmented balancing weights as linear regression. David Bruns-Smith, Oliver Dukes, Avi Feller, and Betsy Ogburn. [abstract; pre-print]

We provide a novel characterization of augmented balancing weights, also known as Automatic Debiased Machine Learning (AutoDML). These estimators combine outcome modeling with balancing weights, which estimate inverse propensity score weights directly. When the outcome and weighting models are both linear in some (possibly infinite) basis, we show that the augmented estimator is equivalent to a single linear model with coefficients that combine the original outcome model coefficients and OLS; in many settings, the augmented estimator collapses to OLS alone. We then extend these results to specific choices of outcome and weighting models. We first show that the combined estimator that uses (kernel) ridge regression for both outcome and weighting models is equivalent to a single, undersmoothed (kernel) ridge regression; this also holds when considering asymptotic rates. When the weighting model is instead lasso regression, we give closed-form expressions for special cases and demonstrate a "double selection" property. Finally, we generalize these results to linear estimands via the Riesz representer. Our framework "opens the black box" on these increasingly popular estimators and provides important insights into estimation choices for augmented balancing weights.

Distribution-free assessment of population overlap in observational studies. Lihua Lei, Alexander D’Amour, Peng Ding, Avi Feller, and Jasjeet Sekhon. [abstract; pre-print]

Overlap in baseline covariates between treated and control groups, also known as positivity or common support, is a common assumption in observational causal inference. Assessing this assumption is often ad hoc, however, and can give misleading results. For example, the common practice of examining the empirical distribution of estimated propensity scores is heavily dependent on model specification and has poor uncertainty quantification. In this paper, we propose a formal statistical framework for assessing the extrema of the population propensity score; e.g., the propensity score lies in [0.1, 0.9] almost surely. We develop a family of upper confidence bounds, which we term O-values, for this quantity. We show these bounds are valid in finite samples so long as the observations are independent and identically distributed, without requiring any further modeling assumptions on the data generating process. Finally, we demonstrate this approach using benchmark observational studies, showing how to build our proposed method into the observational causal inference workflow.

Using multiple outcomes to improve the Synthetic Control Method. Liyang Sun, Eli Ben-Michael, and Avi Feller. [abstract; pre-print]

When there are multiple outcome series of interest, Synthetic Control analyses typically proceed by estimating separate weights for each outcome. In this paper, we instead propose estimating a common set of weights across outcomes, by balancing either a vector of all outcomes or an index or average of them. Under a low-rank factor model, we show that these approaches lead to lower bias bounds than separate weights, and that averaging leads to further gains when the number of outcomes grows. We illustrate this via simulation and in a re-analysis of the impact of the Flint water crisis on educational outcomes.


Statistical methods to estimate the impact of gun policy on gun violence. Eli Ben-Michael, Mitchell Doucette, Avi Feller, Alexander McCourt, and Elizabeth Stuart. In Gun Violence: Statistical Issues, edited by C. Loeffler, L. Xue, and J. Rosenberger. [abstract; pre-print]

Gun violence is a critical public health and safety concern in the United States. There is considerable variability in policy proposals meant to curb gun violence, ranging from increasing gun availability to deter potential assailants (e.g., concealed carry laws or arming school teachers) to restricting access to firearms (e.g., universal background checks or banning assault weapons). Many studies use state-level variation in the enactment of these policies in order to quantify their effect on gun violence. In this paper, we discuss the policy trial emulation framework for evaluating the impact of these policies, and show how to apply this framework to estimating impacts via difference-in-differences and synthetic controls when there is staggered adoption of policies across jurisdictions, estimating the impacts of right-to-carry laws on violent crime as a case study.

Towards representation learning for weighting problems in design-based causal inference. Oscar Clivio, Avi Feller, and Chris Holmes. Uncertainty in Artificial Intelligence (UAI). [abstract]

Reweighting a distribution to minimize a distance to a target distribution is a powerful and flexible strategy for estimating a wide range of causal effects, but can be challenging in practice because optimal weights typically depend on knowledge of the underlying data generating process. In this paper, we focus on design-based weights, which do not incorporate outcome information; prominent examples include prospective cohort studies, survey weighting, and the weighting portion of augmented weighting estimators. In such applications, we explore the central role of representation learning in finding desirable weights in practice. Unlike the common approach of assuming a well-specified representation, we highlight the error due to the choice of a representation and outline a general framework for finding suitable representations that minimize this error. Building on recent work that combines balancing weights and neural networks, we propose an end-to-end estimation procedure that learns a flexible representation, while retaining promising theoretical properties. We show that this approach is competitive in a range of common causal inference tasks.

Continuous treatment effects with surrogate outcomes. Zhenghao Zeng, David Arbour, Avi Feller, Raghavendra Addanki, Ryan Rossi, Ritwik Sinha, and Edward Kennedy. International Conference on Machine Learning (ICML). [abstract; pre-print]

In many real-world causal inference applications, the primary outcomes (labels) are often partially missing, especially if they are expensive or difficult to collect. If the missingness depends on covariates (i.e., missingness is not completely at random), analyses based on fully-observed samples alone may be biased. Incorporating surrogates, which are fully observed post-treatment variables related to the primary outcome, can improve estimation in this case. In this paper, we study the role of surrogates in estimating continuous treatment effects and propose a doubly robust method to efficiently incorporate surrogates in the analysis, which uses both labeled and unlabeled data and does not suffer from the above selection bias problem. Importantly, we establish asymptotic normality of the proposed estimator and show possible improvements on the variance compared with methods that solely use labeled data. Extensive simulations show our methods enjoy appealing empirical performance.

Addressing missing data due to COVID-19: Two early childhood case studies. Avi Feller, Maia Connors, Christina Weiland, John Easton, Stacy Ehrlich Loewe, John Francis, Sarah Kabourek, Diana Leyva, Anna Shapiro, and Gloria Yeomans-Maldonado. Journal of Research on Educational Effectiveness. Accepted. [abstract; pre-print]

One part of COVID-19’s staggering impact on education has been to suspend or fundamentally alter ongoing education research projects. This paper addresses how to analyze the simple but fundamental example of a multi-cohort study in which student assessment data for the final cohort are missing because schools were closed, learning was virtual, and/or assessments were canceled or inconsistently collected due to COVID-19. We argue that current best-practice recommendations for addressing missing data may fall short in such studies because the assumptions that underpin these recommendations are violated. We then provide a new, simple decision-making framework for empirical researchers facing this situation and provide two empirical examples of how to apply this framework drawn from early childhood studies, one a cluster randomized trial and the other a descriptive longitudinal study. Based on this framework and the assumptions required to address the missing data, we advise against the standard recommendation of adjusting for missing outcomes (e.g., via imputation or weighting). Instead, changing the target quantity by restricting to fully-observed cohorts or by pivoting to focusing on an alternative outcome may be more appropriate.

Improving the estimation of site-specific effects and their distribution in multisite trials. JoonHo Lee, Jonathan Che, Sophia Rabe-Hesketh, Avi Feller, Luke Miratrix. Journal of Educational and Behavioral Statistics. Accepted. [abstract; pre-print]

In multisite trials, statistical goals often include obtaining individual site-specific treatment effects, determining their rankings, and examining their distribution across multiple sites. This paper explores two strategies for improving inferences related to site-specific effects: (a) semiparametric modeling of the prior distribution using Dirichlet process mixture (DPM) models to relax the normality assumption, and (b) using estimators other than the posterior mean, such as the constrained Bayes or triple-goal estimators, to summarize the posterior. We conduct a large-scale simulation study, calibrated to multisite trials common in education research. We then explore the conditions and degrees to which these strategies and their combinations succeed or falter in the limited data environments. We found that the average reliability of within-site effect estimates is crucial for determining effective estimation strategies. In settings with low-to-moderate data informativeness, flexible DPM models perform no better than the simple parametric Gaussian model coupled with a posterior summary method tailored to a specific inferential goal. DPM models outperform Gaussian models only in select high-information settings, indicating considerable sensitivity to the level of cross-site information available in the data. We discuss the implications of our findings for balancing trade-offs associated with shrinkage for the design and analysis of future multisite randomized experiments.


Temporal aggregation for the Synthetic Control Method. Liyang Sun, Eli Ben-Michael, and Avi Feller. AEA Papers & Proceedings 114: 1–5. [abstract; pre-print]

The synthetic control method (SCM) is a popular approach for estimating the impact of a treatment on a single unit with panel data. Two challenges arise with higher frequency data (e.g., monthly versus yearly): (1) achieving excellent pre-treatment fit is typically more challenging; and (2) overfitting to noise is more likely. Aggregating data over time can mitigate these problems but can also destroy important signal. In this paper, we bound the bias for SCM with disaggregated and aggregated outcomes and give conditions under which aggregating tightens the bounds. We then propose finding weights that balance both disaggregated and aggregated series.

Randomization tests for peer effects in group formation experiments. Guillaume Basse, Peng Ding, Avi Feller, and Panos Toulis. Econometrica. 92(2): 567–590. [abstract; published; pre-print]

Measuring the effect of peers on individuals' outcomes is a challenging problem, in part because individuals often select peers who are similar in both observable and unobservable ways. Group formation experiments avoid this problem by randomly assigning individuals to groups and observing their responses; for example, do first‐year students have better grades when they are randomly assigned roommates who have stronger academic backgrounds? In this paper, we propose randomization‐based permutation tests for group formation experiments, extending classical Fisher Randomization Tests to this setting. The proposed tests are justified by the randomization itself, require relatively few assumptions, and are exact in finite samples. This approach can also complement existing strategies, such as linear‐in‐means models, by using a regression coefficient as the test statistic. We apply the proposed tests to two recent group formation experiments.


Interpretable sensitivity analysis for balancing weights. Dan Soriano, Eli Ben-Michael, Peter Bickel, Avi Feller, and Sam Pimentel. Journal of the Royal Statistical Society (Series A). 186(4): 707–721. [abstract; published; pre-print]

Assessing sensitivity to unmeasured confounding is an important step in observational studies, which typically estimate effects under the assumption that all confounders are measured. In this paper, we develop a sensitivity analysis framework for balancing weights estimators, an increasingly popular approach that solves an optimization problem to obtain weights that directly minimizes covariate imbalance. In particular, we adapt a sensitivity analysis framework using the percentile bootstrap for a broad class of balancing weights estimators. We prove that the percentile bootstrap procedure can, with only minor modifications, yield valid confidence intervals for causal effects under restrictions on the level of unmeasured confounding. We also propose an amplification—a mapping from a one-dimensional sensitivity analysis to a higher dimensional sensitivity analysis—to allow for interpretable sensitivity parameters in the balancing weights framework. We illustrate our method through extensive real data examples.

Variation in impacts of letters of recommendation on college admissions decisions: Approximate balancing weights for treatment effect heterogeneity in observational studies. Eli Ben-Michael, Avi Feller, and Jesse Rothstein. Annals of Applied Statistics. 17(4): 2843–2864. [abstract; published; pre-print]

In a pilot program during the 2016-17 admissions cycle, the University of California, Berkeley invited many applicants for freshman admission to submit letters of recommendation. We use this pilot as the basis for an observational study of the impact of submitting letters of recommendation on subsequent admission, with the goal of estimating how impacts vary across pre-defined subgroups. Understanding this variation is challenging in observational studies, however, because estimated impacts reflect both actual treatment effect variation and differences in covariate balance across groups. To address this, we develop balancing weights that directly optimize for "local balance" within subgroups while maintaining global covariate balance between treated and control units. We then show that this approach has a dual representation as a form of inverse propensity score weighting with a hierarchical propensity score model. In the UC Berkeley pilot study, our proposed approach yields excellent local and global balance, unlike more traditional weighting methods, which fail to balance covariates within subgroups. We find that the impact of letters of recommendation increases with the predicted probability of admission, with mixed evidence of differences for under-represented minority applicants.

Is it who you are or where you are? Accounting for compositional differences in cross-site treatment effect variation. Benjamin Lu, Eli Ben-Michael, Avi Feller, and Luke Miratrix. Journal of Educational and Behavioral Statistics. 48(4): 420-453. [abstract; published; pre-print]

In multisite trials, learning about treatment effect variation across sites is critical for understanding where and for whom a program works. Unadjusted comparisons, however, capture "compositional" differences in the distributions of unit-level features as well as "contextual" differences in site-level features, including possible differences in program implementation. Our goal in this paper is to adjust site-level estimates for differences in the distribution of observed unit-level features by re-weighting (or "transporting") each site to a common distribution. This allows us to make "apples to apples" comparisons across sites, parcelling out the amount of cross-site effect variation explained by systematic differences in populations served. To do so we develop a framework for transporting effects using approximate balancing weights, where the weights are chosen to directly optimize unit-level covariate balance between each site and the target distribution. We first develop our approach for the general setting of transporting the effect of a single-site trial. We then extend our method to multisite trials, assess its performance via simulation, and use it to analyze a series of multisite trials of adult education and vocational training programs. In our application, we find that distributional differences are potentially masking cross-site variation. Our method is available in the balancer R package

Using supervised learning to estimate inequality in the size and persistence of income shocks. David Bruns-Smith, Avi Feller, and Emi Nakamura. FAccT ’23: 2023 ACM Conference on Fairness, Accountability, and Transparency. [abstract; published]

Household responses to income shocks are important drivers of financial fragility, the evolution of wealth inequality, and the effectiveness of fiscal and monetary policy. Traditional approaches to measuring the size and persistence of income shocks are based on restrictive econometric models that impose strong homogeneity across households and over time. In this paper, we propose a more flexible, machine learning framework for estimating income shocks that allows for variation across all observable features and time horizons. First, we propose non-parametric estimands for shocks and shock persistence. We then show how to estimate these quantities by using off-the-shelf supervised learning tools to approximate the conditional expectation of future income given present information. We solve this income prediction problem in a large Icelandic administrative dataset, and then use the estimated shocks to document several features of labor income risk in Iceland that are not captured by standard economic income models.

Balancing weights for causal inference. Eric R. Cohn, Eli Ben-Michael, Avi Feller, and José Zubizarreta. Handbook of Matching and Weighting Adjustments for Causal Inference (Chapman and Hall/CRC). [abstract; published]

In this chapter, the authors introduce the balancing approach to weighting for covariate balance and causal inference. They begin by providing a framework for causal inference in observational studies, including typical assumptions necessary for the identification of average treatment effects. The authors motivate the task of finding weights that balance covariates and unify a variety of methods from the literature. It discusses several implementation and design choices for finding balancing weights in practice and discuss the trade-offs of these choices using an example from the canonical LaLonde data. An alternative approach for IPW weights instead directly targets the balancing property by seeking weights that balance covariates in the sample at hand. Arguably, the most important design choice in the balancing approach is the choice of the model class M over which imbalance is minimized. A common approach to estimating treatment effects in observational studies involves least squares linear regression.

Multilevel calibration weighting for survey data. Eli Ben-Michael, Avi Feller, and Erin Hartman. Political Analysis. 1–19. [abstract; published; pre-print]

In the November 2016 U.S. presidential election, many state-level public opinion polls, particularly in the Upper Midwest, incorrectly predicted the winning candidate. One leading explanation for this polling miss is that the precipitous decline in traditional polling response rates led to greater reliance on statistical methods to adjust for the corresponding bias—and that these methods failed to adjust for important interactions between key variables like educational attainment, race, and geographic region. Finding calibration weights that account for important interactions remains challenging with traditional survey methods: raking typically balances the margins alone, while post-stratification, which exactly balances all interactions, is only feasible for a small number of variables. In this paper, we propose multilevel calibration weighting, which enforces tight balance constraints for marginal balance and looser constraints for higher-order interactions. This incorporates some of the benefits of post-stratification while retaining the guarantees of raking. We then correct for the bias due to the relaxed constraints via a flexible outcome model; we call this approach "double regression with post-stratification." We use these tools to re-assess a large-scale survey of voter intention in the 2016 U.S. presidential election, finding meaningful gains from the proposed methods. The approach is available in the multical R package.

Estimating the effects of a California gun control program with Multitask Gaussian Processes. Eli Ben-Michael, David Arbour, Avi Feller, Alex Franks, and Steven Raphael. Annals of Applied Statistics. 17(2): 985-1016. [abstract; published; pre-print]

Gun violence is a critical public safety concern in the United States. In 2006, California implemented a unique firearm monitoring program, the Armed and Prohibited Persons System (APPS), to address gun violence in the state. The APPS program first identifies those firearm owners who become prohibited from owning one, due to federal or state law, then confiscates their firearms. Our goal is to assess the effect of APPS on California murder rates using annual, state-level crime data across the U.S. for the years before and after the introduction of the program. To do so, we adapt a nonparametric Bayesian approach, multitask Gaussian processes (MTGPs), to the panel data setting. MTGPs allow for flexible and parsimonious panel data models that nest many existing approaches and allow for direct control over both dependence across time and dependence across units as well as natural uncertainty quantification. We extend this approach to incorporate non-Normal outcomes, auxiliary covariates, and multiple outcome series, which are all important in our application. We also show that this approach has attractive Frequentist properties, including a representation as a weighting estimator with separate weights over units and time periods. Applying this approach, we find that the increased monitoring and enforcement from the APPS program substantially decreased homicides in California. We also find that the effect on murder is driven entirely by declines in gun-related murder with no measurable effect on non-gun murder. Estimated cost per murder avoided are substantially lower than conventional estimates of the value of a statistical life, suggesting a very high benefit-cost ratio for this enforcement effort.

Hospital quality risk standardization via approximate balancing weights. Luke Keele, Eli Ben-Michael, Avi Feller, Rachel Kelz, and Luke Miratrix. Annals of Applied Statistics. 17(2): 901-928. [abstract; published; pre-print]

Comparing outcomes across hospitals, often to identify underperforming hospitals, is a critical task in health services research. However, naive comparisons of average outcomes, such as surgery complication rates, can be misleading because hospital case mixes differ—a hospital’s overall complication rate may be lower simply because the hospital serves a healthier population overall. In this paper we develop a method of "direct standardization" where we reweight each hospital patient population to be representative of the overall population and then compare the weighted averages across hospitals. Adapting methods from survey sampling and causal inference, we find weights that directly control for imbalance between the hospital patient mix and the target population, even across many patient attributes. Critically, these balancing weights can also be tuned to preserve sample size for more precise estimates. We also derive principled measures of statistical uncertainty and use outcome modeling and Bayesian shrinkage to increase precision and account for variation in hospital size. We demonstrate these methods using claims data from Pennsylvania, Florida, and New York, estimating standardized hospital complication rates for general surgery patients. We conclude with a discussion of how to detect low performing hospitals.


Weak Separation in Mixture Models and Implications for Principal Stratification. Nhat Ho, Avi Feller, Evan Greif, Luke Miratrix, and Natesh Pillai. AISTATS: Proceedings of the 25th International Conference on Artificial Intelligence and Statistics. PMLR 151:5416-5458. [abstract; published; pre-print]

Principal stratification is a popular framework for addressing post-randomization complications, often in conjunction with finite mixture models for estimating the causal effects of interest. Unfortunately, standard estimators of mixture parameters, like the MLE, are known to exhibit pathological behavior. We study this behavior in a simple but fundamental example, a two-component Gaussian mixture model in which only the component means and variances are unknown, and focus on the setting in which the components are weakly separated. In this case, we show that the asymptotic convergence rate of the MLE is quite poor, such as O(n^{-1/6}) or even O(n^{-1/8}). We then demonstrate via both theoretical arguments and extensive simulations that the MLE behaves like a threshold estimator in finite samples, in the sense that the MLE can give strong evidence that the means are equal when the truth is otherwise. We also explore the behavior of the MLE when the MLE is non-zero, showing that it is difficult to estimate both the sign and magnitude of the means in this case. We provide diagnostics for all of these pathologies and apply these ideas to re-analyzing two randomized evaluations of job training programs, JOBS II and Job Corps. Our results suggest that the corresponding maximum likelihood estimates should be interpreted with caution in these cases.

Outcome Assumptions and Duality Theory for Balancing Weights. David Bruns-Smith and Avi Feller. AISTATS: Proceedings of The 25th International Conference on Artificial Intelligence and Statistics. PMLR 151:11037-11055 [abstract; published; pre-print]

We study balancing weight estimators, which reweight outcomes from a source population to estimate missing outcomes in a target population. These estimators minimize the worst-case error by making an assumption about the outcome model. In this paper, we show that this outcome assumption has two immediate implications. First, we can replace the minimax optimization problem for balancing weights with a simple convex loss over the assumed outcome function class. Second, we can replace the commonly-made overlap assumption with a more appropriate quantitative measure, the minimum worst-case bias. Finally, we show conditions under which the weights remain robust when our assumptions on the outcomes are wrong.

Problems with evidence assessment in COVID-19 health policy impact evaluation: a systematic review of study design and evidence strength. Noah Haber, Emma Clark-Deelder, Avi Feller, et al. BMJ Open. 12(1), e053820. [abstract; published]

Introduction. Assessing the impact of COVID-19 policy is critical for informing future policies. However, there are concerns about the overall strength of COVID-19 impact evaluation studies given the circumstances for evaluation and concerns about the publication environment.
Methods. We included studies that were primarily designed to estimate the quantitative impact of one or more implemented COVID-19 policies on direct SARS-CoV-2 and COVID-19 outcomes. After searching PubMed for peer-reviewed articles published on 26 November 2020 or earlier and screening, all studies were reviewed by three reviewers first independently and then to consensus. The review tool was based on previously developed and released review guidance for COVID-19 policy impact evaluation.
Results. After 102 articles were identified as potentially meeting inclusion criteria, we identified 36 published articles that evaluated the quantitative impact of COVID-19 policies on direct COVID-19 outcomes. Nine studies were set aside because the study design was considered inappropriate for COVID-19 policy impact evaluation (n=8 pre/post; n=1 cross-sectional), and 27 articles were given a full consensus assessment. 20/27 met criteria for graphical display of data, 5/27 for functional form, 19/27 for timing between policy implementation and impact, and only 3/27 for concurrent changes to the outcomes. Only 4/27 were rated as overall appropriate. Including the 9 studies set aside, reviewers found that only four of the 36 identified published and peer-reviewed health policy impact evaluation studies passed a set of key design checks for identifying the causal impact of policies on COVID-19 outcomes.
Discussion. The reviewed literature directly evaluating the impact of COVID-19 policies largely failed to meet key design criteria for inference of sufficient rigour to be actionable by policy-makers. More reliable evidence review is needed to both identify and produce policy-actionable evidence, alongside the recognition that actionable evidence is often unlikely to be feasible.

A graph-theoretic approach to randomization tests of causal effects under general interference. David Puelz, Guillaume Basse, Avi Feller, and Panos Toulis. Journal of the Royal Statistical Society (Series B). 84: 174-204. [abstract; published; pre-print]

Measuring the effect of peers on individual outcomes is a challenging problem, in part because individuals often select peers who are similar in both observable and unobservable ways. Group formation experiments avoid this problem by randomly assigning individuals to groups and observing their responses; for example, do first-year students have better grades when they are randomly assigned roommates who have stronger academic backgrounds? Standard approaches for analyzing these experiments, however, are heavily model-dependent and generally fail to exploit the randomized design. In this paper, we extend methods from randomization-based testing under interference to group formation experiments. The proposed tests are justified by the randomization itself, require relatively few assumptions, and are exact in finite samples. First, we develop procedures that yield valid tests for arbitrary group formation designs. Second, we derive sufficient conditions on the design such that the randomization test can be implemented via simple random permutations. We apply this approach to two recent group formation experiments.


Synthetic controls under staggered adoption. Eli Ben-Michael, Avi Feller, and Jesse Rothstein. Journal of the Royal Statistical Society (Series B). 84: 351–381. [abstract; published; pre-print]

Staggered adoption of policies by different units at different times creates promising opportunities for observational causal inference. Estimation remains challenging, however, and common regression methods can give misleading results. A promising alternative is the synthetic control method (SCM), which finds a weighted average of control units that closely balances the treated unit's pre-treatment outcomes. In this paper, we generalize SCM, originally designed to study a single treated unit, to the staggered adoption setting. We first bound the error for the average effect and show that it depends on both the imbalance for each treated unit separately and the imbalance for the average of the treated units. We then propose "partially pooled" SCM weights to minimize a weighted combination of these measures; approaches that focus only on balancing one of the two components can lead to bias. We extend this approach to incorporate unit-level intercept shifts and auxiliary covariates. We assess the performance of the proposed method via extensive simulations and apply our results to the question of whether teacher collective bargaining leads to higher school spending, finding minimal impacts. We implement the proposed method in the augsynth R package.

Challenges with evaluating education policy using panel data during and after the COVID-19 pandemic. Avi Feller and Elizabeth Stuart. Journal of Research on Educational Effectiveness. 2021. 14(3): 668–675. [abstract; published]

Panel data methods, which include difference-in-differences and comparative interrupted time series, have become increasingly common in education policy research. The key idea is to use variation across time and space (e.g., school districts) to estimate the effects of policy or programmatic changes that happen in some localities but not others. In this commentary we highlight some specific challenges for panel or longitudinal studies of K-12 education interventions during and following the COVID-19 pandemic. Our goal is to help researchers think through the underlying issues and assumptions, and to help consumers of those studies assess their validity.

The Augmented Synthetic Control Method. Eli Ben-Michael, Avi Feller, and Jesse Rothstein. Journal of the American Statistical Association. 116(536): 1789–1803. [abstract; published; pre-print]

The synthetic control method (SCM) is a popular approach for estimating the impact of a treatment on a single unit in panel data settings. The "synthetic control" is a weighted average of control units that balances the treated unit's pre-treatment outcomes as closely as possible. A critical feature of the original proposal is to use SCM only when the fit on pre-treatment outcomes is excellent. We propose Augmented SCM to extend SCM to settings where such pre-treatment fit is infeasible. Analogous to bias correction for inexact matching, Augmented SCM uses an outcome model to estimate the bias due to imperfect pre-treatment fit and then de-biases the original SCM estimate. Our main proposal, which uses ridge regression as the outcome model, directly controls pre-treatment fit while minimizing extrapolation from the convex hull. We bound the bias of this approach under a linear factor model and show how regularization helps to avoid over-fitting to noise. We demonstrate gains from Augmented SCM with extensive simulation studies and apply this framework to estimate the impact of the 2012 Kansas tax cuts on economic growth. We implement the proposed method in the new augsynth R package.

Impact Evaluation of Coronavirus Disease 2019 Policy: A Guide to Common Design Issues. Noah Haber, Emma Clarke-Deelder, Joshua Salomon, Avi Feller, and Elizabeth Stuart. American Journal of Epidemiology. 190(11): 2474-2486. [abstract; published; pre-print]

Policy responses to COVID-19, particularly those related to non-pharmaceutical interventions, are unprecedented in scale and scope. Researchers and policymakers are striving to understand the impact of these policies on a variety of outcomes. Policy impact evaluations always require a complex combination of circumstance, study design, data, statistics, and analysis. Beyond the issues that are faced for any policy, evaluation of COVID-19 policies is complicated by additional challenges related to infectious disease dynamics and lags, lack of direct observation of key outcomes, and a multiplicity of interventions occurring on an accelerated time scale. In this paper, we (1) introduce the basic suite of policy impact evaluation designs for observational data, including cross-sectional analyses, pre/post, interrupted time-series, and difference-in-differences analysis, (2) demonstrate key ways in which the requirements and assumptions underlying these designs are often violated in the context of COVID-19, and (3) provide decision-makers and reviewers a conceptual and graphical guide to identifying these key violations. The overall goal of this paper is to help policy-makers, journal editors, journalists, researchers, and other research consumers understand and weigh the strengths and limitations of evidence that is essential to decision-making.

A trial emulation approach for policy evaluations with group-level longitudinal data. Eli Ben-Michael, Avi Feller, and Elizabeth Stuart. Epidemiology. 32(4): 533–540. [abstract; published; pre-print]

To limit the spread of the novel coronavirus, governments across the world implemented extraordinary physical distancing policies, such as stay-at-home orders, and numerous studies aim to estimate their effects. Many statistical and econometric methods, such as difference-in-differences, leverage repeated measurements and variation in timing to estimate policy effects, including in the COVID-19 context. While these methods are less common in epidemiology, epidemiologic researchers are well accustomed to handling similar complexities in studies of individual-level interventions. "Target trial emulation" emphasizes the need to carefully design a non-experimental study in terms of inclusion and exclusion criteria, covariates, exposure definition, and outcome measurement -- and the timing of those variables. We argue that policy evaluations using group-level longitudinal ("panel") data need to take a similar careful approach to study design, which we refer to as "policy trial emulation." This is especially important when intervention timing varies across jurisdictions; the main idea is to construct target trials separately for each "treatment cohort" (states that implement the policy at the same time) and then aggregate. We present a stylized analysis of the impact of state-level stay-at-home orders on total coronavirus cases. We argue that estimates from panel methods -- with the right data and careful modeling and diagnostics -- can help add to our understanding of many policies, though doing so is often challenging.

Overlap in observational studies with high-dimensional covariates. Alex D’Amour, Peng Ding, Avi Feller, Lihua Lei, and Jasjeet Sekhon. Journal of Econometrics. 221: 644-654. [abstract; published]

Causal inference in observational settings typically rests on a pair of identifying assumptions: (1) unconfoundedness and (2) covariate overlap, also known as positivity or common support. Investigators often argue that unconfoundedness is more plausible when many covariates are included in the analysis. Less discussed is the fact that covariate overlap is more difficult to satisfy in this setting. In this paper, we explore the implications of overlap in high-dimensional observational studies, arguing that this assumption is stronger than investigators likely realize. Our main innovation is to frame (strict) overlap in terms of bounds on a likelihood ratio, which allows us to leverage and expand on existing results from information theory. In particular, we show that strict overlap bounds discriminating information (e.g., Kullback-Leibler divergence) between the covariate distributions in the treated and control populations. We use these results to derive explicit bounds on the average imbalance in covariate means under strict overlap and a range of assumptions on the covariate distributions. Importantly, these bounds grow tighter as the dimension grows large, and converge to zero in some cases. We examine how restrictions on the treatment assignment and outcome processes can weaken the implications of certain overlap assumptions, but at the cost of stronger requirements for unconfoundedness. Taken together, our results suggest that adjusting for high-dimensional covariates does not necessarily make causal identification more plausible.


Flexible sensitivity analysis for observational studies without observable implications. Alex Franks, Alex D’Amour, and Avi Feller. Journal of the American Statistical Association. 115(532): 1730-1746. [abstract; published; pre-print]

A fundamental challenge in observational causal inference is that assumptions about unconfoundedness are not testable from data. Assessing sensitivity to such assumptions is therefore important in practice. Unfortunately, some existing sensitivity analysis approaches inadvertently impose restrictions that are at odds with modern causal inference methods, which emphasize flexible models for observed data. To address this issue, we propose a framework that allows (1) flexible models for the observed data and (2) clean separation of the identified and unidentified parts of the sensitivity model. Our framework extends an approach from the missing data literature, known as Tukey’s factorization, to the causal inference setting. Under this factorization, we can represent the distributions of unobserved potential outcomes in terms of unidentified selection functions that posit a relationship between treatment assignment and unobserved potential outcomes. The sensitivity parameters in this framework are easily interpreted, and we provide heuristics for calibrating these parameters against observable quantities. We demonstrate the flexibility of this approach in two examples, where we estimate both average treatment effects and quantile treatment effects using Bayesian nonparametric models for the observed data.

Bayesian sensitivity analysis for offline policy evaluation. Jongbin Jung, Ravi Shroff, Avi Feller, and Sharad Goel. AIES: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 64–70. [abstract; published; pre-print]

On a variety of complex decision-making tasks, from doctors prescribing treatment to judges setting bail, machine learning algorithms have been shown to outperform expert human judgments. One complication, however, is that it is often difficult to anticipate the effects of algorithmic policies prior to deployment, as one generally cannot use historical data to directly observe what would have happened had the actions recommended by the algorithm been taken. A common strategy is to model potential outcomes for alternative decisions assuming that there are no unmeasured confounders (i.e., to assume ignorability). But if this ignorability assumption is violated, the predicted and actual effects of an algorithmic policy can diverge sharply. In this paper we present a flexible Bayesian approach to gauge the sensitivity of predicted policy outcomes to unmeasured confounders. In particular, and in contrast to past work, our modeling framework easily enables confounders to vary with the observed covariates. We demonstrate the efficacy of our method on a large dataset of judicial actions, in which one must decide whether defendants awaiting trial should be required to pay bail or can be released without payment.


Assessing treatment effect variation in observational studies: Results from a data challenge. Carlos Carvalho, Avi Feller, Jared Murray, Spencer Woody, David Yeager. Observational Studies. 5: 21-35. [abstract; published; pre-print]

A growing number of methods aim to assess the challenging question of treatment effect variation in observational studies. This special section of "Observational Studies" reports the results of a workshop conducted at the 2018 Atlantic Causal Inference Conference designed to understand the similarities and differences across these methods. We invited eight groups of researchers to analyze a synthetic observational data set that was generated using a recent large-scale randomized trial in education. Overall, participants employed a diverse set of methods, ranging from matching and flexible outcome modeling to semiparametric estimation and ensemble approaches. While there was broad consensus on the topline estimate, there were also large differences in estimated treatment effect moderation. This highlights the fact that estimating varying treatment effects in observational studies is often more challenging than estimating the average treatment effect alone. We suggest several directions for future work arising from this workshop.

Identifying and estimating principal causal effects in a multi-site trial of Early College High Schools. Lo-Hua Yuan, Avi Feller, and Luke Miratrix. Annals of Applied Statistics. 13(3): 1348-1369. [abstract; published; pre-print]

Randomized trials are often conducted with separate randomizations across multiple sites such as schools, voting districts, or hospitals. These sites can differ in important ways, including the site’s implementation quality, local conditions, and the composition of individuals. An important question in practice is whether---and under what assumptions---researchers can leverage this cross-site variation to learn more about the intervention. We address these questions in the principal stratification framework, which describes causal effects for subgroups defined by post-treatment quantities. We show that researchers can estimate certain principal causal effects via the multi-site design if they are willing to impose the strong assumption that the site-specific effects are independent of the site-specific distribution of stratum membership. We motivate this approach with a multi-site trial of the Early College High School Initiative, a unique secondary education program with the goal of increasing high school graduation rates and college enrollment. Our analyses corroborate previous studies suggesting that the initiative had positive effects for students who would have otherwise attended a low-quality high school, although power is limited.

Exact conditional randomization tests for causal effects under interference. Guillaume Basse, Avi Feller, and Panos Toulis. Biometrika. 106(2): 487-494. [abstract; published; pre-print]

Many important causal questions involve interactions between units, also known as interference, such as interactions between individuals in households, students in schools, and firms in markets. Standard methods often break down in this setting. Permuting individual-level treatment assignments, for example, does not generally permute the treatment exposures of interest, such as spillovers, which depend on both the treatment assignment and the interference structure. One approach is to restrict the randomization test to a subset of units and assignments such that permuting the treatment assignment vector also permutes the treatment exposures, thus emulating the classical Fisher randomization test under no interference. Existing tests, however, can only leverage limited information in the structure of interference, which can lead to meaningful loss in power and introduce computational challenges. In this paper, we introduce the concept of a conditioning mechanism, which provides a framework for constructing valid and powerful randomization tests under general forms of interference. We describe our framework in the context of two-stage randomized designs and apply this approach to an analysis of a randomized evaluation of an intervention targeting student absenteeism in the School District of Philadelphia. We show meaningful improvements over existing methods, both in terms of computation and statistical power.

Decomposing treatment effect variation. Peng Ding, Avi Feller, and Luke Miratrix. Journal of the American Statistical Association. 114(525): 304-317. [abstract; published; pre-print]

Understanding and characterizing treatment effect variation in randomized experiments has become essential for going beyond the "black box" of the average treatment effect. Nonetheless, traditional statistical approaches often ignore or assume away such variation. In the context of a randomized experiment, this paper proposes a framework for decomposing overall treatment effect variation into a systematic component that is explained by observed covariates, and a remaining idiosyncratic component. Our framework is fully randomization-based, with estimates of treatment effect variation that are fully justified by the randomization itself. Our framework can also account for noncompliance, which is an important practical complication. We make several key contributions. First, we show that randomization-based estimates of systematic variation are very similar in form to estimates from fully-interacted linear regression and two stage least squares. Second, we use these estimators to develop an omnibus test for systematic treatment effect variation, both with and without noncompliance. Third, we propose an R2-like measure of treatment effect variation explained by covariates and, when applicable, noncompliance. Finally, we assess these methods via simulation studies and apply them to the Head Start Impact Study, a large-scale randomized experiment.


Analyzing two-stage experiments in the presence of interference. Guillaume Basse and Avi Feller. Journal of the American Statistical Association. 113(521): 41-55. [abstract; published; pre-print]

Two-stage randomization is a powerful design for estimating treatment effects in the presence of interference; that is, when one individual's treatment assignment affects another individual's outcomes. Our motivating example is a two-stage randomized trial evaluating an intervention to reduce student absenteeism in the School District of Philadelphia. In that experiment, households with multiple students were first assigned to treatment or control; then, in treated households, one student was randomly assigned to treatment. Using this example, we highlight key considerations for analyzing two-stage experiments in practice. Our first contribution is to address additional complexities that arise when household sizes vary; in this case, researchers must decide between assigning equal weight to households or equal weight to individuals. We propose unbiased estimators for a broad class of individual- and household-weighted estimands, with corresponding theoretical and estimated variances. Our second contribution is to connect two common approaches for analyzing two-stage designs: linear regression and randomization inference. We show that, with suitably chosen standard errors, these two approaches yield identical point and variance estimates, which is somewhat surprising given the complex randomization scheme. Finally, we explore options for incorporating covariates to improve precision. We confirm our analytic results via simulation studies and apply these methods to the attendance study, finding substantively meaningful spillover effects.

Reducing absences at scale by targeting parents’ misbeliefs. Todd Rogers and Avi Feller. Nature Human Behavior. 2(5): 335-342. [abstract; published; pre-print]

Student attendance is critical to educational success, and is increasingly the focus of educators, researchers, and policymakers. We report the first randomized experiment examining interventions targeting student absenteeism (N=28,080). Parents of high-risk, K-12 students received one of three personalized information treatments repeatedly throughout the school year. The most effective versions reduced chronic absenteeism by 10%, partly by correcting parents' biased beliefs about their students' total absences. The intervention reduced student absences comparably across grade levels, and reduced absences among untreated cohabiting students in treated households. This intervention is easy to scale and is more than an order of magnitude more cost effective than current absence-reduction best practices. Educational interventions that inform and empower parents, like those reported here, can complement more intensive student-focused absenteeism interventions.

Information, knowledge and attitudes: An Evaluation of the taxpayer receipt. Lucy Barnes, Avi Feller, Jake Haselswerdt, and Ethan Porter. Journal of Politics. 80(2): 701-706. [abstract; published; pre-print]
Media: Washington Post op-ed

To better understand the relationship between information and political knowledge, we evaluate an ambitious government initiative: the nationwide dissemination of "taxpayer receipts," or personalized, itemized accounts of government spending, by the UK government in Fall 2014. In coordination with the British tax authorities, we embedded a survey experiment in a nationally representative panel. We find that citizens became more knowledgeable about government spending because of our encouragement to read their receipt. Although baseline levels of political knowledge are indeed low, our findings indicate that individuals are capable of learning and retaining complex political information. However, even as citizens became more knowledgeable, we uncover no evidence that their attitudes toward government and redistribution changed concomitantly. The acquisition and retention of new information does not necessarily change attitudes. Our results have implications for citizens' capacity to learn and research on the relationship between knowledge and attitudes.

New Findings on Impact Variation from the Head Start Impact Study: Informing the Scale-up of Early Childhood Programs. Pamela Morris, Maia Connors, Allison Freidman-Krauss, Dana McCoy, Avi Feller, Lindsay Page, Howard Bloom, Hirokazu Yoshikawa. AERA Open. 4(2): 1-16. [abstract; published]

This article synthesizes findings from a reanalysis of data from the Head Start Impact Study with a focus on impact variation. This study addressed whether the size of Head Start's impacts on children’s access to center-based and high-quality care and their school readiness skills varied by child characteristics, geographic location, and the experiences of children in the control group. Across multiple sets of analyses based on new, innovative statistical methods, findings suggest that the topline Head Start Impact Study results of Head Start's average impacts mask substantial variation in its effectiveness and that one key source of that variation was in the counterfactual experiences and the context of Head Start sites (as well as the more typically examined child characteristics; e.g., children's dual language learner status). Implications are discussed for the future of Head Start and further research, as well as the scale-up of other early childhood programs, policies, and practices.

The Millennium Villages Project: a retrospective observational end-line evaluation. Shira Mitchell, Andrew Gelman, Rebecca Ross, Joyce Chen, Sehrish Bari, Uyen Kim Huynh, Matthew Harris, Sonia Sachs, Elizabeth Stuart, Avi Feller, Susanna Makela, Alan Zaslavsky, Lucy McClellan, Seth Ohemeng-Dapaah, Patricia Namakula, Cheryl Palm, and Jeffrey Sachs. Lancet Global Health. 6(5): e500–e513. [abstract; published]

Background: The Millennium Villages Project (MVP) was a ten-year, multi-sector, rural development project implemented in ten sites in ten sub-Saharan African countries to achieve the Millennium Development Goals (MDGs). This paper summarizes statistical analyses of survey and on-site spending data for the end-line evaluation of the MVP, including estimates of (1) project impacts, (2) MDG and project-specified target attainment, and (3) on-site spending.
Methods: To estimate project impacts, we retrospectively selected comparison villages, matching the project villages on possible confounding variables. At end-line, we collected cross-sectional survey data in both the project and comparison villages. Using these data, as well as survey and on-site spending data collected in the project villages during implementation, we estimated: (1) project impacts as differences in outcomes between project and matched comparison villages; (2) target attainment as differences between project outcomes and pre-specified targets; and (3) on-site spending as reported expenditures by communities, donors, governments, and the project (on-site).
Findings: Averaged across the ten project sites, we found that: (1) impact estimates on 30 out of 40 outcomes met the conventional criterion of statistical significance (p-value < 0.05). On all 30, estimated impacts were favorable. We found particularly substantial impacts in agriculture and health, in which some project outcomes were roughly one standard deviation better than their comparisons. (2) One-third of the targets were met in the project sites. (3) Total on-site spending decreased from $132 per capita in the first half to $109 per capita in the second half of the project.
Interpretation: Of all the sectors, the strongest estimated impacts were on health outcomes, suggesting support for the project's health systems strengthening approach.


Bounding, an accessible method for estimating principal causal effects, examined and explained. Luke Miratrix, Jane Furey, Avi Feller, Todd Grindal, and Lindsay Page. Journal of Research on Educational Effectiveness. 11(1): 133-162. [abstract; published; pre-print]

Estimating treatment effects for subgroups defined by post-treatment behavior (i.e., estimating causal effects in a principal stratification framework) can be technically challenging and heavily reliant on strong assumptions. We investigate an alternate path: using bounds to identify ranges of possible effects that are consistent with the data. This simple approach relies on weak assumptions yet can result in policy-relevant findings. Further, covariates can be used to sharpen bounds, as we show. Via simulation, we demonstrate which types of covariates are maximally beneficial. We conclude with an analysis of a multi-site experimental study of Early College High Schools. When examining the program's impact on students completing the ninth grade "on-track" for college, we find little impact for ECHS students who would otherwise attend a high quality high school, but substantial effects for those who would not. This suggests potential benefit in expanding these programs in areas primarily served by lower quality schools.

Principal score methods: Assumptions, extensions, and practical considerations. Avi Feller, Fabrizia Mealli, and Luke Miratrix. Journal of Educational and Behavioral Statistics. 42(6): 726-758. [abstract; published; pre-print]

Researchers addressing post-treatment complications in randomized trials often turn to principal stratification to define relevant assumptions and quantities of interest. One approach for the subsequent estimation of causal effects in this framework is to use methods based on the "principal score," the conditional probability of belonging to a certain principal stratum given covariates. These methods typically assume that stratum membership is as good as randomly assigned given these covariates. We clarify the key assumption in this context, known as Principal Ignorability, and argue that versions of this assumption are quite strong in practice. We describe these concepts in terms of both one- and two-sided noncompliance and propose a novel approach for researchers to "mix and match" Principal Ignorability assumptions with alternative assumptions, such as the exclusion restriction. Finally, we apply these ideas to a randomized evaluation of a job training program and a randomized evaluation of an early childhood education program. Overall, applied researchers should acknowledge that principal score methods, while useful tools, rely on assumptions that are typically hard to justify in practice.

Algorithmic decision making and the cost of fairness. Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. KDD: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. [abstract; published]
Media: Washington Post op-ed

Algorithms are now regularly used to decide whether defendants awaiting trial are too dangerous to be released back into the community. In some cases, black defendants are substantially more likely than white defendants to be incorrectly classified as high risk. To mitigate such disparities, several techniques recently have been proposed to achieve algorithmic fairness. Here we reformulate algorithmic fairness as constrained optimization: the objective is to maximize public safety while satisfying formal fairness constraints designed to reduce racial disparities. We show that for several past definitions of fairness, the optimal algorithms that result require detaining defendants above race-specific risk thresholds. We further show that the optimal unconstrained algorithm requires applying a single, uniform threshold to all defendants. The unconstrained algorithm thus maximizes public safety while also satisfying one important understanding of equality: that all individuals are held to the same standard, irrespective of their race. Because the optimal constrained and unconstrained algorithms in general differ, reducing racial disparities is at odds with improving public safety. By examining data from Broward County, Florida, we show that this trade-off can be large in practice. We focus on the problem of designing algorithms for pretrial release decisions, but the principles we discuss apply to other domains, and also to human decision makers carrying out structured decision rules.


Compared to what? Variation in the impacts of early childhood education by alternative care type. Avi Feller, Todd Grindal, Luke Miratrix, and Lindsay Page. Annals of Applied Statistics. 2016. 110(3): 1245-1285. [abstract; published]
Media: Huffington Post op-ed; IRLE brief on Head Start research at UC Berkeley

Early childhood education research often compares a group of children who receive the intervention of interest to a group of children who receive care in a range of different care settings. In this paper, we estimate differential impacts of an early childhood intervention by alternative care setting, using data from the Head Start Impact Study, a large-scale randomized evaluation. To do so, we utilize a Bayesian principal stratification framework to estimate separate impacts for two types of Compliers: those children who would otherwise be in other center-based care when assigned to control and those who would otherwise be in home-based care. We find strong, positive short-term effects of Head Start on receptive vocabulary for those Compliers who would otherwise be in home-based care. By contrast, we find no meaningful impact of Head Start on vocabulary for those Compliers who would otherwise be in other center-based care. Our findings suggest that alternative care type is a potentially important source of variation in early childhood education interventions.

Randomization inference for treatment effect variation. Peng Ding, Avi Feller, and Luke Miratrix. Journal of the Royal Statistical Society (Series B). 78(3): 655-671. [abstract; published; pre-print]

Applied researchers are increasingly interested in whether and how treatment effects vary in randomized evaluations, especially variation that is not explained by observed covariates. We propose a model-free approach for testing for the presence of such unexplained variation. To use this randomization-based approach, we must address the fact that the average treatment effect, which is generally the object of interest in randomized experiments, actually acts as a nuisance parameter in this setting. We explore potential solutions and advocate for a method that guarantees valid tests in finite samples despite this nuisance. We also show how this method readily extends to testing for heterogeneity beyond a given model, which can be useful for assessing the sufficiency of a given scientific theory. We finally apply our method to the National Head Start impact study, which is a large-scale randomized evaluation of a Federal preschool programme, finding that there is indeed significant unexplained treatment effect variation.

Discouraged by peer excellence: Exposure to exemplary peer performance causes quitting. Todd Rogers and Avi Feller. Psychological Science. 27(3): 365-374. [abstract; published]
Media: NPR Hidden Brain/Science Friday, Education Week, Science Daily

People are exposed to exemplary peer performances often (and sometimes by design in interventions). In two studies, we showed that exposure to exemplary peer performances can undermine motivation and success by causing people to perceive that they cannot attain their peers’ high levels of performance. It also causes de-identification with the relevant domain. We examined such discouragement by peer excellence by exploiting the incidental exposure to peers’ abilities that occurs when students are asked to assess each other’s work. Study 1 was a natural experiment in a massive open online course that employed peer assessment (N = 5,740). Exposure to exemplary peer performances caused a large proportion of students to quit the course. Study 2 explored underlying psychological mechanisms in an online replication (N = 361). Discouragement by peer excellence has theoretical implications for work on social judgment, social comparison, and reference bias and has practical implications for interventions that induce social comparisons.

2015 and earlier

Principal Stratification: A tool for understanding variation in program effects across endogenous subgroups. Lindsay Page, Avi Feller, Todd Grindal, Luke Miratrix, and Marie-Andree Somers. American Journal of Evaluation. 2015. 36(4): 514-531. [abstract; published]

Increasingly, researchers are interested in questions regarding treatment-effect variation across partially or fully latent subgroups defined not by pretreatment characteristics but by postrandomization actions. One promising approach to address such questions is principal stratification. Under this framework, a researcher defines endogenous subgroups, or principal strata, based on post-randomization behaviors under both the observed and the counterfactual experimental conditions. These principal strata give structure to such research questions and provide a framework for determining estimation strategies to obtain desired effect estimates. This article provides a nontechnical primer to principal stratification. We review selected applications to highlight the breadth of substantive questions and methodological issues that this method can inform. We then discuss its relationship to instrumental variables analysis to address binary noncompliance in an experimental context and highlight how the framework can be generalized to handle more complex posttreatment patterns. We emphasize the counterfactual logic fundamental to principal stratification and the key assumptions that render analytic challenges more tractable. We briefly discuss technical aspects of estimation procedures, providing a short guide for interested readers.

Hierarchical models for causal effects. Avi Feller and Andrew Gelman. Emerging Trends in the Social and Behavioral Sciences. 2015. [abstract; published]

Hierarchical models play three important roles in modeling causal effects: (i) accounting for data collection, such as in stratified and split-plot experimental designs; (ii) adjusting for unmeasured covariates, such as in panel studies; and (iii) capturing treatment effect variation, such as in subgroup analyses. Across all three areas, hierarchical models, especially Bayesian hierarchical modeling, offer substantial benefits over classical, non-hierarchical approaches. After discussing each of these topics, we explore some recent developments in the use of hierarchical models for causal inference and conclude with some thoughts on new directions for this research area.

Genome-wide profiling of chromosome interactions in Plasmodium falciparum characterizes nuclear architecture and reconfigurations associated with antigenic variation. Jacob Lemieux, Sue Kyes, Thomas Otto, Avi Feller, Richard Eastman, Robert Pinches, Matthew Berriman, Xin-zhuan Su, Chris Newbold. Molecular microbiology. 2013. 90(3): 519–537. [abstract; published]

Spatial relationships within the eukaryotic nucleus are essential for proper nuclear function. In Plasmodium falciparum, the repositioning of chromosomes has been implicated in the regulation of the expression of genes responsible for antigenic variation, and the formation of a single, peri-nuclear nucleolus results in the clustering of rDNA. Nevertheless, the precise spatial relationships between chromosomes remain poorly understood, because, until recently, techniques with sufficient resolution have been lacking. Here we have used chromosome conformation capture and second-generation sequencing to study changes in chromosome folding and spatial positioning that occur during switches in var gene expression. We have generated maps of chromosomal spatial affinities within the P. falciparum nucleus at 25 Kb resolution, revealing a structured nucleolus, an absence of chromosome territories, and confirming previously identified clustering of heterochromatin foci. We show that switches in var gene expression do not appear to involve interaction with a distant enhancer, but do result in local changes at the active locus. These maps reveal the folding properties of malaria chromosomes, validate known physical associations, and characterize the global landscape of spatial interactions. Collectively, our data provide critical information for a better understanding of gene expression regulation and antigenic variation in malaria parasites.

Red state/blue state divisions in the 2012 presidential election. Avi Feller, Andrew Gelman, and Boris Shor. The Forum. 2013. 10(4): 127–131. [abstract; published]
New York Times version: “Red versus blue in a new light.” Nov. 12, 2012. with Andrew Gelman.

The so-called "red/blue paradox" is that rich individuals are more likely to vote Republican but rich states are more likely to support the Democrats. Previous research argued that this seeming paradox could be explained by comparing rich and poor voters within each state – the difference in the Republican vote share between rich and poor voters was much larger in low-income, conservative, middle-American states like Mississippi than in high-income, liberal, coastal states like Connecticut. We use exit poll and other survey data to assess whether this was still the case for the 2012 Presidential election. Based on this preliminary analysis, we find that, while the red/blue paradox is still strong, the explanation offered by Gelman et al. no longer appears to hold. We explore several empirical patterns from this election and suggest possible avenues for resolving the questions posed by the new data.

Genome wide adaptations of Plasmodium falciparum in response to lumefantrine selective drug pressure. Leah Mwai, Abdi Diriye, Victor Masseno, Steven Muriithi, Theresa Feltwell, Jennifer Musyoki, Jacob Lemieux, Avi Feller, Gunnar Mair, Kevin Marsh, Chris Newbold, Alexis Nzila, Celine Carret. PloS One. 2012. 7(2): e31623. [abstract; published]

The combination therapy of the Artemisinin-derivative Artemether (ART) with Lumefantrine (LM) (CoartemH) is an important malaria treatment regimen in many endemic countries. Resistance to Artemisinin has already been reported, and it is feared that LM resistance (LMR) could also evolve quickly. Therefore molecular markers which can be used to track CoartemH efficacy are urgently needed. Often, stable resistance arises from initial, unstable phenotypes that can be identified in vitro. Here we have used the Plasmodium falciparum multidrug resistant reference strain V1S to induce LMR in vitro by culturing the parasite under continuous drug pressure for 16 months. The initial IC50 (inhibitory concentration that kills 50% of the parasite population) was 24 nM. The resulting resistant strain V1SLM, obtained after culture for an estimated 166 cycles under LM pressure, grew steadily in 378 nM of LM, corresponding to 15 times the IC50 of the parental strain. However, after two weeks of culturing V1SLM in drug-free medium, the IC50 returned to that of the initial, parental strain V1S. This transient drug tolerance was associated with major changes in gene expression profiles: using the PFSANGER Affymetrix custom array, we identified 184 differentially expressed genes in V1SLM. Among those are 18 known and putative transporters including the multidrug resistance gene 1 (pfmdr1), the multidrug resistance associated protein and the V-type H+ pumping pyrophosphatase 2 (pfvp2) as well as genes associated with fatty acid metabolism. In addition we detected a clear selective advantage provided by two genomic loci in parasites grown under LM drug pressure, suggesting that all, or some of those genes contribute to development of LM tolerance—they may prove useful as molecular markers to monitor P. falciparum LM susceptibility.

In vivo profiles show continuous variation between 2 cellular populations. Jacob Lemieux*, Avi Feller*, Chris Holmes, and Chris Newbold (* indicates equal contribution). Proceedings of the National Academy of Sciences. 2009. 106 (27), E71–E72. [abstract; published]

In this reply, we argue that seemingly discrete variation in the authors' published gene expression profiles is actually a combination of continuous variation and technical biological errors in the original experiment.

Statistical estimation of cell-cycle progression and lineage commitment in Plasmodium falciparum reveals a homogeneous pattern of transcription in ex vivo culture. Jacob Lemieux*, Natalia Gomez-Escobar*, Avi Feller*, Celine Carret, Alfred Amambua-Ngwa, Robert Pinches, Felix Daya, Sue Kyes, David Conway, Chris Holmes, and Chris Newbold (* indicates equal contribution). Proceedings of the National Academy of Sciences. 2009. 106(18): 7559–7564. [abstract; published]

We have cultured Plasmodium falciparum directly from the blood of infected individuals to examine patterns of mature-stage gene expression in patient isolates. Analysis of the transcriptome of P. falciparum is complicated by the highly periodic nature of gene expression because small variations in the stage of parasite development between samples can lead to an apparent difference in gene expression values. To address this issue, we have developed statistical likelihood-based methods to estimate cell cycle progression and commitment to asexual or sexual development lineages in our samples based on microscopy and gene expression patterns. In cases subsequently matched for temporal development, we find that transcriptional patterns in ex vivo culture display little variation across patients with diverse clinical profiles and closely resemble transcriptional profiles that occur in vitro. These statistical methods, available to the research community, assist in the design and interpretation of P. falciparum expression profiling experiments where it is difficult to separate true differential expression from cell-cycle dependent expression. We reanalyze an existing dataset of in vivo patient expression profiles and conclude that previously observed discrete variation is consistent with the commitment of a varying proportion of the parasite population to the sexual development lineage.


Comment on ‘Causal Inference Using Invariant Prediction: Identification and Confidence Intervals’ by Peters, Buehlmann, and Meinshausen. Peng Ding and Avi Feller. Journal of the Royal Statistical Society (Series B) 2016. 78(5): 994-995. [ published]

Comment on ‘How to find an appropriate clustering for mixed type variables with application to socio-economic stratification’ by Hennig and Liao. Avi Feller and Edo Airoldi. Journal of the Royal Statistical Society (Series C) 2013. 62(3): 347–348. [ published]