Repository: ebenmichael/augsynth Branch: master Commit: 65c5a6f34f4e Files: 83 Total size: 373.9 KB Directory structure: gitextract_wwvwxodd/ ├── .Rbuildignore ├── .gitignore ├── .travis.yml ├── DESCRIPTION ├── LICENSE ├── NAMESPACE ├── R/ │ ├── augsynth.R │ ├── augsynth_pre.R │ ├── cv.R │ ├── data.R │ ├── eligible_donors.R │ ├── fit_synth.R │ ├── format.R │ ├── globalVariables.R │ ├── highdim.R │ ├── inference.R │ ├── multi_outcomes.R │ ├── multi_synth_qp.R │ ├── multisynth_class.R │ ├── outcome_models.R │ ├── outcome_multi.R │ ├── ridge.R │ ├── ridge_lambda.R │ └── time_regression_multi.R ├── README.md ├── data/ │ └── kansas.rda ├── data-raw/ │ ├── clean_kansas.R │ └── kansas_longer2.dta ├── man/ │ ├── augsynth-package.Rd │ ├── augsynth.Rd │ ├── augsynth_multiout.Rd │ ├── check_data_stag.Rd │ ├── conformal_inf.Rd │ ├── conformal_inf_linear.Rd │ ├── conformal_inf_multiout.Rd │ ├── get_nona_donors.Rd │ ├── jackknife_se_single.Rd │ ├── kansas.Rd │ ├── make_V_matrix.Rd │ ├── multisynth.Rd │ ├── plot.augsynth.Rd │ ├── plot.augsynth_multiout.Rd │ ├── plot.multisynth.Rd │ ├── plot.summary.augsynth.Rd │ ├── plot.summary.augsynth_multiout.Rd │ ├── plot.summary.multisynth.Rd │ ├── predict.augsynth.Rd │ ├── predict.augsynth_multiout.Rd │ ├── predict.multisynth.Rd │ ├── print.augsynth.Rd │ ├── print.augsynth_multiout.Rd │ ├── print.multisynth.Rd │ ├── print.summary.augsynth.Rd │ ├── print.summary.augsynth_multiout.Rd │ ├── print.summary.multisynth.Rd │ ├── rdirichlet_b.Rd │ ├── rmultinom_b.Rd │ ├── rwild_b.Rd │ ├── single_augsynth.Rd │ ├── summary.augsynth.Rd │ ├── summary.augsynth_multiout.Rd │ ├── summary.multisynth.Rd │ ├── time_jackknife_plus.Rd │ └── time_jackknife_plus_multiout.Rd ├── pkg.Rproj ├── tests/ │ ├── testthat/ │ │ ├── test_augsynth_pre.R │ │ ├── test_format.R │ │ ├── test_general.R │ │ ├── test_lambda.R │ │ ├── test_load_data.R │ │ ├── test_multiple_outcomes.R │ │ ├── test_multisynth.R │ │ ├── test_multisynth_covariates.R │ │ ├── test_outcome_models.R │ │ ├── test_time_cohort.R │ │ └── test_unbalanced_multisynth.R │ └── testthat.R └── vignettes/ ├── .gitignore ├── multi-outcomes-vignette.Rmd ├── multisynth-vignette.Rmd ├── multisynth-vignette.md ├── singlesynth-vignette.Rmd └── singlesynth-vignette.md ================================================ FILE CONTENTS ================================================ ================================================ FILE: .Rbuildignore ================================================ ^data-raw$ ^Meta$ ^doc$ ^\.travis\.yml$ ^pkg.Rproj$ figure$ cache$ ================================================ FILE: .gitignore ================================================ Meta doc inst/doc ## Files # Emacs autosave files *~ \#*# # Don't put data in the repo *.csv *.feather # R stuff *.Rout *.Rhistory *.RData *.Rapp.history # Mac stuff *.DS_store # C++ stuff *.o *.so *.dll test.R *-vignette.pdf ================================================ FILE: .travis.yml ================================================ # R for travis: see documentation at https://docs.travis-ci.com/user/languages/r language: r r: - 3.5.1 sudo: false cache: packages warnings_are_errors: false r_binary_packages: - dplyr - magrittr - ggplot2 - glmnet - plyr - kableExtra ================================================ FILE: DESCRIPTION ================================================ Package: augsynth Title: The Augmented Synthetic Control Method Version: 0.2.0 Authors@R: person("Eli", "Ben-Michael", email = "ebenmichael@berkeley.edu", role = c("aut", "cre")) Description: A package implementing the Augmented Synthetic Controls Method. Depends: R (>= 3.5.0) Imports: dplyr, tidyr, magrittr, ggplot2, MASS, LiblineaR, Formula, Matrix, osqp, rlang, purrr, FNN Remotes: susanathey/MCPanel License: MIT + file LICENSE Encoding: UTF-8 LazyData: true RoxygenNote: 7.2.3 Suggests: testthat, CausalImpact, keras, gsynth, knitr, rmarkdown, softImpute, MCPanel, glmnet, randomForest, kableExtra, ggrepel VignetteBuilder: knitr ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2018 Elijahu Ben-Michael Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: NAMESPACE ================================================ # Generated by roxygen2: do not edit by hand S3method(plot,augsynth) S3method(plot,augsynth_multiout) S3method(plot,multisynth) S3method(plot,summary.augsynth) S3method(plot,summary.augsynth_multiout) S3method(plot,summary.multisynth) S3method(predict,augsynth) S3method(predict,augsynth_multiout) S3method(predict,multisynth) S3method(print,augsynth) S3method(print,augsynth_multiout) S3method(print,multisynth) S3method(print,summary.augsynth) S3method(print,summary.augsynth_multiout) S3method(print,summary.multisynth) S3method(summary,augsynth) S3method(summary,augsynth_multiout) S3method(summary,multisynth) export(augsynth) export(augsynth_multiout) export(multisynth) export(rdirichlet_b) export(rmultinom_b) export(rwild_b) export(single_augsynth) import(dplyr) import(tidyr) importFrom(ggplot2,aes) importFrom(graphics,plot) importFrom(magrittr,"%>%") importFrom(purrr,reduce) importFrom(stats,coef) importFrom(stats,delete.response) importFrom(stats,formula) importFrom(stats,lm) importFrom(stats,model.frame) importFrom(stats,model.matrix) importFrom(stats,na.omit) importFrom(stats,poly) importFrom(stats,predict) importFrom(stats,sd) importFrom(stats,terms) importFrom(stats,update) importFrom(utils,capture.output) ================================================ FILE: R/augsynth.R ================================================ ################################################################################ ## Main functions for single-period treatment augmented synthetic controls Method ################################################################################ #' Fit Augmented SCM #' #' @param form outcome ~ treatment | auxillary covariates #' @param unit Name of unit column #' @param time Name of time column #' @param t_int Time of intervention #' @param data Panel data as dataframe #' @param progfunc What function to use to impute control outcomes #' ridge=Ridge regression (allows for standard errors), #' none=No outcome model, #' en=Elastic Net, RF=Random Forest, GSYN=gSynth, #' mcp=MCPanel, #' cits=Comparitive Interuppted Time Series #' causalimpact=Bayesian structural time series with CausalImpact #' @param scm Whether the SCM weighting function is used #' @param fixedeff Whether to include a unit fixed effect, default F #' @param cov_agg Covariate aggregation functions, if NULL then use mean with NAs omitted #' @param ... optional arguments for outcome model #' #' @return augsynth object that contains: #' \itemize{ #' \item{"weights"}{Ridge ASCM weights} #' \item{"l2_imbalance"}{Imbalance in pre-period outcomes, measured by the L2 norm} #' \item{"scaled_l2_imbalance"}{L2 imbalance scaled by L2 imbalance of uniform weights} #' \item{"mhat"}{Outcome model estimate} #' \item{"data"}{Panel data as matrices} #' } #' @export single_augsynth <- function(form, unit, time, t_int, data, progfunc = "ridge", scm=T, fixedeff = FALSE, cov_agg=NULL, ...) { call_name <- match.call() form <- Formula::Formula(form) unit <- enquo(unit) time <- enquo(time) ## format data outcome <- terms(formula(form, rhs=1))[[2]] trt <- terms(formula(form, rhs=1))[[3]] wide <- format_data(outcome, trt, unit, time, t_int, data) synth_data <- do.call(format_synth, wide) treated_units <- data %>% filter(!!trt == 1) %>% distinct(!!unit) %>% pull(!!unit) control_units <- data %>% filter(!(!!unit %in% treated_units)) %>% distinct(!!unit) %>% arrange(!!unit) %>% pull(!!unit) ## add covariates if(length(form)[2] == 2) { Z <- extract_covariates(form, unit, time, t_int, data, cov_agg) } else { Z <- NULL } # fit augmented SCM augsynth <- fit_augsynth_internal(wide, synth_data, Z, progfunc, scm, fixedeff, ...) # add some extra data augsynth$data$time <- data %>% distinct(!!time) %>% arrange(!!time) %>% pull(!!time) augsynth$call <- call_name augsynth$t_int <- t_int augsynth$weights <- matrix(augsynth$weights) rownames(augsynth$weights) <- control_units return(augsynth) } #' Internal function to fit augmented SCM #' @param wide Data formatted from format_data #' @param synth_data Data formatted from foramt_synth #' @param Z Matrix of auxiliary covariates #' @param progfunc outcome model to use #' @param scm Whether to fit SCM #' @param fixedeff Whether to de-mean synth #' @param V V matrix for Synth, default NULL #' @param ... Extra args for outcome model #' #' @noRd #' fit_augsynth_internal <- function(wide, synth_data, Z, progfunc, scm, fixedeff, V = NULL, ...) { n <- nrow(wide$X) t0 <- ncol(wide$X) ttot <- t0 + ncol(wide$y) if(fixedeff) { demeaned <- demean_data(wide, synth_data) fit_wide <- demeaned$wide fit_synth_data <- demeaned$synth_data mhat <- demeaned$mhat } else { fit_wide <- wide fit_synth_data <- synth_data mhat <- matrix(0, n, ttot) } if (is.null(progfunc)) { progfunc = "none" } progfunc = tolower(progfunc) ## fit augsynth if(progfunc == "ridge") { # Ridge ASCM augsynth <- do.call(fit_ridgeaug_formatted, list(wide_data = fit_wide, synth_data = fit_synth_data, Z = Z, V = V, scm = scm, ...)) } else if(progfunc == "none") { ## Just SCM augsynth <- do.call(fit_ridgeaug_formatted, c(list(wide_data = fit_wide, synth_data = fit_synth_data, Z = Z, ridge = F, scm = T, V = V, ...))) } else { ## Other outcome models progfuncs = c("ridge", "none", "en", "rf", "gsyn", "mcp", "cits", "causalimpact", "seq2seq") if (progfunc %in% progfuncs) { augsynth <- fit_augsyn(fit_wide, fit_synth_data, progfunc, scm, ...) } else { stop("progfunc must be one of 'EN', 'RF', 'GSYN', 'MCP', 'CITS', 'CausalImpact', 'seq2seq', 'None'") } } augsynth$mhat <- mhat + cbind(matrix(0, nrow = n, ncol = t0), augsynth$mhat) augsynth$data <- wide augsynth$data$Z <- Z augsynth$data$synth_data <- synth_data augsynth$progfunc <- progfunc augsynth$scm <- scm augsynth$fixedeff <- fixedeff augsynth$extra_args <- list(...) if(progfunc == "ridge") { augsynth$extra_args$lambda <- augsynth$lambda } else if(progfunc == "gsyn") { augsynth$extra_args$r <- ncol(augsynth$params$factor) augsynth$extra_args$CV <- 0 } ##format output class(augsynth) <- "augsynth" return(augsynth) } #' Get prediction of ATT or average outcome under control #' @param object augsynth object #' @param att If TRUE, return the ATT, if FALSE, return imputed counterfactual #' @param ... Optional arguments #' #' @return Vector of predicted post-treatment control averages #' @export predict.augsynth <- function(object, att = F, ...) { # if ("att" %in% names(list(...))) { # att <- list(...)$att # } else { # att <- F # } augsynth <- object X <- augsynth$data$X y <- augsynth$data$y comb <- cbind(X, y) trt <- augsynth$data$trt mhat <- augsynth$mhat m1 <- colMeans(mhat[trt==1,,drop=F]) resid <- (comb[trt==0,,drop=F] - mhat[trt==0,drop=F]) y0 <- m1 + t(resid) %*% augsynth$weights if(att) { return(colMeans(comb[trt == 1,, drop = F]) - c(y0)) } else { rnames <- rownames(y0) y0_vec <- c(y0) names(y0_vec) <- rnames return(y0_vec) } } #' Print function for augsynth #' @param x augsynth object #' @param ... Optional arguments #' @export print.augsynth <- function(x, ...) { augsynth <- x ## straight from lm cat("\nCall:\n", paste(deparse(augsynth$call), sep="\n", collapse="\n"), "\n\n", sep="") ## print att estimates tint <- ncol(augsynth$data$X) ttotal <- tint + ncol(augsynth$data$y) att_post <- predict(augsynth, att = T)[(tint + 1):ttotal] cat(paste("Average ATT Estimate: ", format(round(mean(att_post),3), nsmall = 3), "\n\n", sep="")) } #' Plot function for augsynth #' @importFrom graphics plot #' #' @param x Augsynth object to be plotted #' @param inf Boolean, whether to get confidence intervals around the point estimates #' @param cv If True, plot cross validation MSE against hyper-parameter, otherwise plot effects #' @param ... Optional arguments #' @export plot.augsynth <- function(x, inf = T, cv = F, ...) { # if ("se" %in% names(list(...))) { # se <- list(...)$se # } else { # se <- T # } augsynth <- x if (cv == T) { errors = data.frame(lambdas = augsynth$lambdas, errors = augsynth$lambda_errors, errors_se = augsynth$lambda_errors_se) p <- ggplot2::ggplot(errors, ggplot2::aes(x = lambdas, y = errors)) + ggplot2::geom_point(size = 2) + ggplot2::geom_errorbar( ggplot2::aes(ymin = errors, ymax = errors + errors_se), width=0.2, size = 0.5) p <- p + ggplot2::labs(title = bquote("Cross Validation MSE over " ~ lambda), x = expression(lambda), y = "Cross Validation MSE", parse = TRUE) p <- p + ggplot2::scale_x_log10() # find minimum and min + 1se lambda to plot min_lambda <- choose_lambda(augsynth$lambdas, augsynth$lambda_errors, augsynth$lambda_errors_se, F) min_1se_lambda <- choose_lambda(augsynth$lambdas, augsynth$lambda_errors, augsynth$lambda_errors_se, T) min_lambda_index <- which(augsynth$lambdas == min_lambda) min_1se_lambda_index <- which(augsynth$lambdas == min_1se_lambda) p <- p + ggplot2::geom_point( ggplot2::aes(x = min_lambda, y = augsynth$lambda_errors[min_lambda_index]), color = "gold") p + ggplot2::geom_point( ggplot2::aes(x = min_1se_lambda, y = augsynth$lambda_errors[min_1se_lambda_index]), color = "gold") + ggplot2::theme_bw() } else { plot(summary(augsynth, ...), inf = inf) } } #' Summary function for augsynth #' @param object augsynth object #' @param inf Boolean, whether to get confidence intervals around the point estimates #' @param inf_type Type of inference algorithm. Options are #' \itemize{ #' \item{"conformal"}{Conformal inference (default)} #' \item{"jackknife+"}{Jackknife+ algorithm over time periods} #' \item{"jackknife"}{Jackknife over units} #' } #' @param linear_effect Boolean, whether to invert the conformal inference hypothesis test to get confidence intervals for a linear-in-time treatment effect: intercept + slope * time #' @param ... Optional arguments for inference, for more details for each `inf_type` see #' \itemize{ #' \item{"conformal"}{`conformal_inf`} #' \item{"jackknife+"}{`time_jackknife_plus`} #' \item{"jackknife"}{`jackknife_se_single`} #' } #' @export summary.augsynth <- function(object, inf = T, inf_type = "conformal", linear_effect = F, ...) { augsynth <- object summ <- list() t0 <- ncol(augsynth$data$X) t_final <- t0 + ncol(augsynth$data$y) if(inf) { if(inf_type == "jackknife") { att_se <- jackknife_se_single(augsynth) } else if(inf_type == "jackknife+") { att_se <- time_jackknife_plus(augsynth, ...) } else if(inf_type == "conformal") { att_se <- conformal_inf(augsynth, ...) # get CIs for linear treatment effects if(linear_effect) { att_linear <- conformal_inf_linear(augsynth, ...) } } else { stop(paste(inf_type, "is not a valid choice of 'inf_type'")) } att <- data.frame(Time = augsynth$data$time, Estimate = att_se$att[1:t_final]) if(inf_type == "jackknife") { att$Std.Error <- att_se$se[1:t_final] att_avg_se <- att_se$se[t_final + 1] } else { att_avg_se <- NA } att_avg <- att_se$att[t_final + 1] if(inf_type %in% c("jackknife+", "nonpar_bs", "t_dist", "conformal")) { att$lower_bound <- att_se$lb[1:t_final] att$upper_bound <- att_se$ub[1:t_final] } if(inf_type == "conformal") { att$p_val <- att_se$p_val[1:t_final] } } else { t0 <- ncol(augsynth$data$X) t_final <- t0 + ncol(augsynth$data$y) att_est <- predict(augsynth, att = T) att <- data.frame(Time = augsynth$data$time, Estimate = att_est) att$Std.Error <- NA att_avg <- mean(att_est[(t0 + 1):t_final]) att_avg_se <- NA } summ$att <- att if(inf) { if(inf_type %in% c("jackknife+")) { summ$average_att <- data.frame(Value = "Average Post-Treatment Effect", Estimate = att_avg, Std.Error = att_avg_se) summ$average_att$lower_bound <- att_se$lb[t_final + 1] summ$average_att$upper_bound <- att_se$ub[t_final + 1] summ$alpha <- att_se$alpha } if(inf_type == "conformal") { # summ$average_att$p_val <- att_se$p_val[t_final + 1] # summ$average_att$lower_bound <- att_se$lb[t_final + 1] # summ$average_att$upper_bound <- att_se$ub[t_final + 1] # summ$alpha <- att_se$alpha if(linear_effect) { summ$average_att <- data.frame( Value = c("Average Post-Treatment Effect", "Treatment Effect Intercept", "Treatment Effect Slope"), Estimate = c(att_avg, att_linear$est_int, att_linear$est_slope), Std.Error = c(att_avg_se, NA, NA), p_val = c(att_se$p_val[t_final + 1], NA, NA), lower_bound = c(att_se$lb[t_final + 1], att_linear$ci_int[1], att_linear$ci_slope[1]), upper_bound = c(att_se$ub[t_final + 1], att_linear$ci_int[2], att_linear$ci_slope[2]) ) } else { summ$average_att <- data.frame( Value = c("Average Post-Treatment Effect"), Estimate = att_avg, Std.Error = att_avg_se, p_val = att_se$p_val[t_final + 1], lower_bound = att_se$lb[t_final + 1], upper_bound = att_se$ub[t_final + 1] ) } summ$alpha <- att_se$alpha } } else { summ$average_att <- data.frame(Value = "Average Post-Treatment Effect", Estimate = att_avg, Std.Error = att_avg_se) } summ$t_int <- augsynth$t_int summ$call <- augsynth$call summ$l2_imbalance <- augsynth$l2_imbalance summ$scaled_l2_imbalance <- augsynth$scaled_l2_imbalance if(!is.null(augsynth$covariate_l2_imbalance)) { summ$covariate_l2_imbalance <- augsynth$covariate_l2_imbalance summ$scaled_covariate_l2_imbalance <- augsynth$scaled_covariate_l2_imbalance } ## get estimated bias if(tolower(augsynth$progfunc) == "ridge") { mhat <- augsynth$ridge_mhat w <- augsynth$synw } else { mhat <- augsynth$mhat w <- augsynth$weights } trt <- augsynth$data$trt m1 <- colMeans(mhat[trt==1,,drop=F]) if(tolower(augsynth$progfunc) == "none" | (!augsynth$scm)) { summ$bias_est <- NA } else { summ$bias_est <- m1 - t(mhat[trt==0,,drop=F]) %*% w } summ$inf_type <- if(inf) inf_type else "None" class(summ) <- "summary.augsynth" return(summ) } #' Print function for summary function for augsynth #' @param x summary object #' @param ... Optional arguments #' @export print.summary.augsynth <- function(x, ...) { summ <- x ## straight from lm cat("\nCall:\n", paste(deparse(summ$call), sep="\n", collapse="\n"), "\n\n", sep="") t_final <- nrow(summ$att) ## distinction between pre and post treatment att_est <- summ$att$Estimate t_total <- length(att_est) t_int <- summ$att %>% filter(Time <= summ$t_int) %>% nrow() att_pre <- att_est[1:(t_int-1)] att_post <- att_est[t_int:t_total] out_msg <- "" # print out average post treatment estimate att_post <- summ$average_att$Estimate[1] se_est <- summ$att$Std.Error if(summ$inf_type == "jackknife") { se_avg <- summ$average_att$Std.Error out_msg <- paste("Average ATT Estimate (Jackknife Std. Error): ", format(round(att_post,3), nsmall=3), " (", format(round(se_avg,3)), ")\n") inf_type <- "Jackknife over units" } else if(summ$inf_type == "conformal") { p_val <- summ$average_att$p_val[1] out_msg <- paste("Average ATT Estimate (p Value for Joint Null): ", format(att_post, digits = 3), " (", format(p_val, digits = 2), ")\n") inf_type <- "Conformal inference" if("Treatment Effect Slope" %in% summ$average_att$Value) { lowers <- summ$average_att$lower_bound[2:3] uppers <- summ$average_att$upper_bound[2:3] out_msg_line2 <- paste0("Confidence intervals for linear-in-time treatment effects (Intercept + Slope * Time)\n", "\tIntercept: [", format(lowers[1], digits = 3), ",", format(uppers[1], digits = 3), "]\n", "\tSlope: [", format(lowers[2], digits = 3), ",", format(uppers[2], digits = 3), "]\n") out_msg <- paste0(out_msg, out_msg_line2) } } else if(summ$inf_type == "jackknife+") { out_msg <- paste("Average ATT Estimate: ", format(round(att_post,3), nsmall=3), "\n") inf_type <- "Jackknife+ over time periods" } else { out_msg <- paste("Average ATT Estimate: ", format(round(att_post,3), nsmall=3), "\n") inf_type <- "None" } out_msg <- paste(out_msg, "L2 Imbalance: ", format(round(summ$l2_imbalance,3), nsmall=3), "\n", "Percent improvement from uniform weights: ", format(round(1 - summ$scaled_l2_imbalance,3)*100), "%\n\n", sep="") if(!is.null(summ$covariate_l2_imbalance)) { out_msg <- paste(out_msg, "Covariate L2 Imbalance: ", format(round(summ$covariate_l2_imbalance,3), nsmall=3), "\n", "Percent improvement from uniform weights: ", format(round(1 - summ$scaled_covariate_l2_imbalance,3)*100), "%\n\n", sep="") } out_msg <- paste(out_msg, "Avg Estimated Bias: ", format(round(mean(summ$bias_est), 3),nsmall=3), "\n\n", "Inference type: ", inf_type, "\n\n", sep="") cat(out_msg) if(summ$inf_type == "jackknife") { out_att <- summ$att[t_int:t_final,] %>% select(Time, Estimate, Std.Error) } else if(summ$inf_type == "conformal") { out_att <- summ$att[t_int:t_final,] %>% select(Time, Estimate, lower_bound, upper_bound, p_val) names(out_att) <- c("Time", "Estimate", paste0((1 - summ$alpha) * 100, "% CI Lower Bound"), paste0((1 - summ$alpha) * 100, "% CI Upper Bound"), paste0("p Value")) } else if(summ$inf_type == "jackknife+") { out_att <- summ$att[t_int:t_final,] %>% select(Time, Estimate, lower_bound, upper_bound) names(out_att) <- c("Time", "Estimate", paste0((1 - summ$alpha) * 100, "% CI Lower Bound"), paste0((1 - summ$alpha) * 100, "% CI Upper Bound")) } else { out_att <- summ$att[t_int:t_final,] %>% select(Time, Estimate) } out_att %>% mutate_at(vars(-Time), ~ round(., 3)) %>% print(row.names = F) } #' Plot function for summary function for augsynth #' @param x Summary object #' @param inf Boolean, whether to plot confidence intervals #' @param ... Optional arguments #' @export plot.summary.augsynth <- function(x, inf = T, ...) { summ <- x # if ("inf" %in% names(list(...))) { # inf <- list(...)$inf # } else { # inf <- T # } p <- summ$att %>% ggplot2::ggplot(ggplot2::aes(x=Time, y=Estimate)) if(inf) { if(all(is.na(summ$att$lower_bound))) { p <- p + ggplot2::geom_ribbon(ggplot2::aes(ymin=Estimate-2*Std.Error, ymax=Estimate+2*Std.Error), alpha=0.2) } else { p <- p + ggplot2::geom_ribbon(ggplot2::aes(ymin=lower_bound, ymax=upper_bound), alpha=0.2) } } p + ggplot2::geom_line() + ggplot2::geom_vline(xintercept=summ$t_int, lty=2) + ggplot2::geom_hline(yintercept=0, lty=2) + ggplot2::theme_bw() } #' augsynth #' #' @description A package implementing the Augmented Synthetic Controls Method #' @docType package #' @name augsynth-package #' @importFrom magrittr "%>%" #' @importFrom purrr reduce #' @import dplyr #' @import tidyr #' @importFrom stats terms #' @importFrom stats formula #' @importFrom stats update #' @importFrom stats delete.response #' @importFrom stats model.matrix #' @importFrom stats model.frame #' @importFrom stats na.omit NULL ================================================ FILE: R/augsynth_pre.R ================================================ ################################################################################ ## Main function for the augmented synthetic controls Method ################################################################################ #' Fit Augmented SCM #' @param form outcome ~ treatment | auxillary covariates #' @param unit Name of unit column #' @param time Name of time column #' @param data Panel data as dataframe #' @param t_int Time of intervention (used for single-period treatment only) #' @param ... Optional arguments #' \itemize{ #' \item Single period augsynth with/without multiple outcomes #' \itemize{ #' \item{"progfunc"}{What function to use to impute control outcomes: Ridge=Ridge regression (allows for standard errors), None=No outcome model, EN=Elastic Net, RF=Random Forest, GSYN=gSynth, MCP=MCPanel, CITS=CITS, CausalImpact=Bayesian structural time series with CausalImpact, seq2seq=Sequence to sequence learning with feedforward nets} #' \item{"scm"}{Whether the SCM weighting function is used} #' \item{"fixedeff"}{Whether to include a unit fixed effect, default F } #' \item{"cov_agg"}{Covariate aggregation functions, if NULL then use mean with NAs omitted} #' } #' \item Multi period (staggered) augsynth #' \itemize{ #' \item{"relative"}{Whether to compute balance by relative time} #' \item{"n_leads"}{How long past treatment effects should be estimated for} #' \item{"n_lags"}{Number of pre-treatment periods to balance, default is to balance all periods} #' \item{"alpha"}{Fraction of balance for individual balance} #' \item{"lambda"}{Regularization hyperparameter, default = 0} #' \item{"force"}{Include "none", "unit", "time", "two-way" fixed effects. Default: "two-way"} #' \item{"n_factors"}{Number of factors for interactive fixed effects, default does CV} #' } #' } #' #' @return augsynth object that contains: #' \itemize{ #' \item{"weights"}{weights} #' \item{"data"}{Panel data as matrices} #' } #' @export #' augsynth <- function(form, unit, time, data, t_int=NULL, ...) { call_name <- match.call() form <- Formula::Formula(form) unit_quosure <- enquo(unit) time_quosure <- enquo(time) ## format data outcome <- terms(formula(form, rhs=1))[[2]] trt <- terms(formula(form, rhs=1))[[3]] # check for multiple outcomes multi_outcome <- length(outcome) != 1 ## get first treatment times trt_time <- data %>% group_by(!!unit_quosure) %>% filter(!all(!!trt == 0)) %>% summarise(trt_time = min((!!time_quosure)[(!!trt) == 1])) %>% mutate(trt_time = replace_na(as.numeric(trt_time), Inf)) num_trt_years <- sum(is.finite(unique(trt_time$trt_time))) if(multi_outcome & num_trt_years > 1) { stop("augsynth is not currently implemented for more than one outcome and more than one treated unit") } else if(num_trt_years > 1) { message("More than one treatment time found. Running multisynth.") if("progfunc" %in% names(list(...))) { warning("`progfunc` is not an argument for multisynth, so it is ignored") } return(multisynth(form, !!enquo(unit), !!enquo(time), data, ...)) } else { if (is.null(t_int)) { t_int <- trt_time %>% filter(is.finite(trt_time)) %>% summarise(t_int = max(trt_time)) %>% pull(t_int) } if(!multi_outcome) { message("One outcome and one treatment time found. Running single_augsynth.") return(single_augsynth(form, !!enquo(unit), !!enquo(time), t_int, data = data, ...)) } else { message("Multiple outcomes and one treatment time found. Running augsynth_multiout.") return(augsynth_multiout(form, !!enquo(unit), !!enquo(time), t_int, data = data, ...)) } } } ================================================ FILE: R/cv.R ================================================ drop_time_t <- function(wide_data, Z, t_drop) { new_wide_data <- list() new_wide_data$trt <- wide_data$trt if (is.list(wide_data$X)) { # TODO } else { new_wide_data$X <- wide_data$X[, -t_drop, drop = F] new_wide_data$y <- cbind(wide_data$X[, t_drop, drop = F], wide_data$y) X0 <- new_wide_data$X[new_wide_data$trt == 0,, drop = F] x1 <- matrix(colMeans(new_wide_data$X[new_wide_data$trt == 1,, drop = F]), ncol=1) y0 <- new_wide_data$y[new_wide_data$trt == 0,, drop = F] y1 <- colMeans(new_wide_data$y[new_wide_data$trt == 1,, drop = F]) new_synth_data <- list() new_synth_data$Z0 <- t(X0) new_synth_data$X0 <- t(X0) new_synth_data$Z1 <- x1 new_synth_data$X1 <- x1 return(list(wide_data = new_wide_data, synth_data = new_synth_data, Z = Z)) } } drop_time_and_refit <- function(wide_data, Z, t_drop, progfunc, scm, fixedeff, ...) { new_data <- drop_time_t(wide_data, Z, t_drop) new_ascm <- do.call(fit_augsynth_internal, c(list(wide = new_data$wide, synth_data = new_data$synth_data, Z = new_data$Z, progfunc = progfunc, scm = scm, fixedeff = fixedeff, ...))) return(new_ascm) } cv_internal <- function(wide_data, Z, progfunc, scm, fixedeff, lambdas, holdout_periods, ...) { X <- wide_data$X lambda_error_vals <- vapply(lambdas, function(lambda){ errors <- apply(holdout_periods, 1, function(t_drop){ new_ascm <- drop_time_and_refit(wide_data, Z, t_drop, progfunc, scm, fixedeff, lambda = lambda, ...) err <- sum((predict(new_ascm, att = T)[(ncol(X)-length(t_drop)+1):ncol(X)])^2) err }) lambda_error <- mean(errors) lambda_error_se <- sd(errors) / sqrt(length(errors)) c(lambda_error, lambda_error_se) }, numeric(2)) return(list(lambda_errors = lambda_error_vals[1,], lambda_errors_se = lambda_error_vals[2,])) } cv_ridge <- function(wide_data, synth_data, Z, progfunc, scm, fixedeff, how = 'time', holdout_length = 1, lambdas = NULL, lambda_min_ratio = 1e-8, n_lambda = 20, lambda_max = NULL, min_1se = T, V = NULL, ...) { X <- wide_data$X trt <- wide_data$trt if (is.null(lambdas)) { if(is.null(lambda_max)) { X_cent <- apply(X, 2, function(x) x - mean(x[trt==0])) X_c <- X_cent[trt==0,,drop=FALSE] t0 <- ncol(X_c) if(is.null(V)) { V <- diag(rep(1, t0)) } else if(is.vector(V)) { V <- diag(V) } else if(ncol(V) == 1 & nrow(V) == t0) { V <- diag(c(V)) } else if(ncol(V) == t0 & nrow(V) == 1) { V <- diag(c(V)) } else if(nrow(V) == t0) { } else { stop("`V` must be a vector with t0 elements or a t0xt0 matrix") } X_c <- X_c %*% V if(!is.null(Z)) { Z_cent <- apply(Z, 2, function(x) x - mean(x[trt==0])) Z_c <- Z_cent[trt==0,,drop=FALSE] Xc_hat <- Z_c %*% solve(t(Z_c) %*% Z_c) %*% t(Z_c) %*% X_c res_c <- X_c - Xc_hat X_c <- res_c } lambda_max <- svd(X_c)$d[1] ^ 2 } lambdas <- create_lambda_list(lambda_max, lambda_min_ratio, n_lambda) } if (how == 'time') { period_starts <- 1:(ncol(X) - holdout_length) if (holdout_length == 1) { holdout_periods <- matrix(period_starts, nrow = length(period_starts), ncol = 1) } else { holdout_periods <- t(vapply(period_starts, function(t) t:(t+holdout_length-1), numeric(holdout_length))) } results <- cv_internal(wide_data, Z, progfunc, scm, fixedeff, lambdas, holdout_periods, ...) lambda <- choose_lambda(lambdas, results$lambda_errors, results$lambda_errors_se, min_1se) return(list(lambda = lambda, lambdas = lambdas, lambda_errors = results$lambda_errors, lambda_errors_se = results$lambda_errors_se)) } } ================================================ FILE: R/data.R ================================================ #' Economic indicators for US states from 1990-2016 #' #' #' @format A dataframe with 5250 rows and 32 variables: #' \describe{ #' \item{fips}{FIPS code for each state} #' \item{year}{Year of measurement} #' \item{qtr}{Quarter (1-4) of measurement} #' \item{state}{Name of State} #' \item{gdp}{Gross State Product (millions of $) Values before 2005 are linearly interpolated between years} #' \item{revenuepop}{State and local revenue per capita} #' \item{rev_state_total}{State total general revenue (millions of $)} #' \item{rev_local_total}{Local total general revenue (millions of $)} #' \item{popestimate}{Population estimate} #' \item{qtrly_estabs_count}{Count of establishments for a given quarter} #' \item{month1_emplvl, month2_emplvl, month3_emplvl}{ Employment level for first, second, and third months of a given quarter} #' \item{total_qtrly_wages}{Total wages for a givne quarter} #' \item{taxable_qtrly_wage}{Taxable wages for a given quarter} #' \item{avg_wkly_wage}{Average weekly wage for a given quarter} #' \item{year_qtr}{Year and quarter combined into one continuous variable} #' \item{treated}{Whether the state passed tax cuts before the given year and quareter} #' \item{lngdpcapita}{Natural log of GDP per capita} #' \item{emplvlcapita}{Average employment level per capita} #' \item{Xcapita}{Per capita value of X} #' \item{abb}{State abbreviation} #' } "kansas" ================================================ FILE: R/eligible_donors.R ================================================ ############################################################################## ## Code to get eligible donor units based on covariates ############################################################################## get_donors <- function(X, y, trt, Z, time_cohort, n_lags, n_leads, how = "knn", exact_covariates = NULL, ...) { # first get eligible donors by treatment time donors <- get_eligible_donors(trt, time_cohort, n_leads) # get donors with no NA values nona_donors <- get_nona_donors(X, y, trt, n_lags, n_leads, time_cohort) donors <- lapply(1:length(donors), function(j) donors[[j]] & nona_donors[[j]]) # if Z isn't NULL, futher restrict the donors by matching if(!is.null(Z)) { if(ncol(Z) != 0) { donors <- get_matched_donors(trt, Z, donors, how, exact_covariates, ...) } } return(donors) } get_eligible_donors <- function(trt, time_cohort, n_leads) { # get treatment times if(time_cohort) { grps <- unique(trt[is.finite(trt)]) } else { grps <- trt[is.finite(trt)] } J <- length(grps) # only allow weights on donors treated after n_leads donors <- lapply(1:J, function(j) trt > n_leads + grps[j]) return(donors) } #' Get donors that don't have missing outcomes where treated units have outcomes get_nona_donors <- function(X, y, trt, n_lags, n_leads, time_cohort) { n <- length(trt) # find na treatment times fulldat <- cbind(X, y) is_na <- is.na(fulldat[is.finite(trt), , drop = F]) # aggregate by time cohort if(time_cohort) { grps <- unique(trt[is.finite(trt)]) # if doing a time cohort, convert the boolean mask finite_trt <- trt[is.finite(trt)] is_na <- t(sapply(grps, function(tj) apply(is_na[finite_trt == tj, , drop = F], 2, all))) } else { grps <- trt[is.finite(trt)] } not_na <- !is.na(fulldat) J <- length(grps) lapply(1:J, function(j) { idxs <- max(grps[j] - n_lags + 1, 1):min(grps[j] + n_leads, ncol(fulldat)) isna_j <- is_na[j, idxs] apply(not_na[, idxs, drop = F][, !isna_j, drop = F], 1, all) }) -> donors return(donors) } get_matched_donors <- function(trt, Z, donors, how, exact_covariates = NULL, k = NULL, ...) { J <- sum(is.finite(trt)) trt_idx <- which(is.finite(trt)) if(is.null(exact_covariates)) { if(how == "exact") { return( lapply(1:J, function(j) donors[[j]] & apply(t(Z) == Z[trt_idx[j], ], 2, all) ) ) } else if(how == "knn") { return(get_knn_donors(trt, Z, donors, k)) } else { stop("Option for exact matching must be in ('exact', 'knn')") } } else { if(how == "exact") { return( lapply(1:J, function(j) donors[[j]] & apply(t(Z) == Z[trt_idx[j], exact_covariates], 2, all) ) ) } else if(how == "knn") { donors <- lapply(1:J, function(j) { donors[[j]] & apply(t(Z[, exact_covariates, drop = F]) == Z[trt_idx[j],exact_covariates], 2, all) } ) approx_covs <- which(!colnames(Z) %in% exact_covariates) if(length(approx_covs != 0)) { return(get_knn_donors(trt, Z[, approx_covs, drop = F], donors, k)) } else { return(donors) } } else { stop("Option for exact matching must be in ('exact', 'knn')") } } } get_knn_donors <- function(trt, Z, donors, k) { if(is.null(k)) { stop("Number of neighbors for knn not selected, please choose k.") } # knn matching within time cohort trt_idxs <- which(is.finite(trt)) lapply(1:length(trt_idxs), function(j) { idx <- trt_idxs[j] # idxs for treated units treated at time tj Z_tj <- Z[idx, , drop = F] # get donors for treated cohort donors_tj <- donors[[j]] Z_donors_tj <- Z[donors_tj, , drop = F] # check that k is less than the number of donors # if not, warn and set k to be the number of donors - 1 if(k >= nrow(Z_donors_tj)) { warning(paste("Number of potential donor units is less than", "the number of required matches,", "returning all donors as matches")) return(donors_tj) } # do knn matching nn <- FNN::get.knnx(data = Z_donors_tj, query = Z_tj, k = k) # keep track of which indices these are donors_j <- logical(length(donors_tj)) true_idx <- which(donors_tj)[nn$nn.index[1, ]] donors_j[true_idx] <- TRUE return(donors_j) }) -> matches names(matches) <- trt_idxs return(matches) } ================================================ FILE: R/fit_synth.R ================================================ ####################################################### # Helper scripts to fit synthetic controls to simulations ####################################################### #' Make a V matrix from a vector (or null) make_V_matrix <- function(t0, V) { if(is.null(V)) { V <- diag(rep(1, t0)) } else if(is.vector(V)) { if(length(V) != t0) { stop(paste("`V` must be a vector with", t0, "elements or a", t0, "x", t0, "matrix")) } V <- diag(V) } else if(ncol(V) == 1 & nrow(V) == t0) { V <- diag(c(V)) } else if(ncol(V) == t0 & nrow(V) == 1) { V <- diag(c(V)) } else if(nrow(V) == t0) { } else { stop(paste("`V` must be a vector with", t0, "elements or a", t0, "x", t0, "matrix")) } return(V) } #' Fit synthetic controls on outcomes after formatting data #' @param synth_data Panel data in format of Synth::dataprep #' @param V Matrix to scale the obejctive by #' @noRd #' @return \itemize{ #' \item{"weights"}{Synth weights} #' \item{"l2_imbalance"}{Imbalance in pre-period outcomes, measured by the L2 norm} #' \item{"scaled_l2_imbalance"}{L2 imbalance scaled by L2 imbalance of uniform weights} #' } fit_synth_formatted <- function(synth_data, V = NULL) { t0 <- dim(synth_data$Z0)[1] ## if no is supplied, set equal to 1 V <- make_V_matrix(t0, V) weights <- synth_qp(synth_data$X1, t(synth_data$X0), V) l2_imbalance <- sqrt(sum((synth_data$Z0 %*% weights - synth_data$Z1)^2)) ## primal objective value scaled by least squares difference for mean uni_w <- matrix(1/ncol(synth_data$Z0), nrow=ncol(synth_data$Z0), ncol=1) unif_l2_imbalance <- sqrt(sum((synth_data$Z0 %*% uni_w - synth_data$Z1)^2)) scaled_l2_imbalance <- l2_imbalance / unif_l2_imbalance return(list(weights=weights, l2_imbalance=l2_imbalance, scaled_l2_imbalance=scaled_l2_imbalance)) } #' Solve the synth QP directly #' @param X1 Target vector #' @param X0 Matrix of control outcomes #' @param V Scaling matrix #' @noRd synth_qp <- function(X1, X0, V) { Pmat <- X0 %*% V %*% t(X0) qvec <- - t(X1) %*% V %*% t(X0) n0 <- nrow(X0) A <- rbind(rep(1, n0), diag(n0)) l <- c(1, numeric(n0)) u <- c(1, rep(1, n0)) settings = osqp::osqpSettings(verbose = FALSE, eps_rel = 1e-8, eps_abs = 1e-8) sol <- osqp::solve_osqp(P = Pmat, q = qvec, A = A, l = l, u = u, pars = settings) return(sol$x) } ================================================ FILE: R/format.R ================================================ ################################################################################ ## Scripts to format panel data into matrices ################################################################################ #' Format "long" panel data into "wide" program evaluation matrices #' @param outcome Name of outcome column #' @param trt Name of treatment column #' @param unit Name of unit column #' @param time Name of time column #' @param t_int Time of intervention #' @param data Panel data as dataframe #' @noRd #' @return \itemize{ #' \item{"X"}{Matrix of pre-treatment outcomes} #' \item{"trt"}{Vector of treatment assignments} #' \item{"y"}{Matrix of post-treatment outcomes} #' } format_data <- function(outcome, trt, unit, time, t_int, data) { ## pre treatment outcomes X <- data %>% filter(!!time < t_int) %>% select(!!unit, !!time, !!outcome) %>% spread(!!time, !!outcome) %>% select(-!!unit) %>% as.matrix() ## post treatment outcomes y <- data %>% filter(!!time >= t_int) %>% select(!!unit, !!time, !!outcome) %>% spread(!!time, !!outcome) %>% select(-!!unit) %>% as.matrix() ## treatment status trt <- data %>% select(!!unit, !!trt) %>% group_by(!!unit) %>% summarise(trt = max(!!trt)) %>% ungroup() %>% pull(trt) return(list(X=X, trt=trt, y=y)) } #' Format "long" panel data into "wide" program evaluation matrices #' @param outcomes Vectors of names of outcome columns #' @param trt Name of treatment column #' @param unit Name of unit column #' @param time Name of time column #' @param t_int Time of intervention #' @param data Panel data as dataframe #' @noRd #' @return \itemize{ #' \item{"X"}{List of matrices of pre-treatment outcomes} #' \item{"trt"}{Vector of treatment assignments} #' \item{"y"}{List of matrices of post-treatment outcomes} #' } format_data_multi <- function(outcomes, trt, unit, time, t_int, data) { lapply(outcomes, function(outcome) format_data(outcome, trt, unit, time, t_int, data) ) -> formats # X <- simplify2array(lapply(formats, function(x) x$X)) # y <- simplify2array(lapply(formats, function(x) x$y)) # X <- lapply(formats, function(x) t(na.omit(t(x$X)))) X <- lapply(formats, `[[`, "X") y <- lapply(formats, function(x) t(na.omit(t(x$y)))) trt <- formats[[1]]$trt return(list(X = X, trt = trt, y = y)) } #' Format "long" panel data into "wide" program evaluation matrices with staggered adoption #' @param outcome Name of outcome column #' @param trt Name of treatment column #' @param unit Name of unit column #' @param time Name of time column #' @param data Panel data as dataframe #' @noRd #' @return \itemize{ #' \item{"X"}{Matrix of pre-treatment outcomes} #' \item{"trt"}{Vector of treatment assignments} #' \item{"y"}{Matrix of post-treatment outcomes} #' } format_data_stag <- function(outcome, trt, unit, time, data) { # arrange data by time first data <- data %>% arrange(!!time) ## get first treatment times trt_time <- data %>% group_by(!!unit) %>% summarise(trt_time=(!!time)[(!!trt) == 1][1]) %>% mutate(trt_time=replace_na(as.numeric(trt_time), Inf)) t_int <- trt_time %>% filter(is.finite(trt_time)) %>% summarise(t_int=max(trt_time)) %>% pull(t_int) ## ## boolean mask of available data for treatment groups ## mask <- data %>% inner_join(trt_time %>% ## filter(is.finite(trt_time))) %>% ## filter(!!time < t_int) %>% ## mutate(trt=1-!!trt) %>% ## select(!!unit, !!time, trt_time, trt) %>% ## spread(!!time, trt) %>% ## group_by(trt_time) %>% ## summarise_all(list(max)) %>% ## arrange(trt_time) %>% ## select(-trt_time, -!!unit) %>% ## as.matrix() ## boolean mask of available data for treatment groups mask <- data %>% inner_join(trt_time %>% filter(is.finite(trt_time)), by = rlang::as_name(unit)) %>% filter(!!time < t_int) %>% mutate(trt=1-!!trt) %>% select(!!unit, !!time, trt_time, trt) %>% spread(!!time, trt) %>% ## arrange(!!unit) %>% select(-trt_time, -!!unit) %>% as.matrix() # outcomes as a matrix Xy <- data %>% select(!!unit, !!time, !!outcome) %>% spread(!!time, !!outcome) %>% select(-!!unit) %>% as.matrix() pre_times <- data %>% filter(!!time < t_int) %>% distinct(!!time) %>% pull(!!time) post_times <- data %>% filter(!!time >= t_int) %>% distinct(!!time) %>% pull(!!time) X <- Xy[, as.character(pre_times), drop = F] y <- Xy[, as.character(post_times), drop = F] if(nrow(X) != nrow(y)) { stop("There are not the same number of units after the last unit is treated as before the last unit is treated") } t_vec <- data %>% pull(!!time) %>% unique() %>% sort() trt <- sapply(trt_time$trt_time, function(x) which(t_vec == x)-1) %>% as.numeric() %>% replace_na(Inf) units <- data %>% filter(!!time < t_int) %>% select(!!unit, !!time, !!outcome) %>% spread(!!time, !!outcome) %>% pull(!!unit) return(list(X=X, trt=trt, y=y, mask=mask, time = t_vec, units=units)) } #' Format program eval matrices into synth form #' #' @param X Matrix of pre-treatment outcomes #' @param trt Vector of treatment assignments #' @param y Matrix of post-treatment outcomes #' @noRd #' @return List with data formatted as Synth::dataprep format_synth <- function(X, trt, y) { synth_data <- list() ## pre-treatment values as covariates synth_data$Z0 <- t(X[trt==0,,drop=F]) ## average treated units together synth_data$Z1 <- as.matrix((colMeans(X[trt==1,,drop=F])), ncol=1) ## combine everything together also synth_data$Y0plot <- t(cbind(X[trt==0,,drop=F], y[trt==0,,drop=F])) synth_data$Y1plot <- as.matrix(colMeans( cbind(X[trt==1,,drop=F], y[trt==1,,drop=F])), ncol=1) ## predictors are pre-period outcomes synth_data$X0 <- synth_data$Z0 synth_data$X1 <- synth_data$Z1 return(synth_data) } #' Remove unit means #' @param wide_data X, y, trt #' @param synth_data List with data formatted as Synth::dataprep #' @noRd demean_data <- function(wide_data, synth_data) { # pre treatment means means <- rowMeans(wide_data$X) new_wide_data <- list() new_X <- wide_data$X - means trt <- wide_data$trt new_wide_data$X <- new_X new_wide_data$y <- wide_data$y - means new_wide_data$trt <- trt new_synth_data <- list() new_synth_data$X0 <- t(new_X[trt == 0,, drop = FALSE]) new_synth_data$Z0 <- new_synth_data$X0 new_synth_data$X1 <- as.matrix((colMeans(new_X[trt==1,,drop = F])), ncol = 1) new_synth_data$Z1 <- new_synth_data$X1 # estimate post-treatment as pre-treatment means mhat <- replicate(ncol(wide_data$X) + ncol(wide_data$y), means) return(list(wide = new_wide_data, synth_data = new_synth_data, mhat = mhat)) } #' Helper function to extract covariate matrix from data #' @param form Formula as outcome ~ treatment | covariates #' @param unit Name of unit column #' @param time Name of time column #' @param t_int Time of intervention #' @param data Panel data as dataframe #' @param cov_agg Covariate aggregation function #' @noRd extract_covariates <- function(form, unit, time, t_int, data, cov_agg) { ## if no aggregation functions, use the mean (omitting NAs) if(is.null(cov_agg)) { cov_agg <- c(function(x) mean(x, na.rm=T)) } cov_form <- update(formula(delete.response(terms(form, rhs=2, data=data))), ~. - 1) ## ensure that there is no intercept ## pull out relevant covariates and aggregate pre_data <- data %>% filter(!! (time) < t_int) model.matrix(cov_form, model.frame(cov_form, pre_data, na.action=NULL) ) %>% data.frame() %>% mutate(unit=pull(pre_data, !!unit)) %>% group_by(unit) %>% summarise_all(cov_agg) -> Z # recombine with any missing units and convert to matrix data %>% distinct(!!unit) %>% rename(unit = !!unit) %>% left_join(Z, by = "unit") %>% arrange(unit) %>% select(-unit) %>% as.matrix() -> Z if(nrow(distinct(data, !!unit)) != nrow(Z)) { stop("Some units missing all covariate data") } # check if any covariates have no variation Zsds <- apply(Z, 2, sd) if(any(Zsds == 0)) { zero_covs <- paste(colnames(Z)[Zsds == 0], collapse = ", ") stop(paste("The following covariates have no variation across units:", zero_covs)) } return(Z) } #' Check that we can actually run multisynth on the data #' @param wide Output of format_data_stag #' @param fixedeff Whether to include a unit fixed effect #' @param n_leads How long past treatment effects should be estimated for, default is number of post treatment periods for last treated unit #' @param n_lags Number of pre-treatment periods to balance, default is to balance all periods check_data_stag <- function(wide, fixedeff, n_leads, n_lags) { # If there are less than 5 pre-treatment outcomes, give a warning less_5 <- wide$units[wide$trt < 5] warn_msg <- "" if(length(less_5) != 0) { warn_msg <- paste0( warn_msg, "The following units have less than 5 pre-treatment outcomes: (", paste(less_5, collapse = ","), "). Be cautious!" ) } # check if there are any always treated units always_trt <- wide$units[wide$trt == 0] # If including a fixed effect, check that there is more than one pretreatment # outcome for each unit n1 <- wide$units[wide$trt == 1] err_msg <- "" if(length(always_trt) != 0) { err_msg <- paste0( err_msg, "The following units are always treated and should be removed: (", paste(always_trt, collapse = ","), ")\n") } if(length(n1) != 0 & fixedeff) { if(nchar(err_msg) > 0) { err_msg <- paste0(err_msg, " Also: ") } err_msg <- paste0( err_msg, "You are including a fixed effect with `fixedeff = T`, but the ", "following units only have one pre-treatment outcome: (", paste(n1, collapse = ","), "). Either remove these units or set `fixedeff = F`.\n" ) } # check if there are never treated units if(max(wide$trt) < ncol(wide$X) + ncol(wide$y)) { if(nchar(err_msg) > 0) { err_msg <- paste0(err_msg, " Also: ") } err_msg <- paste0( err_msg, "All units are eventually treated. The last treatment time is ", wide$time[max(wide$trt)], ". To run multisynth, remove all periods after this time. ", "Units treated at this time will be considered 'never treated' in the ", "narrowed sample.\n" ) } if(nchar(warn_msg) > 0) { warning(warn_msg) } if(nchar(err_msg) > 0) { stop(err_msg) } } ================================================ FILE: R/globalVariables.R ================================================ utils::globalVariables(c("time", "val", "post", "weight", ".", "Time", "Estimate", "Std.Error", "Level", "last_time", "is_avg", "label", "Outcome", "unit", "obs", "lambdas", "errors_se", "upper_bound", "lower_bound")) ================================================ FILE: R/highdim.R ================================================ ################################################################################ ## Methods to use flexible outcome models ################################################################################ ##### Augmented SCM with general outcome models #' Use zero weights, do nothing but output everything in the right way #' @param synth_data Panel data in format of Synth::dataprep #' @noRd #' @return \itemize{ #' \item{"weights"}{Synth weights} #' \item{"l2_imbalance"}{Imbalance in pre-period outcomes, measured by the L2 norm} #' \item{"scaled_l2_imbalance"}{L2 imbalance scaled by L2 imbalance of uniform weights} #' } fit_zero_weights <- function(synth_data) { ## Imbalance is uniform weights imbalance uni_w <- matrix(1/ncol(synth_data$Z0), nrow=ncol(synth_data$Z0), ncol=1) unif_l2_imbalance <- sqrt(sum((synth_data$Z0 %*% uni_w - synth_data$Z1)^2)) scaled_l2_imbalance <- 1 return(list(weights=rep(0, ncol(synth_data$Z0)), l2_imbalance=unif_l2_imbalance, scaled_l2_imbalance=scaled_l2_imbalance)) } #' Fit E[Y(0)|X] and for each post-period and balance pre-period #' #' @param wide_data Output of `format_ipw` #' @param synth_data Output of `synth_data` #' @param fit_progscore Function to fit prognostic score #' @param fit_weights Function to fit synth weights #' @param ... optional arguments for outcome model #' @noRd #' @return \itemize{ #' \item{"weights"}{Ridge ASCM weights} #' \item{"l2_imbalance"}{Imbalance in pre-period outcomes, measured by the L2 norm} #' \item{"scaled_l2_imbalance"}{L2 imbalance scaled by L2 imbalance of uniform weights} #' \item{"mhat"}{Outcome model estimate} #' } fit_augsyn_formatted <- function(wide_data, synth_data, fit_progscore, fit_weights, ...) { X <- wide_data$X y <- wide_data$y trt <- wide_data$trt ## fit prognostic scores fitout <- do.call(fit_progscore, list(X=X, y=y, trt=trt, ...)) ## fit synth syn <- fit_weights(synth_data) syn$params <- fitout$params syn$mhat <- fitout$y0hat return(syn) } #' Fit outcome model and balance pre-period #' @param wide_data Output of `format_ipw` #' @param synth_data Output of `synth_data` #' @param progfunc What function to use to impute control outcomes #' EN=Elastic Net, RF=Random Forest, GSYN=gSynth, #' Comp=softImpute, MCP=MCPanel, CITS=CITS #' CausalImpact=Bayesian structural time series with CausalImpact #' seq2seq=Sequence to sequence learning with feedforward nets #' @param scm Whether the SCM weighting function is used #' @param ... optional arguments for outcome model #' @noRd #' @return \itemize{ #' \item{"weights"}{Ridge ASCM weights} #' \item{"l2_imbalance"}{Imbalance in pre-period outcomes, measured by the L2 norm} #' \item{"scaled_l2_imbalance"}{L2 imbalance scaled by L2 imbalance of uniform weights} #' \item{"mhat"}{Outcome model estimate} #' } fit_augsyn <- function(wide_data, synth_data, progfunc=c("EN", "RF", "GSYN", "MCP","CITS", "CausalImpact", "seq2seq"), scm=T, ...) { ## prognostic score and weight functions to use progfunc = tolower(progfunc) if(progfunc == "en") { progf <- fit_prog_reg } else if(progfunc == "rf") { progf <- fit_prog_rf } else if(progfunc == "gsyn"){ progf <- fit_prog_gsynth } else if(progfunc == "mcp"){ progf <- fit_prog_mcpanel } else if(progfunc == "cits") { progf <- fit_prog_cits } else if(progfunc == "causalimpact") { progf <- fit_prog_causalimpact } else if(progfunc == "seq2seq"){ progf <- fit_prog_seq2seq } else { stop("progfunc must be one of 'EN', 'RF', 'GSYN', 'MCP', 'CITS', 'CausalImpact', 'seq2seq'") } if(scm) { weightf <- fit_synth_formatted } else { ## still fit synth even if none ## TODO: This is a dumb wasteful hack weightf <- fit_zero_weights } return(fit_augsyn_formatted(wide_data, synth_data, progf, weightf, ...)) } ### Combine synth and gsynth by balancing pre-period residuals #' Fit outcome model and balance residuals #' #' @param wide_data Output of `format_data` #' @param synth_data Output of `format_synth` #' @param fit_progscore Function to fit prognostic score #' @param fit_weights Function to fit synth weights #' @param ... optional arguments for outcome model #' @noRd #' @return \itemize{ #' \item{"weights"}{Ridge ASCM weights} #' \item{"l2_imbalance"}{Imbalance in pre-period outcomes, measured by the L2 norm} #' \item{"scaled_l2_imbalance"}{L2 imbalance scaled by L2 imbalance of uniform weights} #' \item{"mhat"}{Outcome model estimate} #' } fit_residaug_formatted <- function(wide_data, synth_data, fit_progscore, fit_weights, ...) { X <- wide_data$X y <- wide_data$y trt <- wide_data$trt ## fit prognostic scores fitout <- do.call(fit_progscore, list(X=X, y=y, trt=trt, ...)) y0hat <- fitout$y0hat ## get residuals ctrl_resids <- fitout$params$ctrl_resids trt_resids <- fitout$params$trt_resids ## replace outcomes with pre-period residuals t0 <- dim(X)[2] synth_data$Z0 <- ctrl_resids[1:t0, ] synth_data$Z1 <- as.matrix(trt_resids[1:t0]) ## fit synth weights syn <- fit_weights(synth_data) syn$params <- fitout$params ## return predicted values for treatment and control syn$mhat <- y0hat return(syn) } #' Fit outcome model and balance residuals #' #' @param wide_data Output of `format_data` #' @param synth_data Output of `format_synth` #' @param progfunc What function to use to impute control outcomes #' GSYN=gSynth, MCP=MCPanel, #' CITS=Comparative interrupted time series #' CausalImpact=Bayesian structural time series with CausalImpact #' @param weightfunc What function to use to fit weights #' SCM=Vanilla Synthetic Controls #' NONE=No reweighting, just outcome model #' @param ... optional arguments for outcome model #' @noRd #' @return \itemize{ #' \item{"weights"}{Ridge ASCM weights} #' \item{"l2_imbalance"}{Imbalance in pre-period outcomes, measured by the L2 norm} #' \item{"scaled_l2_imbalance"}{L2 imbalance scaled by L2 imbalance of uniform weights} #' \item{"mhat"}{Outcome model estimate} #' } fit_residaug <- function(wide_data, synth_data, progfunc=c("GSYN", "MCP", "CITS", "CausalImpact"), weightfunc=c("SC","ENT", "SVD", "NONE"), ...) { ## prognostic score and weight functions to use if(progfunc == "GSYN"){ progf <- fit_prog_gsynth } else if(progfunc == "MCP"){ progf <- fit_prog_mcpanel } else if(progfunc == "CITS") { progf <- fit_prog_cits } else if(progfunc == "CausalImpact") { progf <- fit_prog_causalimpact } else { stop("progfunc must be one of 'GSYN', 'MCP', 'CITS', 'CausalImpact'") } ## weight function to use if(weightfunc == "SCM") { weightf <- fit_synth_formatted } else if(weightfunc == "NONE") { ## still fit synth even if none ## TODO: This is a dumb wasteful hack weightf <- fit_synth_formatted } else { stop("weightfunc must be one of 'SCM', 'NONE'") } return(fit_residaug_formatted(wide_data, synth_data, progf, weightf, ...)) } ================================================ FILE: R/inference.R ================================================ ################################################################################ ## Code for inference ################################################################################ #' Jackknife+ algorithm over time #' @param ascm Fitted `augsynth` object #' @param alpha Confidence level #' @param conservative Whether to use the conservative jackknife+ procedure #' @return List that contains: #' \itemize{ #' \item{"att"}{Vector of ATT estimates} #' \item{"heldout_att"}{Vector of ATT estimates with the time period held out} #' \item{"se"}{Standard error, always NA but returned for compatibility} #' \item{"lb"}{Lower bound of 1 - alpha confidence interval} #' \item{"ub"}{Upper bound of 1 - alpha confidence interval} #' \item{"alpha"}{Level of confidence interval} #' } time_jackknife_plus <- function(ascm, alpha = 0.05, conservative = F) { wide_data <- ascm$data synth_data <- ascm$data$synth_data n <- nrow(wide_data$X) n_c <- dim(synth_data$Z0)[2] Z <- wide_data$Z t0 <- dim(synth_data$Z0)[1] tpost <- ncol(wide_data$y) t_final <- dim(synth_data$Y0plot)[1] jack_ests <- lapply(1:t0, function(tdrop) { # drop unit i new_data <- drop_time_t(wide_data, Z, tdrop) # refit new_ascm <- do.call(fit_augsynth_internal, c(list(wide = new_data$wide, synth_data = new_data$synth_data, Z = new_data$Z, progfunc = ascm$progfunc, scm = ascm$scm, fixedeff = ascm$fixedeff), ascm$extra_args)) # get ATT estimates and held out error for time t # t0 is prediction for held out time est <- predict(new_ascm, att = F)[(t0 +1):t_final] est <- c(est, mean(est)) err <- c(colMeans(wide_data$X[wide_data$trt == 1, tdrop, drop = F]) - predict(new_ascm, att = F)[t0]) list(err, rbind(est + abs(err), est - abs(err), est + err, est)) }) # get errors and jackknife distribution held_out_errs <- vapply(jack_ests, `[[`, numeric(1), 1) jack_dist <- vapply(jack_ests, `[[`, matrix(0, nrow = 4, ncol = tpost + 1), 2) out <- list() att <- predict(ascm, att = T) out$att <- c(att, mean(att[(t0 + 1):t_final])) # held out ATT out$heldout_att <- c(held_out_errs, att[(t0 + 1):t_final], mean(att[(t0 + 1):t_final])) # out$se <- rep(NA, 10 + tpost) if(conservative) { qerr <- stats::quantile(abs(held_out_errs), 1 - alpha) out$lb <- c(rep(NA, t0), apply(jack_dist[4,,], 1, min) - qerr) out$ub <- c(rep(NA, t0), apply(jack_dist[4,,], 1, max) + qerr) } else { out$lb <- c(rep(NA, t0), apply(jack_dist[2,,], 1, stats::quantile, alpha / 2)) out$ub <- c(rep(NA, t0), apply(jack_dist[1,,], 1, stats::quantile, 1 - alpha / 2)) } # shift back to ATT scale y1 <- predict(ascm, att = F) + att y1 <- c(y1, mean(y1[(t0 + 1):t_final])) shifted_lb <- y1 - out$ub shifted_ub <- y1 - out$lb out$lb <- shifted_lb out$ub <- shifted_ub out$alpha <- alpha return(out) } #' Drop time period from pre-treatment data #' @param wide_data (X, y, trt) #' @param Z Covariates matrix #' @param t_drop Time to drop #' @noRd drop_time_t <- function(wide_data, Z, t_drop) { new_wide_data <- list() new_wide_data$trt <- wide_data$trt new_wide_data$X <- wide_data$X[, -t_drop, drop = F] new_wide_data$y <- cbind(wide_data$X[, t_drop, drop = F], wide_data$y) X0 <- new_wide_data$X[new_wide_data$trt == 0,, drop = F] x1 <- matrix(colMeans(new_wide_data$X[new_wide_data$trt == 1,, drop = F]), ncol=1) y0 <- new_wide_data$y[new_wide_data$trt == 0,, drop = F] y1 <- colMeans(new_wide_data$y[new_wide_data$trt == 1,, drop = F]) new_synth_data <- list() new_synth_data$Z0 <- t(X0) new_synth_data$X0 <- t(X0) new_synth_data$Z1 <- x1 new_synth_data$X1 <- x1 return(list(wide_data = new_wide_data, synth_data = new_synth_data, Z = Z)) } #' Conformal inference procedure to compute p-values and point-wise confidence intervals #' @param ascm Fitted `augsynth` object #' @param alpha Confidence level #' @param stat_func Function to compute test statistic #' @param type Either "iid" for iid permutations or "block" for moving block permutations; default is "block" #' @param q The norm for the test static `((sum(x ^ q))) ^ (1/q)` #' @param ns Number of resamples for "iid" permutations #' @param grid_size Number of grid points to use when inverting the hypothesis test #' @return List that contains: #' \itemize{ #' \item{"att"}{Vector of ATT estimates} #' \item{"heldout_att"}{Vector of ATT estimates with the time period held out} #' \item{"se"}{Standard error, always NA but returned for compatibility} #' \item{"lb"}{Lower bound of 1 - alpha confidence interval} #' \item{"ub"}{Upper bound of 1 - alpha confidence interval} #' \item{"p_val"}{p-value for test of no post-treatment effect} #' \item{"alpha"}{Level of confidence interval} #' } conformal_inf <- function(ascm, alpha = 0.05, stat_func = NULL, type = "iid", q = 1, ns = 1000, grid_size = 50) { wide_data <- ascm$data synth_data <- ascm$data$synth_data n <- nrow(wide_data$X) n_c <- dim(synth_data$Z0)[2] Z <- wide_data$Z t0 <- dim(synth_data$Z0)[1] tpost <- ncol(wide_data$y) t_final <- dim(synth_data$Y0plot)[1] # grid of nulls att <- predict(ascm, att = T) post_att <- att[(t0 +1):t_final] post_sd <- sqrt(mean(post_att ^ 2)) # iterate over post-treatment periods to get pointwise CIs vapply(1:tpost, function(j) { # fit using t0 + j as a pre-treatment period and get reisduals new_wide_data <- wide_data new_wide_data$X <- cbind(wide_data$X, wide_data$y[, j, drop = TRUE]) if(tpost > 1) { new_wide_data$y <- wide_data$y[, -j, drop = FALSE] } else { # set the post period has to be *something* new_wide_data$y <- matrix(1, nrow = n, ncol = 1) } # make a grid around the estimated ATT grid <- seq(att[t0 + j] - 2 * post_sd, att[t0 + j] + 2 * post_sd, length.out = grid_size) compute_permute_ci(new_wide_data, ascm, grid, 1, alpha, type, q, ns, stat_func) }, numeric(3)) -> cis # test a null post-treatment effect new_wide_data <- wide_data new_wide_data$X <- cbind(wide_data$X, wide_data$y) new_wide_data$y <- matrix(1, nrow = n, ncol = 1) null_p <- compute_permute_pval(new_wide_data, ascm, 0, ncol(wide_data$y), type, q, ns, stat_func) out <- list() att <- predict(ascm, att = T) out$att <- c(att, mean(att[(t0 + 1):t_final])) # out$se <- rep(NA, t_final) # out$sigma <- NA out$lb <- c(rep(NA, t0), cis[1, ], NA) out$ub <- c(rep(NA, t0), cis[2, ], NA) out$p_val <- c(rep(NA, t0), cis[3, ], null_p) out$alpha <- alpha return(out) } #' Conformal inference procedure to compute a confidence interval for a linear in time effect #' @param ascm Fitted `augsynth` object #' @param alpha Confidence level #' @param stat_func Function to compute test statistic #' @param type Either "iid" for iid permutations or "block" for moving block permutations; default is "iid" #' @param q The norm for the test static `((sum(x ^ q))) ^ (1/q)` #' @param ns Number of resamples for "iid" permutations #' @param grid_size Number of grid points to use when inverting the hypothesis test #' @return List that contains: #' \itemize{ #' \item{"att"}{Vector of ATT estimates} #' \item{"heldout_att"}{Vector of ATT estimates with the time period held out} #' \item{"se"}{Standard error, always NA but returned for compatibility} #' \item{"lb"}{Lower bound of 1 - alpha confidence interval} #' \item{"ub"}{Upper bound of 1 - alpha confidence interval} #' \item{"p_val"}{p-value for test of no post-treatment effect} #' \item{"alpha"}{Level of confidence interval} #' } conformal_inf_linear <- function(ascm, alpha = 0.05, stat_func = NULL, type = "iid", q = 1, ns = 1000, grid_size = 50) { wide_data <- ascm$data synth_data <- ascm$data$synth_data n <- nrow(wide_data$X) n_c <- dim(synth_data$Z0)[2] Z <- wide_data$Z t0 <- dim(synth_data$Z0)[1] tpost <- ncol(wide_data$y) t_final <- dim(synth_data$Y0plot)[1] # grid of nulls att <- predict(ascm, att = T) post_att <- att[(t0 +1):t_final] post_second <- sqrt(mean(post_att^2)) # grid for slope # use ols to get pilot estimate ts <- 1:tpost lm_out <- summary(lm(post_att ~ ts))$coefficients # grid for intercept grid_int <- seq(lm_out[1,1] - 2 * post_second, lm_out[1,1] + 2 * post_second, length.out = grid_size) if(tpost == 2) { warning(paste0("There are 2 post-treatment time periods, so a linear model has a perfect fit. A confidence interval for the slope may not be reasonable here.")) grid_slope <- seq(lm_out[2,1] - abs(lm_out[2,1]), lm_out[2,1] + abs(lm_out[2,1]), length.out = grid_size) } else if(tpost <= 1) { stop("There is only one post-treatment time period, so an intercept and a slope cannot be computed.") } else { grid_slope <- seq(lm_out[2,1] - 4 * lm_out[2,2] * sqrt(tpost), lm_out[2,1] + 4 * lm_out[2,2] * sqrt(tpost), length.out = grid_size) } # test a null post-treatment effect new_wide_data <- wide_data new_wide_data$X <- cbind(wide_data$X, wide_data$y) new_wide_data$y <- matrix(1, nrow = n, ncol = 1) null_p <- compute_permute_pval(new_wide_data, ascm, 0, ncol(wide_data$y), type, q, ns, stat_func) # confidence interval for linear in time treatment effects cis <- compute_permute_ci_linear(new_wide_data, ascm, grid_int, grid_slope, ncol(wide_data$y), alpha, type, q, ns, stat_func) return(cis) } #' Compute conformal test statistics #' @param wide_data List containing pre- and post-treatment outcomes and outcome vector #' @param ascm Fitted `augsynth` object #' @param h0 Null hypothesis to test #' @param post_length Number of post-treatment periods #' @param type Either "iid" for iid permutations or "block" for moving block permutations #' @param q The norm for the test static `((sum(x ^ q))) ^ (1/q)` #' @param ns Number of resamples for "iid" permutations #' @param stat_func Function to compute test statistic #' #' @return List that contains: #' \itemize{ #' \item{"resids"}{Residuals after enforcing the null} #' \item{"test_stats"}{Permutation distribution of test statistics} #' \item{"stat_func"}{Test statistic function} #' } #' @noRd compute_permute_test_stats <- function(wide_data, ascm, h0, post_length, type, q, ns, stat_func) { # format data new_wide_data <- wide_data t0 <- ncol(wide_data$X) - post_length tpost <- t0 + post_length # adjust outcomes for null new_wide_data$X[wide_data$trt == 1,(t0 + 1):tpost ] <- new_wide_data$X[wide_data$trt == 1,(t0 + 1):tpost] - h0 X0 <- new_wide_data$X[new_wide_data$trt == 0,, drop = F] x1 <- matrix(colMeans(new_wide_data$X[new_wide_data$trt == 1,, drop = F]), ncol=1) new_synth_data <- list() new_synth_data$Z0 <- t(X0) new_synth_data$X0 <- t(X0) new_synth_data$Z1 <- x1 new_synth_data$X1 <- x1 # fit synth with adjusted data and get residuals new_ascm <- do.call(fit_augsynth_internal, c(list(wide = new_wide_data, synth_data = new_synth_data, Z = wide_data$Z, progfunc = ascm$progfunc, scm = ascm$scm, fixedeff = ascm$fixedeff), ascm$extra_args)) resids <- predict(new_ascm, att = T)[1:tpost] # permute residuals and compute test statistic if(is.null(stat_func)) { stat_func <- function(x) (sum(abs(x) ^ q) / sqrt(length(x))) ^ (1 / q) } if(type == "iid") { test_stats <- sapply(1:ns, function(x) { reorder <- sample(resids) stat_func(reorder[(t0 + 1):tpost]) }) } else { ## increment time by one step and wrap test_stats <- sapply(1:tpost, function(j) { reorder <- resids[(0:tpost -1 + j) %% tpost + 1] stat_func(reorder[(t0 + 1):tpost]) }) } return(list(resids = resids, test_stats = test_stats, stat_func = stat_func)) } #' Compute conformal p-value #' @param wide_data List containing pre- and post-treatment outcomes and outcome vector #' @param ascm Fitted `augsynth` object #' @param h0 Null hypothesis to test #' @param post_length Number of post-treatment periods #' @param type Either "iid" for iid permutations or "block" for moving block permutations #' @param q The norm for the test static `((sum(x ^ q))) ^ (1/q)` #' @param ns Number of resamples for "iid" permutations #' @param stat_func Function to compute test statistic #' #' @return Computed p-value #' @noRd compute_permute_pval <- function(wide_data, ascm, h0, post_length, type, q, ns, stat_func) { t0 <- ncol(wide_data$X) - post_length tpost <- t0 + post_length out <- compute_permute_test_stats(wide_data, ascm, h0, post_length, type, q, ns, stat_func) mean(out$stat_func(out$resids[(t0 + 1):tpost]) <= out$test_stats) } #' Compute conformal p-value #' @param wide_data List containing pre- and post-treatment outcomes and outcome vector #' @param ascm Fitted `augsynth` object #' @param grid Set of null hypothesis to test for inversion #' @param post_length Number of post-treatment periods #' @param type Either "iid" for iid permutations or "block" for moving block permutations #' @param q The norm for the test static `((sum(x ^ q))) ^ (1/q)` #' @param ns Number of resamples for "iid" permutations #' @param stat_func Function to compute test statistic #' #' @return (lower bound of interval, upper bound of interval, p-value for null of 0 effect) #' @noRd compute_permute_ci <- function(wide_data, ascm, grid, post_length, alpha, type, q, ns, stat_func) { # make sure 0 is in the grid grid <- c(grid, 0) ps <-sapply(grid, function(x) { compute_permute_pval(wide_data, ascm, x, post_length, type, q, ns, stat_func) }) c(min(grid[ps >= alpha]), max(grid[ps >= alpha]), ps[grid == 0]) } #' Compute conformal confidence interval for a linear model for effects #' int + slope * time #' @param wide_data List containing pre- and post-treatment outcomes and outcome vector #' @param ascm Fitted `augsynth` object #' @param grid_int Set of null hypothesis values for the intercept #' @param grid_slope Set of null hypothesis values for the slope #' @param post_length Number of post-treatment periods #' @param type Either "iid" for iid permutations or "block" for moving block permutations #' @param q The norm for the test static `((sum(x ^ q))) ^ (1/q)` #' @param ns Number of resamples for "iid" permutations #' @param stat_func Function to compute test statistic #' #' @return (lower bound of interval, upper bound of interval, p-value for null of 0 effect) #' @noRd compute_permute_ci_linear <- function(wide_data, ascm, grid_int, grid_slope, post_length, alpha, type, q, ns, stat_func) { # make sure 0 is in both grids # grid_int <- c(grid_int, 0) # grid_slope <- c(grid_slope, 0) # make the combined grid grid_comb <- expand.grid(grid_int, grid_slope) grid_comb$p_val <-apply(grid_comb, 1, function(x) { compute_permute_pval(wide_data, ascm, x[1] + x[2] * (1:post_length), post_length, type, q, ns, stat_func) }) ci_int <- c(min(grid_comb[grid_comb$p_val >= alpha, 1]), max(grid_comb[grid_comb$p_val >= alpha, 1])) ci_slope <- c(min(grid_comb[grid_comb$p_val >= alpha, 2]), max(grid_comb[grid_comb$p_val >= alpha, 2])) int_slope_est <- as.numeric(grid_comb[which.max(grid_comb$p_val), 1:2]) return(list(est_int = int_slope_est[1], ci_int = ci_int, est_slope = int_slope_est[2], ci_slope = ci_slope)) } #' Jackknife+ algorithm over time #' @param ascm Fitted `augsynth` object #' @param alpha Confidence level #' @param conservative Whether to use the conservative jackknife+ procedure #' @return List that contains: #' \itemize{ #' \item{"att"}{Vector of ATT estimates} #' \item{"heldout_att"}{Vector of ATT estimates with the time period held out} #' \item{"se"}{Standard error, always NA but returned for compatibility} #' \item{"lb"}{Lower bound of 1 - alpha confidence interval} #' \item{"ub"}{Upper bound of 1 - alpha confidence interval} #' \item{"alpha"}{Level of confidence interval} #' } time_jackknife_plus_multiout <- function(ascm_multi, alpha = 0.05, conservative = F) { wide_data <- ascm_multi$data data_list <- ascm_multi$data_list n <- nrow(wide_data$X) k <- length(data_list$X) t0 <- min(sapply(data_list$X, ncol)) tpost <- max(sapply(data_list$y, ncol)) t_final <- t0 + tpost Z <- wide_data$Z jack_ests <- lapply(1:t0, function(tdrop) { # drop unit i new_data_list <- drop_time_t_multiout(data_list, Z, tdrop) # refit new_ascm <- do.call(fit_augsynth_multiout_internal, c(list(wide_list = new_data_list, combine_method = ascm_multi$combine_method, Z = data_list$Z, progfunc = ascm_multi$progfunc, scm = ascm_multi$scm, fixedeff = ascm_multi$fixedeff, outcomes_str = ascm_multi$outcomes), ascm_multi$extra_args)) # get ATT estimates and held out error for time t # t0 is prediction for held out time est <- predict(new_ascm, att = F)[(t0 +1):t_final, , drop = F] est <- rbind(est, colMeans(est)) # err <- c(colMeans(wide_data$X[wide_data$trt == 1, # tdrop, # drop = F]) - # predict(new_ascm, att = F)[t0]) err <- c(predict(new_ascm, att = T)[t0, , drop = F]) list(err, t(t(est) + abs(err)), t(t(est) - abs(err)), t(t(est) + err), est) }) # get errors and jackknife distribution held_out_errs <- matrix(vapply(jack_ests, `[[`, numeric(k), 1), nrow = k) jack_dist_high <- vapply(jack_ests, `[[`, matrix(0, nrow = tpost + 1, ncol = k), 2) jack_dist_low <- vapply(jack_ests, `[[`, matrix(0, nrow = tpost + 1, ncol = k), 3) jack_dist_cons <- vapply(jack_ests, `[[`, matrix(0, nrow = tpost + 1, ncol = k), 4) out <- list() att <- predict(ascm_multi, att = T) out$att <- rbind(att, colMeans(att[(t0 + 1):t_final, , drop = F])) # held out ATT out$heldout_att <- rbind(t(held_out_errs), att[(t0 + 1):t_final, , drop = F], colMeans(att[(t0 + 1):t_final, , drop = F])) if(conservative) { qerr <- apply(abs(held_out_errs), 1, stats::quantile, 1 - alpha, type = 1) out$lb <- rbind(matrix(NA, nrow = t0, ncol = k), t(t(apply(jack_dist_cons, 1:2, min)) - qerr)) out$ub <- rbind(matrix(NA, nrow = t0, ncol = k), t(t(apply(jack_dist_cons, 1:2, max)) + qerr)) } else { out$lb <- rbind(matrix(NA, nrow = t0, ncol = k), apply(jack_dist_low, 1:2, stats::quantile, alpha, type = 1)) out$ub <- rbind(matrix(NA, nrow = t0, ncol = k), apply(jack_dist_high, 1:2, stats::quantile, 1 - alpha, type = 1)) } # shift back to ATT scale y1 <- predict(ascm_multi, att = F) + att y1 <- rbind(y1, colMeans(y1[(t0 + 1):t_final, , drop = F])) shifted_lb <- y1 - out$ub shifted_ub <- y1 - out$lb out$lb <- shifted_lb out$ub <- shifted_ub out$alpha <- alpha return(out) } #' Drop time period from pre-treatment data #' @param wide_data (X, y, trt) #' @param Z Covariates matrix #' @param t_drop Time to drop #' @noRd drop_time_t_multiout <- function(data_list, Z, t_drop) { new_data_list <- list() new_data_list$trt <- data_list$trt new_data_list$X <- lapply(data_list$X, function(x) x[, -t_drop, drop = F]) new_data_list$y <- lapply(1:length(data_list$y), function(k) { cbind(data_list$X[[k]][, t_drop, drop = F], data_list$y[[k]]) }) return(new_data_list) } #' Conformal inference procedure to compute p-values and point-wise confidence intervals #' @param ascm Fitted `augsynth` object #' @param alpha Confidence level #' @param stat_func Function to compute test statistic #' @param type Either "iid" for iid permutations or "block" for moving block permutations #' @param q The norm for the test static `((sum(x ^ q))) ^ (1/q)` #' @param ns Number of resamples for "iid" permutations #' @param grid_size Number of grid points to use when inverting the hypothesis test (default is 1, so only to test joint null) #' @return List that contains: #' \itemize{ #' \item{"att"}{Vector of ATT estimates} #' \item{"heldout_att"}{Vector of ATT estimates with the time period held out} #' \item{"se"}{Standard error, always NA but returned for compatibility} #' \item{"lb"}{Lower bound of 1 - alpha confidence interval} #' \item{"ub"}{Upper bound of 1 - alpha confidence interval} #' \item{"p_val"}{p-value for test of no post-treatment effect} #' \item{"alpha"}{Level of confidence interval} #' } conformal_inf_multiout <- function(ascm_multi, alpha = 0.05, stat_func = NULL, type = "iid", q = 1, ns = 1000, grid_size = 1, lin_h0 = NULL) { wide_data <- ascm_multi$data data_list <- ascm_multi$data_list n <- nrow(wide_data$X) k <- length(data_list$X) t0 <- min(sapply(data_list$X, ncol)) tpost <- max(sapply(data_list$y, ncol)) t_final <- t0 + tpost # grid of nulls att <- predict(ascm_multi, att = T) post_att <- att[(t0 +1):t_final,, drop = F] post_sd <- apply(post_att, 2, function(x) sqrt(mean(x ^ 2, na.rm = T))) # iterate over post-treatment periods to get pointwise CIs vapply(1:tpost, function(j) { # fit using t0 + j as a pre-treatment period and get residuals new_data_list <- data_list new_data_list$X <- lapply(1:k, function(i) { Xi <- cbind(data_list$X[[i]], data_list$y[[i]][, j, drop = TRUE]) colnames(Xi) <- c(colnames(data_list$X[[i]]), colnames(data_list$y[[i]])[j]) Xi }) if(tpost > 1) { new_data_list$y <- lapply(1:k, function(i) { data_list$y[[i]][, -j, drop = FALSE] }) } else { # set the post period has to be *something* new_data_list$y <- lapply(1:k, function(i) { x <- matrix(1, nrow = n, ncol = 1) colnames(x) <- max(as.numeric(colnames(data_list$y[[i]]))) + 1 x }) } # make a grid around the estimated ATT if(is.null(lin_h0)) { grid <- lapply(1:k, function(i) { seq(att[t0 + j, i] - 2 * post_sd[i], att[t0 + j, i] + 2 * post_sd[i], length.out = grid_size) }) } else { grid <- seq(min(att[t0 + j, ]) - 2 * max(post_sd), max(att[t0 + j, ]) + 2 * max(post_sd), length.out = grid_size) } if(grid_size > 1) { compute_permute_ci_multiout(new_data_list, ascm_multi, grid, 1, alpha, type, q, ns, lin_h0, stat_func) } else { rbind(matrix(0, ncol = k, nrow = 2), compute_permute_pval_multiout(new_data_list, ascm_multi, numeric(k), 1, type, q, ns, stat_func)) } }, matrix(0, ncol = k, nrow=3)) -> cis # # test a null post-treatment effect new_data_list <- data_list new_data_list$X <- lapply(1:k, function(i) { Xi <- cbind(data_list$X[[i]], data_list$y[[i]]) colnames(Xi) <- c(colnames(data_list$X[[i]]), colnames(data_list$y[[i]])) Xi }) # set post treatment to be *something* new_data_list$y <- lapply(1:k, function(i) { data_list$y[[i]][, 1, drop = FALSE] }) null_p <- compute_permute_pval_multiout(new_data_list, ascm_multi, numeric(k), tpost, type, q, ns, stat_func) if(is.null(lin_h0)) { grid <- lapply(1:k, function(i) { seq(min(att[(t0 + 1):tpost, i]) - 4 * post_sd[i], max(att[(t0 + 1):tpost, i]) + 4 * post_sd[i], length.out = grid_size) }) } else { grid <- seq(min(att[t0 + 1, ]) - 3 * max(post_sd), max(att[t0 + 1, ]) + 3 * max(post_sd), length.out = grid_size) } null_ci <- compute_permute_ci_multiout(new_data_list, ascm_multi, grid, tpost, alpha, type, q, ns, lin_h0, stat_func) out <- list() att <- predict(ascm_multi, att = T) out$att <- rbind(att, apply(att[(t0 + 1):t_final, , drop = F], 2, mean)) out$lb <- rbind(matrix(NA, nrow = t0, ncol = k), t(matrix(cis[1, ,], nrow = k)), # rep(NA, k) null_ci[1,] ) colnames(out$lb) <- ascm_multi$outcomes out$ub <- rbind(matrix(NA, nrow = t0, ncol = k), t(matrix(cis[2, ,], nrow = k)), # rep(NA, k) null_ci[2,] ) colnames(out$ub) <- ascm_multi$outcomes out$p_val <- rbind(matrix(NA, nrow = t0, ncol = k), t(matrix(cis[3, ,], nrow = k)), # rep(null_p, k) null_ci[3,]) colnames(out$p_val) <- ascm_multi$outcomes out$alpha <- alpha return(out) } #' Compute conformal test statistics #' @param wide_data List containing pre- and post-treatment outcomes and outcome vector #' @param ascm Fitted `augsynth` object #' @param h0 Null hypothesis to test #' @param post_length Number of post-treatment periods #' @param type Either "iid" for iid permutations or "block" for moving block permutations #' @param q The norm for the test static `((sum(x ^ q))) ^ (1/q)` #' @param ns Number of resamples for "iid" permutations #' @param stat_func Function to compute test statistic #' #' @return List that contains: #' \itemize{ #' \item{"resids"}{Residuals after enforcing the null} #' \item{"test_stats"}{Permutation distribution of test statistics} #' \item{"stat_func"}{Test statistic function} #' } #' @noRd compute_permute_test_stats_multiout <- function(data_list, ascm_multi, h0, post_length, type, q, ns, stat_func) { # format data new_data_list <- data_list t0 <- ncol(data_list$X[[1]]) - post_length tpost <- t0 + post_length k <- length(data_list$X) # adjust outcomes for null for(i in 1:k) { new_data_list$X[[k]][data_list$trt == 1,(t0 + 1):tpost ] <- new_data_list$X[[k]][data_list$trt == 1,(t0 + 1):tpost] - h0[i] } # fit synth with adjusted data and get residuals new_ascm <- do.call(fit_augsynth_multiout_internal, c(list(wide_list = new_data_list, combine_method = ascm_multi$combine_method, Z = data_list$Z, progfunc = ascm_multi$progfunc, scm = ascm_multi$scm, fixedeff = ascm_multi$fixedeff, outcomes_str = ascm_multi$outcomes), ascm_multi$extra_args)) resids <- predict(new_ascm, att = T)[1:tpost, , drop = F] # permute residuals and compute test statistic if(is.null(stat_func)) { stat_func <- function(x) { x <- na.omit(x) (sum(abs(x) ^ q) / sqrt(length(x))) ^ (1 / q) } } if(type == "iid") { test_stats <- sapply(1:ns, function(x) { idxs <- sample(1:nrow(resids)) reorder <- resids[idxs, , drop = F] apply(reorder[(t0 + 1):tpost, ,drop = F], 2, stat_func) }) } else { ## increment time by one step and wrap test_stats <- sapply(0:(tpost - 1), function(j) { reorder <- resids[(0:(tpost -1) + j) %% tpost + 1, ,drop = F] if(!all(dim(reorder) == dim(resids))) { stop("Error in block resampling") } apply(reorder[(t0 + 1):tpost, , drop = F], 2, stat_func) }) } return(list(resids = resids, test_stats = matrix(test_stats, nrow = k), stat_func = stat_func)) } #' Compute conformal p-value #' @param wide_data List containing pre- and post-treatment outcomes and outcome vector #' @param ascm Fitted `augsynth` object #' @param h0 Null hypothesis to test #' @param post_length Number of post-treatment periods #' @param type Either "iid" for iid permutations or "block" for moving block permutations #' @param q The norm for the test static `((sum(x ^ q))) ^ (1/q)` #' @param ns Number of resamples for "iid" permutations #' @param stat_func Function to compute test statistic #' #' @return Computed p-value #' @noRd compute_permute_pval_multiout <- function(data_list, ascm_multi, h0, post_length, type, q, ns, stat_func) { t0 <- ncol(data_list$X[[1]]) - post_length tpost <- t0 + post_length out <- compute_permute_test_stats_multiout(data_list, ascm_multi, h0, post_length, type, q, ns, stat_func) k <- length(data_list$X) comb_stat <- mean(apply(out$resids[(t0 + 1):tpost, , drop = F], 2, out$stat_func), na.rm = TRUE) comb_test_stats <- apply(out$test_stats, 2, mean, na.rm = TRUE) # if(h0 == 0) { # hist(comb_test_stats) # abline(v = comb_stat) # print(mean(comb_stat <= comb_test_stats)) # print(1 - mean(comb_stat > comb_test_stats)) # } 1 - mean(comb_stat > comb_test_stats) } #' Compute conformal p-value #' @param wide_data List containing pre- and post-treatment outcomes and outcome vector #' @param ascm Fitted `augsynth` object #' @param grid Set of null hypothesis to test for inversion #' @param post_length Number of post-treatment periods #' @param type Either "iid" for iid permutations or "block" for moving block permutations #' @param q The norm for the test static `((sum(x ^ q))) ^ (1/q)` #' @param ns Number of resamples for "iid" permutations #' @param stat_func Function to compute test statistic #' #' @return (lower bound of interval, upper bound of interval, p-value for null of 0 effect) #' @noRd compute_permute_ci_multiout <- function(data_list, ascm_multi, grid, post_length, alpha, type, q, ns, lin_h0 = NULL, stat_func) { # make sure 0 is in the grid if(is.null(lin_h0)) { grid <- lapply(grid, function(x) c(x, 0)) k <- length(grid) # get all combinations of grid grid <- expand.grid(grid) grid_low <- NULL } else { k <- length(lin_h0) # keep track of low dimensional grid grid_low <- c(grid, 0) # transform into high dimensional grid with linear hypothesis grid <- sapply(lin_h0, function(x) x * grid_low) } ps <- apply(grid, 1, function(x) { compute_permute_pval_multiout(data_list, ascm_multi, x, post_length, type, q, ns, stat_func) }) sapply(1:k, function(i) c(min(grid[ps >= alpha, i]), max(grid[ps >= alpha, i]), ps[apply(grid == 0, 1, all)])) } #' Drop unit i from data #' @param wide_data (X, y, trt) #' @param Z Covariates matrix #' @param i Unit to drop #' @noRd drop_unit_i <- function(wide_data, Z, i) { new_wide_data <- list() new_wide_data$trt <- wide_data$trt[-i] new_wide_data$X <- wide_data$X[-i,, drop = F] new_wide_data$y <- wide_data$y[-i,, drop = F] X0 <- new_wide_data$X[new_wide_data$trt == 0,, drop = F] x1 <- matrix(colMeans(new_wide_data$X[new_wide_data$trt == 1,, drop = F]), ncol=1) y0 <- new_wide_data$y[new_wide_data$trt == 0,, drop = F] y1 <- colMeans(new_wide_data$y[new_wide_data$trt == 1,, drop = F]) new_synth_data <- list() new_synth_data$Z0 <- t(X0) new_synth_data$X0 <- t(X0) new_synth_data$Z1 <- x1 new_synth_data$X1 <- x1 new_Z <- if(!is.null(Z)) Z[-i, , drop = F] else NULL return(list(wide_data = new_wide_data, synth_data = new_synth_data, Z = new_Z)) } #' Drop unit i from data #' @param wide_list (X, y, trt) #' @param Z Covariates matrix #' @param i Unit to drop #' @noRd drop_unit_i_multiout <- function(wide_list, Z, i) { new_wide_data <- list() new_wide_data$trt <- wide_list$trt[-i] new_wide_data$X <- lapply(wide_list$X, function(x) x[-i,, drop = F]) new_wide_data$y <- lapply(wide_list$y, function(x) x[-i,, drop = F]) new_Z <- if(!is.null(Z)) Z[-i, , drop = F] else NULL return(list(wide_list = new_wide_data, Z = new_Z)) } #' Estimate standard errors for single ASCM with the jackknife #' Do this for ridge-augmented synth #' @param ascm Fitted augsynth object #' #' @return List that contains: #' \itemize{ #' \item{"att"}{Vector of ATT estimates} #' \item{"se"}{Standard error estimate} #' \item{"lb"}{Lower bound of 1 - alpha confidence interval} #' \item{"ub"}{Upper bound of 1 - alpha confidence interval} #' \item{"alpha"}{Level of confidence interval} #' } jackknife_se_single <- function(ascm) { wide_data <- ascm$data synth_data <- ascm$data$synth_data n <- nrow(wide_data$X) n_c <- dim(synth_data$Z0)[2] Z <- wide_data$Z t0 <- dim(synth_data$Z0)[1] tpost <- ncol(wide_data$y) t_final <- dim(synth_data$Y0plot)[1] errs <- matrix(0, n_c, t_final - t0) # only drop out control units with non-zero weights nnz_weights <- numeric(n) nnz_weights[wide_data$trt == 0] <- round(ascm$weights, 3) != 0 # if more than one unit is treated, include them in the jackknife if(sum(wide_data$trt) > 1) { nnz_weights[wide_data$trt == 1] <- 1 } trt_idxs <- (1:n)[as.logical(nnz_weights)] # jackknife estimates ests <- vapply(trt_idxs, function(i) { # drop unit i new_data <- drop_unit_i(wide_data, Z, i) # refit new_ascm <- do.call(fit_augsynth_internal, c(list(wide = new_data$wide, synth_data = new_data$synth_data, Z = new_data$Z, progfunc = ascm$progfunc, scm = ascm$scm, fixedeff = ascm$fixedeff), ascm$extra_args)) # get ATT estimates est <- predict(new_ascm, att = T)[(t0 + 1):t_final] c(est, mean(est)) }, numeric(tpost + 1)) # convert to matrix ests <- matrix(ests, nrow = tpost + 1, ncol = length(trt_idxs)) ## standard errors se2 <- apply(ests, 1, function(x) (n - 1) / n * sum((x - mean(x, na.rm = T)) ^ 2)) se <- sqrt(se2) out <- list() att <- predict(ascm, att = T) out$att <- c(att, mean(att[(t0 + 1):t_final])) out$se <- c(rep(NA, t0), se) # out$sigma <- NA return(out) } #' Compute standard errors using the jackknife #' @param multisynth fitted multisynth object #' @param relative Whether to compute effects according to relative time #' @noRd jackknife_se_multi <- function(multisynth, relative=NULL, alpha = 0.05, att_weight = NULL) { ## get info from the multisynth object if(is.null(relative)) { relative <- multisynth$relative } n_leads <- multisynth$n_leads n <- nrow(multisynth$data$X) att <- predict(multisynth, att=T, att_weight = att_weight) outddim <- nrow(att) J <- length(multisynth$grps) ## drop each unit and estimate overall treatment effect jack_est <- vapply(1:n, function(i) { msyn_i <- drop_unit_i_multi(multisynth, i) pred <- predict(msyn_i[[1]], relative=relative, att=T, att_weight = att_weight) if(nrow(pred) < outddim) { pred <- rbind( pred[1:(nrow(pred)-1), ], matrix(NA, nrow=outddim-nrow(pred), ncol=ncol(pred)), pred[nrow(pred), ] ) } if(length(msyn_i[[2]]) != 0) { out <- matrix(NA, nrow=nrow(pred), ncol=(J+1)) out[,-(msyn_i[[2]]+1)] <- pred } else { out <- pred } out }, matrix(0, nrow=outddim,ncol=(J+1))) se2 <- apply(jack_est, c(1,2), function(x) (n-1) / n * sum((x - mean(x,na.rm=T))^2, na.rm=T)) lower_bound <- att - qnorm(1 - alpha / 2) * sqrt(se2) upper_bound <- att + qnorm(1 - alpha / 2) * sqrt(se2) return(list(att = att, se = sqrt(se2), lower_bound = lower_bound, upper_bound = upper_bound)) } #' Helper function to drop unit i and refit #' @param msyn multisynth_object #' @param i Unit to drop #' @noRd drop_unit_i_multi <- function(msyn, i) { n <- nrow(msyn$data$X) time_cohort <- msyn$time_cohort which_t <- (1:n)[is.finite(msyn$data$trt)] not_miss_j <- which_t %in% setdiff(which_t, i) # drop unit i from data drop_i <- msyn$data drop_i$X <- msyn$data$X[-i, , drop = F] drop_i$y <- msyn$data$y[-i, , drop = F] drop_i$trt <- msyn$data$trt[-i] drop_i$mask <- msyn$data$mask[not_miss_j,, drop = F] if(!is.null(msyn$data$Z)) { drop_i$Z <- msyn$data$Z[-i, , drop = F] } else { drop_i$Z <- NULL } long_df <- msyn$long_df unit <- colnames(long_df)[1] # make alphabetical, because the ith unit is the index in alphabetical ordering long_df <- long_df[order(long_df[, unit, drop = TRUE]),] ith_unit <- unique(long_df[,unit, drop = TRUE])[i] long_df <- long_df[long_df[,unit, drop = TRUE] != ith_unit,] # re-fit everything args_list <- list(wide = drop_i, relative = msyn$relative, n_leads = msyn$n_leads, n_lags = msyn$n_lags, nu = msyn$nu, lambda = msyn$lambda, V = msyn$V, force = msyn$force, n_factors = msyn$n_factors, scm = msyn$scm, time_w = msyn$time_w, lambda_t = msyn$lambda_t, fit_resids = msyn$fit_resids, time_cohort = msyn$time_cohort, long_df = long_df, how_match = msyn$how_match) msyn_i <- do.call(multisynth_formatted, c(args_list, msyn$extra_pars)) # check for dropped treated units/time periods if(time_cohort) { dropped <- which(!msyn$grps %in% msyn_i$grps) } else { dropped <- which(!not_miss_j) } return(list(msyn_i, dropped)) } #' Estimate standard errors for multi outcome ascm with jackknife #' @param ascm Fitted augsynth object #' @noRd jackknife_se_multiout <- function(ascm) { wide_data <- ascm$data wide_list <- ascm$data_list n <- nrow(wide_data$X) Z <- wide_data$Z # only drop out control units with non-zero weights nnz_weights <- numeric(n) nnz_weights[wide_data$trt == 0] <- round(ascm$weights, 3) != 0 trt_idxs <- (1:n)[as.logical(nnz_weights)] # jackknife estimates ests <- lapply(trt_idxs, function(i) { # drop unit i new_data <- drop_unit_i_multiout(wide_list, Z, i) # refit new_ascm <- do.call(fit_augsynth_multiout_internal, c(list(wide = new_data$wide, combine_method = ascm$combine_method, Z = new_data$Z, progfunc = ascm$progfunc, scm = ascm$scm, fixedeff = ascm$fixedeff, outcomes_str = ascm$outcomes), ascm$extra_args)) new_ascm$outcomes <- ascm$outcomes new_ascm$data_list <- ascm$data_list new_ascm$data$time <- ascm$data$time # get ATT estimates est <- predict(new_ascm, att = T) est <- est[as.numeric(rownames(est)) >= ascm$t_int,, drop = F] rbind(est, colMeans(est, na.rm = T)) }) ests <- simplify2array(ests) ## standard errors se2 <- apply(ests, c(1, 2), function(x) (n - 1) / n * sum((x - mean(x, na.rm = T)) ^ 2)) se <- sqrt(se2) out <- list() att <- predict(ascm, att = T) att_post <- colMeans(att[as.numeric(rownames(att)) >= ascm$t_int,, drop = F], na.rm = T) out$att <- rbind(att, att_post) t0 <- sum(as.numeric(rownames(att)) < ascm$t_int) out$se <- rbind(matrix(NA, t0, ncol(se)), se) out$sigma <- NA return(out) } #' Compute the weighted bootstrap distribution #' @param multisynth fitted multisynth object #' @param rweight Function to draw random weights as a function of n (e.g rweight(n)) #' @param relative Whether to compute effects according to relative time #' @noRd weighted_bootstrap_multi <- function(multisynth, rweight = rwild_b, n_boot = 1000, alpha = 0.05, att_weight = NULL, relative=NULL) { ## get info from the multisynth object if(is.null(relative)) { relative <- multisynth$relative } n <- nrow(multisynth$data$X) att <- predict(multisynth, att=T, att_weight = att_weight) outddim <- nrow(att) n1 <- sum(is.finite(multisynth$data$trt)) J <- length(multisynth$grps) # draw random weights to get bootstrap distribution bs_est <- vapply(1:n_boot, function(i) { Z <- rweight(n)# / sqrt(n1) predict(multisynth, att=T, att_weight = att_weight, bs_weight = Z) - sum(Z) / n1 * att }, matrix(0, nrow=outddim,ncol=(J+1))) se2 <- apply(bs_est, c(1,2), function(x) mean((x - mean(x))^2, na.rm=T)) bias <- apply(bs_est, c(1,2), function(x) mean(x, na.rm=T)) upper_bound <- att - apply(bs_est, c(1,2), function(x) quantile(x, alpha / 2, na.rm = T)) lower_bound <- att - apply(bs_est, c(1,2), function(x) quantile(x, 1 - alpha / 2, na.rm = T)) return(list(att = att, bias = bias, se = sqrt(se2), upper_bound = upper_bound, lower_bound = lower_bound)) } #' Bayesian bootstrap #' @param n Number of units #' @export rdirichlet_b <- function(n) { Z <- as.numeric(rgamma(n, 1, 1)) return(Z / sum(Z) * n) } #' Non-parametric bootstrap #' @param n Number of units #' @export rmultinom_b <- function(n) as.numeric(rmultinom(1, n, rep(1 / n, n))) #' Wild bootstrap (Mammen 1993) #' @param n Number of units #' @export rwild_b <- function(n) { sample(c(-(sqrt(5) - 1) / 2, (sqrt(5) + 1) / 2 ), n, replace = TRUE, prob = c((sqrt(5) + 1)/ (2 * sqrt(5)), (sqrt(5) - 1) / (2 * sqrt(5)))) } ================================================ FILE: R/multi_outcomes.R ================================================ #' Fit Augmented SCM with multiple outcomes #' @param form outcome ~ treatment | auxillary covariates #' @param unit Name of unit column #' @param time Name of time column #' @param t_int Time of intervention #' @param data Panel data as dataframe #' @param progfunc What function to use to impute control outcomes #' Ridge=Ridge regression (allows for standard errors), #' None=No outcome model, #' @param scm Whether the SCM weighting function is used #' @param fixedeff Whether to include a unit fixed effect, default F #' @param cov_agg Covariate aggregation functions, if NULL then use mean with NAs omitted #' @param combine_method How to combine outcomes: `concat` concatenates outcomes and `avg` averages them, default: 'avg' #' @param ... optional arguments for outcome model #' #' @return augsynth object that contains: #' \itemize{ #' \item{"weights"}{Ridge ASCM weights} #' \item{"l2_imbalance"}{Imbalance in pre-period outcomes, measured by the L2 norm} #' \item{"scaled_l2_imbalance"}{L2 imbalance scaled by L2 imbalance of uniform weights} #' \item{"mhat"}{Outcome model estimate} #' \item{"data"}{Panel data as matrices} #' } #' @export augsynth_multiout <- function(form, unit, time, t_int, data, progfunc=c("Ridge", "None"), scm=T, fixedeff = FALSE, cov_agg=NULL, combine_method = "avg", ...) { call_name <- match.call() form <- Formula::Formula(form) unit <- enquo(unit) time <- enquo(time) ## format data outcome <- terms(formula(form, rhs=1))[[2]] trt <- terms(formula(form, rhs=1))[[3]] outcomes_str <- all.vars(outcome) outcomes <- sapply(outcomes_str, quo) # get outcomes as a list wide_list <- format_data_multi(outcomes, trt, unit, time, t_int, data) ## add covariates if(length(form)[2] == 2) { cov_form <- paste(deparse(terms(formula(form, rhs = 2))[[3]]), collapse = "") new_form <- as.formula(paste("~", cov_form)) Z <- extract_covariates(new_form, unit, time, t_int, data, cov_agg) } else { Z <- NULL } # only allow ridge augmentation if(! tolower(progfunc) %in% c("none", "ridge")) { stop(paste(progfunc, "is not a valid augmentation function with multiple outcomes. Only `none` or `ridge` are allowable options for `prog_func`")) } # fit augmented SCM augsynth <- fit_augsynth_multiout_internal(wide_list, combine_method, Z, progfunc, scm, fixedeff, outcomes_str, ...) # add some extra data augsynth$data$time <- data %>% distinct(!!time) %>% pull(!!time) augsynth$call <- call_name augsynth$t_int <- t_int augsynth$combine_method <- combine_method treated_units <- data %>% filter(!!trt == 1) %>% distinct(!!unit) %>% pull(!!unit) control_units <- data %>% filter(!(!!unit %in% treated_units)) %>% distinct(!!unit) %>% pull(!!unit) augsynth$weights <- matrix(augsynth$weights) rownames(augsynth$weights) <- control_units return(augsynth) } #' Internal function to fit augmented SCM with multiple outcomes #' @param wide_list List of matrices for each outcome formatted from format_data #' @param combine_method How to combine outcomes #' @param Z Matrix of auxiliary covariates #' @param progfunc outcome model to use #' @param scm Whether to fit SCM #' @param fixedeff Whether to de-mean synth #' @param ... Extra args for outcome model #' @noRd fit_augsynth_multiout_internal <- function(wide_list, combine_method, Z, progfunc, scm, fixedeff, outcomes_str, ...) { # combine into a matrix for fitting and balancing out <- combine_outcomes(wide_list, combine_method, fixedeff, ...) wide_bal <- out$wide_bal mhat <- out$mhat V <- out$V synth_data <- do.call(format_synth, wide_bal) # set Y1 and Y0plot to be raw concatenated outcomes X <- do.call(cbind, wide_list$X) y <- do.call(cbind, wide_list$y) trt <- wide_list$trt synth_data$Y0plot <- t(cbind(X, y)[trt == 0,, drop = F]) synth_data$Y1plot <- colMeans(cbind(X, y)[trt == 1,, drop = F]) augsynth <- fit_augsynth_internal(wide_bal, synth_data, Z, progfunc, scm, fixedeff = F, V = V, ...) # potentially add back in fixed effects augsynth$mhat <- mhat# + augsynth$mhat augsynth$data <- list(X = X, trt = trt, y = y, Z = Z) augsynth$data_list <- wide_list augsynth$outcomes <- outcomes_str # change fixedeff flag to match input (rather than fixedeff = F in fit_augsynth_internal) augsynth$fixedeff <- fixedeff ##format output class(augsynth) <- c("augsynth_multiout", "augsynth") return(augsynth) } #' Helper function to combine multiple outcomes into a single balance matrix #' @param wide_list List of lists of pre/post treatment data for each outcome #' @param combine_method How to combine outcomes #' @param fixedeff Whether to take out unit fixed effects or not #' @param nu Weighting between concatenated and averaged objectives #' @param ... Extra arguments for combination #' @noRd #' @return \itemize{ #' \item{"X"}{Matrix of combined pre-treatment outcomes} #' \item{"trt"}{Vector of treatment assignments} #' \item{"y"}{Matrix of combined post-treatment outcomes} #' } combine_outcomes <- function(wide_list, combine_method, fixedeff, nu = NULL, ...) { n_outs <- length(wide_list$X) n_units <- Map(nrow, wide_list$X) %>% Reduce(max, .) # take out unit fixed effects demean_j <- function(j) { means <- rowMeans(wide_list$X[[j]], na.rm = TRUE) new_wide_data <- list() new_X <- wide_list$X[[j]] - means new_y <- wide_list$y[[j]] - means new_wide_data$X <- new_X new_wide_data$y <- new_y new_wide_data$mhat_pre <- replicate(ncol(wide_list$X[[j]]), means) new_wide_data$mhat_post <- replicate(ncol(wide_list$y[[j]]), means) return(new_wide_data) } if(fixedeff) { new_wide_list <- lapply(1:n_outs, demean_j) wide_list$X <- lapply(new_wide_list, function(x) x$X) wide_list$y <- lapply(new_wide_list, function(x) x$y) mhat_pre <- lapply(new_wide_list, function(x) x$mhat_pre) mhat_post <- lapply(new_wide_list, function(x) x$mhat_post) } else { mhat_pre <- lapply( 1:n_outs, function(j) matrix(0, nrow = n_units, ncol = ncol(wide_list$X[[j]]))) mhat_post <- lapply( 1:n_outs, function(j) matrix(0, nrow = n_units, ncol = ncol(wide_list$y[[j]]))) } # combine outcomes if(combine_method == "concat") { # center X and scale by overall variance for outcome # X <- lapply(wide_list$X, function(x) t(t(x) - colMeans(x)) / sd(x)) wide_bal <- list(X = do.call(cbind, lapply(wide_list$X, function(x) t(na.omit(t(x))))), y = do.call(cbind, lapply(wide_list$y, function(x) t(na.omit(t(x))))), trt = wide_list$trt) # V matrix scales by inverse variance for outcome and number of periods V <- do.call(c, lapply(wide_list$X, function(x) rep(1 / (sqrt(nrow(na.omit(t(x)))) * sd(x[wide_list$trt == 0, , drop = F], na.rm=T)), nrow(na.omit(t(x)))))) # } else if(combine_method == "svd") { # wide_bal <- list(X = do.call(cbind, wide_list$X), # y = do.call(cbind, wide_list$y), # trt = wide_list$trt) # # first get the standard deviations of the outcomes to put on the same scale # sds <- do.call(c, # lapply(wide_list$X, # function(x) rep((sqrt(ncol(x)) * sd(x, na.rm=T)), ncol(x)))) # # do an SVD on centered and scaled outcomes # X0 <- wide_bal$X[wide_bal$trt == 0, , drop = FALSE] # X0 <- t((t(X0) - colMeans(X0)) / sds) # k <- if(is.null(k)) ncol(X0) else k # V <- diag(1 / sds) %*% svd(X0)$v[, 1:k, drop = FALSE] } else if(combine_method == "avg") { # average pre-treatment outcomes, dividing by standard deviation and removing missing values X_avg <- rowMeans(simplify2array(lapply(wide_list$X, function(x) (x - mean(x[wide_list$trt == 0,], na.rm = TRUE)) / sd(x[wide_list$trt == 0,], na.rm = TRUE))), dims = 2, na.rm = TRUE) # remove any time periods with NAs X_avg <- t(na.omit(t(X_avg))) wide_bal <- list(X = X_avg, y = rowMeans(simplify2array(wide_list$y), dims = 2, na.rm = TRUE), trt = wide_list$trt) V <- diag(ncol(wide_bal$X)) } else if(combine_method == "avg_concat") { # average pre-treatment outcomes, dividing by standard deviation and removing missing values # standardize the outcomes X_list_std<- lapply(wide_list$X,function(x) (x - mean(x[wide_list$trt == 0,], na.rm = TRUE)) / sd(x[wide_list$trt == 0,], na.rm = TRUE)) X_avg <- rowMeans(simplify2array(X_list_std), dims = 2, na.rm = TRUE) # remove any time periods with NAs X_avg <- t(na.omit(t(X_avg))) X_concat <- do.call(cbind, lapply(X_list_std, function(x) t(na.omit(t(x))))) # V matrix assigns weight nu to the averaged objective and (1 - nu) to the concatenated objective # V <- c(rep(sqrt(nu), ncol(X_avg)), # sqrt(1 - nu) / sqrt(n_outs) * do.call(c, # lapply(wide_list$X, # function(x) rep(1 / (sqrt(nrow(na.omit(t(x)))) * # sd(x[wide_list$trt == 0, , drop = F], na.rm=T)), # nrow(na.omit(t(x)))))) # ) V <- c(rep(sqrt(nu), ncol(X_avg)), rep(sqrt(1 - nu) / sqrt(n_outs), ncol(X_concat))) wide_bal <- list( X = cbind(X_avg, X_concat), y = do.call(cbind, lapply(wide_list$y, function(x) t(na.omit(t(x))))), trt = wide_list$trt ) } else { stop(paste("combine_method should be one of ('avg', 'concat', 'avg_concat'),", combine_method, " is not a valid combining option")) } mhat_pre <- do.call(cbind, mhat_pre) mhat_post <- do.call(cbind, mhat_post) mhat <- cbind(mhat_pre, mhat_post) return(list(wide_bal = wide_bal, mhat = mhat, V = V)) } #' Get prediction of ATT or average outcome under control #' @param object augsynth_multiout object #' @param ... Optional arguments, including \itemize{\item{"att"}{Whether to return the ATT or average outcome under control}} #' #' @return Vector of predicted post-treatment control averages #' @export predict.augsynth_multiout <- function(object, ...) { if ("att" %in% names(list(...))) { att <- list(...)$att } else { att <- F } # call augsynth predict pred <- NextMethod() # separate out by outcome n_outs <- length(object$data_list$X) max_t <- max(sapply(1:n_outs, function(k) ncol(object$data_list$X[[k]]) + ncol(object$data_list$y[[k]]))) pred_reshape <- matrix(NA, ncol = n_outs, nrow = max_t) colnames <- lapply(1:n_outs, function(k) colnames(cbind(object$data_list$X[[k]], object$data_list$y[[k]]))) rownames(pred_reshape) <- colnames[[which.max(sapply(colnames, length))]] colnames(pred_reshape) <- object$outcomes # get outcome names for predictions pre_outs <- do.call(c, sapply(1:n_outs, function(j) { rep(object$outcomes[j], ncol(object$data_list$X[[j]])) }, simplify = FALSE)) post_outs <- do.call(c, sapply(1:n_outs, function(j) { rep(object$outcomes[j], ncol(object$data_list$y[[j]])) }, simplify = FALSE)) # print(pred) # print(cbind(names(pred), c(pre_outs, post_outs))) pred_reshape[cbind(names(pred), c(pre_outs, post_outs))] <- pred return(pred_reshape) } #' Print function for augsynth #' @param x augsynth_multiout object #' @param ... Optional arguments #' @export print.augsynth_multiout <- function(x, ...) { ## straight from lm cat("\nCall:\n", paste(deparse(x$call), sep="\n", collapse="\n"), "\n\n", sep="") ## print att estimates att <- predict(x, att = T) att_post <- data.frame( colMeans(att[as.numeric(rownames(att)) >= x$t_int,, drop = F])) names(att_post) <- c("") cat("Average ATT Estimate:\n") print(att_post) cat("\n\n") } #' Summary function for augsynth #' @param object augsynth_multiout object #' @param inf whether or not to perform inference #' @param inf_typ Type of inference, default is "conformal" #' @param grid_size Grid to compute prediction intervals over, default is 1 and only p-values are computed #' @param ... Optional arguments, including \itemize{\item{"se"}{Whether to plot standard error}} #' @export summary.augsynth_multiout <- function(object, inf = T, inf_type = "conformal", grid_size = 1, ...) { summ <- list() if(inf) { if(inf_type == "conformal") { if(grid_size > 1) { cat(paste0("A grid size of ", grid_size, " will require ", grid_size, "^", length(object$outcomes), " = ", grid_size ^ length(object$outcomes), " evaluations. This could take a while...")) } att_se <- conformal_inf_multiout(object, grid_size = grid_size, ...) } else { stop("Only conformal inference is supported for multiple outcomes") } # if(inf_type == "jackknife") { # att_se <- jackknife_se_multiout(object) # } else if(inf_type == "jackknife+") { # att_se <- time_jackknife_plus_multiout(object, ...) # } else if(inf_type == "conformal") { # att_se <- conformal_inf_multiout(object, ...) # } else { # stop(paste(inf_type, "is not a valid choice of 'inf_type'")) # } t_final <- nrow(att_se$att) att_df <- data.frame(att_se$att[1:(t_final - 1),, drop=F]) names(att_df) <- object$outcomes att_df$Time <- object$data$time att_df <- att_df %>% gather(Outcome, Estimate, -Time) # if(inf_type == "jackknife") { # se_df <- data.frame(att_se$se[1:(t_final - 1),, drop=F]) # names(se_df) <- object$outcomes # se_df$Time <- object$data$time # se_df <- se_df %>% gather(Outcome, Std.Error, -Time) # att <- inner_join(att_df, se_df, by = c("Time", "Outcome")) # } else if(inf_type %in% c("conformal", "jackknife+")) { lb_df <- data.frame(att_se$lb[1:(t_final - 1),, drop=F]) names(lb_df) <- object$outcomes lb_df$Time <- object$data$time lb_df <- lb_df %>% gather(Outcome, lower_bound, -Time) ub_df <- data.frame(att_se$ub[1:(t_final - 1),, drop=F]) names(ub_df) <- object$outcomes ub_df$Time <- object$data$time ub_df <- ub_df %>% gather(Outcome, upper_bound, -Time) att <- inner_join(att_df, lb_df, by = c("Time", "Outcome")) %>% inner_join(ub_df, by = c("Time", "Outcome")) # if(inf_type == "conformal") { pval_df <- data.frame(att_se$p_val[1:(t_final - 1),, drop=F]) names(pval_df) <- object$outcomes pval_df$Time <- object$data$time pval_df <- pval_df %>% gather(Outcome, p_val, -Time) att <- inner_join(att, pval_df, by = c("Time", "Outcome")) # } # } if(grid_size == 1) { att <- att %>% mutate(lower_bound = NA, upper_bound = NA) } att_avg <- data.frame(att_se$att[t_final,, drop = F]) names(att_avg) <- object$outcomes att_avg <- gather(att_avg, Outcome, Estimate) # if(inf_type == "jackknife") { # att_avg_se <- data.frame(att_se$se[t_final,, drop = F]) # names(att_avg_se) <- object$outcomes # att_avg_se <- gather(att_avg_se, Outcome, Std.Error) # average_att <- inner_join(att_avg, att_avg_se, by="Outcome") # } else if(inf_type %in% c("conformal", "jackknife+")){ att_avg_lb <- data.frame(att_se$lb[t_final,, drop = F]) names(att_avg_lb) <- object$outcomes att_avg_lb <- gather(att_avg_lb, Outcome, lower_bound) att_avg_ub <- data.frame(att_se$ub[t_final,, drop = F]) names(att_avg_ub) <- object$outcomes att_avg_ub <- gather(att_avg_ub, Outcome, upper_bound) average_att <- inner_join(att_avg, att_avg_lb, by="Outcome") %>% inner_join(att_avg_ub, by = "Outcome") # if(inf_type == "conformal") { att_avg_pval <- data.frame(att_se$p_val[t_final,, drop = F]) names(att_avg_pval) <- object$outcomes att_avg_pval <- gather(att_avg_pval, Outcome, p_val) average_att <- inner_join(average_att, att_avg_pval, by = "Outcome") if(grid_size == 1) { average_att <- average_att %>% mutate(lower_bound = NA, upper_bound = NA) } # } # } else { # average_att <- gather(att_avg, Outcome, Estimate) # } } else { att_est <- predict(object, att = T) att_df <- data.frame(att_est) names(att_df) <- object$outcomes att_df$Time <- object$data$time att <- att_df %>% gather(Outcome, Estimate, -Time) att$Std.Error <- NA t_int <- min(sapply(object$data_list$X, ncol)) att_avg <- data.frame(t(colMeans( att_est[t_int:nrow(att_est),, drop = F]))) print(att_avg) names(att_avg) <- object$outcomes average_att <- gather(att_avg, Outcome, Estimate) average_att$Std.Error <- NA } # get average of all outcomes sds <- data.frame(Outcome = object$outcomes, sdo = sapply(object$data_list$X, function(x) sd(x[object$data_list$trt == 0,], na.rm = TRUE))) att %>% inner_join(sds, by = "Outcome") %>% mutate(Estimate = Estimate / sdo) %>% group_by(Time) %>% summarise(Estimate = mean(Estimate, na.rm = TRUE)) %>% mutate(Outcome = "Average") %>% bind_rows(att, .) -> att summ$att <- att summ$average_att <- average_att summ$t_int <- object$t_int summ$call <- object$call summ$l2_imbalance <- object$l2_imbalance summ$scaled_l2_imbalance <- object$scaled_l2_imbalance summ$inf_type <- inf_type ## get estimated bias if(object$progfunc == "Ridge") { mhat <- object$ridge_mhat w <- object$synw } else { mhat <- object$mhat w <- object$weights } trt <- object$data$trt m1 <- colMeans(mhat[trt==1,,drop=F]) summ$bias_est <- m1 - t(mhat[trt==0,,drop=F]) %*% w if(object$progfunc == "None" | (!object$scm)) { summ$bias_est <- NA } class(summ) <- "summary.augsynth_multiout" return(summ) } #' Print function for summary function for augsynth #' @param x summary.augsynth_multiout object #' @param ... Optional arguments #' @export print.summary.augsynth_multiout <- function(x, ...) { ## straight from lm cat("\nCall:\n", paste(deparse(x$call), sep="\n", collapse="\n"), "\n\n", sep="") att_est <- x$att$Estimate ## get pre-treatment fit by outcome imbal <- x$att %>% filter(Time < x$t_int) %>% group_by(Outcome) %>% summarise(Pre.RMSE = sqrt(mean(Estimate ^ 2, na.rm = TRUE))) cat(paste("Overall L2 Imbalance (Scaled):", format(round(x$l2_imbalance,3), nsmall=3), " (", format(round(x$scaled_l2_imbalance,3), nsmall=3), ")\n\n", # "Avg Estimated Bias: ", # format(round(mean(summ$bias_est), 3),nsmall=3), "\n\n", sep="")) cat("Average ATT Estimate:\n") print(inner_join(x$average_att, imbal, by = "Outcome")) cat("\n\n") } #' Plot function for summary function for augsynth #' @importFrom graphics plot #' @param x summary.augsynth_multiout object #' @param inf Boolean, whether to plot uncertainty intervals, default TRUE #' @param plt_avg Boolean, whether to plot the average of the outcomes, default FALSE #' @param ... Optional arguments for summary function #' #' @export plot.augsynth_multiout <- function(x, inf = T, plt_avg = F, ...) { plot(summary(x, ...), inf = inf, plt_avg = plt_avg) } #' Plot function for summary function for augsynth #' @param x summary.augsynth_multiout object #' @param inf Boolean, whether to plot uncertainty intervals, default TRUE #' @param plt_avg Boolean, whether to plot the average of the outcomes, default FALSE #' #' @export plot.summary.augsynth_multiout <- function(x, inf = F, plt_avg = F, ...) { if(plt_avg) { p <- x$att %>% ggplot2::ggplot(ggplot2::aes(x=Time, y=Estimate)) } else { p <- x$att %>% filter(Outcome != "Average") %>% ggplot2::ggplot(ggplot2::aes(x=Time, y=Estimate)) } if(inf) { if(x$inf_type == "jackknife") { p <- p + ggplot2::geom_ribbon(ggplot2::aes(ymin=Estimate-2*Std.Error, ymax=Estimate+2*Std.Error), alpha=0.2, data = . %>% filter(Outcome != "Average")) } else if(x$inf_type %in% c("conformal", "jackknife+")) { p <- p + ggplot2::geom_ribbon(ggplot2::aes(ymin=lower_bound, ymax=upper_bound), alpha=0.2, data = . %>% filter(Outcome != "Average")) } } p + ggplot2::geom_line() + ggplot2::geom_vline(xintercept=x$t_int, lty=2) + ggplot2::geom_hline(yintercept=0, lty=2) + ggplot2::facet_wrap(~ Outcome, scales = "free_y") + ggplot2::theme_bw() } ================================================ FILE: R/multi_synth_qp.R ================================================ ################################################################################ ## Solve the multisynth problem as a QP ################################################################################ #' Internal function to fit synth with staggered adoption with a QP solver #' @param X Matrix of pre-final intervention outcomes, or list of such matrices after transformations #' @param trt Vector of treatment levels/times #' @param mask Matrix with indicators for observed pre-intervention times for each treatment group #' @param n_leads Number of time periods after treatment to impute control values. #' For units treated at time T_j, all units treated after T_j + n_leads #' will be used as control values. If larger than the number of periods, #' only never never treated units (pure controls) will be used as comparison units #' @param n_lags Number of pre-treatment periods to balance, default is to balance all periods #' @param relative Whether to re-index time according to treatment date, default T #' @param nu Hyper-parameter that controls trade-off between overall and individual balance. #' Larger values of nu place more emphasis on individual balance. #' Balance measure is #' nu ||global|| + (1-nu) ||individual|| #' Default: 0 #' @param lambda Regularization hyper-parameter. Default, 0 #' @param time_cohort Whether to average synthetic controls into time cohorts #' @param norm_pool Normalizing value for pooled objective, default: number of treated units squared #' @param norm_sep Normalizing value for separate objective, default: number of treated units #' @param verbose Whether to print logs for osqp #' @param eps_rel Relative error tolerance for osqp #' @param eps_abs Absolute error tolerance for osqp #' @noRd #' @return \itemize{ #' \item{"weights"}{Matrix of unit weights} #' \item{"imbalance"}{Matrix of overall and group specific imbalance} #' \item{"global_l2"}{Imbalance overall} #' \item{"ind_l2"}{Matrix of imbalance for each group} #' } multisynth_qp <- function(X, trt, mask, Z = NULL, n_leads=NULL, n_lags=NULL, relative=T, nu=0, lambda=0, V = NULL, time_cohort = FALSE, donors = NULL, norm_pool = NULL, norm_sep = NULL, verbose = FALSE, eps_rel=1e-4, eps_abs=1e-4) { # if Z has no columns then set it to NULL if(!is.null(Z)) { if(ncol(Z) == 0) { Z <- NULL } } n <- if(typeof(X) == "list") dim(X[[1]])[1] else dim(X)[1] d <- if(typeof(X) == "list") dim(X[[1]])[2] else dim(X)[2] if(is.null(n_leads)) { n_leads <- d+1 } else if(n_leads > d) { n_leads <- d+1 } if(is.null(n_lags)) { n_lags <- d } else if(n_lags > d) { n_lags <- d } V <- make_V_matrix(n_lags, V) ## treatment times if(time_cohort) { grps <- unique(trt[is.finite(trt)]) which_t <- lapply(grps, function(tj) (1:n)[trt == tj]) # if doing a time cohort, convert the boolean mask mask <- unique(mask) } else { grps <- trt[is.finite(trt)] which_t <- (1:n)[is.finite(trt)] } J <- length(grps) if(is.null(norm_sep)) { norm_sep <- 1#J } if(is.null(norm_pool)) { norm_pool <- 1#J ^ 2 } n1 <- sapply(1:J, function(j) length(which_t[[j]])) # if no specific donors passed in, # then all donors treated after n_lags are eligible if(is.null(donors)) { donors <- get_eligible_donors(trt, time_cohort, n_leads) } ## handle X differently if it is a list if(typeof(X) == "list") { x_t <- lapply(1:J, function(j) colSums(X[[j]][which_t[[j]], mask[j,]==1, drop=F])) # Xc contains pre-treatment data for valid donor units Xc <- lapply(1:nrow(mask), function(j) X[[j]][donors[[j]], mask[j,]==1, drop=F]) # std dev of outcomes for first treatment time sdx <- sd(X[[1]][is.finite(trt)]) } else { x_t <- lapply(1:J, function(j) colSums(X[which_t[[j]], mask[j,]==1, drop=F])) # Xc contains pre-treatment data for valid donor units Xc <- lapply(1:nrow(mask), function(j) X[donors[[j]], mask[j,]==1, drop=F]) # std dev of outcomes sdx <- sd(X[is.finite(trt)]) } # get covariates for donors if(!is.null(Z)) { # scale covariates to have same variance as pure control outcomes Z_scale <- sdx * apply(Z, 2, function(z) (z - mean(z[!is.finite(trt)])) / sd(z[!is.finite(trt)])) z_t <- lapply(1:J, function(j) colSums(Z_scale[which_t[[j]], , drop = F])) Zc <- lapply(1:J, function(j) Z_scale[donors[[j]], , drop = F]) } else { z_t <- lapply(1:J, function(j) c(0)) Zc <- lapply(1:J, function(j) Matrix::Matrix(0, nrow = sum(donors[[j]]), ncol = 1)) } dz <- ncol(Zc[[1]]) # replace NA values with zero x_t <- lapply(x_t, function(xtk) tidyr::replace_na(xtk, 0)) Xc <- lapply(Xc, function(xck) apply(xck, 2, tidyr::replace_na, 0)) ## make matrices for QP n0s <- sapply(Xc, nrow) if(any(n0s == 0)) { stop("Some treated units have no possible donor units!") } n0 <- sum(n0s) const_mats <- make_constraint_mats(trt, grps, n_leads, n_lags, Xc, Zc, d, n1) Amat <- const_mats$Amat lvec <- const_mats$lvec uvec <- const_mats$uvec ## quadratic balance measures qvec <- make_qvec(Xc, x_t, z_t, nu, n_lags, d, V, norm_pool, norm_sep) Pmat <- make_Pmat(Xc, x_t, dz, nu, n_lags, lambda, d, V, norm_pool, norm_sep) ## Optimize settings <- do.call(osqp::osqpSettings, c(list(verbose = verbose, eps_rel = eps_rel, eps_abs = eps_abs))) out <- osqp::solve_osqp(Pmat, qvec, Amat, lvec, uvec, pars = settings) ## get weights total_ctrls <- n0 * J weights <- matrix(out$x[1:total_ctrls], nrow = n0) nj0 <- as.numeric(lapply(Xc, nrow)) nj0cumsum <- c(0, cumsum(nj0)) imbalance <- vapply(1:J, function(j) { dj <- length(x_t[[j]]) ndim <- min(dj, n_lags) c(numeric(d-ndim), x_t[[j]][(dj-ndim+1):dj] - t(Xc[[j]][,(dj-ndim+1):dj, drop = F]) %*% out$x[(nj0cumsum[j] + 1):nj0cumsum[j + 1]]) }, numeric(d)) avg_imbal <- rowMeans(t(t(imbalance))) Vsq <- t(V) %*% V global_l2 <- c(sqrt(t(avg_imbal[(d - n_lags + 1):d]) %*% Vsq %*% avg_imbal[(d - n_lags + 1):d])) / sqrt(d) avg_l2 <- mean(apply(imbalance, 2, function(x) c(sqrt(t(x[(d - n_lags + 1):d]) %*% Vsq %*% x[(d - n_lags + 1):d])))) ind_l2 <- sqrt(mean( apply(imbalance, 2, function(x) c(x[(d - n_lags + 1):d] %*% Vsq %*% x[(d - n_lags + 1):d]) / sum(x[(d - n_lags + 1):d] != 0)))) # pad weights with zeros for treated units and divide by number of treated units vapply(1:J, function(j) { weightj <- numeric(n) weightj[donors[[j]]] <- out$x[(nj0cumsum[j] + 1):nj0cumsum[j + 1]] weightj }, numeric(n)) -> weights weights <- t(t(weights) / n1) # manually enforce non-negativity constraint # (osqp solver only enforces constraint up to a tolerance) weights <- pmax(weights, 0) output <- list(weights = weights, imbalance = cbind(avg_imbal, imbalance), global_l2 = global_l2, ind_l2 = ind_l2, avg_l2 = avg_l2, V = V) if(!is.null(Z)) { # imbalance in auxiliary covariates z_t <- sapply(1:J, function(j) colMeans(Z[which_t[[j]], , drop = F])) imbal_z <- z_t - t(Z) %*% weights avg_imbal_z <- rowSums(t(t(imbal_z) * n1)) / sum(n1) global_l2_z <- sqrt(sum(avg_imbal_z ^ 2)) ind_l2_z <- sum(apply(imbal_z, 2, function(x) sqrt(sum(x ^ 2)))) imbal_z <- cbind(avg_imbal_z, imbal_z) rownames(imbal_z) <- colnames(Z) output$imbalance_aux <- imbal_z output$global_l2_aux <- global_l2_z output$ind_l2_aux <- ind_l2_z } return(output) } #' Create constraint matrices for multisynth QP #' @param trt Vector of treatment levels/times #' @param grps Treatment times #' @param n_leads Number of time periods after treatment to impute control values. #' @param n_lags Number of pre-treatment periods to balance #' @param Xc List of outcomes for possible comparison units #' @param d Max number of lagged outcomes #' @param n1 Vector of number of treated units per cohort #' @noRd #' @return #' \itemize{ #' \item{"Amat"}{Linear constraint matrix} #' \item{"lvec"}{Lower bounds for linear constraints} #' \item{"uvec"}{Upper bounds for linear constraints} #' } make_constraint_mats <- function(trt, grps, n_leads, n_lags, Xc, Zc, d, n1) { J <- length(grps) idxs0 <- trt > n_leads + min(grps) n0 <- sum(idxs0) ## sum to n1 constraint A1 <- do.call(Matrix::bdiag, lapply(1:(J), function(x) rep(1, n0))) A1 <- Matrix::bdiag(lapply(1:J, function(j) rep(1, nrow(Xc[[j]])))) Amat <- as.matrix(Matrix::t(A1)) Amat <- Matrix::rbind2(Matrix::t(A1), Matrix::Diagonal(nrow(A1))) dz <- ncol(Zc[[1]]) # constraints for transformed weights A_trans1 <- do.call(Matrix::bdiag, lapply(1:J, function(j) { dj <- ncol(Xc[[j]]) ndim <- min(dj, n_lags) max_dim <- min(d, n_lags) mat <- Xc[[j]][, (dj - ndim + 1):dj, drop = F] n0 <- nrow(mat) zero_mat <- Matrix::Matrix(0, n0, max_dim - ndim) Matrix::t(cbind(zero_mat, mat)) })) # sum of total number of pre-periods sum_tj <- min(d, n_lags) * J A_trans2 <- - Matrix::Diagonal(sum_tj) A_trans <- Matrix::cbind2( Matrix::cbind2(A_trans1, A_trans2), Matrix::Matrix(0, nrow = nrow(A_trans1), ncol = dz * J)) # constraints for transformed weights on auxiliary covariates A_transz <- Matrix::t(Matrix::bdiag(Zc)) A_transz <- Matrix::cbind2( Matrix::cbind2(A_transz, Matrix::Matrix(0, nrow = nrow(A_transz), ncol = sum_tj)), -Matrix::Diagonal(dz * J)) # add in zero columns for transformated weights Amat <- Matrix::cbind2(Amat, Matrix::Matrix(0, nrow = nrow(Amat), ncol = sum_tj + dz * J)) Amat <- Matrix::rbind2(Matrix::rbind2(Amat, A_trans), A_transz) lvec <- c(n1, # sum to n1 constraint numeric(nrow(A1)), # lower bound by zero numeric(sum_tj), # constrain transformed weights numeric(dz * J) # constrain transformed weights ) uvec <- c(n1, #sum to n1 constraint rep(Inf, nrow(A1)), numeric(sum_tj), # constrain transformed weights numeric(dz * J) # constrain transformed weights ) return(list(Amat = Amat, lvec = lvec, uvec = uvec)) } #' Make the vector in the QP #' @param Xc List of outcomes for possible comparison units #' @param x_t List of outcomes for treated units #' @param nu Hyperparameter between global and individual balance #' @param n_lags Number of lags to balance #' @param d Largest number of pre-intervention time periods #' @param V Scaling matrix #' @param norm_pool Normalizing value for pooled objective #' @param norm_sep Normalizing value for separate objective #' @noRd make_qvec <- function(Xc, x_t, z_t, nu, n_lags, d, V, norm_pool, norm_sep) { J <- length(x_t) Vsq <- t(V) %*% V qvec <- lapply(1:J, function(j) { dj <- length(x_t[[j]]) ndim <- min(dj, n_lags) max_dim <- min(d, n_lags) vec <- x_t[[j]][(dj - ndim + 1):dj] / ndim Vsq %*% c(numeric(max_dim - ndim), vec) }) avg_target_vec <- lapply(x_t, function(xtk) { dk <- length(xtk) ndim <- min(dk, n_lags) max_dim <- min(d, n_lags) c(numeric(max_dim - ndim), xtk[(dk - ndim + 1):dk]) }) %>% reduce(`+`) %*% Vsq qvec_avg <- rep(avg_target_vec, J) # qvec <- - (nu * qvec_avg / n_lags + (1 - nu) * reduce(qvec, c)) # qvec <- - (nu * qvec_avg / (J ^ 2 * n_lags) + # (1 - nu) * reduce(qvec, c) / J) qvec <- - (nu * qvec_avg / (norm_pool * n_lags * J ^ 2) + (1 - nu) * reduce(qvec, c) / (norm_sep * J)) qvec_avg_z <- z_t %>% reduce(`+`) qvec_avg_z <- rep(qvec_avg_z, J) # qvec_z <- - (nu * qvec_avg_z + (1 - nu) * reduce(z_t, c)) / length(z_t[[1]]) # qvec_z <- - (nu * qvec_avg_z / J ^2 + # (1 - nu) * reduce(z_t, c) / J) / length(z_t[[1]]) qvec_z <- - (nu * qvec_avg_z / (norm_pool * J ^ 2) + (1 - nu) * reduce(z_t, c) / (norm_sep * J)) / length(z_t[[1]]) total_ctrls <- lapply(Xc, nrow) %>% reduce(`+`) return(c(numeric(total_ctrls), qvec, qvec_z)) } #' Make the matrix in the QP #' @param Xc List of outcomes for possible comparison units #' @param x_t List of outcomes for treated units #' @param nu Hyperparameter between global and individual balance #' @param n_lags Number of lags to balance #' @param lambda Regularization hyperparameter #' @param d Largest number of pre-intervention time periods #' @param V Scaling matrix #' @param norm_pool Normalizing value for pooled objective #' @param norm_sep Normalizing value for separate objective #' @noRd make_Pmat <- function(Xc, x_t, dz, nu, n_lags, lambda, d, V, norm_pool, norm_sep) { J <- length(x_t) Vsq <- t(V) %*% V ndims <- vapply(1:J, function(j) min(length(x_t[[j]]), n_lags), numeric(1)) max_dim <- min(d, n_lags) total_dim <- sum(ndims) total_dim <- max_dim * J V1 <- Matrix::bdiag(lapply(ndims, function(ndim) Matrix::Diagonal(max_dim, 1 / ndim))) V1 <- Matrix::bdiag(lapply(ndims, function(ndim) Vsq / ndim)) tile_sparse <- function(j) { kronecker(Matrix::Matrix(1, nrow = j, ncol = j), Vsq) } tile_sparse_cov <- function(d, j) { kronecker(Matrix::Matrix(1, nrow = j, ncol = j), Matrix::Diagonal(d)) } V2 <- tile_sparse(J) / n_lags # Pmat <- nu * V2 + (1 - nu) * V1 # Pmat <- nu * V2 / J ^ 2 + (1 - nu) * V1 / J Pmat <- nu * V2 / (norm_pool * J ^ 2) + (1 - nu) * V1 / (norm_sep * J) V1_z <- Matrix::Diagonal(dz * J, 1 / dz) V2_z <- tile_sparse_cov(dz, J) / dz # Pmat_z <- nu * V2_z + (1 - nu) * V1_z # Pmat_z <- nu * V2_z / J ^ 2 + (1 - nu) * V1_z / J Pmat_z <- nu * V2_z / (norm_pool * J ^ 2) + (1 - nu) * V1_z / (norm_sep * J) # combine total_ctrls <- lapply(Xc, nrow) %>% reduce(`+`) Pmat <- Matrix::bdiag(Matrix::Matrix(0, nrow = total_ctrls, ncol = total_ctrls), Pmat, Pmat_z) I0 <- Matrix::bdiag(Matrix::Diagonal(total_ctrls), Matrix::Matrix(0, nrow = total_dim + dz * J, ncol = total_dim + dz * J)) return(Pmat + lambda * I0) } ================================================ FILE: R/multisynth_class.R ================================================ ################################################################################ ## Fitting, plotting, summarizing staggered synth ################################################################################ #' Fit staggered synth #' @param form outcome ~ treatment | weighting covariates | approximate matching covaraites | exact matching covariates #' \itemize{ #' \item{outcome}{Name of the outcome of interest} #' \item{treatment}{Name of the treatment assignment variable} #' \item{weighting covariates}{Auxiliary covariates to weight on} #' \item{approximate matching covariates}{Auxiliary covariates to approximately match one before weighting} #' \item{exact matching covariates}{Auxiliary covariates to exactly match on before weighting} #' } #' If covariates are time-varying, their average value before the first unit is treated will be used. This can be changed by supplying a custom aggregation function to cov_agg. #' @param unit Name of unit column #' @param time Name of time column #' @param data Panel data as dataframe #' @param n_leads How long past treatment effects should be estimated for, default is number of post treatment periods for last treated unit #' @param n_lags Number of pre-treatment periods to balance, default is to balance all periods #' @param nu Fraction of balance for individual balance #' @param lambda Regularization hyperparameter, default = 0 #' @param V Scaling matrix for synth optimization, default NULL is identity #' @param fixedeff Whether to include a unit fixed effect, default TRUE #' @param n_factors Number of factors for interactive fixed effects, setting to NULL fits with CV, default is 0 #' @param scm Whether to fit scm weights #' @param time_cohort Whether to average synthetic controls into time cohorts, default FALSE #' @param cov_agg Covariate aggregation function #' @param eps_abs Absolute error tolerance for osqp #' @param eps_rel Relative error tolerance for osqp #' @param verbose Whether to print logs for osqp #' @param ... Extra arguments #' #' @return multisynth object that contains: #' \itemize{ #' \item{"weights"}{weights matrix where each column is a set of weights for a treated unit} #' \item{"data"}{Panel data as matrices} #' \item{"imbalance"}{Matrix of treatment minus synthetic control for pre-treatment time periods, each column corresponds to a treated unit} #' \item{"global_l2"}{L2 imbalance for the pooled synthetic control} #' \item{"scaled_global_l2"}{L2 imbalance for the pooled synthetic control, scaled by the imbalance for unitform weights} #' \item{"ind_l2"}{Average L2 imbalance for the individual synthetic controls} #' \item{"scaled_ind_l2"}{Average L2 imbalance for the individual synthetic controls, scaled by the imbalance for unitform weights} #' \item{"n_leads", "n_lags"}{Number of post treatment outcomes (leads) and pre-treatment outcomes (lags) to include in the analysis} #' \item{"nu"}{Fraction of balance for individual balance} #' \item{"lambda"}{Regularization hyperparameter} #' \item{"scm"}{Whether to fit scm weights} #' \item{"grps"}{Time periods for treated units} #' \item{"y0hat"}{Pilot estimates of control outcomes} #' \item{"residuals"}{Difference between the observed outcomes and the pilot estimates} #' \item{"n_factors"}{Number of factors for interactive fixed effects} #' } #' @export multisynth <- function(form, unit, time, data, n_leads=NULL, n_lags=NULL, nu=NULL, lambda=0, V = NULL, fixedeff = TRUE, n_factors=0, scm=T, time_cohort = F, how_match = "knn", cov_agg = NULL, eps_abs = 1e-4, eps_rel = 1e-4, verbose = FALSE, ...) { call_name <- match.call() form <- Formula::Formula(form) unit <- enquo(unit) time <- enquo(time) ## format data outcome <- terms(formula(form, rhs=1))[[2]] trt <- terms(formula(form, rhs=1))[[3]] wide <- format_data_stag(outcome, trt, unit, time, data) check_data_stag(wide, fixedeff, n_leads, n_lags) force <- if(fixedeff) 3 else 2 # get covariates if(length(form)[2] == 2) { Z <- extract_covariates(form, unit, time, wide$time[min(wide$trt) + 1], data, cov_agg) } else if(length(form)[2] == 3) { app_form <- Formula::Formula(formula(form, rhs = 1:2)) Z_weight <- extract_covariates(app_form, unit, time, wide$time[min(wide$trt) + 1], data, cov_agg) exact_form <- Formula::Formula(formula(form, rhs = c(1,3))) Z_match<- extract_covariates(exact_form, unit, time, wide$time[min(wide$trt) + 1], data, cov_agg) Z <- cbind(Z_weight, Z_match) wide$match_covariates <- colnames(Z_match) } else if(length(form)[2] == 4) { if(time_cohort) { stop("Aggregating by time cohort and matching on covariates are not ", "implemented together. If matching then you cannot aggregate ", "by time cohort.") } weight_form <- Formula::Formula(formula(form, rhs = c(1,2))) Z_weight <- extract_covariates(weight_form, unit, time, wide$time[min(wide$trt) + 1], data, cov_agg) app_form <- Formula::Formula(formula(form, rhs = c(1,3))) Z_app <- extract_covariates(app_form, unit, time, wide$time[min(wide$trt) + 1], data, cov_agg) exact_form <- Formula::Formula(formula(form, rhs = c(1,4))) Z_exact <- extract_covariates(exact_form, unit, time, wide$time[min(wide$trt) + 1], data, cov_agg) Z <- cbind(Z_weight, Z_app, Z_exact) wide$exact_covariates <- colnames(Z_exact) wide$match_covariates <- c(colnames(Z_app), wide$exact_covariates) } else { Z <- NULL } wide$Z <- Z # if n_leads is NULL set it to be the largest possible number of leads # for the last treated unit if(is.null(n_leads)) { n_leads <- ncol(wide$y) } else if(n_leads > max(apply(1-wide$mask, 1, sum, na.rm = T)) + ncol(wide$y)) { n_leads <- max(apply(1-wide$mask, 1, sum, na.rm = T)) + ncol(wide$y) } ## if n_lags is NULL set it to the largest number of pre-treatment periods if(is.null(n_lags)) { n_lags <- ncol(wide$X) } else if(n_lags > ncol(wide$X)) { n_lags <- ncol(wide$X) } long_df <- data[c(quo_name(unit), quo_name(time), as.character(trt), as.character(outcome))] msynth <- multisynth_formatted(wide = wide, relative = T, n_leads = n_leads, n_lags = n_lags, nu = nu, lambda = lambda, V = V, force = force, n_factors = n_factors, scm = scm, time_cohort = time_cohort, time_w = F, lambda_t = 0, fit_resids = TRUE, eps_abs = eps_abs, eps_rel = eps_rel, verbose = verbose, long_df = long_df, how_match = how_match, ...) units <- data %>% arrange(!!unit) %>% distinct(!!unit) %>% pull(!!unit) rownames(msynth$weights) <- units if(scm) { ## Get imbalance for uniform weights on raw data ## TODO: Get rid of this stupid hack of just fitting the weights again with big lambda unif <- multisynth_qp(X=wide$X, ## X=residuals[,1:ncol(wide$X)], trt=wide$trt, mask=wide$mask, Z = Z[, ! colnames(Z) %in% wide$match_covariates, drop = F], n_leads=n_leads, n_lags=n_lags, relative=T, nu=0, lambda=1e10, V = V, time_cohort = time_cohort, donors = msynth$donors, eps_rel = eps_rel, eps_abs = eps_abs, verbose = verbose) ## scaled global balance ## msynth$scaled_global_l2 <- msynth$global_l2 / sqrt(sum(unif$imbalance[,1]^2)) msynth$scaled_global_l2 <- msynth$global_l2 / unif$global_l2 ## balance for individual estimates ## msynth$scaled_ind_l2 <- msynth$ind_l2 / sqrt(sum(unif$imbalance[,-1]^2)) msynth$scaled_ind_l2 <- msynth$ind_l2 / unif$ind_l2 } msynth$call <- call_name return(msynth) } #' Internal funciton to fit staggered synth with formatted data #' @param wide List containing data elements #' @param relative Whether to compute balance by relative time #' @param n_leads How long past treatment effects should be estimated for #' @param n_lags Number of pre-treatment periods to balance, default is to balance all periods #' @param nu Fraction of balance for individual balance #' @param lambda Regularization hyperparameter, default = 0 #' @param V Scaling matrix for synth optimization, default NULL is identity #' @param force c(0,1,2,3) what type of fixed effects to include #' @param n_factors Number of factors for interactive fixed effects, default does CV #' @param scm Whether to fit scm weights #' @param time_cohort Whether to average synthetic controls into time cohorts #' @param time_w Whether to fit time weights #' @param lambda_t Regularization for time regression #' @param fit_resids Whether to fit SCM on the residuals or not #' @param eps_abs Absolute error tolerance for osqp #' @param eps_rel Relative error tolerance for osqp #' @param verbose Whether to print logs for osqp #' @param long_df A long dataframe with 4 columns in the order unit, time, trt, outcome #' @param ... Extra arguments #' @noRd #' @return multisynth object multisynth_formatted <- function(wide, relative=T, n_leads, n_lags, nu, lambda, V, force, n_factors, scm, time_cohort, time_w, lambda_t, fit_resids, eps_abs, eps_rel, verbose, long_df, how_match, ...) { ## average together treatment groups ## grps <- unique(wide$trt) %>% sort() if(time_cohort) { grps <- unique(wide$trt[is.finite(wide$trt)]) } else { grps <- wide$trt[is.finite(wide$trt)] } J <- length(grps) ## fit outcome models if(time_w) { # Autoregressive model out <- fit_time_reg(cbind(wide$X, wide$y), wide$trt, n_leads, lambda_t, ...) y0hat <- out$y0hat residuals <- out$residuals params <- out$time_weights } else if(is.null(n_factors)) { out <- tryCatch({ fit_gsynth_multi(long_df, cbind(wide$X, wide$y), wide$trt, force=force) }, error = function(error_condition) { stop("Cannot run CV because there are too few pre-treatment periods.") }) y0hat <- out$y0hat params <- out$params n_factors <- ncol(params$factor) ## get residuals from outcome model residuals <- cbind(wide$X, wide$y) - y0hat } else if (n_factors != 0) { ## if number of factors is provided don't do CV out <- fit_gsynth_multi(long_df, cbind(wide$X, wide$y), wide$trt, r=n_factors, CV=0, force=force) y0hat <- out$y0hat params <- out$params ## get residuals from outcome model residuals <- cbind(wide$X, wide$y) - y0hat } else if(force == 0 & n_factors == 0) { # if no fixed effects or factors, just take out # control averages at each time point # time fixed effects from pure controls pure_ctrl <- cbind(wide$X, wide$y)[!is.finite(wide$trt), , drop = F] y0hat <- matrix(colMeans(pure_ctrl, na.rm = TRUE), nrow = nrow(wide$X), ncol = ncol(pure_ctrl), byrow = T) residuals <- cbind(wide$X, wide$y) - y0hat params <- NULL } else { ## take out pre-treatment averages fullmask <- cbind(wide$mask, matrix(0, nrow=nrow(wide$mask), ncol=ncol(wide$y))) out <- fit_feff(cbind(wide$X, wide$y), wide$trt, fullmask, force, time_cohort) y0hat <- out$y0hat residuals <- out$residuals params <- NULL } ## balance the residuals if(fit_resids) { if(time_w) { # fit scm on residuals after taking out unit fixed effects fullmask <- cbind(wide$mask, matrix(0, nrow=nrow(wide$mask), ncol=ncol(wide$y))) out <- fit_feff(cbind(wide$X, wide$y), wide$trt, fullmask, force, time_cohort) bal_mat <- lapply(out$residuals, function(x) x[,1:ncol(wide$X)]) } else if(typeof(residuals) == "list") { bal_mat <- lapply(residuals, function(x) x[,1:ncol(wide$X)]) } else { bal_mat <- residuals[,1:ncol(wide$X)] } } else { # if not balancing residuals, then take out control averages # for each time ctrl_avg <- matrix(colMeans(wide$X[!is.finite(wide$trt), , drop = F]), nrow = nrow(wide$X), ncol = ncol(wide$X), byrow = T) bal_mat <- wide$X - ctrl_avg bal_mat <- wide$X } if(scm) { # get eligible set of donor units based on covariates donors <- get_donors(wide$X, wide$y, wide$trt, wide$Z[, colnames(wide$Z) %in% wide$match_covariates, drop = F], time_cohort, n_lags, n_leads, how = how_match, exact_covariates = wide$exact_covariates, ...) # run separate synth for scaling sep_fit <- multisynth_qp(X=bal_mat, trt=wide$trt, mask=wide$mask, Z = wide$Z[, !colnames(wide$Z) %in% wide$match_covariates, drop = F], n_leads=n_leads, n_lags=n_lags, relative=relative, nu=0, lambda=lambda, V = V, time_cohort = time_cohort, donors = donors, eps_rel = eps_rel, eps_abs = eps_abs, verbose = verbose) # if no nu value is provided, use default based on # global and individual imbalance for separate synth if(is.null(nu)) { # select nu by triangle inequality ratio glbl <- sep_fit$global_l2 * sqrt(nrow(sep_fit$imbalance)) ind <- sep_fit$avg_l2 nu <- glbl / ind } msynth <- multisynth_qp(X=bal_mat, trt=wide$trt, mask=wide$mask, Z = wide$Z[, !colnames(wide$Z) %in% wide$match_covariates, drop = F], n_leads=n_leads, n_lags=n_lags, relative=relative, nu=nu, lambda=lambda, V = V, time_cohort = time_cohort, donors = donors, norm_pool = sep_fit$global_l2 ^ 2, norm_sep = sep_fit$ind_l2 ^ 2, eps_rel = eps_rel, eps_abs = eps_abs, verbose = verbose) } else { msynth <- list(weights = matrix(0, nrow = nrow(wide$X), ncol = J), imbalance=NA, global_l2=NA, ind_l2=NA) } ## put in data and hyperparams msynth$data <- wide msynth$relative <- relative msynth$n_leads <- n_leads msynth$n_lags <- n_lags msynth$nu <- nu msynth$lambda <- lambda msynth$scm <- scm msynth$time_cohort <- time_cohort msynth$grps <- grps msynth$y0hat <- y0hat msynth$residuals <- residuals msynth$n_factors <- n_factors msynth$force <- force ## outcome model parameters msynth$params <- params # more arguments msynth$scm <- scm msynth$time_w <- time_w msynth$lambda_t <- lambda_t msynth$fit_resids <- fit_resids msynth$extra_pars <- c(list(eps_abs = eps_abs, eps_rel = eps_rel, verbose = verbose), list(...)) msynth$long_df <- long_df msynth$how_match <- how_match msynth$donors <- donors ##format output class(msynth) <- "multisynth" return(msynth) } #' Get prediction of average outcome under control or ATT #' @param object Fit multisynth object #' @param att If TRUE, return the ATT, if FALSE, return imputed counterfactual #' @param att_weight Weights to place on individual units/cohorts when averaging #' @param bs_weight Weight to perturb units by for weighted bootstrap #' @param ... Optional arguments #' #' @return Matrix of predicted post-treatment control outcomes for each treated unit #' @export predict.multisynth <- function(object, att = F, att_weight = NULL, bs_weight = NULL, ...) { multisynth <- object relative <- T time_cohort <- multisynth$time_cohort if(is.null(relative)) { relative <- multisynth$relative } n_leads <- multisynth$n_leads d <- ncol(multisynth$data$X) n <- nrow(multisynth$data$X) fulldat <- cbind(multisynth$data$X, multisynth$data$y) ttot <- ncol(fulldat) grps <- multisynth$grps J <- length(grps) if(is.null(bs_weight)) { # bs_weight <- rep(1 / sqrt(sum(is.finite(multisynth$data$trt))), n) bs_weight <- rep(1, n) } if(time_cohort) { which_t <- lapply(grps, function(tj) (1:n)[multisynth$data$trt == tj]) mask <- unique(multisynth$data$mask) } else { which_t <- (1:n)[is.finite(multisynth$data$trt)] mask <- multisynth$data$mask } n1 <- sapply(1:J, function(j) length(which_t[[j]])) fullmask <- cbind(mask, matrix(0, nrow = J, ncol = (ttot - d))) ## estimate the post-treatment values to get att estimates mu1hat <- vapply(1:J, function(j) colMeans((bs_weight * fulldat)[which_t[[j]], , drop=FALSE]), numeric(ttot)) ## get average outcome model estimates and reweight residuals if(typeof(multisynth$y0hat) == "list") { mu0hat <- vapply(1:J, function(j) { y0hat <- colMeans( (bs_weight * multisynth$y0hat[[j]])[which_t[[j]], , drop=FALSE]) weightsj <- multisynth$weights[,j] * bs_weight resj <- multisynth$residuals[[j]][weightsj != 0,, drop = F] y0hat + t(resj) %*% weightsj[weightsj != 0] } , numeric(ttot) ) } else { mu0hat <- vapply(1:J, function(j) { y0hat <- colMeans( (bs_weight * multisynth$y0hat)[which_t[[j]], , drop=FALSE]) weightsj <- multisynth$weights[, j] * bs_weight resj <- multisynth$residuals[weightsj != 0,, drop = F] y0hat + t(resj) %*% weightsj[weightsj != 0] } , numeric(ttot) ) } tauhat <- mu1hat - mu0hat if(is.null(att_weight)) { att_weight <- rep(1, J) } ## re-index time if relative to treatment if(relative) { total_len <- min(d + n_leads, ttot + d - min(grps)) ## total length of predictions mu0hat <- vapply(1:J, function(j) { vec <- c(rep(NA, d-grps[j]), mu0hat[1:grps[j],j], mu0hat[(grps[j]+1):(min(grps[j] + n_leads, ttot)), j]) ## last row is post-treatment average c(vec, rep(NA, total_len - length(vec)), mean(mu0hat[(grps[j]+1):(min(grps[j] + n_leads, ttot)), j])) }, numeric(total_len +1 )) tauhat <- vapply(1:J, function(j) { vec <- c(rep(NA, d-grps[j]), tauhat[1:grps[j],j], tauhat[(grps[j]+1):(min(grps[j] + n_leads, ttot)), j]) ## last row is post-treatment average c(vec, rep(NA, total_len - length(vec)), mean(tauhat[(grps[j]+1):(min(grps[j] + n_leads, ttot)), j])) }, numeric(total_len +1 )) # re-index unit weights if they change over time if(is.null(dim(att_weight))) { if(J == 1) { att_weight <- matrix(replicate(total_len + 1, att_weight), ncol = 1) } else { att_weight <- t(replicate(total_len + 1, att_weight)) } } att_weight_new <- vapply(1:J, function(j) { vec <- c(rep(NA, d-grps[j]), att_weight[1:grps[j],j], att_weight[(grps[j]+1):(min(grps[j] + n_leads, ttot)), j]) ## last row is post-treatment average c(vec, rep(NA, total_len - length(vec)), mean(att_weight[(grps[j]+1):(min(grps[j] + n_leads, ttot)), j])) }, numeric(total_len +1 )) ## get the overall average estimate avg <- apply(mu0hat, 1, function(z) sum(n1 * z, na.rm=T) / sum(n1 * !is.na(z))) avg <- sapply(1:nrow(mu0hat), function(k) { sum(n1 * mu0hat[k,] * att_weight_new[k,], na.rm=T) / sum(n1 * (!is.na(mu0hat[k,])) * att_weight_new[k, ], na.rm = T) }) mu0hat <- cbind(avg, mu0hat) avg <- apply(tauhat, 1, function(z) sum(n1 * z, na.rm=T) / sum(n1 * !is.na(z))) avg <- sapply(1:nrow(mu0hat), function(k) { sum(n1 * tauhat[k,] * att_weight_new[k,], na.rm=T) / sum(n1 * (!is.na(tauhat[k,])) * att_weight_new[k, ], na.rm = T) }) tauhat <- cbind(avg, tauhat) } else { ## remove all estimates for t > T_j + n_leads vapply(1:J, function(j) c(mu0hat[1:min(grps[j]+n_leads, ttot),j], rep(NA, max(0, ttot-(grps[j] + n_leads)))), numeric(ttot)) -> mu0hat vapply(1:J, function(j) c(tauhat[1:min(grps[j]+n_leads, ttot),j], rep(NA, max(0, ttot-(grps[j] + n_leads)))), numeric(ttot)) -> tauhat ## only average currently treated units avg1 <- rowSums(t(fullmask) * mu0hat * n1) / rowSums(t(fullmask) * n1) avg2 <- rowSums(t(1-fullmask) * mu0hat * n1) / rowSums(t(1-fullmask) * n1) avg <- replace_na(avg1, 0) * apply(fullmask, 2, min) + replace_na(avg2,0) * apply(1-fullmask, 2, max) cbind(avg, mu0hat) -> mu0hat ## only average currently treated units avg1 <- rowSums(t(fullmask) * tauhat * n1) / rowSums(t(fullmask) * n1) avg2 <- rowSums(t(1-fullmask) * tauhat * n1) / rowSums(t(1-fullmask) * n1) avg <- replace_na(avg1, 0) * apply(fullmask, 2, min) + replace_na(avg2,0) * apply(1 - fullmask, 2, max) cbind(avg, tauhat) -> tauhat } if(att) { return(tauhat) } else { return(mu0hat) } } #' Print function for multisynth #' @param x multisynth object #' @param ... Optional arguments #' @export print.multisynth <- function(x, att_weight = NULL, ...) { multisynth <- x ## straight from lm cat("\nCall:\n", paste(deparse(multisynth$call), sep="\n", collapse="\n"), "\n\n", sep="") # print att estimates att_post <- predict(multisynth, att=T, att_weight = att_weight)[,1] att_post <- att_post[length(att_post)] cat(paste("Average ATT Estimate: ", format(round(mean(att_post),3), nsmall = 3), "\n\n", sep="")) } #' Plot function for multisynth #' @importFrom graphics plot #' @param x Augsynth object to be plotted #' @param inf_type Type of inference to perform: #' \itemize{ #' \item{bootstrap}{Wild bootstrap, the default option} #' \item{jackknife}{Jackknife} #' } #' @param inf Whether to compute and plot confidence intervals #' @param levels Which units/groups to plot, default is every group #' @param label Whether to label the individual levels #' @param weights Whether to plot the weights, default = FALSE #' @param ... Optional arguments #' @export plot.multisynth <- function(x, inf_type = "bootstrap", inf = T, levels = NULL, label = T, weights = FALSE, ...) { if(weights) { ever_trt <- x$data$units[is.finite(x$data$trt)] never_trt <- x$data$units[!is.finite(x$data$trt)] weights <- data.frame(x$weights) colnames(weights) <- ever_trt weights$unit <- factor(rownames(weights), levels = sort(rownames(weights), decreasing = TRUE)) # plotting the weights weights %>% tidyr::pivot_longer(-unit, names_to = "trt_unit", values_to = "weight") %>% ggplot2::ggplot(aes(x = trt_unit, y = unit, fill = round(weight, 3))) + ggplot2::geom_tile(color = "white", size=.5) + ggplot2::scale_fill_gradient(low = "white", high = "black", limits=c(-0.01,1.01)) + ggplot2::guides(fill = "none") + ggplot2::xlab("Treated Unit") + ggplot2::ylab("Donor Unit") + ggplot2::theme_bw() + ggplot2::theme(axis.ticks.x = ggplot2::element_blank(), axis.ticks.y = ggplot2::element_blank()) } else { plot(summary(x, inf_type = inf_type, ...), inf = inf, levels = levels, label = label) } } #' Summary function for multisynth #' @param object multisynth object #' @param inf_type Type of inference to perform: #' \itemize{ #' \item{bootstrap}{Wild bootstrap, the default option} #' \item{jackknife}{Jackknife} #' } #' @param ... Optional arguments #' #' @return summary.multisynth object that contains: #' \itemize{ #' \item{"att"}{Dataframe with ATT estimates, standard errors for each treated unit} #' \item{"global_l2"}{L2 imbalance for the pooled synthetic control} #' \item{"scaled_global_l2"}{L2 imbalance for the pooled synthetic control, scaled by the imbalance for unitform weights} #' \item{"ind_l2"}{Average L2 imbalance for the individual synthetic controls} #' \item{"scaled_ind_l2"}{Average L2 imbalance for the individual synthetic controls, scaled by the imbalance for unitform weights} #' \item{"n_leads", "n_lags"}{Number of post treatment outcomes (leads) and pre-treatment outcomes (lags) to include in the analysis} #' } #' @export summary.multisynth <- function(object, inf_type = "bootstrap", att_weight = NULL, ...) { multisynth <- object relative <- T n_leads <- multisynth$n_leads d <- ncol(multisynth$data$X) n <- nrow(multisynth$data$X) ttot <- d + ncol(multisynth$data$y) trt <- multisynth$data$trt time_cohort <- multisynth$time_cohort if(time_cohort) { grps <- unique(trt[is.finite(trt)]) which_t <- lapply(grps, function(tj) (1:n)[trt == tj]) } else { grps <- trt[is.finite(trt)] which_t <- (1:n)[is.finite(trt)] } # grps <- unique(multisynth$data$trt) %>% sort() J <- length(grps) # which_t <- (1:n)[is.finite(multisynth$data$trt)] times <- multisynth$data$time summ <- list() ## post treatment estimate for each group and overall # att <- predict(multisynth, relative, att=T) if(inf_type == "jackknife") { attse <- jackknife_se_multi(multisynth, relative, att_weight = att_weight, ...) } else if(inf_type == "bootstrap") { if(object$force == 2) { warning("Wild bootstrap without including a unit fixed effect ", "is likely to be very conservative!") } attse <- weighted_bootstrap_multi(multisynth, att_weight = att_weight, ...) } else { att <- predict(multisynth, relative, att=T, att_weight = att_weight) attse <- list(att = att, se = matrix(NA, nrow(att), ncol(att)), upper_bound = matrix(NA, nrow(att), ncol(att)), lower_bound = matrix(NA, nrow(att), ncol(att))) } if(relative) { att <- data.frame(cbind(c(-(d-1):min(n_leads, ttot-min(grps)), NA), attse$att)) if(time_cohort) { col_names <- c("Time", "Average", as.character(times[grps + 1])) } else { col_names <- c("Time", "Average", as.character(multisynth$data$units[which_t])) } names(att) <- col_names att %>% gather(Level, Estimate, -Time) %>% rename("Time"=Time) %>% mutate(Time=Time-1) -> att se <- data.frame(cbind(c(-(d-1):min(n_leads, ttot-min(grps)), NA), attse$se)) names(se) <- col_names se %>% gather(Level, Std.Error, -Time) %>% rename("Time"=Time) %>% mutate(Time=Time-1) -> se lower_bound <- data.frame(cbind(c(-(d-1):min(n_leads, ttot-min(grps)), NA), attse$lower_bound)) names(lower_bound) <- col_names lower_bound %>% gather(Level, lower_bound, -Time) %>% rename("Time"=Time) %>% mutate(Time=Time-1) -> lower_bound upper_bound <- data.frame(cbind(c(-(d-1):min(n_leads, ttot-min(grps)), NA), attse$upper_bound)) names(upper_bound) <- col_names upper_bound %>% gather(Level, upper_bound, -Time) %>% rename("Time"=Time) %>% mutate(Time=Time-1) -> upper_bound } else { att <- data.frame(cbind(times, attse$att)) names(att) <- c("Time", "Average", times[grps[1:J]]) att %>% gather(Level, Estimate, -Time) -> att se <- data.frame(cbind(times, attse$se)) names(se) <- c("Time", "Average", times[grps[1:J]]) se %>% gather(Level, Std.Error, -Time) -> se } summ$att <- inner_join(att, se, by = c("Time", "Level")) %>% inner_join(lower_bound, by = c("Time", "Level")) %>% inner_join(upper_bound, by = c("Time", "Level")) summ$relative <- relative summ$grps <- grps summ$call <- multisynth$call summ$global_l2 <- multisynth$global_l2 summ$scaled_global_l2 <- multisynth$scaled_global_l2 summ$ind_l2 <- multisynth$ind_l2 summ$scaled_ind_l2 <- multisynth$scaled_ind_l2 summ$n_leads <- multisynth$n_leads summ$n_lags <- multisynth$n_lags class(summ) <- "summary.multisynth" return(summ) } #' Print function for summary function for multisynth #' @param x summary object #' @param level Which unit/group to print results for, default is the overall average #' @param ... Optional arguments #' @export print.summary.multisynth <- function(x, level = "Average", ...) { summ <- x ## straight from lm cat("\nCall:\n", paste(deparse(summ$call), sep="\n", collapse="\n"), "\n\n", sep="") first_lvl <- summ$att %>% filter(Level != "Average") %>% pull(Level) %>% min() ## get ATT estimates for treatment level, post treatment if(summ$relative) { summ$att %>% filter(Time >= 0, Level==level) %>% rename("Time Since Treatment"=Time) -> att_est } else if(level == "average") { summ$att %>% filter(Time > first_lvl, Level=="Average") -> att_est } else { summ$att %>% filter(Time > level, Level==level) -> att_est } cat(paste("Average ATT Estimate (Std. Error): ", summ$att %>% filter(Level == level, is.na(Time)) %>% pull(Estimate) %>% round(3) %>% format(nsmall=3), " (", summ$att %>% filter(Level == level, is.na(Time)) %>% pull(Std.Error) %>% round(3) %>% format(nsmall=3), ")\n\n", sep="")) cat(paste("Global L2 Imbalance: ", format(round(summ$global_l2,3), nsmall=3), "\n", "Scaled Global L2 Imbalance: ", format(round(summ$scaled_global_l2,3), nsmall=3), "\n", "Percent improvement from uniform global weights: ", format(round(1-summ$scaled_global_l2,3)*100), "\n\n", "Individual L2 Imbalance: ", format(round(summ$ind_l2,3), nsmall=3), "\n", "Scaled Individual L2 Imbalance: ", format(round(summ$scaled_ind_l2,3), nsmall=3), "\n", "Percent improvement from uniform individual weights: ", format(round(1-summ$scaled_ind_l2,3)*100), "\t", "\n\n", sep="")) print(att_est, row.names=F) } #' Plot function for summary function for multisynth #' @importFrom ggplot2 aes #' #' @param x summary object #' @param inf Whether to plot confidence intervals #' @param levels Which units/groups to plot, default is every group #' @param label Whether to label the individual levels #' @param weights Whether to plot the weights, default = FALSE #' @param ... Optional arguments #' @export plot.summary.multisynth <- function(x, inf = T, levels = NULL, label = T, weights = FALSE, ...) { if(weights) { ever_trt <- x$data$units[is.finite(x$data$trt)] never_trt <- x$data$units[!is.finite(x$data$trt)] weights <- data.frame(x$weights) colnames(weights) <- ever_trt weights$unit <- factor(rownames(weights), levels = sort(rownames(weights), decreasing = TRUE)) # plotting the weights weights %>% tidyr::pivot_longer(-unit, names_to = "trt_unit", values_to = "weight") %>% ggplot2::ggplot(aes(x = trt_unit, y = unit, fill = round(weight, 3))) + ggplot2::geom_tile(color = "white", size=.5) + ggplot2::scale_fill_gradient(low = "white", high = "black", limits=c(-0.01,1.01)) + ggplot2::guides(fill = "none") + ggplot2::xlab("Treated Unit") + ggplot2::ylab("Donor Unit") + ggplot2::theme_bw() + ggplot2::theme(axis.ticks.x = ggplot2::element_blank(), axis.ticks.y = ggplot2::element_blank()) } summ <- x ## get the last time period for each level summ$att %>% filter(!is.na(Estimate), Time >= -summ$n_lags, Time <= summ$n_leads) %>% group_by(Level) %>% summarise(last_time = max(Time)) -> last_times if(is.null(levels)) levels <- unique(summ$att$Level) summ$att %>% inner_join(last_times) %>% filter(Level %in% levels) %>% mutate(label = ifelse(Time == last_time, Level, NA), is_avg = ifelse(("Average" %in% levels) * (Level == "Average"), "A", "B")) %>% ggplot2::ggplot(ggplot2::aes(x = Time, y = Estimate, group = Level, color = is_avg, alpha = is_avg)) + ggplot2::geom_line(size = 1) + ggplot2::geom_point(size = 1) -> p if(label) { p <- p + ggrepel::geom_label_repel(ggplot2::aes(label = label), nudge_x = 1, na.rm = T) } p <- p + ggplot2::geom_hline(yintercept = 0, lty = 2) if(summ$relative) { p <- p + ggplot2::geom_vline(xintercept = 0, lty = 2) + ggplot2::xlab("Time Relative to Treatment") } else { p <- p + ggplot2::geom_vline(aes(xintercept = as.numeric(Level)), lty = 2, alpha = 0.5, summ$att %>% filter(Level != "Average")) } ## add ses if(inf) { max_time <- max(summ$att$Time, na.rm = T) if(max_time == 0) { error_plt <- ggplot2::geom_errorbar clr <- "black" alph <- 1 } else { error_plt <- ggplot2::geom_ribbon clr <- NA alph <- 0.2 } if("Average" %in% levels) { p <- p + error_plt( ggplot2::aes(ymin=lower_bound, ymax=upper_bound), alpha = alph, color=clr, data = summ$att %>% filter(Level == "Average", Time >= 0)) } else { p <- p + error_plt( ggplot2::aes(ymin=lower_bound, ymax=upper_bound), data = . %>% filter(Time >= 0), alpha = alph, color = clr) } } p <- p + ggplot2::scale_alpha_manual(values=c(1, 0.5)) + ggplot2::scale_color_manual(values=c("#333333", "#818181")) + ggplot2::guides(alpha=F, color=F) + ggplot2::theme_bw() return(p) } ================================================ FILE: R/outcome_models.R ================================================ ################################################################################ ## Code to fit various outcome models ################################################################################ #' Use a separate regularized regression for each post period #' to fit E[Y(0)|X] #' @importFrom stats poly #' @importFrom stats coef #' #' @param X Matrix of covariates/lagged outcomes #' @param y Matrix of post-period outcomes #' @param trt Vector of treatment indicator #' @param alpha Mixing between L1 and L2, default: 1 (LASSO) #' @param lambda Regularization hyperparameter, if null then CV #' @param poly_order Order of polynomial to fit, default 1 #' @param type How to fit outcome model(s) #' \itemize{ #' \item{sep }{Separate outcome models} #' \item{avg }{Average responses into 1 outcome} #' \item{multi }{Use multi response regression in glmnet}} #' @param ... optional arguments for outcome model #' @noRd #' @return \itemize{ #' \item{y0hat }{Predicted outcome under control} #' \item{params }{Regression parameters}} fit_prog_reg <- function(X, y, trt, alpha=1, lambda=NULL, poly_order=1, type="sep", ...) { if(!requireNamespace("glmnet", quietly = TRUE)) { stop("In order to fit an elastic net outcome model, you must install the glmnet package.") } extra_params = list(...) if (length(extra_params) > 0) { warning("Unused parameters when using elastic net: ", paste(names(extra_params), collapse = ", ")) } X <- matrix(poly(matrix(X),degree=poly_order), nrow=dim(X)[1]) ## helper function to fit regression with CV outfit <- function(x, y) { if(is.null(lambda)) { lam <- glmnet::cv.glmnet(x, y, alpha=alpha, grouped=FALSE)$lambda.min } else { lam <- lambda } fit <- glmnet::glmnet(x, y, alpha=alpha, lambda=lam) return(as.matrix(coef(fit))) } if(type=="avg") { ## if fitting the average post period value, stack post periods together stacky <- c(y) stackx <- do.call(rbind, lapply(1:dim(y)[2], function(x) X)) stacktrt <- rep(trt, dim(y)[2]) regweights <- outfit(stackx[stacktrt==0,], stacky[stacktrt==0]) } else if(type=="sep"){ ## fit separate regressions for each post period regweights <- apply(as.matrix(y), 2, function(yt) outfit(X[trt==0,], yt[trt==0])) } else { ## fit multi response regression lam <- glmnet::cv.glmnet(X, y, family="mgaussian", alpha=alpha, grouped=FALSE)$lambda.min fit <- glmnet::glmnet(X, y, family="mgaussian", alpha=alpha, lambda=lam) regweights <- as.matrix(do.call(cbind, coef(fit))) } ## Get predicted values y0hat <- cbind(rep(1, dim(X)[1]), X) %*% regweights return(list(y0hat = y0hat, params = regweights)) } #' Use a separate random forest regression for each post period #' to fit E[Y(0)|X] #' #' @param X Matrix of covariates/lagged outcomes #' @param y Matrix of post-period outcomes #' @param trt Vector of treatment indicator #' @param avg Predict the average post-treatment outcome #' @param ... optional arguments for outcome model #' @noRd #' @return \itemize{ #' \item{y0hat }{Predicted outcome under control} #' \item{params }{Regression parameters}} fit_prog_rf <- function(X, y, trt, avg=FALSE, ...) { if(!requireNamespace("randomForest", quietly = TRUE)) { stop("In order to fit a random forest outcome model, you must install the randomForest package.") } extra_params = list(...) if (length(extra_params) > 0) { warning("Unused parameters when using random forest: ", paste(names(extra_params), collapse = ", ")) } ## helper function to fit RF outfit <- function(x, y) { fit <- randomForest::randomForest(x, y) return(fit) } if(avg | dim(y)[2] == 1) { ## if fitting the average post period value, stack post periods together stacky <- c(y) stackx <- do.call(rbind, lapply(1:dim(y)[2], function(x) X)) stacktrt <- rep(trt, dim(y)[2]) fit <- outfit(stackx[stacktrt==0,], stacky[stacktrt==0]) ## predict outcome y0hat <- matrix(predict(fit, X), ncol=1) ## keep feature importances imports <- randomForest::importance(fit) } else { ## fit separate regressions for each post period fits <- apply(as.matrix(y), 2, function(yt) outfit(X[trt==0,], yt[trt==0])) ## predict outcome y0hat <- lapply(fits, function(fit) as.matrix(predict(fit,X))) %>% bind_rows() %>% as.matrix() ## keep feature importances imports <- lapply(fits, function(fit) randomForest::importance(fit)) %>% bind_rows() %>% as.matrix() } return(list(y0hat=y0hat, params=imports)) } #' Use gsynth to fit factor model for E[Y(0)|X] #' #' @param X Matrix of covariates/lagged outcomes #' @param y Matrix of post-period outcomes #' @param trt Vector of treatment indicator #' @param r Number of factors to use (or start with if CV==1) #' @param r.end Max number of factors to consider if CV==1 #' @param force Fixed effects (0=none, 1=unit, 2=time, 3=two-way) #' @param CV Whether to do CV (0=no CV, 1=yes CV) #' @param ... optional arguments for outcome model #' @noRd #' @return \itemize{ #' \item{y0hat }{Predicted outcome under control} #' \item{params }{Regression parameters}} fit_prog_gsynth <- function(X, y, trt, r=0, r.end=5, force=3, CV=1, ...) { if(!requireNamespace("gsynth", quietly = TRUE)) { stop("In order to fit generalized synthetic controls, you must install the gsynth package.") } extra_params = list(...) if (length(extra_params) > 0) { warning("Unused parameters when using gSynth: ", paste(names(extra_params), collapse = ", ")) } df_x = data.frame(X, check.names=FALSE) df_x$unit = rownames(df_x) df_x$trt = rep(0, nrow(df_x)) df_x <- df_x %>% select(unit, trt, everything()) long_df_x = gather(df_x, time, obs, -c(unit,trt)) df_y = data.frame(y, check.names=FALSE) df_y$unit = rownames(df_y) df_y$trt = trt df_y <- df_y %>% select(unit, trt, everything()) long_df_y = gather(df_y, time, obs, -c(unit,trt)) long_df = rbind(long_df_x, long_df_y) transform(long_df, time = as.numeric(time)) transform(long_df, unit = as.numeric(unit)) gsyn <- gsynth::gsynth(data = long_df, Y = "obs", D = "trt", index = c("unit", "time"), force = force, CV = CV, r = r) t0 <- dim(X)[2] t_final <- t0 + dim(y)[2] n <- dim(X)[1] ## get predicted outcomes y0hat <- matrix(0, nrow=n, ncol=(t_final-t0)) y0hat[trt==0,] <- t(gsyn$Y.co[(t0+1):t_final,,drop=FALSE] - gsyn$est.co$residuals[(t0+1):t_final,,drop=FALSE]) y0hat[trt==1,] <- gsyn$Y.ct[(t0+1):t_final,] ## add treated prediction for whole pre-period gsyn$est.co$Y.ct <- gsyn$Y.ct ## control and treated residuals gsyn$est.co$ctrl_resids <- gsyn$est.co$residuals gsyn$est.co$trt_resids <- colMeans(cbind(X[trt==1,,drop=FALSE], y[trt==1,,drop=FALSE])) - rowMeans(gsyn$est.co$Y.ct) return(list(y0hat=y0hat, params=gsyn$est.co)) } #' Use Athey (2017) matrix completion panel data code #' #' @param X Matrix of covariates/lagged outcomes #' @param y Matrix of post-period outcomes #' @param trt Vector of treatment indicator #' @param unit_fixed Whether to estimate unit fixed effects #' @param time_fixed Whether to estimate time fixed effects #' @param ... optional arguments for outcome model #' @noRd #' @return \itemize{ #' \item{y0hat }{Predicted outcome under control} #' \item{params }{Regression parameters}} fit_prog_mcpanel <- function(X, y, trt, unit_fixed=1, time_fixed=1, ...) { if(!requireNamespace("MCPanel", quietly = TRUE)) { stop("In order to fit matrix completion, you must install the MCPanel package.") } extra_params = list(...) if (length(extra_params) > 0) { warning("Unused parameters when using MCPanel: ", paste(names(extra_params), collapse = ", ")) } ## create matrix and missingness matrix t0 <- dim(X)[2] t_final <- t0 + dim(y)[2] n <- dim(X)[1] fullmat <- cbind(X, y) maskmat <- matrix(1, nrow=nrow(fullmat), ncol=ncol(fullmat)) maskmat[trt==1, (t0+1):t_final] <- 0 ## estimate matrix mcp <- MCPanel::mcnnm_cv(fullmat, maskmat, to_estimate_u=unit_fixed, to_estimate_v=time_fixed) ## impute matrix imp_mat <- mcp$L + sweep(matrix(0, nrow=nrow(fullmat), ncol=ncol(fullmat)), 1, mcp$u, "+") + # unit fixed sweep(matrix(0, nrow=nrow(fullmat), ncol=ncol(fullmat)), 2, mcp$v, "+") # time fixed trtmat <- matrix(0, ncol=n, nrow=t_final) trtmat[t0:t_final, trt == 1] <- 1 ## get predicted outcomes y0hat <- imp_mat[,(t0+1):t_final,drop=FALSE] params <- mcp params$trt_resids <- colMeans(cbind(X[trt==1,,drop=FALSE], y[trt==1,,drop=FALSE])) - rowMeans(imp_mat[trt==1,,drop=FALSE]) params$ctrl_resids <- t(cbind(X[trt==0,,drop=FALSE], y[trt==0,,drop=FALSE]) - imp_mat[trt==0,,drop=FALSE]) params$Y.ct <- t(imp_mat[trt==1,,drop=FALSE]) return(list(y0hat=y0hat, params=params)) } #' Fit a Comparitive interupted time series #' to fit E[Y(0)|X] #' @importFrom stats lm #' @importFrom stats predict #' #' @param X Matrix of covariates/lagged outcomes #' @param y Matrix of post-period outcomes #' @param trt Vector of treatment indicator #' @param poly_order Order of time trend polynomial to fit, default 1 #' @param weights Weights to use in WLS, default is no weights #' @param ... optional arguments for outcome model #' @noRd #' @return \itemize{ #' \item{y0hat }{Predicted outcome under control} #' \item{params }{Regression parameters}} fit_prog_cits <- function(X, y, trt, poly_order=1, weights=NULL, ...) { extra_params = list(...) if (length(extra_params) > 0) { warning("Unused parameters when using CITS: ", paste(names(extra_params), collapse = ", ")) } ## combine back into a panel structure ids <- 1:nrow(X) t0 <- dim(X)[2] t_final <- t0 + dim(y)[2] n <- nrow(X) if(is.null(weights)) { weights <- rep(1, n) } pnl1 <- data.frame(X) colnames(pnl1) <- 1:(t0) pnl1 <- pnl1 %>% mutate(trt=trt, post=0, id=ids, weight=weights) %>% gather(time, val, -trt, -post, -id, -weight) %>% mutate(time=as.numeric(time)) pnl2 <- data.frame(y) colnames(pnl2) <- (t0+1):t_final pnl2 <- pnl2 %>% mutate(trt=trt, post=1, id=ids, weight=weights) %>% gather(time, val, -trt, -post, -id, -weight) %>% mutate(time=as.numeric(time)) pnl <- bind_rows(pnl1, pnl2) ## fit regression if(poly_order == "fixed") { fit <- pnl %>% filter(!((post==1) & (trt==1))) %>% ## filter out post-period treated outcomes lm(val ~ as.factor(id) + as.factor(time), ., weights = .$weight ) } else if(poly_order > 0) { fit <- pnl %>% filter(!((post==1) & (trt==1))) %>% ## filter out post-period treated outcomes lm(val ~ poly(time, poly_order) + post + trt + poly(time * trt, poly_order), ., weights = .$weight ) } else { fit <- pnl %>% filter(!((post==1) & (trt==1))) %>% ## filter out post-period treated outcomes lm(val ~ post + trt, ., weights = .$weight ) } ## get predicted post-period outcomes y0hat <- matrix(0, nrow=n, ncol=(t_final-t0)) y0hat[trt==0,] <- matrix(predict(fit, pnl %>% filter(post==1 & trt==0)), ncol=ncol(y)) y0hat[trt==1,] <- matrix(predict(fit, pnl %>% filter(post==1 & trt==1)), ncol=ncol(y)) params <- list() ## add treated prediction for whole pre-period params$Y.ct <- matrix(predict(fit, pnl %>% filter(trt==1), ncol=(ncol(X) + ncol(y)))) ## and control prediction ctrl_pred <- matrix(predict(fit, pnl %>% filter(trt==0)), ncol=(ncol(X) + ncol(y))) ## control and treated residuals params$ctrl_resids <- t(cbind(X[trt==0,,drop=FALSE], y[trt==0,,drop=FALSE])) - t(ctrl_pred) params$trt_resids <- colMeans(cbind(X[trt==1,,drop=FALSE], y[trt==1,,drop=FALSE])) - rowMeans(params$Y.ct) return(list(y0hat=y0hat, params=params)) } #' Fit a bayesian structural time series #' to fit E[Y(0)|X] #' #' @param X Matrix of covariates/lagged outcomes #' @param y Matrix of post-period outcomes #' @param trt Vector of treatment indicator #' @param ... optional arguments for outcome model #' @noRd #' @return \itemize{ #' \item{y0hat }{Predicted outcome under control} #' \item{params }{Model parameters}} fit_prog_causalimpact <- function(X, y, trt, ...) { if(!requireNamespace("CausalImpact", quietly = TRUE)) { stop("In order to fit bayesian structural time series, you must install the CausalImpact package.") } extra_params = list(...) if (length(extra_params) > 0) { warning("Unused parameters using Bayesian structural time series with CausalImpact: ", paste(names(extra_params), collapse = ", ")) } ## structure data accordingly ids <- 1:nrow(X) t0 <- dim(X)[2] t_final <- t0 + dim(y)[2] n <- nrow(X) comb <- cbind(X, y) imp_dat <- t(rbind(colMeans(comb[trt==1,,drop=F]), comb[trt==0,,drop=F])) ## get predicted post-period outcomes ## TODO: is this the way to use CausalImpact?? ci_func <- function(i) { ## fit causal impact using controls CausalImpact::CausalImpact(t(rbind(comb[i,], comb[-i,][trt[-i]==0,])), pre.period=c(1, t0), post.period=c(t0+1, t_final) )$series$point.pred } y0hat <- t(sapply(1:n, ci_func)) params <- list() ## add treated prediction for whole pre-period params$Y.ct <- t(y0hat[trt==1,,drop=F]) ## and control prediction ctrl_pred <- y0hat[trt==0,,drop=F] ## control and treated residuals params$ctrl_resids <- t(cbind(X[trt==0,,drop=FALSE], y[trt==0,,drop=FALSE])) - t(ctrl_pred) params$trt_resids <- colMeans(cbind(X[trt==1,,drop=FALSE], y[trt==1,,drop=FALSE])) - rowMeans(params$Y.ct) return(list(y0hat=y0hat[,(t0+1):t_final, drop=F], params=params)) } #' Fit a seq2seq model with a feedforward net #' to fit E[Y(0)|X] #' #' @param X Matrix of covariates/lagged outcomes #' @param y Matrix of post-period outcomes #' @param trt Vector of treatment indicator #' @param layers List of (n_hidden_units, activation function) pairs to define layers #' @param epochs Number of epochs for training #' @param patience Number of epochs to wait before early stopping #' @param val_split Proportion of control units to use for validation #' @param verbose Whether to print training progress #' @param ... optional arguments for outcome model #' @noRd #' @return \itemize{ #' \item{y0hat }{Predicted outcome under control} #' \item{params }{Model parameters}} fit_prog_seq2seq <- function(X, y, trt, layers=list(c(50, "relu"), c(5, "relu")), epochs=500, patience=5, val_split=0.2, verbose=F, ...) { if(!requireNamespace("keras", quietly = TRUE)) { stop("In order to fit a neural network, you must install the keras package.") } extra_params = list(...) if (length(extra_params) > 0) { warning("Unused parameters when building sequence to sequence learning with feedforward nets: ", paste(names(extra_params), collapse = ", ")) } ## structure data accordingly ids <- 1:nrow(X) t0 <- dim(X)[2] t_final <- t0 + dim(y)[2] n <- nrow(X) Xctrl <- X[trt==0,,drop=F] yctrl <- y[trt==0,,drop=F] ## create first layer model <- keras::keras_model_sequential() %>% keras::layer_dense(units = layers[[1]][1], activation = layers[[1]][2], input_shape = ncol(Xctrl)) ## add layers for(layer in layers[-1]) { model %>% keras::layer_dense(units = layer[1], activation = layer[2]) } ## output lyaer model %>% keras::layer_dense(units=ncol(yctrl)) ## compile model %>% keras::compile(optimizer="rmsprop", loss="mse", metrics=c("mae")) ## fit model learn <- model %>% keras::fit(x=Xctrl, y=yctrl, epochs=epochs, batch_size=nrow(Xctrl), validation_split=val_split, callbacks=list(keras::callback_early_stopping(patience=patience)), verbose=verbose) ## predict for everything y0hat <- model %>% predict(X) params=list(model=model, learn=learn) return(list(y0hat=y0hat, params=params)) } ================================================ FILE: R/outcome_multi.R ================================================ ################################################################################ ## Fitting outcome models for multiple treatment groups ################################################################################ #' Use gsynth to fit factor model with #' @importFrom utils capture.output #' @param long_df A long dataframe with 4 columns in the order unit, time, trt, outcome #' @param X Matrix of outcomes #' @param trt Vector of treatment status for each unit #' @param r Number of factors to use (or start with if CV==1) #' @param r.end Max number of factors to consider if CV==1 #' @param force Fixed effects (0=none, 1=unit, 2=time, 3=two-way) #' @param CV Whether to do CV (0=no CV, 1=yes CV) #' @noRd #' @return \itemize{ #' \item{y0hat }{Predicted outcome under control} #' \item{params }{Regression parameters}} fit_gsynth_multi <- function(long_df, X, trt, r=0, force=3, CV=1) { if(!requireNamespace("gsynth", quietly = TRUE)) { stop("In order to fit generalized synthetic controls, you must install the gsynth package.") } ttot <- ncol(X) n <- nrow(X) labels <- colnames(long_df) gsyn <- gsynth::gsynth(data = long_df, Y = labels[4], D = labels[3], index = c(labels[1], labels[2]), force = force, CV = CV, r=r) y0hat <- matrix(0, nrow=n, ncol=ttot) y0hat[!is.finite(trt),] <- t(gsyn$Y.co - gsyn$est.co$residuals) y0hat[is.finite(trt),] <- t(gsyn$Y.ct) ## add treated prediction for whole pre-period gsyn$est.co$Y.ct <- gsyn$Y.ct return(list(y0hat=y0hat, params=gsyn$est.co)) } #' Get fixed effects from pre-treatment data for each level #' #' @param X Matrix of outcomes #' @param trt Vector of treatment status for each unit #' @param mask Matrix of treatment statuses #' @param force Fixed effects: 1="unit", 2="time", 3="two-way" #' @param time_cohort Boolean indicating whether to use time cohorts #' @noRd #' @return \itemize{ #' \item{y0hat }{Predicted outcome under control} #' \item{params }{Regression parameters}} fit_feff <- function(X, trt, mask, force, time_cohort) { ttot <- ncol(X) n <- nrow(X) # grps <- trt[is.finite(trt)] # iterate over treatment cohorts grps <- unique(trt[is.finite(trt)]) J <- length(grps) which_t <- (1:n)[is.finite(trt)] if(force %in% c(2,3)) { ## compute time fixed effects from pure controls time_eff <- matrix(colMeans(X[!is.finite(trt),, drop = F], na.rm = TRUE), nrow=nrow(X), ncol=ncol(X), byrow=T) } else { time_eff <- matrix(0, nrow = nrow(X), ncol = ncol(X)) } residuals <- X - time_eff y0hat <- time_eff if(force %in% c(1,3)) { ## compute unit fixed effects from pre-intervention outcomes unit_eff <- lapply(grps, function(tj) matrix( rowMeans(residuals[, 1:tj, drop = F], na.rm = TRUE), nrow=nrow(X), ncol=ncol(X))) residuals <- lapply(1:J, function(j) residuals - unit_eff[[j]]) y0hat <- unit_eff } if(force == 3) { y0hat <- lapply(unit_eff, function(ufj) time_eff + ufj) } # go from treatment cohorts to individuals if(force %in% c(1,3) & !time_cohort) { names(residuals) <- as.character(grps) residuals <- residuals[as.character(trt[is.finite(trt)])] names(y0hat) <- as.character(grps) y0hat <- y0hat[as.character(trt[is.finite(trt)])] } return(list(y0hat = y0hat, residuals = residuals)) } ================================================ FILE: R/ridge.R ================================================ ################################################################################ ## Ridge-augmented SCM ################################################################################ #' Ridge augmented weights (possibly with covariates) #' #' @param wide_data Output of `format_data` #' @param synth_data Output of `format_synth` #' @param Z Matrix of covariates, default is NULL #' @param lambda Ridge hyper-parameter, if NULL use CV #' @param ridge Include ridge or not #' @param scm Include SCM or not #' @param lambda_min_ratio Ratio of the smallest to largest lambda when tuning lambda values #' @param n_lambda Number of lambdas to consider between the smallest and largest lambda value #' @param lambda_max Initial (largest) lambda, if NULL sets it to be (1+norm(X_1-X_c))^2 #' @param holdout_length Length of conseuctive holdout period for when tuning lambdas #' @param min_1se If TRUE, chooses the maximum lambda within 1 standard error of the lambda that minimizes the CV error, if FALSE chooses the optimal lambda; default TRUE #' @param V V matrix for synth, default NULL #' @param residualize Whether to residualize auxiliary covariates or balance directly, default TRUE #' @param ... optional arguments for outcome model #' @noRd #' @return \itemize{ #' \item{"weights"}{Ridge ASCM weights} #' \item{"l2_imbalance"}{Imbalance in pre-period outcomes, measured by the L2 norm} #' \item{"scaled_l2_imbalance"}{L2 imbalance scaled by L2 imbalance of uniform weights} #' \item{"mhat"}{Outcome model estimate (zero in this case)} #' \item{"lambda"}{Value of the ridge hyperparameter} #' \item{"ridge_mhat"}{The ridge regression predictions (for estimating the bias)} #' \item{"synw"}{The synth weights(for estimating the bias)} #' \item{"lambdas"}{List of lambda values evaluated to tune ridge regression} #' \item{"lambda_errors"}{"The MSE associated with each lambda term in lambdas."} #' \item{"lambda_errors_se"}{"The SE of the MSE associated with each lambda term in lambdas."} #' } fit_ridgeaug_formatted <- function(wide_data, synth_data, Z=NULL, lambda=NULL, ridge=T, scm=T, lambda_min_ratio = 1e-8, n_lambda = 20, lambda_max = NULL, holdout_length = 1, min_1se = T, V = NULL, residualize = FALSE, ...) { extra_params = list(...) if (length(extra_params) > 0) { warning("Unused parameters in using ridge augmented weights: ", paste(names(extra_params), collapse = ", ")) } X <- wide_data$X y <- wide_data$y trt <- wide_data$trt lambda_errors <- NULL lambda_errors_se <- NULL lambdas <- NULL ## center outcomes X_cent <- apply(X, 2, function(x) x - mean(x[trt==0])) X_c <- X_cent[trt==0,,drop=FALSE] X_1 <- matrix(colMeans(X_cent[trt==1,,drop=FALSE]), nrow=1) y_cent <- apply(y, 2, function(x) x - mean(x[trt==0])) y_c <- y_cent[trt==0,,drop=FALSE] t0 <- ncol(X_c) V <- make_V_matrix(t0, V) # apply V matrix transformation X_c <- X_c %*% V X_1 <- X_1 %*% V new_synth_data <- synth_data ## if there are auxiliary covariates, use them if(!is.null(Z)) { ## center covariates Z_cent <- apply(Z, 2, function(x) x - mean(x[trt==0])) Z_c <- Z_cent[trt==0,,drop=FALSE] Z_1 <- matrix(colMeans(Z_cent[trt==1,,drop=FALSE]), nrow=1) if(residualize) { ## regress out covariates Xc_hat <- Z_c %*% solve(t(Z_c) %*% Z_c) %*% t(Z_c) %*% X_c X1_hat <- Z_1 %*% solve(t(Z_c) %*% Z_c) %*% t(Z_c) %*% X_c # take residuals res_t <- X_1 - X1_hat res_c <- X_c - Xc_hat X_c <- res_c X_1 <- res_t X_cent[trt == 0,] <- res_c X_cent[trt == 1,] <- res_t new_synth_data$Z1 <- t(res_t) new_synth_data$X1 <- t(res_t) new_synth_data$Z0 <- t(res_c) new_synth_data$X0 <- t(res_c) } else { # standardize covariates to be on the same scale as the outcomes sdz <- apply(Z_c, 2, sd) sdx <- sd(X_c) Z_c <- sdx * t(t(Z_c) / sdz) Z_1 <- sdx * Z_1 / sdz # concatenate X_c <- cbind(X_c, Z_c) X_1 <- cbind(X_1, Z_1) new_synth_data$Z1 <- t(X_1) new_synth_data$X1 <- t(X_1) new_synth_data$Z0 <- t(X_c) new_synth_data$X0 <- t(X_c) V <- diag(ncol(X_c)) } } else { new_synth_data$Z1 <- t(X_1) new_synth_data$X1 <- t(X_1) new_synth_data$Z0 <- t(X_c) new_synth_data$X0 <- t(X_c) } out <- fit_ridgeaug_inner(X_c, X_1, trt, new_synth_data, lambda, ridge, scm, lambda_min_ratio, n_lambda, lambda_max, holdout_length, min_1se) weights <- out$weights synw <- out$synw lambda <- out$lambda lambdas <- out$lambdas lambda_errors <- out$lambda_errors lambda_errors_se <- out$lambda_errors_se # add back in covariate weights if(!is.null(Z)) { if(residualize) { no_cov_weights <- weights ridge_w <- t(t(Z_1) - t(Z_c) %*% weights) %*% solve(t(Z_c) %*% Z_c) %*% t(Z_c) weights <- weights + t(ridge_w) } else { no_cov_weights <- NULL } } l2_imbalance <- sqrt(sum((synth_data$X0 %*% weights - synth_data$X1)^2)) ## primal objective value scaled by least squares difference for mean uni_w <- matrix(1/ncol(synth_data$X0), nrow=ncol(synth_data$X0), ncol=1) unif_l2_imbalance <- sqrt(sum((synth_data$X0 %*% uni_w - synth_data$X1)^2)) scaled_l2_imabalance <- l2_imbalance / unif_l2_imbalance ## no outcome model mhat <- matrix(0, nrow=nrow(y), ncol=ncol(y)) ridge_mhat <- mhat if(!is.null(Z)) { if(residualize) { ridge_mhat <- ridge_mhat + Z_cent %*% solve(t(Z_c) %*% Z_c) %*% t(Z_c) %*% y_c ## regress out covariates for outcomes yc_hat <- ridge_mhat[trt == 0,, drop = F] # take residuals of outcomes y_c <- y_c - yc_hat } else { X_cent <- cbind(X_cent, Z_cent) } } if(ridge) { ridge_mhat <- ridge_mhat + X_cent %*% solve(t(X_c) %*% X_c + lambda * diag(ncol(X_c))) %*% t(X_c) %*% y_c } output <- list(weights = weights, l2_imbalance = l2_imbalance, scaled_l2_imbalance = scaled_l2_imabalance, mhat = mhat, lambda = lambda, ridge_mhat = ridge_mhat, synw = synw, lambdas = lambdas, lambda_errors = lambda_errors, lambda_errors_se = lambda_errors_se) if(!is.null(Z)) { output$no_cov_weights <- no_cov_weights z_l2_imbalance <- sqrt(sum((t(Z_c) %*% weights - t(Z_1))^2)) z_unif_l2_imbalance <- sqrt(sum((t(Z_c) %*% uni_w - t(Z_1))^2)) z_scaled_l2_imbalance <- z_l2_imbalance / z_unif_l2_imbalance output$covariate_l2_imbalance <- z_l2_imbalance output$scaled_covariate_l2_imbalance <- z_scaled_l2_imbalance } return(output) } #' Helper function to fit ridge ASCM #' @param X_c Matrix of control lagged outcomes #' @param X_1 Vector of treated leagged outcomes #' @param trt Vector of treatment indicators #' @param synth_data Output of `format_synth` #' @param lambda Ridge hyper-parameter, if NULL use CV #' @param ridge Include ridge or not #' @param scm Include SCM or not #' @param lambda_min_ratio Ratio of the smallest to largest lambda when tuning lambda values #' @param n_lambda Number of lambdas to consider between the smallest and largest lambda value #' @param lambda_max Initial (largest) lambda, if NULL sets it to be (1+norm(X_1-X_c))^2 #' @param holdout_length Length of conseuctive holdout period for when tuning lambdas #' @param min_1se If TRUE, chooses the maximum lambda within 1 standard error of the lambda that minimizes the CV error, if FALSE chooses the optimal lambda; default TRUE #' @noRd #' @return \itemize{ #' \item{"weights"}{Ridge ASCM weights} #' \item{"lambda"}{Value of the ridge hyperparameter} #' \item{"synw"}{The synth weights(for estimating the bias)} #' \item{"lambdas"}{List of lambda values evaluated to tune ridge regression} #' \item{"lambda_errors"}{"The MSE associated with each lambda term in lambdas."} #' \item{"lambda_errors_se"}{"The SE of the MSE associated with each lambda term in lambdas."} #' } fit_ridgeaug_inner <- function(X_c, X_1, trt, synth_data, lambda, ridge, scm, lambda_min_ratio, n_lambda, lambda_max, holdout_length, min_1se) { lambda_errors <- NULL lambda_errors_se <- NULL lambdas <- NULL ## if SCM fit scm if(scm) { syn <- fit_synth_formatted(synth_data)$weights } else { ## else use uniform weights syn <- rep(1 / sum(trt == 0), sum(trt == 0)) } if(ridge) { if(is.null(lambda)) { cv_out <- cv_lambda(X_c, X_1, synth_data, trt, holdout_length, scm, lambda_max, lambda_min_ratio, n_lambda, min_1se) lambda <- cv_out$lambda lambda_errors <- cv_out$lambda_errors lambda_errors_se <- cv_out$lambda_errors_se lambdas <- cv_out$lambdas } # get ridge weights ridge_w <- t(t(X_1) - t(X_c) %*% syn) %*% solve(t(X_c) %*% X_c + lambda * diag(ncol(X_c))) %*% t(X_c) } else { ridge_w <- matrix(0, ncol = sum(trt == 0), nrow=1) } ## combine weights weights <- syn + t(ridge_w) return(list(weights = weights, synw = syn, lambda = lambda, lambdas = lambdas, lambda_errors = lambda_errors, lambda_errors_se = lambda_errors_se)) } #' Choose max lambda as largest eigenvalue of control X #' @param X_c matrix of control lagged outcomes #' @noRd #' @return max lambda get_lambda_max <- function(X_c) { svd(X_c)$d[1] ^ 2 } #' Create list of lambdas #' @param lambda_min_ratio Ratio of the smallest to largest lambda when tuning lambda values #' @param n_lambda Number of lambdas to consider between the smallest and largest lambda value #' @param lambda_max Initial (largest) lambda, if NULL sets it to be (1+norm(X_1-X_c))^2 #' @noRd #' @return List of lambdas create_lambda_list <- function(lambda_max, lambda_min_ratio, n_lambda) { scaler <- (lambda_min_ratio) ^ (1/n_lambda) lambdas <- lambda_max * (scaler ^ (seq(0:n_lambda) - 1)) return(lambdas) } #' Choose either the lambda that minimizes CV MSE or largest lambda within 1 se of min #' @param lambdas list of lambdas #' @param lambda_errors The MSE associated with each lambda term in lambdas. #' @param lambda_errors_se The SE of the MSE associated with each lambda #' @param min_1se If TRUE, chooses the maximum lambda within 1 standard error of the lambda that minimizes the CV error, if FALSE chooses the optimal lambda; default TRUE #' @noRd #' @return optimal lambda choose_lambda <- function(lambdas, lambda_errors, lambda_errors_se, min_1se) { # lambda with smallest error min_idx <- which.min(lambda_errors) min_error <- lambda_errors[min_idx] min_se <- lambda_errors_se[min_idx] lambda_min <- lambdas[min_idx] # max lambda with error within one se of min lambda_1se <- max(lambdas[lambda_errors <= min_error + min_se]) return(if(min_1se) lambda_1se else lambda_min) } #' Choose best lambda with CV #' @param X_c Matrix of control lagged outcomes #' @param X_1 Vector of treated leagged outcomes #' @param synth_data Output of `format_synth` #' @param trt Vector of treatment indicators #' @param holdout_length Length of conseuctive holdout period for when tuning lambdas #' @param scm Include SCM or not #' @param lambda_max Initial (largest) lambda, if NULL sets it to be (1+norm(X_1-X_c))^2 #' @param lambda_min_ratio Ratio of the smallest to largest lambda when tuning lambda values #' @param n_lambda Number of lambdas to consider between the smallest and largest lambda value #' @param min_1se If TRUE, chooses the maximum lambda within 1 standard error of the lambda #' @noRd #' @return \itemize{ #' \item{"lambda"}{Value of the ridge hyperparameter} #' \item{"lambdas"}{List of lambda values evaluated to tune ridge regression} #' \item{"lambda_errors"}{"The MSE associated with each lambda term in lambdas."} #' \item{"lambda_errors_se"}{"The SE of the MSE associated with each lambda term} #' } cv_lambda <- function(X_c, X_1, synth_data, trt, holdout_length, scm, lambda_max, lambda_min_ratio, n_lambda, min_1se) { if(is.null(lambda_max)) { lambda_max <- get_lambda_max(X_c) } lambdas <- create_lambda_list(lambda_max, lambda_min_ratio, n_lambda) lambda_out <- get_lambda_errors(lambdas, X_c, X_1, synth_data, trt, holdout_length, scm) lambda_errors <- lambda_out$lambda_errors lambda_errors_se <- lambda_out$lambda_errors_se lambda <- choose_lambda(lambdas, lambda_errors, lambda_errors_se, min_1se) return(list(lambda = lambda, lambda_errors = lambda_errors, lambda_errors_se = lambda_errors_se, lambdas = lambdas)) } ================================================ FILE: R/ridge_lambda.R ================================================ ################################################################################ ## Function to calculate error on different lambda values if using Ridge Augmented SCM ################################################################################ #' Get Lambda Errors #' @importFrom stats sd #' #' @param lambdas Vector of lambda values to compute errors for #' @param X_c Matrix of control group pre-treatment outcomes #' @param X_t Matrix of treatment group pre-treatment outcomes #' @param synth_data Output of `format_synth` #' @param trt Boolean vector of treatment assignments #' @param holdout_length Length of conseuctive holdout period for when tuning lambdas #' @param scm Include SCM or not #' @noRd #' @return List of lambda errors for each corresponding lambda in the lambdas parameter. get_lambda_errors <- function(lambdas, X_c, X_t, synth_data, trt, holdout_length=1, scm=T) { # vector that stores the sum MSE across all CV sets errors <- matrix(0, nrow = ncol(X_c) - holdout_length, ncol = length(lambdas)) lambda_errors = numeric(length(lambdas)) lambda_errors_se = numeric(length(lambdas)) for (i in 1:(ncol(X_c) - holdout_length)) { X_0 <- X_c[,-(i:(i + holdout_length - 1))] X_1 <- matrix(X_t[-(i:(i + holdout_length - 1))]) X_0v <- X_c[,i:(i + holdout_length - 1)] X_1v <- matrix(X_t[i:(i + holdout_length - 1)], ncol = 1) new_synth_data <- synth_data new_synth_data$Z1 <- X_1 new_synth_data$X1 <- X_1 new_synth_data$Z0 <- t(X_0) new_synth_data$X0 <- t(X_0) if(scm) { syn <- fit_synth_formatted(new_synth_data)$weights } else { syn <- rep(1/sum(trt==0), sum(trt==0)) } for (j in 1:length(lambdas)) { ridge_weights <- t(X_1 - t(X_0) %*% syn) %*% solve(t(X_0) %*% X_0 + lambdas[j] * diag(ncol(X_0))) %*% t(X_0) aug_weights <- syn + t(ridge_weights) error <- X_1v - t(X_0v) %*% aug_weights # take sum of errors across the holdout time periods error <- sum(error ^ 2) errors[i, j] <- error # lambda_errors[j] <- lambda_errors[j] + error } } lambda_errors <- apply(errors, 2, mean) lambda_errors_se <- apply(errors, 2, function(x) sd(x) / sqrt(length(x))) return(list(lambda_errors = lambda_errors, lambda_errors_se = lambda_errors_se)) } ================================================ FILE: R/time_regression_multi.R ================================================ ############################################################################## ## Outcome regression with multiple treated units ############################################################################## #' Fit a time regression #' @param X Matrix of outcomes #' @param trt Vector of treatment status for each unit #' @param n_leads How long past treatment effects should be estimated for #' @param reg_param Regularization hyperparameter #' @param lowlim Lower bound for coefs #' @param uplim upper bound for coefs #' @param ... Extra optimization hyperparameters #' @noRd #' @return \itemize{ #' \item{y0hat }{List of predicted outcome under control} #' \item{residuals }{List of residuals} #' \item{params }{Regression parameters}} fit_time_reg <- function(X, trt, n_leads, reg_param, lowlim = 0, uplim = 1, ...) { grps <- trt[is.finite(trt)] J <- length(grps) tmax <- max(trt[is.finite(trt)]) # fit QP reg_weights <- fit_time_reg_qp_(X, trt, n_leads, lowlim, uplim, reg_param, ...) # get predicted outcomes (repeated as a matrix) and residuals y0hat <- lapply(1:J, function(j) { # compute time fixed effects from pure controls time_eff <- matrix(colMeans(X[!is.finite(trt),]), nrow=nrow(X), ncol=ncol(X), byrow=T) Xj <- X - time_eff zero_mat <- matrix(0, nrow = nrow(X), ncol = (tmax - grps[j])) Xj <- cbind(zero_mat, Xj[, 1:grps[j], drop = F]) # take out pure control means y0hatj <- Xj %*% reg_weights[,j, drop = F] matrix(y0hatj, nrow=nrow(X), ncol=ncol(X)) + time_eff }) residuals <- lapply(1:J, function(j) X - y0hat[[j]]) return(list(y0hat = y0hat, residuals = residuals, time_weights = reg_weights)) } #' Fit a time regression #' @param X Matrix of outcomes #' @param trt Vector of treatment status for each unit #' @param n_leads How long past treatment effects should be estimated for #' @param reg_param Regularization hyperparameter #' @param lowlim Lower bound for coefs #' @param uplim upper bound for coefs #' @param ... Extra optimization hyperparameters #' @noRd #' @return reg_weights Fitted regression weights fit_time_reg_qp_ <- function(X, trt, n_leads, lowlim, uplim, reg_param, ...) { grps <- trt[is.finite(trt)] J <- length(grps) ttot <- ncol(X) max_trt <- max(grps) # get data in the right form data_mats <- collect_data(X, trt, n_leads) # create constraint matrices constraints <- make_constraints(J, grps, lowlim, uplim) # get components of QP Qmat <- get_Qmat(data_mats$pre_mats) pvec <- get_pvec(data_mats$pre_mats, data_mats$post_vecs) I0 <- get_regularization_matrix(J, max_trt, reg_param) # add in regularization # I0 <- Matrix::bdiag(reg_param1 * Matrix::Diagonal(max_trt), # reg_param2 * Matrix::Diagonal(J * max_trt)) Qmat <- Qmat + I0 # fit QP settings <- osqp::osqpSettings(verbose = FALSE, ...) out <- osqp::solve_osqp(Qmat, pvec, constraints$Amat, constraints$lvec, constraints$uvec, pars=settings) # collect as matrix # reg_weights <- matrix(out$x, ncol = J + 1) reg_weights <- matrix(out$x, ncol = J) # pooled <- reg_weights[,1] # add in common component # reg_weights <- reg_weights[, 1] + reg_weights[, -1] # reverse to calendar time # reg_weights <- reg_weights[nrow(reg_weights):1, ] return(reg_weights) } #' Organize data for the QP #' @param X Matrix of outcomes #' @param trt Vector of treatment status for each unit #' @param n_leads How long past treatment effects should be estimated for #' @noRd collect_data <- function(X, trt, n_leads) { grps <- trt[is.finite(trt)] J <- length(grps) ttot <- ncol(X) max_trt <- max(grps) # sapply(1:ncol(X), # function(tj) { # mean(X[trt >= tj]) # }) -> ctrl_means # X <- t(t(X) - ctrl_means) # get pre-treatment matrices lapply(grps, function(tj) { # donor unit pre tj outcomes idxs <- trt > tj + n_leads pre_mat <- cbind(#1, matrix(0, nrow = nrow(X), ncol = (max_trt - tj)), X[, 1:tj, drop = F]) # subtract out pure control means pre_mat <- t(t(pre_mat) - colMeans(pre_mat[!is.finite(trt),,drop = F])) # restrict to units that won't be treated w/in n_leads pre_mat[idxs,,drop = F] }) -> pre_mats # get post treatment averages lapply(grps, function(tj) { # avg of donor units post tj outcomes idxs <- trt > tj + n_leads donors <- rowMeans(X[, (tj + 1):(tj + n_leads), drop = F]) # subtract out pure control means donors <- donors - mean(donors[!is.finite(trt)]) # restrict to units that won't be treated w/in n_leads donors[idxs] }) -> post_vecs return(list(pre_mats = pre_mats, post_vecs = post_vecs)) } get_Qmat <- function(pre_mats) { #### matrix in QP cov_mats <- lapply(pre_mats, function(x) t(x) %*% x) # unit specific covariance matrices Qmat <- Matrix::bdiag(cov_mats) return(Qmat) } get_Qmat_pool <- function(pre_mats) { #### matrix in QP cov_mats <- lapply(pre_mats, function(x) t(x) %*% x) pooled_cov <- Reduce(`+`, cov_mats) cov_mats_bind <-do.call(rbind, cov_mats) # unit specific covariance matrices Qmat <- Matrix::bdiag(cov_mats) # pooling terms Qmat <- rbind(t(cov_mats_bind), Qmat) Qmat <- cbind(rbind(pooled_cov, cov_mats_bind), Qmat) return(Qmat) } get_pvec <- function(pre_mats, post_vecs) { # vector in QP lapply(1:length(pre_mats), function(j) { t(pre_mats[[j]]) %*% post_vecs[[j]] }) -> pvec_list pvec <- do.call(c, pvec_list) return(-2 * pvec) } get_pvec_pool <- function(pre_mats, post_vecs) { # vector in QP lapply(1:length(pre_mats), function(j) { t(pre_mats[[j]]) %*% post_vecs[[j]] }) -> pvec_list pvec_pool <- Reduce(`+`, pvec_list) pvec <- do.call(c, pvec_list) pvec <- c(pvec_pool, pvec) return(-2 * pvec) } make_constraints <- function(J, grps, lowlim, uplim) { tmax <- max(grps) # sum to 1 constraints A1 <- Matrix::t(Matrix::bdiag(lapply(1:J, function(j) c(0, rep(1, tmax))))) A1 <- Matrix::t(Matrix::bdiag(lapply(1:J, function(j) rep(1, tmax)))) l1 <- rep(1, J) # l1 <- rep(-Inf, J) u1 <- rep(1, J) # u1 <- rep(Inf, J) # upper lower limits diag_w_intercept <- Matrix::bdiag(list(0, Matrix::Diagonal(tmax)))[-1, ] A2 <- Matrix::bdiag(lapply(1:J, function(j) diag_w_intercept)) A2 <- Matrix::Diagonal(J * tmax) # make sure that only weighting times that exist l2 <- sapply(1:J, function(j) { c(rep(0, tmax - grps[j]), rep(lowlim, grps[j])) }) u2 <- sapply(1:J, function(j) { c(rep(0, tmax - grps[j]), rep(uplim, grps[j])) }) # combine Amat <- rbind(A1, A2) lvec <- c(l1, l2) uvec <- c(u1, u2) return(list(Amat = Amat, lvec = lvec, uvec = uvec)) } make_constraints_pool <- function(J, grps, lowlim, uplim) { tmax <- max(grps) # sum to 1 constraints A1 <- cbind(0, matrix(1, ncol = tmax, nrow = J), Matrix::t(Matrix::bdiag(lapply(1:J, function(j) c(0, rep(1, tmax)))))) l1 <- rep(1, J) # l1 <- rep(-Inf, J) u1 <- rep(1, J) # u1 <- rep(Inf, J) # upper lower limits diag_w_intercept <- Matrix::bdiag(list(0, Matrix::Diagonal(tmax)))[-1, ] pool_A2 <- do.call(rbind, lapply(1:J, function(j) diag_w_intercept)) A2 <- Matrix::bdiag(lapply(1:J, function(j) diag_w_intercept)) A2 <- cbind(pool_A2, A2) # restrict global intercept to 0 A2 <- rbind(c(1, numeric(ncol(A2) - 1)), A2) # make sure that only weighting times that exist l2 <- sapply(1:J, function(j) { c(rep(0, tmax - grps[j]), rep(lowlim, grps[j])) }) l2 <- c(0, l2) u2 <- sapply(1:J, function(j) { c(rep(0, tmax - grps[j]), rep(uplim, grps[j])) }) u2 <- c(0, u2) # l2 <- rep(lowlim, J * tmax) # u2 <- rep(uplim, J * tmax) # combine Amat <- rbind(A1, A2) lvec <- c(l1, l2) uvec <- c(u1, u2) return(list(Amat = Amat, lvec = lvec, uvec = uvec)) } get_regularization_matrix <- function(J, max_trt, reg_param) { single_reg_mat <- Matrix::bdiag(list(0, Matrix::Diagonal(max_trt))) I0 <- Matrix::bdiag(lapply(1:J,function(j) single_reg_mat)) I0 <- reg_param * Matrix::Diagonal(J * max_trt) return(reg_param * I0) } get_regularization_matrix_pool <- function(J, max_trt, reg_param1, reg_param2) { single_reg_mat <- Matrix::bdiag(list(0, Matrix::Diagonal(max_trt))) grp_reg_mats <- Matrix::bdiag(lapply(1:J,function(j) single_reg_mat)) I0 <- Matrix::bdiag(reg_param1 * single_reg_mat, reg_param2 * grp_reg_mats) return(I0) } ================================================ FILE: README.md ================================================ # augsynth: Augmented Synthetic Control Method [![Build Status](https://travis-ci.org/ebenmichael/augsynth.svg?branch=master)](https://travis-ci.org/ebenmichael/augsynth) [![Project Status: Active The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) ## Overview This package implements the Augmented Synthetic Control Method (ASCM). For a more detailed description of the main functionality check out: - [the vignette for simultaneous adoption](https://github.com/ebenmichael/augsynth/blob/master/vignettes/singlesynth-vignette.md) - [the vignette for staggered adoption](https://github.com/ebenmichael/augsynth/blob/master/vignettes/multisynth-vignette.md) ## Installation To install this package, first ensure that `devtools` is installed with ``` install.packages("devtools") ``` then install the package from GitHub with ``` devtools::install_github("ebenmichael/augsynth") ``` ## Basic usage To get started, use a panel dataset with an `outcome` measure, a `treatment` indicator, a `unit` indicator, a `time` variable, and an intervention time `t_int`. Then run ``` asyn <- augsynth(outcome ~ trt, unit, time, t_int, data) ``` ================================================ FILE: data-raw/clean_kansas.R ================================================ library(haven) library(tidyverse) kansas <- read_dta("kansas_longer2.dta") state_abb <- read_csv("us-state-ansi-fips.csv") %>% rename(fips = st, abb = stusps) %>% mutate(fips = as.numeric(fips)) %>% select(fips, abb) kansas <- kansas %>% rename(fips=Fips) %>% filter(year >= 1990, !is.na(fips), # filter out all of US fips != 11, # filter out DC # year_qtr >= 2005 | year_qtr == round(year_qtr) ) %>% # interpolate GDP mutate(year_qtr = year + qtr / 4 - 0.25, # combine year and quarter fips = as.integer(fips), # state id treated = 1 * (fips == 20) * (year_qtr >= 2012.25), gdp = ifelse((qtr == 1) | (year >= 2005), gdp, NA), popestimate = ifelse((qtr == 1), popestimate, NA)) %>% # interpolate GDP and population group_by(fips) %>% arrange(year_qtr) %>% mutate(gdp = approx(year_qtr, gdp, year_qtr)$y, popestimate = approx(year_qtr, popestimate, year_qtr)$y) %>% ungroup() %>% arrange(fips, year_qtr) %>% mutate(gdpcapita = gdp / popestimate * 1e6, lngdp = log(gdp), lngdpcapita = log(gdpcapita), revstatecapita = rev_state_total / popestimate * 1e6, revlocalcapita = rev_local_total / popestimate * 1e6, emplvl1capita = month1_emplvl / popestimate, emplvl2capita = month2_emplvl / popestimate, emplvl3capita = month3_emplvl / popestimate, emplvlcapita = (month1_emplvl + month2_emplvl + month3_emplvl) / (3 * popestimate), totalwagescapita = total_qtrly_wages / popestimate, taxwagescapita = taxable_qtrly_wages / popestimate, avgwklywagecapita = avg_wkly_wage, estabscapita = qtrly_estabs_count / popestimate) %>% filter(year_qtr <= 2016) %>% inner_join(state_abb) for (name in colnames(kansas)) { attributes(kansas[[name]])$label = NULL } ================================================ FILE: man/augsynth-package.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/augsynth.R \docType{package} \name{augsynth-package} \alias{augsynth-package} \title{augsynth} \description{ A package implementing the Augmented Synthetic Controls Method } ================================================ FILE: man/augsynth.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/augsynth_pre.R \name{augsynth} \alias{augsynth} \title{Fit Augmented SCM} \usage{ augsynth(form, unit, time, data, t_int = NULL, ...) } \arguments{ \item{form}{outcome ~ treatment | auxillary covariates} \item{unit}{Name of unit column} \item{time}{Name of time column} \item{data}{Panel data as dataframe} \item{t_int}{Time of intervention (used for single-period treatment only)} \item{...}{Optional arguments \itemize{ \item Single period augsynth with/without multiple outcomes \itemize{ \item{"progfunc"}{What function to use to impute control outcomes: Ridge=Ridge regression (allows for standard errors), None=No outcome model, EN=Elastic Net, RF=Random Forest, GSYN=gSynth, MCP=MCPanel, CITS=CITS, CausalImpact=Bayesian structural time series with CausalImpact, seq2seq=Sequence to sequence learning with feedforward nets} \item{"scm"}{Whether the SCM weighting function is used} \item{"fixedeff"}{Whether to include a unit fixed effect, default F } \item{"cov_agg"}{Covariate aggregation functions, if NULL then use mean with NAs omitted} } \item Multi period (staggered) augsynth \itemize{ \item{"relative"}{Whether to compute balance by relative time} \item{"n_leads"}{How long past treatment effects should be estimated for} \item{"n_lags"}{Number of pre-treatment periods to balance, default is to balance all periods} \item{"alpha"}{Fraction of balance for individual balance} \item{"lambda"}{Regularization hyperparameter, default = 0} \item{"force"}{Include "none", "unit", "time", "two-way" fixed effects. Default: "two-way"} \item{"n_factors"}{Number of factors for interactive fixed effects, default does CV} } }} } \value{ augsynth object that contains: \itemize{ \item{"weights"}{weights} \item{"data"}{Panel data as matrices} } } \description{ Fit Augmented SCM } ================================================ FILE: man/augsynth_multiout.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/multi_outcomes.R \name{augsynth_multiout} \alias{augsynth_multiout} \title{Fit Augmented SCM with multiple outcomes} \usage{ augsynth_multiout( form, unit, time, t_int, data, progfunc = c("Ridge", "None"), scm = T, fixedeff = FALSE, cov_agg = NULL, combine_method = "avg", ... ) } \arguments{ \item{form}{outcome ~ treatment | auxillary covariates} \item{unit}{Name of unit column} \item{time}{Name of time column} \item{t_int}{Time of intervention} \item{data}{Panel data as dataframe} \item{progfunc}{What function to use to impute control outcomes Ridge=Ridge regression (allows for standard errors), None=No outcome model,} \item{scm}{Whether the SCM weighting function is used} \item{fixedeff}{Whether to include a unit fixed effect, default F} \item{cov_agg}{Covariate aggregation functions, if NULL then use mean with NAs omitted} \item{combine_method}{How to combine outcomes: `concat` concatenates outcomes and `avg` averages them, default: 'avg'} \item{...}{optional arguments for outcome model} } \value{ augsynth object that contains: \itemize{ \item{"weights"}{Ridge ASCM weights} \item{"l2_imbalance"}{Imbalance in pre-period outcomes, measured by the L2 norm} \item{"scaled_l2_imbalance"}{L2 imbalance scaled by L2 imbalance of uniform weights} \item{"mhat"}{Outcome model estimate} \item{"data"}{Panel data as matrices} } } \description{ Fit Augmented SCM with multiple outcomes } ================================================ FILE: man/check_data_stag.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/format.R \name{check_data_stag} \alias{check_data_stag} \title{Check that we can actually run multisynth on the data} \usage{ check_data_stag(wide, fixedeff, n_leads, n_lags) } \arguments{ \item{wide}{Output of format_data_stag} \item{fixedeff}{Whether to include a unit fixed effect} \item{n_leads}{How long past treatment effects should be estimated for, default is number of post treatment periods for last treated unit} \item{n_lags}{Number of pre-treatment periods to balance, default is to balance all periods} } \description{ Check that we can actually run multisynth on the data } ================================================ FILE: man/conformal_inf.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/inference.R \name{conformal_inf} \alias{conformal_inf} \title{Conformal inference procedure to compute p-values and point-wise confidence intervals} \usage{ conformal_inf( ascm, alpha = 0.05, stat_func = NULL, type = "iid", q = 1, ns = 1000, grid_size = 50 ) } \arguments{ \item{ascm}{Fitted `augsynth` object} \item{alpha}{Confidence level} \item{stat_func}{Function to compute test statistic} \item{type}{Either "iid" for iid permutations or "block" for moving block permutations; default is "block"} \item{q}{The norm for the test static `((sum(x ^ q))) ^ (1/q)`} \item{ns}{Number of resamples for "iid" permutations} \item{grid_size}{Number of grid points to use when inverting the hypothesis test} } \value{ List that contains: \itemize{ \item{"att"}{Vector of ATT estimates} \item{"heldout_att"}{Vector of ATT estimates with the time period held out} \item{"se"}{Standard error, always NA but returned for compatibility} \item{"lb"}{Lower bound of 1 - alpha confidence interval} \item{"ub"}{Upper bound of 1 - alpha confidence interval} \item{"p_val"}{p-value for test of no post-treatment effect} \item{"alpha"}{Level of confidence interval} } } \description{ Conformal inference procedure to compute p-values and point-wise confidence intervals } ================================================ FILE: man/conformal_inf_linear.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/inference.R \name{conformal_inf_linear} \alias{conformal_inf_linear} \title{Conformal inference procedure to compute a confidence interval for a linear in time effect} \usage{ conformal_inf_linear( ascm, alpha = 0.05, stat_func = NULL, type = "iid", q = 1, ns = 1000, grid_size = 50 ) } \arguments{ \item{ascm}{Fitted `augsynth` object} \item{alpha}{Confidence level} \item{stat_func}{Function to compute test statistic} \item{type}{Either "iid" for iid permutations or "block" for moving block permutations; default is "iid"} \item{q}{The norm for the test static `((sum(x ^ q))) ^ (1/q)`} \item{ns}{Number of resamples for "iid" permutations} \item{grid_size}{Number of grid points to use when inverting the hypothesis test} } \value{ List that contains: \itemize{ \item{"att"}{Vector of ATT estimates} \item{"heldout_att"}{Vector of ATT estimates with the time period held out} \item{"se"}{Standard error, always NA but returned for compatibility} \item{"lb"}{Lower bound of 1 - alpha confidence interval} \item{"ub"}{Upper bound of 1 - alpha confidence interval} \item{"p_val"}{p-value for test of no post-treatment effect} \item{"alpha"}{Level of confidence interval} } } \description{ Conformal inference procedure to compute a confidence interval for a linear in time effect } ================================================ FILE: man/conformal_inf_multiout.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/inference.R \name{conformal_inf_multiout} \alias{conformal_inf_multiout} \title{Conformal inference procedure to compute p-values and point-wise confidence intervals} \usage{ conformal_inf_multiout( ascm_multi, alpha = 0.05, stat_func = NULL, type = "iid", q = 1, ns = 1000, grid_size = 1, lin_h0 = NULL ) } \arguments{ \item{alpha}{Confidence level} \item{stat_func}{Function to compute test statistic} \item{type}{Either "iid" for iid permutations or "block" for moving block permutations} \item{q}{The norm for the test static `((sum(x ^ q))) ^ (1/q)`} \item{ns}{Number of resamples for "iid" permutations} \item{grid_size}{Number of grid points to use when inverting the hypothesis test (default is 1, so only to test joint null)} \item{ascm}{Fitted `augsynth` object} } \value{ List that contains: \itemize{ \item{"att"}{Vector of ATT estimates} \item{"heldout_att"}{Vector of ATT estimates with the time period held out} \item{"se"}{Standard error, always NA but returned for compatibility} \item{"lb"}{Lower bound of 1 - alpha confidence interval} \item{"ub"}{Upper bound of 1 - alpha confidence interval} \item{"p_val"}{p-value for test of no post-treatment effect} \item{"alpha"}{Level of confidence interval} } } \description{ Conformal inference procedure to compute p-values and point-wise confidence intervals } ================================================ FILE: man/get_nona_donors.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/eligible_donors.R \name{get_nona_donors} \alias{get_nona_donors} \title{Get donors that don't have missing outcomes where treated units have outcomes} \usage{ get_nona_donors(X, y, trt, n_lags, n_leads, time_cohort) } \description{ Get donors that don't have missing outcomes where treated units have outcomes } ================================================ FILE: man/jackknife_se_single.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/inference.R \name{jackknife_se_single} \alias{jackknife_se_single} \title{Estimate standard errors for single ASCM with the jackknife Do this for ridge-augmented synth} \usage{ jackknife_se_single(ascm) } \arguments{ \item{ascm}{Fitted augsynth object} } \value{ List that contains: \itemize{ \item{"att"}{Vector of ATT estimates} \item{"se"}{Standard error estimate} \item{"lb"}{Lower bound of 1 - alpha confidence interval} \item{"ub"}{Upper bound of 1 - alpha confidence interval} \item{"alpha"}{Level of confidence interval} } } \description{ Estimate standard errors for single ASCM with the jackknife Do this for ridge-augmented synth } ================================================ FILE: man/kansas.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/data.R \docType{data} \name{kansas} \alias{kansas} \title{Economic indicators for US states from 1990-2016} \format{ A dataframe with 5250 rows and 32 variables: \describe{ \item{fips}{FIPS code for each state} \item{year}{Year of measurement} \item{qtr}{Quarter (1-4) of measurement} \item{state}{Name of State} \item{gdp}{Gross State Product (millions of $) Values before 2005 are linearly interpolated between years} \item{revenuepop}{State and local revenue per capita} \item{rev_state_total}{State total general revenue (millions of $)} \item{rev_local_total}{Local total general revenue (millions of $)} \item{popestimate}{Population estimate} \item{qtrly_estabs_count}{Count of establishments for a given quarter} \item{month1_emplvl, month2_emplvl, month3_emplvl}{ Employment level for first, second, and third months of a given quarter} \item{total_qtrly_wages}{Total wages for a givne quarter} \item{taxable_qtrly_wage}{Taxable wages for a given quarter} \item{avg_wkly_wage}{Average weekly wage for a given quarter} \item{year_qtr}{Year and quarter combined into one continuous variable} \item{treated}{Whether the state passed tax cuts before the given year and quareter} \item{lngdpcapita}{Natural log of GDP per capita} \item{emplvlcapita}{Average employment level per capita} \item{Xcapita}{Per capita value of X} \item{abb}{State abbreviation} } } \usage{ kansas } \description{ Economic indicators for US states from 1990-2016 } \keyword{datasets} ================================================ FILE: man/make_V_matrix.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/fit_synth.R \name{make_V_matrix} \alias{make_V_matrix} \title{Make a V matrix from a vector (or null)} \usage{ make_V_matrix(t0, V) } \description{ Make a V matrix from a vector (or null) } ================================================ FILE: man/multisynth.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/multisynth_class.R \name{multisynth} \alias{multisynth} \title{Fit staggered synth} \usage{ multisynth( form, unit, time, data, n_leads = NULL, n_lags = NULL, nu = NULL, lambda = 0, V = NULL, fixedeff = TRUE, n_factors = 0, scm = T, time_cohort = F, how_match = "knn", cov_agg = NULL, eps_abs = 1e-04, eps_rel = 1e-04, verbose = FALSE, ... ) } \arguments{ \item{form}{outcome ~ treatment | weighting covariates | approximate matching covaraites | exact matching covariates \itemize{ \item{outcome}{Name of the outcome of interest} \item{treatment}{Name of the treatment assignment variable} \item{weighting covariates}{Auxiliary covariates to weight on} \item{approximate matching covariates}{Auxiliary covariates to approximately match one before weighting} \item{exact matching covariates}{Auxiliary covariates to exactly match on before weighting} } If covariates are time-varying, their average value before the first unit is treated will be used. This can be changed by supplying a custom aggregation function to cov_agg.} \item{unit}{Name of unit column} \item{time}{Name of time column} \item{data}{Panel data as dataframe} \item{n_leads}{How long past treatment effects should be estimated for, default is number of post treatment periods for last treated unit} \item{n_lags}{Number of pre-treatment periods to balance, default is to balance all periods} \item{nu}{Fraction of balance for individual balance} \item{lambda}{Regularization hyperparameter, default = 0} \item{V}{Scaling matrix for synth optimization, default NULL is identity} \item{fixedeff}{Whether to include a unit fixed effect, default TRUE} \item{n_factors}{Number of factors for interactive fixed effects, setting to NULL fits with CV, default is 0} \item{scm}{Whether to fit scm weights} \item{time_cohort}{Whether to average synthetic controls into time cohorts, default FALSE} \item{cov_agg}{Covariate aggregation function} \item{eps_abs}{Absolute error tolerance for osqp} \item{eps_rel}{Relative error tolerance for osqp} \item{verbose}{Whether to print logs for osqp} \item{...}{Extra arguments} } \value{ multisynth object that contains: \itemize{ \item{"weights"}{weights matrix where each column is a set of weights for a treated unit} \item{"data"}{Panel data as matrices} \item{"imbalance"}{Matrix of treatment minus synthetic control for pre-treatment time periods, each column corresponds to a treated unit} \item{"global_l2"}{L2 imbalance for the pooled synthetic control} \item{"scaled_global_l2"}{L2 imbalance for the pooled synthetic control, scaled by the imbalance for unitform weights} \item{"ind_l2"}{Average L2 imbalance for the individual synthetic controls} \item{"scaled_ind_l2"}{Average L2 imbalance for the individual synthetic controls, scaled by the imbalance for unitform weights} \item{"n_leads", "n_lags"}{Number of post treatment outcomes (leads) and pre-treatment outcomes (lags) to include in the analysis} \item{"nu"}{Fraction of balance for individual balance} \item{"lambda"}{Regularization hyperparameter} \item{"scm"}{Whether to fit scm weights} \item{"grps"}{Time periods for treated units} \item{"y0hat"}{Pilot estimates of control outcomes} \item{"residuals"}{Difference between the observed outcomes and the pilot estimates} \item{"n_factors"}{Number of factors for interactive fixed effects} } } \description{ Fit staggered synth } ================================================ FILE: man/plot.augsynth.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/augsynth.R \name{plot.augsynth} \alias{plot.augsynth} \title{Plot function for augsynth} \usage{ \method{plot}{augsynth}(x, inf = T, cv = F, ...) } \arguments{ \item{x}{Augsynth object to be plotted} \item{inf}{Boolean, whether to get confidence intervals around the point estimates} \item{cv}{If True, plot cross validation MSE against hyper-parameter, otherwise plot effects} \item{...}{Optional arguments} } \description{ Plot function for augsynth } ================================================ FILE: man/plot.augsynth_multiout.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/multi_outcomes.R \name{plot.augsynth_multiout} \alias{plot.augsynth_multiout} \title{Plot function for summary function for augsynth} \usage{ \method{plot}{augsynth_multiout}(x, inf = T, plt_avg = F, ...) } \arguments{ \item{x}{summary.augsynth_multiout object} \item{inf}{Boolean, whether to plot uncertainty intervals, default TRUE} \item{plt_avg}{Boolean, whether to plot the average of the outcomes, default FALSE} \item{...}{Optional arguments for summary function} } \description{ Plot function for summary function for augsynth } ================================================ FILE: man/plot.multisynth.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/multisynth_class.R \name{plot.multisynth} \alias{plot.multisynth} \title{Plot function for multisynth} \usage{ \method{plot}{multisynth}( x, inf_type = "bootstrap", inf = T, levels = NULL, label = T, weights = FALSE, ... ) } \arguments{ \item{x}{Augsynth object to be plotted} \item{inf_type}{Type of inference to perform: \itemize{ \item{bootstrap}{Wild bootstrap, the default option} \item{jackknife}{Jackknife} }} \item{inf}{Whether to compute and plot confidence intervals} \item{levels}{Which units/groups to plot, default is every group} \item{label}{Whether to label the individual levels} \item{weights}{Whether to plot the weights, default = FALSE} \item{...}{Optional arguments} } \description{ Plot function for multisynth } ================================================ FILE: man/plot.summary.augsynth.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/augsynth.R \name{plot.summary.augsynth} \alias{plot.summary.augsynth} \title{Plot function for summary function for augsynth} \usage{ \method{plot}{summary.augsynth}(x, inf = T, ...) } \arguments{ \item{x}{Summary object} \item{inf}{Boolean, whether to plot confidence intervals} \item{...}{Optional arguments} } \description{ Plot function for summary function for augsynth } ================================================ FILE: man/plot.summary.augsynth_multiout.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/multi_outcomes.R \name{plot.summary.augsynth_multiout} \alias{plot.summary.augsynth_multiout} \title{Plot function for summary function for augsynth} \usage{ \method{plot}{summary.augsynth_multiout}(x, inf = F, plt_avg = F, ...) } \arguments{ \item{x}{summary.augsynth_multiout object} \item{inf}{Boolean, whether to plot uncertainty intervals, default TRUE} \item{plt_avg}{Boolean, whether to plot the average of the outcomes, default FALSE} } \description{ Plot function for summary function for augsynth } ================================================ FILE: man/plot.summary.multisynth.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/multisynth_class.R \name{plot.summary.multisynth} \alias{plot.summary.multisynth} \title{Plot function for summary function for multisynth} \usage{ \method{plot}{summary.multisynth}(x, inf = T, levels = NULL, label = T, weights = FALSE, ...) } \arguments{ \item{x}{summary object} \item{inf}{Whether to plot confidence intervals} \item{levels}{Which units/groups to plot, default is every group} \item{label}{Whether to label the individual levels} \item{weights}{Whether to plot the weights, default = FALSE} \item{...}{Optional arguments} } \description{ Plot function for summary function for multisynth } ================================================ FILE: man/predict.augsynth.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/augsynth.R \name{predict.augsynth} \alias{predict.augsynth} \title{Get prediction of ATT or average outcome under control} \usage{ \method{predict}{augsynth}(object, att = F, ...) } \arguments{ \item{object}{augsynth object} \item{att}{If TRUE, return the ATT, if FALSE, return imputed counterfactual} \item{...}{Optional arguments} } \value{ Vector of predicted post-treatment control averages } \description{ Get prediction of ATT or average outcome under control } ================================================ FILE: man/predict.augsynth_multiout.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/multi_outcomes.R \name{predict.augsynth_multiout} \alias{predict.augsynth_multiout} \title{Get prediction of ATT or average outcome under control} \usage{ \method{predict}{augsynth_multiout}(object, ...) } \arguments{ \item{object}{augsynth_multiout object} \item{...}{Optional arguments, including \itemize{\item{"att"}{Whether to return the ATT or average outcome under control}}} } \value{ Vector of predicted post-treatment control averages } \description{ Get prediction of ATT or average outcome under control } ================================================ FILE: man/predict.multisynth.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/multisynth_class.R \name{predict.multisynth} \alias{predict.multisynth} \title{Get prediction of average outcome under control or ATT} \usage{ \method{predict}{multisynth}(object, att = F, att_weight = NULL, bs_weight = NULL, ...) } \arguments{ \item{object}{Fit multisynth object} \item{att}{If TRUE, return the ATT, if FALSE, return imputed counterfactual} \item{att_weight}{Weights to place on individual units/cohorts when averaging} \item{bs_weight}{Weight to perturb units by for weighted bootstrap} \item{...}{Optional arguments} } \value{ Matrix of predicted post-treatment control outcomes for each treated unit } \description{ Get prediction of average outcome under control or ATT } ================================================ FILE: man/print.augsynth.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/augsynth.R \name{print.augsynth} \alias{print.augsynth} \title{Print function for augsynth} \usage{ \method{print}{augsynth}(x, ...) } \arguments{ \item{x}{augsynth object} \item{...}{Optional arguments} } \description{ Print function for augsynth } ================================================ FILE: man/print.augsynth_multiout.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/multi_outcomes.R \name{print.augsynth_multiout} \alias{print.augsynth_multiout} \title{Print function for augsynth} \usage{ \method{print}{augsynth_multiout}(x, ...) } \arguments{ \item{x}{augsynth_multiout object} \item{...}{Optional arguments} } \description{ Print function for augsynth } ================================================ FILE: man/print.multisynth.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/multisynth_class.R \name{print.multisynth} \alias{print.multisynth} \title{Print function for multisynth} \usage{ \method{print}{multisynth}(x, att_weight = NULL, ...) } \arguments{ \item{x}{multisynth object} \item{...}{Optional arguments} } \description{ Print function for multisynth } ================================================ FILE: man/print.summary.augsynth.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/augsynth.R \name{print.summary.augsynth} \alias{print.summary.augsynth} \title{Print function for summary function for augsynth} \usage{ \method{print}{summary.augsynth}(x, ...) } \arguments{ \item{x}{summary object} \item{...}{Optional arguments} } \description{ Print function for summary function for augsynth } ================================================ FILE: man/print.summary.augsynth_multiout.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/multi_outcomes.R \name{print.summary.augsynth_multiout} \alias{print.summary.augsynth_multiout} \title{Print function for summary function for augsynth} \usage{ \method{print}{summary.augsynth_multiout}(x, ...) } \arguments{ \item{x}{summary.augsynth_multiout object} \item{...}{Optional arguments} } \description{ Print function for summary function for augsynth } ================================================ FILE: man/print.summary.multisynth.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/multisynth_class.R \name{print.summary.multisynth} \alias{print.summary.multisynth} \title{Print function for summary function for multisynth} \usage{ \method{print}{summary.multisynth}(x, level = "Average", ...) } \arguments{ \item{x}{summary object} \item{level}{Which unit/group to print results for, default is the overall average} \item{...}{Optional arguments} } \description{ Print function for summary function for multisynth } ================================================ FILE: man/rdirichlet_b.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/inference.R \name{rdirichlet_b} \alias{rdirichlet_b} \title{Bayesian bootstrap} \usage{ rdirichlet_b(n) } \arguments{ \item{n}{Number of units} } \description{ Bayesian bootstrap } ================================================ FILE: man/rmultinom_b.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/inference.R \name{rmultinom_b} \alias{rmultinom_b} \title{Non-parametric bootstrap} \usage{ rmultinom_b(n) } \arguments{ \item{n}{Number of units} } \description{ Non-parametric bootstrap } ================================================ FILE: man/rwild_b.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/inference.R \name{rwild_b} \alias{rwild_b} \title{Wild bootstrap (Mammen 1993)} \usage{ rwild_b(n) } \arguments{ \item{n}{Number of units} } \description{ Wild bootstrap (Mammen 1993) } ================================================ FILE: man/single_augsynth.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/augsynth.R \name{single_augsynth} \alias{single_augsynth} \title{Fit Augmented SCM} \usage{ single_augsynth( form, unit, time, t_int, data, progfunc = "ridge", scm = T, fixedeff = FALSE, cov_agg = NULL, ... ) } \arguments{ \item{form}{outcome ~ treatment | auxillary covariates} \item{unit}{Name of unit column} \item{time}{Name of time column} \item{t_int}{Time of intervention} \item{data}{Panel data as dataframe} \item{progfunc}{What function to use to impute control outcomes ridge=Ridge regression (allows for standard errors), none=No outcome model, en=Elastic Net, RF=Random Forest, GSYN=gSynth, mcp=MCPanel, cits=Comparitive Interuppted Time Series causalimpact=Bayesian structural time series with CausalImpact} \item{scm}{Whether the SCM weighting function is used} \item{fixedeff}{Whether to include a unit fixed effect, default F} \item{cov_agg}{Covariate aggregation functions, if NULL then use mean with NAs omitted} \item{...}{optional arguments for outcome model} } \value{ augsynth object that contains: \itemize{ \item{"weights"}{Ridge ASCM weights} \item{"l2_imbalance"}{Imbalance in pre-period outcomes, measured by the L2 norm} \item{"scaled_l2_imbalance"}{L2 imbalance scaled by L2 imbalance of uniform weights} \item{"mhat"}{Outcome model estimate} \item{"data"}{Panel data as matrices} } } \description{ Fit Augmented SCM } ================================================ FILE: man/summary.augsynth.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/augsynth.R \name{summary.augsynth} \alias{summary.augsynth} \title{Summary function for augsynth} \usage{ \method{summary}{augsynth}(object, inf = T, inf_type = "conformal", linear_effect = F, ...) } \arguments{ \item{object}{augsynth object} \item{inf}{Boolean, whether to get confidence intervals around the point estimates} \item{inf_type}{Type of inference algorithm. Options are \itemize{ \item{"conformal"}{Conformal inference (default)} \item{"jackknife+"}{Jackknife+ algorithm over time periods} \item{"jackknife"}{Jackknife over units} }} \item{linear_effect}{Boolean, whether to invert the conformal inference hypothesis test to get confidence intervals for a linear-in-time treatment effect: intercept + slope * time} \item{...}{Optional arguments for inference, for more details for each `inf_type` see \itemize{ \item{"conformal"}{`conformal_inf`} \item{"jackknife+"}{`time_jackknife_plus`} \item{"jackknife"}{`jackknife_se_single`} }} } \description{ Summary function for augsynth } ================================================ FILE: man/summary.augsynth_multiout.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/multi_outcomes.R \name{summary.augsynth_multiout} \alias{summary.augsynth_multiout} \title{Summary function for augsynth} \usage{ \method{summary}{augsynth_multiout}(object, inf = T, inf_type = "conformal", grid_size = 1, ...) } \arguments{ \item{object}{augsynth_multiout object} \item{inf}{whether or not to perform inference} \item{grid_size}{Grid to compute prediction intervals over, default is 1 and only p-values are computed} \item{...}{Optional arguments, including \itemize{\item{"se"}{Whether to plot standard error}}} \item{inf_typ}{Type of inference, default is "conformal"} } \description{ Summary function for augsynth } ================================================ FILE: man/summary.multisynth.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/multisynth_class.R \name{summary.multisynth} \alias{summary.multisynth} \title{Summary function for multisynth} \usage{ \method{summary}{multisynth}(object, inf_type = "bootstrap", att_weight = NULL, ...) } \arguments{ \item{object}{multisynth object} \item{inf_type}{Type of inference to perform: \itemize{ \item{bootstrap}{Wild bootstrap, the default option} \item{jackknife}{Jackknife} }} \item{...}{Optional arguments} } \value{ summary.multisynth object that contains: \itemize{ \item{"att"}{Dataframe with ATT estimates, standard errors for each treated unit} \item{"global_l2"}{L2 imbalance for the pooled synthetic control} \item{"scaled_global_l2"}{L2 imbalance for the pooled synthetic control, scaled by the imbalance for unitform weights} \item{"ind_l2"}{Average L2 imbalance for the individual synthetic controls} \item{"scaled_ind_l2"}{Average L2 imbalance for the individual synthetic controls, scaled by the imbalance for unitform weights} \item{"n_leads", "n_lags"}{Number of post treatment outcomes (leads) and pre-treatment outcomes (lags) to include in the analysis} } } \description{ Summary function for multisynth } ================================================ FILE: man/time_jackknife_plus.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/inference.R \name{time_jackknife_plus} \alias{time_jackknife_plus} \title{Jackknife+ algorithm over time} \usage{ time_jackknife_plus(ascm, alpha = 0.05, conservative = F) } \arguments{ \item{ascm}{Fitted `augsynth` object} \item{alpha}{Confidence level} \item{conservative}{Whether to use the conservative jackknife+ procedure} } \value{ List that contains: \itemize{ \item{"att"}{Vector of ATT estimates} \item{"heldout_att"}{Vector of ATT estimates with the time period held out} \item{"se"}{Standard error, always NA but returned for compatibility} \item{"lb"}{Lower bound of 1 - alpha confidence interval} \item{"ub"}{Upper bound of 1 - alpha confidence interval} \item{"alpha"}{Level of confidence interval} } } \description{ Jackknife+ algorithm over time } ================================================ FILE: man/time_jackknife_plus_multiout.Rd ================================================ % Generated by roxygen2: do not edit by hand % Please edit documentation in R/inference.R \name{time_jackknife_plus_multiout} \alias{time_jackknife_plus_multiout} \title{Jackknife+ algorithm over time} \usage{ time_jackknife_plus_multiout(ascm_multi, alpha = 0.05, conservative = F) } \arguments{ \item{alpha}{Confidence level} \item{conservative}{Whether to use the conservative jackknife+ procedure} \item{ascm}{Fitted `augsynth` object} } \value{ List that contains: \itemize{ \item{"att"}{Vector of ATT estimates} \item{"heldout_att"}{Vector of ATT estimates with the time period held out} \item{"se"}{Standard error, always NA but returned for compatibility} \item{"lb"}{Lower bound of 1 - alpha confidence interval} \item{"ub"}{Upper bound of 1 - alpha confidence interval} \item{"alpha"}{Level of confidence interval} } } \description{ Jackknife+ algorithm over time } ================================================ FILE: pkg.Rproj ================================================ Version: 1.0 RestoreWorkspace: No SaveWorkspace: No AlwaysSaveHistory: Default EnableCodeIndexing: Yes Encoding: UTF-8 AutoAppendNewline: Yes StripTrailingWhitespace: Yes BuildType: Package PackageUseDevtools: Yes PackageInstallArgs: --no-multiarch --with-keep.source PackageRoxygenize: rd,collate,namespace ================================================ FILE: tests/testthat/test_augsynth_pre.R ================================================ context("Testing that top level API runs the right functions") library(Synth) test_that("augsynth runs single_synth when there is a single treated unit", { data(basque) basque <- basque %>% mutate(trt = case_when(year < 1975 ~ 0, regionno != 17 ~0, regionno == 17 ~ 1)) %>% filter(regionno != 1) syn <- augsynth(gdpcap ~ trt, regionno, year, basque, progfunc = "None", scm = T, t_int = 1975) syn_single <- single_augsynth(gdpcap ~ trt, regionno, year, basque, progfunc = "None", scm = T, t_int = 1975) expect_equal(syn$weights, syn_single$weights) }) test_that("augsynth finds the treated time when is a single treated unit", { data(basque) basque <- basque %>% mutate(trt = case_when(year < 1975 ~ 0, regionno != 17 ~0, regionno == 17 ~ 1)) %>% filter(regionno != 1) syn <- augsynth(gdpcap ~ trt, regionno, year, basque, progfunc = "None", scm = T, t_int = 1975) syn2 <- augsynth(gdpcap ~ trt, regionno, year, basque, progfunc = "None", scm = T) expect_equal(syn$weights, syn2$weights, tolerance = 1e-6) # should work with out of order time as well syn_rev <- augsynth(gdpcap ~ trt, regionno, year, basque %>% arrange(desc(year)), progfunc = "None", scm = T) expect_equal(syn$weights, syn_rev$weights) expect_equal(predict(syn), predict(syn_rev)) }) test_that("augsynth runs single_synth when there is simultaneous adoption", { data(basque) basque <- basque %>% mutate(trt = case_when(year < 1975 ~ 0, !regionno %in% c(16, 17) ~0, regionno %in% c(16, 17) ~ 1)) %>% filter(regionno != 1) syn <- augsynth(gdpcap ~ trt, regionno, year, basque, progfunc = "None", scm = T, t_int = 1975) syn_single <- single_augsynth(gdpcap ~ trt, regionno, year, basque, progfunc = "None", scm = T, t_int = 1975) expect_equal(syn$weights, syn_single$weights) }) test_that("augsynth runs multisynth when there is staggered adoption", { data(basque) basque <- basque %>% mutate(trt = case_when((regionno == 17) & (year == 1975) ~ 1, (regionno == 16) & (year == 1980) ~ 1, TRUE ~ 0)) %>% filter(regionno != 1) syn <- augsynth(gdpcap ~ trt, regionno, year, basque, scm = T) syn_multi <- multisynth(gdpcap ~ trt, regionno, year, basque) expect_equal(syn$weights, syn_multi$weights, tolerance = 1e-5) }) test_that("augsynth with a single treated unit doesn't depend on unit order", { data(kansas) syn <- augsynth(lngdpcapita ~ treated | log(revstatecapita), abb, year_qtr, kansas, progfunc = "None", scm = T) syn2 <- augsynth(lngdpcapita ~ treated | log(revstatecapita), fips, year_qtr, kansas %>% arrange(desc(fips)), progfunc = "None", scm = T) expect_equal(predict(syn), predict(syn2)) asyn <- augsynth(lngdpcapita ~ treated | log(revstatecapita), fips, year_qtr, kansas, progfunc = "ridge", scm = T) asyn2 <- augsynth(lngdpcapita ~ treated | log(revstatecapita), fips, year_qtr, kansas %>% arrange(desc(fips)), progfunc = "ridge", scm = T) expect_equal(c(asyn$weights), c(asyn2$weights)) expect_equal(predict(asyn), predict(asyn2)) }) test_that("augsynth runs single_synth with progfunc = 'ridge' when there is a single treated unit and no progfunc is specified", { data(basque) basque <- basque %>% mutate(trt = case_when(year < 1975 ~ 0, regionno != 17 ~0, regionno == 17 ~ 1)) %>% filter(regionno != 1) syn <- augsynth(gdpcap ~ trt, regionno, year, basque, scm = T) syn_single <- single_augsynth(gdpcap ~ trt, regionno, year, basque, progfunc = "ridge", scm = T, t_int = 1975) expect_equal(syn$weights, syn_single$weights) }) ================================================ FILE: tests/testthat/test_format.R ================================================ context("Test data formatting") library(Synth) data(basque) basque <- basque %>% mutate(trt = case_when(year < 1975 ~ 0, regionno != 17 ~0, regionno == 17 ~ 1)) %>% filter(regionno != 1) test_that("format_data creates matrices with the right dimensions", { dat <- format_data(quo(gdpcap), quo(trt), quo(regionno), quo(year),1975, basque) test_dim <- function(obj, d) { expect_equivalent(dim(obj), d) } test_dim(dat$X, c(17, 20)) expect_equivalent(length(dat$trt), 17) test_dim(dat$y, c(17, 23)) } ) test_that("format_synth creates matrices with the right dimensions", { dat <- format_data(quo(gdpcap), quo(trt), quo(regionno), quo(year),1975, basque) syn_dat <- format_synth(dat$X, dat$trt, dat$y) test_dim <- function(obj, d) { expect_equivalent(dim(obj), d) } test_dim(syn_dat$Z0, c(20, 16)) test_dim(syn_dat$Z1, c(20, 1)) test_dim(syn_dat$Y0plot, c(43, 16)) test_dim(syn_dat$Y1plot, c(43, 1)) expect_equivalent(syn_dat$Z1, syn_dat$X1) expect_equivalent(syn_dat$Z0, syn_dat$X0) } ) test_that("multisynth throws errors when there aren't enough pre-treatment times", { basque2 <- basque %>% mutate(trt = case_when( regionno == 16 ~ 1, year >= 1975 & regionno == 17 ~ 1, TRUE ~ 0) ) %>% filter(regionno != 1) # error from always treated unit expect_warning( expect_error( multisynth(gdpcap ~ trt, regionno, year, basque2) ) ) basque2 <- basque %>% mutate(trt = case_when( regionno == 16 & year >= 1956 ~ 1, year >= 1975 & regionno == 17 ~ 1, TRUE ~ 0) ) %>% filter(regionno != 1) # error from one pre-treatment outcome and fixedeff = T expect_warning( expect_error(multisynth(gdpcap ~ trt, regionno, year, basque2)) ) # no error from one pre-treatment outcome and fixedeff = F, just warning expect_warning(multisynth(gdpcap ~ trt, regionno, year, basque2, fixedeff = F)) }) test_that("formatting for staggered adoption doesn't care about order of time in data", { basque2 <- basque %>% # slice(sample(1:n())) %>% mutate(trt = case_when((regionno == 17) & (year >= 1975) ~ 1, (regionno == 16) & (year >= 1980) ~ 1, TRUE ~ 0)) dat <- format_data_stag(quo(gdpcap), quo(trt), quo(regionno), quo(year), basque2) # true treatment times true_trt <- c(1975, 1980) - min(basque$year) expect_equal(true_trt, sort(dat$trt[is.finite(dat$trt)])) basque2 <- basque %>% slice(sample(1:n())) %>% mutate(trt = case_when((regionno == 17) & (year >= 1975) ~ 1, (regionno == 16) & (year >= 1980) ~ 1, TRUE ~ 0)) dat <- format_data_stag(quo(gdpcap), quo(trt), quo(regionno), quo(year), basque2) expect_equal(true_trt, sort(dat$trt[is.finite(dat$trt)])) }) test_that("augsynth exits with error message if there are no never treated units", { basque2 <- basque %>% # slice(sample(1:n())) %>% mutate(trt = case_when((regionno == 17) & (year >= 1975) ~ 1, (year >= 1997) ~ 1, TRUE ~ 0)) expect_error(augsynth(gdpcap ~ trt, regionno, year, basque2), "1996") }) ================================================ FILE: tests/testthat/test_general.R ================================================ context("Generally testing the workflow for augsynth") library(Synth) data(basque) basque <- basque %>% mutate(trt = case_when(year < 1975 ~ 0, regionno != 17 ~0, regionno == 17 ~ 1)) %>% filter(regionno != 1) test_that("SCM gives the right answer", { syn <- augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="None", scm=T, t_int=1975) ## average att estimate is as expected expect_equal(-.3686, mean(summary(syn, inf = F)$att$Estimate), tolerance=1e-4) ## level of balance is as expected expect_equal(.377, syn$l2_imbalance, tolerance=1e-3) } ) test_that("SCM finds the correct t_int and gives the right answer", { syn1 <- augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="None", scm=T) syn2 <- augsynth(gdpcap ~ trt, regionno, year, basque, progfunc = "None", scm = T, t_int = 1975) ## average att estimate is as expected expect_equal(mean(summary(syn1, inf = F)$att$Estimate), mean(summary(syn2, inf = F)$att$Estimate), tolerance=1e-4) ## level of balance is as expected expect_equal(syn1$l2_imbalance, syn2$l2_imbalance, tolerance=1e-3) } ) test_that("Ridge ASCM gives the right answer", { asyn <- augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="Ridge", scm=T, lambda=8) ## average att estimate is as expected expect_equal(-.3696, mean(summary(asyn, inf = F)$att$Estimate), tolerance=1e-3) ## level of balance is as expected expect_equal(.373, asyn$l2_imbalance, tolerance=1e-3) } ) test_that("SCM after residualizing covariates gives the right answer", { covsyn_resid <- augsynth(gdpcap ~ trt | invest + popdens, regionno, year, basque, progfunc = "None", scm = T, residualize = T) ## average att estimate is as expected expect_equal(-.1443, mean(summary(covsyn_resid, inf = F)$att$Estimate), tolerance = 1e-3) ## level of balance is as expected expect_equal(.3720, covsyn_resid$l2_imbalance, tolerance=1e-3) # perfect auxiliary covariate balance expect_equal(0, covsyn_resid$covariate_l2_imbalance, tolerance=1e-3) } ) test_that("Ridge ASCM with covariates jointly gives the right answer", { covsyn_noresid <- augsynth(gdpcap ~ trt | invest + popdens, regionno, year, basque, progfunc = "None", scm = T, residualize = F) ## average att estimate is as expected expect_equal(-.3345, mean(summary(covsyn_noresid, inf = F)$att$Estimate), tolerance = 1e-3) ## level of balance is as expected expect_equal(0.659, covsyn_noresid$l2_imbalance, tolerance=1e-3) # covariate balance is as expected expect_equal(0.884, covsyn_noresid$covariate_l2_imbalance, tolerance=1e-3) } ) test_that("Ridge ASCM after residualizing covariates gives the right answer", { covascm_resid <- augsynth(gdpcap ~ trt | invest + popdens, regionno, year, basque, progfunc = "Ridge", scm = T, lambda = 1, residualize = T) ## average att estimate is as expected expect_equal(-.123, mean(summary(covascm_resid, inf = F)$att$Estimate), tolerance = 1e-3) ## level of balance is as expected expect_equal(.347, covascm_resid$l2_imbalance, tolerance=1e-3) # perfect auxiliary covariate balance expect_equal(0, covascm_resid$covariate_l2_imbalance, tolerance=1e-3) } ) test_that("Ridge ASCM with covariates jointly gives the right answer", { covascm_noresid <- augsynth(gdpcap ~ trt | invest + popdens, regionno, year, basque, progfunc = "Ridge", scm = T, lambda = 1, residualize = F) ## average att estimate is as expected expect_equal(-.267, mean(summary(covascm_noresid, inf = F)$att$Estimate), tolerance = 1e-3) ## level of balance is as expected expect_equal(0.419, covascm_noresid$l2_imbalance, tolerance=1e-3) # covariate balance is as expected expect_equal(0.084, covascm_noresid$covariate_l2_imbalance, tolerance=1e-3) } ) test_that("Warning given when inputting an unused argument", { expect_warning( augsynth(gdpcap ~ trt| invest + popdens, regionno, year, basque, progfunc="Ridge", scm=T, lambda=8, t_int = 1975, bad_param = "Unused input parameter"), ) }) ================================================ FILE: tests/testthat/test_lambda.R ================================================ context("Testing lambda tuning if ridge is true.") library(Synth) data(basque) basque <- basque %>% mutate(trt = case_when(year < 1975 ~ 0, regionno != 17 ~0, regionno == 17 ~ 1)) %>% filter(regionno != 1) test_that("Lambda sequence is generated correctly", { syn <- augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="Ridge", scm=T) lambdas <- syn$lambdas expect_equivalent(lambdas[length(lambdas)] / lambdas[1], 1e-8) expect_equivalent(lambdas[2] / lambdas[1], lambdas[3] / lambdas[2]) }) test_that("Smallest lambda is chosen", { syn <- augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="Ridge", scm=T, min_1se = F) expect_equivalent(syn$lambda, syn$lambdas[which.min(syn$lambda_errors)]) }) test_that("Largest lambda within 1 SE of minimum is chosen", { syn <- augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="Ridge", scm=T, min_1se = T) min_idx <- which.min(syn$lambda_errors) min_1se <- max(syn$lambdas[syn$lambda_errors <= syn$lambda_errors[min_idx] + syn$lambda_errors_se[min_idx]]) expect_equivalent(syn$lambda, min_1se) }) test_that("Max lambda is in list of returned lambas (optional parameters are going through)", { syn <- augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="Ridge", scm=T, lambda_max = 100) lambdas <- syn$lambdas expect_equivalent(lambdas[1], 100) }) ================================================ FILE: tests/testthat/test_load_data.R ================================================ context("Testing that we can load packaged data") test_that("kansas data loads", { expect_error(data(kansas), NA) }) ================================================ FILE: tests/testthat/test_multiple_outcomes.R ================================================ context("Generally testing the workflow for synth with multiple outcomes") library(Synth) data(basque) basque <- basque %>% mutate(trt = case_when(year < 1975 ~ 0, regionno != 17 ~0, regionno == 17 ~ 1), gdpcap_sq = gdpcap ^ 2) %>% filter(regionno != 1) test_that("augsynth and augsynth_multiout are the same without augmentation", { syn1 <- augsynth_multiout(gdpcap + gdpcap_sq ~ trt, regionno, year, 1975, basque, progfunc="None", scm=T) syn2 <- augsynth(gdpcap + gdpcap_sq ~ trt, regionno, year, basque, progfunc="None", scm=T) # weights are the same expect_equal(c(syn1$weights), c(syn2$weights), tolerance=3e-4) # estimates are the same expect_equal(c(predict(syn1, att=F)), c(predict(syn2, att = F)), tolerance=5e-5) ## level of balance is same expect_equal(syn1$l2_imbalance, syn2$l2_imbalance, tolerance=1e-5) }) test_that("augsynth and augsynth_multiout are the same with ridge augmentation", { syn1 <- augsynth_multiout(gdpcap + gdpcap_sq ~ trt, regionno, year, 1975, basque, progfunc="Ridge", scm=T, lambda = 10) syn2 <- augsynth(gdpcap + gdpcap_sq ~ trt, regionno, year, basque, progfunc="Ridge", scm=T, lambda = 10) # weights are the same expect_equal(c(syn1$weights), c(syn2$weights), tolerance=3e-4) # estimates are the same expect_equal(c(predict(syn1, att=F)), c(predict(syn2, att = F)), tolerance=5e-5) ## level of balance is same expect_equal(syn1$l2_imbalance, syn2$l2_imbalance, tolerance=1e-5) }) test_that("augsynth and augsynth_multiout are the same with fixed effects augmentation", { syn1 <- augsynth_multiout(gdpcap + gdpcap_sq ~ trt, regionno, year, 1975, basque, progfunc="None", scm=T, fixedeff = T) syn2 <- augsynth(gdpcap + gdpcap_sq ~ trt, regionno, year, basque, progfunc="None", scm=T, fixedeff = T) # weights are the same expect_equal(c(syn1$weights), c(syn2$weights), tolerance=3e-4) # estimates are the same expect_equal(c(predict(syn1, att=F)), c(predict(syn2, att = F)), tolerance=5e-5) ## level of balance is same expect_equal(syn1$l2_imbalance, syn2$l2_imbalance, tolerance=1e-5) }) test_that("single_augsynth and augsynth_multiout are the same for one outcome", { syn1 <- augsynth_multiout(gdpcap ~ trt, regionno, year, 1975, basque, progfunc="None", scm=T, combine_method = "concat") syn2 <- augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="None", scm=T) # weights are the same expect_equal(c(syn1$weights), c(syn2$weights), tolerance=3e-4) # estimates are the same expect_equal(c(predict(syn1, att=F)), unname(predict(syn2, att = F)), tolerance=5e-5) ## level of balance is same expect_equal(syn1$l2_imbalance, syn2$l2_imbalance, tolerance=1e-5) }) test_that("single_augsynth and augsynth_multiout are the same for one outcome with ridge augmentation",{ syn1 <- augsynth_multiout(gdpcap ~ trt, regionno, year, 1975, basque, progfunc="Ridge", scm=T, combine_method = "concat") syn2 <- augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="Ridge", scm=T) # weights are the same expect_equal(c(syn1$weights), c(syn2$weights), tolerance=3e-4) # estimates are the same expect_equal(c(predict(syn1, att=F)), unname(predict(syn2, att = F)), tolerance=5e-5) ## level of balance is same expect_equal(syn1$l2_imbalance, syn2$l2_imbalance, tolerance=1e-5) }) test_that("single_augsynth and augsynth_multiout are the same for one outcome with fixed effect augmentation", { syn1 <- augsynth_multiout(gdpcap ~ trt, regionno, year, 1975, basque, progfunc="None", scm=T, fixedeff = T, combine_method = "concat") syn2 <- augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="None", scm=T, fixedeff = T) # weights are the same expect_equal(c(syn1$weights), c(syn2$weights), tolerance=3e-4) # estimates are the same expect_equal(c(predict(syn1, att=F)), unname(predict(syn2, att = F)), tolerance=5e-5) ## level of balance is same expect_equal(syn1$l2_imbalance, syn2$l2_imbalance, tolerance=1e-5) }) test_that("Averaging outcomes with augsynth_multiout gives correct results without fixed effects", { sds <- basque %>% filter(trt == 0, year < 1975) %>% summarise(across(c(gdpcap, gdpcap_sq), sd)) %>% rename(gdpcap_sd = gdpcap, gdpcap_sq_sd = gdpcap_sq) basque %>% bind_cols(sds) %>% mutate(avg = gdpcap / gdpcap_sd + gdpcap_sq / gdpcap_sq_sd, avg2 = gdpcap + gdpcap_sq) -> bas_avg syn1 <- augsynth_multiout(gdpcap + gdpcap_sq ~ trt, regionno, year, 1975, basque, progfunc="None", scm=T, fixedeff = F, combine_method = "avg") syn2 <- augsynth(avg ~ trt, regionno, year, bas_avg, progfunc="None", scm=T, fixedeff = F) # weights are the same expect_equal(c(syn1$weights), c(syn2$weights), tolerance=3e-4) }) test_that("Averaging outcomes with augsynth_multiout gives correct results with fixed effects", { sds <- basque %>% filter(trt == 0, year < 1975) %>% summarise(across(c(gdpcap, gdpcap_sq), sd)) %>% rename(gdpcap_sd = gdpcap, gdpcap_sq_sd = gdpcap_sq) basque %>% bind_cols(sds) %>% mutate(avg = gdpcap / gdpcap_sd + gdpcap_sq / gdpcap_sq_sd, avg2 = gdpcap + gdpcap_sq) -> bas_avg syn1 <- augsynth_multiout(gdpcap + gdpcap_sq ~ trt, regionno, year, 1975, basque, progfunc="None", scm=T, fixedeff = T, combine_method = "avg") syn2 <- augsynth(avg ~ trt, regionno, year, bas_avg, progfunc="None", scm=T, fixedeff = T) # weights are the same expect_equal(c(syn1$weights), c(syn2$weights), tolerance=1e-3) }) test_that("Concatenating outcomes with augsynth_multiout gives correct results without fixed effects", { sds <- basque %>% filter(trt == 0, year < 1975) %>% summarise(across(c(gdpcap, gdpcap_sq), sd)) %>% rename(gdpcap_sd = gdpcap, gdpcap_sq_sd = gdpcap_sq) basque %>% bind_cols(sds) %>% mutate(gdpcap = gdpcap / gdpcap_sd, gdpcap_sq = gdpcap_sq / gdpcap_sq_sd) %>% select(gdpcap, gdpcap_sq, trt, year, regionno) %>% pivot_longer(-c(regionno, year, trt)) %>% mutate(year = ifelse(name == "gdpcap", year, year - 0.5)) -> bas_cat syn1 <- augsynth_multiout(gdpcap + gdpcap_sq ~ trt, regionno, year, 1975, basque, progfunc="None", scm=T, fixedeff = F, combine_method = "concat") syn2 <- augsynth(value ~ trt, regionno, year, bas_cat, progfunc="None", scm=T, fixedeff = F) # weights are the same expect_equal(c(syn1$weights), c(syn2$weights), tolerance=5e-4) }) ================================================ FILE: tests/testthat/test_multisynth.R ================================================ context("Generally testing the workflow for multisynth") library(Synth) data(basque) basque <- basque %>% mutate(trt = case_when(year < 1975 ~ 0, regionno != 17 ~0, regionno == 17 ~ 1)) %>% filter(regionno != 1) test_that("augsynth and multisynth give the same answer for a single treated unit and no augmentation", { syn <- single_augsynth(gdpcap ~ trt, regionno, year, 1975, basque, progfunc="None", scm=T, fixedeff = F) msyn <- multisynth(gdpcap ~ trt, regionno, year, basque, nu = 0, fixedeff = F, scm=T, eps_rel=1e-5, eps_abs=1e-5) # weights are the same-ish expect_equal(c(syn$weights), c(msyn$weights[-16]), tolerance=3e-4) # estimates are the same-ish pred_msyn <- predict(msyn, att=F)[,1] pred_msyn <- pred_msyn[-length(pred_msyn)] expect_equal(unname(predict(syn, att=F)), pred_msyn, tolerance=5e-5) ## level of balance is same-ish expected expect_equal(syn$l2_imbalance, msyn$avg_l2, tolerance=1e-5) } ) test_that("Pooling doesn't matter for a single treated unit", { nopool <- multisynth(gdpcap ~ trt, regionno, year, basque, nu = 0, scm=T, eps_rel=1e-5, eps_abs=1e-5) allpool <- multisynth(gdpcap ~ trt, regionno, year, basque, nu = 1, scm=T, eps_rel=1e-5, eps_abs=1e-5) # weights are the same expect_equal(nopool$weights, allpool$weights) # estimates are the same expect_equal(predict(nopool), predict(allpool)) ## level of balance is same-ish expected expect_equal(allpool$ind_l2, nopool$ind_l2) } ) test_that("Separate synth is the same as fitting separate synths", { basque2 <- basque %>% mutate(trt = case_when(year < 1975 ~ 0, !regionno %in% c(16, 17) ~ 0, regionno %in% c(16, 17) ~ 1)) %>% filter(regionno != 1) basque2 %>% filter(regionno != 16) %>% single_augsynth(gdpcap ~ trt, regionno, year, 1975, ., progfunc="None", scm=T) -> scm17 basque2 %>% filter(regionno != 17) %>% single_augsynth(gdpcap ~ trt, regionno, year, 1975, ., progfunc="None", scm=T) -> scm16 msyn <- multisynth(gdpcap ~ trt, regionno, year, basque2, nu = 0, scm=T, eps_rel=1e-5, eps_abs=1e-5, fixedeff = F) # weights are the same-ish sscm_weights <- unname(c(scm17$weights)) mscm_weights <- unname(c(msyn$weights[-c(15, 16), 2])) expect_equal(sscm_weights, mscm_weights, tolerance=3e-2) expect_equal(rownames(scm17$weights), rownames(as.matrix(msyn$weights[-c(15, 16), 2]))) # expect_equal(c(scm16$weights), c(msyn$weights[-c(15, 16), 1]), tolerance=3e-2) # estimates are the same-ish pred_msyn <- predict(msyn, att=F) pred_msyn <- pred_msyn[-nrow(pred_msyn), ] expect_equal(unname(predict(scm17, att=F)), pred_msyn[, 3], tolerance=5e-3) expect_equal(unname(predict(scm16, att=F)), pred_msyn[, 2], tolerance=5e-3) } ) test_that("Limiting number of lags works", { basque2 <- basque %>% mutate(trt = case_when(year < 1975 ~ 0, !regionno %in% c(16, 17) ~ 0, regionno %in% c(16, 17) ~ 1)) %>% filter(regionno != 1) expect_error( multisynth(gdpcap ~ trt, regionno, year, basque2, nu = 0, scm=T, eps_rel=1e-5, eps_abs=1e-5, n_lags =3), NA ) } ) test_that("L2 imbalance computed correctly", { basque2 <- basque %>% mutate(trt = case_when(year < 1975 ~ 0, !regionno %in% c(16, 17) ~ 0, regionno %in% c(16, 17) ~ 1)) %>% filter(regionno != 1) msyn <- multisynth(gdpcap ~ trt, regionno, year, basque2, scm=T, eps_rel=1e-5, eps_abs=1e-5) glbl <- sqrt(mean(msyn$imbalance[,1]^2)) ind <- sqrt(mean( apply(msyn$imbalance[, -1], 2, function(x) sum(x ^ 2) / sum(x != 0)))) avg_ind <- mean(apply(msyn$imbalance[,-1, drop = F], 2, function(x) sqrt(sum(x ^ 2)))) expect_equal(glbl, msyn$global_l2) expect_equal(avg_ind, msyn$avg_l2) expect_equal(ind, msyn$ind_l2) }) test_that("V matrix is equivalent to hard thresholding", { basque2 <- basque %>% mutate(trt = case_when(year < 1975 ~ 0, !regionno %in% c(16, 17) ~ 0, regionno %in% c(16, 17) ~ 1)) %>% filter(regionno != 1) V <- c(numeric(10), rep(1,5)) msyn1 <- multisynth(gdpcap ~ trt, regionno, year, basque2, nu = 0, scm=T, eps_rel=1e-8, eps_abs=1e-8, n_lags = 15, V = V) msyn2 <- multisynth(gdpcap ~ trt, regionno, year, basque2, nu = 0, scm=T, eps_rel=1e-8, eps_abs=1e-8, n_lags = 5) expect_equal(msyn1$weights, msyn2$weights, tolerance = 1e-5) expect_equal(msyn1$global_l2, msyn2$global_l2, tolerance = 1e-5) expect_equal(msyn1$avg_l2, msyn2$avg_l2, tolerance = 1e-5) } ) test_that("V matrix is the same for single and multi synth", { V <- exp(seq(log(1e-3), log(1), length.out = 20)) syn <- augsynth(gdpcap ~ trt, regionno, year, basque, progfunc = "none", scm=T, V = V) msyn <- multisynth(gdpcap ~ trt, regionno, year, basque, scm=T, eps_rel=1e-8, eps_abs=1e-8, V = V, fixed = F, nu = 0) expect_equal(as.numeric(syn$weights), as.numeric(msyn$weights[-16, ]), tolerance = 1e-3) } ) test_that("multisynth doesn't depend on unit order", { basque2 <- basque %>% mutate(trt = case_when(year < 1975 ~ 0, !regionno %in% c(16, 17) ~ 0, regionno %in% c(16, 17) ~ 1)) %>% filter(regionno != 1) msyn <- multisynth(gdpcap ~ trt, regionno, year, basque2, nu = 0, fixedeff = F, scm=T, eps_rel=1e-5, eps_abs=1e-5) msyn2 <- multisynth(gdpcap ~ trt, regionno, year, basque2 %>% arrange(desc(regionno)), nu = 0, fixedeff = F, scm=T, eps_rel=1e-5, eps_abs=1e-5) # weights are the same expect_equal(c(msyn$weights), c(msyn2$weights)) # estimates are the same expect_equal(predict(msyn), predict(msyn2)) } ) test_that("multisynth doesn't depend on time order", { basque2 <- basque %>% mutate(trt = case_when(year < 1975 ~ 0, !regionno %in% c(16, 17) ~ 0, regionno %in% c(16, 17) ~ 1)) %>% filter(regionno != 1) msyn <- multisynth(gdpcap ~ trt, regionno, year, basque2, nu = 0, fixedeff = F, scm=T, eps_rel=1e-5, eps_abs=1e-5) msyn2 <- multisynth(gdpcap ~ trt, regionno, year, basque2 %>% arrange(desc(year)), nu = 0, fixedeff = F, scm=T, eps_rel=1e-5, eps_abs=1e-5) # weights are the same expect_equal(c(msyn$weights), c(msyn2$weights)) # estimates are the same expect_equal(predict(msyn), predict(msyn2)) } ) ================================================ FILE: tests/testthat/test_multisynth_covariates.R ================================================ context("Testing multisynth with covariates") set.seed(1011) library(Synth) data(basque) basque <- basque %>% mutate(trt = case_when((regionno == 17) & (year >= 1975) ~ 1, (regionno == 16) & (year >= 1980) ~ 1, TRUE ~ 0)) %>% filter(regionno != 1) regions <- basque %>% distinct(regionno) %>% pull(regionno) test_that("Getting eligible donor units by exact matching works", { # binary variable to split on fake_bin <- sample(c(0, 1), length(regions), replace = T) basque %>% inner_join( data.frame(regionno = regions, Z = fake_bin) %>% mutate(Z = case_when(regionno == 17 ~ 0, regionno == 16 ~ 1, TRUE ~ Z) ), by = "regionno") -> basque2 msyn <- multisynth(gdpcap ~ trt | 0 | 0| Z, regionno, year, basque2, nu = 0, scm = T) # check that there is actually no weight on donors with different Z expect_equal(sum(msyn$weights[fake_bin == 1, 1]), 1, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_bin == 0, 1]), 0, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_bin == 1, 2]), 0, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_bin == 0, 2]), 1, tolerance = 1e-3) # again with fixed effect msyn <- multisynth(gdpcap ~ trt | 0 | 0 | Z, regionno, year, basque2, nu = 0, scm = T, fixedeff = T) # check that there is actually no weight on donors with different Z expect_equal(sum(msyn$weights[fake_bin == 1, 1]), 1, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_bin == 0, 1]), 0, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_bin == 1, 2]), 0, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_bin == 0, 2]), 1, tolerance = 1e-3) }) test_that("Getting eligible donor units by exact matching works with factors", { # binary variable to split on fake_fac <- sample(c(0, 1, 3), length(regions), replace = T) basque %>% inner_join( data.frame(regionno = regions, Z = fake_fac) %>% mutate( Z = case_when(regionno == 17 ~ 0, regionno == 16 ~ 1, TRUE ~ Z), Z = as.factor(Z) ), by = "regionno") -> basque2 msyn <- multisynth(gdpcap ~ trt | 0 | 0 | Z, regionno, year, basque2, nu = 0, scm = T) # check that there is actually no weight on donors with different Z expect_equal(sum(msyn$weights[fake_fac == 1, 1]), 1, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_fac == 0, 1]), 0, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_fac == 3, 1]), 0, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_fac == 1, 2]), 0, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_fac == 0, 2]), 1, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_fac == 3, 2]), 0, tolerance = 1e-4) # again with fixed effect msyn <- multisynth(gdpcap ~ trt | 0 | 0 |Z, regionno, year, basque2, nu = 0, scm = T, fixedeff = T, how_match = "exact") # check that there is actually no weight on donors with different Z expect_equal(sum(msyn$weights[fake_fac == 1, 1]), 1, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_fac == 0, 1]), 0, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_fac == 3, 1]), 0, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_fac == 1, 2]), 0, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_fac == 0, 2]), 1, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_fac == 3, 2]), 0, tolerance = 1e-4) }) test_that("K-NN finds the right number of neighbors", { # variables to match on Z <- matrix(rnorm(length(regions) * 3), ncol = 3) basque %>% inner_join( data.frame(regionno = regions, Z1 = Z[, 1], Z2 = Z[, 2], Z3 = Z[, 3]), by = "regionno") -> basque2 dat <- format_data_stag(quo(gdpcap), quo(trt), quo(regionno), quo(year), basque2) k <- 3 donors <- get_eligible_donors(dat$trt, F, 100) knn_donors <- get_knn_donors(dat$trt, Z, donors, k) expect_true(all(sapply(knn_donors, sum) == k)) k <- 20 expect_warning(get_knn_donors(dat$trt, Z, donors, k)) }) test_that("Getting eligible donor units by knn matching works", { # variables to match on Z <- matrix(rnorm(length(regions) * 3), ncol = 3) basque %>% inner_join( data.frame(regionno = regions, Z1 = Z[, 1], Z2 = Z[, 2], Z3 = Z[, 3]), by = "regionno") -> basque2 # error if no k is supplied expect_error(multisynth(gdpcap ~ trt | 0 | Z1 + Z2 + Z3, regionno, year, basque2, scm = T, how_match = "knn"), "Number of neighbors for knn not selected, please choose k.") k <- 5 msyn <- multisynth(gdpcap ~ trt | 0 | Z1 + Z2 + Z3, regionno, year, basque2, scm = T, how_match = "knn", k = k) # check that at most k units recieve non-0 weight expect_lte(sum(msyn$weights[, 1] != 0), k) expect_lte(sum(msyn$weights[, 2] != 0), k) # again with fixed effect msyn <- multisynth(gdpcap ~ trt | 0 | Z1 + Z2 + Z3, regionno, year, basque2, scm = T, fixedeff = T, how_match = "knn", k = k) # check that all but k units recieve exactly 0 weight expect_lte(sum(msyn$weights[, 1] != 0), k) expect_lte(sum(msyn$weights[, 2] != 0), k) # without synth weights, weights are uniform k <- 2 unimatch <- multisynth(gdpcap ~ trt| 0 | Z1 + Z2 + Z3, regionno, year, basque2, scm = T, how_match = "knn", k = k, lambda = 1e10) expect_equal(unimatch$weights[unimatch$weights != 0 ], rep(1 / k, 2 * k)) # matching with more neighbors is worse unimatch2 <- multisynth(gdpcap ~ trt | 0 | Z1 + Z2 + Z3, regionno, year, basque2, scm = T, how_match = "knn", k = 2.5 * k, lambda = 1e10) trtZ <- Z[regions %in% c(16, 17),] imbal1 <- sqrt(sum(sapply(1:2, function(i) sum(unimatch$weights[,i] * (trtZ[i,] - Z) ^ 2 )))) imbal2 <- sqrt(sum(sapply(1:2, function(i) sum(unimatch2$weights[,i] * (trtZ[i,] - Z) ^ 2 )))) expect_lt(imbal1, imbal2) }) test_that("Getting eligible donor units by exact and knn matching works", { # binary variable to split on fake_bin <- sample(c(0, 1), length(regions), replace = T) # variables to match on Z <- matrix(rnorm(length(regions) * 3), ncol = 3) basque %>% inner_join( data.frame(regionno = regions, Z1 = Z[, 1], Z2 = Z[, 2], Z3 = Z[, 3], Z_bin = fake_bin) %>% mutate(Z_bin = case_when(regionno == 17 ~ 0, regionno == 16 ~ 1, TRUE ~ Z_bin)), by = "regionno") -> basque2 # error if no k is supplied expect_error(multisynth(gdpcap ~ trt | 0 | Z1 + Z2 + Z3 | Z_bin, regionno, year, basque2, scm = T, how_match = "knn"), "Number of neighbors for knn not selected, please choose k.") k <- 3 msyn <- multisynth(gdpcap ~ trt | 0 | Z1 + Z2 + Z3 | Z_bin, regionno, year, basque2, scm = T, how_match = "knn", k = k) # check that there is actually no weight on donors with different Z expect_equal(sum(msyn$weights[fake_bin == 1, 1]), 1, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_bin == 0, 1]), 0, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_bin == 1, 2]), 0, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_bin == 0, 2]), 1, tolerance = 1e-4) # check that at most k units recieve non-0 weight expect_lte(sum(msyn$weights[, 1] != 0), k) expect_lte(sum(msyn$weights[, 2] != 0), k) # again with fixed effect msyn <- multisynth(gdpcap ~ trt | 0 | Z1 + Z2 + Z3 | Z_bin, regionno, year, basque2, scm = T, fixedeff = T, how_match = "knn", k = k) # check that at most k units recieve non-0 weight expect_lte(sum(msyn$weights[, 1] != 0), k) expect_lte(sum(msyn$weights[, 2] != 0), k) # check that there is actually no weight on donors with different Z expect_equal(sum(msyn$weights[fake_bin == 1, 1]), 1, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_bin == 0, 1]), 0, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_bin == 1, 2]), 0, tolerance = 1e-4) expect_equal(sum(msyn$weights[fake_bin == 0, 2]), 1, tolerance = 1e-4) k <- 3 # without synth weights, weights are uniform unimatch <- multisynth(gdpcap ~ trt | 0 | Z1 + Z2 + Z3 | Z_bin, regionno, year, basque2, scm = T, how_match = "knn", k = k, lambda = 1e10) expect_equal(unimatch$weights[unimatch$weights != 0 ], rep(1 / k, 2 * k)) # matching without exact gives better matches unimatch2 <- multisynth(gdpcap ~ trt | 0 | Z1 + Z2 + Z3, regionno, year, basque2, scm = T, how_match = "knn", k = k, lambda = 1e10) trtZ <- Z[regions %in% c(16, 17),] imbal1 <- sqrt(sum(sapply(1:2, function(i) sum(unimatch$weights[,i] * (trtZ[i,] - Z) ^ 2 )))) imbal2 <- sqrt(sum(sapply(1:2, function(i) sum(unimatch2$weights[,i] * (trtZ[i,] - Z) ^ 2 )))) expect_lt(imbal2, imbal1) }) test_that("An error is thrown if trying to match with time cohorts or the formula is wrong", { # binary variable to split on fake_bin <- sample(c(0, 1), length(regions), replace = T) # variables to match on Z <- matrix(rnorm(length(regions) * 3), ncol = 3) basque %>% inner_join( data.frame(regionno = regions, Z1 = Z[, 1], Z2 = Z[, 2], Z3 = Z[, 3], Z_bin = fake_bin) %>% mutate(Z_bin = case_when(regionno == 17 ~ 0, regionno == 16 ~ 1, TRUE ~ Z_bin)), by = "regionno") %>% mutate(trt = case_when((regionno == 17) & (year >= 1975) ~ 1, (regionno == 16) & (year >= 1975) ~ 1, TRUE ~ 0)) %>% filter(regionno != 1)-> basque2 expect_error(multisynth(gdpcap ~ trt | Z1 + Z2, regionno, year, basque2, time_cohort = T), NA) expect_error(multisynth(gdpcap ~ trt | Z1 + Z2 | 0 | Z_bin, regionno, year, basque2, time_cohort = T)) }) test_that("multisynth with covariates doesn't depend on unit or time order ", { data <- read.csv("https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/WGWMAV/3UHTLP", sep="\t") data %>% filter(!State %in% c("DC", "WI"), year >= 1959, year <= 1997) %>% mutate(YearCBrequired = ifelse(is.na(YearCBrequired), Inf, YearCBrequired), cbr = 1 * (year >= YearCBrequired)) -> analysis_df data %>% select(State, year, agr, pnwht, purban, perinc, studteachratio) %>% group_by(State) %>% summarise(perinc_1959 = perinc[year == 1959], studteachratio_1959 = studteachratio[year == 1959]) %>% # filter to lower 48 where we have data filter(!State %in% c("AK", "HI")) -> cov_data analysis_df %>% inner_join(cov_data, by = "State") -> analysis_df_covs msyn <- multisynth(lnppexpend ~ cbr | perinc_1959 + studteachratio_1959, State, year, analysis_df_covs) msyn_rev_unit <- multisynth(lnppexpend ~ cbr | perinc_1959 + studteachratio_1959, State, year, analysis_df_covs %>% arrange(desc(State))) msyn_rev_time <- multisynth(lnppexpend ~ cbr | perinc_1959 + studteachratio_1959, State, year, analysis_df_covs %>% arrange(desc(year))) expect_equal(predict(msyn), predict(msyn_rev_time)) expect_equal(predict(msyn), predict(msyn_rev_unit)) }) ================================================ FILE: tests/testthat/test_outcome_models.R ================================================ context("Testing that augmenting synth with different models loads and runs") library(Synth) data(basque) basque <- basque %>% mutate(trt = case_when(year < 1975 ~ 0, regionno != 17 ~0, regionno == 17 ~ 1)) %>% filter(regionno != 1) test_that("Augmenting synth with glmnet runs", { if(!requireNamespace("glmnet", quietly = TRUE)) { ## should fail because glmnet isn't installed expect_error(augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="EN", scm=T), "you must install the glmnet package") ## install glmnet install.packages("glmnet", repos = "http://cran.us.r-project.org") } ## should run because glmnet is installed expect_error(augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="EN", scm=T), NA) } ) test_that("Augmenting synth with random forest runs", { if(!requireNamespace("randomForest", quietly = TRUE)) { ## should fail because randomForest isn't installed expect_error(augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="RF", scm=T), "you must install the randomForest package") ## install randomForest install.packages("randomForest", repos = "http://cran.us.r-project.org") } ## should run because randomForest is installed expect_error(augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="RF", scm=T), NA) } ) test_that("Augmenting synth with gsynth runs and produces the correct result", { if(!requireNamespace("gsynth", quietly = TRUE)) { ## should fail because gsynth isn't installed expect_error(augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="GSYN", scm=T), "you must install the gsynth package") ## install gsynth install.packages("gsynth", repos = "http://cran.us.r-project.org") } ## should run because gsynth is installed expect_error( augsynth(gdpcap ~ trt, regionno, year, basque, progfunc = "GSYN", scm = T, CV = 0, r = 4), NA) asyn_gsyn <- augsynth(gdpcap ~ trt, regionno, year, basque, progfunc = "GSYN", scm = F, CV = 0, r = 4) expect_equal(summary(asyn_gsyn, inf = F)$average_att$Estimate, -0.1444637, tolerance=1e-4) } ) test_that("Augmenting synth with MCPanel runs", { if(!requireNamespace("MCPanel", quietly = TRUE)) { ## should fail because MCPanel isn't installed expect_error(augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="MCP", scm=T), "you must install the MCPanel package") } else { ## should run because MCPanel is installed expect_error(augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="MCP", scm=T), NA) } } ) test_that("Augmenting synth with CausalImpact runs", { if(!requireNamespace("CausalImpact", quietly = TRUE)) { ## should fail because CausalImpact isn't installed expect_error(augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="CausalImpact", scm=T), "you must install the CausalImpact package") ## install CausalImpact install.packages("CausalImpact", repos = "http://cran.us.r-project.org") } ## should run because CausalImpact is installed expect_error(augsynth(gdpcap ~ trt, regionno, year, basque, progfunc="CausalImpact", scm=T), NA) } ) ================================================ FILE: tests/testthat/test_time_cohort.R ================================================ context("Test time cohort vs unit level analysis") library(Synth) data(basque) basque <- basque %>% mutate(trt = case_when(year < 1975 ~ 0, regionno != 17 ~0, regionno == 17 ~ 1)) %>% filter(regionno != 1) test_that("multisynth at the unit level and time cohort level give the same answer for a single treated unit and no augmentation", { msyn_unit <- multisynth(gdpcap ~ trt, regionno, year, basque, nu = 0, time_cohort = F, scm = T, eps_rel = 1e-5, eps_abs = 1e-5) msyn_time <- multisynth(gdpcap ~ trt, regionno, year, basque, nu = 0, time_cohort = T, scm = T, eps_rel = 1e-5, eps_abs = 1e-5) # weights are the same-ish expect_equal(c(msyn_unit$weights), c(msyn_time$weights), tolerance=3e-2) # estimates are the same-ish expect_equal(c(predict(msyn_unit, att=F)), c(predict(msyn_time, att=F)), tolerance=5e-3) ## level of balance is same-ish expected expect_equal(msyn_unit$ind_l2, msyn_time$ind_l2, tolerance=1e-3) } ) test_that("multisynth at the time cohort level runs", { expect_error(msyn_time <- multisynth(gdpcap ~ trt, regionno, year, basque, time_cohort = T, scm = T), NA) } ) ================================================ FILE: tests/testthat/test_unbalanced_multisynth.R ================================================ context("Test multisynth for unbalanced panels") set.seed(1011) library(Synth) data(basque) basque <- basque %>% mutate(trt = case_when((regionno == 17) & (year >= 1975) ~ 1, (regionno == 16) & (year >= 1980) ~ 1, TRUE ~ 0)) %>% filter(regionno != 1) regions <- basque %>% distinct(regionno) %>% pull(regionno) test_that("Data formatting creates NAs correctly", { # drop a time period for unit 17 basque %>% filter(regionno != 17 | year != 1970) -> basque_mis dat_format <- format_data_stag(quo(gdpcap), quo(trt), quo(regionno), quo(year), basque_mis) expect_true(is.na(dat_format$X[regions == 17, "1970"])) }) test_that("Non-NA donors are chosen correctly with missing pre-treatment", { # drop a time period for unit 17 basque %>% filter(!regionno %in% c(15, 17, 18) | year != 1970) -> basque_mis dat_format <- format_data_stag(quo(gdpcap), quo(trt), quo(regionno), quo(year), basque_mis) n_lags <- ncol(dat_format$X) n_leads <- ncol(dat_format$y) donors <- get_nona_donors(dat_format$X, dat_format$y, dat_format$trt, n_lags, n_leads, F) expect_true(!all(donors[[1]][regions %in% c(15, 17, 18) ])) expect_true(all(donors[[1]][!regions %in% c(15, 17, 18) ])) expect_true(all(donors[[2]])) }) test_that("Non-NA donors are chosen correctly with missing post-treatment", { # drop a time period for unit 17 basque %>% filter(!regionno %in% c(15, 17, 18) | !year %in% c(1990)) -> basque_mis dat_format <- format_data_stag(quo(gdpcap), quo(trt), quo(regionno), quo(year), basque_mis) n_lags <- ncol(dat_format$X) n_leads <- ncol(dat_format$y) donors <- get_nona_donors(dat_format$X, dat_format$y, dat_format$trt, n_lags, n_leads, F) expect_true(!all(donors[[1]][regions %in% c(15, 17, 18) ])) expect_true(all(donors[[1]][!regions %in% c(15, 17, 18) ])) expect_true(all(donors[[2]])) }) test_that("Non-NA donors are chosen correctly with missing pre- and post-treatment", { # drop a time period for unit 17 basque %>% filter(!regionno %in% c(15, 17, 18) | !year %in% c(1970, 1990)) -> basque_mis dat_format <- format_data_stag(quo(gdpcap), quo(trt), quo(regionno), quo(year), basque_mis) n_lags <- ncol(dat_format$X) n_leads <- ncol(dat_format$y) donors <- get_nona_donors(dat_format$X, dat_format$y, dat_format$trt, n_lags, n_leads, F) expect_true(!all(donors[[1]][regions %in% c(15, 17, 18) ])) expect_true(all(donors[[1]][!regions %in% c(15, 17, 18) ])) expect_true(all(donors[[2]])) }) test_that("Non-NA donors are chosen correctly with missing pre- and post-treatment and not considering all leads and lags", { # drop a time period for unit 17 basque %>% filter(!regionno %in% c(15, 17, 18) | !year %in% c(1970, 1990)) -> basque_mis dat_format <- format_data_stag(quo(gdpcap), quo(trt), quo(regionno), quo(year), basque_mis) n_lags <- ncol(dat_format$X) n_leads <- ncol(dat_format$y) donors <- get_nona_donors(dat_format$X, dat_format$y, dat_format$trt, 5, 5, F) expect_true(all(donors[[1]])) expect_true(all(donors[[2]])) }) test_that("Separate synth with missing treated unit time drops the time", { # drop a time period for unit 17 basque %>% filter(!regionno %in% c(17) | year != 1970) -> basque_mis msyn <- multisynth(gdpcap ~ trt, regionno, year, basque_mis, fixedeff = F, nu = 0, scm=T, eps_rel=1e-8, eps_abs=1e-8) msyn2 <- multisynth(gdpcap ~ trt, regionno, year, basque %>% filter(year != 1970), fixedeff = F, nu = 0, scm=T, eps_rel=1e-8, eps_abs=1e-8) expect_equal(msyn$weights[,2], msyn2$weights[,2], tolerance = 1e-6) }) test_that("Separate synth with missing control unit time drops control unit", { # drop a time period for unit 17 basque %>% filter(!regionno %in% c(18) | year != 1970) -> basque_mis msyn <- multisynth(gdpcap ~ trt, regionno, year, basque_mis, nu = 0, scm=T, eps_rel=1e-8, eps_abs=1e-8) msyn2 <- multisynth(gdpcap ~ trt, regionno, year, basque %>% filter(regionno != 18), nu = 0, scm=T, eps_rel=1e-8, eps_abs=1e-8) expect_equal(msyn$weights[-17,2], msyn2$weights[,2], tolerance = 1e-6) }) test_that("Separate synth with missing control unit only in post-treatment period drops control unit", { # drop a time period for unit 17 basque %>% filter(!regionno %in% c(18) | year < 1980) -> basque_mis dat_format <- format_data_stag(quo(gdpcap), quo(trt), quo(regionno), quo(year), basque_mis) expect_true(nrow(dat_format$X) == nrow(dat_format$y)) msyn <- multisynth(gdpcap ~ trt, regionno, year, basque_mis, nu = 0, scm=T, eps_rel=1e-8, eps_abs=1e-8) msyn2 <- multisynth(gdpcap ~ trt, regionno, year, basque %>% filter(regionno != 18), nu = 0, scm=T, eps_rel=1e-8, eps_abs=1e-8) expect_equal(msyn$weights[-17,2], msyn2$weights[,2], tolerance = 1e-6) }) test_that("Separate synth with missing control unit only in pre-treatment period drops control unit", { # drop a time period for unit 17 basque %>% filter(!regionno %in% c(18) | year >= 1980) -> basque_mis dat_format <- format_data_stag(quo(gdpcap), quo(trt), quo(regionno), quo(year), basque_mis) expect_true(nrow(dat_format$X) == nrow(dat_format$y)) msyn <- multisynth(gdpcap ~ trt, regionno, year, basque_mis, nu = 0, scm=T, eps_rel=1e-8, eps_abs=1e-8) msyn2 <- multisynth(gdpcap ~ trt, regionno, year, basque %>% filter(regionno != 18), nu = 0, scm=T, eps_rel=1e-8, eps_abs=1e-8) expect_equal(msyn$weights[-17,2], msyn2$weights[,2], tolerance = 1e-6) }) test_that("Multisynth with unbalanced panels runs", { # drop a time period for unit 17 basque %>% filter(!regionno %in% c(15, 17) | year != 1970) -> basque_mis msyn <- multisynth(gdpcap ~ trt, regionno, year, basque_mis, scm=T, eps_rel=1e-8, eps_abs=1e-8) expect_error(summary(msyn), NA) }) test_that("Multisynth with unbalanced panels runs with missing post-treatment", { # drop a time period for unit 17 basque %>% filter(!regionno %in% c(15, 17) | year != 1990) -> basque_mis msyn <- multisynth(gdpcap ~ trt, regionno, year, basque_mis, scm=T, eps_rel=1e-8, eps_abs=1e-8) expect_error(summary(msyn), NA) }) test_that("Multisynth with unbalanced panels runs", { # drop a time period for unit 17 basque %>% filter(!regionno %in% c(15) | year != 1985) -> basque_mis msyn <- multisynth(gdpcap ~ trt, regionno, year, basque_mis, scm=T, eps_rel=1e-8, eps_abs=1e-8) expect_error(summary(msyn), NA) }) ================================================ FILE: tests/testthat.R ================================================ library(testthat) library(augsynth) test_check("augsynth") ================================================ FILE: vignettes/.gitignore ================================================ *.html *.R ================================================ FILE: vignettes/multi-outcomes-vignette.Rmd ================================================ --- output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Multi Outcomes AugSynth Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(kableExtra) ``` # `augsynth`: Estimating multiple outcome effects ### The data To demonstrate `augsynth` with multiple outcomes, we'll use data on the impact of personal income tax cuts in Kansas that comes with the `AugSynth` package. Our interest is in estimating the effect of income tax cuts on gross state product (GSP) per capita, wages, establishment counts, and other macroeconomic indicators. ```{r load_data, results="hide", warning=F, message=F} library(magrittr) library(dplyr) library(augsynth) data(kansas) ``` The `kansas` dataset contains the GSP per capita (the outcome measure) `lngdpcapita` for all 50 states from the first quarter of 1990 to the first quarter of 2016. To run `augsynth`, we need to include a treatment status column that indicates which region was treated and at what time. The table in `kansas` contains the column `treated` to denote this. In the original study, the second quarter of 2012 was the implementation of the tax cut in Kansas. ```{r treated_units} kansas %>% select(year, qtr, year_qtr, state, treated, gdp, lngdpcapita) %>% filter(state == "Kansas" & year_qtr >= 2012 & year_qtr < 2013) ``` ### Using the Synthetic Controls Method We will begin by running the synthetic controls method on GDP per capita, wages, and the number of establishments. To run the vanilla synthetic controls method using `augsynth`, set `progfunc` to `None` and `scm` to `TRUE`. #### Single outcomes First, we will examine each outcome variable separately, beginning with log GDP per capita `lngdpcapita`. ```{r lngdpcapita_syn} syn_lngdpcapita <- augsynth(lngdpcapita ~ treated, fips, year_qtr, kansas, progfunc="None", scm=T) summary(syn_lngdpcapita) ``` ```{r lngdpcapita_syn_plot} plot(syn_lngdpcapita) ``` Next we will examine the log total wages per capita. Since this column doesn't already exist in the dataframe, we will create a `lntotalwagescapita` column. ```{r lntotalwagescapita_syn} kansas$lntotalwagescapita <- log(kansas$totalwagescapita) syn_lntotalwagescapita <- augsynth(lntotalwagescapita ~ treated, fips, year_qtr, kansas, progfunc="None", scm=T) summary(syn_lntotalwagescapita) ``` ```{r lntotalwagescapita_syn_plot} plot(syn_lntotalwagescapita) ``` Lastly, we will examine the number of establishments per capita, `estabscapita`. ```{r estabscapita_syn} syn_estabscapita <- augsynth(estabscapita ~ treated, fips, year_qtr, kansas, progfunc="None", scm=T) summary(syn_estabscapita) ``` ```{r single_estabscapita_syn_plot} plot(syn_estabscapita) ``` #### Multiple outcomes Now we will combine our outcome variables into one study. To add more outcome variables, we add them to the LHS of the formula. ```{r multi_outcome_syn} syn_multi <- augsynth(lngdpcapita + lntotalwagescapita + estabscapita ~ treated, fips, year_qtr, kansas, progfunc="None", scm=T) summary(syn_multi) ``` ```{r multi_outcome_syn_plot} plot(syn_multi) ``` ### Using the Augmented Synthetic Controls Method We will now repeat the study using the Augmented Synthetic Controls Method with ridge regression. In ASCM, we first fit the SCM weights, then combine it with a ridge regression, thus setting `progfunc="Ridge", scm=T`. ```{r lngdpcapita_asyn} asyn_lngdpcapita <- augsynth(lngdpcapita ~ treated, fips, year_qtr, kansas, progfunc="Ridge", scm=T) summary(asyn_lngdpcapita) ``` ```{r lngdpcapita_asyn_plot} plot(asyn_lngdpcapita) ``` ```{r lntotalwagescapita_asyn} asyn_lntotalwagescapita <- augsynth(lntotalwagescapita ~ treated, fips, year_qtr, kansas, progfunc="Ridge", scm=T) summary(asyn_lntotalwagescapita) ``` ```{r lntotalwagescapita_asyn_plot} plot(asyn_lntotalwagescapita) ``` ```{r estabscapita_asyn} asyn_estabscapita <- augsynth(estabscapita ~ treated, fips, year_qtr, kansas, progfunc="Ridge", scm=T) summary(asyn_estabscapita) ``` ```{r single_estabscapita_asyn_plot} plot(asyn_estabscapita) ``` #### Multiple outcomes Now we will combine our outcome variables into one study. To add more outcome variables, we add them to the LHS of the formula. ```{r multi_outcome_asyn} asyn_multi <- augsynth(lngdpcapita + lntotalwagescapita + estabscapita ~ treated, fips, year_qtr, kansas, progfunc="Ridge", scm=T, lambda = 1e-4) summary(asyn_multi) ``` ```{r multi_outcome_asyn_plot} plot(asyn_multi) ``` ================================================ FILE: vignettes/multisynth-vignette.Rmd ================================================ --- output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{MultiSynth Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(kableExtra) ``` # `augsynth`: Estimating treatment effects with staggered adoption ### The data To show the features of the `multisynth` function we will use data on the effects of states implementing mandatory collective bargaining agreements for public sector unions [(Paglayan, 2018)](https://onlinelibrary.wiley.com/doi/full/10.1111/ajps.12388) ```{r results="hide", warning=F, message=F} library(magrittr) library(dplyr) library(augsynth) ``` ```{r } data <- read.csv("https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/WGWMAV/3UHTLP", sep="\t") ``` The dataset contains several important variables that we'll use: - `year`, `State`: The state and year of the measurement - `YearCBrequired`: The year that the state adopted mandatory collective bargaining - `lnppexpend`: Log per pupil expenditures in constant 2010 $ ```{r echo = F} data %>% filter(year == 1960) %>% select(year, State, YearCBrequired, lnppexpend) %>% head() %>% kable() %>% kable_styling(bootstrap_options =c("hover", "responsive")) ``` To run `multisynth`, we need to include a treatment status column that indicates which state is treated in a given year, we call this `cbr` below. We also restrict to the years 1959-1997 where we have yearly measurements of expenditures and drop Washington D.C. and Wisconsin from the analysis. ```{r } data %>% filter(!State %in% c("DC", "WI"), year >= 1959, year <= 1997) %>% mutate(YearCBrequired = ifelse(is.na(YearCBrequired), Inf, YearCBrequired), cbr = 1 * (year >= YearCBrequired)) -> analysis_df ``` ## Partially pooled SCM with an intercept To fit partially pooled synthetic controls, we need to give `multisynth` a formula of the form `outcome ~ treatment`, point it to the unit and time variables, and choose the level of partial pooling `nu`. Setting `nu = 0` fits a separate synthetic control for each treated unit and setting `nu = 1` fits fully pooled synthetic controls. If we don't set `nu`, `multisynth` will choose a heuristic value based on how well separate synthetic controls balance the overall average. By default, `multisynth` includes an intercept shift along with the weights; we can exclude the intercept shift by setting `fixedeff = F`. We can also set the number of pre-treatment time periods (lags) that we want to balance with the `n_lags` argument and the number of post-treatment time periods (leads) that we want to estimate with the `n_leads` argument. By default `multisynth` sets `n_lags` and `n_leads` to the number of pre-treatment and post-treatment periods for the last treated unit, respectively. ```{r } # with a choice of nu ppool_syn <- multisynth(lnppexpend ~ cbr, State, year, nu = 0.5, analysis_df) # with default nu ppool_syn <- multisynth(lnppexpend ~ cbr, State, year, analysis_df) print(ppool_syn$nu) ppool_syn ``` Using the `summary` function, we'll compute the treatment effects and standard errors and confidence intervals for all treated units as well as the average via the wild bootstrap. (This takes a bit of time so we'll store the output) We can also change the significant level associated with the confidence intervals by setting the `alpha` argument, by default `alpha = 0.05`. ```{r} ppool_syn_summ <- summary(ppool_syn) ``` We can then report the level of global and individual balance as well as estimates for the average. ```{r } ppool_syn_summ ``` `ppool_syn_summ$att` is a dataframe that contains all of the point estimates, standard errors, and lower/upper confidence limits. `Time = NA` denotes the effect averaged across the post treatment periods. ```{r echo = F} ppool_syn_summ$att %>% filter(Time >= 0) %>% head() %>% kable() %>% kable_styling(bootstrap_options =c("hover", "responsive")) ``` We can also visually display both the pre-treatment balance and the estimated treatment effects. ```{r ppool_syn_plot, fig.width=8, fig.height=4.5, fig.align="center", warning=F, message=F} plot(ppool_syn_summ) ``` And again we can hone in on the average effects. ```{r ppool_syn_plot_avg, fig.width=8, fig.height=4.5, fig.align="center", warning=F, message=F} plot(ppool_syn_summ, levels = "Average") ``` ### Collapsing into time cohorts We can also collapse treated units with the same treatment time into _time cohorts_, and find one synthetic control per time cohort by setting `time_cohort = TRUE`. When the number of distinct treatment times is much smaller than the number of treated units, this will run significantly faster. ```{r } # with default nu ppool_syn_time <- multisynth(lnppexpend ~ cbr, State, year, analysis_df, time_cohort = TRUE) print(ppool_syn_time$nu) ppool_syn_time ``` We can then compute effects for the overall average as well as for each treatment time cohort, rather than individual units. ```{r} ppool_syn_time_summ <- summary(ppool_syn_time) ppool_syn_time_summ ``` ```{r echo = F} ppool_syn_time_summ$att %>% filter(Time >= 0) %>% head() %>% kable() %>% kable_styling(bootstrap_options =c("hover", "responsive")) ``` Again we can plot the effects. ```{r ppool_syn_time_plot, fig.width=8, fig.height=4.5, fig.align="center", warning=F, message=F} plot(ppool_syn_time_summ) ``` ### Including auxiliary covariates We can also include an additional set of covariates to balance along with the pre-treatment outcomes. First, let's create a data frame with the values of some covariates in a few different years: ```{r cov_data} data %>% select(State, year, agr, pnwht, purban, perinc, studteachratio) %>% group_by(State) %>% summarise(perinc_1959 = perinc[year == 1959], studteachratio_1959 = studteachratio[year == 1959]) %>% # filter to lower 48 where we have data filter(!State %in% c("AK", "HI")) -> cov_data analysis_df %>% inner_join(cov_data, by = "State") -> analysis_df_covs ``` To include auxiliary covariates, we can add them in to the formula after `|`. This will balance the auxiliary covariates along with the pre-treatment outcomes simultanouesly. If the covariates vary during the pre-treatment periods, `multisynth` will use the average pre-treatment value. We can change this behavior by including our own custom aggregation function via the `cov_agg` argument. ```{r cov_syn} # with default nu ppool_syn_cov <- multisynth(lnppexpend ~ cbr | perinc_1959 + studteachratio_1959, State, year, analysis_df_covs) print(ppool_syn_cov$nu) ppool_syn_cov ``` Again we can compute effects, along with their standard errors and confidence intervals, and plot. ```{r} ppool_syn_cov_summ <- summary(ppool_syn_cov) ppool_syn_cov_summ ``` ```{r echo = F} ppool_syn_cov_summ$att %>% filter(Time >= 0) %>% head() %>% kable() %>% kable_styling(bootstrap_options =c("hover", "responsive")) ``` Again we can plot the effects. ```{r ppool_syn_cov_plot, fig.width=8, fig.height=4.5, fig.align="center", warning=F, message=F} plot(ppool_syn_cov_summ, levels = "Average") ``` ================================================ FILE: vignettes/multisynth-vignette.md ================================================ --- output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{MultiSynth Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # `augsynth`: Estimating treatment effects with staggered adoption ### The data To show the features of the `multisynth` function we will use data on the effects of states implementing mandatory collective bargaining agreements for public sector unions [(Paglayan, 2018)](https://onlinelibrary.wiley.com/doi/full/10.1111/ajps.12388) ```r library(magrittr) library(dplyr) library(augsynth) ``` ```r data <- read.csv("https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/WGWMAV/3UHTLP", sep="\t") ``` The dataset contains several important variables that we'll use: - `year`, `State`: The state and year of the measurement - `YearCBrequired`: The year that the state adopted mandatory collective bargaining - `lnppexpend`: Log per pupil expenditures in constant 2010 $
year State YearCBrequired lnppexpend
1960 AK 1970 8.325518
1960 AL NA 7.396177
1960 AR NA 7.385373
1960 AZ NA 7.947127
1960 CA 1976 8.185162
1960 CO NA 7.952833
To run `multisynth`, we need to include a treatment status column that indicates which state is treated in a given year, we call this `cbr` below. We also restrict to the years 1959-1997 where we have yearly measurements of expenditures and drop Washington D.C. and Wisconsin from the analysis. ```r data %>% filter(!State %in% c("DC", "WI"), year >= 1959, year <= 1997) %>% mutate(YearCBrequired = ifelse(is.na(YearCBrequired), Inf, YearCBrequired), cbr = 1 * (year >= YearCBrequired)) -> analysis_df ``` ## Partially pooled SCM with an intercept To fit partially pooled synthetic controls, we need to give `multisynth` a formula of the form `outcome ~ treatment`, point it to the unit and time variables, and choose the level of partial pooling `nu`. Setting `nu = 0` fits a separate synthetic control for each treated unit and setting `nu = 1` fits fully pooled synthetic controls. If we don't set `nu`, `multisynth` will choose a heuristic value based on how well separate synthetic controls balance the overall average. By default, `multisynth` includes an intercept shift along with the weights; we can exclude the intercept shift by setting `fixedeff = F`. We can also set the number of pre-treatment time periods (lags) that we want to balance with the `n_lags` argument and the number of post-treatment time periods (leads) that we want to estimate with the `n_leads` argument. By default `multisynth` sets `n_lags` and `n_leads` to the number of pre-treatment and post-treatment periods for the last treated unit, respectively. ```r # with a choice of nu ppool_syn <- multisynth(lnppexpend ~ cbr, State, year, nu = 0.5, analysis_df) # with default nu ppool_syn <- multisynth(lnppexpend ~ cbr, State, year, analysis_df) print(ppool_syn$nu) #> [1] 0.2606793 ppool_syn #> #> Call: #> multisynth(form = lnppexpend ~ cbr, unit = State, time = year, #> data = analysis_df) #> #> Average ATT Estimate: -0.011 ``` Using the `summary` function, we'll compute the treatment effects and standard errors and confidence intervals for all treated units as well as the average via the wild bootstrap. (This takes a bit of time so we'll store the output) We can also change the significant level associated with the confidence intervals by setting the `alpha` argument, by default `alpha = 0.05`. ```r ppool_syn_summ <- summary(ppool_syn) ``` We can then report the level of global and individual balance as well as estimates for the average. ```r ppool_syn_summ #> #> Call: #> multisynth(form = lnppexpend ~ cbr, unit = State, time = year, #> data = analysis_df) #> #> Average ATT Estimate (Std. Error): -0.011 (0.022) #> #> Global L2 Imbalance: 0.003 #> Scaled Global L2 Imbalance: 0.019 #> Percent improvement from uniform global weights: 98.1 #> #> Individual L2 Imbalance: 0.028 #> Scaled Individual L2 Imbalance: 0.096 #> Percent improvement from uniform individual weights: 90.4 #> #> Time Since Treatment Level Estimate Std.Error lower_bound upper_bound #> 0 Average -0.004281754 0.02231379 -0.04888183 0.03786032 #> 1 Average -0.010856856 0.02099299 -0.05423609 0.02939147 #> 2 Average 0.004378813 0.02268842 -0.04268354 0.04896627 #> 3 Average 0.001155346 0.02388535 -0.04846624 0.04464696 #> 4 Average -0.009305005 0.02529949 -0.06207289 0.03822153 #> 5 Average -0.016942988 0.02447144 -0.06935946 0.02695179 #> 6 Average -0.018505173 0.02507329 -0.07297111 0.02755436 #> 7 Average -0.003866657 0.02817460 -0.06047905 0.05013422 #> 8 Average -0.015835730 0.03141197 -0.08179055 0.04231137 #> 9 Average -0.031751350 0.02962989 -0.09168791 0.02202697 #> 10 Average -0.017839047 0.03314017 -0.08835499 0.04070061 ``` `ppool_syn_summ$att` is a dataframe that contains all of the point estimates, standard errors, and lower/upper confidence limits. `Time = NA` denotes the effect averaged across the post treatment periods.
Time Level Estimate Std.Error lower_bound upper_bound
0 Average -0.0042818 0.0223138 -0.0488818 0.0378603
1 Average -0.0108569 0.0209930 -0.0542361 0.0293915
2 Average 0.0043788 0.0226884 -0.0426835 0.0489663
3 Average 0.0011553 0.0238853 -0.0484662 0.0446470
4 Average -0.0093050 0.0252995 -0.0620729 0.0382215
5 Average -0.0169430 0.0244714 -0.0693595 0.0269518
We can also visually display both the pre-treatment balance and the estimated treatment effects. ```r plot(ppool_syn_summ) ```
plot of chunk ppool_syn_plot

plot of chunk ppool_syn_plot

And again we can hone in on the average effects. ```r plot(ppool_syn_summ, levels = "Average") ```
plot of chunk ppool_syn_plot_avg

plot of chunk ppool_syn_plot_avg

### Collapsing into time cohorts We can also collapse treated units with the same treatment time into _time cohorts_, and find one synthetic control per time cohort by setting `time_cohort = TRUE`. When the number of distinct treatment times is much smaller than the number of treated units, this will run significantly faster. ```r # with default nu ppool_syn_time <- multisynth(lnppexpend ~ cbr, State, year, analysis_df, time_cohort = TRUE) print(ppool_syn_time$nu) #> [1] 0.3939013 ppool_syn_time #> #> Call: #> multisynth(form = lnppexpend ~ cbr, unit = State, time = year, #> data = analysis_df, time_cohort = TRUE) #> #> Average ATT Estimate: -0.018 ``` We can then compute effects for the overall average as well as for each treatment time cohort, rather than individual units. ```r ppool_syn_time_summ <- summary(ppool_syn_time) ppool_syn_time_summ #> #> Call: #> multisynth(form = lnppexpend ~ cbr, unit = State, time = year, #> data = analysis_df, time_cohort = TRUE) #> #> Average ATT Estimate (Std. Error): -0.018 (0.024) #> #> Global L2 Imbalance: 0.005 #> Scaled Global L2 Imbalance: 0.018 #> Percent improvement from uniform global weights: 98.2 #> #> Individual L2 Imbalance: 0.038 #> Scaled Individual L2 Imbalance: 0.057 #> Percent improvement from uniform individual weights: 94.3 #> #> Time Since Treatment Level Estimate Std.Error lower_bound upper_bound #> 0 Average -0.0007756959 0.02443902 -0.04849731 0.04410082 #> 1 Average -0.0160616979 0.02455148 -0.06120905 0.03042719 #> 2 Average -0.0028471499 0.02521902 -0.05189710 0.04841170 #> 3 Average -0.0026721191 0.02742973 -0.05634460 0.05048728 #> 4 Average -0.0181312843 0.02798461 -0.07148111 0.03468573 #> 5 Average -0.0284898474 0.02644653 -0.07724091 0.02368573 #> 6 Average -0.0228343778 0.02673115 -0.07456646 0.02837584 #> 7 Average -0.0140789250 0.03200335 -0.07574649 0.04580312 #> 8 Average -0.0245472682 0.03276526 -0.08792999 0.03819451 #> 9 Average -0.0476922268 0.03221486 -0.11080383 0.01490279 #> 10 Average -0.0216121159 0.03235770 -0.08391841 0.03853317 ```
Time Level Estimate Std.Error lower_bound upper_bound
0 Average -0.0007757 0.0244390 -0.0484973 0.0441008
1 Average -0.0160617 0.0245515 -0.0612091 0.0304272
2 Average -0.0028471 0.0252190 -0.0518971 0.0484117
3 Average -0.0026721 0.0274297 -0.0563446 0.0504873
4 Average -0.0181313 0.0279846 -0.0714811 0.0346857
5 Average -0.0284898 0.0264465 -0.0772409 0.0236857
Again we can plot the effects. ```r plot(ppool_syn_time_summ) ```
plot of chunk ppool_syn_time_plot

plot of chunk ppool_syn_time_plot

### Including auxiliary covariates We can also include an additional set of covariates to balance along with the pre-treatment outcomes. First, let's create a data frame with the values of some covariates in a few different years: ```r data %>% select(State, year, agr, pnwht, purban, perinc, studteachratio) %>% group_by(State) %>% summarise(perinc_1959 = perinc[year == 1959], studteachratio_1959 = studteachratio[year == 1959]) %>% # filter to lower 48 where we have data filter(!State %in% c("AK", "HI")) -> cov_data analysis_df %>% inner_join(cov_data, by = "State") -> analysis_df_covs ``` To include auxiliary covariates, we can add them in to the formula after `|`. This will balance the auxiliary covariates along with the pre-treatment outcomes simultanouesly. If the covariates vary during the pre-treatment periods, `multisynth` will use the average pre-treatment value. We can change this behavior by including our own custom aggregation function via the `cov_agg` argument. ```r # with default nu ppool_syn_cov <- multisynth(lnppexpend ~ cbr | perinc_1959 + studteachratio_1959, State, year, analysis_df_covs) print(ppool_syn_cov$nu) #> [1] 0.2242633 ppool_syn_cov #> #> Call: #> multisynth(form = lnppexpend ~ cbr | perinc_1959 + studteachratio_1959, #> unit = State, time = year, data = analysis_df_covs) #> #> Average ATT Estimate: -0.019 ``` Again we can compute effects, along with their standard errors and confidence intervals, and plot. ```r ppool_syn_cov_summ <- summary(ppool_syn_cov) ppool_syn_cov_summ #> #> Call: #> multisynth(form = lnppexpend ~ cbr | perinc_1959 + studteachratio_1959, #> unit = State, time = year, data = analysis_df_covs) #> #> Average ATT Estimate (Std. Error): -0.019 (0.016) #> #> Global L2 Imbalance: 0.004 #> Scaled Global L2 Imbalance: 0.030 #> Percent improvement from uniform global weights: 97 #> #> Individual L2 Imbalance: 0.043 #> Scaled Individual L2 Imbalance: 0.155 #> Percent improvement from uniform individual weights: 84.5 #> #> Time Since Treatment Level Estimate Std.Error lower_bound upper_bound #> 0 Average -0.0002624529 0.02142663 -0.04534283 0.039477273 #> 1 Average -0.0156461424 0.01955742 -0.05138329 0.021858933 #> 2 Average 0.0069387257 0.01979857 -0.03246108 0.046934990 #> 3 Average -0.0106105517 0.02094953 -0.05241864 0.032678554 #> 4 Average -0.0194238312 0.02027608 -0.06026658 0.019006295 #> 5 Average -0.0209126517 0.02065713 -0.06053277 0.018478402 #> 6 Average -0.0212525401 0.02011174 -0.06076619 0.018093027 #> 7 Average -0.0276107046 0.02122581 -0.07010144 0.014753050 #> 8 Average -0.0278450111 0.02282095 -0.07360570 0.017305636 #> 9 Average -0.0354977043 0.02341366 -0.07998872 0.009126067 #> 10 Average -0.0341083505 0.02709654 -0.08591161 0.017937928 ```
Time Level Estimate Std.Error lower_bound upper_bound
0 Average -0.0002625 0.0214266 -0.0453428 0.0394773
1 Average -0.0156461 0.0195574 -0.0513833 0.0218589
2 Average 0.0069387 0.0197986 -0.0324611 0.0469350
3 Average -0.0106106 0.0209495 -0.0524186 0.0326786
4 Average -0.0194238 0.0202761 -0.0602666 0.0190063
5 Average -0.0209127 0.0206571 -0.0605328 0.0184784
Again we can plot the effects. ```r plot(ppool_syn_cov_summ, levels = "Average") ```
plot of chunk ppool_syn_cov_plot

plot of chunk ppool_syn_cov_plot

================================================ FILE: vignettes/singlesynth-vignette.Rmd ================================================ --- output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Single Outcome AugSynth Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(kableExtra) ``` # `augsynth`: The Augmented Synthetic Control Method ## Installation You can install `augsynth` from github using `devtools`. ```{r install, results="hide", message=F, eval=F} ## Install devtools if noy already installed install.packages("devtools", repos='http://cran.us.r-project.org') ## Install augsynth from github devtools::install_github("ebenmichael/augsynth") ``` ## Example: Effects of the 2012 Kansas Tax Cuts ### The data To show the usage and features of `augsynth`, we'll use data on the impact of personal income tax cuts in Kansas that comes with the `AugSynth` package. Our interest is in estimating the effect of income tax cuts on gross state product (GSP) per capita. ```{r load_data, results="hide", warning=F, message=F} library(magrittr) library(dplyr) library(augsynth) data(kansas) ``` The `kansas` dataset contains the GSP per capita (the outcome measure) `lngdpcapita` for all 50 states from the first quarter of 1990 to the first quarter of 2016. To run `augsynth`, we need to include a treatment status column that indicates which region was treated and at what time. The table in `kansas` contains the column `treated` to denote this. In the original study, the second quarter of 2012 was the implementation of the tax cut in Kansas. ```{r treated_units} kansas %>% select(year, qtr, year_qtr, state, treated, gdp, lngdpcapita) %>% filter(state == "Kansas" & year_qtr >= 2012 & year_qtr < 2013) ``` ### Synth Now to find a synthetic control using the entire series of pre-intervention outcomes (and no auxiliary covariates), we can use `augsynth`. To do so we just need to give `augsynth` a formula like `outcome ~ treatment`, tell it what the unit and time variables are, optionally provide when intervention took place (the code will automatically determine this if `t_int` is not provided), and specify that we don't want to fit an outcome model ```{r fit_synth, message=F, warning=F} library(augsynth) syn <- augsynth(lngdpcapita ~ treated, fips, year_qtr, kansas, progfunc = "None", scm = T) ``` We can then look at the ATT estimates for each post-intervention time period and overall. We'll also see the quality of the synthetic control fit measured by the L2 distance between Kansas and its synthetic control, and the percent improvement over uniform weights. By default, we'll also see pointwise confidence intervals using a [conformal inference procedure](https://arxiv.org/abs/1712.09089). ```{r summ_syn} summary(syn) ``` The default test statistic is the sum of the absolute treatment efects `function(x) sum(abs(x))`. We can change the test statistic via the `stat_func` argument. For instance, if we want to perform a one-way test against postive effects, we can set the test stastic to be the negative sum `function(x) -sum(x)`: ```{r summ_syn_neg} summary(syn, stat_func = function(x) -sum(x)) ``` Or if we want to priotize testing the average post-treatment effect, we can set it to be the absolute sum: ```{r summ_syn_sum} summary(syn, stat_func = function(x) abs(sum(x))) ``` It's easier to see this information visually. Below we plot the difference between Kansas and it's synthetic control. Before the tax cuts (to the left of the dashed line) we expect these to be close, and after the tax cuts we measure the effect (with point-wise confidence intervals). ```{r fig_syn, fig.width=8, fig.height=4.5, echo=T, fig.align="center"} plot(syn) ``` We can also compute point-wise confidence intervals using the [Jackknife+ procedure](https://arxiv.org/abs/1905.02928) by changing the `inf_type` argument, although this requires additional assumptions. ```{r fig_syn_plus, fig.width=8, fig.height=4.5, echo=T, fig.align="center"} plot(syn, inf_type = "jackknife+") ``` ### Augmenting synth with an outcome model In this example the pre-intervention synthetic control fit has an L2 imbalance of 0.083, about 20% of the imbalance between Kansas and the average of the other states. We can reduce this by _augmenting_ synth with ridge regression. To do this we change `progfunc` to `"Ridge"`. We can also choose the ridge hyper-parameter by setting `lambda`, while not specifying `lambda` will determine one through cross validation: ```{r fit_asynth, message=F, warning=F} asyn <- augsynth(lngdpcapita ~ treated, fips, year_qtr, kansas, progfunc = "Ridge", scm = T) ``` We can plot the cross-validation MSE when dropping pre-treatment time periods by setting `cv = T` in the `plot` function: ```{r fig_asyn_cv, fig.width=8, fig.height=4.5, echo=T, fig.align="center"} plot(asyn, cv = T) ``` By default, the CV procedure chooses the maximal value of `lambda` with MSE within one standard deviation of the minimal MSE. To instead choose the `lambda` that minizes the cross validation MSE, set `min_1se = FALSE`. We can look at the summary and plot the results. Now in the summary output we see an estimate of the overall bias of synth; we measure this with the average amount that augmentation changes the synth estimate. Notice that the estimates become somewhat larger in magnitude, and the standard errors are tighter. ```{r summ_asyn} summary(asyn) ``` ```{r fig_asyn, fig.width=8, fig.height=4.5, echo=T, fig.align="center"} plot(asyn) ``` There are also several auxiliary covariates. We can include these in the augmentation by fitting an outcome model using the auxiliary covariates. To do this we simply add the covariates into the formula after `|`. By default this will create time invariant covariates by averaging the auxiliary covariates over the pre-intervention period, dropping `NA` values. We can use a custom aggregation function by setting the `cov_agg` argument. Then the lagged outcomes and the auxiliary covariates are jointly balanced by SCM and the ridge outcome model includes both. ```{r fit_covsynth, message=F, warning=F} covsyn <- augsynth(lngdpcapita ~ treated | lngdpcapita + log(revstatecapita) + log(revlocalcapita) + log(avgwklywagecapita) + estabscapita + emplvlcapita, fips, year_qtr, kansas, progfunc = "ridge", scm = T) ``` Again we can look at the summary and plot the results. ```{r summ_cvsyn} summary(covsyn) ``` ```{r fig_covsyn, fig.width=8, fig.height=4.5, echo=T, fig.align="center"} plot(covsyn) ``` Now we can additionally fit ridge ASCM on the residuals, look at the summary, and plot the results. ```{r fit_covsynth_aug, message=F, warning=F} covsyn_resid <- augsynth(lngdpcapita ~ treated | lngdpcapita + log(revstatecapita) + log(revlocalcapita) + log(avgwklywagecapita) + estabscapita + emplvlcapita, fips, year_qtr, kansas, progfunc = "ridge", scm = T, lambda = asyn$lambda, residualize = T) ``` ```{r summ_cvsyn_resid} summary(covsyn_resid) ``` ```{r fig_covsyn_resid, fig.width=8, fig.height=4.5, echo=T, fig.align="center"} plot(covsyn_resid) ``` Finally, we can augment synth with many different outcome models. The simplest outcome model is a unit fixed effect model, which we can include by setting `fixedeff = T`. ```{r fit_desyn, message=F, warning=F} desyn <- augsynth(lngdpcapita ~ treated, fips, year_qtr, kansas, progfunc = "none", scm = T, fixedeff = T) ``` ```{r summ_desyn} summary(desyn) ``` ```{r fig_desyn, fig.width=8, fig.height=4.5, echo=T, fig.align="center"} plot(desyn) ``` We can incorproate other outcome models by changing the `progfunc`. Several outcome models are available, including, fitting the factor model directly with `gsynth`, general elastic net regression, bayesian structural time series estimation with `CausalImpact`, and matrix completion with `MCPanel`. For each outcome model you can supply an optional set of parameters, see documentation for details. ================================================ FILE: vignettes/singlesynth-vignette.md ================================================ --- output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Single Outcome AugSynth Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # `augsynth`: The Augmented Synthetic Control Method ## Installation You can install `augsynth` from github using `devtools`. ```r ## Install devtools if noy already installed install.packages("devtools", repos='http://cran.us.r-project.org') ## Install augsynth from github devtools::install_github("ebenmichael/augsynth") ``` ## Example: Effects of the 2012 Kansas Tax Cuts ### The data To show the usage and features of `augsynth`, we'll use data on the impact of personal income tax cuts in Kansas that comes with the `AugSynth` package. Our interest is in estimating the effect of income tax cuts on gross state product (GSP) per capita. ```r library(magrittr) library(dplyr) library(augsynth) data(kansas) ``` The `kansas` dataset contains the GSP per capita (the outcome measure) `lngdpcapita` for all 50 states from the first quarter of 1990 to the first quarter of 2016. To run `augsynth`, we need to include a treatment status column that indicates which region was treated and at what time. The table in `kansas` contains the column `treated` to denote this. In the original study, the second quarter of 2012 was the implementation of the tax cut in Kansas. ```r kansas %>% select(year, qtr, year_qtr, state, treated, gdp, lngdpcapita) %>% filter(state == "Kansas" & year_qtr >= 2012 & year_qtr < 2013) #> # A tibble: 4 x 7 #> year qtr year_qtr state treated gdp lngdpcapita #> #> 1 2012 1 2012 Kansas 0 143844 10.8 #> 2 2012 2 2012. Kansas 1 141518 10.8 #> 3 2012 3 2012. Kansas 1 138890 10.8 #> 4 2012 4 2013. Kansas 1 139603 10.8 ``` ### Synth Now to find a synthetic control using the entire series of pre-intervention outcomes (and no auxiliary covariates), we can use `augsynth`. To do so we just need to give `augsynth` a formula like `outcome ~ treatment`, tell it what the unit and time variables are, optionally provide when intervention took place (the code will automatically determine this if `t_int` is not provided), and specify that we don't want to fit an outcome model ```r library(augsynth) syn <- augsynth(lngdpcapita ~ treated, fips, year_qtr, kansas, progfunc = "None", scm = T) ``` We can then look at the ATT estimates for each post-intervention time period and overall. We'll also see the quality of the synthetic control fit measured by the L2 distance between Kansas and its synthetic control, and the percent improvement over uniform weights. By default, we'll also see pointwise confidence intervals using a [conformal inference procedure](https://arxiv.org/abs/1712.09089). ```r summary(syn) #> #> Call: #> single_augsynth(form = form, unit = !!enquo(unit), time = !!enquo(time), #> t_int = t_int, data = data, progfunc = "None", scm = ..2) #> #> Average ATT Estimate (p Value for Joint Null): -0.029 ( 0.328 ) #> L2 Imbalance: 0.083 #> Percent improvement from uniform weights: 79.5% #> #> Avg Estimated Bias: NA #> #> Inference type: Conformal inference #> #> Time Estimate 95% CI Lower Bound 95% CI Upper Bound p Value #> 2012.25 -0.018 -0.045 0.006 0.111 #> 2012.50 -0.041 -0.070 -0.015 0.022 #> 2012.75 -0.033 -0.062 -0.007 0.044 #> 2013.00 -0.019 -0.046 0.005 0.111 #> 2013.25 -0.029 -0.053 -0.005 0.044 #> 2013.50 -0.046 -0.073 -0.022 0.022 #> 2013.75 -0.032 -0.056 -0.010 0.022 #> 2014.00 -0.045 -0.074 -0.018 0.022 #> 2014.25 -0.043 -0.074 -0.014 0.022 #> 2014.50 -0.029 -0.061 0.000 0.044 #> 2014.75 -0.018 -0.053 0.011 0.144 #> 2015.00 -0.029 -0.066 0.005 0.078 #> 2015.25 -0.019 -0.051 0.010 0.122 #> 2015.50 -0.022 -0.056 0.007 0.111 #> 2015.75 -0.019 -0.055 0.013 0.189 #> 2016.00 -0.028 -0.067 0.008 0.100 ``` The default test statistic is the sum of the absolute treatment efects `function(x) sum(abs(x))`. We can change the test statistic via the `stat_func` argument. For instance, if we want to perform a one-way test against postive effects, we can set the test stastic to be the negative sum `function(x) -sum(x)`: ```r summary(syn, stat_func = function(x) -sum(x)) #> #> Call: #> single_augsynth(form = form, unit = !!enquo(unit), time = !!enquo(time), #> t_int = t_int, data = data, progfunc = "None", scm = ..2) #> #> Average ATT Estimate (p Value for Joint Null): -0.029 ( 0.159 ) #> L2 Imbalance: 0.083 #> Percent improvement from uniform weights: 79.5% #> #> Avg Estimated Bias: NA #> #> Inference type: Conformal inference #> #> Time Estimate 95% CI Lower Bound 95% CI Upper Bound p Value #> 2012.25 -0.018 -0.080 0.006 0.067 #> 2012.50 -0.041 -0.103 -0.015 0.022 #> 2012.75 -0.033 -0.095 -0.007 0.033 #> 2013.00 -0.019 -0.081 0.005 0.067 #> 2013.25 -0.029 -0.091 -0.005 0.033 #> 2013.50 -0.046 -0.108 -0.022 0.022 #> 2013.75 -0.032 -0.094 -0.010 0.022 #> 2014.00 -0.045 -0.107 -0.021 0.022 #> 2014.25 -0.043 -0.105 -0.014 0.022 #> 2014.50 -0.029 -0.091 0.000 0.033 #> 2014.75 -0.018 -0.080 0.011 0.078 #> 2015.00 -0.029 -0.091 0.005 0.056 #> 2015.25 -0.019 -0.081 0.007 0.078 #> 2015.50 -0.022 -0.084 0.007 0.067 #> 2015.75 -0.019 -0.081 0.013 0.111 #> 2016.00 -0.028 -0.090 0.008 0.067 ``` Or if we want to priotize testing the average post-treatment effect, we can set it to be the absolute sum: ```r summary(syn, stat_func = function(x) abs(sum(x))) #> #> Call: #> single_augsynth(form = form, unit = !!enquo(unit), time = !!enquo(time), #> t_int = t_int, data = data, progfunc = "None", scm = ..2) #> #> Average ATT Estimate (p Value for Joint Null): -0.029 ( 0.302 ) #> L2 Imbalance: 0.083 #> Percent improvement from uniform weights: 79.5% #> #> Avg Estimated Bias: NA #> #> Inference type: Conformal inference #> #> Time Estimate 95% CI Lower Bound 95% CI Upper Bound p Value #> 2012.25 -0.018 -0.045 0.006 0.111 #> 2012.50 -0.041 -0.070 -0.015 0.022 #> 2012.75 -0.033 -0.062 -0.007 0.044 #> 2013.00 -0.019 -0.046 0.005 0.111 #> 2013.25 -0.029 -0.053 -0.005 0.044 #> 2013.50 -0.046 -0.073 -0.022 0.022 #> 2013.75 -0.032 -0.056 -0.010 0.022 #> 2014.00 -0.045 -0.074 -0.018 0.022 #> 2014.25 -0.043 -0.074 -0.014 0.022 #> 2014.50 -0.029 -0.061 0.000 0.044 #> 2014.75 -0.018 -0.053 0.011 0.144 #> 2015.00 -0.029 -0.066 0.005 0.078 #> 2015.25 -0.019 -0.051 0.010 0.122 #> 2015.50 -0.022 -0.056 0.007 0.111 #> 2015.75 -0.019 -0.055 0.013 0.189 #> 2016.00 -0.028 -0.067 0.008 0.100 ``` It's easier to see this information visually. Below we plot the difference between Kansas and it's synthetic control. Before the tax cuts (to the left of the dashed line) we expect these to be close, and after the tax cuts we measure the effect (with point-wise confidence intervals). ```r plot(syn) ``` plot of chunk fig_syn We can also compute point-wise confidence intervals using the [Jackknife+ procedure](https://arxiv.org/abs/1905.02928) by changing the `inf_type` argument, although this requires additional assumptions. ```r plot(syn, inf_type = "jackknife+") ``` plot of chunk fig_syn_plus ### Augmenting synth with an outcome model In this example the pre-intervention synthetic control fit has an L2 imbalance of 0.083, about 20% of the imbalance between Kansas and the average of the other states. We can reduce this by _augmenting_ synth with ridge regression. To do this we change `progfunc` to `"Ridge"`. We can also choose the ridge hyper-parameter by setting `lambda`, while not specifying `lambda` will determine one through cross validation: ```r asyn <- augsynth(lngdpcapita ~ treated, fips, year_qtr, kansas, progfunc = "Ridge", scm = T) ``` We can plot the cross-validation MSE when dropping pre-treatment time periods by setting `cv = T` in the `plot` function: ```r plot(asyn, cv = T) ``` plot of chunk fig_asyn_cv By default, the CV procedure chooses the maximal value of `lambda` with MSE within one standard deviation of the minimal MSE. To instead choose the `lambda` that minizes the cross validation MSE, set `min_1se = FALSE`. We can look at the summary and plot the results. Now in the summary output we see an estimate of the overall bias of synth; we measure this with the average amount that augmentation changes the synth estimate. Notice that the estimates become somewhat larger in magnitude, and the standard errors are tighter. ```r summary(asyn) #> #> Call: #> single_augsynth(form = form, unit = !!enquo(unit), time = !!enquo(time), #> t_int = t_int, data = data, progfunc = "Ridge", scm = ..2) #> #> Average ATT Estimate (p Value for Joint Null): -0.040 ( 0.057 ) #> L2 Imbalance: 0.062 #> Percent improvement from uniform weights: 84.7% #> #> Avg Estimated Bias: 0.011 #> #> Inference type: Conformal inference #> #> Time Estimate 95% CI Lower Bound 95% CI Upper Bound p Value #> 2012.25 -0.022 -0.044 0.003 0.056 #> 2012.50 -0.047 -0.076 -0.018 0.022 #> 2012.75 -0.043 -0.071 -0.010 0.022 #> 2013.00 -0.030 -0.055 -0.004 0.033 #> 2013.25 -0.041 -0.067 -0.012 0.022 #> 2013.50 -0.059 -0.088 -0.030 0.022 #> 2013.75 -0.045 -0.073 -0.019 0.022 #> 2014.00 -0.058 -0.090 -0.026 0.022 #> 2014.25 -0.055 -0.091 -0.020 0.022 #> 2014.50 -0.041 -0.080 -0.006 0.033 #> 2014.75 -0.029 -0.068 0.006 0.056 #> 2015.00 -0.040 -0.082 0.000 0.056 #> 2015.25 -0.030 -0.066 0.002 0.056 #> 2015.50 -0.033 -0.072 0.003 0.056 #> 2015.75 -0.029 -0.071 0.010 0.056 #> 2016.00 -0.038 -0.087 0.004 0.056 ``` ```r plot(asyn) ``` plot of chunk fig_asyn There are also several auxiliary covariates. We can include these in the augmentation by fitting an outcome model using the auxiliary covariates. To do this we simply add the covariates into the formula after `|`. By default this will create time invariant covariates by averaging the auxiliary covariates over the pre-intervention period, dropping `NA` values. We can use a custom aggregation function by setting the `cov_agg` argument. Then the lagged outcomes and the auxiliary covariates are jointly balanced by SCM and the ridge outcome model includes both. ```r covsyn <- augsynth(lngdpcapita ~ treated | lngdpcapita + log(revstatecapita) + log(revlocalcapita) + log(avgwklywagecapita) + estabscapita + emplvlcapita, fips, year_qtr, kansas, progfunc = "ridge", scm = T) ``` Again we can look at the summary and plot the results. ```r summary(covsyn) #> #> Call: #> single_augsynth(form = form, unit = !!enquo(unit), time = !!enquo(time), #> t_int = t_int, data = data, progfunc = "ridge", scm = ..2) #> #> Average ATT Estimate (p Value for Joint Null): -0.061 ( 0.11 ) #> L2 Imbalance: 0.054 #> Percent improvement from uniform weights: 86.6% #> #> Covariate L2 Imbalance: 0.005 #> Percent improvement from uniform weights: 97.7% #> #> Avg Estimated Bias: 0.027 #> #> Inference type: Conformal inference #> #> Time Estimate 95% CI Lower Bound 95% CI Upper Bound p Value #> 2012.25 -0.021 -0.044 0.002 0.067 #> 2012.50 -0.047 -0.076 -0.014 0.033 #> 2012.75 -0.050 -0.083 -0.007 0.033 #> 2013.00 -0.045 -0.074 -0.012 0.033 #> 2013.25 -0.055 -0.088 -0.022 0.022 #> 2013.50 -0.071 -0.105 -0.033 0.022 #> 2013.75 -0.058 -0.091 -0.025 0.022 #> 2014.00 -0.081 -0.119 -0.037 0.022 #> 2014.25 -0.078 -0.121 -0.034 0.022 #> 2014.50 -0.065 -0.114 -0.021 0.033 #> 2014.75 -0.057 -0.110 -0.008 0.044 #> 2015.00 -0.075 -0.124 -0.022 0.033 #> 2015.25 -0.063 -0.106 -0.014 0.033 #> 2015.50 -0.067 -0.106 -0.019 0.022 #> 2015.75 -0.063 -0.101 -0.009 0.022 #> 2016.00 -0.078 -0.122 -0.019 0.022 ``` ```r plot(covsyn) ``` plot of chunk fig_covsyn Now we can additionally fit ridge ASCM on the residuals, look at the summary, and plot the results. ```r covsyn_resid <- augsynth(lngdpcapita ~ treated | lngdpcapita + log(revstatecapita) + log(revlocalcapita) + log(avgwklywagecapita) + estabscapita + emplvlcapita, fips, year_qtr, kansas, progfunc = "ridge", scm = T, lambda = asyn$lambda, residualize = T) ``` ```r summary(covsyn_resid) #> #> Call: #> single_augsynth(form = form, unit = !!enquo(unit), time = !!enquo(time), #> t_int = t_int, data = data, progfunc = "ridge", scm = ..2, #> lambda = ..3, residualize = ..4) #> #> Average ATT Estimate (p Value for Joint Null): -0.055 ( 0.288 ) #> L2 Imbalance: 0.067 #> Percent improvement from uniform weights: 83.4% #> #> Covariate L2 Imbalance: 0.000 #> Percent improvement from uniform weights: 100% #> #> Avg Estimated Bias: 0.006 #> #> Inference type: Conformal inference #> #> Time Estimate 95% CI Lower Bound 95% CI Upper Bound p Value #> 2012.25 -0.025 -0.046 -0.005 0.044 #> 2012.50 -0.051 -0.076 -0.026 0.011 #> 2012.75 -0.045 -0.070 -0.020 0.011 #> 2013.00 -0.044 -0.069 -0.019 0.011 #> 2013.25 -0.051 -0.077 -0.026 0.011 #> 2013.50 -0.069 -0.094 -0.044 0.011 #> 2013.75 -0.051 -0.077 -0.026 0.011 #> 2014.00 -0.069 -0.095 -0.040 0.011 #> 2014.25 -0.067 -0.097 -0.037 0.011 #> 2014.50 -0.053 -0.083 -0.024 0.011 #> 2014.75 -0.045 -0.075 -0.015 0.022 #> 2015.00 -0.064 -0.093 -0.034 0.011 #> 2015.25 -0.051 -0.076 -0.026 0.011 #> 2015.50 -0.059 -0.089 -0.034 0.011 #> 2015.75 -0.058 -0.087 -0.028 0.011 #> 2016.00 -0.074 -0.103 -0.044 0.011 ``` ```r plot(covsyn_resid) ``` plot of chunk fig_covsyn_resid Finally, we can augment synth with many different outcome models. The simplest outcome model is a unit fixed effect model, which we can include by setting `fixedeff = T`. ```r desyn <- augsynth(lngdpcapita ~ treated, fips, year_qtr, kansas, progfunc = "none", scm = T, fixedeff = T) ``` ```r summary(desyn) #> #> Call: #> single_augsynth(form = form, unit = !!enquo(unit), time = !!enquo(time), #> t_int = t_int, data = data, progfunc = "none", scm = ..2, #> fixedeff = ..3) #> #> Average ATT Estimate (p Value for Joint Null): -0.034 ( 0.319 ) #> L2 Imbalance: 0.082 #> Percent improvement from uniform weights: 55.1% #> #> Avg Estimated Bias: NA #> #> Inference type: Conformal inference #> #> Time Estimate 95% CI Lower Bound 95% CI Upper Bound p Value #> 2012.25 -0.022 -0.046 0.006 0.078 #> 2012.50 -0.046 -0.070 -0.013 0.022 #> 2012.75 -0.038 -0.062 -0.005 0.044 #> 2013.00 -0.024 -0.048 0.003 0.078 #> 2013.25 -0.033 -0.057 -0.006 0.044 #> 2013.50 -0.050 -0.074 -0.023 0.022 #> 2013.75 -0.035 -0.056 -0.010 0.022 #> 2014.00 -0.049 -0.073 -0.019 0.022 #> 2014.25 -0.047 -0.071 -0.014 0.022 #> 2014.50 -0.033 -0.057 0.000 0.056 #> 2014.75 -0.023 -0.047 0.010 0.122 #> 2015.00 -0.034 -0.061 0.004 0.078 #> 2015.25 -0.023 -0.047 0.007 0.100 #> 2015.50 -0.026 -0.053 0.007 0.100 #> 2015.75 -0.023 -0.050 0.012 0.144 #> 2016.00 -0.033 -0.066 0.008 0.089 ``` ```r plot(desyn) ``` plot of chunk fig_desyn We can incorproate other outcome models by changing the `progfunc`. Several outcome models are available, including, fitting the factor model directly with `gsynth`, general elastic net regression, bayesian structural time series estimation with `CausalImpact`, and matrix completion with `MCPanel`. For each outcome model you can supply an optional set of parameters, see documentation for details.