[
  {
    "path": ".gitignore",
    "content": "GlobalFactors/.Rproj.user\nGlobalFactors/.Rhistory\nGlobalFactors/.RData\nGlobalFactors/.Ruserdata\nAnalysis/Data/\nAnalysis/Objects/\nAnalysis/Figures/\nAnalysis/.Rproj.user\nAnalysis/.Rhistory\nAnalysis/.RData\nAnalysis/.Ruserdata\n.Rproj.user\n"
  },
  {
    "path": "Analysis/.gitignore",
    "content": ".Rhistory\n.RData\n.Rproj.user\nData/\nObjects/\nFigures/\nScribbles/\n"
  },
  {
    "path": "Analysis/0 - Functions.R",
    "content": "eb_prepare <- function(data, scale_alphas, overlapping) {\n  if (overlapping) {\n    data <- data %>%\n      group_by(region, characteristic) %>%\n      mutate(obs = n()) %>%\n      ungroup() %>%\n      filter(obs == max(obs)) %>%\n      select(-obs)\n  }\n  # Adjust for Beta\n  data <- data %>% \n    group_by(region, characteristic) %>%\n    mutate(\n      beta = cov(ret, mkt_vw_exc)/var(mkt_vw_exc),\n      ret_neu = (ret - mkt_vw_exc * beta)*100,\n      scaling_fct = sqrt(10^2/12) / sd(ret_neu),\n      ret_neu_scaled = ret_neu * scaling_fct\n    ) %>%\n    ungroup()\n  # Make Wide\n  data <- data %>% mutate(name_wide = str_c(characteristic, \"__\", region))\n  if(scale_alphas) {\n    data_wide <- data %>%\n      select(name_wide, eom, ret_neu_scaled) %>%\n      spread(key = name_wide, value = ret_neu_scaled)\n  } else {\n    data_wide <- data %>%\n      select(name_wide, eom, ret_neu) %>%\n      spread(key = name_wide, value = ret_neu)\n  }\n  # Return\n  list(\n    \"long\" = data,\n    \"wide\" = data_wide\n  )\n}\n\nblock_cluster_func <- function(cor_mat, cl_lables) {\n  cor_long <- cor_mat %>%\n    as_tibble(rownames = \"char1\") %>%\n    gather(-char1, key = \"char2\", value = \"cor\") %>%\n    separate(col = \"char1\", into = c(\"char1\", \"region1\"), sep = \"__\") %>%\n    separate(col = \"char2\", into = c(\"char2\", \"region2\"), sep = \"__\") %>%\n    left_join(cl_lables %>% select(characteristic, \"hcl1\"  = hcl_label), by = c(\"char1\"=\"characteristic\")) %>%\n    left_join(cl_lables %>% select(characteristic, \"hcl2\"  = hcl_label), by = c(\"char2\"=\"characteristic\")) %>%\n    rowwise() %>%\n    mutate(\n      hclreg1 = str_c(hcl1, \"__\", region1),\n      hclreg2 = str_c(hcl2, \"__\", region2)\n    ) %>%\n    select(-hcl1, -hcl2) %>%\n    mutate(hcl_pair = str_c(min(c(hclreg1, hclreg2)), \"_x_\", max(c(hclreg1, hclreg2)))) %>%\n    unite(col = \"name1\", char1, region1, sep = \"__\", remove = T) %>%\n    unite(col = \"name2\", char2, region2, sep = \"__\", remove = T) %>%\n    ungroup()\n  \n  cluster_wise_cor <- cor_long %>%\n    filter(name1 != name2) %>%  # Exclude cor(factor_i, factor_i)=1 \n    group_by(hcl_pair) %>%\n    summarise(\n      cor_avg = mean(cor)\n    ) \n  \n  cluster_block_cor <- cor_long %>%\n    left_join(cluster_wise_cor, by = \"hcl_pair\") %>% \n    mutate(cor_avg = if_else(name1 == name2, 1, cor_avg)) %>%  # ONLY IF SAME REGION!!\n    select(name1, name2, cor_avg) %>%\n    spread(key = name2, value = cor_avg)\n  \n  cbc_rows <- cluster_block_cor$name1  \n  cluster_block_cor <- cluster_block_cor %>% select(-name1) %>% as.matrix()\n  rownames(cluster_block_cor) <- cbc_rows\n  return(cluster_block_cor)\n}\n\n# Empirical Bayes ----------------\nemp_bayes <- function(data, cluster_labels, min_obs = 5 * 12, fix_alpha = F, bs_cov = F, cor_type = \"sample\", shrinkage = 0, layers = 3, bs_samples = 10000, seed, priors = NULL, sigma = NULL, plot = T) { # cor_type %in% c(\"raw\", \"block_2\", \"block_clusters\")\n  set.seed(seed)\n  y_raw <- data$wide %>% select(-eom) %>% as.matrix()\n  obs <- y_raw %>% apply(2, function(x) sum(!is.na(x)))\n  y <- y_raw[, obs >= min_obs]\n  n_fcts <- ncol(y)\n  y_mean <- y %>% apply(2, mean, na.rm = T)  \n  \n  if (is.null(sigma)) {\n    if (bs_cov) {\n      bs_full <- y %>%\n        rsample::bootstraps(times = bs_samples) %>%\n        mutate(\n          res = splits %>% map(~.x %>% rsample::analysis() %>% apply(2, mean, na.rm = T) %>% as_tibble(rownames = \"characteristic\"))\n        ) %>%\n        select(-splits) %>%\n        unnest(res)\n      \n      bs_full_cov <- bs_full %>%\n        spread(key = characteristic, value = value) %>%\n        select(-id) %>%\n        cov()\n      \n      alpha_sd <- sqrt(diag(bs_full_cov)) \n      alpha_cor <- solve(diag(alpha_sd)) %*% bs_full_cov %*% solve(diag(alpha_sd))\n      colnames(alpha_cor) <- names(alpha_sd)\n      rownames(alpha_cor) <- names(alpha_sd)\n      \n    } else {\n      y_sd <- y %>% apply(2, sd, na.rm=T) \n      y_scor <- y %>% cor(use = \"complete.obs\")  \n      alpha_sd <- y_sd / sqrt(nrow(y))\n      alpha_cor <- y_scor\n    }\n    \n    # Apply Shrinkage\n    alpha_cor_shrunk <- alpha_cor * (1-shrinkage) + diag(n_fcts) * shrinkage\n    \n    # Correlation Block Adjustment\n    if (cor_type == \"sample\") {\n      alpha_cor_adj <- alpha_cor_shrunk\n    }\n    if (cor_type == \"block_clusters\") {\n      alpha_cor_adj <- alpha_cor_shrunk %>% block_cluster_func(cl_lables = cluster_labels)\n    }\n    \n    sigma <- diag(alpha_sd) %*% alpha_cor_adj %*% diag(alpha_sd)  # This is really the equivalent of sigma/T from the paper\n    colnames(sigma) <- colnames(alpha_cor_shrunk)\n    \n    print(str_c(\"Condition Number: Raw = \", round(kappa(alpha_cor_shrunk), 2), \", Adjusted = \", round(kappa(alpha_cor_adj), 2)))\n  } else {\n    alpha_sd <- sqrt(diag(sigma))\n    names(alpha_sd) <- colnames(sigma)\n  }\n  \n  # Cluster Membership\n  cm <- y_mean %>% \n    as_tibble(rownames = \"char_reg\") %>%\n    mutate(\n      characteristic = str_split(char_reg, \"__\", simplify = T)[, 1]\n    ) %>%\n    left_join(cluster_labels, by = \"characteristic\") \n  \n  m <- cm %>%\n    mutate(cm = 1) %>%\n    select(char_reg, hcl_label, cm) %>%\n    spread(key = hcl_label, value = cm) %>% \n    select(-char_reg) %>% \n    as.matrix()\n  m[is.na(m)] <- 0\n  mm <- m %*% t(m)\n  \n  n_cl <- ncol(m)\n  \n  # Signal Membership\n  z <- cm %>%\n    mutate(sm = 1) %>%\n    select(char_reg, characteristic, sm) %>%\n    spread(key = characteristic, value = sm) %>%\n    select(-char_reg) %>%\n    as.matrix()\n  \n  z[is.na(z)] <- 0\n  zz <- z %*% t(z)\n  \n  n_s <- ncol(z)\n  \n  # Starting Values\n  starting_values <- cm %>%\n    group_by(hcl_label, characteristic) %>%\n    summarise(\n      n_s = n(),\n      signal_mean = mean(value),\n      signal_sd = sd(value)\n    ) %>%\n    group_by(hcl_label) %>%\n    summarise(\n      n_c = sum(n_s),\n      cl_mean = mean(signal_mean),\n      cl_sd = sd(signal_mean),\n      cl_signal_within = mean(signal_sd)\n    ) %>%\n    ungroup() %>%\n    mutate(cl_sd = if_else(n_c == 1, 0, cl_sd)) %>%\n    summarise(\n      alpha_mean = mean(cl_mean),  \n      sd_cl_mean = if_else(condition = fix_alpha, sqrt(sum((cl_mean^2) / (n() - 1))), sd(cl_mean)),\n      sd_within_cl = mean(cl_sd),\n      sd_within_signal = mean(cl_signal_within)\n    )\n  \n  if (fix_alpha) {\n    sd_all <- sqrt(sum(y_mean^2) / (length(y_mean) - 1))\n  } else {\n    sd_all <- sd(y_mean)\n  }\n  \n  # Maximum Likelihood\n  omega_func <- function(layers, tau_c, tau_s, tau_w) {\n    if (layers == 1) {\n      a_omega <- diag(n_fcts) * tau_c^2                                        # All alphas are drawn from same distribution\n    }\n    if (layers == 2) {\n      a_omega <- diag(n_fcts) * tau_s^2 + mm * tau_c^2                       # All cluster alphas are drawn from the same distribution, could be done with signals as well  \n    }\n    if (layers == 3) {\n      a_omega <- diag(n_fcts) * tau_w^2 + zz * tau_s^2 + mm * tau_c^2      # Cluster distrib., signal distrib. factor distrib.\n    }\n    return(a_omega)\n  }\n  \n  # Choose between specifying prior parameters or finding them via EB\n  if (is.null(priors)) {\n    if (layers == 1) {\n      start_list <- list(\n        a = starting_values$alpha_mean, \n        tc = sd_all)\n      \n      mle_func <- function(a, tc) {\n        a_vec <- rep(a, n_fcts)\n        a_omega <- omega_func(layers = layers, tau_c = tc, tau_s = NULL, tau_w = NULL)\n        \n        a_cov <- sigma + a_omega  #  / t_mat\n        \n        -(mvtnorm::dmvnorm(x = y_mean, mean = a_vec, sigma = a_cov, log = T))  #  + dgamma(param[2], 2, 5, log = T)*sum(mm)/2  + dgamma(param[3], 2, 10, log = T)*140 \n      }\n    } \n    if (layers == 2) {\n      start_list <- list(\n        a = starting_values$alpha_mean, \n        tc = starting_values$sd_cl_mean, \n        ts = starting_values$sd_within_cl)\n      \n      mle_func <- function(a, tc, ts) {\n        a_vec <- rep(a, n_fcts)\n        a_omega <- omega_func(layers = layers, tau_c = tc, tau_s = ts, tau_w = NULL)\n        \n        a_cov <- sigma + a_omega  #  / t_mat\n        \n        -(mvtnorm::dmvnorm(x = y_mean, mean = a_vec, sigma = a_cov, log = T))  \n      }\n    }\n    if (layers == 3) {\n      start_list <- list(\n        a = starting_values$alpha_mean, \n        tc = starting_values$sd_cl_mean, \n        ts = starting_values$sd_within_cl,\n        tw = starting_values$sd_within_signal)\n      \n      mle_func <- function(a, tc, ts, tw) {\n        a_vec <- rep(a, n_fcts)\n        a_omega <- omega_func(layers = layers, tau_c = tc, tau_s = ts, tau_w = tw)\n        \n        a_cov <- sigma + a_omega  \n        \n        -(mvtnorm::dmvnorm(x = y_mean, mean = a_vec, sigma = a_cov, log = T))\n      }\n    }\n    \n    # Maximum likelihood estimation\n    for (k in 1:10) {\n      initial_params <- start_list %>% lapply(function(x) max(x+rnorm(1, mean = 0, sd = 0.01), 0)) # Max is just to ensure that variances are not negative, never in use\n      if (fix_alpha) {\n        (hyper_pars <- stats4::mle(minuslogl = mle_func, start = initial_params, lower = c(-Inf, 0, 0, 0)[1:length(start_list)], fixed = list(a = 0)))\n      } else {\n        (hyper_pars <- stats4::mle(minuslogl = mle_func, start = initial_params, lower = c(-Inf, 0, 0, 0)[1:length(start_list)]))\n      }\n      if (hyper_pars@details$convergence==0) break\n    }\n    # Check convergence\n    if (hyper_pars@details$convergence != 0) {\n      warning(\"MLE step did not converge!!!\")\n      return(NULL)\n    } \n    \n    mu <- hyper_pars@fullcoef[\"a\"]\n    tau_c <- hyper_pars@fullcoef[\"tc\"]\n    tau_s <- hyper_pars@fullcoef[\"ts\"]\n    tau_w <- hyper_pars@fullcoef[\"tw\"]\n  } else {\n    mu <- priors$alpha\n    tau_c <- priors$tau_c\n    tau_s <- priors$tau_s\n    tau_w <- priors$tau_w\n  }\n  \n  theta <- omega_func(layers = layers, tau_c = tau_c, tau_s = tau_s, tau_w = tau_w)\n  colnames(theta) <- rownames(theta) <- names(y_mean)\n  print(paste(\"Condition Number Omega =\", round(kappa(theta))))\n  \n  # Signal Posteriors ------------------\n  if (layers == 3) {\n    as_mean <- tau_w^2*t(z) %*% (theta + sigma) %*% (y_mean - rep(mu, n_fcts))\n    as_cov <- tau_w^2 * diag(n_s) - tau_w^4 * t(z) %*% (theta + sigma) %*% z\n    as_sd <- sqrt(diag(as_cov))\n    colnames(as_mean) <- \"post_mean\"\n    \n    signal_summary <- as_mean %>% as_tibble(rownames = \"characteristic\") %>%\n      left_join(as_sd %>% as_tibble(rownames = \"characteristic\") %>% rename(\"post_sd\" = value), by = \"characteristic\")\n  }\n  \n  # Factor Posteriors ------------------\n  ai_cov <- solve(solve(theta) + solve(sigma))  # t_mat * solve(sigma)\n  ai_sd <- sqrt(diag(ai_cov))\n  ai_mean <- ai_cov %*% (solve(theta) %*% rep(mu, n_fcts) + solve(sigma) %*% y_mean) ## (t_mat * solve(sigma))\n  \n  rownames(ai_mean) <- names(y_mean)\n  colnames(ai_mean) <- \"post_mean\"\n  names(ai_sd) <- names(y_mean)\n  \n  factor_summary <- ai_mean %>% as_tibble(rownames = \"char_reg\") %>%\n    left_join(ai_sd %>% as_tibble(rownames = \"char_reg\") %>% rename(\"post_sd\" = value), by = \"char_reg\") %>%\n    left_join(y_mean %>% as_tibble(rownames = \"char_reg\") %>% rename(\"ols_est\" = value), by = \"char_reg\") %>%\n    left_join(alpha_sd %>% as_tibble(rownames = \"char_reg\") %>% rename(\"ols_se\" = value), by = \"char_reg\") %>%\n    mutate(\n      characteristic = str_split(char_reg, \"__\", simplify = T)[, 1],\n      # characteristic = char_reg %>% str_extract(\".+[?=_{2}]\") %>% str_remove(\"__\")\n      p025 = post_mean - 1.96 * post_sd,\n      p975 = post_mean + 1.96 * post_sd\n    ) %>%\n    left_join(cluster_labels, by = \"characteristic\") %>%\n    mutate(\n      region = char_reg %>% str_extract(pattern = \"(?<=_{2}).+\")\n    ) %>%\n    select(char_reg, characteristic, hcl_label, region, everything()) \n  \n  # Output\n  if (is.null(priors)) {\n    comparison <- tibble(\n      estimate = c(\"alpha\", \"tau_c\", \"tau_s\", \"tau_w\")[1:(layers + 1)],\n      crude = drop(unlist(start_list)),\n      ml_est = c(mu, tau_c, tau_s, tau_w)[1:(layers + 1)]\n    )\n    if (fix_alpha) {\n      ml_se <- c(NA_real_, sqrt(diag(solve(hyper_pars@details$hessian))))\n    } else {\n      ml_se <- sqrt(diag(solve(hyper_pars@details$hessian)))\n    }\n    comparison$ml_se <- ml_se\n    \n    print(comparison)\n  }\n  \n  if (plot == T) {\n    list(\"factors\" = factor_summary) %>% eb_plots()\n  }\n  \n  ret_list <- list(\n    \"input\" = data,\n    # \"clusters\" = cluster_summary,\n    \"factors\" = factor_summary,\n    \"factor_mean\" = ai_mean,\n    \"factor_cov\" = ai_cov,\n    \"theta\" = theta,\n    \"sigma\" = sigma\n  )\n  if (is.null(sigma)) {\n    ret_list[[\"alpha_cor_raw\"]] <- alpha_cor_shrunk\n    ret_list[[\"alpha_cor_adj\"]] <- alpha_cor_adj\n  }\n  if (is.null(priors)) {\n    ret_list[[\"mle\"]] <- comparison\n  }\n  if (layers == 3) {\n    ret_list$signal <- signal_summary\n  }\n  return(ret_list)\n}\n\nfdr_sim <- function(t_low, a_vec, a_cov, n_sim = 10000, seed=1) {\n  set.seed(seed)\n  t_all <- a_vec / sqrt(diag(a_cov))\n  t_steps <- sort(t_all[t_all > t_low])\n  t_steps <- head(t_steps, -1)  # Don't include the last t-value (no significant)\n  # Simulated alphas\n  sims <- mvtnorm::rmvnorm(n = n_sim, mean = a_vec, sigma = a_cov)\n  # False Discovery as a Function of t-cutoff\n  t_steps %>% lapply(function(t) {\n    # Significant alphas under t-cutoff\n    sig <- (t_all >= t) \n    # False Discovery Rate\n    sims_fdr <- rowMeans(sign(sims[, sig]) != sign(a_vec[sig]))\n    # Output\n    tibble(t_cutoff = t, n_sig = sum(sig), fdr = mean(sims_fdr), fwr = mean(sims_fdr > 0))\n  }) %>% bind_rows()\n}\n\nfdr_fwer_rates <- function(t_cutoff, a_vec, a_cov, orig_sig = F, n_sim = 10000, seed=1) {\n  set.seed(seed)\n  # Simulate from full posterior\n  sims <- mvtnorm::rmvnorm(n = n_sim, mean = a_vec, sigma = a_cov)\n  if (orig_sig == T) {\n    orig_factors <- char_info %>% filter(significance == T) %>% pull(characteristic) %>% str_c(\"__world\")\n    sims <- sims[, match(x = orig_factors, table = rownames(a_vec))]\n    a_vec <- a_vec[orig_factors, ]\n    a_cov <- a_cov[orig_factors, orig_factors]\n  } \n  t_all <- a_vec / sqrt(diag(a_cov))\n  sig <- (t_all >= t_cutoff) \n  sig_sims <- sims[, sig]\n  false_discoveries <- sig_sims %>% apply(1, function(x) mean(x<0))\n  # FDR Distribution\n  fdr_dist <- tibble(\n    min = min(false_discoveries),\n    p025 = quantile(false_discoveries, 0.025),\n    p50 = quantile(false_discoveries, 0.5),\n    p975 = quantile(false_discoveries, 0.975),\n    max = max(false_discoveries),\n    mean = mean(false_discoveries),\n    sd = sd(false_discoveries)\n  )\n  fwer_dist <- tibble(\n    min = min(false_discoveries!=0),\n    p025 = quantile(false_discoveries!=0, 0.025),\n    p50 = quantile(false_discoveries!=0, 0.5),\n    p975 = quantile(false_discoveries!=0, 0.975),\n    max = max(false_discoveries!=0),\n    mean = mean(false_discoveries!=0),\n    sd = sd(false_discoveries!=0)\n  )\n  # FWER Distribution\n  # Output \n  print(paste(\"Factors:\", length(t_all), \"- Sig:\", sum(sig)))\n  print(paste(\"Mean p-value:\", round(mean(1-pnorm(t_all[sig])), 6)))\n  fwer_fdr <- tibble(t_cutoff = t_cutoff, n_sig = sum(sig), fdr = mean(false_discoveries), fwer = mean(false_discoveries > 0))\n  list(\"fdr_dist\"=fdr_dist, \"fwer_dist\" = fwer_dist, \"fwer_fdr\"=fwer_fdr)\n}\n\n# True Factors \ntrue_factors <- function(t_cutoff, a_vec, a_cov, orig_sig = T, n_sim = 10000, seed=1) {\n  set.seed(seed)\n  post_vol <- sqrt(diag(a_cov))\n  # Simulate using all factors\n  sims <- mvtnorm::rmvnorm(n = n_sim, mean = a_vec, sigma = a_cov)\n  # Decide which factors to look at\n  if (orig_sig == T) {\n    orig_factors <- char_info %>% filter(significance == T) %>% pull(characteristic) %>% str_c(\"__world\")\n    orig_factors_match <- match(x = orig_factors, table = rownames(a_vec))\n    sims <- sims[, orig_factors_match]\n    post_vol <- post_vol[orig_factors_match]\n    a_vec <- a_vec[orig_factors_match]\n  }\n  sims <- 1:ncol(sims) %>% sapply(function(i) sims[, i] / post_vol[i])\n  true_factors_dist <- sims %>% apply(1, function(x) mean(x > t_cutoff)) \n  true_stat <- mean(a_vec/post_vol > t_cutoff)\n  # From point 4 in https://influentialpoints.com/Training/bootstrap_confidence_intervals-principles-properties-assumptions.htm\n  bc_ci <- function(stat, bootstraps, alpha=0.05) { # bias corrected bootstrap standard errors\n    # estimate bias in std. norm deviates\n    b <- qnorm((sum(bootstraps > stat)+sum(bootstraps==stat)/2)/length(bootstraps))  # Proportion of bootstrap samples above the \"population\" estimate. If unbiased, b=0.5. To handle discrete statistics, half of the sample at the population estimate is assumed to lie above\n    z <- qnorm(c(alpha/2,1-alpha/2)) # Std. norm. limits\n    p <- pnorm(z-2*b) # bias-correct & convert to proportions\n    \n    quantile(bootstraps,p=p) # Bias-corrected percentile lims.\n  } \n  bias_corrected <- bc_ci(stat = true_stat, bootstraps = true_factors_dist, alpha = 0.05)\n  tibble(\n    min = min(true_factors_dist),\n    p025 = quantile(true_factors_dist, 0.025),\n    p50 = quantile(true_factors_dist, 0.5),\n    p975 = quantile(true_factors_dist, 0.975),\n    max = max(true_factors_dist),\n    mean = mean(true_factors_dist),\n    sd = sd(true_factors_dist),\n    p025_bc = bias_corrected[1],\n    p975_bc = bias_corrected[2]\n  )\n}\n\n# Simulations for figure 2\nsim_mt_control <- function(sim_settings) {\n  # Cluster membership\n  m <- matrix(0, nrow = sim_settings$n, ncol = sim_settings$clusters)  # Cluster membership\n  j <- 0\n  for (i in 1:sim_settings$clusters) {\n    m[(j+1):(j + sim_settings$fct_pr_cl), i] <- 1\n    j <- j + sim_settings$fct_pr_cl\n  }\n  # Correlation Matrix\n  corr_mat <- m %*% t(m)\n  corr_mat[corr_mat == 0] <- sim_settings$corr_across\n  corr_mat[corr_mat == 1] <- sim_settings$corr_within\n  diag(corr_mat) <- 1\n  # Sigma\n  sigma <- sim_settings$se^2 * corr_mat\n  # Predefine variables\n  alpha_0_vec <- rep(sim_settings$alpha_0, sim_settings$n)\n  i_n <- diag(sim_settings$n)\n  # Simulation\n  search_grid <- expand.grid(\"tau_c\" = sim_settings$tau_c, \"tau_w\" = sim_settings$tau_w)\n  1:nrow(search_grid) %>% lapply(function(i) {\n    tau_c <- search_grid[i, \"tau_c\"]\n    tau_w <- search_grid[i, \"tau_w\"]\n    print(paste(\"Iteration\", i, \"out of\", nrow(search_grid)))\n    alpha_noise <- MASS::mvrnorm(n = sim_settings$n_sims, mu = rep(0, sim_settings$n), Sigma = sigma) # Preallocate alpha noise for efficiency\n    s <- 1:sim_settings$n_sims %>% lapply(function(s) {\n      omega <- m %*% t(m) * tau_c^2 + i_n * tau_w^2\n      alpha_c <- rnorm(sim_settings$clusters) * tau_c\n      alpha_w <- rnorm(sim_settings$n) * tau_w\n      alpha_true <- alpha_0_vec + m %*% alpha_c + alpha_w\n      alpha_hat <- alpha_true + alpha_noise[s, ]\n      \n      post_var <- solve(solve(omega) + solve(sigma))\n      post_alpha <- post_var %*% (solve(omega) %*% alpha_0_vec + solve(sigma) %*% alpha_hat)\n      \n      eb <- tibble(\"type\" = \"eb\", \"true_alpha\" = drop(alpha_true), \"z\" = drop(post_alpha / sqrt(diag(post_var))), \"p\" = 2 * pnorm(abs(z), lower.tail = F))\n      ols <- tibble(\"type\" = \"ols\", \"true_alpha\" = drop(alpha_true), \"z\" = drop(alpha_hat / sqrt(diag(sigma))), \"p\" = 2 * pnorm(abs(z), lower.tail = F))\n      by <- tibble(\"type\" = \"by\", \"true_alpha\" = drop(alpha_true), \"z\" = ols$z)\n      by$p <- p.adjust(ols$p, method = \"BY\")\n      rbind(eb, ols, by) %>%\n        mutate(sig = z > 0 & p < 0.025) %>%\n        group_by(type) %>%\n        summarise(\n          sim = s,\n          n_disc = sum(sig),\n          true_disc = sum(sign(true_alpha[sig == T]) == sign(z[sig == T])),\n          false_disc = n_disc - true_disc\n        )\n    }) %>% bind_rows()\n    s %>% \n      group_by(type) %>%\n      mutate(fdp = if_else(n_disc == 0, 0, false_disc / n_disc)) %>%\n      summarise(\n        fdr = mean(fdp),\n        n_disc = mean(n_disc),\n        false_disc = mean(false_disc),\n        true_disc = mean(true_disc),\n        tau_c = tau_c,\n        tau_w = tau_w,\n        n = n()\n      ) %>%\n      mutate(true_disc_rate = true_disc / (sim_settings$n / 2))\n  }) %>% bind_rows()\n}\n\nmultiple_testing <- function(eb_all, eb_world = NULL) {\n  eb_all$factors %>%\n    bind_rows(eb_world$factors) %>%\n    mutate(\n      t_ols = ols_est/ols_se,\n      p_ols = 2*pnorm(abs(t_ols), lower.tail = F)\n    ) %>%\n    group_by(region) %>%\n    mutate(\n      n = n(),\n      p_bonf = p_ols %>% p.adjust(method = \"bonferroni\"),\n      p_holm = p_ols %>% p.adjust(method = \"holm\"),\n      p_bh = p_ols %>% p.adjust(method = \"BH\"),\n      p_by = p_ols %>% p.adjust(method = \"BY\")\n    ) %>%\n    select(n, region, char_reg, \"estimate\" = ols_est, \"statistic\" = t_ols, \"se\" = ols_se, starts_with(\"p_\")) %>%\n    gather(starts_with(\"p_\"), key = \"method\", value = \"p\") %>%\n    mutate(\n      method = method %>% str_remove(\"^p_\"),\n      mt_adj = case_when(\n        method == \"ols\" ~ \"None\",\n        method == \"bh\" ~ \"FDR\",\n        method == \"by\" ~ \"FDR\",\n        method == \"bonf\" ~ \"FWR\",\n        method == \"holm\" ~ \"FWR\"\n      ),\n      method = case_when(\n        method == \"ols\" ~ \"OLS\",\n        method == \"bh\" ~ \"BH\",\n        method == \"by\" ~ \"BY\",\n        method == \"bonf\" ~ \"Bonferroni\",\n        method == \"holm\" ~ \"Holm\",\n        TRUE ~ method\n      )\n    ) \n}\n\n# Bootstrap Tangency Portfolio --\n# BS Func\nbootstrap_tpf <- function(data, n_boots = 100, shorting = T, seed = 1) {\n  set.seed(seed)\n  if (shorting) {\n    boot_func <- function(splits, ...) {\n      df <- analysis(splits) %>% apply(2, function(x) x / sd(x)) %>% as.data.frame()\n      lm(rep(1, nrow(df)) ~ -1 + ., data = df) %>% \n        broom::tidy() %>% \n        mutate(weight = estimate / sum(estimate)) %>% \n        mutate(term = term %>% str_remove_all(\"`\")) %>%\n        select(term, weight)\n    }\n  } else {\n    boot_func <- function(splits, ...) {\n      df <- analysis(splits) %>% apply(2, function(x) x / sd(x))\n      glmnet::glmnet(y = rep(1, nrow(df)), x = df %>% as.matrix(), \n                     lambda = 0, lower.limits = 0, intercept = F) %>% \n        broom::tidy(return_zeros = T) %>% \n        filter(term != \"(Intercept)\") %>% \n        mutate(weight = estimate / sum(estimate)) %>%\n        select(term, weight)\n    }\n  } \n  \n  data %>% \n    bootstraps(times = n_boots, apparent = T) %>% # Apparent = T --> Generate original data \n    mutate(\n      coef = splits %>% map(.f = boot_func)\n    ) \n}\n\n# Full tpf\ntpf_cluster <- function(data, mkt_region, orig_sig, min_date, n_boots, shorting, seed) {\n  if (orig_sig) {\n    orig_sig_values <- T\n  } else {\n    orig_sig_values <- c(T, F)\n  }\n  market_ret <- regional_mkt_ret[region == mkt_region]\n  \n  cluster_pf <- data %>%\n    left_join(cluster_labels, by = \"characteristic\") %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>%\n    filter(orig_sig %in% orig_sig_values) %>%\n    group_by(hcl_label, eom) %>%\n    summarise(\n      ret = mean(ret)\n    )\n  \n  tpf_data <- cluster_pf %>% \n    filter(eom >= min_date) %>% \n    spread(key = hcl_label, value = ret) %>%\n    left_join(market_ret %>% select(eom, market), by = c(\"eom\")) %>%\n    rename(Market = market)\n  \n  tpf_data %>% select(-eom) %>% bootstrap_tpf(n_boots = n_boots, shorting = shorting, seed = seed) %>% mutate(market_region = mkt_region)\n}\n\n# In-Sample / Out-of-Sample Functions\nprepare_is_oos <- function(input, min_obs, orig_group, ret_scaled, type, print=F) {  # ret_scaled in ('none', \"all\", \"is\") & type in ('is_oos', 'is_post', 'is_pre')  \n  data <- input %>%\n    select(characteristic, eom, ret, mkt_vw_exc) %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>% \n    filter(orig_sig %in% orig_group) %>%\n    left_join(char_info %>% select(characteristic, sample_start, sample_end), by = \"characteristic\") %>%\n    mutate(\n      period = case_when(\n        year(eom) >= sample_start & year(eom) <= sample_end ~ \"is\",\n        type == \"pre\" & year(eom) < sample_start ~ \"oos\",\n        type == \"post\" & year(eom) > sample_end ~ \"oos\",\n        type == \"pre_post\" & (year(eom) < sample_start | year(eom) > sample_end) ~ \"oos\"\n      ),\n      ret = ret * 100,\n      mkt_vw_exc = mkt_vw_exc * 100\n    ) %>%\n    filter(!is.na(period))\n  \n  # Exclude data\n  data_excl <- data %>%\n    group_by(characteristic) %>%\n    mutate(n_is = sum(period == \"is\"), n_oos = sum(period == \"oos\")) %>% \n    filter(n_is >= min_obs & n_oos >= min_obs)\n  \n  if (ret_scaled == \"none\") {\n    data_adj <- data_excl %>% mutate(ret_adj = ret)\n  }\n  \n  if (ret_scaled == \"all\") {\n    data_excl <- data_excl %>% \n      group_by(characteristic, period) %>%\n      mutate(\n        ret_neu = ret - cov(ret, mkt_vw_exc)/var(mkt_vw_exc) * mkt_vw_exc,\n        ret_adj = ret * (10/sqrt(12))/sd(ret_neu)) %>%\n      select(-ret_neu) %>%\n      ungroup()\n  }\n  if (ret_scaled == \"is\") {\n    is_vol <- data_excl %>%\n      filter(period == \"is\") %>%\n      group_by(characteristic) %>%\n      mutate(ret_neu = ret - cov(ret, mkt_vw_exc)/var(mkt_vw_exc) * mkt_vw_exc) %>%\n      summarise(\n        is_sd = sd(ret_neu)\n      )\n    \n    data_excl <- data_excl %>% \n      left_join(is_vol, by = c(\"region\", \"characteristic\")) %>%\n      group_by(characteristic, period) %>%\n      mutate(ret_adj = ret * (10/sqrt(12))/is_sd) %>%\n      ungroup() %>%\n      select(-is_sd)\n  }\n  \n  full <- data %>% summarise(fct_all = uniqueN(characteristic))\n  excl <- data_excl %>% summarise(fct_excl = uniqueN(characteristic))\n  \n  if (print) {\n    print(tibble(\"type\"=type, full, excl))\n  }\n  \n  return(data_excl)\n}\n\n# Economic Benefit of more Power\ntrading_on_significance <- function(posterior_is) {\n  pf_base <- posterior_is %>%\n    left_join(char_info %>% select(characteristic, significance, sample_end), by = \"characteristic\") %>%\n    filter(significance == 1 & est_date >= sample_end) %>%\n    group_by(est_date) %>%\n    mutate(\n      ols_p = pnorm(abs(ols_est / ols_se), lower.tail = F)*2,\n      by_p = p.adjust(ols_p, method = \"BY\")\n    ) \n  \n  print(pf_base %>%\n          summarise(\n            rr_eb = mean(p025 > 0),\n            rr_ols = mean(ols_p <= 0.05 & ols_est > 0),\n            rr_by = mean(by_p <= 0.05 & ols_est > 0),\n          ) %>%\n          gather(rr_eb, rr_ols, rr_by, key = \"type\", value = \"rr\") %>%\n          ggplot(aes(est_date, rr, colour = type)) +\n          geom_point() +\n          geom_line())\n  \n  positions <- pf_base %>%\n    ungroup() %>%\n    mutate(\n      position_year = year(est_date) + 1,\n      eb_sig = (p025 > 0),\n      by_sig = (by_p <= 0.05 & ols_est > 0)\n    ) %>%\n    select(position_year, characteristic, eb_sig, by_sig)\n  \n  candidate_factors <- regional_pfs %>%\n    mutate(year = year(eom)) %>%\n    left_join(positions, by = c(\"characteristic\", \"year\" = \"position_year\")) %>%\n    mutate(marg_sig = (eb_sig == T & by_sig == F)) %>%\n    gather(marg_sig, eb_sig, by_sig, key = \"type\", value = \"significant\")\n  \n  candidate_factors %>%\n    filter(significant == T) %>%\n    group_by(region, type, significant, eom) %>%\n    summarise(\n      n = n(),\n      ret = mean(ret),\n      mkt = mean(mkt_vw_exc)\n    )  \n}\n\n# Simulation according to specification for Harvey et al (2016)\nharvey_et_al_sim <- function(sim_settings, seed) {\n  set.seed(seed)\n  # Cluster membership\n  m <- matrix(0, nrow = sim_settings$n, ncol = sim_settings$cl)  # Cluster membership\n  j <- 0\n  for (i in 1:sim_settings$cl) {\n    m[(j+1):(j + sim_settings$fct_pr_cl), i] <- 1\n    j <- j + sim_settings$fct_pr_cl\n  }\n  mm <- m %*% t(m)\n  # Correlation Matrix\n  corr_mat <- mm\n  corr_mat[corr_mat == 0] <- sim_settings$corr_across\n  corr_mat[corr_mat == 1] <- sim_settings$corr_within\n  diag(corr_mat) <- 1\n  \n  # Average Correlation (Should be close to zero)\n  mean(corr_mat[lower.tri(corr_mat)])\n  \n  # Sigma\n  sigma <- sim_settings$se^2 * corr_mat\n  \n  # Predefine variables\n  alpha_0_vec <- rep(sim_settings$alpha_0, sim_settings$n)\n  i_n <- diag(sim_settings$n)\n  \n  # Simulation\n  sim_settings$tau_ws %>% lapply(function(tau_w) {\n    start <- proc.time()\n    alpha_noise <- MASS::mvrnorm(n = sim_settings$n_sims, mu = rep(0, sim_settings$n), Sigma = sigma) # Preallocate alpha noise for efficiency\n    tau_sim <- 1:sim_settings$n_sims %>% sapply(simplify = F, USE.NAMES = T, function(s) {\n      print(paste(\"Tau_w:\", tau_w, \"- Simulation\", s, \"out of\", sim_settings$n_sims))\n      # Simulate Alphas\n      alpha_c <- c(rep(sim_settings$ret, times = sim_settings$cl_true), rep(0, times = (sim_settings$cl - sim_settings$cl_true)))\n      alpha_w <- c(rnorm(sim_settings$n_true) * tau_w, rep(0, sim_settings$n - sim_settings$n_true))\n      alpha_true <- alpha_0_vec + m %*% alpha_c + alpha_w\n      alpha_hat <- as.vector(alpha_true + alpha_noise[s, ])\n      \n      # MLE Function\n      mle_func <- function(a, tc, tw) {\n        a_vec <- rep(a, sim_settings$n)\n        a_omega <- i_n * tw^2 + mm * tc^2\n        a_cov <- sigma + a_omega  #  / t_mat\n        \n        -(mvtnorm::dmvnorm(x = alpha_hat, mean = a_vec, sigma = a_cov, log = T))  \n      }\n      \n      # Starting Values\n      starting_values <- tibble(a = alpha_hat, cl = rep(1:sim_settings$cl, each = sim_settings$fct_pr_cl)) %>%\n        group_by(cl) %>%\n        summarise(\n          cl_mean = mean(a),\n          cl_sd = sd(a)\n        ) %>%\n        summarize(\n          crude_a0 = mean(cl_mean),\n          crude_tc = if_else(sim_settings$fix_alpha, sqrt(sum((cl_mean^2) / (n() - 1))), sd(cl_mean)),\n          crude_tw = mean(cl_sd)\n        )\n      start_list <- list(\n        a = starting_values$crude_a0, \n        tc = starting_values$crude_tc, \n        tw = starting_values$crude_tw)\n      \n      # Estimate Parameters\n      if (sim_settings$fix_alpha) {\n        (hyper_pars <- stats4::mle(minuslogl = mle_func, start = start_list, lower = c(-Inf, 0, 0), fixed = list(a = 0)))\n      } else {\n        (hyper_pars <- stats4::mle(minuslogl = mle_func, start = start_list, lower = c(-Inf, 0, 0)))\n      }\n      \n      # Check convergence\n      if (hyper_pars@details$convergence != 0) {\n        warning(\"MLE step did not converge!!!\")\n        return(NULL)\n      }\n      mu <- hyper_pars@fullcoef[\"a\"]\n      tc <- hyper_pars@fullcoef[\"tc\"]\n      tw <- hyper_pars@fullcoef[\"tw\"]\n      mle <- tibble(\n        s = rep(s, 3),\n        coef = c(\"a\", \"tc\", \"tw\"),\n        mle = c(mu, tc, tw),\n        crude = c(start_list$a, start_list$tc, start_list$tw)\n      )\n      print(mle)\n      \n      # Specify Posterior\n      omega <- i_n * tw^2 + mm * tc^2\n      post_cov <- solve(solve(omega) + solve(sigma))\n      post_alpha <- post_cov %*% (solve(omega) %*% alpha_0_vec + solve(sigma) %*% alpha_hat)\n      \n      list(\"alpha_true\" = alpha_true, \"alpha_hat\" = alpha_hat, \"post_alpha\" = post_alpha, \"post_cov\" = post_cov, \"mle\" = mle)\n    })\n    print(proc.time() - start)  # 5 iterations took 232.13/60 = 4 minutes  \n    return(tau_sim)\n  }) \n}\n\n\n# Single Factor TP --\nsr_func <- function(data, w) {\n  ret_vec <- data %>% colMeans()\n  cov_mat <- data %>% cov()\n  drop(w %*% ret_vec / sqrt(t(w) %*% cov_mat %*% w))\n}\n\nepo_tpf <- function(data, s) {\n  sd <- data %>% apply(2, sd)\n  cor <- data %>% cor()\n  ret_vec <- data %>% colMeans()\n  cor_shrunk <- diag(length(sd)) * s + cor * (1-s)\n  cov_shrunk <- diag(sd) %*% cor_shrunk %*% diag(sd)\n  drop((solve(cov_shrunk) %*% ret_vec) / sum(solve(cov_shrunk) %*% ret_vec))\n}\n\n# Prepare data for \nprepare_tpf_factors <- function(region, orig_sig_values, start, scale) {\n  mkt <- regional_mkt_ret %>% filter(region == !!region) %>% select(-region)\n  \n  tpf_factors <- eb_est[[region]]$input$long %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>%\n    filter(orig_sig %in% orig_sig_values) %>%\n    group_by(eom) %>%\n    mutate(n = n()) %>%\n    ungroup() %>%\n    filter(eom >= start) %>%\n    select(characteristic, eom, n, ret)\n  \n  # Ensure that all factors have data\n  missing <- tpf_factors %>% filter(n != max(n)) %>% select(eom, n) %>% distinct()\n  if (nrow(missing) > 0) {\n    warning(\"UNBALANCED PANEL - SOME FACTORS ARE MISSING DATA!\")\n  }\n  tpf_factors <- tpf_factors %>% filter(n == max(n)) %>% select(-n)  \n  \n  # 1. Overall TPF\n  tpf_factors <- tpf_factors %>%\n    bind_rows(mkt %>% rename(\"ret\"=\"market\") %>% mutate(characteristic = \"market\") %>% filter(eom %in% tpf_factors$eom)) \n  \n  if (scale) {\n    tpf_factors <- tpf_factors %>%\n      group_by(characteristic) %>%\n      mutate(ret = ret * (0.1 / sqrt(12)) / sd(ret))\n  }\n  \n  tpf_factors_wide <- tpf_factors %>%\n    pivot_wider(names_from = characteristic, values_from = ret) %>%\n    select(-eom) \n  # Output\n  list(\"long\"=tpf_factors, \"wide\" = tpf_factors_wide)\n}\n\n# Optimal Shrinkage\noptimal_shrinkage <- function(data, k, epo_range = seq(0, 1, 0.1)) {\n  finance_kfold <- function(dates, k, horizon) {\n    # Helper Function\n    helper_eom_seq <- function(ends, horizon) {\n      all <- ends %>% lapply(function(d) seq.Date(from = ceiling_date(d, unit = \"month\") - months(horizon - 1), to = ceiling_date(d, unit = \"month\"), by = \"1 month\") - 1)\n      all_unique <- do.call(c, all) %>% unique()\n    }\n    # Split Dates into k groups\n    date_vec <- dates %>% unique() %>% sort()\n    n <- length(date_vec)\n    n_fold <- floor(length(date_vec) / k)\n    split <- split(date_vec, cut(seq_along(date_vec), k, labels = FALSE))\n    # Create train/test split\n    1:k %>% lapply(function(i) {\n      test_ends <- split[[i]]\n      test_dates <- test_ends %>% helper_eom_seq(horizon = horizon)\n      train_ends <- do.call(c, split[-i])\n      train_dates <- train_ends %>% helper_eom_seq(horizon = horizon)\n      # Avoid Data Leakage\n      train_dates <- train_dates[!(train_dates %in% test_dates)]\n      tibble(fold = i, train = list(train_dates), test = list(test_dates))\n    }) %>%\n      bind_rows()\n  }\n  \n  date_split <- unique(data$eom) %>% finance_kfold(k = k, horizon = 1)\n  cross_val <- 1:k %>% lapply(function(i)  {\n    test_dates <- date_split$test[[i]]\n    test <- data %>%\n      filter(eom %in% test_dates) %>%\n      pivot_wider(names_from = characteristic, values_from = ret) \n    test_eom <- test$eom\n    test <- test %>% select(-eom)\n    \n    train_dates <- date_split$train[[i]]\n    train <- data %>%\n      filter(eom %in% train_dates) %>%\n      pivot_wider(names_from = characteristic, values_from = ret) %>%\n      select(-eom) \n    \n    # Create Weights\n    nonneg_w <- glmnet::glmnet(y = rep(1, nrow(train)), x = train %>% as.matrix(), \n                               lambda = 0, lower.limits = 0, intercept = F) %>% \n      tidy(return_zeros = T) %>% \n      filter(term != \"(Intercept)\") %>% \n      mutate(\n        type = \"Non-Negative\",\n        weight = estimate / sum(estimate)\n      ) %>%\n      select(type, weight)\n    \n    epo_w <- epo_range %>% lapply(function(s) {\n      epo <- train %>% epo_tpf(s = s)\n      tibble(type = paste0(\"EPO s=\", s), weight = epo)\n    })\n    \n    # OOS Performance\n    c(list(nonneg_w), epo_w) %>% lapply(function(x) {\n      w <- x$weight\n      tibble(\n        type = unique(x$type),\n        eom = test_eom, \n        ret = drop(as.matrix(test) %*% w)\n      )\n    }) %>%\n      bind_rows() %>%\n      mutate(\n        i = i,\n        test_range = paste0(year(min(test_dates)), \"-\", year(max(test_dates)))\n      )\n  }) %>% bind_rows()\n  \n  cross_val_summary <- cross_val %>%\n    group_by(type) %>%\n    summarise(\n      ann_ret = mean(ret),\n      sd = sd(ret),\n      sr = ann_ret/sd\n    ) %>%\n    mutate(\n      type_overall = if_else(str_detect(type, \"EPO\"), \"EPO\", \"Non-Negative\"),\n      type = if_else(type == \"EPO s=0\", \"Unconstrained\", type),\n      type = type %>% factor(levels = c(\"Non-Negative\", \"Unconstrained\", paste0(\"EPO s=\", seq(0.1, 1, 0.1))))\n    ) \n  \n  print(cross_val_summary %>%\n          ggplot(aes(type, sr, group=type_overall)) +\n          geom_point() +\n          geom_path() +\n          labs(colour = \"Test Period:\", y = \"Monthly OOS SR of TPF\") +\n          theme(legend.position = \"top\", axis.title.x = element_blank(), axis.text.x = element_text(angle = 45, vjust=0.5)))\n  \n  opt_s_summary <- cross_val_summary %>% filter(type_overall == \"EPO\" & sr == max(sr)) %>% mutate(s = type %>% str_remove(\"EPO s=\") %>% as.numeric()) \n  print(paste0(\"Highest OOS SR: \", opt_s_summary %>% pull(sr) %>% round(2), \", Standard MVO: \", cross_val_summary %>% filter(type == \"Unconstrained\") %>% pull(sr) %>% round(2)))\n  \n  # Optimal shrinkage\n  opt_s_summary %>% pull(s)\n}\n\n\n# Table Functions --------------------------------------------------------\ntable_is_oos_ols <- function(is_oos_regs, is_post_regs) {\n  oos_us <- lm(oos ~ is, data = is_oos_regs %>% filter(region == \"us\"))\n  post_us <- lm(post ~ is, data = is_post_regs %>% filter(region == \"us\"))\n  \n  stargazer::stargazer(\n    post_us, oos_us,\n    title = \"OLS - Biased: $\\\\hat{\\\\alpha}_\\\\text{Out-of-Sample} = \\\\gamma_0 + \\\\gamma_1\\\\times\\\\hat{\\\\alpha}_\\\\text{In-Sample}$\", out.header=T,\n    no.space=T, digits=3, type='latex', single.row=F,\n    align = T, notes.align = \"l\",\n    omit.stat = c(\"adj.rsq\", \"f\", \"ser\"),\n    covariate.labels = c(\"Intercept\", \"$\\\\hat{\\\\alpha}_\\\\text{IS}$\"), dep.var.caption = \"Dependent Variable:\", dep.var.labels = c(\"$\\\\hat{\\\\alpha}_\\\\text{Post IS}$\", \"$\\\\hat{\\\\alpha}_\\\\text{Pre \\\\& Post IS}$\"),\n    notes.append=T, report = \"vc*t\", intercept.bottom = F)\n}\n\ntable_is_oos_nls <- function(nls_post, nls_oos) {\n  info <- list(\n    \"post\" = list(\n      \"dep\" = \"post\",\n      \"fit\" = nls_post\n    ),\n    \"oos\" = list(\n      \"dep\" = \"oos\",\n      \"fit\" = nls_oos\n    )\n  ) \n  \n  op <- info %>% sapply(simplify = F, USE.NAMES = T, function(x) {\n    # Create Fake Linear Model\n    fake_data <- tibble(y = rnorm(10), k0 = rnorm(10), kh = rnorm(10))\n    colnames(fake_data) <- c(x$dep, \"k0\", \"kh\")\n    lm_string <- paste0(x$dep, \"~k0 + kh -1\")\n    fake_lm = lm(lm_string, data = fake_data)\n    fake_x = c(\"k0\", \"kh\")\n    \n    # Generate various parts for output\n    sum_xx = summary(x$fit$nls_fit)\n    mat_xx = sum_xx$coefficients[1:2, ]\n    colnames(mat_xx) = c(\"coef\",\"se\", \"t\", \"p\")\n    indVarNames = rownames(mat_xx)\n    \n    # Generate coefficients, se, t-stat and p values \n    df_xx = as.data.frame(mat_xx)\n    vCoef = df_xx$coef; names(vCoef)=fake_x\n    vSE = df_xx$se; names(vSE)=fake_x\n    vT = df_xx$t; names(vT)=fake_x\n    vP = df_xx$p; names(vP)=fake_x\n    \n    formulaTxt = sum_xx$formula\n    nParameters = sum_xx$df[1]\n    nDF = sum_xx$df[2]\n    obs <- length(x$fit$nls_fit$m$resid())\n    n_fcts <- uniqueN(x$fit$nls_data$c)\n    rss = round(sum_xx$sigma, 3)\n    convTolerance = x$fit$nls_fit$m$conv()\n    list(\"lm\"=fake_lm, \"coef\" = vCoef, \"se\" = vSE, \"t\" = vT, \"p\" = vP, \"rss\" = rss, \"obs\" = obs, \"n_fcts\" = n_fcts)\n  })\n  \n  # Determine order\n  y1 <- \"post\"\n  y2 <- \"oos\"\n  \n  # Aesthetics \n  vTitle = \"NLS - Unbiased: $R_{i,t} = \\\\alpha_i + (\\\\kappa_0 + \\\\kappa_h \\\\times \\\\alpha_i)\\\\times 1_\\\\text{\\\\{Out-of-Sample\\\\}}$\"\n  vType = \"latex\"\n  # v_col_label = c(\"USA\", \"Developed\", \"Emerging\")\n  lines_obs <- c(\"Observations\", \n                 sprintf(\"\\\\multicolumn{1}{r}{%s}\", prettyNum(op[[y1]]$obs, big.mark = \",\")), \n                 sprintf(\"\\\\multicolumn{1}{r}{%s}\", prettyNum(op[[y2]]$obs, big.mark = \",\")))\n  lines_fcts <- c(\"Factors\", sprintf(\"\\\\multicolumn{1}{r}{%s}\", op[[y1]]$n_fcts), sprintf(\"\\\\multicolumn{1}{r}{%s}\", op[[y2]]$n_fcts))\n  dep_var_options <- list(\"post\" = \"Post IS\", \"oos\" = \"Pre \\\\& Post IS\")\n  dep_var_lbls <- c(dep_var_options[[y1]], dep_var_options[[y2]])\n  \n  # Output\n  stargazer::stargazer(\n    op[[y1]]$lm, op[[y2]]$lm,\n    title = vTitle, out.header=T,\n    no.space=T, digits=3, type=vType, single.row=F,\n    align = T, notes.align = \"l\",\n    omit.stat = c(\"rsq\",\"adj.rsq\", \"f\", \"n\", \"ser\"),\n    covariate.labels = c(\"$\\\\kappa_0$\", \"$\\\\kappa_h$\"), dep.var.labels.include = T, dep.var.caption = \"OOS Period:\", dep.var.labels = dep_var_lbls,\n    add.lines=list(lines_fcts, lines_obs), report = \"vc*t\", intercept.bottom = F,\n    # notes=vNotes, notes.append=T,\n    coef=list(op[[y1]]$coef, op[[y2]]$coef), se=list(op[[y1]]$se, op[[y2]]$se), t=list(op[[y1]]$t, op[[y2]]$t), p=list(op[[y1]]$p, op[[y2]]$p)\n  )\n}\n\ntable_taus <- function(){\n  taus <- list(\n    list(\"USA\", \"us\"),\n    list(\"Developed\", \"developed\"),\n    list(\"Emerging\", \"emerging\"),\n    list(\"USA, Developed & Emerging\", \"all\"),\n    list(\"World\", \"world\"),\n    list(\"World ex. US\", \"world_ex_us\"),\n    list(\"USA - Mega\", \"us_mega\"),\n    list(\"USA - Large\", \"us_large\"),\n    list(\"USA - Small\", \"us_small\"),\n    list(\"USA - Micro\", \"us_micro\"),\n    list(\"USA - Nano\", \"us_nano\")\n  ) %>% lapply(function(x) {\n    eb_est[[x[[2]]]]$mle %>% \n      select(estimate, ml_est) %>%\n      spread(key = estimate, value = ml_est) %>%\n      mutate(sample = x[[1]])\n  }) %>% \n    bind_rows() %>%\n    select(sample, tau_c, tau_s, tau_w)\n  \n  tau_cap <- paste(\n    \"The table shows the tau parameters estimated by maximum likelihood.\",\n    \"$\\\\tau_c$ is the estimated dispersion in cluster alphas.\",\n    \"$\\\\tau_w$ is the estimated dispersion in factor alphas with a cluster.\",\n    \"$\\\\tau_s$ is the estimated dispersion in alpha of the same factor in different regions.\"\n  )\n  \n  taus %>%\n    select(\"Sample\" = sample, \"$\\\\tau_c$\" = tau_c, \"$\\\\tau_w$\" = tau_s, \"$\\\\tau_s$\" = tau_w) %>% # Here I use the notation from eq 23\n    xtable(auto=T, digits = 2, caption = tau_cap) %>% \n    print(include.rownames = F, caption.placement = \"top\", sanitize.colnames.function = identity) \n}\n\n# Table - Factor Performance\ntable_factor_info <- function() {\n  table <- eb_est$all$factors %>%\n    mutate(p_zero = pnorm(q = 0, mean = post_mean, sd = post_sd)) %>%\n    select(characteristic, region, ols_est, \"eb_est\" = post_mean, p_zero) %>% \n    pivot_wider(names_from = region, values_from = c(ols_est, eb_est, p_zero), names_sep = \"_\") %>%\n    select(characteristic, ends_with(\"_us\"), ends_with(\"_developed\"), ends_with(\"_emerging\")) %>%\n    left_join(char_info %>% select(characteristic, significance), by = \"characteristic\") %>%\n    mutate(char_name = if_else(significance == 0, paste0(characteristic, \"*\"), characteristic)) %>%\n    select(-characteristic, -significance) %>%\n    select(char_name, everything()) %>%\n    arrange(ols_est_us) %>%\n    as.data.frame() \n  \n  data.frame(table[, 1:4], \"empty1\" = rep(\"\", nrow(table)), table[, 5:7], \"empty2\" = rep(\"\", nrow(table)), table[, 8:10]) %>%\n    xtable() %>%\n    print()\n}\n\ntable_economic_benefit <- function(sig_pfs) {\n  sig_regs <- c(\"us\", \"developed\", \"emerging\") %>% lapply(function(x) {\n    fit <- lm(ret ~ mkt, data = sig_pfs %>% filter(type == \"marg_sig\" & region == x) %>% mutate(ret = ret*100, mkt = mkt*100)) # Ensures alpha is in Percentages\n    nw <- fit %>% lmtest::coeftest(vcov = sandwich::NeweyWest(fit, lag = 6))\n    list(\"fit\"=fit, \"nw\"=nw)\n  })\n  \n  lines_obs <- c(\"Observations\", \n                 sprintf(\"\\\\multicolumn{1}{r}{%s}\", prettyNum(length(sig_regs[[1]]$fit$residuals), big.mark = \",\")), \n                 sprintf(\"\\\\multicolumn{1}{r}{%s}\", prettyNum(length(sig_regs[[2]]$fit$residuals), big.mark = \",\")), \n                 sprintf(\"\\\\multicolumn{1}{r}{%s}\", prettyNum(length(sig_regs[[3]]$fit$residuals), big.mark = \",\")))\n  lines_r2 <- c(\"Adjusted $R^2$\", \n                sprintf(\"\\\\multicolumn{1}{r}{%s}\", formatC(summary(sig_regs[[1]]$fit)$adj.r.squared, digits = 2, format = \"f\")), \n                sprintf(\"\\\\multicolumn{1}{r}{%s}\", formatC(summary(sig_regs[[2]]$fit)$adj.r.squared, digits = 2, format = \"f\")), \n                sprintf(\"\\\\multicolumn{1}{r}{%s}\", formatC(summary(sig_regs[[3]]$fit)$adj.r.squared, digits = 2, format = \"f\")))\n  \n  stargazer::stargazer(sig_regs[[1]]$nw, sig_regs[[2]]$nw, sig_regs[[3]]$nw, dep.var.labels.include = F, dep.var.caption = \"Region\",\n                       no.space = F, intercept.bottom = F, report = \"vc*t\", column.labels = c(\"US\", \"Developed ex. US\", \"Emerging\"),\n                       add.lines=list(lines_obs, lines_r2), covariate.labels = c(\"Alpha\", \"Market Beta\"), align=T, digits=2)\n}\n\n\n\n# PLOT FUNCTIONS ---------------------------------------------------------\ncluster_val <- function(cor, labels, op_format = \"pdf\") {\n  pairwise_cor <- cor %>%\n    as_tibble(rownames = \"char1\") %>%\n    gather(-char1, key = \"char2\", value = \"cor\") %>%\n    left_join(select(labels, characteristic, \"label1\" = hcl_label), by = c(\"char1\"=\"characteristic\")) %>%\n    left_join(select(labels, characteristic, \"label2\" = hcl_label), by = c(\"char2\"=\"characteristic\")) %>%\n    filter(char1 != char2) %>% \n    mutate(hcl_pair = str_c(label1, \"_\", label2)) %>%\n    group_by(hcl_pair) %>%\n    summarise(\n      n = n(),\n      cor_avg = mean(cor)\n    ) %>%\n    ungroup() %>%\n    separate(hcl_pair, c(\"hcl1\", \"hcl2\"), sep = \"_\") %>%\n    select(hcl1, hcl2, cor_avg) %>%\n    spread(key = hcl2, cor_avg) \n  \n  pairwise_cor_names <- pairwise_cor$hcl1\n  pairwise_cor <- pairwise_cor %>% select(-hcl1) %>% as.matrix()\n  rownames(pairwise_cor) <- pairwise_cor_names\n  \n  # Needs to Be saved as a functional\n  if (op_format == \"tex\") {\n    corrplot_cex <- list(tl = 0.8, number = 0.5)\n  }\n  if (op_format == \"pdf\") {\n    corrplot_cex <- list(tl = 0.7, number = 0.45)\n  }\n  function() {\n    par(xpd=TRUE)\n    pairwise_cor %>%\n      corrplot::corrplot(method = \"color\", addCoef.col = \"black\", type = \"lower\", mar = c(0, 0, 3, 0), tl.cex = corrplot_cex$tl,\n                         number.cex = corrplot_cex$number, tl.col = \"black\", col = colorRampPalette(c(colours_theme[2], \"white\", colours_theme[1]))(200))\n  }\n}\nplot_mt_eb_comp <- function(mt, eb_all, eb_us = NULL, eb_developed = NULL, eb_emerging = NULL, eb_world = NULL, mts = c(\"OLS\", \"Bonferroni\", \"BY\"), regs = c(\"us\", \"developed\", \"emerging\", \"world\"), se_methods, se_regions) {\n  mt_sub <- mt %>%\n    mutate(method = method %>% factor(levels = c(\"OLS\", \"Bonferroni\", \"Holm\", \"BH\", \"BY\", \"EB - Region\", \"EB - Full\"))) %>%\n    filter(method %in% mts & region %in% regs)\n  \n  (t_cutoff <- mt_sub %>% \n      group_by(method, region) %>%\n      summarise(\n        t_cut = (min(abs(statistic)[p < 0.05]) + max(abs(statistic)[p > 0.05])) / 2\n      ))\n  \n  eb_comb <- bind_rows(\n    eb_all$factors %>% mutate(method = \"EB - All\"),\n    eb_us$factors %>% mutate(method = \"EB - Region\"),\n    eb_developed$factors %>% mutate(method = \"EB - Region\"),\n    eb_emerging$factors %>% mutate(method = \"EB - Region\"),\n    eb_world$factors %>% mutate(method = \"EB - Region\")\n  ) %>% mutate(\n    method = method %>% factor(levels = c(\"OLS\", \"Bonferroni\", \"Holm\", \"BH\", \"BY\", \"EB - Region\", \"EB - All\"))\n  )\n  \n  mt_table <- mt_sub %>%\n    mutate(characteristic = char_reg %>% str_remove(\"__.+\")) %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>% \n    filter(orig_sig == T) %>%\n    group_by(region, method, mt_adj) %>%\n    summarise(\n      n = n(),\n      significant = mean(p < 0.05 & estimate > 0), # Estimates also needs to be positive\n      max_t_insig = max(abs(statistic[p > 0.05]))\n    ) \n  \n  eb_table <- eb_comb %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>% \n    filter(orig_sig == T) %>%\n    mutate(ols_t = ols_est / ols_se) %>%\n    group_by(region, method) %>%\n    summarise(\n      mt_adj = \"Bayesian\",\n      n = n(),\n      significant = mean(p025 > 0),\n      max_t_insig = max(abs(ols_t[p025 < 0 & p975 > 0]))\n    )\n  \n  (comp_table <- mt_table %>% \n      bind_rows(eb_table)) \n  \n  # To install drlib put: devtools::install_github(\"dgrtwo/drlib\")\n  if (FALSE) {\n    repl_plot <- comp_table %>%\n      mutate(\n        region_pretty = case_when(\n          region == \"us\" ~ \"US\",\n          region == \"developed\" ~ \"Developed Ex. US\",\n          region == \"emerging\" ~ \"Emerging\",\n          region == \"world\" ~ \"World\"\n        ),\n        region_pretty = region_pretty %>% factor(levels = c(\"US\", \"Developed Ex. US\", \"Emerging\", \"World\"))\n      ) %>%\n      filter((region == \"world\" & method == \"EB - All\") | region != \"world\") %>%\n      filter(method != \"Bonferroni\") %>%\n      # filter(!(region == \"world\" & method != \"Empirical Bayes\")) %>%\n      ggplot(aes(drlib::reorder_within(method, significant, region_pretty), significant, fill = method)) +\n      geom_col() +\n      drlib::scale_x_reordered() +\n      geom_text(aes(label = str_c(formatC(round(significant * 100, 2), digits = 2, format = \"f\"), \"%\")), nudge_y = 0.025, size = 3) +\n      facet_wrap(~region_pretty, nrow = 1, scales = \"free_x\") +\n      labs(x = \"Method\", fill = \"Multiple Testing Adj.\", y = \"Replication Rate (%)\") +\n      theme(legend.position = \"none\")\n  }\n  \n  repl_plot <- comp_table %>%\n    # filter((region == \"world\" & method == \"EB - All\") | region != \"world\") %>%\n    filter(method != \"Bonferroni\") %>%\n    group_by(method) %>%\n    # mutate(sort_var = significant[region == \"us\"]) %>%\n    mutate(\n      region_pretty = case_when(\n        region == \"us\" ~ \"US\",\n        region == \"developed\" ~ \"Developed Ex. US\",\n        region == \"emerging\" ~ \"Emerging\",\n        region == \"world\" ~ \"World\"\n      ),\n      region_pretty = region_pretty %>% factor(levels = c(\"US\", \"Developed Ex. US\", \"Emerging\", \"World\")),\n      method_pretty = case_when(\n        method == \"BY\" ~ \"Benjamini-Yekutieli\",\n        method == \"EB - Region\" ~ \"Empirical Bayes - Region\",\n        method == \"EB - All\" ~ \"Empirical Bayes - All\",\n        method == \"OLS\" ~ \"OLS\"\n      ),\n      method_pretty = method_pretty %>% factor(levels = c(\"OLS\", \"Benjamini-Yekutieli\", \"Empirical Bayes - Region\", \"Empirical Bayes - All\"))\n    ) %>%\n    ggplot(aes(method_pretty, significant*100, fill = method_pretty)) + #reorder(method_pretty, sort_var)\n    geom_col() +\n    geom_text(aes(label = str_c(formatC(round(significant * 100, 1), digits = 1, format = \"f\"), \"%\")), nudge_y = 2.5, size = 3) +\n    facet_grid(. ~ region_pretty, scales = \"free\", space='free') +\n    # facet_wrap(~region_pretty, nrow = 1, scales = \"free_x\") +\n    labs(x = \"Method\", fill = \"Multiple Testing Adj.\", y = \"Replication Rate (%)\") +\n    theme(legend.position = \"none\", axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 10), axis.title.x = element_blank())\n  \n  \n  eb_overview <- eb_comb %>% \n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>% \n    mutate(\n      t_cut = 1.96,\n      type = case_when(\n        p025 > 0 & orig_sig == 1 ~ \"Replicated\",\n        p025 <= 0 & orig_sig == 1 ~ \"Not Replicated\",\n        orig_sig == 0 ~ \"Never Significant\"\n      )\n    ) %>% \n    select(region, method, type, char_reg, estimate = post_mean, t_cut, se = post_sd)\n  \n  \n  \n  mt_plot <- mt_sub %>%\n    mutate(characteristic = char_reg %>% str_remove(\"__.+\")) %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>% \n    left_join(t_cutoff, by = c(\"region\", \"method\")) %>%\n    mutate(\n      # significant = if_else(p < 0.05 & estimate > 0, \"Significant\", \"Insignificant\"),\n      type = case_when(\n        p <= 0.05 & estimate > 0 & orig_sig == 1 ~ \"Replicated\",\n        (p > 0.05 | estimate <= 0) & orig_sig == 1 ~ \"Not Replicated\",\n        orig_sig == 0 ~ \"Never Significant\"\n      )\n    ) %>%\n    bind_rows(eb_overview) %>%\n    filter(region %in% se_regions & method %in% se_methods) %>%\n    mutate(\n      characteristic = char_reg %>% str_extract(\".+[?=__]\") %>% str_remove(\"__\"),\n      type = type %>% factor(levels = c(\"Replicated\", \"Not Replicated\", \"Never Significant\")),\n      method_pretty = case_when(\n        method == \"BY\" ~ \"Multiple Testing - Benjamini-Yekutieli\",\n        method == \"EB - Region\" ~ \"Empirical Bayes - US\",\n        method == \"EB - All\" ~ \"Empirical Bayes - Global\",\n        method == \"OLS\" ~ \"OLS\"\n      ),\n      method_pretty = method_pretty %>% factor(levels = c(\"OLS\", \"Multiple Testing - Benjamini-Yekutieli\", \"Empirical Bayes - US\", \"Empirical Bayes - Global\"))\n    ) %>%\n    group_by(characteristic) %>%\n    # mutate(sort_var = statistic[method == \"OLS\" & region == \"us\"]) %>%\n    mutate(sort_var = estimate[method == \"OLS\" & region == \"us\"]) %>%\n    group_by(region, method) %>%\n    mutate(\n      ols_rank = frank(sort_var),\n      repl_rate = sum(type == \"Replicated\") / sum(type %in% c(\"Replicated\", \"Not Replicated\"))\n    ) %>%\n    ggplot(aes(reorder(ols_rank, sort_var), estimate, colour = type, linetype = type)) +\n    geom_point() +\n    geom_text(aes(x = 35, y = 1.45, label = str_c(\"Replication Rate: \", formatC(round(repl_rate*100, 1), digits = 1, format = \"f\"), \"%\")), \n              colour = \"black\", size = 3, check_overlap = T) +\n    geom_errorbar(aes(ymin = estimate - t_cut * se, ymax = estimate + t_cut * se)) +\n    facet_wrap(~method_pretty, ncol = length(se_methods) / 2) + \n    coord_cartesian(ylim = c(-1, 1.5)) +\n    geom_hline(yintercept = 0, linetype = \"dashed\") +\n    guides(colour = guide_legend(override.aes = list(shape = NA))) + \n    labs(y = \"Monthly Alpha (%)\") +\n    theme(\n      axis.title.x = element_blank(),\n      axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),\n      # text = element_text(size = 13),\n      legend.title = element_blank(),\n      legend.position = \"top\"\n    )\n  \n  list(\"mt\" = mt_plot, \"repl\" = repl_plot)\n}\n\nplot_fdr <- function(simulated_fdr) {\n  simulated_fdr %>%\n    gather(fdr, fwr, key = \"type\", value = \"rate\") %>%\n    mutate(type = type %>% str_to_upper()) %>%\n    ggplot(aes(t_cutoff, rate, colour = type)) +\n    geom_point() +\n    geom_hline(yintercept = 0.05, linetype = \"dashed\") + \n    geom_vline(xintercept = 1.96, linetype = \"dotted\") +\n    scale_y_continuous(breaks = c(0, 0.05, 0.25, 0.5, 0.75, 1)) +\n    scale_x_continuous(breaks = c(0, 1.96, 2.5, 5.0, 7.5, 10)) +\n    labs(x = \"Critical Value (t)\", y = \"Rate\", colour = \"Type:\") +\n    theme(\n      legend.position = \"top\"\n    )\n}\n\nplot_factor_post <- function(eb, orig_sig, cluster_order) {\n  if (orig_sig) {\n    orig_sig_values <- T\n  } else {\n    orig_sig_values <- c(T, F)\n  }\n  \n  eb$factors %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>%\n    filter(orig_sig %in% orig_sig_values) %>%\n    group_by(hcl_label) %>%\n    mutate(\n      sort_var = median(post_mean) + post_mean / 1000000\n    ) %>%\n    mutate(hcl_label = hcl_label %>% factor(levels = cluster_order)) %>%\n    ggplot(aes(reorder(characteristic, sort_var), post_mean, colour = hcl_label, shape = hcl_label)) +\n    geom_point() +\n    scale_shape_manual(values=1:13) +\n    geom_errorbar(aes(ymin = post_mean - 1.96 * post_sd, ymax = post_mean + 1.96 * post_sd)) +\n    geom_hline(yintercept = 0, linetype = \"dashed\") +\n    labs(y = \"Monthly Alpha with 95% Confidence Interval (%)\", colour = \"Cluster\", shape = \"Cluster\") +\n    theme(\n      axis.title.x = element_blank(),\n      axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)\n    )\n}\n\nplot_repl_region <- function(eb_all, cluster_order) {\n  eb_all$factors %>% \n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>% \n    filter(orig_sig == 1) %>%\n    mutate(hcl_label = hcl_label %>% factor(levels = cluster_order)) %>%\n    group_by(region, hcl_label) %>% \n    summarise(n = n(), repl_rate = mean(post_mean - 1.96 * post_sd > 0)) %>% \n    group_by(hcl_label) %>%\n    mutate(\n      sort_var = repl_rate[region == \"us\"] + n[region == \"us\"] / 1e6,\n      region_pretty = case_when(\n        region == \"us\" ~ \"USA\",\n        region == \"developed\" ~ \"Developed Ex. USA\",\n        region == \"emerging\" ~ \"Emerging\"\n      ),\n      region_pretty = region_pretty %>% factor(levels = c(\"USA\", \"Developed Ex. USA\", \"Emerging\"))\n    ) %>%\n    ggplot(aes(reorder(hcl_label, sort_var), repl_rate*100, fill = hcl_label)) +\n    geom_col() +\n    labs(y = \"Replication Rate (%)\") +\n    facet_wrap(~region_pretty, ncol = 1) +\n    theme(\n      legend.position = \"none\",\n      axis.title.x = element_blank(),\n      axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)\n    )\n}\n\n# Figure 1 - Waterfall Graph\nplot_lit_comp <- function(eb_us, mt_res, eb_world, excl_insig=T) {\n  if (excl_insig) {\n    sig_group <- T\n  } else {\n    sig_group <- c(T, F)\n  }\n  raw_reg <- eb_us$input$long %>%\n    group_by(characteristic) %>%\n    nest() %>%\n    mutate(\n      raw_reg = data %>% map(~ lm(ret ~ 1, data = .x)),\n      tidied = raw_reg %>% map(tidy)\n    ) %>%\n    unnest(tidied) %>% \n    ungroup() \n  \n  raw_overall <- raw_reg %>%\n    summarise(repl_rate = mean(p.value < 0.05 & estimate > 0)) %>%\n    pull(repl_rate)\n  \n  raw_sig <- raw_reg %>% \n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>% \n    filter(orig_sig == T) %>%\n    summarise(repl_rate = mean(p.value < 0.05 & estimate > 0)) %>%\n    pull(repl_rate)\n  \n  capm <- mt_res %>% \n    filter(region == \"us\" & method %in% c(\"BY\", \"OLS\")) %>% \n    mutate(characteristic = char_reg %>% str_remove(\"__.+\")) %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>% \n    filter(orig_sig %in% sig_group) %>%\n    group_by(method) %>% \n    summarise(repl_rate = mean(p < 0.05 & estimate > 0))\n  \n  eb_us_repl <- eb_us$factors %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>% \n    filter(orig_sig %in% sig_group) %>%\n    summarise(repl_rate = mean(p025 > 0)) %>%\n    pull(repl_rate)\n  \n  eb_global_repl <- eb_world$factors %>% \n    ungroup() %>% \n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>% \n    filter(orig_sig %in% sig_group) %>%\n    summarise(repl_rate = mean(post_mean - 1.96 * post_sd > 0)) %>% \n    pull(repl_rate)\n  \n  # Waterfall Graph\n  litterature_comp <- tribble(\n    ~ type, ~repl_rate,\n    \"hxz\", 0.35,\n    \"raw\", raw_overall,\n    \"raw_sig\", raw_sig,\n    \"alpha\", capm %>% filter(method == \"OLS\") %>% pull(repl_rate),\n    \"mt\", capm %>% filter(method == \"BY\") %>% pull(repl_rate),\n    \"eb_us\", eb_us_repl,\n    \"eb_global\", eb_global_repl\n  ) \n  if (excl_insig == F) {\n    litterature_comp <- litterature_comp %>% filter(type != \"raw_sig\")\n  } \n  litterature_comp <- litterature_comp %>%\n    mutate(\n      repl_rate = repl_rate * 100,\n      type = type %>% factor(levels = c(\"hxz\", \"raw\", \"raw_sig\", \"alpha\", \n                                        \"mt\", \"eb_us\", \"eb_global\")),\n      prev_repl_rate = dplyr::lag(repl_rate, default = 0),\n      impact = if_else(repl_rate > prev_repl_rate, \"Increase\", \"Decrease\"),\n      impact = impact %>% factor(levels = c(\"Increase\", \"Decrease\"))\n    ) %>%\n    setDT() \n  \n  w <- 0.3  #use to set width of bars\n  l1 <- -3\n  inc <- -3 \n  col_top <- \"black\" # colours_theme[2]\n  col_bot <- \"black\" # colours_theme[1]\n  type <- litterature_comp$type\n  plot <- litterature_comp %>%\n    ggplot(aes(xmin = as.integer(type) - w, xmax = as.integer(type) + w, ymin = prev_repl_rate, ymax = repl_rate, fill = impact)) +\n    geom_rect() +\n    geom_segment(data = litterature_comp[1:(.N - 1)], aes(x = as.integer(type) + w, xend = as.integer(type) + w + 1, y = repl_rate, yend = repl_rate)) +\n    scale_x_discrete(limits = type) +\n    geom_text(aes(x = as.integer(type), y = repl_rate + if_else(sign(repl_rate - prev_repl_rate)!=0, sign(repl_rate - prev_repl_rate), 1) * 2, label = str_c(formatC(round(repl_rate, 1), digits = 1, format = \"f\"), \"%\"))) + \n    scale_fill_manual(values = (c(\"Increase\" = colours_theme[1], \"Decrease\" = colours_theme[2]))) +\n    labs(x = \"Implementation\", y = \"Replication Rate (%)\") +\n    coord_cartesian(ylim = c(0, 90), expand = FALSE, clip = \"off\") +\n    # HXZ\n    annotate(geom = \"text\", x = \"hxz\", y = l1, label = \"Hou, Xue, and\", colour = col_top, fontface = 2) +\n    annotate(geom = \"text\", x = \"hxz\", y = l1 + inc*1, label = \"Zhang (2020)\", colour = col_top, fontface = 2) +\n    annotate(geom = \"text\", x = \"hxz\", y = l1 + inc*2, label = \"Raw returns\", colour = col_bot, fontface = 1) +\n    # Our Raw\n    annotate(geom = \"text\", x = \"raw\", y = l1, label = \"Our sample\", colour = col_top, fontface = 2) +\n    annotate(geom = \"text\", x = \"raw\", y = l1 + inc*1, label = \"Raw returns,\", colour = col_bot) +\n    annotate(geom = \"text\", x = \"raw\", y = l1 + inc*2, label = \"our methodology\", colour = col_bot) +\n    # Our Alpha\n    annotate(geom = \"text\", x = \"alpha\", y = l1, label = \"Our sample\", colour = col_top, fontface = 2) +\n    annotate(geom = \"text\", x = \"alpha\", y = l1 + inc*1, label = \"CAPM alphas\", colour = col_bot) +\n    # Our MT\n    annotate(geom = \"text\", x = \"mt\", y = l1, label = \"Harvey, Liu, and\", colour = col_top, fontface = 2) +\n    annotate(geom = \"text\", x = \"mt\", y = l1 + inc*1, label = \"Zhu (2016)\", colour = col_bot, fontface = 2) +\n    annotate(geom = \"text\", x = \"mt\", y = l1 + inc*2, label = \"Multiple testing\", colour = col_bot) +\n    annotate(geom = \"text\", x = \"mt\", y = l1 + inc*3, label = \"adjustment\", colour = col_bot) +\n    # Our EB-US\n    annotate(geom = \"text\", x = \"eb_us\", y = l1, label = \"Our Bayesian\", colour = col_top, fontface = 2) +\n    annotate(geom = \"text\", x = \"eb_us\", y = l1 + inc*1, label = \"estimation\", colour = col_top, fontface = 2) +\n    annotate(geom = \"text\", x = \"eb_us\", y = l1 + inc*2, label = \"US data\", colour = col_bot) +\n    # Our EB-US\n    annotate(geom = \"text\", x = \"eb_global\", y = l1, label = \"Our Bayesian\", colour = col_top, fontface = 2) +\n    annotate(geom = \"text\", x = \"eb_global\", y = l1 + inc*1, label = \"estimation\", colour = col_top, fontface = 2) +\n    annotate(geom = \"text\", x = \"eb_global\", y = l1 + inc*2, label = \"Global data\", colour = col_bot) +\n    theme(\n      legend.title = element_blank(),\n      plot.margin = unit(c(1, 1, 4, 1), \"lines\"),\n      axis.title.x = element_blank(),\n      axis.text.x = element_blank(),\n      panel.grid.major.x = element_blank(),\n      panel.grid.minor.x = element_blank()) \n  \n  if (excl_insig) {\n    # Our Raw Significant only\n    plot <- plot +\n      annotate(geom = \"text\", x = \"raw_sig\", y = l1, label = \"Our sample\", colour = col_top, fontface = 2) +\n      annotate(geom = \"text\", x = \"raw_sig\", y = l1 + inc*1, label = \"Excl. factors\", colour = col_bot) +\n      annotate(geom = \"text\", x = \"raw_sig\", y = l1 + inc*2, label = \"never significant\", colour = col_bot)\n  }\n  return(plot)\n}\n\nplot_many_factors <- function() {\n  # The Power of Many Factors\n  many_factors_se <- eb_est$us$input$long %>%\n    select(\"char_reg\" = name_wide, ret_neu_scaled, mkt_vw_exc) %>%\n    mutate(\n      region = char_reg %>% str_extract(pattern = \"(?<=_{2}).+\"),\n      region = case_when(\n        region == \"us\" ~ \"USA\",\n        region == \"developed\" ~ \"Developed Ex. USA\",\n        region == \"emerging\" ~ \"Emerging\"\n      ),\n      region = region %>% factor(levels = c(\"USA\", \"Developed Ex. USA\", \"Emerging\"))\n    ) %>%\n    filter(!is.na(ret_neu_scaled)) %>%\n    group_by(char_reg, region) %>%\n    nest() %>%\n    mutate(\n      fit = data %>% map(~lm(ret_neu_scaled ~ mkt_vw_exc, data = .x)),  # They are market neutral by construction but this takes care of the degress of freedom adjustment \n      nw = fit %>% map(~ lmtest::coeftest(.x, vcov = sandwich::NeweyWest(.x)) %>% broom::tidy()),\n      df = fit %>% map_dbl(~ .x$df.residual)\n    ) %>%\n    unnest(nw) %>%\n    filter(term == \"(Intercept)\") %>%\n    rename(\"p_ols\" = p.value) %>%\n    # group_by(region) %>%\n    ungroup() %>%\n    mutate(\n      n = n(),\n      p_bonf = p_ols %>% p.adjust(method = \"bonferroni\"),\n      p_holm = p_ols %>% p.adjust(method = \"holm\"),\n      p_bh = p_ols %>% p.adjust(method = \"BH\"),\n      p_by = p_ols %>% p.adjust(method = \"BY\")\n    ) %>%\n    select(n, region, char_reg, estimate, statistic, df, \"se\" = std.error, starts_with(\"p_\")) %>%\n    gather(starts_with(\"p_\"), key = \"method\", value = \"p\") %>%\n    mutate(\n      method = method %>% str_remove(\"^p_\"),\n      mt_adj = case_when(\n        method == \"ols\" ~ \"None\",\n        method == \"bh\" ~ \"FDR\",\n        method == \"by\" ~ \"FDR\",\n        method == \"bonf\" ~ \"FWR\",\n        method == \"holm\" ~ \"FWR\"\n      ),\n      method = case_when(\n        method == \"ols\" ~ \"OLS\",\n        method == \"bh\" ~ \"BH\",\n        method == \"by\" ~ \"Benjamini-Yekutieli\",\n        method == \"bonf\" ~ \"Bonferroni\",\n        method == \"holm\" ~ \"Holm\",\n        TRUE ~ method\n      ),\n      method = method %>% factor(levels = c(\"OLS\", \"Bonferroni\", \"Holm\", \"BH\", \"Benjamini-Yekutieli\", \"EB - Region\", \"EB - Full\"))\n    ) \n  \n  (mf_t <- many_factors_se %>% \n      group_by(method) %>%\n      summarise(\n        t_140 = (min(abs(statistic)[p < 0.05]) + max(abs(statistic)[p > 0.05])) / 2\n      ))\n  \n  avg_se <- mean(many_factors_se$se)\n  \n  ols_ci <- tibble(\n    method = c(\"OLS\", \"Bonferroni\", \"Benjamini-Yekutieli\", \"Empirical Bayes\"), \n    t_1 = rep(1.96, 4)\n  ) %>%\n    left_join(mf_t, by = \"method\") %>%\n    gather(t_1, t_140, key = \"n_factors\", value = \"t\") %>%\n    mutate(\n      n_factors = n_factors %>% str_remove(\"t_\") %>% as.integer(),\n      p025 = -t * avg_se,\n      p975 = t * avg_se\n    ) %>%\n    filter(!(method == \"Empirical Bayes\" & n_factors == 140)) \n  \n  eb_ci <- eb_est$us$factors %>%\n    summarise(\n      eb_se = mean(post_sd)\n    ) %>%\n    transmute(\n      method = \"Empirical Bayes\",\n      n_factors = 140,\n      p025 = -1.96 * eb_se,\n      p975 = 1.96 * eb_se\n    )  \n  \n  comb_data <- bind_rows(ols_ci, eb_ci)\n  \n  comb_data %>%\n    ggplot(aes(n_factors, colour = method, linetype = method)) +\n    geom_line(aes(y = p025)) +\n    geom_line(aes(y = p975)) +\n    geom_ribbon(data=comb_data %>% filter(method == \"Empirical Bayes\"), \n                aes(x = n_factors, ymin=p025,ymax=p975), fill=colours_theme[3], alpha=0.2, inherit.aes = F) +\n    labs(x = \"Number of Factors\", y = \"Centered 95% Confidence Interval\", colour = \"Method\", linetype = \"Method\") +\n    scale_y_continuous(breaks=c(avg_se * 1.96, 0, -avg_se * 1.96),\n                       labels=c(expression(hat(alpha) + sigma[hat(alpha)] %*% t), expression(hat(alpha)), expression(hat(alpha) - sigma[hat(alpha)] %*% t))) +\n    scale_x_continuous(breaks = c(1, 140), expand=expansion(mult = c(0, 0.3), \n                                                            add = c(5, 0))) +\n    geom_dl(aes(label = method, y = p975), method = list(dl.trans(x = x + 0.2), \"last.points\", cex = 1)) +\n    theme(\n      text = element_text(size = 12),\n      axis.title.y = element_blank(),\n      legend.position = \"none\"\n    )\n}\n\n\n# Plot In-Sample vs. Out of Sample\nplot_is_oos <- function(ub_us, ub_dev, ub_emer) {\n  is_oos_split <- ub_us$ols_regs %>% \n    mutate(region = \"us\") %>%\n    bind_rows(\n      ub_dev$ols_regs %>% mutate(region = \"dev\"),\n      ub_emer$ols_regs %>% mutate(region = \"emer\")\n    ) %>%\n    mutate(\n      region = case_when(\n        region == \"us\" ~ \"USA\",\n        region == \"dev\" ~ \"Developed\",\n        region == \"emer\" ~ \"Emerging\"\n      ),\n      region = region %>% factor(levels = c(\"USA\", \"Developed\", \"Emerging\"))\n    ) %>%\n    rename(\"characteristic\" = c)\n  \n  cluster_plot <- is_oos_split %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>% \n    filter(orig_sig == 1) %>%\n    left_join(cluster_labels, by = \"characteristic\") %>%\n    gather(is, post_sample, key = \"period\", value = \"estimate\") %>%\n    group_by(region, hcl_label, period) %>%\n    summarise(\n      alpha_mean = mean(estimate)\n    ) %>% \n    select(region, hcl_label, period, alpha_mean) %>%\n    spread(key = period, value = alpha_mean) %>%\n    mutate(nudge_y = -0.015) %>%\n    ggplot(aes(is, post_sample)) +\n    geom_point() +\n    ggrepel::geom_text_repel(aes(label = hcl_label), nudge_y = -0.015) +\n    geom_abline(intercept = 0, slope = 1, linetype = \"dotted\") +\n    # geom_smooth(method = \"lm\", se = F) +\n    facet_wrap(~region) +\n    labs(x = \"Monthly Alpha (%): In-Sample\", y = \"Monthly Alpha (%): Post Sample\")\n  \n  is_oos_split %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>% \n    filter(orig_sig == 1) %>%\n    ggplot(aes(is, post_sample)) +\n    geom_point() +\n    geom_abline(intercept = 0, slope = 1, linetype = \"dotted\") +\n    geom_smooth(method = \"lm\", se = F, formula = \"y ~ x\") +\n    facet_wrap(~region, ncol = 1) +\n    labs(x = \"Monthly Alpha (%): In-Sample\", y = \"Monthly Alpha (%): Post Sample\")\n}\n\n\nplot_is_oos_factors <- function(is_oos_regions) {\n  is_oos_data <- is_oos_regions %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>% \n    filter(orig_sig == 1) %>%\n    mutate(\n      region = case_when(\n        region == \"us\" ~ \"USA\",\n        region == \"developed\" ~ \"Developed Ex. USA\",\n        region == \"emerging\" ~ \"Emerging\"\n      ),\n      region = region %>% factor(levels = c(\"USA\", \"Developed Ex. USA\", \"Emerging\"))\n    ) %>%\n    select(region, characteristic, period, estimate) %>%\n    spread(key = period, value = estimate) \n  \n  is_oos_data %>%\n    group_by(region) %>%\n    nest() %>%\n    mutate(\n      fit = data %>% map(~lm(oos ~ is, data = .x)),\n      n = fit %>% map_dbl(~length(.x$residuals)),\n      tidied = fit %>% map(tidy)\n    ) %>%\n    unnest(tidied)\n  \n  is_oos_data %>% \n    ggplot(aes(is, oos)) + \n    geom_point() + \n    geom_smooth(method = \"lm\", se = F, formula = \"y ~ x\") +\n    facet_wrap(~region) +\n    geom_hline(yintercept = 0) +\n    geom_vline(xintercept = 0) +\n    coord_fixed() +\n    labs(x = \"Monthly Alpha (%): In-Sample\", y = \"Monthly Alpha (%): Out-of-Sample\")\n}\n\nplot_tpf <- function(tpf, cluster_order, ci_low = 0.05, ci_high = 0.95) {\n  orig <- tpf %>%\n    filter(id == \"Apparent\") %>%\n    select(coef) %>%\n    unnest(coef) %>%\n    rename(\"tpf_weight\" = weight)\n  \n  bs <- tpf %>%\n    filter(id != \"Apparent\") %>%\n    unnest(coef) %>%\n    group_by(term) %>%\n    summarise(\n      bs_mean = mean(weight),\n      bs_sd = sd(weight),\n      bs_se = bs_sd / sqrt(n()),\n      bs_low = weight %>% quantile(ci_low),\n      bs_high = weight %>% quantile(ci_high),\n      bs_prob_zero = mean(weight == 0)\n    ) %>%\n    left_join(orig, by = \"term\") %>%\n    mutate(bs_bias = bs_mean - tpf_weight)\n  print(paste0(\"Clusters with significantly positive TPF weight: \", sum(filter(bs, term != \"Market\")$bs_low>0)))\n  bs %>% \n    mutate(\n      term = term %>% factor(levels = c(cluster_order, \"Market\"))\n    ) %>%\n    ggplot(aes(reorder(term, tpf_weight), tpf_weight*100, fill = term)) +\n    geom_col() +\n    coord_flip() +\n    geom_errorbar(mapping = aes(ymin = bs_low*100, ymax = bs_high*100), width = 0.2, size = 0.2) +\n    labs(y = \"Weight in Tangency PF (%)\") +\n    theme(\n      axis.title.y = element_blank(),\n      legend.position = \"none\"\n    )\n}\n\nplot_tpf_region <- function(tpf_us, tpf_dev, tpf_emer, cluster_order, ci_low = 0.05, ci_high = 0.95) {\n  all <- bind_rows(tpf_us, tpf_dev, tpf_emer)\n  \n  orig <- all %>%\n    filter(id == \"Apparent\") %>%\n    select(market_region, coef) %>%\n    unnest(coef) %>%\n    rename(\"tpf_weight\" = weight)\n  \n  bs <- all %>%\n    filter(id != \"Apparent\") %>%\n    unnest(coef) %>%\n    group_by(market_region, term) %>%\n    summarise(\n      bs_mean = mean(weight),\n      bs_sd = sd(weight),\n      bs_se = bs_sd / sqrt(n()),\n      bs_low = weight %>% quantile(ci_low),\n      bs_high = weight %>% quantile(ci_high),\n      bs_prob_zero = mean(weight == 0)\n    ) %>%\n    left_join(orig, by = c(\"market_region\", \"term\")) %>%\n    mutate(bs_bias = bs_mean - tpf_weight)\n  \n  \n  bs %>% \n    group_by(term) %>% \n    mutate(\n      sort_var = tpf_weight[market_region == \"us\"],\n      region_pretty = case_when(\n        market_region == \"us\" ~ \"USA\",\n        market_region == \"developed\" ~ \"Developed Ex. USA\",\n        market_region == \"emerging\" ~ \"Emerging\"\n      ),\n      region_pretty = region_pretty %>% factor(levels = c(\"USA\", \"Developed Ex. USA\", \"Emerging\"))\n    ) %>%\n    mutate(term = term %>% factor(levels = c(cluster_order, \"Market\"))) %>%\n    ggplot(aes(reorder(term, sort_var), tpf_weight*100, fill = term)) +\n    geom_col() +\n    geom_errorbar(mapping = aes(ymin = bs_low*100, ymax = bs_high*100), width = 0.2, size = 0.2) +\n    labs(y = \"Weight in Tangency PF (%)\") +\n    theme(\n      axis.title.x = element_blank(),\n      axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),\n      legend.position = \"none\"\n    ) +\n    facet_wrap(~region_pretty, ncol = 1)\n}\n\nplot_tpf_size <- function(tpf_size_samples, cluster_order, ci_low = 0.05, ci_high = 0.95) {\n  orig <- tpf_size_samples %>%\n    filter(id == \"Apparent\") %>%\n    select(size_grp, coef) %>%\n    unnest(coef) %>%\n    rename(\"tpf_weight\" = weight)\n  \n  bs <- tpf_size_samples %>%\n    filter(id != \"Apparent\") %>%\n    unnest(coef) %>%\n    group_by(size_grp, term) %>%\n    summarise(\n      bs_mean = mean(weight),\n      bs_sd = sd(weight),\n      bs_se = bs_sd / sqrt(n()),\n      bs_low = weight %>% quantile(ci_low),\n      bs_high = weight %>% quantile(ci_high),\n      bs_prob_zero = mean(weight == 0)\n    ) %>%\n    left_join(orig, by = c(\"size_grp\", \"term\")) %>%\n    mutate(bs_bias = bs_mean - tpf_weight)\n  \n  \n  bs %>% \n    group_by(term) %>% \n    mutate(\n      sort_var = tpf_weight[size_grp == \"mega\"],\n      size_grp_pretty = size_grp %>% str_to_title(),\n      size_grp_pretty = size_grp_pretty %>% factor(levels = c(\"Mega\", \"Large\", \"Small\", \"Micro\", \"Nano\"))\n    ) %>%\n    mutate(term = term %>% factor(levels = c(cluster_order, \"Market\"))) %>%\n    ggplot(aes(reorder(term, sort_var), tpf_weight*100, fill = term)) +\n    geom_col() +\n    geom_errorbar(mapping = aes(ymin = bs_low*100, ymax = bs_high*100), width = 0.2, size = 0.2) +\n    labs(y = \"Weight in Tangency PF (%)\") +\n    theme(\n      axis.title.x = element_blank(),\n      axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),\n      legend.position = \"none\"\n    ) +\n    facet_wrap(~size_grp_pretty, ncol = 1)\n}\n\nplot_over_time <- function(posterior_over_time, orig_sig, ols_incl, lb, bw) {\n  if (orig_sig) {\n    orig_sig_values <- T\n  } else {\n    orig_sig_values <- c(T, F)\n  }\n  \n  all_factors <- tibble(\"char_reg\" = rownames(posterior_over_time[[1]]$factor_mean)) %>%\n    mutate(characteristic = char_reg %>% str_remove(paste0(\"__\", ot_region))) %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>%\n    mutate(\n      selected_factors = (orig_sig %in% orig_sig_values) \n    )\n  i <- all_factors$selected_factors\n  \n  full_posterior <- posterior_over_time %>% lapply(function(eb_act) {\n    a <- eb_act$factor_mean[i]\n    a_cov <- eb_act$factor_cov[i, i]\n    n <- length(a)\n    w <- rep(1/n, n)\n    post_mean <- drop(t(a) %*% w)\n    post_sd <- drop(sqrt(t(w) %*% a_cov %*% w))\n    avg_ols <- mean(eb_act$factors$ols_est[i])\n    tibble(\"end_date\"= eb_act$end_date, n=n, post_mean, post_sd, avg_ols) \n  }) %>% bind_rows()\n  \n  # Black and white coloring\n  if (bw) {\n    col1 <- \"black\"\n    col2 <- \"grey35\"\n  } else {\n    col1 <- colours_theme[1]\n    col2 <- colours_theme[2]\n  }\n  \n  if (ols_incl) {\n    # Create OLS benchmarks\n    ols_bm <- seq.Date(from = as.Date(\"1959-12-31\"), to = settings$end_date, by = \"1 year\") %>% lapply(function(end_date) {\n      data <- regional_pfs[region == ot_region] %>% \n        filter(characteristic %in% all_factors$characteristic[i]) %>%\n        filter(eom >= settings$start_date & eom <= end_date) %>%\n        eb_prepare(\n          scale_alpha = settings$eb$scale_alpha, \n          overlapping = settings$eb$overlapping \n        )\n      \n      avg_alpha_full <- data$long %>%\n        group_by(characteristic) %>%\n        summarise(\n          n = n(),\n          alpha = mean(ret_neu_scaled)\n        ) %>%\n        ungroup() %>%\n        summarise(\n          end_date = end_date,\n          type = \"avg_alpha_full\",\n          alpha = mean(alpha)\n        )\n      \n      avg_alpha_st <- data$long %>%\n        filter(year(eom) > (year(end_date)-lb)) %>%\n        group_by(characteristic) %>%\n        mutate(\n          ret_neu_st = (ret - cov(ret, mkt_vw_exc)/var(mkt_vw_exc) * mkt_vw_exc)*100,\n          ret_neu_st = ret_neu_st / sd(ret_neu_st) * (10 / sqrt(12))\n        ) %>%\n        summarise(alpha = mean(ret_neu_st)) %>%\n        ungroup() %>%\n        summarise(\n          end_date = end_date,\n          type = \"avg_alpha_st\",\n          alpha = mean(alpha)\n        )\n      \n      alpha_avg <- data$long %>%\n        group_by(eom) %>%\n        summarise(\n          n = n(),\n          ret = mean(ret_neu_scaled),\n          mkt_vw_exc = unique(mkt_vw_exc)\n        ) %>%\n        ungroup() %>%\n        mutate(\n          ret_neu = ret - cov(ret, mkt_vw_exc) / var(mkt_vw_exc) * mkt_vw_exc\n        ) %>%\n        summarise(\n          end_date = end_date,\n          type = \"alpha_avg_full\",\n          alpha = mean(ret_neu)\n        )\n      \n      bind_rows(avg_alpha_full, avg_alpha_st, alpha_avg)\n    }) %>% bind_rows()\n    \n    ols_bm_wide <- ols_bm %>% spread(key = type, value = alpha)\n    \n    (plot_1 <- full_posterior %>%\n        left_join(ols_bm_wide, by = \"end_date\") %>%\n        ggplot(aes(end_date)) +\n        geom_point(aes(y = post_mean, colour=\"Average Posterior Alpha\", shape = \"Average Posterior Alpha\")) +\n        geom_point(aes(y = avg_alpha_full, colour=\"Average OLS Alpha\", shape = \"Average OLS Alpha\")) +\n        geom_errorbar(aes(ymin = post_mean + 1.96 * post_sd, ymax = post_mean - 1.96 * post_sd)) +\n        scale_colour_manual(name = \"Test\", values = c(\"Average Posterior Alpha\"=col1, \"Average OLS Alpha\"=col2)) +\n        scale_shape_manual(name = \"Test\", values = c(\"Average Posterior Alpha\" = 16, \"Average OLS Alpha\" = 17)) +\n        labs(y = \"Posterior Alpha with 95% CI (%)\") +\n        ylim(c(0, NA)) +\n        scale_x_date(breaks = seq.Date(as.Date(\"1960-12-31\"), as.Date(\"2020-12-31\"), by = \"10 years\"), date_labels = \"%Y-%m\") +\n        theme(\n          legend.title = element_blank(),\n          legend.position = \"top\",\n          axis.text.x = element_blank(),\n          axis.title.x = element_blank()\n        ))\n    \n    plot_2 <- full_posterior %>%\n      left_join(ols_bm_wide, by = \"end_date\") %>%\n      ggplot(aes(end_date, avg_alpha_st)) +\n      geom_col() +\n      labs(y = \"5-year Rolling Alpha (%)\") +\n      scale_x_date(breaks = seq.Date(as.Date(\"1960-12-31\"), as.Date(\"2020-12-31\"), by = \"10 years\"), date_labels = \"%Y-%m\") +\n      theme(\n        axis.title.x = element_blank()\n      )\n    \n    # print(full_posterior %>% left_join(ols_bm_wide, by = \"end_date\") %>% mutate(diff_pm = post_mean - lag(post_mean), diff_aaf = avg_alpha_full - lag(avg_alpha_full)) %>% filter(end_date != as.Date(\"1960-12-31\")) %>% summarise(sd_pm = sd(diff_pm), sd_aaf = sd(diff_aaf)))\n    \n    plot <- cowplot::plot_grid(plot_1, plot_2, ncol = 1, rel_heights = c(2, 1))\n  } else {\n    plot <- full_posterior %>%\n      ggplot(aes(end_date)) +\n      geom_point(aes(y = post_mean), colour=col1, shape = 16) +\n      geom_errorbar(aes(ymin = post_mean + 1.96 * post_sd, ymax = post_mean - 1.96 * post_sd)) +\n      labs(y = \"Posterior Alpha with 95% CI (%)\") +\n      ylim(c(0, NA)) +\n      scale_x_date(breaks = seq.Date(as.Date(\"1960-12-31\"), as.Date(\"2020-12-31\"), by = \"10 years\"), date_labels = \"%Y-%m\") +\n      theme(\n        legend.title = element_blank(),\n        legend.position = \"top\",\n        axis.title.x = element_blank()\n      )\n  }\n  print(full_posterior %>% mutate(ci_width = post_sd*1.96*2) %>% filter(end_date %in% c(as.Date(\"1960-12-31\"), settings$end_date)))\n  plot\n}\n\n\n\nplot_taus_over_time <- function(posterior_over_time_flex) {\n  data <- posterior_over_time_flex %>% lapply(function(x) {\n    x$mle %>% mutate(end_date = x$end_date)\n  }) %>% \n    bind_rows() %>%\n    filter(estimate != \"alpha\") %>%\n    mutate(estimate_pretty = if_else(estimate == \"tau_s\", \"tau_w\", estimate)) \n  \n  ymax <- max(data$ml_est)\n  \n  data %>%\n    ggplot(aes(end_date, ml_est, colour = estimate_pretty, linetype = estimate_pretty)) +\n    geom_line() +\n    scale_linetype_manual(values = c('tau_c' = \"solid\", 'tau_w' = \"longdash\"), name = '', labels = c(expression(tau[c]), expression(tau[w]))) +\n    scale_colour_manual(values = c('tau_c' = colours_theme[1], 'tau_w' = colours_theme[2]), name = '', labels = c(expression(tau[c]), expression(tau[w]))) +\n    scale_x_date(breaks = seq.Date(as.Date(\"1960-12-31\"), as.Date(\"2020-12-31\"), by = \"10 years\"), date_labels = \"%Y-%m\") +\n    labs(y = \"Maximum Likelihood Estimate (%)\") +\n    ylim(c(0, ymax)) +\n    theme(\n      legend.title = element_blank(),\n      legend.position = \"top\",\n      axis.title.x = element_blank()\n    )\n}\n\nplot_sim_fdr <- function(simulation) {\n  tau_w_names <- c(\n    `0.01` = expression(tau[w] ~ \"= 0.01%\"),\n    `0.2` = expression(tau[w] ~ \"= 0.20%\")\n  )\n  stat_labels <- c(\n    `False Discovery Rate` = expression(~\"False Discovery Rate\"),\n    `True Discovery Rate` = expression(~\"True Discovery Rate\"),\n    `True Discoveries` = expression(~\"True Discoveries\"),\n    `False Discoveries` = expression(~\"False Discoveries\")\n  )\n  \n  plot_data <- simulation %>%\n    gather(n_disc, fdr, true_disc, false_disc, true_disc_rate, key = \"stat\", value = \"number\") %>%\n    filter(stat %in% c(\"fdr\", \"true_disc\", \"false_disc\", \"true_disc_rate\")) %>%\n    mutate(\n      stat = case_when(\n        stat == \"fdr\" ~ \"False Discovery Rate\",\n        stat == \"true_disc\" ~ \"True Discoveries\",\n        stat == \"false_disc\" ~ \"False Discoveries\",\n        stat == \"true_disc_rate\" ~ \"True Discovery Rate\"\n      ),\n      stat = stat %>% factor(levels = c(\"False Discovery Rate\", \"True Discovery Rate\", \"True Discoveries\", \"False Discoveries\")),\n      type = case_when(\n        type == \"by\" ~ \"Benjamini and Yekutieli\",\n        type == \"ols\" ~ \"OLS\",\n        type == \"eb\" ~ \"Empirical Bayes\"\n      ),\n      type = type %>% factor(levels = c(\"OLS\", \"Benjamini and Yekutieli\", \"Empirical Bayes\")),\n      # tau_w_title = formatC(tau_w, digits = 2, format = \"f\"),\n      # tau_w_title = as.character(eval(bquote(tau[w] ~ \"=\" ~ .(tau_w_title)~ \"%\")))\n      tau_w_title = tau_w %>% factor(labels = tau_w_names),\n      stat_title = stat %>% factor(label = stat_labels)\n    ) \n  \n  fdr_plot <- plot_data %>% \n    filter(stat == \"False Discovery Rate\") %>%\n    ggplot(aes(tau_c, number, colour = type)) +\n    geom_point() +\n    geom_line() +\n    labs(x = \"tau_c (%)\", y = \"False Discovery Rate\", colour = \"Adjustment\") +\n    facet_wrap(stat~tau_w_title, labeller = label_bquote(tau[w] ~ \"=\" ~ .(tau_w_title)~ \"%\"))\n  \n  true_disc_rate <- plot_data %>% \n    filter(stat == \"True Discovery Rate\") %>%\n    ggplot(aes(tau_c, number, colour = type)) +\n    geom_point() +\n    geom_line() +\n    labs(x = \"tau_c (%)\", y = \"True Discovery Rate\", colour = \"Adjustment\") +\n    facet_wrap(stat~tau_w_title, ncol = 2) \n  \n  plot_data %>%\n    filter(stat %in% c(\"False Discovery Rate\", \"True Discovery Rate\")) %>%\n    group_by(stat) %>%\n    mutate(scale_max = max(number)) %>%\n    mutate(scale_min = min(number)) %>%\n    ggplot(aes(tau_c, number, colour = type, shape = type)) +\n    geom_point() +\n    geom_point(aes(y = scale_max), alpha = 0) +\n    geom_point(aes(y = scale_min), alpha = 0) +\n    geom_line() +\n    labs(x = bquote(bold(tau[c])~\"(%)\"), colour = \"Type:\", linetype = \"Type:\", shape = \"Type:\") +\n    facet_wrap(stat_title~tau_w_title, scales = \"free_y\", labeller = label_parsed) +\n    # facet_wrap(stat_title~tau_w_title) +\n    # facet_wrap(stat~tau_w_title, scales = \"free_y\", labeller = label_bquote(tau[w] ~ \"=\" ~ .(tau_w_title)~ \"%\")) +\n    theme(\n      axis.title.y = element_blank(),\n      axis.title.x = element_text(size = 12),\n      legend.position = \"top\"\n    )\n}\n\n\nplot_size_overall <- function(eb_size, flipped = F, text = F) {\n  size_repl <- eb_size %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>% \n    filter(orig_sig == 1) %>%\n    group_by(size_grp) %>%\n    summarise(\n      repl_rate = mean(p025>0)\n    )\n  if (flipped) {\n    size_plot <- size_repl %>%\n      mutate(size_grp = size_grp %>% factor(levels = c(\"Nano\", \"Micro\", \"Small\", \"Large\", \"Mega\"))) %>%\n      ggplot(aes(x = size_grp, y = repl_rate*100)) +\n      geom_col(fill = colours_theme[1]) +\n      coord_flip() +\n      labs(y = \"Replication Rate (%)\") +\n      theme(axis.title.y = element_blank())\n    if (text) {\n      size_plot <- size_plot + geom_text(aes(label = str_c(formatC(round(repl_rate * 100, 1), digits = 1, format = \"f\"), \"%\")), nudge_y = 7, size = 5.5)\n    }\n  } else {\n    size_plot <- size_repl %>%\n      ggplot(aes(x = size_grp, y = repl_rate*100)) +\n      geom_col(fill = colours_theme[1]) +\n      labs(y = \"Replication Rate (%)\") +\n      theme(axis.title.x = element_blank())\n    if (text) {\n      size_plot <- size_plot + geom_text(aes(label = str_c(formatC(round(repl_rate * 100, 1), digits = 1, format = \"f\"), \"%\")), nudge_y = 2.5, size = 4)\n    }\n  }\n  size_plot\n}\n\nplot_size_clusters <- function(eb_size, cluster_order) {\n  overall <- eb_size %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>% \n    filter(orig_sig == 1) %>%\n    group_by(size_grp) %>%\n    summarise(\n      overall_rr = mean(p025>0)\n    )\n  \n  cluster_rr <- eb_size %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>% \n    filter(orig_sig == 1) %>%\n    mutate(hcl_label = hcl_label %>% factor(levels = cluster_order)) %>%\n    group_by(size_grp, hcl_label) %>%\n    summarise(\n      repl_rate = mean(p025>0)\n    ) %>%\n    group_by(hcl_label) %>%\n    mutate(sort_var = repl_rate[size_grp == \"Mega\"]) %>%\n    left_join(overall, by = \"size_grp\") %>%\n    mutate(size_title = str_c(size_grp, \" - Replication Rate: \", formatC(round(overall_rr * 100, 1), digits = 1, format = \"f\"), \"%\"))\n  \n  titles <- cluster_rr$size_title %>% unique()\n  title_order <- c(titles[str_detect(titles, \"Mega\")], titles[str_detect(titles, \"Large\")], titles[str_detect(titles, \"Small\")], \n                   titles[str_detect(titles, \"Micro\")], titles[str_detect(titles, \"Nano\")])\n  \n  cluster_rr %>%\n    mutate(size_title = size_title %>% factor(levels = title_order)) %>%\n    ggplot(aes(x = reorder(hcl_label, sort_var), y = repl_rate*100, fill = hcl_label)) +\n    geom_col() +\n    labs(y = \"Replication Rate (%)\") +\n    facet_wrap(~size_title, ncol = 1) +\n    theme(\n      axis.title.x = element_blank(),\n      legend.position = \"none\",\n      axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 10)\n    )\n}\n\nplot_sign_test <- function(sign_test) {\n  sig <- sign_test %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>% \n    mutate(\n      type = case_when(\n        p <= 0.05 & orig_sig == 1 ~ \"Replicated\",\n        p > 0.05 & orig_sig == 1 ~ \"Not Replicated\",\n        orig_sig == 0 ~ \"Never Significant\"\n      ),\n      type = type %>% factor(levels = c(\"Replicated\", \"Not Replicated\", \"Never Significant\"))      \n      # p_value = if_else(p < 0.05, \"Significant\", \"Insignificant\"),\n      # p_value = p_value %>% factor(levels = c(\"Significant\", \"Insignificant\"))\n    ) \n  \n  sig_overall <- sig %>% \n    summarise(repl_rate = sum(type == \"Replicated\") / sum(type %in% c(\"Replicated\", \"Not Replicated\"))) %>% \n    pull(repl_rate)\n  \n  plot_sign_factors <- sig %>%\n    ggplot(aes(reorder(characteristic, pos_act), pos_act*100, fill = type)) +\n    geom_col() +\n    ylim(c(0, 100)) +\n    labs(y = \"Countries with Positive Return (%)\", fill = \"Bootstrapped p-Value:\") +\n    geom_text(aes(x = 18, y = 100, label = str_c(\"Replication Rate: \", round(sig_overall*100, 2), \"%\")), inherit.aes = F) +\n    # geom_text(aes(label = round(p, 2)), nudge_y = 1, size = 1.5) +\n    theme(\n      axis.title.x = element_blank(),\n      # legend.title = element_blank(),\n      legend.position = \"top\",\n      axis.text.x = element_text(size = 7, angle = 90, vjust = 0, hjust = 1),\n      text = element_text(size = 10)\n    )\n  \n  plot_pos <- sig %>%\n    left_join(cluster_labels, by = \"characteristic\") %>% \n    mutate(hcl_label = hcl_label %>% factor(levels = cluster_order)) %>%\n    group_by(hcl_label) %>%\n    filter(type %in% c(\"Replicated\", \"Not Replicated\")) %>%\n    summarise(\n      repl_rate = sum(type == \"Replicated\") / sum(type %in% c(\"Replicated\", \"Not Replicated\")),\n      pos = mean(pos_act)\n    ) %>% \n    ggplot(aes(reorder(hcl_label, pos), pos*100, fill = hcl_label)) +\n    geom_col() +\n    labs(y = \"Countries with Positive Alpha (%)\") +\n    ylim((c(0, 100))) +\n    theme(\n      legend.position = \"none\",\n      axis.title.y = element_text(size=8),\n      axis.title.x = element_blank(),\n      axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)\n    )\n  \n  plot_sig <- sig %>%\n    left_join(cluster_labels, by = \"characteristic\") %>% \n    mutate(hcl_label = hcl_label %>% factor(levels = cluster_order)) %>%\n    group_by(hcl_label) %>% \n    summarise(repl_rate = sum(type == \"Replicated\") / sum(type %in% c(\"Replicated\", \"Not Replicated\"))) %>% \n    ggplot(aes(reorder(hcl_label, repl_rate), repl_rate*100, fill = hcl_label)) +\n    geom_col() +\n    labs(y = \"Sign Test Replication Rate (%)\") +\n    theme(\n      legend.position = \"none\",\n      axis.title.y = element_text(size=8),\n      axis.title.x = element_blank(),\n      axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)\n    )\n  plot_sign_clusters <- cowplot::plot_grid(plot_pos, plot_sig, ncol = 1, labels = c(\"A\", \"B\"), label_y = 1, label_x = 0)\n  list(\"factors\" = plot_sign_factors, \"clusters\" = plot_sign_clusters)\n}\n\n# World ex us versus us\nplot_int_cor <- function(eb_us, eb_world_ex_us) {\n  cor_data <- eb_us$input$long %>%\n    bind_rows(eb_world_ex_us$input$long) %>%\n    select(characteristic, region, eom, ret) %>%\n    spread(key = region, value = ret) %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>%\n    filter(!is.na(us) & !is.na(world_ex_us)) %>%\n    group_by(characteristic) %>%\n    summarise(\n      monhts = n(),\n      cor = cor(us, world_ex_us)\n    ) \n  \n  print(cor_data %>%\n          pull(cor) %>%\n          quantile())\n  \n  cor_data %>%\n    left_join(cluster_labels, by = \"characteristic\") %>%\n    group_by(hcl_label) %>%\n    summarise(\n      cor_avg = mean(cor)\n    ) %>%\n    ggplot(aes(reorder(hcl_label, cor_avg), cor_avg)) +\n    geom_col(fill = colours_theme[1]) +\n    labs(y = \"Correlation of US and World ex. US factor (Avg. within Cluster)\") +\n    coord_flip() +\n    theme(\n      axis.title.y = element_blank()\n    )\n}\n\nplot_world_vs_us <- function(eb_us, eb_world_ex_us) {\n  cor_data <- eb_us$input$long %>%\n    bind_rows(eb_world_ex_us$input$long) %>%\n    select(characteristic, region, eom, ret) %>%\n    spread(key = region, value = ret) %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>%\n    filter(!is.na(us) & !is.na(world_ex_us)) %>%\n    group_by(characteristic) %>%\n    summarise(\n      monhts = n(),\n      cor = cor(us, world_ex_us)\n    ) \n  \n  print(cor_data %>%\n          pull(cor) %>%\n          quantile())\n  \n  \n  \n  region_data <- eb_us$factors %>%\n    bind_rows(eb_world_ex_us$factors) %>%\n    select(characteristic, region, ols_est) %>%\n    spread(key = region, value = ols_est) %>%\n    left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") \n  \n  max_scale <- max(c(region_data$world_ex_us, region_data$us))\n  min_scale <- min(c(region_data$world_ex_us, region_data$us))\n  \n  fit_all <- lm(world_ex_us ~ us, data = region_data)\n  int <- fit_all$coefficients[1]\n  slp <- fit_all$coefficients[2]\n  r2 <- summary(fit_all)$r.squared\n  \n  eq <- substitute(italic(y) == a + b %.% italic(x)*\",\"~~italic(R)^2~\"=\"~r2, \n                   list(a = format(unname(int), digits = 2),\n                        b = format(unname(slp), digits = 2),\n                        r2 = format(r2, digits = 3)))\n  lbl <- as.character(as.expression(eq))\n  \n  t_int <- formatC(round(summary(fit_all)$coefficients[\"(Intercept)\", \"t value\"], 2), format='f', digits=2)\n  t_us <- formatC(round(summary(fit_all)$coefficients[\"us\", \"t value\"], 2), format='f', digits=2)\n  \n  region_data %>%\n    mutate(\n      orig_sig_pretty = if_else(orig_sig == 1, \"Studied\", \"Not Studied\"),\n      orig_sig_pretty = orig_sig_pretty %>% factor(levels = c(\"Studied\", \"Not Studied\"))\n    ) %>%\n    ggplot(aes(us, world_ex_us, colour = orig_sig_pretty, shape = orig_sig_pretty)) +\n    geom_point() +\n    geom_abline(intercept = 0, slope = 1, linetype = \"dotted\") +\n    # geom_abline(intercept = int, slope = slp) +\n    xlim(c(min_scale, max_scale)) +\n    ylim(c(min_scale, max_scale)) +\n    annotate(\"text\", label=lbl, parse=TRUE, x=min_scale, y=max_scale, hjust = 0) +\n    annotate(\"text\", label=paste0(\"        (\", t_int, \")   (\", t_us, \")\"), parse=F, x=min_scale, y=max_scale-0.07, hjust = 0, size = 3.3) +\n    labs(x = \"US Alpha (%)\", y = \"World Ex. US Alpha (%)\") + \n    theme(\n      legend.position = \"none\",\n      legend.title = element_blank()\n    )\n}\n\nplot_is_oos_post <- function(is_oos, type) { # type in c(\"GLS\", \"OLS\")\n  plot_list <- c(\"pre\", \"post\", \"pre_post\") %>% lapply(function(period) {\n    data <- is_oos[[period]]$regs\n    if (type == \"OLS\") {\n      fit <- lm(oos ~ is, data = data)\n      # Output\n      int <- unname(fit$coefficients[1])\n      int_se <- summary(fit)$coefficients[1, \"Std. Error\"]\n      slope <- unname(fit$coefficients[2])\n      slope_se <- summary(fit)$coefficients[2, \"Std. Error\"]\n      r2 <- summary(fit)$adj.r.squared # Adjusted R2\n      # Label\n      eq_lbl <- substitute(\n        italic(y) == a + b %.% italic(x)*\",\"~~italic(R)^2~\"=\"~r2,\n        list(\n          a = formatC(int, digits = 2, format = \"f\"),\n          b = formatC(slope, digits = 2, format = \"f\"),\n          r2 = formatC(r2, digits = 2, format = \"f\"))\n      )\n    }\n    if (type == \"GLS\") {\n      x <- cbind(rep(1, nrow(data)), data$is)\n      y <- data$oos\n      chars <- str_c(data$characteristic, \"__us\")\n      \n      gls_cov <- eb_est$us$factor_cov[chars, chars]\n      gls_est <- solve(t(x) %*% solve(gls_cov) %*% x) %*% t(x) %*% solve(gls_cov) %*% y\n      gls_res <- y - x %*% gls_est\n      gls_e_var <- 1/(nrow(x)-ncol(x)) * t(gls_res) %*% solve(gls_cov) %*% (gls_res)\n      gls_se <- sqrt(diag(drop(gls_e_var)* solve(t(x) %*% solve(gls_cov) %*% x)))\n      # Output\n      int <- gls_est[1, 1]\n      int_se <- gls_se[1]\n      slope <- gls_est[2, 1]\n      slope_se <- gls_se[2]\n      # Label (R2 doesn't really translate to GLS, because the mean prediction is no longer a good baseline)\n      eq_lbl <- substitute(\n        italic(y) == a + b %.% italic(x),\n        list(\n          a = formatC(int, digits = 2, format = \"f\"),\n          b = formatC(slope, digits = 2, format = \"f\"))\n      )\n    }\n    \n    min_y <- min(data$is, data$oos)\n    max_y <- max(data$is, data$oos)\n    \n    t_int <- formatC(round(int/int_se, 2), format='f', digits=2)\n    t_is <- formatC(round(slope/slope_se, 2), format='f', digits=2)\n    t_lbl <- paste0(\"       (\", t_int, \")   (\", t_is, \")\")\n    \n    data %>%\n        ggplot(aes(is, oos)) +\n        geom_point(colour = colours_theme[1]) +\n        ylim(c(min_y, max_y)) +\n        xlim(c(min_y, max_y)) +\n        geom_hline(yintercept = 0, linetype = \"solid\") +\n        geom_vline(xintercept = 0, linetype = \"solid\") +\n        geom_abline(slope = 1, intercept = 0, linetype = \"dotted\") +\n        ggtitle(label = eq_lbl, subtitle = t_lbl) +\n        labs(x = \"In-Sample\", y = \"Out-of-Sample\")\n  })\n  plot_list\n}\n\nplot_is_oos_post_quad <- function(is_oos, type) { # type in c(\"OLS\", \"GLS\")\n  plot_list <- c(\"pre\", \"post\", \"pre_post\") %>% lapply(function(period) {\n    data <- is_oos[[period]]$regs\n    if (type == \"OLS\") {\n      fit <- lm(oos ~ is + I(is^2), data = data)\n      # Output\n      int <- unname(fit$coefficients[1])\n      int_se <- summary(fit)$coefficients[1, \"Std. Error\"]\n      is <- unname(fit$coefficients[2])\n      is_se <- summary(fit)$coefficients[2, \"Std. Error\"]\n      issq <- unname(fit$coefficients[3])\n      issq_se <- summary(fit)$coefficients[3, \"Std. Error\"]\n      r2 <- summary(fit)$adj.r.squared # Adjusted R2\n      # Label \n      eq_lbl <- substitute(\n        italic(y) == a + b %.% italic(x)* ~ s ~ c %.% italic(x)^2*\",\"~~italic(R)^2~\"=\"~r2,\n        list(\n          a = formatC(int, digits = 2, format = \"f\"),\n          b = formatC(is, digits = 2, format = \"f\"),\n          s = ifelse(sign(issq)==1, \"+\", \"-\"),\n          c = formatC(unname(abs(issq)), digits = 2, format = \"f\"),\n          r2 = formatC(r2, digits = 2, format = \"f\"))\n      )\n    }\n    if (type == \"GLS\") {\n      x <- cbind(rep(1, nrow(data)), data$is, data$is^2)\n      y <- data$oos\n      chars <- str_c(data$characteristic, \"__us\")\n      \n      gls_cov <- eb_est$us$factor_cov[chars, chars]\n      gls_est <- solve(t(x) %*% solve(gls_cov) %*% x) %*% t(x) %*% solve(gls_cov) %*% y\n      gls_res <- y - x %*% gls_est\n      gls_e_var <- 1/(nrow(x)-ncol(x)) * t(gls_res) %*% solve(gls_cov) %*% (gls_res)\n      gls_se <- sqrt(diag(drop(gls_e_var)* solve(t(x) %*% solve(gls_cov) %*% x)))\n      # Output\n      int <- gls_est[1, 1]\n      int_se <- gls_se[1]\n      is <- gls_est[2, 1]\n      is_se <- gls_se[2]\n      issq <- gls_est[3, 1]\n      issq_se <- gls_se[3]\n      # Label\n      eq_lbl <- substitute(\n        italic(y) == a + b %.% italic(x)* ~ s ~ c %.% italic(x)^2,\n        list(\n          a = formatC(int, digits = 2, format = \"f\"),\n          b = formatC(is, digits = 2, format = \"f\"),\n          s = ifelse(sign(issq)==1, \"+\", \"-\"),\n          c = formatC(unname(abs(issq)), digits = 2, format = \"f\"))\n      )\n    }\n    \n    min_y <- min(data$is, data$oos)\n    max_y <- max(data$is, data$oos)\n    \n    t_int <- formatC(round(int/int_se, 2), format='f', digits=2)\n    t_is <- formatC(round(is/is_se, 2), format='f', digits=2)\n    t_issq <- formatC(round(issq/issq_se, 2), format='f', digits=2)\n    t_lbl <- paste0(\"      (\", t_int, \")    (\", t_is, \")         (\", t_issq, \")\")\n    \n    data %>%\n      ggplot(aes(is, oos)) +\n      geom_point(colour = colours_theme[1]) +\n      ylim(c(min_y, max_y)) +\n      xlim(c(min_y, max_y)) +\n      geom_hline(yintercept = 0, linetype = \"solid\") +\n      geom_vline(xintercept = 0, linetype = \"solid\") +\n      geom_abline(slope = 1, intercept = 0, linetype = \"dotted\") +\n      geom_smooth(method = \"loess\", span = 1, formula = \"y~x\") +\n      ggtitle(label = eq_lbl, subtitle = t_lbl) +\n      labs(x = \"In-Sample\", y = \"Out-of-Sample\")\n  })\n  plot_list\n}\n\n# Effect Size Plot\nplot_effects <- function(type, orig_sig, cluster_order) {  # type in c(\"ols\", \"eb\")\n  if (orig_sig) {\n    orig_sig_values <- T\n  } else {\n    orig_sig_values <- c(T, F)\n  }\n  if (type == \"ols\") {\n    alpha_est = \"ols_est\"\n  }\n  if (type == \"eb\") {\n    alpha_est = \"post_mean\"\n  }\n  (effect_world <- eb_est$world$factors %>%\n      left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>%\n      filter(orig_sig %in% orig_sig_values) %>%\n      mutate(hcl_label = hcl_label %>% factor(levels = cluster_order)) %>%\n      group_by(hcl_label) %>%\n      summarise(mean_alpha = mean(get(alpha_est))) %>%\n      ggplot(aes(reorder(hcl_label, mean_alpha), mean_alpha, fill = hcl_label)) + \n      geom_col() +\n      coord_flip() +\n      labs(y = paste(str_to_upper(type), \"Alpha Estimate (%)\")) +\n      theme(\n        axis.title.y = element_blank(),\n        legend.position = \"none\"\n      ))\n  \n  (effect_regions <- eb_est$all$factors %>%\n      left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>%\n      filter(orig_sig %in% orig_sig_values) %>%\n      mutate(hcl_label = hcl_label %>% factor(levels = cluster_order)) %>%\n      group_by(region, hcl_label) %>%\n      summarise(mean_alpha = mean(get(alpha_est))) %>%\n      group_by(hcl_label) %>%\n      mutate(\n        sort_var = mean_alpha[region == \"us\"],\n        region_pretty = case_when(\n          region == \"us\" ~ \"US\",\n          region == \"developed\" ~ \"Developed\",\n          region == \"emerging\" ~ \"Emerging\"\n        ),\n        region_pretty = region_pretty %>% factor(levels = c(\"US\", \"Developed\", \"Emerging\"))\n      ) %>%\n      ggplot(aes(reorder(hcl_label, sort_var), mean_alpha, fill = hcl_label)) + \n      geom_col() +\n      coord_flip() +\n      scale_y_continuous(breaks = seq(-0.2, 1, 0.2)) +\n      facet_wrap(~region_pretty, scales = \"free_x\") +\n      labs(y = paste(\"Average\", str_to_upper(type) ,\"Alpha (%)\")) +\n      theme(\n        axis.title.y = element_blank(),\n        legend.position = \"none\"\n      ))\n  \n  (effect_size <- eb_us_size %>%\n      left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>%\n      filter(orig_sig %in% orig_sig_values) %>%\n      mutate(hcl_label = hcl_label %>% factor(levels = cluster_order)) %>%\n      group_by(size_grp, hcl_label) %>%\n      summarise(mean_alpha = mean(get(alpha_est))) %>%\n      group_by(hcl_label) %>%\n      mutate(sort_var = mean_alpha[size_grp == \"Mega\"]) %>%\n      ggplot(aes(reorder(hcl_label, sort_var), mean_alpha, fill = hcl_label)) +\n      geom_col() +\n      coord_flip() +\n      scale_y_continuous(breaks = seq(0, 1.5, 0.50)) +\n      facet_wrap(~size_grp, nrow = 1, scales = \"free_x\") +\n      labs(y = paste(\"Average\", str_to_upper(type) ,\"Alpha (%)\")) +\n      theme(\n        axis.title.y = element_blank(),\n        legend.position = \"none\"\n      ))\n  \n  list(effect_world, effect_regions, effect_size)\n}\n\n# Replication Rate by Cluster\nplot_repl_cluster <- function(eb_factors, orig_sig, cluster_order) {\n  if (orig_sig) {\n    factor_subset <- eb_factors %>%\n      left_join(char_info %>% select(characteristic, significance), by = \"characteristic\") %>%\n      filter(significance == T)\n  } else {\n    factor_subset <- copy(eb_factors)\n  }\n  if (uniqueN(factor_subset$region) != 1) {\n    warning(\"!!!MULTIPLE REGIONS INCLUDED!!!\")\n  }\n  \n  factor_subset %>%\n    mutate(hcl_label = hcl_label %>% factor(levels = cluster_order)) %>%\n    group_by(hcl_label) %>%\n    summarise(\n      n = n(),\n      repl_rate = mean(p025 > 0),\n      sort_var = repl_rate + n / 100000\n    )  %>%\n    ggplot(aes(reorder(hcl_label, sort_var), repl_rate * 100, fill = hcl_label)) +\n    geom_col() +\n    labs(y = \"Replication Rate (%)\") +\n    coord_flip() +\n    theme(\n      axis.title.y = element_blank(),\n      legend.position = \"none\"\n    )\n}\n\n# Plot Replication Rate as a Function of Tau - Benchmark against Harvey et al. (2016)\nplot_harvey <- function(harvey_base_res, harvey_worst_res, tau_ws, act_rr) {\n  mle_est_base <- harvey_base_res$sim[[1]] %>% \n    lapply(function(x) x$mle) %>% \n    bind_rows() %>%\n    mutate(type = \"baseline\")\n  \n  mle_est_worst <- harvey_worst_res$sim[[1]] %>% \n    lapply(function(x) x$mle) %>% \n    bind_rows() %>%\n    mutate(type = \"worst_case\")\n  \n  mle_est <- mle_est_base %>% bind_rows(mle_est_worst)\n  \n  mle_summary <- mle_est %>%\n    group_by(type, coef) %>%\n    summarise(\n      n = n(),\n      coef_mean = mean(mle)\n    )\n  \n  tc_harvey_base <- mle_summary %>% filter(type == \"baseline\" & coef == \"tc\") %>% pull(coef_mean)\n  tc_harvey_worst <- mle_summary %>% filter(type == \"worst_case\" & coef == \"tc\") %>% pull(coef_mean)\n  \n  # Replication Rate under alternative Tau's\n  m <- eb_est$us$factors %>%\n    select(characteristic, ols_est, hcl_label) %>%\n    mutate(cm = 1) %>%\n    select(characteristic, hcl_label, cm) %>%\n    spread(key = hcl_label, value = cm) %>% \n    select(-characteristic) %>% \n    as.matrix()\n  m[is.na(m)] <- 0\n  mm <- m %*% t(m)\n  \n  alpha_hat <- eb_est$us$factors %>% pull(ols_est)\n  alpha_0_vec <- rep(0, 153)\n  sigma <- eb_est$us$sigma\n  \n  tc_act <- eb_est$us$mle %>% filter(estimate == \"tau_c\") %>% pull(ml_est) %>% round(2)\n  \n  search_grid <- expand.grid(\n    tau_c = c(seq(0.15, 0.46, by = 0.01), tc_act, tc_harvey_base, tc_harvey_worst),\n    tau_w = tau_ws\n  )\n  \n  repl_by_tau <- 1:nrow(search_grid) %>% lapply(function(i) {\n    tw <- search_grid[i, \"tau_w\"]\n    tc <- search_grid[i, \"tau_c\"]\n    omega <- diag(153) * tw^2 + mm * tc^2\n    post_cov <- solve(solve(omega) + solve(sigma))\n    post_alpha <- post_cov %*% (solve(omega) %*% alpha_0_vec + solve(sigma) %*% alpha_hat)\n    \n    tibble(characteristic = eb_est$us$factors$characteristic, post_mean = drop(post_alpha), post_sd = sqrt(diag(post_cov))) %>%\n      left_join(char_info %>% select(characteristic, \"orig_sig\" = significance), by = \"characteristic\") %>%\n      filter(orig_sig == T) %>%\n      summarise(repl_rate = mean(post_mean - 1.96 * post_sd > 0)) %>%\n      mutate(tau_c = tc, tau_w = tw)\n  }) %>% bind_rows()\n  \n  # Set TW labels \n  n_tw <- length(tau_ws) \n  tau_w_names <- vector(mode = \"expression\", length = n_tw)\n  for (i in 1:n_tw) {\n    tau_w_names[i] <- c(bquote(bold(tau[w])  ~ \"=\" ~ .(unname(tau_ws[i])) ~ \"%\"))\n  }\n  names(tau_w_names) <- tau_ws\n  \n  # Generate Important Points\n  tau_points <- repl_by_tau %>%\n    filter(tau_c %in% c(tc_act, tc_harvey_base, tc_harvey_worst)) %>%\n    distinct() %>%\n    mutate(\n      tau_w_title = tau_w %>% factor(labels = tau_w_names),\n      type = case_when(\n        tau_c == tc_act ~ \"Estimated from Data\",\n        tau_c == tc_harvey_base ~ \"Harvey, Liu, and Zhu (2016): Baseline\",\n        tau_c == tc_harvey_worst ~ \"Harvey, Liu, and Zhu (2016): Conservative\",\n        TRUE ~ \"Other\"\n      ),\n      type = type %>% factor(levels = c(\"Harvey, Liu, and Zhu (2016): Conservative\", \"Harvey, Liu, and Zhu (2016): Baseline\", \"Estimated from Data\"))\n    )\n  print(tau_points)\n  \n  plot <- repl_by_tau %>%\n    mutate(\n      tau_w_title = tau_w %>% factor(labels = tau_w_names),\n      type = case_when(\n        tau_c == tc_act ~ \"Estimated from Data\",\n        tau_c == tc_harvey_base ~ \"Harvey, Liu, and Zhu (2016): Baseline\",\n        tau_c == tc_harvey_worst ~ \"Harvey, Liu, and Zhu (2016): Conservative\",\n        TRUE ~ \" \"\n      ),\n      type = type %>% factor(levels = c(\"Estimated from Data\", \"Harvey, Liu, and Zhu (2016): Baseline\", \"Harvey et al. (2016): Conservative\", \" \"))\n    ) %>%\n    ggplot(aes(tau_c, repl_rate * 100)) +\n    geom_point(data = tau_points, aes(colour = type, shape = type, stroke = 1), size = 3) +\n    geom_line(alpha = 1, size = 0.6) +\n    geom_hline(yintercept = act_rr*100, linetype = \"dotted\") +\n    scale_x_continuous(breaks = seq(0.05, max(search_grid$tau_c), 0.05)) +\n    ylim(c(0, 100)) +\n    theme(legend.title = element_blank(), legend.position = \"top\") +\n    labs(y = \"Replication Rate (%)\", x = bquote(bold(tau[c])~\"(%)\"), colour = expression(tau[c]), shape = expression(tau[c]))\n  if (n_tw > 1) {\n    plot <- plot + facet_wrap(~tau_w_title, labeller = label_parsed)\n  }\n  return(plot)\n}\n\n# Single Factor TPF --\n# Plot TPF Factor: Cluster + Market\nplot_tpf_one_cluster <- function(data_wide, cluster_labels, s) {\n  mkt_sr <- mean(data_wide$market)/sd(data_wide$market)\n  one_cluster <- unique(cluster_labels$hcl_label) %>% lapply(function(c) {\n    cl_chars <- cluster_labels %>%\n      filter(characteristic %in% colnames(data_wide) & hcl_label == c) %>%\n      pull(characteristic)\n    # Cluster SR\n    cl <- data_wide %>% select(market, all_of(cl_chars))\n    w <- cl %>% epo_tpf(s = s)  \n    sr_all <- cl %>% sr_func(w = w)\n    # Average SR\n    sr_single_avg <- cl_chars %>% sapply(function(char) {\n      cl_sub <- data_wide %>% select(market, all_of(char))\n      w <- cl_sub %>% epo_tpf(s = opt_s)\n      cl_sub %>% sr_func(w = w)\n    }) %>%\n      mean()\n    \n    tibble(hcl_label=c, sr_all=sr_all, sr_single_avg=sr_single_avg)\n  }) %>%\n    bind_rows()\n  \n  one_cluster %>%\n    mutate(hcl_label = hcl_label %>% factor(levels = cluster_order)) %>%\n    ggplot(aes(reorder(hcl_label, sr_all), sr_all, fill = hcl_label)) +\n    geom_col() +\n    coord_flip() +\n    geom_hline(yintercept = mkt_sr, linetype = \"dashed\") +\n    theme(legend.position = \"none\", axis.title.y = element_blank()) +\n    labs(y = \"Monthly Sharpe Ratio: Market + Cluster\")\n}\n\n# Plot TPF Factor: Exclude one cluster\nplot_tpf_excl_cl <- function(data_wide, cluster_labels, s) {\n  epo_w <- data_wide %>% epo_tpf(s = s)\n  full_sr <- data_wide %>% sr_func(w = epo_w)\n  \n  excl_one <- c(\"Market\", unique(cluster_labels$hcl_label)) %>% lapply(function(c) {\n    cl_chars <- cluster_labels %>%\n      filter(characteristic %in% colnames(data_wide) & hcl_label != c) %>%\n      pull(characteristic)\n    if (c != \"Market\") {\n      cl_chars <- c(cl_chars, \"market\")\n    }\n    # All minus Cluster SR\n    data <- data_wide %>% select(all_of(cl_chars))\n    w <- data %>% epo_tpf(s = s)  \n    sr <- data %>% sr_func(w = w)\n    \n    tibble(hcl_label=c, sr=sr)\n  }) %>%\n    bind_rows()\n  \n  excl_one %>%\n    mutate(hcl_label = hcl_label %>% factor(levels = c(cluster_order, \"Market\"))) %>%\n    ggplot(aes(reorder(hcl_label, -sr), (full_sr-sr) / full_sr, fill = hcl_label)) +\n    geom_col() +\n    coord_flip() +\n    theme(legend.position = \"none\", axis.title.y = element_blank()) +\n    labs(y = \"Percentage Drop in Monthly SR from Excluding Cluster\")\n}\n\n# Plot TPF Factor: Single Factor Importance\nplot_tpf_factor_imp <- function(data_wide, cluster_labels, s) {\n  epo_w <- data_wide %>% epo_tpf(s = s)\n  full_sr <- data_wide %>% sr_func(w = epo_w)\n  each_factor <- colnames(data_wide) %>% lapply(function(c) {\n    # SR excluding char\n    data <- data_wide %>% select(-all_of(c))\n    w <- data %>% epo_tpf(s = opt_s)  \n    sr <- data %>% sr_func(w = w)\n    # Output\n    tibble(characteristic = c, sr = sr)\n  }) %>%\n    bind_rows()\n  \n  each_factor %>%\n    left_join(cluster_labels, by = \"characteristic\") %>%\n    mutate(\n      hcl_label = if_else(characteristic == \"market\", \"Market\", hcl_label),\n      hcl_label = hcl_label %>% factor(levels = c(cluster_order, \"Market\")),\n      drop = full_sr - sr,\n      drop_prop = drop/full_sr\n    ) %>%\n    arrange(drop_prop) %>%\n    tail(10) %>%\n    ggplot(aes(reorder(characteristic, drop_prop), drop_prop*100, fill = hcl_label)) +\n    geom_col() +\n    coord_flip() +\n    theme(axis.title.y = element_blank()) +\n    labs(y = \"Drop in TPF SR (% of full)\", fill = \"Cluster\")\n  \n  each_factor %>%\n    left_join(cluster_labels, by = \"characteristic\") %>%\n    mutate(\n      hcl_label = if_else(characteristic == \"market\", \"Market\", hcl_label),\n      hcl_label = hcl_label %>% factor(levels = c(cluster_order, \"Market\")),\n      drop = full_sr - sr,\n      drop_prop = drop/full_sr,\n      rank = frank(-drop_prop)\n    ) %>%\n    group_by(hcl_label) %>%\n    filter(characteristic != \"market\") %>%\n    # filter(drop_prop == max(drop_prop)) %>%\n    ggplot(aes(reorder(characteristic, drop_prop), drop_prop*100, fill = hcl_label)) +\n    geom_col() +\n    coord_flip() +\n    theme(axis.title.y = element_blank(), legend.position = \"none\") +\n    labs(y = \"Drop in TPF SR (% of full)\", fill = \"Cluster\")+\n    facet_wrap(~hcl_label, scales = \"free_y\")\n}\n\n# Plot TPF Factor: Single Factor Importance within Cluster\nplot_tpf_factor_imp_cluster <- function(data_wide, cluster_labels, s) {\n  within_cluster <- unique(cluster_labels$hcl_label) %>% lapply(function(c) {\n    cl_chars <- cluster_labels %>%\n      filter(characteristic %in% colnames(data_wide) & hcl_label == c) %>%\n      pull(characteristic)\n    # Full Cluster tpf\n    data <- data_wide %>% select(market, all_of(cl_chars))\n    w_all <- data %>% epo_tpf(s = opt_s)  \n    sr_all <- data %>% sr_func(w = w_all)\n    # Individual\n    sr_chars <- cl_chars %>% lapply(function(char) {\n      sub <- data %>% select(-all_of(char))\n      w_sub <- sub %>% epo_tpf(s = s)  \n      sr_sub <- sub %>% sr_func(w = w_sub)\n      tibble(excl_char = char, sr = sr_sub)\n    }) %>% bind_rows()\n    \n    sr_chars %>% mutate(hcl_label = c, sr_all = sr_all)\n  }) %>%\n    bind_rows()\n  \n  within_cluster %>%\n      mutate(\n        drop = sr_all - sr,\n        drop_prop = drop/sr_all,\n        hcl_label = hcl_label %>% factor(levels = cluster_order)\n      ) %>%\n      group_by(hcl_label) %>%\n      # filter(drop == max(drop)) %>%\n      ggplot(aes(reorder(excl_char, drop_prop), drop_prop*100, fill = hcl_label)) +\n      geom_col() +\n      coord_flip() +\n      theme(axis.title.y = element_blank(), legend.position = \"none\") +\n      labs(y = \"Drop in cluster TPF SR (% of full cluster)\", fill = \"Cluster\") +\n      facet_wrap(~hcl_label, scales = \"free\")\n}\n\n# Plot TPF Factor: The Evolution of the TPF\nplot_tpf_evolution <- function(data, data_wide, char_info, orig_sig_values, s) {\n  mkt_sr <- mean(data_wide$market) / sd(data_wide$market)\n  \n  years <- data %>% \n    filter(year(eom) > min(char_info$sample_end)) %>% \n    mutate(year = year(eom)) %>%\n    pull(year) %>%\n    unique()\n  \n  sr_over_time <- years %>% lapply(function(y) {\n    discovered_chars <- char_info %>% filter(sample_end <= y & significance %in% orig_sig_values) %>% pull(characteristic)\n    sub <- data_wide %>% select(all_of(discovered_chars), \"market\")  \n    w <- sub %>% epo_tpf(s = s) \n    # tibble(w = w, char = colnames(sub)) %>% left_join(char_info %>% select(\"char\"=characteristic, sample_end), by = \"char\") %>% arrange(-w)\n    # 2002: seasonality coincides with noa_at that also gets a large weight\n    \n    tibble(year = y, tpf_sr = sr_func(sub, w), n = ncol(sub))  \n  }) %>% bind_rows()\n  \n  sr_over_time <- sr_over_time %>%\n    bind_rows(tibble(year = min(years)-1, tpf_sr = mkt_sr, n = 0))\n  \n  (sr_plot <- sr_over_time %>%\n      ggplot(aes(year, tpf_sr)) +\n      geom_point() +\n      geom_line() +\n      ylim(c(0, NA)) +\n      # geom_hline(yintercept = full_sr) +\n      annotate(\"text\", x = 1971, \n               y = 0, label = \"Market\", colour='black') +\n      geom_segment(aes(x = 1971, y = 0.02, xend = 1971, yend = 0.1), size=0.1,arrow = arrow(length = unit(0.2, \"cm\"))) +\n      labs(y = \"Ex-Post Tangency SR\", x = \"Year\") +\n      theme(axis.title.x = element_blank()) +\n      annotate(\"text\", x = 1972, \n               y = 0.34, label = \"Beta\", colour='black') +\n      geom_segment(aes(x = 1972, y = 0.31, xend = 1972, yend = 0.23), size=0.1,arrow = arrow(length = unit(0.2, \"cm\"))) +\n      annotate(\"text\", x = 1979, \n               y = 0.45, label = \"Earning-to-Price\", colour='black') +\n      geom_segment(aes(x = 1979, y = 0.42, xend = 1979, yend = 0.30), size=0.1,arrow = arrow(length = unit(0.2, \"cm\"))) +\n      annotate(\"text\", x = 1983, \n               y = 0.08, label = \"Earnings Momentum\", colour='black') +\n      geom_segment(aes(x = 1981, y = 0.10, xend = 1981, yend = 0.25), size=0.1,arrow = arrow(length = unit(0.2, \"cm\"))) +\n      annotate(\"text\", x = 1989, \n               y = 0.25, label = \"Price Momentum\", colour='black') +\n      geom_segment(aes(x = 1989, y = 0.27, xend = 1989, yend = 0.39), size=0.1,arrow = arrow(length = unit(0.2, \"cm\"))) +\n      annotate(\"text\", x = 1991, \n               y = 0.73, label = \"Operating Accruals\", colour='black') +\n      geom_segment(aes(x = 1991, y = 0.69, xend = 1991, yend = 0.6), size=0.1,arrow = arrow(length = unit(0.2, \"cm\"))) +\n      annotate(\"text\", x = 2002, \n               y = 0.9, label = \"Seasonality\", colour='black') +\n      geom_segment(aes(x = 2002, y = 0.87, xend = 2002, yend = 0.78), size=0.1,arrow = arrow(length = unit(0.2, \"cm\"))))\n  \n  n_plot <- sr_over_time %>%\n    ggplot(aes(year, n)) +\n    geom_point() +\n    geom_line() +\n    labs(y = \"Factor Discovered\", x = \"Year of Discovery\")\n  \n  list(\"plot\"=cowplot::plot_grid(sr_plot, n_plot, ncol = 1, rel_heights = c(0.7, 0.3)), \"data\"=sr_over_time)\n}\n\n# Plot performance over time\nplot_ts <- function(data, oos, alphas, scale, orig_sig, start = as.Date(\"1986-01-01\")) {\n  data[, region := case_when(\n    region == \"us\" ~ \"US\",\n    region == \"world_ex_us\" ~ \"World ex. US\"\n  )]\n  data <- data[eom >= start]\n  data <- cluster_labels[data, on = \"characteristic\"]\n  if (oos) {\n    data <- setDT(char_info)[, .(characteristic, sample_end)][data, on = .(characteristic)]\n    data <- data[year(eom) > sample_end][, sample_end := NULL]\n  }\n  if (orig_sig) {\n    data <- setDT(char_info)[, .(characteristic, significance)][data, on = .(characteristic)]\n    data <- data[significance==T][, significance := NULL]\n  }\n  y_axis <- paste0(\"Cumulative \", if_else(alphas==T, \"Alpha \", \"Excess Return \"), if_else(oos==T, \"(OOS)\", \"(IS)\"))\n  \n  agg <- data[, .(ret = mean(ret), mkt = mean(mkt_vw_exc)), by = .(region, eom)]\n  if (alphas) {\n    agg[, ret := ret - cov(ret,mkt)/var(mkt)*mkt, by = .(region)]\n  }\n  if (scale) {\n    agg[, ret := ret / (sd(ret)*sqrt(12)/0.1), by = .(region)]\n  }\n  agg %>% setorder(region, eom)\n  agg[, cumret_app := cumsum(ret), by = region]\n  plot <- agg %>%\n    ggplot(aes(eom, cumret_app, colour = region)) +\n    geom_line() +\n    labs(y = y_axis) +\n    theme(\n      legend.position = c(0.85, 0.35),\n      legend.title = element_blank(),\n      axis.title.x = element_blank()\n    )\n  # Table\n  tbl <- agg %>%\n    group_by(region) %>%\n    summarise(\n      n = n(),\n      meanret = mean(ret),\n      vol = sd(ret),\n      ret_vol = meanret/vol*sqrt(12),\n      t = meanret/(vol/sqrt(n))\n    ) %>%\n    mutate(meanret = meanret*12)\n  tbl %>% \n    select(region, ret_vol, t) %>%\n    pivot_longer(c(ret_vol, t)) %>%\n    mutate(\n      value = formatC(value, digits=2, format = \"f\"),\n      value = if_else(name == \"t\", paste0(\"(\", value, \")\"), value)\n    ) %>%\n    mutate(\n      region = if_else(name == \"t\", \"\", region)\n    ) %>%\n    select(-name) %>%\n    rename(\"Region\"=region, \"Full sample\"=value) %>%\n    xtable(align = \"llc\") %>%\n    print(include.rownames = F)\n  # Output\n  return(plot)\n}\n\n# Plot OOS performance of significant factors\nplot_sig_oos <- function(sig_oos_pfs, sig_type, cutoff_2012, first_date, leg_pos) {\n  full <- sig_oos_pfs %>%\n    filter(eom >= first_date) %>% \n    group_by(region, type, significant) %>%\n    mutate(\n      a = ret - cov(mkt,ret)/var(mkt)*mkt\n    )  %>%\n    summarise(\n      n = n(),\n      meanret = mean(ret),\n      sd = sd(ret),\n      sr = meanret/sd * sqrt(12),\n      alpha = mean(a),\n      resvol = sd(a),\n      ir = alpha/resvol*sqrt(12),\n      t_alpha = alpha/(resvol/sqrt(n))\n    ) %>%\n    mutate(alpha = alpha*12) %>%\n    filter(region %in% c(\"us\", \"world_ex_us\") & type == sig_type) %>%\n    setDT()\n  \n  post_harvey <- sig_oos_pfs %>%\n    filter(eom >= first_date) %>%\n    group_by(region, type, significant) %>%\n    mutate(\n      a = ret - cov(mkt,ret)/var(mkt)*mkt \n    )  %>%\n    filter(eom >= as.Date(\"2013-01-01\")) %>%\n    summarise(\n      n = n(),\n      meanret = mean(ret),\n      sd = sd(ret),\n      sr = meanret/sd * sqrt(12),\n      alpha = mean(a),\n      resvol = sd(a),\n      ir = alpha/resvol*sqrt(12),\n      t_alpha = alpha/(resvol/sqrt(n))\n    ) %>%\n    mutate(alpha = alpha*12) %>%\n    filter(region %in% c(\"us\", \"world_ex_us\") & type == sig_type) %>%\n    setDT()\n  \n  cumret <- sig_oos_pfs %>%\n    filter(eom >= first_date) %>%\n    group_by(region, type, significant) %>%\n    arrange(region, type, significant, eom) %>%\n    filter(type == sig_type & region %in% c(\"us\", \"world_ex_us\")) %>%\n    mutate(\n      alpha = ret - cov(mkt,ret)/var(mkt)*mkt,\n      alpha = alpha / (sd(alpha)*sqrt(12)/0.1),\n      cum_alpha = cumsum(alpha),\n      region_pretty = case_when(\n        region == \"us\" ~ \"U.S.\",\n        region == \"world_ex_us\" ~ \"World ex. U.S.\"\n      )\n    )\n  \n  # Figure \n  sig_oos <- cumret %>%\n      ggplot(aes(eom, cum_alpha, colour = region_pretty, linetype = region_pretty)) +\n      geom_line() +\n      labs(y = \"Cumulative Alpha\") +\n      theme(\n        axis.title.x = element_blank(),\n        legend.position = leg_pos,\n        legend.title = element_blank()\n      )\n  \n  # Table for caption\n  tbl <- rbind(\n    full %>% select(region, ir, t_alpha) %>% mutate(sample = \"Full sample\"),\n    post_harvey %>% select(region, ir, t_alpha) %>% mutate(sample = \"Post Harvey et al\")\n  ) %>%\n    pivot_longer(c(ir, t_alpha)) %>%\n    mutate(\n      value = formatC(value, digits=2, format = \"f\"),\n      value = if_else(name == \"t_alpha\", paste0(\"(\", value, \")\"), value)\n    ) %>%\n    pivot_wider(names_from = sample, values_from = value) %>%\n    mutate(\n      region = case_when(\n        region==\"us\" ~ \"IR: US\",\n        region == \"world_ex_us\" ~ \"IR: World ex. US\"\n      ),\n      region = if_else(name == \"t_alpha\", \"\", region)\n    ) %>%\n    select(-name) %>%\n    rename(\"Region\"=region)\n  \n  if (cutoff_2012) {\n    sig_oos <- sig_oos + geom_vline(xintercept = as.Date(\"2012-12-31\"), linetype = \"dotted\", alpha = 1)\n    tbl %>% xtable(align = \"llcc\") %>% print(include.rownames = F)\n  } else {\n    tbl %>% select(-`Post Harvey et al`) %>% xtable(align = \"llc\") %>% print(include.rownames = F)\n  }\n  # Output \n  return(sig_oos)\n}\n\n\n# EB Posterior checks\neb_plots <- function(eb, plot = \"shrinkage\") {\n  if (plot == \"cluster_distribution\") {\n    a <- eb$mle %>% filter(estimate == \"alpha\") %>% pull(ml_est)\n    tb <- eb$mle %>% filter(estimate == \"tau_bar\") %>% pull(ml_est)\n    \n    op <- data.frame(x=c(a-3*tb, a+3*tb)) %>% # data.frame(x=c(-1.2, 1.2)) %>%\n      ggplot(aes(x)) + \n      stat_function(fun=function(x) dnorm(x = x, mean = a, sd = tb)) +\n      labs(x = \"Alpha\", y = \"Density\", title = \"Population Cluster Distribution ~ N(a0, tau_bar)\")\n  }\n  if (plot == \"factor_distribution\") {\n    a <- eb$mle %>% filter(estimate == \"alpha\") %>% pull(ml_est)\n    tb <- eb$mle %>% filter(estimate == \"tau_bar\") %>% pull(ml_est)\n    tt <- eb$mle %>% filter(estimate == \"tau_tilde\") %>% pull(ml_est)\n    factor_sd <- sqrt(tb^2 + tt^2)\n    \n    op <- data.frame(x=c(a-3*factor_sd, a+3*factor_sd)) %>% # data.frame(x=c(-1.2, 1.2)) %>%\n      ggplot(aes(x)) + \n      stat_function(fun=function(x) dnorm(x = x, mean = a, sd = factor_sd)) +\n      labs(x = \"Alpha\", y = \"Density\", title = \"Population Factor Distribution ~ N(a0, tau_bar + tau_tilde)\")\n  }\n  if (plot == \"factor\") {\n    op <- eb$factors %>%\n      ggplot(aes(reorder(characteristic, post_mean), post_mean)) +\n      geom_point() +\n      geom_errorbar(aes(ymin = p025, ymax = p975)) +\n      facet_wrap(~region) +\n      theme(\n        axis.title.x = element_blank(),\n        axis.text.x = element_text(size = 5, angle = 90, vjust = 0, hjust = 1),\n        text = element_text(size = 10)\n      )\n  }\n  if (plot == \"cluster\") {\n    op <- eb$clusters %>% \n      ggplot(aes(reorder(hcl_label, post_mean), post_mean)) +\n      geom_point() +\n      geom_errorbar(aes(ymin = post_mean - 1.96 * post_sd, ymax = post_mean + 1.96 * post_sd)) +\n      geom_hline(yintercept = 0, linetype = \"dotted\") +\n      labs(y = \"Posterior Distribution of Cluster Alpha\") +\n      theme(\n        axis.title.x = element_blank(),\n        axis.text.x = element_text(size = 8)\n      ) \n  }\n  if (plot == \"signal\") {\n    op <- eb$signal %>%\n      ggplot(aes(characteristic, post_mean)) +\n      geom_point() +\n      geom_errorbar(aes(ymin = post_mean - 1.96 * post_sd, ymax = post_mean + 1.96 * post_sd))\n  }\n  if (plot == \"factor_shrinkage\") {\n    op <- eb$factors %>%\n      select(characteristic, region, \"eb\" = post_mean, \"ols\" = ols_est) %>%\n      gather(eb, ols, key = \"type\", value = \"alpha\") %>%\n      group_by(characteristic) %>%\n      mutate(sort_var = alpha[region == \"us\" & type == \"ols\"]) %>%\n      ggplot(aes(reorder(characteristic, sort_var), alpha, colour = region)) +\n      geom_point() +\n      facet_wrap(~type, ncol = 1) +\n      theme(\n        axis.title.x = element_blank(),\n        axis.text.x = element_text(size = 5, angle = 90, vjust = 0, hjust = 1)\n      )\n  }\n  if (plot == \"se\") {\n    op <- eb_act$factors %>% \n      mutate(ols_p025 = ols_est - 1.96 * ols_se) %>% \n      mutate(se_diff = (ols_se-post_sd) / ols_se) %>% \n      ggplot(aes(x = region, y = se_diff*100, colour = region)) +\n      geom_boxplot() +\n      expand_limits(y = 0) +\n      theme(\n        axis.title.x = element_blank(),\n        legend.position = \"none\"\n      ) +\n      labs(y = \"(SE_ols - SE_eb) / SE_ols * 100\")\n  }\n  if (plot == \"repl\") {\n    repl_table <- eb$factors %>%\n      group_by(region) %>%\n      summarise(\n        repl_eb = mean(p025 > 0),\n        repl_ols = mean(ols_est - 1.96*ols_se > 0)\n      )\n    print(repl_table)\n    \n    op <- eb$factors %>%\n      group_by(region, hcl_label) %>%\n      summarise(\n        mean_alpha = mean(post_mean),\n        rep_rate = mean(p025>0)\n      ) %>%\n      group_by(hcl_label) %>%\n      mutate(sort_var = rep_rate[region == \"US\"]) %>%\n      ggplot(aes(reorder(hcl_label, sort_var), rep_rate)) +\n      geom_col() +\n      facet_wrap(~region, ncol = 1) +\n      theme(\n        axis.title.x = element_blank(),\n        axis.text.x = element_text(size = 7)\n      )\n  }\n  if (plot == \"significance\") {\n    op <- bind_rows(\n      eb$factors %>% select(char_reg, characteristic, hcl_label, region, \"alpha\" = post_mean, \"se\" = post_sd) %>% mutate(type = \"hlm\"),\n      eb$factors %>% select(char_reg, characteristic, hcl_label, region, \"alpha\" = ols_est, \"se\" = ols_se) %>% mutate(type = \"ols\")\n    ) %>%\n      mutate(significant = alpha - 1.96 * se > 0) %>%\n      group_by(hcl_label) %>%\n      mutate(\n        sort_var = median(alpha[type == \"ols\" & region == \"US\"]),\n        type = case_when(\n          type == \"hlm\" ~ \"Empirical Bayes\",\n          type == \"ols\" ~ \"OLS\"\n        )\n      ) %>%\n      ggplot(aes(reorder(characteristic, sort_var), alpha, colour = significant)) +\n      geom_point() +\n      geom_errorbar(aes(ymin = alpha - 1.96*se, ymax = alpha + 1.96*se)) +\n      geom_hline(yintercept = 0, colour = \"black\") +\n      facet_wrap(region~type, ncol = 2) +\n      labs(y = \"Alpha\", colour = \"Significant\") +\n      theme(\n        axis.title.x = element_blank(),\n        axis.text.x = element_text(size = 5, angle = 90, vjust = 0, hjust = 1)\n      )\n  }\n  if (plot == \"shrinkage\") {\n    op <- eb$factors %>%\n      select(region, characteristic, \"hlm_alpha_mean\" = post_mean, \"ols_alpha\" = ols_est) %>%\n      gather(-characteristic, -region, key = \"type\", value = \"alpha\") %>%\n      group_by(characteristic, region) %>%\n      mutate(sort_var = sum(alpha * (type == \"ols_alpha\"))) %>%\n      group_by(region) %>%\n      mutate(\n        rank = frank(sort_var, ties.method = \"max\") / 2,\n        type = case_when(\n          type == \"hlm_alpha_mean\" ~ \"Empricial Bayes Posterior Mean\",\n          type == \"ols_alpha\" ~ \"OLS Estimate\"\n        )\n      ) %>%\n      ggplot(aes(rank, alpha, shape = type, colour = type, group = type)) +\n      geom_smooth(method = \"lm\", se = F, formula = \"y ~ x\") +\n      geom_point() +\n      facet_wrap(~region) +\n      theme(\n        axis.text.x = element_text(size = 7),\n        text = element_text(size = 10)\n      ) +\n      labs(x = \"Rank OLS Alpha\", y = \"Alpha\", colour = \"Type\", shape = \"Type\")\n  }\n  \n  if (plot == \"comparison\") {\n    op <- bind_rows(\n      eb$factors %>% select(char_reg, characteristic, hcl_label, region, \"alpha\" = post_mean, \"se\" = post_sd) %>% mutate(type = \"hlm\"),\n      eb$factors %>% select(char_reg, characteristic, hcl_label, region, \"alpha\" = ols_est, \"se\" = ols_se) %>% mutate(type = \"ols\")\n    ) %>%\n      group_by(hcl_label) %>%\n      mutate(\n        sort_var = median(alpha[type == \"ols\" & region == \"US\"]),\n        type = case_when(\n          type == \"hlm\" ~ \"Empirical Bayes\",\n          type == \"ols\" ~ \"OLS\"\n        )\n      ) %>%\n      ggplot(aes(reorder(characteristic, sort_var), alpha, colour = hcl_label)) +\n      geom_point() +\n      geom_errorbar(aes(ymin = alpha - 1.96*se, ymax = alpha + 1.96*se)) +\n      geom_hline(yintercept = 0, colour = \"black\") +\n      facet_wrap(region~type, ncol = 2) +\n      labs(y = \"Alpha\", colour = \"Cluster\") +\n      theme(\n        axis.title.x = element_blank(),\n        axis.text.x = element_text(size = 5, angle = 90, vjust = 0, hjust = 1)\n      )\n  }\n  if (plot == \"cluster_density\") {\n    op <- eb$clusters %>% \n      group_by(hcl_label) %>% \n      nest() %>% \n      mutate(randoms = data%>% map(~rnorm(1000, mean = .x$post_mean, sd = .x$post_sd))) %>% \n      unnest(randoms) %>% \n      ggplot(aes(x = randoms, fill = hcl_label)) + \n      geom_density(alpha = 0.5)  #  + facet_wrap(~hcl_label)\n  }\n  print(op)\n}\n"
  },
  {
    "path": "Analysis/1 - Prepare Data.R",
    "content": "# Prepare Support Data ---------------------------------\n# Market Returns\nmarket_returns <- fread(paste0(data_path, \"/market_returns.csv\"), colClasses = c(\"eom\"=\"character\"))\nmarket_returns[, eom := eom %>% as.Date(format = \"%Y-%m-%d\")]\nmarket_returns <- market_returns[, .(excntry, eom, mkt_vw_exc, stocks, me_lag1)]\nmarket_returns <- market_returns[\n  eom >= settings$start_date & eom <= settings$end_date &\n    !(excntry %in% settings$country_excl) &   \n    !(excntry == \"PER\" & eom == as.Date(\"1992-01-31\") & mkt_vw_exc >= 8900) &  # Huge outlier \n    !(excntry == \"VEN\" & eom == as.Date(\"2018-02-28\") & mkt_vw_exc < -1)]      # Something is clearly wrong\n\n# Labels \nchar_info <- readxl::read_xlsx(\"Factor Details.xlsx\",\n                               sheet = \"details\", range = \"A1:N300\") %>%\n  select(\"characteristic\"=abr_jkp, direction, significance, date_range = `in-sample period`, \"hxz_group\"=group) %>%\n  filter(!is.na(characteristic)) %>%\n  mutate(\n    direction = direction %>% as.integer,\n    sample_start = date_range %>% str_extract(\"^\\\\d+\") %>% as.integer(),\n    sample_end = date_range %>% str_extract(\"\\\\d+$\") %>% as.integer()\n  )\nbase_chars <- char_info$characteristic\n\n# Country Classification\ncountry_classification <- readxl::read_xlsx(\"Country Classification.xlsx\", \n                                            sheet = \"countries\", range = \"A1:C200\") %>%\n  select(excntry, msci_development, region) %>%\n  filter(!is.na(excntry)) %>%\n  setDT()\n\n# Regions\nregion_info <- tibble(\n  name = c(\"us\", \"developed\", \"emerging\", \"frontier\", \"world\", \"world_ex_us\"),\n  country_codes = list(\n    \"USA\",\n    country_classification[msci_development == \"developed\" & excntry != \"USA\"]$excntry,\n    country_classification[msci_development == \"emerging\"]$excntry,\n    country_classification[msci_development == \"frontier\"]$excntry,\n    country_classification$excntry,\n    country_classification[excntry != \"USA\"]$excntry\n  ),\n  countries_min = c(1, rep(settings$countries_min, 3), 1, 3)\n)\n\n# Prepare Data --------------------------------------------------------\n# HML ----------------------\nhml <- fread(paste0(data_path, \"/hml.csv\"), colClasses = c(\"eom\"=\"character\"))\nhml[, eom := eom %>% as.Date(format = \"%Y-%m-%d\")]\n# Choose weighting\nhml[excntry == \"USA\", ret := case_when(\n  settings$weighting$us == \"vw\" ~ ret_vw,\n  settings$weighting$us == \"ew\" ~ ret_ew,\n  settings$weighting$us == \"vw_cap\" ~ ret_vw_cap\n)]\nhml[excntry != \"USA\", ret := case_when(\n  settings$weighting$global_ex_us == \"vw\" ~ ret_vw,\n  settings$weighting$global_ex_us == \"ew\" ~ ret_ew,\n  settings$weighting$global_ex_us == \"vw_cap\" ~ ret_vw_cap\n)]\n# Screens\nhml <- hml %>%\n  filter(\n    characteristic %in% base_chars,\n    eom >= settings$start_date & eom <= settings$end_date,\n    !is.na(ret) & n_stocks_min >= settings$n_stocks_min,  # Min Stocks is the big one, it removes close to 50% of the obs. with n_stocks_min>=10. Perhaps it's too stringent.\n    !(excntry %in% settings$country_excl)\n  ) %>%\n  select(-signal, -n_stocks_min)\n# Set direction as original study\nhml <- hml %>%\n  left_join(char_info %>% select(characteristic, direction), by = \"characteristic\") %>%\n  mutate(ret = ret * direction) %>%\n  select(-ret_vw, -ret_ew, -ret_vw_cap)\n# Ensure no Duplicates\nif(hml[, .N, by = .(characteristic, excntry, eom)][, max(N)] > 1) {\n  warning(\"THE DATA HAS DUPLICATES\")\n} \n\n# Regional Portfolios ------------------------------------------------\nregional_data <- function(data, countries, weighting, countries_min, months_min, size_grps = F) {\n  # Determine Country Weights\n  weights <- market_returns[, .(excntry, eom, mkt_vw_exc, \"country_weight\" = case_when(\n    weighting == \"market_cap\" ~ me_lag1,\n    weighting == \"stocks\" ~ as.double(stocks),\n    weighting == \"ew\" ~ 1)\n  )]\n  # Portfolio Return \n  pf <- data[excntry %in% countries]\n  pf <- weights[pf, on = .(excntry, eom)]\n  if (size_grps) {\n    pf <- pf[, .(\n      n = .N,\n      ret = sum(ret*country_weight) / sum(country_weight),\n      mkt_vw_exc = sum(mkt_vw_exc * country_weight) / sum(country_weight) \n    ), by = .(characteristic, size_grp, eom)]\n  } else {\n    pf <- pf[, .(\n      n = .N,\n      ret = sum(ret*country_weight) / sum(country_weight),\n      mkt_vw_exc = sum(mkt_vw_exc * country_weight) / sum(country_weight) \n    ), by = .(characteristic, eom)]\n  }\n  # Minimum Requirement: Countries\n  pf <- pf[n >= countries_min]\n  # Minimum Requirement: Months\n  pf[, months := .N, by = .(characteristic)]\n  pf <- pf[months >= months_min]\n  return(pf)\n}\nregional_pfs <- 1:nrow(region_info) %>% lapply(function(i) {\n  info <- region_info[i, ]\n  reg_pf <- hml %>% regional_data(countries = unlist(info$country_codes), weighting = settings$country_weighting,\n                                  countries_min = info$countries_min, months_min = settings$months_min)\n  reg_pf %>% mutate(region = info$name)\n}) %>% bind_rows() \n\n# Characteristic Managed Portfolios ----------------------\ncmp <- fread(paste0(data_path, \"/cmp.csv\"), colClasses = c(\"eom\"=\"character\"))\ncmp[, eom := eom %>% as.Date(format=\"%Y-%m-%d\")]\n\n# Screens\ncmp <- cmp %>%\n  rename(ret = ret_weighted) %>%\n  filter(\n    characteristic %in% base_chars,\n    eom >= settings$start_date & eom <= settings$end_date,\n    !is.na(ret) & signal_weighted != 0 & n_stocks >= settings$n_stocks_min * 2, \n    !(excntry %in% settings$country_excl)\n  ) %>%\n  select(-signal_weighted, -n_stocks)\n# Determine Direction\ncmp <- cmp %>%\n  left_join(char_info %>% select(characteristic, direction), by = \"characteristic\") %>%\n  mutate(ret = ret * direction)\n# Ensure no Duplicates\nif(cmp[, .N, by = .(characteristic, excntry, size_grp, eom)][, max(N)] > 1) {\n  warning(\"THE DATA HAS DUPLICATES\")\n} \n# Regional Portfolios\nregion_info_cmp <- region_info %>% filter(name == \"us\")\nregional_pfs_cmp <- 1:nrow(region_info_cmp) %>% lapply(function(i) {\n  info <- region_info[i, ]\n  reg_pf <- cmp %>% regional_data(countries = unlist(info$country_codes), weighting = settings$country_weighting,\n                                  countries_min = info$countries_min, months_min = settings$months_min, size_grps = T)\n  reg_pf %>% mutate(region = info$name)\n}) %>% bind_rows() \n\n# Regional Market Returns ---\nregional_mkt_ret <- 1:nrow(region_info) %>% lapply(function(i) {\n  info <- region_info[i, ]\n  mkt <- market_returns[excntry %in% unlist(info$country_codes), .(n = .N, market = sum(mkt_vw_exc * me_lag1) / sum(me_lag1)), by = eom]\n  mkt <- mkt[n >= info$countries_min][, n:= NULL]\n  mkt %>% mutate(region = info$name)\n}) %>% bind_rows() \n\nprint(paste(\"Total Characteristics:\", uniqueN(regional_pfs$characteristic)))\n"
  },
  {
    "path": "Analysis/2 - Determine Clusters.R",
    "content": "# Hierachical Clustering ----------------------------------------------\nfactor_hcl <- function(cor_mat, linkage = \"ward.D\", k, direction_bars = T) {\n  dist_mat <- as.dist((1-cor_mat))\n  # dist_mat <- as.dist(sqrt((1-cor_mat)*2)) # With ward.D2 gives the same clusters\n  hcl <- dist_mat %>% \n    hclust(method=linkage)\n  print(str_c(\"Cophenetic Correlation between Dendogram and Distance Matrix = \", \n              format(cor(cophenetic(hcl), dist_mat), digits = 2, nsmall = 2)))\n  \n  hcl_labels <- hcl %>%\n    cutree(k = k) %>%\n    as_tibble(rownames = \"characteristic\") %>%\n    setDT() %>% \n    setnames(c(\"characteristic\", \"hcl\"))\n\n  hcl_col <- rep(colours_theme[c(1, 2, 3, 4, 5, 7, 9, 11)], ceiling(k/8))[1:k]\n  \n  dend <- hcl %>% \n    as.dendrogram() %>%\n    dendextend::set(\"labels_col\", value = hcl_col, k=k) %>%\n    dendextend::set(\"branches_k_color\", value = hcl_col, k=k) %>%\n    dendextend::set(\"labels_cex\", value = 0.5) %>%\n    dendextend::set(\"branches_lty\", 1) %>%\n    dendextend::set(\"branches_lwd\", 0.2) \n  dend %>% plot(horiz=T)\n  \n  return_list <- list(\n    \"cor\" = cor_mat,\n    \"labels\" = hcl_labels,\n    \"dend\" = dend\n  )\n  \n  if (direction_bars) {\n    bar_colours <- tibble(\"characteristic\" = colnames(cor_mat)) %>%\n      left_join(char_info %>% select(characteristic, direction), by = \"characteristic\") %>% \n      mutate(col_dir = if_else(direction == 1, \"black\", \"white\"))\n    colored_bars(colors = bar_colours %>% select(col_dir), dend = dend, rowLabels = c(\"Long High\"), y_shift = 3, horiz = T)\n    return_list$bar_colours <- bar_colours\n  }\n  return(return_list)\n}\n\nhcl_input <- function(data, ret_type = \"alpha\", ...) { # ret_type %in% c(\"raw\", \"alpha\")\n  data <- copy(data)  # Avoid modifying in place\n  if (ret_type == \"raw\") {\n    data[, ret_hcl := ret]\n  }\n  if (ret_type == \"alpha\") {\n    data[, ret_hcl := ret - mkt_vw_exc * cov(ret, mkt_vw_exc)/var(mkt_vw_exc), by = characteristic]\n  }\n  data %>%\n    select(characteristic, eom, ret_hcl) %>%\n    spread(key = characteristic, value = ret_hcl) %>%\n    select(-eom) %>%\n    cor(...)  \n}\n\n# US Clusters -----------\nclusters <- regional_pfs %>%\n  filter(region == settings$hcl$region & year(eom) >= settings$hcl$start_year) %>%\n  hcl_input(ret_type = settings$hcl$ret_type, method = settings$hcl$cor_method, use = \"pairwise.complete.obs\") %>%\n  factor_hcl(linkage = settings$hcl$linkage, k = settings$hcl$k, direction_bars = T)\n\n# Cluster Labels\nif (settings$weighting$us == \"vw_cap\" & settings$hcl$k == 13 & settings$hcl$region == \"us\" & settings$hcl$start_year == 1975) {\n  clusters$labels <- clusters$labels %>% \n    mutate(\n      hcl_label = case_when(\n        hcl == 1 ~ \"Low Leverage\",  \n        hcl == 2 ~ \"Investment\",     \n        hcl == 3 ~ \"Size\", \n        hcl == 4 ~ \"Value\",  \n        hcl == 5 ~ \"Quality\", \n        hcl == 6 ~ \"Low Risk\", \n        hcl == 7 ~ \"Debt Issuance\", \n        hcl == 8 ~ \"Seasonality\", \n        hcl == 9 ~ \"Accruals\",   \n        hcl == 10 ~ \"Profitability\",\n        hcl == 11 ~ \"Profit Growth\",\n        hcl == 12 ~ \"Short-Term Reversal\",\n        hcl == 13 ~ \"Momentum\",\n        TRUE ~ as.character(hcl))\n    ) \n} else {\n  clusters$labels <- clusters$labels %>% \n    mutate(hcl_label = hcl)\n}\n# Output\ncluster_labels <- clusters$labels %>%\n  select(-hcl)\n"
  },
  {
    "path": "Analysis/3 - Analysis.R",
    "content": "# Empirical Bayes Estimation --------------------------\n# search_list: c(regions, type, layers, size_grp)\nsearch_list <- list(\n  \"us\" = list(\"us\", \"hml\", 2), \n  \"developed\" = list(\"developed\", \"hml\", 2),\n  \"emerging\" = list(\"emerging\", \"hml\", 2),\n  \"all\" = list(c(\"us\", \"developed\", \"emerging\"), \"hml\", 3),\n  \"world\" = list(\"world\", \"hml\", 2),\n  \"world_ex_us\" = list(\"world_ex_us\", \"hml\", 2),\n  \"us_mega\" = list(\"us\", \"cmp\", 2, \"mega\"),\n  \"us_large\" = list(\"us\", \"cmp\", 2, \"large\"),\n  \"us_small\" = list(\"us\", \"cmp\", 2, \"small\"),\n  \"us_micro\" = list(\"us\", \"cmp\", 2, \"micro\"),\n  \"us_nano\" = list(\"us\", \"cmp\", 2, \"nano\")\n)\neb_est <- search_list %>% sapply(simplify = F, USE.NAMES = T, function(x) {\n  print(paste(\"Region:\", x[[1]]))\n  regions <- x[[1]]\n  if (x[[2]] == \"cmp\") {\n    base_data <- copy(regional_pfs_cmp) %>% filter(size_grp == x[[4]])\n  }\n  if (x[[2]] == \"hml\") {\n    base_data <- copy(regional_pfs)\n  }\n  # Prepare Data\n  data <- base_data %>% \n    filter(eom >= settings$start_date & eom <= settings$end_date) %>%\n    filter(region %in% regions) %>% \n    eb_prepare(\n      scale_alpha = settings$eb$scale_alpha, \n      overlapping = settings$eb$overlapping \n    )\n  # Run Empirical Bayes\n  op <- data %>% \n    emp_bayes( \n      cluster_labels = cluster_labels, \n      min_obs = settings$eb$min_obs,\n      fix_alpha = settings$eb$fix_alpha, \n      bs_cov = settings$eb$bs_cov, \n      layers = x[[3]], \n      shrinkage = settings$eb$shrinkage, \n      cor_type = settings$eb$cor_type, \n      bs_samples = settings$eb$bs_samples, \n      seed = settings$seed\n    )\n  # Output\n  return(op)\n})\n\n# Simulations EB vs. BY --------------\n# Simulations\nif (update_sim) {\n  # Values from Data\n  pairwise_cor <- eb_est$us$input$long %>%\n    select(characteristic, eom, ret_neu) %>%\n    spread(key = characteristic, value = ret_neu) %>%\n    select(-eom) %>%\n    cor(use = \"pairwise.complete.obs\") \n  \n  cor_value <- pairwise_cor %>%\n    as_tibble(rownames = \"char1\") %>%\n    gather(-char1, key = \"char2\", value = \"cor\") %>%\n    left_join(cluster_labels %>% select(characteristic, \"hcl1\" = hcl_label), by = c(\"char1\"=\"characteristic\")) %>%\n    left_join(cluster_labels %>% select(characteristic, \"hcl2\" = hcl_label), by = c(\"char2\"=\"characteristic\")) %>%\n    filter(char1 != char2) %>%\n    mutate(same_cluster = (hcl1 == hcl2)) %>%\n    group_by(same_cluster) %>%\n    summarise(avg_cor = mean(cor))\n  \n  # Time periods\n  med_months <- eb_est$us$input$long %>% group_by(characteristic) %>% summarise(n = n()) %>% pull(n) %>% median()\n  \n  data <- list(\n    yrs = round(med_months / 12),\n    cor_within = cor_value %>% filter(same_cluster == T) %>% pull(avg_cor) %>% round(digits = 2),\n    cor_across = cor_value %>% filter(same_cluster == F) %>% pull(avg_cor) %>% round(digits = 2)\n  )\n  # Simulation Settings\n  set.seed(settings$seed)\n  sim <- list(\n    \"alpha_0\" = 0,\n    \"t\" = 12*70,      # Median amount of data\n    \"clusters\" = 13,\n    \"fct_pr_cl\" = 10,\n    \"corr_within\" = 0.58, \n    \"corr_across\" = 0.02,\n    \"n_sims\" = 10000,\n    \"tau_c\" = c(0.01, seq(from = 0.05, to = 0.5, by = 0.05)),\n    \"tau_w\" = c(0.01, 0.2)\n  )\n  sim$se <- (10/sqrt(12))/sqrt(sim$t)\n  sim$n <- sim$clusters * sim$fct_pr_cl\n  \n  # Check settings are consistent with data [Alert if there is a significant difference]\n  if (abs(sim$t - data$yrs*12) > 12 | abs(sim$corr_within - data$cor_within) > 0.05 | abs(sim$corr_across - data$cor_across) > 0.05) {\n    warning(\"SIMULATION AND DATA VALUES ARE NOT CONSISTENT!\")\n    print(data)\n    print(list(\"yrs\"=sim$t/12, \"corr_within\"=sim$corr_within, \"corr_across\"=sim$corr_across))\n  }\n  simulation <- sim_mt_control(sim_settings = sim)\n  simulation %>% saveRDS(file = paste0(object_path, \"/fdr_sim.RDS\"))\n} else {\n  simulation <- readRDS(file = paste0(object_path, \"/fdr_sim.RDS\"))\n}\n# False Discovery Rate\nmodel_fdr <- fdr_sim(t_low = 0, a_vec = eb_est$us$factor_mean, a_cov = eb_est$us$factor_cov, n_sim = 10000, seed = settings$seed)\n# Multiple Testing Adjustments\nmt <- multiple_testing(eb_all = eb_est$all, eb_world = eb_est$world)\n# Tangency Portfolios -----------------------------\n# Regions\ntpf_world <- eb_est$world$input$long %>% tpf_cluster(mkt_region = \"world\", orig_sig = T, min_date = settings$tpf$start$world, n_boots = settings$tpf$bs_samples, shorting = settings$tpf$shorting, seed = settings$seed)\ntpf_us <- eb_est$us$input$long %>% tpf_cluster(mkt_region = \"us\", orig_sig = T, min_date = settings$tpf$start$us, n_boots = settings$tpf$bs_samples, shorting = settings$tpf$shorting, seed = settings$seed)\ntpf_dev <- eb_est$developed$input$long %>% tpf_cluster(mkt_region = \"developed\", orig_sig = T, min_date = settings$tpf$start$developed, n_boots = settings$tpf$bs_samples, shorting = settings$tpf$shorting, seed = settings$seed)\ntpf_emer <- eb_est$emerging$input$long %>% tpf_cluster(mkt_region = \"emerging\", orig_sig = T, min_date = settings$tpf$start$emerging, n_boots = settings$tpf$bs_samples, shorting = settings$tpf$shorting, seed = settings$seed)\n# Size Groups\ntpf_size <- c(\"mega\", \"large\", \"small\", \"micro\", \"nano\") %>% lapply(function(x) {\n  eb_est[[paste0(\"us_\", x)]]$input$long %>% tpf_cluster(mkt_region = \"us\", orig_sig = T, min_date = settings$tpf$start$size_grps, n_boots = settings$tpf$bs_samples, shorting = settings$tpf$shorting, seed = settings$seed) %>%\n    mutate(size_grp = x)\n}) %>%\n  bind_rows()\n# Single Factor TPF\ntpf_factors <- prepare_tpf_factors(region = settings$tpf_factors$region, orig_sig_values = settings$tpf_factors$orig_sig, \n                                   start = settings$tpf_factors$start, scale = settings$tpf_factors$scale)\nopt_s <- tpf_factors$long %>% optimal_shrinkage(k = settings$tpf_factors$k)\n\n# Posterior Over time -----------------\not_region <- \"world\"\nif (update_post_over_time) {\n  for (fix_taus in c(T,F)) {\n    if (fix_taus) {\n      fixed_priors <- list(\n        \"alpha\" = eb_est[[ot_region]]$mle %>% filter(estimate == \"alpha\") %>% pull(ml_est), \n        \"tau_c\" = eb_est[[ot_region]]$mle %>% filter(estimate == \"tau_c\") %>% pull(ml_est), \n        \"tau_s\" = eb_est[[ot_region]]$mle %>% filter(estimate == \"tau_s\") %>% pull(ml_est)\n      )\n    } else {\n      fixed_priors <- NULL\n    }\n    \n    periods <- sort(unique(regional_pfs$eom))\n    periods <- periods[month(periods) == 12]  # Only estimate once per year\n    \n    time_chars <- regional_pfs %>% \n      filter(region == ot_region & eom <= as.Date(\"1960-12-31\")) %>% \n      group_by(characteristic) %>%\n      filter(n() >= settings$eb$min_obs) %>%\n      pull(characteristic) %>% \n      unique()\n    \n    posterior_over_time <- periods[periods >= as.Date(\"1960-12-31\")] %>% lapply(function(end_date) { \n      print(end_date)\n      # Prepare Data\n      data <- regional_pfs %>% \n        filter(characteristic %in% time_chars) %>%\n        filter(eom >= settings$start_date & eom <= end_date) %>%\n        filter(region == ot_region) %>% \n        eb_prepare(\n          scale_alpha = settings$eb$scale_alpha, \n          overlapping = settings$eb$overlapping \n        )\n      # Run Empirical Bayes\n      eb_act <- data %>% \n        emp_bayes( \n          cluster_labels = cluster_labels, \n          min_obs = settings$eb$min_obs,\n          fix_alpha = settings$eb$fix_alpha, \n          bs_cov = settings$eb$bs_cov, \n          layers = 2, \n          shrinkage = settings$eb$shrinkage, \n          cor_type = settings$eb$cor_type, \n          bs_samples = 1000, \n          priors = fixed_priors,\n          seed = settings$seed\n        )\n      eb_act$input <- NULL\n      eb_act$end_date <- end_date\n      return(eb_act)\n    }) \n    if (fix_taus) {\n      posterior_over_time %>% saveRDS(file = paste0(object_path, \"/posterior_over_time.RDS\"))\n    } else {\n      posterior_over_time %>% saveRDS(file = paste0(object_path, \"/posterior_over_time_flex.RDS\"))\n    }\n  }\n} \nposterior_over_time <- readRDS(file = paste0(object_path, \"/posterior_over_time.RDS\"))\nposterior_over_time_flex <- readRDS(file = paste0(object_path, \"/posterior_over_time_flex.RDS\"))\n\n# Size Dimension\neb_us_size <- c(\"mega\", \"large\", \"small\", \"micro\", \"nano\") %>% lapply(function(x) {\n  eb_est[[str_c(\"us_\", x)]]$factors %>% mutate(size_grp = x)\n}) %>% \n  bind_rows() %>%\n  mutate(\n    size_grp = str_to_title(size_grp),\n    size_grp = size_grp %>% factor(levels = c(\"Mega\", \"Large\", \"Small\", \"Micro\", \"Nano\"))\n  )\n\n# In-Sample / Out-of-Sample ------------------------------------\nis_oos <- c(\"pre\", \"post\", \"pre_post\") %>% sapply(simplify = F, USE.NAMES = T, function(t) {\n  data <- eb_est$us$input$long %>% prepare_is_oos(min_obs = 60, ret_scaled = \"all\", orig_group = T, type = t, print=T)\n  regs <- data %>%\n    group_by(characteristic, period, n_is, n_oos) %>%\n    nest() %>%\n    mutate(\n      fit = data %>% map(~lm(ret_adj ~ mkt_vw_exc, data = .x)),\n      tidied = fit %>% map(tidy)\n    ) %>% \n    unnest(tidied) %>%\n    filter(term == \"(Intercept)\") %>%\n    select(characteristic, period, n_is, n_oos, estimate) %>%\n    spread(key = period, value = estimate)\n  list(data=data, regs=regs)\n})\n\n# Economi Benefit of More Powerful Multiple Comparison \nif (update_post_is) {\n  periods <- sort(unique(regional_pfs$eom))\n  periods <- periods[month(periods) == 12 & year(periods) >= 1959]  # Only estimate once per year\n  \n  posterior_is <- periods %>% lapply(function(end_date) { \n    print(paste(\"Date\", end_date, \"-\" , match(end_date, periods), \"out of\", length(periods)))\n    # Prepare Data\n    data <- regional_pfs %>% \n      filter(eom >= settings$start_date & eom <= end_date) %>%\n      filter(region == \"us\") %>% \n      eb_prepare(\n        scale_alpha = settings$eb$scale_alpha, \n        overlapping = settings$eb$overlapping \n      )\n    # Run Empirical Bayes\n    eb_act <- data %>% \n      emp_bayes( \n        cluster_labels = cluster_labels, \n        min_obs = settings$eb$min_obs,\n        fix_alpha = settings$eb$fix_alpha, \n        bs_cov = settings$eb$bs_cov, \n        layers = 2, \n        shrinkage = settings$eb$shrinkage, \n        cor_type = settings$eb$cor_type, \n        bs_samples = 1000,\n        seed = settings$seed\n      )\n    # Output \n    eb_act$factors %>% mutate(est_date = end_date)\n  }) %>% bind_rows()\n  posterior_is %>% saveRDS(file = paste0(object_path, \"/posterior_is.RDS\"))\n} else {\n  posterior_is <- readRDS(file = paste0(object_path, \"/posterior_is.RDS\"))\n}\nsig_oos_pfs <- posterior_is %>% trading_on_significance()\n\n# Harvey et al (2016) Simulation - Baseline ------------------------------\n# We use the baseline specification from table 5 - Panel A where the average correlation is 0 (the average correlation among factors in our data is 7%)\n# 1300 * (1-0.396)  The harvey et al numbers are m=1297 and m_true=783\nharvey_base <- list(\n  alpha_0 = 0,\n  t = 70*12,  \n  ret = 4.4 / 12,\n  vol = 10 / sqrt(12),\n  cl = 26,\n  cl_true = 16,\n  fct_pr_cl = 50,\n  corr_across = 0.02,\n  corr_within = 0.58,\n  tau_ws = c(0.21),         # We estimate it at 0.21\n  n_sims = 50,\n  fix_alpha = T\n)\nharvey_base$se <- harvey_base$vol / sqrt(harvey_base$t)\nharvey_base$n <- harvey_base$cl * harvey_base$fct_pr_cl\nharvey_base$n_true <- harvey_base$cl_true * harvey_base$fct_pr_cl\n\nif (update_harvey_baseline) {\n  harvey_base_res <- harvey_et_al_sim(sim_settings = harvey_base, seed = settings$seed)\n  harvey_base_res <- list(\"settings\" = harvey_base, \"sim\" = harvey_base_res)\n  harvey_base_res %>% saveRDS(file = paste0(object_path, \"/harvey_res_baseline.RDS\"))\n} else {\n  harvey_base_res <- readRDS(file = paste0(object_path, \"/harvey_res_baseline.RDS\"))\n}\n\n# Harvey et al (2016) Simulation - Worst Case\n# We use the worst cases specification from table 5 - Panel B where the average correlation is 0 (the average correlation among factors in our data is 7%)\n# 2500 * (1-0.683) = 800  The harvey numbers are m=2458 and m_true=779\nharvey_worst <- list(\n  alpha_0 = 0,\n  t = 70*12,                # Median number of years for US factor\n  ret = 4.4 / 12,\n  vol = 10 / sqrt(12),\n  cl = 50,\n  cl_true = 16,\n  fct_pr_cl = 50,\n  corr_across = 0.02,\n  corr_within = 0.58,\n  tau_ws = c(0.21),         # We estimate it at 0.21 Same as what we estimate\n  n_sims = 50,\n  fix_alpha = T\n)\nharvey_worst$se <- harvey_worst$vol / sqrt(harvey_worst$t)\nharvey_worst$n <- harvey_worst$cl * harvey_worst$fct_pr_cl\nharvey_worst$n_true <- harvey_worst$cl_true * harvey_worst$fct_pr_cl\n\nif (update_harvey_worstcase) {\n  harvey_worst_res <- harvey_et_al_sim(sim_settings = harvey_worst, seed = settings$seed)\n  harvey_worst_res <- list(\"settings\" = harvey_worst, \"sim\" = harvey_worst_res)\n  harvey_worst_res %>% saveRDS(file = paste0(object_path, \"/harvey_res_worstcase.RDS\"))\n} else {\n  harvey_worst_res <- readRDS(file = paste0(object_path, \"/harvey_res_worstcase.RDS\"))\n}\n\n# Estimate parameters on OOS data -----------\nif (FALSE) {\n  reg <- \"us\"\n  # Prepare Data\n  data <- regional_pfs %>% \n    filter(eom >= settings$start_date & eom <= settings$end_date) %>%\n    filter(region==reg) %>%\n    left_join(char_info %>% select(characteristic, sample_start, sample_end), by = \"characteristic\") %>%\n    filter(year(eom) < sample_start | year(eom) > sample_end) %>%\n    eb_prepare(\n      scale_alpha = settings$eb$scale_alpha, \n      overlapping = settings$eb$overlapping # If we start in 1955-02-28 we lose 12 factors relative to starting in 1972-11-30\n    )\n  # Run Empirical Bayes\n  op <- data %>% \n    emp_bayes( \n      cluster_labels = cluster_labels, \n      min_obs = settings$eb$min_obs,\n      fix_alpha = settings$eb$fix_alpha, \n      bs_cov = settings$eb$bs_cov, \n      layers = 2, \n      shrinkage = settings$eb$shrinkage, \n      cor_type = settings$eb$cor_type, \n      bs_samples = settings$eb$bs_samples,\n      seed = settings$seed\n    )\n  # OOS-replication rate\n  op$factors %>%\n    left_join(char_info %>% select(characteristic, significance), by = \"characteristic\") %>%\n    filter(significance == 1) %>%\n    summarise(\n      n = n(),\n      sd_ols = sd(ols_est),\n      eb_rr = mean(p025 > 0),\n      eb_ols = mean(ols_est - 1.96*ols_se > 0)\n    )\n  # OOS Hyperparameters\n  op$mle %>% mutate(estimate = if_else(estimate == \"tau_s\", \"tau_w\", estimate))\n  # Full sample Tau's\n  eb_est[[reg]]$mle %>% mutate(estimate = if_else(estimate == \"tau_s\", \"tau_w\", estimate))\n  # Replication rate with OOS hyperparameters\n  repl_rate <- function(chars, alphas, sigma, alpha0, tau_c, tau_w, cluster_labels, char_info) {\n    # Alpha zero vector\n    alpha0_vec <- rep(alpha0, length(alphas))\n    # Signal Membership\n    cm <- tibble(characteristic = chars, \"alpha\"= alphas) %>% \n      left_join(cluster_labels, by = \"characteristic\") \n    m <- cm %>%\n      mutate(cm = 1) %>%\n      select(characteristic, hcl_label, cm) %>%\n      spread(key = hcl_label, value = cm) %>% \n      select(-characteristic) %>% \n      as.matrix()\n    m[is.na(m)] <- 0\n    mm <- m %*% t(m)\n    # Omega \n    omega <- diag(length(alphas)) * tau_w^2 + mm * tau_c^2\n    # Posterior\n    post_cov <- solve(solve(omega) + solve(sigma))\n    post_alpha <- post_cov %*% (solve(omega) %*% alpha0_vec + solve(sigma) %*% alphas)\n    # Replication Rate\n    tibble(\n      characteristic = chars,\n      alpha = drop(post_alpha),\n      se = sqrt(diag(post_cov))\n    ) %>%\n      left_join(char_info, by = \"characteristic\") %>%\n      filter(significance == T) %>%\n      summarise(\n        repl_rate = mean(alpha - 1.96*se > 0)\n      )\n  }\n  # Full Sample Hyper-parameters\n  repl_rate(\n    chars = eb_est[[reg]]$factors$characteristic, \n    alphas = eb_est[[reg]]$factors$ols_est, \n    sigma = eb_est[[reg]]$sigma,\n    alpha0 = eb_est[[reg]]$mle %>% filter(estimate == \"alpha\") %>% pull(ml_est),\n    tau_c = eb_est[[reg]]$mle %>% filter(estimate == \"tau_c\") %>% pull(ml_est),\n    tau_w = eb_est[[reg]]$mle %>% filter(estimate == \"tau_s\") %>% pull(ml_est),\n    cluster_labels = cluster_labels, \n    char_info = char_info\n  )\n  # OOS Hyper-parameters\n  repl_rate(\n    chars = eb_est[[reg]]$factors$characteristic, \n    alphas = eb_est[[reg]]$factors$ols_est, \n    sigma = eb_est[[reg]]$sigma,\n    alpha0 = op$mle%>% filter(estimate == \"alpha\") %>% pull(ml_est),\n    tau_c = op$mle %>% filter(estimate == \"tau_c\") %>% pull(ml_est),\n    tau_w = op$mle %>% filter(estimate == \"tau_s\") %>% pull(ml_est),\n    cluster_labels = cluster_labels, \n    char_info = char_info\n  )\n}\n"
  },
  {
    "path": "Analysis/4 - Output.R",
    "content": "# Determine Cluster Order\ncluster_order <- c(\"Accruals\", \"Debt Issuance\", \"Investment\", \"Short-Term Reversal\", \"Value\",\n                   \"Low Risk\", \"Quality\", \"Momentum\", \"Profitability\", \"Profit Growth\",\n                   \"Seasonality\", \"Size\", \"Low Leverage\")\n\n# Collect all output in list\noutput <- list(figures = list(), tables = list())\n\n# Headline Replication Rate\nheadline_rr <- eb_est$us$factors %>% left_join(char_info, by = \"characteristic\") %>% filter(significance == 1) %>% summarise(rr = mean(p025>0)) %>% pull(rr)\n\n# Figures --------------------------------------\n# HCL\noutput$figures$hcl_us <- function(tex = F) {\n  par(mar = c(3,2,1,10), cex = 1) \n  c <- 1.2\n  x <- 37/2\n  label_func <- function(x) unique(clusters$labels$hcl_label)[unique(clusters$labels$hcl_label) %>% str_detect(x)]\n  \n  clusters_tex <- copy(clusters$dend)\n  if (tex == T) {\n    labels(clusters_tex) <- labels(clusters_tex) %>% str_replace_all(\"_\", \"\\\\\\\\_\")\n  }\n  \n  clusters_tex %>% plot(horiz=T)\n  colored_bars(colors = clusters$bar_colours %>% select(col_dir), dend = clusters_tex, rowLabels = c(\"Long High\"), \n               y_shift = 11/2, horiz = T)\n  # Labels\n  if (settings$hcl$k == 13) {\n    text(x = x, y = 151, label_func(\"Short-Term Reversal\"), cex = c, col = colours_theme[5], adj = 0)   #gold-9\n    text(x = x, y = 141, label_func(\"Profitability\"), cex = c, col = colours_theme[4], adj = 0)  #lightgreen-7\n    text(x = x, y = 129, label_func(\"Low Risk\"), cex = c, col = colours_theme[3], adj = 0)  #purple-5\n    text(x = x, y = 112, label_func(\"Value\"), cex = c, col = colours_theme[2], adj = 0)       #orange-4\n    text(x = x, y = 90, label_func(\"Investment\"), cex = c, col = colours_theme[1], adj = 0)           #darkgreen-3\n    text(x = x, y = 73, label_func(\"Seasonality\"), cex = c, col = colours_theme[11], adj = 0)            #orange-4 \n    text(x = x, y = 63, label_func(\"Debt Issuance\"), cex = c, col = colours_theme[9], adj = 0)       #red-2\n    text(x = x, y = 57, label_func(\"Size\"), cex = c, col = colours_theme[7], adj = 0)            #blue-1  \n    text(x = x, y = 51, label_func(\"Accruals\"), cex = c, col = colours_theme[5], adj = 0)   #black-11\n    text(x = x, y = 43, label_func(\"Low Leverage\"), cex = c, col = colours_theme[4], adj = 0)            #lightgreen-7\n    text(x = x, y = 30, label_func(\"Profit Growth\"), cex = c, col = colours_theme[3], adj = 0)   #purple-5\n    text(x = x, y = 22, label_func(\"Momentum\"), cex = c, col = colours_theme[2], adj = 0)               #darkgreen-3\n    text(x = x, y = 12, label_func(\"Quality\"), cex = c, col = colours_theme[1], adj = 0)        #blue-1\n  }\n}\n\n# Cluster Validation\noutput$figures$hcl_us_val <- clusters$cor %>% cluster_val(labels = clusters$labels, op_format = \"pdf\")\n\n# Literature comparison\nprint(output$figures$lit_comp <- eb_est$us %>% plot_lit_comp(mt_res = mt, eb_world = eb_est$world, excl_insig=T))\n\n# Comparing Multiple Testing with Empirical Bayes\nc(output$figures$mt_factors, output$figures$mt_summary) %<-% plot_mt_eb_comp(\n  mt = mt, eb_all = eb_est$all, eb_us = eb_est$us, eb_developed = eb_est$developed, eb_world = eb_est$world,\n  eb_emerging = eb_est$emerging, mts = c(\"OLS\", \"Bonferroni\", \"BY\"), \n  regs = c(\"us\", \"developed\", \"emerging\", \"world\"), se_methods = c(\"OLS\", \"BY\", \"EB - Region\", \"EB - All\"),\n  se_regions = \"us\")\n\n# Replication Rate by Region\noutput$figures$gl_by_cluster <- plot_repl_region(eb_all = eb_est$all, cluster_order = cluster_order)\n\n# Global Factor Posterior\noutput$figures$gl_by_factor <- eb_est$world %>% plot_factor_post(orig_sig = T, cluster_order = cluster_order)\n\n# CI Many Factors\noutput$figures$ci_many_fcts <- plot_many_factors()\n\n# Tangency Portfolio - US\noutput$figures$tpf <- tpf_us %>% plot_tpf(cluster_order = cluster_order, ci_low = 0.05, ci_high = 0.95)\n\n# Tangency Portfolio - Regions\noutput$figures$tpf_regions <- plot_tpf_region(tpf_us = tpf_us, tpf_dev = tpf_dev, tpf_emer = tpf_emer, cluster_order = cluster_order, ci_low = 0.05, ci_high = 0.95)\n\n# Tangency Portfolio - Size Groups\noutput$figures$tpf_size <- tpf_size %>% plot_tpf_size(cluster_order = cluster_order, ci_low = 0.05, ci_high = 0.95)\n\n# Posterior over Time - Fixed Taus\noutput$figures$overtime <- posterior_over_time %>% plot_over_time(orig_sig = T, ols_incl = T, lb = 5, bw=F)\noutput$figures$overtime_bw <- posterior_over_time %>% plot_over_time(orig_sig = T, ols_incl = T, lb = 5, bw=T)\n\n# Posterior over Time - Flexible Taus\noutput$figures$overtime_flex <- posterior_over_time_flex %>% plot_over_time(orig_sig = T, ols_incl = F, lb=5, bw=F)\noutput$figures$overtime_flex_bw <- posterior_over_time_flex %>% plot_over_time(orig_sig = T, ols_incl = F, lb=5, bw=T)\n\n# Posterior over Time - Flexible Taus - Plot taus\noutput$figures$overtime_flex_taus <- posterior_over_time_flex %>% plot_taus_over_time() \n\n# By Size - Overall\noutput$figures$size_overall <- eb_us_size %>% plot_size_overall(flipped = T, text = F)\n\n# By Size - Clusters\noutput$figures$size_clusters <- eb_us_size %>% plot_size_clusters(cluster_order = cluster_order)\n\n# Model - False Discovery Rate\noutput$figures$model_fdr <- model_fdr %>% plot_fdr()\n\n# Simulation - False Discovery Rate\noutput$figures$sim_fdr <- simulation %>% plot_sim_fdr()\n\n# US verus world factor\noutput$figures$world_vs_us <- plot_world_vs_us(eb_us = eb_est$us, eb_world_ex_us = eb_est$world_ex_us)\n\n# In-sample vs. OOS and Post\nc(output$figures$is_pre, output$figures$is_post, output$figures$is_oos) %<-% plot_is_oos_post(is_oos = is_oos, type = \"GLS\") \n\n# In-sample vs. OOS and Post - quadratic\nc(output$figures$is_pre_quad, output$figures$is_post_quad, output$figures$is_oos_quad) %<-% plot_is_oos_post_quad(is_oos = is_oos, type = \"GLS\")\n\n# Effect Sizes\nc(output$figures$effect_world, output$figures$effect_regions, output$figures$effect_size) %<-% plot_effects(type = \"ols\", orig_sig = T, cluster_order = cluster_order)\n\n# Replicateion Rate by Cluster - US\noutput$figures$repl_cluster_us <- eb_est$us$factors %>% plot_repl_cluster(orig_sig = T, cluster_order = cluster_order)\n\n# Simulation benchmarked to Harvey et al (2016)\nif (eb_est$us$mle %>% filter(estimate == \"tau_s\") %>% pull(ml_est) %>% round(2) != 0.21) {\n  warning(\"Tau_w in Harvey et al simulation is inconsistent with data!!\")\n}\noutput$figures$sim_harvey <- plot_harvey(harvey_base_res = harvey_base_res, harvey_worst_res = harvey_worst_res, tau_ws = 0.21, act_rr = headline_rr)\noutput$figures$sim_harvey_robustness <- plot_harvey(harvey_base_res = harvey_base_res, harvey_worst_res = harvey_worst_res, tau_ws = c(0.1, 0.21, 0.3), act_rr = headline_rr)\n\n# TPF Single Factors \noutput$figures$tpf_factors_one_cluster <- tpf_factors$wide %>% plot_tpf_one_cluster(cluster_labels = cluster_labels, s = opt_s)\noutput$figures$tpf_factors_excl_one <- tpf_factors$wide %>% plot_tpf_excl_cl(cluster_labels = cluster_labels, s = opt_s)\noutput$figures$tpf_factors_imp <- tpf_factors$wide %>% plot_tpf_factor_imp(cluster_labels = cluster_labels, s = opt_s)\noutput$figures$tpf_factors_imp_cluster <- tpf_factors$wide %>% plot_tpf_factor_imp_cluster(cluster_labels = cluster_labels, s = opt_s)\ntpf_evol <- tpf_factors$long %>% plot_tpf_evolution(data_wide = tpf_factors$wide, char_info = char_info, orig_sig_values = settings$tpf_factors$orig_sig, s = opt_s)\noutput$figures$tpf_evolution <- tpf_evol$plot\n\n\n# Cumulative returns OOS - Marginally significant factors\noutput$figures$marg_sig_oos <- sig_oos_pfs %>% plot_sig_oos(sig_type = \"marg_sig\", cutoff_2012 = T, first_date = as.Date(\"1990-01-01\"), leg_pos = c(.85, .55)) # Also include table for caption\n\n# Cummulative returns OOS - EB significant factors\noutput$figures$eb_sig_oos <- sig_oos_pfs %>% plot_sig_oos(sig_type = \"eb_sig\", cutoff_2012 = F, first_date = as.Date(\"1990-01-01\"), leg_pos = c(.85, .40)) # Also include table for caption\n\n\n# Save Figures as Pictures -------------------------\nif (save_figures) {\n  # Base settings\n  output_fig <- function(path, name, format, width, height) {\n    file <- paste0(path, \"/\", name, \".eps\")\n    ggsave(file=file, width = width, height = height, units = \"in\", dpi = 300)\n  }\n  \n  h <- 5\n  w <- 9\n  # For main text -----------------------------------------\n  # Figure 1 ------\n  output$figures$lit_comp\n  output_fig(path=output_path, name = \"fig1\", width = w + 1, height = h)\n  dev.off()\n  \n  output$figures$lit_comp + scale_fill_manual(values = c(\"black\", \"grey35\"))\n  output_fig(path=output_path, name = \"fig1_bw\", width = w + 1, height = h)\n  dev.off()\n  \n  # Figure 2 -----\n  output$figures$marg_sig_oos\n  output_fig(path=output_path, name = \"fig2\", width = w, height = h*2/3)\n  dev.off()\n  \n  output$figures$marg_sig_oos + scale_colour_manual(values = c(\"black\", \"black\"))\n  output_fig(path=output_path, name = \"fig2_bw\", width = w, height = h*2/3)\n  dev.off()\n  \n  # Figure 3 -----\n  output$figures$sim_fdr\n  output_fig(path=output_path, name = \"fig3\", format = format, width = w, height = h)\n  dev.off()\n  \n  output$figures$sim_fdr + scale_colour_manual(values = c(\"black\", \"black\", \"black\"))\n  output_fig(path=output_path, name = \"fig3_bw\", format = format, width = w, height = h)\n  dev.off()\n  \n  \n  # Figure 4 -----\n  output$figures$mt_factors + theme(text = element_text(size = 13), axis.text.x = element_blank())\n  output_fig(path=output_path, name = \"fig4\", format = format, width = w, height = h)\n  dev.off()\n  \n  output$figures$mt_factors + theme(text = element_text(size = 13), axis.text.x = element_blank()) + \n    scale_colour_manual(values = rep(\"black\", 3))\n  output_fig(path=output_path, name = \"fig4_bw\", format = format, width = w, height = h)\n  dev.off()\n  \n  # Figure 5a -----\n  output$figures$size_overall + theme(\n    axis.title.x = element_text(size = 17),\n    axis.text.x = element_text(size = 17),\n    axis.text.y = element_text(size = 17))\n  output_fig(path=output_path, name = \"fig5a\", format = format, width = w, height = h)\n  dev.off()\n  \n  output$figures$size_overall + theme(\n    axis.title.x = element_text(size = 17),\n    axis.text.x = element_text(size = 17),\n    axis.text.y = element_text(size = 17)) + geom_col(fill=\"grey35\")\n  output_fig(path=output_path, name = \"fig5a_bw\", format = format, width = w, height = h)\n  dev.off()\n  \n  # Figure 5b -----\n  output$figures$repl_cluster_us + theme(\n    axis.text.y = element_text(size = 13), \n    axis.text.x = element_text(size = 14), \n    text = element_text(size = 14))\n  output_fig(path=output_path, name = \"fig5b\", format = format, width = w, height = h)\n  dev.off()\n  \n  output$figures$repl_cluster_us + theme(\n    axis.text.y = element_text(size = 13), \n    axis.text.x = element_text(size = 14), \n    text = element_text(size = 14)) + scale_fill_manual(values = rep(\"grey35\", 13))\n  output_fig(path=output_path, name = \"fig5b_bw\", format = format, width = w, height = h)\n  dev.off()\n  \n  # Figure 6 -----\n  output$figures$mt_summary + \n    theme(\n      text = element_text(size = 13),\n      axis.title.x = element_blank(), \n      axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 9)\n    )\n  output_fig(path=output_path, name = \"fig6\", format = format, width = w, height = h)\n  dev.off()\n  \n  output$figures$mt_summary + \n    theme(\n      text = element_text(size = 13),\n      axis.title.x = element_blank(), \n      axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size = 9)\n    ) +\n    scale_fill_grey()\n  output_fig(path=output_path, name = \"fig6_bw\", format = format, width = w, height = h)\n  dev.off()\n  \n  # Figure 7 -----\n  output$figures$world_vs_us + theme(text = element_text(size = 13)) \n  output_fig(path=output_path, name = \"fig7\", format = format, width = h, height = h)\n  dev.off()\n  \n  output$figures$world_vs_us + theme(text = element_text(size = 13)) + scale_colour_manual(values = c(\"black\", \"grey35\"))\n  output_fig(path=output_path, name = \"fig7_bw\", format = format, width = h, height = h)\n  dev.off()\n  \n  # Figure 8a -----\n  output$figures$is_pre + theme(\n    text = element_text(size = 12), \n    plot.title = element_text(size = 10, vjust = -2), \n    plot.subtitle = element_text(size = 8, vjust = 0),\n    plot.margin = unit(c(0,0,0,0), \"cm\")\n  ) \n  output_fig(path=output_path, name = \"fig8a\", format = format, width = w/3, height = w/3)\n  dev.off()\n  \n  output$figures$is_pre + theme(\n    text = element_text(size = 12), \n    plot.title = element_text(size = 10, vjust = -2), \n    plot.subtitle = element_text(size = 8, vjust = 0),\n    plot.margin = unit(c(0,0,0,0), \"cm\")\n  )  + geom_point(colour = \"black\")\n  output_fig(path=output_path, name = \"fig8a_bw\", format = format, width = w/3, height = w/3)\n  dev.off()\n  \n  # Figure 8b -----\n  output$figures$is_post  + theme(\n    text = element_text(size = 12), \n    plot.title = element_text(size = 10, vjust = -2), \n    plot.subtitle = element_text(size = 8, vjust = 0),\n    plot.margin = unit(c(0,0,0,0), \"cm\")\n  ) \n  output_fig(path=output_path, name = \"fig8b\", format = format, width = w/3, height = w/3)\n  dev.off()\n  \n  output$figures$is_post  + theme(\n    text = element_text(size = 12), \n    plot.title = element_text(size = 10, vjust = -2), \n    plot.subtitle = element_text(size = 8, vjust = 0),\n    plot.margin = unit(c(0,0,0,0), \"cm\")\n  ) + geom_point(colour = \"black\")\n  output_fig(path=output_path, name = \"fig8b_bw\", format = format, width = w/3, height = w/3)\n  dev.off()\n  \n  # Figure 8c -----\n  output$figures$is_oos  + theme(\n    text = element_text(size = 12), \n    plot.title = element_text(size = 10, vjust = -2), \n    plot.subtitle = element_text(size = 8, vjust = 0),\n    plot.margin = unit(c(0,0,0,0), \"cm\")\n  ) \n  output_fig(path=output_path, name = \"fig8c\", format = format, width = w/3, height = w/3)\n  dev.off()\n  \n  output$figures$is_oos  + theme(\n    text = element_text(size = 12), \n    plot.title = element_text(size = 10, vjust = -2), \n    plot.subtitle = element_text(size = 8, vjust = 0),\n    plot.margin = unit(c(0,0,0,0), \"cm\")\n  )  + geom_point(colour = \"black\")\n  output_fig(path=output_path, name = \"fig8c_bw\", format = format, width = w/3, height = w/3)\n  dev.off()\n  \n  # Figure 9 -----\n  output$figures$overtime\n  output_fig(path=output_path, name = \"fig9\", format = format, width = w, height = h)\n  dev.off()\n  \n  output$figures$overtime_bw\n  output_fig(path=output_path, name = \"fig9_bw\", format = format, width = w, height = h)\n  dev.off()\n  \n  \n  # Figure 10 -----\n  output$figures$sim_harvey + theme(\n    text = element_text(size = 12),\n    axis.text.x = element_text(size = 11),\n    axis.text.y = element_text(size = 11),\n    legend.text = element_text(size = 11)\n  )\n  output_fig(path=output_path, name = \"fig10\", format = format, width = w, height = h)\n  dev.off()\n  \n  output$figures$sim_harvey + theme(\n    text = element_text(size = 12),\n    axis.text.x = element_text(size = 11),\n    axis.text.y = element_text(size = 11),\n    legend.text = element_text(size = 11)\n  ) + scale_colour_manual(values = rep(\"black\", 3))\n  output_fig(path=output_path, name = \"fig10_bw\", format = format, width = w, height = h)\n  dev.off()\n  \n  # Figure 11 -----\n  output$figures$gl_by_factor + theme(text = element_text(size = 13), legend.position = \"right\", axis.text.x = element_text(size = 5)) \n  output_fig(path=output_path, name = \"fig11\", format = format, width = w, height = h)\n  dev.off()\n  \n  output$figures$gl_by_factor + theme(text = element_text(size = 13), legend.position = \"right\", axis.text.x = element_text(size = 5)) +\n    scale_colour_manual(values = rep(\"black\", 13))\n  output_fig(path=output_path, name = \"fig11_bw\", format = format, width = w, height = h)\n  dev.off()\n  \n  \n  # Figure 12 -----\n  output$figures$effect_size + theme(\n    axis.text.y = element_text(size = 13), \n    strip.text.x = element_text(size = 14),\n    axis.text.x = element_text(size = 12),\n    axis.title.x = element_text(size = 13)\n  )\n  output_fig(path=output_path, name = \"fig12a\", format = format, width = w, height = h)\n  dev.off()\n  \n  output$figures$effect_size + theme(\n    axis.text.y = element_text(size = 13), \n    strip.text.x = element_text(size = 14),\n    axis.text.x = element_text(size = 12),\n    axis.title.x = element_text(size = 13)\n  ) + scale_fill_manual(values = rep(\"grey35\", 13))\n  output_fig(path=output_path, name = \"fig12a_bw\", format = format, width = w, height = h)\n  dev.off()\n  \n  # Figure 12b -----\n  output$figures$effect_regions + theme(\n    axis.text.y = element_text(size = 13), \n    strip.text.x = element_text(size = 13),\n    axis.text.x = element_text(size = 12),\n    axis.title.x = element_text(size = 13)\n  )\n  output_fig(path=output_path, name = \"fig12b\", format = format, width = w, height = h)\n  dev.off()\n  \n  output$figures$effect_regions + theme(\n    axis.text.y = element_text(size = 13), \n    strip.text.x = element_text(size = 13),\n    axis.text.x = element_text(size = 12),\n    axis.title.x = element_text(size = 13)\n  ) + scale_fill_manual(values = rep(\"grey35\", 13))\n  output_fig(path=output_path, name = \"fig12b_bw\", format = format, width = w, height = h)\n  dev.off()\n  \n  # Figure 13 -----\n  output$figures$tpf + theme(text = element_text(size = 13), axis.text.x = element_text(size = 13), axis.text.y = element_text(size = 13))\n  output_fig(path=output_path, name = \"fig13\", format = format, width = w, height = h)\n  dev.off()\n  \n  output$figures$tpf + theme(text = element_text(size = 13), axis.text.x = element_text(size = 13), axis.text.y = element_text(size = 13)) +\n    scale_fill_manual(values = rep(\"grey35\", 14))\n  output_fig(path=output_path, name = \"fig13_bw\", format = format, width = w, height = h)\n  dev.off()\n  \n  # Figure 14 -----\n  # Evolution of TPF\n  output$figures$tpf_evolution\n  output_fig(path=output_path, name = \"fig14\", format = format, width = w, height = h*0.8)\n  dev.off()\n  \n  output$figures$tpf_evolution\n  output_fig(path=output_path, name = \"fig14_bw\", format = format, width = w, height = h*0.8)\n  dev.off()\n  \n  \n  # Figure IA.1 ----- # NEED TO DO!!\n  if (FALSE) {\n    # Need to run everything with settings$weighting$us = \"vw\" and settings$weighting$global_ex_us=\"vw\"\n    output$figures$lit_comp\n    output_fig(path=output_path, name = \"figIA1\", format = format, width = w + 1, height = h)\n    dev.off()\n  }\n  \n  \n  # Figure IA.2 ------\n  output$figures$eb_sig_oos\n  output_fig(path=output_path, name = \"figIA2\", format = format, width = w, height = h*2/3)\n  dev.off()\n  \n  # Figure IA.3a -----\n  output$figures$is_pre_quad + theme(\n    text = element_text(size = 12), \n    plot.title = element_text(size = 10, vjust = -2), \n    plot.subtitle = element_text(size = 8, vjust = 0),\n    plot.margin = unit(c(0,0,0,0), \"cm\")\n  ) \n  output_fig(path=output_path, name = \"figIA3a\", format = format, width = w/3, height = w/3)\n  dev.off()\n  \n  # Figure IA.3b -----\n  output$figures$is_post_quad  + theme(\n    text = element_text(size = 12), \n    plot.title = element_text(size = 10, vjust = -2), \n    plot.subtitle = element_text(size = 8, vjust = 0),\n    plot.margin = unit(c(0,0,0,0), \"cm\")\n  )  \n  output_fig(path=output_path, name = \"figIA3b\", format = format, width = w/3, height = w/3)\n  dev.off()\n  \n  # Figure IA.3c -----\n  output$figures$is_oos_quad  + theme(\n    text = element_text(size = 12), \n    plot.title = element_text(size = 10, vjust = -2), \n    plot.subtitle = element_text(size = 8, vjust = 0),\n    plot.margin = unit(c(0,0,0,0), \"cm\")\n  ) \n  output_fig(path=output_path, name = \"figIA3c\", format = format, width = w/3, height = w/3)\n  dev.off()\n  \n  \n  # Figure IA.4 -----\n  output$figures$overtime_flex + theme(text = element_text(size = 13))\n  output_fig(path=output_path, name = \"figIA4\", format = format, width = w, height = h)\n  dev.off()\n  \n  # Figure IA.5 -----\n  output$figures$overtime_flex_taus + theme(text = element_text(size = 13), legend.text = element_text(size = 12))\n  output_fig(path=output_path, name = \"figIA5\", format = format, width = w, height = h)\n  dev.off()\n  \n  # Figure IA.6 ----\n  output$figures$sim_harvey_robustness + theme(\n    text = element_text(size = 12),\n    axis.text.x = element_text(size = 11),\n    axis.text.y = element_text(size = 11),\n    legend.text = element_text(size = 11)\n  )\n  output_fig(path=output_path, name = \"figIA6\", format = format, width = w, height = h)\n  dev.off()\n  \n  # Figure IA.7 ----\n  output$figures$gl_by_cluster  + theme(text = element_text(size = 13))\n  output_fig(path=output_path, name = \"figIA7\", format = format, width = w, height = h)\n  dev.off()\n  \n  # Figure IA.8 ----\n  output$figures$size_clusters\n  output_fig(path=output_path, name = \"figIA8\", format = format, width = w, height = h+1)\n  dev.off()\n  \n  # Figure IA.9 ----\n  output$figures$tpf_regions  + theme(text = element_text(size = 13), axis.text.x = element_text(size = 13), axis.text.y = element_text(size = 13))\n  output_fig(path=output_path, name = \"figIA9\", format = format, width = w, height = h)\n  dev.off()\n  \n  # Figure IA.10 ----\n  output$figures$tpf_size  + theme(\n    text = element_text(size = 13), \n    axis.text.x = element_text(size = 13), \n    axis.text.y = element_text(size = 11))\n  \n  output_fig(path=output_path, name = \"figIA10\", format = format, width = w, height = h*1.5)\n  dev.off()\n  \n  # Figure IA.11 ----\n  output$figures$tpf_factors_one_cluster\n  output_fig(path=output_path, name = \"figIA11\", format = format, width = w, height = h)\n  dev.off()\n  \n  # Figure IA.12 ----\n  output$figures$tpf_factors_excl_one\n  output_fig(path=output_path, name = \"figIA12\", format = format, width = w, height = h)\n  dev.off()\n  \n  # Figure IA.13 ----\n  output$figures$tpf_factors_imp_cluster + theme(\n    axis.text.y = element_text(size = 6)\n  )\n  output_fig(path=output_path, name = \"figIA13\", format = format, width = w, height = h*1.5)\n  dev.off()\n  \n  # Figure IA.14 ----\n  output$figures$tpf_factors_imp + theme(\n    axis.text.y = element_text(size = 6)\n  )\n  output_fig(path=output_path, name = \"figIA14\", format = format, width = w, height = h*1.5)\n  dev.off()\n  \n  # Figure IA.15 (special, because not ggplot) ----\n  pdf(str_c(output_path, \"/figIA15.pdf\"), width = w, height = h*2+1.5)\n  output$figures$hcl_us()\n  dev.off()\n  \n  # Figure IA.16 (special, because not ggplot) ----\n  pdf(str_c(output_path, \"/figIA16.pdf\"), width = 6, height = 6)\n  output$figures$hcl_us_val()\n  dev.off()\n  \n  \n  \n  \n  # Other figures not included in paper --------\n  output$figures$ci_many_fcts + theme(axis.text.y = element_text(size = 13), text = element_text(size = 13))\n  output$figures$model_fdr + theme(text = element_text(size = 13))\n\n  # TABLES -----------------\n  # Estimated Taus\n  table_taus()\n  # Economic Benefit of more Power\n  sig_oos_pfs %>% table_economic_benefit()\n  # Factor Performance\n  table_factor_info()\n  \n  # Numbers mentioned in paper --------\n  # Bayesian Multiple Testing\n  bayes_sim <- 1000000\n  (fdr_196 <- fdr_fwer_rates(t_cutoff = 1.96, orig_sig = T, a_vec = eb_est$world$factor_mean, a_cov = eb_est$world$factor_cov, n_sim = bayes_sim, seed=settings$seed))\n  (fdr_278 <- fdr_fwer_rates(t_cutoff = 2.78, orig_sig = T, a_vec = eb_est$world$factor_mean, a_cov = eb_est$world$factor_cov, n_sim = bayes_sim, seed=settings$seed))\n  (true_factors_tbl <- true_factors(t_cutoff = 0, a_vec = eb_est$world$factor_mean, a_cov = eb_est$world$factor_cov, orig_sig = T, n_sim = bayes_sim, seed=settings$seed))\n  (rr_unc <- true_factors(t_cutoff = 1.96, a_vec = eb_est$world$factor_mean, a_cov = eb_est$world$factor_cov, orig_sig = T, n_sim = bayes_sim, seed=settings$seed))\n  \n  # Mentioned in introduction\n  paste0(\"Replication rate SE: \", round(rr_unc$sd*100, 2), \"%\")\n  paste0(\"Bayesian FDR: \", round(fdr_196$fdr_dist$mean*100, 2), \"%, with 95% CI of [\", round(fdr_196$fdr_dist$p025*100, 2), \"%, \", round(fdr_196$fdr_dist$p975*100, 2), \"%], SE: \", round(fdr_196$fdr_dist$sd*100, 2))\n  paste0(\"Bayesian FWER: \", round(fdr_196$fwer_dist$mean*100, 2), \"%, with SE of \", round(fdr_196$fwer_dist$sd*100, 2), \"%\")\n  paste0(\"Expected fraction of true factors: \", round(true_factors_tbl$mean*100, 2), \"%, with SE of \", round(true_factors_tbl$sd*100, 2), \"%\")\n  \n  \n  # BY cutoff\n  mt %>%\n    filter(method == \"BY\" & region == \"us\") %>%\n    mutate(sig = p<=0.05) %>%\n    group_by(sig) %>%\n    mutate(\n      max = max(abs(statistic)),\n      min = min(abs(statistic))\n    ) %>%\n    filter(abs(statistic)==max & sig==F | abs(statistic)==min & sig==T) %>%\n    ungroup() %>%\n    summarise(\n      by_cutoff = mean(abs(statistic))\n    ) %>% \n    print()\n  \n  # Change in Book equity factor\n  be_gr_us <- eb_est$us$factors %>% \n    filter(region == \"us\" & characteristic == \"be_gr1a\")\n  be_gr_all <- eb_est$all$factors %>% \n    filter(region == \"us\" & characteristic == \"be_gr1a\")\n  tibble(\n    characteristic = rep(\"be_gr1a\",2), \n    region = rep(\"US\", 2), \n    data = c(\"US\", \"Global\"), \n    post_mean = c(be_gr_us$post_mean, be_gr_all$post_mean), \n    post_vol = c(be_gr_us$post_sd, be_gr_all$post_sd),\n    t = post_mean / post_vol\n  ) %>% print()\n  \n  # IS / OOS \n  is_oos$post$regs %>%\n    ungroup() %>%\n    summarise(\n      is = mean(is),\n      post = mean(oos),\n      decline = post/is-1\n    ) %>% \n    print()\n  \n  c(\"pre\",\"post\",\"pre_post\") %>% lapply(function(x) {\n    is_oos[[x]]$regs %>% mutate(period = x)\n  }) %>%\n    bind_rows() %>%\n    group_by(period) %>%\n    summarise(\n      n = n(),\n      is = mean(is > 0),\n      oos = mean(oos > 0)\n    ) %>% \n    print()\n  \n  # Posterior over time width\n  posterior_over_time %>% plot_over_time(orig_sig = T, ols_incl = T, lb = 5)\n  \n  # Bayesian Multiple Testing\n  fdr_196$fdr_dist\n  fdr_278$fwer_fdr # FWER at t>2.78\n  true_factors_tbl\n  \n  # Replication rates in different size groups\n  eb_us_size %>% plot_size_overall(flipped = T, text = T)\n  \n  # Publication Bias\n  plot_harvey(harvey_base_res = harvey_base_res, harvey_worst_res = harvey_worst_res, tau_ws = 0.21, act_rr = headline_rr)\n  \n  # Correlations across size and region\n  eb_us_size %>%\n    select(characteristic, size_grp, ols_est) %>%\n    spread(key = size_grp, value = ols_est) %>%\n    summarise(\n      cor_mega_micro = cor(Mega, Micro, method = \"spearman\"),\n      cor_mega_nano = cor(Mega, Nano, method = \"spearman\")\n    ) %>%\n    print()\n  \n  eb_est$all$factors %>%\n    select(characteristic, region, ols_est) %>%\n    spread(key = region, value = ols_est) %>%\n    na.omit() %>%\n    summarise(\n      n = n(),\n      cor_us_dev = cor(us, developed, method = \"spearman\"),\n      cor_us_emer = cor(us, emerging, method = \"spearman\")\n    ) %>%\n    print()\n  \n  # TPF Evolution numbers\n  tpf_evol$data %>%\n    arrange(year) %>%\n    mutate(tpf_sr_l1 = dplyr::lag(tpf_sr)) %>%\n    filter(year %in% c(min(year), max(year), 2002, 1991)) %>% # char_info[characteristic %in% c(\"seas_2_5an\", \"oaccruals_at\")]\n    arrange(year) %>%\n    mutate(\n      new_inclusions = case_when(\n        year == 1971 ~ \"Market\",\n        year == 1991 ~ \"Accruals\",\n        year == 2002 ~ \"Seasonality\",\n        year == year(settings$end_date) ~ \"[All factors included]\",\n      )\n    ) %>%\n    print()\n  \n  # Average pairwise correlations\n  eb_est$us$input$long %>%\n    select(characteristic, eom, ret_neu) %>%\n    spread(key = characteristic, value = ret_neu) %>%\n    select(-eom) %>%\n    cor(use = \"pairwise.complete.obs\") %>%\n    as_tibble(rownames = \"char1\") %>%\n    gather(-char1, key = \"char2\", value = \"cor\") %>% \n    filter(char1 != char2) %>% \n    summarise(average_cor = mean(cor))\n  \n}"
  },
  {
    "path": "Analysis/Analysis.Rproj",
    "content": "Version: 1.0\n\nRestoreWorkspace: Default\nSaveWorkspace: Default\nAlwaysSaveHistory: Default\n\nEnableCodeIndexing: Yes\nUseSpacesForTab: Yes\nNumSpacesForTab: 2\nEncoding: UTF-8\n\nRnwWeave: Sweave\nLaTeX: pdfLaTeX\n"
  },
  {
    "path": "Analysis/README.md",
    "content": "## Overview\nThis repository contains code that is used for the paper [Is There a Replication Crisis in Finance?](https://onlinelibrary.wiley.com/doi/full/10.1111/jofi.13249) by Jensen, Kelly and Pedersen (2023). The code used to create the underlying dataset can be found at [https://github.com/bkelly-lab/GlobalFactor](https://github.com/bkelly-lab/GlobalFactor). Please cite Jensen, Kelly and Pedersen (2023) if using the code or data. \n\n## How To Run the Code\n\n1. Start with factor returns in each country. The files you need to download are `hml.csv`, `cmp.csv` and `market_returns.csv` using one of the following methods.\n\t1. Download the factor returns used in the paper [here](https://www.dropbox.com/sh/wcrjok1qyxtrasi/AABZ90GDCUvIzDzijt8Qoo3ha?dl=0). \n\t1. Download the latest version of the factor returns [here](https://www.dropbox.com/sh/xq278bryrj0qf9s/AABUTvTGok91kakyL07LKyQoa?dl=0). \n\t1. Generate the factor returns from scratch by following the steps in [GlobalFactors](https://github.com/bkelly-lab/ReplicationCrisis/tree/master/GlobalFactors) as the output from `portfolios.R`. \n2. Copy the code from this repository to a local folder. \n3. Open `main.R` in the programming language \"R\".\n4. Ensure that the current working directory is the folder from 2. To check this, write `getwd()` in the console. To change the working directory use `setwd()`.\n5. Run `main.R`.\n\n## Outputs\n1. The consol prints key numbers used in the paper as well as the paper tables in latex format.\n2. If `save_figures=TRUE` (default), the folder in `output_path` will contain figures of the same format used in the paper. \n\n## Optional Settings \n1. `data_path` is the folder with the portfolio data from step 1 (default: current working directory/Data).\n2. `object_path` is a folder where R objects can be saved for faster                iterations (default: current working directory/Objects).\n3. `output_path` is a folder where figures can be saved (default: current working directory/Figures).\n4. `save_figures` should be `TRUE` if you wish to save figures, otherwise `FALSE` (default: TRUE).\n5. `settings` controls settings for the analysis, including the start and end date,           the portfolio weighting scheme, the cluster settings, the empirical Bayes settings, and the             tangency portfolio settings (default: the settings in the paper).  \n\n## Notes\n\nThe code is divded into 4 separate R scripts. `0 - Functions.R` contains the project functions, `1 - Prepare Data.R` prepares the data, `2 - Determine Clusters.R` finds statistical clusters based on return data, `3 - Analysis.R` analyzes the data and `4 - Output.R` generates tables and figures based on the analysis. Everything is sourced from `main.R` which also contains user-defined control variables.  \n"
  },
  {
    "path": "Analysis/country_stats.R",
    "content": "library(xtable)\nlibrary(tidyverse)\nlibrary(data.table)\n\n# Data [output from SAS code, already screened with obs_main=1, primary_sec=1, exch_main=1]\ndata_path <- \"../../Data/Characteristics\"\ncountry_files <- list.files(data_path)\ncountries <- country_files %>% lapply(function(file) {\n  fread(paste0(data_path, \"/\", file), select = c(\"excntry\", \"id\", \"eom\", \"me\", \"size_grp\", \"ret_local\"))\n}) %>% rbindlist()\ncountries[, eom := eom %>% as.character() %>% lubridate::fast_strptime(format = \"%Y%m%d\") %>% as.Date()]\n# Aggregate by month\ncountry_info <- countries[!is.na(me) & !is.na(ret_local), .(\n  n = .N,\n  n_nano = sum(size_grp == \"nano\"),\n  n_mega = sum(size_grp == \"mega\"),\n  me = sum(me),\n  me_p50 = median(me)\n), by = .(excntry, eom)]\n# Country Classification\ncountry_classification <- readxl::read_xlsx(\"Country Classification.xlsx\", \n                                            sheet = \"countries\", range = \"A1:C200\") %>%\n  select(excntry, msci_development) %>%\n  filter(!is.na(excntry)) %>%\n  setDT()\ncountry_info <- country_classification[country_info, on = \"excntry\"]\n\n# Table\ntable_country <- function(country_info, info_date) {\n  tbl_caption <- paste(\"The table shows summary statistics by the country where a security is listed.\\\\\", \n                       \"We include common stocks that are the primary security of the underlying firm, traded on a standard exchange, with non-missing return and market equity data.\\\\\",\n                       \"\\\\textit{Country} is the ISO code of the underlying exchange country.\\\\\", \n                       \"For further information, see \\\\href{https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes}{https://en.wikipedia.org/wiki/List\\\\_of\\\\_ISO\\\\_3166\\\\_country\\\\_codes.}\",\n                       \"\\\\textit{MSCI} shows the MSCI classification of each country as of January 7th 2021.\", \n                       \"For the most recent classification, see \\\\href{https://www.msci.com/market-classification}{https://www.msci.com/market-classification}.\",\n                       \"\\\\textit{Start} is the first date with a valid observation.\",\n                       \"In the next 4 columns, the data is shown as of December 31st 2020.\\\\\",\n                       \"\\\\textit{Stocks} is the number of stocks available.\\\\\", \n                       \"\\\\textit{Mega stocks} is the number of stocks with a market cap above the 80th percentile of NYSE stocks.\\\\\",\n                       \"\\\\textit{Total Market Cap} is the aggregate market cap in million USD.\\\\\",\n                       \"\\\\textit{Median MC} is the median market cap in million USD.\")\n  \n  # Country summary\n  country_info[, start_date := min(eom), by = excntry]\n  country_stats <- country_info[eom==info_date]\n  countries_add <- country_info[!(excntry %in% country_stats$excntry)][, .(excntry, msci_development, eom = info_date, start_date, n=0, n_mega=0, n_nano=0, me=0, me_p50=0)] %>% unique()# Add countries without data by info_date\n  country_stats <- country_stats %>% rbind(countries_add) %>% arrange(-me) %>% select(-c(n_nano, eom))\n  \n  total <- country_stats %>%\n    ungroup() %>%\n    summarise(\n      excntry = \"All\",\n      msci_development = \"\",\n      start_date = NA,\n      n = sum(n),\n      n_mega = sum(n_mega),\n      me = sum(me),\n      me_p50 = NA_real_\n    )\n  \n  country_stats %>%\n    bind_rows(total) %>%\n    mutate(\n      n = n %>% prettyNum(big.mark = \",\", digits = 0),\n      n_mega = n_mega %>% prettyNum(big.mark=\",\", digits = 0),\n      start_date = as.character(start_date),\n      me = me %>% formatC(format = \"e\", digits = 2),\n      me_p50 = me_p50 %>% prettyNum(big.mark=\",\", digits = 0),\n      msci_development = msci_development %>% str_to_title(),\n      \" \"= ' ',\n    ) %>%\n    select(excntry, msci_development, start_date, ` `, everything()) %>%\n    rename(\"Country\" = excntry, \"MSCI\" = msci_development, \"Start\" = start_date, \"Stocks\" = n, \"Mega Stocks\" = n_mega,\n           \"Total Market Cap\" = me, \"Median MC\" = me_p50) %>%\n    xtable(digits = 0, align = \"lllllrrrr\", caption = tbl_caption) %>%\n    print(include.rownames = T, floating = FALSE, latex.environments = \"center\", hline.after=c(-1, 0), \n          tabular.environment = \"longtable\", size=\"\\\\fontsize{10pt}{12pt}\\\\selectfont\")\n}\n\n# Output for paper--------------------\n# Remember:\n# Only copy from \"& country ...\" and down.\n# In line 94: \n#  - Delete \"94\"\n#  - Delete \"NA\"\n#  - Add hline above and below total\n#  - Make \"All\" in \\textbf{}\ncountry_info[!(excntry %in% c(\"ZWE\", \"VEN\"))] %>% table_country(info_date = as.Date(\"2020-12-31\"))  # We exclude Zimbamwe and Venesuela due to data issues\n\n# Nano Caps in the US\ncountry_info[eom == as.Date(\"2020-12-31\") & excntry == \"USA\", .(n, n_nano, nano_prop = n_nano / n)] %>% print()\n"
  },
  {
    "path": "Analysis/hxz_decomp.R",
    "content": "library(lubridate)\nlibrary(tidyverse)\nlibrary(data.table)\noptions(dplyr.summarise.inform = FALSE)\n\n# User Input -----------------------\n# Paths\ndata_path <- \"../../Data\" # Set to path with global characteristics data.\n# Start Date\nstart <- as.Date(\"2020-12-31\")\n\n# Data -----------------------------\n# Characteristics\nchar_info <- readxl::read_xlsx(\"Factor Details.xlsx\",\n                               sheet = \"details\", range = \"A1:N300\") %>%\n  filter(!is.na(abr_jkp)) %>%\n  select(\"characteristic\"=abr_jkp, direction, significance) %>%\n  mutate(direction = direction %>% as.numeric()) %>%\n  setDT()\nchars <- char_info$characteristic\n\n# NYSE Cutoff \nnyse_size_cutoffs <- fread(paste0(data_path, \"/nyse_cutoffs.csv\"), colClasses = c(\"eom\"=\"character\"))\nnyse_size_cutoffs[, eom := as.Date(eom, format = \"%Y%m%d\")]\n\n# Return Cutoffs\nret_cutoffs <- fread(paste0(data_path, \"/return_cutoffs.csv\"), colClasses = c(\"eom\"=\"character\"))\nret_cutoffs[, eom := as.Date(eom, format = \"%Y%m%d\")]\nret_cutoffs[, eom_lag1 := floor_date(eom, unit = \"month\") - 1]  # Because we use ret_exc_lead1m\n\n# Data\ndata <- fread(paste0(data_path, \"/Characteristics/usa.csv\"), \n              select = c(\"excntry\", \"id\", \"eom\", \"source_crsp\", \"comp_exchg\", \"crsp_exchcd\", \"size_grp\", \"ret_exc\", \"ret_exc_lead1m\", \"me\", chars), colClasses = c(\"eom\"=\"character\"))\ndata[, eom := as.Date(lubridate::fast_strptime(eom, format = \"%Y%m%d\"))]\n# ME CAP\ndata <- nyse_size_cutoffs[, .(eom, nyse_p80)][data, on = \"eom\"]\ndata[, me_cap := pmin(me, nyse_p80)][, nyse_p80 := NULL]\n# Screens\ndata <- data[!is.na(size_grp) & !is.na(me) & !is.na(ret_exc_lead1m)]\n\n# Winsorize Compustat Returns\ndata <- ret_cutoffs[, .(eom, \"p001\"=ret_exc_0_1, \"p999\"=ret_exc_99_9)][data, on = \"eom\"]\ndata[source_crsp == 0 & ret_exc > p999, ret_exc := p999]\ndata[source_crsp == 0 & ret_exc < p001, ret_exc := p001]\ndata[, c(\"p001\", \"p999\") := NULL]\ndata <- ret_cutoffs[, .(\"eom\" = eom_lag1, \"p001\"=ret_exc_0_1, \"p999\"=ret_exc_99_9)][data, on = \"eom\"]\ndata[source_crsp == 0 & ret_exc_lead1m > p999, ret_exc_lead1m := p999]\ndata[source_crsp == 0 & ret_exc_lead1m < p001, ret_exc_lead1m := p001]\ndata[, c(\"source_crsp\", \"p001\", \"p999\") := NULL]\n\n# Create 1 month separated observations\nreturns <- tidyr::crossing(\"id\" = unique(data$id), \"eom\" = unique(data$eom)) %>% setDT()\nreturns <- data[, .(id, eom, ret_exc)][returns, on = .(id, eom)]\nreturns[, start := min(eom[!is.na(ret_exc)]), by = id]\nreturns <- returns[eom >= start][, start := NULL]\nreturns[, last := floor_date(max(eom[!is.na(ret_exc)]), unit = \"month\") + months(12) - 1, by = id]  # To avoid lookahead bias, _horizon_ months after last obs\nreturns <- returns[eom <= last][, last := NULL]\nreturns %>% setorder(id, eom)\n\npf_func <- function(chars, pfs, bps, bp_min_n, min_stocks, horizon) {\n  # Realized Returns \n  ret_lead <- 1:horizon %>% lapply(function(h) {\n    if (h==1) {\n      r <- data[, .(id, eom, lead = 1, ret_exc = ret_exc_lead1m)][!is.na(ret_exc)]\n    } else {\n      r <- returns[, .(eom, lead = h, ret_exc = dplyr::lead(ret_exc, n = h)), by = id][!is.na(ret_exc)]\n    }\n    r[, eom_ret := ceiling_date(eom, unit = \"months\")+months(h)-1]\n  }) %>% rbindlist()\n  # Portfolios \n  chars %>% lapply(function(x) {\n    print(paste0(\"   \" , x, \": \", match(x, chars), \" out of \", length(chars)))\n    data[, var := as.double(get(x))]\n    sub <- data[!is.na(var), .(id, eom, var, size_grp, me, me_cap, crsp_exchcd, comp_exchg)]\n    # Portfolio Assignment\n    if (bps == \"nyse\") {\n      sub[, bp_stock := (crsp_exchcd == 1 & is.na(comp_exchg)) | (comp_exchg == 11 & is.na(crsp_exchcd))]\n    }\n    if (bps == \"non_mc\") {\n      sub[, bp_stock := (size_grp %in% c(\"mega\", \"large\", \"small\"))]\n    }\n    sub[, bp_n := sum(bp_stock), by = eom]\n    sub <- sub[bp_n >= bp_min_n] # require at least 10 stocks for break points\n    sub[, cdf := ecdf(var[bp_stock == T])(var), by = eom]\n    sub[, min_cdf := min(cdf), by = eom]\n    sub[cdf == min_cdf, cdf := 0.00000001] # To ensure that the lowest value is in portfolio 1   \n    sub[, pf := ceiling(cdf*pfs), by = eom]  \n    sub[pf == 0, pf := 1]  # Happens when non-bp stocks extend beyond bp stock range\n    # Align with returns\n    sub <- sub[, .(id, eom, me, me_cap, pf)][ret_lead, on = .(id, eom)][!is.na(pf)]\n    # Returns \n    pf_returns <- sub[, .(\n      characteristic = x,\n      n = .N,\n      ret_ew = mean(ret_exc),\n      ret_vw = sum(ret_exc * me) / sum(me),\n      ret_vw_cap = sum(ret_exc * me_cap) / sum(me_cap)\n    ), by = .(pf, eom_ret, lead)]\n    # HML\n    pf_returns[, .(\n      characteristic = x,\n      n_stocks_min = as.integer(min(n[pf==pfs], n[pf==1])),\n      ret_ew = ret_ew[pf == pfs] - ret_ew[pf == 1],\n      ret_vw = ret_vw[pf == pfs] - ret_vw[pf == 1],\n      ret_vw_cap = ret_vw_cap[pf == pfs] - ret_vw_cap[pf == 1]\n    ), by = .(eom_ret, lead)][!is.na(ret_ew) & n_stocks_min >= min_stocks]\n  }) %>% rbindlist()\n}\n\n# Output \nsystem.time(hml_nonmc3 <- chars %>% pf_func(pfs = 3, bps = \"non_mc\", bp_min_n = 5, min_stocks = 5, horizon = 12))  # 47 min\nsystem.time(hml_nyse10 <- chars %>% pf_func(pfs = 10, bps = \"nyse\", bp_min_n = 5, min_stocks = 5, horizon = 12))\n\nrr <- list(hml_nonmc3, hml_nyse10) %>% lapply(function(dt) {\n  repl_data <- char_info[dt, on = \"characteristic\"] %>%\n    filter(eom_ret <= start) %>%\n    mutate(\n      ret_ew = ret_ew*direction,\n      ret_vw = ret_vw*direction,\n      ret_vw_cap = ret_vw_cap*direction\n    ) %>%\n    pivot_longer(c(ret_ew, ret_vw, ret_vw_cap), names_to = \"type\", values_to = \"ret\")\n  \n  repl_data <- c(1, 6, 12) %>% lapply(function(h) {\n    repl_data %>%\n      filter(lead %in% 1:h) %>%\n      group_by(characteristic, eom_ret, type, significance) %>%\n      filter(n() == h) %>%\n      summarise(ret = mean(ret)) %>%\n      ungroup() %>%\n      mutate(horizon = h)\n  }) %>% bind_rows()\n  \n  rr_func <- function(sample) {\n    sample %>%\n      group_by(characteristic, type, horizon) %>%\n      summarise(\n        t = mean(ret)/(sd(ret)/sqrt(n()))\n      ) %>%\n      group_by(type) %>%\n      summarise(\n        n = n(),\n        rr = mean(t > 1.96)\n      )\n  }\n  \n  new_factors <- c(\n    \"ret_3_1\", \"ret_9_1\",  \"ret_12_7\",  \"corr_1260d\",  \"rmax5_21d\", \"rmax5_rvol_21d\",\n    \"ni_be\", \"ocf_at\", \"ocf_at_chg1\", \"mispricing_perf\", \"mispricing_mgmt\",  \"qmj\",\n    \"qmj_prof\", \"qmj_growth\",  \"qmj_safety\")\n  \n  \n  # Our Benchline Raw Return\n  baseline <- repl_data %>%\n    filter(horizon == 1 & eom_ret <= start) %>%\n    rr_func() %>%\n    mutate(name = \"Baseline\")\n  # Difference in sample period\n  hor_diff <- repl_data %>%\n    rr_func() %>%\n    mutate(name = \"Three Horizons\")\n  # Difference in horizons\n  sample_diff <- repl_data %>%\n    filter(eom_ret >= as.Date(\"1967-01-01\") & eom_ret <= as.Date(\"2016-12-31\")) %>%\n    rr_func() %>%\n    mutate(name = \"Three Horizons, Shorter Sample\")\n  # Differences in Factors\n  fct_diff <- repl_data %>%\n    filter(eom_ret >= as.Date(\"1967-01-01\") & eom_ret <= as.Date(\"2016-12-31\")) %>%\n    filter(!(characteristic %in% new_factors)) %>%\n    rr_func() %>%\n    mutate(name = \"Three Horizons, Shorter Sample, Difference in Factors\")\n  bind_rows(baseline, sample_diff, hor_diff, fct_diff) %>%\n    arrange(type, name)\n})\n\n# Decomposition  \nterc_base <- rr[[1]] %>% filter(name == \"Baseline\" & type == \"ret_vw_cap\") %>% pull(rr)\nterc_base_vw <- rr[[1]] %>% filter(name == \"Baseline\" & type == \"ret_vw\") %>% pull(rr)\nterc_hor <- rr[[1]] %>% filter(name == \"Three Horizons\" & type == \"ret_vw\") %>% pull(rr)\nterc_sample <- rr[[1]] %>% filter(name == \"Three Horizons, Shorter Sample\" & type == \"ret_vw\") %>% pull(rr)\nterc_factors <- rr[[1]] %>% filter(name == \"Three Horizons, Shorter Sample, Difference in Factors\" & type == \"ret_vw\") %>% pull(rr)\ndec_factors <- rr[[2]] %>% filter(name == \"Three Horizons, Shorter Sample, Difference in Factors\" & type == \"ret_vw\") %>% pull(rr)\n\n# From vw_cap to vw\nterc_base-terc_base_vw\n# Multiple Horizons\nterc_base_vw-terc_hor\n# Shorter sample \nterc_hor-terc_sample\n# New Factors\nterc_sample-terc_factors\n# Deciles instead of terciles and change of BP\nterc_factors-dec_factors\n# Explained Difference\n(expl_rr <- terc_base-((terc_base-terc_base_vw)+(terc_base_vw-terc_hor)+(terc_hor-terc_sample)+(terc_sample-terc_factors)+(terc_factors-dec_factors)))\nexpl_rr-0.35\n"
  },
  {
    "path": "Analysis/main.R",
    "content": "library(cowplot)\nlibrary(directlabels)\nlibrary(xtable)\nlibrary(zeallot)\nlibrary(dendextend)\nlibrary(RColorBrewer)\nlibrary(rsample)\nlibrary(lubridate)\nlibrary(tidyverse)\nlibrary(data.table)\noptions(dplyr.summarise.inform = FALSE)\n\n# How To --------------------------------------\n# Paths\n# - data_path:                Folder that contains market_returns.csv, hml.csv and cmp.csv generated from portfolio.R                \n# - object_path:              Folder to save objects too. Retrived when update_*==F\n# - output_path:              Folder to save figures in. Not neccesary if save_figures==F\n# Save\n# - save_figures:             Should figures be saved in output_path?\n# Update\n# - update_sim:               Simulations for figure 2 (Simulation Comparison of False Discovery Rates)\n# - update_post_over_time:    Posterior calculations for figure 8 (US Factor Alpha Posterior Distribution over Time)\n# - update_post_is:           Data for regression in table E.1 (The Economic Benefit of More Powerful Tests)\n# - update_harvey_baseline:   Data for figure 9\n# - update_harvey_worstcase:  Data for figure F.1\n\n# User Input -----------------------\n# Paths\ndata_path <- \"Data\" \nobject_path <- \"Objects\"\noutput_path <- \"Figures\"\n# Save\nsave_figures <- T\n# Update\nupdate_sim <- T              \nupdate_post_over_time <- T\nupdate_post_is <- T\nupdate_harvey_baseline <- T  \nupdate_harvey_worstcase <- T \n# Settings\nsettings <- list(\n  seed = 1,\n  start_date = as.Date(\"1925-12-31\"),     \n  end_date = as.Date(\"2022-12-31\"),       # Important that end_date <= Last_CRSP_UPDATE\n  country_excl = c(\"ZWE\", \"VEN\"),         # Countries are excluded due to data issues\n  weighting =  list(                      # Which weighting scheme to use? In c(\"ew\", \"vw\", \"vw_cap\")\n    us = 'vw_cap',\n    global_ex_us = 'vw_cap'\n  ),                   \n  n_stocks_min = 5,                       # Minimum amount of stocks in each side of the portfolios\n  months_min = 5 * 12,                    # Minimum amount of observations a factor needs to be included  \n  country_weighting = \"market_cap\",       # How to weight countries? In (\"market_cap\", \"stocks\", \"ew\")\n  countries_min = 3,                      # Minimum number of countries necessary in a regional portfolio\n  clusters = \"hcl\",                       # Which cluster method to use? In c(\"manual\", \"hcl\")\n  hcl = list(\n    ret_type = \"alpha\",                   # Which return to use in clustering: In c(\"raw\", \"alpha\")\n    cor_method = \"pearson\",               # Which cor method to base distance upon\n    linkage    = \"ward.D\",                # Which linkage method to use\n    k          = 13,                      # How many clusters to colour\n    region     = \"us\",                    # Region to use for clusters\n    start_year = 1975                     # Start year cluster data   \n  ),\n  eb = list(\n    scale_alpha = T,\n    overlapping = F,\n    min_obs = 5 * 12,\n    fix_alpha = T, \n    bs_cov = T, \n    shrinkage = 0, \n    cor_type = \"block_clusters\", \n    bs_samples = 10000                    # Set to 10000 for paper\n  ),\n  tpf = list(\n    start = list(\n      world = as.Date(\"1952-01-01\"),\n      us = as.Date(\"1952-01-01\"),\n      developed = as.Date(\"1987-01-01\"),\n      emerging = as.Date(\"1994-01-01\"),\n      size_grps = as.Date(\"1963-01-01\")   # Dictated by start of nano-caps\n    ),\n    bs_samples = 10000,                   # Number of bootstrap samples [10.000 for paper]\n    shorting = F                          # Should shorting be allowed?\n  ),\n  tpf_factors = list(\n    region = \"us\",\n    orig_sig = T,                         # Only include originally significant factor: T, include all: c(T,F)\n    start = as.Date(\"1972-01-31\"),\n    scale = T,                            # Scale to ex-post volatility of 10%?\n    k = 5                                 # Number of Folds for cross-validation exercise\n  )\n)\n\n# Layout Settings ---------------\ntheme_set(theme_classic())\ncolours_theme <- c(\"#0C6291\", \"#A63446\", RColorBrewer::brewer.pal(8, \"Dark2\"), \n                   \"darkslategrey\", \"blue3\", \"red3\", \"purple2\", \"yellow2\", \"aquamarine\",\n                   \"grey\", \"salmon\", \"antiquewhite\", \"chartreuse\") \nscale_colour_discrete <- function(...) {\n  scale_colour_manual(..., values = colours_theme)\n}\nscale_fill_discrete <- function(...) {\n  scale_fill_manual(..., values = colours_theme)\n}\nscale_linetype_discrete <- function(...) {\n  scale_linetype_manual(..., values = c(\"solid\", \"longdash\", \"dotted\", \"dashed\", \"dotdash\", \"twodash\"))\n}\n\n# Run Scripts ----------\nsource(\"0 - Functions.R\", echo = T)\nsource(\"1 - Prepare Data.R\", echo = T)\nsource(\"2 - Determine Clusters.R\", echo = T)\nsource(\"3 - Analysis.R\", echo = T)\nsource(\"4 - Output.R\", echo = T)\n"
  },
  {
    "path": "GlobalFactors/CHANGELOG.md",
    "content": "# CHANGELOG.md\nThis change log keeps track of changes to the underlying data set. In brackets, we highlight versions of importance. The version with _factor data set_ is the basis of the factor portfolios we upload at [https://jkpfactors.com/](https://jkpfactors.com/). The version with _paper data set_ is the basis of [Jensen, Kelly and Pedersen (2023)](https://onlinelibrary.wiley.com/doi/full/10.1111/jofi.13249).\n\n## 05-03-2025 [Factor data set]\n__Changes__:\n- Added 2024 data\n- moved world_data_prelim from scratch to work folder\n- updated market returns macro to add a capped value option\n- corrected error in o-score calculation\n- added end_date filter in macros saving daily and monthly returns \n\n## 11-03-2024 \n__Changes__:\n- Added 2023 data\n- Updated the country classification according the latest MSCI market classification\n\n## 03-03-2023 \n__Changes__:\n- Added 2022 data\n- Added 'me' (market equity) and 'ret' (total return) and removed 'source_crsp' from daily return files\n\n__Impact__:\n- Replication rate: 83.2% \n\n## 30-06-2022 [Paper data set] \n\n__Changes__:\n- Changed name of \"Skewness\" cluster to \"Short-Term Reversal\"\n\n__Impact__:\n- Replication rate: 82.4% \n\n## 08-02-2022 \n\n__Changes__:\n- Fix error in the construction of intrinsic_value. Previously, we failed to scale intrinsic_value by market equity as done in Frankel and Lee (1998). We call the new characteristic ival_me and keep intrinsic_value in the data set. The alpha of the new factor based on ival_me is significantly different from zero, while the factor based on intrinsic_value is insignificant.\n\n__Impact__:\n- Replication rate: 82.4% (added 2020 data)\n\n\n## 16-11-2021 \n\n__Changes__:\n- Changed return cutoffs to depend on all stocks, instead of only stocks from CRSP.\n- Added monthly and daily returns to the output folder. \n- Changed the 'source' (character) column to 'source_crsp' (integer),. source_crsp is 1 if CRSP is the return data source.\n- Changed the 'id' column from character to integer. For stocks from CRSP, the id is just their permno. For stocks from Compustat, the first digits is 1 if the stocks is traded on a US exchange, 2 if it's traded on a Canadian exchange, and 3 otherwise. The next two digits are the IID from Compustat, and the remaining six digits are the gvkey.  \n- Adapted the primary_sec column such that all observations from CRSP have primary_sec=1. \n- Previously, we treated a zero return as a missing observation. Now, we have removed this screen, such that a zero return is treated like any other return. \n- Previously, we winsorized daily returns, market equity, and dollar volume, before creating charactersitics based on daily stock market data. Now, we have removed this winsorization, and daily characteristics are based on the raw data. \n- Added the option to create daily factor return in the portfolios.R code.\n- Added the option to create industry returns in the portfolios.R code.\n\n__Impact__:\n- Replication rate: 83.2%\n\n## 27-08-2021 \n\n__Changes__:\n- Fixed a bug regarding how daily delisting returns from CRSP is incorporated.\n- Added indfmt='FS' to the international accounting data. \n\n__Impact__:\n- Replication rate: 83.2%\n\n## 14-06-2021 \n\n__Changes__:\n\n- We changed the winsorization scheme. First, we removed the 0.01%/99.9% winsorization of market equity in all countries. Second, we removed the winsorization of returns from the CRSP database. For Compustat returns, we set returns above (below) the 99.9% (0.01%) of CRSP returns in the same month, to that level. In other words, we base our winsorization of Compustat data on CRSP data from the same month. \n- We made several changes to the code for easier usability. Notably, the updated `main.sas` file returns a zip folder called \"output\" in the scratch folder, which contains all data neccesary to re-produce the results in the paper.  \n\n__Impact__:\n\n- Replication rate: 83.2%\n- The revisions impacted all factors slightly, but the overall results are qualitatively very similar. \n\n## 02-19-2021 \n\n__Changes__:\n\n- Previously we did not exclude securities that are only traded over the counter. In the new version of the data set, we include an indicator column \"exch_main\" to exclude non-standard exchanges. In the US, the main exchanges are AMEX, NASDAQ and NYSE. Outside of the US, we exclude over the counter exchanges, stock connect exchanges in China and cross-country exchanges such as BATS Chi-X Europe. The documentation includes a full list of the excluded exchanges.  \n- Included SIC, NAICS and GICS industry codes.\n\n__Impact__:\n\n- Replication rate: 84.0%. \n- Excluding non-standard exchanges mainly affected the US. By December 2019, the number of stocks in the US dropped from 5,256 to 4,102 (-22%) after adding the new 'exch_main' screen. The excluded securities are mainly tiny stocks traded over the counter, so the aggregate market cap only dropped by 2%. The change also mostly affected post 2000 data, because over the counter observations in Compustat are very rare before this point in time.    \n- The change had a small effect outside of the US, because of our 'primary_sec' screen. It's very rare for Compustat to identify a security traded on a non-standard exchange as the primary security of a firm. \n- Because the changes mainly affected tiny stocks, our results did not change much. Across the 153 factors in the US, Developed and Emerging regions, the change in posterior monthly alpha ranged from -0.06% to +0.07.\n\n## 02-15-2021\n\n__Changes__:\n\n- A bug caused _ivol_ff3_21d_, _iskew_ff3_21d_, _ivol_hxz4_21d_ and _iskew_hxz4_21d_ to require 17 (ff3) and 18 (hxz) observations for a valid estimate. Consistent with our original intent, we now require at least 15 observations for a valid estimate.    \n\n__Impact__:\n\n- Replication Rate: 84.0%. \n- The changes had a negligible effect on the affected factors.\n\n## 02-01-2021 \n\n__Changes__:\n\n- Fixed a small bug in the bidask_hl() macro.\n- When creating asset pricing factors (FF and HXZ), we previously required at least 5 stocks in a sub-portfolio (e.g. small stocks with high BM) for the observation to be valid. This led to missing observation in the 1950's for small stocks with low bm. We lowered this requirement to at least 3 stocks. Furthermore, when creating asset pricing factors, we changed the breakpoints to be based on NYSE stocks in the US instead of non-microcap stocks. Outside of the US, breakpoints are still based on non-microcap stocks.\n  \n__Impact__:\n\n- Replication Rate: 84.0% \n- The _bidaskhl_21d_ factor changed slightly but is still significantly negative in all regions. The US factor IR changed from -0.11 to -0.09.\n- The change in asset pricing factor generally didn't affect the results much.\n\n## 01-25-2021\n__Changes__:\n\n- Changed residual momentum characteristics (resff3_12_1 & resff3_6_1) to be scaled with the standard deviation of residuals consistent with Blitz, Huij and Mertens (2011). \n- Fixed error in creating _qmj_prof_. The issue was that the _oaccruals_at_ used the value instead of the z-score of ranks. This effectively meant that accruals didn't impact the profitability score. \n- Fixed error for annual seasonality characteristics (factor names starting with seas_ and ending with _an). There was a bug in the screening procedure which meant that the characteristic for one stock could use information from an unrelated stock. \n- Rounding issues when converting a .csv file to an excel file, caused the zero_trades_* variables to not have any decimals which made the turnover tie-breaker ineffective.\n- Standardized unexpected earnings (niq_su) and sales (saleq_su) is computed as the actual value minus the expected value (standardized by the standard deviation of this change). Before, the expected value was computed as the mean yearly change over the last 8 quarters added to the last quarterly value. Now the expected value is the same mean yearly change, but added to the quarterly value 4 quarters ago consistent with Jegadeesh and Livnat (2006).\n  \n__Impact__:\n\n- Replication Rate: 84.0%\n- The change to the residual momentum variables, made them slightly weaker. As an example, the monthly OLS information ratio of the US resff3_12_1 factor dropped from 0.33 to 0.28. \n- The _qmj_prof_ change made _qmj_prof_ and _qmj_ slightly stronger. As an example, the monthly OLS information ratio of the US _qmj_prof_ factor increased from 0.16 to 0.22.  \n- The seasonality fix didn't have a large qualitative impact for the US factors, but did have a large positive effect outside of the US. As an example, the OLS IR of the developed market _seas_11_15an_ factor changed from -0.06 to 0.11.   \n- The zero_trades where missing in the developed market because of too few non-missing observations. The developed market zero trades factors are generally strong and the IR ranges from 0.07 to 0.20. Similarly, The Emerging market zero trades factors where slightly negative before. After, the factors are strong with IRs ranging from 0.14 to 0.17. The US market zero trades factors improved slightly. The IR of zero_trades_21d has the most notable increase from 0.05 to 0.09.   \n- The standardized unexpected sales (saleq_su) variable went from a significant IR of 0.12 to an insignificant IR of 0.05. This explains the drop in the replication rate. On the other hand, niq_su increased from 0.11 to 0.19.\n\n## 01-15-2021 \n__Changes__:\n\n  - Base data set used in the first online version of Jensen, Kelly and Pedersen (2021).\n  \n__Impact__:\n\n- Replication Rate: 84.9%\n"
  },
  {
    "path": "GlobalFactors/Cluster Labels.csv",
    "content": "characteristic,cluster\nage,Low Leverage\naliq_at,Investment\naliq_mat,Low Leverage\nami_126d,Size\nat_be,Low Leverage\nat_gr1,Investment\nat_me,Value\nat_turnover,Quality\nbe_gr1a,Investment\nbe_me,Value\nbeta_60m,Low Risk\nbeta_dimson_21d,Low Risk\nbetabab_1260d,Low Risk\nbetadown_252d,Low Risk\nbev_mev,Value\nbidaskhl_21d,Low Leverage\ncapex_abn,Debt Issuance\ncapx_gr1,Investment\ncapx_gr2,Investment\ncapx_gr3,Investment\ncash_at,Low Leverage\nchcsho_12m,Value\ncoa_gr1a,Investment\ncol_gr1a,Investment\ncop_at,Quality\ncop_atl1,Quality\ncorr_1260d,Seasonality\ncoskew_21d,Seasonality\ncowc_gr1a,Accruals\ndbnetis_at,Seasonality\ndebt_gr3,Debt Issuance\ndebt_me,Value\ndgp_dsale,Quality\ndiv12m_me,Value\ndolvol_126d,Size\ndolvol_var_126d,Profitability\ndsale_dinv,Profit Growth\ndsale_drec,Profit Growth\ndsale_dsga,Profit Growth\nearnings_variability,Low Risk\nebit_bev,Profitability\nebit_sale,Profitability\nebitda_mev,Value\nemp_gr1,Investment\neq_dur,Value\neqnetis_at,Value\neqnpo_12m,Value\neqnpo_me,Value\neqpo_me,Value\nf_score,Profitability\nfcf_me,Value\nfnl_gr1a,Debt Issuance\ngp_at,Quality\ngp_atl1,Quality\ninv_gr1,Investment\ninv_gr1a,Investment\niskew_capm_21d,Short-Term Reversal\niskew_ff3_21d,Short-Term Reversal\niskew_hxz4_21d,Short-Term Reversal\nival_me,Value\nivol_capm_21d,Low Risk\nivol_capm_252d,Low Risk\nivol_ff3_21d,Low Risk\nivol_hxz4_21d,Low Risk\nkz_index,Seasonality\nlnoa_gr1a,Investment\nlti_gr1a,Seasonality\nmarket_equity,Size\nmispricing_mgmt,Investment\nmispricing_perf,Quality\nncoa_gr1a,Investment\nncol_gr1a,Debt Issuance\nnetdebt_me,Low Leverage\nnetis_at,Value\nnfna_gr1a,Debt Issuance\nni_ar1,Debt Issuance\nni_be,Profitability\nni_inc8q,Quality\nni_ivol,Low Leverage\nni_me,Value\nniq_at,Quality\nniq_at_chg1,Profit Growth\nniq_be,Profitability\nniq_be_chg1,Profit Growth\nniq_su,Profit Growth\nnncoa_gr1a,Investment\nnoa_at,Debt Issuance\nnoa_gr1a,Investment\no_score,Profitability\noaccruals_at,Accruals\noaccruals_ni,Accruals\nocf_at,Profitability\nocf_at_chg1,Profit Growth\nocf_me,Value\nocfq_saleq_std,Low Risk\nop_at,Quality\nop_atl1,Quality\nope_be,Profitability\nope_bel1,Profitability\nopex_at,Quality\npi_nix,Seasonality\nppeinv_gr1a,Investment\nprc,Size\nprc_highprc_252d,Momentum\nqmj,Quality\nqmj_growth,Quality\nqmj_prof,Quality\nqmj_safety,Quality\nrd_me,Size\nrd_sale,Low Leverage\nrd5_at,Low Leverage\nresff3_12_1,Momentum\nresff3_6_1,Momentum\nret_1_0,Short-Term Reversal\nret_12_1,Momentum\nret_12_7,Profit Growth\nret_3_1,Momentum\nret_6_1,Momentum\nret_60_12,Investment\nret_9_1,Momentum\nrmax1_21d,Low Risk\nrmax5_21d,Low Risk\nrmax5_rvol_21d,Short-Term Reversal\nrskew_21d,Short-Term Reversal\nrvol_21d,Low Risk\nsale_bev,Quality\nsale_emp_gr1,Profit Growth\nsale_gr1,Investment\nsale_gr3,Investment\nsale_me,Value\nsaleq_gr1,Investment\nsaleq_su,Profit Growth\nseas_1_1an,Profit Growth\nseas_1_1na,Momentum\nseas_11_15an,Seasonality\nseas_11_15na,Seasonality\nseas_16_20an,Seasonality\nseas_16_20na,Accruals\nseas_2_5an,Seasonality\nseas_2_5na,Investment\nseas_6_10an,Seasonality\nseas_6_10na,Low Risk\nsti_gr1a,Seasonality\ntaccruals_at,Accruals\ntaccruals_ni,Accruals\ntangibility,Low Leverage\ntax_gr1a,Profit Growth\nturnover_126d,Low Risk\nturnover_var_126d,Profitability\nz_score,Low Leverage\nzero_trades_126d,Low Risk\nzero_trades_21d,Low Risk\nzero_trades_252d,Low Risk\n"
  },
  {
    "path": "GlobalFactors/GlobalFactors.Rproj",
    "content": "Version: 1.0\n\nRestoreWorkspace: Default\nSaveWorkspace: Default\nAlwaysSaveHistory: Default\n\nEnableCodeIndexing: Yes\nUseSpacesForTab: Yes\nNumSpacesForTab: 2\nEncoding: UTF-8\n\nRnwWeave: Sweave\nLaTeX: pdfLaTeX\n"
  },
  {
    "path": "GlobalFactors/MD",
    "content": ""
  },
  {
    "path": "GlobalFactors/README.md",
    "content": "## Overview\nThis repository contains code that create a dataset of global stock returns and characteristics. The dataset was created for the paper [Is There a Replication Crisis in Finance?](https://onlinelibrary.wiley.com/doi/10.1111/jofi.13249) by Jensen, Kelly and Pedersen (2023). Please cite this paper if you are using the code or data: \n```\n@article{JensenKellyPedersen2023,\n\tauthor = {Jensen, Theis Ingerslev and Kelly, Bryan and Pedersen, Lasse Heje},\n\ttitle = {Is There a Replication Crisis in Finance?},\n\tjournal = {The Journal of Finance},\n\tvolume = {78},\n\tnumber = {5},\n\tpages = {2465-2518},\n\tyear = {2023}\n}\n```\nFollow this [link](https://www.dropbox.com/sh/61j1v0sieq9z210/AACdJ68fs5_eT_eJMunwMBWia?dl=0) for a detailed documentation of the data sets.\n\n## How to Generate Global Stock Returns and Stock Characteristics\n_Note: The data created in this section can be downloaded directly from WRDS ([link](https://wrds-www.wharton.upenn.edu/pages/get-data/contributed-data-forms/global-factor-data/))_\n\nThe .sas files construct the stock-level characteristics and factor portfolio returns for all countries. The code requires a connection to WRDS servers. Below we outline our preferred approach to creating the dataset:\n\n1. Connect to the [SAS studio server hosted by WRDS](https://wrds-cloud.wharton.upenn.edu/SASStudio/index?locale=en_US).  \n2. Create a folder called _Global Data_ in your home directory and upload `main.sas`, `project_macros.sas`, `market_chars.sas`, `accounting_chars.sas`, `char_macros.sas`, and `ind_identification.sas` to this folder.\n3. Create an empty folder in your institutions scratch folder. The scratch folder is located at \"Sever Files and Folders/Files/scratch/\\<institution name\\>\".\n4. Open `main.sas`. \n5. Replace line 8 with the path to the scratch folder created in step 3. \n6. Run `main.sas`.  \n\nFor step 6, we suggest that you make a background submit. To do so, locate the `main.sas` file in your _Global Data_ folder, right-clik the file and select \"Background Submit\". Running the code will take around 24-48 hours. \nWhen the code has finished running, your scratch folder will contain a folder called _output_. This folder contains daily and monthly stocks returns, daily and monthly market returns, market equity breakpoints from NYSE and return breakpoints to use for winsorization. \nIf `save_csv=1` on line 12, the folder will also hold the data country-by-country in .csv format. Importantly, these files only include the main observation of ordinary common stocks that are the primary securities of the underlying firm, and traded on a main exchange. To include all stocks, modify the 'save_main_data_csv()' macro. \n\nNote that the space in the scratch directory is shared across users in your institution. It's best practice to delete files after downloading what you need. Furthermore, files in the scratch directory is deleted after 7 days.\n\n## How to Generate Global Factor Returns\n_Note: The data created in this section can be downloaded directly from [JKPfactors.com](https://jkpfactors.com/)_\n\nThe file `portfolio.R` generates country level factor returns based on the dataset generated by `main.sas`:\n\n1. Run `main.sas`.\n2. Save country level .csv files by running the macro `save_main_data_csv()` in `main.sas`.\n3. Download the folder named _output_ to a local directory and unzip.\n4. Open `portfolio.R`.\n5. User defined inputs:\n\t1. Replace the variable `data_path` with the path to the folder where you have unzipped the content of _output_.\n\t2. Replace the variable `output_path` with the path to the folder where you wish to save the factor returns. \n\t3. Replace the variable `legacy_path` with the path to the folder where you wish to save earlier versions of the factor output. If you don't wish to save earlier version, set the variabl to         `NULL`.\n\t4. The variable `countries` controls which countries to generate factor returns for. By default, it selects all the countries in `data_path`.\n\t5. The variable `chars` controls which characteristics to use for creating factor returns. By default, it is set to the 153 characteristics from Jensen, Kelly and Pedersen (2022). Any             column in the charactersitic dataset can be used to as the basis of a factor. \n\t6. The variable `settings` controls the end date, number of portfolio, which breakpoints to use, the data source, whether to winsorize Compustat returns, the minimum amount of stocks for             breakpoints to be valid and whether to create rank weighted (characteristic managed)                  portfolios in each of the 5 size groups. Default settings are the same as used in Jensen,             Kelly and Pedersen (2022).\n 6. Run `portfolio.R`.\n\n### Output\n**Files**\n- `pfs.csv`: We sort stocks into 3 portfolios based on non-microcap breakpoints. Portfolio 1 (3) has the stocks with the lowest (highest) value of the characteristic.\n- `hml.csv`: Long/short portfolios that are long stocks with high values of the underlying characteristics (portfolio 3 from pfs.csv) and short stocks with low values (portfolio 1 from pfs.csv). \n- `lms.csv`: Long/short portfolios based on hml.csv but with the signing convention used in Jensen, Kelly and Pedersen (2022). In particular, we sign factors so they are consistent with the literature. For example, we go long low asset growth stocks and short high asset growth stocks, becuase the literature generally finds that low asset growth stocks outperform. \n- `cmp.csv`: Rank-weighted (chracteristic managed) portfolios within mega, large, small, micro and nano cap stocks in the US.\n- `market_returns.csv`: Monthly market returns in all the countries we cover.\n- `Country Factors`: Folder with lms.csv in country-by-country files for easier usability. Countries are saved by their [ISO Alpha-3 codes](https://www.nationsonline.org/oneworld/country_code_list.htm).  \n- `Regional Factors`: Folder with regional factors based on lms.csv and the method in Jensen, Kelly and Pedersen (2022).   \n\n**Variables**\n- `excntry`: Country where securities are listed as ISO Alpha-3 codes.\n- `eom`: End-of-month of the month used to calculate returns.\n- `characteristic`: Name of characteristic, refer to table J.1 in Jensen, Kelly and Pedersen (2022).\n- `region`: Region/MSCI country development of included factors. \n- `size_grp`: Size group used to create the rank weighted factors.\n- `pf`: Portfolio identifier\n- `n`: Total number of stocks in the portfolio.\n- `n_stocks`: Total number of stocks in the long and short portfolio.\n- `n_stocks_min`: Minimum number of stocks in the long and short portfolio. For example, if the long portfolio has 10 stocks and the short portfolio has 40 stocks, n_stocks=50 and n_stocks_min=10.\n- `n_countries`: Number of countries included in regional portfolio.\n- `signal` (pfs.csv): Median characteristic value in the portfolio.\n- `signal` (hml.csv, lms.csv): Difference between the median characteristics in the long and short portfolio.\n- `signal_weighted`: Rank weighted signal.\n- `ret_ew`: Return with equal weights.\n- `ret_vw`: Return with value weights.\n- `ret_vw_cap`: Return with capped value weights as used in Jensen, Kelly and Pedersen (2022).\n- `ret_weighted`: Rank weighted returns.\n- `me_lag1`: Total market equity within a country at the begining of the month\n- `dolvol_lag1`: Total dollar volume traded within a country in the previous month.\n- `stocks`: Stocks included in the market portfolio.\n- `mkt_vw_lcl`: Value weighted market return in local currency.\n- `mkt_ew_lcl`: Equally weighted market return in local currency.\n- `mkt_vw`: Value weighted market return in USD.\n- `mkt_ew`: Equally weighted market return in USD.\n- `mkt_ew_exc`: Equally weighted market excess return in USD.\n- `mkt_vw_exc` (market_return.csv): Value weighted market excess return in USD.\n- `mkt_vw_exc` (Regional Factors): Market cap weighted average of the market excess return in USD for countries included in the regional portfolio.\n"
  },
  {
    "path": "GlobalFactors/accounting_chars.sas",
    "content": "***************************************************************************\n*                     Characteritics to Extract \n*************************************************************************** ; \n/* Pure Accounting Based Characteristics */\n%let acc_chars=\n\t/* Accounting Based Size Measures */\n\tassets sales book_equity net_income enterprise_value\n\t\n\t/* 1yr Growth */\n\tat_gr1 ca_gr1 nca_gr1 lt_gr1 cl_gr1 ncl_gr1 be_gr1 pstk_gr1 debt_gr1 sale_gr1 cogs_gr1 sga_gr1 opex_gr1\n\t\n\t/* 3yr Growth */\n\tat_gr3 ca_gr3 nca_gr3 lt_gr3 cl_gr3 ncl_gr3 be_gr3 pstk_gr3 debt_gr3 sale_gr3 cogs_gr3 sga_gr3 opex_gr3\n\t\n\t/* 1yr Growth Scaled by Assets */\n\tcash_gr1a inv_gr1a rec_gr1a ppeg_gr1a lti_gr1a intan_gr1a debtst_gr1a ap_gr1a\n\ttxp_gr1a debtlt_gr1a txditc_gr1a coa_gr1a col_gr1a cowc_gr1a ncoa_gr1a ncol_gr1a nncoa_gr1a\n\toa_gr1a ol_gr1a noa_gr1a fna_gr1a fnl_gr1a nfna_gr1a gp_gr1a ebitda_gr1a ebit_gr1a \n\tope_gr1a ni_gr1a nix_gr1a dp_gr1a ocf_gr1a fcf_gr1a nwc_gr1a\n\teqnetis_gr1a dltnetis_gr1a dstnetis_gr1a dbnetis_gr1a netis_gr1a fincf_gr1a eqnpo_gr1a\n\ttax_gr1a\n\tdiv_gr1a eqbb_gr1a eqis_gr1a eqpo_gr1a capx_gr1a\n\t\n\t/* 3yr Growth Scaled by Assets */\n\tcash_gr3a inv_gr3a rec_gr3a ppeg_gr3a lti_gr3a intan_gr3a debtst_gr3a ap_gr3a\n\ttxp_gr3a debtlt_gr3a txditc_gr3a coa_gr3a col_gr3a cowc_gr3a ncoa_gr3a ncol_gr3a nncoa_gr3a \n\toa_gr3a ol_gr3a fna_gr3a fnl_gr3a nfna_gr3a gp_gr3a ebitda_gr3a ebit_gr3a \n\tope_gr3a ni_gr3a nix_gr3a dp_gr3a ocf_gr3a fcf_gr3a nwc_gr3a\n\teqnetis_gr3a dltnetis_gr3a dstnetis_gr3a dbnetis_gr3a netis_gr3a fincf_gr3a eqnpo_gr3a\n\ttax_gr3a\n\tdiv_gr3a eqbb_gr3a eqis_gr3a eqpo_gr3a capx_gr3a\n\t\n\t/* Investment */\n\tcapx_at rd_at  \n\t\n \t/* Profitability */\n \tgp_sale ebitda_sale ebit_sale pi_sale ni_sale nix_sale ocf_sale fcf_sale  /* Profit Margins */\n \tgp_at ebitda_at ebit_at fi_at cop_at\t\t\t\t\t\t\t/* Return on Assets */ \n \tope_be ni_be nix_be ocf_be fcf_be\t\t\t\t\t\t\t\t/* Return on Book Equity */ \n \tgp_bev ebitda_bev ebit_bev fi_bev cop_bev\t\t\t\t\t\t/* Return on Invested Capital */  \n \tgp_ppen ebitda_ppen fcf_ppen\t\t\t\t\t\t\t\t\t/* Return on Physical Capital */\n \t\n \t/* Issuance */\n \tfincf_at netis_at eqnetis_at eqis_at dbnetis_at dltnetis_at dstnetis_at\n \t\n \t/* Equity Payout */\n \teqnpo_at eqbb_at div_at\n \t\n \t/* Accruals */\n \toaccruals_at oaccruals_ni taccruals_at taccruals_ni noa_at\n \t\n \t/* Capitalization/Leverage Ratios */\n\tbe_bev debt_bev cash_bev pstk_bev debtlt_bev debtst_bev\n\tdebt_mev pstk_mev debtlt_mev debtst_mev\n\t\n\t/* Financial Soundness Ratios */\n\tint_debtlt int_debt cash_lt inv_act rec_act  \t\t\t\n\tebitda_debt debtst_debt cl_lt debtlt_debt profit_cl ocf_cl\t\t\n\tocf_debt lt_ppen debtlt_be fcf_ocf\n\topex_at nwc_at\n\t\n\t/* Solvency Ratios */\n \tdebt_at debt_be ebit_int\n \t\n \t/* Liquidity Ratios */\n \tcash_cl caliq_cl ca_cl\n \tinv_days rec_days ap_days cash_conversion \n \t\n \t/* Activity/Efficiency Ratio */\n \tinv_turnover at_turnover rec_turnover ap_turnover\t\n \t\n \t/* Non-Recurring Items */\n \tspi_at xido_at nri_at\n \t\n\t/* Miscalleneous */\n\tadv_sale staff_sale rd_sale div_ni sale_bev sale_be sale_nwc tax_pi\n\t\n\t/* Balance Sheet Fundamentals to Market Equity */\n\tbe_me at_me cash_me\n\t\n\t/* Income Fundamentals to Market Equity */\n\tgp_me ebitda_me ebit_me ope_me ni_me nix_me sale_me ocf_me fcf_me cop_me\n\trd_me\n\t\n\t/* Equity Payout/issuance to Market Equity */\n\tdiv_me eqbb_me eqis_me eqpo_me eqnpo_me eqnetis_me\n\t\n\t/* Debt Issuance to Market Enterprice Value */\n\tdltnetis_mev dstnetis_mev dbnetis_mev\n\t\n\t/* Firm Payout/issuance to Market Enterpice Value */\n\tnetis_mev\n\t\n\t/* Balance Sheet Fundamentals to Market Enterprise Value */\n\tat_mev be_mev bev_mev ppen_mev cash_mev\n\t\n\t/* Income/CF Fundamentals to Market Enterprise Value */\n\tgp_mev ebitda_mev ebit_mev cop_mev sale_mev ocf_mev fcf_mev fincf_mev\n\t\n\t/* New Variables from HXZ */ \n\tni_inc8q ppeinv_gr1a lnoa_gr1a capx_gr1 capx_gr2 capx_gr3 sti_gr1a\n\tniq_at niq_at_chg1 niq_be niq_be_chg1 saleq_gr1 rd5_at\n\tdsale_dinv dsale_drec dgp_dsale dsale_dsga\n\tsaleq_su niq_su debt_me netdebt_me capex_abn inv_gr1 be_gr1a\n\top_at pi_nix op_atl1 gp_atl1 ope_bel1 cop_atl1\n\tat_be ocfq_saleq_std  \n\taliq_at aliq_mat tangibility\n\teq_dur f_score o_score z_score kz_index intrinsic_value ival_me\n\tsale_emp_gr1 emp_gr1 cash_at\n\tearnings_variability ni_ar1 ni_ivol\n\t\n\t/* New Variables not in HXZ */\n\tniq_saleq_std ni_emp sale_emp ni_at\n\tocf_at ocf_at_chg1\n\troeq_be_std roe_be_std\n\tgpoa_ch5 roe_ch5 roa_ch5 cfoa_ch5 gmar_ch5\n;\n\n%put ### In total %nwords(&acc_chars.) characteristics will be created ###;\n\n**********************************************************************************************************************\n*                                    MACRO - Add Helper Vars to Standardized Compustat Accounting \n*********************************************************************************************************************\nDescription: \n   The main functionality of this macro is to take the output of %compustat_accounting_data and add helper variables.\n   These helper variables have two main functionalities:\n\t1. They are used to expand coverage of a given variable \n\t2. They are used to create variables not directly available from the accounting statements\n   All added variables have a suffix of '_x'.\n;\n\n%macro add_helper_vars(data=, out=);\n\t/* First ensure that the gap between two dates is always one month */\n\tproc sql;\n\t\tcreate table __comp_dates1 as\n\t\tselect gvkey, curcd, min(datadate) as start_date, max(datadate) as end_date\n\t\tfrom &data.\n\t\tgroup by gvkey, curcd;\n\tquit;\n\t\n\t%expand(data=__comp_dates1, out=__comp_dates2, id_vars=gvkey, start_date=start_date, end_date=end_date, freq='month', new_date_name=datadate);\n\t\n\tproc sql;\n\t\tcreate table __helpers1 as \n\t\tselect a.gvkey, a.curcd, a.datadate, b.*, not missing(b.gvkey) as data_available\n\t\tfrom __comp_dates2 as a\n\t\tleft join &data. as b on a.gvkey=b.gvkey and a.curcd=b.curcd and a.datadate=b.datadate;\n\tquit;\n\t\n\tproc sort nodupkey data=__helpers1; by gvkey curcd datadate; run;\n\t\n\tdata __helpers2;\n\t\tset __helpers1;\n\t\tby gvkey curcd;\n\t\tretain count;\n\t\tif first.curcd then \n\t\t\tcount = 1;\n\t\telse\n\t\t\tcount = count+1;\n\trun;\n\t\n\t/* Create Helper Variables */\n\tdata &out.; \n\t\tset __helpers2;\n\t\tby gvkey curcd;\n\t\t\n\t\t/* Require Certain Variables to Be Positive */\n\t\tarray var_pos at sale revt dv che;\n\t\tdo over var_pos;\n\t\t\tif var_pos<0 then\n\t\t\t\tvar_pos = .;\n\t\tend;\n\n\t\t/* X Variables to Maximize Coverage */\n\t\t*Income Statement;\n\t\tsale_x\t\t\t= coalesce(sale, revt); /* in NA sale has better coverage in Global revt has better coverage. They are the same though*/\n\t\tgp_x \t\t\t= coalesce(gp, sale_x-cogs); /*Gross Profit*/\n\t\topex_x\t\t\t= coalesce(xopr, cogs+xsga); /* Operating Expenses */\n\t\tebitda_x \t\t= coalesce(ebitda, oibdp, sale_x-opex_x, gp_x-xsga); /*Operating Income Before Depreciation*/\n\t\tebit_x \t\t\t= coalesce(ebit, oiadp, ebitda_x-dp); /*Operating Income Before Depreciation*/\n\t\top_x            = ebitda_x + coalesce(xrd, 0);  /* Operating Profit ala Ball et al (2015)*/\n\t\tope_x\t\t\t= ebitda_x-xint; /* Operating Profit to Equity ala FF*/\n\t\tpi_x\t\t\t= coalesce(pi, ebit_x-xint+coalesce(spi,0)+coalesce(nopi,0));  /* Interest Income is included in NOPI*/\n\t\txido_x\t\t\t= coalesce(xido, xi+coalesce(do, 0));\n\t\tni_x\t\t\t= coalesce(ib, ni-xido_x, pi_x - txt - coalesce(mii, 0)); \n\t\tnix_x\t\t\t= coalesce(ni, ni_x+coalesce(xido_x, 0), ni_x + xi + do);\n\t\tfi_x\t\t\t= nix_x+xint;\t\t/* Firm income i.e. return to equity and debt holders */  \n\t\tdiv_x\t\t\t= coalesce(dvt, dv); /* See [1] */\n\t\t\n\t\t* Cash Flow Statement;\n\t\teqbb_x\t\t\t= sum(prstkc, purtshr);  /* Equity Buyback is mainly PRSTKC in NA and PURTSHR in GLOBAL. Using sum() means that any of the two inputs are allowed to be missing */\t\n\t\teqis_x\t\t\t= sstk;  /* Equity Issuance is SSTK which is common+preferred Stocks.*/\t\n\t\teqnetis_x\t\t= sum(eqis_x,-eqbb_x);  /* Net Equity Issuance. Using sum means that the variable will be computed as long as one of the inputs are non-missing */\n\t\teqpo_x\t\t\t= div_x+eqbb_x;  /* Net Equity Payout= Div+Buyback-Issuance*/\n\t\teqnpo_x\t\t\t= div_x-eqnetis_x;  /* Net Equity Payout= Div+Buyback-Issuance*/\n\t\tdltnetis_x\t\t= coalesce(sum(dltis,-dltr), ltdch, dif12(dltt)); /* Net Long Term Debt issuance. GLOBAL firms only have LTDCH, NA firms only have dltis and DLTR. If cash flow items are missing. Approximate by the change in long term book debt */\n\t\tif missing(dltis) and missing(dltr) and missing(ltdch) and count<=12 then\n\t\t\tdltnetis_x\t= .; \n\t\tdstnetis_x\t\t= coalesce(dlcch, dif12(dlc)); /* Prefer dlcch. If this is missing, approximate by the change in short term book debt */\n\t\tif missing(dlcch) and count<=12 then\n\t\t\tdstnetis_x\t= .; \n\t\tdbnetis_x\t\t= sum(dstnetis_x, dltnetis_x);\n\t\tnetis_x\t\t\t= eqnetis_x+dbnetis_x;  /* I require that both equity and debt issuance are available */\n\t\tfincf_x\t\t\t= coalesce(fincf, netis_x-dv+coalesce(fiao, 0)+coalesce(txbcof, 0)); /* https://wrds-web.wharton.upenn.edu/wrds/support/Data/_001Manuals%20and%20Overviews/_001Compustat/_001North%20America%20-%20Global%20-%20Bank/_000dataguide/index.cfm?_ga=2.179697184.565214677.1585224362-1543109641.1544636729 */\n\t\t\n\t\t* Balance Sheet Statement;\n\t\tdebt_x\t\t\t= sum(dltt, dlc); /* This greatly expands coverage */\n\t\tpstk_x\t\t\t= coalesce(pstkrv, pstkl, pstk); /* Value of Preferred Stock*/\n\t\tseq_x\t\t\t= coalesce(seq, ceq+coalesce(pstk_x, 0), at-lt);\n\t\tat_x            = coalesce(at, seq_x + dltt + coalesce(lct, 0) + coalesce(lo, 0) + coalesce(txditc, 0));  /* at is only available yearly in the beginning of the fundq data whereas seq and dltt is available on a quarterly basis from the beginning */  \n\t\tca_x\t\t\t= coalesce(act, rect+invt+che+aco); /* Current Assets */\n\t\tnca_x\t\t\t= at_x-ca_x; /* Non-Current Assets */\n\t\tcl_x\t\t\t= coalesce(lct, ap+dlc+txp+lco); /* Current Liabilities */\n\t\tncl_x\t\t\t= lt-cl_x;\n\t\t\n\t\tnetdebt_x\t\t= debt_x - coalesce(che, 0); /* Net Debt for calculating Enterprise Value */\n\t\ttxditc_x\t\t= coalesce(txditc, sum(txdb, itcb));\n\t\tbe_x\t\t\t= seq_x+coalesce(txditc_x, 0)-coalesce(pstk_x, 0); \n\t\tbev_x\t\t\t= coalesce(icapt+coalesce(dlc, 0)-coalesce(che, 0), netdebt_x+seq_x+coalesce(mib, 0));\n\t\t\n\t\tcoa_x\t\t\t= ca_x - che;  /* Operating (non cash) current assets */\n\t\tcol_x\t\t\t= cl_x-coalesce(dlc, 0);  /* Operating Current Liabilities */\n\t\tcowc_x\t\t\t= coa_x-col_x;  /* Current Operating Working Capital */\n\t\tncoa_x\t\t\t= at_x-ca_x-coalesce(ivao, 0); /* Non-Current Operating Assets */\n\t\tncol_x\t\t\t= lt-cl_x-dltt; /* Non-Current Operating Liabilities */\n\t\tnncoa_x\t\t\t= ncoa_x-ncol_x; /* Net Non-Current Operatng Assets */\n\t\tfna_x\t\t\t= coalesce(ivst,0)+coalesce(ivao,0); /* Financial Assets */\n\t\tfnl_x\t\t\t= debt_x+coalesce(pstk_x,0); /* Financial Liabilities */\n\t\tnfna_x\t\t\t= fna_x-fnl_x; /* Net Financial Assets */\n\t\toa_x\t\t\t= coa_x+ncoa_x; /* Operating Assets */\n\t\tol_x\t\t\t= col_x+ncol_x; /* Operating Liabilities */\n\t\tnoa_x\t\t\t= oa_x-ol_x; /* Net Operating Assets*/\n\t\tlnoa_x          = ppent+intan+ao-lo+dp;  /* Long-term NOA (from HXZ A.3.5)*/\n\t\t\n\t\tcaliq_x\t\t\t= coalesce(ca_x-invt, che+rect); /* Liquid current assets use for quick ratio*/\n\t\tnwc_x\t\t\t= ca_x-cl_x;\n\t\tppeinv_x        = ppegt + invt;  * Should be moved to create_chars as it is a helper variables;\n\t\taliq_x          = che + 0.75 * coa_x + 0.5 * (at_x - ca_x - coalesce(intan, 0));  /* From Ortiz-Molina and Phillips (2014). Don't subtract gdwl since this is already included intangibles */\n\t\t\n\t\tarray var_bs be_x bev_x;\n\t\tdo over var_bs;\n\t\t\tif var_bs<=0 then\n\t\t\t\tvar_bs\t= .;\n\t\tend;\n\t\t\n\t\t/* Accruals + OCF/FCF + Cash Based Operating Profit*/\n\t\toacc_x\t\t\t= coalesce(ni_x-oancf, dif12(cowc_x)+dif12(nncoa_x));\t/* Operating Accruals: Difference between Accounting Earnings and Operating Cash Flow. [2] */\n\t\ttacc_x\t\t\t= oacc_x+dif12(nfna_x);      /* Total Accruals = Accural Earnings - Cash Earnings = Change in Non-Cash Assets - Change in Liabilities. */\n\t\tif count<=12 then do;\n\t\t\toacc_x \t\t= .;\n\t\t\ttacc_x\t\t= .;\n\t\tend;\n\t\tocf_x           = coalesce(oancf, ni_x-oacc_x, ni_x + dp - coalesce(wcapt, 0)); \n\t\tfcf_x\t\t\t= ocf_x-capx; /* Note that this does not include funds from financing activities */\n\t\tcop_x\t\t\t= ebitda_x + coalesce(xrd, 0) - oacc_x;  /* Cash Based Operating Profitability (Gerakos et al, 2016) add R&D while subtracting accruals */\n\t\t\n\t\tdrop count;\n\trun;\t\t\n\t\n\tproc delete data= __comp_dates1 __comp_dates2 __helpers1 __helpers2; run;\n%mend add_helper_vars;\n\n**********************************************************************************************************************\n*                                    MACRO - Compustat Accounting Data Standardized \n*********************************************************************************************************************\nDescription: \n   The main functionality of this macro is to create accounting datasets which are \n   comparable across frequency (quarterly/annual) and geography (North America/Global).\n   To make the data comparable across frequency, we modify the quarterly data by: \n   \t- Quarterize year-to-date variables.  \n   \t- Take the sum over the last 4 quarters for income and cash flow variables\n\t- Change name to be consistent with the annual data\n   To make the data comparable across geography, we modify the global data by:\n \t- Create columns available in the North American dataset in the global dataset.\n \t  If possible we infer the values from available columns. Otherwise we just set to missing.\n \t- If specified, we change all data to USD.\n;\n\n\n%macro standardized_accounting_data(coverage=, convert_to_usd=, me_data=, include_helpers_vars=1, start_date='01JAN1950'd);  /* Coverage in ('na', 'global', 'world'), convert_to_usd in (0,1), include_helpers_vars in (0,1)*/\n\t/* Compustat Accounting Vars to Extract */\n\t%let avars_inc =\n\t\tsale revt gp ebitda oibdp ebit oiadp  pi ib ni mii\n\t\tcogs xsga xopr xrd xad xlr dp xi do xido xint spi nopi txt\n\t\tdvt\n\t;\n\t%let avars_cf =\n\t\t/* Operating */\n\t\toancf ibc dpc xidoc capx ibc dpc wcapt\n\t\t\n\t\t/* Financing */\n\t\tfincf fiao txbcof ltdch dltis dltr dlcch purtshr prstkc sstk\n\t\tdv dvc\n\t;\n\t%let avars_bs =\n\t\t/* Assets */\n\t\tat act aco che invt rect ivao ivst ppent ppegt intan ao gdwl re\n\t\t\n\t\t/* Liabilities */\n\t\tlt lct dltt dlc txditc txdb itcb txp ap lco lo \n\t\tseq ceq pstkrv pstkl pstk mib icapt \n\t;\n\t\n\t* Variables in avars_other are not measured in currency units, and only available in annual data;\n\t%let avars_other = emp; \n\t\n\t%let avars=\t&avars_inc. &avars_cf. &avars_bs.; \t/* fdate and pdate. Unfortunately only available from 2007. RDQ is available further back for NA but should probably use 10K dates instead???*/\t\n\t\n\t%put INCOME STATEMENT: %nwords(&avars_inc.) || CASH FLOW STATEMENT: %nwords(&avars_cf.) || BALANCE SHEET: %nwords(&avars_bs.) || OTHER: %nwords(&avars_other.);\n\t\n\tproc sql noprint;\n\t\tselect distinct lowcase(name) into :qvars_q separated by ' '\n\t\tfrom dictionary.columns\n\t\twhere libname='COMP' and memname in ('FUNDQ', 'G_FUNDQ') and memtype='DATA'\n\t\tand findw(lowcase(\"&avars.\"),substr(lowcase(name),1,length(name)-1))>0\n\t\tand name like \"%nrstr(%%)q\"; /* Quarterly names ending with q */\n\tquit;\n\t\n\tproc sql noprint;\n\t\tselect distinct lowcase(name) into :qvars_y separated by ' '\n\t\tfrom dictionary.columns\n\t\twhere libname='COMP' and memname in ('FUNDQ', 'G_FUNDQ') and memtype='DATA'\n\t\tand findw(lowcase(\"&avars.\"),substr(lowcase(name),1,length(name)-1))>0\n\t\tand name like \"%nrstr(%%)y\"; /* Quarterly names ending with y (ytd_variables)*/\n\tquit;\n\t\n\t%let qvars = &qvars_q. &qvars_y.;\n\t\n\t* In International but not in NA: dvtq dvty purtshr purtshry ltdch ltdchy    LOC?;\n\t\n\t%if &coverage. = 'global' or &coverage. = 'world' %then %do;\n\t\t%let aname = __gfunda;\n\t\t%let qname = __gfundq;\n\t\t%let compcond=indfmt in ('INDL', 'FS') and datafmt='HIST_STD' and popsrc='I' and consol='C';\n\t\tdata g_funda1; \n\t\t\tset comp.g_funda;\n\t\t\twhere &compcond. and datadate>=&start_date.; \n\t\t\tsource = 'GLOBAL'; format source $char6.;\n\t\t\t\n\t\t\t* Variables Not Available in G_FUNDA with Replacement;\n\t\t\tni \t\t\t= ib+coalesce(xi,0)+coalesce(do,0);   /*https://wrds-www.wharton.upenn.edu/pages/support/support-articles/compustat/global/ni-net-income-variable/?_ga=2.199276522.59544135.1581901991-1543109641.1544636729 */\n\t\t\t\n\t\t\t* Variables Not Available in G_FUNDA without Replacement;\n\t\t\tgp \t\t\t= .;\n\t\t\tpstkrv\t\t= .; /*captured by pstk_x*/\n\t\t\tpstkl\t\t= .; /*captured by pstk_x*/\n\t\t\titcb \t\t= .; /*Only used as substitute for txditc (deffered tax and investment credit)*/\n\t\t\txad\t\t\t= .;\n\t\t\ttxbcof\t\t= .; /* Part of FINCF but will automatically be set to zero*/\n\t\t\tkeep gvkey datadate indfmt curcd source &avars. &avars_other.;\n\t\trun;\n\t\tproc sql;\n\t\t\tcreate table &aname. as\n\t\t\tselect *\n\t\t\tfrom g_funda1\n\t\t\tgroup by gvkey, datadate\n\t\t\thaving count(*)=1 or (count(*)=2 and indfmt='INDL');  /* If the accouting report is available inb both an industrial and financial format, choose financial format (happens very rarely in the international data)*/ \n\t\t\t\n\t\t\talter table &aname.\n\t\t\tdrop indfmt;\n\t\tquit;\n\t\t\n\t\tdata g_fundq1;\n\t\t\tset comp.g_fundq;\n\t\t\twhere &compcond. and datadate>=&start_date.;\n\t\t\tsource = 'GLOBAL'; format source $char6.;\n\t\t\t\n\t\t\t* Variables Not Available in G_FUNDQ with Replacement;\n\t\t\tniq\t\t\t= ibq+coalesce(xiq, 0); /*Discontinued Operations is not available in g_fundq*/\n\t\t\tppegtq\t\t= ppentq+dpactq;  /* See [3] */\n\t\t\t\n\t\t\t* Variables Not Available in G_FUNDQ without Replacement;\n\t\t\ticaptq\t\t= .;\n\t\t\tniy\t\t\t= .;\n\t\t\ttxditcq\t\t= .;\n\t\t\ttxpq\t\t= .; \n\t\t\txidoq\t\t= .; \n\t\t\txidoy\t\t= .;\n\t\t\txrdq\t\t= .;\n\t\t\txrdy\t\t= .;\n\t\t\ttxbcofy\t\t= .; /* Part of FINCF but will automatically be set to zero*/\n\t\t\tkeep gvkey datadate indfmt fyr fyearq fqtr curcdq source &qvars.; \n\t\trun;\n\t\tproc sql;\n\t\t\tcreate table &qname. as\n\t\t\tselect *\n\t\t\tfrom g_fundq1\n\t\t\tgroup by gvkey, datadate\n\t\t\thaving count(*)=1 or (count(*)=2 and indfmt='INDL');  /* If the accouting report is available inb both an industrial and financial format, choose financial format (happens very rarely in the international data)*/ \n\t\t\t\n\t\t\talter table &qname.\n\t\t\tdrop indfmt;\n\t\tquit;\n\t%end;\n\t\n\t%if &coverage. = 'na' or &coverage. = 'world' %then %do;\n\t\t%let aname = __funda;\n\t\t%let qname = __fundq;\n\t\t%let compcond=indfmt='INDL' and datafmt='STD' and popsrc='D' and consol='C';\n\t\tdata &aname.; \n\t\t\tset comp.funda; \n\t\t\twhere &compcond. and datadate>=&start_date.;\n\t\t\tsource = 'NA'; format source $char6.;\n\t\t\tkeep gvkey datadate curcd source &avars. &avars_other.; \n\t\trun;\n\t\tdata &qname.; \n\t\t\tset comp.fundq; \n\t\t\twhere &compcond. and datadate>=&start_date.;\n\t\t\tsource = 'NA'; format source $char6.;\n\t\t\tkeep gvkey datadate fyr fyearq fqtr curcdq source &qvars.; \n\t\trun;\n\t%end;\t\n\t%if &coverage. = 'world' %then %do;\n\t\t%let aname = __wfunda;\n\t\t%let qname = __wfundq;\n\t\tdata &aname.;\n\t\t\tset __gfunda __funda;\n\t\trun;\n\t\tdata &qname.;\n\t\t\tset __gfundq __fundq;\n\t\trun;\n\t\t/*proc delete data= __gfunda __gfundq __funda __fundq; run;*/\n\t%end;\n\n\t/* If &convert_to_usd=1 then convert everything to USD otherwise keep as local currency*/\n\t%if &convert_to_usd.=1 %then %do;\n\t\t%compustat_fx(out=fx);\n\t\t\n\t\tproc sql;\n\t\t\tcreate table __tempa as \n\t\t\tselect a.*, b.fx\n\t\t\tfrom &aname. as a left join fx as b\n\t\t\ton a.datadate=b.date and a.curcd=b.curcdd;\n\t\t\t\n\t\t\tcreate table __tempq as \n\t\t\tselect a.*, b.fx\n\t\t\tfrom &qname. as a left join fx as b\n\t\t\ton a.datadate=b.date and a.curcdq=b.curcdd;\n\t\tquit;\n\t\t\n\t\tdata __compa1;\n\t\t\tset __tempa;\n\t\t\tarray var &avars.;\n\t\t\tdo over var;\n\t\t\t\tvar = var*fx;\n\t\t\tend;\n\t\t\tcurcd = 'USD';\n\t\t\tdrop fx;\n\t\trun;\n\t\t\n\t\tdata __compq1;\n\t\t\tset __tempq;\n\t\t\tarray var &qvars.;\n\t\t\tdo over var;\n\t\t\t\tvar = var*fx;\n\t\t\tend;\n\t\t\tcurcdq = 'USD';\n\t\t\tdrop fx;\n\t\trun; \n\t\t\n\t\tproc delete data=fx __tempa __tempq;\n\t%end;\n\t%else %do;\n\t\t*Rename Data;\n\t\tproc sql;\n\t\t\tcreate table __compa1 as select * from &aname.; \n\t\t\tcreate table __compq1 as select * from &qname.;\n\t\tquit;\n\t%end;\n\t\n\tproc delete data= &aname. &qname.;\n\t \n\t/* Change Quarterly Data to Be Comparable to Annual Data */\n\t%QUARTERIZE(inset=__compq1, outset=__compq2, idvar=gvkey fyr, fyear=fyearq, fqtr=fqtr); /* Quarterize the YTD flow accounting variables */\n\tproc sort data=__compq2 nodupkey; by gvkey fyr fyearq fqtr; run; /*THEIS: 0 Deleted in the US*/\n\t%macro ttm(var); (&var + lag1(&var) + lag2(&var) + lag3(&var)) %mend; /* Note that ttm will return missing if either of the lags are missing. This is the behavior we want. */\n\n\t%macro temp();\n\t/* Prepare quarterly data: if quarterly Compustat variable is missing, replace with quarterized version*/\n\tdata __compq3; \n\t\tset __compq2;\n\t\tby gvkey fyr fyearq fqtr;\n\t\t\n\t\t/* Replace missing &var.q with quarterized version &var.y_q*/\n\t\t%do i=1 %to %nwords(&qvars_y.); \n\t\t\t%let var_ytd = %scan(&qvars_y., &i, %str(' '));\n\t\t\t%let var = %sysfunc(prxchange(s/y$//, 1, &var_ytd.));\n\t\t\tif missing(&var.q) then\n\t\t\t\t&var.q = &var.y_q;\n\t\t\tdrop &var.y_q;\n\t\t%end;\t\n\t\t\n\t\t/* Create Quarterly Variables to Keep */\n\t\tni_qtr = ibq;\n\t\tsale_qtr = saleq;\n\t\tocf_qtr = coalesce(oancfq, ibq + dpq - coalesce(wcaptq, 0));\n\t\n\t\t*Cumulate Income/CF Items Over 4 Quarters (This should be made automatic at some point;\n\t\t%let yrl_vars = cogsq xsgaq xintq dpq txtq xrdq dvq spiq saleq revtq cogsq \n\t\t\txoprq oibdpq oiadpq ibq niq xidoq nopiq miiq piq xiq\n\t\t\txidocq capxq oancfq ibcq dpcq wcaptq\n\t\t\tprstkcq sstkq purtshrq\n\t\t\tdsq dltrq ltdchq dlcchq\n\t\t\tfincfq fiaoq txbcofq dvtq;\n\t\t\n\t\t%do i=1 %to %nwords(&yrl_vars.);\n\t\t\t%let var_yrl = %scan(&yrl_vars., &i, %str(' '));\n\t\t\t%let var_yrl_name = %sysfunc(prxchange(s/q$//, 1, &var_yrl.));  \n\t\t\t&var_yrl_name. = %ttm(&var_yrl.);\n\t\t\tif (gvkey^=lag3(gvkey) or fyr^=lag3(fyr) or curcdq^=lag3(curcdq) or %ttm(fqtr)^=10) then \n\t\t\t\t&var_yrl_name. = .;\n\t\t\tif missing(&var_yrl_name.) and fqtr=4 then\n\t\t\t\t&var_yrl_name. = &var_yrl_name.y;  * If financial quarter is 4, the ytd variable is yearly;\n\t\t\tdrop &var_yrl. &var_yrl_name.y;\n\t\t%end;\n\t\t\n\t\t* Rename All Other (Balance Sheet and CURCDQ) Items to Facilitate Merge (This should be made automatic at some point);\n\t\t%let bs_vars = seqq ceqq pstkq icaptq mibq gdwlq req \n\t\t\tatq actq invtq rectq ppegtq ppentq aoq acoq intanq cheq ivaoq ivstq \n\t\t\tltq lctq dlttq dlcq txpq apq lcoq loq txditcq txdbq;\n\t\t%do i=1 %to %nwords(&bs_vars.);\n\t\t\t%let var_bs = %scan(&bs_vars., &i, %str(' '));\n\t\t\t%let var_bs_name = %sysfunc(prxchange(s/q$//, 1, &var_bs.));  \n\t\t\trename &var_bs. = &var_bs_name.;\n\t\t%end;\n\t\trename curcdq = curcd;\n\t\t\n\trun;\n\t%mend;\n\t%temp();\n\t\n\t/* Ensure One Obs pr. Datadate */\n\tproc sort data=__compq3 nodupkey; by gvkey datadate fyr; run;\n\tdata __compq4; set __compq3; by gvkey datadate; if last.datadate; drop fyr fyearq fqtr; run; /*THEIS: US->1432 Observations are deleted in this step*/\t\n\t\n\t/* Add empty quarterly variables to annual data */\n\tdata __compa2;\n\t\tset __compa1;\n\t\tni_qtr = .;\n\t\tsale_qtr = .;\n\t\tocf_qtr = .;\n\trun;\n\t\n\t/* Add Market Equity at Fiscal End */\n\tproc sql;\n\t\tcreate table __me_data as \n\t\tselect distinct gvkey, eom, me_company as me_fiscal \n\t\tfrom &me_data.\n\t\twhere not missing(gvkey) and primary_sec=1 and not missing(me_company) and common=1 and obs_main=1 /* Notice, exch_main is not a requirement */\n\t\tgroup by gvkey, eom\n\t\thaving me_company=max(me_company);\n\tquit; \n\t\n\tproc sql;\n\t\tcreate table __compa3 as \n\t\tselect a.*, b.me_fiscal\n\t\tfrom __compa2 as a \n\t\tleft join __me_data as b\n\t\ton a.gvkey = b.gvkey and a.datadate = b.eom;\n\tquit;\n\t\n\tproc sql;\n\t\tcreate table __compq5 as \n\t\tselect a.*, b.me_fiscal\n\t\tfrom __compq4 as a \n\t\tleft join __me_data as b\n\t\ton a.gvkey = b.gvkey and a.datadate = b.eom;\n\tquit;\n\t\n\t/* Include Helper Variables */\n\t%if &include_helpers_vars.=1 %then %do;\n\t\t%let qdata = __compq6;\n\t\t%let adata = __compa4;\n\t\t%add_helper_vars(data = __compq5, out=&qdata.);\n\t\t%add_helper_vars(data = __compa3, out=&adata.);\n\t\tproc delete data=__compq5 __compa3; run;\n\t%end;\n\t%else %do;\n\t\t%let qdata = __compq5;\n\t\t%let adata = __compa3;\n\t%end;\n\t\n\t/* Output */\n\tproc sort data= &adata. out=acc_std_ann nodupkey; by gvkey datadate; run;\n\tproc sort data= &qdata. out=acc_std_qtr nodupkey; by gvkey datadate; run;\n\n\tproc delete data=__compq1 __compq2 __compq3 __compq4 __compa1 __compa2 &qdata. &adata.; run; \n%mend standardized_accounting_data;\n\n\n\n\n\n\n\n\n**********************************************************************************************************************\n*                  MACRO - Create Accounting Characteristics from Compustat Standardized Data \n*********************************************************************************************************************\nDescription: \n   The main functionality of this macro is to take the output from %standardized_accounting_data and create characteristics \n   that require accounting data.  \n   !!! When we are satisfied with the data, we should really include labels for all accounting characteristics !!!\n;\n%macro create_acc_chars(data=, out=, lag_to_public=, max_data_lag=, __keep_vars=, me_data=, suffix=);\n\t/* Helper Macros */\n\t%macro mean_year(var); mean(&var, lag12(&var)) %mend; /* Note that this function will still return a value even if one of the lags are missing. To change this do (var+lag12(var))/2. Could also consider if we should take the average of all observable observations? ie var+lag1(var)+lag2(var)...+lag12(var)*/\n\t\n\t%macro apply_to_lastq(x=, _qtrs=, func=); /* The Macro Below is a Generic way of creating lag from current to &n*/\n\t\t%let mv = &func.(&x.; \n\t\t%do _i=1 %to &_qtrs.-1;\n\t\t\t%let mv = &mv., lag%eval(&_i.*3)(&x.);\n\t\t%end;\n\t\t%let mv = &mv.);\n\t\t&mv.;\n\t%mend apply_to_lastq;\n\t\n\t%macro apply_to_lasty(x=, yrs=, func=); /* The Macro Below is a Generic way of creating ANNUAL lags from current to &n */\n\t\t%let mv = &func.(&x.; \n\t\t%do _i=1 %to &yrs.-1;\n\t\t\t%let mv = &mv., lag%eval(&_i.*12)(&x.);\n\t\t%end;\n\t\t%let mv = &mv.);\n\t\t&mv.;\n\t%mend apply_to_lasty;\n\t\n\t/* Start Procedure */\n\tproc sort data=&data. out=__chars3; by gvkey curcd datadate; run; /* deleted _chars1 and __chars2 steps*/\n\t\n\tdata __chars4;\n\t\tset __chars3;\n\t\tby gvkey curcd;\n\t\tretain count;\n\t\tif first.curcd then \n\t\t\tcount = 1;\n\t\telse\n\t\t\tcount = count+1;\n\trun;\n\t\n\t/* Create Accounting Characteristics */\n\tdata __chars5; \n\t\tset __chars4;\n\t\tby gvkey curcd;\n\t\t\n\t\t/* Accounting Based Size Measures */\n\t\tassets = at_x;\n\t\tsales = sale_x;\n\t\tbook_equity = be_x;\n\t\tnet_income = ni_x;\n\t\n\t\t/* Growth Characteristics */\n\t\t%let growth_vars =\n\t\t\tat_x ca_x nca_x \t\t\t\t\t\t\t\t\t/* Assets - Aggregated */\n\t\t\tlt cl_x ncl_x  \t\t\t\t\t\t\t\t\t/* Liabilities - Aggregated */\n\t\t\tbe_x pstk_x debt_x \t\t\t\t\t\t\t\t/* Financing Book Values */\n\t\t\tsale_x cogs xsga opex_x  \t\t\t\t\t\t/* Sales and Operating Costs */\n\t\t\tcapx invt\n\t\t;\t\n\t\t\n\t\t* 1yr Growth;\n\t\t%do i=1 %to %nwords(&growth_vars.);\n\t\t\t%let var_gr1 = %scan(&growth_vars., &i, %str(' '));\n\t\t\t%var_growth(var_gr=&var_gr1., horizon=12);\n\t\t%end;\n\t\t* 3yr Growth;\n\t\t%do i=1 %to %nwords(&growth_vars.);\n\t\t\t%let var_gr3 = %scan(&growth_vars., &i, %str(' '));\n\t\t\t%var_growth(var_gr=&var_gr3., horizon = 36);\n\t\t%end;\n\t\t\n\t\t/* Change Scaled by Asset Characteristics */\n\t\t%let ch_asset_vars = \n\t\t\tche invt rect ppegt ivao ivst intan \t\t\t\t/* Assets - Individual Items */\n\t\t\tdlc ap txp dltt txditc \t\t\t\t\t\t\t\t/* Liabilities - Individual Items*/\n\t\t\tcoa_x col_x cowc_x ncoa_x ncol_x nncoa_x \t\t\t/* Operating Assets/Liabilities */\n\t\t\toa_x ol_x\t\t\t\t\t\t\t\t\t/* Operating Assets/Liabilities */\n\t\t\tfna_x fnl_x nfna_x\t\t\t\t\t\t\t\t\t/* Financial Assets/Liabilities */\n\t\t\tgp_x ebitda_x ebit_x ope_x ni_x nix_x dp \t\t\t/* Income Statement*/\n\t\t\tfincf_x ocf_x fcf_x nwc_x \t\t\t\t\t\t\t/* Aggreagted Cash Flow */\n\t\t\teqnetis_x dltnetis_x dstnetis_x dbnetis_x netis_x\t/* Financing Cash Flow */\n\t\t\teqnpo_x\n\t\t\ttxt\t\t\t\t\t\t\t\t\t\t\t\t\t/* Tax Change */\n\t\t\teqbb_x eqis_x div_x eqpo_x\t\t\t\t\t\t/* Financing Cash Flow */\n\t\t\tcapx be_x\n\t\t;\n\t\t* 1yr Change Scaled by Assets;\n\t\t%do i=1 %to %nwords(&ch_asset_vars.);\n\t\t\t%let var_gr1a = %scan(&ch_asset_vars., &i, %str(' '));\n\t\t\t%chg_to_assets(var_gra = &var_gr1a., horizon = 12);\n\t\t%end;\n\t\t\n\t\t* 3yr Change Scaled by Assets;\n\t\t%do i=1 %to %nwords(&ch_asset_vars.);\n\t\t\t%let var_gr3a = %scan(&ch_asset_vars., &i, %str(' '));\n\t\t\t%chg_to_assets(var_gra = &var_gr3a., horizon = 36);\n\t\t%end;\n\t\t\n\t\t/* Investment Measure */\t\n\t\tcapx_at\t\t\t= capx/at_x;  \n\t\trd_at\t\t\t= xrd/at_x;\n\t\t\n\t\t\n\t\t/* Non-Recurring Items */\n\t\tspi_at\t\t\t= spi/at_x;\n\t\txido_at\t\t\t= xido_x/at_x;\n\t\tnri_at\t\t\t= (spi+xido_x)/at_x;\t\t\t\t\t/* Non-Recurring Items */\n\t\t\n\t\t/*Profitability Ratios and Rates of Return*/\n\t\t* Profit Margins; \n\t\tgp_sale\t\t\t= gp_x/sale_x; \t\t\t\t\t\t\t/* Gross Profit Margin*/                                  \n\t\tebitda_sale\t\t= ebitda_x/sale_x; \t\t\t\t\t\t/* Operating Profit Margin before Depreciation */\n\t\tebit_sale\t\t= ebit_x/sale_x; \t\t\t\t\t\t/* Operating profit Margin after Depreciation */                                 \n\t\tpi_sale\t\t\t= pi_x/sale_x;  \t\t\t\t\t\t/* Pretax Profit Margin */                                         \n\t\tni_sale\t\t\t= ni_x/sale_x;\t\t\t\t\t\t\t/* Net Profit Margin Before XI */\n\t\tnix_sale\t\t= ni/sale_x;\t\t\t\t\t\t\t/* Net Profit Margin */\n\t\tocf_sale\t\t= ocf_x/sale_x;  \t\t\t\t\t\t/* Cash Flow Margin */       \n\t\tfcf_sale\t\t= fcf_x/sale_x;\n\t\t\n\t\t* Return on Assets;\n\t\tgp_at\t\t\t= gp_x/at_x;\n\t\tebitda_at\t\t= ebitda_x/at_x;\n\t\tebit_at\t\t\t= ebit_x/at_x; \t\n\t\tfi_at\t\t\t= fi_x/at_x;\n\t\tcop_at\t\t\t= cop_x/at_x;\n\t\tni_at           = ni_x/at_x;\n\t\t\n\t\t* Return on Book Equity;\n\t\tope_be\t\t\t= ope_x/be_x; \t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\tni_be\t\t\t= ni_x/be_x; \t\t\t\t\t\t\t\t\n\t\tnix_be\t\t\t= nix_x/be_x;\n\t\tocf_be\t\t\t= ocf_x/be_x;\n\t\tfcf_be\t\t\t= fcf_x/be_x;\n\t\t\n\t\t* Return on Invested Book Capital;\n\t\tgp_bev\t\t\t= gp_x/bev_x;\n\t\tebitda_bev\t\t= ebitda_x/bev_x;\n\t\tebit_bev\t\t= ebit_x/bev_x; \t\t\t\t\t/* Pre tax Return on Book Enterprise Value */\n\t\tfi_bev \t\t\t= fi_x/bev_x; \t\t\t\t\t\t/* ROIC */\n\t\tcop_bev\t\t\t= cop_x/bev_x;\t\t\t\t\t\t/* Cash Based Operating Profit to Invested Capital */\n\t\t\n\t\t* Return on Physical Capital;\n\t\tgp_ppen\t\t\t= gp_x/ppent;\n\t\tebitda_ppen\t\t= ebitda_x/ppent;\n\t\tfcf_ppen\t\t= fcf_x/ppent;\n\t\t\n\t\t* Issuance Variables;\n\t\tfincf_at\t\t= fincf_x / at_x;\n\t\tnetis_at\t\t= netis_x / at_x;\n\t\teqnetis_at\t\t= eqnetis_x / at_x;\n\t\teqis_at\t\t\t= eqis_x / at_x;\n\t\tdbnetis_at\t\t= dbnetis_x / at_x;\n\t\tdltnetis_at\t\t= dltnetis_x / at_x;\n\t\tdstnetis_at\t\t= dstnetis_x / at_x;\n\t\t\n\t\t/* Equity Payout */\n\t\teqnpo_at\t\t= eqnpo_x / at_x;\n\t\teqbb_at\t\t\t= eqbb_x / at_x;\n\t\tdiv_at\t\t\t= div_x / at_x;\n\t\t\n\t\t* Accruals;\n\t\toaccruals_at \t= oacc_x/at_x;\t\t\t\t\t\t\t/* Operating Accruals */\n\t\toaccruals_ni \t= oacc_x/abs(nix_x);\t\t\t\t\t/* Percent Operating Accruals */\n\t\ttaccruals_at \t= tacc_x/at_x;\t\t\t\t\t\t\t/* Total Accruals */\n\t\ttaccruals_ni \t= tacc_x/abs(nix_x);\t\t\t\t\t/* Percent Total Accruals */\n\t\tnoa_at\t\t\t= noa_x/lag12(at_x);\t\t\t\t\t/* Net Operating Asset to Total Assets*/\n\t\tif count <= 12 or lag12(at_x) <= 0 then do;\n\t\t\tnoa_at = .;\n\t\tend;\n\t\t\n\t\t/*Capitalization/Leverage Ratios Book*/\n\t\tbe_bev \t\t\t= be_x/bev_x;   \t\t\t\t\t\t/* Common Equity as % of Book Enterprise Value*/\n\t\tdebt_bev \t\t= debt_x/bev_x;  \t\t\t\t\t\t/* Total Debt as % of Book Enterprise Value*/\n\t\tcash_bev\t\t= che/bev_x;\t\t\t\t\t\t\t/* Cash and Short-Term Investments to Book Enterprise Value */\n\t\tpstk_bev\t\t= pstk_x/bev_x;\t\t\t\t\t\t\t/* Prefered Stock to Book Enterprise Value */\n\t\tdebtlt_bev\t\t= dltt/bev_x;    \t\t\t\t\t\t/* Long-term debt as % of Book Enterprise Value */\n\t\tdebtst_bev\t\t= dlc/bev_x;\t\t\t\t\t\t\t/* Short-term debt as % of Book Enterprise Value */\n\t\t\t\n\t\t/*Financial Soundness Ratios*/\n\t\tint_debt \t\t= xint/debt_x; \t\t\t\t\t\t\t/* Interest as % of average total debt*/\n\t\tint_debtlt \t\t= xint/dltt; \t\t\t\t\t\t\t/* Interest as % of average long-term debt*/\n\t\tebitda_debt \t= ebitda_x/debt_x; \t\t\t\t\t\t/* Ebitda to total debt*/\n\t\tprofit_cl\t\t= ebitda_x/cl_x; \t\t\t\t\t\t/* Profit before D&A to current liabilities*/\n\t\tocf_cl\t\t\t= ocf_x/cl_x; \t\t\t\t\t\t\t/* Operating cash flow to current liabilities*/\n\t\tocf_debt\t\t= ocf_x/debt_x;\t\t\t\t\t\t\t/* Operating cash flow to total debt*/\n\t\tcash_lt \t\t= che/lt; \t\t\t\t\t\t\t\t/* Cash balance to Total Liabilities*/\n\t\tinv_act \t\t= invt/act; \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/*inventory as % of current assets*/\n\t\trec_act \t\t= rect/act; \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/*receivables as % of current assets*/\n\t\tdebtst_debt \t= dlc/debt_x; \t\t\t\t\t\t\t\t\t\t\t\t\t\t/*short term term as % of total debt*/\n\t\tcl_lt\t\t\t= cl_x/lt; \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/*current liabilities as % of total liabilities*/\n\t\tdebtlt_debt\t\t= dltt/debt_x; \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/*long-term debt as % of total liabilities*/\n\t\tlt_ppen\t\t\t= lt/ppent; \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/*total liabilities to total tangible assets*/\n\t\tdebtlt_be\t\t= dltt/be_x; \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/*long-term debt to book equity*/\n\t\topex_at\t\t\t= opex_x/at_x;\t\t\t\t\t\t\t\t\t\t\t\t\t/* Operating Leverage ala Novy-Marx (2011) */\n\t\tnwc_at\t\t\t= nwc_x/at_x;\n\t\tif ocf_x>0 then \n\t\t\tfcf_ocf \t= fcf_x/ocf_x;  \t\t\t\t\t\t\t\t\t\t\t\t\t/*Free Cash Flow/Operating Cash Flow*/\n\t\t\n\t\t/*Solvency Ratios*/\n\t\tdebt_at\t\t= debt_x/at_x; \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/*Debt-to-assets*/\n\t\tdebt_be\t\t= debt_x/be_x; \t\t\t\t\t\t\t\t\t\t\t\t\t/*debt to shareholders' equity ratio*/\t\n\t\tebit_int\t= ebit_x/xint; \t\t\t\t\t\t\t/*interest coverage ratio*/\n\t\t\n\t\t/*Liquidity Ratios*/\n\t\tinv_days \t\t\t= %mean_year(invt)/cogs * 365;\t\t/* Days Inventory Outstanding */\n\t\trec_days \t\t\t= %mean_year(rect)/sale_x * 365;\t/* Days Sales Outstanding */\n\t\tap_days \t\t\t= %mean_year(ap)/cogs * 365;\t\t/* Days Accounts Payable Outstanding */\n\t\tif count<=12 then do;\n\t\t\tarray var_liq inv_days rec_days ap_days;\n\t\t\tdo over var_liq;\n\t\t\t\tvar_liq=.;\n\t\t\tend;\n\t\tend;\n\t\tcash_conversion = inv_days + rec_days - ap_days; \t\t/* Cash Conversion Cycle*/\n\t\tif cash_conversion<0 then \n\t\t\tcash_conversion =.;\n\t\tif cl_x>0 then do;\n\t\t\tcash_cl\t\t= che/cl_x; \t\t\t\t\t\t\t\t\t\t\t/* Cash Ratio*/                                   \n\t\t\tcaliq_cl\t= caliq_x/cl_x;\t\t\t\t\t\t\t\t\t\t/* Quick Ratio (acid test)*/\n\t\t\tca_cl\t\t= ca_x/cl_x;\t\t\t\t\t\t\t\t\t\t/* Current Ratio*/\n\t\tend;\n\t\t\n\t\t/*Activity/Efficiency Ratios*/\n\t\tinv_turnover = cogs/%mean_year(invt);\t\t\t\t/* Inventory Turnover */\n\t\tat_turnover = sale_x/%mean_year(at_x);\t\t\t\t/* Asset Turnover */\n\t\trec_turnover = sale_x/%mean_year(rect);\t\t\t/* Receivables Turnover */\n\t\tap_turnover = (cogs+dif12(invt))/%mean_year(ap);\t/* Account Payables Turnover */\n\t\tif count<=12 then do;\n\t\t\tarray var_turn inv_turnover at_turnover rec_turnover ap_turnover;\n\t\t\tdo over var_turn;\n\t\t\t\tvar_turn=.;\n\t\t\tend;\n\t\tend;\n\t\t\n\t\t/*Miscallenous Ratios*/\n\t\tadv_sale\t\t= xad/sale_x; \t\t\t\t\t\t\t\t\t\t\t\t/*advertising as % of sales*/\n\t\tstaff_sale\t\t= xlr/sale_x; \t\t\t\t\t\t\t\t\t\t\t\t\t\t/*labor expense as % of sales*/\n\t\tsale_bev \t\t= sale_x/bev_x; \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/*sale per $ Book Enterprise Value*/\n\t    rd_sale\t\t\t= xrd/sale_x; \n\t\tsale_be \t\t= sale_x/be_x; \t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/*sales per $ total stockholders' equity*/\n\t\tif coalesce(nix_x, ni_x)>0 then \n\t\t\tdiv_ni\t\t= div_x/nix_x; \t\t\t\t\t\t\t\t\t\t\t\t/*Dividend payout ratio. THEIS: I added ib as a possibility*/\t\n\t\tif nwc_x>0 then \n\t\t\tsale_nwc    = sale_x/nwc_x;\t\t\t\t\t\t\t\t\t\t\t\t\t\t/*sales per $ working capital*/\n\t\tif pi_x>0 then \n\t\t\ttax_pi\t\t= txt/pi_x; \t\t\t\t\t\t\t\t/*effective tax rate*/\n\t\t\t\n\t\t\t\n\t\t\t\n\t\t/* NEW VARIABLES */\n\t\tcash_at = che / at_x;\n\t\tif at_x <= 0 then\n\t\t\tcash_at = .;\n\t\t* Employees based variables;\n\t\tni_emp = ni_x / emp;\n\t\tif emp <= 0 then ni_emp = .;\n\t\tsale_emp = sale_x / emp;\n\t\tif emp <= 0 then sale_emp = .;\n\t\tsale_emp_gr1 = sale_emp / lag12(sale_emp) - 1; /* Labor force efficiency */\n\t\tif count <= 12 or lag12(sale_emp) <= 0 then sale_emp_gr1 = .;\n\t\temp_gr1 = (emp - lag12(emp)) / (0.5 * emp + 0.5 * lag12(emp));\n\t\tif count <= 12 or emp_gr1 = 0 or (0.5 * emp + 0.5 * lag12(emp)) = 0 then emp_gr1 = .;\n\t\t\n\t\t* Number of Consecutive Earnings Increases;\n\t\tni_inc = ni_x > lag12(ni_x);\n\t\tif missing(ni_x) or missing(lag12(ni_x)) then\n\t\t\tni_inc = .;\n\t\tni_inc8q = 0;\t\n\t\tno_decrease = 1;\n\t\t%do q = 0 %to 7;\n\t\t\t%let ql = %sysevalf(&q.*3);\n\t\t\tif lag&ql.(ni_inc) = 1 and no_decrease = 1 then \n\t\t\t\tni_inc8q = ni_inc8q + 1;\n\t\t\telse \n\t\t\t\tno_decrease = 0;\n\t\t%end;\n\t\tn_ni_inc = %apply_to_lastq(x = not missing(ni_inc), _qtrs = 8, func = sum);\n\t\tif missing(ni_inc) or n_ni_inc ^= 8 or count < 33 then \n\t\t\tni_inc8q = .;\n\t\tdrop no_decrease n_ni_inc; \t\n\t\t\n\t\t* 1yr Change Scaled by Lagged Assets;\n\t\t%let ch_asset_lag_vars = \n\t\t\tnoa_x ppeinv_x\n\t\t;\n\t\t%do i=1 %to %nwords(&ch_asset_lag_vars.);\n\t\t\t%let var_gr1al = %scan(&ch_asset_lag_vars., &i, %str(' '));\n\t\t\t%let name_gr1al = %sysfunc(tranwrd(&var_gr1al, _x, %str()));  /* Remove '_x' from var name */\n\t\t\t&name_gr1al._gr1a = (&var_gr1al-lag12(&var_gr1al))/lag12(at_x);\n\t\t\tif count<=12 or lag12(at_x)<=0 then\n\t\t\t\t&name_gr1al._gr1a = .;\n\t\t%end;\n\t\t\n\t\t* 1yr Change Scaled by Average Assets;\n\t\t%let ch_asset_avg_vars = \n\t\t\tlnoa_x\n\t\t;\n\t\t%do i=1 %to %nwords(&ch_asset_avg_vars.);\n\t\t\t%let var_gr1aa = %scan(&ch_asset_avg_vars., &i, %str(' '));\n\t\t\t%let name_gr1aa = %sysfunc(tranwrd(&var_gr1aa, _x, %str()));  /* Remove '_x' from var name */\n\t\t\t&name_gr1aa._gr1a = (&var_gr1aa-lag12(&var_gr1aa))/(at_x + lag12(at_x));\n\t\t\tif count<=12 or (at_x + lag12(at_x))<=0 then\n\t\t\t\t&name_gr1aa._gr1a = .;\n\t\t%end;\n\t\t\n\t\t* CAPEX growth over 2 years;\n\t\t%var_growth(var_gr=capx, horizon=24);\n\t\t\t\n\t\t* Quarterly Profitability Measures;\n\t\tsaleq_gr1 = sale_qtr / lag12(sale_qtr) - 1;\n\t\tif count <= 12 or lag12(sale_qtr) < 0 then\n\t\t\tsaleq_gr1 = .;\n\t\tniq_be = ni_qtr / lag3(be_x);\n\t\tif count <= 3 or lag3(be_x) < 0 then\n\t\t\tniq_be = .;\n\t\tniq_at = ni_qtr / lag3(at_x);\n\t\tif count <= 3 or lag3(at_x) < 0 then\n\t\t\tniq_at = .;\n\t\tniq_be_chg1 = niq_be - lag12(niq_be);\n\t\tniq_at_chg1 = niq_at - lag12(niq_at);\n\t\tif count <= 12 then do;\n\t\t\tniq_be_chg1 = .;\n\t\t\tniq_at_chg1 = .;\n\t\tend;\n\t\t\n\t\t* R&D capital-to-assets;\n\t\trd5_at = (xrd + lag12(xrd)*0.8 + lag24(xrd)*0.6 + lag36(xrd)*0.4 + lag48(xrd)*0.2) / at_x;\n\t\tif count <= 48 or at_x <= 0 then\n\t\t\trd5_at = .;\n\t\t\t\n\t\t* Abarbanell and Bushee (1998);\n\t\t%chg_to_exp(var_ce = sale_x);\n\t\t%chg_to_exp(var_ce = invt);\n\t\t%chg_to_exp(var_ce = rect);\n\t\t%chg_to_exp(var_ce = gp_x);\n\t\t%chg_to_exp(var_ce = xsga);\n\t\t\n\t\tdsale_dinv = sale_ce - invt_ce;\n\t\tdsale_drec = sale_ce - rect_ce;\n\t\tdgp_dsale  = gp_ce - sale_ce;\n\t\tdsale_dsga = sale_ce - xsga_ce;\n\t\tdrop sale_ce invt_ce rect_ce gp_ce xsga_ce;\n\t\t\n\t\t* Earnings and Revenue 'Surpise';\n\t\t%standardized_unexpected(var=sale_qtr, qtrs = 8, qtrs_min = 6);\n\t\t%standardized_unexpected(var=ni_qtr, qtrs = 8, qtrs_min = 6);\n\t\t\t\n\t\t* Abnormal Corporate Investment;\n\t\t__capex_sale = capx / sale_x;\n\t\tif sale_x <= 0 then\n\t\t\t__capx_sale = .;\n\t\tcapex_abn = __capex_sale / ((lag12(__capex_sale) + lag24(__capex_sale) + lag36(__capex_sale)) / 3) - 1;\n\t\tif count <= 36 then \n\t\t\tcapex_abn = .;\n\t\tdrop __capex_sale;\n\t\t\n\t\t/* Profit scaled by lagged */\n\t\top_atl1 = op_x / lag12(at_x);\n\t\tif count <= 12 or lag12(at_x) <= 0 then\n\t\t\top_atl1 = .;\n\t\tgp_atl1 = gp_x / lag12(at_x);\n\t\tif count <= 12 or lag12(at_x) <= 0 then\n\t\t\tgp_atl1 = .;\n\t\tope_bel1 = ope_x / lag12(be_x);\n\t\tif count <= 12 or lag12(be_x) <= 0 then\n\t\t\tope_bel1 = .;\n\t\tcop_atl1 = cop_x / lag12(at_x);\n\t\tif count <= 12 or lag12(at_x) <= 0 then\n\t\t\tcop_atl1 = .;\n\t\t\t\n\t\t/* Profitability Measures*/\n\t\tpi_nix = pi_x / nix_x;\n\t\tif pi_x <= 0 or nix_x <= 0 then\n\t\t\tpi_nix = .;\n\t\tocf_at = ocf_x / at_x;\n\t\top_at = op_x / at_x;\n\t\tif at_x <= 0 then do;\n\t\t\tocf_at = .;\n\t\t\top_at = .;\n\t\tend;\n\t\tocf_at_chg1 = ocf_at - lag12(ocf_at);\n\t\tif count <= 12 then \n\t\t\tocf_at_chg1 = .;\n\t\t\n\t\t/* Book Leverage */\n\t\tat_be = at_x / be_x;\n\t\t\n\t\t\t/* Volatility Quarterly Items */\n\t\t__ocfq_saleq = ocf_qtr / sale_qtr;\n\t\t__niq_saleq = ni_qtr / sale_qtr;\n\t\tif sale_qtr <= 0 then do;\n\t\t\t__ocfq_saleq = .;\n\t\t\t__niq_saleq = .;\n\t\tend;\n\t\t__roeq = ni_qtr / be_x;\n\t\tif be_x <= 0 then \n\t\t\t__roeq = .;\n\t\t%volq(name = ocfq_saleq_std, var = __ocfq_saleq, qtrs = 16, qtrs_min = 8);\n\t\t%volq(name = niq_saleq_std, var = __niq_saleq, qtrs = 16, qtrs_min = 8);\n\t\t%volq(name = roeq_be_std, var = __roeq, qtrs = 20, qtrs_min = 12);\n\t\tdrop __ocfq_saleq __niq_saleq __roeq;\n\t\t\n\t\t/* Volatility Annual Items*/\n\t\t__roe = ni_x / be_x;\n\t\tif be_x <= 0 then\n\t\t\t__roe = .;\n\t\t%vola(name = roe_be_std, var = __roe, yrs = 5, yrs_min = 5);\n\t\tdrop __roe;\n\t\t\n\t\t/* Asset Tangibility */\n\t\ttangibility = (che + 0.715 * rect + 0.547 * invt + 0.535 * ppegt) / at_x;  \n\t\t\n\t\t* Earnings Smoothness;\n\t\t%earnings_variability(esm_h=5);\n\t\t\n\t\t* Asset Liquidity;\n\t\taliq_at = aliq_x / lag12(at_x);\n\t\tif count <= 12 or lag12(at_x) <= 0 then aliq_at = .;\n\t\t\n\t\t* Equity Duration Helper Variables;\n\t\t%equity_duration_cd(horizon=10, r=0.12, roe_mean=0.12, roe_ar1=0.57, g_mean=0.06, g_ar1=0.24);\n\t\t\n\t\t* Pitroski F-Score;\n\t\t%pitroski_f(name = f_score);\n\t\t\n\t\t* Ohlson (1980) O-score;\n\t\t%ohlson_o(name = o_score);\n\t\t\n\t\t* Altman (1968) Z-score;\n\t\t%altman_z(name = z_score);\n\t\t \n\t\t* Intrinsic ROE based value from Frankel and Lee (1998);\n\t\t%intrinsic_value(name = intrinsic_value, r=0.12);\n\t\t\n\t\t* Kaplan-Zingales Index;\n\t\t%kz_index(name=kz_index);\n\t\t\n\t\t* 5 year ratio change (For quality minus junk variables);\n\t\t%chg_var1_to_var2(name=gpoa_ch5, var1=gp_x, var2=at_x, horizon=60);\n\t\t%chg_var1_to_var2(name=roe_ch5, var1=ni_x, var2=be_x, horizon=60);\n\t\t%chg_var1_to_var2(name=roa_ch5, var1=ni_x, var2=at_x, horizon=60);\n\t\t%chg_var1_to_var2(name=cfoa_ch5, var1=ocf_x, var2=at_x, horizon=60);\n\t\t%chg_var1_to_var2(name=gmar_ch5, var1=gp_x, var2=sale_x, horizon=60);\n\t\t\n\t\t/* Delete Helper Variables */\n\t\tdrop count;\n\trun;\t\n\t\n\t/* Create earningspersistence */\n\t%earnings_persistence(out=earnings_pers, data=__chars5, __n=5, __min=5);\n\t\n\tproc sql;\n\t\tcreate table __chars6 as \n\t\tselect a.*, b.ni_ar1, b.ni_ivol\n\t\tfrom __chars5 as a left join earnings_pers as b\n\t\ton a.gvkey = b.gvkey and a.curcd=b.curcd and a.datadate=b.datadate;\n\tquit;\n\t\n\t/* Keep only dates with accounting data */\n\tdata __chars7;\n\t\tset __chars6;\n\t\twhere data_available=1;\n\trun;\n\t\n\t/* Expand by Public Availability */\n\t* Would be great to change start_date to filling_date or some derivative which was a function of fqtr;\n\tproc sort data=__chars7; by gvkey descending datadate; run;\n\tdata __chars8;\n\t\tset __chars7;\n\t\tby gvkey;\n\t\t\n\t\tstart_date = intnx('month', datadate, &lag_to_public.,'e'); format start_date YYMMDDN8.;\n\t\tnext_start_date = lag(start_date);\n\t\tif first.gvkey then \n\t\t\tnext_start_date=.;\n\t\tend_date = min(intnx('month', next_start_date, -1, 'e'), intnx('month', datadate, &max_data_lag., 'e')); format end_date YYMMDDN8.;\t\n\t\t\t\n\t\tdrop next_start_date;\t\n\trun;\n\t\n\t%expand(data=__chars8, out=__chars9, id_vars=gvkey, start_date=start_date, end_date=end_date, freq='month', new_date_name=public_date);\n\t\n\t/* Convert All Raw (non-scaled) Variables to USD [2]*/\n\t%compustat_fx(out=__fx);\n\tproc sql;\n\t\tcreate table __chars10 as \n\t\tselect a.*, b.fx \n\t\tfrom __chars9 as a left join __fx as b\n\t\ton a.curcd=b.curcdd and a.public_date=b.date;\n\tquit;\n\t\n\tdata __chars11;\n\t\tset __chars10;\n\t\tarray var_raw \n\t\t\tassets sales book_equity net_income;\n\t\tdo over var_raw;\n\t\t\tvar_raw = var_raw*fx;\n\t\tend;\n\t\tdrop curcd;\n\trun;\n\t\n\t/* Create Ratios using both Accounting and Market Values */\n\t* Note that valuation ratios are created at the company level;\n\tproc sql;\n\t\tcreate table __me_data1 as \n\t\tselect distinct gvkey, eom, me_company /* Include id for join with daily std */\n\t\tfrom &me_data.\n\t\twhere not missing(gvkey) and primary_sec=1 and not missing(me_company) and common=1 and obs_main=1 /* Notice, exch_main is not a requirement */\n\t\tgroup by gvkey, eom\n\t\thaving me_company=max(me_company);\n\tquit;\n\t\n\tproc sql; \n\t\tcreate table __chars12 as\n\t\tselect a.*, b.me_company\n\t\tfrom __chars11 as a left join __me_data1 as b\n\t\ton a.gvkey=b.gvkey and a.public_date=b.eom;\n\tquit;\n\t\n\tproc sort data=__chars12 out=__chars13 nodupkey; by gvkey public_date; run; /*THEIS: Global-> No duplicates US-> 3464 duplicate observations where deleted*/\n\n\tdata __chars14;\n\t\tset __chars13;\n\t\t/* Prepare Data */\n\t\tmev\t\t\t\t= me_company+netdebt_x*fx;\t\t\t  /* Enterprise Value (in Dollars) */\n\t\tmat             = at_x * fx - be_x * fx + me_company;  /* Market Asset Value */\n\t\tif mev <= 0 then mev = .;\n\t\tif me_company <= 0 then\tme_company = .;\n\t\tif mat <= 0 then mat = .;\n\t\t\n\t\t/* Characteristics Scaled by Market Equity */\n\t\t%let me_vars = at_x be_x debt_x netdebt_x che sale_x gp_x ebitda_x ebit_x ope_x ni_x nix_x cop_x  \n\t\t\tocf_x fcf_x div_x eqbb_x eqis_x eqpo_x eqnpo_x eqnetis_x\n\t\t\txrd;\n\t\t%do i=1 %to %nwords(&me_vars.);\n\t\t\t%let var_me = %scan(&me_vars., &i, %str(' '));\n\t\t\t%let name_me = %sysfunc(tranwrd(&var_me., _x, %str()));  /* Remove '_x' from var name */\n\t\t\t&name_me._me = (&var_me.*fx)/me_company;\n\t\t%end;\n\t\tival_me = (intrinsic_value*fx) / me_company;\n\t\t\n\t\t/* Characteristics Scaled by Market Enterprise Value */\n\t\t%let mev_vars = at_x bev_x ppent be_x che sale_x gp_x ebitda_x ebit_x ope_x ni_x nix_x cop_x ocf_x fcf_x\n\t\t\tdebt_x pstk_x dltt dlc dltnetis_x dstnetis_x dbnetis_x netis_x fincf_x;\n\t\t%do i=1 %to %nwords(&mev_vars.);\n\t\t\t%let var_mev = %scan(&mev_vars., &i, %str(' '));\n\t\t\t%let name_mev = %sysfunc(tranwrd(&var_mev., _x, %str()));  /* Remove '_x' from var name */\n\t\t\t&name_mev._mev = (&var_mev.*fx)/mev;\n\t\t%end;\n\t\t\n\t\t/* Characteristics Scaled by Market Assets */\n\t\taliq_mat = aliq_x * fx / lag12(mat);\n\t\tif gvkey ^= lag12(gvkey) then aliq_mat = .;\n\t\t\n\t\t/* Size Measure */\n\t\tenterprise_value = mev;\n\t\t\n\t\t/* Equity Duration */\n\t\teq_dur = ed_cd_w * fx / me_company + ed_constant * (me_company - ed_cd * fx) / me_company;\n\t\tif ed_err = 1 or eq_dur <= 0 then eq_dur = .;\n\trun;\n\t\n\t/* Format Output */\n\tproc sql noprint;\n\t\tselect name into :col_names separated by ' ' \n\t\tfrom dictionary.columns\n\t\twhere libname=upcase(\"work\") and memname = upcase(\"__chars14\");\n\tquit;\n\t\n\t\n\tdata __chars15;\n\t\tset __chars14;\n\t\t%do i=1 %to %nwords(&col_names.);\n\t\t\t%let old_name = %scan(&col_names., &i, %str(' '));\n\t\t\t%let new_name = %sysfunc(prxchange(s/xrd/rd/, 1, &old_name.));  /* Replace XRD with RD */\n\t\t\t%let new_name = %sysfunc(prxchange(s/xsga/sga/, 1, &new_name.));  /* Replace XSGA with SGA */\n\t\t\t%let new_name = %sysfunc(prxchange(s/dlc/debtst/, 1, &new_name.));  /* Replace DLC with DEBTST */\n\t\t\t%let new_name = %sysfunc(prxchange(s/dltt/debtlt/, 1, &new_name.));  /* Replace DLTT with DEBTLT */\n\t\t\t%let new_name = %sysfunc(prxchange(s/oancf/ocf/, 1, &new_name.));  /* Replace OANCF with OCF */\n\t\t\t%let new_name = %sysfunc(prxchange(s/ppegt/ppeg/, 1, &new_name.));  /* Replace PPEGT with PPEG */\n\t\t\t%let new_name = %sysfunc(prxchange(s/ppent/ppen/, 1, &new_name.));  /* Replace PPENT with PPEN */\n\t\t\t%let new_name = %sysfunc(prxchange(s/che/cash/, 1, &new_name.));  /* Replace CHE with CASH */\n\t\t\t%let new_name = %sysfunc(prxchange(s/invt/inv/, 1, &new_name.));  /* Replace INVT with INV */\n\t\t\t%let new_name = %sysfunc(prxchange(s/rect/rec/, 1, &new_name.));  /* Replace RECT with REC */\n\t\t\t%let new_name = %sysfunc(prxchange(s/txt/tax/, 1, &new_name.));  /* Replace TXT with TAX */\n\t\t\t%let new_name = %sysfunc(prxchange(s/ivao/lti/, 1, &new_name.));  /* Replace IVAO with LTI */\n\t\t\t%let new_name = %sysfunc(prxchange(s/ivst/sti/, 1, &new_name.));  /* Replace IVST with STI */\n\t\t\t%let new_name = %sysfunc(prxchange(s/sale_qtr/saleq/, 1, &new_name.));  /* Replace SALE_QTR with SALEQ */\n\t\t\t%let new_name = %sysfunc(prxchange(s/ni_qtr/niq/, 1, &new_name.));  /* Replace SALE_QTR with SALEQ */\n\t\t\t%let new_name = %sysfunc(prxchange(s/ocf_qtr/ocfq/, 1, &new_name.));  /* Replace SALE_QTR with SALEQ */\n\t\t\trename &old_name. = &new_name.;\n\t\t%end;\n\t\t\n\trun;\n\t\n\t* Reorder and Keep only Selected Columns;\n\tdata __chars16;\n\t\tretain source gvkey datadate public_date assets sales book_equity net_income enterprise_value;\n\t\tset __chars15;\n\t\tkeep source gvkey public_date datadate &__keep_vars.;\n\trun;\n\n\t* Add suffix if specified;\n\t%if %length(&suffix.)>0 %then %do;\n\t\tdata __chars16;\n\t\t\tset __chars16;\n\t\t\t%do i=1 %to %nwords(&__keep_vars.);\n\t\t\t\t%let var_x = %scan(&__keep_vars., &i, %str(' '));\n\t\t\t\trename &var_x. = &var_x.&suffix.;\n\t\t\t%end;\n\t\t\trename datadate=datadate&suffix.;\n\t\trun;\n\t%end;\n\t\n\tproc sort nodupkey data=__chars16 out=&out.; by gvkey public_date; run;\n\tproc delete data= __chars3 __chars4 __chars5 __chars6 __chars7 __chars8 \n\t__chars9 __chars10 __chars11 __chars12 __chars13 __chars14 __chars15 __chars16 __me_data __me_data1 __fx earnings_pers; run;\n%mend create_acc_chars;\n\n/* Combine Characteristics from Annual and Quarterly Data */\n%macro combine_ann_qtr_chars(out=, ann_data=, qtr_data=, __char_vars=, q_suffix=);\n\tproc sql;\n\t\tcreate table __acc_chars1 as \n\t\tselect a.*, b.*\n\t\tfrom &ann_data. as a left join &qtr_data. as b\n\t\ton a.gvkey=b.gvkey and a.public_date=b.public_date;\n\tquit;\n\t\n\t/* Substitute Annual Characteristic for Quarterly if Quarterly is more recent */\n\tdata __acc_chars2;\n\t\tset __acc_chars1;\n\t\t%do i=1 %to %nwords(&__char_vars.);\n\t\t\t%let ann_var = %scan(&__char_vars., &i.);\n\t\t\t%let qtr_var = &ann_var.&q_suffix.;\n\t\t\tif missing(&ann_var.) or (not missing(&qtr_var.) and datadate&q_suffix. > datadate) then /* Didn't include the first part before! */ \n\t\t\t\t&ann_var. = &qtr_var.;\n\t\t\tdrop &qtr_var.;\n\t\t%end;\n\t\tdrop datadate datadate&q_suffix.; /* We can no longer be sure which items accounting dates refer to */\n\trun;\n\t\n\tproc sort nodupkey data=__acc_chars2 out=&out; by gvkey public_date; run;\n\t\n\tproc delete data=__acc_chars1 __acc_chars2; run;\n%mend combine_ann_qtr_chars;\n"
  },
  {
    "path": "GlobalFactors/char_macros.sas",
    "content": "/* MACROS USING COMPOSITE DATA */\n/* MACRO: MISPRICING_FACTORS \n- Based on the paper by Yuan and Stambaugh (2016)\n- Currently, the distress probability anomaly is not implemented.\n- I use fractional ranks i.e. ranks from 0 to 1. I think they use absolute ranks in the paper.\n*/\n%macro mispricing_factors(out=, data=, min_stks=, min_fcts=3);\n\tproc sql;\n\t\tcreate table chars1 as \n\t\tselect id, eom, excntry, chcsho_12m, eqnpo_12m, oaccruals_at, noa_at, at_gr1, ppeinv_gr1a, \n\t\t\to_score, ret_12_1, gp_at, niq_at\n\t\tfrom &data.\n\t\twhere common=1 and primary_sec=1 and obs_main=1 and exch_main = 1 and not missing(ret_exc) and not missing(me)\n\t\torder by excntry, eom;\n\tquit;\n\t%let __vars = chcsho_12m eqnpo_12m oaccruals_at noa_at at_gr1 ppeinv_gr1a o_score ret_12_1 gp_at niq_at;\n\t%let __direction = -1 1 -1 -1 -1 -1 -1 1 1 1;\n\t%do i=1 %to 10;\n\t\t%let __v = %scan(&__vars., &i, %str(' '));\n\t\t%let __d = %scan(&__direction., &i, %str(' '));\n\t\t%if &__d. = 1 %then %do;\n\t\t\t%let __sort=;\n\t\t%end;\n\t\t%else %do;\n\t\t\t%let __sort=descending;\n\t\t%end;\n\t\tproc sql;\n\t\t\tcreate table __subset as \n\t\t\tselect *\n\t\t\tfrom chars1\n\t\t\tgroup by excntry, eom\n\t\t\thaving count(&__v.) >= &min_stks.;\n\t\tquit;\n\t\tproc rank data=__subset out = __ranks(keep=excntry id eom rank_&__v.) &__sort. ties=mean f;\n\t\t\tby excntry eom;\n\t\t\tvar &__v.;\n\t\t\tranks rank_&__v.;\n\t\trun;\n\t\tproc sql;\n\t\t\tcreate table chars%eval(&i.+1) as\n\t\t\tselect a.*, b.rank_&__v.\n\t\t\tfrom chars&i. as a left join __ranks as b\n\t\t\ton a.id=b.id and a.eom=b.eom;\n\t\tquit;\n\t%end;\n\tdata &out.;\n\t\tset chars11;\n\t\tmispricing_perf = mean(rank_o_score, rank_ret_12_1, rank_gp_at, rank_niq_at);\n\t\tif missing(rank_o_score) + missing(rank_ret_12_1) + missing(rank_gp_at) + missing(rank_niq_at) > &min_fcts. then\n\t\t\tmispricing_perf = .;\n\t\tmispricing_mgmt = mean(rank_chcsho_12m, rank_eqnpo_12m, rank_oaccruals_at, rank_noa_at, rank_at_gr1, rank_ppeinv_gr1a);\n\t\tif missing(rank_chcsho_12m) + missing(rank_eqnpo_12m) + missing(rank_oaccruals_at) + missing(rank_noa_at) + missing(rank_at_gr1) + missing(rank_ppeinv_gr1a) > &min_fcts. then\n\t\t\tmispricing_mgmt = .;\t\n\t\tkeep id eom mispricing_perf mispricing_mgmt;\n\trun;\n%mend;\n\n* MACRO: QUALITY MINUS JUNK\n- Based on the paper by Asness, Frazzini and Pedersen (2018)\n- I deviate slightly from the original paper in the variable construction\n- The most clear deviation is the way the growth variables are created.\n;\n%macro quality_minus_junk(out=, data=, min_stks=);\n\t/* Helper Macro */\n\t%macro z_ranks(out=, data=, var=, min=, sort=);\n\t\tproc sql;\n\t\t\tcreate table __subset as \n\t\t\tselect *\n\t\t\tfrom &data.\n\t\t\tgroup by excntry, eom\n\t\t\thaving count(&var.) >= &min.;\n\t\tquit;\n\t\tproc rank data=__subset out = __ranks(keep=excntry id eom rank_&var.) &sort. ties=mean;\n\t\t\tby excntry eom;\n\t\t\tvar &var.;\n\t\t\tranks rank_&var.;\n\t\trun;\n\t\tproc sql;\n\t\t\tcreate table &out. as \n\t\t\tselect excntry, id, eom, (rank_&var. - mean(rank_&var.)) / std(rank_&var.) as z_&var.\n\t\t\tfrom __ranks\n\t\t\twhere not missing(rank_&var.)\n\t\t\tgroup by excntry, eom;\n\t\tquit;\n\t\tproc delete data=__subset __ranks; run;\n\t%mend;\n\tproc sql;\n\t\tcreate table qmj1 as \n\t\tselect id, eom, excntry, coalesce(roeq_be_std*2, roe_be_std) as __evol, /* I multiply the quarterly measure by sqrt(4)=2 to reflect that quarterly measures are less volatile than the annual measure. Empirically, this seems to be a reasonable approximation although perhaps a slightly higher multiplied could be used e.g. 2.5*/\n\t\t\tgp_at, ni_be, ni_at, ocf_at, gp_sale, oaccruals_at, gpoa_ch5, roe_ch5, roa_ch5, cfoa_ch5, \n\t\t\tgmar_ch5, betabab_1260d, debt_at, o_score, z_score\n\t\tfrom &data.\n\t\twhere common=1 and primary_sec=1 and obs_main=1 and exch_main=1 and not missing(ret_exc) and not missing(me)\n\t\torder by excntry, eom;\n\tquit;\n\t%let z_vars    = gp_at ni_be ni_at ocf_at gp_sale oaccruals_at\n\t                 gpoa_ch5  roe_ch5 roa_ch5 cfoa_ch5 gmar_ch5\n\t                 betabab_1260d debt_at o_score z_score __evol;  \n\t%let direction = 1 1 1 1 1 -1\n\t                 1 1 1 1 1\n\t                 -1 -1 -1 1 -1;\n\t%do i=1 %to 16;\n\t\t%let __v = %scan(&z_vars., &i, %str(' '));\n\t\t%let __d = %scan(&direction., &i, %str(' '));\n\t\t%if &__d. = 1 %then %do;\n\t\t\t%let __sort=;\n\t\t%end;\n\t\t%else %do;\n\t\t\t%let __sort=descending;\n\t\t%end;\n\t\t%z_ranks(out=__z, data=qmj1, var = &__v., min=&min_stks., sort=&__sort.);\n\t\tproc sql;\n\t\t\tcreate table qmj%eval(&i.+1) as\n\t\t\tselect a.*, b.z_&__v.\n\t\t\tfrom qmj&i. as a left join __z as b\n\t\t\ton a.id=b.id and a.eom=b.eom;\n\t\tquit;\n\t\t%if &i.>1 %then %do;\n\t\t\tproc delete data=qmj&i.; run;\n\t\t%end;\n\t%end;\n\tdata qmj18;\n\t\tset qmj17;\n\t\t__prof = mean(z_gp_at, z_ni_be, z_ni_at, z_ocf_at, z_gp_sale, z_oaccruals_at);\n\t\t__growth = mean(z_gpoa_ch5, z_roe_ch5, z_roa_ch5, z_cfoa_ch5, z_gmar_ch5);\n\t\t__safety = mean(z_betabab_1260d, z_debt_at, z_o_score, z_z_score, z___evol);\n\t\tkeep excntry id eom __prof __growth __safety;\n\trun;\n\t%z_ranks(out=__prof, data=qmj18, var = __prof, min=&min_stks., sort=);\n\t%z_ranks(out=__growth, data=qmj18, var = __growth, min=&min_stks., sort=);\n\t%z_ranks(out=__safety, data=qmj18, var = __safety, min=&min_stks., sort=);\n\tproc sql;\n\t\tcreate table qmj19 as \n\t\tselect a.excntry, a.id, a.eom, b.z___prof as qmj_prof, c.z___growth as qmj_growth, d.z___safety as qmj_safety\n\t\tfrom qmj18 as a \n\t\tleft join __prof as b on a.excntry=b.excntry and a.id=b.id and a.eom=b.eom\n\t\tleft join __growth as c on a.excntry=c.excntry and a.id=c.id and a.eom=c.eom\n\t\tleft join __safety as d on a.excntry=d.excntry and a.id=d.id and a.eom=d.eom;\n\tquit;\n\t/* QMJ SCORE! */\n\tdata qmj20;\n\t\tset qmj19;\n\t\t__qmj = (qmj_prof + qmj_growth + qmj_safety) / 3; * Missing if any of subcomponents are missing;\n\trun;\n\t%z_ranks(out=__qmj, data=qmj20, var=__qmj, min = &min_stks., sort=);\n\tproc sql;\n\t\tcreate table &out. as \n\t\tselect a.excntry, a.id, a.eom, a.qmj_prof, a.qmj_growth, a.qmj_safety, b.z___qmj as qmj\n\t\tfrom qmj20 as a left join __qmj as b \n\t\ton a.excntry=b.excntry and a.id=b.id and a.eom=b.eom;\n\tquit;\n\tproc delete data=qmj1 qmj17 qmj18 qmj19 qmj20; run;\n%mend;\n\n/* MACRO USING RETURN DATA -------------------------------------------------------*/\n* MACRO: BIDASK_HL -------------------------\n- Corwin-Schultz High-Low Bid-ask Estimator\n- Heavily inspired by Shane Corwins code: http://sites.nd.edu/scorwin/files/2019/12/Sample-SAS-Program.pdf\n- Primary change: I adjust prices for stock splits\n- Arguments:\n\t* OUT: Output dataset containing estimates of average monthly bid-ask spread and return volatility\n\t* DATA: Input dataset with high and low prices\n\t* __min_obs: Minimum amount of daily observations required to compute monthly estimates;\n%macro bidask_hl(out=, data=, __min_obs=);\n\tproc sql;\n\t\tcreate table __dsf1 as \n\t\tselect a.id, a.date, a.eom, a.bidask, a.tvol, \n\t\t\ta.prc / a.adjfct as prc, a.prc_high / a.adjfct as prc_high, a.prc_low / a.adjfct as prc_low /* Adjust price for stocks splits! */\n\t\tfrom &data. as a left join scratch.market_returns_daily as b\n\t\ton a.excntry=b.excntry and a.date=b.date\n\t\twhere not missing(b.mkt_vw_exc)  /* This ensures that we look at trading days */\n\t\torder by id, date;\n\tquit;\n\t\n\t* Cleaning data;\n\tdata __dsf2(drop=prc_low_r prc_high_r);\n\t\tretain prc_low_r prc_high_r;\n\t\tset __dsf1;\n\t\tby id date eom;\n\t\t* Keep initial valeus;\n\t\tprc_low_in = prc_low;\n\t\tprc_high_in = prc_high;\n\t\thlreset = 0;\n\t\t* Initial Screens;\n\t\tif bidask = 1 or prc_low=prc_high or prc_low<=0 or prc_high<=0 or tvol=0 then do; \n\t\t\tprc_high = .;\n\t\t\tprc_low = .;\n\t\tend;\n\t\t/* Replace bad/missing price with previous day range */\n\t\tif first.id then do;\n\t\t\tprc_low_r = .;\n\t\t\tprc_high_r = .;\n\t\tend;\n\t\t* Reset retained high and low volume;\n\t\tif 0<prc_low<prc_high then do;\n\t\t\tprc_low_r=prc_low; \n\t\t\tprc_high_r=prc_high;\n\t\tend;\n\t\t* Replace mising/bad high and low prices with retained values;\n\t\telse do;\n\t\t\t* Replace if within prior days range;\n\t\t\tif prc_low_r <= prc <= prc_high_r then do;\n\t\t\t\tprc_low = prc_low_r;\n\t\t\t\tprc_high = prc_high_r;\n\t\t\t\thlreset = 1;\n\t\t\tend;\n\t\t\t* Replace if below prior days range;\n\t\t\tif prc < prc_low_r then do;\n\t\t\t\tprc_low = prc;\n\t\t\t\tprc_high = prc_high_r - (prc_low_r-prc);\n\t\t\t\thlreset = 2;\n\t\t\tend;\n\t\t\t* Replace if above prior days range;\n\t\t\tif prc > prc_high_r then do;\n\t\t\t\tprc_low = prc_low_r + (prc - prc_high_r);\n\t\t\t\tprc_high = prc;\n\t\t\t\thlreset = 3;\n\t\t\tend;\n\t\tend;\n\t\t/* Final data screen after H/L reset */\n\t\tif prc_low ^= 0 and prc_high/prc_low > 8 then do;\n\t\t\tprc_low = .;\n\t\t\tprc_high = .;\n\t\tend;\n\trun;\n\t\n\t/* Adjust for overnight returns */\n\tdata __dsf3;\n\t\tset __dsf2;\n\t\tretadj = 0;\n\t\tprc_low_t = prc_low;\n\t\tprc_high_t = prc_high;\n\t\tprc_low_l1 = lag(prc_low);\n\t\tprc_high_l1 = lag(prc_high);\n\t\tprc_l1 = lag(prc);\n\t\tif id ^= lag(id) then do;\n\t\t\tprc_low_l1 = .;\n\t\t\tprc_high_l1 = .;\n\t\t\tprc_l1 = .;\n\t\tend;\n\t\tif prc_l1<prc_low and prc_l1>0 then do; * Adjust when prior close is below current low;\n\t\t\tprc_high_t=prc_high-(prc_low-prc_l1); \n\t\t\tprc_low_t=prc_l1; \n\t\t\tretadj=1;\n\t \tend;\n\t\tif prc_l1>prc_high and prc_l1>0 then do; * Adjust when prior close is above current high;\n\t\t\tprc_high_t=prc_l1; \n\t\t\tprc_low_t=prc_low+(prc_l1-prc_high); \n\t\t\tretadj=2;\n\t\tend;\n\trun;\n\t\n\t/* Calculate daily high/low bid-ask spread*/\n\tdata __dsf4; \n\t\tset __dsf3;\n\t \tpi=constant('PI');\n\t \tk2 = sqrt(8/pi);\n\t \tconst = 3-2*sqrt(2);\n\t \tprc_high_2d=max(prc_high_t,prc_high_l1);\n\t \tprc_low_2d=min(prc_low_t,prc_low_l1);\n\t \tif prc_low_t>0 and prc_low_l1>0 then \n\t \t\tbeta = (log(prc_high_t/prc_low_t))**2+(log(prc_high_l1/prc_low_l1))**2;\n\t \tif prc_low_2d>0 then \n\t \t\tgamma = (log(prc_high_2d/prc_low_2d))**2;\n\t \talpha = (sqrt(2*beta)-sqrt(beta))/const - sqrt(gamma/const);\n\t \t* Calculate spread with missing set to zero;\n\t \tspread = 2*(exp(alpha)-1)/(1+exp(alpha));\n\t \tspread_0 = max(spread,0); * Set negative spread estimates to zero;\n\t \tif spread = . then \n\t \t\tspread_0 = .;\n\t \t* Calculate daily volatillity;\n\t \tsigma = ((sqrt(beta/2)-sqrt(beta)))/(k2*const)+sqrt(gamma/(k2*k2*const));\n\t \tsigma_0 = max(sigma,0); * Set negative sigma estimates to zero;\n\t \tif sigma= . then \n\t \t\tsigma_0 = .;\n\trun;\n\t\n\t/* Monthly bid-ask estimates */\n\tproc sql;\n\t\tcreate table &out. as \n\t\tselect id, eom, mean(spread_0) as bidaskhl_21d, mean(sigma_0) as rvolhl_21d\n\t\tfrom __dsf4\n\t\tgroup by id, eom\n\t\thaving count(spread_0) > &__min_obs.;\n\tquit;\n\t\n\tproc delete data=__dsf1 __dsf2 __dsf3 __dsf4; run;\n%mend;\n\n\n* MACRO: SEASONALITY\n- Caclulates annual and non-annual seasonality measures following Heston and Sadka (2008)\n- Specifically, calculate the average return over annual and non-annual lags within the specified\n  start and end dates\n- Within a given year, the annual lag is lag11 and the non-annual lags are lag0-lag10. \n- For return predictability, the information should be used to form portfolios at the end of lag0\n- Said differently, the seasonality variables should be lagged 1 period relative to returns\n;\n%macro seasonality(start_year=, end_year=);\n\t* Return over all lags;\n\t__all_ret = 0;\n\t__all_n = 0;\n\t%do i = %eval((&start_year.-1) * 12) %to %eval(&end_year. * 12 - 1);\n\t\t__all_ret = __all_ret + lag&i(ret_x);\n\t\t__all_n = __all_n + 1;\n\t%end;\n\t* Return over annual lags;\n\t__an_ret = 0;\n\t__an_n = 0;\n\t%do i = %eval(&start_year.) %to &end_year.;\n\t\t%let __al = %eval(&i. * 12 - 1);  \n\t\t__an_ret = __an_ret + lag&__al.(ret_x);\n\t\t__an_n = __an_n + 1;\n\t%end;\n\t* Return over non-annual lags;\n\t__na_ret = __all_ret - __an_ret;\n\t__na_n = __all_n - __an_n;\n\t* Create Variables;\n\tseas_&start_year._&end_year.an = __an_ret / __an_n;\n\tseas_&start_year._&end_year.na = __na_ret / __na_n;\n\tif count < %eval(&end_year. * 12) then do;\n\t\tseas_&start_year._&end_year.an = .;\n\t\tseas_&start_year._&end_year.na = .;\t\n\tend;\n\tdrop __all_ret __all_n __an_ret __an_n __na_ret __na_n;\n%mend;\n\n/* MACRO USING ACCOUNTING DATA -------------------------------------------------------*/\n* Create Growth in Variable over horizon;\n%macro var_growth(var_gr=, horizon=); /* Horizon is in months */\n\t%let name_gr = %sysfunc(tranwrd(&var_gr., _x, %str()));  /* Remove '_x' from var name */\n\t%let name_gr = &name_gr._gr%sysevalf(&horizon./12);      /* Add gr and horizon in years to name */ \n\t&name_gr. = &var_gr./lag&horizon.(&var_gr.)-1;\n\tif count<=&horizon. or lag&horizon.(&var_gr.)<=0 then\n\t\t\t&name_gr. = .; \n%mend;\n\n* Change in Variable over Horizon Scaled by Assets;\n%macro chg_to_assets(var_gra=, horizon=); /* Horizon is in months */\n\t%let name_gra = %sysfunc(tranwrd(&var_gra., _x, %str()));  /* Remove '_x' from var name */\n\t%let name_gra = &name_gra._gr%sysevalf(&horizon./12);      /* Add gr and horizon in years to name */ \n\t%let name_gra = &name_gra.a;                                /* Add 'a' in the end*/\n\t&name_gra. = (&var_gra.-lag&horizon.(&var_gra.))/at_x;\n\tif count<=&horizon. or at_x<=0 then\n\t\t\t&name_gra. = .; \n%mend;\n\n* Ratio Change;\n%macro chg_var1_to_var2(name=, var1=, var2=, horizon=);\n\t__x = &var1. / &var2.;\n\tif &var2. <= 0 then \n\t\t__x=.;\n\t&name. = (__x - lag&horizon.(__x));\n\tif count <= horizon then \n\t\t&name. = .;\n\tdrop __x;\n%mend;\n\n* Change to expectations (Abarnell and Bushee, 1998);\n%macro chg_to_exp(var_ce=);\n\t%let name_ce = %sysfunc(tranwrd(&var_ce., _x, %str()));  /* Remove '_x' from var name */\n\t%let name_ce = &name_ce._ce;\n\t__expect = (lag12(&var_ce.) + lag24(&var_ce.))/2;\n\t&name_ce. = &var_ce. / (__expect) - 1;\n\tif count <= 24 or __expect <= 0 then\n\t\t&name_ce. = .;\n\tdrop __expect;\n%mend;\n\n* Standardized Unexpected Realization;\n* Uses the specification in Jegadeesh and Livnat (2006);\n%macro standardized_unexpected(var=, qtrs=, qtrs_min=);\n\t%let name = %sysfunc(tranwrd(&var., _x, %str()));  /* Remove '_x' from var name */\n\t%let name = &name._su;  \n\t__chg = &var. - lag12(&var.);\n\t__chg_mean = %apply_to_lastq(x = __chg, _qtrs = &qtrs., func = mean);\n\t__chg_std = %apply_to_lastq(x = __chg, _qtrs = &qtrs., func = std);\n\t__chg_n = %apply_to_lastq(x = not missing(__chg), _qtrs = &qtrs., func = sum);\n\tif __chg_n <= &qtrs_min. then do;\n\t\t__chg_mean = .;\n\t\t__chg_std = .;\n\tend;\n\t&name. = (&var. - (lag12(&var.) + lag3(__chg_mean) )) / lag3(__chg_std); /* This is the correct one*/\n\tif count <= %eval(12 + &qtrs.*3) then\n\t\t&name. = .;\n\tdrop __chg __chg_mean __chg_std __chg_n;\n%mend;\n\n* Volatility of Quarterly Data;\n%macro volq(name=, var=, qtrs=, qtrs_min=);\n\t__n = %apply_to_lastq(x = not missing(&var.), _qtrs = &qtrs., func = sum);\n\t&name. = %apply_to_lastq(x = &var., _qtrs = &qtrs., func = std);\n\tif count <= %eval((&qtrs.-1)*3) or __n < &qtrs_min. then\n\t\t&name. = .;\n\tdrop __n;\n%mend;\n\n* Volatility of Annual Data;\n%macro vola(name=, var=, yrs=, yrs_min=);\n\t__n = %apply_to_lasty(x = not missing(&var.), yrs = &yrs., func = sum);\n\t&name. = %apply_to_lasty(x = &var., yrs = &yrs., func = std);\n\tif count <= %eval((&yrs.-1)*12) or __n < &yrs_min. then\n\t\t&name. = .;\n\tdrop __n;\n%mend;\n\n* Earnings Smoothness;\n%macro earnings_variability(esm_h=);\n\t__roa = ni_x / lag12(at_x);\n\t__croa = ocf_x / lag12(at_x);\n\t__roa_n = %apply_to_lasty(x= not missing(__roa), yrs=&esm_h., func=sum); \n\t__croa_n = %apply_to_lasty(x= not missing(__croa), yrs=&esm_h., func=sum);\n\t__roa_std = %apply_to_lasty(x=__roa, yrs=&esm_h., func=std);\n\t__croa_std = %apply_to_lasty(x=__croa, yrs=&esm_h., func=std);\n\tearnings_variability = __roa_std / __croa_std;\n\tif count <= %eval(&esm_h. * 12) or __croa_std <= 0 or __roa_n < &esm_h. or __croa_n < &esm_h. then\n\t\tearnings_variability = .;\n\t drop __roa __croa  __roa_n __croa_n __roa_std __croa_std;\n%mend;\n\n/* Equity Duration: Forecast of Cash Distribution */\n%macro equity_duration_cd(horizon=, r=, roe_mean=, roe_ar1=, g_mean=, g_ar1=);\n\t* Create Initial Variables;\n\t__roe0 = ni_x / lag12(be_x);\n\t__g0 = sale_x / lag12(sale_x) - 1;\n\t__be0 = be_x;\n\tif count <= 12 or lag12(be_x) <= 1 then __roe0 = .; /* Use 1 million to avoid bad estimates from a small denominator */\n\tif count <= 12 or lag12(sale_x) <= 1 then __g0 = .; /* Use 1 million to avoid bad estimates from a small denominator */\n\t* Forecast Cash Distributions;\n\t%let roe_c = &roe_mean.*(1 - &roe_ar1.);\n\t%let g_c = &g_mean.*(1 - &g_ar1.);\n\t%do i = 1 %to &horizon.;\n\t\t%let j = %eval(&i.-1);\n\t\t__roe&i. = &roe_c. + &roe_ar1. * __roe&j.;\n\t\t__g&i. = &g_c. + &g_ar1. * __g&j.;\n\t\t__be&i. = __be&j. * (1 + __g&i.);\n\t\t__cd&i. = __be&j. * (__roe&i. - __g&i.);\n\t%end;\n\t* Create Duration Helper Variables;\n\ted_constant = &horizon. + (1 + &r.) / &r.;\n\ted_cd_w = 0;\n\ted_cd = 0;\n\ted_err = 0;\n\t%do t = 1 %to &horizon.;\n\t\ted_cd_w = ed_cd_w + &t. * __cd&t. / (1 + &r.)**&t.;\n\t\ted_cd = ed_cd + __cd&t. / (1 + &r.)**&t.;\n\t\tif __be&t. < 0 then ed_err = 1;\n\t%end;\n\tdrop __roe: __g: __be: __cd:; \n%mend;\n\n* Pitroski (2000) Fundamental Score;\n%macro pitroski_f(name=);\n\t__f_roa = ni_x / lag12(at_x);\n\tif count <= 12 or lag12(at_x) <= 0 then __f_roa = .;\n\t__f_croa = ocf_x / lag12(at_x);\n\tif count <= 12 or lag12(at_x) <= 0 then __f_croa = .;\n\t__f_droa = __f_roa - lag12(__f_roa);\n\tif count <= 12 then __f_droa = .;\n\t__f_acc = __f_croa - __f_roa;\n\t__f_lev = dltt / at_x - lag12(dltt / at_x);\n\tif count <= 12 or at_x <= 0 or lag12(at_x) <= 0 then __f_lev = .;\n\t__f_liq = ca_x / cl_x - lag12(ca_x / cl_x);\n\tif count <= 12 or cl_x <= 0 or lag12(cl_x) <= 0 then __f_liq = .;\n\t__f_eqis = eqis_x;\n\t__f_gm = gp_x / sale_x - lag12(gp_x / sale_x);\n\tif count <= 12 or sale_x <= 0 or lag12(sale_x) <= 0 then __f_gm = .;\n\t__f_aturn = sale_x / lag12(at_x) - lag12(sale_x) / lag24(at_x); \n\tif count <= 24 or lag12(at_x) <= 0 or lag24(at_x) <= 0 then __f_aturn = .; \n\t&name. = (__f_roa > 0) + (__f_croa > 0) + (__f_droa > 0) + (__f_acc > 0) +\n\t\t\t  (__f_lev < 0) + (__f_liq > 0) + (coalesce(__f_eqis, 0) = 0) +  /* Set __f_eqis to zero if missing. This greatly expands coverage and seems like a reasonable approximation */ \n\t\t\t  (__f_gm > 0) + (__f_aturn > 0);\n\t* Only allow __f_eqis to be missing;\t\t  \n\tif missing(__f_roa) or missing(__f_croa) or missing(__f_droa) or missing(__f_acc) or\n\t   missing(__f_lev) or missing(__f_liq) or missing(__f_gm) or missing(__f_aturn) then &name. = .;\n\tdrop __f_:;\n%mend;\n\n* Ohlson (1980) O-score;\n%macro ohlson_o(name=);\n\t* Create Helpers;\n\t__o_lat = log(at_x);\n\t__o_lev = debt_x / at_x;\n\t__o_wc = (ca_x - cl_x) / at_x;\n\t__o_roe = nix_x / at_x;\n\tif at_x <= 0 then do;\n\t\t__o_lat = .;\n\t\t__o_lev = .;\n\t\t__o_wc = .;\n\t\t__o_roe = .;\n\tend;\n\t__o_cacl = cl_x / ca_x;\n\tif ca_x <= 0 then __o_cacl = .;\n\t__o_ffo = (pi_x + dp) / lt;\n\tif lt <= 0 then __o_ffo = .;\n\t__o_neg_eq = lt > at_x;\n\tif missing(lt) or missing(at_x) then __o_neg_eq = .;\n\t__o_neg_earn = (nix_x < 0 and lag12(nix_x) < 0);\n\tif count <= 12 or missing(nix_x) or missing(lag12(nix_x)) then __o_neg_earn = .;\n\t__o_nich = (nix_x - lag12(nix_x)) / (abs(nix_x) + abs(lag12(nix_x)));\n\tif count <= 12 or (abs(nix_x) + abs(lag12(nix_x))) = 0 then __o_nich = .;\n\t* Create O-score;\n\t&name. = -1.32 - 0.407 * __o_lat + 6.03 * __o_lev - 1.43 * __o_wc\n\t          + 0.076 * __o_cacl - 1.72 * __o_neg_eq - 2.37 * __o_roe\n\t          - 1.83 * __o_ffo + 0.285 * __o_neg_earn - 0.52 * __o_nich;\n%mend;\n\n* Altman (1968) Z-score;\n%macro altman_z(name=);\n\t* Create Helpers;\n\t__z_wc = (ca_x - cl_x) / at_x;\n\t__z_re = re / at_x;\n\t__z_eb = ebitda_x / at_x;\n\t__z_sa = sale_x / at_x;\n\tif at_x <= 0 then do;\n\t\t__z_wc = .;\n\t\t__z_re = .;\n\t\t__z_eb = .;\n\t\t__z_sa = .;\n\tend;\n\t__z_me = me_fiscal / lt;\n\tif lt <= 0 then\t__z_me = .;\n\t* Create Temporary Z-score;\n\t&name. = 1.2 * __z_wc + 1.4 * __z_re + 3.3 * __z_eb + 0.6 * __z_me + 1.0 * __z_sa;\n\tdrop __z:\n%mend;\n\n* Intrinsic ROE based value from Frankel and Lee (1998);\n%macro intrinsic_value(name=, r=);\n\t__iv_po = div_x/nix_x;\n\tif nix_x <= 0 then\n\t\t__iv_po = div_x / (at_x * 0.06);\n\t__iv_roe = nix_x / ((be_x + lag12(be_x)) / 2);\n\tif count <= 12 or (be_x + lag12(be_x)) <= 0 then __iv_roe = .;\n\t__iv_be1 = (1 + (1 - __iv_po) * __iv_roe) * be_x;\n\t&name. = be_x + (__iv_roe - &r.) / (1 + &r.) * be_x \n\t\t\t\t\t+ (__iv_roe - &r.) / ((1 + &r.) * &r.) * __iv_be1; \n\t* If Intrinsic value is Non-Positive, set to missing;\n\tif &name. <= 0 then &name. = .;\n\tdrop __iv:;\n%mend;\n\n*  Kaplan-Zingales Index;\n%macro kz_index(name=);\n\t* Create Helper Variables;\n\t__kz_cf = (ni_x + dp) / lag12(ppent);\n\tif count <= 12 or lag12(ppent)<=0 then __kz_cf = .;\n\t__kz_q = (at_x + me_fiscal - be_x) / at_x;\n\tif at_x <= 0 then __kz_q = .;\n\t__kz_db = debt_x / (debt_x + seq_x);\n\tif (debt_x + seq_x) = 0 then __kz_db = .;\n\t__kz_dv = div_x / lag12(ppent);\n\tif count <= 12 or lag12(ppent)<=0 then __kz_dv = .;\n\t__kz_cs = che / lag12(ppent);\n\tif count <= 12 or lag12(ppent)<=0 then __kz_cs = .;\n\t* Create Variable;\n\t&name. = - 1.002 * __kz_cf + 0.283 * __kz_q + 3.139 * __kz_db\n\t         - 39.368 * __kz_dv - 1.315 * __kz_cs;\n%mend;\n\n/* Earnings Predicability/Persistence*/\n* I scale net income by total assets to account for issuance activity.;\n%macro earnings_persistence(out=, data=, __n=, __min=);\n\t%let __months = %eval(&__n. * 12);\n\tproc sort data=&data. out=__acc1; by gvkey curcd datadate; run;\n\tdata __acc2;\n\t\tset __acc1;\n\t\tby gvkey curcd;\n\t\tretain count;\n\t\tif first.curcd then \n\t\t\tcount = 1;\n\t\telse\n\t\t\tcount = count+1;\n\trun;\n\tdata __acc3;\n\t\tset __acc2;\n\t\t__ni_at = ni_x / at_x;\n\t\tif at_x <= 0 then\n\t\t\t__ni_at = .;\n\t\t__ni_at_l1 = lag12(__ni_at);\n\t\tif count<=12 then\n\t\t\t__ni_at_l1 =.;\t\n\trun;\n\tproc sql;\n\t\tcreate table __acc4 as \n\t\tselect gvkey, curcd, datadate, __ni_at, __ni_at_l1\n\t\tfrom __acc3\n\t\twhere not missing(__ni_at) and not missing(__ni_at_l1);\n\tquit;\n\t\n\tproc sql;\n\t\tcreate table month_ends as \n\t\tselect distinct datadate\n\t\tfrom __acc4\n\t\torder by datadate;\n\tquit;\n\t\n\t* Divide data into __n groups;\n\tproc sql;\n\t\tcreate table dates_apply as \n\t\tselect *, mod(monotonic(), &__months.) as grp\n\t\tfrom month_ends;\n\tquit;\n\t\n\t* Helper macro: If first group, save &new. as &base. otherwise, append &new. to &base.;\n\t%macro save_or_append(base=, new=);\n\t\t%if &__grp. = 0 %then %do;\n\t\t\tdata &base.; set &new.; run;\n\t\t%end;\n\t\t%else %do;\n\t\t\tproc append base=&base. data=&new.; run;\n\t\t%end;\n\t%mend;\n\t\n\t%do __grp=0 %to %eval(&__months. - 1); \n\t\t%put ############### GROUP %eval(&__grp.+1) out of &__months. ###############; \n\t\t* Prepare data;\n\t\tproc sql;\n\t\t\tcreate table calc_dates as\n\t\t\tselect a.datadate, b.datadate as calc_date\n\t\t\tfrom dates_apply as a left join dates_apply(where=(grp = &__grp.)) as b\n\t\t\ton a.datadate > intnx(\"year\", b.datadate, -&__n., \"e\") and a.datadate <= b.datadate and month(a.datadate) = month(b.datadate); /* month(*) ensures annual lags*/\n\t\tquit;\n\t\t\n\t\tproc sql;\n\t\t\tcreate table calc_data as \n\t\t\tselect a.*, b.calc_date\n\t\t\tfrom __acc4 as a left join calc_dates as b\n\t\t\ton a.datadate = b.datadate\n\t\t\twhere not missing(b.calc_date)  \n\t\t\tgroup by a.gvkey, a.curcd, b.calc_date\n\t\t\thaving count(*) >= &__min.\n\t\t\torder by a.gvkey, b.calc_date;\n\t\tquit;\n\t\t\n\t\tproc reg data=calc_data outest=__earn_pers1 edf NOPRINT;\n\t\t\tby gvkey curcd calc_date;\n\t\t\tmodel __ni_at=__ni_at_l1;\n\t\trun;\n\t\tproc sql;\n\t\t\tcreate table __earn_pers2 as \n\t\t\tselect gvkey, curcd, calc_date as datadate, __ni_at_l1 as ni_ar1, sqrt(_rmse_**2 * _edf_ / (_edf_ + 1)) as ni_ivol\n\t\t\tfrom __earn_pers1\n\t\t\twhere (_edf_ + 2) >= &__min.;\n\t\tquit;\n\t\t%save_or_append(base=op_ep, new=__earn_pers2);\n\t%end;\n\tproc sort data=op_ep out=&out. nodup; by gvkey curcd datadate; run; \n\tproc delete data= __acc1 __acc2 __acc3 __acc4  dates_apply calc_dates calc_data month_ends __earn_pers1 __earn_pers2 op_ep; run;\n%mend;\n\n/* MACRO - FIRM AGE */\n%macro firm_age(data=, out=);\n\t* CRSP first observation;\n\tproc sql;\n\t\tcreate table crsp_age1 as \n\t\tselect permco, min(date) as crsp_first format=YYMMDDN8.\n\t\tfrom crsp.msf\n\t\tgroup by permco;\n\tquit;\n\t\n\t* Compustat accounting first observation;\n\tproc sql;\n\t\tcreate table comp_acc_age1 as\n\t\tselect gvkey, datadate from comp.funda\n\t\touter union corr\n\t\tselect gvkey, datadate from comp.g_funda;\n\t\t\n\t\tcreate table comp_acc_age2 as\n\t\tselect gvkey, min(datadate) as comp_acc_first format=YYMMDDN8.\n\t\tfrom comp_acc_age1\n\t\tgroup by gvkey;\n\t\t\n\t\tupdate comp_acc_age2\n\t\tset comp_acc_first = intnx('year', comp_acc_first, -1, 'e');  /* When submitting an annual report, the firm must have existed for at least 1 year*/\n\tquit; \n\t\n\t* Compustat return first obs;\n\tproc sql;\n\t\tcreate table comp_ret_age1 as\n\t\tselect gvkey, datadate from comp.secm\n\t\touter union corr\n\t\tselect gvkey, datadate from comp.g_secd where monthend=1;\n\t\t\n\t\tcreate table comp_ret_age2 as\n\t\tselect gvkey, min(datadate) as comp_ret_first format=YYMMDDN8.\n\t\tfrom comp_ret_age1\n\t\tgroup by gvkey;\n\t\t\n\t\tupdate comp_ret_age2\n\t\tset comp_ret_first = intnx('year', comp_ret_first, -1, 'e');  /* When submitting an annual report, the firm must have existed for at least 1 year*/\n\tquit; \n\t\n\t* Add to Dataset;\n\tproc sql;\n\t\tcreate table comb1 as\n\t\tselect a.id, a.eom, min(b.crsp_first, c.comp_acc_first, d.comp_ret_first) as first_obs format=YYMMDDN8.\n\t\tfrom &data. as a \n\t\tleft join crsp_age1 as b \n\t\ton a.permco=b.permco\n\t\tleft join comp_acc_age2 as c\n\t\ton a.gvkey=c.gvkey\n\t\tleft join comp_ret_age2 as d\n\t\ton a.gvkey=d.gvkey;\n\t\n\t\tcreate table comb2 as \n\t\tselect *, min(eom) as first_alt format=YYMMDDN8. /* A few (0.3% of all obs) north american observations don't have observations in comp.secm so this is a fall back option*/\n\t\tfrom comb1\n\t\tgroup by id; \n\t\n\t\tcreate table comb3 as\n\t\tselect *, intck ('month', min(first_obs, first_alt), eom) as age\n\t\tfrom comb2;\n\t\t\n\t\talter table comb3\n\t\tdrop first_obs, first_alt;\n\tquit;\n\t* Output;\n\tproc sort data=comb3 out=&out.; by id eom; run;\n%mend;"
  },
  {
    "path": "GlobalFactors/ind_identification.sas",
    "content": "* MACRO: FF_IND_CLASS\n\tAdd variable matching 4-digit SIC identifiers to Fama-French industry identifiers\n\t   Arguments:\n\t   \tdata: name of input dataset that includes 4-digit SIC codes under name 'sic'\n\t   \tff_grps: number of industry portfolios for Fama-French identifiers\n\t   \tOUT: name of output dataset;\n%macro ff_ind_class(data=, ff_grps=, out=);\n\t%if &ff_grps = 38 %then %do;\n\t\tproc sql;\n\t\t\t/* French identifies \"Other\" as \"almost nothing\", so no firms are identified as \"other\" \n\t\t\thttps://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/det_38_ind_port.html*/\n\t\t\tcreate table &out. as\n\t\t\tselect *, \n\t\t\t\tcase\n\t\t\t\t\twhen 100 <= sic <= 999 then 1\n\t\t\t\t\twhen 1000 <= sic <= 1299 then 2 \n\t\t\t\t\twhen 1300 <= sic <= 1399 then 3\n\t\t\t\t\twhen 1400 <= sic <= 1499 then 4\n\t\t\t\t\twhen 1500 <= sic <= 1799 then 5\n\t\t\t\t\twhen 2000 <= sic <= 2099 then 6\n\t\t\t\t\twhen 2100 <= sic <= 2199 then 7\n\t\t\t\t\twhen 2200 <= sic <= 2299 then 8\n\t\t\t\t\twhen 2300 <= sic <= 2399 then 9\n\t\t\t\t\twhen 2400 <= sic <= 2499 then 10\n\t\t\t\t\twhen 2500 <= sic <= 2599 then 11\n\t\t\t\t\twhen 2600 <= sic <= 2661 then 12\n\t\t\t\t\twhen 2700 <= sic <= 2799 then 13\n\t\t\t\t\twhen 2800 <= sic <= 2899 then 14\n\t\t\t\t\twhen 2900 <= sic <= 2999 then 15\n\t\t\t\t\twhen 3000 <= sic <= 3099 then 16\n\t\t\t\t\twhen 3100 <= sic <= 3199 then 17\n\t\t\t\t\twhen 3200 <= sic <= 3299 then 18\n\t\t\t\t\twhen 3300 <= sic <= 3399 then 19\n\t\t\t\t\twhen 3400 <= sic <= 3499 then 20\n\t\t\t\t\twhen 3500 <= sic <= 3599 then 21\n\t\t\t\t\twhen 3600 <= sic <= 3699 then 22\n\t\t\t\t\twhen 3700 <= sic <= 3799 then 23\n\t\t\t\t\twhen 3800 <= sic <= 3879 then 24\n\t\t\t\t\twhen 3900 <= sic <= 3999 then 25\n\t\t\t\t\twhen 4000 <= sic <= 4799 then 26\n\t\t\t\t\twhen 4800 <= sic <= 4829 then 27\n\t\t\t\t\twhen 4830 <= sic <= 4899 then 28\n\t\t\t\t\twhen 4900 <= sic <= 4949 then 29\n\t\t\t\t\twhen 4950 <= sic <= 4959 then 30\n\t\t\t\t\twhen 4960 <= sic <= 4969 then 31\n\t\t\t\t\twhen 4970 <= sic <= 4979 then 32\n\t\t\t\t\twhen 5000 <= sic <= 5199 then 33\n\t\t\t\t\twhen 5200 <= sic <= 5999 then 34\n\t\t\t\t\twhen 6000 <= sic <= 6999 then 35\n\t\t\t\t\twhen 7000 <= sic <= 8999 then 36 \n\t\t\t\t\twhen 9000 <= sic <= 9999 then 37\n\t\t\t\t\telse .\n\t\t\t\tend as ff38\n\t\t\tfrom &data.;\n\t\trun;\n\t\t%end;\n\t\t\n\t%else %if &ff_grps. = 49 %then %do;\n\t\tproc sql;\n\t\t\tcreate table &out. as\n\t\t\tselect *, \t\n\t\t\t\tcase\n\t\t\t\t\twhen sic = 2048 or 100 <= sic <= 299 or 700 <= sic <= 799 or 910 <= sic <= 919 then 1\n\t\t\t\t\twhen sic in (2095, 2098, 2099) or 2000 <= sic <= 2046 or 2050 <= sic <= 2063 or \n\t\t\t\t\t\t2070 <= sic <= 2079 or 2090 <= sic <= 2092 then 2\n\t\t\t\t\twhen sic in (2086, 2087, 2096, 2097) or 2064 <= sic <= 2068 then 3\n\t\t\t\t\twhen sic = 2080 or 2082 <= sic <= 2085 then 4\n\t\t\t\t\twhen 2100 <= sic <= 2199 then 5\n\t\t\t\t\twhen sic in (3732, 3930, 3931) or 920 <= sic <= 999 or 3650 <= sic <= 3652 or 3940 <= sic <= 3949 then 6\n\t\t\t\t\twhen sic in (7840, 7841, 7900, 7910, 7911, 7980) or 7800 <= sic <= 7833 or 7920 <= sic <= 7933 or \n\t\t\t\t\t\t7940 <= sic <= 7949 or 7990 <= sic <= 7999 then 7\n\t\t\t\t\twhen sic in (2770, 2771) or 2700 <= sic <= 2749 or 2780 <= sic <= 2799 then 8\n\t\t\t\t\twhen sic in (2047, 2391, 2392, 3160, 3161, 3229, 3260, 3262, 3263, 3269, 3230, 3231, 3750, 3751, 3800, 3860, \n\t\t\t\t\t\t3861, 3910, 3911, 3914, 3915, 3991, 3995) or 2510 <= sic <= 2519 or 2590 <= sic <= 2599 or \n\t\t\t\t\t\t2840 <= sic <= 2844 or 3170 <= sic <= 3172 or 3190 <= sic <= 3199 or 3630 <= sic <= 3639 or \n\t\t\t\t\t\t3870 <= sic <= 3873 or 3960 <= sic <= 3962 then 9\n\t\t\t\t\twhen sic in (3020, 3021, 3130, 3131, 3150, 3151) or 2300 <= sic <= 2390 or 3100 <= sic <= 3111 or \n\t\t\t\t\t\t 3140 <= sic <= 3149 or 3963 <= sic <= 3965 then 10\n\t\t\t\t\twhen 8000 <= sic <= 8099 then 11\n\t\t\t\t\twhen sic in (3693, 3850, 3851) or 3840 <= sic <= 3849 then 12\n\t\t\t\t\twhen sic in (2830, 2831) or 2833 <= sic <= 2836 then 13\n\t\t\t\t\twhen 2800 <= sic <= 2829 or 2850 <= sic <= 2879 or 2890 <= sic <= 2899 then 14\n\t\t\t\t\twhen sic in (3031, 3041) or 3050 <= sic <= 3053 or 3060 <= sic <= 3099 then 15\n\t\t\t\t\twhen 2200 <= sic <= 2284 or 2290 <= sic <= 2295 or 2297 <= sic <= 2299 or 2393 <= sic <= 2395 or \n\t\t\t\t\t\t2397 <= sic <= 2399 then 16\n\t\t\t\t\twhen sic in (2660, 2661, 3200, 3210, 3211, 3240, 3241, 3261, 3264, 3280, 3281, 3446, 3996) or \n\t\t\t\t\t\t800 <= sic <= 899 or 2400 <= sic <= 2439 or 2450 <= sic <= 2459 or\t2490 <= sic <= 2499 or \n\t\t\t\t\t\t2950 <= sic <= 2952 or 3250 <= sic <= 3259 or 3270 <= sic <= 3275 or 3290 <= sic <= 3293 or \n\t\t\t\t\t\t3295 <= sic <= 3299 or 3420 <= sic <= 3429 or 3430 <= sic <= 3433 or 3440 <= sic <= 3442 or \n\t\t\t\t\t\t3448 <= sic <= 3452 or 3490 <= sic <= 3499 then 17\n\t\t\t\t\twhen 1500 <= sic <= 1511 or 1520 <= sic <= 1549 or 1600 <= sic <= 1799 then 18\n\t\t\t\t\twhen sic = 3300 or 3310 <= sic <= 3317 or 3320 <= sic <= 3325 or 3330 <= sic <= 3341 or 3350 <= sic <= 3357\n\t\t\t\t\t\tor 3360 <= sic <= 3379 or 3390 <= sic <= 3399 then 19\n\t\t\t\t\twhen sic in (3400, 3443, 3444) or 3460 <= sic <= 3479 then 20\n\t\t\t\t\twhen sic in (3538, 3585, 3586) or 3510 <= sic <= 3536 or 3540 <= sic <= 3569 or 3580 <= sic <= 3582 or \n\t\t\t\t\t\t3589 <= sic <= 3599 then 21\n\t\t\t\t\twhen sic in (3600, 3620, 3621, 3648, 3649, 3660, 3699) or 3610 <= sic <= 3613 or 3623 <= sic <= 3629 or \n\t\t\t\t\t\t3640 <= sic <= 3646 or 3690 <= sic <= 3692 then 22\n\t\t\t\t\twhen sic in (2296, 2396, 3010, 3011, 3537, 3647, 3694, 3700, 3710, 3711, 3799) or 3713 <= sic <= 3716 or \n\t\t\t\t\t\t3790 <= sic <= 3792 then 23\n\t\t\t\t\twhen sic in (3720, 3721, 3728, 3729) or 3723 <= sic <= 3725 then 24\n\t\t\t\t\twhen sic in (3730, 3731) or 3740 <= sic <= 3743 then 25\n\t\t\t\t\twhen sic = 3795 or 3760 <= sic <= 3769 or 3480 <= sic <= 3489 then 26\n\t\t\t\t\twhen 1040 <= sic <= 1049 then 27\n\t\t\t\t\twhen 1000 <= sic <= 1039 or 1050 <= sic <= 1119 or 1400 <= sic <= 1499 then 28\n\t\t\t\t\twhen 1200 <= sic <= 1299 then 29\n\t\t\t\t\twhen sic in (1300, 1389) or 1310 <= sic <= 1339 or 1370 <= sic <= 1382 or 2900 <= sic <= 2912 or \n\t\t\t\t\t\t2990 <= sic <= 2999 then 30\n\t\t\t\t\twhen sic in (4900, 4910, 4911, 4939) or 4920 <= sic <= 4925 or 4930 <= sic <= 4932 or 4940 <= sic <= 4942 then 31\n\t\t\t\t\twhen sic in (4800, 4899) or 4810 <= sic <= 4813 or 4820 <= sic <= 4822 or 4830 <= sic <= 4841 or \n\t\t\t\t\t\t4880 <= sic <= 4892 then 32\n\t\t\t\t\twhen sic in (7020, 7021, 7200, 7230, 7231, 7240, 7241, 7250, 7251, 7395, 7500, 7600, 7620, 7622, 7623, 7640, \n\t\t\t\t\t\t7641) or 7030 <= sic <= 7033 or 7210 <= sic <= 7212 or 7214 <= sic <= 7217 or 7219 <= sic <= 7221 or \n\t\t\t\t\t\t7260 <= sic <= 7299 or 7520 <= sic <= 7549 or 7629 <= sic <= 7631 or 7690 <= sic <= 7699 or \n\t\t\t\t\t\t8100 <= sic <= 8499 or 8600 <= sic <= 8699 or 8800 <= sic <= 8899 or 7510 <= sic <= 7515 then 33\n\t\t\t\t\twhen sic in (3993, 7218, 7300, 7374, 7396, 7397, 7399, 7519, 8700, 8720, 8721) or 2750 <= sic <= 2759 or \n\t\t\t\t\t\t7310 <= sic <= 7342 or 7349 <= sic <= 7353 or 7359 <= sic <= 7369 or 7376 <= sic <= 7385 or \n\t\t\t\t\t\t7389 <= sic <= 7394 or 8710 <= sic <= 8713 or 8730 <= sic <= 8734 or 8740 <= sic <= 8748 or \n\t\t\t\t\t\t8900 <= sic <= 8911 or 8920 <= sic <= 8999 or 4220 <= sic <= 4229 then 34\n\t\t\t\t\twhen sic = 3695 or 3570 <= sic <= 3579 or 3680 <= sic <= 3689 then 35\n\t\t\t\t\twhen sic = 7375 or 7370 <= sic <= 7373 then 36\n\t\t\t\t\twhen sic in (3622, 3810, 3812) or 3661 <= sic <= 3666 or 3669 <= sic <= 3679 then 37\n\t\t\t\t\twhen sic = 3811 or 3820 <= sic <= 3827 or 3829 <= sic <= 3839 then 38\n\t\t\t\t\twhen sic in (2760, 2761) or 2520 <= sic <= 2549 or 2600 <= sic <= 2639 or 2670 <= sic <= 2699 or \n\t\t\t\t\t\t3950 <= sic <= 3955 then 39\n\t\t\t\t\twhen sic in (3220, 3221) or 2440 <= sic <= 2449 or 2640 <= sic <= 2659 or 3410 <= sic <= 3412 then 40\n\t\t\t\t\twhen sic in (4100. 4130, 4131, 4150, 4151, 4230, 4231, 4780, 4789) or 4000 <= sic <= 4013 or 4040 <= sic <= 4049 \n\t\t\t\t\t\tor 4110 <= sic <= 4121 or 4140 <= sic <= 4142 or 4170 <= sic <= 4173 or 4190 <= sic <= 4200 or \n\t\t\t\t\t\t4210 <= sic <= 4219 or 4240 <= sic <= 4249 or 4400 <= sic <= 4700 or 4710 <= sic <= 4712 or \n\t\t\t\t\t\t4720 <= sic <= 4749 or 4782 <= sic <= 4785 then 41\n\t\t\t\t\twhen sic in (5000, 5099, 5100) or 5010 <= sic <= 5015 or 5020 <= sic <= 5023 or 5030 <= sic <= 5060 or \n\t\t\t\t\t\t5063 <= sic <= 5065 or 5070 <= sic <= 5078 or 5080 <= sic <= 5088 or 5090 <= sic <= 5094 or \n\t\t\t\t\t\t5110 <= sic <= 5113 or 5120 <= sic <= 5122 or 5130 <= sic <= 5172 or 5180 <= sic <= 5182 or \n\t\t\t\t\t\t5190 <= sic <= 5199 then 42\n\t\t\t\t\twhen sic in (5200, 5250, 5251, 5260, 5261, 5270, 5271, 5300, 5310, 5311, 5320, 5330, 5331, 5334, 5900, 5999) or\n\t\t\t\t\t\t5210 <= sic <= 5231 or 5340 <= sic <= 5349 or 5390 <= sic <= 5400 or 5410 <= sic <= 5412 or \n\t\t\t\t\t\t5420 <= sic <= 5469 or 5490 <= sic <= 5500 or 5510 <= sic <= 5579 or 5590 <= sic <= 5700 or \n\t\t\t\t\t\t5710 <= sic <= 5722 or 5730 <= sic <= 5736 or 5750 <= sic <= 5799 or 5910 <= sic <= 5912 or \n\t\t\t\t\t\t5920 <= sic <= 5932 or 5940 <= sic <= 5990 or 5992 <= sic <= 5995 then 43 \n\t\t\t\t\twhen sic in (7000, 7213) or 5800 <= sic <= 5829 or 5890 <= sic <= 5899 or 7010 <= sic <= 7019 or \n\t\t\t\t\t\t7040 <= sic <= 7049 then 44\n\t\t\t\t\twhen sic = 6000 or 6010 <= sic <= 6036 or 6040 <= sic <= 6062 or 6080 <= sic <= 6082 or 6090 <= sic <= 6100 \n\t\t\t\t\t\tor 6110 <= sic <= 6113 or 6120 <= sic <= 6179 or 6190 <= sic <= 6199 then 45\n\t\t\t\t\twhen sic in (6300, 6350, 6351, 6360, 6361) or 6310 <= sic <= 6331 or 6370 <= sic <= 6379 or 6390 <= sic <= 6411 \n\t\t\t\t\t\tthen 46\n\t\t\t\t\twhen sic in (6500, 6510, 6540, 6541, 6610, 6611) or 6512 <= sic <= 6515 or 6517 <= sic <= 6532 or \n\t\t\t\t\t\t6550 <= sic <= 6553 or 6590 <= sic <= 6599 then 47\n\t\t\t\t\twhen sic in (6700, 6798, 6799) or 6200 <= sic <= 6299 or 6710 <= sic <= 6726 or 6730 <= sic <= 6733 or \n\t\t\t\t\t\t6740 <= sic <= 6779 or 6790 <= sic <= 6795 then 48\n\t\t\t\t\twhen sic in (4970, 4971, 4990, 4991) or 4950 <= sic <= 4961 then 49\n\t\t\t\t\telse .\n\t\t\t\tend as ff49\n\t\t\t\tfrom &data.;\n\t\t\trun;\n\t\t\t%end;\n%mend;\n\n* MACRO: CRSP_INDUSTRY\n\tCreate daily historical SIC and NAICS industry identifiers dataset from CRSP data \n\t   Arguments:\n\t   \tout: name of output dataset;\n%macro crsp_industry(out=);\n\t/* Pull distinct date ranges and identifiers from CRSP datasets */\n\tproc sql;\n\t\tcreate table permno0 as\n\t\tselect distinct permno, permco, namedt, nameendt, siccd as sic, input(naics, 6.0) as naics/* NAICS codes can't have leading zeroes so this should be okay*/\n\t\tfrom crsp.dsenames\n\t\torder by permno, namedt, nameendt;\n\trun;\n\t\n\t/* Alter missing industry identifiers */\n\tdata permno1;\n\t\tset permno0;\n\t\tif missing(sic) then sic = -999;\n\t\tif sic = 0 then sic = -999;\n\t\tif missing(naics) then naics = -999;\n\trun;\n\t\n\t/* Find date distance for date ranges */\n\tdata permno2; \n\t\tset permno1;\n\t\tpermno_diff = intck('day', namedt, nameendt, 'd');\n\trun;\n\n\tproc sort data = permno2;\n\t\tby permno namedt nameendt;\n\trun;\n\t\n\t/* Create new rows between valid dates */\n\tdata permno3;\n\t\tset permno2;\n\t\toutput;\n\t\tn = 0;\n\t\tif permno_diff > 0 then do;\n\t\t\tdo until(n = permno_diff);\n\t\t\t\tnamedt = intnx('day', namedt, 1);\n\t\t\t\tn + 1;\n\t\t\t\toutput;\n\t\t\tend;\n\t\tend;\n\t\tdrop nameendt permno_diff n;\n\trun;\n\t\n\t/* Get ready for output */\n\tdata permno4;\n\t\tset permno3;\n\t\tif sic = -999 then sic = .;\n\t\tif naics = -999 then naics = .;\n\t\tdate = namedt;\n\t\tdrop namedt;\n\t\tformat date yymmddn8.;\n\trun;\n\t\t\n\tproc sort data= permno4 out= &out. nodup; by permno date; run;\n\t\n\tproc delete data = permno0 permno1 permno2 permno3 permno4; run;\n%mend; \n\n* MACRO: COMP_SIC_NAICS \n\tCreate a daily historical SIC and NAICS industry identifiers dataset using NA and global annual reports\n\t   Arguments: \n\t   \tOUT: name of output dataset;\n%macro comp_sic_naics(OUT =, ff_num =);\n\tproc sql;\n\t\t/* Retrieve NA identifiers */\n\t\tcreate table comp1 as\n\t\tselect distinct gvkey, datadate, sich as sic, naicsh as naics\n\t\tfrom COMP.FUNDA;\n\trun;\n\t\n\t/* Fix error of gvkey code 175650 */\n\tdata comp2;\n\t\tset comp1;\n\t\tif gvkey = \"175650\" and datadate = '31DEC2005'd and missing(naics) then delete;\n\trun;\n\t\t\n\tproc sql;\n\t\t/* Retrieve global identifiers */\n\t\tcreate table comp3 as\n\t\tselect distinct gvkey, datadate, sich as sic, naicsh as naics\n\t\tfrom COMP.G_FUNDA;\n\t\t\n\t\t/* Join global and NA data */\n\t\tcreate table comp4 as\n\t\tselect a.gvkey as gvkeya, a.datadate as datea, a.sic as sica, a.naics as naicsa, \n\t\t\tb.gvkey as gvkeyb, b.datadate as dateb, b.sic as sicb, b.naics as naicsb\n\t\tfrom comp2 as a full join comp3 as b\n\t\ton a.gvkey = b.gvkey and a.datadate = b.datadate;\n\trun;\n\t\n\t/* Coalesce NA and global */\n\tdata comp5;\n\t\tset comp4;\n\t\tgvkey = put(coalesce(gvkeya, gvkeyb), $z6.);\n\t\tdate  = coalesce(datea, dateb);\n\t\tsic   = coalesce(sica, sicb);\n\t\tnaics = coalesce(naicsa, naicsb);\n\t\tformat date yymmddn8.;\n\t\tdrop gvkeya gvkeyb datea dateb sica sicb naicsa naicsb;\n\trun;\n\t\n\t/* Sort descending*/\n\tproc sort data = comp5;\n\t\tby gvkey descending date;\n\trun;\n\t\n\t/* Add valid date to in order to extend to daily observation */\n\tdata comp6;\n\t\tset comp5;\n\t\tby gvkey;\n\t\tvalid_to = intnx('day', lag(date), -1);\n\t\tif FIRST.gvkey then do;\n\t\t\tvalid_to = date;\n\t\tend;\n\t\tformat valid_to yymmddn8.;\n\trun;\n\t\n\t/* Re-sort */\n\tproc sort data = comp6;\n\t\tby gvkey date valid_to;\n\trun;\n\t\n\t/* Find date distance for date ranges */\n\tdata comp7; \n\t\tset comp6;\n\t\tcomp_diff = intck('day', date, valid_to, 'd');\n\trun;\n\n\tproc sort data = comp7;\n\t\tby gvkey date valid_to;\n\trun;\n\t\n\t/* Create new rows between valid dates */\n\tdata comp8;\n\t\tset comp7;\n\t\toutput;\n\t\tn = 0;\n\t\tif comp_diff > 0 and comp_diff ne . then do;\n\t\t\tdo until(n = comp_diff);\n\t\t\t\tdate = intnx('day', date, 1);\n\t\t\t\tn + 1;\n\t\t\t\toutput;\n\t\t\tend;\n\t\tend;\n\t\tdrop valid_to comp_diff n;\n\trun;\n\n\tproc sort data= comp8 out= &out. nodup; by gvkey date; run;\n\t\n\tproc delete data = comp1 comp2 comp3 comp4 comp5 comp6 comp7 comp8; run;\n%mend;\n\n* MACRO: COMP_HGICS\n\t Create a daily historical gics dataset from COMPUSTAT, either from the NA or global dataset \n\t   Arguments:\n\t   \tlib: COMPUSTAT library from which to pull historical gics data (CO_HGICS if NA, G_CO_HGICS if global)\n\t\tOUT: name of output dataset;\n%macro COMP_HGICS(lib =, out =);\n\t/* Pull historical gics data */\n\tproc sql;\n\t\tcreate table gic1 as\n\t\tselect distinct gvkey, indfrom, indthru, gsubind as gics\n\t\tfrom comp.&lib.\n\t\twhere not missing(gvkey);\n\trun;\n\t\n\tproc sort data = gic1;\n\t\tby gvkey indfrom;\n\trun;\n\t\n\t/* Alter missing gics */\n\tdata gic2;\n\t\tset gic1;\n\t\tby gvkey;\n\t\tif missing(gics) then gics = -999;\n\trun;\n\t\n\t/* Adjust indthru */\n\tdata gic3;\n\t\tset gic2;\n\t\tby gvkey indfrom indthru;\n\t\tif LAST.gvkey and indthru = . then indthru = today();\n\trun;\n\n\t/* Estimate difference between indfrom and indthru */\n\tdata gic4;\n\t\tset gic3;\n\t\tgic_diff = intck('days', indfrom, indthru);\n\trun;\n\t\n\tproc sort data = gic4;\n\t\tby gvkey indfrom indthru;\n\trun;\n\n\t/* Add rows to create daily data */\n\tdata gic5;\n\t\tset gic4;\n\t\tby gvkey;\n\t\toutput;\n\t\tn = 0;\n\t\tif gic_diff > 0 and gic_diff ne . then do;\n\t\t\tdo until(n = gic_diff);\n\t\t\t\tindfrom = intnx('day', indfrom, 1);\n\t\t\t\tn + 1;\n\t\t\t\toutput;\n\t\t\tend;\n\t\tend;\n\trun;\n\t\n\tdata gic6;\n\t\tset gic5;\n\t\tdate = indfrom;\n\t\tformat date yymmddn8.;\n\t\tdrop indfrom indthru gic_diff n;\n\trun;\n\t\n\tproc sort data= gic6 out=&OUT nodup; by gvkey date; run;\n\t\t\n\tproc delete data = gic1 gic2 gic3 gic4 gic5 gic6; run;\n%mend COMP_HGICS;\n\n/* MACRO: HGICS JOIN\n\tJoin NA and global daily historical gics data from COMPUSTAT \n\t   Argument: \n\t   \tOUT: name of output dataset */\n%macro HGICS_JOIN(out=);\n\t/* Construct NA and global historical gics data */\n\t%comp_hgics(lib = co_hgic, OUT = na_hgics);\n\t%comp_hgics(lib = g_co_hgic, OUT = g_hgics);\n\t\n\tproc sql;\n\t\tcreate table gjoin1 as\n\t\tselect a.gvkey as na_gvkey, a.gics as na_gics, a.date as na_date, b.gvkey as g_gvkey, b.gics \n\t\t\tas g_gics, b.date as g_date\n\t\tfrom na_hgics as a full join g_hgics as b \n\t\ton a.gvkey = b.gvkey and a.date = b.date;\n\t\n\t/* Coalesce NA and global */\n\tdata gjoin2;\n\t\tset gjoin1;\n\t\tgvkey = put(coalesce(na_gvkey, g_gvkey), $z6.);\n\t\tdate  = coalesce(na_date, g_date);\n\t\tgics  = coalesce(na_gics, g_gics);\n\t\tformat date yymmddn8.;\n\t\tdrop na_gvkey na_date na_gics g_gvkey g_date g_gics;\n\trun;\n\t\n\tproc sort data = gjoin2 out= &out nodup; by gvkey date; run;\n\t\n\tproc delete data = na_hgics g_hgics gjoin1 gjoin2; run;\n%mend;\n\n* MACRO: COMP_INDUSTRY\n\tJoin SIC and NAICS industry identifiers to GICS identifiers constructed from COMPUSTAT data\n\t   Arguments:\n\t   \tOUT: name of output dataset;\n%macro comp_industry(out=);\n\t/* Construct datasets */\n\t%hgics_join(out=comp_gics);\n\t%comp_sic_naics(out=comp_other);\n\t\n\t/* Join datasets */\n\tdata join1;\n\t\tmerge comp_gics comp_other;\n\t\tby gvkey date;\n\trun;\n\t\n\tproc sort data = join1 nodupkey;\n\t\tby gvkey date;\n\trun;\n\t\n\t/* Check for gaps in coverage */\n\tdata join2;\n\t\tset join1;\n\t\tby gvkey date;\n\t\tlagdate = lag(date);\n\t\tdate_1  = intnx('day', date, -1);\n\t\tgap     = 0;\n\t\tformat lagdate yymmddn8. date_1 yymmddn8.;\n\trun;\n\t\n\tdata join3;\n\t\tset join2;\n\t\tby gvkey date;\n\t\tif not FIRST.gvkey and lagdate ne date_1 then gap = 1;\n\trun;\n\t\n\t/* Create rows for gaps in coverage with all indicators as missing */\n\tproc sql;\n\t\tcreate table gap1 as\n\t\tselect *\n\t\tfrom join3\n\t\twhere gap = 1;\n\trun;\n\t\n\t/* Size of gap */\n\tdata gap2;\n\t\tset gap1;\n\t\tdiff = intck('days', lagdate, date);\n\trun;\n\t\n\t/* Add rows to create daily data */\n\tdata gap3;\n\t\tset gap2;\n\t\tby gvkey date;\n\t\toutput;\n\t\tn = 0;\n\t\tif gap = 1 then do;\n\t\t\tdo until(n = diff - 1);\n\t\t\t\tdate      = intnx('day', date, -1);\n\t\t\t\tgics      = .;\n\t\t\t\tsic       = .;\n\t\t\t\tnaics     = .;\n\t\t\t\tn + 1;\n\t\t\t\toutput;\n\t\t\tend;\n\t\tend;\n\t\tdrop lagdate date_1 gap diff n; \n\trun;\n\t\n\tproc sort data = gap3;\n\t\tby gvkey date;\n\trun;\n\t\n\t/* Join added rows to original daily data */\n\tdata joined1;\n\t\tmerge join1 gap3;\n\t\tby gvkey date;\n\trun;\n\t\n\tproc sort data = joined1 out= &out. nodup; by gvkey date; run;\t\n\t\n\tproc delete data = comp_gics comp_other join1 join2 join3 gap1 gap2 gap3 joined1; run;\n%mend;\n\n"
  },
  {
    "path": "GlobalFactors/main.sas",
    "content": "/* Clean working environment */\nproc delete data = _all_ ; run ; \n\n***************************************************************************\n* Manual Inputs\n*************************************************************************** ; \n* Assign scratch and project folder names;\n%let scratch_folder = /scratch/INSTITUTION/FOLDER; \n%let project_folder = ~/Global Data;\n* Set defaults;\n%let delete_temp = 1;  * Should temporary files be deleted?;\n%let save_csv = 1;     * Should the main data set be save country-by-country in a .csv format?;\n%let save_daily_ret = 1;   * Save daily stocks returns country-by-country in a .csv format?;\n%let save_monthly_ret = 1;   * Save monthly stocks returns  in a .csv format;\n%let end_date = '31DEC2024'd; * Date of last observation: CRSP data is only updated annually, so we keep this updating frequency for consistency. Should be incremented every time there's an update to the CRSP database);\n\n***************************************************************************\n* Libraries and Functions\n*************************************************************************** ; \n* Libraries;\noptions dlcreatedir;\nlibname scratch \"&scratch_folder.\"; \nlibname project \"&project_folder.\";\n\n* Project macros;\n%include \"&project_folder./project_macros.sas\";\n%include \"&project_folder./char_macros.sas\";\n%include \"&project_folder./market_chars.sas\";\n%include \"&project_folder./accounting_chars.sas\";\n%include \"&project_folder./ind_identification.sas\";\n\n*****************************************************************************\n* Create Return Data\n**************************************************************************** ; \n%prepare_comp_sf(freq=both);\n%clean_comp_msf(data=comp_msf); * Delete obvious data errors (work-in-progress);\n%prepare_crsp_sf(freq=d);\n%prepare_crsp_sf(freq=m);\n%combine_crsp_comp_sf(out_msf=world_msf1, out_dsf=scratch.world_dsf, crsp_msf=crsp_msf, comp_msf=comp_msf, crsp_dsf=crsp_dsf, comp_dsf=comp_dsf);\n\nproc delete data=comp_dsf crsp_dsf comp_msf crsp_msf; run;\n*****************************************************************************\n* Add Industry Codes \n***************************************************************************** ;\n%crsp_industry(out=crsp_ind);\n%comp_industry(out=comp_ind);\nproc sql;\n\tcreate table world_msf2 as\n\tselect a.*, b.gics as gics, coalesce(b.sic, c.sic) as sic, coalesce(b.naics, c.naics) as naics \n\tfrom world_msf1 as a\n\tleft join comp_ind as b on a.gvkey=b.gvkey and a.eom=b.date\n\tleft join crsp_ind as c on a.permco=c.permco and a.permno=c.permno and a.eom=c.date;\nquit;\nproc delete data=world_msf1 crsp_ind comp_ind; run; * Prefer COMPUSTAT to CRSP;\n\n* Add a column 'ff49' with Fama-French industry classification;\n%ff_ind_class(data=world_msf2, ff_grps=49, out=world_msf3); \n\n* Size cutoffs;\n%nyse_size_cutoffs(data=world_msf3, out=scratch.nyse_cutoffs);\n\n* Classify stocks into size groups;\nproc sql;\n\tcreate table scratch.world_msf as\n\tselect case \n\t\t\twhen missing(a.me) then ('')\n\t\t\twhen a.me >= b.nyse_p80 then 'mega'\n\t\t\twhen a.me >= b.nyse_p50 then 'large'\n\t\t\twhen a.me >= b.nyse_p20 then 'small'\n\t\t\twhen a.me >= b.nyse_p1 then 'micro'\n\t\t\telse 'nano'\n\t\tend as size_grp, a.*\n\tfrom world_msf3 as a left join scratch.nyse_cutoffs as b\n\ton a.eom=b.eom;\nquit;\nproc delete data=world_msf2 world_msf3; run;\n\n* Return cutoffs;\n%return_cutoffs(data=scratch.world_msf, freq=m, out=scratch.return_cutoffs, crsp_only=0);\n%return_cutoffs(data=scratch.world_dsf, freq=d, out=scratch.return_cutoffs_daily, crsp_only=0);\n\n*****************************************************************************\n* Market Returns\n**************************************************************************** ; \n%market_returns(out = scratch.market_returns, data=scratch.world_msf, freq=m, wins_comp=1, wins_data=scratch.return_cutoffs, cap_data=scratch.nyse_cutoffs);\n%market_returns(out = scratch.market_returns_daily, data=scratch.world_dsf, freq=d, wins_comp=1, wins_data=scratch.return_cutoffs_daily, cap_data=scratch.nyse_cutoffs);\n\n*****************************************************************************\n* Create Characteristics Based on Accounting Data\n**************************************************************************** ;\n%standardized_accounting_data(coverage='world', convert_to_usd=1, me_data = scratch.world_msf, include_helpers_vars=1, start_date='31DEC1949'd); \n%create_acc_chars(data=acc_std_ann, out=achars_world, lag_to_public=4, max_data_lag=18, __keep_vars=&acc_chars., me_data=scratch.world_msf, suffix=);\n%create_acc_chars(data=acc_std_qtr, out=qchars_world, lag_to_public=4, max_data_lag=18, __keep_vars=&acc_chars., me_data=scratch.world_msf, suffix=_qitem);\n%combine_ann_qtr_chars(out=scratch.acc_chars_world, ann_data=achars_world, qtr_data=qchars_world, __char_vars=&acc_chars., q_suffix=_qitem);\n\n*****************************************************************************\n* Create Characteristics Based on Monthly Market Data\n**************************************************************************** ;\n%market_chars_monthly(out=scratch.market_chars_m, data=scratch.world_msf, market_ret=scratch.market_returns, local_currency=0); \n\n* Free up space;\nproc datasets library=work kill nolist; quit;\n\n*****************************************************************************\n* Combine Returns, Accounting and Monthly Market Data\n**************************************************************************** ; \nproc sql;\n\tcreate table world_data_prelim as \n\tselect a.*, b.*, c.*\n\tfrom scratch.world_msf as a  \n\tleft join scratch.market_chars_m as b\n\ton a.id=b.id and a.eom=b.eom\n\tleft join scratch.acc_chars_world as c\n\ton a.gvkey=c.gvkey and a.eom=c.public_date;\n\n\talter table world_data_prelim \n\tdrop div_tot, div_cash, div_spc, public_date, source; \nquit;\n\n%if &delete_temp.=1 %then %do;\n\tproc delete data=\n\t\tscratch.market_chars_m scratch.acc_chars_world; \n\trun;\n%end;\n\n*****************************************************************************\n* Asset Pricing Factors\n**************************************************************************** ; \n* Create monthly and daily factors from FF3 and HXZ4;\n%ap_factors(out=scratch.ap_factors_daily, freq=d, sf=scratch.world_dsf, mchars=world_data_prelim, mkt=scratch.market_returns_daily, min_stocks_bp=10, min_stocks_pf=3);\t\n%ap_factors(out=scratch.ap_factors_monthly, freq=m, sf=scratch.world_msf, mchars=world_data_prelim, mkt=scratch.market_returns, min_stocks_bp=10, min_stocks_pf=3);\t\n\n*****************************************************************************\n* Factor based on combined data\n**************************************************************************** ;\n%firm_age(data=scratch.world_msf, out=scratch.firm_age);\n%mispricing_factors(out=scratch.mp_factors, data=world_data_prelim, min_stks=10, min_fcts=3);\t\n%market_beta(out=scratch.beta_60m, data=scratch.world_msf, fcts=scratch.ap_factors_monthly, __n=60, __min=36);\n%residual_momentum(out=scratch.resmom_ff3, data=scratch.world_msf, fcts=scratch.ap_factors_monthly, type=ff3, __n =36, __min=24, incl=12 6, skip=1 1);\n\n* Free up space;\nproc datasets library=work nolist;\n   delete _all_ / memtype=data;  /* Deletes all datasets */\n   protect world_data_prelim;    /* Prevents world_data_prelim from being deleted */\nquit;\n\n\n\n*****************************************************************************\n* Create Characteristics Based on Daily Market Data\n**************************************************************************** ; \n%bidask_hl(out=scratch.corwin_schultz, data=scratch.world_dsf, __min_obs=10); \n%prepare_daily(data=scratch.world_dsf, fcts=scratch.ap_factors_daily); \n%roll_apply_daily(out=scratch.roll_21d, __n=1, __min=15, fcts=scratch.ap_factors_daily,  __month_ends=month_ends, sfx =_21d, \n\t\t\t\t  __stats= rvol rmax skew capm_ext ff3 hxz4 dimsonbeta zero_trades);\n%roll_apply_daily(out=scratch.roll_126d, __n=6, __min=60, fcts=scratch.ap_factors_daily,  __month_ends=month_ends, sfx =_126d, \n\t\t\t\t  __stats= zero_trades turnover dolvol ami);\n%roll_apply_daily(out=scratch.roll_252d, __n=12, __min=120, fcts=scratch.ap_factors_daily,  __month_ends=month_ends, sfx =_252d, \n\t\t\t\t  __stats= rvol capm downbeta zero_trades prc_to_high mktvol);\n%roll_apply_daily(out=scratch.roll_1260d, __n=60, __min=750, fcts=scratch.ap_factors_daily,  __month_ends=month_ends, sfx =_1260d, \n\t\t\t\t  __stats= mktcorr);\n%finish_daily_chars(out=scratch.market_chars_d);\n\n%if &delete_temp.=1 %then %do;\n\tproc delete data=\n\t\tscratch.corwin_schultz scratch.roll_21d scratch.roll_126d scratch.roll_252d scratch.roll_1260d scratch.ap_factors_daily scratch.ap_factors_monthly ; \n\trun;\n%end;\n\n* Free up space;\nproc datasets library=work nolist;\n   delete _all_ / memtype=data;  /* Deletes all datasets */\n   protect world_data_prelim;    /* Prevents world_data_prelim from being deleted */\nquit;\n\n\n\n*****************************************************************************\n* Combine all characteristics and build final dataset\n**************************************************************************** ; \nproc sql;\n\tcreate table world_data3 as\n\tselect a.*, b.beta_60m, b.ivol_capm_60m, c.resff3_12_1, d.resff3_6_1, e.mispricing_mgmt, e.mispricing_perf, f.*, g.age\n\tfrom world_data_prelim as a \n\tleft join scratch.beta_60m as b on a.id=b.id and a.eom=b.eom\n\tleft join scratch.resmom_ff3_12_1 as c on a.id=c.id and a.eom=c.eom\n\tleft join scratch.resmom_ff3_6_1 as d on a.id=d.id and a.eom=d.eom\n\tleft join scratch.mp_factors as e on a.id=e.id and a.eom=e.eom\n\tleft join scratch.market_chars_d as f on a.id=f.id and a.eom=f.eom\n\tleft join scratch.firm_age as g on a.id=g.id and a.eom=g.eom;\nquit;\n\n* Add Quality minus Junk;\n%quality_minus_junk(out=scratch.qmj, data=world_data3, min_stks=10);\nproc sql;\n\tcreate table world_data4 as \n\tselect a.*, b.qmj, b.qmj_prof, b.qmj_growth, b.qmj_safety\n\tfrom world_data3 as a left join scratch.qmj as b\n\ton a.excntry=b.excntry and a.id=b.id and a.eom=b.eom;\nquit;\n\n* Reorder Variables;\ndata world_data5;\n\tretain id date eom source_crsp size_grp obs_main exch_main primary_sec gvkey iid permno permco excntry curcd fx \n\t\tcommon comp_tpci crsp_shrcd comp_exchg crsp_exchcd gics sic naics ff49\n\t\tadjfct shares me me_company prc prc_local dolvol ret ret_local ret_exc ret_lag_dif ret_exc_lead1m\n\t\tmarket_equity enterprise_value book_equity assets sales net_income; \n\tset world_data4;\nrun;\n\n* Delete Temporary Files;\n%if &delete_temp.=1 %then %do;\n\tproc delete data=\n\t\tworld_data_prelim\n\t\tscratch.beta_60m scratch.qmj scratch.resmom_ff3_12_1 scratch.resmom_ff3_6_1 \n\t\tscratch.mp_factors scratch.firm_age scratch.market_chars_d; \n\trun;\n%end;\n\n* Save combined data;\nproc sort data=world_data5 out=scratch.world_data nodup; by id eom; run;\n\n*****************************************************************************\n* Create Output in .csv Format for Download\n**************************************************************************** ; \n* Create Output Folder;\noptions dlcreatedir;\nlibname op \"&scratch_folder./output\"; \n\noption nonotes;\n* Small Files;\nproc export data=scratch.market_returns_daily\n    outfile=\"&scratch_folder./output/market_returns_daily.csv\"   \n    dbms=CSV\n    replace;\nrun;\nproc export data=scratch.market_returns\n    outfile=\"&scratch_folder./output/market_returns.csv\"   \n    dbms=CSV\n    replace;\nrun;\nproc export data=scratch.nyse_cutoffs\n    outfile=\"&scratch_folder./output/nyse_cutoffs.csv\"   \n    dbms=CSV\n    replace;\nrun;\nproc export data=scratch.return_cutoffs\n    outfile=\"&scratch_folder./output/return_cutoffs.csv\"   \n    dbms=CSV\n    replace;\nrun;\nproc export data=scratch.return_cutoffs_daily\n    outfile=\"&scratch_folder./output/return_cutoffs_daily.csv\"   \n    dbms=CSV\n    replace;\nrun;\noption notes;\n\n* Save main data as .csv files by country;\n%if &save_csv.=1 %then %do;\n\t%save_main_data_csv(out=Characteristics, data=scratch.world_data, path=&scratch_folder./output, end_date=&end_date.); \n%end;\n* Save daily return data as .csv files by country;\n%if &save_daily_ret.=1 %then %do;\n\t%save_daily_ret_csv(out=Daily Returns, data=scratch.world_dsf, path=&scratch_folder./output, end_date=&end_date.);\n%end;\n* Save monthly return data as .csv files by country;\n%if &save_monthly_ret.=1 %then %do;\n\t%save_monthly_ret_csv(out=world_ret_monthly, data=scratch.world_msf, path=&scratch_folder./output, end_date=&end_date.);\n%end;\n* Delete Temporary Files;\n%if &delete_temp.=1 %then %do;\n\tproc delete data=\n\t\tscratch.market_returns_daily scratch.market_returns\n\t\tscratch.nyse_cutoffs scratch.return_cutoffs scratch.return_cutoffs_daily\n\t\tscratch.world_dsf scratch.world_msf scratch.world_data; \n\trun;\n%end;"
  },
  {
    "path": "GlobalFactors/market_chars.sas",
    "content": "* Market Chars: Monthly;\n%let monthly_chars=\n\t/* Market Based Size Variables */\n\tmarket_equity\n\t\n\t/* Total Dividend Paid to Market Equity */\n\tdiv1m_me div3m_me div6m_me div12m_me\n\t\n\t/* Special Dividend Paid to Market Equity */\n\tdivspc1m_me divspc12m_me\n\t\n\t/* Change in Shares Outstanding */\n\tchcsho_1m chcsho_3m chcsho_6m chcsho_12m\n\t\n\t/* Net Equity Payout */\n\teqnpo_1m eqnpo_3m eqnpo_6m eqnpo_12m\n\t\n\t/* Momentum/Reversal */\n\tret_1_0 ret_2_0 ret_3_0 ret_3_1 ret_6_0 ret_6_1 ret_9_0 ret_9_1 ret_12_0 ret_12_1 ret_12_7 ret_18_1 ret_24_1 ret_24_12\n\tret_36_1 ret_36_12 ret_48_1 ret_48_12 ret_60_1 ret_60_12 ret_60_36\n\t\n\t/* Seasonality */\n\tseas_1_1an seas_2_5an seas_6_10an seas_11_15an seas_16_20an\n\tseas_1_1na seas_2_5na seas_6_10na seas_11_15na seas_16_20na\n\t\n;\n\n%put ### In total %nwords(&monthly_chars.) monthly characteristics will be created ###;\n%macro market_chars_monthly(out=, data=, market_ret=, local_currency=);\t\n\t%if &local_currency=1 %then %do;\n\t\t%let ret_var = ret_local;\n\t%end;\n\t%if &local_currency=0 %then %do;\n\t\t%let ret_var = ret;\n\t%end;\n\t\n\t/* Helper macro: Apply function lag0 to lag &n. */\n\t%macro apply_to_lastn(x=, _n=, func=);\n\t\t%let mv = &func.(&x.; \n\t\t%do _i=1 %to &_n.-1;\n\t\t\t%let mv = &mv., lag&_i.(&x.);\n\t\t%end;\n\t\t%let mv = &mv.);\n\t\t&mv.;\n\t%mend apply_to_lastn;\n\t\n\t/* Get Important Variables */\n\tproc sql; \n\t\tcreate table __monthly_chars1 as\n\t\tselect a.id, a.date, a.eom, a.me, a.shares, a.adjfct, \n\t\t\ta.prc, a.ret, a.ret_local, a.&ret_var. as ret_x,  \n\t\t\ta.div_tot, a.div_cash, a.div_spc, a.dolvol,\n\t\t\ta.ret_lag_dif, (a.ret_local = 0) as ret_zero, \n\t\t\ta.ret_exc, b.mkt_vw_exc\t\t\t\t\t\t\t\t\t/* Currently, Excess return is in USD because we lack RF for most markets */\t\t\n\t\tfrom &data. as a left join &market_ret. as b\n\t\ton a.excntry=b.excntry and a.eom=b.eom\n\t\torder by a.id, a.eom;\n\tquit;\n\t\n\t* Ensure that there is a lag of 1 month between each obs;\n\tproc sql;\n\t\tcreate table __stock_coverage as \n\t\tselect id, min(eom) as start_date, max(eom) as end_date\n\t\tfrom __monthly_chars1\n\t\tgroup by id;\n\tquit;\n\t\n\t%expand(data=__stock_coverage, out=__full_range, id_vars=id, start_date=start_date, end_date=end_date, freq='month', new_date_name=eom);\n\t\n\tproc sql;\n\t\tcreate table __monthly_chars2 as\n\t\tselect a.id, a.eom, missing(b.id) as obs_miss,\n\t\t\tb.me, b.shares, b.adjfct, b.prc, b.ret, b.ret_local, b.ret_x, b.ret_lag_dif,\n\t\t\tb.div_tot, b.div_cash, b.div_spc, b.dolvol, b.ret_zero, b.ret_exc, b.mkt_vw_exc\n\t\tfrom __full_range as a left join __monthly_chars1 as b\n\t\ton a.id=b.id and a.eom=b.eom\n\t\torder by id, eom;\n\tquit;\n\t\n\t* Cummulative Return Index;\n\tdata __monthly_chars3;\n\t\tset __monthly_chars2;\n\t\tby id;\n\t\tretain ri_x;  /* Local or USD depending on &local_currency.*/\n\t\tretain ri;\t  /* USD */\n\t\tretain count;\n\t\tif first.id then do;\n\t\t\tri_x = sum(1, ret_x); /* Most will have missing return for the first observation. In that case this evaluates to 1*/\n\t\t\tri = sum(1, ret);\n\t\t\tcount = 1;\n\t\tend;\n\t\telse do;\n\t\t\tri_x = ri_x*sum(1, ret_x); /* By using sum instead of 1+ret missing returns are set to 0 */\n\t\t\tri = ri*sum(1, ret);\n\t\t\tcount = count+1;\n\t\tend;\n\trun;\n\t\n\t/* Set non-standard returns to missing */\n\tdata __monthly_chars4; \n\t\tset __monthly_chars3;\n\t\tret_miss = missing(ret_x) or ret_lag_dif^=1; /* We set returns with more than one month between price observations to missing (only impact Compustat data) */\n\t\tif ret_miss = 1 then do;\n\t\t\tret_x = .;\n\t\t\tret=.;\n\t\t\tret_local=.;\n\t\t\tret_exc =.;\n\t\t\tmkt_vw_exc = .;\n\t\tend;\n\t\tdrop obs_miss ret_zero ret_lag_dif;\n\trun; \n\t\n\t/* Create variables */\n\tproc sort nodup data=__monthly_chars4; by id eom; run;\n\t%macro temp();\n\tdata __monthly_chars5;\n\t\tset __monthly_chars4; \n\t\tby id eom;\n\t\t/* Market Equity */\n\t\tmarket_equity = me;\n\t\t\n\t\t/* Dividend to Price */\n\t\t%let div_range = 1 3 6 12; /* 24, 36*/\n\t\t%do i=1 %to %sysfunc(countw(&div_range.));  \n\t\t\t%let n = %scan(&div_range., &i.);\n\t\t\tdiv_sum = %apply_to_lastn(x=div_tot*shares, _n=&n., func=sum);\n\t\t\tdiv&n.m_me = div_sum/me; \n\t\t\tif count < &n. then \n\t\t\t\tdiv&n.m_me = .;\n\t\t\tdrop div_sum;\n\t\t%end;\n\t\t\n\t\t/* Special Dividends */\n\t\t%let div_spc_range = 1 12;\n\t\t%do i=1 %to %sysfunc(countw(&div_spc_range.));  \n\t\t\t%let n = %scan(&div_spc_range., &i.);\n\t\t\tdiv_spc_sum = %apply_to_lastn(x=div_spc*shares, _n=&n., func=sum);\n\t\t\tdivspc&n.m_me = div_spc_sum/me; \n\t\t\tif count < &n. then \n\t\t\t\tdivspc&n.m_me = .;\n\t\t\tdrop div_spc_sum;\n\t\t%end;\n\t\t\n\t\t/* Change in Shares Outstanding (Market Based Proxy for Net Share Issuance)*/\n\t\t%let chcsho_lags = 1 3 6 12;\n\t\t%do i=1 %to %sysfunc(countw(&chcsho_lags.));  \n\t\t\t%let chcsho_lag = %scan(&chcsho_lags.,&i.);\n\t\t\tchcsho_&chcsho_lag.m = (shares*adjfct)/lag&chcsho_lag.(shares*adjfct)-1;\n\t\t\tif count <= &chcsho_lag. then \n\t\t\t\tchcsho_&chcsho_lag.m=.; \n\t\t%end;\n\t\t\n\t\t/* Net Equity Payout (Market based stock buyback+dividend-stock issuance)*/\n\t\t%let eqnpo_lags = 1 3 6 12;\n\t\t%do i=1 %to %sysfunc(countw(&eqnpo_lags.));  \n\t\t\t%let eqnpo_lag = %scan(&eqnpo_lags.,&i.);\n\t\t\teqnpo_&eqnpo_lag.m = log(ri/lag&eqnpo_lag.(ri))-log(me/lag&eqnpo_lag.(me));\n\t\t\tif count <= &eqnpo_lag. then \n\t\t\t\teqnpo_&eqnpo_lag.m=.; \n\t\t%end;\n\t\t\n\t\t\n\t\t/* Momentum/Reversal */\n\t\t%let from_lags = 1 2 3 3 6 6 9 9 12 12 12 18 24 24 36 36 48 48 60 60 60;\n\t\t%let to_lags   = 0 0 0 1 0 1 0 1 0  1  7  1  1 12  1 12 12  1  1 12 36;\n\t\t%do j=1 %to %sysfunc(countw(&from_lags.));  \n\t\t\t%let from = %scan(&from_lags., &j.);\n\t\t\t%let to = %scan(&to_lags., &j.);\n\t\t\tret_&from._&to. = lag&to.(ri_x)/lag&from.(ri_x)-1;\n\t\t\tif count <= &from. or missing(lag&to.(ret_x)) then /* Require the last return observation to be non-missing */\n\t\t\t\tret_&from._&to.=.; \n\t\t%end;\n\t\t\n\t\t/* Seasonality: Heston and Sadka (2008) */\n\t\t%seasonality(start_year=1, end_year=1);\n\t\t%seasonality(start_year=2, end_year=5);\n\t\t%seasonality(start_year=6, end_year=10);\n\t\t%seasonality(start_year=11, end_year=15);\n\t\t%seasonality(start_year=16, end_year=20);\n\t\t\n\t\t/* Drop Uneccesary Variables */\n\t\tdrop me shares adjfct shares adjfct prc ret ret_local ret_x\n\t\t\tdiv_tot div_cash div_spc dolvol ret_exc mkt_vw_exc ret_miss ri_x ri count;\n\trun;\n\t%mend;\n\t%temp();\n\t\n\tproc sort data=__monthly_chars5 out=&out.; by id eom; run;\n\t\n\tproc delete data=__stock_coverage __full_range __monthly_chars1 __monthly_chars2 __monthly_chars3 __monthly_chars4 __monthly_chars5; run;\n%mend;\n\n/* Calculate CAPM beta over a rolling window */\n%macro market_beta(out=, data=, fcts=, __n =, __min=); \n\tproc sql;\n\t\tcreate table __msf1 as \n\t\tselect a.id, a.eom, a.ret_exc, a.ret_lag_dif, b.mktrf\n\t\tfrom &data. as a left join &fcts. as b\n\t\ton a.excntry=b.excntry and a.eom=b.eom\n\t\twhere a.ret_local^=0 and not missing(a.ret_exc) and a.ret_lag_dif=1 and not missing(b.mktrf);\n\tquit;\t\n\t%winsorize_own(inset=__msf1, outset=__msf2, sortvar=eom, vars=ret_exc, perc_low=0.1, perc_high=99.9); /* Winsorize returns at 0.1% and 99.9% */\n\tproc sort data=__msf2; by id eom; run;\n\t\n\tproc sql;\n\t\tcreate table month_ends as \n\t\tselect distinct eom\n\t\tfrom __msf2\n\t\torder by eom;\n\tquit;\n\t\n\t* Divide data into __n groups;\n\tproc sql;\n\t\tcreate table dates_apply as \n\t\tselect *, mod(monotonic(), &__n.) as grp\n\t\tfrom month_ends;\n\tquit;\n\t\n\t* Helper macro: If first group, save &new. as &base. otherwise, append &new. to &base.;\n\t%macro save_or_append(base=, new=);\n\t\t%if &__grp. = 0 %then %do;\n\t\t\tdata &base.; set &new.; run;\n\t\t%end;\n\t\t%else %do;\n\t\t\tproc append base=&base. data=&new.; run;\n\t\t%end;\n\t%mend;\n\t\n\t%do __grp=0 %to %eval(&__n. - 1);\n\t\t%put ############### GROUP %eval(&__grp.+1) out of &__n. ###############; \n\t\t* Prepare data;\n\t\tproc sql;\n\t\t\tcreate table calc_dates as\n\t\t\tselect a.eom, b.eom as calc_date\n\t\t\tfrom dates_apply as a left join dates_apply(where=(grp = &__grp.)) as b\n\t\t\ton a.eom > intnx(\"month\", b.eom, -&__n., \"e\") and a.eom <= b.eom;\n\t\tquit;\n\t\t\n\t\tproc sql;\n\t\t\tcreate table calc_data as \n\t\t\tselect a.*, b.calc_date\n\t\t\tfrom __msf2 as a left join calc_dates as b\n\t\t\ton a.eom = b.eom\n\t\t\twhere not missing(b.calc_date)  \n\t\t\tgroup by a.id, b.calc_date\n\t\t\thaving count(*) >= &__min.\n\t\t\torder by a.id, b.calc_date;\n\t\tquit;\n\t\t\n\t\tproc reg data=calc_data outest=__capm1 edf NOPRINT;\n\t\t\tby id calc_date;\n\t\t\tmodel ret_exc=mktrf;\n\t\trun;\n\t\tproc sql;\n\t\t\tcreate table __capm2 as \n\t\t\tselect id, calc_date as eom, mktrf as beta_&__n.m, sqrt(_rmse_**2 * _edf_ / (_edf_ + 1)) as ivol_capm_&__n.m\n\t\t\tfrom __capm1\n\t\t\twhere (_edf_ + 2) >= &__min.;\n\t\tquit;\n\t\t%save_or_append(base=op_capm, new=__capm2);\n\t%end;\n\tproc sort data=op_capm out=&out. nodup; by id eom; run; \n\tproc delete data=op_capm; run;\n%mend;\n\n/* MACRO: RESIDUAL MOMENTUM --------------\n- Rolling regressions over &__n. months. used to calculate residual momentum. \n- Currently I have only implemented FF3 but I could easily extend it to CAPM and HXZ4\n- Residual momentum is typically calculated over a shorter horizon than that used to estimate parameters.\n  The number of months to include return data from, is indicated by &incl. The number of months to skip within that period,\n  is indicated by &skip. Both incl and skip can be list, but they need to be of the same length,\n*/\n%macro residual_momentum(out=, data=, fcts=, type=, __n =, __min=, incl=, skip=); /* first_lag and last_lag can be a list but must have equal length. Type in (market, ff3, hxz4) */\n\tproc sql;\n\t\tcreate table __msf1 as \n\t\tselect a.id, a.eom, a.ret_exc, a.ret_lag_dif, b.mktrf, b.hml, b.smb_ff, b.roe, b.inv, b.smb_hxz\n\t\tfrom &data. as a left join &fcts. as b\n\t\ton a.excntry=b.excntry and a.eom=b.eom\n\t\twhere a.ret_local^=0 and not missing(a.ret_exc) and not missing(b.mktrf) and ret_lag_dif=1; \n\tquit;\t\n\t%winsorize_own(inset=__msf1, outset=__msf2, sortvar=eom, vars=ret_exc, perc_low=0.1, perc_high=99.9); /* Winsorize returns at 0.1% and 99.9% */\n\tproc sort data=__msf2; by id eom; run;\n\t\n\tproc sql;\n\t\tcreate table month_ends as \n\t\tselect distinct eom\n\t\tfrom __msf2\n\t\torder by eom;\n\tquit;\n\t\n\t* Divide data into __n groups;\n\tproc sql;\n\t\tcreate table dates_apply as \n\t\tselect *, mod(monotonic(), &__n.) as grp\n\t\tfrom month_ends;\n\tquit;\n\t\n\t* Helper macro: If first group, save &new. as &base. otherwise, append &new. to &base.;\n\t%macro save_or_append(base=, new=);\n\t\t%if &__grp. = 0 %then %do;\n\t\t\tdata &base.; set &new.; run;\n\t\t%end;\n\t\t%else %do;\n\t\t\tproc append base=&base. data=&new.; run;\n\t\t%end;\n\t%mend;\n\t\n\t%do __grp=0 %to %eval(&__n. - 1);\n\t\t%put ############### GROUP %eval(&__grp.+1) out of &__n. ###############; \n\t\t* Prepare data;\n\t\tproc sql;\n\t\t\tcreate table calc_dates as\n\t\t\tselect a.eom, b.eom as calc_date\n\t\t\tfrom dates_apply as a left join dates_apply(where=(grp = &__grp.)) as b\n\t\t\ton a.eom > intnx(\"month\", b.eom, -&__n., \"e\") and a.eom <= b.eom;\n\t\tquit;\n\t\t\n\t\tproc sql;\n\t\t\tcreate table calc_data as \n\t\t\tselect a.*, b.calc_date\n\t\t\tfrom __msf2 as a left join calc_dates as b\n\t\t\ton a.eom = b.eom\n\t\t\twhere not missing(b.calc_date)  \n\t\t\tgroup by a.id, b.calc_date\n\t\t\thaving count(*) >= &__min. \n\t\t\torder by a.id, b.calc_date;\n\t\tquit;\t\n\t\t\n\t\t* Fama and French (1993) 3 factor model;\n\t\t%if %sysfunc(find(&type., ff3)) >= 1 %then %do;\n\t\t\tproc reg data=calc_data(where=(not missing(hml) and not missing(smb_ff))) NOPRINT;\n\t\t\t\tby id calc_date;\n\t\t\t\tmodel ret_exc=mktrf smb_ff hml;\n\t\t\t\toutput out=__ff3_res1\n\t\t\t\t\t   residual=res;  \n\t\t\trun;\n\t\t\t%do __i=1 %to %nwords(&incl.);\n\t\t\t\t%let __in = %scan(&incl., &__i., %str(' '));\n\t\t\t\t%let __sk = %scan(&skip., &__i., %str(' '));\n\t\t\t\tproc sql;\n\t\t\t\t\tcreate table __ff3_res2 as \n\t\t\t\t\tselect *, (eom > intnx(\"month\", calc_date, -&__in., \"e\") and eom <= intnx(\"month\", calc_date, -&__sk., \"e\")) as incl  /* Incl=1 --> obs is included in momentum calculation */ \n\t\t\t\t\tfrom __ff3_res1 \n\t\t\t\t\tgroup by id, calc_date\n\t\t\t\t\thaving count(res) >= &__min. \n\t\t\t\t\torder by id, calc_date, eom;\n\t \t\t\t\n\t \t\t\t\tcreate table __ff3_res3 as \n\t \t\t\t\tselect id, calc_date as eom, mean(res) / std(res) as resff3_&__in._&__sk.\n\t \t\t\t\tfrom __ff3_res2\n\t \t\t\t\twhere incl = 1\n\t \t\t\t\tgroup by id, calc_date;\n\t \t\t\tquit;\n\t \t\t\t\n\t\t\t\t%save_or_append(base=op_&__in._&__sk., new=__ff3_res3);\n\t\t\t%end;\n\t\t%end;\n\t%end;\n\t/* Output */\n\t%do __i=1 %to %nwords(&incl.);\n\t\t%let __in = %scan(&incl., &__i., %str(' '));\n\t\t%let __sk = %scan(&skip., &__i., %str(' '));\n\t\tproc sort data=op_&__in._&__sk. out=&out._&__in._&__sk. nodup; by id eom; run; \n\t\tproc delete data=op_&__in._&__sk.; run; \n\t%end;\n%mend;\n\n\n\n\n\n\t\n\t\n\t\n\t\n\n\t\n\n/* MACRO FOR DAILY CHARS ----------------------------------------------\n\n*/\n%macro prepare_daily(data=, fcts=);\n\t/* Start timer */\n\t%let __prep_start = %sysfunc(datetime());\n\t/* Prepare stock level data */\n\tproc sql;\n\t\tcreate table dsf1 as \n\t\tselect a.excntry, a.id, a.date, a.eom, a.prc / a.adjfct as prc_adj, a.ret, a.ret_exc, a.dolvol as dolvol_d, a.shares, a.tvol,\n\t\t\t   b.mktrf, b.hml, b.smb_ff, b.roe, b.inv, b.smb_hxz,\n\t\t\t   a.ret_lag_dif, a.bidask,\n\t\t\t   sum(a.ret_local = 0) as zero_obs  /* Some firms have almost inclusively zero returns. These should be excluded */\n\t\tfrom &data. as a left join &fcts. as b\n\t\ton a.excntry = b.excntry and a.date = b.date\n\t\twhere not missing(b.mktrf) /* not missing mktrf ensures that we look at trading days*/\n\t\tgroup by a.id, a.eom; \n\t\n\t\tupdate dsf1\n\t\tset ret_exc = .,\n\t\t    ret = .\n\t\twhere ret_lag_dif > 14;  /* Only used returns based on prices that are not more than two weeks old */\n\t\t\n\t\talter table dsf1\n\t\tdrop ret_lag_dif, bidask;\n\tquit;\n\tproc sort data=dsf1; by id date; run;\n\t\n\t* Create lead/lagged market returns (For dimson beta);\n\tproc sql;\n\t\tcreate table mkt_lead_lag1 as \n\t\tselect excntry, date, intnx('month',date,0,'E') as eom format=YYMMDDN8., mktrf\n\t\tfrom &fcts.\n\t\torder by excntry, date desc;\n\tquit;\n\t\n\tdata mkt_lead_lag2;\n\t\tset mkt_lead_lag1;\n\t\tmktrf_ld1 = lag(mktrf);\n\t\tif excntry ^= lag(excntry) or eom ^= lag(eom) then mktrf_ld1 = .;  /* Eom condition is to avoid look-ahead bias */\n\trun;\n\t\n\tproc sort data=mkt_lead_lag2 out=mkt_lead_lag3; by excntry date; run;\n\t\n\tdata mkt_lead_lag4;\n\t\tset mkt_lead_lag3;\n\t\tmktrf_lg1 = lag(mktrf);\n\t\tif excntry ^= lag(excntry) then mktrf_lg1 = .;\n\trun;\n\t\n\t* Overlapping returns used to calculate correlation;\n\tdata corr_data;\n\t\tset dsf1;\n\t\tret_exc_3l = ret_exc + lag(ret_exc) + lag2(ret_exc);\n\t\tmkt_exc_3l = mktrf + lag(mktrf) + lag2(mktrf);\n\t\tif id ^= lag2(id) then do;\n\t\t\tret_exc_3l = .;\n\t\t\tmkt_exc_3l = .;\n\t\tend;\n\t\tkeep id eom zero_obs ret_exc_3l mkt_exc_3l;\n\trun;\n\t\n\t*Unique Month Ends;\n\tproc sql;\n\t\tcreate table month_ends as \n\t\tselect distinct eom\n\t\tfrom dsf1\n\t\torder by eom;\n\tquit;\n\t\n\t/* Stop timer */\n\tdata _null_;\n\t\tdur = datetime() - &__prep_start;\n\t\tput 30*'-' / ' PREPARING DAILY DATA TOOK:' dur time13.2 / 30*'-';\n\trun;\n%mend;\n\n* MACRO: ROLL APPLY DAILY -------------------------\n- Apply &__stats. functions to rolling windows of data.\n- The idea is to iteratively apply the functions to &__n. different splits of the data.\n  The output of each function is a stock id-eom pair plus the calculated characteristics\n- The currently implemented &__stats. are:\n\t* rvol, rmax, skew, capm, capm_ext, ff3, hxz4, dimsonbeta, downbeta, zero_trades, turnover, dolvol, ami, prc_to_high, mktcorr, mktvol\n- Arguments:\n\t* OUT: Output dataset in a long format with all the requested characteristics\n\t* ...;\n%macro roll_apply_daily(out=, __n=, __min=, fcts=, __month_ends=, sfx =,__stats=);  /* Create stats over rolling __n months. stats in (rvol, rmax, skew, capm, capm_ext, ff3, hxz4, dimsonbeta, downbeta, zero_trades, turnover, dolvol, ami, prc_to_high, mktcorr, mktvol) */\n\t/* Start timer */\n\t%let __roll_start = %sysfunc(datetime());\n\t* Divide data into __n groups;\n\tproc sql;\n\t\tcreate table dates_apply as \n\t\tselect *, mod(monotonic(), &__n.) as grp\n\t\tfrom &__month_ends.;\n\tquit;\n\t* Helper: If first group, save &new. as &base. otherwise, append &new. to &base.;\n\t%macro save_or_append(base=, new=);\n\t\t%if &__grp. = 0 %then %do;\n\t\t\tdata &base.; set &new.; run;\n\t\t%end;\n\t\t%else %do;\n\t\t\tproc append base=&base. data=&new.; run;\n\t\t%end;\n\t%mend;\n\t\n\t* Drop unneccesary columns for faster join;\n\tdata __input; set dsf1; run;\n\t%if %sysfunc(find(&__stats., ff3)) = 0 %then %do;\n\t\tproc sql;\n\t\t\talter table __input\n\t\t\tdrop hml, smb_ff;\n\t\tquit;\n\t%end;\n\t%if %sysfunc(find(&__stats., hxz4)) = 0 %then %do;\n\t\tproc sql;\n\t\t\talter table __input\n\t\t\tdrop roe, inv, smb_hxz;\n\t\tquit;\n\t%end;\n\t%if %sysfunc(find(&__stats., turnover)) = 0 and %sysfunc(find(&__stats., ami)) = 0 and %sysfunc(find(&__stats., zero_trades)) = 0 and %sysfunc(find(&__stats., dolvol)) = 0%then %do;\n\t\tproc sql;\n\t\t\talter table __input\n\t\t\tdrop dolvol_d, shares, tvol;\n\t\tquit;\n\t%end;\n\t%if %sysfunc(find(&__stats., prc_to_high)) = 0 %then %do;\n\t\tproc sql;\n\t\t\talter table __input\n\t\t\tdrop prc_adj;\n\t\tquit;\n\t%end;\n\t\n\t* Apply __stats to each group;\n\t%do __grp=0 %to %eval(&__n. - 1); \n\t\t* Prepare data;\n\t\tproc sql;\n\t\t\tcreate table calc_dates as\n\t\t\tselect a.eom, b.eom as calc_date\n\t\t\tfrom dates_apply as a left join dates_apply(where=(grp = &__grp.)) as b\n\t\t\ton a.eom > intnx(\"month\", b.eom, -&__n., \"e\") and a.eom <= b.eom;\n\t\tquit;\n\t\t\n\t\t* Not neccesary if mktcorr is the only stat;\n\t\t%if %nwords(&__stats.)>1 or %sysfunc(find(&__stats., mktcorr))=0 %then %do;\n\t\t\tproc sql;\n\t\t\t\t/* Used for volume variables */\n\t\t\t\tcreate table calc_data_raw as \n\t\t\t\tselect a.*, b.calc_date\n\t\t\t\tfrom __input as a left join calc_dates as b\n\t\t\t\ton a.eom = b.eom\n\t\t\t\twhere not missing(b.calc_date)  \n\t\t\t\torder by a.id, b.calc_date;\n\t\n\t\t\t\t/* Used for return variables */\n\t\t\t\tcreate table calc_data_screen as \n\t\t\t\tselect *\n\t\t\t\tfrom calc_data_raw\n\t\t\t\twhere not missing(ret_exc) and zero_obs < 10  /* We exclude stock-months with 10 or more zero returns */\n\t\t\t\tgroup by id, calc_date\n\t\t\t\thaving count(ret_exc) >= &__min.;\n\t\t\tquit;\t\n\t\t%end;\n\t\t\n\t\t* Return Volatility;\n\t\t%if %sysfunc(find(&__stats., rvol)) >= 1 %then %do;\n\t\t\tproc sql;\n\t\t\t\tcreate table __rvol as \n\t\t\t\tselect id, calc_date as eom, std(ret_exc) as rvol&sfx.\n\t\t\t\tfrom calc_data_screen\n\t\t\t\tgroup by id, calc_date\n\t\t\t\thaving count(ret_exc) >= &__min.;\n\t\t\tquit;\n\t\t\t%save_or_append(base=op_rvol, new=__rvol);\n\t\t%end;\n\t\t\n\t\t* Maximum Return;\n\t\t%if %sysfunc(find(&__stats., rmax)) >= 1 %then %do;\n\t\t\tproc rank data= calc_data_screen out = __rmax1 descending;\n\t\t\t\tby id calc_date;\n\t\t\t\tvar ret;\n\t\t\t\tranks ret_rank;\n\t\t\trun;\n\t\t\tproc sql;\n\t\t\t\tcreate table __rmax2 as \n\t\t\t\tselect id, calc_date as eom, mean(ret) as rmax5&sfx., max(ret) as rmax1&sfx.\n\t\t\t\tfrom __rmax1\n\t\t\t\twhere ret_rank<=5\n\t\t\t\tgroup by id, calc_date;\n\t\t\tquit;\n\t\t\t%save_or_append(base=op_rmax, new=__rmax2);\n\t\t%end;\n\t\t\n\t\t* Return Skewness;\n\t\t%if %sysfunc(find(&__stats., skew)) >= 1 %then %do;\n\t\t\tproc means data=calc_data_screen skewness noprint;\n\t\t\t\tby id calc_date;\n\t\t\t\tvar ret_exc;\n\t\t\t\toutput out = __skew1\n\t\t\t\t\t   skewness = rskew&sfx.;\n\t\t\trun;\n\t\t\tproc sql;\n\t\t\t\tcreate table __skew2 as\n\t\t\t\tselect id, calc_date as eom, rskew&sfx.\n\t\t\t\tfrom __skew1\n\t\t\t\twhere _freq_ >= &__min.;\n\t\t\tquit;\n\t\t\t%save_or_append(base=op_skew, new=__skew2);\n\t\t%end;\n\t\t\n\t\t* Price-to-high;\n\t\t%if %sysfunc(find(&__stats., prc_to_high)) >= 1 %then %do;\n\t\t\tproc sql;\n\t\t\t\tcreate table __prc_high as \n\t\t\t\tselect id, calc_date as eom, prc_adj / max(prc_adj) as prc_highprc&sfx.\n\t\t\t\tfrom calc_data_screen\n\t\t\t\tgroup by id, calc_date\n\t\t\t\thaving date = max(date) and count(prc_adj) >= &__min.;\n\t\t\tquit;\n\t\t\t%save_or_append(base=op_prc_high, new=__prc_high);\n\t\t%end;\n\t\t\n\t\t* Amihud (2002);\n\t\t%if %sysfunc(find(&__stats., ami)) >= 1 %then %do;\n\t\t\tproc sql;\n\t\t\t\tcreate table __ami as \n\t\t\t\tselect id, calc_date as eom, mean(abs(ret) / dolvol_d) *1e6 as ami&sfx.\n\t\t\t\tfrom calc_data_screen\n\t\t\t\tgroup by id, calc_date\n\t\t\t\thaving count(dolvol_d) >= &__min.;\n\t\t\tquit;\n\t\t\t%save_or_append(base=op_ami, new=__ami);\n\t\t%end;\n\t\t\n\t\t* CAPM regression (beta + ivol);\n\t\t%if %sysfunc(find(&__stats., capm)) >= 1 and %sysfunc(find(&__stats., capm_ext)) = 0 %then %do;\n\t\t\tproc reg data=calc_data_screen outest=__capm1 edf NOPRINT;\n\t\t\t\tby id calc_date;\n\t\t\t\tmodel ret_exc=mktrf;\n\t\t\trun;\n\t\t\tproc sql;\n\t\t\t\tcreate table __capm2 as \n\t\t\t\tselect id, calc_date as eom, mktrf as beta&sfx., sqrt(_rmse_**2 * _edf_ / (_edf_ + 1)) as ivol_capm&sfx.\n\t\t\t\tfrom __capm1\n\t\t\t\twhere (_edf_ + 2) >= &__min.;\n\t\t\tquit;\n\t\t\t%save_or_append(base=op_capm, new=__capm2);\n\t\t%end;\n\t\t\n\t\t* CAPM regression extended (beta + ivol + iskew + coskewness);\n\t\t%if %sysfunc(find(&__stats., capm_ext)) >= 1 %then %do;\n\t\t\tproc reg data=calc_data_screen outest=__capm_ext1 edf NOPRINT;\n\t\t\t\tby id calc_date;\n\t\t\t\tmodel ret_exc=mktrf;\n\t\t\t\toutput out=__capm_ext_res\n\t\t\t\t\t   residual=res;  /* Including the output statement increases the time by a factor of 3. It's neccesary to compute skewness */\n\t\t\trun;\n\t\t\t\n\t\t\tproc sql;\n\t\t\t\tcreate table __capm_ext2 as \n\t\t\t\tselect id, calc_date as eom, mktrf as beta&sfx., sqrt(_rmse_**2 * _edf_ / (_edf_ + 1)) as ivol_capm&sfx.\n\t\t\t\tfrom __capm_ext1\n\t\t\t\twhere (_edf_ + 2) >= &__min.;\n\t\t\tquit;\n\t\t\t\n\t\t\t* Idiosyncratic skewness;\n\t\t\tproc means data=__capm_ext_res skewness noprint;\n\t\t\t\tby id calc_date;\n\t\t\t\tvar res;\n\t\t\t\toutput out = __capm_ext_skew(where=(_freq_ >= &__min.))\n\t\t\t\t\t   skewness = iskew_capm&sfx.;\n\t\t\trun;\n\t\t\t\n\t\t\t* Coskewness;\n\t\t\tproc sql;\n\t\t\t\tcreate table __capm_ext_coskew1 as \n\t\t\t\tselect id, calc_date, res, mktrf - mean(mktrf) as mktrf_dm\n\t\t\t\tfrom __capm_ext_res\n\t\t\t\tgroup by id, calc_date;\n\t\t\t\t\n\t\t\t\tcreate table  __capm_ext_coskew2 as \n\t\t\t\tselect id, calc_date, mean(res * mktrf_dm**2) / (sqrt(mean(res**2)) * mean(mktrf_dm**2) ) as coskew&sfx.\n\t\t\t\tfrom __capm_ext_coskew1\n\t\t\t\tgroup by id, calc_date\n\t\t\t\thaving count(res) >= &__min.;\n\t\t\tquit;\n\t\t\t\n\t\t\tproc sql;\n\t\t\t\tcreate table __capm_ext3 as \n\t\t\t\tselect a.*, b.iskew_capm&sfx., c.coskew&sfx.\n\t\t\t\tfrom __capm_ext2 as a\n\t\t\t\tleft join __capm_ext_skew as b on a.id=b.id and a.eom=b.calc_date\n\t\t\t\tleft join __capm_ext_coskew2 as c on a.id=c.id and a.eom=c.calc_date;\n\t\t\tquit;\n\t\t\t\n\t\t\t%save_or_append(base=op_capm_ext, new=__capm_ext3);\n\t\t%end;\n\t\t\n\t\t* Fama and French (1993) 3 factor model;\n\t\t%if %sysfunc(find(&__stats., ff3)) >= 1 %then %do;\n\t\t\tproc reg data=calc_data_screen(where=(not missing(hml) and not missing(smb_ff))) outest=__ff31 edf NOPRINT;\n\t\t\t\tby id calc_date;\n\t\t\t\tmodel ret_exc=mktrf smb_ff hml;\n\t\t\t\toutput out=__ff3_res\n\t\t\t\t\t   residual=res;  /* Including the output statement increases the time by a factor of 3. It's neccesary to compute skewness */\n\t\t\trun;\n\t\t\t\n\t\t\tproc sql;\n\t\t\t\tcreate table __ff32 as \n\t\t\t\tselect id, calc_date as eom, sqrt(_rmse_**2 * _edf_ / (_edf_ + 1)) as ivol_ff3&sfx.\n\t\t\t\tfrom __ff31\n\t\t\t\twhere (_edf_ + 4) >= &__min.;\n\t\t\tquit;\n\t\t\t\n\t\t\t* Idiosyncratic skewness;\n\t\t\tproc means data=__ff3_res skewness noprint;\n\t\t\t\tby id calc_date;\n\t\t\t\tvar res;\n\t\t\t\toutput out = __ff3_skew(where=(_freq_ >= &__min.))\n\t\t\t\t\t   skewness = iskew_ff3&sfx.;\n\t\t\trun;\n\t\t\t\n\t\t\tproc sql;\n\t\t\t\tcreate table __ff33 as \n\t\t\t\tselect a.*, b.iskew_ff3&sfx.\n\t\t\t\tfrom __ff32 as a\n\t\t\t\tleft join __ff3_skew as b on a.id=b.id and a.eom=b.calc_date;\n\t\t\tquit;\n\t\t\t%save_or_append(base=op_ff3, new=__ff33);\n\t\t%end;\n\t\t\n\t\t* Hou, Xue and Zhang (2015) 4 factor model;\n\t\t%if %sysfunc(find(&__stats., hxz4)) >= 1 %then %do;\n\t\t\tproc reg data=calc_data_screen(where=(not missing(roe) and not missing(inv) and not missing(smb_hxz))) outest=__hxz41 edf NOPRINT;\n\t\t\t\tby id calc_date;\n\t\t\t\tmodel ret_exc=mktrf smb_hxz roe inv;\n\t\t\t\toutput out=__hxz4_res\n\t\t\t\t\t   residual=res;  /* Including the output statement increases the time by a factor of 3. It's neccesary to compute skewness */\n\t\t\trun;\n\t\t\t\n\t\t\tproc sql;\n\t\t\t\tcreate table __hxz42 as \n\t\t\t\tselect id, calc_date as eom, sqrt(_rmse_**2 * _edf_ / (_edf_ + 1)) as ivol_hxz4&sfx.\n\t\t\t\tfrom __hxz41\n\t\t\t\twhere (_edf_ + 5) >= &__min.;\n\t\t\tquit;\n\t\t\t\n\t\t\t* Idiosyncratic skewness;\n\t\t\tproc means data=__hxz4_res skewness noprint;\n\t\t\t\tby id calc_date;\n\t\t\t\tvar res;\n\t\t\t\toutput out = __hxz4_skew(where=(_freq_ >= &__min.))\n\t\t\t\t\t   skewness = iskew_hxz4&sfx.;\n\t\t\trun;\n\t\t\t\n\t\t\tproc sql;\n\t\t\t\tcreate table __hxz43 as \n\t\t\t\tselect a.*, b.iskew_hxz4&sfx.\n\t\t\t\tfrom __hxz42 as a\n\t\t\t\tleft join __hxz4_skew as b on a.id=b.id and a.eom=b.calc_date;\n\t\t\tquit;\n\t\t\t%save_or_append(base=op_hxz4, new=__hxz43);\n\t\t%end;\n\t\t\n\t\t* Dimson beta;\n\t\t%if %sysfunc(find(&__stats., dimsonbeta)) >= 1 %then %do;\n\t\t\tproc sql;\n\t\t\t\tcreate table __dimson1 as \n\t\t\t\tselect a.excntry, a.id, a.date, a.eom, a.ret_exc, a.mktrf, b.mktrf_lg1, b.mktrf_ld1\n\t\t\t\tfrom calc_data_screen as a left join mkt_lead_lag4 as b\n\t\t\t\ton a.excntry = b.excntry and a.date = b.date\n\t\t\t\twhere not missing(b.mktrf_lg1) and not missing(b.mktrf_ld1);\n\t\t\t\t\n\t\t\t\tcreate table __dimson2 as \n\t\t\t\tselect *\n\t\t\t\tfrom __dimson1\n\t\t\t\tgroup by id, eom\n\t\t\t\thaving count(*) >= (&__min. - 1);  /* Minus one to reflect the fact that there is one less available observation due to the need of avoiding lookahead bias. */ \n\t\t\tquit;\n\t\t\t\n\t\t\tproc reg data=__dimson2 outest=__dimson3 edf NOPRINT;\n\t\t\t\tby id eom;\n\t\t\t\tmodel ret_exc=mktrf mktrf_lg1 mktrf_ld1;\n\t\t\trun;\n\t\t\t\n\t\t\tdata __dimson4;\n\t\t\t\tset __dimson3;\n\t\t\t\tbeta_dimson&sfx. = mktrf + mktrf_lg1 + mktrf_ld1;\t\n\t\t\t\tkeep id eom beta_dimson&sfx.;\n\t\t\trun;\n\t\t\t%save_or_append(base=op_dimson, new=__dimson4);\n\t\t%end;\n\t\t\n\t\t* Downside beta;\n\t\t%if %sysfunc(find(&__stats., downbeta)) >= 1 %then %do;\n\t\t\tproc reg data=calc_data_screen(where=(mktrf < 0)) outest=__downbeta1 edf NOPRINT;\n\t\t\t\tby id calc_date;\n\t\t\t\tmodel ret_exc=mktrf;\n\t\t\trun;\n\t\t\tproc sql;\n\t\t\t\tcreate table __downbeta2 as \n\t\t\t\tselect id, calc_date as eom, mktrf as betadown&sfx.\n\t\t\t\tfrom __downbeta1\n\t\t\t\twhere (_edf_ + 2) >= (&__min. / 2);  /* Use convention that we require half as many obs for downside beta */\n\t\t\tquit;\n\t\t\t%save_or_append(base=op_downbeta, new=__downbeta2);\n\t\t%end;\n\t\t\n\t\t* Number of zero trades with turnover as tiebreaker;\n\t\t%if %sysfunc(find(&__stats., zero_trades)) >= 1 %then %do;\n\t\t\tproc sql;\n\t\t\t\tcreate table __zero_trades1 as\n\t\t\t\tselect id, calc_date as eom, mean(tvol=0) * 21 as zero_trades, mean(tvol / (shares * 1e6)) as turnover\n\t\t\t\tfrom calc_data_raw\n\t\t\t\twhere not missing(tvol)\n\t\t\t\tgroup by id, calc_date\n\t\t\t\thaving count(tvol) >= &__min.\n\t\t\t\torder by eom;\n\t\t\tquit;\n\t\t\tproc rank data=__zero_trades1(where=(not missing(zero_trades) and not missing(turnover))) out = __zero_trades2 descending ties=mean f; /* f means that we use fractional ranks i.e. between 0 and 1*/ \n\t\t\t\tby eom;\n\t\t\t\tvar turnover;\n\t\t\t\tranks rank_turnover;\n\t\t\trun;\n\t\t\tproc sql;\n\t\t\t\tcreate table __zero_trades3 as\n\t\t\t\tselect id, eom, zero_trades + rank_turnover / 100  as zero_trades&sfx. /* Divide by 100 to ensure that turnover only acts as a tie breaker (1/365*21=0.0833)*/\n\t\t\t\tfrom __zero_trades2;\n\t\t\tquit;\n\t\t\t%save_or_append(base=op_zero_trades, new=__zero_trades3);\n\t\t%end; \n\t\t\t\t\n\t\t* Turnover;\n\t\t%if %sysfunc(find(&__stats., turnover)) >= 1 %then %do;\n\t\t\tproc sql;\n\t\t\t\tcreate table __turnover1 as \n\t\t\t\tselect id, date, calc_date, tvol / (shares * 1e6) as turnover_d  \n\t\t\t\tfrom calc_data_raw;\n\t\t\t\t\n\t\t\t\tcreate table __turnover2 as \n\t\t\t\tselect id, calc_date as eom, mean(turnover_d) as turnover&sfx., std(turnover_d) / (calculated turnover&sfx.) as turnover_var&sfx. \n\t\t\t\tfrom __turnover1\n\t\t\t\tgroup by id, calc_date\n\t\t\t\thaving count(turnover_d) >= &__min.;\n\t\t\tquit;\n\t\t\t%save_or_append(base=op_turnover, new=__turnover2);\n\t\t%end;\n\t\t\n\t\t* Dollar Volume;\n\t\t%if %sysfunc(find(&__stats., dolvol)) >= 1 %then %do;\n\t\t\tproc sql;\n\t\t\t\tcreate table __dolvol as \n\t\t\t\tselect id, calc_date as eom, mean(dolvol_d) as dolvol&sfx., std(dolvol_d) / (calculated dolvol&sfx.) as dolvol_var&sfx.\n\t\t\t\tfrom calc_data_raw\n\t\t\t\tgroup by id, calc_date\n\t\t\t\thaving count(dolvol_d) >= &__min.;\n\t\t\tquit;\n\t\t\t%save_or_append(base=op_dolvol, new=__dolvol);\n\t\t%end;\n\t\t\n\t\t* Correlation to Market;\n\t\t%if %sysfunc(find(&__stats., mktcorr)) >= 1 %then %do;\n\t\t\tproc sql;\n\t\t\t\tcreate table __corr_data1 as \n\t\t\t\tselect a.*, b.calc_date\n\t\t\t\tfrom corr_data as a left join calc_dates as b\n\t\t\t\ton a.eom = b.eom\n\t\t\t\twhere not missing(b.calc_date) and not missing(ret_exc_3l) and zero_obs < 10\n\t\t\t\torder by a.id, b.calc_date;\n\t\t\t\t\n\t\t\t\tcreate table __corr_data2 as \n\t\t\t\tselect *\n\t\t\t\tfrom __corr_data1\n\t\t\t\tgroup by id, calc_date\n\t\t\t\thaving count(ret_exc_3l) >= &__min. and count(mkt_exc_3l) >= &__min.;\n\t\t\tquit;\n\t\t\tproc corr data = __corr_data2 outp=__corr1 noprint nomiss ;\n\t\t\t\tby id calc_date;\n\t\t\t\tvar ret_exc_3l mkt_exc_3l;\n\t\t\trun;\n\t\t\tproc sql;\n\t\t\t\tcreate table __corr2 as \n\t\t\t\tselect id, calc_date as eom, ret_exc_3l as corr&sfx.\n\t\t\t\tfrom __corr1\n\t\t\t\twhere _type_='CORR' and _name_ = 'mkt_exc_3l';\n\t\t\tquit;\n\t\t\t%save_or_append(base=op_corr, new=__corr2);\n\t\t%end;\n\t\t\n\t\t* Market Volatility (separately for each stock);\n\t\t%if %sysfunc(find(&__stats., mktvol)) >= 1 %then %do;\n\t\t\tproc sql;\n\t\t\t\tcreate table __mktvol as \n\t\t\t\tselect id, calc_date as eom, std(mktrf) as __mktvol&sfx.\n\t\t\t\tfrom calc_data_screen\n\t\t\t\tgroup by id, calc_date\n\t\t\t\thaving count(ret_exc) >= &__min.;\n\t\t\tquit;\n\t\t\t%save_or_append(base=op_mktvol, new=__mktvol);\n\t\t%end;\n\t\t\n\t\t* NAME;\n\t\t%if %sysfunc(find(&__stats., NAME)) >= 1 %then %do;\n\t\t\t\n\t\t\t%save_or_append(base=, new=);\n\t\t%end;\n\t\t\n\t%end;\t\n\t* Make all observations into a dataset by transposing and appending; \n\tproc sql noprint;\n\t    select memname into :op_datasets separated by \" \"\n\t    from dictionary.tables\n\t    where lowcase(libname)=\"work\" and  prxmatch(\"/^op\\_/i\", memname) > 0;\n\tquit;\n\t\n\t* Initialize dataset to append to;\n\tdata &out.;\n\t\tformat id 9.0 eom YYMMDDN8. stat $char20. value 16.8;\n\t\tstop;  \n\trun;\n\t%do k=1 %to %nwords(&op_datasets.);\n\t\t%let __dt = %scan(&op_datasets., &k., %str(' '));\n\t\tproc sort data=&__dt.; by id eom; run;\n\t\tproc transpose data=&__dt. out=__op(rename=(col1=value)) name=stat ;\n\t\t\tby id eom;\n\t\trun;\n\t\tproc append base=&out. data=__op force; run; \n\t\tproc delete data=&__dt.; run;\n\t%end;\n\t/* Stop timer */\n\tdata _null_;\n\t\tdur = datetime() - &__roll_start;\n\t\tput 30*'-' / ' DAILY ROLL APPLY TOOK:' dur time13.2 / 30*'-';\n\trun;\n%mend;\n\n/* MACRO: FINISH DAILY CHARS*/\n%macro finish_daily_chars(out=);\n\t* Make bidask into a long format;\n\tproc transpose data=scratch.corwin_schultz out=bidask(rename=(col1=value)) name=stat;\n\t\tby id eom;\n\trun;\n\t* Combine all roll chars;\n\tdata daily_chars1; set scratch.roll_21d; run;\n\tproc append base=daily_chars1 data=scratch.roll_126d; run; \n\tproc append base=daily_chars1 data=scratch.roll_252d; run; \n\tproc append base=daily_chars1 data=scratch.roll_1260d; run;\n\tproc append base=daily_chars1 data=bidask force; run;\n\tproc sort data=daily_chars1 nodup; by id eom; run;\n\tproc transpose data = daily_chars1 out= daily_chars2(drop=_name_);\n\t\tby id eom;\n\t\tid stat;\n\t\tvar value;\n\trun;\n\tproc sql;\n\t\tcreate table daily_chars3 as \n\t\tselect *, corr_1260d * rvol_252d/__mktvol_252d as betabab_1260d, rmax5_21d / rvol_252d as rmax5_rvol_21d\n\t\tfrom daily_chars2;\n\t\t\n\t\talter table daily_chars3\n\t\tdrop __mktvol_252d;\n\tquit;\n\tproc sort data=daily_chars3 out=&out.; by id eom; run;\n%mend;\n\n\n\n"
  },
  {
    "path": "GlobalFactors/portfolios.R",
    "content": "library(lubridate)\nlibrary(tidyverse)\nlibrary(data.table)\n\n# How To --------------------\n# Paths\n# - data_path:    Set to path with global characteristics data and if daily_pf==T should also contain a folder with daily stock returns.\n# - output_path:  Set to desired output folder.\n# - legacy_path:  Set to folder if you want to maintain legacy version. If not, set to NULL.\n# Countries\n# - countries:    Choose the countries where portfolio returns are created. Default: All countries in data_path/Characteristics\n# Characteristics\n# - chars:        Characteristics to create portfolios from. Can be any column from the global characteristics dataset.\n# Portfolio Settings\n# - settings:     Choose how to create portfolios. For more information, see description for the portfolios() function\n\n# User Input -----------------------\n# Paths\ndata_path <- \"../../Data\"\noutput_path <- \"../../PaperFactors\"\nlegacy_path <- \"../../Legacy\"\n# Countries\ncountries <- list.files(path = paste0(data_path, \"/Characteristics\")) %>% str_remove(\".csv\")\n# Chars \nchars <- c(\n  \"age\",                 \"aliq_at\",             \"aliq_mat\",            \"ami_126d\",           \n  \"at_be\",               \"at_gr1\",              \"at_me\",               \"at_turnover\",        \n  \"be_gr1a\",             \"be_me\",               \"beta_60m\",            \"beta_dimson_21d\",    \n  \"betabab_1260d\",       \"betadown_252d\",       \"bev_mev\",             \"bidaskhl_21d\",       \n  \"capex_abn\",           \"capx_gr1\",            \"capx_gr2\",            \"capx_gr3\",           \n  \"cash_at\",             \"chcsho_12m\",          \"coa_gr1a\",            \"col_gr1a\",           \n  \"cop_at\",              \"cop_atl1\",            \"corr_1260d\",          \"coskew_21d\",         \n  \"cowc_gr1a\",           \"dbnetis_at\",          \"debt_gr3\",            \"debt_me\",            \n  \"dgp_dsale\",           \"div12m_me\",           \"dolvol_126d\",         \"dolvol_var_126d\",    \n  \"dsale_dinv\",          \"dsale_drec\",          \"dsale_dsga\",          \"earnings_variability\",\n  \"ebit_bev\",            \"ebit_sale\",           \"ebitda_mev\",          \"emp_gr1\",            \n  \"eq_dur\",              \"eqnetis_at\",          \"eqnpo_12m\",           \"eqnpo_me\",           \n  \"eqpo_me\",             \"f_score\",             \"fcf_me\",              \"fnl_gr1a\",           \n  \"gp_at\",               \"gp_atl1\",             \"ival_me\",             \"inv_gr1\",            \n  \"inv_gr1a\",            \"iskew_capm_21d\",      \"iskew_ff3_21d\",       \"iskew_hxz4_21d\",     \n  \"ivol_capm_21d\",       \"ivol_capm_252d\",      \"ivol_ff3_21d\",        \"ivol_hxz4_21d\",      \n  \"kz_index\",            \"lnoa_gr1a\",           \"lti_gr1a\",            \"market_equity\",      \n  \"mispricing_mgmt\",     \"mispricing_perf\",     \"ncoa_gr1a\",           \"ncol_gr1a\",          \n  \"netdebt_me\",          \"netis_at\",            \"nfna_gr1a\",           \"ni_ar1\",             \n  \"ni_be\",               \"ni_inc8q\",            \"ni_ivol\",             \"ni_me\",              \n  \"niq_at\",              \"niq_at_chg1\",         \"niq_be\",              \"niq_be_chg1\",        \n  \"niq_su\",              \"nncoa_gr1a\",          \"noa_at\",              \"noa_gr1a\",           \n  \"o_score\",             \"oaccruals_at\",        \"oaccruals_ni\",        \"ocf_at\",             \n  \"ocf_at_chg1\",         \"ocf_me\",              \"ocfq_saleq_std\",      \"op_at\",              \n  \"op_atl1\",             \"ope_be\",              \"ope_bel1\",            \"opex_at\",            \n  \"pi_nix\",              \"ppeinv_gr1a\",         \"prc\",                 \"prc_highprc_252d\",   \n  \"qmj\",                 \"qmj_growth\",          \"qmj_prof\",            \"qmj_safety\",         \n  \"rd_me\",               \"rd_sale\",             \"rd5_at\",              \"resff3_12_1\",        \n  \"resff3_6_1\",          \"ret_1_0\",             \"ret_12_1\",            \"ret_12_7\",           \n  \"ret_3_1\",             \"ret_6_1\",             \"ret_60_12\",           \"ret_9_1\",            \n  \"rmax1_21d\",           \"rmax5_21d\",           \"rmax5_rvol_21d\",      \"rskew_21d\",          \n  \"rvol_21d\",            \"sale_bev\",            \"sale_emp_gr1\",        \"sale_gr1\",           \n  \"sale_gr3\",            \"sale_me\",             \"saleq_gr1\",           \"saleq_su\",           \n  \"seas_1_1an\",          \"seas_1_1na\",          \"seas_11_15an\",        \"seas_11_15na\",       \n  \"seas_16_20an\",        \"seas_16_20na\",        \"seas_2_5an\",          \"seas_2_5na\",         \n  \"seas_6_10an\",         \"seas_6_10na\",         \"sti_gr1a\",            \"taccruals_at\",       \n  \"taccruals_ni\",        \"tangibility\",         \"tax_gr1a\",            \"turnover_126d\",      \n  \"turnover_var_126d\",   \"z_score\",             \"zero_trades_126d\",    \"zero_trades_21d\",    \n  \"zero_trades_252d\"\n)\n# Portfolio settings\nsettings <- list(\n  end_date = as.Date(\"2023-12-31\"),\n  pfs = 3,\n  source = c(\"CRSP\", \"COMPUSTAT\"),                           \n  wins_ret = T,\n  bps = \"non_mc\",\n  bp_min_n = 10,\n  cmp = list(\n    us = T,\n    int = F\n  ),\n  signals = list(\n    us = F,\n    int = F,\n    standardize = T,\n    weight = \"vw_cap\"\n  ),\n  regional_pfs = list(\n    ret_type = \"vw_cap\",                    # Type of return to use for regional factors\n    country_excl = c(\"ZWE\", \"VEN\"),         # Countries are excluded due to data issues\n    country_weights = \"market_cap\",         # How to weight countries? In (\"market_cap\", \"stocks\", \"ew\")\n    stocks_min = 5,                         # Minimum amount of stocks in each side of the portfolios\n    months_min = 5 * 12,                    # Minimum amount of observations a factor needs to be included  \n    countries_min = 3                       # Minimum number of countries necessary in a regional portfolio\n  ), \n  daily_pf = T,\n  ind_pf = T\n)\n\n# Portfolio Function -------------\nportfolios <- function(\n  data_path,\n  excntry,\n  chars, \n  source = c(\"CRSP\", \"COMPUSTAT\"),       # Use data from \"CRSP\", \"Compustat\" or both: c(\"CRSP\", \"COMPUSTAT\")\n  wins_ret = T,                          # Should Compustat returns be winsorized at the 0.1% and 99.9% of CRSP returns? \n  pfs,                                   # Number of portfolios \n  bps,                                   # What should breakpoints be based on? Non-Microcap stocks (\"non_mc\") or NYSE stocks \"nyse\"\n  bp_min_n,                              # Minimum number of stocks used for breakpoints\n  cmp = F,                               # Create characteristics managed size portfolios?\n  signals = F,                           # Create portfolio signals?\n  signals_standardize = F,               # Map chars to [-0.5, +0.5]?,\n  signals_w = \"vw_cap\",                  # Weighting for signals: in c(\"ew\", \"vw\", \"vw_cap\")\n  nyse_size_cutoffs,                     # Data frame with NYSE size breakpoints\n  daily_pf= F,                           # Should daily return be estimated\n  ind_pf = F,                            # Should industry portfolio returns be estimated\n  ret_cutoffs = NULL,                    # Data frame for monthly winsorization. Neccesary when wins_ret=T\n  ret_cutoffs_daily = NULL               # Data frame for daily winsorization. Neccesary when wins_ret=T and daily_pf=T\n) {\n  # Characteristic Data\n  data <- fread(paste0(data_path, \"/Characteristics/\", excntry, \".csv\"), select = c(\"id\", \"eom\", \"source_crsp\", \"comp_exchg\", \"crsp_exchcd\", \"size_grp\", \"ret_exc\", \"ret_exc_lead1m\", \"me\", \"gics\", \"ff49\", chars), colClasses = c(\"eom\"=\"character\"))\n  data[, eom := eom %>% lubridate::fast_strptime(format = \"%Y%m%d\") %>% as.Date()]\n  # ME CAP\n  data <- nyse_size_cutoffs[, .(eom, nyse_p80)][data, on = \"eom\"]\n  data[, me_cap := pmin(me, nyse_p80)][, nyse_p80 := NULL]\n  # Screens\n  if (length(source) == 1) {\n    if (source == \"CRSP\") {\n      data <- data[source_crsp == 1]\n    }\n    if (source == \"COMPUSTAT\") {\n      data <- data[source_crsp == 0]\n    }\n  }\n  data <- data[!is.na(size_grp) & !is.na(me) & !is.na(ret_exc_lead1m)] # The ret_exc_lead1m screen assumes that investor knew at the beginning of the month that the security would delist. \n  # Daily Returns\n  if (daily_pf) {\n    daily <- fread(paste0(data_path, \"/Daily Returns/\", excntry, \".csv\"), colClasses = c(\"date\"=\"character\"), select = c(\"id\", \"date\", \"ret_exc\")); gc()\n    daily[, date := date %>% lubridate::fast_strptime(format = \"%Y%m%d\") %>% as.Date()]\n    daily[, eom_lag1 := floor_date(date, unit=\"month\")-1]\n  }\n  # Winsorize Returns?\n  if (wins_ret) {\n    data <- ret_cutoffs[, .(\"eom\" = eom_lag1, \"p001\"=ret_exc_0_1, \"p999\"=ret_exc_99_9)][data, on = \"eom\"]\n    data[source_crsp == 0 & ret_exc_lead1m > p999, ret_exc_lead1m := p999]\n    data[source_crsp == 0 & ret_exc_lead1m < p001, ret_exc_lead1m := p001]\n    data[, c(\"source_crsp\", \"p001\", \"p999\") := NULL]\n    if (daily_pf) {\n      daily[, year := year(date)]\n      daily[, month := month(date)]\n      daily <- ret_cutoffs_daily[, .(year, month, \"p001\"=ret_exc_0_1, \"p999\"=ret_exc_99_9)][daily, on = .(year, month)]\n      daily[id>99999 & ret_exc > p999, ret_exc := p999] # Only winsorize Compustat data, id for CRSP is 5 digits, Compustat is 9: source_crsp == 0 \n      daily[id>99999 & ret_exc < p001, ret_exc := p001]\n      daily[, c(\"p001\", \"p999\", \"year\", \"month\") := NULL]\n    }\n  }\n  # Standardize to [-0.5, +0.5] interval (for signals)\n  if (signals_standardize & signals) {\n    data[, (chars) := lapply(.SD, function(x) frank(x, ties.method = \"min\", na.last = \"keep\")), .SDcols = chars, by = eom]\n    data[, (chars) := lapply(.SD, as.numeric), .SDcols = chars]\n    data[, (chars) := lapply(.SD, function(x) x / max(x, na.rm=T) - 0.5), .SDcols = chars, by = eom]\n  }\n  # Industry Portfolios \n  if (ind_pf) {\n    ind_data <- data[!is.na(gics), .(eom, gics, excntry, ret_exc_lead1m, me, me_cap)]\n    # Get first 2 digits of GICS code for industry groups\n    ind_data[, gics := as.numeric(substr(ind_data$gics, 1, 2))]\n    ind_gics <- ind_data[, .(\n      n = .N,\n      ret_ew = mean(ret_exc_lead1m),\n      ret_vw = sum(ret_exc_lead1m * me) / sum(me),\n      ret_vw_cap = sum(ret_exc_lead1m * me_cap) / sum(me_cap) \n    ), by = .(gics, eom)][, excntry := str_to_upper(excntry)]\n    # Lead month to match using leaded returns\n    ind_gics[, eom := ceiling_date(eom+1, unit = \"month\")-1]\n    ind_gics <- ind_gics[n >= bp_min_n]\n    # Estimate industry portfolios by Fama-French portfolios for US data\n    if (excntry == \"usa\"){\n      ind_data <- data[!is.na(ff49), .(eom, ff49, ret_exc_lead1m, me, me_cap)]\n      ind_ff49 <- ind_data[, .(\n        n = .N,\n        ret_ew = mean(ret_exc_lead1m),\n        ret_vw = sum(ret_exc_lead1m * me) / sum(me),\n        ret_vw_cap = sum(ret_exc_lead1m * me_cap) / sum(me_cap) \n      ), by = .(ff49, eom)][, excntry := str_to_upper(excntry)]\n      ind_ff49[, eom := ceiling_date(eom+1, unit = \"month\")-1]\n      ind_ff49 <- ind_ff49[n >= bp_min_n]\n    }\n  }\n  # Prepare output list\n  output <- list()\n  # Apply Portfolio Function to Each Characteristic\n  char_pfs <- chars %>% lapply(function(x) {\n    op <- list()\n    print(paste0(\"   \" , x, \": \", match(x, chars), \" out of \", length(chars)))\n    data[, var := as.double(get(x))]\n    # Unless we need to compute signals, limit size of data\n    if(!signals) {\n      sub <- data[!is.na(var), .(id, eom, var, size_grp, ret_exc_lead1m, me, me_cap, crsp_exchcd, comp_exchg)]\n    } else {\n      sub <- data[!is.na(var)]\n    }\n    # Portfolio Assignment\n    if (bps == \"nyse\") {\n      sub[, bp_stock := (crsp_exchcd == 1 & is.na(comp_exchg)) | (comp_exchg == 11 & is.na(crsp_exchcd))]\n    }\n    if (bps == \"non_mc\") {\n      sub[, bp_stock := (size_grp %in% c(\"mega\", \"large\", \"small\"))]\n    }\n    sub[, bp_n := sum(bp_stock), by = eom]\n    sub <- sub[bp_n >= bp_min_n] # require at least 10 stocks for break points\n    if (nrow(sub) != 0) {\n      sub[, cdf := ecdf(var[bp_stock == T])(var), by = eom]\n      sub[, min_cdf := min(cdf), by = eom]\n      sub[cdf == min_cdf, cdf := 0.00000001] # To ensure that the lowest value is in portfolio 1   \n      sub[, pf := ceiling(cdf*pfs), by = eom]  \n      sub[pf == 0, pf := 1]  # Happens when non-bp stocks extend beyond bp stock range\n      # Returns \n      op$pf_returns <- sub[, .(\n        characteristic = x,\n        n = .N,\n        signal = median(var),\n        ret_ew = mean(ret_exc_lead1m),\n        ret_vw = sum(ret_exc_lead1m * me) / sum(me),\n        ret_vw_cap = sum(ret_exc_lead1m * me_cap) / sum(me_cap)\n      ), by = .(pf, eom)]\n      op$pf_returns[, eom := ceiling_date(eom+1, unit = \"month\")-1]  # Reflect the fact that returns are leaded\n      # Signals\n      if (signals) {\n        if (signals_w == \"ew\") {\n          sub[, w := 1/.N, by = .(pf, eom)]\n        }\n        if (signals_w == \"vw\") {\n          sub[, w := me / sum(me), .(pf, eom)]\n        }\n        if (signals_w == \"vw_cap\") {\n          sub[, w := me_cap / sum(me_cap), .(pf, eom)]\n        }\n        sub[, (chars) := lapply(.SD, function(x) if_else(is.na(x), 0, x)), .SDcols = chars]  # Set missing to median of 0\n        pf_signals <- sub[, lapply(.SD, function(x) sum(w * x)), .SDcols = chars, by = .(pf, eom)]\n        pf_signals[, characteristic := x]\n        pf_signals[, eom := ceiling_date(eom+1, unit = \"month\")-1]  # Reflect the fact that returns are leaded\n        op$signals <- pf_signals\n      }\n      # Daily Portfolios\n      if (daily_pf) {\n        # Keep weights constant throughout month\n        weights <- sub[, .(id, w_ew = 1/.N, w_vw = me/sum(me), w_vw_cap = me_cap/sum(me_cap)), by = .(eom, pf)]\n        daily_sub <- weights[daily, on = .(id, eom=eom_lag1)][!is.na(pf) & !is.na(ret_exc)]\n        op$pf_daily <- daily_sub[, .(\n          n = .N,\n          ret_ew = sum(w_ew*ret_exc),\n          ret_vw = sum(w_vw*ret_exc),\n          ret_vw_cap = sum(w_vw_cap*ret_exc)\n        ), by = .(pf, date)][, characteristic := x]\n      }\n      # Output\n      return(op)  \n    } \n  })\n  output$pf_returns <- char_pfs %>% lapply(function(x) x$pf_returns) %>% rbindlist()\n  if (daily_pf) {\n    output$pf_daily <- char_pfs %>% lapply(function(x) x$pf_daily) %>% rbindlist()\n  }\n  if (ind_pf) {\n    output$gics_returns <- ind_gics\n    if (excntry == \"usa\") {\n      output$ff49_returns <- copy(ind_ff49)\n    }\n  }\n  if (nrow(output$pf_returns) != 0) {\n    output$pf_returns[, excntry := str_to_upper(excntry)]\n    if (daily_pf) {\n      output$pf_daily[, excntry := str_to_upper(unique(output$pf_returns[, excntry]))]\n    }\n    if (signals) {\n      output$signals <- char_pfs %>% lapply(function(x) x$signals) %>% rbindlist()\n      output$signals[, excntry := str_to_upper(excntry)]\n    }\n  }\n  # Characteristic Managed Portfolios \n  if (cmp) {\n    output$cmp <- chars %>% lapply(function(x) {\n      print(paste0(\"   CMP - \" , x, \": \", match(x, chars), \" out of \", length(chars)))\n      data[, var := get(x)]\n      sub <- data[!is.na(var), .(eom, var, size_grp, ret_exc_lead1m)]\n      sub[, p_rank := frank(var, na.last=NA, ties.method = \"average\") / (.N + 1), by = .(size_grp, eom)] # Notice tie method\n      sub[, p_rank_dev := p_rank - mean(p_rank), by = .(size_grp, eom)]\n      sub[, weight := p_rank_dev / (sum(abs(p_rank_dev)) / 2), by = .(size_grp, eom)]  # 1 unit invested in each of the long and short pf\n      cmp <- sub[, .(\n        excntry = excntry,\n        characteristic = x,\n        n_stocks = .N,\n        ret_weighted = sum(ret_exc_lead1m * weight),\n        signal_weighted = sum(var * weight),\n        sd_var = sd(var)\n      ), by = .(size_grp, eom)]\n      cmp <- cmp[sd_var != 0][, sd_var := NULL]\n      cmp[, eom := ceiling_date(eom+1, unit = \"month\")-1]  # Reflect the fact that returns are leaded\n      return(cmp)\n    }) %>% rbindlist()\n    output$cmp[, excntry := str_to_upper(excntry)]\n  }\n  # Output\n  return(output)\n}\n\n# Extract Neccesary Information --------------------\n# Factor Details\nchar_info <- readxl::read_xlsx(\"Factor Details.xlsx\", sheet = \"details\", range = \"A1:N300\") %>%\n  select(\"characteristic\"=abr_jkp, direction) %>%\n  filter(!is.na(characteristic)) %>%\n  mutate(direction = direction %>% as.integer) %>%\n  setDT()\n# Country Classification\ncountry_classification <- readxl::read_xlsx(\"Country Classification.xlsx\", \n                                            sheet = \"countries\", range = \"A1:I200\") %>%\n  select(excntry, msci_development, region) %>%\n  filter(!is.na(excntry) & !(excntry %in% settings$regional_pfs$country_excl)) %>%\n  setDT()\nregions <- tibble(\n  name = c(\"developed\", \"emerging\", \"frontier\", \"world\", \"world_ex_us\"),\n  country_codes = list(\n    country_classification[msci_development == \"developed\" & excntry != \"USA\"]$excntry,\n    country_classification[msci_development == \"emerging\"]$excntry,\n    country_classification[msci_development == \"frontier\"]$excntry,\n    country_classification$excntry,\n    country_classification[excntry != \"USA\"]$excntry\n  ),\n  countries_min = c(rep(settings$regional_pfs$countries_min, 3), 1, 3)\n)\n# Cluster Labels\ncluster_labels <- fread(\"Cluster Labels.csv\")\n# NYSE Cutoff \nnyse_size_cutoffs <- fread(paste0(data_path, \"/nyse_cutoffs.csv\"), colClasses = c(\"eom\"=\"character\"))\nnyse_size_cutoffs[, eom := as.Date(eom, format = \"%Y%m%d\")]\n# CRSP Return Cutoffs\nret_cutoffs <- fread(paste0(data_path, \"/return_cutoffs.csv\"), colClasses = c(\"eom\"=\"character\"))\nret_cutoffs[, eom := as.Date(eom, format = \"%Y%m%d\")]\nret_cutoffs[, eom_lag1 := floor_date(eom, unit = \"month\") - 1]  # Because we use ret_exc_lead1m\nif (settings$daily_pf) {\n  ret_cutoffs_daily <- fread(paste0(data_path, \"/return_cutoffs_daily.csv\")) \n}\n# Market \nmarket <- fread(paste0(data_path, \"/market_returns.csv\"), colClasses = c(\"eom\"=\"character\"))\nmarket[, eom := eom %>% as.Date(\"%Y%m%d\")]\nif (settings$daily_pf) {\n  market_daily <- fread(paste0(data_path, \"/market_returns_daily.csv\"), colClasses = c(\"date\"=\"character\"))\n  market_daily[, date := date %>% as.Date(\"%Y%m%d\")]\n}\n\n# Create Portfolios -----------------------\nportfolio_data <- countries %>% lapply(function(ex) {\n  print(paste0(ex, \": \", match(ex, countries), \" out of \", length(countries)))\n  portfolios(\n    data_path = data_path,\n    excntry = ex, \n    chars = chars, \n    source = settings$source, \n    wins_ret = settings$wins_ret, \n    pfs=settings$pfs, \n    bps=settings$bps, \n    bp_min_n=settings$bp_min_n, \n    cmp = if_else(ex == \"usa\", settings$cmp$us, settings$cmp$int),\n    signals=if_else(ex == \"usa\", settings$signals$us, settings$signals$int),\n    signals_standardize=settings$signals$standardize, \n    signals_w=settings$signals$weight, \n    nyse_size_cutoffs = nyse_size_cutoffs, \n    daily_pf = settings$daily_pf,\n    ind_pf = settings$ind_pf, \n    ret_cutoffs = ret_cutoffs,\n    ret_cutoffs_daily = ret_cutoffs_daily\n  )\n})\n\n# Daily Data\nif (settings$daily_pf) {\n  # Daily Portfolio Returns\n  pf_daily <- portfolio_data %>% lapply(function(x) x$pf_daily) %>% rbindlist() \n  pf_daily %>% setorder(excntry, characteristic, pf, date)\n  # Daily Long-Short Factors\n  hml_daily <- pf_daily[, .(\n    pfs = sum(pf == settings$pfs) + sum(pf == 1),\n    n_stocks = n[pf==settings$pfs] + n[pf==1],\n    n_stocks_min = as.integer(min(n[pf==settings$pfs], n[pf==1])),\n    ret_ew = ret_ew[pf==settings$pfs] - ret_ew[pf==1],\n    ret_vw = ret_vw[pf==settings$pfs] - ret_vw[pf==1],\n    ret_vw_cap = ret_vw_cap[pf==settings$pfs] - ret_vw_cap[pf==1]\n  ), .(excntry, characteristic, date)]\n  hml_daily <- hml_daily[pfs == 2][, pfs := NULL]\n  hml_daily %>% setorder(excntry, characteristic, date)\n  lms_daily <- char_info[hml_daily, on = \"characteristic\"]\n  resign_cols <- c(\"ret_ew\", \"ret_vw\", \"ret_vw_cap\")\n  lms_daily[, (resign_cols) := lapply(.SD, function(x) x*direction), .SDcols=resign_cols]\n}\n\n# Monthly Portfolio Returns\npf_returns <- portfolio_data %>% lapply(function(x) x$pf_returns) %>% rbindlist() \npf_returns <- pf_returns %>% select(excntry, characteristic, pf, eom, n, signal, ret_ew, ret_vw, ret_vw_cap)\npf_returns %>% setorder(excntry, characteristic, pf, eom)\n\n# GICS Returns \nif (settings$ind_pf) {\n  gics_returns <- portfolio_data %>% lapply(function(x) x$gics_returns) %>% rbindlist()\n  gics_returns %>% setorder(excntry, gics, eom)\n  ff49_returns <- portfolio_data[[which(countries == \"usa\")]]$ff49_returns\n  ff49_returns %>% setorder(excntry, ff49, eom)\n}\n\n# Create HML Returns\nhml_returns <- pf_returns[, .(\n  pfs = sum(pf == settings$pfs) + sum(pf == 1),\n  signal = signal[pf==settings$pfs] - signal[pf==1],\n  n_stocks = n[pf==settings$pfs] + n[pf==1],\n  n_stocks_min = min(n[pf==settings$pfs], n[pf==1]),\n  ret_ew = ret_ew[pf==settings$pfs] - ret_ew[pf==1],\n  ret_vw = ret_vw[pf==settings$pfs] - ret_vw[pf==1],\n  ret_vw_cap = ret_vw_cap[pf==settings$pfs] - ret_vw_cap[pf==1]\n), .(excntry, characteristic, eom)]\nhml_returns <- hml_returns[pfs == 2][, pfs := NULL]\nhml_returns %>% setorder(excntry, characteristic, eom)\n\n# Create Long-Short Factors [Sign Returns to be consistent with original paper]\nlms_returns <- char_info[hml_returns, on = \"characteristic\"]\nresign_cols <- c(\"signal\", \"ret_ew\", \"ret_vw\", \"ret_vw_cap\")\nlms_returns[, (resign_cols) := lapply(.SD, function(x) x*direction), .SDcols=resign_cols]\n\n# Extract Signals (TBD)\n\n# Extract CMP returns\ncmp_returns <- portfolio_data %>% lapply(function(x) x$cmp) %>% rbindlist() \ncmp_returns <- cmp_returns %>% select(excntry, characteristic, size_grp, eom, n_stocks, signal_weighted, ret_weighted)\n\n# Cluster portfolios ---------------\ncluster_pfs <- cluster_labels[lms_returns, on = .(characteristic)][, .(\n  n_factors = .N,\n  ret_ew = mean(ret_ew),\n  ret_vw = mean(ret_vw),\n  ret_vw_cap = mean(ret_vw_cap)\n), by = .(excntry, cluster, eom)]\nif (settings$daily_pf) {\n  cluster_pfs_daily <- cluster_labels[lms_daily, on = .(characteristic)][, .(\n    n_factors = .N,\n    ret_ew = mean(ret_ew),\n    ret_vw = mean(ret_vw),\n    ret_vw_cap = mean(ret_vw_cap)\n  ), by = .(excntry, cluster, date)]\n}\n\n# Regional Portfolios ------------------------------------------------\nregional_data <- function(data, mkt, date_col, char_col, countries, weighting, countries_min, periods_min, stocks_min) {\n  # Determine Country Weights\n  weights <- mkt[, .(excntry, get(date_col), mkt_vw_exc, \"country_weight\" = case_when(\n    weighting == \"market_cap\" ~ me_lag1,\n    weighting == \"stocks\" ~ as.double(stocks),\n    weighting == \"ew\" ~ 1)\n  )]\n  weights %>% setnames(old=\"V2\", new=\"date_col\")\n  # Portfolio Return \n  pf <- data[excntry %in% countries & n_stocks_min >= stocks_min] \n  pf %>% setnames(old=c(date_col, char_col), new = c(\"date_col\", \"char_col\"))\n  pf <- weights[pf, on = .(excntry, date_col)]\n  pf <- pf[!is.na(mkt_vw_exc), .(  \n    n_countries = .N,\n    direction = unique(direction),\n    ret_ew = sum(ret_ew*country_weight) / sum(country_weight),\n    ret_vw = sum(ret_vw*country_weight) / sum(country_weight),\n    ret_vw_cap = sum(ret_vw_cap*country_weight) / sum(country_weight),\n    mkt_vw_exc = sum(mkt_vw_exc * country_weight) / sum(country_weight) \n  ), by = .(char_col, date_col)]\n  # Minimum Requirement: Countries\n  pf <- pf[n_countries >= countries_min]\n  # Minimum Requirement: Months\n  pf[, periods := .N, by = .(char_col)]\n  pf <- pf[periods >= periods_min][, periods := NULL]\n  pf %>% setorder(char_col, date_col)\n  pf %>% setnames(old = c(\"date_col\", \"char_col\"), new = c(date_col, char_col))\n  return(pf)\n}\n# Regional Factors\nregional_pfs <- 1:nrow(regions) %>% lapply(function(i) {\n  info <- regions[i, ]\n  reg_pf <- lms_returns %>% regional_data(mkt=market, countries = unlist(info$country_codes), date_col = \"eom\", char_col = \"characteristic\", \n                                          weighting = settings$regional_pfs$country_weights,\n                                          countries_min = info$countries_min, periods_min = settings$regional_pfs$months_min, \n                                          stocks_min = settings$regional_pfs$stocks_min)\n  reg_pf %>% mutate(region = info$name) %>% select(region, characteristic, direction, eom, n_countries, ret_ew, ret_vw, ret_vw_cap, mkt_vw_exc)\n}) %>% rbindlist() \nif (settings$daily_pf) {\n  regional_pfs_daily <- 1:nrow(regions) %>% lapply(function(i) {\n    info <- regions[i, ]\n    reg_pf <- lms_daily %>% regional_data(mkt=market_daily, countries = unlist(info$country_codes), date_col = \"date\", char_col = \"characteristic\", \n                                          weighting = settings$regional_pfs$country_weights,\n                                          countries_min = info$countries_min, periods_min = settings$regional_pfs$months_min*21, \n                                          stocks_min = settings$regional_pfs$stocks_min)\n    reg_pf %>% mutate(region = info$name) %>% select(region, characteristic, direction, date, n_countries, ret_ew, ret_vw, ret_vw_cap, mkt_vw_exc)\n  }) %>% rbindlist() \n}\n# Regional Cluster Portfolios\nregional_clusters <- 1:nrow(regions) %>% lapply(function(i) {\n  info <- regions[i, ]\n  reg_pf <- cluster_pfs %>% rename(\"n_stocks_min\"=n_factors) %>% mutate(direction = NA_real_) %>% # Hack to make the function applicable\n    regional_data(mkt=market, countries = unlist(info$country_codes), date_col = \"eom\", char_col = \"cluster\",\n                  weighting = settings$regional_pfs$country_weights,\n                  countries_min = info$countries_min, periods_min = settings$regional_pfs$months_min, \n                  stocks_min = 1)\n  reg_pf %>% mutate(region = info$name) %>% select(region, cluster, eom, n_countries, ret_ew, ret_vw, ret_vw_cap, mkt_vw_exc)\n}) %>% rbindlist() \nif (settings$daily_pf) {\n  regional_clusters_daily <- 1:nrow(regions) %>% lapply(function(i) {\n    info <- regions[i, ]\n    reg_pf <- cluster_pfs_daily %>% rename(\"n_stocks_min\"=n_factors) %>% mutate(direction = NA_real_) %>% # Hack to make the function applicable\n      regional_data(mkt=market_daily, countries = unlist(info$country_codes), date_col = \"date\", char_col = \"cluster\", \n                    weighting = settings$regional_pfs$country_weights,\n                    countries_min = info$countries_min, periods_min = settings$regional_pfs$months_min*21, \n                    stocks_min = 1)\n    reg_pf %>% mutate(region = info$name) %>% select(region, cluster, date, n_countries, ret_ew, ret_vw, ret_vw_cap, mkt_vw_exc)\n  }) %>% rbindlist() \n}\n\n# Save ----------------\nif(!is.null(legacy_path)) {\n  # Save Time Stamped Files\n  folder <- paste0(legacy_path, \"/Past Portfolios/\", Sys.Date())\n  dir.create(folder)\n  settings %>% saveRDS(file = paste0(folder, \"/settings.RDS\"))\n  market[eom <= settings$end_date] %>% fwrite(file = paste0(folder, \"/market_returns.csv\"))\n  market_daily[date <= settings$end_date] %>% fwrite(file = paste0(folder, \"/market_returns_daily.csv\"))\n  hml_returns[eom <= settings$end_date] %>% fwrite(file = paste0(folder, \"/hml.csv\"))\n  cmp_returns[eom <= settings$end_date] %>% fwrite(file = paste0(folder, \"/cmp.csv\"))\n  if (settings$daily_pf) {\n    lms_daily[date <= settings$end_date] %>% fwrite(file = paste0(folder, \"/lms_daily.csv\"))\n  }\n  if (settings$ind_pf) {\n    gics_returns[eom <= settings$end_date] %>% fwrite(file = paste0(folder, \"/industry_gics.csv\"))\n    if (nrow(ff49_returns) != 0) {\n      ff49_returns[eom <= settings$end_date] %>% fwrite(file = paste0(folder, \"/industry_ff49.csv\"))\n    }\n  }\n}\n# Save Most Recent Files\nmarket[eom <= settings$end_date] %>% fwrite(file = paste0(output_path, \"/market_returns.csv\"))\npf_returns[eom <= settings$end_date] %>% fwrite(file = paste0(output_path, \"/pfs.csv\"))\nhml_returns[eom <= settings$end_date] %>% fwrite(file = paste0(output_path, \"/hml.csv\"))\nlms_returns[eom <= settings$end_date] %>% fwrite(file = paste0(output_path, \"/lms.csv\"))\ncmp_returns[eom <= settings$end_date] %>% fwrite(file = paste0(output_path, \"/cmp.csv\"))\ncluster_pfs[eom <= settings$end_date] %>% fwrite(file = paste0(output_path, \"/clusters.csv\"))\nif (settings$daily_pf) {\n  market_daily[date <= settings$end_date] %>% fwrite(file = paste0(output_path, \"/market_returns_daily.csv\"))\n  pf_daily[date <= settings$end_date] %>% fwrite(file = paste0(output_path, \"/pfs_daily.csv\"))\n  hml_daily[date <= settings$end_date] %>% fwrite(file = paste0(output_path, \"/hml_daily.csv\"))\n  lms_daily[date <= settings$end_date] %>% fwrite(file = paste0(output_path, \"/lms_daily.csv\"))\n  cluster_pfs_daily[date <= settings$end_date] %>% fwrite(file = paste0(output_path, \"/clusters_daily.csv\"))\n}\nif (settings$ind_pf) {\n  gics_returns[eom <= settings$end_date] %>% fwrite(file = paste0(output_path, \"/industry_gics.csv\"))\n  if (nrow(ff49_returns) != 0) {\n    ff49_returns[eom <= settings$end_date] %>% fwrite(file = paste0(output_path, \"/industry_ff49.csv\"))\n  }\n}\n\n# Regional Factors\nreg_folder <- paste0(output_path, \"/Regional Factors\")\nif (!dir.exists(reg_folder)) {\n  dir.create(reg_folder)\n}\nfor (reg in unique(regional_pfs$region)) {\n  regional_pfs[eom <= settings$end_date & region %in% reg] %>% fwrite(file = paste0(reg_folder, \"/\", str_to_sentence(reg), \".csv\"))\n}\nif (settings$daily_pf) {\n  reg_folder_daily <- paste0(output_path, \"/Regional Factors Daily\")\n  if (!dir.exists(reg_folder_daily)) {\n    dir.create(reg_folder_daily)\n  }\n  for (reg in unique(regional_pfs_daily$region)) {\n    regional_pfs_daily[date <= settings$end_date & region %in% reg] %>% fwrite(file = paste0(reg_folder_daily, \"/\", str_to_sentence(reg), \".csv\"))\n  }\n}\n# Regional Clusters\nreg_folder <- paste0(output_path, \"/Regional Clusters\")\nif (!dir.exists(reg_folder)) {\n  dir.create(reg_folder)\n}\nfor (reg in unique(regional_clusters$region)) {\n  regional_clusters[eom <= settings$end_date & region %in% reg] %>% fwrite(file = paste0(reg_folder, \"/\", str_to_sentence(reg), \".csv\"))\n}\nif (settings$daily_pf) {\n  reg_folder_daily <- paste0(output_path, \"/Regional Clusters Daily\")\n  if (!dir.exists(reg_folder_daily)) {\n    dir.create(reg_folder_daily)\n  }\n  for (reg in unique(regional_clusters_daily$region)) {\n    regional_clusters_daily[date <= settings$end_date & region %in% reg] %>% fwrite(file = paste0(reg_folder_daily, \"/\", str_to_sentence(reg), \".csv\"))\n  }\n}\n\n# Save Long/Short Factors by Country\ncnt_folder <- paste0(output_path, \"/Country Factors\")\nif (!dir.exists(cnt_folder)) {\n  dir.create(cnt_folder)\n}\nfor (exc in unique(lms_returns$excntry)) {\n  lms_returns[eom <= settings$end_date & excntry==exc] %>% fwrite(file = paste0(cnt_folder, \"/\", exc, \".csv\"))\n}\nif (settings$daily_pf) {\n  cnt_folder_daily <- paste0(output_path, \"/Country Factors Daily\")\n  if (!dir.exists(cnt_folder_daily)) {\n    dir.create(cnt_folder_daily)\n  }\n  for (exc in unique(lms_daily$excntry)) {\n    lms_daily[date <= settings$end_date & excntry==exc] %>% fwrite(file = paste0(cnt_folder_daily, \"/\", exc, \".csv\"))\n  }\n}\n\n# Save supplementary information\nnyse_size_cutoffs %>% fwrite(file = paste0(output_path, \"/nyse_cutoffs.csv\"))\nret_cutoffs %>% fwrite(file = paste0(output_path, \"/return_cutoffs.csv\"))\nif (settings$daily_pf) {\n  ret_cutoffs_daily %>% fwrite(file = paste0(output_path, \"/return_cutoffs_daily.csv\"))\n}\n"
  },
  {
    "path": "GlobalFactors/project_macros.sas",
    "content": "*************************************************************\n*  HELPER MACROS\n************************************************************ ; \n* Winsorize_own: Flexible version of WRDS %winsorize macro\n\tOnly difference currently, is the explicit specification of perc_low and perc_high.\n\tThis allows for winsorizing in each end and also to specify winsorization below 1%\n;\n%macro winsorize_own(inset=, outset=, sortvar=, vars=, perc_low=1, perc_high=99, trim=0);\n\t/* List of all variables */\n\t%let vars = %sysfunc(compbl(&vars));\n\t%let nvars = %nwords(&vars);\n\t\n\t/* Display Output */\n\t%put ### START.;\n\t\n\t/* Trimming / Winsorization Options */\n\t%if &trim=0 %then %put ### Winsorization; %else %put ### Trimming;\n\t%put ### Number of Variables:  &nvars;\n\t%put ### List   of Variables:  &vars;\n\toptions nonotes;\n\t\n\t/* Ranking within &sortvar levels */\n\t%put ### Sorting... ;\n\tproc sort data=&inset; by &sortvar; run; \n\t\n\t/* 2-tail winsorization/trimming */\n\t%let var2 = %sysfunc(tranwrd(&vars,%str( ),%str(__ )))__;\n\t%let var_p1 = %sysfunc(tranwrd(&vars,%str( ),%str(__&perc_low )))__&perc_low ;\n\t%let var_p2 = %sysfunc(tranwrd(&vars,%str( ),%str(__&perc_high )))__&perc_high ;\n\t\n\t/* Theis: Handle naming if winsorization < 1%. In this case, digits cause problems */\n\t%let var_p1 = %sysfunc(prxchange(s/\\./_/, -1, &var_p1.));  /* Replace . with _ */\n\t%let var_p2 = %sysfunc(prxchange(s/\\./_/, -1, &var_p2.));  /* Replace . with _ */\n\t\n\t/* Calculate upper and lower percentiles */\n\tproc univariate data=&inset noprint;\n\t  by &sortvar;\n\t  var &vars;\n\t  output out=_perc pctlpts=&perc_low &perc_high pctlpre=&var2;\n\trun;\n\t\n\t%if &trim=1 %then \n\t  %let condition = %str(if myvars(i)>=perct2(i) or myvars(i)<=perct1(i) then myvars(i)=. );\n\t  %else %let condition = %str(myvars(i)=min(perct2(i),max(perct1(i),myvars(i))) );\n\t\n\t%if &trim=0 %then %put ### Winsorizing at &perc_low.% and &perc_high.%... ;\n\t%else %put ### Trimming at &perc_low.% and &perc_high.%... ;\n\t\n\t/* Save output with trimmed/winsorized variables */\n\tdata &outset;\n\tmerge &inset (in=a) _perc;\n\tby &sortvar;\n\t  if a;\n\t  array myvars {&nvars} &vars;\n\t  array perct1 {&nvars} &var_p1;\n\t  array perct2 {&nvars} &var_p2;\n\t  do i = 1 to &nvars;\n\t    if not missing(myvars(i)) then\n\t    do;\n\t      &condition;\n\t    end;\n\t  end;\n\tdrop i &var_p1 &var_p2;\n\trun;\n\t\n\t/* House Cleaning */\n\tproc sql; drop table _perc; quit;\n\toptions notes;\n\t\n\t%put ### DONE . ; %put ;\n\n%mend winsorize_own;\n\n/* MACRO: RETURN CUTOFFS\n\tThe output of the macro is the 0.1%, 1%, 99% and 99.9% percentile of excess returns which can later be used for winsorization\n\tThe reason why this procedure is necesary is that we output the country country-by-country, which makes across country winsorization difficult.\n*/\n%macro return_cutoffs(data=, freq=, out=, crsp_only=); \n\t%if &freq.=m %then %do;\n\t\t%let date_var = eom;\n\t\t%let by_vars = eom;\n\t%end;\n\t%if &freq.=d %then %do;\n\t\t%let date_var = date;\n\t\t%let by_vars = year month;\n\t%end;\n\t\n\t%if &crsp_only.=1 %then %do;\n\t\tproc sort data=&data.(where=(source_crsp=1 and common=1 and obs_main=1 and exch_main=1 and primary_sec=1 and excntry ^= 'ZWE' and not missing(ret_exc))) out=base; by &date_var.; run; \n\t%end;\n\t%if &crsp_only.=0 %then %do;\n\t\tproc sort data=&data.(where=(common=1 and obs_main=1 and exch_main=1 and primary_sec=1 and excntry ^= 'ZWE' and not missing(ret_exc))) out=base; by &date_var.; run;\n\t%end;\n\t\n\t%if &freq.=d %then %do;\n\t\tdata base; \n\t\t\tset base;\n\t\t\tyear=year(date);\n\t\t\tmonth=month(date);\n\t\trun;\n\t%end;\n\t\n\t%let ret_types = ret ret_local ret_exc; \n\t%do i=1 %to %sysfunc(countw(&ret_types.));  \n\t\t%let ret_type = %scan(&ret_types., &i.);\n\t\tproc univariate data=base noprint;\n\t  \t\tby &by_vars.;\n\t  \t\tvar &ret_type.;\n\t  \t\toutput out=cutoffs n=n pctlpts=0.1 1 99 99.9 pctlpre=&ret_type._;\n\t  \trun;\n\t  \t%if &i.=1 %then %do;\n\t  \t\tdata &out.; set cutoffs; run;\n\t  \t%end;\n\t  \t%else %do;\n\t  \t\t%if &freq.=m %then %do;\n\t  \t\t\tproc sql;\n\t\t  \t\t\tcreate table &out. as\n\t\t  \t\t\tselect a.*, b.&ret_type._0_1, b.&ret_type._1, b.&ret_type._99, b.&ret_type._99_9\n\t\t  \t\t\tfrom &out. as a\n\t\t  \t\t\tleft join cutoffs as b\n\t\t  \t\t\ton a.eom=b.eom;\n\t\t  \t\tquit;\n\t  \t\t%end;\n\t  \t\t%if &freq.=d %then %do;\n\t  \t\t\tproc sql;\n\t\t  \t\t\tcreate table &out. as\n\t\t  \t\t\tselect a.*, b.&ret_type._0_1, b.&ret_type._1, b.&ret_type._99, b.&ret_type._99_9\n\t\t  \t\t\tfrom &out. as a\n\t\t  \t\t\tleft join cutoffs as b\n\t\t  \t\t\ton a.year=b.year and a.month=b.month;\n\t\t  \t\tquit;\n\t  \t\t%end;\n\t  \t%end;\n\t%end;\n\tproc delete data= cutoffs base; run;\n%mend;\n\n/* MACRO: NYSE SIZE CUTOFFS\n\tUsed for determining size groups and me cap weights\n*/\n%macro nyse_size_cutoffs(data=, out=);\n\tproc sort data=&data.(where=(crsp_exchcd=1 and obs_main = 1 and exch_main = 1 and primary_sec = 1 and common = 1 and not missing(me))) \n\t\tout=nyse_stocks; \n\t\tby eom; \n\trun;\n\t\n\tproc means data=nyse_stocks noprint;\n\t\tby eom;\n\t\tvar me;\n\t\toutput out=&out.(drop=_type_ _freq_) N=n P1=nyse_p1 p20 = nyse_p20 P50=nyse_p50 p80 = nyse_p80;\n\trun;\n\tproc delete data=nyse_stocks; run;\n%mend;\n\n/* Flexible version of WRDS populate function which can also do daily frequency */\n%macro populate_own(inset=, outset=, datevar=, idvar=, datename=, forward_max=, period=); /* Period in ('day', 'month') */\n\t/* Start Macro*/\n\t%put ; %put ### START. Populating Data --> Note that duplicate idvar and datevar will be removed;\n\n\t/* nodupkey sort necessary */\n\tproc sort data=&inset. out=__temp nodupkey; by &idvar descending &datevar.; run;\n\toptions nonotes;\n\t\n\t/* Populate Dates */\n\t/* FORWARD_MAX is the Regular Periodicity or the Forward Population Intervals */\n\t%let nid = %nwords(&idvar.);\n\t%let id2 = %scan(&idvar.,&nid.,%str( ));\n\t\n\tdata &outset. ; format &datename. YYMMDDN8.; \n\t\tset __temp;\n\t\tby &idvar.;\n\t\t&datename. =&datevar.;\n\t\toutput;\n\t\tfollowing = lag(&datename.); \n\t\tforward_max = intnx('month', &datevar., &forward_max.,'e');\n\t\tif first.&id2 then\n\t\t \tfollowing = .;\n\t\t\n  \t\tn = intck(&period.,&datename., min(following, forward_max));\n  \t\tdo i=1 to n-1;\n\t   \t\t&datename. = intnx(&period.,&datename. ,1,\"E\"); output;\n\t  \tend;\n\t \t\n\t\tdrop following forward_max n i;\n\trun;\n\t\n\tproc sort data=&outset. nodupkey; by &idvar. &datevar. &datename.; run;\n\t\n\t/* House Cleaning */\n\tproc sql; drop table __temp; quit;\n\toptions notes;\n\t%put ### DONE . Dataset &OUTSET. with &period. Frequency Generated ; %put ;\n\n%MEND populate_own;\n\n/* Generic expand in a data set with a start date column and an end date column. */\n%macro expand(data=, out=, id_vars=, start_date=, end_date=, freq='day', new_date_name=date); /*freq in ('day', 'month')*/\n\tdata __expanded;\n\t   set &data.;\n\t   format &new_date_name. YYMMDDN8.;\n\t   do i = 0 to intck(&freq., &start_date., &end_date.);\n\t      &new_date_name. = intnx(&freq., &start_date., i, 'e');\n\t      output;\n\t   end;\n\t   drop &start_date. &end_date. i;\n\trun;\n\t\n\tproc sort data=__expanded out=&out nodupkey; by &id_vars. &new_date_name.; run;\n\tproc delete data=__expanded; run;\n%mend expand;\n\n/* USD to Foreign FX Conversion Rate from Compustat*/\n%macro compustat_fx(out=);\n\tdata usd_curcdd; \n\t\tcurcdd='USD';\n\t\tdatadate=input(put(19500101,8.),yymmdd8.);\n\t\tfx=1;\n\t\tformat datadate yymmddn8.;\n\trun;  /* comp.exrt_dly only starts in 1982 and since we convert to USD we know that the fx for USD is 1 */\n\t\n\tproc sql; \n\t\tcreate table __fx1 as\n\t\tselect distinct a.tocurd as  curcdd  , a.datadate,  b.exratd/a.exratd as fx /*fx is quoted as x/USD so to go from x to USD do x*fx*/\n\t\tfrom comp.exrt_dly a , comp.exrt_dly b\n\t\twhere a.fromcurd = 'GBP' and b.tocurd = 'USD' /*b.exratd is always from GBP to USD, a.exratd is from GBP to currency X*/\n\t\tand a.fromcurd = b.fromcurd and a.datadate = b.datadate;\n\tquit;\n\t\n\tdata __fx2; set __fx1 usd_curcdd; run; \n\t\n\tproc sort data = __fx2;  by curcdd descending datadate; run ; \n\t\n\t/* Carry forward fx observations in case gaps*/\n\tdata __fx3; format date YYMMDDN8.; \n\t\tset __fx2;\n\t\tby curcdd;\n\t\tdate = datadate;\n\t\toutput;\n\t\tfollowing = lag(date); \n\t\tif first.curcdd then\n\t\t \tfollowing = date+1;\n  \t\tn = following-date;\n  \t\tdo i=1 to n-1;\n\t   \t\tdate = date+1; output;\n\t  \tend;\n\t \t\n\t\tdrop datadate following n i;\n\trun;\n\t\n\tproc sort data=__fx3 out=&out nodupkey; by curcdd date; run;\n\t\n\tproc delete data=usd_curcdd __fx1 __fx2 __fx3; run;\n%mend compustat_fx; \n\n\n**********************************************************************************************************************\n*  MACRO - Add Primary Security from Compustat *\n**********************************************************************************************************************\nThe macro expects a data set with gvkey and iid. It then returns the same dataset with a column \"primary_sec\" which is 1 when a security is the primary security in the US, Canada or internationally.\t\nImportantly, a given company (gvkey) can potentially have up to 3 primary securities although it happens very rarely. In the full Compustat monthly dataset, of all gvkey-eom pairs we have\n\t0 primary securities:  1.82%\n  \t1 primary securities: 95.10%\n  \t2 primary securities:  3.08%\n  \t3 primary securities:  0.01%\n;\n%macro add_primary_sec(data=, out=, date_var=);\n\tproc sql;\n\t\tcreate table __prihistrow as \n\t\tselect gvkey, itemvalue as prihistrow, effdate, thrudate\n\t\tfrom comp.g_sec_history where item = 'PRIHISTROW';\n\tquit;\n\t\n\tproc sql;\n\t\tcreate table __prihistusa as \n\t\tselect gvkey, itemvalue as prihistusa, effdate, thrudate\n\t\tfrom comp.sec_history where item = 'PRIHISTUSA';\n\tquit;\n\t\n\tproc sql;\n\t\tcreate table __prihistcan as \n\t\tselect gvkey, itemvalue as prihistcan, effdate, thrudate\n\t\tfrom comp.sec_history where item = 'PRIHISTCAN';\n\tquit;\n\t\n\tproc sql;\n\t\tcreate table __header as \n\t\tselect gvkey, prirow, priusa, prican from comp.company\n\t\touter union corr\n\t\tselect gvkey, prirow, priusa, prican from comp.g_company;\n\tquit;\n\t\n\tproc sort data=__header nodupkey; by gvkey; run; /* Only one duplicate (gvkey=254381)*/\n\t\n\tproc sql;\n\t\tcreate table __data1 as \n\t\tselect distinct a.*, coalesce(b.prihistrow, e.prirow) as prihistrow, coalesce(c.prihistusa, e.priusa)  as prihistusa, coalesce(d.prihistcan, e.prican) as prihistcan /* Prefer historical primary sec indicator, if this is not available, use header. Distinct avoid around 400 dup observations due to gvkey=213098 having overlapping prihistrow in g_sec_history */\n\t\tfrom &data. as a \n\t\tleft join __prihistrow as b\n\t\t\ton a.gvkey=b.gvkey and a.&date_var.>=b.effdate and (a.&date_var.<=b.thrudate or missing(b.thrudate))\n\t\tleft join __prihistusa as c\n\t\t\ton a.gvkey=c.gvkey and a.&date_var.>=c.effdate and (a.&date_var.<=c.thrudate or missing(c.thrudate))\n\t\tleft join __prihistcan as d\n\t\t\ton a.gvkey=d.gvkey and a.&date_var.>=d.effdate and (a.&date_var.<=d.thrudate or missing(d.thrudate))\n\t\tleft join __header as e\n\t\t\ton a.gvkey=e.gvkey;\n\t\t\t\n\t\tcreate table __data2 as \n\t\tselect *, (not missing(iid) and (iid=prihistrow or iid=prihistusa or iid=prihistcan)) as primary_sec /* If the security id is identified as primary in either USA, CANADA or Rest of the World it is deemed a primary security */\n\t\tfrom __data1;\n\tquit;\n\t\n\tdata &out.; set __data2(drop = prihistrow prihistusa prihistcan);\n\t\n\tproc delete data=__prihistrow __prihistusa __prihistcan __header __data1 __data2; run;\n%mend add_primary_sec;\n\n/* MACRO - COMPUSTAT EXCHANGES */*\n\tThis macro returns a dataset with classifications of individual exchanges in terms of country and an indicator of whether it's a normal or special exchange\n;\n%macro comp_exchanges(out=); \n\t/* Exchange Classification */\n\t/* What about 153? . Could consider excluding 111 since its the same securities traded on 110. This is just due to a rule in Thailand limiting the ownership of foreign investors in Thailand https://thaishares.com/nvdr/*/\n\t%let special_exchanges =\n\t\t\t(0, \n\t\t\t1, \n\t\t\t2, \n\t\t\t3, \n\t\t\t4, \n\t\t\t15, 16, 17, 18, 21, /* US exchanges not in NYSE, Amex and NASDAQ */\n\t\t\t13, \n\t\t\t19, \n\t\t\t20, \n\t\t\t127, \n\t\t\t150, \t\t/* AIAF Mercado De Renta Fija --> Spanish exchange for trading debt securities https://practiceguides.chambers.com/practice-guides/capital-markets-debt-2019/spain/1-debt-marketsexchanges */\n\t\t\t157, \n\t\t\t229, \n\t\t\t263, \n\t\t\t269, \n\t\t\t281, \n\t\t\t283, \n\t\t\t290, \n\t\t\t320, \n\t\t\t326, \n\t\t\t341, \n\t\t\t342, \n\t\t\t347, \n\t\t\t348, \n\t\t\t349,\t\t/* BATS Chi-X Europe --> Trades stocks from various european exchanges. Should we keep it?*/\n\t\t\t352)\t\t/* CHI-X Australia --> Only Trades securities listed on ASX (exchg=106). Should we keep it?*/\n\t;\n\n\t/* Determine Country of Exchange (Note that we assume that this is constant through time) */\n\tproc sql;\n\t\tcreate table __ex_country1 as\n\t\tselect distinct exchg, excntry from comp.g_security\n\t\touter union corr\n\t\tselect distinct exchg, excntry from comp.security;\n\t\t\n\t\tcreate table __ex_country2 as\n\t\tselect distinct exchg, \n\t\t\tcase \n\t\t\t\twhen count(excntry)>1 then 'multi_national' /*, calculated count > 1 as multi_national*/\n\t\t\t\telse excntry\n\t\t\tend as excntry\n\t\tfrom __ex_country1\n\t\twhere not missing(excntry) and not missing(exchg)\n\t\tgroup by exchg;\n\t\t\n\t\tcreate table __ex_country3 as\n\t\tselect a.*, b.exchgdesc\n\t\tfrom __ex_country2 as a left join comp.r_ex_codes as b\n\t\ton a.exchg=b.exchgcd;\n\t\t\n\t\tcreate table &out. as \n\t\tselect *, (excntry ^= 'multi_national' and exchg not in &special_exchanges.) as exch_main\n\t\tfrom __ex_country3;\n\tquit;\n\t\n\tproc delete data= __ex_country1 __ex_country2 __ex_country3; run;\n%mend comp_exchanges;\n\n**********************************************************************************************************************\n*                                    US - Data From CRSP\n********************************************************************************************************************* ; \n%macro prepare_crsp_sf(freq=m); /* freq in ('d', 'm') for daily and monthly. Returns crsp_msf if p=m and crsp_dsf if p=d*/\n\t/* CRSP with Company Information*/\n\tproc sql;\n\t\tcreate table __crsp_sf1 as\n\t\tselect a.permno, a.permco, a.date, (a.prc < 0) as bidask, abs(a.prc) as prc, a.shrout/1000 as shrout, calculated prc * calculated shrout as me,\n\t\t   a.ret, a.retx, a.cfacshr, a.vol, \n\t\t   case when a.prc > 0 and a.askhi > 0 then a.askhi else . end as prc_high,   /* Highest price when prc is not the bid-ask average: https://wrds-web.wharton.upenn.edu/wrds/query_forms/variable_documentation.cfm?vendorCode=CRSP&libraryCode=crspa&fileCode=dsf&id=askhi*/\n\t\t   case when a.prc > 0 and a.bidlo > 0 then a.bidlo else . end as prc_low,    /* Lowest price when prc is not the bid-ask average: https://wrds-web.wharton.upenn.edu/wrds/query_forms/variable_documentation.cfm?vendorCode=CRSP&libraryCode=crspa&fileCode=dsf&id=bidlo */\n\t\t   b.shrcd, b.exchcd, c.gvkey, c.liid as iid, /*http://www.crsp.org/products/documentation/crspccmlink-security-link-history*/\n\t\t   b.exchcd in (1, 2, 3) as exch_main\t\t\t\n\t\tfrom crsp.&freq.sf as a \n\t\tleft join crsp.&freq.senames as b\n\t\t   on a.permno=b.permno and a.date>=namedt and a.date<=b.nameendt\n\t\tleft join crsp.ccmxpf_lnkhist as c\n\t\t   on a.permno=c.lpermno and (a.date>=c.linkdt or missing(c.linkdt)) and \n\t\t   (a.date<=c.linkenddt or missing(c.linkenddt)) and c.linktype in ('LC', 'LU', 'LS');\n\tquit;\n\t\n\t/* Adjust trading volume following Gao and Ritter (2010)*/\n\tproc sql;\n\t\tupdate __crsp_sf1\n\t\tset vol = \n\t\t\tcase \n\t\t\t\twhen date < '01FEB2001'd then vol / 2\n\t\t\t\twhen date <= '31DEC2001'd then vol / 1.8\n\t\t\t\twhen date < '31DEC2003'd then vol / 1.6\n\t\t\t\telse vol\n\t\t\tend\n\t\twhere exchcd = 3;\n\tquit;\n\t\n\t/* Add dividend and dollar volume */\n\tproc sort data=__crsp_sf1; by permno date; run;\n\t\n\tdata __crsp_sf2;\n\t\tset __crsp_sf1;\n\t\tby permno;\n\t\tdolvol = abs(prc) * vol;\n\t\tdiv_tot = (ret-retx)*lag(prc)*(cfacshr/lag(cfacshr)); /* The CFACSHR part is to put it on the pr share basis of the current date */\n\t\tif first.permno then\n\t\t\tdiv_tot=.;\n\trun;\n\t\n\t/* Incorporate Delisting Returns */\n\t%if &freq.=d %then %do;\n\t\tproc sql;\n\t\t\tcreate table __crsp_sf3 as\n\t\t\tselect a.*, b.dlret, b.dlstcd\n\t\t\tfrom __crsp_sf2 as a left join crsp.&freq.sedelist as b\n\t\t\ton a.permno=b.permno and a.date=b.dlstdt;\n\t\tquit;\n\t%end;\n\t%if &freq.=m %then %do;\n\t\tproc sql;\n\t\t\tcreate table __crsp_sf3 as\n\t\t\tselect a.*, b.dlret, b.dlstcd\n\t\t\tfrom __crsp_sf2 as a left join crsp.&freq.sedelist as b\n\t\t\ton a.permno=b.permno and year(a.date)=year(b.dlstdt) and month(a.date)=month(b.dlstdt);\n\t\tquit;\n\t%end;\n\t\n\tdata __crsp_sf4; \n\t\tset __crsp_sf3; \n\t\tif missing(dlret) and (dlstcd=500 or (520<=dlstcd<=584)) then dlret=-0.3; /*If delisting is missing and is for performance related reasons. Set to -30%. This is relevant to 155 observations only*/\n\t\tif missing(ret) and not missing(dlret) then ret=0;\n\t\tret= (1+ret)*(1+coalesce(dlret, 0))-1; /*If missing set to zero*/\n\t\tdrop dlret dlstcd;\n\trun;\n\t\n\t* Excess Return;\n\t%if &freq.=d %then %let scale=21;\n\t%if &freq.=m %then %let scale=1;\n\t\n\tproc sql;\n\t\tcreate table __crsp_sf5 as \n\t\tselect a.*, a.ret-coalesce(b.t30ret, c.rf)/&scale. as ret_exc /* I prefer crsp.mcti but FF has monthly updates */\n\t\tfrom __crsp_sf4 as a \n\t\tleft join crsp.mcti as b \n\t\t\ton year(a.date)=year(b.caldt) and month(a.date)=month(b.caldt)\n\t\tleft join ff.factors_monthly as c \n\t\t\ton year(a.date)=year(c.date) and month(a.date)=month(c.date);\n\tquit;\n\t\n\t* Company Market Equity;\n\tproc sql;\n\t\tcreate table __crsp_sf6 as \n\t\tselect *, sum(me) as me_company\n\t\tfrom __crsp_sf5\n\t\tgroup by permco, date;\n\tquit;\n\t\n\t* Make volume comparable across daily and monthly set:  https://wrds-web.wharton.upenn.edu/wrds/query_forms/variable_documentation.cfm?vendorCode=CRSP&libraryCode=crspa&fileCode=dsf&id=vol;\n\t%if &freq.=m %then %do;\n\t\tproc sql;\n\t\t\tupdate __crsp_sf6\n\t\t\tset vol = vol*100,\n\t\t\t    dolvol = dolvol*100;\n\t\tquit;\n\t%end;\n\t\n\tproc sort nodupkey data=__crsp_sf6; by permno date; run; /*In monthly file: Two duplicates 15075-20180131 and 86812-20190731 In daily file 13 obs*/\n\tdata crsp_&freq.sf; set __crsp_sf6;\n\t\n\tproc delete data= __crsp_sf1 __crsp_sf2 __crsp_sf3 __crsp_sf4 __crsp_sf5 __crsp_sf6; \n%mend prepare_crsp_sf;\n\n**********************************************************************************************************************\n*                                    World - Data From Compustat\n********************************************************************************************************************* ; \n%macro prepare_comp_sf(freq=); /* freq in (d, m,both) */\n\t/* SECD has a lot of missing CSHOC. Therefore we use information from Accounting Statements. This is not a problem for g_secd*/\n\t%let comp_cond = indfmt='INDL' and datafmt='STD' and popsrc='D' and consol='C';\n\tproc sql; \n\t\tcreate table __firm_shares1 as \n\t\tselect gvkey, datadate, cshoq as csho_fund, ajexq as ajex_fund \n\t\tfrom comp.fundq where &comp_cond. and not missing(cshoq) and not missing(ajexq) \n\t\touter union corr\n\t\tselect gvkey, datadate, csho as csho_fund, ajex as ajex_fund \n\t\tfrom comp.funda where &comp_cond. and not missing(csho) and not missing(ajex);\n\tquit;\n\t%populate_own(inset=__firm_shares1, outset=__firm_shares2, datevar=datadate, idvar=gvkey, datename=ddate, forward_max=12, period = 'day');\n\n\tproc sql;\n\t\tcreate table __comp_dsf_na as\n\t\tselect a.gvkey, a.iid, a.datadate, a.tpci, a.exchg, a.prcstd, a.curcdd, a.prccd as prc_local, a.ajexdi,\n\t\t\tcase when a.prcstd^=5 then a.prchd else . end as prc_high_lcl,  /* Highest price when prc is not the bid-ask average: https://wrds-web.wharton.upenn.edu/wrds/query_forms/variable_documentation.cfm?vendorCode=COMP&libraryCode=compd&fileCode=secd&id=prchd */\n\t\t\tcase when a.prcstd^=5 then a.prcld else . end as prc_low_lcl,   /* Lowest price when prc is not the bid-ask average: https://wrds-web.wharton.upenn.edu/wrds/support/Data/_001Manuals%20and%20Overviews/_001Compustat/_001North%20America%20-%20Global%20-%20Bank/_000dataguide/?file_name=Data/_001Manuals%20and%20Overviews/_001Compustat/_001North%20America%20-%20Global%20-%20Bank/_000dataguide */\n\t\t\tcshtrd, coalesce(a.cshoc/1e6, b.csho_fund*b.ajex_fund/a.ajexdi) as cshoc,  /* Prefer cshoc but if missing choose shares outstanding from accounting statement adjusted for issuance activity*/\n\t\t   \t(a.prccd/a.ajexdi*a.trfd) as ri_local,\n\t\t   \ta.curcddv, a.div, a.divd, a.divsp /* Dividend Variables */\n\t\tfrom comp.secd as a left join __firm_shares2 as b\n\t\ton a.gvkey=b.gvkey and a.datadate=b.ddate;\n\t\t\n\t\t/* Adjust trading volume of NASDAQ stocks following Gao and Ritter (2010)*/\n\t\tupdate __comp_dsf_na\n\t\tset cshtrd = \n\t\t\tcase \n\t\t\t\twhen datadate < '01FEB2001'd then cshtrd / 2\n\t\t\t\twhen datadate <= '31DEC2001'd then cshtrd / 1.8\n\t\t\t\twhen datadate < '31DEC2003'd then cshtrd / 1.6\n\t\t\t\telse cshtrd\n\t\t\tend\n\t\twhere exchg = 14;\n\t\t\n\t\tcreate table __comp_dsf_global as\n\t\tselect gvkey, iid, datadate, tpci, exchg, prcstd, curcdd, \n\t\t\tprccd/qunit as prc_local, ajexdi, cshoc/1e6 as cshoc,\n\t\t\tcase when prcstd^=5 then prchd/qunit else . end as prc_high_lcl,  /* Highest price when prc is not the bid-ask average: https://wrds-web.wharton.upenn.edu/wrds/query_forms/variable_documentation.cfm?vendorCode=COMP&libraryCode=compd&fileCode=secd&id=prchd */\n\t\t\tcase when prcstd^=5 then prcld/qunit else . end as prc_low_lcl,   /* Lowest price when prc is not the bid-ask average: https://wrds-web.wharton.upenn.edu/wrds/support/Data/_001Manuals%20and%20Overviews/_001Compustat/_001North%20America%20-%20Global%20-%20Bank/_000dataguide/?file_name=Data/_001Manuals%20and%20Overviews/_001Compustat/_001North%20America%20-%20Global%20-%20Bank/_000dataguide */\n\t\t\tcshtrd, ((calculated prc_local)/ajexdi*trfd) as ri_local,\n\t\t\tcurcddv, div, divd, divsp\n\t\tfrom comp.g_secd;\n\t\t\n\t\tcreate table __comp_dsf1 as \n\t\tselect * from __comp_dsf_na \n\t\touter union corr\n\t\tselect * from __comp_dsf_global;\n\tquit;\n\t\n\t/* Add FX */\n\t%compustat_fx(out=fx); \n\tproc sql;\n\t\tcreate table __comp_dsf2 as \n\t\tselect a.*, b.fx as fx, c.fx as fx_div\n\t\tfrom __comp_dsf1 as a\n\t\tleft join fx as b\n\t\t\ton a.curcdd=b.curcdd and a.datadate=b.date\n\t\tleft join fx as c\n\t\t\ton a.curcddv=c.curcdd and a.datadate=c.date;\n\tquit;\n\t\n\tdata __comp_dsf3; \n\t\tset __comp_dsf2; \n\t\t/* Price Adjustment */\n\t\tprc = prc_local*fx;\n\t\tprc_high = prc_high_lcl*fx;\n\t\tprc_low = prc_low_lcl*fx;\n\t\tme = prc*cshoc;\n\t\tdolvol = cshtrd*prc;\n\t\tri = ri_local*fx;\n\t\n\t\t/* Dividend Adjustment (set to zero if missing)*/\n\t\tdiv_tot = coalesce(div, 0)*fx_div;\n\t\tdiv_cash = coalesce(divd, 0)*fx_div;\n\t\tdiv_spc = coalesce(divsp, 0)*fx_div;\n\t\t\n\t\tdrop div divd divsp fx_div curcddv prc_high_lcl prc_low_lcl;\n\trun; \n\t\n\t* Create Daily, Monthly or Both Datasets;\n\t%if &freq. = m or &freq. = d %then %let iter_max = 1;\n\t%if &freq. = both %then %let iter_max = 2;\n\t\n\t%do iter=1 %to &iter_max.;\n\t\t%if &freq. = m or &freq. = d %then %let freq_use = &freq.;  /* Neccesary because of the case where both daily AND monthly datasets are created */\n\t\t%if &freq. = both and &iter. = 1 %then %let freq_use = d;\n\t\t%if &freq. = both and &iter. = 2 %then %let freq_use = m;\n\t\t\n\t\t%if &freq_use.=m %then %do;\n\t\t\tproc sql;\n\t\t\t\tcreate table __comp_msf1 as \n\t\t\t\tselect *, intnx('month', datadate,0,'E') as eom format=YYMMDDN8., max(max(prc_high/ajexdi), max(prc/ajexdi))*ajexdi as prc_highm, min(min(prc_low/ajexdi), min(prc/ajexdi)) * ajexdi as prc_lowm,\n\t\t\t\t\tsum(div_tot/ajexdi)*ajexdi as div_totm, sum(div_cash/ajexdi)*ajexdi as div_cashm, sum(div_spc/ajexdi)*ajexdi as div_spcm,\n\t\t\t\t\tsum(cshtrd/ajexdi)*ajexdi as cshtrm, sum(dolvol) as dolvolm\n\t\t\t\tfrom __comp_dsf3\n\t\t\t\tgroup by gvkey, iid, calculated eom;\n\t\t\t\t\n\t\t\t\tcreate table __comp_msf2 as \n\t\t\t\tselect *\n\t\t\t\tfrom __comp_msf1(drop= cshtrd div_tot div_cash div_spc dolvol prc_high prc_low)\n\t\t\t\twhere not missing(prc_local) and not missing(curcdd) and prcstd in (3, 4, 10) /* This is rather important when choosing last.eom later. Otherwise we sometimes have an end of month observations which is a dividend distribution rather than a trading day */\n\t\t\t\torder by gvkey, iid, eom, datadate;\n\t\t\tquit;\n\t\t\t\n\t\t\tdata __comp_msf2; \n\t\t\t\tset __comp_msf2; \n\t\t\t\trename \n\t\t\t\t\tdiv_totm=div_tot \n\t\t\t\t\tdiv_cashm=div_cash \n\t\t\t\t\tdiv_spcm=div_spc\n\t\t\t\t\tdolvolm=dolvol\n\t\t\t\t\tprc_highm=prc_high\n\t\t\t\t\tprc_lowm=prc_low;\n\t\t\trun;\n\t\t\t\n\t\t\t/* Choose Last observation in Month */\n\t\t\tdata __comp_msf3;\n\t\t\t\tset __comp_msf2;\n\t\t\t\tby gvkey iid eom;\n\t\t\t\tif last.eom;\n\t\t\trun;\n\t\t\t\n\t\t\t/* Add Information from SECM */\n\t\t\tproc sql;\n\t\t\t\tcreate table __comp_secm1 as \n\t\t\t\tselect a.gvkey, a.iid, a.datadate, intnx('month', a.datadate,0,'E') as eom format=YYMMDDN8.,\n\t\t\t\t\ta.tpci, a.exchg, a.curcdm as curcdd, a.prccm as prc_local, a.prchm as prc_high, a.prclm as prc_low, a.ajexm as ajexdi, \n\t\t\t\t\tcoalesce(a.cshom/1e6, a.csfsm/1e3, a.cshoq, b.csho_fund*b.ajex_fund/a.ajexm) as cshoc, /* Notive again that I impute with shares from accounting statements in case it is missing*/\n\t\t\t\t\ta.dvpsxm, a.cshtrm, a.curcddvm,\n\t\t\t\t\ta.prccm/a.ajexm*a.trfm as ri_local,  /*ri_local = local return index [1]*/\n\t\t\t\t\tc.fx as fx, d.fx as fx_div\n\t\t\t\tfrom comp.secm as a \n\t\t\t\tleft join __firm_shares2 as b\n\t\t\t\t\ton a.gvkey=b.gvkey and a.datadate=b.ddate\n\t\t\t\tleft join fx as c\n\t\t\t\t\ton a.curcdm=c.curcdd and a.datadate=c.date\n\t\t\t\tleft join fx as d\n\t\t\t\t\ton a.curcddvm=d.curcdd and a.datadate=d.date;\n\t\t\t\t\t\n\t\t\t\tupdate __comp_secm1\n\t\t\t\tset cshtrm = \n\t\t\t\t\tcase \n\t\t\t\t\t\twhen datadate < '01FEB2001'd then cshtrm / 2\n\t\t\t\t\t\twhen datadate <= '31DEC2001'd then cshtrm / 1.8\n\t\t\t\t\t\twhen datadate < '31DEC2003'd then cshtrm / 1.6\n\t\t\t\t\t\telse cshtrm\n\t\t\t\t\tend\n\t\t\t\twhere exchg = 14;\n\t\t\tquit;\n\t\t\t\n\t\t\tdata __comp_secm2; \n\t\t\t\tset __comp_secm1; \n\t\t\t\t/* Price Adjustment */\n\t\t\t\tif curcdd='USD' then fx=1; \n\t\t\t\tprc = prc_local*fx;\n\t\t\t\tprc_high = prc_high*fx;\n\t\t\t\tprc_low = prc_low*fx;\n\t\t\t\tme = prc*cshoc;\n\t\t\t\tdolvol = cshtrm*prc;\n\t\t\t\tri = ri_local*fx;\n\t\t\t\n\t\t\t\t/* Dividend Adjustment*/\n\t\t\t\tif curcddvm='USD' then fx_div=1;\n\t\t\t\tdiv_tot = dvpsxm*fx_div;\n\t\t\t\tdiv_cash = .;  /* Not available in SECM*/\n\t\t\t\tdiv_spc = .;  /* Not available in SECM*/\n\t\t\t\t\n\t\t\t\tdrop dvpsxm fx_div curcddvm;\n\t\t\trun;\n\t\t\t\n\t\t\t%let common_vars=gvkey, iid, datadate, eom, tpci, exchg, curcdd, prc_local, prc_high, prc_low,\n\t\t\t\tajexdi, cshoc, ri_local, fx, prc, me, cshtrm, dolvol, ri, div_tot, div_cash, div_spc;\n\t\t\tproc sql;\n\t\t\t\tcreate table __comp_msf4 as \n\t\t\t\tselect &common_vars., prcstd, 'secd' as source from __comp_msf3\n\t\t\t\tunion\n\t\t\t\tselect &common_vars., 10 as prcstd, 'secm' as source from __comp_secm2;  /* This ensures that all observations are kept in the __returns step. Erroneous prcstd codes have been screen out */\n\t\t\t\t\n\t\t\t\tcreate table __comp_msf5 as\n\t\t\t\tselect *\n\t\t\t\tfrom __comp_msf4\n\t\t\t\tgroup by gvkey, iid, eom\n\t\t\t\thaving count(*)=1 | (count(*)=2 and source='secd'); /* If a security has an observation in both SECD and SECM. Prefer the observation from SECD*/\n\t\t\tquit; \n\t\t\t\n\t\t\tproc sort nodupkey data=__comp_msf5(drop=source) out=__comp_msf6; by gvkey iid eom; run; /* DUPLICATES should always be 0!*/\n\t\t\tproc delete data=__comp_msf1 __comp_msf2 __comp_msf3 __comp_msf4 __comp_msf5 __comp_secm1 __comp_secm2; run;\n\t\t\t\n\t\t\t%let base=__comp_msf6;\n\t\t\t%let period = 'month';\n\t\t\t%let out = comp_msf;\n\t\t%end;\n\t\t%if &freq_use.=d %then %do;\n\t\t\t%let base=__comp_dsf3;\n\t\t\t%let period = 'day';\n\t\t\t%let out = comp_dsf;\n\t\t%end;\n\t\t\n\t\t/* Compute Returns */\n\t\tproc sort nodupkey data= &base. out=__comp_sf1; by gvkey iid datadate; run; /* Very important to know if there are any duplicates!!*/\n \n\t\tdata __returns;\n\t\t\tset __comp_sf1(where = (not missing(ri) and prcstd in (3, 4, 10))); /* The screen is important, see [1] */\n\t\t\tby gvkey iid;\n\t\t\tret = ri/lag(ri)-1;\n\t\t\tret_local = ri_local/lag(ri_local)-1;\n\t\t\tret_lag_dif = intck(&period., lag(datadate), datadate);\n\t\t\tif first.iid then do;  \n\t\t\t\tret=.;\n\t\t\t\tret_local=.;\n\t\t\t\tret_lag_dif=.;\n\t\t\tend;\n\t\t\t/* Handle situations where currency code changes */\n\t\t\tif first.iid=0 and curcdd^=lag(curcdd) then do;\n\t\t\t\tret_local = ret;\n\t\t\tend;\n\t\t\tkeep gvkey iid datadate ret ret_local ret_lag_dif;\n\t\trun;\n\n\t\t/* Handling Delisting */\n\t\tdata __sec_info; set comp.security comp.g_security; run; /* Combine SECURITY and G_SECURITY*/\n\t\t\n\t\tdata __delist1;\n\t\t\tset __returns(where=(not missing(ret_local) and ret_local^=0)); /* Take delisting date to be last day of trading with non missing and non zero return [2]*/\n\t\t\tby gvkey iid datadate;\n\t\t\tif last.iid;\n\t\trun;\n\t\t\n\t\tproc sql;\n\t\t\tcreate table __delist2 as \n\t\t\tselect a.gvkey, a.iid, a.datadate, b.secstat, b.dlrsni\n\t\t\tfrom __delist1 as a left join __sec_info as b\n\t\t\ton a.gvkey=b.gvkey and a.iid=b.iid;\n\t\t\t\n\t\t\tcreate table __delist3 as \n\t\t\tselect gvkey, iid, datadate as date_delist,\n\t\t\t\tcase when dlrsni in ('02', '03') then -0.3 else 0 end as dlret  \n\t\t\tfrom __delist2\n\t\t\twhere secstat='I';\n\t\tquit;\n\t\t\n\t\t* Incorporate Delisting Return;\n\t\tproc sql;\n\t\t\tcreate table __comp_sf2 as\n\t\t\tselect a.*, b.ret, b.ret_local, b.ret_lag_dif, c.date_delist, c.dlret\n\t\t\tfrom &base as a \n\t\t\tleft join __returns as b\n\t\t\t\ton a.gvkey=b.gvkey and a.iid=b.iid and a.datadate=b.datadate\n\t\t\tleft join __delist3 as c\n\t\t\t\ton a.gvkey=c.gvkey and a.iid=c.iid;\n\t\tquit;\n\t\t\n\t\tdata __comp_sf3;\n\t\t\tset __comp_sf2;\n\t\t\twhere datadate<=date_delist or missing(date_delist); /* In a sample of 104,377 this removes 1,434 obs*/\n\t\t\tif datadate=date_delist then do;\n\t\t\t\tret = (1+ret)*(1+dlret)-1;\n\t\t\t\tret_local = (1+ret_local)*(1+dlret)-1;\n\t\t\tend;\n\t\t\tdrop ri ri_local date_delist dlret;\n\t\trun;\n\t\t\n\t\t/* Excess Return */\n\t\t%if &freq_use.=d %then %let scale=21;\n\t\t%if &freq_use.=m %then %let scale=1;\n\t\t\n\t\tproc sql;\n\t\t\tcreate table __comp_sf4 as \n\t\t\tselect a.*, a.ret-coalesce(b.t30ret, c.rf)/&scale. as ret_exc /* I prefer crsp.mcti but FF has monthly updates */\n\t\t\tfrom __comp_sf3 as a \n\t\t\tleft join crsp.mcti as b \n\t\t\t\ton year(a.datadate)=year(b.caldt) and month(a.datadate)=month(b.caldt)\n\t\t\tleft join ff.factors_monthly as c \n\t\t\t\ton year(a.datadate)=year(c.date) and month(a.datadate)=month(c.date);\n\t\tquit;\n\t\t\n\t\t/* Add Exchange Information*/\n\t\t%comp_exchanges(out=__exchanges);\n\t\tproc sql;\n\t\t\tcreate table __comp_sf5 as\n\t\t\tselect a.*, b.excntry, b.exch_main \n\t\t\tfrom __comp_sf4 as a left join __exchanges as b\n\t\t\ton a.exchg=b.exchg;\n\t\tquit;\n\t\t\n\t\t/* Add Primary Security Indicator?*/\n\t\t%add_primary_sec(data=__comp_sf5, out=__comp_sf6, date_var=datadate);\n\t\t\n\t\t/* Output */\n\t\tproc sort nodupkey data=__comp_sf6 out=&out.; by gvkey iid datadate; run;\n\t%end;\n\t\n\t\n\t/* proc freq data=comp_dsf; tables ret_day_dif; run; *Check: 97.2% of non missing ret_day_dif are <=3. 98.8% are <=4; 99.75% are <=10. This might be a reasonable general cutoff for settings returns to 0*/\n\tproc delete data=__firm_shares1 __firm_shares2 fx \n\t\t__comp_dsf_na __comp_dsf_global __comp_dsf1 __comp_dsf2 __comp_dsf3\n\t\t__returns __sec_info __delist1 __delist2 __delist3\n\t\t__comp_sf1 __comp_sf2 __comp_sf3 __comp_sf4 __comp_sf5 __comp_sf6 __exchanges &base.; run;\n%mend prepare_comp_sf;\n\n/* COMBINE CRSP AND COMPUSTAT WITH CRSP PREFERENCE*/\n%macro combine_crsp_comp_sf(out_msf=, out_dsf=, crsp_msf=, comp_msf=, crsp_dsf=, comp_dsf=);\n\t/* Monthly Files */\n\tproc sql;\n\t\tcreate table __msf_world1 as\n\t\tselect permno as id, PERMNO as permno, PERMCO as permco, GVKEY as gvkey, iid, 'USA' as excntry length=3, exch_main, (shrcd in (10, 11, 12)) as common, 1 as primary_sec,\n\t\t\tbidask, shrcd as crsp_shrcd, exchcd as crsp_exchcd, '' as comp_tpci, . as comp_exchg,\n\t\t\t'USD' as curcd, 1 as fx, date, intnx('month',date,0,'E') as eom format=YYMMDDN8., \n\t\t   \tcfacshr as adjfct, shrout as shares, me, me_company, prc, prc as prc_local, prc_high, prc_low, dolvol, vol as tvol, \n\t\t   \tRET as ret, ret as ret_local, ret_exc, 1 as ret_lag_dif, div_tot, . as div_cash, . as div_spc, 1 as source_crsp\n\t\tfrom &crsp_msf.\n\t\touter union corr\n\t\tselect case\twhen prxmatch(\"/W/\", iid) then input(cats('3', gvkey, substr(iid, 1, 2)), 9.0)\n\t\t\t\t    when prxmatch(\"/C/\", iid) then input(cats('2', gvkey, substr(iid, 1, 2)), 9.0)\n\t\t\t\t    else input(cats('1', gvkey, substr(iid, 1, 2)), 9.0)                            /* IID characters can only take 3 possible valued: http://zeerovery.nl/blogfiles/Comp-IID.pdf*/ \n\t\t\tend as id, . as permno, . as permco, gvkey, iid, excntry, exch_main, (tpci='0') as common, primary_sec,\n\t\t\t(prcstd = 4) as bidask, . as crsp_shrcd, . as crsp_exchcd, tpci as comp_tpci, exchg as comp_exchg, \n\t\t   \tcurcdd as curcd, fx, datadate as date, eom, \n\t\t   \tajexdi as adjfct, cshoc as shares, me, me as me_company, prc, prc_local, prc_high, prc_low, dolvol, cshtrm as tvol,\n\t\t   \tret_local, ret, ret_exc, ret_lag_dif, div_tot, div_cash, div_spc, 0 as source_crsp  \n\t\tfrom &comp_msf.;\n\tquit;\t\n\t\n\t/* Add Excess Return over Month t+1*/\n\tproc sort data=__msf_world1\t; by id descending eom; run;\n\tdata __msf_world2; \n\t\tset __msf_world1;\n\t\tret_exc_lead1m = lag(ret_exc);\n\t\tif lag(id)^=id or lag(ret_lag_dif)^=1 then\n\t\t\tret_exc_lead1m = .;\n\trun;\n\t\n\t/* Daily Files */\n\tproc sql;\n\t\tcreate table __dsf_world1 as\n\t\tselect permno as id, 'USA' as excntry length=3, exch_main, (shrcd in (10, 11, 12)) as common, 1 as primary_sec, \n\t\t\tbidask, 'USD' as curcd, 1 as fx, DATE as date format=YYMMDDN8., intnx('month',DATE,0,'E') as eom format=YYMMDDN8., \n\t\t   \tcfacshr as adjfct, shrout as shares, me, dolvol, vol as tvol, prc, prc_high, prc_low,\n\t\t   \tret as ret_local, RET as ret, ret_exc, 1 as ret_lag_dif, 1 as source_crsp /* More memory efficient with integer instead of character vector*/ \n\t\tfrom &crsp_dsf.\n\t\touter union corr\n\t\tselect case when prxmatch(\"/W/\", iid) then input(cats('3', gvkey, substr(iid, 1, 2)), 9.0)\n\t\t\t\t    when prxmatch(\"/C/\", iid) then input(cats('2', gvkey, substr(iid, 1, 2)), 9.0)\n\t\t\t\t    else input(cats('1', gvkey, substr(iid, 1, 2)), 9.0)                            /* IID characters can only take 3 possible valued: http://zeerovery.nl/blogfiles/Comp-IID.pdf*/ \n\t\t\tend as id, excntry, exch_main, (tpci='0') as common, primary_sec, \n\t\t\t(prcstd = 4) as bidask, curcdd as curcd, fx, datadate as date, intnx('month',datadate,0,'E') as eom format=YYMMDDN8., \n\t\t   \tajexdi as adjfct, cshoc as shares, me, dolvol, cshtrd as tvol, prc, prc_high, prc_low,\n\t\t   \tret_local, ret, ret_exc, ret_lag_dif, 0 as source_crsp  \n\t\tfrom &comp_dsf.;\n\tquit;\t\n\t\n\t/* Choose the main observation based on monthly data */\n\t* If multiple observations for the same GVKEY-IID, Then choose CRSP as the main observation;\n\tproc sql;\n\t\tcreate table __obs_main as \n\t\tselect id, gvkey, iid, eom, (count(gvkey) in (0, 1) or (count(gvkey)>1 and source_crsp=1)) as obs_main\n\t\tfrom __msf_world2\n\t\tgroup by gvkey, iid, eom;\n\t\t\n\t\tcreate table __msf_world3 as \n\t\tselect a.*, b.obs_main\n\t\tfrom __msf_world2 as a left join __obs_main as b\n\t\ton a.id = b.id and a.eom = b.eom;\n\t\n\t\tcreate table __dsf_world2 as \n\t\tselect a.*, b.obs_main\n\t\tfrom __dsf_world1 as a left join __obs_main as b\n\t\ton a.id = b.id and a.eom = b.eom;\n\tquit;\n\t\n\tproc sort data=__msf_world3 out=&out_msf. nodupkey; by id eom; run;\n\tproc sort data=__dsf_world2 out=&out_dsf. nodupkey; by id date; run;\n\t\n\tproc delete data= __msf_world1 __msf_world2 __msf_world3 __dsf_world1 __dsf_world2 __obs_main; run; \n%mend combine_crsp_comp_sf;\n\n* MACRO: CLEAN_COMP_MSF\n\t- Remove obvious Compustat data errors by setting return to missing.\n\t- Currently only implemented for monthly Compustat file, should be expanded to daily file\n;\n%macro clean_comp_msf(data=);\n\tproc sql;\n\t\tupdate &data. \n\t\tset ret=., ret_local=., ret_exc=.\n\t\twhere gvkey = '002137' and iid = '01C' and eom in ('31DEC1983'd, '31JAN1984'd);\n\t\t\n\t\tupdate &data. \n\t\tset ret=., ret_local=., ret_exc=.\n\t\twhere gvkey = '013633' and iid = '01W' and eom in ('28FEB1995'd);\n\tquit;\n%mend;\n\n* MACRO: MARKET RETURNS\n\tIf wins_comp=1, need to supply wins_data as well.\n;\n%macro market_returns(out=, data=, freq=m, wins_comp=1, wins_data=, cap_data=); \n\t%if &freq.=d %then %do;\n\t\t%let dt_col = date;\n\t\t%let max_date_lag = 14;\n\t%end;\n\t%if &freq.=m %then %do;\n\t\t%let dt_col = eom;\n\t\t%let max_date_lag = 1;\n\t%end;\n\t/* Create Index Data */\n\t\n\tproc sql;\n\t    create table updated_data as\n\t    select a.*, b.nyse_p80\n\t    from &data. as a\n\t    left join &cap_data. as b\n\t    on a.eom = b.eom;\n\tquit;\n\t\n\tdata updated_data;\n\t    set updated_data;\n\t    if me <= nyse_p80 then me_cap = me;\n\t    else me_cap = nyse_p80;\n\t    drop nyse_p80;\n\trun;\n\n\n\t\n\t/* Create Index Data */\n\tproc sql;\n\t\tcreate table __common_stocks1 as\n\t\tselect distinct source_crsp, id, date, eom, excntry, obs_main, exch_main, primary_sec, common, ret_lag_dif, me, me_cap, dolvol, ret, ret_local, ret_exc\n\t\tfrom updated_data\n\t\torder by id, &dt_col.;\n\tquit;\n\t\n\tdata __common_stocks2;\n\t\tset __common_stocks1;\n\t\tby id;\n\t\tme_lag1 = lag(me);\n\t\tme_cap_lag1 = lag(me_cap);\n\t\tdolvol_lag1 = lag(dolvol);\n\t\tif first.id then do;\n\t\t\tme_lag1 = .;\n\t\t\tme_cap_lag1 = .;\n\t\t\tdolvol_lag1 = .;\n\t\tend;\n\trun;\n\t\n\t%if &wins_comp. = 1 %then %do;\n\t\t%if &freq.=m %then %do;\n\t\t\tproc sql;\n\t\t\t\tcreate table __common_stocks3 as\n\t\t\t\tselect a.*, b.ret_exc_0_1, b.ret_exc_99_9, b.ret_0_1, b.ret_99_9, b.ret_local_0_1, b.ret_local_99_9\n\t\t\t\tfrom __common_stocks2 as a \n\t\t\t\tleft join &wins_data. as b\n\t\t\t\ton a.eom=b.eom;\n\t\t\tquit;\n\t\t%end;\n\t\t%if &freq.=d %then %do;\n\t\t\tproc sql;\n\t\t\t\tcreate table __common_stocks3 as\n\t\t\t\tselect a.*, b.ret_exc_0_1, b.ret_exc_99_9, b.ret_0_1, b.ret_99_9, b.ret_local_0_1, b.ret_local_99_9\n\t\t\t\tfrom __common_stocks2 as a \n\t\t\t\tleft join &wins_data. as b\n\t\t\t\ton year(a.date)=b.year and month(a.date)=b.month;\n\t\t\tquit;\n\t\t%end;\n\t\tproc sql;\n\t\t\t* Winsorize returns;\n\t\t\tupdate __common_stocks3 \n\t\t\tset ret = ret_99_9\n\t\t\twhere ret > ret_99_9 and source_crsp = 0 and not missing(ret);\t\t\n\t\t\tupdate __common_stocks3 \n\t\t\tset ret = ret_0_1\n\t\t\twhere ret < ret_0_1 and source_crsp = 0 and not missing(ret);\n\t\t\t* Winsorize local returns;\n\t\t\tupdate __common_stocks3 \n\t\t\tset ret_local = ret_local_99_9\n\t\t\twhere ret_local > ret_local_99_9 and source_crsp = 0 and not missing(ret_local);\t\t\n\t\t\tupdate __common_stocks3 \n\t\t\tset ret_local = ret_local_0_1\n\t\t\twhere ret_local < ret_local_0_1 and source_crsp = 0 and not missing(ret_local);\n\t\t\t* Winsorize excess returns;\n\t\t\tupdate __common_stocks3 \n\t\t\tset ret_exc = ret_exc_99_9\n\t\t\twhere ret_exc > ret_exc_99_9 and source_crsp = 0 and not missing(ret_exc);\t\t\n\t\t\tupdate __common_stocks3 \n\t\t\tset ret_exc = ret_exc_0_1\n\t\t\twhere ret_exc < ret_exc_0_1 and source_crsp = 0 and not missing(ret_exc);\n\t\t\t\n\t\t\talter table __common_stocks3\n\t\t\tdrop ret_exc_0_1, ret_exc_99_9, ret_0_1, ret_99_9, ret_local_0_1, ret_local_99_9;\n\t\tquit;\n\t\t\n\t%end;\n\t%if &wins_comp. = 0 %then %do;\n\t\tdata __common_stocks3; set __common_stocks2; run;\n\t%end;\n\tproc sql;\n\t\tcreate table mkt1 as\n\t\tselect excntry, &dt_col., \n\t\t\tcount(*) as stocks, \n\t\t\tsum(me_lag1) as me_lag1,\n\t\t\tsum(me_cap_lag1) as me_cap_lag1,\n\t\t\tsum(dolvol_lag1) as dolvol_lag1,\n\t\t\tsum(ret_local*me_lag1)/(calculated me_lag1) as mkt_vw_lcl,\n\t\t\tsum(ret_local*me_cap_lag1)/(calculated me_cap_lag1) as mkt_vw_cap_lcl,\n\t\t\tmean(ret_local) as mkt_ew_lcl,\n\t\t\tsum(ret*me_lag1)/(calculated me_lag1) as mkt_vw,\n\t\t\tsum(ret*me_cap_lag1)/(calculated me_cap_lag1) as mkt_vw_cap,\n\t\t\tmean(ret) as mkt_ew,\n\t\t\tsum(ret_exc*me_lag1)/(calculated me_lag1) as mkt_vw_exc,\n\t\t\tsum(ret_exc*me_cap_lag1)/(calculated me_cap_lag1) as mkt_vw_cap_exc,\n\t\t\tmean(ret_exc) as mkt_ew_exc\n\t\tfrom __common_stocks3\n\t\twhere obs_main = 1 and exch_main = 1 and primary_sec = 1 and common = 1 and ret_lag_dif <= &max_date_lag. and not missing(me_lag1) and not missing(ret_local) \n\t\tgroup by excntry, &dt_col.;\n\tquit;\n\t%if &freq.=m %then %do;\n\t\tdata &out.; set mkt1; run;\n\t%end;\n\t%if &freq.=d %then %do;\n\t\tproc sql;\n\t\t\tcreate table &out. as\n\t\t\tselect *\n\t\t\tfrom mkt1\n\t\t\tgroup by excntry, year(date), month(date)\n\t\t\thaving stocks / max(stocks) >= 0.25; /* With less than 25% of stocks trading, it's likely that the date is not an official trading date */\n\t\tquit;\n\t%end;\n\t\n\tproc delete data= __common_stocks1 __common_stocks2 __common_stocks3 mkt1; run; \n%mend;\n\n* MACRO: AP_FACTORS\n- This macro creates the factors from the 3-factors model of Fama and French (1993)\n  as well as the factors from the 4-factor of Hou, Xue and Zhang (2015).\n  Factors other than market and small minus big, are created using an unconditional double sort on sort and the underlying characteristics\n  following the methodology of Fama and French (1993). Breakpoints are based on all non-microcap stocks within a country.\n  Arguements \n  \t- out: Name of output dataset\n  \t- freq: In (d, m) i.e. either daily or monthly\n  \t- sf: Dataset of &freq. stocks returns\n  \t- mchars: Dataset of characteristics with monthly frequency\n  \t- mkt: dataset with market returns. This is also used to identify trading days.\n  \t- min_stocks_bp: Minimum number of stocks used to create breakpoints. \n  \t- min_stocks_pf: Minimum number of stocks available in the beginning of a month, to create a valid portfolio.\n;\n%macro ap_factors(out=, freq=, sf=, mchars=, mkt=, min_stocks_bp=, min_stocks_pf=);\n\t%if &freq.=d %then %do;\n\t\t/* Daily return data */\n\t\tproc sql;\n\t\t\tcreate table world_sf1 as \n\t\t\tselect excntry, id, date, eom, ret_exc\n\t\t\tfrom &sf.\n\t\t\twhere ret_lag_dif <= 5 and not missing(ret_exc); * Impose a maximum lag of 5 days between return calculation;\n\t\tquit;\n\t\t%winsorize_own(inset=world_sf1, outset=world_sf2, sortvar=eom, vars=ret_exc, perc_low=0.1, perc_high=99.9); /* Winsorize Returns at 0.1% (Note I winsorize by month) */\n\t\t%let __date_col = date;\n\t%end;\n\t%if &freq.=m %then %do;\n\t\t/* Monthly return data */\n\t\tproc sql;\n\t\t\tcreate table world_sf1 as \n\t\t\tselect excntry, id, eom, ret_exc\n\t\t\tfrom &sf.\n\t\t\twhere ret_lag_dif = 1 and not missing(ret_exc); \n\t\tquit;\n\t\t%winsorize_own(inset=world_sf1, outset=world_sf2, sortvar=eom, vars=ret_exc, perc_low=0.1, perc_high=99.9); /* Winsorize Returns at 0.1% */\n\t\t%let __date_col = eom;\n\t%end;\n\t\n\t/* Sorting Variables */\n\tproc sql;\n\t\tcreate table base1 as \n\t\tselect id, eom, size_grp, excntry, me, market_equity, be_me, at_gr1, niq_be,\n\t\t\tsource_crsp, exch_main, obs_main, common, comp_exchg, crsp_exchcd, primary_sec, ret_lag_dif\n\t\tfrom &mchars.;\n\tquit;\n\t\n\tproc sort data=base1 out=base2; by id eom; run;\n\t\n\t%macro temp();\n\t/* Lag variables used at portfolio rebalacing */\n\t%let cols_lag = comp_exchg crsp_exchcd exch_main obs_main common primary_sec excntry size_grp me be_me at_gr1 niq_be;\n\tdata base3; \n\t\tset base2;\n\t\tby id eom;\n\t\t%do i=1 %to %nwords(&cols_lag.); \n\t\t\t%let col = %scan(&cols_lag., &i, %str(' '));\n\t\t\t&col._l = lag(&col.);\n\t\t\tif id ^= lag(id) or source_crsp ^= lag(source_crsp) or intck(\"month\", lag(eom), eom)^=1 then\n\t\t\t\t&col._l = .;\n\t\t\tdrop &col.;\n\t\t%end;\n\trun;\n\t%mend;\n\t%temp();\n\t\n\t/* Screens */\n\tproc sql;\n\t\tcreate table base4 as\n\t\tselect *,  case \n\t\t\t\t\twhen missing(size_grp_l) then ''\n\t\t\t\t\twhen size_grp_l in ('large', 'mega') then 'big'\n\t\t\t\t\telse 'small'\n\t\t\t\tend as size_pf\n\t\tfrom base3\n\t\twhere obs_main_l = 1 and exch_main_l = 1 and common_l = 1 and primary_sec_l = 1 and ret_lag_dif = 1 and not missing(me_l)\n\t\torder by excntry_l, size_grp_l, eom;\n\tquit;\n\t\n\t/* Factors: Three-by-two sort on var and size Fama-French Style*/\n\t%macro sort_ff_style(out=, char=, freq=, min_stocks_bp=, min_stocks_pf=);\n\t\t* Breakpoints (based on NYSE stocks in the US and non-microcap stocks outside of the US);\n\t\tproc sql;\n\t\t\tcreate table bp_stocks as\n\t\t\tselect *\n\t\t\tfrom base4\n\t\t\twhere ((size_grp_l in ('small', 'large', 'mega') and excntry_l ^= 'USA') or ((crsp_exchcd_l=1 or comp_exchg_l=11) and excntry_l = 'USA')) and not missing(&char._l)\n\t\t\torder by eom, excntry_l;\n\t\tquit;\n\t\t\n\t\tproc means data=bp_stocks noprint;\n\t\t\tby eom excntry_l;\n\t\t\tvar &char._l;\n\t\t\toutput out=bps(drop=_type_ _freq_) N=n P30 = bp_p30 P70=bp_p70;\n\t\trun;\n\t\t\n\t\t* Create weights by end of month;\n\t\tproc sql;\n\t\t\tcreate table weights1 as \n\t\t\tselect a.excntry_l, a.id, a.eom, a.size_pf, a.me_l,\n\t\t\t\tcase \n\t\t\t\t\twhen a.&char._l >= b.bp_p70 then 'high'\n\t\t\t\t\twhen a.&char._l >= b.bp_p30 then 'mid'\n\t\t\t\t\telse 'low'\n\t\t\t\tend as char_pf\n\t\t\tfrom base4 as a left join bps as b\n\t\t\ton a.excntry_l = b.excntry_l and a.eom = b.eom\n\t\t\twhere b.n >= &min_stocks_bp. and not missing(a.&char._l) and size_pf^='';\n\t\t\t\n\t\t\tcreate table weights2 as \n\t\t\tselect *, me_l / sum(me_l) as w\n\t\t\tfrom weights1\n\t\t\tgroup by excntry_l, size_pf, char_pf, eom\n\t\t\thaving count(*) >= &min_stocks_pf.;\n\t\tquit;\n\t\t\n\t\n\t\t* Match with return data;\n\t\tproc sql;\n\t\t\tcreate table returns as \n\t\t\tselect a.*, b.w, b.size_pf, b.char_pf\n\t\t\tfrom world_sf2 as a inner join weights2 as b\n\t\t\ton a.id = b.id and a.eom=b.eom and a.excntry=b.excntry_l;\n\t\tquit;\n\t\t\n\t\t* Create portfolio returns;\n\t\tproc sql;\n\t\t\tcreate table pfs1 as \n\t\t\tselect \"&char.\" as characteristic, excntry, size_pf, char_pf, &__date_col., sum(ret_exc*w) as ret_exc\n\t\t\tfrom returns\n\t\t\tgroup by excntry, size_pf, char_pf, &__date_col.;\n\t\tquit;\n\t\t\n\t\tproc sort data=pfs1 out = pfs2; by characteristic excntry &__date_col.; run;\n\t\tproc transpose data=pfs2 delimiter=_ out=pfs3(drop=_name_);\n\t\t\tby characteristic excntry &__date_col.;\n\t\t\tvar ret_exc;\n\t\t\tid size_pf char_pf;\n\t\trun;\n\t\t\n\t\tdata &out.;\n\t\t\tset pfs3;\n\t\t\tlms = (small_high + big_high) / 2 - (small_low + big_low) / 2;\n\t\t\tsmb = (small_high + small_mid + small_low) / 3 - (big_high + big_mid + big_low) / 3;\n\t\t\tkeep characteristic excntry &__date_col. lms smb;\n\t\trun;\n\t%mend;\n\t\n\t/* Create Individual Factors */\n\t%sort_ff_style(out=book_to_market, char=be_me, min_stocks_bp = &min_stocks_bp., min_stocks_pf = &min_stocks_pf.);\n\t%sort_ff_style(out=asset_growth, char=at_gr1, min_stocks_bp = &min_stocks_bp., min_stocks_pf = &min_stocks_pf.);\n\t%sort_ff_style(out=roeq, char=niq_be, min_stocks_bp = &min_stocks_bp., min_stocks_pf = &min_stocks_pf.);\n\t\n\t/* Fama and French (1993) */\n\tdata ff;\n\t\tset book_to_market;\n\t\trename lms = hml;\n\t\trename smb = smb_ff;\n\trun;\n\t\n\t/* Hou, Xue and Zhang (2015) */\n\tdata hxz1;\n\t\tset asset_growth roeq;\n\trun;\n\t\n\tproc transpose data=hxz1 out=hxz2;\n\t\tby characteristic excntry &__date_col.;\n\t\tvar lms smb;\n\trun;\n\t\n\tproc sort data=hxz2 out = hxz3; by excntry &__date_col.; run;\n\tproc transpose data=hxz3 delimiter=_ out=hxz4(drop=_name_);\n\t\tby excntry &__date_col.;\n\t\tvar col1;\n\t\tid characteristic _name_;\n\trun;\n\t\n\tdata hxz;\n\t\tset hxz4;\n\t\trename niq_be_lms = roe;\n\t\tsmb_hxz = (at_gr1_smb + niq_be_smb) / 2;\n\t\tinv = -at_gr1_lms;\n\trun;\n\t\n\t/* Factor Dataset */\n\tproc sql;\n\t\tcreate table &out. as \n\t\tselect a.excntry, a.&__date_col., a.mkt_vw_exc as mktrf, b.hml, b.smb_ff, c.roe, c.inv, c.smb_hxz\n\t\tfrom &mkt. as a \n\t\tleft join ff as b on a.excntry = b.excntry and a.&__date_col. = b.&__date_col.\n\t\tleft join hxz as c on a.excntry = c.excntry and a.&__date_col. = c.&__date_col.;\n\tquit;\n%mend;\n\n* MACRO: FILE DELETE;\n%macro file_delete(file);\n  %let rc= %sysfunc(filename(fref,&file));\n  %let rc= %sysfunc(fdelete(&fref));\n%mend;\n\n* MACRO: SAVE_MAIN_DATA_CSV\n- The macro saves the main data as separate .csv files by country. \n- By main data we mean data for common stocks that are the primary security of the underlying firm with non-missing return and lagged market equity \n  Arguments\n\t- out: Name of the output Zip file (will be saved in &path.)\n\t- data: should be the path to the main sas dataset\n\t- path: path where data is stored. Should be a scratch directory\n;\n%macro save_main_data_csv(out=, data=, path=, end_date=);\n\t* Lagged me data;\n\tdata main_data1;\n\t\tset &data.;\n\t\tme_lag1 = lag(me);\n\t\tif id ^= lag(id) or intck(\"month\", lag(eom), eom)^=1 then\n\t\t\tme_lag1 = .;\n\trun;\n\t\n\t* Reorder Variables;\n\tdata main_data2;\n\t\tretain id date eom source_crsp size_grp obs_main exch_main primary_sec gvkey iid permno permco excntry curcd fx \n\t\t\tcommon comp_tpci crsp_shrcd comp_exchg crsp_exchcd\n\t\t\tadjfct shares me me_lag1; \n\t\tset main_data1;\n\trun;\n\t\n\t* Screen;\n\tproc sql;\n\t\tcreate table main_data3 as\n\t\tselect * \n\t\tfrom main_data2\n\t\twhere primary_sec = 1 and common = 1 and obs_main = 1 and exch_main = 1 and eom <= &end_date.; /*removed not missing(me_lag1) and and ret_lag_dif = 1 and not missing(ret_exc)*/\n\tquit;\n\t\n\tproc sql noprint;\n\t\tselect distinct lowcase(excntry) into :countries separated by ' '\n\t\tfrom main_data3;\n\tquit;\n\t\n\t/* Create country .csv files */\n\t%do i=1 %to %nwords(&countries.);\n\t\t%put ################ \"&path./&__c..csv\" ########################;\n\t\toption nonotes;\n\t\t%let __c = %scan(&countries., &i., %str(' '));\t\t\n\t\tproc export data=main_data3(where=(excntry = upcase(\"&__c.\")))\n\t\t    outfile=\"&path./&__c..csv\"   \n\t\t    dbms=CSV\n\t\t    replace;\n\t\trun;\n\t\toption notes;\n\t%end;\n\t\n\t\n\t* Zip file for easier download;\n\tods package (newzip) open nopf;\n\t%do i=1 %to %nwords(&countries.);\n\t\t%let __c = %scan(&countries., &i., %str(' '));\n\t\tods package (newzip) add file=\"&path./&__c..csv\";\n\t%end;\n\tods package (newzip) publish archive \n\t\tproperties (\n\t\t\tarchive_name=\"&out..zip\" \n\t\t\tarchive_path= \"&path.\"\n\t\t);\n\tods package(newzip) close;\n\t\n\t/* Delete intermidiate .csv files */\n\t%do i=1 %to %nwords(&countries.);\n\t\t%let __c = %scan(&countries., &i., %str(' '));\n\t\t%file_delete(file=&path./&__c..csv);\n\t%end;\n\tproc delete data = main_data1 main_data2 main_data3; run;\n%mend;\n\n* MACRO: SAVE_DAILY_RET_CSV\n- The macro saves the daily return data as a separate .csv file country-by-country. \n  Arguments\n\t- out: Name of the output Zip file (will be saved in &path.)\n\t- data: should be the path to world_dsf\n\t- path: path where data is stored. Should be a scratch directory\n\t- end_date: restricts the output to till the end_date\n;\n%macro save_daily_ret_csv(out=, data=, path=, end_date=);\n\tdata daily; \n\t\tset &data.;\n\t\twhere date <= &end_date.;\n\t\tkeep excntry id date source_crsp me ret ret_exc; \n\trun;\n\tproc sql noprint;\n\t\tselect distinct lowcase(excntry) into :countries separated by ' '\n\t\tfrom daily;\n\tquit;\n\t/* Create country .csv files */\n\toption nonotes;\n\t%do i=1 %to %nwords(&countries.);\n\t\t%let __c = %scan(&countries., &i., %str(' '));\n\t\t%put ################ \"&path./&__c..csv\" ########################;\n\t\tproc export data=daily(where=(excntry = upcase(\"&__c.\")))\n\t\t    outfile=\"&path./&__c..csv\"   \n\t\t    dbms=CSV\n\t\t    replace;\n\t\trun;\n\t%end;\n\toption notes;\n\t\n\t* Zip file for easier download;\n\tods package (newzip) open nopf;\n\t%do i=1 %to %nwords(&countries.);\n\t\t%let __c = %scan(&countries., &i., %str(' '));\n\t\tods package (newzip) add file=\"&path./&__c..csv\";\n\t%end;\n\tods package (newzip) publish archive \n\t\tproperties (\n\t\t\tarchive_name=\"&out..zip\" \n\t\t\tarchive_path= \"&path.\"\n\t\t);\n\tods package(newzip) close;\n\t\n\t/* Delete intermidiate .csv files */\n\t%do i=1 %to %nwords(&countries.);\n\t\t%let __c = %scan(&countries., &i., %str(' '));\n\t\t%file_delete(file=&path./&__c..csv);\n\t%end;\n\tproc delete data = daily; run;\n%mend;\n\n* MACRO: SAVE_MONTHLY_RET_CSV\n- The macro saves the monthly return data as a .csv across all countries. \n  Arguments\n\t- out: Name of the output Zip file (will be saved in &path.)\n\t- data: should be the path to world_msf\n\t- path: path where data is stored. Should be a scratch directory\n\t- end_date: restricts the output to till the end_date\n;\n%macro save_monthly_ret_csv(out=, data=, path=, end_date=);\n\tdata monthly; \n\t\tset &data.;\n\t\twhere eom <= &end_date.;\n\t\tkeep excntry id source_crsp eom me ret_exc ret ret_local; \n\trun;\n\tproc export data=monthly\n\t    outfile=\"&path./world_ret_monthly.csv\"   \n\t    dbms=CSV\n\t    replace;\n\trun;\n\t* Zip file for easier download;\n\tods package (newzip) open nopf;\n\tods package (newzip) add file=\"&path./world_ret_monthly.csv\";\n\tods package (newzip) publish archive \n\t\tproperties (\n\t\t\tarchive_name=\"&out..zip\" \n\t\t\tarchive_path= \"&path.\"\n\t\t);\n\tods package(newzip) close;\n\t* Delete intermidiate .csv files;\n\t%file_delete(file=&path./world_ret_monthly.csv);\n\tproc delete data = monthly; run;\n%mend;\n\n\n/* Footnotes */*\n[1]: This screen ensures that the return is always computed between two days with a price. The ret_day_dif lets the user check if it's within a reasonable range. \n\t Compustat have some observations whre prc and curcdd is missing and div and curcddiv is not missing. In other words dividend ex-date occuring outside trading days \n\t should not be used to compute returns. Further, the screen on prcstd ensures that returns are computed with non-stale prices. \n\t Specifically prcstd in (3, 4, 10) ensures that the price is taken from observable market data. The main excluded prcstd is 5, which is \"No prices available, last actual price was carried forward\".\n\t This is prevalent in the G_SECD where it accounts for 14.5% of the observations. \n[2]: Compustat also include a security inactivation date called 'dldtei'. In a subsample of 300 inactive securities,\n\t The difference in months between the date I take to be the delisting date and dldtei is 0 for 67% and 1 for 14%.\n\t The rest of the observations are scattered from -432 to 317. I think this validates my choice but is is not completely clear.\n;"
  },
  {
    "path": "README.md",
    "content": "## Overview\nThis repository contains the code used for the paper [Is There a Replication Crisis in Finance?](https://onlinelibrary.wiley.com/doi/10.1111/jofi.13249) by Jensen, Kelly and Pedersen (2023). Please cite this paper if you are using the code or data:\n```\n@article{JensenKellyPedersen2023,\n\tauthor = {Jensen, Theis Ingerslev and Kelly, Bryan and Pedersen, Lasse Heje},\n\ttitle = {Is There a Replication Crisis in Finance?},\n\tjournal = {The Journal of Finance},\n\tvolume = {78},\n\tnumber = {5},\n\tpages = {2465-2518},\n\tyear = {2023}\n}\n```\nFollow this [link](https://www.dropbox.com/sh/61j1v0sieq9z210/AACdJ68fs5_eT_eJMunwMBWia?dl=0) for a detailed documentation of the data sets.\n\nThe code consists of the following two self-contained components:\n\n- [GlobalFactors](https://github.com/bkelly-lab/ReplicationCrisis/tree/master/GlobalFactors) is a folder with code that creates data sets of global stock returns, firm characteristics, and global long-short factors. __Note that the data can be downloaded without running the code__. The global stock return and firm characteristics can be downloaded from WRDS ([link](https://wrds-www.wharton.upenn.edu/pages/get-data/contributed-data-forms/global-factor-data/)). The long-short factors factor returns used in the paper ([here](https://www.dropbox.com/sh/wcrjok1qyxtrasi/AABZ90GDCUvIzDzijt8Qoo3ha?dl=0)) and the latest version of the factor returns (see weblink below). In addition, we keep a folder with the latest versions of the factor returns and additional data such as the underlying portfolios, market returns, and industry returns ([link](https://www.dropbox.com/sh/xq278bryrj0qf9s/AABUTvTGok91kakyL07LKyQoa?dl=0)).\n\n- [Analysis](https://github.com/bkelly-lab/ReplicationCrisis/tree/master/Analysis) is a folder that contains the analysis in the paper, including all figures and tables. This folder takes the global factors as input (either the ones that can be downloaded or the ones that you construct yourself). \n\nSee also the website [https://JKPfactors.com/](https://JKPfactors.com/), where the most recent long-short factors can be downloaded using a simple drop-down menu. \n\n"
  }
]