## 1 India 2 0 0 13 0
## 2 Palestine 1 0 0 4 0
## 3 Vietnam 1 0 0 1 0
```
As you can see Japan has never lost or drawn against India, Palestine, or Vietnam so in the data there wouldn't have been any rows with "Loss" in the results column. With the function I created I was able to impute results that didn't exist and fill them in with 0s!
Let's check Japan's performance against our main rivals in the Asian Cup. Here I make the tables look a lot nicer with the options in the `kable` and `kableExtra` packages.
```r
results_jp_asia %>%
japan_versus(opponent %in% c("Iran", "Korea Republic", "Saudi Arabia"),
tournament == "AFC Asian Cup") %>%
knitr::kable(format = "html",
caption = "Japan vs. Historic Rivals in the Asian Cup") %>%
kableExtra::kable_styling(full_width = FALSE) %>%
kableExtra::add_header_above(c(" ", "Result" = 3, "Goals" = 2))
```
Japan vs. Historic Rivals in the Asian Cup
|
Result |
Goals |
| opponent |
Win |
Draw |
Loss |
Goals For |
Goals Against |
| Iran |
1 |
2 |
0 |
1 |
0 |
| Korea Republic |
0 |
2 |
1 |
2 |
4 |
| Saudi Arabia |
4 |
0 |
1 |
13 |
4 |
Now let's take a look at how Japan have historically played against the other teams in Group F of this year's Asian Cup.
```r
results_jp_asia %>%
japan_versus(opponent %in% c("Oman", "Uzbekistan", "Turkmenistan")) %>%
knitr::kable(format = "html",
caption = "Japan's Record vs. Group F Teams") %>%
kableExtra::kable_styling(full_width = FALSE) %>%
kableExtra::add_header_above(c(" ", "Result" = 3, "Goals" = 2))
```
Japan's Record vs. Group F Teams
|
Result |
Goals |
| opponent |
Win |
Draw |
Loss |
Goals For |
Goals Against |
| Oman |
8 |
3 |
0 |
19 |
4 |
| Uzbekistan |
6 |
3 |
1 |
28 |
9 |
We see no rows here for Turkmenistan. This is due to the fact that until just this past week Japan had **never** played against them in a friendly or competitive game!
# Conclusion
Although Japan's first game was quite horrible I'm hoping it'll wake the players and coaches out of their complacency and not underestimate our opponents in the next two games.
Japan
South Korea and Iran
thankfully south korea should be on the other side of the bracket and we would also only meet Iran in the semifinals (provided both teams finish top of their respective groups)
Japan could meet Australia in the Quarters but without Aaron Mooy they're a much weaker side as shown in their abject loss to Jordan in their opening match.
even with losing new star Nakajima, the fact that we can replace him with a player of the calibre of Takashi Inui and Hannover regular, Genki Haraguchi, stepping up from the bench shows how much Japanese football has progressed these past 25 years.
It's a changing of the guard for Japan but we've got quality players in Europe as well as some depth too with more young Japanese players headed to Europe from a young age
It was quite awe-inspiring seeing how the number of Japanese players playing for foreign clubs have been steadily increasing since the 1988 Asian Cup squad. Maybe that could be another idea for a visualization?
this tournament should be a first stepping stone for this new generation of players to make a big impact for the next world cup in 2022 so keep your eye out for this bunch of players!
================================================
FILE: Bundesliga 2018-2019/player_goal_contribution_matrix.Rmd
================================================
---
title: "Bundesliga"
author: "RN7"
date: "5/24/2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# pkgs
```{r, message=FALSE, warning=FALSE}
pacman::p_load(tidyverse, polite, scales, ggimage, ggforce,
rvest, glue, extrafont, ggrepel, magick)
loadfonts()
```
## add_logo
```{r}
add_logo <- function(plot_path, logo_path, logo_position, logo_scale = 10){
# Requires magick R Package https://github.com/ropensci/magick
# Useful error message for logo position
if (!logo_position %in% c("top right", "top left", "bottom right", "bottom left")) {
stop("Error Message: Uh oh! Logo Position not recognized\n Try: logo_positon = 'top left', 'top right', 'bottom left', or 'bottom right'")
}
# read in raw images
plot <- magick::image_read(plot_path)
logo_raw <- magick::image_read(logo_path)
# get dimensions of plot for scaling
plot_height <- magick::image_info(plot)$height
plot_width <- magick::image_info(plot)$width
# default scale to 1/10th width of plot
# Can change with logo_scale
logo <- magick::image_scale(logo_raw, as.character(plot_width/logo_scale))
# Get width of logo
logo_width <- magick::image_info(logo)$width
logo_height <- magick::image_info(logo)$height
# Set position of logo
# Position starts at 0,0 at top left
# Using 0.01 for 1% - aesthetic padding
if (logo_position == "top right") {
x_pos = plot_width - logo_width - 0.01 * plot_width
y_pos = 0.01 * plot_height
} else if (logo_position == "top left") {
x_pos = 0.01 * plot_width
y_pos = 0.01 * plot_height
} else if (logo_position == "bottom right") {
x_pos = plot_width - logo_width - 0.01 * plot_width
y_pos = plot_height - logo_height - 0.01 * plot_height
} else if (logo_position == "bottom left") {
x_pos = 0.01 * plot_width
y_pos = plot_height - logo_height - 0.01 * plot_height
}
# Compose the actual overlay
magick::image_composite(plot, logo, offset = paste0("+", x_pos, "+", y_pos))
}
```
# Bundesliga
## webscrape soccerway
```{r}
url <- "https://us.soccerway.com/national/germany/bundesliga/20182019/regular-season/r47657/"
session <- bow(url)
team_links <- scrape(session) %>%
html_nodes("#page_competition_1_block_competition_tables_7_block_competition_league_table_1_table .large-link a") %>%
html_attr("href")
team_links_df <- team_links %>%
enframe(name = NULL) %>%
separate(value, c(NA, NA, NA, "team_name", "team_num"), sep = "/") %>%
mutate(link = glue("
https://us.soccerway.com/teams/germany/{team_name}/{team_num}/squad/"),
stat_link = glue("{link %>% str_replace('squad', 'statistics')}"))
# for each team link:
player_name_info <- function(session) {
player_name_info <- scrape(session) %>%
html_nodes("#page_team_1_block_team_squad_3-table .name.large-link") %>%
html_text()
}
num_goals_info <- function(session) {
num_goals_info <- scrape(session) %>%
html_nodes(".goals") %>%
html_text()
num_goals_info_clean <- num_goals_info[-1]
}
num_assists_info <- function(session) {
num_assists_info <- scrape(session) %>%
html_nodes(".assists") %>%
html_text()
num_assists_info_clean <- num_assists_info[-1]
}
team_goals_info <- function(session) {
team_goals_info <- scrape(session) %>%
html_nodes("tr.first:nth-child(6) > td:nth-child(2)") %>%
html_text()
}
# BIG FUNCTION
bundesliga_stats_info <- function(link, statlink) {
session <- bow(link)
session2 <- bow(statlink)
player_name <- player_name_info(session = session)
num_goals <- num_goals_info(session = session)
num_assists <- num_assists_info(session = session)
team_goals <- team_goals_info(session = session2)
resultados <- list(player_name, num_goals, num_assists, team_goals)
col_names <- c("name", "goals", "assists", "team_goals")
bundesliga_stats <- resultados %>%
reduce(cbind) %>%
as_tibble() %>%
set_names(col_names)
}
```
### all at once
```{r}
# ALL 18 TEAMS AT ONCE, WILL TAKE A WHILE:
bundesliga_goal_contribution_df_ALL <- map2(.x = team_links_df$link,
.y = team_links_df$stat_link,
~ bundesliga_stats_info(link = .x, statlink = .y))
bundesliga_goal_contribution_df <- bundesliga_goal_contribution_df_ALL %>%
set_names(team_links_df$team_name) %>%
bind_rows(.id = "team_name")
## save
saveRDS(bundesliga_goal_contribution_df, file = glue("{here::here()}/data/bundesliga_goal_contrib_df_soccerway.RDS"))
```
## clean
```{r}
bundesliga_goal_contribution_clean_df <- bundesliga_goal_contribution_df %>%
mutate_at(.vars = c("goals", "assists"),
~str_replace(., "-", "0") %>% as.numeric) %>%
mutate(team = team_name %>% str_replace_all(., "-", " ") %>% str_to_title,
total_goals = as.numeric(team_goals)) %>%
group_by(team) %>%
mutate(total_assists = sum(assists),
goal_contrib = goals/total_goals,
assist_contrib = assists/total_goals) %>%
ungroup() %>%
select(-team_name, -team_goals)
## save
saveRDS(bundesliga_goal_contribution_clean_df,
file = glue("{here::here()}/data/bundesliga_goal_contrib_clean_df.RDS"))
bundesliga_goal_contribution_clean_df <- readRDS(file = glue("{here::here()}/data/bundesliga_goal_contrib_clean_df.RDS"))
```
## plot
```{r fig.width = 10, fig.height = 8}
## Description text
desc_hazard <- "Hazard FC: With 16 goals and 15 assists Eden Hazard has been involved in the most goals for a team this season."
desc_vardymurray <- "Scoring 37.5% and 37.1% of their team's goals, Jamie Vardy and Glen Murray have proven to be talismans for their team yet again!"
desc_fraser <- "Another fantastic season from Ryan Fraser with 7 goals and 14 assists (one behind league-leader Hazard)"
## PLOT!
bundesliga_goal_contribution_clean_df %>%
ggplot(aes(assist_contrib, goal_contrib)) +
geom_point(data = bundesliga_goal_contribution_clean_df %>%
filter(goal_contrib < 0.225 | assist_contrib < 0.125),
color = "grey20", size = 4, alpha = 0.2) +
geom_point(data = bundesliga_goal_contribution_clean_df %>%
filter(goal_contrib > 0.225 | assist_contrib > 0.125),
color = "red", size = 4) +
geom_hline(yintercept = 0.225, color = "grey20", alpha = 0.4) +
geom_vline(xintercept = 0.125, color = "grey20", alpha = 0.4) +
geom_text_repel(data = bundesliga_goal_contribution_clean_df %>%
filter(goal_contrib > 0.225 | assist_contrib > 0.125),
aes(label = name, family = "Roboto Condensed", fontface = "bold"),
seed = 15, size = 5,
min.segment.length = 0, segment.color = "red",
point.padding = 0.5) +
# geom_mark_circle(aes(filter = name == "E. Hazard", label = "Eden Hazard",
# description = desc_hazard),
# label.family = "Roboto Condensed", label.fontsize = c(14, 10)) +
# geom_mark_hull(aes(filter = name %in% c("G. Murray", "J. Vardy"), label = "Vardy & Murray",
# description = desc_vardymurray),
# label.buffer = unit(20, "mm"), label.fontsize = c(14, 10),
# label.family = "Roboto Condensed") +
# geom_mark_circle(aes(filter = name == "R. Fraser", label = "Ryan Fraser",
# description = desc_fraser),
# label.buffer = unit(9.8, "mm"), label.fontsize = c(14, 10),
# label.family = "Roboto Condensed") +
scale_x_continuous(labels = percent_format(accuracy = 1),
breaks = c(0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3),
limits = c(0, 0.225)) +
scale_y_continuous(labels = percent_format(accuracy = 1),
breaks = c(0, 0.1, 0.2, 0.3, 0.4, 0.5),
limits = c(0, 0.4)) +
labs(title = "Goal Contribution Matrix: Bundesliga (2018-2019 Season)",
subtitle = "Team Goal Involvement as Percentage of Total Club Goals and/or Assists.",
caption = glue("
Data: soccerway.com
By: @R_by_Ryo"),
x = "Percentage of Club Goals Assisted",
y = "Percentage of Club Goals Scored") +
theme_minimal() +
theme(text = element_text(family = "Roboto Condensed"),
title = element_text(size = 18),
plot.subtitle = element_text(size = 16),
plot.caption = element_text(size = 10),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
panel.grid.minor.x = element_blank()) -> bundesliga_goal_contribution_matrix
bundesliga_goal_contribution_matrix
```
## save
```{r}
ggsave(plot = bundesliga_goal_contribution_matrix,
"../Bundesliga 2018-2019/output/goal_contribution_matrix_plot_bundesliga.png",
height = 9, width = 11)
```
```{r}
plot_logo <- add_logo(
plot_path = "../Bundesliga 2018-2019/output/goal_contribution_matrix_plot_bundesliga.png",
logo_path = "https://upload.wikimedia.org/wikipedia/en/d/df/Bundesliga_logo_%282017%29.svg",
logo_position = "top right",
logo_scale = 13)
plot_logo
```
```{r}
image_write(image = plot_logo,
"../Bundesliga 2018-2019/output/goal_contribution_matrix_plot_logo_bundesliga.png")
```
================================================
FILE: Bundesliga 2019-2020/buli_age_utility.Rmd
================================================
---
title: "Untitled"
author: "RN7"
date: "5/15/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# age-utility
```{r, message=FALSE, warning=FALSE}
pacman::p_load(tidyverse, polite, scales, ggimage, rvest,
glue, extrafont, showtext, ggrepel, magick,
ggforce, ggtext, lubridate,
cowplot, patchwork, rlang)
loadfonts(quiet = TRUE)
```
## team links function
```{r}
get_team_links <- function(season) {
url <- "https://www.transfermarkt.com/bundesliga/startseite/wettbewerb/L1/plus/?saison_id={season}"
session <- bow(url)
team_links <- scrape(session) %>%
html_nodes("#yw1 > table > tbody > tr > td.zentriert.no-border-rechts > a") %>%
html_attr("href")
# season <- enquo(season)
team_links_df <- team_links %>%
enframe(name = NULL) %>%
separate(value, c(NA, "team_name", NA, NA, "team_num", NA, NA), sep = "/") %>%
mutate(link = glue("https://www.transfermarkt.com/{team_name}/leistungsdaten/verein/{team_num}/reldata/%26{season}/plus/1"),
season = season)
}
```
```{r}
team_links_df <- get_team_links(season = 2019)
glimpse(team_links_df)
```
## scrape function
```{r}
buli_age_utility_info <- function(team_name, team_num, season, total_minutes) {
## get age + minutes played
session <- bow(glue::glue("https://www.transfermarkt.com/{team_name}/leistungsdaten/verein/{team_num}/plus/1?reldata=L1%26{season}/"))
# grab name from photo element instead
result_name <- scrape(session) %>%
html_nodes("#yw1 .bilderrahmen-fixed") %>%
html_attr("title")
# grab age
result_age <- scrape(session) %>%
html_nodes(".posrela+ .zentriert") %>%
html_text()
# grab minutes played in league
result_mins <- scrape(session) %>%
html_nodes("td.rechts") %>%
html_text()
## get length
session <- bow(glue::glue("https://www.transfermarkt.com/{team_name}/kader/verein/{team_num}/saison_id/{season}/plus/1"))
result_name2 <- scrape(session) %>%
html_nodes("#yw1 .bilderrahmen-fixed") %>%
html_attr("title")
result_bday <- scrape(session) %>%
html_nodes(".posrela+ .zentriert") %>%
html_text()
result_joinedteam <- scrape(session) %>%
html_nodes("td:nth-child(7)") %>%
html_text()
result_leaveteam <- scrape(session) %>%
html_nodes("td:nth-child(9)") %>%
html_text()
# place each vector into list
resultados <- list(result_name, result_age, result_mins)
col_name <- c("name", "age", "minutes")
results_comb <- resultados %>%
reduce(cbind) %>%
as_tibble() %>%
set_names(col_name)
## join + bday
resultados2 <- list(result_name2, result_bday,
result_joinedteam, result_leaveteam)
col_name2 <- c("name", "bday", "join", "leave")
results_comb2 <- resultados2 %>%
reduce(cbind) %>%
as_tibble() %>%
set_names(col_name2)
## combine BOTH
results_comb <- results_comb %>%
left_join(results_comb2)
# fix "strings" into proper formats, calculate % of minutes appeared
all_team_minutes <- results_comb %>%
mutate(age = as.numeric(age),
minutes = minutes %>%
str_replace("\\.", "") %>%
str_replace("'", "") %>%
as.numeric(),
min_perc = (minutes / total_minutes) %>% round(digits = 3),
bday = str_replace_all(bday, "\\(.*\\)", "") %>% mdy(),
join = join %>% mdy(),
join_age = interval(bday, join) / years(1),
age_now = interval(bday, Sys.Date()) / years(1)) %>%
filter(!is.na(minutes)) %>%
separate(name, into = c("first_name", "last_name"),
sep = " ", fill = "left") %>%
mutate(
fname = if_else(!is.na(first_name),
str_extract(first_name, "[A-Z]{1}"),
""),
player = case_when(
!is.na(first_name) ~ glue("{fname}. {last_name}"),
TRUE ~ last_name),
team_name = team_name,
team_name = case_when(
team_name == "fc-bayern-munchen" ~ "Bayern München",
team_name == "rasenballsport-leipzig" ~ "RB Leipzig",
team_name == "fc-schalke-04" ~ "Schalke 04",
team_name == "hertha-bsc" ~ "Hertha Berlin",
team_name == "1-fc-union-berlin" ~ "Union Berlin",
team_name == "fortuna-dusseldorf" ~ "Fortuna Düsseldorf",
team_name == "bayer-04-leverkusen" ~ "Bayer Leverkusen",
team_name == "sv-werder-bremen" ~ "Werder Bremen",
team_name == "borussia-monchengladbach" ~ "Borussia Mönchengladbach",
team_name == "tsg-1899-hoffenheim" ~ "Hoffenheim",
TRUE ~ str_replace_all(team_name, "-", " ") %>% stringr::str_to_title()),
season = season) %>%
arrange(desc(min_perc))
return(all_team_minutes)
}
```
```{r}
# ALL TEAMS AT ONCE, WILL TAKE A WHILE:
age_utility_df_ALL <- pmap(list(team_links_df$team_name,
team_links_df$team_num,
team_links_df$season,
2160),
.f = buli_age_utility_info)
buli_age_utility_df <- age_utility_df_ALL %>%
reduce(rbind)
## save
saveRDS(buli_age_utility_df, file = glue("{here::here()}/data/buli_age_utility_df_MD24_1920.RDS"))
```
## all team data
```{r}
buli_age_utility_df <- readRDS(file = here::here(glue::glue("data/buli_age_utility_df_MD24_1920.RDS")))
```
## bvb
```{r fig.width=15, fig.height=8}
bvb_df <- buli_age_utility_df %>%
filter(team_name == "Borussia Dortmund") %>%
mutate(player = case_when(player == "V. van" ~ "V. Van Dijk",
TRUE ~ as.character(player)),
join_age = case_when(player == "J. Weigl" ~ 18.9,
first_name == "Paco" ~ 23.9,
TRUE ~ join_age),
age_now = case_when(player == "J. Weigl" ~ 24.3,
first_name == "Paco" ~ 26.4,
TRUE ~ age_now))
combined_plot <- age_utility_plotter(df = bvb_df,
team_name = "Borussia Dortmund",
j_name = "Borussia Dortmund",
color_1 = "#D3D3D3", color_2 = "#beb500", #firebrick1 #009782
img_path = here::here("img/BorussiaDortmund.png"))
combined_plot
```
```{r}
ggsave(plot = combined_plot, "../Bundesliga 2019-2020/output/bvb_age_utility_22820.png",
height = 8, width = 15)
```
## rbl
```{r fig.width=15, fig.height=8}
rbl_df <- buli_age_utility_df %>%
filter(team_name == "RB Leipzig") %>%
mutate(player = case_when(player == "V. van" ~ "V. Van Dijk",
TRUE ~ as.character(player)),
join_age = case_when(player == "S. Ilsanker" ~ 26,
player == "M. Cunha" ~ 19,
player == "D. Demme" ~ 22,
TRUE ~ join_age),
age_now = case_when(player == "S. Ilsanker" ~ 30,
player == "M. Cunha" ~ 20,
player == "D. Demme" ~ 28,
TRUE ~ age_now))
combined_plot <- age_utility_plotter(df = rbl_df,
team_name = "RB Leipzig",
j_name = "RB Leipzig",
color_1 = "#001f47", color_2 = "#dd0741", #firebrick1 #009782
img_path = here::here("img/RBLeipzig.png"))
combined_plot
```
```{r}
ggsave(plot = combined_plot, "../Bundesliga 2019-2020/output/rbl_age_utility_22820.png",
height = 8, width = 15)
```
## Bayern Munich
```{r fig.width=15, fig.height=8}
bayern_df <- buli_age_utility_df %>%
filter(team_name == "Bayern München") %>%
mutate(player = case_when(player == "NA. Odriozola" ~ "A. Odriozola",
TRUE ~ as.character(player)),
join_age = case_when(player == "J. Zirkzee" ~ 18,
player == "M. Cunha" ~ 19,
player == "D. Demme" ~ 22,
TRUE ~ join_age))
combined_plot <- age_utility_plotter(df = bayern_df,
team_name = "Bayern München",
j_name = "Bayern Munich",
color_1 = "#0066b2", color_2 = "#dc052d", #firebrick1 #009782
img_path = here::here("img/Bayern.png"))
combined_plot
```
```{r}
ggsave(plot = combined_plot, "../Bundesliga 2019-2020/output/BayernMunich_age_utility_3120.png",
height = 8, width = 15)
```
================================================
FILE: Bundesliga 2019-2020/buli_dribbling_1920_hinrunde.Rmd
================================================
---
title: "Untitled"
author: "RN7"
date: "1/16/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# pkgs
```{r, message=FALSE, warning=FALSE}
pacman::p_load(tidyverse, polite, scales, ggimage,
ggforce, ggtext,
rvest, glue, extrafont, ggrepel, magick)
loadfonts()
```
## add_logo
```{r}
add_logo <- function(plot_path, logo_path, logo_position,
logo_scale = 10){
# Requires magick R Package https://github.com/ropensci/magick
# Useful error message for logo position
if (!logo_position %in% c("top right", "top left", "bottom right", "bottom left")) {
stop("Error Message: Uh oh! Logo Position not recognized\n Try: logo_positon = 'top left', 'top right', 'bottom left', or 'bottom right'")
}
# read in raw images
plot <- magick::image_read(plot_path)
logo_raw <- magick::image_read(logo_path)
# get dimensions of plot for scaling
plot_height <- magick::image_info(plot)$height
plot_width <- magick::image_info(plot)$width
# default scale to 1/10th width of plot
# Can change with logo_scale
logo <- magick::image_scale(logo_raw, as.character(plot_width/logo_scale))
# Get width of logo
logo_width <- magick::image_info(logo)$width
logo_height <- magick::image_info(logo)$height
# Set position of logo
# Position starts at 0,0 at top left
# Using 0.01 for 1% - aesthetic padding
if (logo_position == "top right") {
x_pos = plot_width - logo_width - 0.01 * plot_width
y_pos = 0.01 * plot_height
} else if (logo_position == "top left") {
x_pos = 0.01 * plot_width
y_pos = 0.01 * plot_height
} else if (logo_position == "bottom right") {
x_pos = plot_width - logo_width - 0.01 * plot_width
y_pos = plot_height - logo_height - 0.01 * plot_height
} else if (logo_position == "bottom left") {
x_pos = 0.01 * plot_width
y_pos = plot_height - logo_height - 0.01 * plot_height
}
# Compose the actual overlay
magick::image_composite(plot, logo, offset = paste0("+", x_pos, "+", y_pos))
}
```
# Bundesliga data
- Save "Player Standard Stats" & "Squad Standard Stats" as a `.csv` file or whatever you prefer.
- https://fbref.com/en/comps/20/passing/Bundesliga-Stats
### player stats
```{r}
buli_player_dribbling_stats_hinrunde_raw <- read_csv(here::here("data/buli_player_dribbling_stats_hinrunde.csv"),
skip = 1, col_names = TRUE)
buli_player_dribbling_stats_hinrunde <- buli_player_dribbling_stats_hinrunde_raw %>%
select(-Matches, -Rk) %>%
select(Player, Pos, Squad, `90s`,
dribble_success = Succ,
dribble_attempt = Att,
dribble_perc = `Succ%`,
dribble_past = `#Pl`,
nutmegs = Megs,
dribble_tackled = Tkl,
dribble_against = Att_1,
tackle_success = `Tkl%`,
dribble_against_success = Past)
glimpse(buli_player_dribbling_stats_hinrunde)
```
```{r}
buli_player_dribbling_hinrunde_clean <- buli_player_dribbling_stats_hinrunde %>%
separate(Player, into = c("fullname", "allname"), sep = "\\\\") %>% separate(fullname, into = c("firstname", "lastname"),
sep = "\\s", extra = "merge",
remove = FALSE) %>%
mutate(fname = str_extract(firstname, "[A-Z]{1}")) %>%
## players like Fabinho listed without Tavares last name
mutate(player = if_else(is.na(lastname), firstname, lastname)) %>%
mutate(player = case_when(
!is.na(lastname) ~ glue("{fname}. {lastname}"),
TRUE ~ firstname)) %>%
mutate(
min = `90s` * 90,
dribble_successPer90 = (dribble_success / min) * 90,
dribble_attemptPer90 = (dribble_attempt / min) * 90,
dribble_pastPer90 = (dribble_past / min) * 90,
nutmegsPer90 = (nutmegs / min) * 90,
dribble_tackledPer90 = (dribble_tackled / min) * 90,
dribble_againstPer90 = (dribble_against / min) * 90,
dribble_against_successPer90 = (dribble_against_success / min) * 90) %>%
## keep fullname so can differentiate A. Gomes (Andre/Angel), etc.
select(-`90s`, -firstname, -lastname, -allname,
team_name = Squad, -fname, contains("dribble"),
nutmegs, nutmegsPer90)
glimpse(buli_player_dribbling_hinrunde_clean)
```
```{r}
## save
saveRDS(buli_player_dribbling_hinrunde_clean,
file = glue("{here::here()}/data/buli_player_dribbling_hinrunde_clean.RDS"))
buli_player_dribbling_hinrunde_clean <- readRDS(
file = glue("{here::here()}/data/buli_player_dribbling_hinrunde_clean.RDS"))
```
# plot
```{r}
buli_player_dribbling_hinrunde_clean %>%
filter(min >= 900,
dribble_attemptPer90 > 2) %>%
summarize(avg_dribblesper90 = median(dribble_attemptPer90),
avg_dribble_successper90 = median(dribble_successPer90),
avg_dribble_rate = median(dribble_perc))
buli_player_dribbling_hinrunde_clean %>%
filter(min >= 900,
dribble_attemptPer90 > 2,
Pos %in% c("FW", "FWMF")) %>%
summarize(avg_dribblesper90 = median(dribble_attemptPer90),
avg_dribble_successper90 = median(dribble_successPer90),
avg_dribble_rate = median(dribble_perc))
```
- when filtering for FW/Attacking MFs, F3 falls and PPA rises - fairly self-explanatory and reasonable
```{r}
bad_box <- data.frame(
xmin = -Inf, xmax = 3.36,
ymin = -Inf, ymax = 60.45)
chance_creation_box <- data.frame(
xmin = -Inf, xmax = 3.36,
ymin = 60.45, ymax = Inf)
midfield_progress_box <- data.frame(
xmin = 3.36, xmax = Inf,
ymin = -Inf, ymax = 60.45)
dual_box <- data.frame(
xmin = 3.36, xmax = Inf,
ymin = 60.45, ymax = Inf)
```
```{r}
buli_dribble_df <- buli_player_dribbling_hinrunde_clean %>%
filter(min >= 900,
dribble_attemptPer90 > 2) %>%
mutate(fillcol = case_when(
dribble_attemptPer90 < 3.36 & dribble_perc < 60.45 ~ "red",
dribble_attemptPer90 > 3.36 & dribble_perc < 60.45 ~ "orange",
dribble_attemptPer90 < 3.36 & dribble_perc > 60.45 ~ "yellow",
dribble_attemptPer90 > 3.36 & dribble_perc > 60.45 ~ "green",
TRUE ~ NA_character_))
```
```{r}
point_cols <- c("yellow" = "yellow", "red" = "red",
"orange" = "orange", "green" = "#228B22")
```
```{r}
bayern_desc <- "Both Kimmich & Alaba have been playing much more centrally this season (Center Midfield & Center Back respectively) the former only behind teammate Coutinho and Brandt in Passes into the Penalty Area per 90 with the latter leading the league with 9.83 Passes into the Final 3rd per 90. Thiago keeps things ticking in midfield with a 90.2% Pass Accuracy."
brandt_desc <- "Julian Brandt has become the primary ball progressor in this Dortmund side with his 2.99 PPA per 90 leading the league by a considerable margin. A lot of this comes from his throughballs (0.68 Through Balls per 90, 1st in the league). Along with his 7th and 4th ranking (within BVB) in Final Third Passes per 90 & xA per 90 respectively, it shows that he provides the incisive ball in between the build-up & the final pass. Indeed, his 1.45 KP per 90 is 4th in the team behind Hazard, Sancho, & Hakimi."
```
```{r fig.width = 14, fig.height = 10}
buli_dribbling_hinrunde_plot <- ggplot(
buli_dribble_df,
aes(x = dribble_attemptPer90, y = dribble_perc)) +
## area fills
geom_rect(data = chance_creation_box,
aes(x = NULL, y = NULL,
xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax),
fill = "yellow", alpha = 0.1) +
geom_rect(data = bad_box,
aes(x = NULL, y = NULL,
xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax),
fill = "red", alpha = 0.1) +
geom_rect(data = midfield_progress_box,
aes(x = NULL, y = NULL,
xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax),
fill = "orange", alpha = 0.2) +
geom_rect(data = dual_box,
aes(x = NULL, y = NULL,
xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax),
fill = "green", alpha = 0.1) +
## median reference lines
geom_hline(yintercept = 60.45, color = "grey20", alpha = 0.4) +
geom_vline(xintercept = 3.35, color = "grey20", alpha = 0.4) +
## player data
geom_point(aes(color = factor(fillcol)), size = 3) +
scale_color_manual(values = point_cols) +
## league average
annotate("text", family = "Roboto Condensed", fontface = "bold",
x = 7.35, y = 61, hjust = 0, color = "grey20",
label = "Average OP Passes into the Final Third per 90: 2.5") +
annotate("text", family = "Roboto Condensed", fontface = "bold",
x = 3.35, y = 105, hjust = 0, color = "grey20",
label = "Average OP Passes into Penalty Area per 90: 0.48") +
## area labels
annotate("text", family = "Roboto Condensed", fontface = "bold",
x = 0.25, y = 3.8,
hjust = 0, color = "#CCCC00", size = 6,
label = "Good Chance Creation") +
annotate("text", family = "Roboto Condensed", fontface = "bold",
x = 9.2, y = 0.2,
hjust = 0, color = "orange", size = 6,
label = "Good Midfield Progression") +
annotate(
"text", family = "Roboto Condensed", fontface = "bold",
x = 9.2, y = 3.77,
hjust = 0, color = "#228B22", size = 6,
label = "Good Chance Creation\nGood Midfield Progression") +
## player labels
geom_text_repel(
data = buli_dribble_df %>%
filter(dribble_attemptPer90 > 2 | dribble_perc > 60.45,
!player %in% c("J. Sancho", "M. Rashica",
"A. Davies", "T. Alcántara")),
aes(label = player, family = "Roboto Condensed",
fontface = "bold",
segment.color = factor(fillcol)),
min.segment.length = 0.3, seed = 15, size = 3.5,
#segment.color = "red",
point.padding = 0.6,
color = "grey20") +
# ## Bayern description
geom_mark_hull(
aes(filter = player == "J. Sancho",
label = "Jaden Sancho: Dribble Koenig",
description = "bayern_desc"),
expand = unit(2.5, "mm"), con.cap = unit(0, "mm"),
con.colour = "orange", color = "orange",
label.width = unit(70, 'mm'), label.buffer = unit(10, "mm"),
label.family = "Roboto Condensed", label.fontsize = c(14, 12),
label.colour = "grey20", label.fill = "#cce5cc") +
# ## Brandt description #7fbf7f #b2d8b2
geom_mark_hull(
aes(filter = player == "A. Davies",
label = "Alphonso Davies: Canada's ",
description = "brandt_desc"),
expand = unit(2.5, "mm"), con.cap = unit(0, "mm"),
con.colour = "#228B22", color = "#228B22",
label.width = unit(50, 'mm'), label.buffer = unit(60, "mm"),
label.family = "Roboto Condensed", label.fontsize = c(14, 12),
label.colour = "grey20", label.fill = "#cce5cc") +
# ## Brandt description #7fbf7f #b2d8b2
geom_mark_hull(
aes(filter = player == "T. Alcántara",
label = "Alcántara: Efficient Dribbler.",
description = "brandt_desc"),
expand = unit(2.5, "mm"), con.cap = unit(0, "mm"),
label.width = unit(90, 'mm'), label.buffer = unit(5, "mm"),
label.family = "Roboto Condensed", label.fontsize = c(14, 12),
label.colour = "grey20", label.fill = "#cce5cc") +
# ## Bayern description
geom_mark_hull(
aes(filter = player == "M. Rashica",
label = "Milot Rashica: Dribble Koenig",
description = "bayern_desc"),
expand = unit(2.5, "mm"), con.cap = unit(0, "mm"),
label.width = unit(70, 'mm'), label.buffer = unit(30, "mm"),
label.family = "Roboto Condensed", label.fontsize = c(14, 12),
label.colour = "grey20", label.fill = "#cce5cc") +
## Bayern player labels
# geom_text_repel(
# data = buli_player_dribbling_hinrunde_clean %>%
# filter(min >= 900,
# player %in% c("J. Kimmich",
# "D. Alaba",
# "T. Alcántara",
# "P. Coutinho")),
# aes(label = player, family = "Roboto Condensed",
# fontface = "bold"),
# seed = 15, size = 4.5, color = "red",
# min.segment.length = 0, segment.color = "red",
# point.padding = 0.5, nudge_x = 0.6) +
## scales
scale_x_continuous(labels = seq(0, 10, 0.5),
breaks = seq(0, 10, 0.5),
limits = c(1.25, 10),
expand = c(0.01, 0)) +
scale_y_continuous(labels = seq(0, 100, 10),
breaks = seq(0, 100, 10),
limits = c(30, 110),
expand = c(0.01, 0)) +
labs(title = "Dribble Volume vs. Dribble Success: Bundesliga (2019-2020)",
subtitle = glue("
Hinrunde | January 16th, 2020
Average (Median) | Minimum 900 Minutes Played | More than 2 dribble per 90"),
caption = glue("
Data: FBref | StatsBomb
Ryo Nakagawara, Twitter: @R_by_Ryo"),
x = "Dribbles Attempted per 90",
y = "Dribble Success Rate (%)") +
theme_minimal() +
theme(text = element_markdown(family = "Roboto Condensed"),
plot.title = element_markdown(size = 20),
plot.subtitle = element_markdown(size = 16),
plot.caption = element_text(size = 14),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12))
buli_dribbling_hinrunde_plot
```
## save
```{r}
ggsave(plot = buli_dribbling_hinrunde_plot,
here::here("Bundesliga 2019-2020/output/buli_dribbling_1920_hinrunde.png"),
height = 10, width = 14)
```
```{r}
plot_logo <- add_logo(
plot_path = here::here("Bundesliga 2019-2020/output/buli_dribbling_1920_hinrunde.png"),
logo_path = here::here("../soccer_match_reports/img/Bundesliga_logo_(2017).svg"),
logo_position = "top right",
logo_scale = 18)
plot_logo
```
```{r}
image_write(image = plot_logo,
here::here("Bundesliga 2019-2020/output/buli_dribbling_1920_hinrunde_logo.png"))
```
================================================
FILE: Bundesliga 2019-2020/buli_goalkeepers_1920_hinrunde.Rmd
================================================
---
title: "Untitled"
author: "RN7"
date: "1/18/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# pkgs
```{r, message=FALSE, warning=FALSE}
pacman::p_load(tidyverse, polite, scales, ggimage,
ggforce, ggtext,
rvest, glue, extrafont, ggrepel, magick)
loadfonts()
```
## add_logo
```{r}
add_logo <- function(plot_path, logo_path, logo_position,
logo_scale = 10){
# Requires magick R Package https://github.com/ropensci/magick
# Useful error message for logo position
if (!logo_position %in% c("top right", "top left", "bottom right", "bottom left")) {
stop("Error Message: Uh oh! Logo Position not recognized\n Try: logo_positon = 'top left', 'top right', 'bottom left', or 'bottom right'")
}
# read in raw images
plot <- magick::image_read(plot_path)
logo_raw <- magick::image_read(logo_path)
# get dimensions of plot for scaling
plot_height <- magick::image_info(plot)$height
plot_width <- magick::image_info(plot)$width
# default scale to 1/10th width of plot
# Can change with logo_scale
logo <- magick::image_scale(logo_raw, as.character(plot_width/logo_scale))
# Get width of logo
logo_width <- magick::image_info(logo)$width
logo_height <- magick::image_info(logo)$height
# Set position of logo
# Position starts at 0,0 at top left
# Using 0.01 for 1% - aesthetic padding
if (logo_position == "top right") {
x_pos = plot_width - logo_width - 0.01 * plot_width
y_pos = 0.01 * plot_height
} else if (logo_position == "top left") {
x_pos = 0.01 * plot_width
y_pos = 0.01 * plot_height
} else if (logo_position == "bottom right") {
x_pos = plot_width - logo_width - 0.01 * plot_width
y_pos = plot_height - logo_height - 0.01 * plot_height
} else if (logo_position == "bottom left") {
x_pos = 0.01 * plot_width
y_pos = plot_height - logo_height - 0.01 * plot_height
}
# Compose the actual overlay
magick::image_composite(plot, logo, offset = paste0("+", x_pos, "+", y_pos))
}
```
# Bundesliga data
- Save "Player Standard Stats" & "Squad Standard Stats" as a `.csv` file or whatever you prefer.
- https://fbref.com/en/comps/20/passing/Bundesliga-Stats
## advanced stats
```{r}
buli_player_goalkeeping_stats_hinrunde_raw <- read_csv(here::here("data/buli_player_goalkeeping_stats_hinrunde.csv"),
skip = 1, col_names = TRUE) %>%
select(-Matches, -Rk)
buli_player_goalkeeping_stats_hinrunde <- buli_player_goalkeeping_stats_hinrunde_raw %>%
rename_at(vars(15:17),
~ glue::glue("launch_{colnames(buli_player_goalkeeping_stats_hinrunde_raw)[15:17]}")) %>%
rename_at(vars(18:21),
~ glue::glue("pass_{colnames(buli_player_goalkeeping_stats_hinrunde_raw)[18:21]}")) %>%
rename_at(vars(22:24),
~ glue::glue("goalkick_{colnames(buli_player_goalkeeping_stats_hinrunde_raw)[22:24]}")) %>%
rename_at(vars(25:27),
~ glue::glue("crosses_{colnames(buli_player_goalkeeping_stats_hinrunde_raw)[25:27]}"))
buli_player_goalkeeping_stats_hinrunde <- buli_player_goalkeeping_stats_hinrunde %>%
rename_all(~str_replace_all(colnames(buli_player_goalkeeping_stats_hinrunde), "_[0-9]", "")) %>%
rename(PSxGper90 = `/90`)
glimpse(buli_player_goalkeeping_stats_hinrunde)
```
## regular stats
```{r}
buli_player_regular_goalkeeping_stats_hinrunde_raw <- read_csv(here::here("data/buli_player_regular_goalkeeping_stats_hinrunde.csv"), skip = 1, col_names = TRUE) %>%
select(-Matches, -Rk)
buli_player_regular_goalkeeping_stats_hinrunde <- buli_player_regular_goalkeeping_stats_hinrunde_raw %>%
separate(Player, into = c("fullname", "allname"),
sep = "\\\\") %>%
separate(fullname, into = c("firstname", "lastname"),
sep = "\\s", extra = "merge", remove = FALSE) %>%
mutate(fname = str_extract(firstname, "[A-Z]{1}")) %>%
## players like Fabinho listed without Tavares last name
mutate(player = if_else(is.na(lastname), firstname, lastname)) %>%
mutate(player = case_when(
!is.na(lastname) ~ glue("{fname}. {lastname}"),
TRUE ~ firstname)) %>%
select(-firstname, -fname, -lastname, -allname)
glimpse(buli_player_regular_goalkeeping_stats_hinrunde)
```
## clean
```{r}
buli_player_goalkeeping_hinrunde_clean <- buli_player_goalkeeping_stats_hinrunde %>%
separate(Player, into = c("fullname", "allname"), sep = "\\\\") %>% separate(fullname, into = c("firstname", "lastname"),
sep = "\\s", extra = "merge",
remove = FALSE) %>%
mutate(fname = str_extract(firstname, "[A-Z]{1}")) %>%
## players like Fabinho listed without Tavares last name
mutate(player = if_else(is.na(lastname), firstname, lastname)) %>%
mutate(player = case_when(
!is.na(lastname) ~ glue("{fname}. {lastname}"),
TRUE ~ firstname)) %>%
left_join(buli_player_regular_goalkeeping_stats_hinrunde) %>%
mutate(
min = `90s` * 90,
GSAA = PSxG - GA,
GSAAper90 = (GSAA / min) * 90,
`xSave%` = (SoTA - PSxG) / SoTA,
GSAA_perc = `Save%` - `xSave%`,
minPerSave = min / Saves) %>%
## GSAA %: (SoTA - PSxG) / SoTA
rename(team_name = Squad) %>%
## keep fullname so can differentiate A. Gomes (Andre/Angel), etc.
select(-`90s`, -firstname, -fname, -lastname, -allname)
glimpse(buli_player_goalkeeping_hinrunde_clean)
```
```{r}
## save
saveRDS(buli_player_goalkeeping_hinrunde_clean,
file = glue("{here::here()}/data/buli_player_goalkeeping_hinrunde_clean.RDS"))
buli_player_goalkeeping_hinrunde_clean <- readRDS(
file = glue("{here::here()}/data/buli_player_goalkeeping_hinrunde_clean.RDS"))
```
# plot
```{r}
buli_player_goalkeeping_hinrunde_clean %>%
filter(min >= 900) %>%
summarize(avg_psxgper90 = median(PSxGper90),
avg_GA = median(GA),
avg_GSAA = median(GSAA),
avg_SV_perc = median(`Save%`),
avg_xSV_perc = median(`xSave%`),
avg_minperSave = median(minPerSave),
avg_GSAA_perc = median(GSAA_perc))
buli_player_goalkeeping_hinrunde_clean %>%
filter(min >= 900) %>%
summarize(avg_psxgper90 = mean(PSxGper90),
avg_GA = mean(GA),
avg_GSAA = mean(GSAA))
```
```{r}
bad_box <- data.frame(
xmin = -Inf, xmax = 0,
ymin = -Inf, ymax = 28.86)
chance_creation_box <- data.frame(
xmin = -Inf, xmax = 0,
ymin = 28.86, ymax = Inf)
midfield_progress_box <- data.frame(
xmin = 0, xmax = Inf,
ymin = -Inf, ymax = 28.86)
dual_box <- data.frame(
xmin = 0, xmax = Inf,
ymin = 28.86, ymax = Inf)
```
```{r}
buli_player_goalkeeping_hinrunde_clean %>%
filter(min >= 900) %>%
ggplot(aes(x = GA, y = PSxGper90)) +
geom_point() +
geom_label_repel(aes(label = player))
```
```{r fig.width = 14, fig.height = 10}
buli_player_goalkeeping_hinrunde_clean %>%
filter(min >= 900) %>%
ggplot(aes(x = GA, y = GSAA)) +
geom_point(aes(size = `xSave%`), color = "red") +
geom_text_repel(aes(label = player)) +
geom_vline(xintercept = 23) +
geom_hline(yintercept = -1.2) +
scale_x_reverse() +
theme_minimal()
```
## Minutes / Save vs. GSAA
```{r}
buli_gk <- buli_player_goalkeeping_hinrunde_clean %>%
filter(min >= 900) %>%
mutate(fillcol = case_when(
GSAA_perc < 0 & minPerSave < 28.86 ~ "red",
GSAA_perc > 0 & minPerSave < 28.86 ~ "yellow",
GSAA_perc < 0 & minPerSave > 28.86 ~ "orange",
GSAA_perc > 0 & minPerSave > 28.86 ~ "green",
TRUE ~ NA_character_))
```
```{r}
point_cols <- c("yellow" = "yellow", "red" = "red",
"orange" = "orange", "green" = "#228B22")
```
```{r}
sommer_desc <- "Yann Sommer has been in excellent form and has helped push Die Fohlen to become genuine title challengers this season. The Swiss is tied 1st for least goals conceded, 1st in Goals Against per 90 (1.06), & boasts a 6.4 GSAA (best in the league by a considerable margin, over 4 goals compared to 2nd placed Gikiewicz)."
burki_desc <- "Roman Burki is having a tough time in goal with the worst Save % in the league(47.7%) and a dreadful -16 GSAA % showing how he has leaked a lot of goals that the post-shot model expected him to save. BVB's defense may also share the blame with 3 own goals and Bürki's league-worst 0.38 PSxG/Shot on Target can also be attributed to opponent strikers being very clinical."
pavlenka_desc <- "Jiri Pavlenka is giving up a league-worst 2.41 Goals Against per 90 but he is not helped by a defense that forces him into a save every 27 minutes on average. Although the post-shot model has him saving the highest percentage of shots faced in the league (67.7% xSaves) the Czech has a -10.8 GSAA %."
```
```{r fig.width = 14, fig.height = 10}
buli_goalkeeping_hinrunde_plot <- buli_gk %>%
ggplot(aes(x = GSAA_perc, y = minPerSave)) +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 28.86) +
## area fills
geom_rect(data = chance_creation_box,
aes(x = NULL, y = NULL,
xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax),
fill = "orange", alpha = 0.1) +
geom_rect(data = bad_box,
aes(x = NULL, y = NULL,
xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax),
fill = "red", alpha = 0.1) +
geom_rect(data = midfield_progress_box,
aes(x = NULL, y = NULL,
xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax),
fill = "yellow", alpha = 0.2) +
geom_rect(data = dual_box,
aes(x = NULL, y = NULL,
xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax),
fill = "green", alpha = 0.1) +
## area labels
annotate("text", family = "Roboto Condensed", fontface = "bold",
x = -0.2, y = 19,
hjust = 0, color = "red", size = 6,
label = "Busy & Saving Less Than Average GK") +
annotate("text", family = "Roboto Condensed", fontface = "bold",
x = -0.2, y = 54,
hjust = 0, color = "orange", size = 6,
label = "Not Busy & Saving Less Than Average GK") +
annotate( # #7f7f00 #228B22 #CCCC00
"text", family = "Roboto Condensed", fontface = "bold",
x = 0.02, y = 54,
hjust = 0, color = "#228B22", size = 6,
label = "Not Busy & Saving More Than Average GK") +
annotate(
"text", family = "Roboto Condensed", fontface = "bold",
x = 0.02, y = 19,
hjust = 0, color = "#7f7f00", size = 6,
label = "Busy & Saving More Than Average GK") +
## avg minutes per save
annotate(
"text", family = "Roboto Condensed", fontface = "bold",
x = -0.2, y = 29.5,
hjust = 0, color = "grey20", size = 4,
label = "Average: 28.6 Minutes per Save") +
## player data
geom_point(aes(fill = factor(fillcol)),
shape = 21, size = 3) +
scale_fill_manual(values = point_cols,
guide = NULL) +
geom_text_repel(
data = buli_gk %>%
filter(!player %in% c("J. Pavlenka", "R. Bürki",
"Y. Sommer")),
aes(label = player, family = "Roboto Condensed",
fontface = "bold",
segment.color = factor(fillcol)),
min.segment.length = 0.3, seed = 15, size = 5,
point.padding = 0.6, color = "grey20") +
# Bürki
geom_mark_hull(
aes(filter = player == "R. Bürki",
label = "Roman Bürki (Borussia Dortmund)",
description = burki_desc),
expand = unit(2.5, "mm"), con.cap = unit(0, "mm"),
con.colour = "orange", color = "orange",
label.width = unit(140, 'mm'), label.buffer = unit(4.95, "mm"),
label.family = "Roboto Condensed", label.fontsize = c(14, 12),
label.colour = "grey20", label.fill = "#ffe5cc") +
# Pavlenka
geom_mark_hull(
aes(filter = player == "J. Pavlenka",
label = "Jiri Pavlenka (Werder Bremen)",
description = pavlenka_desc),
expand = unit(2.5, "mm"), con.cap = unit(0, "mm"),
con.colour = "red", color = "red",
label.width = unit(140, 'mm'), label.buffer = unit(0.5, "mm"),
label.family = "Roboto Condensed", label.fontsize = c(14, 12),
label.colour = "grey20", label.fill = "#FFCCCC") +
# Sommer
geom_mark_hull(
aes(filter = player == "Y. Sommer",
label = "Yann Sommer (Borussia Mönchengladbach)",
description = sommer_desc),
expand = unit(2.5, "mm"), con.cap = unit(0, "mm"),
label.width = unit(120, 'mm'), label.buffer = unit(57, "mm"),
label.family = "Roboto Condensed", label.fontsize = c(14, 12),
label.colour = "grey20", label.fill = "#cce5cc") +
scale_x_continuous(labels = scales::percent,
breaks = seq(-0.25, 0.15, 0.05),
limits = c(-0.2, 0.12),
expand = c(0.01, 0)) +
scale_y_continuous(breaks = seq(20, 56, 10),
limits = c(18, 56),
expand = c(0.01, 0)) +
labs(title = "Goalkeeper Performance & Workload: Bundesliga (2019-2020)",
subtitle = glue::glue("
January 18th, 2020 | Hinrunde | Minimum 900 Minutes Played
GSAA %: Whether a GK has saved more/less than the post-shot model expected an average GK to save (Actual Save % - Expected Save %)"),
x = "Goals Saved Above Average Percentage (GSAA %)",
y = "Average Minutes per Save",
caption = "Data: FBref | StatsBomb") +
theme_minimal() +
theme(text = element_text(family = "Roboto Condensed"),
plot.title = element_markdown(size = 20),
plot.subtitle = element_markdown(size = 16),
plot.caption = element_text(size = 14),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12))
buli_goalkeeping_hinrunde_plot
```
## save
```{r}
ggsave(plot = buli_goalkeeping_hinrunde_plot,
here::here("Bundesliga 2019-2020/output/buli_goalkeeping_1920_hinrunde.png"),
height = 10, width = 14)
```
```{r}
plot_logo <- add_logo(
plot_path = here::here("Bundesliga 2019-2020/output/buli_goalkeeping_1920_hinrunde.png"),
logo_path = here::here("../soccer_match_reports/img/Bundesliga_logo_(2017).svg"),
logo_position = "top right",
logo_scale = 18)
plot_logo
```
```{r}
image_write(image = plot_logo,
here::here("Bundesliga 2019-2020/output/buli_goalkeeping_1920_hinrunde_logo.png"))
```
================================================
FILE: Bundesliga 2019-2020/buli_progressive_passing_1920_hinrunde.Rmd
================================================
---
title: "Untitled"
author: "RN7"
date: "1/14/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# pkgs
```{r, message=FALSE, warning=FALSE}
pacman::p_load(tidyverse, polite, scales, ggimage,
ggforce, ggtext,
rvest, glue, extrafont, ggrepel, magick)
loadfonts()
```
## add_logo
```{r}
add_logo <- function(plot_path, logo_path, logo_position,
logo_scale = 10){
# Requires magick R Package https://github.com/ropensci/magick
# Useful error message for logo position
if (!logo_position %in% c("top right", "top left", "bottom right", "bottom left")) {
stop("Error Message: Uh oh! Logo Position not recognized\n Try: logo_positon = 'top left', 'top right', 'bottom left', or 'bottom right'")
}
# read in raw images
plot <- magick::image_read(plot_path)
logo_raw <- magick::image_read(logo_path)
# get dimensions of plot for scaling
plot_height <- magick::image_info(plot)$height
plot_width <- magick::image_info(plot)$width
# default scale to 1/10th width of plot
# Can change with logo_scale
logo <- magick::image_scale(logo_raw, as.character(plot_width/logo_scale))
# Get width of logo
logo_width <- magick::image_info(logo)$width
logo_height <- magick::image_info(logo)$height
# Set position of logo
# Position starts at 0,0 at top left
# Using 0.01 for 1% - aesthetic padding
if (logo_position == "top right") {
x_pos = plot_width - logo_width - 0.01 * plot_width
y_pos = 0.01 * plot_height
} else if (logo_position == "top left") {
x_pos = 0.01 * plot_width
y_pos = 0.01 * plot_height
} else if (logo_position == "bottom right") {
x_pos = plot_width - logo_width - 0.01 * plot_width
y_pos = plot_height - logo_height - 0.01 * plot_height
} else if (logo_position == "bottom left") {
x_pos = 0.01 * plot_width
y_pos = plot_height - logo_height - 0.01 * plot_height
}
# Compose the actual overlay
magick::image_composite(plot, logo, offset = paste0("+", x_pos, "+", y_pos))
}
```
# Bundesliga data
- Save "Player Standard Stats" & "Squad Standard Stats" as a `.csv` file or whatever you prefer.
- https://fbref.com/en/comps/20/passing/Bundesliga-Stats
### player stats
```{r}
buli_player_passing_stats_hinrunde_raw <- read_csv(here::here("data/buli_player_passing_stats_hinrunde.csv"),
skip = 1, col_names = TRUE)
buli_player_passing_stats_hinrunde <- buli_player_passing_stats_hinrunde_raw %>%
rename_at(vars(14:16),
~ glue::glue("short_{colnames(buli_player_passing_stats_hinrunde_raw)[14:16]}")) %>%
rename_at(vars(17:19),
~ glue::glue("medium_{colnames(buli_player_passing_stats_hinrunde_raw)[17:19]}")) %>%
rename_at(vars(20:22),
~ glue::glue("long_{colnames(buli_player_passing_stats_hinrunde_raw)[20:22]}")) %>%
rename_all(~str_replace_all(colnames(buli_player_passing_stats_hinrunde), "_[0-9]", "")) %>%
select(-Matches, -Rk)
glimpse(buli_player_passing_stats_hinrunde)
```
```{r}
## save
saveRDS(buli_player_passing_stats_hinrunde,
file = glue("{here::here()}/data/buli_player_passing_stats_hinrunde.RDS"))
buli_player_passing_stats_hinrunde <- readRDS(
file = glue("{here::here()}/data/buli_player_passing_stats_hinrunde.RDS"))
```
```{r}
buli_player_passing_hinrunde_clean <- buli_player_passing_stats_hinrunde %>%
separate(Player, into = c("fullname", "allname"), sep = "\\\\") %>% separate(fullname, into = c("firstname", "lastname"),
sep = "\\s", extra = "merge",
remove = FALSE) %>%
mutate(fname = str_extract(firstname, "[A-Z]{1}")) %>%
## players like Fabinho listed without Tavares last name
mutate(player = if_else(is.na(lastname), firstname, lastname)) %>%
mutate(player = case_when(
!is.na(lastname) ~ glue("{fname}. {lastname}"),
TRUE ~ firstname)) %>%
mutate(min = `90s` * 90) %>%
mutate(KPper90 = (KP / min) * 90,
xAper90 = (xA / min) * 90,
finalthirdper90 = (`1/3` / min) * 90,
PPAper90 = (PPA / min) * 90,
CrsPAper90 = (CrsPA / min) * 90,
TBper90 = (TB / min) * 90) %>%
## keep fullname so can differentiate A. Gomes (Andre/Angel), etc.
select(-`90s`, -firstname, -lastname, -allname,
team_name = Squad, -fname)
glimpse(buli_player_passing_hinrunde_clean)
```
```{r}
## save
saveRDS(buli_player_passing_hinrunde_clean,
file = glue("{here::here()}/data/buli_player_passing_hinrunde_clean.RDS"))
buli_player_passing_hinrunde_clean <- readRDS(
file = glue("{here::here()}/data/buli_player_passing_hinrunde_clean.RDS"))
```
# plot
```{r}
buli_player_passing_hinrunde_clean %>%
filter(min >= 900) %>%
summarize(avg_f3per90 = median(finalthirdper90),
avg_PPAper90 = median(PPAper90))
buli_player_passing_hinrunde_clean %>%
filter(min >= 900,
Pos %in% c("FW", "FWMF")) %>%
summarize(avg_f3per90 = median(finalthirdper90),
avg_PPAper90 = median(PPAper90))
```
- when filtering for FW/Attacking MFs, F3 falls and PPA rises - fairly self-explanatory and reasonable
```{r}
bad_box <- data.frame(
xmin = -Inf, xmax = 2.5,
ymin = -Inf, ymax = 0.48)
chance_creation_box <- data.frame(
xmin = -Inf, xmax = 2.5,
ymin = 0.48, ymax = Inf)
midfield_progress_box <- data.frame(
xmin = 2.5, xmax = Inf,
ymin = -Inf, ymax = 0.48)
dual_box <- data.frame(
xmin = 2.5, xmax = Inf,
ymin = 0.48, ymax = Inf)
```
```{r}
buli_pass_df <- buli_player_passing_hinrunde_clean %>%
filter(min >= 900)
```
```{r}
bayern_desc <- "Both Kimmich & Alaba have been playing much more centrally this season (Center Midfield & Center Back respectively) the former only behind teammate Coutinho and Brandt in Passes into the Penalty Area per 90 with the latter leading the league with 9.83 Passes into the Final 3rd per 90. Thiago keeps things ticking in midfield with a 90.2% Pass Accuracy."
brandt_desc <- "Julian Brandt has become the primary ball progressor in this Dortmund side with his 2.99 PPA per 90 leading the league by a considerable margin. A lot of this comes from his throughballs (0.68 Through Balls per 90, 1st in the league). Along with his 7th and 4th ranking (within BVB) in Final Third Passes per 90 & xA per 90 respectively, it shows that he provides the incisive ball in between the build-up & the final pass. Indeed, his 1.45 KP per 90 is 4th in the team behind Hazard, Sancho, & Hakimi."
```
Kimmich:
Alaba:
Brandt: 0.68 TB per 90 (1st), 2.99 PPA per 90 (1st), 4.1 Final Third per 90 (30th), 0.19 xA per 90 (30th)
1.45 KP per 90 (47t league, 4th team) xA per 90 (4th), final 3rd (7th)
```{r fig.width = 14, fig.height = 10}
buli_progressive_pass_hinrunde_plot <- ggplot(buli_pass_df,
aes(x = finalthirdper90, y = PPAper90)) +
## area fills
geom_rect(data = chance_creation_box,
aes(x = NULL, y = NULL,
xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax),
fill = "yellow", alpha = 0.1) +
geom_rect(data = bad_box,
aes(x = NULL, y = NULL,
xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax),
fill = "red", alpha = 0.1) +
geom_rect(data = midfield_progress_box,
aes(x = NULL, y = NULL,
xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax),
fill = "orange", alpha = 0.2) +
geom_rect(data = dual_box,
aes(x = NULL, y = NULL,
xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax),
fill = "green", alpha = 0.1) +
## median reference lines
geom_hline(yintercept = 0.48, color = "grey20", alpha = 0.4) +
geom_vline(xintercept = 2.5, color = "grey20", alpha = 0.4) +
## player data
geom_point(color = "red", size = 3) +
## league average
annotate("text", family = "Roboto Condensed", fontface = "bold",
x = 2.6, y = 3.8, hjust = 0, color = "grey20",
label = "Average OP Passes into the Final Third per 90: 2.5") +
annotate("text", family = "Roboto Condensed", fontface = "bold",
x = 9.2, y = 0.44, hjust = 0, color = "grey20",
label = "Average OP Passes into Penalty Area per 90: 0.48") +
## area labels
annotate("text", family = "Roboto Condensed", fontface = "bold",
x = 0.25, y = 3.8,
hjust = 0, color = "#CCCC00", size = 6,
label = "Good Chance Creation") +
annotate("text", family = "Roboto Condensed", fontface = "bold",
x = 9.2, y = 0.2,
hjust = 0, color = "orange", size = 6,
label = "Good Midfield Progression") +
annotate(
"text", family = "Roboto Condensed", fontface = "bold",
x = 9.2, y = 3.77,
hjust = 0, color = "#228B22", size = 6,
label = "Good Chance Creation\nGood Midfield Progression") +
## player labels
geom_text_repel(
data = buli_player_passing_hinrunde_clean %>%
filter(min >= 900,
finalthirdper90 > 3.75 | PPAper90 > 1.2,
!player %in% c("J. Kimmich",
"D. Alaba",
"T. Alcántara",
"P. Coutinho",
"J. Brandt")),
aes(label = player, family = "Roboto Condensed",
fontface = "bold"),
min.segment.length = 0.3, seed = 15, size = 3.5,
segment.color = "red", point.padding = 0.6,
color = "grey20") +
## Bayern description
geom_mark_hull(
aes(filter = player %in% c("J. Kimmich",
"D. Alaba",
"T. Alcántara",
"P. Coutinho"),
label = "Bayern's Passing Maestros.",
description = bayern_desc),
expand = unit(2.5, "mm"), con.cap = unit(0, "mm"),
label.width = unit(170, 'mm'), label.buffer = unit(10, "mm"),
label.family = "Roboto Condensed", label.fontsize = c(14, 12),
label.colour = "grey20", label.fill = "#cce5cc") +
## Brandt description #7fbf7f #b2d8b2
geom_mark_hull(
aes(filter = player %in% c("J. Brandt"),
label = "Julian Brandt: BVB's Elite Ball Progressor.",
description = brandt_desc),
expand = unit(2.5, "mm"), con.cap = unit(0, "mm"),
label.width = unit(250, 'mm'), label.buffer = unit(0.5, "mm"),
label.family = "Roboto Condensed", label.fontsize = c(14, 12),
label.colour = "grey20", label.fill = "#cce5cc") +
## Bayern player labels
geom_text_repel(
data = buli_player_passing_hinrunde_clean %>%
filter(player %in% c("J. Kimmich",
"D. Alaba",
"T. Alcántara",
"P. Coutinho")),
aes(label = player, family = "Roboto Condensed",
fontface = "bold"),
seed = 15, size = 4.5, color = "red",
min.segment.length = 0, segment.color = "red",
point.padding = 0.5, nudge_x = 0.6) +
## scales
scale_x_continuous(labels = seq(0, 12, 1),
breaks = seq(0, 12, 1),
limits = c(0, 12),
expand = c(0.01, 0)) +
scale_y_continuous(labels = seq(0, 4, 0.5),
breaks = seq(0, 4, 0.5),
limits = c(0, 4),
expand = c(0.01, 0)) +
labs(title = "Progressive Passers: Bundesliga (2019-2020)",
subtitle = glue("
Hinrunde | January 15th, 2020
Average (Median) | Minimum 900 Minutes Played"),
caption = glue("
Data: FBref | StatsBomb
Ryo Nakagawara, Twitter: @R_by_Ryo"),
x = "Open Play Passes into Final Third per 90",
y = "Open Play Passes into Penalty Area per 90") +
theme_minimal() +
theme(text = element_markdown(family = "Roboto Condensed"),
plot.title = element_markdown(size = 20),
plot.subtitle = element_markdown(size = 16),
plot.caption = element_text(size = 14),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12))
buli_progressive_pass_hinrunde_plot
```
Florian Neuhaus (12th PPA per 90), Levin Oztunali (8th PPA per 90, 37th xA per 90)
The clairvoyant crossers: Filip Kostic (1.38), Christian Gunter (1.0), Niko Gießelmann (0.8) Completed Crosses into Penalty Area per 90 (1st-3rd in league). Kostic also boasts an impressive 2.6 KP per 90 (4th in the league).
RB Leipzig will surely miss Diego Demme who played a huge part in their build up. 6th in the league (just behind teammate Upamecano) in Passes into Final 3rd per 90 and leads Leipzig with 0.3 Through Balls per 90.
## save
```{r}
ggsave(plot = buli_progressive_pass_hinrunde_plot,
here::here("Bundesliga 2019-2020/output/buli_progressive_pass_1920_hinrunde.png"),
height = 10, width = 14)
```
```{r}
plot_logo <- add_logo(
plot_path = here::here("Bundesliga 2019-2020/output/buli_progressive_pass_1920_hinrunde.png"),
logo_path = here::here("../soccer_match_reports/img/Bundesliga_logo_(2017).svg"),
logo_position = "top right",
logo_scale = 18)
plot_logo
```
```{r}
image_write(image = plot_logo,
here::here("Bundesliga 2019-2020/output/buli_progressive_pass_1920_hinrunde_logo.png"))
```
================================================
FILE: Bundesliga 2019-2020/buli_shot_quality_1920_hinrunde.Rmd
================================================
---
title: "Untitled"
author: "RN7"
date: "1/15/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# pkgs
```{r, message=FALSE, warning=FALSE}
pacman::p_load(tidyverse, polite, scales, ggimage,
ggforce, ggtext,
rvest, glue, extrafont, ggrepel, magick)
loadfonts()
```
## add_logo
```{r}
add_logo <- function(plot_path, logo_path, logo_position,
logo_scale = 10){
# Requires magick R Package https://github.com/ropensci/magick
# Useful error message for logo position
if (!logo_position %in% c("top right", "top left", "bottom right", "bottom left")) {
stop("Error Message: Uh oh! Logo Position not recognized\n Try: logo_positon = 'top left', 'top right', 'bottom left', or 'bottom right'")
}
# read in raw images
plot <- magick::image_read(plot_path)
logo_raw <- magick::image_read(logo_path)
# get dimensions of plot for scaling
plot_height <- magick::image_info(plot)$height
plot_width <- magick::image_info(plot)$width
# default scale to 1/10th width of plot
# Can change with logo_scale
logo <- magick::image_scale(logo_raw, as.character(plot_width/logo_scale))
# Get width of logo
logo_width <- magick::image_info(logo)$width
logo_height <- magick::image_info(logo)$height
# Set position of logo
# Position starts at 0,0 at top left
# Using 0.01 for 1% - aesthetic padding
if (logo_position == "top right") {
x_pos = plot_width - logo_width - 0.01 * plot_width
y_pos = 0.01 * plot_height
} else if (logo_position == "top left") {
x_pos = 0.01 * plot_width
y_pos = 0.01 * plot_height
} else if (logo_position == "bottom right") {
x_pos = plot_width - logo_width - 0.01 * plot_width
y_pos = plot_height - logo_height - 0.01 * plot_height
} else if (logo_position == "bottom left") {
x_pos = 0.01 * plot_width
y_pos = plot_height - logo_height - 0.01 * plot_height
}
# Compose the actual overlay
magick::image_composite(plot, logo, offset = paste0("+", x_pos, "+", y_pos))
}
```
# Bundesliga data
- Save "Player Standard Stats" & "Squad Standard Stats" as a `.csv` file or whatever you prefer.
- https://fbref.com/en/comps/20/passing/Bundesliga-Stats
### player stats
```{r}
buli_player_shooting_stats_hinrunde_raw <- read_csv(here::here("data/buli_player_shooting_stats_hinrunde.csv"),
col_names = TRUE)
buli_player_shooting_stats_hinrunde <- buli_player_shooting_stats_hinrunde_raw %>%
select(-Matches, -Rk)
glimpse(buli_player_shooting_stats_hinrunde)
```
```{r}
## save
saveRDS(buli_player_shooting_stats_hinrunde,
file = glue("{here::here()}/data/buli_player_shooting_stats_hinrunde.RDS"))
buli_player_shooting_stats_hinrunde <- readRDS(
file = glue("{here::here()}/data/buli_player_shooting_stats_hinrunde.RDS"))
```
```{r}
buli_player_shooting_hinrunde_clean <- buli_player_shooting_stats_hinrunde %>%
separate(Player, into = c("fullname", "allname"), sep = "\\\\") %>% separate(fullname, into = c("firstname", "lastname"),
sep = "\\s", extra = "merge",
remove = FALSE) %>%
mutate(fname = str_extract(firstname, "[A-Z]{1}")) %>%
## players like Fabinho listed without Tavares last name
mutate(player = if_else(is.na(lastname), firstname, lastname)) %>%
mutate(player = case_when(
!is.na(lastname) ~ glue("{fname}. {lastname}"),
TRUE ~ firstname)) %>%
mutate(np_shots = Sh - PKatt,
np_SOT = SoT - PK) %>%
mutate(min = `90s` * 90) %>%
# mutate(Shotsper90 = round( (np_shots / min) * 90, 2),
# SOTper90 = round( (np_SOT / min) * 90, 2),
# npxGperShot = npxG / np_shots) %>%
## keep fullname so can differentiate A. Gomes (Andre/Angel), etc.
select(-`90s`, -firstname, -lastname, -allname,
team_name = Squad, -fname,
npxGperShot = `npxG/Sh`,
Shotsper90 = `Sh/90`)
glimpse(buli_player_shooting_hinrunde_clean)
```
```{r}
## save
saveRDS(buli_player_shooting_hinrunde_clean,
file = glue("{here::here()}/data/buli_player_shooting_hinrunde_clean.RDS"))
buli_player_shooting_hinrunde_clean <- readRDS(
file = glue("{here::here()}/data/buli_player_shooting_hinrunde_clean.RDS"))
```
# plot
```{r}
buli_player_shooting_hinrunde_clean %>%
filter(min >= 900,
Shotsper90 > 1) %>%
summarize(avg_Shotsper90 = median(Shotsper90),
avg_npxGperShot = median(npxGperShot))
buli_player_shooting_hinrunde_clean %>%
filter(min >= 900,
Shotsper90 > 1,
Pos %in% c("FW", "FWMF")) %>%
summarize(avg_Shotsper90 = median(Shotsper90),
avg_npxGperShot = median(npxGperShot))
```
- when filtering for FW/Attacking MFs, F3 falls and PPA rises - fairly self-explanatory and reasonable
```{r}
bad_box <- data.frame(
xmin = -Inf, xmax = 2.5,
ymin = -Inf, ymax = 0.48)
chance_creation_box <- data.frame(
xmin = -Inf, xmax = 2.5,
ymin = 0.48, ymax = Inf)
midfield_progress_box <- data.frame(
xmin = 2.5, xmax = Inf,
ymin = -Inf, ymax = 0.48)
dual_box <- data.frame(
xmin = 2.5, xmax = Inf,
ymin = 0.48, ymax = Inf)
```
```{r}
buli_shot_df <- buli_player_shooting_hinrunde_clean %>%
filter(min >= 900)
```
```{r}
bayern_desc <- "Both Kimmich & Alaba have been playing much more centrally this season (Center Midfield & Center Back respectively) the former only behind teammate Coutinho and Brandt in Passes into the Penalty Area per 90 with the latter leading the league with 9.83 Passes into the Final 3rd per 90. Thiago keeps things ticking in midfield with a 90.2% Pass Accuracy."
brandt_desc <- "Julian Brandt has become the primary ball progressor in this Dortmund side with his 2.99 PPA per 90 leading the league by a considerable margin. A lot of this comes from his throughballs (0.68 Through Balls per 90, 1st in the league). Along with his 7th and 4th ranking (within BVB) in Final Third Passes per 90 & xA per 90 respectively, it shows that he provides the incisive ball in between the build-up & the final pass. Indeed, his 1.45 KP per 90 is 4th in the team behind Hazard, Sancho, & Hakimi."
```
```{r fig.width = 14, fig.height = 10}
buli_shooting_hinrunde_plot <- ggplot(
buli_shot_df,
aes(x = Shotsper90, y = npxGperShot)) +
## area fills
geom_rect(data = chance_creation_box,
aes(x = NULL, y = NULL,
xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax),
fill = "yellow", alpha = 0.1) +
geom_rect(data = bad_box,
aes(x = NULL, y = NULL,
xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax),
fill = "red", alpha = 0.1) +
geom_rect(data = midfield_progress_box,
aes(x = NULL, y = NULL,
xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax),
fill = "orange", alpha = 0.2) +
geom_rect(data = dual_box,
aes(x = NULL, y = NULL,
xmin = xmin, xmax = xmax,
ymin = ymin, ymax = ymax),
fill = "green", alpha = 0.1) +
## median reference lines
geom_hline(yintercept = 0.1, color = "grey20", alpha = 0.4) +
geom_vline(xintercept = 2, color = "grey20", alpha = 0.4) +
## player data
geom_point(color = "red", size = 3) +
## league average
annotate("text", family = "Roboto Condensed", fontface = "bold",
x = 2.6, y = 3.8, hjust = 0, color = "grey20",
label = "Average OP Passes into the Final Third per 90: 2.5") +
annotate("text", family = "Roboto Condensed", fontface = "bold",
x = 9.2, y = 0.44, hjust = 0, color = "grey20",
label = "Average OP Passes into Penalty Area per 90: 0.48") +
## area labels
annotate("text", family = "Roboto Condensed", fontface = "bold",
x = 0.25, y = 3.8,
hjust = 0, color = "#CCCC00", size = 6,
label = "Good Chance Creation") +
annotate("text", family = "Roboto Condensed", fontface = "bold",
x = 9.2, y = 0.2,
hjust = 0, color = "orange", size = 6,
label = "Good Midfield Progression") +
annotate(
"text", family = "Roboto Condensed", fontface = "bold",
x = 9.2, y = 3.77,
hjust = 0, color = "#228B22", size = 6,
label = "Good Chance Creation\nGood Midfield Progression") +
## player labels
geom_text_repel(
data = buli_player_shooting_hinrunde_clean %>%
filter(min >= 900,
Shotsper90 > 2 | npxGperShot > 0.1,
!player %in% c("J. Kimmich",
"D. Alaba",
"T. Alcántara",
"P. Coutinho",
"J. Brandt")),
aes(label = player, family = "Roboto Condensed",
fontface = "bold"),
min.segment.length = 0.3, seed = 15, size = 3.5,
segment.color = "red", point.padding = 0.6,
color = "grey20") +
# ## Bayern description
# geom_mark_hull(
# aes(filter = player %in% c("J. Kimmich",
# "D. Alaba",
# "T. Alcántara",
# "P. Coutinho"),
# label = "Bayern's Passing Maestros.",
# description = "bayern_desc"),
# expand = unit(2.5, "mm"), con.cap = unit(0, "mm"),
# label.width = unit(170, 'mm'), label.buffer = unit(10, "mm"),
# label.family = "Roboto Condensed", label.fontsize = c(14, 12),
# label.colour = "grey20", label.fill = "#cce5cc") +
# ## Brandt description #7fbf7f #b2d8b2
# geom_mark_hull(
# aes(filter = player %in% c("J. Brandt"),
# label = "Julian Brandt: BVB's Elite Ball Progressor.",
# description = "brandt_desc"),
# expand = unit(2.5, "mm"), con.cap = unit(0, "mm"),
# label.width = unit(250, 'mm'), label.buffer = unit(0.5, "mm"),
# label.family = "Roboto Condensed", label.fontsize = c(14, 12),
# label.colour = "grey20", label.fill = "#cce5cc") +
## Bayern player labels
geom_text_repel(
data = buli_player_shooting_hinrunde_clean %>%
filter(min >= 900,
player %in% c("J. Kimmich",
"D. Alaba",
"T. Alcántara",
"P. Coutinho")),
aes(label = player, family = "Roboto Condensed",
fontface = "bold"),
seed = 15, size = 4.5, color = "red",
min.segment.length = 0, segment.color = "red",
point.padding = 0.5, nudge_x = 0.6) +
## scales
scale_x_continuous(labels = seq(0, 5, 0.5),
breaks = seq(0, 5, 0.5),
limits = c(0, 5),
expand = c(0.01, 0)) +
scale_y_continuous(labels = seq(0, 0.35, 0.1),
breaks = seq(0, 0.35, 0.1),
limits = c(0, 0.35),
expand = c(0.01, 0)) +
labs(title = "Shot Volume vs. Shot Quality: Bundesliga (2019-2020)",
subtitle = glue("
Hinrunde | January 15th, 2020
Average (Median) | Minimum 900 Minutes Played | More than 1 Shot per 90"),
caption = glue("
Data: FBref | StatsBomb
Ryo Nakagawara, Twitter: @R_by_Ryo"),
x = "Shots per 90",
y = "non-Penalty xG per Shot") +
theme_minimal() +
theme(text = element_markdown(family = "Roboto Condensed"),
plot.title = element_markdown(size = 20),
plot.subtitle = element_markdown(size = 16),
plot.caption = element_text(size = 14),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12))
buli_shooting_hinrunde_plot
```
================================================
FILE: Bundesliga 2019-2020/goal_contrib_graph_1920_hinrunde.Rmd
================================================
---
title: "Untitled"
author: "RN7"
date: "1/8/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# pkgs
```{r, message=FALSE, warning=FALSE}
pacman::p_load(tidyverse, scales, ggforce, ggtext,
rvest, glue, extrafont, ggrepel, magick)
loadfonts()
```
## add_logo
```{r}
add_logo <- function(plot_path, logo_path, logo_position,
logo_scale = 10){
# Requires magick R Package https://github.com/ropensci/magick
# Useful error message for logo position
if (!logo_position %in% c("top right", "top left", "bottom right", "bottom left")) {
stop("Error Message: Uh oh! Logo Position not recognized\n Try: logo_positon = 'top left', 'top right', 'bottom left', or 'bottom right'")
}
# read in raw images
plot <- magick::image_read(plot_path)
logo_raw <- magick::image_read(logo_path)
# get dimensions of plot for scaling
plot_height <- magick::image_info(plot)$height
plot_width <- magick::image_info(plot)$width
# default scale to 1/10th width of plot
# Can change with logo_scale
logo <- magick::image_scale(logo_raw, as.character(plot_width/logo_scale))
# Get width of logo
logo_width <- magick::image_info(logo)$width
logo_height <- magick::image_info(logo)$height
# Set position of logo
# Position starts at 0,0 at top left
# Using 0.01 for 1% - aesthetic padding
if (logo_position == "top right") {
x_pos = plot_width - logo_width - 0.01 * plot_width
y_pos = 0.01 * plot_height
} else if (logo_position == "top left") {
x_pos = 0.01 * plot_width
y_pos = 0.01 * plot_height
} else if (logo_position == "bottom right") {
x_pos = plot_width - logo_width - 0.01 * plot_width
y_pos = plot_height - logo_height - 0.01 * plot_height
} else if (logo_position == "bottom left") {
x_pos = 0.01 * plot_width
y_pos = plot_height - logo_height - 0.01 * plot_height
}
# Compose the actual overlay
magick::image_composite(plot, logo, offset = paste0("+", x_pos, "+", y_pos))
}
```
# Bundesliga data
## fbref
- Save "Player Standard Stats" & "Squad Standard Stats" as a `.csv` file or whatever you prefer.
- https://fbref.com/en/comps/20/stats/Bundesliga-Stats
### squad stats
```{r}
buli_squad_stats_hinrunde_raw <- read_csv(here::here("data/buli_squad_stats_hinrunde.csv"),
skip = 1, col_names = TRUE)
buli_squad_stats_hinrunde <- buli_squad_stats_hinrunde_raw %>%
rename_at(vars(12:16),
~ glue::glue("{colnames(buli_squad_stats_hinrunde_raw)[12:16]}_p90")) %>%
rename_at(vars(20:24),
~ glue::glue("{colnames(buli_squad_stats_hinrunde_raw)[20:24]}_p90"))
buli_squad_stats_hinrunde <- buli_squad_stats_hinrunde %>%
rename_at(vars(2:24), ~ glue::glue("{colnames(buli_squad_stats_hinrunde)[2:24]}_squad")) %>%
rename_at(vars(contains("_1")), ~ str_replace(., "_1", ""))
glimpse(buli_squad_stats_hinrunde)
```
```{r}
## save
saveRDS(buli_squad_stats_hinrunde, file = glue("{here::here()}/data/buli_squad_stats_hinrunde.RDS"))
buli_squad_stats_hinrunde <- readRDS(file = glue("{here::here()}/data/buli_squad_stats_hinrunde.RDS"))
```
### player stats
```{r}
buli_player_stats_hinrunde_raw <- read_csv(here::here("data/buli_player_stats_hinrunde.csv"),
skip = 1, col_names = TRUE)
buli_player_stats_hinrunde <- buli_player_stats_hinrunde_raw %>%
rename_at(vars(17:21),
~ glue::glue("{colnames(buli_player_stats_hinrunde_raw)[17:21]}_p90")) %>%
rename_at(vars(25:29),
~ glue::glue("{colnames(buli_player_stats_hinrunde_raw)[25:29]}_p90"))
buli_player_stats_hinrunde <- buli_player_stats_hinrunde %>%
rename_at(vars(contains("_1")), ~ str_replace(., "_1", "")) %>%
select(-Matches, -Rk)
glimpse(buli_player_stats_hinrunde)
```
```{r}
## save
saveRDS(buli_player_stats_hinrunde, file = glue("{here::here()}/data/buli_player_stats_hinrunde.RDS"))
buli_player_stats_hinrunde <- readRDS(file = glue("{here::here()}/data/buli_player_stats_hinrunde.RDS"))
```
## clean
```{r}
goal_contribution_clean_df <- buli_player_stats_hinrunde %>%
left_join(buli_squad_stats_hinrunde, by = "Squad") %>%
separate(Player, into = c("fullname", "allname"), sep = "\\\\") %>%
separate(fullname, into = c("firstname", "lastname"),
sep = "\\s", extra = "merge",
remove = FALSE) %>%
mutate(fname = str_extract(firstname, "[A-Z]{1}")) %>%
## players like Fabinho listed without Tavares last name
mutate(player = if_else(is.na(lastname), firstname, lastname)) %>%
mutate(player = case_when(
!is.na(lastname) ~ glue("{fname}. {lastname}"),
TRUE ~ firstname)) %>%
group_by(fullname) %>%
mutate(goal_contrib = Gls / Gls_squad,
assist_contrib = Ast / Gls_squad) %>%
ungroup() %>%
## keep fullname so can differentiate A. Gomes (Andre/Angel), etc.
select(player, fullname, Pos, Squad, Min,
Gls, Gls_squad, PK, PK_squad,
Ast, Ast_squad,
goal_contrib, assist_contrib,
npxG_p90, xA_p90, npxG, xA)
glimpse(goal_contribution_clean_df)
```
## plot
```{r}
goal_contribution_clean_df %>%
filter(Min >= 900,
Pos %in% c("FW", "FWMF")) %>%
summarize(avg_goal_contrib = median(goal_contrib),
avg_assist_contrib = median(assist_contrib))
```
### goals & assists
```{r fig.width = 14, fig.height = 10}
## Description text
desc_andersson <- "Union Berlin had a rought start to their maiden Bundesliga season but Andersson stepped up to scoring vitals goals in wins against fellow relegation candidates like Mainz and FC Köln"
desc_weghorst <- "With "
desc_nieder <- "Niederlechner is on course to beat his best Bundesliga season (11 Goals | 3 Assists with Freiburg in 16/17) by quite a margin with 8 goals and 6 assists so far. His total contribution (goals and assists) equals to half of Augburg's goals in the hinrunde."
desc_volland <- "Volland is the fulcrum of Leverkusen's bid for a consecutive appearance in the Champions League with 5 goals and 6 assists. He tallied up 0.45 xG per 90 (11th in the league) and 0.28 xA per 90 (10th) in the Hinrunde."
## PLOT!
goal_contribution_matrix <- goal_contribution_clean_df %>%
ggplot(aes(assist_contrib, goal_contrib)) +
geom_point(data = goal_contribution_clean_df %>%
filter(goal_contrib < 0.2 | assist_contrib < 0.15),
color = "grey20", size = 4, alpha = 0.2) +
geom_point(data = goal_contribution_clean_df %>%
filter(goal_contrib > 0.2 | assist_contrib > 0.15),
color = "red", size = 4) +
geom_hline(yintercept = 0.157, color = "grey20", alpha = 0.4) +
geom_vline(xintercept = 0.08, color = "grey20", alpha = 0.4) +
geom_text_repel(
data = goal_contribution_clean_df %>%
filter(goal_contrib > 0.2 | assist_contrib > 0.15,
!player %in% c("S. Andersson", "W. Weghorst",
"F. Niederlechner", "K. Volland")),
aes(label = player, family = "Roboto Condensed", fontface = "bold"),
seed = 15, size = 4,
min.segment.length = 0, segment.color = "red",
point.padding = 0.5) +
geom_mark_circle(
aes(filter = player == "S. Andersson",
label = "Sebastian Andersson",
description = desc_andersson),
label.width = unit(65, 'mm'), label.buffer = unit(5, "mm"),
label.family = "Roboto Condensed", label.fontsize = c(14, 12)) +
geom_mark_hull(
aes(filter = player == "W. Weghorst",
label = "Wout Weghorst",
description = desc_weghorst),
label.width = unit(45, 'mm'), label.buffer = unit(5, "mm"), label.fontsize = c(14, 11),
label.family = "Roboto Condensed") +
geom_mark_hull(
aes(filter = player == "F. Niederlechner",
label = "Florian Niederlechner",
description = desc_nieder),
concavity = 1,
label.width = unit(55, 'mm'), label.buffer = unit(0.1, "mm"),
label.fontsize = c(14, 12), label.family = "Roboto Condensed") +
geom_mark_hull(
aes(filter = player == "K. Volland",
label = "Kevin Volland",
description = desc_volland),
concavity = 1,
label.width = unit(65, 'mm'), label.buffer = unit(40.5, "mm"),
label.fontsize = c(14, 12), label.family = "Roboto Condensed") +
scale_x_continuous(labels = percent_format(accuracy = 1),
breaks = c(0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3),
limits = c(0, 0.35)) +
scale_y_continuous(labels = percent_format(accuracy = 1),
breaks = c(0, 0.1, 0.2, 0.3, 0.4, 0.5),
limits = c(0, 0.57)) +
labs(title = "Goal Involvement: Bundesliga (2019-2020 Season)",
subtitle = glue("
Goal Involvement (Goals or Assists) as Percentage of Total Club Goals
Hinrunde | January 13th, 2020"),
caption = glue("
Data: FBref | StatsBomb
Ryo Nakagawara, Twitter: @R_by_Ryo"),
x = "Percentage of Club Goals Assisted",
y = "Percentage of Club Goals Scored") +
theme_minimal() +
theme(text = element_markdown(family = "Roboto Condensed"),
plot.title = element_markdown(size = 20),
plot.subtitle = element_text(size = 16),
plot.caption = element_text(size = 14),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
panel.grid.minor.x = element_blank())
goal_contribution_matrix
```
```{r fig.width = 14, fig.height = 10}
## Description text
desc_labels <- c(
"Union Berlin had a rought start to their maiden Bundesliga season but Andersson stepped up to scoring vitals goals in wins against fellow relegation candidates like Mainz and FC Köln",
"With ",
"Niederlechner is on course to beat his best Bundesliga season (11 Goals | 3 Assists with Freiburg in 16/17) by quite a margin with 8 goals and 6 assists so far. His total contribution (goals and assists) equals to half of Augburg's goals in the hinrunde.",
"Volland is the fulcrum of Leverkusen's bid for a consecutive appearance in the Champions League with 5 goals and 6 assists. He tallied up 0.45 xG per 90 (11th in the league) and 0.28 xA per 90 (10th) in the Hinrunde.")
## PLOT!
goal_contribution_matrix <- goal_contribution_clean_df %>%
ggplot(aes(assist_contrib, goal_contrib)) +
geom_point(data = goal_contribution_clean_df %>%
filter(goal_contrib < 0.2 | assist_contrib < 0.15),
color = "grey20", size = 4, alpha = 0.2) +
geom_point(data = goal_contribution_clean_df %>%
filter(goal_contrib > 0.2 | assist_contrib > 0.15),
color = "red", size = 4) +
geom_hline(yintercept = 0.157, color = "grey20", alpha = 0.4) +
geom_vline(xintercept = 0.08, color = "grey20", alpha = 0.4) +
geom_text_repel(
data = goal_contribution_clean_df %>%
filter(goal_contrib > 0.2 | assist_contrib > 0.15,
!player %in% c("S. Andersson", "W. Weghorst",
"F. Niederlechner", "K. Volland")),
aes(label = player, family = "Roboto Condensed", fontface = "bold"),
seed = 15, size = 4,
min.segment.length = 0, segment.color = "red",
point.padding = 0.5) +
geom_mark_circle(
aes(filter = player %in% c("S. Andersson", "W. Weghorst",
"F. Niederlechner", "K. Volland"),
description = desc_labels),
label.width = unit(65, 'mm'), label.buffer = unit(5, "mm"),
label.family = "Roboto Condensed", label.fontsize = c(14, 12)) +
scale_x_continuous(labels = percent_format(accuracy = 1),
breaks = c(0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3),
limits = c(0, 0.35)) +
scale_y_continuous(labels = percent_format(accuracy = 1),
breaks = c(0, 0.1, 0.2, 0.3, 0.4, 0.5),
limits = c(0, 0.57)) +
labs(title = "Goal Involvement: Bundesliga (2019-2020 Season)",
subtitle = glue("
Goal Involvement (Goals or Assists) as Percentage of Total Club Goals
Hinrunde | January 13th, 2020"),
caption = glue("
Data: FBref | StatsBomb
Ryo Nakagawara, Twitter: @R_by_Ryo"),
x = "Percentage of Club Goals Assisted",
y = "Percentage of Club Goals Scored") +
theme_minimal() +
theme(text = element_markdown(family = "Roboto Condensed"),
plot.title = element_markdown(size = 20),
plot.subtitle = element_text(size = 16),
plot.caption = element_text(size = 14),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
panel.grid.minor.x = element_blank())
goal_contribution_matrix
```
#### save
```{r}
ggsave(plot = goal_contribution_matrix,
here::here("Premier League 2019-2020/output/goal_contribution_plot_1920_MD20.png"),
height = 9, width = 11)
```
```{r}
plot_logo <- add_logo(
plot_path = here::here("Premier League 2019-2020/output/goal_contribution_plot_1920_MD20.png"),
logo_path = "https://upload.wikimedia.org/wikipedia/en/f/f2/Premier_League_Logo.svg",
logo_position = "top right",
logo_scale = 6)
plot_logo
```
```{r}
image_write(image = plot_logo,
here::here("Premier League 2019-2020/output/goal_contribution_plot_1920_MD20_logo.png"))
```
### xG
```{r}
goal_contribution_clean_df %>%
filter(Min >= 900,
Pos %in% c("FW", "FWMF")) %>%
summarize(avg_npxg90 = median(npxG_p90),
avg_xA = median(xA_p90))
goal_contribution_clean_df %>%
filter(Min >= 900,
Pos %in% c("FW")) %>%
summarize(avg_npxg90 = median(npxG_p90),
avg_xA = median(xA_p90))
```
```{r}
xG_xA_per90 <- goal_contribution_clean_df
```
```{r}
xG_xA_per90 %>%
filter(Min >= 900,
Pos %in% c("FW", "FWMF")) %>%
summarize(avg_npxg90 = median(npxG_p90),
avg_xA = median(xA_p90))
```
- Lewa: 14 > 12 np 8.3 np-xG
- Muller: 0.32 xG per 90 (Kovac), 0.4 xG per 90 (Flick)
-- 512 mins Flick 0.53 xA per 90 (Kovac), 0.58 xA per 90 (Flick)
- Gnabry:
--
- Plea: 0.48 npxG per 90, 0.28 xA per 90
- Thuram: 0.39 npxG per 90, 0.27 xA per 90
- Embolo: 0.34 npxG per 90, 0.27 xA per 90
```{r}
desc_bayern <- "Despite Niko Kovac's poor tactics & his eventual sacking, Bayern maintained great attacking output mainly through Robert Lewandowski's great finishing (12 np Goals from 8.3 np xG under Kovac)! Thomas Müller, having been frozen out by Kovac, got back to form under new manager Hansi Flick (0.32 xG per 90 vs. 0.4 xG per 90 & 0.53 xA per 90 vs. 0.58 xA per 90). Serge Gnabry has been a consistent threat throughout while Phil Coutinho has been doing well after a slow start."
desc_gladbach <- "Although he has cooled off a bit from an electric 4 goals & 4 assists in the first 7 games, Alassane Plea still provides the most threat with a team-leading 0.48 np-xG per 90 & 0.28 xA per 90 (Patrick Herrmann with 0.54 xG per 90 & 0.32 xA per 90 just misses out having only played 790 minutes). Plea is supported by the new arrivals Breel Embolo & Marcus Thuram who complete this dynamic trident that excels on transition plays."
```
```{r fig.width = 12, fig.height = 10}
xG_xA_per90_plot <- xG_xA_per90 %>%
filter(Min >= 900) %>%
ggplot(aes(xA_p90, npxG_p90)) +
geom_point(data = xG_xA_per90 %>%
filter(xA_p90 < 0.15 | npxG_p90 < 0.3,
Min >= 900),
color = "grey20", size = 4, alpha = 0.2) +
geom_point(data = xG_xA_per90 %>%
filter(xA_p90 > 0.15 | npxG_p90 > 0.3,
Min >= 900),
color = "red", size = 4) +
geom_hline(yintercept = 0.3, alpha = 0.6) +
geom_vline(xintercept = 0.15, alpha = 0.6) +
## All player labels
geom_text_repel(
data = xG_xA_per90 %>%
filter(xA_p90 > 0.15 | npxG_p90 > 0.3,
!player %in% c("R. Lewandowski",
"S. Gnabry", "T. Müller",
"P. Coutinho",
"M. Thuram",
"B. Embolo", "A. Pléa",
"K. Volland",
"F. Niederlechner",
"T. Werner"),
Min >= 900),
aes(label = player, family = "Roboto Condensed",
fontface = "bold"),
seed = 15, size = 3.5,
min.segment.length = 0, segment.color = "red",
point.padding = 0.5) +
## Separate player
geom_text(data = xG_xA_per90 %>%
filter(player %in% c("T. Werner",
"F. Niederlechner",
"K. Volland")),
aes(label = player, family = "Roboto Condensed",
fontface = "bold"),
size = 3.5, nudge_x = 0.01, hjust = 0) +
## Bayern player labels
geom_text_repel(
data = xG_xA_per90 %>%
filter(player %in% c("R. Lewandowski", "P. Coutinho",
"S. Gnabry", "T. Müller")),
aes(label = player, family = "Roboto Condensed",
fontface = "bold"),
seed = 15, size = 4.5, color = "red",
min.segment.length = 0, segment.color = "red",
point.padding = 0.5, nudge_y = 0.05) +
## Gladbach player labels
geom_text_repel(
data = xG_xA_per90 %>%
filter(player %in% c("M. Thuram",
"B. Embolo", "A. Pléa")),
aes(label = player, family = "Roboto Condensed",
fontface = "bold"),
seed = 15, size = 4.5, color = "#228B22",
min.segment.length = 0, segment.color = NA,
point.padding = 0.5) +
## Bayern description
geom_mark_hull(
aes(filter = player %in% c("R. Lewandowski",
"S. Gnabry",
"T. Müller",
"P. Coutinho"),
label = "Bayern's Attack Remains Strong.",
description = desc_bayern),
expand = unit(2.5, "mm"), con.cap = unit(0, "mm"),
label.width = unit(130, 'mm'), label.buffer = unit(5, "mm"),
label.family = "Roboto Condensed", label.fontsize = c(14, 12),
label.colour = c("red", "black")) +
## Gladbach description
geom_mark_hull(
aes(filter = player %in% c("M. Thuram",
"B. Embolo"),
label = "Gladbach's Attacking Trident",
description = desc_gladbach),
expand = unit(0.05, "mm"), con.cap = unit(0, "mm"),
label.width = unit(65, 'mm'), label.buffer = unit(49, "mm"),
label.family = "Roboto Condensed", label.fontsize = c(14, 11),
label.colour = c("#228B22", "black")) +
geom_mark_circle(
aes(filter = player == "A. Pléa"),
expand = unit(3, "mm"), label.width = unit(50, 'mm'),
label.buffer = unit(30, "mm"), label.fontsize = c(14, 11),
label.family = "Roboto Condensed",
label.colour = c("#228B22", "black")) +
## Gladbach desc.
annotate("segment",
x = 0.288, xend = 0.37,
y = 0.48, yend = 0.48) +
annotate("segment",
x = 0.37, xend = 0.37,
y = 0.48, yend = 0.348) +
## xG and xA league average
annotate("text", family = "Roboto Condensed", fontface = "bold",
x = 0.05, y = 0.94, hjust = 0, #color = "red",
label = "Average xA per 90: 0.15") +
annotate("text", family = "Roboto Condensed", fontface = "bold",
x = 0.62, y = 0.29, #color = "red",
label = "Average np-xG per 90: 0.3") +
scale_x_continuous(labels = seq(0, 0.9, 0.1),
breaks = seq(0, 0.9, 0.1),
limits = c(0, 0.65)) +
scale_y_continuous(labels = seq(0, 0.9, 0.1),
breaks = seq(0, 0.9, 0.1),
limits = c(0, 0.95)) +
labs(title = "Gladbach's Trident & Bayern's Attackers Lead the League in xG per 90 & xA per 90",
subtitle = glue("
Bundesliga (2019-2020) | Hinrunde | January 14th, 2020
Average (Median) for Midfielders/Forwards | Minimum 900 Minutes Played"),
caption = glue("
Data: FBref | StatsBomb
Ryo Nakagawara, Twitter: @R_by_Ryo"),
x = "Expected Assists (xA) per 90",
y = "non-Penalty Expected Goals (np-xG) per 90") +
theme_minimal() +
theme(text = element_text(family = "Roboto Condensed"),
plot.title = element_markdown(size = 18),
plot.subtitle = element_markdown(size = 16),
plot.caption = element_text(size = 14),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
panel.grid.minor.x = element_blank())
xG_xA_per90_plot
```
#### save
```{r}
ggsave(plot = xG_xA_per90_plot,
here::here("Bundesliga 2019-2020/output/xA_xG_per90_1920_hinrunde.png"),
height = 10, width = 12)
```
```{r}
plot_logo <- add_logo(
plot_path = here::here("Bundesliga 2019-2020/output/xA_xG_per90_1920_hinrunde.png"),
logo_path = here::here("../soccer_match_reports/img/Bundesliga_logo_(2017).svg"),
logo_position = "top right",
logo_scale = 14)
plot_logo
```
```{r}
image_write(image = plot_logo,
here::here("Bundesliga 2019-2020/output/xA_xG_per90_1920_hinrunde_logo.png"))
```
================================================
FILE: Champions League & Europa League 2019-2020/europa_league_eloRatings.Rmd
================================================
---
title: "Untitled"
author: "RN7"
date: "9/1/2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```
# Packages
```{r}
pacman::p_load(tidyverse, scales, lubridate,
ggrepel, glue, extrafont,
polite, rvest)
loadfonts(quiet = TRUE)
```
# Elo Ratings 8.31.19
```{r}
## Elo Ratings from 8.31.19
elo_ratings_raw <- read.csv("http://api.clubelo.com/2019-08-31")
```
# Europa League
## webscrape
```{r}
url <- "https://en.wikipedia.org/wiki/2019%E2%80%9320_UEFA_Europa_League_group_stage"
session <- bow(url)
EL_teams <- scrape(session) %>%
html_nodes(".wikitable th+ td") %>%
html_text() %>%
as_tibble() %>%
rename(team_name = value) %>%
mutate(team_name = team_name %>% trimws())
country_league <- scrape(session) %>%
html_nodes("th+ td .thumbborder") %>%
html_attr("alt") %>%
as_tibble() %>%
rename(country_league = value)
EL_teams_clean <- EL_teams %>%
bind_cols(country_league)
```
## clean
```{r}
EL_teams_df <- EL_teams_clean %>%
mutate(team_name = team_name %>%
iconv(from = "UTF-8", to = "ASCII//TRANSLIT"),
elo_name = case_when(
team_name == "Qarabag" ~ "Karabakh Agdam",
team_name == "F91 Dudelange" ~ "Dudelange",
team_name == "Copenhagen" ~ "FC Kobenhavn",
team_name == "Malmo FF" ~ "Malmoe",
team_name == "Krasnodar" ~ "FC Krasnodar",
team_name == "Sporting CP" ~ "Sporting",
team_name == "PSV Eindhoven" ~ "PSV",
team_name == "Eintracht Frankfurt" ~ "Frankfurt",
team_name == "Standard Liege" ~ "Standard",
team_name == "Vitoria de Guimaraes" ~ "Guimaraes",
team_name == "CSKA Moscow" ~ "CSKA Moskva",
team_name == "Ludogorets Razgrad" ~ "Razgrad",
team_name == "Ferencvaros" ~ "Ferencvaros",
team_name == "VfL Wolfsburg" ~ "Wolfsburg",
team_name == "KAA Gent" ~ "Gent",
team_name == "Saint-Etienne" ~ "Saint-Etienne",
team_name == "Oleksandriya" ~ "Olexandriya",
team_name == "Borussia Monchengladbach" ~ "Gladbach",
team_name == "Istanbul Basaksehir" ~ "Bueyueksehir",
team_name == "Wolfsberger AC" ~ "Wolfsberg",
team_name == "Beşiktas" ~ "Besiktas",
team_name == "Wolverhampton Wanderers" ~ "Wolves",
team_name == "Manchester United" ~ "Man United",
team_name == "Astana" ~ "FK Astana",
team_name == "AZ" ~ "Alkmaar",
TRUE ~ team_name
)) %>%
## join Club info to Elo data
left_join(elo_ratings_raw, by = c("elo_name" = "Club")) %>%
janitor::clean_names() %>%
select(-level, -from, -to, -elo_name) %>%
## fill out group names per 4 rows
mutate(group = rep(LETTERS[1:12], each = 4, length.out = 48) %>%
as.factor()) %>%
## per group stats
group_by(group) %>%
mutate(avg_elo = mean(elo),
med_elo = median(elo),
max_elo = max(elo),
min_elo = min(elo)) %>%
ungroup() %>%
## manually change names
## manually spread out the labels
mutate(
team_name = case_when(
team_name == "Wolverhampton Wanderers" ~ "Wolves",
team_name == "PSV Eindhoven" ~ "PSV",
team_name == "Borussia Monchengladbach" ~ "Gladbach",
team_name == "Sporting CP" ~ "Sporting",
team_name == "Vitoria de Guimaraes" ~ "Vitoria S.C.",
TRUE ~ team_name),
elo_2 = case_when(
team_name == "CFR Cluj" ~ elo - 10,
team_name == "Sporting" ~ elo + 10,
team_name == "PSV" ~ elo - 10,
team_name == "Braga" ~ elo + 10,
team_name == "Besiktas" ~ elo - 10,
team_name == "AZ" ~ elo + 10,
team_name == "Partizan" ~ elo - 5,
TRUE ~ elo),
group = forcats::fct_reorder(group, med_elo))
```
## plot
```{r, fig.height = 7, fig.width=15}
EL_teams_df %>%
ggplot(aes(x = group, y = elo, group = group)) +
geom_segment(aes(x = group, xend = group,
y = max_elo, yend = min_elo,
group = group),
size = 2.25, color = "lightgrey") +
geom_point(aes(y = med_elo, group = group),
size = 4.5, color = "red") +
geom_text(aes(y = elo_2, label = team_name),
vjust = 0.5,
size = 4.5, family = "Roboto Condensed",
show.legend = FALSE) +
geom_curve(aes(x = 12.8, xend = 12.1,
y = 1675, yend = 1675),
arrow = arrow(length = unit(0.07, "inch")),
size = 0.5, color = "black", curvature = 0.35) +
geom_text(x = 12.85, y = 1675,
hjust = 0, color = "red",
family = "Roboto Condensed",
size = 4,
label = glue::glue("
Median ELO
(per group)")) +
scale_x_discrete(expand = expand_scale(add = c(0.5, 1.75))) +
labs(title = "Europa League (2019/2020) Elo Ratings",
subtitle = "Ratings as of August 31st, 2019",
x = "Group (Ordered by Median Elo Rating per Group)", y = "Elo Rating",
caption = "@R_by_Ryo Source: ClubElo.com") +
theme_minimal() +
theme(text = element_text(family = "Roboto Condensed"),
plot.title = element_text(size = 22),
plot.subtitle = element_text(size = 18),
plot.caption = element_text(size = 14, hjust = 0),
axis.title = element_text(size = 16),
axis.text = element_text(size = 14))
```
```{r}
ggsave(filename = here::here(
"Champions League & Europa League 2019-2020/outputs/euroleague_eloPlot2.png"),
width = 15, height = 7)
```
# Champions League
## webscrape
```{r}
url2 <- "https://en.wikipedia.org/wiki/2019%E2%80%9320_UEFA_Champions_League_group_stage"
session2 <- bow(url2)
CL_teams <- scrape(session2) %>%
html_nodes("h3+ .wikitable .flagicon+ a") %>%
html_text() %>%
as_tibble() %>%
rename(team_name = value) %>%
mutate(team_name = team_name %>% trimws())
country_leagueCL <- scrape(session2) %>%
html_nodes("h3+ .wikitable .thumbborder") %>%
html_attr("alt") %>%
as_tibble() %>%
rename(country_league = value)
CL_teams_clean <- CL_teams %>%
bind_cols(country_leagueCL)
```
## clean
```{r}
CL_teams_df <- CL_teams_clean %>%
mutate(
team_name = team_name %>%
iconv(from = "UTF-8", to = "ASCII//TRANSLIT"),
elo_name = case_when(
team_name == "Paris Saint-Germain" ~ "Paris SG",
team_name == "Club Brugge" ~ "Brugge",
team_name == "Bayern Munich" ~ "Bayern",
team_name == "Tottenham Hotspur" ~ "Tottenham",
team_name == "Olympiacos" ~ "Olympiakos",
team_name == "Red Star Belgrade" ~ "Crvena Zvezda",
team_name == "Manchester City" ~ "Man City",
team_name == "Shakhtar Donetsk" ~ "Shakhtar",
team_name == "Atletico Madrid" ~ "Atletico",
team_name == "Bayer Leverkusen" ~ "Leverkusen",
team_name == "Lokomotiv Moscow" ~ "Lok Moskva",
team_name == "Red Bull Salzburg" ~ "Salzburg",
team_name == "Borussia Dortmund" ~ "Dortmund",
team_name == "Inter Milan" ~ "Inter",
team_name == "Slavia Prague" ~ "Slavia Praha",
team_name == "Zenit Saint Petersburg" ~ "Zenit",
TRUE ~ team_name)) %>%
## join Club info to Elo data
left_join(elo_ratings_raw, by = c("elo_name" = "Club")) %>%
janitor::clean_names() %>%
select(-level, -from, -to, -elo_name) %>%
## fill out group names per 4 rows
mutate(group = rep(LETTERS[1:12], each = 4, length.out = 32) %>%
as.factor()) %>%
## per group stats
group_by(group) %>%
mutate(avg_elo = mean(elo),
med_elo = median(elo),
max_elo = max(elo),
min_elo = min(elo)) %>%
ungroup() %>%
mutate(group = forcats::fct_reorder(group, med_elo),
elo_2 = case_when(
team_name == "RB Leipzig" ~ elo + 10,
team_name == "Benfica" ~ elo - 10,
team_name == "Shakhtar" ~ elo + 10,
team_name == "Atalanta" ~ elo - 10,
TRUE ~ elo))
```
## plot
```{r, fig.height = 7, fig.width=15}
CL_teams_df %>%
ggplot(aes(x = group, y = elo, group = group)) +
geom_segment(aes(x = group, xend = group,
y = max_elo, yend = min_elo,
group = group),
size = 2.25, color = "lightgrey") +
geom_point(aes(y = med_elo, group = group),
size = 4.5, color = "red") +
geom_text(aes(y = elo_2, label = team_name),
#nudge_y = 10,
vjust = 0.5,
size = 4.5, family = "Roboto Condensed",
show.legend = FALSE) +
geom_curve(aes(x = 8.8, xend = 8.1,
y = 1825, yend = 1825),
arrow = arrow(length = unit(0.07, "inch")),
size = 0.5, color = "black", curvature = 0.35) +
geom_text(x = 8.85, y = 1825,
hjust = 0, color = "red",
family = "Roboto Condensed",
size = 4,
label = glue::glue("
Median ELO
(per group)")) +
scale_x_discrete(expand = expand_scale(add = c(0.5, 1.35))) +
labs(title = "Champions League (2019/2020) Elo Ratings",
subtitle = "Ratings as of August 31st, 2019",
x = "Group (Ordered by Median Elo Rating per Group)",
y = "Elo Rating",
caption = "@R_by_Ryo Source: ClubElo.com") +
theme_minimal() +
theme(text = element_text(family = "Roboto Condensed"),
plot.title = element_text(size = 22),
plot.subtitle = element_text(size = 18),
plot.caption = element_text(size = 14, hjust = 0),
axis.title = element_text(size = 16),
axis.text = element_text(size = 14))
```
```{r}
ggsave(filename = here::here(
"Champions League & Europa League 2019-2020/outputs/champleague_eloPlot2.png"),
width = 15, height = 7)
```
```{r}
top3_teams <- CL_teams_df %>%
group_by(group) %>%
arrange(desc(elo), .by_group = TRUE) %>%
mutate(group_rank = row_number()) %>%
filter(group_rank != 4) %>%
group_by(group) %>%
summarize(avg_top3_elo = mean(elo),
med_top3_elo = median(elo),
max_top3_elo = max(elo),
min_top3_elo = min(elo))
CL_top3_df <- CL_teams_df %>%
left_join(top3_teams, by = "group")
```
```{r, fig.height = 7, fig.width=15}
CL_top3_df %>%
mutate(group = forcats::fct_reorder(group, med_top3_elo)) %>%
ggplot(aes(x = group, y = elo, group = group)) +
geom_segment(aes(x = group, xend = group,
y = max_top3_elo, yend = min_top3_elo,
group = group),
size = 2.25, color = "lightgrey") +
geom_point(aes(y = med_top3_elo, group = group),
size = 4.5, color = "red") +
geom_text(aes(y = elo_2, label = team_name),
vjust = 0.5,
size = 4.5, family = "Roboto Condensed",
show.legend = FALSE) +
geom_segment(aes(x = 8.6, xend = 8.05,
y = 1840, yend = 1875),
arrow = arrow(length = unit(0.09, "inch")),
size = 0.5, color = "black") +
geom_text(x = 8.65, y = 1840,
hjust = 0, color = "red",
family = "Roboto Condensed",
size = 4,
label = glue::glue("
Median ELO
(per group)")) +
scale_x_discrete(expand = expand_scale(add = c(0.5, 1.35))) +
labs(title = "Champions League (2019/2020) Elo Ratings",
subtitle = "Ratings as of August 31st, 2019 | Median of Top Three Teams per Group",
x = "Group (Ordered by Median Elo Rating per Group)",
y = "Elo Rating",
caption = "@R_by_Ryo Source: ClubElo.com") +
theme_minimal() +
theme(text = element_text(family = "Roboto Condensed"),
plot.title = element_text(size = 22),
plot.subtitle = element_text(size = 18),
plot.caption = element_text(size = 14, hjust = 0),
axis.title = element_text(size = 16),
axis.text = element_text(size = 14))
```
```{r}
ggsave(filename = here::here(
"Champions League & Europa League 2019-2020/outputs/champleague_top3_eloPlot.png"),
width = 15, height = 7)
```
================================================
FILE: Copa America 2019/1-copa_america2019.md
================================================
Another summer and another edition of the Copa América! Along with the
Africa Cup of Nations, Nations League finals, the Women’s World Cup,
Under-21 European Championship AND the Gold Cup this is yet another
soccer-filled season after last year’s World Cup and the Asian Cup
earlier this year (I also did a blog post on these last two tournaments
which you can see [here (World
Cup)](https://ryo-n7.github.io/2018-06-29-visualize-worldcup/) and [here
(Asian Cup)](https://ryo-n7.github.io/2019-01-11-visualize-asian-cup/)).
There is so much football going on at once even I can’t keep up,
especially with the time difference! To not redo all the previous
visualizations with Copa América data I tried to find new sources of
data and other forms of visualizations to give some insight into the
players and teams competing to be the champion of South America. You can
find all the code I used in this blogpost here and you can also find
other soccer related data viz in my
[soccer\_ggplot](https://github.com/Ryo-N7/soccer_ggplots) Github repo.
The sections will go from a very macro-level view of the **historical
records** of the tournament, to the **squads** competing, the teams’
**match record** in the Copa América, and finally to a micro-level view
of various attacking players using **xG** statistics.
¡Vámonos!
Packages
--------
``` r
library(dplyr) ## data wrangling
library(tidyr) ## data wrangling
library(purrr) ## data wrangling and iteration
library(stringr) ## data wrangling
library(rvest) ## webscraping
library(polite) ## webscraping (Github only pkg)
library(ggplot2) ## plotting
library(scales) ## plotting scales
library(ggimage) ## images for flags
library(ggforce) ## plotting text labels
library(cowplot) ## plotting grid
library(glue) ## text
library(ggrepel) ## plotting text labels
library(magick) ## plotting
library(DT) ## tables
library(ggtextures) ## soccer ball emoji as geom_col()
library(extrafont) ## fonts: Roboto Condensed
loadfonts()
```
theme\_copaAmerica
------------------
I wanted to have all the plots in this blogpost to have a consistent
color theme. As the tournament is going to be held in Brazil, I went
with a color theme based on its flag with blue, yellow, and green being
the primary colors.
``` r
theme_copaAmerica <- function(
title.size = 24,
subtitle.size = 14,
caption.size = 8,
axis.text.size = 14,
axis.text.x.size = 12,
axis.text.y.size = 12,
axis.title.size = 16,
strip.text.size = 18,
panel.grid.major.x = element_line(size = 0.5, color = "white"),
panel.grid.major.y = element_line(size = 0.5, color = "white"),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank(),
axis.ticks = element_line(color = "white")) {
## Theme:
theme(text = element_text(family = "Roboto Condensed", color = "white"),
plot.title = element_text(family = "Roboto Condensed", face = "bold",
size = title.size, color = "yellow"),
plot.subtitle = element_text(size = subtitle.size),
plot.caption = element_text(size = caption.size),
panel.background = element_rect(fill = "#009b3a"),
plot.background = element_rect(fill = "#002776"),
axis.text = element_text(size = axis.text.size, color = "white"),
axis.text.x = element_text(size = axis.text.x.size, color = "white"),
axis.text.y = element_text(size = axis.text.y.size, color = "white"),
axis.title = element_text(size = axis.title.size),
axis.line.x = element_blank(),
axis.line.y = element_blank(),
panel.grid.major.x = panel.grid.major.x,
panel.grid.major.y = panel.grid.major.y,
panel.grid.minor.x = panel.grid.minor.x,
panel.grid.minor.y = panel.grid.minor.y,
strip.text = element_text(color = "yellow", face = "bold",
size = strip.text.size,
margin = margin(4.4, 4.4, 4.4, 4.4)),
strip.background = element_blank(),
axis.ticks = axis.ticks
)
}
```
Top Goal Scorers // Goleadores
------------------------------
For this plot I took the stats from the Spanish version of the Wikipedia
page as it had more content. I used `purrr::flatten_df()` to squish the
list output into a dataframe then set the names of each column using
`purrr::set_names()`.
``` r
url <- "https://es.wikipedia.org/wiki/Anexo:Estad%C3%ADsticas_de_la_Copa_Am%C3%A9rica"
session <- bow(url)
copa_top_scorers <- scrape(session) %>%
html_nodes(".mw-parser-output > table:nth-child(95)") %>%
html_table() %>%
flatten_df() %>%
set_names(c("player", "country", "goals")) %>%
mutate(image = "https://www.emoji.co.uk/files/microsoft-emojis/activity-windows10/8356-soccer-ball.png")
```
``` r
glimpse(copa_top_scorers)
```
## Observations: 22
## Variables: 4
## $ player "Norberto Méndez", "Zizinho", "Lolo Fernández", "Sever...
## $ country "ARG Argentina", "BRA Brasil", "PER Perú", "URU Urugua...
## $ goals 17, 17, 15, 15, 13, 13, 13, 13, 13, 12, 12, 11, 11, 11...
## $ image "https://www.emoji.co.uk/files/microsoft-emojis/activi...
Like in the Asian Cup blogpost I use [Claus
Wilke](https://twitter.com/ClausWilke/)’s
[ggtextures](https://github.com/clauswilke/ggtextures) package to use
**soccer ball emoji** as the column image in the plot.
``` r
copa_goleadores_raw_plot <- copa_top_scorers %>%
head(5) %>%
ggplot(aes(x = reorder(player, goals), y = goals,
image = image)) +
geom_isotype_col(img_width = grid::unit(1, "native"), img_height = NULL,
ncol = NA, nrow = 1, hjust = 0, vjust = 0.5) +
geom_text(aes(label = goals, family = "Roboto Condensed", fontface = "bold"),
size = 7.5, color = "yellow",
nudge_y = 0.5) +
coord_flip() +
scale_y_continuous(breaks = c(0, 2, 4, 6, 8, 10, 12, 14, 16, 18),
expand = c(0, 0),
limits = c(0, 19)) +
labs(title = "Top Scorers of the Copa América",
subtitle = glue("
Most goals in a single tournament: 9
Humberto Maschio (Argentina), Javier Ambrois (Uruguay), Jair (Brazil)"),
y = "Number of Goals", x = NULL,
caption = glue("
Source: Wikipedia
By @R_by_Ryo")) +
theme_copaAmerica(title.size = 26,
subtitle.size = 16,
caption.size = 12,
axis.text.size = 18,
axis.title.size = 18,
panel.grid.major.y = element_blank(),
axis.ticks = element_blank())
## Add flags to y-axis:
axis_image <- axis_canvas(copa_goleadores_raw_plot, axis = 'y') +
draw_image("https://upload.wikimedia.org/wikipedia/en/0/05/Flag_of_Brazil.svg",
y = 16.5, scale = 1.8) +
draw_image("https://upload.wikimedia.org/wikipedia/commons/1/1a/Flag_of_Argentina.svg",
y = 12.5, scale = 1.8) +
draw_image("https://upload.wikimedia.org/wikipedia/commons/f/fe/Flag_of_Uruguay.svg",
y = 9, scale = 1.8) +
draw_image("https://upload.wikimedia.org/wikipedia/commons/d/df/Flag_of_Peru_%28state%29.svg",
y = 5.25, scale = 1.8) +
draw_image("https://upload.wikimedia.org/wikipedia/en/0/05/Flag_of_Brazil.svg",
y = 1.5, scale = 1.8)
copa_goleadores_plot <- ggdraw(insert_yaxis_grob(copa_goleadores_raw_plot, axis_image, position = "left"))
copa_goleadores_plot
```

Most of these players aren’t ones you might recognize. The Copa América
used to be held a lot more regularly (and sometimes erratically) until
this century so players had a lot more opportunities to score goals. All
five of the players you see here played in the 1930s-1950s when there
was a tournament every one or two years. Out of currently active
players, Peruvian legend Paolo Guerrero has 11 goals along with Eduardo
Vargas (from Chile). (Edit: after the Chile - Japan game, Vargas is on
12…) Another player you might recognize that was actually tied with
Ademir for 5th place, along with three other players, was Gabriel
Batistuta (“Batigol”).
Winners of the Copa América
---------------------------
After grabbing the data from the Wikipedia page I used a variety of
functions to clean and reshape the dataset like `tidyr::separate()` to
split the number of occurences and the year.
``` r
url <- "https://es.wikipedia.org/wiki/Anexo:Estad%C3%ADsticas_de_la_Copa_Am%C3%A9rica"
session <- bow(url)
copa_campeones <- scrape(session) %>%
html_nodes(".mw-parser-output > table:nth-child(10)") %>%
html_table() %>%
flatten_df()
copa_campeones_limpia <- copa_campeones %>%
janitor::clean_names() %>%
slice(1:8) %>%
select(1:4) %>%
set_names(c("team", "winners", "runners_up", "third_place")) %>%
separate(winners, into = c("Champions", "first_place_year"),
sep = " ", extra = "merge") %>%
separate(runners_up, into = c("Runners-up", "second_place_year"),
sep = " ", extra = "merge") %>%
separate(third_place, into = c("Third Place", "third_place_year"),
sep = " ", extra = "merge") %>%
mutate_all(list(~str_replace_all(., "–", "0"))) %>%
mutate_at(vars(contains("num")), funs(as.numeric)) %>%
gather(key = "key", value = "value", -team,
-first_place_year, -second_place_year, -third_place_year) %>%
mutate(key = as.factor(key),
value = as.numeric(value),
team = team %>% str_replace(., "[A-Z]{3}", "") %>% str_trim(.),
team = case_when(team == "Brasil" ~ "Brazil",
TRUE ~ team)) %>%
mutate(key = forcats::fct_relevel(key,
"Champions",
"Runners-up",
"Third Place")) %>%
arrange(key, desc(value)) %>%
mutate(team = forcats::as_factor(team),
order = row_number())
```
I also wanted to add flags to this plot but
`cowplot::insert_yaxis_grob()` is unfortunately not compatible with
facets. I used `stringr::str_wrap()` to format the subtitle nicely while
I used `glue::glue()` to avoid having the use ‘’ to create a new line
for the caption.
``` r
copa_ganadores_plot <- copa_campeones_limpia %>%
ggplot(aes(value, forcats::fct_rev(team), color = key)) +
geom_point(size = 10) + # 10
geom_text(aes(label = value),
size = 5, color = "black", # 5
family = "Roboto Condensed", fontface = "bold") +
scale_color_manual(values = c("Champions" = "#FFCC33",
"Runners-up" = "#999999",
"Third Place" = "#CC6600"),
guide = FALSE) +
scale_x_continuous(breaks = c(1, 5, 10, 15),
labels = c(1, 5, 10, 15),
limits = c(-1, 16)) +
labs(x = "Number of Occurrence", y = NULL,
title = "Most Successful Teams of the Copa América!",
subtitle = str_wrap("Ordered by number of Copa América(s) won. Argentina missed the chance to leapfrog Uruguay after consecutive final losses in the previous two tournaments!", width = 80),
caption = glue("
Source: Wikipedia
By @R_by_Ryo")) +
facet_wrap(~key) +
theme_copaAmerica(subtitle.size = 14,
caption.size = 10)
copa_ganadores_plot
```

What’s surprising to note is that Pele never won a Copa América with
Brazil, although he did get Best Player and Top Scorer in the 1959
edition of the tournament. Even more bizarrely Diego Maradona has never
won it either! He didn’t play in either of the 1991 and 1993 editions
where Argentina won their 13th and 14th Copas.
Copa América Squad Profiles
---------------------------
We just looked at what happened in the past but who are the players
competing in the tournament this year? To take a quick look I
web-scraped the squads of each of the competing teams from Wikipedia.
I created a list of the `xpath`s for each of squads and using
`purrr::map()` I grabbed the data for each participating country. After
I got some meta-information about the country name and the group I
created a list-column that stores the squad data as a dataframe in its
own column. To explode this out I used `tidyr::unnest()` to reshape the
entire dataframe to have one row with all the data for each player in
every squad.
To get a clean dataset I use some `stringr::str_*()` functions to
properly format the character strings such as the player positions,
ages, date of births.
``` r
squads_df_clean <- squads_df_raw %>%
janitor::clean_names() %>%
select(-delete, squad_num = no,
position = pos, birth_age = date_of_birth_age) %>%
mutate(position = position %>% str_replace_all(., "[1-9]", ""),
birth_age = birth_age %>% str_extract_all(., pattern = "\\([^()]+\\)")) %>% unnest(birth_age) %>%
group_by(player) %>%
mutate(colnum = seq_along(player)) %>%
spread(key = colnum, value = birth_age) %>%
ungroup() %>%
select(everything(), dob = `1`, age = `2`) %>%
mutate(dob = dob %>% str_replace_all(., "[()]", "") %>% lubridate::as_date(),
age = age %>% str_extract(., "[0-9]+") %>% as.integer,
country = forcats::fct_relevel(country,
"Brazil", "Argentina", "Uruguay",
"Peru", "Qatar", "Chile",
"Venezuela", "Paraguay", "Japan",
"Bolivia", "Colombia", "Ecuador",
),
club = case_when(
club == "Barcelona" & country == "Ecuador" ~ "Barcelona (Ecuador)",
TRUE ~ club))
```
``` r
glimpse(squads_df_clean)
```
## Observations: 276
## Variables: 12
## $ name 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,...
## $ group "A", "A", "A", "A", "A", "A", "A", "A", "A", "A...
## $ country Brazil, Brazil, Brazil, Brazil, Brazil, Brazil,...
## $ squad_num 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ position "GK", "DF", "DF", "DF", "MF", "DF", "FW", "MF",...
## $ player "Alisson", "Thiago Silva", "Miranda", "Marquinh...
## $ caps 36, 79, 57, 36, 36, 40, 3, 10, 29, 65, 49, 15, ...
## $ goals 0, 7, 3, 1, 0, 2, 1, 0, 16, 8, 14, 1, 7, 0, 0, ...
## $ club "Liverpool", "Paris Saint-Germain", "Internazio...
## $ country_league "England", "France", "Italy", "France", "Spain"...
## $ dob 1992-10-02, 1984-09-22, 1984-09-07, 1994-05-14...
## $ age 26, 34, 34, 25, 27, 33, 22, 22, 22, 30, 27, 28,...
### Age-histogram
Using this data I can plot a bunch of histograms:
``` r
age_country_plot <- squads_df_clean %>%
group_by(country) %>%
mutate(median_age = median(age)) %>%
ungroup() %>%
ggplot(aes(x = age)) +
geom_histogram(fill = "red", binwidth = 1) +
geom_vline(aes(xintercept = median_age), size = 1.2) +
geom_label(aes(x = median_age, y = 8,
label = glue::glue("Median: {median_age}")),
nudge_x = 0.5, hjust = 0.1, size = 3,
family = "Roboto Condensed", color = "black") +
labs(title = "Age Distribution of Copa América squads",
subtitle = "Columns ordered Group A to Group C",
x = "Age", y = NULL,
caption = glue::glue("
Source: Wikipedia
By: @R_by_Ryo")) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0),
breaks = scales::pretty_breaks()) +
theme_copaAmerica(title.size = 22,
subtitle.size = 14,
caption.size = 8,
axis.text.size = 12,
axis.title.size = 16,
strip.text.size = 18,
panel.grid.minor.x = element_line(color = "white"),
panel.grid.minor.y = element_line(color = "white")) +
facet_wrap(~country, ncol = 3)
age_country_plot
```

In terms of age, Japan have the youngest team with a median of 21, 4
years younger than the next youngest team, Qatar. The rest have a fairly
balanced spread of ages from 20 to early-mid 30s with most of the
medians hovering around 27 years of age. The reason for Japan’s
extremely young squad is due to the fact that the full-strength Japan
team has played in both the World Cup and the Asian Cup in the past
year. Along with the fact that the Tokyo Olympics are next year, it was
decided to use the invitation to the Copa América as a trial-by-fire for
the young stars of the future. Much like in a real Olympic squad, the
team contains three “overage” players in World Cup 2010/2014/2018
goalkeeper Eiji Kawashima, Premier League winner Shinji Okazaki, and
Getafe playmaker Gaku Shibasaki.
The oldest player will be Brazil captain Dani Alves at 36 with
Paraguay’s Oscar Cardozo only two weeks younger. On the other hand, the
youngest player is Japan’s 18-year old prodigy Takefusa Kubo, the
ex-Barcelona youth player who only just recently moved to Real Madrid!
In light of his transfer a lot of eyes will be on him to see if he can
produce some Captain Tsubasa-esque performances for a very inexperienced
Japan team gearing up for the Tokyo Olympics!
### Caps histogram
When considering the experience of a squad it’s not enough to look at
ages but one needs to look at the caps or appearances for the national
team as well.
``` r
caps_country_plot <- squads_df_clean %>%
group_by(country) %>%
mutate(median_cap = median(caps)) %>%
ungroup() %>%
ggplot(aes(x = caps)) +
geom_histogram(fill = "red", binwidth = 5) +
geom_vline(aes(xintercept = median_cap), size = 1.25) +
geom_label(aes(x = median_cap, y = 15,
label = glue::glue("Median: {median_cap}")),
nudge_x = 0.5, hjust = 0.05, size = 3,
family = "Roboto Condensed", color = "black") +
labs(title = "Caps (Appearances) by Country",
subtitle = "Columns ordered Group A to Group C",
x = "Caps", y = NULL,
caption = glue::glue("
Source: Wikipedia
By: @R_by_Ryo")) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme_copaAmerica(
title.size = 20,
subtitle.size = 14,
caption.size = 10,
axis.text.size = 10,
axis.title.size = 16,
strip.text.size = 18,
panel.grid.major.x = element_line(color = "white", size = 0.25),
panel.grid.major.y = element_line(color = "white", size = 0.25),
panel.grid.minor.x = element_line(color = "white", size = 0.25),
panel.grid.minor.y = element_line(color = "white", size = 0.25)) +
facet_wrap(~country, ncol = 3)
caps_country_plot
```

The majority of Japan’s squad have 0 (ZERO) caps, with the
aforementioned three “overage” players taking up most of the proportion
of caps on the team. Bolivia are also taking a untested squad with 8 of
their players with 2 caps or less! Chile, Uruguay, and Argentina bring
their veterans with multiple players over or around 100 caps. From this
data I was surprised that Jefferson Farfan and Paolo Guerrero didn’t
have 100 caps by now…
The player with the most caps is Lionel Messi (130) followed closely by
Diego Godin (126), and Alexis Sanchez (124). On the other hand there are
29 players hopeful of making their first national team appearance at
this tournament with the majority (17 players) coming from Japan.
### Goal distribution
Next I looked at the distribution of goals scored by the midfielders and
strikers of each team. I found out about using
`ggplot2::position_nudge()` for slightly adjusting variables on a
discrete scale in similar fashion to the `nudge_y =` and `nudge_x =`
arguments most people might be familiar with from other geoms. I also
used `ggforce::geom_mark_hull()` to do some labelling.
``` r
goals_country_plot <- squads_df_clean %>%
filter(position %in% c("MF", "FW")) %>%
group_by(country) %>%
mutate(median = median(goals)) %>%
ungroup() %>%
ggplot(aes(x = goals, y = reorder(country, median))) +
ggridges::geom_density_ridges(fill = "red", color = "white", scale = 1.1) +
geom_point(aes(x = median, y = country), position = position_nudge(y = 0.25),
color = "yellow", size = 3) +
ggforce::geom_mark_hull(aes(filter = country == "Argentina" & goals == 67, label = "Lionel Messi: 67 goals"),
label.buffer = unit(15, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
ggforce::geom_mark_hull(aes(filter = country == "Uruguay" & goals == 55, label = "Luis Suarez: 55 goals"),
label.buffer = unit(5, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
ggforce::geom_mark_hull(aes(filter = country == "Japan" & goals == 50, label = "Shinji Okazaki: 50 goals"),
label.buffer = unit(2, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
ggforce::geom_mark_hull(aes(filter = country == "Uruguay" & goals == 46, label = "Edinson Cavani: 46 goals"),
label.buffer = unit(25, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
ggforce::geom_mark_hull(aes(filter = country == "Chile" & goals == 41, label = "Alexis Sanchez: 41 goals"),
label.buffer = unit(4, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
scale_x_continuous(limits = c(0, 73),
expand = c(0.01, 0.01),
breaks = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70),
labels = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70)) +
expand_limits(y = 13.5) +
labs(title = "Distribution of Goals Scored by Midfielders and Strikers",
subtitle = "Copa América 2019 squads, Yellow dot = Median goals",
x = "Goals", y = NULL,
caption = glue::glue("
Source: Wikipedia
Data from prior to start of tournament
By: @R_by_Ryo")) +
theme_copaAmerica(title.size = 18,
subtitle.size = 12,
caption.size = 8,
axis.text.size = 14,
axis.title.size = 16,
strip.text.size = 18)
goals_country_plot
```
## Picking joint bandwidth of 2.07

With a lot of these players being more defensively minded or new players
the distribution is heavily skewed but you can see little mounds showing
the top goalscorers for each country and see which countries have their
goalscorers spread out among multiple players such as Brazil, Qatar, and
Peru.
If you know your South American players you can take a good guess at who
are the top goal scorers for each nation. For Colombia the two outlying
mounds are obviously James Rodriguez and Falcao, for example.
Venezuela’s top scorer with 22 is Salomon Rondon and for Brazil, if not
for his injury, a lonesome mound would have appeared for Neymar with 60
goals!
### Player contribution by league
Now let’s check the player contribution to the squads at the Copa
América by league. I’m just going to use the country that the league is
from for simplicity’s sake. Originally I wanted to `left_join()` it with
a ‘country <> domestic league’ table but couldn’t find one and the
league names itself aren’t very meaningful or have awful sponsor names
that obfuscate the country of origin even further.
``` r
player_contrib_league_plot <- squads_df_clean %>%
group_by(country_league) %>%
summarize(n = n()) %>%
ungroup() %>%
ggplot(aes(y = n, x = reorder(country_league, n))) +
geom_col(fill = "red") +
geom_text(aes(label = n, family = "Roboto Condensed", fontface = "bold"),
size = 4.5, color = "yellow",
nudge_y = 0.5) +
coord_flip() +
scale_y_continuous(labels = c(0, 5, 10, 15, 20, 25),
breaks = c(0, 5, 10, 15, 20, 25),
limits = c(0, 30),
expand = c(0, 0)) +
labs(title = "Breakdown of Player Contributions by League",
subtitle = glue("
Shown as Country Name
Mexico (Liga MX) contributed 27 players to South American squads"),
x = "League (Country name)", y = "Number of players",
caption = glue::glue("
Source: Wikipedia
By: @R_by_Ryo")) +
theme_copaAmerica(title.size = 18,
subtitle.size = 12,
caption.size = 10,
axis.text.size = 14,
axis.text.y.size = 11,
axis.title.size = 16,
panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_line(color = "white"))
player_contrib_league_plot
```

The best of the best players from South American countries will move on
to Europe so the Argentinean league (Superliga Argentina) and the
Brazilian league (Brasileirão - Serie A) do not have as many players as
you might think and as a consequence, the top leagues of England, Spain,
and Italy contribute quite a bit! A lot of the better players but not
quite elite South American players might go to Mexico instead of a
lower-mid European league. With the growth of the MLS a fair number of
players ply their trade there as well.
We can take a more detailed look by creating a table of the proportion
of players from each squad coming from either a domestic league or any
other league. I had to do a lot of wrangling to get the proper output
for the table. After calculating the percentage of domestic players from
a country’s domestic league I added the full data back in. Then I had to
make sure that for each country, the country - domestic league country
was the first row in each of the country groups (so Bolivia - Bolivia,
Bolivia, China, Japan - Japan, Japan - England, etc.). By doing this I
can automatically `tidyr::fill()`-in the rest of the rows of that
country with the ‘percentage of players from domestic league stat’.
``` r
squads_df_clean %>%
group_by(country, country_league) %>%
summarize(player_from_league = n()) %>%
filter(country == country_league) %>%
mutate(perc_from_domestic_league = percent(player_from_league / 23, accuracy = 0.1)) %>%
right_join(squads_df_clean %>%
group_by(country, country_league) %>%
summarize(player_from_league = n()) %>%
ungroup()) %>%
mutate(first = case_when(
country == country_league ~ 1,
TRUE ~ 0)) %>%
arrange(country, desc(first)) %>%
fill(perc_from_domestic_league) %>%
group_by(country) %>%
mutate(perc_from_league = percent(player_from_league / 23, accuracy = 0.1),
country_league = glue::glue("{country_league} - league")) %>%
arrange(desc(player_from_league)) %>%
select(Country = country, `League (country name)` = country_league,
`Number of players from league` = player_from_league,
`Percentage of players from league` = perc_from_league,
`Percentage of players from domestic league` = perc_from_domestic_league) %>%
DT::datatable(rownames = FALSE)
```
## Joining, by = c("country", "country_league", "player_from_league")
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes

Three interesting facts I found:
- 30% of players on the Brazil squad play for an English team, most
out of any league - squad combination excluding domestic leagues.
- 100% of the Qatar squad play in their domestic league!
- Only one Uruguayan player (4.3%) plays in its domestic league.
### Player contribution by club
In the final plot for this section, I looked at the top 10 clubs
contributing the most players to the tournament. I used
`arrange(desc(n)) %>% slice()` instead of `top_n()` as there were too
many teams tied at 4 players. To set the team names inside the bars I
created a midpoint value `midval` that calculated a value half of the
number of players contributed so the labels were placed neatly.
``` r
player_contrib_club_plot <- squads_df_clean %>%
group_by(club) %>%
summarize(n = n()) %>%
mutate(club = club %>% forcats::as_factor() %>% forcats::fct_reorder(n),
midval = n / 2) %>%
arrange(desc(n)) %>%
slice(1:15) %>%
ggplot(aes(x = club, y = n)) +
geom_col(fill = "red") +
geom_text(aes(label = n, family = "Roboto Condensed", fontface = "bold"),
size = 7.5, color = "yellow",
nudge_y = 0.5) +
geom_text(aes(y = midval, label = club,
family = "Roboto Condensed", fontface = "bold"),
size = 5, color = "white") +
coord_flip() +
scale_y_continuous(breaks = scales::pretty_breaks(),
expand = c(0, 0),
limits = c(0, 10.5)) +
labs(title = "Top 15 Clubs contributing the most players to the Copa América",
x = "Club", y = "Number of players",
caption = "Source: Wikipedia") +
theme_copaAmerica(
title.size = 18,
subtitle.size = 12,
caption.size = 8,
axis.text.size = 14,
axis.title.size = 16,
strip.text.size = 18,
panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_line(color = "white")) +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank())
player_contrib_club_plot
```

With 100% of its players coming from the domestic league it’s not
surprise that the Qatari team, Al-Sadd, is contributing the most players
to the tournament. Tied with another Qatari team, Mexican club America
features 7 players yet none of them are Mexicans (2 Argentineans, 2
Colombians, 1 Ecuadorian, 1 Chilean, and 1 Paraguayan).
At first I thought Barcelona contributed 8 players until I realized the
Ecuadorian players were coming from the Ecuadorian team called
Barcelona…I had to go all the way back up to the beginning of this
section to fix that small peculiarity. As futbol came to South America
via European colonists and immigrants a lot of teams took up the names
and colors of the teams these Europeans were fond of. Other examples
include Liverpool F.C. (Montevideo, Uruguay), Arsenal de Sarandi (Buenos
Aires, Argentina), and Club Atletico Juventus (Sao Paulo, Brazil -
although they use the colors of Torino F.C.).
If you download the data and type in the code below you can see the
entire club-country list.
``` r
squads_df_clean %>%
group_by(club, country) %>%
summarize(n = n()) %>% View()
```
Match Records
-------------
Now that we got a good look at the composition of the teams, we can take
a look at how they’ve done at every Copa América.
The next code chunk mainly comes from PH Julien and his excellent Kaggle
kernel of [“A Journey Through The History of
Soccer”](https://www.kaggle.com/phjulien/a-journey-through-the-history-of-soccer/).
``` r
## grab football federation affiliations data
federation_files <- Sys.glob("../data/federation_affiliations/*")
df_federations = data.frame(country = NULL, federation = NULL)
for (f in federation_files) {
federation = basename(f)
content = read.csv(f, header=FALSE)
content <- cbind(content,federation=rep(federation, dim(content)[1]))
df_federations <- rbind(df_federations, content)
}
colnames(df_federations) <- c("country", "federation")
df_federations <- df_federations %>%
mutate(country = as.character(country) %>% str_trim(side = "both"))
results_raw <- readr::read_csv("../data/results.csv")
results_copa <- results_raw %>%
filter(tournament == "Copa América") %>%
rename(venue_country = country,
venue_city = city) %>%
mutate(match_num = row_number())
## combine with federation affiliations
results_copa_home <- results_copa %>%
left_join(df_federations,
by = c("home_team" = "country")) %>%
mutate(federation = as.character(federation)) %>%
rename(home_federation = federation)
results_copa_away <- results_copa %>%
left_join(df_federations,
by = c("away_team" = "country")) %>%
mutate(federation = as.character(federation)) %>%
rename(away_federation = federation)
## combine home-away
results_copa_cleaned <- results_copa_home %>%
full_join(results_copa_away)
```
Unfortunately, this data does not have **penalty** results as those
games are all counted as a draw (as technically that is the actual
result). Considering there a lot of cagey knock-out rounds that finish
in a penalty shoot-out (including the last two finals…) it is
unfortunate but that’s just the data you have sometimes. There is a way
to web-scrape all the Copa América results and assign Win-Lose to those
games that went to penalties but I’ll leave that for another time. Also,
there is no info on what stage of the tournament the match recorded is
in.
``` r
results_copa_cleaned <- results_copa_cleaned %>%
mutate(
home_federation = case_when(
home_team == "USA" ~ "Concacaf",
TRUE ~ home_federation),
away_federation = case_when(
away_team == "USA" ~ "Concacaf",
TRUE ~ away_federation)) %>%
select(-contains("federation"), -contains("venue"),
-neutral, date, home_team, home_score, away_team, away_score,
tournament, venue_city)
```
``` r
glimpse(results_copa_cleaned)
```
## Observations: 787
## Variables: 8
## $ date 1916-07-02, 1916-07-06, 1916-07-08, 1916-07-10, 19...
## $ home_team "Chile", "Argentina", "Brazil", "Argentina", "Brazi...
## $ away_team "Uruguay", "Chile", "Chile", "Brazil", "Uruguay", "...
## $ home_score 0, 6, 1, 1, 1, 0, 4, 4, 1, 4, 5, 1, 6, 2, 0, 3, 4, ...
## $ away_score 4, 1, 1, 1, 2, 0, 0, 2, 0, 0, 0, 0, 0, 3, 2, 1, 1, ...
## $ tournament "Copa América", "Copa América", "Copa América", "Co...
## $ match_num 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ venue_city "Buenos Aires", "Buenos Aires", "Buenos Aires", "Bu...
Now that it’s nice and cleaned up I created a function that reshapes the
data so that it’s set from a certain team’s perspective with the “team”
argument. You can also set the function to look for only results against
a certain opponent by filling in the `versus` argument.
``` r
copaAmerica_resultados <- function(data, team, versus = NA) {
## team of interest: ex. 'Brazil'
team_var <- enquo(team)
todos_partidos <- data %>%
## filter only for results of team of interest
filter(home_team == !!team_var | away_team == !!team_var) %>%
## reshape columns to team vs. opponent
mutate(
opponent = case_when(
away_team != !!team_var ~ away_team,
home_team != !!team_var ~ home_team),
home_away = case_when(
home_team == !!team_var ~ "home",
away_team == !!team_var ~ "away"),
equipo_goals = case_when(
home_team == !!team_var ~ home_score,
away_team == !!team_var ~ away_score),
opp_goals = case_when(
home_team != !!team_var ~ home_score,
away_team != !!team_var ~ away_score)) %>%
## label results from team's perspective
mutate(
result = case_when(
equipo_goals > opp_goals ~ "Win",
equipo_goals < opp_goals ~ "Loss",
equipo_goals == opp_goals ~ "Draw")) %>%
mutate(result = result %>% forcats::as_factor() %>% forcats::fct_relevel(c("Win", "Draw", "Loss"))) %>%
select(-contains("score"), -contains("team"), -match_num) %>%
rename(Date = date, Tournament = tournament, `Venue` = venue_city, Opponent = opponent, `Home / Away` = home_away,
`Goals For` = equipo_goals, `Goals Against` = opp_goals, Result = result)
if (is.na(versus) | is.null(versus)) {
resultados_totalmente <- todos_partidos %>%
group_by(Result, Opponent) %>%
mutate(n = n()) %>%
ungroup() %>%
## sum amount of goals by team and opponent
group_by(Result, Opponent) %>%
summarize(e_g = sum(`Goals For`),
o_g = sum(`Goals Against`),
n = n()) %>%
ungroup() %>%
## spread results over multiple columns
spread(Result, n) %>%
mutate_if(is.integer, as.numeric)
missing_cols <- c("Win", "Draw", "Loss") %>%
map_dfr( ~tibble(!!.x := numeric()))
resultados_totalmente <- resultados_totalmente %>%
bind_rows(missing_cols) %>%
mutate(Win = if_else(is.na(Win), 0, Win),
Draw = if_else(is.na(Draw), 0, Draw),
Loss = if_else(is.na(Loss), 0, Loss)) %>%
group_by(Opponent) %>%
summarize(Win = sum(Win, na.rm = TRUE),
Draw = sum(Draw, na.rm = TRUE),
Loss = sum(Loss, na.rm = TRUE),
`Goals For` = sum(e_g),
`Goals Against` = sum(o_g))
return(list(resultados_totalmente, todos_partidos))
} else {
## opponent: ex. 'Argentina'
todos_partidos <- todos_partidos %>%
filter(Opponent == versus)
if (nrow(todos_partidos) == 0) {
return(glue("{team} has never played {versus} at the Copa América!"))
} else {
resultados_totalmente <- todos_partidos %>%
group_by(Result, Opponent) %>%
mutate(n = n()) %>%
ungroup() %>%
# sum amount of goals by team and opponent
group_by(Result, Opponent) %>%
summarize(e_g = sum(`Goals For`),
o_g = sum(`Goals Against`),
n = n()) %>%
ungroup() %>%
# spread results over multiple columns
spread(Result, n) %>%
mutate_if(is.integer, as.numeric) %>%
group_by(Opponent) %>%
summarize(Win = sum(Win, na.rm = TRUE),
Draw = sum(Draw, na.rm = TRUE),
Loss = sum(Loss, na.rm = TRUE),
`Goals For` = sum(e_g),
`Goals Against` = sum(o_g))
return(list(resultados_totalmente, todos_partidos))
}
}
}
```
The output is either a dataframe of all the games a team has been
involved in as well as the record of the team against other teams in the
Copa América or a message saying that the team you picked has never
played against the opponent you picked.
### Japan
``` r
copaAmerica_resultados(data = results_copa_cleaned,
team = "Japan", versus = "Brazil")
```
## Japan has never played Brazil at the Copa América!
Oh… that’s right Japan has never played against Brazil at the Copa…
``` r
resultados_japon <- copaAmerica_resultados(data = results_copa_cleaned, team = "Japan")
```
## Warning: Unknown levels in `f`: Win
``` r
resultados_japon[[2]] %>%
DT::datatable(caption = "Japan's record in the Copa América", rownames = FALSE)
```

Japan’s only previous journey to the Copa América was in the 1999
edition where they lost all 3 games. They were invited for the 2011
edition but withdrew due to the Tohoku Earthquake and were replaced by
Costa Rica. Japanese football has come a long way since 1999 but with a
young squad it will be a uphill battle to get 3 points against any of
their Group C opponents, Uruguay, Chile, and Ecuador.
### Colombia
``` r
resultados_colombia <- copaAmerica_resultados(data = results_copa_cleaned, team = "Colombia")
resultados_colombia[[2]] %>%
DT::datatable(caption = "Colombia's record in the Copa América", rownames = FALSE)
```

Despite a recent resurgence of the Colombia national team they have not
been able to match the feats of the 2001 side that won the Copa with
their best place finish since then coming 3rd in 2004. The 2001 team
were not only unbeaten but also did not concede a single goal throughout
the tournament!
### Superclasico Sudamericano: Brazil vs. Argentina
``` r
resultados_de_brazil <- copaAmerica_resultados(data = results_copa_cleaned,
team = "Brazil", versus = "Argentina")
resultados_de_brazil[[1]] %>%
DT::datatable(caption = "Brazil vs. Argentina in the Copa América", rownames = FALSE)
```

``` r
resultados_de_brazil[[2]] %>%
DT::datatable(caption = "Brazil vs. Argentina in the Copa América", rownames = FALSE)
```

Brazil does not have a good overall record vs. Argentina but they have
not lost against their rivals at the Copa América since the 1991
edition. The “draw” in 2004 was actually in the final where they won 4-2
on penalties.
What I found odd was that the Copa América seems to have a very low
priority to certain countries, especially Brazil who have repeatedly
sent their B or C teams to the tournament in favor of sending their best
team to other tournaments or resting star players. Funnily enough these
understrength Brazilian squads have actually won the entire tournament a
few times, most notably in 2007 against a full strength Argentina side
containing the likes of Zanetti, Riquelme, Cambiasso, Tevez, a young
Messi/Mascherano, Cambiasso, et al!
Player Profiles
---------------
After looking at the history of the competition and the composition of
the squads I examined the players and their form coming into the Copa
América. In recent years football analytics has really taken off and
there have been many strides made in creating more informative
statistics to assess players’ abilities, the most prominently being the
**xG** statistics. This is the first time I talk about **xG** in any
length/depth so this introduction is as much to solidify my
understanding as well as yours!
### What IS xG?
- **xG**: Quantitative measure (between 0 and 1) that assigns a
probability that a shot taken will result in a goal based on a
variety of variables and is used for evaluating the quality of
chances and predicting players’ and teams’ future performances.
- **xA**: Quantitative measure (between 0 and 1) that assigns a
probability that a given pass will result in an assist for a goal
based on a variety of variables.
Common variables used in the models that output xG statistics are the
distance and angle of a shot, the body part used, rebound, among others.
Similar to how you might assess your favorite striker’s chances of
scoring just as he is lining up to take a shot: Is the shot a header? Is
he trying to score from a cross in regular play or a corner kick? Are
there a crowd of defenders in front of him or is he one-on-one with the
goalkeeper? Etc. You might think **who** takes a shot would be a genuine
factor but in actuality it tells you a lot less about the chances of a
goal compared to the location of the shot.
Note that there isn’t a SINGLE xG model. You can check out a blog post
comparing different people’s xG models
[here](https://mackayanalytics.nl/2017/06/19/how-accurate-are-xg-models-ii-the-big-chance-dilemma/).
People and organizations (from Statsbomb to OptaPro) have their own
ideas about what **could** be the important variables in play and as
such it’s important to report from which source you got your data from
as the stats can differ between models. A few things xG does not factor
in are things like goalkeeper performance (someone pulling off
incredible saves or letting in a poor shot) and one must also consider
the fact that team style of play and the quality of a player’s
teammates. When judging players based on these stats it is important to
be aware of contextual factors like the team they play for, their
opponent, and the player’s position/role in the team.
From xG and xA more granular statistics such as xGChain and xGBuildup
were created to be able to dig a little deeper into who is contributing
to chance creation, I’ll introduce the latter two a bit later. As the
field has grown new statistics have popped up such as [Karun
Singh](https://twitter.com/karun1710/)’s “expected threat” or xT. You
can check out an introduction to xT from
[here](https://karun.in/blog/expected-threat.html).
Of course, these statistics only tell a part of the story and are
definitenly not the be-all-and-end-all. In the context of this current
blog post, these stats only tell the story about how these players did
for their club teams this past season rather than for their national
team. Even still it gives us a good idea of what kind of form these
players are in coming into this tournament.
You might also want to watch these Youtube videos by
[TifoFootball](https://www.youtube.com/channel/UCGYYNGmyhZ_kwBF_lqqXdAQ)
for a quick primer on [xG](https://www.youtube.com/watch?v=zSaeaFcm1SY)
and [xA](https://www.youtube.com/watch?v=1MdlkuzLdj4).
### understat data
For the data I used the website, `understat.com`. Their xG models were
created via training a neural network on a dataset consisting of over
100,000 shots using more than 10 different variables. Getting data from
`understat` has been made easy by Ewen Henderson’s `understatr` package
available from [Github](https://github.com/ewenme/understatr) (he’s also
the guy that made the [ghibli](https://github.com/ewenme/ghibli) color
palette!). I tried to pick a wide selection of attacking players but I
was also limited by the fact that `understat` only has data for
teams/players from six European leagues (Premier League, Bundesliga,
Serie A, La Liga, Ligue 1, and Russian Premier League).
For **Peru** I would have chosen Paolo Guerrero but as he plays in
Brazil now I went with Jefferson Farfan (who hasn’t played as many games
as the other players used for comparison unfortunately…). For **Chile**
I would pick Eduardo Vargas but he as doesn’t play for a team covered by
understat I went with Alexis Sanchez, who had a woeful season and only
played \~600 minutes despite appearing in \~20 league matches and later
added Arturo Vidal. For **Brazil** I included Neymar initially but since
he won’t actually be playing I’ll keep him for comparison’s sake but
also include Gabriel Jesus and Roberto Firmino who have been fighting
for the starting striker spot. Note that these two aren’t the ones
replacing Neymar **positionally**. In Neymar’s left-wing position I can
see David Neres or Phil Coutinho replacing him (Richarlison and Willian
mostly play on the right). (Edit: In the first match vs. Bolivia, David
Neres started off on the left while Richarlison played on the right,
Coutinho played just behind Bobby Firmino)
The other nation’s strikers/attacking midfielders don’t play for the six
European leagues covered by understat or like in Shinji Okazaki’s case
just did not play as many minutes/games during the season. To get the
data I created a list of the player codes and use `purrr::map()` to
iterate each through the `understatr::get_player_seasons_stats()`
function.
``` r
player_codes <- c(2097, 2099, 813, ## Messi, Neymar, Rondon
498, 4299, 696, ## Alexis, Farfan, Falcao
3294, 2098, 5543, ## Cavani, Suarez, G. Jesus
482, 1148, 2249, ## Bobby, Duvan, James
1089, 3553, 488, ## Cuadrado, Di Maria, Coutinho
222) ## Arturo Vidal
understat_data <- player_codes %>%
map(., ~ understatr::get_player_seasons_stats(.x)) %>%
reduce(bind_rows) %>%
select(-player_id, -position, -yellow, -red)
```
``` r
glimpse(understat_data)
```
## Observations: 83
## Variables: 15
## $ games 34, 36, 34, 33, 38, 17, 20, 30, 34, 33, 32, 36, 38...
## $ goals 36, 34, 37, 26, 43, 15, 19, 13, 24, 22, 11, 7, 8, ...
## $ shots 170, 196, 179, 158, 187, 55, 91, 105, 124, 95, 89,...
## $ time 2704, 2995, 2832, 2726, 3374, 1443, 1797, 2652, 30...
## $ xG 25.997169, 28.946281, 26.885174, 27.101910, 35.891...
## $ assists 13, 12, 9, 16, 18, 7, 13, 11, 12, 7, 7, 3, 2, 2, 0...
## $ xA 15.33516552, 15.10040562, 13.95513140, 15.87127814...
## $ key_passes 93, 87, 79, 77, 95, 43, 70, 91, 102, 52, 32, 23, 2...
## $ year 2018, 2017, 2016, 2015, 2014, 2018, 2017, 2016, 20...
## $ team_name "Barcelona", "Barcelona", "Barcelona", "Barcelona"...
## $ npg 32, 32, 31, 23, 38, 10, 15, 12, 19, 21, 11, 7, 8, ...
## $ npxG 22.280909, 25.973170, 21.682231, 21.899351, 31.432...
## $ xGChain 38.459877, 48.180634, 42.525045, 41.996866, 54.753...
## $ xGBuildup 10.6987990, 21.6344040, 18.1335122, 15.1963644, 19...
## $ player_name "Lionel Messi", "Lionel Messi", "Lionel Messi", "L...
As you can see the data consists of a row for each player and each year
(from the 2014/2015 season to the 2018/2019 season). I tried to mitigate
the fact that some players played a lot more minutes than others by
standardize everything to a ‘per 90 minutes’ value but this does have
its own disadvantages. These include the fact that players who play a
lot of minutes (as regular starting members) may not have as high ‘per
90’ stat even though their production with all these minutes might
suggest that they are consistently performing and producing at a high
level.
It’ll be a bit crowded (kind of like a spilt box of Skittles…) but let’s
check out the key metrics for all the players at once.
Note: npg = non-penalty goals, npxG = non-penalty goals xG
``` r
comparison_data <- understat_data %>%
filter(year == 2018) %>%
select(-games, -team_name, -year) %>%
rename(Shots = shots, KP = key_passes) %>%
gather(key = "key", value = "value", -player_name, -time) %>%
mutate(key = forcats::as_factor(key) %>%
forcats::fct_relevel(.,
"xG", "goals", "npxG", "npg",
"xA", "assists", "xGChain", "xGBuildup",
"Shots", "KP"))
comparison_strikers_plot <- comparison_data %>%
filter(key != "Shots", key != "KP",
key != "xGBuildup", key != "xGChain") %>%
mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) +
geom_vline(xintercept = 1.5, size = 2) +
geom_vline(xintercept = 2.5, size = 2) +
geom_vline(xintercept = 3.5, size = 2) +
geom_vline(xintercept = 4.5, size = 2) +
geom_vline(xintercept = 5.5, size = 2) +
coord_flip() +
scale_y_continuous(expand = c(0.01, 0.01),
limits = c(0, 1.26)) +
scale_fill_manual(values = pals::glasbey(16), name = "Player") +
labs(title = "Comparison: Top attackers at the Copa América",
subtitle = "For select group of attacking players with data available from understat.com",
x = NULL, y = "Metric per 90",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(title.size = 18,
subtitle = 12,
panel.grid.minor.x = element_line(color = "white"))
comparison_strikers_plot
```

As usual in these types of charts, Messi is leading a lot of the metrics
here and showing consistency too with having played the third highest
amount of minutes out of the selected players. It’s helpful to have the
xG/xA stats next to the actual goals/assists as it provides an
indication of whether the player in question is scoring shots that he
probabilistically should be scoring. When a player’s actual goal count
is higher than their xG stat this suggests that the player is
**“exceeding their xG”** or that they are scoring from shots that are
historically hard to score from. It can be seen as a marker of an elite
finisher as they are putting away chances from difficult situations
consistently. In terms of assists and xA Alexis Sanchez, who only played
about 600 minutes, looks a lot better than in reality due to the
aforementioned disadvantage of standardizing everything to a “per 90
minutes” value. Normally you would have a cut-off based on a certain
**minimum amount of minutes** but as I mentioned I was rather limited in
my choice of players.
A way to take a closer look at xG - Goals and xA - Assists is to use a
simple dot plot with a line going through the 45 degree angle. Those
below the line are underperforming relative to their xG or xA stat,
those over it are overachieving (“exceeding” their xG/xA stat) while
those just on the line are scoring or assisting right around what the
model expects the player to be. I use non-penalty xG below as penalties
have around \~0.75 xG (give or take a few percentage points depending on
the model) and can inflate the stats of those players who take a lot of
penalties and score them, especially if they weren’t the ones who earned
the penalty themselves.
``` r
expected_goal_plot <- understat_data %>%
filter(year == 2018) %>%
select(player_name, time, npxG, xG, goals) %>%
mutate_at(c("npxG", "xG", "goals"), ~. / time * 90) %>%
ggplot(aes(x = npxG, y = goals, fill = player_name)) +
geom_abline(intercept = 0, slope = 1, color = "white", size = 1.1) +
geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) +
scale_x_continuous(limits = c(0, 1.1),
expand = c(0, 0)) +
scale_y_continuous(limits = c(0, 1.3),
expand = c(0, 0)) +
scale_fill_manual(values = pals::glasbey(16), name = "Player") +
labs(title = "Expected vs. Actual Goals",
subtitle = "For select group of attacking players with data available from understat.com",
x = "Non-penalty xG per 90 minutes",
y = "Goals per 90 minutes",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(panel.grid.minor.x = element_line(color = "white"),
panel.grid.minor.y = element_line(color = "white"),
subtitle.size = 11)
expected_goal_plot
```

Gabriel Jesus is quite clearly below the 45 degree line meaning that he
has been very poor at finishing chances (and/or incredibly unlucky).
After a poor World Cup where he scored 0 goals as a starter, he is
really going to have to step up to fill Neymar’s goalscoring boots for
this tournament. However, his build-up play for City has still been good
this past season and he has been scoring for Brazil in the friendlies
leading up to the tournament so it’s going to be a hard decision for
Tite to decide on who starts against Bolivia (edit: Firmino started and
contributed an assist while Jesus replaced him in the 65th minute).
``` r
expected_assists_plot <- understat_data %>%
filter(year == 2018) %>%
select(player_name, time, xA, assists) %>%
mutate_at(c("xA", "assists"), ~. / time * 90) %>%
ggplot(aes(x = xA, y = assists, fill = player_name)) +
geom_abline(intercept = 0, slope = 1, color = "white", size = 1.1) +
geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) +
labs(title = "Expected vs. Actual Assists",
subtitle = "For select group of attacking players with data available from understat.com",
x = "xA per 90 minutes",
y = "Assists per 90 minutes",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
scale_x_continuous(limits = c(0, 0.55),
expand = c(0, 0)) +
scale_y_continuous(limits = c(0, 0.55),
expand = c(0, 0)) +
scale_fill_manual(values = pals::glasbey(16), name = "Player") +
theme_copaAmerica(panel.grid.minor.x = element_line(color = "white"),
panel.grid.minor.y = element_line(color = "white"),
subtitle.size = 11)
expected_assists_plot
```

One thing to keep in mind is that xA does not take into account the
recipient of the assist pass. Even if the pass given had a high expected
assist value the receiving player still might not have the quality to
put it away through no fault of the passer. This might explain why most
of the players with a higher xA among this group don’t have the assists
to match. It can also be that these players are also the ones playing a
lot more minutes and the volume of chances they create just aren’t
translating to goals all the time.
### Key Passes, Shots, xGChain, and xGBuildup (per 90)
I separated “key passes” and “shots” as well as “xGChain” and
“xGBuildup” from the rest as these two sets were on a very different
scale.
``` r
kp_shots_plot <- comparison_data %>%
filter(key == "Shots" | key == "KP") %>%
mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) +
coord_flip() +
scale_y_continuous(expand = c(0.01, 0.01),
limits = c(0, 6),
breaks = c(0, 1, 2, 3, 4, 5, 6),
labels = c(0, 1, 2, 3, 4, 5, 6)) +
scale_fill_manual(values = pals::glasbey(17), name = "Player") +
geom_vline(xintercept = 1.5, size = 2) +
labs(title = "Comparison: Stars of the Copa América",
subtitle = glue("
KP = Key Passes
For select group of attacking players with data available from understat.com"),
x = NULL, y = "Metric per 90",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(title.size = 18,
subtitle.size = 10,
panel.grid.minor.x = element_line(color = "white"))
kp_shots_plot
```

- xGChain: Quantitative measure that is the combined sum of the xG of
every possession that ends in a shot that a player is involved in.
The same derived value is given to each of the players involved and
allows us to credit players for attacking contributions outside of
just shots (xG) and assists (xA).
- xGBuildup: Similar to xGChain but excluding shots and assists. This
is in response to xGChain values still being dominated by the xG and
xA from shots and assists, respectively.
``` r
xgbuildup_xgchain_plot <- comparison_data %>%
filter(key == "xGBuildup" | key == "xGChain") %>%
mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) +
coord_flip() +
scale_y_continuous(expand = c(0.01, 0.01),
limits = c(0, 1.55),
breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5),
labels = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5)) +
scale_fill_manual(values = pals::glasbey(17), name = "Player") +
geom_vline(xintercept = 1.5, size = 2) +
labs(title = "Comparison: Stars of the Copa América",
subtitle = "For select group of attacking players with data available from understat.com",
x = NULL, y = "Metric per 90",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(title.size = 18,
subtitle.size = 10,
panel.grid.minor.x = element_line(color = "white"))
xgbuildup_xgchain_plot
```

Although Gabriel Jesus has been poor at finishing his chances as seen in
previous graphs, his xGChain and xGBuildup stat makes it clear that he
is still contributing to City’s attack outside of scoring goals himself
(not to mention all the defensive work he does as well).
For example below, the stats are able to clearly differentiate between
James, who is more of a playmaker, compared to Falcao and Duvan who are
traditional number 9s with his superior xGBuildup, xGChain, and Key
Passes values. For a more detailed overview on xGChain and xGBuildup
check out Statsbomb’s article
[here](https://statsbomb.com/2018/08/introducing-xgchain-and-xgbuildup/).
``` r
## keep colors for Colombians consistent with other plots
colombia_pal <- c("#000033", "#005300", "#009FFF", "#00FFBE")
comparison_colombia_plot <- comparison_data %>%
filter(!key %in% c("xG", "goals", "npxG", "npg", "xA", "assists"),
player_name %in% c("James Rodríguez", "Falcao", "Duván Zapata", "Juan Cuadrado")) %>%
mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) +
geom_vline(xintercept = 1.5, size = 2) +
geom_vline(xintercept = 2.5, size = 2) +
geom_vline(xintercept = 3.5, size = 2) +
coord_flip() +
scale_y_continuous(expand = c(0.05, 0.05),
breaks = c(0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4),
labels = c(0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4)) +
scale_fill_manual(values = colombia_pal, name = "Player") +
labs(title = "Comparison: Stars of Colombia",
subtitle = "KP: Key Passes",
x = NULL, y = "Metric per 90",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(title.size = 20,
subtitle.size = 12,
panel.grid.minor.x = element_line(color = "white"))
comparison_colombia_plot
```

There are lots of different ways to visualize this data, most famously
the radar charts created by [Statsbomb](http://statsbomb.com/). Using
the data you can also compare and evaluate players in many different
ways and using `understatr` package you could add a few more players
like Paulo Dybala, Miguel Almiron, and more! I could probably do a whole
other article using just this data but I’ll leave it here for now.
Conclusion
----------
Throughout this blog post I talked about some of the historical records,
squad compositions, match records, and finally the player profiles of
attacking players at this summer’s Copa América. Using the power of R it
is really easy to webscrape and visualize data in a way that is
informative and aesthetically pleasing. I wanted to finish this before
the tournament started but other life things got in the way as well as
the fact that the amount of content ballooned out of control (especially
the xG section) so I had to cut down a lot.
It’s been fun reading the articles on the Copa América website and
seeing how far my “intermediate-but-very-out-of-practice”-level of
Spanish can get me to understand the content, here is one that I
particularly liked reading: [14 Estadisticas de la Copa
América](https://copaamerica.com/es/noticias/a-14-dias-del-inicio-de-la-copa-14-estadisticas-de-la-copa-america).
With so many tournaments going on right now (and with the African Cup of
Nations starting in a few days) a lot of the news media is spread thin
right now but there are still some quality articles out there to read
about the Copa, like [this
article](https://www.bbc.com/sport/football/48600098) from BBC’s South
American football expert, Tim Vickery.
After the first round of games, a few points of discussion:
- After an extremely lacklustre performance vs. Colombia, how does
Argentina bounce back? What tactical changes need to be made?
- Qatar impressed against Paraguay but can they pull off a major upset
vs. Colombia?
- How will Japan line-up against Uruguay after a losing by a scoreline
that didn’t really do their performance justice? How will manager
Moriyasu balance experience vs. youth, will he start with veteran
Okazaki after Ueda’s numerous misses vs. Chile?
- Can Brazil earn a early ticket to the next round vs. Venezuela after
a clinical but not excellent performance vs. Bolivia? Will they be
able to keep up their streak of winning the Copa every time they
have hosted it?
Thanks for reading and…
¡Buena suerte a todos equipos!
================================================
FILE: Copa America 2019/COPY-2019-06-18-visualize-copa-america.md
================================================
---
layout: post
title: "Visualizing the Copa América: Historical Records, Squad Profiles, and Player Profiles with xG statistics!"
fb-img: https://i.imgur.com/Elzd0Sc.png
share-img: https://i.imgur.com/Elzd0Sc.png
tags: [brazil, messi, ggplot2, web-scraping, copa-america-2019, soccer, futbol, rvest, tidyverse]
---
Another summer and another edition of the Copa América! Along with the
Africa Cup of Nations, Nations League finals, the Women’s World Cup,
Under-21 European Championship AND the Gold Cup this is yet another
soccer-filled season after last year’s World Cup and the Asian Cup
earlier this year (I also did a blog post on these last two tournaments
which you can see [here (World
Cup)](https://ryo-n7.github.io/2018-06-29-visualize-worldcup/) and [here
(Asian Cup)](https://ryo-n7.github.io/2019-01-11-visualize-asian-cup/)).
There is so much football going on at once even I can’t keep up,
especially with the time difference! To not redo all the previous
visualizations with Copa América data I tried to find new sources of
data and other forms of visualizations to give some insight into the
players and teams competing to be the champion of South America. You can
find all the code I used in this blogpost here and you can also find
other soccer related data viz in my
[soccer\_ggplot](https://github.com/Ryo-N7/soccer_ggplots) Github repo.
The sections will go from a very macro-level view of the **historical
records** of the tournament, to the **squads** competing, the teams’
**match record** in the Copa América, and finally to a micro-level view
of various attacking players using **xG** statistics.
¡Vámonos!
Packages
--------
``` r
library(dplyr) ## data wrangling
library(tidyr) ## data wrangling
library(purrr) ## data wrangling and iteration
library(stringr) ## data wrangling
library(rvest) ## webscraping
library(polite) ## webscraping (Github only pkg)
library(ggplot2) ## plotting
library(scales) ## plotting scales
library(ggimage) ## images for flags
library(ggforce) ## plotting text labels
library(cowplot) ## plotting grid
library(glue) ## text
library(ggrepel) ## plotting text labels
library(magick) ## plotting
library(DT) ## tables
library(ggtextures) ## soccer ball emoji as geom_col()
library(extrafont) ## fonts: Roboto Condensed
loadfonts()
```
theme\_copaAmerica
------------------
I wanted to have all the plots in this blogpost to have a consistent
color theme. As the tournament is going to be held in Brazil, I went
with a color theme based on its flag with blue, yellow, and green being
the primary colors.
``` r
theme_copaAmerica <- function(
title.size = 24,
subtitle.size = 14,
caption.size = 8,
axis.text.size = 14,
axis.text.x.size = 12,
axis.text.y.size = 12,
axis.title.size = 16,
strip.text.size = 18,
panel.grid.major.x = element_line(size = 0.5, color = "white"),
panel.grid.major.y = element_line(size = 0.5, color = "white"),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank(),
axis.ticks = element_line(color = "white")) {
## Theme:
theme(text = element_text(family = "Roboto Condensed", color = "white"),
plot.title = element_text(family = "Roboto Condensed", face = "bold",
size = title.size, color = "yellow"),
plot.subtitle = element_text(size = subtitle.size),
plot.caption = element_text(size = caption.size),
panel.background = element_rect(fill = "#009b3a"),
plot.background = element_rect(fill = "#002776"),
axis.text = element_text(size = axis.text.size, color = "white"),
axis.text.x = element_text(size = axis.text.x.size, color = "white"),
axis.text.y = element_text(size = axis.text.y.size, color = "white"),
axis.title = element_text(size = axis.title.size),
axis.line.x = element_blank(),
axis.line.y = element_blank(),
panel.grid.major.x = panel.grid.major.x,
panel.grid.major.y = panel.grid.major.y,
panel.grid.minor.x = panel.grid.minor.x,
panel.grid.minor.y = panel.grid.minor.y,
strip.text = element_text(color = "yellow", face = "bold",
size = strip.text.size,
margin = margin(4.4, 4.4, 4.4, 4.4)),
strip.background = element_blank(),
axis.ticks = axis.ticks
)
}
```
Top Goal Scorers // Goleadores
------------------------------
For this plot I took the stats from the Spanish version of the Wikipedia
page as it had more content. I used `purrr::flatten_df()` to squish the
list output into a dataframe then set the names of each column using
`purrr::set_names()`.
``` r
url <- "https://es.wikipedia.org/wiki/Anexo:Estad%C3%ADsticas_de_la_Copa_Am%C3%A9rica"
session <- bow(url)
copa_top_scorers <- scrape(session) %>%
html_nodes(".mw-parser-output > table:nth-child(95)") %>%
html_table() %>%
flatten_df() %>%
set_names(c("player", "country", "goals")) %>%
mutate(image = "https://www.emoji.co.uk/files/microsoft-emojis/activity-windows10/8356-soccer-ball.png")
```
``` r
glimpse(copa_top_scorers)
```
## Observations: 22
## Variables: 4
## $ player "Norberto Méndez", "Zizinho", "Lolo Fernández", "Sever...
## $ country "ARG Argentina", "BRA Brasil", "PER Perú", "URU Urugua...
## $ goals 17, 17, 15, 15, 13, 13, 13, 13, 13, 12, 12, 11, 11, 11...
## $ image "https://www.emoji.co.uk/files/microsoft-emojis/activi...
Like in the Asian Cup blogpost I use [Claus
Wilke](https://twitter.com/ClausWilke/)’s
[ggtextures](https://github.com/clauswilke/ggtextures) package to use
**soccer ball emoji** as the column image in the plot.
``` r
copa_goleadores_raw_plot <- copa_top_scorers %>%
head(5) %>%
ggplot(aes(x = reorder(player, goals), y = goals,
image = image)) +
geom_isotype_col(img_width = grid::unit(1, "native"), img_height = NULL,
ncol = NA, nrow = 1, hjust = 0, vjust = 0.5) +
geom_text(aes(label = goals, family = "Roboto Condensed", fontface = "bold"),
size = 7.5, color = "yellow",
nudge_y = 0.5) +
coord_flip() +
scale_y_continuous(breaks = c(0, 2, 4, 6, 8, 10, 12, 14, 16, 18),
expand = c(0, 0),
limits = c(0, 19)) +
labs(title = "Top Scorers of the Copa América",
subtitle = glue("
Most goals in a single tournament: 9
Humberto Maschio (Argentina), Javier Ambrois (Uruguay), Jair (Brazil)"),
y = "Number of Goals", x = NULL,
caption = glue("
Source: Wikipedia
By @R_by_Ryo")) +
theme_copaAmerica(title.size = 26,
subtitle.size = 16,
caption.size = 12,
axis.text.size = 18,
axis.title.size = 18,
panel.grid.major.y = element_blank(),
axis.ticks = element_blank())
## Add flags to y-axis:
axis_image <- axis_canvas(copa_goleadores_raw_plot, axis = 'y') +
draw_image("https://upload.wikimedia.org/wikipedia/en/0/05/Flag_of_Brazil.svg",
y = 16.5, scale = 1.8) +
draw_image("https://upload.wikimedia.org/wikipedia/commons/1/1a/Flag_of_Argentina.svg",
y = 12.5, scale = 1.8) +
draw_image("https://upload.wikimedia.org/wikipedia/commons/f/fe/Flag_of_Uruguay.svg",
y = 9, scale = 1.8) +
draw_image("https://upload.wikimedia.org/wikipedia/commons/d/df/Flag_of_Peru_%28state%29.svg",
y = 5.25, scale = 1.8) +
draw_image("https://upload.wikimedia.org/wikipedia/en/0/05/Flag_of_Brazil.svg",
y = 1.5, scale = 1.8)
copa_goleadores_plot <- ggdraw(insert_yaxis_grob(copa_goleadores_raw_plot, axis_image, position = "left"))
copa_goleadores_plot
```

Most of these players aren’t ones you might recognize. The Copa América
used to be held a lot more regularly (and sometimes erratically) until
this century so players had a lot more opportunities to score goals. All
five of the players you see here played in the 1930s-1950s when there
was a tournament every one or two years. Out of currently active
players, Peruvian legend Paolo Guerrero has 11 goals along with Eduardo
Vargas (from Chile). (Edit: after the Chile - Japan game, Vargas is on
12…) Another player you might recognize that was actually tied with
Ademir for 5th place, along with three other players, was Gabriel
Batistuta (“Batigol”).
Winners of the Copa América
---------------------------
After grabbing the data from the Wikipedia page I used a variety of
functions to clean and reshape the dataset like `tidyr::separate()` to
split the number of occurences and the year.
``` r
url <- "https://es.wikipedia.org/wiki/Anexo:Estad%C3%ADsticas_de_la_Copa_Am%C3%A9rica"
session <- bow(url)
copa_campeones <- scrape(session) %>%
html_nodes(".mw-parser-output > table:nth-child(10)") %>%
html_table() %>%
flatten_df()
copa_campeones_limpia <- copa_campeones %>%
janitor::clean_names() %>%
slice(1:8) %>%
select(1:4) %>%
set_names(c("team", "winners", "runners_up", "third_place")) %>%
separate(winners, into = c("Champions", "first_place_year"),
sep = " ", extra = "merge") %>%
separate(runners_up, into = c("Runners-up", "second_place_year"),
sep = " ", extra = "merge") %>%
separate(third_place, into = c("Third Place", "third_place_year"),
sep = " ", extra = "merge") %>%
mutate_all(list(~str_replace_all(., "–", "0"))) %>%
mutate_at(vars(contains("num")), funs(as.numeric)) %>%
gather(key = "key", value = "value", -team,
-first_place_year, -second_place_year, -third_place_year) %>%
mutate(key = as.factor(key),
value = as.numeric(value),
team = team %>% str_replace(., "[A-Z]{3}", "") %>% str_trim(.),
team = case_when(team == "Brasil" ~ "Brazil",
TRUE ~ team)) %>%
mutate(key = forcats::fct_relevel(key,
"Champions",
"Runners-up",
"Third Place")) %>%
arrange(key, desc(value)) %>%
mutate(team = forcats::as_factor(team),
order = row_number())
```
I also wanted to add flags to this plot but
`cowplot::insert_yaxis_grob()` is unfortunately not compatible with
facets. I used `stringr::str_wrap()` to format the subtitle nicely while
I used `glue::glue()` to avoid having the use ‘’ to create a new line
for the caption.
``` r
copa_ganadores_plot <- copa_campeones_limpia %>%
ggplot(aes(value, forcats::fct_rev(team), color = key)) +
geom_point(size = 10) + # 10
geom_text(aes(label = value),
size = 5, color = "black", # 5
family = "Roboto Condensed", fontface = "bold") +
scale_color_manual(values = c("Champions" = "#FFCC33",
"Runners-up" = "#999999",
"Third Place" = "#CC6600"),
guide = FALSE) +
scale_x_continuous(breaks = c(1, 5, 10, 15),
labels = c(1, 5, 10, 15),
limits = c(-1, 16)) +
labs(x = "Number of Occurrence", y = NULL,
title = "Most Successful Teams of the Copa América!",
subtitle = str_wrap("Ordered by number of Copa América(s) won. Argentina missed the chance to leapfrog Uruguay after consecutive final losses in the previous two tournaments!", width = 80),
caption = glue("
Source: Wikipedia
By @R_by_Ryo")) +
facet_wrap(~key) +
theme_copaAmerica(subtitle.size = 14,
caption.size = 10)
copa_ganadores_plot
```

What’s surprising to note is that Pele never won a Copa América with
Brazil, although he did get Best Player and Top Scorer in the 1959
edition of the tournament. Even more bizarrely Diego Maradona has never
won it either! He didn’t play in either of the 1991 and 1993 editions
where Argentina won their 13th and 14th Copas.
Copa América Squad Profiles
---------------------------
We just looked at what happened in the past but who are the players
competing in the tournament this year? To take a quick look I
web-scraped the squads of each of the competing teams from Wikipedia.
I created a list of the `xpath`s for each of squads and using
`purrr::map()` I grabbed the data for each participating country. After
I got some meta-information about the country name and the group I
created a list-column that stores the squad data as a dataframe in its
own column. To explode this out I used `tidyr::unnest()` to reshape the
entire dataframe to have one row with all the data for each player in
every squad.
To get a clean dataset I use some `stringr::str_*()` functions to
properly format the character strings such as the player positions,
ages, date of births.
``` r
squads_df_clean <- squads_df_raw %>%
janitor::clean_names() %>%
select(-delete, squad_num = no,
position = pos, birth_age = date_of_birth_age) %>%
mutate(position = position %>% str_replace_all(., "[1-9]", ""),
birth_age = birth_age %>% str_extract_all(., pattern = "\\([^()]+\\)")) %>% unnest(birth_age) %>%
group_by(player) %>%
mutate(colnum = seq_along(player)) %>%
spread(key = colnum, value = birth_age) %>%
ungroup() %>%
select(everything(), dob = `1`, age = `2`) %>%
mutate(dob = dob %>% str_replace_all(., "[()]", "") %>% lubridate::as_date(),
age = age %>% str_extract(., "[0-9]+") %>% as.integer,
country = forcats::fct_relevel(country,
"Brazil", "Argentina", "Uruguay",
"Peru", "Qatar", "Chile",
"Venezuela", "Paraguay", "Japan",
"Bolivia", "Colombia", "Ecuador",
),
club = case_when(
club == "Barcelona" & country == "Ecuador" ~ "Barcelona (Ecuador)",
TRUE ~ club))
```
``` r
glimpse(squads_df_clean)
```
## Observations: 276
## Variables: 12
## $ name 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,...
## $ group "A", "A", "A", "A", "A", "A", "A", "A", "A", "A...
## $ country Brazil, Brazil, Brazil, Brazil, Brazil, Brazil,...
## $ squad_num 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ position "GK", "DF", "DF", "DF", "MF", "DF", "FW", "MF",...
## $ player "Alisson", "Thiago Silva", "Miranda", "Marquinh...
## $ caps 36, 79, 57, 36, 36, 40, 3, 10, 29, 65, 49, 15, ...
## $ goals 0, 7, 3, 1, 0, 2, 1, 0, 16, 8, 14, 1, 7, 0, 0, ...
## $ club "Liverpool", "Paris Saint-Germain", "Internazio...
## $ country_league "England", "France", "Italy", "France", "Spain"...
## $ dob 1992-10-02, 1984-09-22, 1984-09-07, 1994-05-14...
## $ age 26, 34, 34, 25, 27, 33, 22, 22, 22, 30, 27, 28,...
### Age-histogram
Using this data I can plot a bunch of histograms:
``` r
age_country_plot <- squads_df_clean %>%
group_by(country) %>%
mutate(median_age = median(age)) %>%
ungroup() %>%
ggplot(aes(x = age)) +
geom_histogram(fill = "red", binwidth = 1) +
geom_vline(aes(xintercept = median_age), size = 1.2) +
geom_label(aes(x = median_age, y = 8,
label = glue::glue("Median: {median_age}")),
nudge_x = 0.5, hjust = 0.1, size = 3,
family = "Roboto Condensed", color = "black") +
labs(title = "Age Distribution of Copa América squads",
subtitle = "Columns ordered Group A to Group C",
x = "Age", y = NULL,
caption = glue::glue("
Source: Wikipedia
By: @R_by_Ryo")) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0),
breaks = scales::pretty_breaks()) +
theme_copaAmerica(title.size = 22,
subtitle.size = 14,
caption.size = 8,
axis.text.size = 12,
axis.title.size = 16,
strip.text.size = 18,
panel.grid.minor.x = element_line(color = "white"),
panel.grid.minor.y = element_line(color = "white")) +
facet_wrap(~country, ncol = 3)
age_country_plot
```

In terms of age, Japan have the youngest team with a median of 21, 4
years younger than the next youngest team, Qatar. The rest have a fairly
balanced spread of ages from 20 to early-mid 30s with most of the
medians hovering around 27 years of age. The reason for Japan’s
extremely young squad is due to the fact that the full-strength Japan
team has played in both the World Cup and the Asian Cup in the past
year. Along with the fact that the Tokyo Olympics are next year, it was
decided to use the invitation to the Copa América as a trial-by-fire for
the young stars of the future. Much like in a real Olympic squad, the
team contains three “overage” players in World Cup 2010/2014/2018
goalkeeper Eiji Kawashima, Premier League winner Shinji Okazaki, and
Getafe playmaker Gaku Shibasaki.
The oldest player will be Brazil captain Dani Alves at 36 with
Paraguay’s Oscar Cardozo only two weeks younger. On the other hand, the
youngest player is Japan’s 18-year old prodigy Takefusa Kubo, the
ex-Barcelona youth player who only just recently moved to Real Madrid!
In light of his transfer a lot of eyes will be on him to see if he can
produce some Captain Tsubasa-esque performances for a very inexperienced
Japan team gearing up for the Tokyo Olympics!
### Caps histogram
When considering the experience of a squad it’s not enough to look at
ages but one needs to look at the caps or appearances for the national
team as well.
``` r
caps_country_plot <- squads_df_clean %>%
group_by(country) %>%
mutate(median_cap = median(caps)) %>%
ungroup() %>%
ggplot(aes(x = caps)) +
geom_histogram(fill = "red", binwidth = 5) +
geom_vline(aes(xintercept = median_cap), size = 1.25) +
geom_label(aes(x = median_cap, y = 15,
label = glue::glue("Median: {median_cap}")),
nudge_x = 0.5, hjust = 0.05, size = 3,
family = "Roboto Condensed", color = "black") +
labs(title = "Caps (Appearances) by Country",
subtitle = "Columns ordered Group A to Group C",
x = "Caps", y = NULL,
caption = glue::glue("
Source: Wikipedia
By: @R_by_Ryo")) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme_copaAmerica(
title.size = 20,
subtitle.size = 14,
caption.size = 10,
axis.text.size = 10,
axis.title.size = 16,
strip.text.size = 18,
panel.grid.major.x = element_line(color = "white", size = 0.25),
panel.grid.major.y = element_line(color = "white", size = 0.25),
panel.grid.minor.x = element_line(color = "white", size = 0.25),
panel.grid.minor.y = element_line(color = "white", size = 0.25)) +
facet_wrap(~country, ncol = 3)
caps_country_plot
```

The majority of Japan’s squad have 0 (ZERO) caps, with the
aforementioned three “overage” players taking up most of the proportion
of caps on the team. Bolivia are also taking a untested squad with 8 of
their players with 2 caps or less! Chile, Uruguay, and Argentina bring
their veterans with multiple players over or around 100 caps. From this
data I was surprised that Jefferson Farfan and Paolo Guerrero didn’t
have 100 caps by now…
The player with the most caps is Lionel Messi (130) followed closely by
Diego Godin (126), and Alexis Sanchez (124). On the other hand there are
29 players hopeful of making their first national team appearance at
this tournament with the majority (17 players) coming from Japan.
### Goal distribution
Next I looked at the distribution of goals scored by the midfielders and
strikers of each team. I found out about using
`ggplot2::position_nudge()` for slightly adjusting variables on a
discrete scale in similar fashion to the `nudge_y =` and `nudge_x =`
arguments most people might be familiar with from other geoms. I also
used `ggforce::geom_mark_hull()` to do some labelling.
``` r
goals_country_plot <- squads_df_clean %>%
filter(position %in% c("MF", "FW")) %>%
group_by(country) %>%
mutate(median = median(goals)) %>%
ungroup() %>%
ggplot(aes(x = goals, y = reorder(country, median))) +
ggridges::geom_density_ridges(fill = "red", color = "white", scale = 1.1) +
geom_point(aes(x = median, y = country), position = position_nudge(y = 0.25),
color = "yellow", size = 3) +
ggforce::geom_mark_hull(aes(filter = country == "Argentina" & goals == 67, label = "Lionel Messi: 67 goals"),
label.buffer = unit(15, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
ggforce::geom_mark_hull(aes(filter = country == "Uruguay" & goals == 55, label = "Luis Suarez: 55 goals"),
label.buffer = unit(5, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
ggforce::geom_mark_hull(aes(filter = country == "Japan" & goals == 50, label = "Shinji Okazaki: 50 goals"),
label.buffer = unit(2, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
ggforce::geom_mark_hull(aes(filter = country == "Uruguay" & goals == 46, label = "Edinson Cavani: 46 goals"),
label.buffer = unit(25, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
ggforce::geom_mark_hull(aes(filter = country == "Chile" & goals == 41, label = "Alexis Sanchez: 41 goals"),
label.buffer = unit(4, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
scale_x_continuous(limits = c(0, 73),
expand = c(0.01, 0.01),
breaks = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70),
labels = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70)) +
expand_limits(y = 13.5) +
labs(title = "Distribution of Goals Scored by Midfielders and Strikers",
subtitle = "Copa América 2019 squads, Yellow dot = Median goals",
x = "Goals", y = NULL,
caption = glue::glue("
Source: Wikipedia
Data from prior to start of tournament
By: @R_by_Ryo")) +
theme_copaAmerica(title.size = 18,
subtitle.size = 12,
caption.size = 8,
axis.text.size = 14,
axis.title.size = 16,
strip.text.size = 18)
goals_country_plot
```
## Picking joint bandwidth of 2.07

With a lot of these players being more defensively minded or new players
the distribution is heavily skewed but you can see little mounds showing
the top goalscorers for each country and see which countries have their
goalscorers spread out among multiple players such as Brazil, Qatar, and
Peru.
If you know your South American players you can take a good guess at who
are the top goal scorers for each nation. For Colombia the two outlying
mounds are obviously James Rodriguez and Falcao, for example.
Venezuela’s top scorer with 22 is Salomon Rondon and for Brazil, if not
for his injury, a lonesome mound would have appeared for Neymar with 60
goals!
### Player contribution by league
Now let’s check the player contribution to the squads at the Copa
América by league. I’m just going to use the country that the league is
from for simplicity’s sake. Originally I wanted to `left_join()` it with
a ‘country <> domestic league’ table but couldn’t find one and the
league names itself aren’t very meaningful or have awful sponsor names
that obfuscate the country of origin even further.
``` r
player_contrib_league_plot <- squads_df_clean %>%
group_by(country_league) %>%
summarize(n = n()) %>%
ungroup() %>%
ggplot(aes(y = n, x = reorder(country_league, n))) +
geom_col(fill = "red") +
geom_text(aes(label = n, family = "Roboto Condensed", fontface = "bold"),
size = 4.5, color = "yellow",
nudge_y = 0.5) +
coord_flip() +
scale_y_continuous(labels = c(0, 5, 10, 15, 20, 25),
breaks = c(0, 5, 10, 15, 20, 25),
limits = c(0, 30),
expand = c(0, 0)) +
labs(title = "Breakdown of Player Contributions by League",
subtitle = glue("
Shown as Country Name
Mexico (Liga MX) contributed 27 players to South American squads"),
x = "League (Country name)", y = "Number of players",
caption = glue::glue("
Source: Wikipedia
By: @R_by_Ryo")) +
theme_copaAmerica(title.size = 18,
subtitle.size = 12,
caption.size = 10,
axis.text.size = 14,
axis.text.y.size = 11,
axis.title.size = 16,
panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_line(color = "white"))
player_contrib_league_plot
```

The best of the best players from South American countries will move on
to Europe so the Argentinean league (Superliga Argentina) and the
Brazilian league (Brasileirão - Serie A) do not have as many players as
you might think and as a consequence, the top leagues of England, Spain,
and Italy contribute quite a bit! A lot of the better players but not
quite elite South American players might go to Mexico instead of a
lower-mid European league. With the growth of the MLS a fair number of
players ply their trade there as well.
We can take a more detailed look by creating a table of the proportion
of players from each squad coming from either a domestic league or any
other league. I had to do a lot of wrangling to get the proper output
for the table. After calculating the percentage of domestic players from
a country’s domestic league I added the full data back in. Then I had to
make sure that for each country, the country - domestic league country
was the first row in each of the country groups (so Bolivia - Bolivia,
Bolivia, China, Japan - Japan, Japan - England, etc.). By doing this I
can automatically `tidyr::fill()`-in the rest of the rows of that
country with the ‘percentage of players from domestic league stat’.
``` r
squads_df_clean %>%
group_by(country, country_league) %>%
summarize(player_from_league = n()) %>%
filter(country == country_league) %>%
mutate(perc_from_domestic_league = percent(player_from_league / 23, accuracy = 0.1)) %>%
right_join(squads_df_clean %>%
group_by(country, country_league) %>%
summarize(player_from_league = n()) %>%
ungroup()) %>%
mutate(first = case_when(
country == country_league ~ 1,
TRUE ~ 0)) %>%
arrange(country, desc(first)) %>%
fill(perc_from_domestic_league) %>%
group_by(country) %>%
mutate(perc_from_league = percent(player_from_league / 23, accuracy = 0.1),
country_league = glue::glue("{country_league} - league")) %>%
arrange(desc(player_from_league)) %>%
select(Country = country, `League (country name)` = country_league,
`Number of players from league` = player_from_league,
`Percentage of players from league` = perc_from_league,
`Percentage of players from domestic league` = perc_from_domestic_league) %>%
DT::datatable(rownames = FALSE)
```
## Joining, by = c("country", "country_league", "player_from_league")
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'glue'
## elements may not preserve their attributes

Three interesting facts I found:
- 30% of players on the Brazil squad play for an English team, most
out of any league - squad combination excluding domestic leagues.
- 100% of the Qatar squad play in their domestic league!
- Only one Uruguayan player (4.3%) plays in its domestic league.
### Player contribution by club
In the final plot for this section, I looked at the top 10 clubs
contributing the most players to the tournament. I used
`arrange(desc(n)) %>% slice()` instead of `top_n()` as there were too
many teams tied at 4 players. To set the team names inside the bars I
created a midpoint value `midval` that calculated a value half of the
number of players contributed so the labels were placed neatly.
``` r
player_contrib_club_plot <- squads_df_clean %>%
group_by(club) %>%
summarize(n = n()) %>%
mutate(club = club %>% forcats::as_factor() %>% forcats::fct_reorder(n),
midval = n / 2) %>%
arrange(desc(n)) %>%
slice(1:15) %>%
ggplot(aes(x = club, y = n)) +
geom_col(fill = "red") +
geom_text(aes(label = n, family = "Roboto Condensed", fontface = "bold"),
size = 7.5, color = "yellow",
nudge_y = 0.5) +
geom_text(aes(y = midval, label = club,
family = "Roboto Condensed", fontface = "bold"),
size = 5, color = "white") +
coord_flip() +
scale_y_continuous(breaks = scales::pretty_breaks(),
expand = c(0, 0),
limits = c(0, 10.5)) +
labs(title = "Top 15 Clubs contributing the most players to the Copa América",
x = "Club", y = "Number of players",
caption = "Source: Wikipedia") +
theme_copaAmerica(
title.size = 18,
subtitle.size = 12,
caption.size = 8,
axis.text.size = 14,
axis.title.size = 16,
strip.text.size = 18,
panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_line(color = "white")) +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank())
player_contrib_club_plot
```

With 100% of its players coming from the domestic league it’s not
surprise that the Qatari team, Al-Sadd, is contributing the most players
to the tournament. Tied with another Qatari team, Mexican club America
features 7 players yet none of them are Mexicans (2 Argentineans, 2
Colombians, 1 Ecuadorian, 1 Chilean, and 1 Paraguayan).
At first I thought Barcelona contributed 8 players until I realized the
Ecuadorian players were coming from the Ecuadorian team called
Barcelona…I had to go all the way back up to the beginning of this
section to fix that small peculiarity. As futbol came to South America
via European colonists and immigrants a lot of teams took up the names
and colors of the teams these Europeans were fond of. Other examples
include Liverpool F.C. (Montevideo, Uruguay), Arsenal de Sarandi (Buenos
Aires, Argentina), and Club Atletico Juventus (Sao Paulo, Brazil -
although they use the colors of Torino F.C.).
If you download the data and type in the code below you can see the
entire club-country list.
``` r
squads_df_clean %>%
group_by(club, country) %>%
summarize(n = n()) %>% View()
```
Match Records
-------------
Now that we got a good look at the composition of the teams, we can take
a look at how they’ve done at every Copa América.
The next code chunk mainly comes from PH Julien and his excellent Kaggle
kernel of [“A Journey Through The History of
Soccer”](https://www.kaggle.com/phjulien/a-journey-through-the-history-of-soccer/).
``` r
## grab football federation affiliations data
federation_files <- Sys.glob("../data/federation_affiliations/*")
df_federations = data.frame(country = NULL, federation = NULL)
for (f in federation_files) {
federation = basename(f)
content = read.csv(f, header=FALSE)
content <- cbind(content,federation=rep(federation, dim(content)[1]))
df_federations <- rbind(df_federations, content)
}
colnames(df_federations) <- c("country", "federation")
df_federations <- df_federations %>%
mutate(country = as.character(country) %>% str_trim(side = "both"))
results_raw <- readr::read_csv("../data/results.csv")
results_copa <- results_raw %>%
filter(tournament == "Copa América") %>%
rename(venue_country = country,
venue_city = city) %>%
mutate(match_num = row_number())
## combine with federation affiliations
results_copa_home <- results_copa %>%
left_join(df_federations,
by = c("home_team" = "country")) %>%
mutate(federation = as.character(federation)) %>%
rename(home_federation = federation)
results_copa_away <- results_copa %>%
left_join(df_federations,
by = c("away_team" = "country")) %>%
mutate(federation = as.character(federation)) %>%
rename(away_federation = federation)
## combine home-away
results_copa_cleaned <- results_copa_home %>%
full_join(results_copa_away)
```
Unfortunately, this data does not have **penalty** results as those
games are all counted as a draw (as technically that is the actual
result). Considering there a lot of cagey knock-out rounds that finish
in a penalty shoot-out (including the last two finals…) it is
unfortunate but that’s just the data you have sometimes. There is a way
to web-scrape all the Copa América results and assign Win-Lose to those
games that went to penalties but I’ll leave that for another time. Also,
there is no info on what stage of the tournament the match recorded is
in.
``` r
results_copa_cleaned <- results_copa_cleaned %>%
mutate(
home_federation = case_when(
home_team == "USA" ~ "Concacaf",
TRUE ~ home_federation),
away_federation = case_when(
away_team == "USA" ~ "Concacaf",
TRUE ~ away_federation)) %>%
select(-contains("federation"), -contains("venue"),
-neutral, date, home_team, home_score, away_team, away_score,
tournament, venue_city)
```
``` r
glimpse(results_copa_cleaned)
```
## Observations: 787
## Variables: 8
## $ date 1916-07-02, 1916-07-06, 1916-07-08, 1916-07-10, 19...
## $ home_team "Chile", "Argentina", "Brazil", "Argentina", "Brazi...
## $ away_team "Uruguay", "Chile", "Chile", "Brazil", "Uruguay", "...
## $ home_score 0, 6, 1, 1, 1, 0, 4, 4, 1, 4, 5, 1, 6, 2, 0, 3, 4, ...
## $ away_score 4, 1, 1, 1, 2, 0, 0, 2, 0, 0, 0, 0, 0, 3, 2, 1, 1, ...
## $ tournament "Copa América", "Copa América", "Copa América", "Co...
## $ match_num 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ venue_city "Buenos Aires", "Buenos Aires", "Buenos Aires", "Bu...
Now that it’s nice and cleaned up I created a function that reshapes the
data so that it’s set from a certain team’s perspective with the “team”
argument. You can also set the function to look for only results against
a certain opponent by filling in the `versus` argument.
``` r
copaAmerica_resultados <- function(data, team, versus = NA) {
## team of interest: ex. 'Brazil'
team_var <- enquo(team)
todos_partidos <- data %>%
## filter only for results of team of interest
filter(home_team == !!team_var | away_team == !!team_var) %>%
## reshape columns to team vs. opponent
mutate(
opponent = case_when(
away_team != !!team_var ~ away_team,
home_team != !!team_var ~ home_team),
home_away = case_when(
home_team == !!team_var ~ "home",
away_team == !!team_var ~ "away"),
equipo_goals = case_when(
home_team == !!team_var ~ home_score,
away_team == !!team_var ~ away_score),
opp_goals = case_when(
home_team != !!team_var ~ home_score,
away_team != !!team_var ~ away_score)) %>%
## label results from team's perspective
mutate(
result = case_when(
equipo_goals > opp_goals ~ "Win",
equipo_goals < opp_goals ~ "Loss",
equipo_goals == opp_goals ~ "Draw")) %>%
mutate(result = result %>% forcats::as_factor() %>% forcats::fct_relevel(c("Win", "Draw", "Loss"))) %>%
select(-contains("score"), -contains("team"), -match_num) %>%
rename(Date = date, Tournament = tournament, `Venue` = venue_city, Opponent = opponent, `Home / Away` = home_away,
`Goals For` = equipo_goals, `Goals Against` = opp_goals, Result = result)
if (is.na(versus) | is.null(versus)) {
resultados_totalmente <- todos_partidos %>%
group_by(Result, Opponent) %>%
mutate(n = n()) %>%
ungroup() %>%
## sum amount of goals by team and opponent
group_by(Result, Opponent) %>%
summarize(e_g = sum(`Goals For`),
o_g = sum(`Goals Against`),
n = n()) %>%
ungroup() %>%
## spread results over multiple columns
spread(Result, n) %>%
mutate_if(is.integer, as.numeric)
missing_cols <- c("Win", "Draw", "Loss") %>%
map_dfr( ~tibble(!!.x := numeric()))
resultados_totalmente <- resultados_totalmente %>%
bind_rows(missing_cols) %>%
mutate(Win = if_else(is.na(Win), 0, Win),
Draw = if_else(is.na(Draw), 0, Draw),
Loss = if_else(is.na(Loss), 0, Loss)) %>%
group_by(Opponent) %>%
summarize(Win = sum(Win, na.rm = TRUE),
Draw = sum(Draw, na.rm = TRUE),
Loss = sum(Loss, na.rm = TRUE),
`Goals For` = sum(e_g),
`Goals Against` = sum(o_g))
return(list(resultados_totalmente, todos_partidos))
} else {
## opponent: ex. 'Argentina'
todos_partidos <- todos_partidos %>%
filter(Opponent == versus)
if (nrow(todos_partidos) == 0) {
return(glue("{team} has never played {versus} at the Copa América!"))
} else {
resultados_totalmente <- todos_partidos %>%
group_by(Result, Opponent) %>%
mutate(n = n()) %>%
ungroup() %>%
# sum amount of goals by team and opponent
group_by(Result, Opponent) %>%
summarize(e_g = sum(`Goals For`),
o_g = sum(`Goals Against`),
n = n()) %>%
ungroup() %>%
# spread results over multiple columns
spread(Result, n) %>%
mutate_if(is.integer, as.numeric) %>%
group_by(Opponent) %>%
summarize(Win = sum(Win, na.rm = TRUE),
Draw = sum(Draw, na.rm = TRUE),
Loss = sum(Loss, na.rm = TRUE),
`Goals For` = sum(e_g),
`Goals Against` = sum(o_g))
return(list(resultados_totalmente, todos_partidos))
}
}
}
```
The output is either a dataframe of all the games a team has been
involved in as well as the record of the team against other teams in the
Copa América or a message saying that the team you picked has never
played against the opponent you picked.
### Japan
``` r
copaAmerica_resultados(data = results_copa_cleaned,
team = "Japan", versus = "Brazil")
```
## Japan has never played Brazil at the Copa América!
Oh… that’s right Japan has never played against Brazil at the Copa…
``` r
resultados_japon <- copaAmerica_resultados(data = results_copa_cleaned, team = "Japan")
```
## Warning: Unknown levels in `f`: Win
``` r
resultados_japon[[2]] %>%
DT::datatable(caption = "Japan's record in the Copa América", rownames = FALSE)
```

Japan’s only previous journey to the Copa América was in the 1999
edition where they lost all 3 games. They were invited for the 2011
edition but withdrew due to the Tohoku Earthquake and were replaced by
Costa Rica. Japanese football has come a long way since 1999 but with a
young squad it will be a uphill battle to get 3 points against any of
their Group C opponents, Uruguay, Chile, and Ecuador.
### Colombia
``` r
resultados_colombia <- copaAmerica_resultados(data = results_copa_cleaned, team = "Colombia")
resultados_colombia[[2]] %>%
DT::datatable(caption = "Colombia's record in the Copa América", rownames = FALSE)
```

Despite a recent resurgence of the Colombia national team they have not
been able to match the feats of the 2001 side that won the Copa with
their best place finish since then coming 3rd in 2004. The 2001 team
were not only unbeaten but also did not concede a single goal throughout
the tournament!
### Superclasico Sudamericano: Brazil vs. Argentina
``` r
resultados_de_brazil <- copaAmerica_resultados(data = results_copa_cleaned,
team = "Brazil", versus = "Argentina")
resultados_de_brazil[[1]] %>%
DT::datatable(caption = "Brazil vs. Argentina in the Copa América", rownames = FALSE)
```

``` r
resultados_de_brazil[[2]] %>%
DT::datatable(caption = "Brazil vs. Argentina in the Copa América", rownames = FALSE)
```

Brazil does not have a good overall record vs. Argentina but they have
not lost against their rivals at the Copa América since the 1991
edition. The “draw” in 2004 was actually in the final where they won 4-2
on penalties.
What I found odd was that the Copa América seems to have a very low
priority to certain countries, especially Brazil who have repeatedly
sent their B or C teams to the tournament in favor of sending their best
team to other tournaments or resting star players. Funnily enough these
understrength Brazilian squads have actually won the entire tournament a
few times, most notably in 2007 against a full strength Argentina side
containing the likes of Zanetti, Riquelme, Cambiasso, Tevez, a young
Messi/Mascherano, Cambiasso, et al!
Player Profiles
---------------
After looking at the history of the competition and the composition of
the squads I examined the players and their form coming into the Copa
América. In recent years football analytics has really taken off and
there have been many strides made in creating more informative
statistics to assess players’ abilities, the most prominently being the
**xG** statistics. This is the first time I talk about **xG** in any
length/depth so this introduction is as much to solidify my
understanding as well as yours!
### What IS xG?
- **xG**: Quantitative measure (between 0 and 1) that assigns a
probability that a shot taken will result in a goal based on a
variety of variables and is used for evaluating the quality of
chances and predicting players’ and teams’ future performances.
- **xA**: Quantitative measure (between 0 and 1) that assigns a
probability that a given pass will result in an assist for a goal
based on a variety of variables.
Common variables used in the models that output xG statistics are the
distance and angle of a shot, the body part used, rebound, among others.
Similar to how you might assess your favorite striker’s chances of
scoring just as he is lining up to take a shot: Is the shot a header? Is
he trying to score from a cross in regular play or a corner kick? Are
there a crowd of defenders in front of him or is he one-on-one with the
goalkeeper? Etc. You might think **who** takes a shot would be a genuine
factor but in actuality it tells you a lot less about the chances of a
goal compared to the location of the shot.
Note that there isn’t a SINGLE xG model. You can check out a blog post
comparing different people’s xG models
[here](https://mackayanalytics.nl/2017/06/19/how-accurate-are-xg-models-ii-the-big-chance-dilemma/).
People and organizations (from Statsbomb to OptaPro) have their own
ideas about what **could** be the important variables in play and as
such it’s important to report from which source you got your data from
as the stats can differ between models. A few things xG does not factor
in are things like goalkeeper performance (someone pulling off
incredible saves or letting in a poor shot) and one must also consider
the fact that team style of play and the quality of a player’s
teammates. When judging players based on these stats it is important to
be aware of contextual factors like the team they play for, their
opponent, and the player’s position/role in the team.
From xG and xA more granular statistics such as xGChain and xGBuildup
were created to be able to dig a little deeper into who is contributing
to chance creation, I’ll introduce the latter two a bit later. As the
field has grown new statistics have popped up such as [Karun
Singh](https://twitter.com/karun1710/)’s “expected threat” or xT. You
can check out an introduction to xT from
[here](https://karun.in/blog/expected-threat.html).
Of course, these statistics only tell a part of the story and are
definitenly not the be-all-and-end-all. In the context of this current
blog post, these stats only tell the story about how these players did
for their club teams this past season rather than for their national
team. Even still it gives us a good idea of what kind of form these
players are in coming into this tournament.
You might also want to watch these Youtube videos by
[TifoFootball](https://www.youtube.com/channel/UCGYYNGmyhZ_kwBF_lqqXdAQ)
for a quick primer on [xG](https://www.youtube.com/watch?v=zSaeaFcm1SY)
and [xA](https://www.youtube.com/watch?v=1MdlkuzLdj4).
### understat data
For the data I used the website, `understat.com`. Their xG models were
created via training a neural network on a dataset consisting of over
100,000 shots using more than 10 different variables. Getting data from
`understat` has been made easy by Ewen Henderson’s `understatr` package
available from [Github](https://github.com/ewenme/understatr) (he’s also
the guy that made the [ghibli](https://github.com/ewenme/ghibli) color
palette!). I tried to pick a wide selection of attacking players but I
was also limited by the fact that `understat` only has data for
teams/players from six European leagues (Premier League, Bundesliga,
Serie A, La Liga, Ligue 1, and Russian Premier League).
For **Peru** I would have chosen Paolo Guerrero but as he plays in
Brazil now I went with Jefferson Farfan (who hasn’t played as many games
as the other players used for comparison unfortunately…). For **Chile**
I would pick Eduardo Vargas but he as doesn’t play for a team covered by
understat I went with Alexis Sanchez, who had a woeful season and only
played \~600 minutes despite appearing in \~20 league matches and later
added Arturo Vidal. For **Brazil** I included Neymar initially but since
he won’t actually be playing I’ll keep him for comparison’s sake but
also include Gabriel Jesus and Roberto Firmino who have been fighting
for the starting striker spot. Note that these two aren’t the ones
replacing Neymar **positionally**. In Neymar’s left-wing position I can
see David Neres or Phil Coutinho replacing him (Richarlison and Willian
mostly play on the right). (Edit: In the first match vs. Bolivia, David
Neres started off on the left while Richarlison played on the right,
Coutinho played just behind Bobby Firmino)
The other nation’s strikers/attacking midfielders don’t play for the six
European leagues covered by understat or like in Shinji Okazaki’s case
just did not play as many minutes/games during the season. To get the
data I created a list of the player codes and use `purrr::map()` to
iterate each through the `understatr::get_player_seasons_stats()`
function.
``` r
player_codes <- c(2097, 2099, 813, ## Messi, Neymar, Rondon
498, 4299, 696, ## Alexis, Farfan, Falcao
3294, 2098, 5543, ## Cavani, Suarez, G. Jesus
482, 1148, 2249, ## Bobby, Duvan, James
1089, 3553, 488, ## Cuadrado, Di Maria, Coutinho
222) ## Arturo Vidal
understat_data <- player_codes %>%
map(., ~ understatr::get_player_seasons_stats(.x)) %>%
reduce(bind_rows) %>%
select(-player_id, -position, -yellow, -red)
```
``` r
glimpse(understat_data)
```
## Observations: 83
## Variables: 15
## $ games 34, 36, 34, 33, 38, 17, 20, 30, 34, 33, 32, 36, 38...
## $ goals 36, 34, 37, 26, 43, 15, 19, 13, 24, 22, 11, 7, 8, ...
## $ shots 170, 196, 179, 158, 187, 55, 91, 105, 124, 95, 89,...
## $ time 2704, 2995, 2832, 2726, 3374, 1443, 1797, 2652, 30...
## $ xG 25.997169, 28.946281, 26.885174, 27.101910, 35.891...
## $ assists 13, 12, 9, 16, 18, 7, 13, 11, 12, 7, 7, 3, 2, 2, 0...
## $ xA 15.33516552, 15.10040562, 13.95513140, 15.87127814...
## $ key_passes 93, 87, 79, 77, 95, 43, 70, 91, 102, 52, 32, 23, 2...
## $ year 2018, 2017, 2016, 2015, 2014, 2018, 2017, 2016, 20...
## $ team_name "Barcelona", "Barcelona", "Barcelona", "Barcelona"...
## $ npg 32, 32, 31, 23, 38, 10, 15, 12, 19, 21, 11, 7, 8, ...
## $ npxG 22.280909, 25.973170, 21.682231, 21.899351, 31.432...
## $ xGChain 38.459877, 48.180634, 42.525045, 41.996866, 54.753...
## $ xGBuildup 10.6987990, 21.6344040, 18.1335122, 15.1963644, 19...
## $ player_name "Lionel Messi", "Lionel Messi", "Lionel Messi", "L...
As you can see the data consists of a row for each player and each year
(from the 2014/2015 season to the 2018/2019 season). I tried to mitigate
the fact that some players played a lot more minutes than others by
standardize everything to a ‘per 90 minutes’ value but this does have
its own disadvantages. These include the fact that players who play a
lot of minutes (as regular starting members) may not have as high ‘per
90’ stat even though their production with all these minutes might
suggest that they are consistently performing and producing at a high
level.
It’ll be a bit crowded (kind of like a spilt box of Skittles…) but let’s
check out the key metrics for all the players at once.
Note: npg = non-penalty goals, npxG = non-penalty goals xG
``` r
comparison_data <- understat_data %>%
filter(year == 2018) %>%
select(-games, -team_name, -year) %>%
rename(Shots = shots, KP = key_passes) %>%
gather(key = "key", value = "value", -player_name, -time) %>%
mutate(key = forcats::as_factor(key) %>%
forcats::fct_relevel(.,
"xG", "goals", "npxG", "npg",
"xA", "assists", "xGChain", "xGBuildup",
"Shots", "KP"))
comparison_strikers_plot <- comparison_data %>%
filter(key != "Shots", key != "KP",
key != "xGBuildup", key != "xGChain") %>%
mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) +
geom_vline(xintercept = 1.5, size = 2) +
geom_vline(xintercept = 2.5, size = 2) +
geom_vline(xintercept = 3.5, size = 2) +
geom_vline(xintercept = 4.5, size = 2) +
geom_vline(xintercept = 5.5, size = 2) +
coord_flip() +
scale_y_continuous(expand = c(0.01, 0.01),
limits = c(0, 1.26)) +
scale_fill_manual(values = pals::glasbey(16), name = "Player") +
labs(title = "Comparison: Top attackers at the Copa América",
subtitle = "For select group of attacking players with data available from understat.com",
x = NULL, y = "Metric per 90",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(title.size = 18,
subtitle = 12,
panel.grid.minor.x = element_line(color = "white"))
comparison_strikers_plot
```

As usual in these types of charts, Messi is leading a lot of the metrics
here and showing consistency too with having played the third highest
amount of minutes out of the selected players. It’s helpful to have the
xG/xA stats next to the actual goals/assists as it provides an
indication of whether the player in question is scoring shots that he
probabilistically should be scoring. When a player’s actual goal count
is higher than their xG stat this suggests that the player is
**“exceeding their xG”** or that they are scoring from shots that are
historically hard to score from. It can be seen as a marker of an elite
finisher as they are putting away chances from difficult situations
consistently. In terms of assists and xA Alexis Sanchez, who only played
about 600 minutes, looks a lot better than in reality due to the
aforementioned disadvantage of standardizing everything to a “per 90
minutes” value. Normally you would have a cut-off based on a certain
**minimum amount of minutes** but as I mentioned I was rather limited in
my choice of players.
A way to take a closer look at xG - Goals and xA - Assists is to use a
simple dot plot with a line going through the 45 degree angle. Those
below the line are underperforming relative to their xG or xA stat,
those over it are overachieving (“exceeding” their xG/xA stat) while
those just on the line are scoring or assisting right around what the
model expects the player to be. I use non-penalty xG below as penalties
have around \~0.75 xG (give or take a few percentage points depending on
the model) and can inflate the stats of those players who take a lot of
penalties and score them, especially if they weren’t the ones who earned
the penalty themselves.
``` r
expected_goal_plot <- understat_data %>%
filter(year == 2018) %>%
select(player_name, time, npxG, xG, goals) %>%
mutate_at(c("npxG", "xG", "goals"), ~. / time * 90) %>%
ggplot(aes(x = npxG, y = goals, fill = player_name)) +
geom_abline(intercept = 0, slope = 1, color = "white", size = 1.1) +
geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) +
scale_x_continuous(limits = c(0, 1.1),
expand = c(0, 0)) +
scale_y_continuous(limits = c(0, 1.3),
expand = c(0, 0)) +
scale_fill_manual(values = pals::glasbey(16), name = "Player") +
labs(title = "Expected vs. Actual Goals",
subtitle = "For select group of attacking players with data available from understat.com",
x = "Non-penalty xG per 90 minutes",
y = "Goals per 90 minutes",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(panel.grid.minor.x = element_line(color = "white"),
panel.grid.minor.y = element_line(color = "white"),
subtitle.size = 11)
expected_goal_plot
```

Gabriel Jesus is quite clearly below the 45 degree line meaning that he
has been very poor at finishing chances (and/or incredibly unlucky).
After a poor World Cup where he scored 0 goals as a starter, he is
really going to have to step up to fill Neymar’s goalscoring boots for
this tournament. However, his build-up play for City has still been good
this past season and he has been scoring for Brazil in the friendlies
leading up to the tournament so it’s going to be a hard decision for
Tite to decide on who starts against Bolivia (edit: Firmino started and
contributed an assist while Jesus replaced him in the 65th minute).
``` r
expected_assists_plot <- understat_data %>%
filter(year == 2018) %>%
select(player_name, time, xA, assists) %>%
mutate_at(c("xA", "assists"), ~. / time * 90) %>%
ggplot(aes(x = xA, y = assists, fill = player_name)) +
geom_abline(intercept = 0, slope = 1, color = "white", size = 1.1) +
geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) +
labs(title = "Expected vs. Actual Assists",
subtitle = "For select group of attacking players with data available from understat.com",
x = "xA per 90 minutes",
y = "Assists per 90 minutes",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
scale_x_continuous(limits = c(0, 0.55),
expand = c(0, 0)) +
scale_y_continuous(limits = c(0, 0.55),
expand = c(0, 0)) +
scale_fill_manual(values = pals::glasbey(16), name = "Player") +
theme_copaAmerica(panel.grid.minor.x = element_line(color = "white"),
panel.grid.minor.y = element_line(color = "white"),
subtitle.size = 11)
expected_assists_plot
```

One thing to keep in mind is that xA does not take into account the
recipient of the assist pass. Even if the pass given had a high expected
assist value the receiving player still might not have the quality to
put it away through no fault of the passer. This might explain why most
of the players with a higher xA among this group don’t have the assists
to match. It can also be that these players are also the ones playing a
lot more minutes and the volume of chances they create just aren’t
translating to goals all the time.
### Key Passes, Shots, xGChain, and xGBuildup (per 90)
I separated “key passes” and “shots” as well as “xGChain” and
“xGBuildup” from the rest as these two sets were on a very different
scale.
``` r
kp_shots_plot <- comparison_data %>%
filter(key == "Shots" | key == "KP") %>%
mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) +
coord_flip() +
scale_y_continuous(expand = c(0.01, 0.01),
limits = c(0, 6),
breaks = c(0, 1, 2, 3, 4, 5, 6),
labels = c(0, 1, 2, 3, 4, 5, 6)) +
scale_fill_manual(values = pals::glasbey(17), name = "Player") +
geom_vline(xintercept = 1.5, size = 2) +
labs(title = "Comparison: Stars of the Copa América",
subtitle = glue("
KP = Key Passes
For select group of attacking players with data available from understat.com"),
x = NULL, y = "Metric per 90",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(title.size = 18,
subtitle.size = 10,
panel.grid.minor.x = element_line(color = "white"))
kp_shots_plot
```

- xGChain: Quantitative measure that is the combined sum of the xG of
every possession that ends in a shot that a player is involved in.
The same derived value is given to each of the players involved and
allows us to credit players for attacking contributions outside of
just shots (xG) and assists (xA).
- xGBuildup: Similar to xGChain but excluding shots and assists. This
is in response to xGChain values still being dominated by the xG and
xA from shots and assists, respectively.
``` r
xgbuildup_xgchain_plot <- comparison_data %>%
filter(key == "xGBuildup" | key == "xGChain") %>%
mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) +
coord_flip() +
scale_y_continuous(expand = c(0.01, 0.01),
limits = c(0, 1.55),
breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5),
labels = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5)) +
scale_fill_manual(values = pals::glasbey(17), name = "Player") +
geom_vline(xintercept = 1.5, size = 2) +
labs(title = "Comparison: Stars of the Copa América",
subtitle = "For select group of attacking players with data available from understat.com",
x = NULL, y = "Metric per 90",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(title.size = 18,
subtitle.size = 10,
panel.grid.minor.x = element_line(color = "white"))
xgbuildup_xgchain_plot
```

Although Gabriel Jesus has been poor at finishing his chances as seen in
previous graphs, his xGChain and xGBuildup stat makes it clear that he
is still contributing to City’s attack outside of scoring goals himself
(not to mention all the defensive work he does as well).
For example below, the stats are able to clearly differentiate between
James, who is more of a playmaker, compared to Falcao and Duvan who are
traditional number 9s with his superior xGBuildup, xGChain, and Key
Passes values. For a more detailed overview on xGChain and xGBuildup
check out Statsbomb’s article
[here](https://statsbomb.com/2018/08/introducing-xgchain-and-xgbuildup/).
``` r
## keep colors for Colombians consistent with other plots
colombia_pal <- c("#000033", "#005300", "#009FFF", "#00FFBE")
comparison_colombia_plot <- comparison_data %>%
filter(!key %in% c("xG", "goals", "npxG", "npg", "xA", "assists"),
player_name %in% c("James Rodríguez", "Falcao", "Duván Zapata", "Juan Cuadrado")) %>%
mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) +
geom_vline(xintercept = 1.5, size = 2) +
geom_vline(xintercept = 2.5, size = 2) +
geom_vline(xintercept = 3.5, size = 2) +
coord_flip() +
scale_y_continuous(expand = c(0.05, 0.05),
breaks = c(0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4),
labels = c(0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4)) +
scale_fill_manual(values = colombia_pal, name = "Player") +
labs(title = "Comparison: Stars of Colombia",
subtitle = "KP: Key Passes",
x = NULL, y = "Metric per 90",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(title.size = 20,
subtitle.size = 12,
panel.grid.minor.x = element_line(color = "white"))
comparison_colombia_plot
```

There are lots of different ways to visualize this data, most famously
the radar charts created by [Statsbomb](http://statsbomb.com/). Using
the data you can also compare and evaluate players in many different
ways and using `understatr` package you could add a few more players
like Paulo Dybala, Miguel Almiron, and more! I could probably do a whole
other article using just this data but I’ll leave it here for now.
Conclusion
----------
Throughout this blog post I talked about some of the historical records,
squad compositions, match records, and finally the player profiles of
attacking players at this summer’s Copa América. Using the power of R it
is really easy to webscrape and visualize data in a way that is
informative and aesthetically pleasing. I wanted to finish this before
the tournament started but other life things got in the way as well as
the fact that the amount of content ballooned out of control (especially
the xG section) so I had to cut down a lot.
It’s been fun reading the articles on the Copa América website and
seeing how far my “intermediate-but-very-out-of-practice”-level of
Spanish can get me to understand the content, here is one that I
particularly liked reading: [14 Estadisticas de la Copa
América](https://copaamerica.com/es/noticias/a-14-dias-del-inicio-de-la-copa-14-estadisticas-de-la-copa-america).
With so many tournaments going on right now (and with the African Cup of
Nations starting in a few days) a lot of the news media is spread thin
right now but there are still some quality articles out there to read
about the Copa, like [this
article](https://www.bbc.com/sport/football/48600098) from BBC’s South
American football expert, Tim Vickery.
After the first round of games, a few points of discussion:
- After an extremely lacklustre performance vs. Colombia, how does
Argentina bounce back? What tactical changes need to be made?
- Qatar impressed against Paraguay but can they pull off a major upset
vs. Colombia?
- How will Japan line-up against Uruguay after a losing by a scoreline
that didn’t really do their performance justice? How will manager
Moriyasu balance experience vs. youth, will he start with veteran
Okazaki after Ueda’s numerous misses vs. Chile?
- Can Brazil earn a early ticket to the next round vs. Venezuela after
a clinical but not excellent performance vs. Bolivia? Will they be
able to keep up their streak of winning the Copa every time they
have hosted it?
Thanks for reading and…
¡Buena suerte a todos equipos!
================================================
FILE: Copa America 2019/copa_america2019.md
================================================
Another summer and another edition of the Copa América! Along with the
Africa Cup of Nations, Nations League finals, the Women’s World Cup,
Under-21 European Championship AND the Gold Cup this is yet another
soccer-filled season after last year’s World Cup and the Asian Cup
earlier this year (I also did a blog post on these last two tournaments
which you can see [here (World
Cup)](https://ryo-n7.github.io/2018-06-29-visualize-worldcup/) and [here
(Asian Cup)](https://ryo-n7.github.io/2019-01-11-visualize-asian-cup/)).
There is so much football going on at once even I can’t keep up,
especially with the time difference! To not redo all the previous
visualizations with Copa América data I tried to find new sources of
data and other forms of visualizations to give some insight into the
players and teams competing to be the champion of South America. You can
find all the code I used in this blogpost here and you can also find
other soccer related data viz in my
[soccer\_ggplot](https://github.com/Ryo-N7/soccer_ggplots) Github repo.
The sections will go from a very macro-level view of the **historical
records** of the tournament, to the **squads** competing, the teams’
**match record** in the Copa América, and finally to a micro-level view
of various attacking players using **xG** statistics.
¡Vámonos!
Packages
--------
``` r
library(dplyr) ## data wrangling
library(tidyr) ## data wrangling
library(purrr) ## data wrangling and iteration
library(stringr) ## data wrangling
library(rvest) ## webscraping
library(polite) ## webscraping (Github only pkg)
library(ggplot2) ## plotting
library(scales) ## plotting scales
library(ggimage) ## images for flags
library(ggforce) ## plotting text labels
library(cowplot) ## plotting grid
library(glue) ## text
library(ggrepel) ## plotting text labels
library(magick) ## plotting
library(DT) ## tables
library(ggtextures) ## soccer ball emoji as geom_col()
library(extrafont) ## fonts: Roboto Condensed
loadfonts()
```
theme\_copaAmerica
------------------
I wanted to have all the plots in this blogpost to have a consistent
color theme. As the tournament is going to be held in Brazil, I went
with a color theme based on its flag with blue, yellow, and green being
the primary colors.
``` r
theme_copaAmerica <- function(
title.size = 24,
subtitle.size = 14,
caption.size = 8,
axis.text.size = 14,
axis.text.x.size = 12,
axis.text.y.size = 12,
axis.title.size = 16,
strip.text.size = 18,
panel.grid.major.x = element_line(size = 0.5, color = "white"),
panel.grid.major.y = element_line(size = 0.5, color = "white"),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank(),
axis.ticks = element_line(color = "white")) {
## Theme:
theme(text = element_text(family = "Roboto Condensed", color = "white"),
plot.title = element_text(family = "Roboto Condensed", face = "bold",
size = title.size, color = "yellow"),
plot.subtitle = element_text(size = subtitle.size),
plot.caption = element_text(size = caption.size),
panel.background = element_rect(fill = "#009b3a"),
plot.background = element_rect(fill = "#002776"),
axis.text = element_text(size = axis.text.size, color = "white"),
axis.text.x = element_text(size = axis.text.x.size, color = "white"),
axis.text.y = element_text(size = axis.text.y.size, color = "white"),
axis.title = element_text(size = axis.title.size),
axis.line.x = element_blank(),
axis.line.y = element_blank(),
panel.grid.major.x = panel.grid.major.x,
panel.grid.major.y = panel.grid.major.y,
panel.grid.minor.x = panel.grid.minor.x,
panel.grid.minor.y = panel.grid.minor.y,
strip.text = element_text(color = "yellow", face = "bold",
size = strip.text.size,
margin = margin(4.4, 4.4, 4.4, 4.4)),
strip.background = element_blank(),
axis.ticks = axis.ticks
)
}
```
Top Goal Scorers // Goleadores
------------------------------
For this plot I took the stats from the Spanish version of the Wikipedia
page as it had more content. I used `purrr::flatten_df()` to squish the
list output into a dataframe then set the names of each column using
`purrr::set_names()`.
``` r
url <- "https://es.wikipedia.org/wiki/Anexo:Estad%C3%ADsticas_de_la_Copa_Am%C3%A9rica"
session <- bow(url)
copa_top_scorers <- scrape(session) %>%
html_nodes(".mw-parser-output > table:nth-child(95)") %>%
html_table() %>%
flatten_df() %>%
set_names(c("player", "country", "goals")) %>%
mutate(image = "https://www.emoji.co.uk/files/microsoft-emojis/activity-windows10/8356-soccer-ball.png")
```
``` r
glimpse(copa_top_scorers)
```
## Observations: 22
## Variables: 4
## $ player "Norberto Méndez", "Zizinho", "Lolo Fernández", "Sever...
## $ country "ARG Argentina", "BRA Brasil", "PER Perú", "URU Urugua...
## $ goals 17, 17, 15, 15, 13, 13, 13, 13, 13, 12, 12, 11, 11, 11...
## $ image "https://www.emoji.co.uk/files/microsoft-emojis/activi...
Like in the Asian Cup blogpost I use [Claus
Wilke](https://twitter.com/ClausWilke/)’s
[ggtextures](https://github.com/clauswilke/ggtextures) package to use
**soccer ball emoji** as the column image in the plot.
``` r
copa_goleadores_raw_plot <- copa_top_scorers %>%
head(5) %>%
ggplot(aes(x = reorder(player, goals), y = goals,
image = image)) +
geom_isotype_col(img_width = grid::unit(1, "native"), img_height = NULL,
ncol = NA, nrow = 1, hjust = 0, vjust = 0.5) +
geom_text(aes(label = goals, family = "Roboto Condensed", fontface = "bold"),
size = 7.5, color = "yellow",
nudge_y = 0.5) +
coord_flip() +
scale_y_continuous(breaks = c(0, 2, 4, 6, 8, 10, 12, 14, 16, 18),
expand = c(0, 0),
limits = c(0, 19)) +
labs(title = "Top Scorers of the Copa América",
subtitle = glue("
Most goals in a single tournament: 9
Humberto Maschio (Argentina), Javier Ambrois (Uruguay), Jair (Brazil)"),
y = "Number of Goals", x = NULL,
caption = glue("
Source: Wikipedia
By @R_by_Ryo")) +
theme_copaAmerica(title.size = 26,
subtitle.size = 16,
caption.size = 12,
axis.text.size = 18,
axis.title.size = 18,
panel.grid.major.y = element_blank(),
axis.ticks = element_blank())
## Add flags to y-axis:
axis_image <- axis_canvas(copa_goleadores_raw_plot, axis = 'y') +
draw_image("https://upload.wikimedia.org/wikipedia/en/0/05/Flag_of_Brazil.svg",
y = 16.5, scale = 1.8) +
draw_image("https://upload.wikimedia.org/wikipedia/commons/1/1a/Flag_of_Argentina.svg",
y = 12.5, scale = 1.8) +
draw_image("https://upload.wikimedia.org/wikipedia/commons/f/fe/Flag_of_Uruguay.svg",
y = 9, scale = 1.8) +
draw_image("https://upload.wikimedia.org/wikipedia/commons/d/df/Flag_of_Peru_%28state%29.svg",
y = 5.25, scale = 1.8) +
draw_image("https://upload.wikimedia.org/wikipedia/en/0/05/Flag_of_Brazil.svg",
y = 1.5, scale = 1.8)
copa_goleadores_plot <- ggdraw(insert_yaxis_grob(copa_goleadores_raw_plot, axis_image, position = "left"))
copa_goleadores_plot
```

Most of these players aren’t ones you might recognize. The Copa América
used to be held a lot more regularly (and sometimes erratically) until
this century so players had a lot more opportunities to score goals. All
five of the players you see here played in the 1930s-1950s when there
was a tournament every one or two years. Out of currently active
players, Peruvian legend Paolo Guerrero has 11 goals along with Eduardo
Vargas (from Chile). (Edit: after the Chile - Japan game, Vargas is on
12…) Another player you might recognize that was actually tied with
Ademir for 5th place, along with three other players, was Gabriel
Batistuta (“Batigol”).
Winners of the Copa América
---------------------------
After grabbing the data from the Wikipedia page I used a variety of
functions to clean and reshape the dataset like `tidyr::separate()` to
split the number of occurences and the year.
``` r
url <- "https://es.wikipedia.org/wiki/Anexo:Estad%C3%ADsticas_de_la_Copa_Am%C3%A9rica"
session <- bow(url)
copa_campeones <- scrape(session) %>%
html_nodes(".mw-parser-output > table:nth-child(10)") %>%
html_table() %>%
flatten_df()
copa_campeones_limpia <- copa_campeones %>%
janitor::clean_names() %>%
slice(1:8) %>%
select(1:4) %>%
set_names(c("team", "winners", "runners_up", "third_place")) %>%
separate(winners, into = c("Champions", "first_place_year"),
sep = " ", extra = "merge") %>%
separate(runners_up, into = c("Runners-up", "second_place_year"),
sep = " ", extra = "merge") %>%
separate(third_place, into = c("Third Place", "third_place_year"),
sep = " ", extra = "merge") %>%
mutate_all(list(~str_replace_all(., "–", "0"))) %>%
mutate_at(vars(contains("num")), funs(as.numeric)) %>%
gather(key = "key", value = "value", -team,
-first_place_year, -second_place_year, -third_place_year) %>%
mutate(key = as.factor(key),
value = as.numeric(value),
team = team %>% str_replace(., "[A-Z]{3}", "") %>% str_trim(.),
team = case_when(team == "Brasil" ~ "Brazil",
TRUE ~ team)) %>%
mutate(key = forcats::fct_relevel(key,
"Champions",
"Runners-up",
"Third Place")) %>%
arrange(key, desc(value)) %>%
mutate(team = forcats::as_factor(team),
order = row_number())
```
I also wanted to add flags to this plot but
`cowplot::insert_yaxis_grob()` is unfortunately not compatible with
facets. I used `stringr::str_wrap()` to format the subtitle nicely while
I used `glue::glue()` to avoid having the use ‘’ to create a new line
for the caption.
``` r
copa_ganadores_plot <- copa_campeones_limpia %>%
ggplot(aes(value, forcats::fct_rev(team), color = key)) +
geom_point(size = 10) + # 10
geom_text(aes(label = value),
size = 5, color = "black", # 5
family = "Roboto Condensed", fontface = "bold") +
scale_color_manual(values = c("Champions" = "#FFCC33",
"Runners-up" = "#999999",
"Third Place" = "#CC6600"),
guide = FALSE) +
scale_x_continuous(breaks = c(1, 5, 10, 15),
labels = c(1, 5, 10, 15),
limits = c(-1, 16)) +
labs(x = "Number of Occurrence", y = NULL,
title = "Most Successful Teams of the Copa América!",
subtitle = str_wrap("Ordered by number of Copa América(s) won. Argentina missed the chance to leapfrog Uruguay after consecutive final losses in the previous two tournaments!", width = 80),
caption = glue("
Source: Wikipedia
By @R_by_Ryo")) +
facet_wrap(~key) +
theme_copaAmerica(subtitle.size = 14,
caption.size = 10)
copa_ganadores_plot
```

What’s surprising to note is that Pele never won a Copa América with
Brazil, although he did get Best Player and Top Scorer in the 1959
edition of the tournament. Even more bizarrely Diego Maradona has never
won it either! He didn’t play in either of the 1991 and 1993 editions
where Argentina won their 13th and 14th Copas.
Copa América Squad Profiles
---------------------------
We just looked at what happened in the past but who are the players
competing in the tournament this year? To take a quick look I
web-scraped the squads of each of the competing teams from Wikipedia.
I created a list of the `xpath`s for each of squads and using
`purrr::map()` I grabbed the data for each participating country. After
I got some meta-information about the country name and the group I
created a list-column that stores the squad data as a dataframe in its
own column. To explode this out I used `tidyr::unnest()` to reshape the
entire dataframe to have one row with all the data for each player in
every squad.
To get a clean dataset I use some `stringr::str_*()` functions to
properly format the character strings such as the player positions,
ages, date of births.
``` r
squads_df_clean <- squads_df_raw %>%
janitor::clean_names() %>%
select(-delete, squad_num = no,
position = pos, birth_age = date_of_birth_age) %>%
mutate(position = position %>% str_replace_all(., "[1-9]", ""),
birth_age = birth_age %>% str_extract_all(., pattern = "\\([^()]+\\)")) %>% unnest(birth_age) %>%
group_by(player) %>%
mutate(colnum = seq_along(player)) %>%
spread(key = colnum, value = birth_age) %>%
ungroup() %>%
select(everything(), dob = `1`, age = `2`) %>%
mutate(dob = dob %>% str_replace_all(., "[()]", "") %>% lubridate::as_date(),
age = age %>% str_extract(., "[0-9]+") %>% as.integer,
country = forcats::fct_relevel(country,
"Brazil", "Argentina", "Uruguay",
"Peru", "Qatar", "Chile",
"Venezuela", "Paraguay", "Japan",
"Bolivia", "Colombia", "Ecuador",
),
club = case_when(
club == "Barcelona" & country == "Ecuador" ~ "Barcelona (Ecuador)",
TRUE ~ club))
```
``` r
glimpse(squads_df_clean)
```
## Observations: 276
## Variables: 12
## $ name 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,...
## $ group "A", "A", "A", "A", "A", "A", "A", "A", "A", "A...
## $ country Brazil, Brazil, Brazil, Brazil, Brazil, Brazil,...
## $ squad_num 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ position "GK", "DF", "DF", "DF", "MF", "DF", "FW", "MF",...
## $ player "Alisson", "Thiago Silva", "Miranda", "Marquinh...
## $ caps 36, 79, 57, 36, 36, 40, 3, 10, 29, 65, 49, 15, ...
## $ goals 0, 7, 3, 1, 0, 2, 1, 0, 16, 8, 14, 1, 7, 0, 0, ...
## $ club "Liverpool", "Paris Saint-Germain", "Internazio...
## $ country_league "England", "France", "Italy", "France", "Spain"...
## $ dob 1992-10-02, 1984-09-22, 1984-09-07, 1994-05-14...
## $ age 26, 34, 34, 25, 27, 33, 22, 22, 22, 30, 27, 28,...
### Age-histogram
Using this data I can plot a bunch of histograms:
``` r
age_country_plot <- squads_df_clean %>%
group_by(country) %>%
mutate(median_age = median(age)) %>%
ungroup() %>%
ggplot(aes(x = age)) +
geom_histogram(fill = "red", binwidth = 1) +
geom_vline(aes(xintercept = median_age), size = 1.2) +
geom_label(aes(x = median_age, y = 8,
label = glue::glue("Median: {median_age}")),
nudge_x = 0.5, hjust = 0.1, size = 3,
family = "Roboto Condensed", color = "black") +
labs(title = "Age Distribution of Copa América squads",
subtitle = "Columns ordered Group A to Group C",
x = "Age", y = NULL,
caption = glue::glue("
Source: Wikipedia
By: @R_by_Ryo")) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0),
breaks = scales::pretty_breaks()) +
theme_copaAmerica(title.size = 22,
subtitle.size = 14,
caption.size = 8,
axis.text.size = 12,
axis.title.size = 16,
strip.text.size = 18,
panel.grid.minor.x = element_line(color = "white"),
panel.grid.minor.y = element_line(color = "white")) +
facet_wrap(~country, ncol = 3)
age_country_plot
```

In terms of age, Japan have the youngest team with a median of 21, 4
years younger than the next youngest team, Qatar. The rest have a fairly
balanced spread of ages from 20 to early-mid 30s with most of the
medians hovering around 27 years of age. The reason for Japan’s
extremely young squad is due to the fact that the full-strength Japan
team has played in both the World Cup and the Asian Cup in the past
year. Along with the fact that the Tokyo Olympics are next year, it was
decided to use the invitation to the Copa América as a trial-by-fire for
the young stars of the future. Much like in a real Olympic squad, the
team contains three “overage” players in World Cup 2010/2014/2018
goalkeeper Eiji Kawashima, Premier League winner Shinji Okazaki, and
Getafe playmaker Gaku Shibasaki.
The oldest player will be Brazil captain Dani Alves at 36 with
Paraguay’s Oscar Cardozo only two weeks younger. On the other hand, the
youngest player is Japan’s 18-year old prodigy Takefusa Kubo, the
ex-Barcelona youth player who only just recently moved to Real Madrid!
In light of his transfer a lot of eyes will be on him to see if he can
produce some Captain Tsubasa-esque performances for a very inexperienced
Japan team gearing up for the Tokyo Olympics!
### Caps histogram
When considering the experience of a squad it’s not enough to look at
ages but one needs to look at the caps or appearances for the national
team as well.
``` r
caps_country_plot <- squads_df_clean %>%
group_by(country) %>%
mutate(median_cap = median(caps)) %>%
ungroup() %>%
ggplot(aes(x = caps)) +
geom_histogram(fill = "red", binwidth = 5) +
geom_vline(aes(xintercept = median_cap), size = 1.25) +
geom_label(aes(x = median_cap, y = 15,
label = glue::glue("Median: {median_cap}")),
nudge_x = 0.5, hjust = 0.05, size = 3,
family = "Roboto Condensed", color = "black") +
labs(title = "Caps (Appearances) by Country",
subtitle = "Columns ordered Group A to Group C",
x = "Caps", y = NULL,
caption = glue::glue("
Source: Wikipedia
By: @R_by_Ryo")) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme_copaAmerica(
title.size = 20,
subtitle.size = 14,
caption.size = 10,
axis.text.size = 10,
axis.title.size = 16,
strip.text.size = 18,
panel.grid.major.x = element_line(color = "white", size = 0.25),
panel.grid.major.y = element_line(color = "white", size = 0.25),
panel.grid.minor.x = element_line(color = "white", size = 0.25),
panel.grid.minor.y = element_line(color = "white", size = 0.25)) +
facet_wrap(~country, ncol = 3)
caps_country_plot
```

The majority of Japan’s squad have 0 (ZERO) caps, with the
aforementioned three “overage” players taking up most of the proportion
of caps on the team. Bolivia are also taking a untested squad with 8 of
their players with 2 caps or less! Chile, Uruguay, and Argentina bring
their veterans with multiple players over or around 100 caps. From this
data I was surprised that Jefferson Farfan and Paolo Guerrero didn’t
have 100 caps by now…
The player with the most caps is Lionel Messi (130) followed closely by
Diego Godin (126), and Alexis Sanchez (124). On the other hand there are
29 players hopeful of making their first national team appearance at
this tournament with the majority (17 players) coming from Japan.
### Goal distribution
Next I looked at the distribution of goals scored by the midfielders and
strikers of each team. I found out about using
`ggplot2::position_nudge()` for slightly adjusting variables on a
discrete scale in similar fashion to the `nudge_y =` and `nudge_x =`
arguments most people might be familiar with from other geoms. I also
used `ggforce::geom_mark_hull()` to do some labelling.
``` r
goals_country_plot <- squads_df_clean %>%
filter(position %in% c("MF", "FW")) %>%
group_by(country) %>%
mutate(median = median(goals)) %>%
ungroup() %>%
ggplot(aes(x = goals, y = reorder(country, median))) +
ggridges::geom_density_ridges(fill = "red", color = "white", scale = 1.1) +
geom_point(aes(x = median, y = country), position = position_nudge(y = 0.25),
color = "yellow", size = 3) +
ggforce::geom_mark_hull(aes(filter = country == "Argentina" & goals == 67, label = "Lionel Messi: 67 goals"),
label.buffer = unit(15, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
ggforce::geom_mark_hull(aes(filter = country == "Uruguay" & goals == 55, label = "Luis Suarez: 55 goals"),
label.buffer = unit(5, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
ggforce::geom_mark_hull(aes(filter = country == "Japan" & goals == 50, label = "Shinji Okazaki: 50 goals"),
label.buffer = unit(2, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
ggforce::geom_mark_hull(aes(filter = country == "Uruguay" & goals == 46, label = "Edinson Cavani: 46 goals"),
label.buffer = unit(25, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
ggforce::geom_mark_hull(aes(filter = country == "Chile" & goals == 41, label = "Alexis Sanchez: 41 goals"),
label.buffer = unit(4, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
scale_x_continuous(limits = c(0, 73),
expand = c(0.01, 0.01),
breaks = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70),
labels = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70)) +
expand_limits(y = 13.5) +
labs(title = "Distribution of Goals Scored by Midfielders and Strikers",
subtitle = "Copa América 2019 squads, Yellow dot = Median goals",
x = "Goals", y = NULL,
caption = glue::glue("
Source: Wikipedia
Data from prior to start of tournament
By: @R_by_Ryo")) +
theme_copaAmerica(title.size = 18,
subtitle.size = 12,
caption.size = 8,
axis.text.size = 14,
axis.title.size = 16,
strip.text.size = 18)
goals_country_plot
```

With a lot of these players being more defensively minded or new players
the distribution is heavily skewed but you can see little mounds showing
the top goalscorers for each country and see which countries have their
goalscorers spread out among multiple players such as Brazil, Qatar, and
Peru.
If you know your South American players you can take a good guess at who
are the top goal scorers for each nation. For Colombia the two outlying
mounds are obviously James Rodriguez and Falcao, for example.
Venezuela’s top scorer with 22 is Salomon Rondon and for Brazil, if not
for his injury, a lonesome mound would have appeared for Neymar with 60
goals!
### Player contribution by league
Now let’s check the player contribution to the squads at the Copa
América by league. I’m just going to use the country that the league is
from for simplicity’s sake. Originally I wanted to `left_join()` it with
a ‘country <> domestic league’ table but couldn’t find one and the
league names itself aren’t very meaningful or have awful sponsor names
that obfuscate the country of origin even further.
``` r
player_contrib_league_plot <- squads_df_clean %>%
group_by(country_league) %>%
summarize(n = n()) %>%
ungroup() %>%
ggplot(aes(y = n, x = reorder(country_league, n))) +
geom_col(fill = "red") +
geom_text(aes(label = n, family = "Roboto Condensed", fontface = "bold"),
size = 4.5, color = "yellow",
nudge_y = 0.5) +
coord_flip() +
scale_y_continuous(labels = c(0, 5, 10, 15, 20, 25),
breaks = c(0, 5, 10, 15, 20, 25),
limits = c(0, 30),
expand = c(0, 0)) +
labs(title = "Breakdown of Player Contributions by League",
subtitle = glue("
Shown as Country Name
Mexico (Liga MX) contributed 27 players to South American squads"),
x = "League (Country name)", y = "Number of players",
caption = glue::glue("
Source: Wikipedia
By: @R_by_Ryo")) +
theme_copaAmerica(title.size = 18,
subtitle.size = 12,
caption.size = 10,
axis.text.size = 14,
axis.text.y.size = 11,
axis.title.size = 16,
panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_line(color = "white"))
player_contrib_league_plot
```

The best of the best players from South American countries will move on
to Europe so the Argentinean league (Superliga Argentina) and the
Brazilian league (Brasileirão - Serie A) do not have as many players as
you might think and as a consequence, the top leagues of England, Spain,
and Italy contribute quite a bit! A lot of the better players but not
quite elite South American players might go to Mexico instead of a
lower-mid European league. With the growth of the MLS a fair number of
players ply their trade there as well.
We can take a more detailed look by creating a table of the proportion
of players from each squad coming from either a domestic league or any
other league. I had to do a lot of wrangling to get the proper output
for the table. After calculating the percentage of domestic players from
a country’s domestic league I added the full data back in. Then I had to
make sure that for each country, the country - domestic league country
was the first row in each of the country groups (so Bolivia - Bolivia,
Bolivia, China, Japan - Japan, Japan - England, etc.). By doing this I
can automatically `tidyr::fill()`-in the rest of the rows of that
country with the ‘percentage of players from domestic league stat’.
``` r
squads_df_clean %>%
group_by(country, country_league) %>%
summarize(player_from_league = n()) %>%
filter(country == country_league) %>%
mutate(perc_from_domestic_league = percent(player_from_league / 23, accuracy = 0.1)) %>%
right_join(squads_df_clean %>%
group_by(country, country_league) %>%
summarize(player_from_league = n()) %>%
ungroup()) %>%
mutate(first = case_when(
country == country_league ~ 1,
TRUE ~ 0)) %>%
arrange(country, desc(first)) %>%
fill(perc_from_domestic_league) %>%
group_by(country) %>%
mutate(perc_from_league = percent(player_from_league / 23, accuracy = 0.1),
country_league = glue::glue("{country_league} - league")) %>%
arrange(desc(player_from_league)) %>%
select(Country = country, `League (country name)` = country_league,
`Number of players from league` = player_from_league,
`Percentage of players from league` = perc_from_league,
`Percentage of players from domestic league` = perc_from_domestic_league) %>%
head(10) %>%
knitr::kable(format = "html",
caption = "Breakdown of Player Contribution by League") %>%
kableExtra::kable_styling(full_width = FALSE)
```
Breakdown of Player Contribution by League
|
Country
|
League (country name)
|
Number of players from league
|
Percentage of players from league
|
Percentage of players from domestic league
|
|
Qatar
|
Qatar - league
|
23
|
100.0%
|
100.0%
|
|
Bolivia
|
Bolivia - league
|
20
|
87.0%
|
87.0%
|
|
Japan
|
Japan - league
|
14
|
60.9%
|
60.9%
|
|
Ecuador
|
Ecuador - league
|
9
|
39.1%
|
39.1%
|
|
Paraguay
|
Paraguay - league
|
8
|
34.8%
|
34.8%
|
|
Brazil
|
England - league
|
7
|
30.4%
|
13.0%
|
|
Uruguay
|
Spain - league
|
6
|
26.1%
|
4.3%
|
|
Peru
|
Peru - league
|
6
|
26.1%
|
26.1%
|
|
Chile
|
Chile - league
|
6
|
26.1%
|
26.1%
|
|
Argentina
|
Argentina - league
|
5
|
21.7%
|
21.7%
|
Three interesting facts I found:
- 30% of players on the Brazil squad play for an English team, most
out of any league - squad combination excluding domestic leagues.
- 100% of the Qatar squad play in their domestic league!
- Only one Uruguayan player (4.3%) plays in its domestic league.
### Player contribution by club
In the final plot for this section, I looked at the top 10 clubs
contributing the most players to the tournament. I used
`arrange(desc(n)) %>% slice()` instead of `top_n()` as there were too
many teams tied at 4 players. To set the team names inside the bars I
created a midpoint value `midval` that calculated a value half of the
number of players contributed so the labels were placed neatly.
``` r
player_contrib_club_plot <- squads_df_clean %>%
group_by(club) %>%
summarize(n = n()) %>%
mutate(club = club %>% forcats::as_factor() %>% forcats::fct_reorder(n),
midval = n / 2) %>%
arrange(desc(n)) %>%
slice(1:15) %>%
ggplot(aes(x = club, y = n)) +
geom_col(fill = "red") +
geom_text(aes(label = n, family = "Roboto Condensed", fontface = "bold"),
size = 7.5, color = "yellow",
nudge_y = 0.5) +
geom_text(aes(y = midval, label = club,
family = "Roboto Condensed", fontface = "bold"),
size = 5, color = "white") +
coord_flip() +
scale_y_continuous(breaks = scales::pretty_breaks(),
expand = c(0, 0),
limits = c(0, 10.5)) +
labs(title = "Top 15 Clubs contributing the most players to the Copa América",
x = "Club", y = "Number of players",
caption = "Source: Wikipedia") +
theme_copaAmerica(
title.size = 18,
subtitle.size = 12,
caption.size = 8,
axis.text.size = 14,
axis.title.size = 16,
strip.text.size = 18,
panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_line(color = "white")) +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank())
player_contrib_club_plot
```

With 100% of its players coming from the domestic league it’s not
surprise that the Qatari team, Al-Sadd, is contributing the most players
to the tournament. Tied with another Qatari team, Mexican club America
features 7 players yet none of them are Mexicans (2 Argentineans, 2
Colombians, 1 Ecuadorian, 1 Chilean, and 1 Paraguayan).
At first I thought Barcelona contributed 8 players until I realized the
Ecuadorian players were coming from the Ecuadorian team called
Barcelona…I had to go all the way back up to the beginning of this
section to fix that small peculiarity. As futbol came to South America
via European colonists and immigrants a lot of teams took up the names
and colors of the teams these Europeans were fond of. Other examples
include Liverpool F.C. (Montevideo, Uruguay), Arsenal de Sarandi (Buenos
Aires, Argentina), and Club Atletico Juventus (Sao Paulo, Brazil -
although they use the colors of Torino F.C.).
If you download the data and type in the code below you can see the
entire club-country list.
``` r
squads_df_clean %>%
group_by(club, country) %>%
summarize(n = n()) %>% View()
```
Match Records
-------------
Now that we got a good look at the composition of the teams, we can take
a look at how they’ve done at every Copa América.
The next code chunk mainly comes from PH Julien and his excellent Kaggle
kernel of [“A Journey Through The History of
Soccer”](https://www.kaggle.com/phjulien/a-journey-through-the-history-of-soccer/).
``` r
## grab football federation affiliations data
federation_files <- Sys.glob("../data/federation_affiliations/*")
df_federations = data.frame(country = NULL, federation = NULL)
for (f in federation_files) {
federation = basename(f)
content = read.csv(f, header=FALSE)
content <- cbind(content,federation=rep(federation, dim(content)[1]))
df_federations <- rbind(df_federations, content)
}
colnames(df_federations) <- c("country", "federation")
df_federations <- df_federations %>%
mutate(country = as.character(country) %>% str_trim(side = "both"))
results_raw <- readr::read_csv("../data/results.csv")
results_copa <- results_raw %>%
filter(tournament == "Copa América") %>%
rename(venue_country = country,
venue_city = city) %>%
mutate(match_num = row_number())
## combine with federation affiliations
results_copa_home <- results_copa %>%
left_join(df_federations,
by = c("home_team" = "country")) %>%
mutate(federation = as.character(federation)) %>%
rename(home_federation = federation)
results_copa_away <- results_copa %>%
left_join(df_federations,
by = c("away_team" = "country")) %>%
mutate(federation = as.character(federation)) %>%
rename(away_federation = federation)
## combine home-away
results_copa_cleaned <- results_copa_home %>%
full_join(results_copa_away)
```
Unfortunately, this data does not have **penalty** results as those
games are all counted as a draw (as technically that is the actual
result). Considering there a lot of cagey knock-out rounds that finish
in a penalty shoot-out (including the last two finals…) it is
unfortunate but that’s just the data you have sometimes. There is a way
to web-scrape all the Copa América results and assign Win-Lose to those
games that went to penalties but I’ll leave that for another time. Also,
there is no info on what stage of the tournament the match recorded is
in.
``` r
results_copa_cleaned <- results_copa_cleaned %>%
mutate(
home_federation = case_when(
home_team == "USA" ~ "Concacaf",
TRUE ~ home_federation),
away_federation = case_when(
away_team == "USA" ~ "Concacaf",
TRUE ~ away_federation)) %>%
select(-contains("federation"), -contains("venue"),
-neutral, date, home_team, home_score, away_team, away_score,
tournament, venue_city)
```
``` r
glimpse(results_copa_cleaned)
```
## Observations: 787
## Variables: 8
## $ date 1916-07-02, 1916-07-06, 1916-07-08, 1916-07-10, 19...
## $ home_team "Chile", "Argentina", "Brazil", "Argentina", "Brazi...
## $ away_team "Uruguay", "Chile", "Chile", "Brazil", "Uruguay", "...
## $ home_score 0, 6, 1, 1, 1, 0, 4, 4, 1, 4, 5, 1, 6, 2, 0, 3, 4, ...
## $ away_score 4, 1, 1, 1, 2, 0, 0, 2, 0, 0, 0, 0, 0, 3, 2, 1, 1, ...
## $ tournament "Copa América", "Copa América", "Copa América", "Co...
## $ match_num 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ venue_city "Buenos Aires", "Buenos Aires", "Buenos Aires", "Bu...
Now that it’s nice and cleaned up I created a function that reshapes the
data so that it’s set from a certain team’s perspective with the “team”
argument. You can also set the function to look for only results against
a certain opponent by filling in the `versus` argument.
``` r
copaAmerica_resultados <- function(data, team, versus = NA) {
## team of interest: ex. 'Brazil'
team_var <- enquo(team)
todos_partidos <- data %>%
## filter only for results of team of interest
filter(home_team == !!team_var | away_team == !!team_var) %>%
## reshape columns to team vs. opponent
mutate(
opponent = case_when(
away_team != !!team_var ~ away_team,
home_team != !!team_var ~ home_team),
home_away = case_when(
home_team == !!team_var ~ "home",
away_team == !!team_var ~ "away"),
equipo_goals = case_when(
home_team == !!team_var ~ home_score,
away_team == !!team_var ~ away_score),
opp_goals = case_when(
home_team != !!team_var ~ home_score,
away_team != !!team_var ~ away_score)) %>%
## label results from team's perspective
mutate(
result = case_when(
equipo_goals > opp_goals ~ "Win",
equipo_goals < opp_goals ~ "Loss",
equipo_goals == opp_goals ~ "Draw")) %>%
mutate(result = result %>% forcats::as_factor() %>% forcats::fct_relevel(c("Win", "Draw", "Loss"))) %>%
select(-contains("score"), -contains("team"), -match_num) %>%
rename(Date = date, Tournament = tournament, `Venue` = venue_city, Opponent = opponent, `Home / Away` = home_away,
`Goals For` = equipo_goals, `Goals Against` = opp_goals, Result = result)
if (is.na(versus) | is.null(versus)) {
resultados_totalmente <- todos_partidos %>%
group_by(Result, Opponent) %>%
mutate(n = n()) %>%
ungroup() %>%
## sum amount of goals by team and opponent
group_by(Result, Opponent) %>%
summarize(e_g = sum(`Goals For`),
o_g = sum(`Goals Against`),
n = n()) %>%
ungroup() %>%
## spread results over multiple columns
spread(Result, n) %>%
mutate_if(is.integer, as.numeric)
missing_cols <- c("Win", "Draw", "Loss") %>%
map_dfr( ~tibble(!!.x := numeric()))
resultados_totalmente <- resultados_totalmente %>%
bind_rows(missing_cols) %>%
mutate(Win = if_else(is.na(Win), 0, Win),
Draw = if_else(is.na(Draw), 0, Draw),
Loss = if_else(is.na(Loss), 0, Loss)) %>%
group_by(Opponent) %>%
summarize(Win = sum(Win, na.rm = TRUE),
Draw = sum(Draw, na.rm = TRUE),
Loss = sum(Loss, na.rm = TRUE),
`Goals For` = sum(e_g),
`Goals Against` = sum(o_g))
return(list(resultados_totalmente, todos_partidos))
} else {
## opponent: ex. 'Argentina'
todos_partidos <- todos_partidos %>%
filter(Opponent == versus)
if (nrow(todos_partidos) == 0) {
return(glue("{team} has never played {versus} at the Copa América!"))
} else {
resultados_totalmente <- todos_partidos %>%
group_by(Result, Opponent) %>%
mutate(n = n()) %>%
ungroup() %>%
# sum amount of goals by team and opponent
group_by(Result, Opponent) %>%
summarize(e_g = sum(`Goals For`),
o_g = sum(`Goals Against`),
n = n()) %>%
ungroup() %>%
# spread results over multiple columns
spread(Result, n) %>%
mutate_if(is.integer, as.numeric) %>%
group_by(Opponent) %>%
summarize(Win = sum(Win, na.rm = TRUE),
Draw = sum(Draw, na.rm = TRUE),
Loss = sum(Loss, na.rm = TRUE),
`Goals For` = sum(e_g),
`Goals Against` = sum(o_g))
return(list(resultados_totalmente, todos_partidos))
}
}
}
```
The output is either a dataframe of all the games a team has been
involved in as well as the record of the team against other teams in the
Copa América or a message saying that the team you picked has never
played against the opponent you picked.
### Japan
``` r
copaAmerica_resultados(data = results_copa_cleaned,
team = "Japan", versus = "Brazil")
```
## Japan has never played Brazil at the Copa América!
Oh… that’s right Japan has never played against Brazil at the Copa…
``` r
resultados_japon <- copaAmerica_resultados(data = results_copa_cleaned, team = "Japan")
resultados_japon[[2]] %>%
knitr::kable(format = "html",
caption = "Japan's record in the Copa América") %>%
kableExtra::kable_styling(full_width = FALSE)
```
Japan’s record in the Copa América
|
Date
|
Tournament
|
Venue
|
Opponent
|
Home / Away
|
Goals For
|
Goals Against
|
Result
|
|
1999-06-29
|
Copa América
|
Asunción
|
Peru
|
away
|
2
|
3
|
Loss
|
|
1999-07-02
|
Copa América
|
Asunción
|
Paraguay
|
away
|
0
|
4
|
Loss
|
|
1999-07-05
|
Copa América
|
Pedro Juan Caballero
|
Bolivia
|
away
|
1
|
1
|
Draw
|
Japan’s only previous journey to the Copa América was in the 1999
edition where they lost all 3 games. They were invited for the 2011
edition but withdrew due to the Tohoku Earthquake and were replaced by
Costa Rica. Japanese football has come a long way since 1999 but with a
young squad it will be a uphill battle to get 3 points against any of
their Group C opponents, Uruguay, Chile, and Ecuador.
### Colombia
``` r
resultados_colombia <- copaAmerica_resultados(data = results_copa_cleaned, team = "Colombia")
resultados_colombia[[2]] %>%
slice(87:92) %>%
knitr::kable(format = "html",
caption = "Colombia's record in the Copa América") %>%
kableExtra::kable_styling(full_width = FALSE)
```
Colombia’s record in the Copa América
|
Date
|
Tournament
|
Venue
|
Opponent
|
Home / Away
|
Goals For
|
Goals Against
|
Result
|
|
2001-07-11
|
Copa América
|
Barranquilla
|
Venezuela
|
home
|
2
|
0
|
Win
|
|
2001-07-14
|
Copa América
|
Barranquilla
|
Ecuador
|
home
|
1
|
0
|
Win
|
|
2001-07-17
|
Copa América
|
Barranquilla
|
Chile
|
home
|
2
|
0
|
Win
|
|
2001-07-23
|
Copa América
|
Armenia
|
Peru
|
home
|
3
|
0
|
Win
|
|
2001-07-26
|
Copa América
|
Manizales
|
Honduras
|
home
|
2
|
0
|
Win
|
|
2001-07-29
|
Copa América
|
Bogotá
|
Mexico
|
home
|
1
|
0
|
Win
|
Despite a recent resurgence of the Colombia national team they have not
been able to match the feats of the 2001 side that won the Copa with
their best place finish since then coming 3rd in 2004. The 2001 team
were not only unbeaten but also did not concede a single goal throughout
the tournament!
### Superclasico Sudamericano: Brazil vs. Argentina
``` r
resultados_de_brazil <- copaAmerica_resultados(data = results_copa_cleaned,
team = "Brazil", versus = "Argentina")
resultados_de_brazil[[1]] %>%
knitr::kable(format = "html",
caption = "Brazil vs. Argentina in the Copa América") %>%
kableExtra::kable_styling(full_width = FALSE)
```
Brazil vs. Argentina in the Copa América
|
Opponent
|
Win
|
Draw
|
Loss
|
Goals For
|
Goals Against
|
|
Argentina
|
9
|
8
|
15
|
38
|
52
|
``` r
resultados_de_brazil[[2]] %>%
tail(5) %>%
knitr::kable(format = "html",
caption = "Brazil vs. Argentina in the Copa América") %>%
kableExtra::kable_styling(full_width = FALSE)
```
Brazil vs. Argentina in the Copa América
|
Date
|
Tournament
|
Venue
|
Opponent
|
Home / Away
|
Goals For
|
Goals Against
|
Result
|
|
1993-06-27
|
Copa América
|
Guayaquil
|
Argentina
|
home
|
1
|
1
|
Draw
|
|
1995-07-17
|
Copa América
|
Rivera
|
Argentina
|
home
|
2
|
2
|
Draw
|
|
1999-07-11
|
Copa América
|
Ciudad del Este
|
Argentina
|
home
|
2
|
1
|
Win
|
|
2004-07-25
|
Copa América
|
Lima
|
Argentina
|
away
|
2
|
2
|
Draw
|
|
2007-07-15
|
Copa América
|
Maracaibo
|
Argentina
|
home
|
3
|
0
|
Win
|
Brazil does not have a good overall record vs. Argentina but they have
not lost against their rivals at the Copa América since the 1993 edition
where they lost 5-6 on penalties in the Quarter Finals. The “draw” in
1995 was won on penalties while the “draw” in 2004 was actually in the
final where they won 4-2 on penalties.
What I found odd was that the Copa América seems to have a very low
priority to certain countries, especially Brazil who have repeatedly
sent their B or C teams to the tournament in favor of sending their best
team to other tournaments or resting star players. Funnily enough these
understrength Brazilian squads have actually won the entire tournament a
few times, most notably in 2007 against a full strength Argentina side
containing the likes of Zanetti, Riquelme, Cambiasso, Tevez, a young
Messi/Mascherano, Cambiasso, et al!
Player Profiles
---------------
After looking at the history of the competition and the composition of
the squads I examined the players and their form coming into the Copa
América. In recent years football analytics has really taken off and
there have been many strides made in creating more informative
statistics to assess players’ abilities, the most prominently being the
**xG** statistics. This is the first time I talk about **xG** in any
length/depth so this introduction is as much to solidify my
understanding as well as yours!
### What IS xG?
- **xG**: Quantitative measure (between 0 and 1) that assigns a
probability that a shot taken will result in a goal based on a
variety of variables and is used for evaluating the quality of
chances and predicting players’ and teams’ future performances.
- **xA**: Quantitative measure (between 0 and 1) that assigns a
probability that a given pass will result in an assist for a goal
based on a variety of variables.
Common variables used in the models that output xG statistics are the
distance and angle of a shot, the body part used, rebound, among others.
Similar to how you might assess your favorite striker’s chances of
scoring just as he is lining up to take a shot: Is the shot a header? Is
he trying to score from a cross in regular play or a corner kick? Are
there a crowd of defenders in front of him or is he one-on-one with the
goalkeeper? Etc. You might think **who** takes a shot would be a genuine
factor but in actuality it tells you a lot less about the chances of a
goal compared to the location of the shot.
Note that there isn’t a SINGLE xG model. You can check out a blog post
comparing different people’s xG models
[here](https://mackayanalytics.nl/2017/06/19/how-accurate-are-xg-models-ii-the-big-chance-dilemma/).
People and organizations (from Statsbomb to OptaPro) have their own
ideas about what **could** be the important variables in play and as
such it’s important to report from which source you got your data from
as the stats can differ between models. A few things xG does not factor
in are things like goalkeeper performance (someone pulling off
incredible saves or letting in a poor shot) and one must also consider
the fact that team style of play and the quality of a player’s
teammates. When judging players based on these stats it is important to
be aware of contextual factors like the team they play for, their
opponent, and the player’s position/role in the team.
From xG and xA more granular statistics such as xGChain and xGBuildup
were created to be able to dig a little deeper into who is contributing
to chance creation, I’ll introduce the latter two a bit later. As the
field has grown new statistics have popped up such as [Karun
Singh](https://twitter.com/karun1710/)’s “expected threat” or xT. You
can check out an introduction to xT from
[here](https://karun.in/blog/expected-threat.html).
Of course, these statistics only tell a part of the story and are
definitenly not the be-all-and-end-all. In the context of this current
blog post, these stats only tell the story about how these players did
for their club teams this past season rather than for their national
team. Even still it gives us a good idea of what kind of form these
players are in coming into this tournament.
You might also want to watch these Youtube videos by
[TifoFootball](https://www.youtube.com/channel/UCGYYNGmyhZ_kwBF_lqqXdAQ)
for a quick primer on [xG](https://www.youtube.com/watch?v=zSaeaFcm1SY)
and [xA](https://www.youtube.com/watch?v=1MdlkuzLdj4).
### understat data
For the data I used the website, `understat.com`. Their xG models were
created via training a neural network on a dataset consisting of over
100,000 shots using more than 10 different variables. Getting data from
`understat` has been made easy by Ewen Henderson’s `understatr` package
available from [Github](https://github.com/ewenme/understatr) (he’s also
the guy that made the [ghibli](https://github.com/ewenme/ghibli) color
palette!). I tried to pick a wide selection of attacking players but I
was also limited by the fact that `understat` only has data for
teams/players from six European leagues (Premier League, Bundesliga,
Serie A, La Liga, Ligue 1, and Russian Premier League).
For **Peru** I would have chosen Paolo Guerrero but as he plays in
Brazil now I went with Jefferson Farfan (who hasn’t played as many games
as the other players used for comparison unfortunately…). For **Chile**
I would pick Eduardo Vargas but he as doesn’t play for a team covered by
understat I went with Alexis Sanchez, who had a woeful season and only
played \~600 minutes despite appearing in \~20 league matches and later
added Arturo Vidal. For **Brazil** I included Neymar initially but since
he won’t actually be playing I’ll keep him for comparison’s sake but
also include Gabriel Jesus and Roberto Firmino who have been fighting
for the starting striker spot. Note that these two aren’t the ones
replacing Neymar **positionally**. In Neymar’s left-wing position I can
see David Neres or Phil Coutinho replacing him (Richarlison and Willian
mostly play on the right). (Edit: In the first match vs. Bolivia, David
Neres started off on the left while Richarlison played on the right,
Coutinho played just behind Bobby Firmino)
The other nation’s strikers/attacking midfielders don’t play for the six
European leagues covered by understat or like in Shinji Okazaki’s case
just did not play as many minutes/games during the season. To get the
data I created a list of the player codes and use `purrr::map()` to
iterate each through the `understatr::get_player_seasons_stats()`
function.
``` r
player_codes <- c(2097, 2099, 813, ## Messi, Neymar, Rondon
498, 4299, 696, ## Alexis, Farfan, Falcao
3294, 2098, 5543, ## Cavani, Suarez, G. Jesus
482, 1148, 2249, ## Bobby, Duvan, James
1089, 3553, 488, ## Cuadrado, Di Maria, Coutinho
222) ## Arturo Vidal
understat_data <- player_codes %>%
map(., ~ understatr::get_player_seasons_stats(.x)) %>%
reduce(bind_rows) %>%
select(-player_id, -position, -yellow, -red)
```
``` r
glimpse(understat_data)
```
## Observations: 83
## Variables: 15
## $ games 34, 36, 34, 33, 38, 17, 20, 30, 34, 33, 32, 36, 38...
## $ goals 36, 34, 37, 26, 43, 15, 19, 13, 24, 22, 11, 7, 8, ...
## $ shots 170, 196, 179, 158, 187, 55, 91, 105, 124, 95, 89,...
## $ time 2704, 2995, 2832, 2726, 3374, 1443, 1797, 2652, 30...
## $ xG 25.997169, 28.946281, 26.885174, 27.101910, 35.891...
## $ assists 13, 12, 9, 16, 18, 7, 13, 11, 12, 7, 7, 3, 2, 2, 0...
## $ xA 15.33516552, 15.10040562, 13.95513140, 15.87127814...
## $ key_passes 93, 87, 79, 77, 95, 43, 70, 91, 102, 52, 32, 23, 2...
## $ year 2018, 2017, 2016, 2015, 2014, 2018, 2017, 2016, 20...
## $ team_name "Barcelona", "Barcelona", "Barcelona", "Barcelona"...
## $ npg 32, 32, 31, 23, 38, 10, 15, 12, 19, 21, 11, 7, 8, ...
## $ npxG 22.280909, 25.973170, 21.682231, 21.899351, 31.432...
## $ xGChain 38.459877, 48.180634, 42.525045, 41.996866, 54.753...
## $ xGBuildup 10.6987990, 21.6344040, 18.1335122, 15.1963644, 19...
## $ player_name "Lionel Messi", "Lionel Messi", "Lionel Messi", "L...
As you can see the data consists of a row for each player and each year
(from the 2014/2015 season to the 2018/2019 season). I tried to mitigate
the fact that some players played a lot more minutes than others by
standardize everything to a ‘per 90 minutes’ value but this does have
its own disadvantages. These include the fact that players who play a
lot of minutes (as regular starting members) may not have as high ‘per
90’ stat even though their production with all these minutes might
suggest that they are consistently performing and producing at a high
level.
It’ll be a bit crowded (kind of like a spilt box of Skittles…) but let’s
check out the key metrics for all the players at once.
Note: npg = non-penalty goals, npxG = non-penalty goals xG
``` r
comparison_data <- understat_data %>%
filter(year == 2018) %>%
select(-games, -team_name, -year) %>%
rename(Shots = shots, KP = key_passes) %>%
gather(key = "key", value = "value", -player_name, -time) %>%
mutate(key = forcats::as_factor(key) %>%
forcats::fct_relevel(.,
"xG", "goals", "npxG", "npg",
"xA", "assists", "xGChain", "xGBuildup",
"Shots", "KP"))
comparison_strikers_plot <- comparison_data %>%
filter(key != "Shots", key != "KP",
key != "xGBuildup", key != "xGChain") %>%
mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) +
geom_vline(xintercept = 1.5, size = 2) +
geom_vline(xintercept = 2.5, size = 2) +
geom_vline(xintercept = 3.5, size = 2) +
geom_vline(xintercept = 4.5, size = 2) +
geom_vline(xintercept = 5.5, size = 2) +
coord_flip() +
scale_y_continuous(expand = c(0.01, 0.01),
limits = c(0, 1.26)) +
scale_fill_manual(values = pals::glasbey(16), name = "Player") +
labs(title = "Comparison: Top attackers at the Copa América",
subtitle = "For select group of attacking players with data available from understat.com",
x = NULL, y = "Metric per 90",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(title.size = 18,
subtitle = 12,
panel.grid.minor.x = element_line(color = "white"))
comparison_strikers_plot
```

As usual in these types of charts, Messi is leading a lot of the metrics
here and showing consistency too with having played the third highest
amount of minutes out of the selected players. It’s helpful to have the
xG/xA stats next to the actual goals/assists as it provides an
indication of whether the player in question is scoring shots that he
probabilistically should be scoring. When a player’s actual goal count
is higher than their xG stat this suggests that the player is
**“exceeding their xG”** or that they are scoring from shots that are
historically hard to score from. It can be seen as a marker of an elite
finisher as they are putting away chances from difficult situations
consistently. In terms of assists and xA Alexis Sanchez, who only played
about 600 minutes, looks a lot better than in reality due to the
aforementioned disadvantage of standardizing everything to a “per 90
minutes” value. Normally you would have a cut-off based on a certain
**minimum amount of minutes** but as I mentioned I was rather limited in
my choice of players.
A way to take a closer look at xG - Goals and xA - Assists is to use a
simple dot plot with a line going through the 45 degree angle. Those
below the line are underperforming relative to their xG or xA stat,
those over it are overachieving (“exceeding” their xG/xA stat) while
those just on the line are scoring or assisting right around what the
model expects the player to be. I use non-penalty xG below as penalties
have around \~0.75 xG (give or take a few percentage points depending on
the model) and can inflate the stats of those players who take a lot of
penalties and score them, especially if they weren’t the ones who earned
the penalty themselves.
``` r
expected_goal_plot <- understat_data %>%
filter(year == 2018) %>%
select(player_name, time, npxG, xG, goals) %>%
mutate_at(c("npxG", "xG", "goals"), ~. / time * 90) %>%
ggplot(aes(x = npxG, y = goals, fill = player_name)) +
geom_abline(intercept = 0, slope = 1, color = "white", size = 1.1) +
geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) +
scale_x_continuous(limits = c(0, 1.1),
expand = c(0, 0)) +
scale_y_continuous(limits = c(0, 1.3),
expand = c(0, 0)) +
scale_fill_manual(values = pals::glasbey(16), name = "Player") +
labs(title = "Expected vs. Actual Goals",
subtitle = "For select group of attacking players with data available from understat.com",
x = "Non-penalty xG per 90 minutes",
y = "Goals per 90 minutes",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(panel.grid.minor.x = element_line(color = "white"),
panel.grid.minor.y = element_line(color = "white"),
subtitle.size = 11)
expected_goal_plot
```

Gabriel Jesus is quite clearly below the 45 degree line meaning that he
has been very poor at finishing chances (and/or incredibly unlucky).
After a poor World Cup where he scored 0 goals as a starter, he is
really going to have to step up to fill Neymar’s goalscoring boots for
this tournament. However, his build-up play for City has still been good
this past season and he has been scoring for Brazil in the friendlies
leading up to the tournament so it’s going to be a hard decision for
Tite to decide on who starts against Bolivia (edit: Firmino started and
contributed an assist while Jesus replaced him in the 65th minute).
``` r
expected_assists_plot <- understat_data %>%
filter(year == 2018) %>%
select(player_name, time, xA, assists) %>%
mutate_at(c("xA", "assists"), ~. / time * 90) %>%
ggplot(aes(x = xA, y = assists, fill = player_name)) +
geom_abline(intercept = 0, slope = 1, color = "white", size = 1.1) +
geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) +
labs(title = "Expected vs. Actual Assists",
subtitle = "For select group of attacking players with data available from understat.com",
x = "xA per 90 minutes",
y = "Assists per 90 minutes",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
scale_x_continuous(limits = c(0, 0.55),
expand = c(0, 0)) +
scale_y_continuous(limits = c(0, 0.55),
expand = c(0, 0)) +
scale_fill_manual(values = pals::glasbey(16), name = "Player") +
theme_copaAmerica(panel.grid.minor.x = element_line(color = "white"),
panel.grid.minor.y = element_line(color = "white"),
subtitle.size = 11)
expected_assists_plot
```

One thing to keep in mind is that xA does not take into account the
recipient of the assist pass. Even if the pass given had a high expected
assist value the receiving player still might not have the quality to
put it away through no fault of the passer. This might explain why most
of the players with a higher xA among this group don’t have the assists
to match. It can also be that these players are also the ones playing a
lot more minutes and the volume of chances they create just aren’t
translating to goals all the time.
### Key Passes, Shots, xGChain, and xGBuildup (per 90)
I separated “key passes” and “shots” as well as “xGChain” and
“xGBuildup” from the rest as these two sets were on a very different
scale.
``` r
kp_shots_plot <- comparison_data %>%
filter(key == "Shots" | key == "KP") %>%
mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) +
coord_flip() +
scale_y_continuous(expand = c(0.01, 0.01),
limits = c(0, 6),
breaks = c(0, 1, 2, 3, 4, 5, 6),
labels = c(0, 1, 2, 3, 4, 5, 6)) +
scale_fill_manual(values = pals::glasbey(17), name = "Player") +
geom_vline(xintercept = 1.5, size = 2) +
labs(title = "Comparison: Stars of the Copa América",
subtitle = glue("
KP = Key Passes
For select group of attacking players with data available from understat.com"),
x = NULL, y = "Metric per 90",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(title.size = 18,
subtitle.size = 10,
panel.grid.minor.x = element_line(color = "white"))
kp_shots_plot
```

- xGChain: Quantitative measure that is the combined sum of the xG of
every possession that ends in a shot that a player is involved in.
The same derived value is given to each of the players involved and
allows us to credit players for attacking contributions outside of
just shots (xG) and assists (xA).
- xGBuildup: Similar to xGChain but excluding shots and assists. This
is in response to xGChain values still being dominated by the xG and
xA from shots and assists, respectively.
``` r
xgbuildup_xgchain_plot <- comparison_data %>%
filter(key == "xGBuildup" | key == "xGChain") %>%
mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) +
coord_flip() +
scale_y_continuous(expand = c(0.01, 0.01),
limits = c(0, 1.55),
breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5),
labels = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5)) +
scale_fill_manual(values = pals::glasbey(17), name = "Player") +
geom_vline(xintercept = 1.5, size = 2) +
labs(title = "Comparison: Stars of the Copa América",
subtitle = "For select group of attacking players with data available from understat.com",
x = NULL, y = "Metric per 90",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(title.size = 18,
subtitle.size = 10,
panel.grid.minor.x = element_line(color = "white"))
xgbuildup_xgchain_plot
```

Although Gabriel Jesus has been poor at finishing his chances as seen in
previous graphs, his xGChain and xGBuildup stat makes it clear that he
is still contributing to City’s attack outside of scoring goals himself
(not to mention all the defensive work he does as well).
For example below, the stats are able to clearly differentiate between
James, who is more of a playmaker, compared to Falcao and Duvan who are
traditional number 9s with his superior xGBuildup, xGChain, and Key
Passes values. For a more detailed overview on xGChain and xGBuildup
check out Statsbomb’s article
[here](https://statsbomb.com/2018/08/introducing-xgchain-and-xgbuildup/).
``` r
## keep colors for Colombians consistent with other plots
colombia_pal <- c("#000033", "#005300", "#009FFF", "#00FFBE")
comparison_colombia_plot <- comparison_data %>%
filter(!key %in% c("xG", "goals", "npxG", "npg", "xA", "assists"),
player_name %in% c("James Rodríguez", "Falcao", "Duván Zapata", "Juan Cuadrado")) %>%
mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) +
geom_vline(xintercept = 1.5, size = 2) +
geom_vline(xintercept = 2.5, size = 2) +
geom_vline(xintercept = 3.5, size = 2) +
coord_flip() +
scale_y_continuous(expand = c(0.05, 0.05),
breaks = c(0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4),
labels = c(0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4)) +
scale_fill_manual(values = colombia_pal, name = "Player") +
labs(title = "Comparison: Stars of Colombia",
subtitle = "KP: Key Passes",
x = NULL, y = "Metric per 90",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(title.size = 20,
subtitle.size = 12,
panel.grid.minor.x = element_line(color = "white"))
comparison_colombia_plot
```

There are lots of different ways to visualize this data, most famously
the radar charts created by [Statsbomb](http://statsbomb.com/). Using
the data you can also compare and evaluate players in many different
ways and using `understatr` package you could add a few more players
like Paulo Dybala, Miguel Almiron, and more! I could probably do a whole
other article using just this data but I’ll leave it here for now.
Conclusion
----------
Throughout this blog post I talked about some of the historical records,
squad compositions, match records, and finally the player profiles of
attacking players at this summer’s Copa América. Using the power of R it
is really easy to webscrape and visualize data in a way that is
informative and aesthetically pleasing. I wanted to finish this before
the tournament started but other life things got in the way as well as
the fact that the amount of content ballooned out of control (especially
the xG section) so I had to cut down a lot.
It’s been fun reading the articles on the Copa América website and
seeing how far my “intermediate-but-very-out-of-practice”-level of
Spanish can get me to understand the content, here is one that I
particularly liked reading: [14 Estadisticas de la Copa
América](https://copaamerica.com/es/noticias/a-14-dias-del-inicio-de-la-copa-14-estadisticas-de-la-copa-america).
With so many tournaments going on right now (and with the African Cup of
Nations starting in a few days) a lot of the news media is spread thin
right now but there are still some quality articles out there to read
about the Copa, like [this
article](https://www.bbc.com/sport/football/48600098) from BBC’s South
American football expert, Tim Vickery.
After the first round of games, a few points of discussion:
- After an extremely lacklustre performance vs. Colombia, how does
Argentina bounce back? What tactical changes need to be made?
- Qatar impressed against Paraguay but can they pull off a major upset
vs. Colombia?
- How will Japan line-up against Uruguay after a losing by a scoreline
that didn’t really do their performance justice? How will manager
Moriyasu balance experience vs. youth, will he start with veteran
Okazaki after Ueda’s numerous misses vs. Chile?
- Can Brazil earn a early ticket to the next round vs. Venezuela after
a clinical but not excellent performance vs. Bolivia? Will they be
able to keep up their streak of winning the Copa every time they
have hosted it?
Thanks for reading and…
¡Buena suerte a todos equipos!
================================================
FILE: Copa America 2019/copa_america2019.rmd
================================================
---
title: "Visualizing the Copa América: Historical Records, Squad Profiles, and Player Profiles with xG statistics!"
always_allow_html: yes
output:
md_document:
variant: markdown_github
---
```{r setup, include=FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
pacman::p_load(tidyverse, polite, scales, ggimage, ggforce, ggtextures, DT,
cowplot, rvest, glue, extrafont, ggrepel, magick)
loadfonts()
```
Another summer and another edition of the Copa América! Along with the Africa Cup of Nations, Nations League finals, the Women's World Cup, Under-21 European Championship AND the Gold Cup this is yet another soccer-filled season after last year's World Cup and the Asian Cup earlier this year (I also did a blog post on these last two tournaments which you can see [here (World Cup)](https://ryo-n7.github.io/2018-06-29-visualize-worldcup/) and [here (Asian Cup)](https://ryo-n7.github.io/2019-01-11-visualize-asian-cup/)). There is so much football going on at once even I can't keep up, especially with the time difference! To not redo all the previous visualizations with Copa América data I tried to find new sources of data and other forms of visualizations to give some insight into the players and teams competing to be the champion of South America. You can find all the code I used in this blogpost here and you can also find other soccer related data viz in my [soccer_ggplot](https://github.com/Ryo-N7/soccer_ggplots) Github repo.
The sections will go from a very macro-level view of the __historical records__ of the tournament, to the __squads__ competing, the teams' __match record__ in the Copa América, and finally to a micro-level view of various attacking players using __xG__ statistics.
¡Vámonos!
## Packages
```{r eval=FALSE, message=FALSE, warning=FALSE}
library(dplyr) ## data wrangling
library(tidyr) ## data wrangling
library(purrr) ## data wrangling and iteration
library(stringr) ## data wrangling
library(rvest) ## webscraping
library(polite) ## webscraping (Github only pkg)
library(ggplot2) ## plotting
library(scales) ## plotting scales
library(ggimage) ## images for flags
library(ggforce) ## plotting text labels
library(cowplot) ## plotting grid
library(glue) ## text
library(ggrepel) ## plotting text labels
library(magick) ## plotting
library(DT) ## tables
library(ggtextures) ## soccer ball emoji as geom_col()
library(extrafont) ## fonts: Roboto Condensed
loadfonts()
```
## theme_copaAmerica
I wanted to have all the plots in this blogpost to have a consistent color theme. As the tournament is going to be held in Brazil, I went with a color theme based on its flag with blue, yellow, and green being the primary colors.
```{r}
theme_copaAmerica <- function(
title.size = 24,
subtitle.size = 14,
caption.size = 8,
axis.text.size = 14,
axis.text.x.size = 12,
axis.text.y.size = 12,
axis.title.size = 16,
strip.text.size = 18,
panel.grid.major.x = element_line(size = 0.5, color = "white"),
panel.grid.major.y = element_line(size = 0.5, color = "white"),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank(),
axis.ticks = element_line(color = "white")) {
## Theme:
theme(text = element_text(family = "Roboto Condensed", color = "white"),
plot.title = element_text(family = "Roboto Condensed", face = "bold",
size = title.size, color = "yellow"),
plot.subtitle = element_text(size = subtitle.size),
plot.caption = element_text(size = caption.size),
panel.background = element_rect(fill = "#009b3a"),
plot.background = element_rect(fill = "#002776"),
axis.text = element_text(size = axis.text.size, color = "white"),
axis.text.x = element_text(size = axis.text.x.size, color = "white"),
axis.text.y = element_text(size = axis.text.y.size, color = "white"),
axis.title = element_text(size = axis.title.size),
axis.line.x = element_blank(),
axis.line.y = element_blank(),
panel.grid.major.x = panel.grid.major.x,
panel.grid.major.y = panel.grid.major.y,
panel.grid.minor.x = panel.grid.minor.x,
panel.grid.minor.y = panel.grid.minor.y,
strip.text = element_text(color = "yellow", face = "bold",
size = strip.text.size,
margin = margin(4.4, 4.4, 4.4, 4.4)),
strip.background = element_blank(),
axis.ticks = axis.ticks
)
}
```
## Top Goal Scorers // Goleadores
For this plot I took the stats from the Spanish version of the Wikipedia page as it had more content. I used `purrr::flatten_df()` to squish the list output into a dataframe then set the names of each column using `purrr::set_names()`.
```{r, eval=FALSE}
url <- "https://es.wikipedia.org/wiki/Anexo:Estad%C3%ADsticas_de_la_Copa_Am%C3%A9rica"
session <- bow(url)
copa_top_scorers <- scrape(session) %>%
html_nodes(".mw-parser-output > table:nth-child(95)") %>%
html_table() %>%
flatten_df() %>%
set_names(c("player", "country", "goals")) %>%
mutate(image = "https://www.emoji.co.uk/files/microsoft-emojis/activity-windows10/8356-soccer-ball.png")
```
```{r, eval=FALSE, echo=FALSE}
saveRDS(copa_top_scorers, file = here::here("data/copa_top_scorers.RDS"))
```
```{r, echo=FALSE}
copa_top_scorers <- readRDS(file = here::here("data/copa_top_scorers.RDS"))
```
```{r}
glimpse(copa_top_scorers)
```
Like in the Asian Cup blogpost I use [Claus Wilke](https://twitter.com/ClausWilke/)'s [ggtextures](https://github.com/clauswilke/ggtextures) package to use __soccer ball emoji__ as the column image in the plot.
```{r, fig.height = 8, fig.width = 10}
copa_goleadores_raw_plot <- copa_top_scorers %>%
head(5) %>%
ggplot(aes(x = reorder(player, goals), y = goals,
image = image)) +
geom_isotype_col(img_width = grid::unit(1, "native"), img_height = NULL,
ncol = NA, nrow = 1, hjust = 0, vjust = 0.5) +
geom_text(aes(label = goals, family = "Roboto Condensed", fontface = "bold"),
size = 7.5, color = "yellow",
nudge_y = 0.5) +
coord_flip() +
scale_y_continuous(breaks = c(0, 2, 4, 6, 8, 10, 12, 14, 16, 18),
expand = c(0, 0),
limits = c(0, 19)) +
labs(title = "Top Scorers of the Copa América",
subtitle = glue("
Most goals in a single tournament: 9
Humberto Maschio (Argentina), Javier Ambrois (Uruguay), Jair (Brazil)"),
y = "Number of Goals", x = NULL,
caption = glue("
Source: Wikipedia
By @R_by_Ryo")) +
theme_copaAmerica(title.size = 26,
subtitle.size = 16,
caption.size = 12,
axis.text.size = 18,
axis.title.size = 18,
panel.grid.major.y = element_blank(),
axis.ticks = element_blank())
## Add flags to y-axis:
axis_image <- axis_canvas(copa_goleadores_raw_plot, axis = 'y') +
draw_image("https://upload.wikimedia.org/wikipedia/en/0/05/Flag_of_Brazil.svg",
y = 16.5, scale = 1.8) +
draw_image("https://upload.wikimedia.org/wikipedia/commons/1/1a/Flag_of_Argentina.svg",
y = 12.5, scale = 1.8) +
draw_image("https://upload.wikimedia.org/wikipedia/commons/f/fe/Flag_of_Uruguay.svg",
y = 9, scale = 1.8) +
draw_image("https://upload.wikimedia.org/wikipedia/commons/d/df/Flag_of_Peru_%28state%29.svg",
y = 5.25, scale = 1.8) +
draw_image("https://upload.wikimedia.org/wikipedia/en/0/05/Flag_of_Brazil.svg",
y = 1.5, scale = 1.8)
copa_goleadores_plot <- ggdraw(insert_yaxis_grob(copa_goleadores_raw_plot, axis_image, position = "left"))
copa_goleadores_plot
```
```{r, eval=FALSE, echo=FALSE}
ggsave(plot = copa_goleadores_plot, filename = here::here("Copa America 2019/output/copa_goleadores_plot.png"),
height = 8, width = 10)
```
Most of these players aren't ones you might recognize. The Copa América used to be held a lot more regularly (and sometimes erratically) until this century so players had a lot more opportunities to score goals. All five of the players you see here played in the 1930s-1950s when there was a tournament every one or two years. Out of currently active players, Peruvian legend Paolo Guerrero has 11 goals along with Eduardo Vargas (from Chile). (Edit: after the Chile - Japan game, Vargas is on 12...) Another player you might recognize that was actually tied with Ademir for 5th place, along with three other players, was Gabriel Batistuta ("Batigol").
## Winners of the Copa América
After grabbing the data from the Wikipedia page I used a variety of functions to clean and reshape the dataset like `tidyr::separate()` to split the number of occurences and the year.
```{r, warning=FALSE, eval=FALSE}
url <- "https://es.wikipedia.org/wiki/Anexo:Estad%C3%ADsticas_de_la_Copa_Am%C3%A9rica"
session <- bow(url)
copa_campeones <- scrape(session) %>%
html_nodes(".mw-parser-output > table:nth-child(10)") %>%
html_table() %>%
flatten_df()
copa_campeones_limpia <- copa_campeones %>%
janitor::clean_names() %>%
slice(1:8) %>%
select(1:4) %>%
set_names(c("team", "winners", "runners_up", "third_place")) %>%
separate(winners, into = c("Champions", "first_place_year"),
sep = " ", extra = "merge") %>%
separate(runners_up, into = c("Runners-up", "second_place_year"),
sep = " ", extra = "merge") %>%
separate(third_place, into = c("Third Place", "third_place_year"),
sep = " ", extra = "merge") %>%
mutate_all(list(~str_replace_all(., "–", "0"))) %>%
mutate_at(vars(contains("num")), funs(as.numeric)) %>%
gather(key = "key", value = "value", -team,
-first_place_year, -second_place_year, -third_place_year) %>%
mutate(key = as.factor(key),
value = as.numeric(value),
team = team %>% str_replace(., "[A-Z]{3}", "") %>% str_trim(.),
team = case_when(team == "Brasil" ~ "Brazil",
TRUE ~ team)) %>%
mutate(key = forcats::fct_relevel(key,
"Champions",
"Runners-up",
"Third Place")) %>%
arrange(key, desc(value)) %>%
mutate(team = forcats::as_factor(team),
order = row_number())
```
```{r, eval=FALSE, echo=FALSE}
saveRDS(copa_campeones_limpia, file = here::here("data/copa_campeones_clean.RDS"))
```
```{r, echo=FALSE}
copa_campeones_limpia <- readRDS(file = here::here("data/copa_campeones_clean.RDS"))
```
I also wanted to add flags to this plot but `cowplot::insert_yaxis_grob()` is unfortunately not compatible with facets. I used `stringr::str_wrap()` to format the subtitle nicely while I used `glue::glue()` to avoid having the use '\n' to create a new line for the caption.
```{r, fig.height = 6, fig.width = 8}
copa_ganadores_plot <- copa_campeones_limpia %>%
ggplot(aes(value, forcats::fct_rev(team), color = key)) +
geom_point(size = 10) + # 10
geom_text(aes(label = value),
size = 5, color = "black", # 5
family = "Roboto Condensed", fontface = "bold") +
scale_color_manual(values = c("Champions" = "#FFCC33",
"Runners-up" = "#999999",
"Third Place" = "#CC6600"),
guide = FALSE) +
scale_x_continuous(breaks = c(1, 5, 10, 15),
labels = c(1, 5, 10, 15),
limits = c(-1, 16)) +
labs(x = "Number of Occurrence", y = NULL,
title = "Most Successful Teams of the Copa América!",
subtitle = str_wrap("Ordered by number of Copa América(s) won. Argentina missed the chance to leapfrog Uruguay after consecutive final losses in the previous two tournaments!", width = 80),
caption = glue("
Source: Wikipedia
By @R_by_Ryo")) +
facet_wrap(~key) +
theme_copaAmerica(subtitle.size = 14,
caption.size = 10)
copa_ganadores_plot
```
```{r, eval=FALSE, echo=FALSE}
ggsave(filename = here::here("Copa America 2019/output/copa_ganadores_plot.png"),
height = 6, width = 8)
```
What's surprising to note is that Pele never won a Copa América with Brazil, although he did get Best Player and Top Scorer in the 1959 edition of the tournament. Even more bizarrely Diego Maradona has never won it either! He didn't play in either of the 1991 and 1993 editions where Argentina won their 13th and 14th Copas.
## Copa América Squad Profiles
We just looked at what happened in the past but who are the players competing in the tournament this year? To take a quick look I web-scraped the squads of each of the competing teams from Wikipedia.
I created a list of the `xpath`s for each of squads and using `purrr::map()` I grabbed the data for each participating country. After I got some meta-information about the country name and the group I created a list-column that stores the squad data as a dataframe in its own column. To explode this out I used `tidyr::unnest()` to reshape the entire dataframe to have one row with all the data for each player in every squad.
```{r, eval=FALSE, echo=FALSE}
squad_url <- "https://en.wikipedia.org/wiki/2019_Copa_Am%C3%A9rica_squads"
session <- bow(squad_url)
xpaths <- 1:12 %>%
map(., ~glue("//*[@id='mw-content-text']/div/table[{.x}]"))
squads_df_raw <- scrape(session) %>%
html_node(xpath = '//*[@id="toc"]') %>%
html_text() %>%
str_split("\n") %>%
unlist() %>%
tibble::enframe() %>%
rename(country = value) %>%
filter(str_detect(country, "^[1-8]\\."), !str_detect(country, "Group")) %>%
separate(country, c("group", "delete", "country"), sep = c(1, 3)) %>%
slice(1:12) %>%
mutate(group = LETTERS[as.numeric(group)],
country = str_trim(country),
xpaths = xpaths,
squads = map(xpaths, ~ scrape(session) %>%
html_node(xpath = .x) %>%
html_table())) %>%
unnest(squads) %>%
filter(Player != "") %>%
mutate(country_league = scrape(session) %>%
html_nodes(".nat-fs-player .thumbborder") %>%
html_attr("alt"))
```
```{r, eval=FALSE, echo=FALSE}
saveRDS(squads_df_raw, file = here::here("data/copa_america2019_squads_raw.RDS"))
```
```{r, echo=FALSE}
squads_df_raw <- readRDS(file = here::here("data/copa_america2019_squads_raw.RDS"))
```
To get a clean dataset I use some `stringr::str_*()` functions to properly format the character strings such as the player positions, ages, date of births.
```{r}
squads_df_clean <- squads_df_raw %>%
janitor::clean_names() %>%
select(-delete, squad_num = no,
position = pos, birth_age = date_of_birth_age) %>%
mutate(position = position %>% str_replace_all(., "[1-9]", ""),
birth_age = birth_age %>% str_extract_all(., pattern = "\\([^()]+\\)")) %>% unnest(birth_age) %>%
group_by(player) %>%
mutate(colnum = seq_along(player)) %>%
spread(key = colnum, value = birth_age) %>%
ungroup() %>%
select(everything(), dob = `1`, age = `2`) %>%
mutate(dob = dob %>% str_replace_all(., "[()]", "") %>% lubridate::as_date(),
age = age %>% str_extract(., "[0-9]+") %>% as.integer,
country = forcats::fct_relevel(country,
"Brazil", "Argentina", "Uruguay",
"Peru", "Qatar", "Chile",
"Venezuela", "Paraguay", "Japan",
"Bolivia", "Colombia", "Ecuador",
),
club = case_when(
club == "Barcelona" & country == "Ecuador" ~ "Barcelona (Ecuador)",
TRUE ~ club))
```
```{r, eval=FALSE, echo=FALSE}
saveRDS(squads_df_clean, file = here::here("data/copa_america2019_squads_clean.RDS"))
```
```{r, echo=FALSE}
squads_df_clean <- readRDS(file = here::here("data/copa_america2019_squads_clean.RDS"))
```
```{r}
glimpse(squads_df_clean)
```
### Age-histogram
Using this data I can plot a bunch of histograms:
```{r, fig.height = 8, fig.width = 8}
age_country_plot <- squads_df_clean %>%
group_by(country) %>%
mutate(median_age = median(age)) %>%
ungroup() %>%
ggplot(aes(x = age)) +
geom_histogram(fill = "red", binwidth = 1) +
geom_vline(aes(xintercept = median_age), size = 1.2) +
geom_label(aes(x = median_age, y = 8,
label = glue::glue("Median: {median_age}")),
nudge_x = 0.5, hjust = 0.1, size = 3,
family = "Roboto Condensed", color = "black") +
labs(title = "Age Distribution of Copa América squads",
subtitle = "Columns ordered Group A to Group C",
x = "Age", y = NULL,
caption = glue::glue("
Source: Wikipedia
By: @R_by_Ryo")) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0),
breaks = scales::pretty_breaks()) +
theme_copaAmerica(title.size = 22,
subtitle.size = 14,
caption.size = 8,
axis.text.size = 12,
axis.title.size = 16,
strip.text.size = 18,
panel.grid.minor.x = element_line(color = "white"),
panel.grid.minor.y = element_line(color = "white")) +
facet_wrap(~country, ncol = 3)
age_country_plot
```
```{r, eval=FALSE, echo=FALSE}
ggsave(plot = age_country_plot, filename = here::here("Copa America 2019/output/age_country_plot.png"),
height = 8, width = 8)
```
In terms of age, Japan have the youngest team with a median of 21, 4 years younger than the next youngest team, Qatar. The rest have a fairly balanced spread of ages from 20 to early-mid 30s with most of the medians hovering around 27 years of age. The reason for Japan's extremely young squad is due to the fact that the full-strength Japan team has played in both the World Cup and the Asian Cup in the past year. Along with the fact that the Tokyo Olympics are next year, it was decided to use the invitation to the Copa América as a trial-by-fire for the young stars of the future. Much like in a real Olympic squad, the team contains three "overage" players in World Cup 2010/2014/2018 goalkeeper Eiji Kawashima, Premier League winner Shinji Okazaki, and Getafe playmaker Gaku Shibasaki.
The oldest player will be Brazil captain Dani Alves at 36 with Paraguay's Oscar Cardozo only two weeks younger. On the other hand, the youngest player is Japan's 18-year old prodigy Takefusa Kubo, the ex-Barcelona youth player who only just recently moved to Real Madrid! In light of his transfer a lot of eyes will be on him to see if he can produce some Captain Tsubasa-esque performances for a very inexperienced Japan team gearing up for the Tokyo Olympics!
### Caps histogram
When considering the experience of a squad it's not enough to look at ages but one needs to look at the caps or appearances for the national team as well.
```{r, fig.height = 8, fig.width = 8}
caps_country_plot <- squads_df_clean %>%
group_by(country) %>%
mutate(median_cap = median(caps)) %>%
ungroup() %>%
ggplot(aes(x = caps)) +
geom_histogram(fill = "red", binwidth = 5) +
geom_vline(aes(xintercept = median_cap), size = 1.25) +
geom_label(aes(x = median_cap, y = 15,
label = glue::glue("Median: {median_cap}")),
nudge_x = 0.5, hjust = 0.05, size = 3,
family = "Roboto Condensed", color = "black") +
labs(title = "Caps (Appearances) by Country",
subtitle = "Columns ordered Group A to Group C",
x = "Caps", y = NULL,
caption = glue::glue("
Source: Wikipedia
By: @R_by_Ryo")) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme_copaAmerica(
title.size = 20,
subtitle.size = 14,
caption.size = 10,
axis.text.size = 10,
axis.title.size = 16,
strip.text.size = 18,
panel.grid.major.x = element_line(color = "white", size = 0.25),
panel.grid.major.y = element_line(color = "white", size = 0.25),
panel.grid.minor.x = element_line(color = "white", size = 0.25),
panel.grid.minor.y = element_line(color = "white", size = 0.25)) +
facet_wrap(~country, ncol = 3)
caps_country_plot
```
```{r, eval=FALSE, echo=FALSE}
ggsave(plot = caps_country_plot, filename = here::here("Copa America 2019/output/caps_country_plot.png"),
height = 8, width = 8)
```
The majority of Japan's squad have 0 (ZERO) caps, with the aforementioned three "overage" players taking up most of the proportion of caps on the team. Bolivia are also taking a untested squad with 8 of their players with 2 caps or less! Chile, Uruguay, and Argentina bring their veterans with multiple players over or around 100 caps. From this data I was surprised that Jefferson Farfan and Paolo Guerrero didn't have 100 caps by now...
The player with the most caps is Lionel Messi (130) followed closely by Diego Godin (126), and Alexis Sanchez (124). On the other hand there are 29 players hopeful of making their first national team appearance at this tournament with the majority (17 players) coming from Japan.
### Goal distribution
Next I looked at the distribution of goals scored by the midfielders and strikers of each team. I found out about using `ggplot2::position_nudge()` for slightly adjusting variables on a discrete scale in similar fashion to the `nudge_y =` and `nudge_x =` arguments most people might be familiar with from other geoms. I also used `ggforce::geom_mark_hull()` to do some labelling.
```{r, fig.height = 6, fig.width = 8}
goals_country_plot <- squads_df_clean %>%
filter(position %in% c("MF", "FW")) %>%
group_by(country) %>%
mutate(median = median(goals)) %>%
ungroup() %>%
ggplot(aes(x = goals, y = reorder(country, median))) +
ggridges::geom_density_ridges(fill = "red", color = "white", scale = 1.1) +
geom_point(aes(x = median, y = country), position = position_nudge(y = 0.25),
color = "yellow", size = 3) +
ggforce::geom_mark_hull(aes(filter = country == "Argentina" & goals == 67, label = "Lionel Messi: 67 goals"),
label.buffer = unit(15, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
ggforce::geom_mark_hull(aes(filter = country == "Uruguay" & goals == 55, label = "Luis Suarez: 55 goals"),
label.buffer = unit(5, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
ggforce::geom_mark_hull(aes(filter = country == "Japan" & goals == 50, label = "Shinji Okazaki: 50 goals"),
label.buffer = unit(2, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
ggforce::geom_mark_hull(aes(filter = country == "Uruguay" & goals == 46, label = "Edinson Cavani: 46 goals"),
label.buffer = unit(25, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
ggforce::geom_mark_hull(aes(filter = country == "Chile" & goals == 41, label = "Alexis Sanchez: 41 goals"),
label.buffer = unit(4, "mm"), label.fontsize = 10, label.fill = "red",
label.family = "Roboto Condensed", label.colour = "white",
con.cap = unit(1, "mm"), con.type = "straight") +
scale_x_continuous(limits = c(0, 73),
expand = c(0.01, 0.01),
breaks = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70),
labels = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70)) +
expand_limits(y = 13.5) +
labs(title = "Distribution of Goals Scored by Midfielders and Strikers",
subtitle = "Copa América 2019 squads, Yellow dot = Median goals",
x = "Goals", y = NULL,
caption = glue::glue("
Source: Wikipedia
Data from prior to start of tournament
By: @R_by_Ryo")) +
theme_copaAmerica(title.size = 18,
subtitle.size = 12,
caption.size = 8,
axis.text.size = 14,
axis.title.size = 16,
strip.text.size = 18)
goals_country_plot
```
```{r, eval=FALSE, echo=FALSE}
ggsave(plot = goals_country_plot, filename = here::here("Copa America 2019/output/goals_country_plot.png"),
height = 6, width = 8)
```
With a lot of these players being more defensively minded or new players the distribution is heavily skewed but you can see little mounds showing the top goalscorers for each country and see which countries have their goalscorers spread out among multiple players such as Brazil, Qatar, and Peru.
If you know your South American players you can take a good guess at who are the top goal scorers for each nation. For Colombia the two outlying mounds are obviously James Rodriguez and Falcao, for example. Venezuela's top scorer with 22 is Salomon Rondon and for Brazil, if not for his injury, a lonesome mound would have appeared for Neymar with 60 goals!
### Player contribution by league
Now let's check the player contribution to the squads at the Copa América by league. I'm just going to use the country that the league is from for simplicity's sake. Originally I wanted to `left_join()` it with a 'country <> domestic league' table but couldn't find one and the league names itself aren't very meaningful or have awful sponsor names that obfuscate the country of origin even further.
```{r, fig.height = 7, fig.width = 8}
player_contrib_league_plot <- squads_df_clean %>%
group_by(country_league) %>%
summarize(n = n()) %>%
ungroup() %>%
ggplot(aes(y = n, x = reorder(country_league, n))) +
geom_col(fill = "red") +
geom_text(aes(label = n, family = "Roboto Condensed", fontface = "bold"),
size = 4.5, color = "yellow",
nudge_y = 0.5) +
coord_flip() +
scale_y_continuous(labels = c(0, 5, 10, 15, 20, 25),
breaks = c(0, 5, 10, 15, 20, 25),
limits = c(0, 30),
expand = c(0, 0)) +
labs(title = "Breakdown of Player Contributions by League",
subtitle = glue("
Shown as Country Name
Mexico (Liga MX) contributed 27 players to South American squads"),
x = "League (Country name)", y = "Number of players",
caption = glue::glue("
Source: Wikipedia
By: @R_by_Ryo")) +
theme_copaAmerica(title.size = 18,
subtitle.size = 12,
caption.size = 10,
axis.text.size = 14,
axis.text.y.size = 11,
axis.title.size = 16,
panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_line(color = "white"))
player_contrib_league_plot
```
```{r, eval=FALSE, echo=FALSE}
ggsave(plot = player_contrib_league_plot, filename = here::here("Copa America 2019/output/player_contrib_league_plot.png"),
height = 7, width = 8)
```
The best of the best players from South American countries will move on to Europe so the Argentinean league (Superliga Argentina) and the Brazilian league (Brasileirão - Serie A) do not have as many players as you might think and as a consequence, the top leagues of England, Spain, and Italy contribute quite a bit! A lot of the better players but not quite elite South American players might go to Mexico instead of a lower-mid European league. With the growth of the MLS a fair number of players ply their trade there as well.
We can take a more detailed look by creating a table of the proportion of players from each squad coming from either a domestic league or any other league. I had to do a lot of wrangling to get the proper output for the table. After calculating the percentage of domestic players from a country's domestic league I added the full data back in. Then I had to make sure that for each country, the country - domestic league country was the first row in each of the country groups (so Bolivia - Bolivia, Bolivia, China, Japan - Japan, Japan - England, etc.). By doing this I can automatically `tidyr::fill()`-in the rest of the rows of that country with the 'percentage of players from domestic league stat'.
```{r}
squads_df_clean %>%
group_by(country, country_league) %>%
summarize(player_from_league = n()) %>%
filter(country == country_league) %>%
mutate(perc_from_domestic_league = percent(player_from_league / 23, accuracy = 0.1)) %>%
right_join(squads_df_clean %>%
group_by(country, country_league) %>%
summarize(player_from_league = n()) %>%
ungroup()) %>%
mutate(first = case_when(
country == country_league ~ 1,
TRUE ~ 0)) %>%
arrange(country, desc(first)) %>%
fill(perc_from_domestic_league) %>%
group_by(country) %>%
mutate(perc_from_league = percent(player_from_league / 23, accuracy = 0.1),
country_league = glue::glue("{country_league} - league")) %>%
arrange(desc(player_from_league)) %>%
select(Country = country, `League (country name)` = country_league,
`Number of players from league` = player_from_league,
`Percentage of players from league` = perc_from_league,
`Percentage of players from domestic league` = perc_from_domestic_league) %>%
head(10) %>%
knitr::kable(format = "html",
caption = "Breakdown of Player Contribution by League") %>%
kableExtra::kable_styling(full_width = FALSE)
```
Three interesting facts I found:
- 30% of players on the Brazil squad play for an English team, most out of any league - squad combination excluding domestic leagues.
- 100% of the Qatar squad play in their domestic league!
- Only one Uruguayan player (4.3%) plays in its domestic league.
### Player contribution by club
In the final plot for this section, I looked at the top 10 clubs contributing the most players to the tournament. I used `arrange(desc(n)) %>% slice()` instead of `top_n()` as there were too many teams tied at 4 players. To set the team names inside the bars I created a midpoint value `midval` that calculated a value half of the number of players contributed so the labels were placed neatly.
```{r fig.height = 6, fig.width = 8}
player_contrib_club_plot <- squads_df_clean %>%
group_by(club) %>%
summarize(n = n()) %>%
mutate(club = club %>% forcats::as_factor() %>% forcats::fct_reorder(n),
midval = n / 2) %>%
arrange(desc(n)) %>%
slice(1:15) %>%
ggplot(aes(x = club, y = n)) +
geom_col(fill = "red") +
geom_text(aes(label = n, family = "Roboto Condensed", fontface = "bold"),
size = 7.5, color = "yellow",
nudge_y = 0.5) +
geom_text(aes(y = midval, label = club,
family = "Roboto Condensed", fontface = "bold"),
size = 5, color = "white") +
coord_flip() +
scale_y_continuous(breaks = scales::pretty_breaks(),
expand = c(0, 0),
limits = c(0, 10.5)) +
labs(title = "Top 15 Clubs contributing the most players to the Copa América",
x = "Club", y = "Number of players",
caption = "Source: Wikipedia") +
theme_copaAmerica(
title.size = 18,
subtitle.size = 12,
caption.size = 8,
axis.text.size = 14,
axis.title.size = 16,
strip.text.size = 18,
panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_line(color = "white")) +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank())
player_contrib_club_plot
```
```{r, eval=FALSE, echo=FALSE}
ggsave(plot = player_contrib_club_plot, filename = here::here("Copa America 2019/output/player_contrib_club_plot.png"),
height = 6, width = 8)
```
With 100% of its players coming from the domestic league it's not surprise that the Qatari team, Al-Sadd, is contributing the most players to the tournament. Tied with another Qatari team, Mexican club America features 7 players yet none of them are Mexicans (2 Argentineans, 2 Colombians, 1 Ecuadorian, 1 Chilean, and 1 Paraguayan).
At first I thought Barcelona contributed 8 players until I realized the Ecuadorian players were coming from the Ecuadorian team called Barcelona...I had to go all the way back up to the beginning of this section to fix that small peculiarity. As futbol came to South America via European colonists and immigrants a lot of teams took up the names and colors of the teams these Europeans were fond of. Other examples include Liverpool F.C. (Montevideo, Uruguay), Arsenal de Sarandi (Buenos Aires, Argentina), and Club Atletico Juventus (Sao Paulo, Brazil - although they use the colors of Torino F.C.).
If you download the data and type in the code below you can see the entire club-country list.
```{r, eval=FALSE}
squads_df_clean %>%
group_by(club, country) %>%
summarize(n = n()) %>% View()
```
## Match Records
Now that we got a good look at the composition of the teams, we can take a look at how they've done at every Copa América.
The next code chunk mainly comes from PH Julien and his excellent Kaggle kernel of ["A Journey Through The History of Soccer"](https://www.kaggle.com/phjulien/a-journey-through-the-history-of-soccer/).
```{r, message=FALSE, eval=FALSE}
## grab football federation affiliations data
federation_files <- Sys.glob("../data/federation_affiliations/*")
df_federations = data.frame(country = NULL, federation = NULL)
for (f in federation_files) {
federation = basename(f)
content = read.csv(f, header=FALSE)
content <- cbind(content,federation=rep(federation, dim(content)[1]))
df_federations <- rbind(df_federations, content)
}
colnames(df_federations) <- c("country", "federation")
df_federations <- df_federations %>%
mutate(country = as.character(country) %>% str_trim(side = "both"))
results_raw <- readr::read_csv("../data/results.csv")
results_copa <- results_raw %>%
filter(tournament == "Copa América") %>%
rename(venue_country = country,
venue_city = city) %>%
mutate(match_num = row_number())
## combine with federation affiliations
results_copa_home <- results_copa %>%
left_join(df_federations,
by = c("home_team" = "country")) %>%
mutate(federation = as.character(federation)) %>%
rename(home_federation = federation)
results_copa_away <- results_copa %>%
left_join(df_federations,
by = c("away_team" = "country")) %>%
mutate(federation = as.character(federation)) %>%
rename(away_federation = federation)
## combine home-away
results_copa_cleaned <- results_copa_home %>%
full_join(results_copa_away)
```
Unfortunately, this data does not have __penalty__ results as those games are all counted as a draw (as technically that is the actual result). Considering there a lot of cagey knock-out rounds that finish in a penalty shoot-out (including the last two finals...) it is unfortunate but that's just the data you have sometimes. There is a way to web-scrape all the Copa América results and assign Win-Lose to those games that went to penalties but I'll leave that for another time. Also, there is no info on what stage of the tournament the match recorded is in.
```{r, eval=FALSE}
results_copa_cleaned <- results_copa_cleaned %>%
mutate(
home_federation = case_when(
home_team == "USA" ~ "Concacaf",
TRUE ~ home_federation),
away_federation = case_when(
away_team == "USA" ~ "Concacaf",
TRUE ~ away_federation)) %>%
select(-contains("federation"), -contains("venue"),
-neutral, date, home_team, home_score, away_team, away_score,
tournament, venue_city)
```
```{r, echo=FALSE, eval=FALSE}
saveRDS(results_copa_cleaned, file = here::here("data/results_copa_cleaned.RDS"))
```
```{r, echo=FALSE}
results_copa_cleaned <- readRDS(file = here::here("data/results_copa_cleaned.RDS"))
```
```{r}
glimpse(results_copa_cleaned)
```
Now that it's nice and cleaned up I created a function that reshapes the data so that it's set from a certain team's perspective with the "team" argument. You can also set the function to look for only results against a certain opponent by filling in the `versus` argument.
```{r}
copaAmerica_resultados <- function(data, team, versus = NA) {
## team of interest: ex. 'Brazil'
team_var <- enquo(team)
todos_partidos <- data %>%
## filter only for results of team of interest
filter(home_team == !!team_var | away_team == !!team_var) %>%
## reshape columns to team vs. opponent
mutate(
opponent = case_when(
away_team != !!team_var ~ away_team,
home_team != !!team_var ~ home_team),
home_away = case_when(
home_team == !!team_var ~ "home",
away_team == !!team_var ~ "away"),
equipo_goals = case_when(
home_team == !!team_var ~ home_score,
away_team == !!team_var ~ away_score),
opp_goals = case_when(
home_team != !!team_var ~ home_score,
away_team != !!team_var ~ away_score)) %>%
## label results from team's perspective
mutate(
result = case_when(
equipo_goals > opp_goals ~ "Win",
equipo_goals < opp_goals ~ "Loss",
equipo_goals == opp_goals ~ "Draw")) %>%
mutate(result = result %>% forcats::as_factor() %>% forcats::fct_relevel(c("Win", "Draw", "Loss"))) %>%
select(-contains("score"), -contains("team"), -match_num) %>%
rename(Date = date, Tournament = tournament, `Venue` = venue_city, Opponent = opponent, `Home / Away` = home_away,
`Goals For` = equipo_goals, `Goals Against` = opp_goals, Result = result)
if (is.na(versus) | is.null(versus)) {
resultados_totalmente <- todos_partidos %>%
group_by(Result, Opponent) %>%
mutate(n = n()) %>%
ungroup() %>%
## sum amount of goals by team and opponent
group_by(Result, Opponent) %>%
summarize(e_g = sum(`Goals For`),
o_g = sum(`Goals Against`),
n = n()) %>%
ungroup() %>%
## spread results over multiple columns
spread(Result, n) %>%
mutate_if(is.integer, as.numeric)
missing_cols <- c("Win", "Draw", "Loss") %>%
map_dfr( ~tibble(!!.x := numeric()))
resultados_totalmente <- resultados_totalmente %>%
bind_rows(missing_cols) %>%
mutate(Win = if_else(is.na(Win), 0, Win),
Draw = if_else(is.na(Draw), 0, Draw),
Loss = if_else(is.na(Loss), 0, Loss)) %>%
group_by(Opponent) %>%
summarize(Win = sum(Win, na.rm = TRUE),
Draw = sum(Draw, na.rm = TRUE),
Loss = sum(Loss, na.rm = TRUE),
`Goals For` = sum(e_g),
`Goals Against` = sum(o_g))
return(list(resultados_totalmente, todos_partidos))
} else {
## opponent: ex. 'Argentina'
todos_partidos <- todos_partidos %>%
filter(Opponent == versus)
if (nrow(todos_partidos) == 0) {
return(glue("{team} has never played {versus} at the Copa América!"))
} else {
resultados_totalmente <- todos_partidos %>%
group_by(Result, Opponent) %>%
mutate(n = n()) %>%
ungroup() %>%
# sum amount of goals by team and opponent
group_by(Result, Opponent) %>%
summarize(e_g = sum(`Goals For`),
o_g = sum(`Goals Against`),
n = n()) %>%
ungroup() %>%
# spread results over multiple columns
spread(Result, n) %>%
mutate_if(is.integer, as.numeric) %>%
group_by(Opponent) %>%
summarize(Win = sum(Win, na.rm = TRUE),
Draw = sum(Draw, na.rm = TRUE),
Loss = sum(Loss, na.rm = TRUE),
`Goals For` = sum(e_g),
`Goals Against` = sum(o_g))
return(list(resultados_totalmente, todos_partidos))
}
}
}
```
The output is either a dataframe of all the games a team has been involved in as well as the record of the team against other teams in the Copa América or a message saying that the team you picked has never played against the opponent you picked.
### Japan
```{r, warning=FALSE}
copaAmerica_resultados(data = results_copa_cleaned,
team = "Japan", versus = "Brazil")
```
Oh... that's right Japan has never played against Brazil at the Copa...
```{r}
resultados_japon <- copaAmerica_resultados(data = results_copa_cleaned, team = "Japan")
resultados_japon[[2]] %>%
knitr::kable(format = "html",
caption = "Japan's record in the Copa América") %>%
kableExtra::kable_styling(full_width = FALSE)
```
Japan's only previous journey to the Copa América was in the 1999 edition where they lost all 3 games. They were invited for the 2011 edition but withdrew due to the Tohoku Earthquake and were replaced by Costa Rica. Japanese football has come a long way since 1999 but with a young squad it will be a uphill battle to get 3 points against any of their Group C opponents, Uruguay, Chile, and Ecuador.
### Colombia
```{r}
resultados_colombia <- copaAmerica_resultados(data = results_copa_cleaned, team = "Colombia")
resultados_colombia[[2]] %>%
slice(87:92) %>%
knitr::kable(format = "html",
caption = "Colombia's record in the Copa América") %>%
kableExtra::kable_styling(full_width = FALSE)
```
Despite a recent resurgence of the Colombia national team they have not been able to match the feats of the 2001 side that won the Copa with their best place finish since then coming 3rd in 2004. The 2001 team were not only unbeaten but also did not concede a single goal throughout the tournament!
### Superclasico Sudamericano: Brazil vs. Argentina
```{r}
resultados_de_brazil <- copaAmerica_resultados(data = results_copa_cleaned,
team = "Brazil", versus = "Argentina")
resultados_de_brazil[[1]] %>%
knitr::kable(format = "html",
caption = "Brazil vs. Argentina in the Copa América") %>%
kableExtra::kable_styling(full_width = FALSE)
```
```{r}
resultados_de_brazil[[2]] %>%
tail(5) %>%
knitr::kable(format = "html",
caption = "Brazil vs. Argentina in the Copa América") %>%
kableExtra::kable_styling(full_width = FALSE)
```
Brazil does not have a good overall record vs. Argentina but they have not lost against their rivals at the Copa América since the 1993 edition where they lost 5-6 on penalties in the Quarter Finals. The "draw" in 1995 was won on penalties while the "draw" in 2004 was actually in the final where they won 4-2 on penalties.
What I found odd was that the Copa América seems to have a very low priority to certain countries, especially Brazil who have repeatedly sent their B or C teams to the tournament in favor of sending their best team to other tournaments or resting star players. Funnily enough these understrength Brazilian squads have actually won the entire tournament a few times, most notably in 2007 against a full strength Argentina side containing the likes of Zanetti, Riquelme, Cambiasso, Tevez, a young Messi/Mascherano, Cambiasso, et al!
## Player Profiles
After looking at the history of the competition and the composition of the squads I examined the players and their form coming into the Copa América. In recent years football analytics has really taken off and there have been many strides made in creating more informative statistics to assess players' abilities, the most prominently being the __xG__ statistics. This is the first time I talk about __xG__ in any length/depth so this introduction is as much to solidify my understanding as well as yours!
### What IS xG?
- __xG__: Quantitative measure (between 0 and 1) that assigns a probability that a shot taken will result in a goal based on a variety of variables and is used for evaluating the quality of chances and predicting players' and teams' future performances.
- __xA__: Quantitative measure (between 0 and 1) that assigns a probability that a given pass will result in an assist for a goal based on a variety of variables.
Common variables used in the models that output xG statistics are the distance and angle of a shot, the body part used, rebound, among others. Similar to how you might assess your favorite striker's chances of scoring just as he is lining up to take a shot: Is the shot a header? Is he trying to score from a cross in regular play or a corner kick? Are there a crowd of defenders in front of him or is he one-on-one with the goalkeeper? Etc. You might think __who__ takes a shot would be a genuine factor but in actuality it tells you a lot less about the chances of a goal compared to the location of the shot.
Note that there isn't a SINGLE xG model. You can check out a blog post comparing different people's xG models [here](https://mackayanalytics.nl/2017/06/19/how-accurate-are-xg-models-ii-the-big-chance-dilemma/). People and organizations (from Statsbomb to OptaPro) have their own ideas about what __could__ be the important variables in play and as such it's important to report from which source you got your data from as the stats can differ between models. A few things xG does not factor in are things like goalkeeper performance (someone pulling off incredible saves or letting in a poor shot) and one must also consider the fact that team style of play and the quality of a player's teammates. When judging players based on these stats it is important to be aware of contextual factors like the team they play for, their opponent, and the player's position/role in the team.
From xG and xA more granular statistics such as xGChain and xGBuildup were created to be able to dig a little deeper into who is contributing to chance creation, I'll introduce the latter two a bit later. As the field has grown new statistics have popped up such as [Karun Singh](https://twitter.com/karun1710/)'s "expected threat" or xT. You can check out an introduction to xT from [here](https://karun.in/blog/expected-threat.html).
Of course, these statistics only tell a part of the story and are definitenly not the be-all-and-end-all. In the context of this current blog post, these stats only tell the story about how these players did for their club teams this past season rather than for their national team. Even still it gives us a good idea of what kind of form these players are in coming into this tournament.
You might also want to watch these Youtube videos by [TifoFootball](https://www.youtube.com/channel/UCGYYNGmyhZ_kwBF_lqqXdAQ) for a quick primer on [xG](https://www.youtube.com/watch?v=zSaeaFcm1SY) and [xA](https://www.youtube.com/watch?v=1MdlkuzLdj4).
### understat data
For the data I used the website, `understat.com`. Their xG models were created via training a neural network on a dataset consisting of over 100,000 shots using more than 10 different variables. Getting data from `understat` has been made easy by Ewen Henderson's `understatr` package available from [Github](https://github.com/ewenme/understatr) (he's also the guy that made the [ghibli](https://github.com/ewenme/ghibli) color palette!). I tried to pick a wide selection of attacking players but I was also limited by the fact that `understat` only has data for teams/players from six European leagues (Premier League, Bundesliga, Serie A, La Liga, Ligue 1, and Russian Premier League).
For __Peru__ I would have chosen Paolo Guerrero but as he plays in Brazil now I went with Jefferson Farfan (who hasn't played as many games as the other players used for comparison unfortunately...). For __Chile__ I would pick Eduardo Vargas but he as doesn't play for a team covered by understat I went with Alexis Sanchez, who had a woeful season and only played ~600 minutes despite appearing in ~20 league matches and later added Arturo Vidal. For __Brazil__ I included Neymar initially but since he won't actually be playing I'll keep him for comparison's sake but also include Gabriel Jesus and Roberto Firmino who have been fighting for the starting striker spot. Note that these two aren't the ones replacing Neymar __positionally__. In Neymar's left-wing position I can see David Neres or Phil Coutinho replacing him (Richarlison and Willian mostly play on the right). (Edit: In the first match vs. Bolivia, David Neres started off on the left while Richarlison played on the right, Coutinho played just behind Bobby Firmino)
The other nation's strikers/attacking midfielders don't play for the six European leagues covered by understat or like in Shinji Okazaki's case just did not play as many minutes/games during the season. To get the data I created a list of the player codes and use `purrr::map()` to iterate each through the `understatr::get_player_seasons_stats()` function.
```{r, eval=FALSE}
player_codes <- c(2097, 2099, 813, ## Messi, Neymar, Rondon
498, 4299, 696, ## Alexis, Farfan, Falcao
3294, 2098, 5543, ## Cavani, Suarez, G. Jesus
482, 1148, 2249, ## Bobby, Duvan, James
1089, 3553, 488, ## Cuadrado, Di Maria, Coutinho
222) ## Arturo Vidal
understat_data <- player_codes %>%
map(., ~ understatr::get_player_seasons_stats(.x)) %>%
reduce(bind_rows) %>%
select(-player_id, -position, -yellow, -red)
```
```{r, eval=FALSE, echo=FALSE}
saveRDS(understat_data, file = here::here("data/copa_america_understat.RDS"))
```
```{r, echo=FALSE}
understat_data <- readRDS(file = here::here("data/copa_america_understat.RDS"))
```
```{r}
glimpse(understat_data)
```
As you can see the data consists of a row for each player and each year (from the 2014/2015 season to the 2018/2019 season). I tried to mitigate the fact that some players played a lot more minutes than others by standardize everything to a 'per 90 minutes' value but this does have its own disadvantages. These include the fact that players who play a lot of minutes (as regular starting members) may not have as high 'per 90' stat even though their production with all these minutes might suggest that they are consistently performing and producing at a high level.
It'll be a bit crowded (kind of like a spilt box of Skittles...) but let's check out the key metrics for all the players at once.
Note: npg = non-penalty goals, npxG = non-penalty goals xG
```{r, fig.height = 6, fig.width = 8}
comparison_data <- understat_data %>%
filter(year == 2018) %>%
select(-games, -team_name, -year) %>%
rename(Shots = shots, KP = key_passes) %>%
gather(key = "key", value = "value", -player_name, -time) %>%
mutate(key = forcats::as_factor(key) %>%
forcats::fct_relevel(.,
"xG", "goals", "npxG", "npg",
"xA", "assists", "xGChain", "xGBuildup",
"Shots", "KP"))
comparison_strikers_plot <- comparison_data %>%
filter(key != "Shots", key != "KP",
key != "xGBuildup", key != "xGChain") %>%
mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) +
geom_vline(xintercept = 1.5, size = 2) +
geom_vline(xintercept = 2.5, size = 2) +
geom_vline(xintercept = 3.5, size = 2) +
geom_vline(xintercept = 4.5, size = 2) +
geom_vline(xintercept = 5.5, size = 2) +
coord_flip() +
scale_y_continuous(expand = c(0.01, 0.01),
limits = c(0, 1.26)) +
scale_fill_manual(values = pals::glasbey(16), name = "Player") +
labs(title = "Comparison: Top attackers at the Copa América",
subtitle = "For select group of attacking players with data available from understat.com",
x = NULL, y = "Metric per 90",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(title.size = 18,
subtitle = 12,
panel.grid.minor.x = element_line(color = "white"))
comparison_strikers_plot
```
```{r, eval=FALSE, echo=FALSE}
ggsave(plot = comparison_strikers_plot, filename = here::here("Copa America 2019/output/comparison_strikers_plot.png"),
height = 5, width = 8)
```
As usual in these types of charts, Messi is leading a lot of the metrics here and showing consistency too with having played the third highest amount of minutes out of the selected players. It's helpful to have the xG/xA stats next to the actual goals/assists as it provides an indication of whether the player in question is scoring shots that he probabilistically should be scoring. When a player's actual goal count is higher than their xG stat this suggests that the player is __"exceeding their xG"__ or that they are scoring from shots that are historically hard to score from. It can be seen as a marker of an elite finisher as they are putting away chances from difficult situations consistently. In terms of assists and xA Alexis Sanchez, who only played about 600 minutes, looks a lot better than in reality due to the aforementioned disadvantage of standardizing everything to a "per 90 minutes" value. Normally you would have a cut-off based on a certain __minimum amount of minutes__ but as I mentioned I was rather limited in my choice of players.
A way to take a closer look at xG - Goals and xA - Assists is to use a simple dot plot with a line going through the 45 degree angle. Those below the line are underperforming relative to their xG or xA stat, those over it are overachieving ("exceeding" their xG/xA stat) while those just on the line are scoring or assisting right around what the model expects the player to be. I use non-penalty xG below as penalties have around ~0.75 xG (give or take a few percentage points depending on the model) and can inflate the stats of those players who take a lot of penalties and score them, especially if they weren't the ones who earned the penalty themselves.
```{r, fig.height = 5, fig.width = 8}
expected_goal_plot <- understat_data %>%
filter(year == 2018) %>%
select(player_name, time, npxG, xG, goals) %>%
mutate_at(c("npxG", "xG", "goals"), ~. / time * 90) %>%
ggplot(aes(x = npxG, y = goals, fill = player_name)) +
geom_abline(intercept = 0, slope = 1, color = "white", size = 1.1) +
geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) +
scale_x_continuous(limits = c(0, 1.1),
expand = c(0, 0)) +
scale_y_continuous(limits = c(0, 1.3),
expand = c(0, 0)) +
scale_fill_manual(values = pals::glasbey(16), name = "Player") +
labs(title = "Expected vs. Actual Goals",
subtitle = "For select group of attacking players with data available from understat.com",
x = "Non-penalty xG per 90 minutes",
y = "Goals per 90 minutes",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(panel.grid.minor.x = element_line(color = "white"),
panel.grid.minor.y = element_line(color = "white"),
subtitle.size = 11)
expected_goal_plot
```
```{r, eval=FALSE, echo=FALSE}
ggsave(plot = expected_goal_plot, filename = here::here("Copa America 2019/output/expected_goal_plot.png"),
height = 5, width = 8)
```
Gabriel Jesus is quite clearly below the 45 degree line meaning that he has been very poor at finishing chances (and/or incredibly unlucky). After a poor World Cup where he scored 0 goals as a starter, he is really going to have to step up to fill Neymar's goalscoring boots for this tournament. However, his build-up play for City has still been good this past season and he has been scoring for Brazil in the friendlies leading up to the tournament so it's going to be a hard decision for Tite to decide on who starts against Bolivia (edit: Firmino started and contributed an assist while Jesus replaced him in the 65th minute).
```{r, fig.height = 5, fig.width = 8}
expected_assists_plot <- understat_data %>%
filter(year == 2018) %>%
select(player_name, time, xA, assists) %>%
mutate_at(c("xA", "assists"), ~. / time * 90) %>%
ggplot(aes(x = xA, y = assists, fill = player_name)) +
geom_abline(intercept = 0, slope = 1, color = "white", size = 1.1) +
geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) +
labs(title = "Expected vs. Actual Assists",
subtitle = "For select group of attacking players with data available from understat.com",
x = "xA per 90 minutes",
y = "Assists per 90 minutes",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
scale_x_continuous(limits = c(0, 0.55),
expand = c(0, 0)) +
scale_y_continuous(limits = c(0, 0.55),
expand = c(0, 0)) +
scale_fill_manual(values = pals::glasbey(16), name = "Player") +
theme_copaAmerica(panel.grid.minor.x = element_line(color = "white"),
panel.grid.minor.y = element_line(color = "white"),
subtitle.size = 11)
expected_assists_plot
```
```{r, eval=FALSE, echo=FALSE}
ggsave(plot = expected_assists_plot, filename = here::here("Copa America 2019/output/expected_assists_plot.png"),
height = 5, width = 8)
```
One thing to keep in mind is that xA does not take into account the recipient of the assist pass. Even if the pass given had a high expected assist value the receiving player still might not have the quality to put it away through no fault of the passer. This might explain why most of the players with a higher xA among this group don't have the assists to match. It can also be that these players are also the ones playing a lot more minutes and the volume of chances they create just aren't translating to goals all the time.
### Key Passes, Shots, xGChain, and xGBuildup (per 90)
I separated "key passes" and "shots" as well as "xGChain" and "xGBuildup" from the rest as these two sets were on a very different scale.
```{r, fig.height=5, fig.width=7}
kp_shots_plot <- comparison_data %>%
filter(key == "Shots" | key == "KP") %>%
mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) +
coord_flip() +
scale_y_continuous(expand = c(0.01, 0.01),
limits = c(0, 6),
breaks = c(0, 1, 2, 3, 4, 5, 6),
labels = c(0, 1, 2, 3, 4, 5, 6)) +
scale_fill_manual(values = pals::glasbey(17), name = "Player") +
geom_vline(xintercept = 1.5, size = 2) +
labs(title = "Comparison: Stars of the Copa América",
subtitle = glue("
KP = Key Passes
For select group of attacking players with data available from understat.com"),
x = NULL, y = "Metric per 90",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(title.size = 18,
subtitle.size = 10,
panel.grid.minor.x = element_line(color = "white"))
kp_shots_plot
```
```{r, eval=FALSE, echo=FALSE}
ggsave(plot = kp_shots_plot, filename = here::here("Copa America 2019/output/kp_shots_plot.png"),
height = 5, width = 7)
```
- xGChain: Quantitative measure that is the combined sum of the xG of every possession that ends in a shot that a player is involved in. The same derived value is given to each of the players involved and allows us to credit players for attacking contributions outside of just shots (xG) and assists (xA).
- xGBuildup: Similar to xGChain but excluding shots and assists. This is in response to xGChain values still being dominated by the xG and xA from shots and assists, respectively.
```{r, fig.height=5, fig.width=7}
xgbuildup_xgchain_plot <- comparison_data %>%
filter(key == "xGBuildup" | key == "xGChain") %>%
mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, stroke = 1.1) +
coord_flip() +
scale_y_continuous(expand = c(0.01, 0.01),
limits = c(0, 1.55),
breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5),
labels = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5)) +
scale_fill_manual(values = pals::glasbey(17), name = "Player") +
geom_vline(xintercept = 1.5, size = 2) +
labs(title = "Comparison: Stars of the Copa América",
subtitle = "For select group of attacking players with data available from understat.com",
x = NULL, y = "Metric per 90",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(title.size = 18,
subtitle.size = 10,
panel.grid.minor.x = element_line(color = "white"))
xgbuildup_xgchain_plot
```
```{r, eval=FALSE, echo=FALSE}
ggsave(plot = xgbuildup_xgchain_plot, filename = here::here("Copa America 2019/output/xgbuildup_xgchain_plot.png"),
height = 5, width = 7)
```
Although Gabriel Jesus has been poor at finishing his chances as seen in previous graphs, his xGChain and xGBuildup stat makes it clear that he is still contributing to City's attack outside of scoring goals himself (not to mention all the defensive work he does as well).
For example below, the stats are able to clearly differentiate between James, who is more of a playmaker, compared to Falcao and Duvan who are traditional number 9s with his superior xGBuildup, xGChain, and Key Passes values. For a more detailed overview on xGChain and xGBuildup check out Statsbomb's article [here](https://statsbomb.com/2018/08/introducing-xgchain-and-xgbuildup/).
```{r, fig.height=5, fig.width=7}
## keep colors for Colombians consistent with other plots
colombia_pal <- c("#000033", "#005300", "#009FFF", "#00FFBE")
comparison_colombia_plot <- comparison_data %>%
filter(!key %in% c("xG", "goals", "npxG", "npg", "xA", "assists"),
player_name %in% c("James Rodríguez", "Falcao", "Duván Zapata", "Juan Cuadrado")) %>%
mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_point(shape = 21, size = 5, color = "black", stroke = 1.1) +
geom_vline(xintercept = 1.5, size = 2) +
geom_vline(xintercept = 2.5, size = 2) +
geom_vline(xintercept = 3.5, size = 2) +
coord_flip() +
scale_y_continuous(expand = c(0.05, 0.05),
breaks = c(0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4),
labels = c(0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4)) +
scale_fill_manual(values = colombia_pal, name = "Player") +
labs(title = "Comparison: Stars of Colombia",
subtitle = "KP: Key Passes",
x = NULL, y = "Metric per 90",
caption = glue::glue("
data: understat.com
2018-2019 Season")) +
theme_copaAmerica(title.size = 20,
subtitle.size = 12,
panel.grid.minor.x = element_line(color = "white"))
comparison_colombia_plot
```
```{r, eval=FALSE, echo=FALSE}
ggsave(plot = comparison_colombia_plot, filename = here::here("Copa America 2019/output/comparison_colombia_plot.png"),
height = 5, width = 7)
```
There are lots of different ways to visualize this data, most famously the radar charts created by [Statsbomb](http://statsbomb.com/). Using the data you can also compare and evaluate players in many different ways and using `understatr` package you could add a few more players like Paulo Dybala, Miguel Almiron, and more! I could probably do a whole other article using just this data but I'll leave it here for now.
## Conclusion
Throughout this blog post I talked about some of the historical records, squad compositions, match records, and finally the player profiles of attacking players at this summer's Copa América. Using the power of R it is really easy to webscrape and visualize data in a way that is informative and aesthetically pleasing. I wanted to finish this before the tournament started but other life things got in the way as well as the fact that the amount of content ballooned out of control (especially the xG section) so I had to cut down a lot.
It's been fun reading the articles on the Copa América website and seeing how far my "intermediate-but-very-out-of-practice"-level of Spanish can get me to understand the content, here is one that I particularly liked reading: [14 Estadisticas de la Copa América](https://copaamerica.com/es/noticias/a-14-dias-del-inicio-de-la-copa-14-estadisticas-de-la-copa-america). With so many tournaments going on right now (and with the African Cup of Nations starting in a few days) a lot of the news media is spread thin right now but there are still some quality articles out there to read about the Copa, like [this article](https://www.bbc.com/sport/football/48600098) from BBC's South American football expert, Tim Vickery.
After the first round of games, a few points of discussion:
- After an extremely lacklustre performance vs. Colombia, how does Argentina bounce back? What tactical changes need to be made?
- Qatar impressed against Paraguay but can they pull off a major upset vs. Colombia?
- How will Japan line-up against Uruguay after a losing by a scoreline that didn't really do their performance justice? How will manager Moriyasu balance experience vs. youth, will he start with veteran Okazaki after Ueda's numerous misses vs. Chile?
- Can Brazil earn a early ticket to the next round vs. Venezuela after a clinical but not excellent performance vs. Bolivia? Will they be able to keep up their streak of winning the Copa every time they have hosted it?
Thanks for reading and...
¡Buena suerte a todos equipos!
================================================
FILE: Copa America 2019/copa_extras.Rmd
================================================
---
title: "Untitled"
author: "RN7"
date: "6/15/2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## R Markdown
https://copaamerica.com/es/noticias/a-14-dias-del-inicio-de-la-copa-14-estadisticas-de-la-copa-america
fixture calendar: https://copaamerica.com/static/CA2019-calendar-ESP-1-d78531672f9b0a782f78f96f57acb21c.png
Colores:
https://i.ytimg.com/vi/ByYw8EyMr1E/maxresdefault.jpg
```{r}
#142052 darknavy
#7CAE51 light grassy green
#3E632F dark green
#FEDB45 yellow gold
#31359D purple
#245D98 lightblue
#012488 deep blue
#4B9213 med green
```
host nation performance
- first game results
guest nation performance >>> USA 3rd Place in 2016... also hosts though...
time-difference
how far does each team have to travel??
- hypothetical scenarios all the way to the final: Brazil, Argentina, Japan??
- grab names of stadiums then `geocode()`
- brazil training in teresopolis near Rio before the tournament...
- which leagues have contributed most players? boring bar chart
-- which team
-- percentage of players coming from each own nation's team?
- avg. age per position? per team?
- squad numbers and position?
top performers >>> understat (only for top 5 leagues...)
-->> percentile rank of select players in similar position?
- Sh90, KP90, xG, xG90, xA, xA90, xGChain90, xGBuildup90
gt tables: green/yellow/red ABOVE+BELOW league avg. (at least 30 appearances)
create radar charts?
```{r, fig.height = 5, fig.width = 8}
age_position_plot <- squads_df_clean %>%
group_by(position) %>%
mutate(median = median(age)) %>%
ungroup() %>%
ggplot(aes(x = age)) +
geom_histogram(fill = "red", binwidth = 1) +
geom_vline(aes(xintercept = median), size = 1.2) +
labs(x = "Age", y = NULL) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme_copaAmerica(title.size = 24,
subtitle.size = 14,
caption.size = 8,
axis.text.size = 14,
axis.title.size = 16,
strip.text.size = 18,
panel.grid.minor.x = element_line(color = "white"),
panel.grid.minor.y = element_line(color = "white")) +
facet_wrap(~position)
age_position_plot
```
## Team Profiles
soccerway
### Argentina
```{r}
```
probable line-up? >>> most common since world cup?
Goals + assists
### Brazil
```{r}
url <- "https://us.soccerway.com/teams/brazil/brazil/349/matches/"
session <- bow(url)
matches_links <- scrape(session) %>%
html_nodes(".score a") %>%
html_attr("href") %>%
tail(9)
matches_link_df <- matches_links %>%
tibble::enframe(name = NULL, value = "link") %>%
mutate(link = glue::glue("https://us.soccerway.com{link}"))
one_url <- "https://us.soccerway.com/matches/2018/10/16/world/friendlies/argentina/brazil/2916264/"
## starting eleven
bow(one_url) %>%
scrape() %>%
html_nodes(".right .large-link") %>%
html_text()
bow(one_url) %>%
scrape() %>%
html_nodes(".left .large-link") %>%
html_text()
## home team - away team
## goal scorers
## result
```
probable line-up?
Goals + assists
- regex shenanigans...
```{r}
```
### Lionel Messi
```{r}
messi_data_raw <- understatr::get_player_seasons_stats(2097)
messi_data_clean <- messi_data_raw %>%
filter(year == 2018) %>%
select(-player_id, -year, -team_name,
-position, -yellow, -red)
```
### Luis Suarez
```{r}
suarez_data_raw <- understatr::get_player_seasons_stats(2098)
glimpse(suarez_data_raw)
suarez_data_clean <- suarez_data_raw %>%
filter(year == 2018) %>%
select(-player_id, -year, -team_name,
-position, -yellow, -red)
```
```{r}
comparison_data <- bind_rows(messi_data_clean, suarez_data_clean) %>%
select(-games, -time) %>%
gather(key = "key", value = "value", -player_name) %>%
mutate(key = forcats::as_factor(key) %>%
forcats::fct_relevel(.,
"xG", "goals", "xA", "assists"))
comparison_data %>%
ggplot(aes(x = key, y = value, color = player_name)) +
geom_point(size = 2) +
coord_flip() +
labs(title = "Comparison: Messi vs. Suarez") +
theme_copaAmerica() +
theme(legend.position = "none")
```
of course when you have a lot of players or players with similar stats then it becomes harder to see and that's where the radar chart can be more appropriate.
```{r}
data <- bind_rows(messi_data_clean, suarez_data_clean) %>%
select(-games, -time) %>%
#mutate_at(vars(-player_name), rescale) %>%
select(1:6)
Attributes = colnames(data)
AttNo = length(Attributes)
circleFun <- function(center = c(0,0),diameter = 1, npoints = 100){
r = diameter / 2
tt <- seq(0,2*pi,length.out = npoints)
xx <- center[1] + r * cos(tt)
yy <- center[2] + r * sin(tt)
return(data.frame(x = xx, y = yy))
}
circle1 <- circleFun(c(0,0),200,npoints = 100)
circle2 <- circleFun(c(0,0),150,npoints = 100)
circle3 <- circleFun(c(0,0),100,npoints = 100)
circle4 <- circleFun(c(0,0),50,npoints = 100)
angle_spilt <- (2*pi) / (AttNo)
angle_spilt_seq <- seq(0,(2*pi),angle_spilt)
# empty dataframes to catch results
LineData <- data.frame(x = numeric, y = numeric, stringsAsFactors = F)
TitlePositioning <- data.frame(title = character, x = numeric, y = numeric, stringsAsFactors = F)
## create plot background construction data
for (i in 1:NCOL(data)) {
angle_multiplier <- if(i < NCOL(data)){i}else{1}
radians_for_segment <- angle_spilt_seq[i]
x <- 100 * cos(radians_for_segment)
y <- 100 * sin(radians_for_segment)
temp <- data.frame(x = x, y = y, stringsAsFactors = F)
LineData <- rbind(temp, LineData)
x <- 112 * cos(radians_for_segment)
y <- 112 * sin(radians_for_segment)
title <- colnames(data)[i]
temp <- data.frame(title = title, x = x, y = y, stringsAsFactors = F)
TitlePositioning <- rbind(temp, TitlePositioning)
}
## create the value labellings data
values <- c(25,50,75)
radian_for_values <- angle_spilt / 2
x <- values * cos(radian_for_values)
y <- values * sin(radian_for_values)
ValuePositioning <- data.frame(values = values, x = x, y = y, stringsAsFactors = F)
## Add the origin values for the lines
LineData$x2 <- 0
LineData$y2 <- 0
# empty dataframe to catch result
polydata <- data.frame(player = character, value = numeric,
radians = numeric,
x = numeric, y = numeric, stringsAsFactors = F)
## create polygon data for the players
for (i in 1:NCOL(data)) {
for (p in 1:NROW(data)) {
player2calc <- data[p,]
angle_multiplier <- if(i < NCOL(data)){i}else{1}
radians_for_segment <- angle_spilt_seq[i]
x <- player2calc[i] * cos(radians_for_segment)
y <- player2calc[i] * sin(radians_for_segment)
player <- rownames(data)[p]
temp <- data.frame(player = player, value = player2calc[i],
radians = radians_for_segment, x = x, y = y,
stringsAsFactors = F)
polydata <- rbind(temp, polydata)
}
}
#e data up into player 1 and 2
playersDB <- unique(polydata$player)
player1 <- polydata[which(polydata$player == playersDB[1]),]
player2 <- polydata[which(polydata$player == playersDB[2]),]
## create the title string for player 1
Player1_title <- gsub('([[:upper:]])', ' \\1', playersDB[1])
Player1_title <- trimws(Player1_title)
## Create Title Strings for Player 2
Player2_title <- gsub('([[:upper:]])', ' \\1', playersDB[2])
Player2_title <- trimws(Player2_title)
require(ggplot2)
## Add the radar background
ggplot() + xlim(c(-120, 120)) + ylim(c(-120, 150)) +
## Add circles
geom_polygon(data = circle1, aes(x=x,y=y),fill = "#F0F0F0", colour = "#969696") + geom_polygon(data = circle2, aes(x=x,y=y),fill = "#FFFFFF", colour = "#d9d9d9") + geom_polygon(data = circle3, aes(x=x,y=y),fill = "#F0F0F0", colour = "#d9d9d9") + geom_polygon(data = circle4, aes(x=x,y=y),fill = "#FFFFFF", colour = "#d9d9d9") +
## Change the theme to void
theme_void() +
## Add the segment lines and attribute/value titles
geom_segment(data=LineData, aes(x = LineData$x, y = LineData$y, xend = LineData$x2, yend = LineData$y2),colour = "#d9d9d9", linetype = "dashed") +
annotate("text", x = TitlePositioning$x , y = TitlePositioning$y, label = TitlePositioning$title, size= 2.5) +
annotate("text", x = ValuePositioning$x , y = ValuePositioning$y, label = ValuePositioning$values, size= 2.5, colour = "#969696") +
## Add player 1 data
geom_polygon(data = player1, aes(x=x,y=y),fill = "#A30845", colour = "#A30845", alpha = 0.3) + geom_point(data = player1, aes(x = x, y = y),size=0.3, colour= "#A30845") +
## Add Chart Title
annotate("text", x = -110 , y = 130, label = Player1_title, size= 5, colour = "#A30845", family = "Helvetica", fontface = "bold", hjust = 0) +
annotate("text", x = 110 , y = 130, label = "FIFA 18 Data", size= 4, colour = "#969696", family = "Helvetica", fontface = "bold", hjust = 1) +
## Add the player 2 polygon and data points
geom_polygon(data = player2, aes(x=x,y=y),fill = "#00B20B", colour = "#00B20B", alpha = 0.3) +
geom_point(data = player2, aes(x = x, y = y),size=0.3, colour= "#00B20B") +
## Add the titles for player 2
annotate("text", x = -110 , y = 116, label = Player2_title, size= 5, colour = "#00B20B", family = "Helvetica", fontface = "bold", hjust = 0) +
annotate("text", x = -110 , y = 123 , label = "vrs", size= 3, colour = "#969696", family = "Helvetica", hjust = 0)
```
### Brazil's Number 9
```{r}
comparison_data %>%
filter(player_name %in% c("Roberto Firmino", "Gabriel Jesus")) %>%
mutate(value = value / time * 90)
```
```{r, fig.height = 5, fig.width = 8, warning=FALSE, message=FALSE}
colores <- c("Roberto Firmino" = "red",
"Gabriel Jesus" = "lightblue")
comparison_data %>%
filter(key != "shots" | key != "key_passes") %>%
filter(player_name %in% c("Roberto Firmino", "Gabriel Jesus")) %>%
#mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, show.legend = FALSE, stroke = 1.1) +
coord_flip() +
scale_y_continuous(expand = c(0.05, 0.05)) +
scale_fill_manual(values = colores, name = "Player") +
labs(title = "Comparison: Brazil's Number 9s",
x = NULL, y = "Metric") +
theme_copaAmerica(title.size = 14,
panel.grid.minor.x = element_line(color = "white")) -> comparison_value
comparison_data %>%
filter(key != "shots" | key != "key_passes") %>%
filter(player_name %in% c("Roberto Firmino", "Gabriel Jesus")) %>%
mutate(value = value / time * 90) %>%
ggplot(aes(x = key, y = value, fill = player_name)) +
geom_jitter(shape = 21, size = 5, color = "black", width = 0.25, show.legend = FALSE, stroke = 1.1) +
coord_flip() +
scale_y_continuous(expand = c(0.05, 0.05)) +
scale_fill_manual(values = colores, name = "Player") +
labs(title = "Comparison: Brazil's Number 9s",
x = NULL, y = "Metric per 90",
caption = glue::glue("data: understat.com")) +
theme_copaAmerica(title.size = 14,
panel.grid.minor.x = element_line(color = "white")) -> comparison_per90
plot_grid(comparison_value, comparison_per90, ncol = 2)
```
```{r, eval=FALSE, echo=FALSE}
ggsave(plot = player_contrib_club_plot, filename = here::here("Copa America 2019/output/player_contrib_club_plot.png"),
height = 6, width = 8)
```
================================================
FILE: Eredivisie 2018-2019/player_goal_contribution_matrix.rmd
================================================
---
title: "Untitled"
author: "RN7"
date: "5/25/2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# pkgs
```{r, message=FALSE, warning=FALSE}
pacman::p_load(tidyverse, polite, scales, ggimage, ggforce,
rvest, glue, extrafont, ggrepel, magick)
loadfonts()
```
## add_logo
```{r}
add_logo <- function(plot_path, logo_path, logo_position, logo_scale = 10){
# Requires magick R Package https://github.com/ropensci/magick
# Useful error message for logo position
if (!logo_position %in% c("top right", "top left", "bottom right", "bottom left")) {
stop("Error Message: Uh oh! Logo Position not recognized\n Try: logo_positon = 'top left', 'top right', 'bottom left', or 'bottom right'")
}
# read in raw images
plot <- magick::image_read(plot_path)
logo_raw <- magick::image_read(logo_path)
# get dimensions of plot for scaling
plot_height <- magick::image_info(plot)$height
plot_width <- magick::image_info(plot)$width
# default scale to 1/10th width of plot
# Can change with logo_scale
logo <- magick::image_scale(logo_raw, as.character(plot_width/logo_scale))
# Get width of logo
logo_width <- magick::image_info(logo)$width
logo_height <- magick::image_info(logo)$height
# Set position of logo
# Position starts at 0,0 at top left
# Using 0.01 for 1% - aesthetic padding
if (logo_position == "top right") {
x_pos = plot_width - logo_width - 0.01 * plot_width
y_pos = 0.01 * plot_height
} else if (logo_position == "top left") {
x_pos = 0.01 * plot_width
y_pos = 0.01 * plot_height
} else if (logo_position == "bottom right") {
x_pos = plot_width - logo_width - 0.01 * plot_width
y_pos = plot_height - logo_height - 0.01 * plot_height
} else if (logo_position == "bottom left") {
x_pos = 0.01 * plot_width
y_pos = plot_height - logo_height - 0.01 * plot_height
}
# Compose the actual overlay
magick::image_composite(plot, logo, offset = paste0("+", x_pos, "+", y_pos))
}
```
# Eredivisie
https://us.soccerway.com/national/france/ligue-1/20182019/regular-season/r48044/
## webscrape soccerway
```{r}
url <- "https://us.soccerway.com/national/netherlands/eredivisie/20182019/regular-season/r47971/"
session <- bow(url)
team_links <- scrape(session) %>%
html_nodes("#page_competition_1_block_competition_tables_7_block_competition_league_table_1_table .large-link a") %>%
html_attr("href")
team_links_df <- team_links %>%
enframe(name = NULL) %>%
separate(value, c(NA, NA, NA, "team_name", "team_num"), sep = "/") %>%
mutate(link = glue("
https://us.soccerway.com/teams/netherlands/{team_name}/{team_num}/squad/"),
stat_link = glue("{link %>% str_replace('squad', 'statistics')}"))
# for each team link:
player_name_info <- function(session) {
player_name_info <- scrape(session) %>%
html_nodes("#page_team_1_block_team_squad_3-table .name.large-link") %>%
html_text()
}
num_goals_info <- function(session) {
num_goals_info <- scrape(session) %>%
html_nodes(".goals") %>%
html_text()
num_goals_info_clean <- num_goals_info[-1]
}
num_assists_info <- function(session) {
num_assists_info <- scrape(session) %>%
html_nodes(".assists") %>%
html_text()
num_assists_info_clean <- num_assists_info[-1]
}
team_goals_info <- function(session) {
team_goals_info <- scrape(session) %>%
html_nodes("tr.first:nth-child(6) > td:nth-child(2)") %>%
html_text()
}
# BIG FUNCTION
eredivisie_stats_info <- function(link, statlink) {
session <- bow(link)
session2 <- bow(statlink)
player_name <- player_name_info(session = session)
num_goals <- num_goals_info(session = session)
num_assists <- num_assists_info(session = session)
team_goals <- team_goals_info(session = session2)
resultados <- list(player_name, num_goals, num_assists, team_goals)
col_names <- c("name", "goals", "assists", "team_goals")
eredivisie_stats <- resultados %>%
reduce(cbind) %>%
as_tibble() %>%
set_names(col_names)
}
```
### all at once
```{r}
# ALL 18 TEAMS AT ONCE, WILL TAKE A WHILE:
eredivisie_goal_contribution_df_ALL <- map2(.x = team_links_df$link,
.y = team_links_df$stat_link,
~ eredivisie_stats_info(link = .x, statlink = .y))
eredivisie_goal_contribution_df <- eredivisie_goal_contribution_df_ALL %>%
set_names(team_links_df$team_name) %>%
bind_rows(.id = "team_name")
## save
saveRDS(eredivisie_goal_contribution_df, file = glue("{here::here()}/data/eredivisie_goal_contrib_df_soccerway.RDS"))
```
## clean
```{r}
eredivisie_goal_contribution_clean_df <- eredivisie_goal_contribution_df %>%
mutate_at(.vars = c("goals", "assists"),
~str_replace(., "-", "0") %>% as.numeric) %>%
mutate(team = team_name %>% str_replace_all(., "-", " ") %>% str_to_title,
total_goals = as.numeric(team_goals)) %>%
group_by(team) %>%
mutate(total_assists = sum(assists),
goal_contrib = goals/total_goals,
assist_contrib = assists/total_goals) %>%
ungroup() %>%
select(-team_name, -team_goals)
## save
saveRDS(eredivisie_goal_contribution_clean_df,
file = glue("{here::here()}/data/eredivisie_goal_contrib_clean_df.RDS"))
eredivisie_goal_contribution_clean_df <- readRDS(file = glue("{here::here()}/data/eredivisie_goal_contrib_clean_df.RDS"))
```
## plot
-
-
-
```{r fig.width = 10, fig.height = 8}
## Description text
## Iago Aspas
desc_aspas <- "With 20 Goals from 12.47 xG, Iago Aspas greatly exceeded his xG as he heroically saved Celta Vigo from relegation yet again!"
## Jony xA90 of 0.29 (tied 3rd highest, minimum 30 games)
desc_jony <- "With 10 assists amounting to 34% of Alavés' xA, Jony had his career-best season as he helped Alavés contend for a European place."
## Messi
desc_goat <- "With total contribution nearly double of the next best, 36 Goals from a xG of 26, etc. Messi led almost every metric in La Liga this season!"
## PLOT!
eredivisie_goal_contribution_clean_df %>%
ggplot(aes(assist_contrib, goal_contrib)) +
geom_point(data = eredivisie_goal_contribution_clean_df %>%
filter(goal_contrib < 0.2 | assist_contrib < 0.125),
color = "grey20", size = 4, alpha = 0.2) +
geom_point(data = eredivisie_goal_contribution_clean_df %>%
filter(goal_contrib > 0.2 | assist_contrib > 0.125),
color = "red", size = 4) +
geom_hline(yintercept = 0.2, color = "grey20", alpha = 0.4) +
geom_vline(xintercept = 0.125, color = "grey20", alpha = 0.4) +
geom_text_repel(data = eredivisie_goal_contribution_clean_df %>%
filter(goal_contrib > 0.2 | assist_contrib > 0.125,
!name %in% c("Iago Aspas", "L. Messi",
"Jony", "Pablo Sarabia")),
aes(label = name, family = "Roboto Condensed", fontface = "bold"),
seed = 15, size = 4,
min.segment.length = 0, segment.color = "red",
point.padding = 0.5) +
# geom_mark_circle(aes(filter = name == "L. Messi",
# label = "Lionel Messi: GOAT",
# description = desc_goat),
# label.family = "Roboto Condensed", label.fontsize = c(16, 12)) +
# geom_mark_circle(aes(filter = name == "Iago Aspas",
# label = "Iago Aspas: The Hero of Vigo",
# description = desc_aspas),
# label.buffer = unit(10, "mm"), label.fontsize = c(16, 12),
# label.family = "Roboto Condensed") +
# geom_mark_circle(aes(filter = name == "Jony",
# label = "Jony: El Glorioso",
# description = desc_jony),
# label.buffer = unit(5, "mm"), label.fontsize = c(16, 12),
# label.family = "Roboto Condensed") +
scale_x_continuous(labels = percent_format(accuracy = 1),
breaks = c(0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3),
limits = c(0, 0.3)) +
scale_y_continuous(labels = percent_format(accuracy = 1),
breaks = c(0, 0.1, 0.2, 0.3, 0.4, 0.5),
limits = c(0, 0.5)) +
labs(title = "Goal Contribution Matrix: Eredivisie (2018-2019 Season)",
subtitle = "Goal Involvement (Goals and/or Assists) as Percentage of Total Club Goals",
caption = glue("
Data: soccerway.com & understat.com
By: @R_by_Ryo"),
x = "Percentage of Club Goals Assisted",
y = "Percentage of Club Goals Scored") +
theme_minimal() +
theme(text = element_text(family = "Roboto Condensed"),
title = element_text(size = 20),
plot.subtitle = element_text(size = 16),
plot.caption = element_text(size = 10),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
panel.grid.minor.x = element_blank()) -> eredivisie_goal_contribution_matrix
eredivisie_goal_contribution_matrix
```
## save
```{r}
ggsave(plot = eredivisie_goal_contribution_matrix,
"../Eredivisie 2018-2019/output/goal_contribution_matrix_plot_eredivisie.png",
height = 9, width = 11)
```
```{r}
plot_logo <- add_logo(
plot_path = "../Eredivisie 2018-2019/output/goal_contribution_matrix_plot_eredivisie.png",
logo_path = "https://upload.wikimedia.org/wikipedia/commons/0/0f/Eredivisie_nieuw_logo_2017-.svg",
logo_position = "top right",
logo_scale = 10)
plot_logo
```
```{r}
image_write(image = plot_logo,
"../Eredivisie 2018-2019/output/goal_contribution_matrix_plot_logo_eredivisie.png")
```
================================================
FILE: Europe 2021-2022/fbref_sca_waffle_blogpost.Rmd
================================================
---
title: "Solution to the 'preserving the sum after rounding' problem in a soccer waffle viz!"
always_allow_html: yes
output:
md_document:
variant: markdown_github
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
I recently came across a cool waffle viz for the top 20 shot-creating action players in the big 5 European soccer leagues done by [Harsh Krishna](https://twitter.com/placehoIder2004) on Twitter, see [original Tweet here](https://twitter.com/placehoIder2004/status/1480136784021393409). He also posted the code [in this Github gist](https://github.com/harshkrishna17/R-Code/blob/main/Waffle.R) with a call for help in solving an issue he was having with the preserving the sum of a group of metrics after rounding the individual metrics to integers. __Below is the code he posted in full.__ The problem in particular is after the second `for` loop where he rounds the metrics, he finds that the sum of the metrics don't add up to 100(%). His way of resolving this was choosing the `Sh_SCA` variable and trying to manipulate the values in that metric until the total summed back up to 100.
```{r eval=FALSE}
# Harsh Krishna `@placeholder2004`
# https://github.com/harshkrishna17/R-Code/blob/main/Waffle.R
# Libraries
library(tidyverse)
library(worldfootballR)
library(ggtext)
library(extrafont)
library(waffle)
library(MetBrewer)
# Scraping
data <- fb_big5_advanced_season_stats(season_end_year = 2022, stat_type = "gca", team_or_player = "player")
# Data Wrangling
data1 <- data %>%
filter(Mins_Per_90 >= 9) %>%
select(Player, Mins_Per_90, SCA90_SCA, SCA_SCA, PassLive_SCA, PassDead_SCA, Drib_SCA, Sh_SCA, Fld_SCA, Def_SCA)
data1 <- data1[order(as.numeric(data1$SCA90_SCA),decreasing = TRUE),]
data1 <- data1[c(1:20),]
df <- data1[order(as.numeric(data1$SCA90_SCA),decreasing = TRUE),]
df <- df[c(1:20),]
Player <- data1$Player
Mins <- data1$Mins_Per_90
data1 <- subset(data1, select = -c(Player, Mins_Per_90, SCA90_SCA))
for(i in 1:ncol(data1)) {
data1[, i] <- data1[, i] / Mins
}
SCA <- data1$SCA_SCA
for(i in 1:ncol(data1)) {
data1[, i] <- round((data1[, i] / SCA) * 100, 0)
}
data1 <- data1 %>%
mutate(Total = PassLive_SCA + PassDead_SCA + Drib_SCA + Sh_SCA + Fld_SCA + Def_SCA)
## run this ifelse statement as many times as necessary until the Total comes out to be a 100 for all rows.
data1 <- data1 %>% mutate(Sh_SCA = ifelse(Total == 100, Sh_SCA,
ifelse(Total < 100, Sh_SCA + 1,
ifelse(Total > 100, Sh_SCA - 1, NA)))) %>%
mutate(Total = PassLive_SCA + PassDead_SCA + Drib_SCA + Sh_SCA + Fld_SCA + Def_SCA)
data1$Player <- Player
data1 <- data1 %>%
pivot_longer(!Player, values_to = "SCAp90", names_to = "SCATypes") %>%
filter(!SCATypes == "SCA_SCA") %>%
filter(!SCATypes == "Total") %>%
count(Player, SCATypes, wt = SCAp90)
data1$Player <- factor(data1$Player, levels = print(df$Player))
# Custom theme function
theme_athletic <- function() {
theme_minimal() +
theme(plot.background = element_rect(colour = "#151515", fill = "#151515"),
panel.background = element_rect(colour = "#151515", fill = "#151515")) +
theme(plot.title = element_text(colour = "white", size = 24, family = "Fried Chicken Bold", hjust = 0.5),
plot.subtitle = element_markdown(colour = "#525252", size = 18, hjust = 0.5),
plot.caption = element_text(colour = "white", size = 15, hjust = 1),
axis.title.x = element_text(colour = "white", face = "bold", size = 14),
axis.title.y = element_text(colour = "white", face = "bold", size = 14),
axis.text.x = element_blank(),
axis.text.y = element_blank()) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank()) +
theme(legend.title = element_text(colour = "white"),
legend.text = element_text(colour = "white"))
}
# Plotting
data1 %>%
ggplot(aes(fill = SCATypes, values = n)) +
geom_waffle(nrows = 10, size = 1.5, colour = "#151515", flip = TRUE) +
scale_fill_manual(values = met.brewer(name = "Gauguin", n = 6, type = "discrete")) +
facet_wrap(~Player) +
labs(title = "Big 5 Leagues Shot-Creating Actions Share [2021/22]",
subtitle = "Top 20 Players with the most SCA per 90 so far",
caption = "Minimum 9 90's Played\nData from FBref\nCreated by @placeholder2004") +
theme_athletic() +
theme(aspect.ratio = 1,
strip.background = element_blank(),
strip.text = element_text(colour = "white", size = 14),
legend.position = "bottom",
legend.text = element_text(size = 14))
# Save
setwd("C:/Users/harsh_1mwi2o4/Downloads")
ggsave("wafflebig5.png", width = 3100, height = 3500, units = "px")
```
This is a problem I've faced in the past and I'm sure many others have too, whether in the context of soccer analysis or otherwise. So I decided to tackle this problem as a fun coding challenge using the __tidyverse__!
Let's get started!
# Packages
I prefer typing out the specific tidyverse packages rather than loading everything in at once.
```{r message=FALSE}
library(dplyr)
library(tidyr)
library(worldfootballR)
library(ggtext)
library(extrafont)
library(waffle)
library(MetBrewer)
```
# Scrape data
Get data from FBref/StatsBomb with the {worldfootballR} package.
```{r eval=FALSE}
# Scraping
data_raw <- fb_big5_advanced_season_stats(season_end_year = 2022, stat_type = "gca", team_or_player = "player")
```
```{r echo=FALSE}
# saveRDS(data_raw, file = here::here("data/sca_big5_demo.RDS"))
# write.csv(data_raw, file = here::here("data/sca_big5_demo.csv"))
data_raw <- read.csv(file = "https://raw.githubusercontent.com/Ryo-N7/soccer_ggplots/master/data/sca_big5_demo.csv")
```
```{r}
glimpse(data_raw)
```
Instead of ordering and subsetting with base functions in the original, I used `arrange()` and `slice()` to grab the top 20 players by SCA per 90.
```{r}
# Data Wrangling
df1 <- data_raw %>%
filter(Mins_Per_90 >= 9) %>%
## use contains() so I don't have to type out every `SCA` variable out
select(Player, Mins_Per_90, contains("SCA")) %>%
## arrange by SCA per 90 (descending) then take top 20 rows
arrange(desc(SCA90_SCA)) %>%
slice(1:20)
glimpse(df1)
```
Instead of using a `for` loop to perform the calculations on each column, I used `mutate()` and then `across()` to specify the columns I wanted to run the same operation on, dividing the metrics by the number of minutes that each player played. Also since we are using `mutate()` we don't need to pull out the `Mins` variable anymore as we can just refer to that specific column in the data.frame.
```{r}
df2 <- df1 %>%
## use across() to specify which vars to perform operations on
## ex. all cols EXCEPT `Player`, `Mins_Per_90`, and `SCA90_SCA`
mutate(across(-c(Player, Mins_Per_90, SCA90_SCA), ~ . / Mins_Per_90))
glimpse(df2)
```
Again, here instead of using a `for` loop I used the same `mutate() + across()` trick again to divide the values by each of the total SCA types and then multiply that by 100.
What you get now is that all the per 90 stats are in terms of percentages of the total SCA.
```{r}
df3 <- df2 %>%
## use across() to specify which vars to perform operations on
## ex. all cols EXCEPT `Player`, `Mins_Per_90`, and `SCA90_SCA`
mutate(across(-c(Player, Mins_Per_90, SCA90_SCA), ~ (. / SCA_SCA) * 100 )) %>%
mutate(Total = PassLive_SCA + PassDead_SCA + Drib_SCA + Sh_SCA + Fld_SCA + Def_SCA) %>%
select(Player, Mins_Per_90, Total, contains("SCA"))
glimpse(df3)
```
Then I check using `Total` column if the numbers sum to 100. They do, so we know that these per 90 numbers are good and everything adds up to 100% properly.
For the purpose of making a waffle chart, we have to turn all of these values into integers. However, the problem with this is that due to rounding the individual values, the sum doesn't equal 100 after the calculation! Some sums might equal to 99 or 98, some to 100, and some to 101!
# Preserve sum after rounding function
I looked around and tried out a **lot** of different things, not just different functions but also ways to re-shape the data so that the algorithm would work correctly. The first function I tried was the `largeRem()` function [here](https://github.com/basilesimon/largeRem/) but this only worked for when the sums would add up to -1 (99) or +1 (101) from 100. So I then started manually adding more branches to the if-else statements but since I knew there had to be something better out there I moved on. Eventually I found the function below, hoisted from the [{JLutils}](https://github.com/larmarange/JLutils/blob/master/R/round_preserve_sum.R) package that has a pretty good implementation of the **largest remainder** algorithm:
```{r}
round_preserve_sum <- function(x, digits = 0) {
up <- 10^digits
x <- x * up
y <- floor(x)
indices <- tail(order(x - y), round(sum(x)) - sum(y))
y[indices] <- y[indices] + 1
y / up
}
```
I honestly wanted to show how to do **row-wise** operations using the tidyverse in this section because I got tired of having to pivot back and forth and back and forth. See [here](https://dplyr.tidyverse.org/articles/rowwise.html) for how row-wise operations work in the tidyverse. However, the custom function I used to preserve the sum wasn't created to support the tidyverse's row-wise method, so I ended up having to transpose the data and un-transpose it with the pivot functions anyway. At the end, we can check our work by creating a `Total` variable again to see that everything sums up to 100 even after rounding all the metrics into integers!
```{r}
df4 <- df3 %>%
select(-Mins_Per_90, -SCA90_SCA, -SCA_SCA, -Total) %>%
## transpose
pivot_longer(cols = -Player) %>%
pivot_wider(names_from = Player, values_from = value) %>%
## run function over player column
mutate(across(-name, ~ round_preserve_sum(.))) %>%
## transpose back to original shape
pivot_longer(names_to = "Player", values_to = "thing", cols = -name) %>%
pivot_wider(names_from = name, values_from = thing) %>%
mutate(Total = PassLive_SCA + PassDead_SCA + Drib_SCA + Sh_SCA + Fld_SCA + Def_SCA)
glimpse(df4)
```
Now all that's left is to pivot the data based on the different Shot-Creating Action types so that the data is re-shaped into the format needed for the waffle plot. I also cleaned up the SCA types so that it reads nicer on the final plot using the `case_when()` function, which is like a Super Saiyan version of if-else statements.
```{r}
df5 <- df4 %>%
## we don't need the `Total` column anymore...
select(-Total) %>%
## Pivot so that we get the all the SCA types collapsed into a single column
pivot_longer(!Player, values_to = "SCAp90", names_to = "SCATypes") %>%
mutate(SCATypes = case_when(
SCATypes == "Def_SCA" ~ "Defensive Action SCA",
SCATypes == "Drib_SCA" ~ "Dribble SCA",
SCATypes == "Fld_SCA" ~ "Fouled SCA",
SCATypes == "PassDead_SCA" ~ "Dead-ball Pass SCA",
SCATypes == "PassLive_SCA" ~ "Pass SCA",
SCATypes == "Sh_SCA" ~ "Shot SCA",
TRUE ~ SCATypes
)) %>%
count(Player, SCATypes, wt = SCAp90)
glimpse(df5)
```
# 'The Athletic' Theme
I do feel that the subtitle should pop out a bit more, but I want to preserve the original as much as possible so I won't change anything here...!
```{r}
theme_athletic <- function() {
theme_minimal() +
theme(plot.background = element_rect(colour = "#151515", fill = "#151515"),
panel.background = element_rect(colour = "#151515", fill = "#151515")) +
theme(plot.title = element_text(colour = "white", size = 24, family = "Fried Chicken Bold", hjust = 0.5),
plot.subtitle = element_markdown(colour = "#525252", size = 18, hjust = 0.5),
plot.caption = element_text(colour = "white", size = 15, hjust = 1),
axis.title.x = element_text(colour = "white", face = "bold", size = 14),
axis.title.y = element_text(colour = "white", face = "bold", size = 14),
axis.text.x = element_blank(),
axis.text.y = element_blank()) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank()) +
theme(legend.title = element_text(colour = "white"),
legend.text = element_text(colour = "white"))
}
```
# Plot
Not much needs to be changed here again, I just cleaned the legend title a bit.
**NOTE**: The players in the plot changed because, well, there's been plenty of football since the [original viz was posted](https://twitter.com/placehoIder2004/status/1480136784021393409)!
```{r fig.width=31, fig.height=35}
wafplot <- df5 %>%
ggplot(aes(fill = SCATypes, values = n)) +
geom_waffle(nrows = 10, size = 1.5, colour = "#151515", flip = TRUE) +
scale_fill_manual(values = met.brewer(name = "Gauguin", n = 6, type = "discrete"),
name = "Shot-Creating Action Type") +
facet_wrap(~Player) +
labs(title = "Big 5 Leagues Shot-Creating Actions Share [2021/22]",
subtitle = "Top 20 Players with the most SCA per 90 so far",
caption = "Minimum 9 90's Played\nData from FBref\nCreated by @placeholder2004") +
theme_athletic() +
theme(aspect.ratio = 1,
strip.background = element_blank(),
strip.text = element_text(colour = "white", size = 14),
legend.position = "bottom",
legend.text = element_text(size = 14))
wafplot
```
I've just added the call to `here::here()`, a force of habit to keep file paths relative to the project root directory. Typing out the entire path is an annoyance so you can just do this to avoid it. Also helps when you move projects to a different computer or you're collaborating using somebody else's code, you don't have to re-type `C:/Users/blahblah/blahblah/...` all the time.
```{r eval=FALSE}
ggsave(plot = wafplot, filename = here::here("Europe 2021-2022/output/big5_SCA_waffle_plot.png"),
width = 3100, height = 3500, units = "px")
```
...and done!
So this was a very short blog post on finding a solution to the "preserve sum after rounding" problem as well as re-writing some of the code to fit my coding style using the tidyverse. Just to be clear, aside from the main problem, none of the code in the original script was wrong! At the end of the day, it worked and a great viz was created, so there obviously was no problem at all. I hope the way I did it shows people there are different ways to approach a problem and that you learned about a couple of new functions and tricks along the way!
I spent about 1~2 hours on this and most of it was googling for different solutions and finding documentation which goes to show how important being able to search for the right things on the internet is for programming in general. You're never going to be able to memorize everything you've ever done, so being good at googling is paramount. Being able to look up stuff efficiently is a skill that needs to be mastered, so things like taking on coding challenges I see in the wild (like the problem presented in this blog post) or taking part in community challenges like #TidyTuesday can help you gain valuable experience.
Of course, there are also other ways to crystallize knowledge such as ... writing a blog post about an interesting problem!
Special thanks to Harsh Krishna for sharing the code to his beautiful data viz. Hope this blog post was useful to everyone!
================================================
FILE: Europe 2021-2022/fbref_sca_waffle_blogpost.md
================================================
---
layout: post
title: "J.League Soccer 2021 Season Review!"
tags: [japan, jleague, soccer, football, ggplot2, tidyverse, r-bloggers]
---
I recently came across a cool waffle viz for the top 20 shot-creating
action players in the big 5 European soccer leagues done by [Harsh
Krishna](https://twitter.com/placehoIder2004) on Twitter, see [original
Tweet
here](https://twitter.com/placehoIder2004/status/1480136784021393409).
He also posted the code [in this Github
gist](https://github.com/harshkrishna17/R-Code/blob/main/Waffle.R) with
a call for help in solving an issue he was having with the preserving
the sum of a group of metrics after rounding the individual metrics to
integers. **Below is the code he posted in full.** The problem in
particular is after the second `for` loop where he rounds the metrics,
he finds that the sum of the metrics don’t add up to 100(%). His way of
resolving this was choosing the `Sh_SCA` variable and trying to
manipulate the values in that metric until the total summed back up to
100.
``` r
# Harsh Krishna `@placeholder2004`
# https://github.com/harshkrishna17/R-Code/blob/main/Waffle.R
# Libraries
library(tidyverse)
library(worldfootballR)
library(ggtext)
library(extrafont)
library(waffle)
library(MetBrewer)
# Scraping
data <- fb_big5_advanced_season_stats(season_end_year = 2022, stat_type = "gca", team_or_player = "player")
# Data Wrangling
data1 <- data %>%
filter(Mins_Per_90 >= 9) %>%
select(Player, Mins_Per_90, SCA90_SCA, SCA_SCA, PassLive_SCA, PassDead_SCA, Drib_SCA, Sh_SCA, Fld_SCA, Def_SCA)
data1 <- data1[order(as.numeric(data1$SCA90_SCA),decreasing = TRUE),]
data1 <- data1[c(1:20),]
df <- data1[order(as.numeric(data1$SCA90_SCA),decreasing = TRUE),]
df <- df[c(1:20),]
Player <- data1$Player
Mins <- data1$Mins_Per_90
data1 <- subset(data1, select = -c(Player, Mins_Per_90, SCA90_SCA))
for(i in 1:ncol(data1)) {
data1[, i] <- data1[, i] / Mins
}
SCA <- data1$SCA_SCA
for(i in 1:ncol(data1)) {
data1[, i] <- round((data1[, i] / SCA) * 100, 0)
}
data1 <- data1 %>%
mutate(Total = PassLive_SCA + PassDead_SCA + Drib_SCA + Sh_SCA + Fld_SCA + Def_SCA)
## run this ifelse statement as many times as necessary until the Total comes out to be a 100 for all rows.
data1 <- data1 %>% mutate(Sh_SCA = ifelse(Total == 100, Sh_SCA,
ifelse(Total < 100, Sh_SCA + 1,
ifelse(Total > 100, Sh_SCA - 1, NA)))) %>%
mutate(Total = PassLive_SCA + PassDead_SCA + Drib_SCA + Sh_SCA + Fld_SCA + Def_SCA)
data1$Player <- Player
data1 <- data1 %>%
pivot_longer(!Player, values_to = "SCAp90", names_to = "SCATypes") %>%
filter(!SCATypes == "SCA_SCA") %>%
filter(!SCATypes == "Total") %>%
count(Player, SCATypes, wt = SCAp90)
data1$Player <- factor(data1$Player, levels = print(df$Player))
# Custom theme function
theme_athletic <- function() {
theme_minimal() +
theme(plot.background = element_rect(colour = "#151515", fill = "#151515"),
panel.background = element_rect(colour = "#151515", fill = "#151515")) +
theme(plot.title = element_text(colour = "white", size = 24, family = "Fried Chicken Bold", hjust = 0.5),
plot.subtitle = element_markdown(colour = "#525252", size = 18, hjust = 0.5),
plot.caption = element_text(colour = "white", size = 15, hjust = 1),
axis.title.x = element_text(colour = "white", face = "bold", size = 14),
axis.title.y = element_text(colour = "white", face = "bold", size = 14),
axis.text.x = element_blank(),
axis.text.y = element_blank()) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank()) +
theme(legend.title = element_text(colour = "white"),
legend.text = element_text(colour = "white"))
}
# Plotting
data1 %>%
ggplot(aes(fill = SCATypes, values = n)) +
geom_waffle(nrows = 10, size = 1.5, colour = "#151515", flip = TRUE) +
scale_fill_manual(values = met.brewer(name = "Gauguin", n = 6, type = "discrete")) +
facet_wrap(~Player) +
labs(title = "Big 5 Leagues Shot-Creating Actions Share [2021/22]",
subtitle = "Top 20 Players with the most SCA per 90 so far",
caption = "Minimum 9 90's Played\nData from FBref\nCreated by @placeholder2004") +
theme_athletic() +
theme(aspect.ratio = 1,
strip.background = element_blank(),
strip.text = element_text(colour = "white", size = 14),
legend.position = "bottom",
legend.text = element_text(size = 14))
# Save
setwd("C:/Users/harsh_1mwi2o4/Downloads")
ggsave("wafflebig5.png", width = 3100, height = 3500, units = "px")
```
This is a problem I’ve faced in the past and I’m sure many others have
too, whether in the context of soccer analysis or otherwise. So I
decided to tackle this problem as a fun coding challenge using the
**tidyverse**!
Let’s get started!
Packages
========
I prefer typing out the specific tidyverse packages rather than loading
everything in at once.
``` r
library(dplyr)
library(tidyr)
library(worldfootballR)
library(ggtext)
library(extrafont)
library(waffle)
library(MetBrewer)
```
Scrape data
===========
Get data from FBref/StatsBomb with the {worldfootballR} package.
``` r
# Scraping
data_raw <- fb_big5_advanced_season_stats(season_end_year = 2022, stat_type = "gca", team_or_player = "player")
```
``` r
glimpse(data_raw)
```
## Rows: 2,540
## Columns: 27
## $ X 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,~
## $ Season_End_Year 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, ~
## $ Squad "Alavés", "Alavés", "Alavés", "Alavés", "Alavés", "Ala~
## $ Comp "La Liga", "La Liga", "La Liga", "La Liga", "La Liga",~
## $ Player "Martin Agirregabiria", "Mircea Alexandru Tirlea", "Ru~
## $ Nation "ESP", "ROU", "ESP", "ARG", "ESP", "ESP", "SWE", "ESP"~
## $ Pos "DF", "DF", "DF", "MF", "MF", "FW", "FW,MF", "MF", "FW~
## $ Age "25-249", "21-292", "26-088", "28-293", "24-012", "27-~
## $ Born 1996, 2000, 1995, 1993, 1998, 1994, 1992, 1994, 1990, ~
## $ Mins_Per_90 13.6, 0.2, 16.4, 0.6, 7.6, 0.0, 3.0, 0.4, 18.5, 17.9, ~
## $ SCA_SCA 22, 0, 23, 0, 23, 0, 10, 1, 40, 3, 4, 5, 2, 12, 3, 11,~
## $ SCA90_SCA 1.62, 0.00, 1.40, 0.00, 3.02, 0.00, 3.28, 2.25, 2.16, ~
## $ PassLive_SCA 16, 0, 11, 0, 12, 0, 5, 1, 24, 0, 4, 3, 2, 10, 1, 6, 4~
## $ PassDead_SCA 2, 0, 9, 0, 8, 0, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 18, 0,~
## $ Drib_SCA 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 2, 0, 1, 1, 0, ~
## $ Sh_SCA 0, 0, 0, 0, 1, 0, 0, 0, 6, 1, 0, 0, 0, 0, 2, 0, 1, 2, ~
## $ Fld_SCA 3, 0, 2, 0, 1, 0, 2, 0, 6, 0, 0, 0, 0, 0, 0, 3, 0, 1, ~
## $ Def_SCA 0, 0, 1, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, ~
## $ GCA_GCA 2, 0, 3, 0, 0, 0, 3, 0, 5, 1, 0, 1, 0, 0, 0, 3, 3, 0, ~
## $ GCA90_GCA 0.15, 0.00, 0.18, 0.00, 0.00, 0.00, 0.99, 0.00, 0.27, ~
## $ PassLive_GCA 1, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 1, 0, 0, 0, 1, 1, 0, ~
## $ PassDead_GCA 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ~
## $ Drib_GCA 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ Sh_GCA 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, ~
## $ Fld_GCA 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, ~
## $ Def_GCA 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, ~
## $ Url "https://fbref.com/en/players/355c883a/Martin-Agirrega~
Instead of ordering and subsetting with base functions in the original,
I used `arrange()` and `slice()` to grab the top 20 players by SCA per
90.
``` r
# Data Wrangling
df1 <- data_raw %>%
filter(Mins_Per_90 >= 9) %>%
## use contains() so I don't have to type out every `SCA` variable out
select(Player, Mins_Per_90, contains("SCA")) %>%
## arrange by SCA per 90 (descending) then take top 20 rows
arrange(desc(SCA90_SCA)) %>%
slice(1:20)
glimpse(df1)
```
## Rows: 20
## Columns: 10
## $ Player "Dimitri Payet", "Lorenzo Pellegrini", "Hakan Çalhanoglu"~
## $ Mins_Per_90 15.4, 13.8, 12.7, 9.2, 12.2, 9.2, 10.3, 9.6, 11.8, 17.3, ~
## $ SCA_SCA 115, 83, 76, 55, 72, 53, 59, 55, 66, 94, 66, 89, 56, 63, ~
## $ SCA90_SCA 7.48, 6.02, 5.97, 5.95, 5.88, 5.77, 5.75, 5.72, 5.59, 5.4~
## $ PassLive_SCA 71, 34, 41, 35, 40, 34, 44, 37, 52, 50, 45, 43, 43, 36, 3~
## $ PassDead_SCA 23, 26, 26, 10, 25, 5, 11, 4, 1, 28, 8, 25, 4, 15, 8, 2, ~
## $ Drib_SCA 7, 7, 1, 2, 2, 3, 1, 4, 3, 6, 4, 4, 5, 5, 5, 1, 13, 1, 1,~
## $ Sh_SCA 3, 10, 3, 4, 1, 4, 2, 5, 4, 1, 6, 4, 1, 4, 2, 3, 6, 3, 1,~
## $ Fld_SCA 11, 5, 3, 3, 2, 6, 1, 4, 6, 8, 2, 9, 2, 3, 8, 0, 7, 1, 1,~
## $ Def_SCA 0, 1, 2, 1, 2, 1, 0, 1, 0, 1, 1, 4, 1, 0, 3, 2, 2, 0, 2, 2
Instead of using a `for` loop to perform the calculations on each
column, I used `mutate()` and then `across()` to specify the columns I
wanted to run the same operation on, dividing the metrics by the number
of minutes that each player played. Also since we are using `mutate()`
we don’t need to pull out the `Mins` variable anymore as we can just
refer to that specific column in the data.frame.
``` r
df2 <- df1 %>%
## use across() to specify which vars to perform operations on
## ex. all cols EXCEPT `Player`, `Mins_Per_90`, and `SCA90_SCA`
mutate(across(-c(Player, Mins_Per_90, SCA90_SCA), ~ . / Mins_Per_90))
glimpse(df2)
```
## Rows: 20
## Columns: 10
## $ Player "Dimitri Payet", "Lorenzo Pellegrini", "Hakan Çalhanoglu"~
## $ Mins_Per_90 15.4, 13.8, 12.7, 9.2, 12.2, 9.2, 10.3, 9.6, 11.8, 17.3, ~
## $ SCA_SCA 7.467532, 6.014493, 5.984252, 5.978261, 5.901639, 5.76087~
## $ SCA90_SCA 7.48, 6.02, 5.97, 5.95, 5.88, 5.77, 5.75, 5.72, 5.59, 5.4~
## $ PassLive_SCA 4.610390, 2.463768, 3.228346, 3.804348, 3.278689, 3.69565~
## $ PassDead_SCA 1.49350649, 1.88405797, 2.04724409, 1.08695652, 2.0491803~
## $ Drib_SCA 0.45454545, 0.50724638, 0.07874016, 0.21739130, 0.1639344~
## $ Sh_SCA 0.19480519, 0.72463768, 0.23622047, 0.43478261, 0.0819672~
## $ Fld_SCA 0.71428571, 0.36231884, 0.23622047, 0.32608696, 0.1639344~
## $ Def_SCA 0.00000000, 0.07246377, 0.15748031, 0.10869565, 0.1639344~
Again, here instead of using a `for` loop I used the same
`mutate() + across()` trick again to divide the values by each of the
total SCA types and then multiply that by 100.
What you get now is that all the per 90 stats are in terms of
percentages of the total SCA type.
``` r
df3 <- df2 %>%
## use across() to specify which vars to perform operations on
## ex. all cols EXCEPT `Player`, `Mins_Per_90`, and `SCA90_SCA`
mutate(across(-c(Player, Mins_Per_90, SCA90_SCA), ~ (. / SCA_SCA) * 100 )) %>%
mutate(Total = PassLive_SCA + PassDead_SCA + Drib_SCA + Sh_SCA + Fld_SCA + Def_SCA) %>%
select(Player, Mins_Per_90, Total, contains("SCA"))
glimpse(df3)
```
## Rows: 20
## Columns: 11
## $ Player "Dimitri Payet", "Lorenzo Pellegrini", "Hakan Çalhanoglu"~
## $ Mins_Per_90 15.4, 13.8, 12.7, 9.2, 12.2, 9.2, 10.3, 9.6, 11.8, 17.3, ~
## $ Total 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 10~
## $ SCA_SCA 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 10~
## $ SCA90_SCA 7.48, 6.02, 5.97, 5.95, 5.88, 5.77, 5.75, 5.72, 5.59, 5.4~
## $ PassLive_SCA 61.73913, 40.96386, 53.94737, 63.63636, 55.55556, 64.1509~
## $ PassDead_SCA 20.000000, 31.325301, 34.210526, 18.181818, 34.722222, 9.~
## $ Drib_SCA 6.086957, 8.433735, 1.315789, 3.636364, 2.777778, 5.66037~
## $ Sh_SCA 2.608696, 12.048193, 3.947368, 7.272727, 1.388889, 7.5471~
## $ Fld_SCA 9.565217, 6.024096, 3.947368, 5.454545, 2.777778, 11.3207~
## $ Def_SCA 0.000000, 1.204819, 2.631579, 1.818182, 2.777778, 1.88679~
Then I check using `Total` column if the numbers sum to 100. They do, so
we know that these per 90 numbers are good and everything adds up to
100% properly.
For the purpose of making a waffle chart and the fact that you can’t
really “do” 0.43 of a “Shot Creating Action”, we have to turn all of
these values into integers. However, the problem with this is that due
to rounding the individual values, the sum doesn’t equal 100 after the
calculation! Some sums might equal to 99 or 98, some to 100, and some to
101!
Preserve sum after rounding function
====================================
I looked around and tried out a **lot** of different things, not just
different functions but also ways to re-shape the data so that the
algorithm would work correctly. The first function I tried was the
`largeRem()` function [here](https://github.com/basilesimon/largeRem/)
but this only worked for when the sums would add up to -1 (99) or +1
(101) from 100. So I then started manually adding more branches to the
if-else statements but since I knew there had to be something better out
there I moved on. Eventually I found the function below, hoisted from
the
[{JLutils}](https://github.com/larmarange/JLutils/blob/master/R/round_preserve_sum.R)
package that has a pretty good implementation of the **largest
remainder** algorithm:
``` r
round_preserve_sum <- function(x, digits = 0) {
up <- 10^digits
x <- x * up
y <- floor(x)
indices <- tail(order(x - y), round(sum(x)) - sum(y))
y[indices] <- y[indices] + 1
y / up
}
```
I honestly wanted to show how to do **row-wise** operations using the
tidyverse in this section because I got tired of having to pivot back
and forth and back and forth. See
[here](https://dplyr.tidyverse.org/articles/rowwise.html) for how
row-wise operations work in the tidyverse. However, the custom function
I used to preserve the sum wasn’t created to support the tidyverse’s
row-wise method, so I ended up having to transpose the data and
un-transpose it with the pivot functions anyway. At the end, we can
check our work by creating a `Total` variable again to see that
everything sums up to 100 even after rounding all the metrics into
integers!
``` r
df4 <- df3 %>%
select(-Mins_Per_90, -SCA90_SCA, -SCA_SCA, -Total) %>%
## transpose
pivot_longer(cols = -Player) %>%
pivot_wider(names_from = Player, values_from = value) %>%
## run function over player column
mutate(across(-name, ~ round_preserve_sum(.))) %>%
## transpose back to original shape
pivot_longer(names_to = "Player", values_to = "thing", cols = -name) %>%
pivot_wider(names_from = name, values_from = thing) %>%
mutate(Total = PassLive_SCA + PassDead_SCA + Drib_SCA + Sh_SCA + Fld_SCA + Def_SCA)
glimpse(df4)
```
## Rows: 20
## Columns: 8
## $ Player "Dimitri Payet", "Lorenzo Pellegrini", "Hakan Çalhanoglu"~
## $ PassLive_SCA 62, 41, 54, 64, 55, 64, 74, 67, 79, 53, 68, 48, 77, 57, 5~
## $ PassDead_SCA