Repository: ScPoEcon/ScPoEconometrics Branch: master Commit: 4999239de84e Files: 59 Total size: 2.2 MB Directory structure: gitextract_ejq2bful/ ├── .Rbuildignore ├── .github/ │ └── ISSUE_TEMPLATE/ │ └── custom.md ├── .gitignore ├── 01-R.Rmd ├── 02-SummaryStats.Rmd ├── 03-linear-reg.Rmd ├── 04-MultipleReg.Rmd ├── 05-Categorial-Vars.Rmd ├── 06-StdErrors.Rmd ├── 07-Causality.Rmd ├── 08-STAR.Rmd ├── 09-RDD.Rmd ├── 10-IV.Rmd ├── 11-IV2.Rmd ├── 12-panel.Rmd ├── 13-discrete.Rmd ├── 14-references.Rmd ├── DESCRIPTION ├── GA-tracker.html ├── LICENSE ├── NAMESPACE ├── R/ │ └── utils.R ├── README.md ├── ScPoEconometrics.Rproj ├── _archive/ │ └── chapters/ │ └── 03-linear-reg.Rmd ├── _bookdown.yml ├── _build.sh ├── _deploy.sh ├── _local_deploy.sh ├── _output.yml ├── _tex/ │ ├── ci.tex │ ├── onesided.tex │ ├── testing.lyx │ ├── two-sided-beta.tex │ └── twosided-mean.tex ├── _to_be_done/ │ ├── 08-TBD.Rmd │ ├── 09-R-advanced.Rmd │ ├── 11-projects.Rmd │ └── notes.R ├── book.bib ├── images/ │ └── trade.html ├── index.Rmd ├── inst/ │ ├── CITATION │ └── datasets/ │ ├── airline-safety.csv │ ├── corr50.csv │ ├── demo_gind.xls │ ├── example-data.csv │ ├── grade5.dta │ └── simple_arrows.RData ├── packages.bib ├── preamble.tex ├── previous_travis.yml ├── style.css ├── teachers/ │ ├── ForTeachers.md │ ├── app-timeline.md │ ├── session1-ouline.md │ ├── tasks_ch1.Rmd │ └── tasks_ch2.Rmd └── toc.css ================================================ FILE CONTENTS ================================================ ================================================ FILE: .Rbuildignore ================================================ ^.*\.Rproj$ ^\.Rproj\.user$ ^.*\.html$ ^.*\.jpg$ _book* _slides* _tex* ^\d\d-.*\.Rmd$ ^.*\.yml$ ^.*\.sh$ ^.*\.css$ ^.*\.gif$ ^.*\.bib$ ^.*\.tex$ images/ data/.keep js/ ^appveyor\.yml$ teachers/ ================================================ FILE: .github/ISSUE_TEMPLATE/custom.md ================================================ --- name: Custom issue template about: Please file an issue here! title: '' labels: '' assignees: '' --- hello! Please ask any course-related questions here, or let us know if something does not work. make sure your issue includes a reproducible example of the bug/issue that you encountered. **Every issue needs to submit three things**: 1. The commands that lead to the error you find 1. The actual output of the error 1. **after** the error happend, type `sessionInfo()` and post the output here as well. ================================================ FILE: .gitignore ================================================ .Rproj.user .Rhistory .RData _publish.R _book _bookdown_files rsconnect /data/ /inst/shinys/**/*.html /inst/tutorials/**/*.html /inst/tutorials/**/*data inst/tutorials/chapter2/chapter2_files/ _slides/chapter1/chapter1-* _slides/chapter2/chapter2-* _slides/chapter6/chapter6_files/ _slides/**/*.html ================================================ FILE: 01-R.Rmd ================================================ # Introduction to `R` {#R-intro} ## Getting Started `R` is both a programming language and software environment for statistical computing, which is *free* and *open-source*. To get started, you will need to install two pieces of software: 1. [`R`, the actual programming language.](https://www.r-project.org) - Chose your operating system, and select the most recent version. 1. [RStudio, an excellent IDE for working with `R`.](http://www.rstudio.com/) - Note, you must have `R` installed to use RStudio. RStudio is simply an interface used to interact with `R`. The popularity of `R` is on the rise, and everyday it becomes a better tool for statistical analysis. It even generated this book! The following few chapters will serve as a whirlwind introduction to `R`. They are by no means meant to be a complete reference for the `R` language, but simply an introduction to the basics that we will need along the way. Several of the more important topics will be re-stressed as they are actually needed for analyses. This introductory `R` chapter may feel like an overwhelming amount of information. You are not expected to pick up everything the first time through. You should try all of the code from this chapter, then return to it a number of times as you return to the concepts when performing analyses. We only present the most basic aspects of `R`. If you want to know more, there are countless online tutorials, and you could start with the official [CRAN sample session](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#A-sample-session) or have a look at the resources at [Rstudio](https://www.rstudio.com/online-learning/#DataScience) or on this [github repo](https://github.com/qinwf/awesome-R). ## Starting R and RStudio A key difference for you to understand is the one between `R`, the actual programming language, and `RStudio`, a popular interface to `R` which allows you to work efficiently and with greater ease with `R`. The best way to appreciate the value of `RStudio` is to start using `R` *without* `RStudio`. To do this, double-click on the R GUI that you should have downloaded on your computer following the steps above (on windows or Mac), or start R in your terminal (on Linux or Mac) by just typing `R` in a terminal, see figure \@ref(fig:console). You've just opened the R **console** which allows you to start typing code right after the `>` sign, called *prompt*. Try typing `2 + 2` or `print("Your Name")` and hit the return key. And *voilà*, your first R commands! ```{r console, fig.cap="R GUI symbol and R in a MacOS Terminal",fig.align='center',out.width="50%",echo=FALSE} knitr::include_graphics(c("images/RLogo.png","images/console.png") ) ``` Typing one command after the other into the console is not very convenient as our analysis becomes more involved. Ideally, we would like to collect all command statements in a file and run them one after the other, automatically. We can do this by writing so-called **script files** or just **scripts**, i.e. simple text files with extension `.R` or `.r` which can be *inserted* (or *sourced*) into an `R` session. RStudio makes this process very easy. Open `RStudio` by clicking on the `RStudio` application on your computer, and notice how different the whole environment is from the basic `R` console – in fact, that *very same* `R` console is running in your bottom left panel. The upper-left panel is a space for you to write scripts – that is to say many lines of codes which you can run when you choose to. To run a single line of code, simply highlight it and hit `Command` + `Return`. ```{block, type='note'} We highly recommend that you use `RStudio` for everything related to this course (in particular, to launch our apps and tutorials). ``` RStudio has a large number of useful keyboard shortcuts. A list of these can be found using a keyboard shortcut -- the keyboard shortcut to rule them all: - On Windows: `Alt` + `Shift` + `K` - On Mac: `Option` + `Shift` + `K` The `RStudio` team has developed [a number of "cheatsheets"](https://www.rstudio.com/resources/cheatsheets/) for working with both `R` and `RStudio`. [This particular cheatseet for Base `R`](http://www.rstudio.com/wp-content/uploads/2016/05/base-r.pdf) will summarize many of the concepts in this document. ^[When programming, it is often a good practice to follow a style guide. (Where do spaces go? Tabs or spaces? Underscores or CamelCase when naming variables?) No style guide is "correct" but it helps to be aware of what others do. The more import thing is to be consistent within your own code. Here are two guides: [Hadley Wickham Style Guide](http://adv-r.had.co.nz/Style.html), and the [Google Style Guide](https://google.github.io/styleguide/Rguide.xml). For this course, our main deviation from these two guides is the use of `=` in place of `<-`. For all practical purposes, you should think `=` whenever you see `<-`.] ### First Glossary * `R`: a statistical programming language * `RStudio`: an integrated development environment (IDE) to work with `R` * *command*: user input (text or numbers) that `R` *understands*. * *script*: a list of commands collected in a text file, each separated by a new line, to be run one after the other. ## Basic Calculations To get started, we'll use `R` like a simple calculator. Run the following code either directly from your RStudio console, or in RStudio by writting them in a script and running them using `Command` + `Return`. #### Addition, Subtraction, Multiplication and Division {-} | Math | `R` code | Result | |:-------------:|:-------:|:---------:| | $3 + 2$ | `3 + 2` | `r 3 + 2` | | $3 - 2$ | `3 - 2` | `r 3 - 2` | | $3 \cdot2$ | `3 * 2` | `r 3 * 2` | | $3 / 2$ | `3 / 2` | `r 3 / 2` | #### Exponents {-} | Math | `R` code | Result | |:-------------:|:-------:|:---------:| | $3^2$ | `3 ^ 2` | `r 3 ^ 2` | | $2^{(-3)}$ | `2 ^ (-3)` | `r 2 ^ (-3)` | | $100^{1/2}$ | `100 ^ (1 / 2)` | `r 100 ^ (1 / 2)` | | $\sqrt{100}$ | `sqrt(100)` | `r sqrt(100)` | #### Mathematical Constants {-} | Math | `R` code | Result | |:------------:|:---------------:|:-----------------:| | $\pi$ | `pi` | `r pi` | | $e$ | `exp(1)` | `r exp(1)` | #### Logarithms {-} Note that we will use $\ln$ and $\log$ interchangeably to mean the natural logarithm. There is no `ln()` in `R`, instead it uses `log()` to mean the natural logarithm. | Math | `R` code | Result | |:------------:|:---------------:|:-----------------:| | $\log(e)$ | `log(exp(1))` | `r log(exp(1))` | | $\log_{10}(1000)$ | `log10(1000)` | `r log10(1000)` | | $\log_{2}(8)$ | `log2(8)` | `r log2(8)` | | $\log_{4}(16)$ | `log(16, base = 4)` | `r log(16, base = 4)` | #### Trigonometry {-} | Math | `R` code | Result | |:------------:|:---------------:|:-----------------:| | $\sin(\pi / 2)$ | `sin(pi / 2)` | `r sin(pi / 2)` | | $\cos(0)$ | `cos(0)` | `r cos(0)` | ## Getting Help In using `R` as a calculator, we have seen a number of functions: `sqrt()`, `exp()`, `log()` and `sin()`. To get documentation about a function in `R`, simply put a question mark in front of the function name, or call the function `help(function)` and RStudio will display the documentation, for example: ```{r, eval = FALSE} ?log ?sin ?paste ?lm help(lm) # help() is equivalent help(ggplot,package="ggplot2") # show help from a certain package ``` Frequently one of the most difficult things to do when learning `R` is asking for help. First, you need to decide to ask for help, then you need to know *how* to ask for help. Your very first line of defense should be to Google your error message or a short description of your issue. (The ability to solve problems using this method is quickly becoming an extremely valuable skill.) If that fails, and it eventually will, you should ask for help. There are a number of things you should include when contacting an instructor, or posting to a help website such as [Stack Overflow](https://stackoverflow.com). - Describe what you expect the code to do. - State the end goal you are trying to achieve. (Sometimes what you expect the code to do, is not what you want to actually do.) - Provide the full text of any errors you have received. - Provide enough code to recreate the error. Often for the purpose of this course, you could simply post your entire `.R` script or `.Rmd` to `slack`. - Sometimes it is also helpful to include a screenshot of your entire RStudio window when the error occurs. If you follow these steps, you will get your issue resolved much quicker, and possibly learn more in the process. Do not be discouraged by running into errors and difficulties when learning `R`. (Or any other technical skill.) It is simply part of the learning process. ## Installing Packages `R` comes with a number of built-in functions and datasets, but one of the main strengths of `R` as an open-source project is its package system. Packages add additional functions and data. Frequently if you want to do something in `R`, and it is not available by default, there is a good chance that there is a package that will fulfill your needs. To install a package, use the `install.packages()` function. Think of this as buying a recipe book from the store, bringing it home, and putting it on your shelf (i.e. into your library): ```{r, eval = FALSE} install.packages("ggplot2") ``` Once a package is installed, it must be loaded into your current `R` session before being used. Think of this as taking the book off of the shelf and opening it up to read. ```{r, message = FALSE, warning = FALSE} library(ggplot2) ``` Once you close `R`, all the packages are closed and put back on the imaginary shelf. The next time you open `R`, you do not have to install the package again, but you do have to load any packages you intend to use by invoking `library()`. ## `Code` vs Output in this Book {#code-output} A quick note on styling choices in this book. We had to make a decision how to visually separate `R` code and resulting output in this book. All output lines are prefixed with `##` to make the distinction. A typical code snippet with output is thus going to look like this: ```{r} 1 + 3 # everything after a # is a comment, i.e. R disregards it. ``` where you see on the first line the `R` code, and on the second line the output. As mentioned, that line starts with `##` to say *this is an output*, followed by `[1]` (indicating this is a vector of length *one* - more on this below!), followed by the actual result - `1 + 3 = 4`! Notice that you can simply copy and paste all the code you see into your `R` console. In fact, you are *strongly* encouraged to actually do this and try out **all the code** you see in this book. Finally, please note that this way of showing output is fully our choice in this textbook, and that you should expect other output formats elsewhere. For example, in my `RStudio` console, the above code and output looks like this: ```R > 1 + 3 [1] 4 ``` ## `ScPoApps` Package {#install-package} To fully take advantage of our course, please install the associated `R` package directly from its online code repository. You can do this by copy and pasting the following three lines into your `R` console: ```R if (!require("devtools")) install.packages("devtools") devtools::install_github(repo = "ScPoEcon/ScPoApps") ``` In order to check whether everything works fine, you could load the library, and check it's current version: ```{r,warning=FALSE,message=FALSE,eval=FALSE} library(ScPoApps) packageVersion("ScPoApps") ``` ## Data Types {#data-types} `R` has a number of basic *data types*. While `R` is not a *strongly typed language* (i.e. you can be agnostic about types most of the times), it is useful to know what data types are available to you: - Numeric - Also known as Double. The default type when dealing with numbers. - Examples: `1`, `1.0`, `42.5` - Integer - Examples: `1L`, `2L`, `42L` - Complex - Example: `4 + 2i` - Logical - Two possible values: `TRUE` and `FALSE` - You can also use `T` and `F`, but this is *not* recommended. - `NA` is also considered logical. - Character - Examples: `"a"`, `"Statistics"`, `"1 plus 2."` - Categorical or `factor` - A mixture of integer and character. A `factor` variable assigns a label to a numeric value. - For example `factor(x=c(0,1),labels=c("male","female"))` assigns the string *male* to the numeric values `0`, and the string *female* to the value `1`. ## Data Structures `R` also has a number of basic data *structures*. A data structure is either homogeneous (all elements are of the same data type) or heterogeneous (elements can be of more than one data type). | Dimension | **Homogeneous** | **Heterogeneous** | |:---------:|:---------------:|:-----------------:| | 1 | Vector | List | | 2 | Matrix | Data Frame | | 3+ | Array | nested Lists | ### Vectors Many operations in `R` make heavy use of **vectors**. A vector is a *container* for objects of identical type (see \@ref(data-types) above). Vectors in `R` are indexed starting at `1`. That is what the `[1]` in the output is indicating, that the first element of the row being displayed is the first element of the vector. Larger vectors will start additional rows with something like `[7]` where `7` is the index of the first element of that row. Possibly the most common way to create a vector in `R` is using the `c()` function, which is short for "combine". As the name suggests, it combines a list of elements separated by commas. (Are you busy typing all of those examples into your `R` console? :-) ) ```{r} c(1, 3, 5, 7, 8, 9) ``` Here `R` simply outputs this vector. If we would like to store this vector in a **variable** we can do so with the **assignment** operator `=`. In this case the variable `x` now holds the vector we just created, and we can access the vector by typing `x`. ```{r} x = c(1, 3, 5, 7, 8, 9) x ``` As an aside, there is a long history of the assignment operator in `R`, partially due to the keys available on the [keyboards of the creators of the `S` language.](https://twitter.com/kwbroman/status/747829864091127809) (Which preceded `R`.) For simplicity we will use `=`, but know that often you will see `<-` as the assignment operator. Because vectors must contain elements that are all the same type, `R` will automatically **coerce** (i.e. convert) to a single type when attempting to create a vector that combines multiple types. ```{r} c(42, "Statistics", TRUE) c(42, TRUE) ``` Frequently you may wish to create a vector based on a sequence of numbers. The quickest and easiest way to do this is with the `:` operator, which creates a sequence of integers between two specified integers. ```{r} (y = 1:100) ``` Here we see `R` labeling the rows after the first since this is a large vector. Also, we see that by putting parentheses around the assignment, `R` both stores the vector in a variable called `y` and automatically outputs `y` to the console. Note that scalars do not exists in `R`. They are simply vectors of length `1`. ```{r} 2 ``` If we want to create a sequence that isn't limited to integers and increasing by 1 at a time, we can use the `seq()` function. ```{r} seq(from = 1.5, to = 4.2, by = 0.1) ``` We will discuss functions in detail later, but note here that the input labels `from`, `to`, and `by` are optional. ```{r} seq(1.5, 4.2, 0.1) ``` Another common operation to create a vector is `rep()`, which can repeat a single value a number of times. ```{r} rep("A", times = 10) ``` The `rep()` function can be used to repeat a vector some number of times. ```{r} rep(x, times = 3) ``` We have now seen four different ways to create vectors: - `c()` - `:` - `seq()` - `rep()` So far we have mostly used them in isolation, but they are often used together. ```{r} c(x, rep(seq(1, 9, 2), 3), c(1, 2, 3), 42, 2:4) ``` The length of a vector can be obtained with the `length()` function. ```{r} length(x) length(y) ``` ```{block type="warning"} Let's try this out! **Your turn**: ``` #### Task 1 1. Create a vector of five ones, i.e. `[1,1,1,1,1]` 1. Notice that the colon operator `a:b` is just short for *construct a sequence **from** `a` **to** `b`*. Create a vector the counts down from 10 to 0, i.e. it looks like `[10,9,8,7,6,5,4,3,2,1,0]`! 1. the `rep` function takes additional arguments `times` (as above), and `each`, which tells you how often *each element* should be repeated (as opposed to the entire input vector). Use `rep` to create a vector that looks like this: `[1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3]` #### Subsetting To subset a vector, i.e. to choose only some elements of it, we use square brackets, `[]`. Here we see that `x[1]` returns the first element, and `x[3]` returns the third element: ```{r} x x[1] x[3] ``` We can also exclude certain indexes, in this case the second element. ```{r} x[-2] ``` Lastly we see that we can subset based on a vector of indices. ```{r} x[1:3] x[c(1,3,4)] ``` All of the above are subsetting a vector using a vector of indexes. (Remember a single number is still a vector.) We could instead use a vector of logical values. ```{r} z = c(TRUE, TRUE, FALSE, TRUE, TRUE, FALSE) z ``` ```{r} x[z] ``` `R` is able to perform many operations on vectors and scalars alike: ```{r} x = 1:10 # a vector x + 1 # add a scalar 2 * x # multiply all elements by 2 2 ^ x # take 2 to the x as exponents sqrt(x) # compute the square root of all elements in x log(x) # take the natural log of all elements in x x + 2*x # add vector x to vector 2x ``` We see that when a function like `log()` is called on a vector `x`, a vector is returned which has applied the function to each element of the vector `x`. ### Logical Operators | Operator | Summary | Example | Result | |:---------|:---------------------:|:---------------------:|:-------:| | `x < y` | `x` less than `y` | `3 < 42` | `r 3 < 42` | | `x > y` | `x` greater than `y` | `3 > 42` | `r 3 > 42` | | `x <= y` | `x` less than or equal to `y` | `3 <= 42` | `r 3 <= 42` | | `x >= y` | `x` greater than or equal to `y` | `3 >= 42` | `r 3 >= 42` | | `x == y` | `x`equal to `y` | `3 == 42` | `r 3 == 42` | | `x != y` | `x` not equal to `y` | `3 != 42` | `r 3 != 42` | | `!x` | not `x` | `!(3 > 42)` | `r !(3 > 42)` | | `x | y` | `x` or `y` | `(3 > 42) | TRUE` | `r (3 > 42) | TRUE` | | `x & y` | `x` and `y` | `(3 < 4) & ( 42 > 13)` | `r (3 < 4) & ( 42 > 13)` | In `R`, logical operators also work on vectors: ```{r} x = c(1, 3, 5, 7, 8, 9) ``` ```{r} x > 3 x < 3 x == 3 x != 3 ``` ```{r} x == 3 & x != 3 x == 3 | x != 3 ``` This is quite useful for subsetting. ```{r} x[x > 3] x[x != 3] ``` ```{r} sum(x > 3) as.numeric(x > 3) ``` Here we saw that using the `sum()` function on a vector of logical `TRUE` and `FALSE` values that is the result of `x > 3` results in a numeric result: you just *counted* for how many elements of `x`, the condition `> 3` is `TRUE`. During the call to `sum()`, `R` is first automatically coercing the logical to numeric where `TRUE` is `1` and `FALSE` is `0`. This coercion from logical to numeric happens for most mathematical operations. ```{r} # which(condition of x) returns true/false # each index of x where condition is true which(x > 3) x[which(x > 3)] max(x) which(x == max(x)) which.max(x) ``` #### Task 2 1. Create a vector filled with 10 numbers drawn from the uniform distribution (hint: use function `runif`) and store them in `x`. 1. Using logical subsetting as above, get all the elements of `x` which are larger than 0.5, and store them in `y`. 1. using the function `which`, store the *indices* of all the elements of `x` which are larger than 0.5 in `iy`. 1. Check that `y` and `x[iy]` are identical. ### Matrices `R` can also be used for **matrix** calculations. Matrices have rows and columns containing a single data type. In a matrix, the order of rows and columns is important. (This is not true of *data frames*, which we will see later.) Matrices can be created using the `matrix` function. ```{r} x = 1:9 x X = matrix(x, nrow = 3, ncol = 3) X ``` Notice here that `R` is case sensitive (`x` vs `X`). By default the `matrix` function fills your data into the matrix column by column. But we can also tell `R` to fill rows instead: ```{r} Y = matrix(x, nrow = 3, ncol = 3, byrow = TRUE) Y ``` We can also create a matrix of a specified dimension where every element is the same, in this case `0`. ```{r} Z = matrix(0, 2, 4) Z ``` Like vectors, matrices can be subsetted using square brackets, `[]`. However, since matrices are two-dimensional, we need to specify both a row and a column when subsetting. ```{r} X X[1, 2] ``` Here we accessed the element in the first row and the second column. We could also subset an entire row or column. ```{r} X[1, ] X[, 2] ``` We can also use vectors to subset more than one row or column at a time. Here we subset to the first and third column of the second row: ```{r} X[2, c(1, 3)] ``` Matrices can also be created by combining vectors as columns, using `cbind`, or combining vectors as rows, using `rbind`. ```{r} x = 1:9 rev(x) rep(1, 9) ``` ```{r} rbind(x, rev(x), rep(1, 9)) ``` ```{r} cbind(col_1 = x, col_2 = rev(x), col_3 = rep(1, 9)) ``` When using `rbind` and `cbind` you can specify "argument" names that will be used as column names. `R` can then be used to perform matrix calculations. ```{r} x = 1:9 y = 9:1 X = matrix(x, 3, 3) Y = matrix(y, 3, 3) X Y ``` ```{r} X + Y X - Y X * Y X / Y ``` Note that `X * Y` is **not** matrix multiplication. It is *element by element* multiplication. (Same for `X / Y`). Matrix multiplication uses `%*%`. Other matrix functions include `t()` which gives the transpose of a matrix and `solve()` which returns the inverse of a square matrix if it is invertible. ```{r} X %*% Y t(X) ``` ### Arrays A vector is a one-dimensional array. A matrix is a two-dimensional array. In `R` you can create arrays of arbitrary dimensionality `N`. Here is how: ```{r} d = 1:16 d3 = array(data = d,dim = c(4,2,2)) d4 = array(data = d,dim = c(4,2,2,3)) # will recycle 1:16 d3 ``` You can see that `d3` are simply *two* (4,2) matrices laid on top of each other, as if there were *two pages*. Similary, `d4` would have two pages, and another 3 registers in a fourth dimension. And so on. You can subset an array like you would a vector or a matrix, taking care to index each dimension: ```{r} d3[ ,1,1] # all elements from col 1, page 1 d3[2:3, , ] # rows 2:3 from all pages d3[2,2, ] # row 2, col 2 from both pages. ``` #### Task 3 1. Create a vector containing `1,2,3,4,5` called v. 1. Create a (2,5) matrix `m` containing the data `1,2,3,4,5,6,7,8,9,10`. The first row should be `1,2,3,4,5`. 1. Perform matrix multiplication of `m` with `v`. Use the command `%*%`. What dimension does the output have? 1. Why does `v %*% m` not work? ### Lists A list is a one-dimensional *heterogeneous* data structure. So it is indexed like a vector with a single integer value (or with a name), but each element can contain an element of any type. Lists are similar to a python or julia `Dict` object. Many `R` structures and outputs are lists themselves. Lists are extremely useful and versatile objects, so make sure you understand their useage: ```{r} # creation without fieldnames list(42, "Hello", TRUE) # creation with fieldnames ex_list = list( a = c(1, 2, 3, 4), b = TRUE, c = "Hello!", d = function(arg = 42) {print("Hello World!")}, e = diag(5) ) ``` Lists can be subset using two syntaxes, the `$` operator, and square brackets `[]`. The `$` operator returns a named **element** of a list. The `[]` syntax returns a **list**, while the `[[]]` returns an **element** of a list. - `ex_list[1]` returns a list contain the first element. - `ex_list[[1]]` returns the first element of the list, in this case, a vector. ```{r} # subsetting ex_list$e ex_list[1:2] ex_list[1] ex_list[[1]] ex_list[c("e", "a")] ex_list["e"] ex_list[["e"]] ex_list$d ex_list$d(arg = 1) ``` #### Task 4 1. Copy and paste the above code for `ex_list` into your R session. Remember that `list` can hold any kind of `R` object. Like...another list! So, create a new list `new_list` that has two fields: a first field called "this" with string content `"is awesome"`, and a second field called "ex_list" that contains `ex_list`. 1. Accessing members is like in a plain list, just with several layers now. Get the element `c` from `ex_list` in `new_list`! 1. Compose a new string out of the first element in `new_list`, the element under label `this`. Use the function `paste` to print `R is awesome` to your screen. ## Data Frames {#dataframes} We have previously seen vectors and matrices for storing data as we introduced `R`. We will now introduce a **data frame** which will be the most common way that we store and interact with data in this course. A `data.frame` is similar to a python `pandas.dataframe` or a julia `DataFrame`. (But the `R` version was the first! :-) ) ```{r} example_data = data.frame(x = c(1, 3, 5, 7, 9, 1, 3, 5, 7, 9), y = c(rep("Hello", 9), "Goodbye"), z = rep(c(TRUE, FALSE), 5)) ``` Unlike a matrix, which can be thought of as a vector rearranged into rows and columns, a data frame is not required to have the same data type for each element. A data frame is a **list** of vectors, and each vector has a *name*. So, each vector must contain the same data type, but the different vectors can store different data types. Note, however, that all vectors must have **the same length** (differently from a `list`)! ```{block, type="tip"} A **data.frame** is similar to a typical Spreadsheet. There are *rows*, and there are *columns*. A row is typically thought of as an *observation*, and each column is a certain *variable*, *characteristic* or *feature* of that observation. ```
Let's look at the data frame we just created above: ```{r} example_data ``` Unlike a list, which has more flexibility, the elements of a data frame must all be vectors. Again, we access any given column with the `$` operator: ```{r} example_data$x all.equal(length(example_data$x), length(example_data$y), length(example_data$z)) str(example_data) nrow(example_data) ncol(example_data) dim(example_data) names(example_data) ``` ### Working with `data.frames` The `data.frame()` function above is one way to create a data frame. We can also import data from various file types in into `R`, as well as use data stored in packages. ```{r, echo = FALSE} write.csv(example_data, "data/example-data.csv", row.names = FALSE) write.csv(example_data,"inst/datasets/example-data.csv", row.names=FALSE) ``` To read this data back into `R`, we will use the built-in function `read.csv`: ```{r, message = FALSE, warning = FALSE} path = system.file(package="ScPoEconometrics","datasets","example-data.csv") example_data_from_disk = read.csv(path) ``` This particular line of code assumes that you installed the associated R package to this book, hence you have this dataset stored on your computer at `system.file(package = "ScPoEconometrics","datasets","example-data.csv")`. ```{r} example_data_from_disk ``` When using data, there are three things we would generally like to do: - Look at the raw data. - Understand the data. (Where did it come from? What are the variables? Etc.) - Visualize the data. To look at data in a `data.frame`, we have two useful commands: `head()` and `str()`. ```{r} # we are working with the built-in mtcars dataset: mtcars ``` You can see that this prints the entire data.frame to screen. The function `head()` will display the first `n` observations of the data frame. ```{r} head(mtcars,n=2) head(mtcars) # default ``` The function `str()` will display the "structure" of the data frame. It will display the number of **observations** and **variables**, list the variables, give the type of each variable, and show some elements of each variable. This information can also be found in the "Environment" window in RStudio. ```{r} str(mtcars) ``` In this dataset an observation is for a particular model of a car, and the variables describe attributes of the car, for example its fuel efficiency, or its weight. To understand more about the data set, we use the `?` operator to pull up the documentation for the data. ```{r, eval = FALSE} ?mtcars ``` `R` has a number of functions for quickly working with and extracting basic information from data frames. To quickly obtain a vector of the variable names, we use the `names()` function. ```{r} names(mtcars) ``` To access one of the variables **as a vector**, we use the `$` operator. ```{r} mtcars$mpg mtcars$wt ``` We can use the `dim()`, `nrow()` and `ncol()` functions to obtain information about the dimension of the data frame. ```{r} dim(mtcars) nrow(mtcars) ncol(mtcars) ``` Here `nrow()` is also the number of observations, which in most cases is the *sample size*. Subsetting data frames can work much like subsetting matrices using square brackets, `[ , ]`. Here, we find vehicles with mpg over 25 miles per gallon and only display columns `cyl`, `disp` and `wt`. ```{r} # mpg[row condition, col condition] mtcars[mtcars$mpg > 20, c("cyl", "disp", "wt")] ``` An alternative would be to use the `subset()` function, which has a much more readable syntax. ```{r, eval = FALSE} subset(mtcars, subset = mpg > 25, select = c("cyl", "disp", "wt")) ``` #### Task 5 1. How many observations are there in `mtcars`? 1. How many variables? 1. What is the average value of `mpg`? 1. What is the average value of `mpg` for cars with more than 4 cylinders, i.e. with `cyl>4`? ## Programming Basics In this section we illustrate some general concepts related to programming. ### Variables We encountered the term *variable* already several times, but mainly in the context of a column of a data.frame. In programming, a variable is denotes an *object*. Another way to say it is that a variable is a name or a *label* for something: ```{r} x = 1 y = "roses" z = function(x){sqrt(x)} ``` Here `x` refers to the value `1`, `y` holds the string "roses", and `z` is the name of a function that computes $\sqrt{x}$. Notice that the argument `x` of the function is different from the `x` we just defined. It is **local** to the function: ```{r} x z(9) ``` ### Control Flow Control Flow relates to ways in which you can adapt your code to different circumstances. Based on a `condition` being `TRUE`, your program will do one thing, as opposed to another thing. This is most widely known as an `if/else` statement. In `R`, the if/else syntax is: ```{r, eval = FALSE} if (condition = TRUE) { some R code } else { some other R code } ``` For example, ```{r} x = 1 y = 3 if (x > y) { # test if x > y # if TRUE z = x * y print("x is larger than y") } else { # if FALSE z = x + 5 * y print("x is less than or equal to y") } z ``` ### Loops Loops are a very important programming construct. As the name suggests, in a *loop*, the programming *repeatedly* loops over a set of instructions, until some condition tells it to stop. A very powerful, yet simple, construction is that the program can *count how many steps* it has done already - which may be important to know for many algorithms. The syntax of a `for` loop (there are others), is ```{r eval=FALSE} for (ix in 1:10){ # does not have to be 1:10! # loop body: gets executed each time # the value of ix changes with each iteration } ``` For example, consider this simple `for` loop, which will simply print the value of the *iterator* (called `i` in this case) to screen: ```{r} for (i in 1:5){ print(i) } ``` Notice that instead of `1:5`, we could have *any* kind of iterable collection: ```{r} for (i in c("mangos","bananas","apples")){ print(paste("I love",i)) # the paste function pastes together strings } ``` We often also see *nested* loops, which are just what its name suggests: ```{r} for (i in 2:3){ # first nest: for each i for (j in c("mangos","bananas","apples")){ # second nest: for each j print(paste("Can I get",i,j,"please?")) } } ``` The important thing to note here is that you can do calculations with the iterators *while inside a loop*. ### Functions So far we have been using functions, but haven't actually discussed some of their details. A function is a set of instructions that `R` executes for us, much like those collected in a script file. The good thing is that functions are much more flexible than scripts, since they can depend on *input arguments*, which change the way the function behaves. Here is how to define a function: ```{r eval=FALSE} function_name <- function(arg1,arg2=default_value){ # function body # you do stuff with arg1 and arg2 # you can have any number of arguments, with or without defaults # any valid `R` commands can be included here # the last line is returned } ``` And here is a trivial example of a function definition: ```{r} hello <- function(your_name = "Lord Vader"){ paste("You R most welcome,",your_name) # we could also write: # return(paste("You R most welcome,",your_name)) } # we call the function by typing it's name with round brackets hello() ``` You see that by not specifying the argument `your_name`, `R` reverts to the default value given. Try with your own name now! Just typing the function name returns the actual definition to us, which is handy sometimes: ```{r} hello ``` It's instructive to consider that before we defined the function `hello` above, `R` did not know what to do, had you called `hello()`. The function did not exist! In this sense, we *taught `R` a new trick*. This feature to create new capabilities on top of a core language is one of the most powerful characteristics of programming languages. In general, it is good practice to split your code into several smaller functions, rather than one long script file. It makes your code more readable, and it is easier to track down mistakes. #### Task 6 1. Write a for loop that counts down from 10 to 1, printing the value of the iterator to the screen. 1. Modify that loop to write "i iterations to go" where `i` is the iterator 1. Modify that loop so that each iteration takes roughly one second. You can achieve that by adding the command `Sys.sleep(1)` below the line that prints "i iterations to go". ================================================ FILE: 02-SummaryStats.Rmd ================================================ # Working With Data {#sum} In this chapter we will first learn some basic concepts that help summarizing data. Then, we will tackle a real-world task and read, clean, and summarize data from the web. ## Summary Statistics `R` has built in functions for a large number of summary statistics. For numeric variables, we can summarize data by looking at their center and spread, for example. ```{r} # for the mpg dataset, we load: library(ggplot2) ``` ### Central Tendency {-} Suppose we want to know the *mean* and *median* of all the values stored in the `data.frame` column `mpg$cty`: | Measure | `R` | Result | |:---------:|:-------------------:|:---------------------:| | Mean | `mean(mpg$cty)` | `r mean(mpg$cty)` | | Median | `median(mpg$cty)` | `r median(mpg$cty)` | ### Spread {-} How do the values in that column *vary*? How far *spread out* are they? | Measure | `R` | Result | |:---------:|:-------------------:|:---------------------:| | Variance | `var(mpg$cty)` | `r var(mpg$cty)` | | Standard Deviation | `sd(mpg$cty)` | `r sd(mpg$cty)` | | IQR | `IQR(mpg$cty)` | `r IQR(mpg$cty)` | | Minimum | `min(mpg$cty)` | `r min(mpg$cty)` | | Maximum | `max(mpg$cty)` | `r max(mpg$cty)` | | Range | `range(mpg$cty)` | `r range(mpg$cty)` | ### Categorical {-} For categorical variables, counts and percentages can be used for summary. ```{r} table(mpg$drv) table(mpg$drv) / nrow(mpg) ``` ## Plotting Now that we have some data to work with, and we have learned about the data at the most basic level, our next tasks will be to visualize it. Often, a proper visualization can illuminate features of the data that can inform further analysis. We will look at four methods of visualizing data by using the basic `plot` facilities built-in with `R`: - Histograms - Barplots - Boxplots - Scatterplots ### Histograms When visualizing a single numerical variable, a **histogram** is useful. It summarizes the *distribution* of values in a vector. In `R` you create one using the `hist()` function: ```{r} hist(mpg$cty) ``` The histogram function has a number of parameters which can be changed to make our plot look much nicer. Use the `?` operator to read the documentation for the `hist()` to see a full list of these parameters. ```{r} hist(mpg$cty, xlab = "Miles Per Gallon (City)", main = "Histogram of MPG (City)", # main title breaks = 12, # how many breaks? col = "red", border = "blue") ``` Importantly, you should always be sure to label your axes and give the plot a title. The argument `breaks` is specific to `hist()`. Entering an integer will give a suggestion to `R` for how many bars to use for the histogram. By default `R` will attempt to intelligently guess a good number of `breaks`, but as we can see here, it is sometimes useful to modify this yourself. ### Barplots Somewhat similar to a histogram, a barplot can provide a visual summary of a categorical variable, or a numeric variable with a finite number of values, like a ranking from 1 to 10. ```{r} barplot(table(mpg$drv)) ``` ```{r} barplot(table(mpg$drv), xlab = "Drivetrain (f = FWD, r = RWD, 4 = 4WD)", ylab = "Frequency", main = "Drivetrains", col = "dodgerblue", border = "darkorange") ``` ### Boxplots To visualize the relationship between a numerical and categorical variable, once could use a **boxplot**. In the `mpg` dataset, the `drv` variable takes a small, finite number of values. A car can only be front wheel drive, 4 wheel drive, or rear wheel drive. ```{r} unique(mpg$drv) ``` First note that we can use a single boxplot as an alternative to a histogram for visualizing a single numerical variable. To do so in `R`, we use the `boxplot()` function. The box shows the *interquartile range*, the solid line in the middle is the value of the median, the wiskers show 1.5 times the interquartile range, and the dots are outliers. ```{r} boxplot(mpg$hwy) ``` However, more often we will use boxplots to compare a numerical variable for different values of a categorical variable. ```{r} boxplot(hwy ~ drv, data = mpg) ``` Here used the `boxplot()` command to create side-by-side boxplots. However, since we are now dealing with two variables, the syntax has changed. The `R` syntax `hwy ~ drv, data = mpg` reads "Plot the `hwy` variable against the `drv` variable using the dataset `mpg`." We see the use of a `~` (which specifies a formula) and also a `data = ` argument. This will be a syntax that is common to many functions we will use in this course. ```{r} boxplot(hwy ~ drv, data = mpg, xlab = "Drivetrain (f = FWD, r = RWD, 4 = 4WD)", ylab = "Miles Per Gallon (Highway)", main = "MPG (Highway) vs Drivetrain", pch = 20, cex = 2, col = "darkorange", border = "dodgerblue") ``` Again, `boxplot()` has a number of additional arguments which have the ability to make our plot more visually appealing. ### Scatterplots Lastly, to visualize the relationship between two numeric variables we will use a **scatterplot**. This can be done with the `plot()` function and the `~` syntax we just used with a boxplot. (The function `plot()` can also be used more generally; see the documentation for details.) ```{r} plot(hwy ~ displ, data = mpg) ``` ```{r} plot(hwy ~ displ, data = mpg, xlab = "Engine Displacement (in Liters)", ylab = "Miles Per Gallon (Highway)", main = "MPG (Highway) vs Engine Displacement", pch = 20, cex = 2, col = "dodgerblue") ``` ### `ggplot` {#ggplot} All of the above plots could also have been generated using the `ggplot` function from the already loaded `ggplot2` package. Which function you use is up to you, but sometimes a plot is easier to build in base R (like in the `boxplot` example maybe), sometimes the other way around. ```{r} ggplot(data = mpg,mapping = aes(x=displ,y=hwy)) + geom_point() ``` `ggplot` is impossible to describe in brief terms, so please look at [the package's website](http://ggplot2.tidyverse.org) which provides excellent guidance. We will from time to time use ggplot in this book, so you could familiarize yourself with it. Let's quickly demonstrate how one could further customize that first plot: ```{r} ggplot(data = mpg, mapping = aes(x=displ,y=hwy)) + # ggplot() makes base plot geom_point(color="blue",size=2) + # how to show x and y? scale_y_continuous(name="Miles Per Gallon (Highway)") + # name of y axis scale_x_continuous(name="Engine Displacement (in Liters)") + # x axis theme_bw() + # change the background ggtitle("MPG (Highway) vs Engine Displacement") # add a title ``` If you want to see `ggplot` in action, you could start with [this](http://jcyhong.github.io/ggplot_demo.html) and then look at that [very nice tutorial](https://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html)? It's fun! ## Summarizing Two Variables {#summarize-two} We often are interested in how two variables are related to each other. The core concepts here are *covariance* and *correlation*. Let's generate some data on `x` and `y` and plot them against each other: ```{r x-y-corr,echo=FALSE,message=FALSE,warning=FALSE,fig.cap='How are $x$ and $y$ related?',fig.align='center'} library(mvtnorm) set.seed(10) cor = 0.9 sig = matrix(c(1,cor,cor,1),c(2,2)) ndat = data.frame(rmvnorm(n=300,sigma = sig)) x = ndat$X1 y = ndat$X2 par(pty="s") plot(x ~ y, xlab="x",ylab="y") ``` Taking as example the data in this plot, the concepts *covariance* and *correlation* relate to the following type of question: ```{block, type="note"} Given we observe value of something like $x=2$, say, can we expect a high or a low value of $y$, on average? Something like $y=2$ or rather something like $y=-2$? ```
The answer to this type of question can be addressed by computing the covariance of both variables: ```{r} cov(x,y) ``` Here, this gives a positive number, `r round(cov(x,y),2)`, indicating that as one variable lies above it's average, the other one does as well. In other words, it indicates a **positive relationship**. What is less clear, however, how to interpret the magnitude of `r round(cov(x,y),2)`. Is that a *strong* or a *weak* positive association? In fact, we cannot tell. This is because the covariance is measured in the same units as the data, and those units often differ between both variables. There is a better measure available to us though, the **correlation**, which is obtained by *standardizing* each variable. By *standardizing* a variable $x$ one means to divide $x$ by its standard deviation $\sigma_x$: $$ z = \frac{x}{\sigma_x} $$ The *correlation coefficient* between $x$ and $y$, commonly denoted $r_{x,y}$, is then defined as $$ r_{x,y} = \frac{cov(x,y)}{\sigma_x \sigma_y}, $$ and we get rid of the units problem. In `R`, you can call directly ```{r} cor(x,y) ``` Now this is better. Given that the correlation has to lie in $[-1,1]$, a value of `r round(cor(x,y),2)` is indicative of a rather strong positive relationship for the data in figure \@ref(fig:x-y-corr) Note that $x,y$ being drawn from a *continuous distribution* (they are joint normally distributed) had no implication for covariance and correlation: We can compute those measures also for discrete random variables (like the throws of two dice, as you will see in one of our tutorials). ### Visually estimating $\sigma$ Sometimes it is useful to estimate the standard deviation of some data *without* the help of a computer (for example during an exam ;-) ). If $x$ is approximately normally distributed, 95% of its observations will lie within a range of $\bar{x}\pm$ two standard deviations of $x$. That is to say, *four* standard deviations of $x$ cover 95% of its observations. Hence, a simple way to estimate the standard deviation for a variable is to look at the range of $x$, and simply divide that number by four. ```{r vis,fig.cap='visual estimation on $\\sigma$. The x-axis labels min and max as well as mean of $x$.',echo=FALSE} sdd = 3 md = 3 dta = rnorm(50,mean=md,sd=sdd) plot(dta,rep(1,50),pch=3,yaxt="n",ylab="",xlab="x",xaxt="n") axis(1,at=round(c(min(dta),md,max(dta)),2)) ``` This is illustrated in figure \@ref(fig:vis). Here we see that `range(x)/4` gives `r round(diff(range(dta))/4,2)` which compares favourably to the actual standard deviation `r sdd`. ## The `tidyverse` [Hadley Wickham](http://hadley.nz) is the author of R packages `ggplot2` and also of `dplyr` (and also a myriad of others). With `ggplot2` he introduced what is called the *grammar of graphics* (hence, `gg`) to `R`. Grammar in the sense that there are **nouns** and **verbs** and a **syntax**, i.e. rules of how nouns and verbs are to be put together to construct an understandable sentence. He has extended the *grammar* idea into various other packages. The `tidyverse` package is a collection of those packages. `tidy` data is data where: * Each variable is a column * Each observation is a row * Each value is a cell Fair enough, you might say, that is a regular spreadsheet. And you are right! However, data comes to us *not* tidy most of the times, and we first need to clean, or `tidy`, it up. Once it's in `tidy` format, we can use the tools in the `tidyverse` with great efficiency to analyse the data and stop worrying about which tool to use. ### Reading `.csv` data in the *tidy* way We could have used the `read_csv()` function from the `readr` package to read our example dataset from the previous chapter. The `readr` function `read_csv()` has a number of advantages over the built-in `read.csv`. For example, it is much faster reading larger data. [It also uses the `tibble` package to read the data as a tibble.](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html) **A `tibble` is simply a data frame that prints with sanity.** Notice in the output below that we are given additional information such as dimension and variable type. ```{r, message = FALSE, warning = FALSE} library(readr) # you need `install.packages("readr")` once! path = system.file(package="ScPoEconometrics","datasets","example-data.csv") example_data_from_disk = read_csv(path) ``` ### Tidy `data.frames` are `tibbles` Let's grab some data from the `ggplot2` package: ```{r} data(mpg,package = "ggplot2") # load dataset `mpg` from `ggplot2` package head(mpg, n = 10) ``` The function `head()` will display the first `n` observations of the data frame, as we have seen. The `head()` function was more useful before tibbles. Notice that `mpg` is a tibble already, so the output from `head()` indicates there are only `10` observations. Note that this applies to `head(mpg, n = 10)` and not `mpg` itself. Also note that tibbles print a limited number of rows and columns by default. The last line of the printed output indicates with rows and columns were omitted. ```{r} mpg ``` Let's look at `str` as well to get familiar with the content of the data: ```{r} str(mpg) ``` In this dataset an observation is for a particular model-year of a car, and the variables describe attributes of the car, for example its highway fuel efficiency. To understand more about the data set, we use the `?` operator to pull up the documentation for the data. ```{r, eval = FALSE} ?mpg ``` Working with tibbles is mostly the same as working with plain data.frames: ```{r} names(mpg) mpg$year mpg$hwy ``` Subsetting is also similar to dataframe. Here, we find fuel efficient vehicles earning over 35 miles per gallon and only display `manufacturer`, `model` and `year`. ```{r} # mpg[row condition, col condition] mpg[mpg$hwy > 35, c("manufacturer", "model", "year")] ``` An alternative would be to use the `subset()` function, which has a much more readable syntax. ```{r, eval = FALSE} subset(mpg, subset = hwy > 35, select = c("manufacturer", "model", "year")) ``` Lastly, and most *tidy*, we could use the `filter` and `select` functions from the `dplyr` package which introduces the *pipe operator* `f(x) %>% g(z)` from the `magrittr` package. This operator takes the output of the first command, for example `y = f(x)`, and passes it *as the first argument* to the next function, i.e. we'd obtain `g(y,z)` here.^[A *pipe* is a concept from the Unix world, where it means to take the output of some command, and pass it on to another command. This way, one can construct a *pipeline* of commands. For additional info on the pipe operator in R, you might be interested [in this tutorial](https://www.datacamp.com/community/tutorials/pipe-r-tutorial).] ```{r, eval = TRUE,message=FALSE,warning=FALSE} library(dplyr) mpg %>% filter(hwy > 35) %>% select(manufacturer, model, year) ``` Note that the above syntax is equivalent to the following pipe-free command (which is much harder to read!): ```{r, eval = TRUE,message=FALSE,warning=FALSE} library(dplyr) select(filter(mpg, hwy > 35), manufacturer, model, year) ``` All three approaches produce the same results. Which you use will be largely based on a given situation as well as your preference. #### Task 1 1. Make sure to have the `mpg` dataset loaded by typing `data(mpg)` (and `library(ggplot2)` if you haven't!). Use the `table` function to find out how many cars were built by *mercury*? 1. What is the average year the audi's were built in this dataset? Use the function `mean` on the subset of column `year` that corresponds to `audi`. (Be careful: subsetting a `tibble` returns a `tibble` (and not a vector)!. so get the `year` column after you have subset the `tibble`.) 1. Use the `dplyr` piping syntax from above first with `group_by` and then with `summarise(newvar=your_expression)` to find the mean `year` by all manufacturers (i.e. same as previous task, but for all manufacturers. don't write a loop!). ### Tidy Example: Importing Non-Tidy Excel Data The data we will look at is from [Eurostat](http://ec.europa.eu/eurostat/data/database) on demography and migration. You should download the data yourself (click on previous link, then drill down to *database by themes > Population and social conditions > Demograph and migration > Population change - Demographic balance and crude rates at national level (demo_gind)*). Once downloaded, we can read the data with the function `read_excel` from the package [`readxl`](http://readxl.tidyverse.org), again part of the `tidyverse` suite. It's important to know how the data is organized in the spreadsheet. Open the file with Excel to see: * There is a heading which we don't need. * There are 5 rows with info that we don't need. * There is one table per variable (total population, males, females, etc) * Each table has one row for each country, and one column for each year. * As such, this data is **not tidy**. Now we will read the first chunk of data, from the first table: *total population*: ```{r,message=FALSE,warning=FALSE} library(readxl) # load the library # Notice that if you installed the R package of this book, # you have the .xls data file already at # `system.file(package="ScPoEconometrics", # "datasets","demo_gind.xls")` # otherwise: # * download the file to your computer # * change the argument `path` to where you downloaded it # you may want to change your working directory with `setwd("your/directory") # or in RStudio by clicking Session > Set Working Directory # total population in raw format tot_pop_raw = read_excel( path = system.file(package="ScPoEconometrics", "datasets","demo_gind.xls"), sheet="Data", # which sheet range="A9:K68") # which excel cell range to read names(tot_pop_raw)[1] <- "Country" # lets rename the first column tot_pop_raw ``` This shows a `tibble`, which we encountered just above. The column names are `Country,2008,2009,...`, and the rows are numbered `1,2,3,...`. Notice, in particular, that *all* columns seem to be of type ``, i.e. characters - a string, not a number! We'll have to fix that, as this is clearly numeric data. #### `tidyr` In the previous `tibble`, each year is a column name (like `2008`) instead of all years being collected in one column `year`. We really would like to have several rows for each Country, one row per year. We want to `gather()` all years into a new column to tidy this up - and here is how: 1. specify which columns are to be gathered: in our case, all years (note that `paste(2008:2017)` produces a vector like `["2008", "2009", "2010",...]`) 1. say what those columns should be gathered into, i.e. what is the *key* for those values: we'll call it `year`. 1. Finally, what is the name of the new resulting column, containing the *value* from each cell: let's call it `counts`. ```{r gather,warning=FALSE} library(tidyr) # for the gather function tot_pop = gather(tot_pop_raw, paste(2008:2017),key="year", value = "counts") tot_pop ``` That's better! However, `counts` is still `chr`! Let's convert it to a number: ```{r convert} tot_pop$counts = as.integer(tot_pop$counts) tot_pop ``` Now you can see that column `counts` is indeed `int`, i.e. an integer number, and we are fine. The `Warning: NAs introduced by coercion` means that `R` converted some values to `NA`, because it couldn't convert them into `numeric`. More below! #### `dplyr` >The [transform](http://r4ds.had.co.nz/transform.html) chapter of Hadley Wickham's book is a great place to read up more on using `dplyr`. With `dplyr` you can do the following operations on `data.frame`s and `tibble`s: * Choose observations based on a certain value (i.e. subset): `filter()` * Reorder rows: `arrange()` * Select variables by name: `select()` * Create new variables out of existing ones: `mutate()` * Summarise variables: `summarise()` All of those verbs can be used with `group_by()`, where we apply the respective operation on a *group* of the dataframe/tibble. For example, on our `tot_pop` tibble we will now * filter * mutate * and plot the resulting values Let's get a plot of the populations of France, the UK and Italy over time, in terms of millions of people. We will make use of the `piping` syntax of `dplyr` which we introduced just above. ```{r gather-plot,warning=FALSE,message=FALSE} library(dplyr) # for %>%, filter, mutate, ... # 1. take the data.frame `tot_pop` tot_pop %>% # 2. pipe it into the filter function # filter on Country being one of "France","United Kingdom" or "Italy" filter(Country %in% c("France","United Kingdom","Italy")) %>% # 3. pipe the result into the mutate function # create a new column called millions mutate(millions = counts / 1e6) %>% # 4. pipe the result into ggplot to make a plot ggplot(mapping = aes(x=year,y=millions,color=Country,group=Country)) + geom_line(size=1) ``` #### Arrange a `tibble` {-} * What are the top/bottom 5 most populated areas? ```{r,message=FALSE} top5 = tot_pop %>% arrange(desc(counts)) %>% # arrange in descending order of col `counts` top_n(5) bottom5 = tot_pop %>% arrange(desc(counts)) %>% top_n(-5) # let's see top 5 top5 # and bottom 5 bottom5 ``` Now this is not exactly what we wanted. It's always the same country in both top and bottom, because there are multiple years per country. Let's compute average population over the last 5 years and rank according to that: ```{r,message=FALSE} topbottom = tot_pop %>% group_by(Country) %>% filter(year > 2012) %>% summarise(mean_count = mean(counts)) %>% arrange(desc(mean_count)) top5 = topbottom %>% top_n(5) bottom5 = topbottom %>% top_n(-5) top5 bottom5 ``` That's better! #### Look for `NA`s in a `tibble` {-} Sometimes data is *missing*, and `R` represents it with the special value `NA` (not available). It is good to know where in our dataset we are going to encounter any missing values, so the task here is: let's produce a table that has three columns: 1. the names of countries with missing data 2. how many years of data are missing for each of those 3. and the actual years that are missing ```{r} missings = tot_pop %>% filter(is.na(counts)) %>% # is.na(x) returns TRUE if x is NA group_by(Country) %>% summarise(n_missing = n(),years = paste(year,collapse = ", ")) knitr:::kable(missings) # knitr:::kable makes a nice table ``` #### Males and Females {-} Let's look at the numbers by male and female population. They are in the same xls file, but at different cell ranges. Also, I just realised that the special character `:` indicates *missing* data. We can feed that to `read_excel` and that will spare us the need to convert data types afterwards. Let's see: ```{r females} females_raw = read_excel( path = system.file(package="ScPoEconometrics", "datasets","demo_gind.xls"), sheet="Data", # which sheet range="A141:K200", # which excel cell range to read na=":" ) # missing data indicator names(females_raw)[1] <- "Country" # lets rename the first column females_raw ``` You can see that `R` now correctly read the numbers as such, after we told it that the `:` character has the special *missing* meaning: before, it *coerced* the entire `2008` column (for example) to be of type `chr` after it hit the first `:`. We had to manually convert the column back to `numeric`, in the process automatically coercing the `:`s into `NA`. Now we addressed that issue directly. Let's also get the male data in the same way: ```{r males} males_raw = read_excel( path = system.file(package="ScPoEconometrics", "datasets","demo_gind.xls"), sheet="Data", # which sheet range="A75:K134", # which excel cell range to read na=":" ) # missing data indicator names(males_raw)[1] <- "Country" # lets rename the first column ``` Next step was to `tidy` up this data, just as before: ```{r tidymales} females = gather(females_raw, paste(2008:2017),key="year", value = "counts") males = gather(males_raw, paste(2008:2017),key="year", value = "counts") ``` Let's try to tweak our above plot to show the same data in two separate panels: one for males and one for females. This is easiest to do with `ggplot` if we have all the data in one single `data.frame` (or `tibble`), and marked with a *group identifier*. Let's first add this to both datasets, and then let's just combine both into one: ```{r} females$sex = "female" males$sex = "male" sexes = rbind(males,females) # "row bind" 2 data.frames sexes ``` Now that we have all the data nice and `tidy` in a `data.frame`, this is a very small change to our previous plotting code: ```{r psexes} sexes %>% filter(Country %in% c("France","United Kingdom","Italy")) %>% mutate(millions = counts / 1e6) %>% ggplot(mapping = aes(x=as.Date(year,format="%Y"), # convert to `Date` y=millions,colour=Country,group=Country)) + geom_line() + scale_x_date(name = "year") + # rename x axis facet_wrap(~sex) # make two panels, splitting by groups `sex` ``` #### Always Compare to Germany :-) {-} How do our three countries compare with respect to the biggest country in the EU in terms of population? What *fraction* of Germany does the French population make in any given year, for example? ```{r} # remember that the pipe operator %>% takes the # result of the previous operation and passes it # as the *first* argument to the next function call merge_GER <- tot_pop %>% # 1. subset to countries of interest filter( Country %in% c("France", "United Kingdom", "Italy") ) %>% # 2. group data by year group_by(year) %>% # 3. add GER's count as new column *by year* left_join( # Germany only filter(tot_pop, Country %in% "Germany including former GDR"), # join back in `by year` by="year") merge_GER ``` Here you see that the merge (or join) operation labelled `col.x` and `col.y` if both datasets contained a column called `col`. Now let's continue to compute what proportion of german population each country amounts to: ```{r} names(merge_GER)[1] <- "Country" merge_GER %>% mutate(prop_GER = 100 * counts.x / counts.y) %>% # 5. plot ggplot(mapping = aes(x = year, y = prop_GER, color = Country, group = Country)) + geom_line(size=1) + scale_y_continuous("percent of German population") + theme_bw() # new theme for a change? ``` ================================================ FILE: 03-linear-reg.Rmd ================================================ --- output: pdf_document: default html_document: default --- # Linear Regression {#linreg} In this chapter we will learn an additional way how one can represent the relationship between *outcome*, or *dependent* variable variable $y$ and an *explanatory* or *independent* variable $x$. We will refer throughout to the graphical representation of a collection of independent observations on $x$ and $y$, i.e., a *dataset*. ## How are `x` and `y` related? ### Data on Cars We will look at the built-in `cars` dataset. Let's get a view of this by just typing `View(cars)` in Rstudio. You can see something like this: ```{r,echo=FALSE} head(cars) ``` We have a `data.frame` with two columns: `speed` and `dist`. Type `help(cars)` to find out more about the dataset. There you could read that >The data give the speed of cars (mph) and the distances taken to stop (ft). It's good practice to know the extent of a dataset. You could just type ```{r} dim(cars) ``` to find out that we have 50 rows and 2 columns. A central question that we want to ask now is the following: ### How are `speed` and `dist` related? The simplest way to start is to plot the data. Remembering that we view each row of a data.frame as an observation, we could just label one axis of a graph `speed`, and the other one `dist`, and go through our table above row by row. We just have to read off the x/y coordinates and mark them in the graph. In `R`: ```{r} plot(dist ~ speed, data = cars, xlab = "Speed (in Miles Per Hour)", ylab = "Stopping Distance (in Feet)", main = "Stopping Distance vs Speed", pch = 20, cex = 2, col = "red") ``` Here, each dot represents one observation. In this case, one particular measurement `speed` and `dist` for a car. Now, again: ```{block, type='note'} How are `speed` and `dist` related? How could one best *summarize* this relationship? ```
One thing we could do, is draw a straight line through this scatterplot, like so: ```{r} plot(dist ~ speed, data = cars, xlab = "Speed (in Miles Per Hour)", ylab = "Stopping Distance (in Feet)", main = "Stopping Distance vs Speed", pch = 20, cex = 2, col = "red") abline(a = 60,b = 0,lw=3) ``` Now that doesn't seem a particularly *good* way to summarize the relationship. Clearly, a *better* line would be not be flat, but have a *slope*, i.e. go upwards: ```{r,echo=FALSE} plot(dist ~ speed, data = cars, xlab = "Speed (in Miles Per Hour)", ylab = "Stopping Distance (in Feet)", main = "Stopping Distance vs Speed", pch = 20, cex = 2, col = "red") abline(a = 0,b = 5,lw=3) ``` That is slightly better. However, the line seems at too high a level - the point at which it crosses the y-axis is called the *intercept*; and it's too high. We just learned how to represent a *line*, i.e. with two numbers called *intercept* and *slope*. Let's write down a simple formula which represents a line where some outcome $z$ is related to a variable $x$: \begin{equation} z = b_0 + b_1 x (\#eq:bline) \end{equation} Here $b_0$ represents the value of the intercept (i.e. $z$ when $x=0$), and $b_1$ is the value of the slope. The question for us is now: How to choose the number $b_0$ and $b_1$ such that the result is the **good** line? ### Choosing the Best Line ```{r, echo = FALSE, message = FALSE, warning = FALSE} generate_data = function(int = 0.5, slope = 1, sigma = 10, n_obs = 9, x_min = 0, x_max = 10) { x = seq(x_min, x_max, length.out = n_obs) y = int + slope * x + rnorm(n_obs, 0, sigma) fit = lm(y ~ x) y_hat = fitted(fit) y_bar = rep(mean(y), n_obs) error = resid(fit) meandev = y - y_bar data.frame(x, y, y_hat, y_bar, error, meandev) } plot_total_dev = function(reg_data,title=NULL) { if (is.null(title)){ plot(reg_data$x, reg_data$y, xlab = "x", ylab = "y", pch = 20, cex = 2, col = "grey") rect(xleft = reg_data$x, ybottom = reg_data$y, xright = reg_data$x + abs(reg_data$meandev), ytop = reg_data$y - reg_data$meandev, density = -1, col = rgb(red = 0, green = 0, blue = 1, alpha = 0.5), border = NA) } else { plot(reg_data$x, reg_data$y, xlab = "x", ylab = "y", pch = 20, cex = 2, col = "grey",main=title,ylim=c(-2,10.5)) axis(side=2,at=seq(-2,10,by=2)) rect(xleft = reg_data$x, ybottom = reg_data$y, xright = reg_data$x + abs(reg_data$meandev), ytop = reg_data$y - reg_data$meandev, density = -1, col = rgb(red = 0, green = 0, blue = 1, alpha = 0.5), border = NA) } # arrows(reg_data$x, reg_data$y_bar, # reg_data$x, reg_data$y, # col = 'grey', lwd = 1, lty = 3, length = 0.2, angle = 20) abline(h = mean(reg_data$y), lwd = 2,col = "grey") # abline(lm(y ~ x, data = reg_data), lwd = 2, col = "grey") } plot_total_dev_prop = function(reg_data) { plot(reg_data$x, reg_data$y, xlab = "x", ylab = "y", pch = 20, cex = 2, col = "grey") arrows(reg_data$x, reg_data$y_bar, reg_data$x, reg_data$y_hat, col = 'darkorange', lwd = 1, length = 0.2, angle = 20) arrows(reg_data$x, reg_data$y_hat, reg_data$x, reg_data$y, col = 'dodgerblue', lwd = 1, lty = 2, length = 0.2, angle = 20) abline(h = mean(reg_data$y), lwd = 2,col = "grey") abline(lm(y ~ x, data = reg_data), lwd = 2, col = "grey") } plot_unexp_dev = function(reg_data) { plot(reg_data$x, reg_data$y, xlab = "x", ylab = "y", pch = 20, cex = 2,asp=1) arrows(reg_data$x, reg_data$y_hat, reg_data$x, reg_data$y, col = 'red', lwd = 2, lty = 1, length = 0.1, angle = 20) abline(lm(y ~ x, data = reg_data), lwd = 2, col = "black") } plot_unexp_SSR = function(reg_data,asp=1,title=NULL) { if (is.null(title)){ plot(reg_data$x, reg_data$y, xlab = "x", ylab = "y", pch = 20, cex = 2, rect(xleft = reg_data$x, ybottom = reg_data$y, xright = reg_data$x + abs(reg_data$error), ytop = reg_data$y - reg_data$error, density = -1, col = rgb(red = 1, green = 0, blue = 0, alpha = 0.5), border = NA),asp=asp) abline(lm(y ~ x, data = reg_data), lwd = 2, col = "black") } else { plot(reg_data$x, reg_data$y, xlab = "x", ylab = "y", pch = 20, cex = 2, rect(xleft = reg_data$x, ybottom = reg_data$y, xright = reg_data$x + abs(reg_data$error), ytop = reg_data$y - reg_data$error, density = -1, col = rgb(red = 1, green = 0, blue = 0, alpha = 0.5), border = NA),asp=asp,main=title) axis(side=2,at=seq(-2,10,by=2)) abline(lm(y ~ x, data = reg_data), lwd = 2, col = "black") } } plot_exp_dev = function(reg_data) { plot(reg_data$x, reg_data$y, main = "SSReg (Sum of Squares Regression)", xlab = "x", ylab = "y", pch = 20, cex = 2, col = "grey") arrows(reg_data$x, reg_data$y_bar, reg_data$x, reg_data$y_hat, col = 'darkorange', lwd = 1, length = 0.2, angle = 20) abline(lm(y ~ x, data = reg_data), lwd = 2, col = "grey") abline(h = mean(reg_data$y), col = "grey") } ``` ```{r, echo=FALSE, message=FALSE, warning=FALSE} set.seed(21) plot_data = generate_data(sigma = 2) ``` In order to be able to reason about good or bad line, we need to denote the *output* of equation \@ref(eq:bline). We call the value $\hat{y}_i$ the *predicted value* for obseration $i$, after having chosen some particular values $b_0$ and $b_1$: \begin{equation} \hat{y}_i = b_0 + b_1 x_i (\#eq:abline-pred) \end{equation} In general it is likely that we won't be able to choose $b_0$ and $b_1$ in such as way as to provide a perfect prediction, i.e. one where $\hat{y}_i = y_i$ for all $i$. That is, we expect to make an *error* in our prediction $\hat{y}_i$, so let's denote this value $e_i$. If we acknowlegdge that we will make errors, let's at least make them as small as possible! Exactly this is going to be our task now. Suppose we have the following set of `r nrow(plot_data)` observations on `x` and `y`, and we put the *best* straight line into it, that we can think of. It would look like this: ```{r line-arrows, echo=FALSE, message=FALSE, warning=FALSE, fig.cap="The best line and its errors",fig.align="center"} plot_unexp_dev(plot_data) ``` Here, the red arrows indicate the **distance** between the prediction (i.e. the black line) to each data point, in other words, each arrow is a particular $e_i$. An upward pointing arrow indicates a positive value of a particular $e_i$, and vice versa for downward pointing arrows. The erros are also called *residuals*, which comes from the way can write the equation for this relationship between two particular values $(y_i,x_i)$ belonging to observation $i$: \begin{equation} y_i = b_0 + b_1 x_i + e_i (\#eq:abline) \end{equation} You realize of course that $\hat{y}_i = y_i - e_i$, which just means that our prediction is the observed value $y_i$ minus any error $e_i$ we make. In other words, $e_i$ is what is left to be explained on top of the line $b_0 + b_1 x_i$, hence, it's a residual to explain $y_i$. Here are $y,\hat{y}$ and the resulting $e$ which are plotted in figure \@ref(fig:line-arrows): ```{r,echo=FALSE} knitr:::kable(subset(plot_data,select=c(x,y,y_hat,error)),align = "c",digits = 2) ``` If our line was a **perfect fit** to the data, all $e_i = 0$, and the column `error` would display `0` for each row - there would be no errors at all. (All points in figure \@ref(fig:line-arrows) would perfectly line up on a straight line). Now, back to our claim that this particular line is the *best* line. What exactly characterizes this best line? We now come back to what we said above - *how to make the errors as small as possible*? Keeping in mind that each residual $e_i$ is $y_i - \hat{y}_i$, we have the following minization problem to solve: \begin{align} e_i & = y_i - \hat{y}_i = y_i - \underbrace{\left(b_0 + b_1 x_i\right)}_\text{prediction}\\ e_1^2 + \dots + e_N^2 &= \sum_{i=1}^N e_i^2 \equiv \text{SSR}(b_0,b_1) \\ (b_0,b_1) &= \arg \min_{\text{int},\text{slope}} \sum_{i=1}^N \left[y_i - \left(\text{int} + \text{slope } x_i\right)\right]^2 (\#eq:ols-min) \end{align} ```{block,type="warning"} The best line chooses $b_0$ and $b_1$ so as to minimize the sum of **squared residuals** (SSR). ```
Wait a moment, why *squared* residuals? This is easy to understand: suppose that instead, we wanted to just make the *sum* of the arrows in figure \@ref(fig:line-arrows) as small as possible (that is, no squares). Choosing our line to make this number small would not give a particularly good representation of the data -- given that errors of opposite sign and equal magnitude offset, we could have very long arrows (but of opposite signs), and a poor resulting line. Squaring each error avoids this (because now negative errors get positive values!) ```{r line-squares, echo=FALSE, message=FALSE, warning=FALSE, fig.align='center', fig.cap="The best line and its SQUARED errors"} plot_unexp_SSR(plot_data) ``` We illustrate this in figure \@ref(fig:line-squares). This is the same data as in figure \@ref(fig:line-arrows), but instead of arrows of length $e_i$ for each observation $i$, now we draw a square with side $e_i$, i.e. an area of $e_i^2$. We have two apps for you at this point, one where you have to try and find the best line by choosing $b_0$ and $b_1$, only focusing on the sum of errors (and not their square), and a second one focusing on squared errors: ```{r app1, eval=FALSE} library(ScPoApps) launchApp("reg_simple_arrows") launchApp("reg_simple") # with squared errors launchApp("SSR_cone") # visualize the minimzation problem from above! ``` Most of our `apps` have an associated `about` document, which gives extra information and explanations. After you have looked at all three apps, we invite you thus to have a look at the associated explainers by typing ```{r,eval=FALSE} aboutApp("reg_simple_arrows") aboutApp("reg_simple") aboutApp("SSR_cone") ``` ## Ordinary Least Squares (OLS) Estimator{#OLS} The method to compute (or *estimate*) $b_0$ and $b_1$ we illustrated above is called *Ordinary Least Squares*, or OLS. $b_0$ and $b_1$ are therefore also often called the *OLS coefficients*. By solving problem \@ref(eq:ols-min) one can derive an explicit formula for them: \begin{equation} b_1 = \frac{cov(x,y)}{var(x)}, (\#eq:beta1hat) \end{equation} i.e. the estimate of the slope coefficient is the covariance between $x$ and $y$ divided by the variance of $x$, both computed from our sample of data. With $b_1$ in hand, we can get the estimate for the intercept as \begin{equation} b_0 = \bar{y} - b_1 \bar{x}. (\#eq:beta0hat) \end{equation} where $\bar{z}$ denotes the sample mean of variable $z$. The interpretation of the OLS slope coefficient $b_1$ is as follows. Given a line as in $y = b_0 + b_1 x$, * $b_1 = \frac{d y}{d x}$ measures the change in $y$ resulting from a one unit change in $x$ * For example, if $y$ is wage and $x$ is years of education, $b_1$ would measure the effect of an additional year of education on wages. There is an alternative representation for the OLS slope coefficient which relates to the *correlation coefficient* $r$. Remember from section \@ref(summarize-two) that $r = \frac{cov(x,y)}{s_x s_y}$, where $s_z$ is the standard deviation of variable $z$. With this in hand, we can derive the OLS slope coefficient as \begin{align} b_1 &= \frac{cov(x,y)}{var(x)}\\ &= \frac{cov(x,y)}{s_x s_x} \\ &= r\frac{s_y}{s_x} (\#eq:beta1-r) \end{align} In other words, the slope coefficient is equal to the correlation coefficient $r$ times the ratio of standard deviations of $y$ and $x$. ### Linear Regression without Regressor There are several important special cases for the linear regression introduced above. Let's start with the most obvious one: What is the meaning of running a regression *without any regressor*, i.e. without a $x$? Our line becomes very simple. Instead of \@ref(eq:bline), we get \begin{equation} y = b_0. (\#eq:b0line) \end{equation} This means that our minization problem in \@ref(eq:ols-min) *also* becomes very simple: We only have to choose $b_0$! We have $$ b_0 = \arg\min_{\text{int}} \sum_{i=1}^N \left[y_i - \text{int}\right]^2, $$ which is a quadratic equation with a unique optimum such that $$ b_0 = \frac{1}{N} \sum_{i=1}^N y_i = \overline{y}. $$ ```{block type='tip'} Least Squares **without regressor** $x$ estimates the sample mean of the outcome variable $y$, i.e. it produces $\overline{y}$. ``` ### Regression without an Intercept We follow the same logic here, just that we miss another bit from our initial equation and the minimisation problem in \@ref(eq:ols-min) now becomes: \begin{align} b_1 &= \arg\min_{\text{slope}} \sum_{i=1}^N \left[y_i - \text{slope } x_i \right]^2\\ \mapsto b_1 &= \frac{\frac{1}{N}\sum_{i=1}^N x_i y_i}{\frac{1}{N}\sum_{i=1}^N x_i^2} = \frac{\bar{x} \bar{y}}{\overline{x^2}} (\#eq:b1line) \end{align} ```{block type='tip'} Least Squares **without intercept** (i.e. with $b_0=0$) is a line that passes through the origin. ```
In this case we only get to choose the slope $b_1$ of this anchored line.^[ This slope is related to the angle between vectors $\mathbf{a} = (\overline{x},\overline{y})$, and $\mathbf{b} = (\overline{x},0)$. Hence, it's related to the [scalar projection](https://en.wikipedia.org/wiki/Scalar_projection) of $\mathbf{a}$ on $\mathbf{b}$.] You should now try out both of those restrictions on our linear model by spending some time with ```{r,eval=FALSE} launchApp("reg_constrained") ``` ### Centering A Regression By *centering* or *demeaning* a regression, we mean to substract from both $y$ and $x$ their respective averages to obtain $\tilde{y}_i = y_i - \bar{y}$ and $\tilde{x}_i = x_i - \bar{x}$. We then run a regression *without intercept* as above. That is, we use $\tilde{x}_i,\tilde{y}_i$ instead of $x_i,y_i$ in \@ref(eq:b1line) to obtain our slope estimate $b_1$: \begin{align} b_1 &= \frac{\frac{1}{N}\sum_{i=1}^N \tilde{x}_i \tilde{y}_i}{\frac{1}{N}\sum_{i=1}^N \tilde{x}_i^2}\\ &= \frac{\frac{1}{N}\sum_{i=1}^N (x_i - \bar{x}) (y_i - \bar{y})}{\frac{1}{N}\sum_{i=1}^N (x_i - \bar{x})^2} \\ &= \frac{cov(x,y)}{var(x)} (\#eq:bline-centered) \end{align} This last expression is *identical* to the one in \@ref(eq:beta1hat)! It's the standard OLS estimate for the slope coefficient. We note the following: ```{block type='tip'} Adding a constant to a regression produces the same result as centering all variables and estimating without intercept. So, unless all variables are centered, **always** include an intercept in the regression. ```
To get a better feel for what is going on here, you can try this out now by yourself by typing: ```{r,eval=FALSE} launchApp("demeaned_reg") ``` ### Standardizing A Regression {#reg-standard} *Standardizing* a variable $z$ means to demean as above, but in addition to divide the demeaned value by its own standard deviation. Similarly to what we did above for *centering*, we define transformed variables $\breve{y}_i = \frac{y_i-\bar{y}}{\sigma_y}$ and $\breve{x}_i = \frac{x_i-\bar{x}}{\sigma_x}$ where $\sigma_z$ is the standard deviation of variable $z$. From here on, you should by now be used to what comes next! As above, we use $\breve{x}_i,\breve{y}_i$ instead of $x_i,y_i$ in \@ref(eq:b1line) to this time obtain: \begin{align} b_1 &= \frac{\frac{1}{N}\sum_{i=1}^N \breve{x}_i \breve{y}_i}{\frac{1}{N}\sum_{i=1}^N \breve{x}_i^2}\\ &= \frac{\frac{1}{N}\sum_{i=1}^N \frac{x_i - \bar{x}}{\sigma_x} \frac{y_i - \bar{y}}{\sigma_y}}{\frac{1}{N}\sum_{i=1}^N \left(\frac{x_i - \bar{x}}{\sigma_x}\right)^2} \\ &= \frac{Cov(x,y)}{\sigma_x \sigma_y} \\ &= Corr(x,y) (\#eq:bline-standardized) \end{align} ```{block type='tip'} After we standardize both $y$ and $x$, the slope coefficient $b_1$ in the regression without intercept is equal to the **correlation coefficient**. ```
And also for this case we have a practical application for you. Just type this and play around with the app for a little while! ```{r,eval=FALSE} launchApp("reg_standardized") ``` ## Predictions and Residuals {#pred-resids} Now we want to ask how our residuals $e_i$ relate to the prediction $\hat{y_i}$. Let us first think about the average of all predictions $\hat{y_i}$, i.e. the number $\frac{1}{N} \sum_{i=1}^N \hat{y_i}$. Let's just take \@ref(eq:abline-pred) and plug this into this average, so that we get \begin{align} \frac{1}{N} \sum_{i=1}^N \hat{y_i} &= \frac{1}{N} \sum_{i=1}^N b_0 + b_1 x_i \\ &= b_0 + b_1 \frac{1}{N} \sum_{i=1}^N x_i \\ &= b_0 + b_1 \bar{x} \\ \end{align} But that last line is just equal to the formula for the OLS intercept \@ref(eq:beta0hat), $b_0 = \bar{y} - b_1 \bar{x}$! That means of course that $$ \frac{1}{N} \sum_{i=1}^N \hat{y_i} = b_0 + b_1 \bar{x} = \bar{y} $$ in other words: ```{block type='tip'} The average of our predictions $\hat{y_i}$ is identically equal to the mean of the outcome $y$. This implies that the average of the residuals is equal to zero. ```
Related to this result, we can show that the prediction $\hat{y}$ and the residuals are *uncorrelated*, something that is often called **orthogonality** between $\hat{y}_i$ and $e_i$. We would write this as \begin{align} Cov(\hat{y},e) &=\frac{1}{N} \sum_{i=1}^N (\hat{y}_i-\bar{y})(e_i-\bar{e}) = \frac{1}{N} \sum_{i=1}^N (\hat{y}_i-\bar{y})e_i \\ &= \frac{1}{N} \sum_{i=1}^N \hat{y}_i e_i-\bar{y} \frac{1}{N} \sum_{i=1}^N e_i = 0 \end{align} It's useful to bring back the sample data which generate figure \@ref(fig:line-arrows) at this point in order to verify these claims: ```{r,echo=FALSE} ss = subset(plot_data,select=c(y,y_hat,error)) round(ss,2) ``` Let's check that these claims are true in this sample of data. We want that 1. The average of $\hat{y}_i$ to be the same as the mean of $y$ 2. The average of the errors should be zero. 3. Prediction and errors should be uncorrelated. ```{r} # 1. all.equal(mean(ss$error), 0) # 2. all.equal(mean(ss$y_hat), mean(ss$y)) # 3. all.equal(cov(ss$error,ss$y_hat), 0) ``` So indeed we can confirm this result with our test dataset. Great! ## Correlation, Covariance and Linearity It is important to keep in mind that Correlation and Covariance relate to a *linear* relationship between `x` and `y`. Given how the regression line is estimated by OLS (see just above), you can see that the regression line inherits this property from the Covariance. A famous exercise by Francis Anscombe (1973) illustrates this by constructing 4 different datasets which all have identical **linear** statistics: mean, variance, correlation and regression line *are identical*. However, the usefulness of the statistics to describe the relationship in the data is not clear. ```{r,echo=FALSE} ##-- now some "magic" to do the 4 regressions in a loop: ff <- y ~ x mods <- setNames(as.list(1:4), paste0("lm", 1:4)) for(i in 1:4) { ff[2:3] <- lapply(paste0(c("y","x"), i), as.name) ## or ff[[2]] <- as.name(paste0("y", i)) ## ff[[3]] <- as.name(paste0("x", i)) mods[[i]] <- lmi <- lm(ff, data = anscombe) } op <- par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma = c(0, 0, 2, 0)) for(i in 1:4) { ff[2:3] <- lapply(paste0(c("y","x"), i), as.name) plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2, xlim = c(3, 19), ylim = c(3, 13),main=paste("dataset",i)) abline(mods[[i]], col = "blue") } par(op) ``` The important lesson from this example is the following: ```{block,type="warning"} Always **visually inspect** your data, and don't rely exclusively on summary statistics like *mean, variance, correlation and regression line*. All of those assume a **linear** relationship between the variables in your data. ```
The mission of Anscombe has been continued recently. As a result of this we can have a look at the `datasauRus` package, which pursues Anscbombe's idea through a multitude of funny data sets, all with the same linear statistics. Don't just compute the covariance, or you might actually end up looking at a Dinosaur! What? Type this to find out: ```{r,eval=FALSE} launchApp("datasaurus") aboutApp("datasaurus") ``` ### Non-Linear Relationships in Data Suppose our data now looks like this: ```{r non-line-cars,echo=FALSE} with(mtcars,plot(hp,mpg,xlab="x",ylab="y")) ``` Putting our previous *best line* defined in equation \@ref(eq:abline) as $y = b_0 + b_1 x + e$, we get something like this: ```{r non-line-cars-ols,echo=FALSE,fig.align='center',fig.cap='Best line with non-linear data?'} l1 = lm(mpg~hp,data=mtcars) plot(mtcars$hp,mtcars$mpg,xlab="x",ylab="y") abline(reg=l1,lw=2) ``` Somehow when looking at \@ref(fig:non-line-cars-ols) one is not totally convinced that the straight line is a good summary of this relationship. For values $x\in[50,120]$ the line seems to low, then again too high, and it completely misses the right boundary. It's easy to address this shortcoming by including *higher order terms* of an explanatory variable. We would modify \@ref(eq:abline) to read now \begin{equation} y_i = b_0 + b_1 x_i + b_2 x_i^2 + e_i (\#eq:abline2) \end{equation} This is a special case of *multiple regression*, which we will talk about in chapter \@ref(multiple-reg). You can see that there are *multiple* slope coefficients. For now, let's just see how this performs: ```{r non-line-cars-ols2,echo=FALSE,fig.align="center",fig.cap="Better line with non-linear data!",echo=FALSE} l1 = lm(mpg~hp+I(hp^2),data=mtcars) newdata=data.frame(hp=seq(from=min(mtcars$hp),to=max(mtcars$hp),length.out=100)) newdata$y = predict(l1,newdata=newdata) plot(mtcars$hp,mtcars$mpg,xlab="x",ylab="y") lines(newdata$hp,newdata$y,lw=2) ``` ## Analysing $Var(y)$ Analysis of Variance (ANOVA) refers to a method to decompose variation in one variable as a function of several others. We can use this idea on our outcome $y$. Suppose we wanted to know the variance of $y$, keeping in mind that, by definition, $y_i = \hat{y}_i + e_i$. We would write \begin{align} Var(y) &= Var(\hat{y} + e)\\ &= Var(\hat{y}) + Var(e) + 2 Cov(\hat{y},e)\\ &= Var(\hat{y}) + Var(e) (\#eq:anova) \end{align} We have seen above in \@ref(pred-resids) that the covariance between prediction $\hat{y}$ and error $e$ is zero, that's why we have $Cov(\hat{y},e)=0$ in \@ref(eq:anova). What this tells us in words is that we can decompose the variance in the observed outcome $y$ into a part that relates to variance as *explained by the model* and a part that comes from unexplained variation. Finally, we know the definition of *variance*, and can thus write down the respective formulae for each part: * $Var(y) = \frac{1}{N}\sum_{i=1}^N (y_i - \bar{y})^2$ * $Var(\hat{y}) = \frac{1}{N}\sum_{i=1}^N (\hat{y_i} - \bar{y})^2$, because the mean of $\hat{y}$ is $\bar{y}$ as we know. Finally, * $Var(e) = \frac{1}{N}\sum_{i=1}^N e_i^2$, because the mean of $e$ is zero. We can thus formulate how the total variation in outcome $y$ is aportioned between model and unexplained variation: ```{block, type="tip"} The total variation in outcome $y$ (often called SST, or *total sum of squares*) is equal to the sum of explained squares (SSE) plus the sum of residuals (SSR). We have thus **SST = SSE + SSR**. ``` ## Assessing the *Goodness of Fit* In our setup, there exists a convenient measure for how good a particular statistical model fits the data. It is called $R^2$ (*R squared*), also called the *coefficient of determination*. We make use of the just introduced decomposition of variance, and write the formula as \begin{equation} R^2 = \frac{\text{variance explained}}{\text{total variance}} = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}\in[0,1] (\#eq:Rsquared) \end{equation} It is easy to see that a *good fit* is one where the sum of *explained* squares (SSE) is large relativ to the total variation (SST). In such a case, we observe an $R^2$ close to one. In the opposite case, we will see an $R^2$ close to zero. Notice that a small $R^2$ does not imply that the model is useless, just that it explains a small fraction of the observed variation. ## An Example: A Log Wage Equation Let's consider the following example concerning wage data collected in the 1976 Current Population Survey in the USA.^[This example is close to the vignette of the [wooldridge](https://cloud.r-project.org/web/packages/wooldridge/index.html) package, whose author I hereby thank for the excellent work.] We want to investigate the relationship between average hourly earnings, and years of education. Let's start with a plot: ```{r wooldridge-wages, echo=TRUE,fig.cap='Wages vs Education from the wooldridge dataset wage1.',fig.height=7} data("wage1", package = "wooldridge") # load data # a function that returns a plot plotfun <- function(wage1,log=FALSE,rug = TRUE){ y = wage1$wage if (log){ y = log(wage1$wage) } plot(y = y, x = wage1$educ, col = "red", pch = 21, bg = "grey", cex=1.25, xaxt="n", frame = FALSE, # set default x-axis to none main = ifelse(log,"log(Wages) vs. Education, 1976","Wages vs. Education, 1976"), xlab = "years of education", ylab = ifelse(log,"Log Hourly wages","Hourly wages")) axis(side = 1, at = c(0,6,12,18)) # add custom ticks to x axis if (rug) rug(wage1$wage, side=2, col="red") # add `rug` to y axis } par(mfcol = c(2,1)) # set up a plot with 2 panels # plot 1: standard scatter plot plotfun(wage1) # plot 2: add a panel with histogram+density hist(wage1$wage,prob = TRUE, col = "grey", border = "red", main = "Histogram of wages and Density",xlab = "hourly wage") lines(density(wage1$wage), col = "black", lw = 2) ``` ```{r,echo=FALSE} par(mfcol = c(1,1)) ``` Looking at the top panel of figure \@ref(fig:wooldridge-wages), you notice two things: From the red ticks on the y axis, you see that wages are very concentrated at around 5 USD per hour, with fewer and fewer observations at higher rates; and second, that it seems that the hourly wage seems to increase with higher education levels. The bottom panel reinforces the first point, showing that the estimated pdf (probability density function) shown as a black line has a very long right tail: there are always fewer and fewer, but always larger and larger values of hourly wage in the data. ```{block,type="warning"} You have seen this shape of a distribution in the tutorial for chapter 2 already! Do you remember the name of this particular shape of a distribution? (why not type `ScPoEconometrics::runTutorial('chapter2')`) to check? ```
Let's run a first regression on this data to generate some intution: \begin{equation} \text{wage}_i = b_0 + b_1 \text{educ}_i + e_i (\#eq:wage) \end{equation} We use the `lm` function for this purpose as follows: ```{r} hourly_wage <- lm(formula = wage ~ educ, data = wage1) ``` and we can add the resulting regression line to our above plot: ```{r wooldridge-wages2, echo=TRUE,fig.cap='Wages vs Education from the wooldridge dataset wage1, with regression'} plotfun(wage1) abline(hourly_wage, col = 'black', lw = 2) # add regression line ``` The `hourly_wage` object contains the results of this estimation. We can get a summary of those results with the `summary` method: ```{r} summary(hourly_wage) ``` The main interpretation of this table can be read off the column labelled *Estimate*, reporting estimated coefficients $b_0,b_1$: 1. With zero year of education, the hourly wage is about -0.9 dollars per hour (row named `(Intercept)`) 1. Each additional year of education increase hourly wage by 54 cents. (row named `educ`) 1. For example, for 15 years of education, we predict roughly -0.9 + 0.541 * 15 = `r -0.9 + 0.541 * 15` dollars/h. ## Scaling Regressions ```{block type="tip"} Regression estimates ($b_0, b_1$) are in the scale *of the data*. The actual *value* of the estimates will vary, if we change the scale of the data. The overall fit of the model to the data would *not* change, however, so that the $R^2$ statistic would be constant. ```
Suppose we wanted to use the above estimates to report the effect of years of education on *annual* wages instead of *hourly* ones. Let's assume we have full-time workers, 7h per day, 5 days per week, 45 weeks per year. Calling this factor $\delta = 7 \times 5 \times 45 = 1575$, we have that $x$ dollars per hour imply $x \times \delta = x \times `r 7*5*45`$ dollars per year. What would be the effect of using $\tilde{y} = wage \times `r 7*5*45`$ instead of $y = wage$ as outcome variable on our regression coefficients $b_0$ and $b_1$? Well, let's try! ```{r,results= "asis",echo = FALSE} delta = 7*5*45 wage1$annual_wage <- wage1$wage * delta wage_annual <- lm(formula = annual_wage ~ educ, data = wage1) c1 = coef(hourly_wage) c2 = coef(wage_annual) stargazer::stargazer(hourly_wage, wage_annual, type = if (knitr:::is_latex_output()) "latex" else "html",title = "Effect of Scaling on Coefficients") ``` Let's call the coefficients in the column labelled (1) as $b_0$ and $b_1$, and let's call the ones in column (2) $b_0^*$ and $b_1^*$. In column (1) we see that another year increaeses hourly wage by `r round(c1[2],2)` dollars, as before. In column (2), the corresponding number is `r round(c2[2],2)`, i.e. another year of education will increase *annual* wages by `r round(c2[2],2)` dollars, on average. Notice however, that $b_0 \times \delta = `r round(c1[1],2)` \times `r delta` = `r round(c2[1],2)` = b_0^*$ and that $b_1 \times \delta = `r round(c1[2],2)` \times `r delta` = `r round(c2[2],2)` = b_1^*$, that is we just had to multiply both coefficients by the scaling factor applied to original outcome $y$ to obtain our new coefficients $b_0^*$ and $b_1^*$! Also, observe that the $R^2$s of both regressions are identical! So, really, we did not have to run the regression in column (2) at all to make this change: multiplying all coefficients through by $\delta$ is enough in this case. We keep the identically same fit to the data. Rescaling the regressors $x$ is slightly different, but it's easy to work out *how* different, given the linear nature of the covariance operator, which is part of the OLS estimator. Suppose we rescale $x$ by the number $c$. Then, using the OLS formula in \@ref(eq:beta1hat), we see that we get new slope coefficient $b_1^*$ via \begin{align} b_1^* &= \frac{Cov(cx,y)}{Var(cx)} \\ &= \frac{cCov(x,y)}{c^2 Var(x)} \\ &= \frac{1}{c} b_1. \end{align} As for the intercept, and by using \@ref(eq:beta0hat) \begin{align} b_0^* &= \bar{y} - b_1^* \frac{1}{N}\sum_{i=1}^N c \cdot x_i \\ &= \bar{y} - b_1^* \frac{c}{N}\sum_{i=1}^N x_i \\ &= \bar{y} - \frac{1}{c} b_1 c * \bar{x} \\ &= \bar{y} - b_1 * \bar{x} \\ &= b_0 \end{align} That is, we change the slope by the *inverse* of the scaling factor applied to regressor $x$, but the intercept is unaffected from this. You should play around for a while with our rescaling app to get a feeling for this: ```{r,eval=FALSE} library(ScPoApps) launchApp('Rescale') ``` ## A Particular Rescaling: The $\log$ Transform The natural logarithm is a particularly important transformation that we often encounter in economics. Why would we transform a variable with the $\log$ function to start with? 1. Several important economic variables (like wages, city size, firm size, etc) are approximately *log-normally* distributed. By transforming them with the $\log$, we obtain an approximately *normally* distributed variable, which has desirable properties for our regression. 1. Applying the $\log$ reduces the impact of outliers. 1. The transformation allows for a convenient interpretation in terms of *percentage changes* of the outcome variable. Let's investigate this issue in our running example by transforming the wage data above. Look back at the bottom panel of figure \@ref(fig:wooldridge-wages): Of course you saw immediately that this looked a lot like a log-normal distribution, so point 1. above applies. We modify the left hand side of equation \@ref(eq:wage): \begin{equation} \log(\text{wage}_i) = b_0 + b_1 \text{educ}_i + e_i (\#eq:log-wage) \end{equation} Let's use the `update` function to modify our previous regression model: ```{r} log_hourly_wage = update(hourly_wage, log(wage) ~ ., data = wage1) ``` The `update` function takes an existing `lm` object, like `hourly_wage` here, and updates the `formula`. Here the `.` on the right hand side means *leave unchanged* (so the RHS stays unchanged). How do our pictures change? ```{r logplot,echo = TRUE} par(mfrow = c(1,2)) plotfun(wage1,rug = FALSE) abline(hourly_wage, col = 'black', lw = 2) # add regression line plotfun(wage1,log = TRUE, rug = FALSE) abline(log_hourly_wage, col = 'black', lw = 2) # add regression line par(mfrow = c(1,1)) ``` It *looks as if* the regression line has the same slope, but beware of the different scales of the y-axis! You can clearly see that all y-values have been compressed by the log transformation. The log case behaves differently from our *scaling by a constant number* case above because it is a *nonlinear* function. Let's compare the output between both models: ```{r,echo = FALSE, results = "asis"} stargazer::stargazer(hourly_wage, log_hourly_wage, title = "Log Transformed Equation",type = if (knitr:::is_latex_output()) "latex" else "html") ``` The interpretation of the transformed model in column (2) is now the following: ```{block type = "note"} We call a regression of the form $\log(y) = b_0 + b_1 x + u$ a *log-level* specification, because we regressed the log of a variable on the level (i.e not the log!) of another variable. Here, the impact of increasing $x$ by one unit is to increase $y$ by $100 \times b_1$ **percent**. In our example: an additional year of education will increase hourly wages by 8.3%. Notice that this is very different from saying *...increases log hourly wages by 8.3%*, which is wrong. ```
Notice that the $R^2$ slightly improved, so have a better fit to the data. This is due the fact that the log compressed large outlier values. Whether we apply the $log$ to left or right-hand side variables makes a difference, as outlined in this important table:
(\#tab:loglog) Common Regression Specifications
Specification | Outcome Var | Regressor | Interpretation of $b_1$ | Comment :------------:|:------------:|:------------:|:------------:|:------------: Level-level | y | x | $\Delta y = b_1 \Delta x$ | Standard Level-log | y | $\log(x)$ | $\Delta y = \frac{b_1}{100} \Delta x$ | less frequent Log-level | $\log(y)$ | x | $\% \Delta y = (100 b_1) \Delta$ x | Semi-elasticity Log-Log | $\log(y)$ | $\log(x)$ | $\% \Delta y = \% \Delta$ b_1 x | Elasticity You may remember from your introductory micro course what the definition of the *elasticity* of $y$ with respect to $x$ is: This number tells us by how many percent $y$ will change, if we change $x$ by one percent. Let's look at another example from the `wooldridge` package of datasets, this time concerning CEO salaries and their relationship with company sales. ```{r ceo-sal,fig.cap="The effect of log-transforming highly skewed data.",fig.height = 4} data("ceosal1", package = "wooldridge") par(mfrow = c(1,2)) plot(salary ~ sales, data = ceosal1, main = "Sales vs Salaries",xaxt = "n",frame = FALSE) axis(1, at = c(0,40000, 80000)) rug(ceosal1$salary,side = 2) rug(ceosal1$sales,side = 1) plot(log(salary) ~ log(sales), data = ceosal1, main = "Log(Sales) vs Log(Salaries)") ``` ```{r,echo = FALSE} par(mfrow = c(1,1)) ``` In the left panel of figure \@ref(fig:ceo-sal) you clearly see that both `sales` and `salary` have very long right tails, as indicated by the rug plots on either axis. As a consequence, the points are clustered in the bottom left corner of the plot. We suspect a positive relationship, but it's hard to see. Contrast this with the right panel, where both axis have been log transformed: the points are nicely spread out, clearly spelling out a positive correlation. Let's see what this gives in a regression model! ```{r,results = 'asis',echo = FALSE,warning=FALSE,message = FALSE} library(magrittr) library(dplyr) ceosal1 %>% mutate(logsalary = log(salary), logsales = log(sales)) %>% lm(logsalary ~ logsales, data = .) %>% equatiomatic::extract_eq(use_coefs = TRUE) ``` Refering back at table \@ref(tab:loglog), here we have a log-log specification. Therefore we interpret this regression as follows: ```{block type = "tip"} In a log-log equation, the slope coefficient $b_1$ is the *elasticity of $y$ with respect to changes in $x$*. Here: A 1% increase in sales is associated to a 0.26% increase in CEO salaries. Note, again, that there is no *log* in this statement. ``` ================================================ FILE: 04-MultipleReg.Rmd ================================================ # Multiple Regression {#multiple-reg} We can extend the discussion from chapter \@ref(linreg) to more than one explanatory variable. For example, suppose that instead of only $x$ we now had $x_1$ and $x_2$ in order to explain $y$. Everything we've learned for the single variable case applies here as well. Instead of a regression *line*, we now get a regression *plane*, i.e. an object representable in 3 dimenions: $(x_1,x_2,y)$. As an example, suppose we wanted to explain how many *miles per gallon* (`mpg`) a car can travel as a function of its *horse power* (`hp`) and its *weight* (`wt`). In other words we want to estimate the equation \begin{equation} mpg_i = b_0 + b_1 hp_i + b_2 wt_i + e_i (\#eq:abline2d) \end{equation} on our built-in dataset of cars (`mtcars`): ```{r mtcarsdata} head(subset(mtcars, select = c(mpg,hp,wt))) ``` How do you think `hp` and `wt` will influence how many miles per gallon of gasoline each of those cars can travel? In other words, what do you expect the signs of $b_1$ and $b_2$ to be? With two explanatory variables as here, it is still possible to visualize the regression plane, so let's start with this as an answer. The OLS regression plane through this dataset looks like in figure \@ref(fig:plane3D-reg): ```{r plane3D-reg,echo=FALSE,fig.align='center',fig.cap='Multiple Regression - a plane in 3D. The red lines indicate the residual for each observation.',warning=FALSE,message=FALSE} library(plotly) library(reshape2) data(mtcars) # linear fit fit <- lm(mpg ~ wt+hp,data=mtcars) to_plot_x <- range(mtcars$wt) to_plot_y <- range(mtcars$hp) df <- data.frame(wt = rep(to_plot_x, 2), hp = rep(to_plot_y, each = 2)) df["pred"] <- predict.lm(fit, df, se.fit = F) surf <- acast(df, wt ~ hp) color <- rep(0, length(df)) mtcars %>% plot_ly(colors = "grey") %>% add_markers(x = ~wt, y = ~hp, z = ~mpg,name = "data",opacity = .8, marker=list(color = 'red', size = 5, hoverinfo="skip")) %>% add_surface(x = to_plot_x, y = to_plot_y, z = ~surf, inherit = F, name = "Mtcars 3D", opacity = .75, cauto = F, surfacecolor = color) %>% hide_colorbar() ``` This visualization shows a couple of things: the data are shown with red points and the grey plane is the one resulting from OLS estimation of equation \@ref(eq:abline2d). You should realize that this is exactly the same story as told in figure \@ref(fig:line-arrows) - just in three dimensions! Furthermore, *multiple* regression refers the fact that there could be *more* than two regressors. In fact, you could in principle have $K$ regressors, and our theory developed so far would still be valid: \begin{align} \hat{y}_i &= b_0 + b_1 x_{1i} + b_2 x_{2i} + \dots + b_K x_{Ki}\\ e_i &= y_i - \hat{y}_i (\#eq:multiple-reg) \end{align} Just as before, the least squares method chooses numbers $(b_0,b_1,\dots,b_K)$ to as to minimize SSR, exactly as in the minimization problem for the one regressor case seen in \@ref(eq:ols-min). ## All Else Equal {#ceteris} We can see from the above plot that cars with more horse power and greater weight, in general travel fewer miles per gallon of combustible. Hence, we observe a plane that is downward sloping in both the *weight* and *horse power* directions. Suppose now we wanted to know impact of `hp` on `mpg` *in isolation*, so as if we could ask ```{block,type="tip"}
Keeping the value of $wt$ fixed for a certain car, what would be the impact on $mpg$ be if we were to increase **only** its $hp$? Put differently, keeping **all else equal**, what's the impact of changing $hp$ on $mpg$?
```
We ask this kind of question all the time in econometrics. In figure \@ref(fig:plane3D-reg) you clearly see that both explanatory variables have a negative impact on the outcome of interest: as one increases either the horse power or the weight of a car, one finds that miles per gallon decreases. What is kind of hard to read off is *how negative* an impact each variable has in isolation. As a matter of fact, the kind of question asked here is so common that it has got its own name: we'd say "*ceteris paribus*, what is the impact of `hp` on `mpg`?". *ceteris paribus* is latin and means *the others equal*, i.e. all other variables fixed. In terms of our model in \@ref(eq:abline2d), we want to know the following quantity: \begin{equation} \frac{\partial mpg_i}{\partial hp_i} = b_1 (\#eq:abline2d-deriv) \end{equation} The $\partial$ sign denotes a *partial derivative* of the function describing `mpg` with respect to the variable `hp`. It measures *how the value of `mpg` changes, as we change the value of `hp` ever so slightly*. In our context, this means: *keeping all other variables fixed, what is the effect of `hp` on `mpg`?*. We call the value of coefficient $b_1$ therefore also the *partial effect* of `hp` on `mpg`. In terms of our dataset, we use `R` to run the following **multiple regression**:
```{r,echo=FALSE} summary(fit) ``` From this table you see that the coefficient on `wt` has value `r round(coef(fit)[2],5)`. You can interpret this as follows: ```{block,type="warning"} Holding all other variables fixed at their observed values - or *ceteris paribus* - a one unit increase in $wt$ implies a -3.87783 units change in $mpg$. In other words, increasing the weight of a car by 1000 pounds (lbs), will lead to 3.88 miles less travelled per gallon. Similarly, a car with one additional horse power means that we will travel 0.03177 fewer miles per gallon of gasoline, *all else (i.e. $wt$) equal*. ``` ## Multicolinearity {#multicol} One important requirement for multiple regression is that the data be **not linearly dependent**: Each variable should provide at least some new information for the outcome, and it cannot be replicated as a linear combination of other variables. Suppose that in the example above, we had a variable `wtplus` defined as `wt + 1`, and we included this new variable together with `wt` in our regression. In this case, `wtplus` provides no new information. It's enough to know $wt$, and add $1$ to it. In this sense, `wt_plus` is a redundant variable and should not be included in the model. Notice that this holds only for *linearly* dependent variables - *nonlinear* transformations (like for example $wt^2$) are exempt from this rule. Here is why: \begin{align} y &= b_0 + b_1 \text{wt} + b_2 \text{wtplus} + e \\ &= b_0 + b_1 \text{wt} + b_2 (\text{wt} + 1) + e \\ &= (b_0 + b_2) + \text{wt} (b_1 + b_2) + e \end{align} This shows that we cannot *identify* the regression coefficients in case of linearly dependent data. Variation in the variable `wt` identifies a different coefficient, say $\gamma = b_1 + b_2$, from what we actually wanted: separate estimates for $b_1,b_2$. ```{block, type="note"} We cannot have variables which are *linearly dependent*, or *perfectly colinear*. This is known as the **rank condition**. In particular, the condition dictates that we need at least $N \geq K+1$, i.e. more observations than coefficients. The greater the degree of linear dependence amongst our explanatory variables, the less information we can extract from them, and our estimates becomes *less precise*. ``` ## Log Wage Equation Let's go back to our previous example of the relationship between log wages and education. How does this relationship change if we also think that experience in the labor market has an impact, next to years of education? Here is a picture: ```{r plane3D-lwage,echo=FALSE,fig.align='center',fig.cap='Log wages vs education and experience in 3D.',warning=FALSE,message=FALSE} data("wage1", package = "wooldridge") # linear fit log_wage <- lm(lwage ~ educ + exper,data=wage1) to_plot_x <- range(wage1$educ) to_plot_y <- range(wage1$exper) df <- data.frame(educ = rep(to_plot_x, 2), exper = rep(to_plot_y, each = 2)) df["pred"] <- predict.lm(log_wage, df, se.fit = F) surf <- acast(df, educ ~ exper) color <- rep(0, length(df)) wage1 %>% plot_ly(colors = "grey") %>% add_markers(x = ~educ, y = ~exper, z = ~lwage,name = "data",opacity = .8, marker=list(color = 'red', size = 5, hoverinfo="skip", opacity = 0.8)) %>% add_surface(x = to_plot_x, y = to_plot_y, z = ~surf, inherit = F, name = "wages 3D", opacity = .75, cauto = F, surfacecolor = color) %>% hide_colorbar() ``` Let's add even more variables! For instance, what's the impact of experience in the labor market, and time spent with the current employer? Let's first look at how those variables co-vary with each other: ```{r corrplot, fig.cap = "correlation plot"} cmat = round(cor(subset(wage1,select = c(lwage,educ,exper,tenure))),2) # correlation matrix corrplot::corrplot(cmat,type = "upper",method = "ellipse") ``` The way to read the so-called *correlation plot* in figure \@ref(fig:corrplot) is straightforward: each row illustrates the correlation of a certain variable with the other variables. In this example both the shape of the ellipse in each cell as well as their color coding tell us how strongly two variables correlate. Let us put this into a regression model now: ```{r,results = 'asis'} educ_only <- lm(lwage ~ educ , data = wage1) educ_exper <- lm(lwage ~ educ + exper , data = wage1) log_wages <- lm(lwage ~ educ + exper + tenure, data = wage1) stargazer::stargazer(educ_only, educ_exper, log_wages,type = if (knitr:::is_latex_output()) "latex" else "html") ``` Column (1) refers to model \@ref(eq:log-wage) from the previous chapter, where we only had `educ` as a regressor: we obtain an $R^2$ of 0.186. Column (2) is the model that generated the plane in figure \@ref(fig:plane3D-lwage) above. (3) is the model with three regressors. You can see that by adding more regressors, the quality of our fit increases, as more of the variation in $y$ is now accounted for by our model. You can also see that the values of our estimated coefficients keeps changing as we move from left to right across the columns. Given the correlation structure shown in figure \@ref(fig:corrplot), it is only natural that this is happening: We see that `educ` and `exper` are negatively correlated, for example. So, if we *omit* `exper` from the model in column (1), `educ` will reflect part of this correlation with `exper` by a lower estimated value. By directly controlling for `exper` in column (2) we get an estimate of the effect of `educ` *net of* whatever effect `exper` has in isolation on the outcome variable. We will come back to this point later on. ## How To Make Predictions {#make-preds} So suppose we have a model like $$\text{lwage} = b_0 + b_{1}(\text{educ}) + b_{2}(\text{exper}) + b_{3}(\text{tenure}) + \epsilon$$ How could we use this to make a *prediction* of log wages, given some new data? Remember that the OLS procedure gives us *estimates* for the values $b_0,b_1, b_2,b_3$. With those in hand, it is straightforward to make a prediction about the *conditional mean* of the outcome - just plug in the desired numbers for `educ,exper` and `tenure`. Suppose you want to know what the mean of `lwage` is conditional on `educ = 10,exper=4` and `tenure = 2`. You'd do \begin{align} E[\text{lwage}|\text{educ}=10,\text{exper}=4,\text{tenure}=2] &= b_0 + b_1 10 + b_2 4 + b_3 2\\ &= `r round(coef(log_wages) %*% c(1,10,4,2),2)`. \end{align} I computed the last line directly with ```{r,eval=FALSE} x = c(1,10,4,2) # 1 for intercept pred = coef(log_wages) %*% x ``` but `R` has a more complete prediction interface, using the function `predict`. For starters, you can predict the model on all data points which were contained in the dataset we used for estimation, i.e. `wage1` in our case: ```{r} head(predict(log_wages)) # first 6 observations of wage1 as predicted by our model ``` Often you want to add that prediction *to* the original dataset: ```{r} wage_prediction = cbind(wage1, prediction = predict(log_wages)) head(wage_prediction[, c("lwage","educ","exper","tenure","prediction")]) ``` You'll remember that we called the distance in prediction and observed outcome our *residual* $e$. Well here this is just `lwage - prediction`. Indeed, $e$ is such an important quantity that `R` has a convenient method to compute $y - \hat{y}$ from an `lm` object directly - the method `resid`. Let's add another column to `wage_prediction`: ```{r} wage_prediction = cbind(wage_prediction, residual = resid(log_wages)) head(wage_prediction[, c("lwage","educ","exper","tenure","prediction","residual")]) ``` Using the data in `wage_prediction`, you should now check for yourself what we already know about $\hat{y}$ and $e$ from section \@ref(pred-resids): 1. What is the average of the vector `residual`? 1. What is the average of `prediction`? 1. How does this compare to the average of the outcome `lwage`? 1. What is the correlation between `prediction` and `residual`? ================================================ FILE: 05-Categorial-Vars.Rmd ================================================ # Categorial Variables {#categorical-vars} Up until now, we have encountered only examples with *continuous* variables $x$ and $y$, that is, $x,y \in \mathbb{R}$, so that a typical observation could have been $(y_i,x_i) = (1.5,5.62)$. There are many situations where it makes sense to think about the data in terms of *categories*, rather than continuous numbers. For example, whether an observation $i$ is *male* or *female*, whether a pixel on a screen is *black* or *white*, and whether a good was produced in *France*, *Germany*, *Italy*, *China* or *Spain* are all categorical classifications of data. Probably the simplest type of categorical variable is the *binary*, *boolean*, or just *dummy* variable. As the name suggests, it can take on only two values, `0` and `1`, or `TRUE` and `FALSE`. ## The Binary Regressor Case Even though this is an extremely parsimonious way of encoding that, it is a very powerful tool that allows us to represent that a certain observation $i$ **is a member** of a certain category $j$. For example, let's imagine we have income data on males and females, and we would create a variable called `is.male` that is `TRUE` whenever $i$ is male, `FALSE` otherwise, and similarly for women. For example, to encode whether subject $i$ is male, one could do this: \begin{align*} \text{is.male}_i &= \begin{cases} 1 & \text{if }i\text{ is male} \\ 0 & \text{if }i\text{ is not male}. \\ \end{cases}, \\ \end{align*} and similarly for females, we'd have \begin{align*} \text{is.female}_i &= \begin{cases} 1 & \text{if }i\text{ is female} \\ 0 & \text{if }i\text{ is not female}. \\ \end{cases} \\ \end{align*} By definition, we have just introduced a linear dependence into our dataset. It will always be true that $\text{is.male}_i + \text{is.female}_i = 1$. This is because dummy variables are based on data being mutually exclusively categorized - here, you are either male or female.^[There are [transgender](https://en.wikipedia.org/wiki/Transgender) individuals where this example will not apply.] This should immediately remind you of section \@ref(multicol) where we introduced *multicolinearity*. A regression of income on both of our variables like this $$ y_i = b_0 + b_1 \text{is.female}_i + b_2 \text{is.male}_i + e_i $$ would be invalid because of perfect colinearity between $\text{is.female}_i$ and $\text{is.male}_i$. The solution to this is pragmatic and simple: ```{block, type="tip"} In dummy variable regressions, we remove one category from the regression (for example here: `is.male`) and call it the *reference category*. The effect of being *male* is absorbed in the intercept. The coefficient on the remaining categories measures the *difference* in mean outcome with respect to the reference category. ```
Now let's try this out. We start by creating the female indicator as above, $$ \text{is.female}_i = \begin{cases} 1 & \text{if }i\text{ is female} \\ 0 & \text{if }i\text{ is not female}. \\ \end{cases} $$ and let's suppose that $y_i$ is a measure of $i$'s annual labor income. Our model is \begin{equation} y_i = b_0 + b_1 \text{is.female}_i + e_i (\#eq:dummy-reg) \end{equation} and here is how we estimate this in `R`: ```{r, echo=FALSE} set.seed(19) n = 50 b0 = 2 b1 = -3 x = sample(x = c(0, 1), size = n, replace = T) y = b0 + b1 * x + rnorm(n) dta = data.frame(x,y) zero_one = lm(y~x,dta) ``` ```{r, dummy-reg} # x = sample(x = c(0, 1), size = n, replace = T) dta$is.female = factor(x) # convert x to factor dummy_reg = lm(y~is.female,dta) summary(dummy_reg) ``` Notice that `R` displays the *level* of the factor to which coefficient $b_1$ belongs here, i.e. `is.female1` means this coefficient is on level `is.female = 1` - the reference level is `is.female = 0`, and it has no separate coefficient. Also interesting is that $b_1$ is equal to the difference in conditional means between male and female $$b_1 = E[y|\text{is.female}=1] - E[y|\text{is.female}=0]=`r round(mean(dta[dta$x == 1, "y"]) - mean(dta[dta$x == 0, "y"]),4)`.$$ ```{block,type="note"} A dummy variable measures the difference or the *offset* in the mean of the response variable, $E[y]$, **conditional** on $x$ belonging to some category - relative to a baseline category. In our artificial example, the coefficient $b_1$ informs us that women earn on average 3.756 units less than men. ```
It is instructive to reconsider this example graphically: ```{r x-zero-one,fig.align='center',fig.cap='regressing $y \\in \\mathbb{R}$ on $\\text{is.female}_i \\in \\{0,1\\}$. The blue line is $E[y]$, the red arrow is the size of $b_1$. Which is the same as the slope of the regression line in this case and the difference in conditional means!',echo=FALSE} a <- coef(zero_one)[1] b <- coef(zero_one)[2] # plot expr <- function(x) a + b*x errors <- (a + b*x) - y plot(x, y, type = "p", pch = 21, col = "blue", bg = "royalblue", asp=.25, xlim = c(-.1, 1.1), ylim = c(min(y)-.1, max(y)+.1), frame.plot = T, cex = 1.2) points(0, mean(dta[dta$x == 0, "y"]), col = 'orange', cex = 3, pch = 15) text(0.05, mean(dta[dta$x == 0, "y"]), "E[Y | is.female = 0]", pos = 4) points(1, mean(dta[dta$x == 1, "y"]), col = 'orange', cex = 3, pch = 15) text(1.05, mean(dta[dta$x == 1, "y"]), "E[Y | is.female = 1]", pos = 4) curve(expr = expr, from = min(x)-10, to = max(x)+10, add = TRUE, col = "black") segments(x0 = x, y0 = y, x1 = x, y1 = (y + errors), col = "green") arrows(x0 =-1, y0 = mean(dta[dta$x == 0, "y"]), x1 = -1, y1 = mean(dta[dta$x == 1, "y"]),col="red",lw=3,code=3,length=0.1) # dashes segments(x0=-1,y0 = mean(dta[dta$x == 0, "y"]),x1=0,y1 = mean(dta[dta$x == 0, "y"]),col="red",lty="dashed") segments(x0=-1,y0 = mean(dta[dta$x == 1, "y"]),x1=1,y1 = mean(dta[dta$x == 1, "y"]),col="red",lty="dashed") text(-1, mean(y)+1, paste("b1=",round(b,2)), pos = 4,col="red") abline(a=mean(dta$y),b=0,col="blue",lw=2) ``` In figure \@ref(fig:x-zero-one) we see that this regression simplifies to the straight line connecting the mean, or the *expected value* of $y$ when $\text{is.female}_i = 0$, i.e. $E[y|\text{is.female}_i=0]$, to the mean when $\text{is.female}_i=1$, i.e. $E[y|\text{is.female}_i=1]$. It is useful to remember that the *unconditional mean* of $y$, i.e. $E[y]$, is going to be the result of regressing $y$ only on an intercept, illustrated by the blue line. This line will always lie in between both conditional means. As indicated by the red arrow, the estimate of the coefficient on the dummy, $b_1$, is equal to the difference in conditional means for both groups. You should look at our app now to deepen your understanding of what's going on here: ```{r,eval=FALSE} library(ScPoApps) launchApp("reg_dummy") ``` ## Dummy and Continuous Variables What happens if there are more predictors than just the dummy variable in a regression? For example, what if instead we had \begin{equation} y_i = b_0 + b_1 \text{is.female}_i + b_2 \text{exper}_i + e_i (\#eq:dummy-reg2) \end{equation} where $\text{exper}_i$ would measure years of experience in the labor market? As above, the dummy variable acts as an intercept shifter. We have \begin{equation} y_i = \begin{cases} b_0 + b_1 + b_2 \times \text{exper}_i + e_i & \text{if is.female=1} \\ b_0 + \hphantom{b_1} +b_2 \times \text{exper}_i + e_i & \text{if is.female=0} \end{cases} \end{equation} so that the intercept is $b_0 + b_1$ for women but $b_0$ for men. We will see this in the real-world example below, but for now let's see the effect of switching the dummy *on* and *off* in this app: ```{r,eval=FALSE} library(ScPoApps) launchApp("reg_dummy_example") ``` ## Categorical Variables in `R`: `factor` `R` has extensive support for categorical variables built-in. The relevant data type representing a categorical variable is called `factor`. We encountered them as basic data types in section \@ref(data-types) already, but it is worth repeating this here. We have seen that a factor *categorizes* a usually small number of numeric values by *labels*, as in this example which is similar to what I used to create regressor `is.female` for the above regression: ```{r factors} is.female = factor(x = c(0,1,1,0), labels = c(FALSE,TRUE)) is.female ``` You can see the result is a vector object of type `factor` with 4 entries, whereby `0` is represented as `FALSE` and `1` as `TRUE`. An other example could be if we wanted to record a variable *sex* instead, and we could do ```{r} sex = factor(x = c(0,1,1,0), labels = c("male","female")) sex ``` You can see that this is almost identical, just the *labels* are different. ### More Levels We can go beyond *binary* categorical variables such as `TRUE` vs `FALSE`. For example, suppose that $x$ measures educational attainment, i.e. it is now something like $x_i \in \{\text{high school,some college,BA,MSc}\}$. In `R` parlance, *high school, some college, BA, MSc* are the **levels of factor $x$**. A straightforward extension of the above would dictate to create one dummy variable for each category (or level), like \begin{align*} \text{has.HS}_i &= \mathbf{1}[x_i==\text{high school}] \\ \text{has.someCol}_i &= \mathbf{1}[x_i==\text{some college}] \\ \text{has.BA}_i &= \mathbf{1}[x_i==\text{BA}] \\ \text{has.MSc}_i &= \mathbf{1}[x_i==\text{MSc}] \end{align*} but you can see that this is cumbersome. There is a better solution for us available: ```{r} factor(x = c(1,1,2,4,3,4),labels = c("HS","someCol","BA","MSc")) ``` Notice here that `R` will apply the labels in increasing order the way you supplied it (i.e. a numerical value `4` will correspond to "MSc", no matter the ordering in `x`.) ### Log Wages and Dummies {#factors} The above developed `factor` terminology fits neatly into `R`'s linear model fitting framework. Let us illustrate the simplest use by way of example. Going back to our wage example, let's say that a worker's wage depends on their education as well as their sex: \begin{equation} \ln w_i = b_0 + b_1 educ_i + b_2 female_i + e_i (\#eq:wage-sex) \end{equation} ```{r,results = "asis"} data("wage1", package = "wooldridge") wage1$female = as.factor(wage1$female) # convert 0-1 to factor lm_w = lm(lwage ~ educ, data = wage1) lm_w_sex = lm(lwage ~ educ + female, data = wage1) stargazer::stargazer(lm_w,lm_w_sex,type = if (knitr:::is_latex_output()) "latex" else "html") ``` We know the results from column (1) very well by now. How does the relationship change if we include the `female` indicator? Remember from above that `female` is a `factor` with two levels, *0* and *1*, where *1* means *that's a female*. We see in the above output that `R` included a regressor called `female1`. This is a combination of the variable name `female` and the level which was included in the regression. In other words, `R` chooses a *reference category* (by default the first of all levels by order of appearance), which is excluded - here this is `female==0`. The interpretation is that $b_2$ measures the effect of being female *relative* to being male. `R` automatically creates a dummy variable for each potential level, excluding the first category. ```{r wage-plot,fig.align='center',echo=FALSE,fig.cap='log wage vs educ. Right panel with female dummy.',message=FALSE,warning=FALSE,fig.height=3} library(ggplot2) p1 = ggplot(mapping = aes(y=lwage,x=educ), data=wage1) + geom_point(shape=1,alpha=0.6) + geom_smooth(method="lm",col="blue",se=FALSE) + theme_bw() p_sex = cbind(wage1,pred=predict(lm_w_sex)) # p_sex = dplyr::sample_n(p_sex,2500) p2 <- ggplot(data=p_sex,mapping=aes(x=educ,y=lwage,color=female)) p2 <- p2 + geom_jitter(shape=1,alpha=0.6,width=0.1) + geom_line(mapping = aes(y=pred), size=1) + theme_bw() + scale_y_continuous(name = NULL) cowplot::plot_grid(p1,p2, rel_widths = c(1,1.2)) ``` Figure \@ref(fig:wage-plot) illustrates this. The left panel is our previous model. The right panel adds the `female` dummy. You can see that both male and female have the same upward sloping regression line. But you can also see that there is a parallel downward shift from male to female line. The estimate of $b_2 = `r round(coef(lm_w_sex)[3],2)`$ is the size of the downward shift. ## Interactions Sometimes it is useful to let the slope of a certain variable to be dependent on the value of *another* regressor. For example consider a model for the sales prices of houses, where `area` is the livable surface of the property, and `age` is its age: \begin{equation} \log(price) = b_0 + b_1 \text{area} + b_2 \text{age} + b_3 (\text{area} \times \text{age}) + e (\#eq:price-interact) \end{equation} In that model, the partial effect of `area` on `log(price)`, keeping all other variables fixed, is \begin{equation} \frac{\partial \log(price)}{\partial \text{area}} = b_1 + b_3 (\text{age}) \end{equation} If we find that $b_3 > 0$ in a regression, we conclude that the size of a house values more in older houses. We call $b_3$ the **interaction effect** between area and age. Let's look at that regression model now. ```{r} data(hprice3, package = "wooldridge") summary(lm(lprice ~ area*age, data = hprice3)) ``` In this instance, we see that indeed there is a small positive interaction between `area` and `age` on the sales price: even though `age` in isolation decreases the sales value, bigger houses command a small premium if they are older. ### Interactions with Dummies: Differential Slopes It is straightforward to extend the interactions logic to allow not only for different *intercepts*, but also different *slopes* for each subgroup in a dataset. Let's go back to our dataset of wages from section \@ref(factors) above. Now that we know how to create and interaction between two variables, we can easily modify equation \@ref(eq:wage-sex) like this: \begin{equation} \ln w = b_0 + b_1 \text{female} + b_2 \text{educ} + b_3 (\text{female} \times \text{educ}) + e (\#eq:wage-sex2) \end{equation} The only peculiarity here is that `female` is a factor with levels `0` and `1`: i.e. the interaction term $b_3$ will be zero for all men. Similarly to above, we can test whether there are indeed different returns to education or men and women by looking at the estimated value $b_3$: ```{r,echo = TRUE} lm_w_interact <- lm(lwage ~ educ * female , data = wage1) # R expands to full interactions model summary(lm_w_interact) ``` We will in the next chapter learn that the estimate for $b_3$ on the interaction `educ:female1` is difficult for us to distinguish from zero in a statistical sense; Hence for now we conclude that there are *no* significantly different returns in education for men and women in this data. This is easy to verify visually in this plot, where we are unable to detect a difference in slopes in the right panel. ```{r wage-plot2,fig.align='center',echo=FALSE,fig.cap='log wage vs educ. Right panel allows slopes to be different - turns out they are not!',message=FALSE,warning=FALSE,fig.height=3} p_sex = cbind(wage1,pred=predict(lm_w_sex)) p_sex = cbind(p_sex,pred_inter=predict(lm_w_interact)) # p_sex = dplyr::sample_n(p_sex,2500) p2 <- ggplot(data=p_sex,mapping=aes(x=educ,y=lwage,color=female)) p2 <- p2 + geom_jitter(shape=1,alpha=0.6,width=0.1) + geom_line(mapping = aes(y=pred), size=1) + theme_bw() + scale_y_continuous(name = "log wage") + ggtitle("Impose Parallel Slopes") + theme(legend.position = "none") p3 <- ggplot(data=p_sex,mapping=aes(x=educ,y=lwage,color=female)) p3 <- p3 + geom_jitter(shape=1,alpha=0.6,width=0.1) + geom_line(mapping = aes(y=pred_inter), size=1) + theme_bw() + scale_y_continuous(name = NULL) + ggtitle("Allow Different Slopes") + theme(legend.position = "none") cowplot::plot_grid(p2,p3) ``` ## (Unobserved) Individual Heterogeneity Finally, dummary variables are sometimes very important to account for spurious relationships in that data. Consider the following (artificial example): 1. Suppose we collected data on hourly wage data together with a the number of hours worked for a set of individuals. 1. We plot want to investigate labour supply behaviour of those individuals, hence we run regression `hours_worked ~ wage`. 1. We expect to get a positive coefficient on `wage`: the higher the wage, the more hours worked. 1. You know that individuals are members of either group `g=0` or `g=1`. ```{r, echo = FALSE} two_clouds <- function(a1 = 5, a2 = -2, b1 = 2.5, b2 = 1.5, n1 = 50, n2 = 50){ set.seed(12) x1 = rnorm(n1,mean = 1,sd = 0.5) x2 = rnorm(n2,mean = 1.3,sd = 0.5) y1 = a1 + b1 * x1 + rnorm(n1,sd = 2) y2 = a2 + b2 * x2 + rnorm(n2,sd = 2) x = c(x1,x2) y = c(y1,y2) g = factor(c(rep(1,n1),rep(2,n1))) z = data.frame(x,y,g) m1 = lm(y~x,data = z) m2 = update(m1, . ~ . + g) p1 = ggplot(z, aes(x,y)) + geom_point() + geom_smooth(method = "lm",se =FALSE) + scale_x_continuous(name = "wage") + scale_y_continuous(name = "hours") + theme_bw() p2 = ggplot(z, aes(x,y,color = g)) + geom_point() + geom_smooth(method = "lm",se = FALSE) + scale_x_continuous(name = "wage") + scale_y_continuous(name = "hours") + theme_bw() + ggtitle("Controlling for Group") # par(mfcol = c(1,2)) # plot(z$x,z$y) # abline(m1) list(m1=m1,m2=m2,p1 = p1, p = cowplot::plot_grid(p1,p2,rel_widths = c(1,1.2))) } tc = two_clouds() tc$p1 ``` Here we observe a slightly negative relationship: higher wages are associated with fewer hours worked? Maybe. But what is this, there is a group identifier in this data! Let's use this and include `g` as a dummy in the regression - suppose `g` encodes male and female. ```{r,echo = FALSE,fig.align='center',fig.cap='Left and right panel exhibit the same data. The right panel controls for group composition.',} tc$p ``` This is an artificial example; yet it shows that you can be severly misled if you don't account for group-specific effects in your data. The problem is particularly accute if we *don't know group membership* - we can then resort to advanced methods that are beyond the scope of this course to *estimate* which group each individual belongs to. If we *do know* group membership, however, it is good practice to include a group dummy so as to control for group effects. ================================================ FILE: 06-StdErrors.Rmd ================================================ # Regression Inference {#std-errors} In this chapter we want to investigate uncertainty in regression estimates. We want to understand what the precise meaning of the `Std. Error` column in a typical regression table is telling us. In terms of a picture, we want to understand better the meaning of the shaded area as in this one here: ```{r confint,fig.align="center",message=FALSE,warning=FALSE,echo=FALSE,fig.cap="Confidence bands around a regression line."} library(ggplot2) data("wage1", package = "wooldridge") p <- ggplot(mapping = aes(x = educ, y = lwage), data = subset(wage1,educ > 5)) # base plot p <- p + geom_point() # add points p <- p + geom_smooth(method = "lm", size=1, color="red") # add regression line p <- p + scale_y_continuous(name = "log hourly wage") + scale_x_continuous(name = "years of education") p + theme_bw() + ggtitle("Log Wages vs Education") ``` In order to fully understand this, we need to go back and make sure we have a good grasp of *sampling*. Let's do this first. ## Sampling In class we were confronted with a jar of Tricolore Fusilli pasta as picture in figure \@ref(fig:pasta1).^[This part is largely based on [moderndive](https://moderndive.com/7-sampling.html), to which I am giving full credit hereby. Thanks for this great idea.] We asked ourselves a question which, secretly, many of you had asked themselves at one point in their lives, namely: ```{block type = "tip"} What is the proportion of **green** Fusilli in a pack of Tricolore Fusilli? ```
Well, it's time to find out. ```{r pasta1, fig.cap="A glass jar filled with Fusilli pasta in three different colors.",echo = FALSE,fig.width = 8, out.width = "90%"} knitr::include_graphics("images/pasta1.JPG") ``` Let's call the fusilly in this jar our *study population*, i.e. the set of units about which we want to learn something. There are several approaches to address the question of how big a proportion in the population the green Fusilli make up. One obvious solution is to enumerate all Fusilli according to their color, and compute their proportion in the entire population. It works perfectly well as a solution, but is a long and arduous process, see figures \@ref(fig:pasta2) and \@ref(fig:pasta3). ```{r pasta2, fig.cap="Manually separating Fusilli by their color is very costly in terms of effort and cost.",echo = FALSE,fig.width = 8, out.width = "90%"} knitr::include_graphics("images/pasta2.JPG") ``` Additionally, you may draw worried looks from the people around you, while you are doing it. Maybe this is not the right way to approach this task?^[Regardless of the worried onlookers, I did what I had to do and I carried on to count the green pile. I know exactly how many greens are in there now! I then computed the weight of 20 Fusilli (5g), and backed out the number of Fusilli in the other piles. I will declare those numbers as the *true numbers*. (Sceptics are free to recount.)] ```{r pasta3, fig.cap="Heaps of Fusilli pasta ready to be counted.",echo = FALSE, out.width = "90%"} knitr::include_graphics("images/pasta3.JPG") ``` ### Taking One Sample From the Population We started by randomly grabbing a handful of Fusilli from the jar and by letting drop exactly $N=20$ into a paper coffee cup, pictured in \@ref(fig:pasta5). We call $N$ the *sample size*. The count and corresponding proportions of each color in this first sample are shown in the following table: Color | Count | Proportion :------:|:------:|:--------: Red | 7 | 0.35 Green | 5 | 0.25 White | 8 | 0.4 So far, so good. We have our first *estimate of the population proportion of green Fusilli in the overall population*: 0.25. Notice that taking a sample of $N=20$ was *much* quicker and *much less painful* than performing the full count (i.e. the *census*) of Fusilli performed above. ```{r pasta5, fig.cap="Taking one sample of 20 Fusilli from the jar.",echo = FALSE, out.width = "90%"} knitr::include_graphics("images/pasta5.JPG") ``` Then, we put my sample back into the jar, and we reshuffled the Fusilli. Had we taken *another* sample, again of $N=20$, would we again have gotten 7 Red, 5 Green, and 8 White, just as in the first sample? Maybe, but maybe not. Suppose we had carried on for several times drawing samples of 20 and counting the colors: Would we also have observed 5 green Fusilli? Definitely not. We would have noted some degree of *variability* in the proportions computed from our samples. The *sample proportions* in this case are an example of a *sample statistic*. ```{block type = "note"} **Sampling Variation** refers to the fact that if we *randomly* take samples from a wider population, the *random* composition of each sample will imply that we obtain statistics that vary - they take on potentially different values in each sample. ``` Let's see how this story evolved as we started taking more samples at a time. ## Taking Eleven Samples From The Population We formed teams of two students in class who would each in turn take samples from the jar (the population) of size $N=20$, as before. Each team computed the proportion of green Fusilli they had in their sample, and we wrote this data down in a table on the board. Then, we drew a histogram which showed how many samples had fallen into which bins. ```{r pasta6, fig.cap="Taking eleven samples of 20 Fusilli each from the jar, and plotting the histogram of obtained sample proportions of Green Fusilli.",echo = FALSE, out.width = "90%"} knitr::include_graphics("images/pasta6.JPG") ``` We looked at the histogram in figure \@ref(fig:pasta6) and we noted several things: 1. The largest proportions where 0.3 green 1. The smallest proportion was 0.15 green. 1. Most samples found a proportion of 0.25 green fusilli. 1. We did think that this looked *suspiciouly* like a **normal distribution**. We collected the sample data into a data.frame: ```{r sample-data} pasta_samples <- data.frame(group = 1:11, replicate = 1:11, prop_green = c(0.3,0.25,0.25,0.3,0.15,0.3,0.25,0.25,0.2,0.25,0.2)) pasta_samples ``` This produces an associated histogram which looks very much like the one we draws onto the board: ```{r pasta-hist,echo = FALSE} hist(pasta_samples$prop_green,breaks = c(0.125,0.175,0.225,0.275,0.325),main = "Histogram of 11 Pasta Samples", xlab = "Proportion of Green Fusilli") ``` ### Recap Let's recaptiulate what we just did. We wanted to know what proportion of Fusilli in the glass jar in figure \@ref(fig:pasta1) are green. We acknowledged that an exclusive count, or a census, is a costly and cumbersome exercise, which in most circumstances we will try to avoid. In order to make some progress nonetheless, we took a *random sample* from the full population in the jar: we randomly selected 20 Fusilli, and looked at the proportion of green ones in there. We found a proportion of 0.25. After replacing the Fusilli from the first sample in the jar, we asked ourselves if, upon drawing a *new* sample of 20 Fusilli, we should expect to see the same outcome - and we concluded: maybe, but maybe not. In short, we discovered some random variation from sample to sample. We called this **sampling variation**. The purpose of this little activity was three-fold: 1. To understand that random samples differ and that there is sampling variation. 1. To understand that bigger samples will yield smaller sampling variation. 1. To illustrate that the sampling distribution of *any* statistic (i.e. not only the sample proportion as in our case) computed from a random sample converges to a normal distribution as the sample size increases. ```{block type = "note"} The value of this exercise consisted in making **you** perform the sampling activity yourself. We will now hand over to the brilliant **moderndive** package, which will further develop this chapter. ``` ## Handover to `Moderndive` ```{r handover,out.width="90%", fig.cap="The Moderndive package used red and white balls instead of fusilli pasta.",echo = FALSE} knitr::include_graphics("images/transition.png") ``` The sampling activity in `moderndive` was performed with red and white balls instead of green fusilli pasta. The rest is identical. We will now read sections [7.2](https://moderndive.com/7-sampling.html#sampling-simulation) and [7.3](https://moderndive.com/7-sampling.html#sampling-framework) in their book, as well as [chapter 8 on confidence intervals adn bootstrapping](https://moderndive.com/8-confidence-intervals.html), and [chapter 9 on hypothesis testing](https://moderndive.com/9-hypothesis-testing.html). ## Uncertainty in Regression Estimates In the previous chapters we have seen how the OLS method can produce estimates about intercept and slope coefficients from data. You have seen this method at work in `R` by using the `lm` function as well. It is now time to introduce the notion that given that $b_0$, $b_1$ and $b_2$ are *estimates* of some unkown *population parameters*, there is some degree of **uncertainty** about their values. An other way to say this is that we want some indication about the *precision* of those estimates. The underlying issue that the data we have at hand are usually *samples* from a larger population. ```{block,type="note"}
How *confident* should we be about the estimated values $b$?
```
## What is *true*? What are Statistical Models? A **statistical model** is simply a set of assumptions about how some data have been generated. As such, it models the data-generating process (DGP), as we have it in mind. Once we define a DGP, we could simulate data from it and see how this compares to the data we observe in the real world. Or, we could change the parameters of the DGP so as to understand how the real world data *would* change, could we (or some policy) change the corresponding parameters in reality. Let us now consider one particular statistical model, which in fact we have seen so many times already. ## The Classical Regression Model (CRM) {#class-reg} Let's bring back our simple model \@ref(eq:abline) to explain this concept. \begin{equation} y_i = \beta_0 + \beta_1 x_i + \varepsilon_i (\#eq:abline-5) \end{equation} The smallest set of assumptions used to define the *classical regression model* as in \@ref(eq:abline-5) are the following: 1. The data are **not linearly dependent**: Each variable provides new information for the outcome, and it cannot be replicated as a linear combination of other variables. We have seen this in section \@ref(multicol). In the particular case of one regressor, as here, we require that $x$ exhibit some variation in the data, i.e. $Var(x)\neq 0$. 1. The mean of the residuals conditional on $x$ should be zero, $E[\varepsilon|x] = 0$. Notice that this also means that $Cov(\varepsilon,x) = 0$, i.e. that the errors and our explanatory variable(s) should be *uncorrelated*. It is said that $x$ should be **strictly exogenous** to the model. These assumptions are necessary to successfully (and correctly!) run an OLS regression. They are often supplemented with an additional set of assumptions, which help with certain aspects of the exposition, but are not strictly necessary: 3. The data are drawn from a **random sample** of size $n$: observation $(x_i,y_i)$ comes from the exact same distribution, and is independent of observation $(x_j,y_j)$, for all $i\neq j$. 4. The variance of the error term $\varepsilon$ is the same for each value of $x$: $Var(\varepsilon|x) = \sigma^2$. This property is called **homoskedasticity**. 5. The error is normally distributed, i.e. $\varepsilon \sim \mathcal{N}(0,\sigma^2)$ Invoking assumption 5. in particular defines what is commonly called the *normal* linear regression model. ### $b$ is not $\beta$! Let's talk about the small but important modifications we applied to model \@ref(eq:abline) to end up at \@ref(eq:abline-5) above: * $\beta_0$ and $\beta_1$ and intercept and slope parameters * $\varepsilon$ is the error term. First, we *assumed* that \@ref(eq:abline-5) is the correct represenation of the DGP. With that assumption in place, the values $\beta_0$ and $\beta_1$ are the *true parameter values* which generated the data. Notice that $\beta_0$ and $\beta_1$ are potentially different from $b_0$ and $b_1$ in \@ref(eq:abline) for a given sample of data - they could in practice be very close to each other, but $b_0$ and $b_1$ are *estimates* of $\beta_0$ and $\beta_1$. And, crucially, those estimates are generated from a sample of data. Now, the fact that our data $\{y_i,x_i\}_{i=1}^N$ are a sample from a larger population, means that there will be *sampling variation* in our estimates - exactly like in the case of the sample mean estimating the population average as mentioned above. One particular sample of data will generate one particular set of estimates $b_0$ and $b_1$, whereas another sample of data will generate estimates which will in general be different - by *how much* those estimates differ across samples is the question in this chapter. In general, the more observations we have the greater the precision of our estimates, hence, the closer the estimates from different samples will lie together. ### Violating the Assumptions of the CRM {#violating} It's interesting to consider in which circumstances we might violate those assumptions. Let's give an example for each of them: 1. No Perfect Collinearity. We have seen that a perfect collinearity makes it impossible to compute to OLS coefficients. Remember the example about adding `wtplus = wt + 1` to the `mtcars` dataset? Here it is: ```{r,warning = FALSE, message = FALSE} library(dplyr) mtcars %>% mutate(wtplus = wt + 1) %>% lm(mpg ~ wt + wtplus, data = .) ``` That the coefficient on `wtplus` is `NA` is the result of the direct linear dependence. (Notice that creating `wtplus2 = (wt + 1)^2`) would work, since that is not linear!) 1. Conditional Mean of errors is zero, $E[\varepsilon|x] = 0$. Going back to our running example in figure \@ref(fig:confint) about wages and education: Suppose that each individual $i$ in our data something like *innate ability*, something we might wish to measure with an IQ-test, however imperfecty. Let's call it $a_i$. It seems reasonable to think that high $a_i$ will go together with high wages. At the same time, people with high $a_i$ will find studying for exams and school work much less burdensome than others, hence they might select into obtaining more years of schooling. The problem? Well, there is no $a_i$ in our regression equation - most of time we don't have a good measure of it to start with. So it's an *unobserved variable*, and as such, it is part of the error term $\varepsilon$ in our model. We will attribute to `educ` part of the effect on wages that is actually *caused* by ability $a_i$! Sometimes we may be able to reason about whether our estimate on `educ` is too high or too low, but we will never know it's true value. We don't get the *ceteris paribus* effect (the true partial derivative of `educ` on `lwage`). Technically, the assumption $E[\varepsilon|x] = 0$ implies that $Cov(\varepsilon,x) = 0$, so that's the part that is violated. 1. Data from Random Sample. One common concern here is that the observations in the data could have been *selected* in a particular fashion, which would make it less representative of the underlying population. Suppose we had ended up with individuals only from the richest neighborhood of town; Our interpretation the impact of education on wages might not be valid for other areas. 1. Homoskedasticity. For correct inference (below!), we want to know whether the variance of $\varepsilon$ varies with our explanatory variable $x$, or not. Here is a typical example where it does: ```{r,echo = FALSE} data("engel",package = "quantreg") plot(foodexp ~ log(income) ,data = engel,main = "Food Expenditure vs Log(income)") ``` As income increases, not all people increase their food consumption in an equal way. So $Var(\varepsilon|x)$ will vary with the value of $x$, hence it won't be equal to the constant $\sigma^2$. 1. If the distribution of $\varepsilon$ is not normal, it is more cumbersome to derive theoretical results about inference. ## Standard Errors in Theory {#se-theory} The standard deviation of the OLS parameters is generally called *standard error*. As such, it is just the square root of the parameter's variance. Under assumptions 1. through 4. above we can define the formula for the variance of our slope coefficient in the context of our single regressor model \@ref(eq:abline-5) as follows: \begin{equation} Var(b_1|x_i) = \frac{\sigma^2}{\sum_i^N (x_i - \bar{x})^2} (\#eq:var-ols) \end{equation} In pratice, we don't know the theoretical variance of $\varepsilon$, i.e. $\sigma^2$, but we form an estimate about it from our sample of data. A widely used estimate uses the already encountered SSR (sum of squared residuals), and is denoted $s^2$: $$ s^2 = \frac{SSR}{n-p} = \frac{\sum_{i=1}^n (y_i - b_0 - b_1 x_i)^2}{n-p} = \frac{\sum_{i=1}^n e_i^2}{n-p} $$ where $n-p$ are the *degrees of freedom* available in this estimation. $p$ is the number of parameters we wish to estimate (here: 1). So, the variance formula would become \begin{equation} Var(b_1|x_i) = \frac{SSR}{(n-p)\sum_i^N (x_i - \bar{x})^2} (\#eq:var-ols2) \end{equation} We most of the time work directly with the *standard error* of a coefficient, hence we define \begin{equation} SE(b_1) = \sqrt{Var(b_1|x_i)} = \sqrt{\frac{SSR}{(n-p)\sum_i^N (x_i - \bar{x})^2}} (\#eq:SE-ols2) \end{equation} You can clearly see that, as $n$ increases, the denominator increases, and therefore variance and standard error of the estimate will decrease. ================================================ FILE: 07-Causality.Rmd ================================================ # Causality {#causality} In this chapter we take on a challenging part of our course. Remember that in the [first set of slides](https://rawcdn.githack.com/ScPoEcon/ScPoEconometrics-Slides/session2_1/chapter1/chapter1.html) we introduced Econometrics as the economist's toolkit to answer questions like *does $x$ **cause** $y$?* Let's illustrate the issues at stake with a question from epidemiologie and public health: ```{block type = "warning"} Does smoking **cause** lung cancer? ```
Just in case you were wondering: Yes it does! However, for a very long time the *causal impact* of smoking on lung cancer was hotly debated, and it's instructive for us to look at this history.^[This chapter is drawn from chapter 5 of *The Book of Why* by [Judea Pearl](http://bayes.cs.ucla.edu/jp_home.html).] Let's go back to the 1950's. We are at the start of a big increase in deaths from lung cancer. At the same time cigarette consumption was growing very fast. With the benefit of hindsight, we can now draw this graph: ```{r smoking-cancer,echo = FALSE,fig.align = "center",fig.cap="Two time series showing cigarette consumption per capita and incidence of lung cancer in the USA."} knitr::include_graphics("images/Smoking_lung_cancer.png",) ``` However, time series graphs are poor tools to make causal statements. Many *other things* had changed from 1900 to 1950, all of which could equally be responsible for the rise in cancer rates: 1. Tarring of roads 1. Inhalation of motor exhausts (leaded gasoline fumes) 1. General greater air pollution. We call those other factors **confounders** of the relationship between smoking and lung cancer. So, there were a series of sceptics around who at the time were contesting the existing evidence. That evidence consisted in general of the following: 1. **Case-Control studies**: British Epidemiologists Richard Doll and Austin Bradford Hill started to compare people already diagnosed with cancer to those without, recording their history, and observable characteristics (like age and health behaviours). In one study, out of 649 lung cancer patients interviewed, all but 2 had been smokers! In that study, a cancer patient was 1.5 million times more likely to be have been a smoker than a non-smoker. Still, critics said, there are several sources of bias: * Hospital patients could be a selected sample of the general (smoking) population. * Patients could suffer from *recall bias*, affecting their recollection of facts. * So, while comparing cancer patients to non-patients and controlling for several important *confounders* (like age, income and other observable characteristics), there was still scope for bias. * Moreoever, replicating those studies, as Doll and Hill attempted, would not have solved this issue. 1. Next they attempted what doctors call a **Dose-Response Effect** study. In 1951 they sent out 60,000 questionnaires to British physicians asking about *their* smoking habits. Then they followed them over time: * Only 5 years on, heavy smokers had a death rate from lung cancer that was 24 times higher than for nonsmokers. * People who had smoked and then stopped reduced their risk by a factor of 2. * Still, notorious sceptics like R.A. Fisher were unconvinced. The studies *still* failed to compare **otherwise identical** smokers to non-smokers. There were *still* important unobserved confounders out there which could invalidate the conclusion that we observed indeed a **causal** relationship. Let's put a some structure on this problem now, so we can make progress. ## Directed Acyclical Graphs (DAG) {#dags} A DAG is a tool to visualize a causal relationship. It is a graph where nodes are connected via arrows, where an arrow can run in one direction only (hence, *directed* graph). If an arrow starts at node $x$ and ends at node $y$, we say that $x$ causes $y$. Here is a simple example of such a DAG: ```{r dag1,echo = FALSE,warning = FALSE,message = FALSE,fig.width=4,fig.height = 4, fig.align = "center",fig.cap = "A simple DAG showing the causal impact of $x$ on $y$."} library(ggdag) theme_set(theme_dag()) d1 = dagify(y ~ x) %>% ggdag() d1 ``` Now consider this setting, where there is a third variable, $z$. It could be possible that also $z$ has a direct influence on $y$: ```{r dag2,echo = FALSE,warning = FALSE,message = FALSE,fig.width=4,fig.height = 4, fig.align = "center",fig.cap = "A simple DAG with with 2 causal paths: Both $x$ and $z$ have a direct impact on $y$."} dagify(y ~ x, y ~ z) %>% ggdag() ``` Now let's change this and create a path from $z$ to *both* $x$ and $y$ instead. We call $z$ a *confounder* in the relationship between $x$ and $y$: $z$ *confounds* the direct causal impact of $x$ on $y$, by affecting them both at the same time. What is more, there is no arrow from $x$ to $y$ at all, so the only *real* explanatory variable here is in fact $z$. Attributing any explanatory power to $x$ would be wrong in this setting. ```{r dag3,echo = FALSE,warning = FALSE,fig.width=4,fig.height = 4, fig.align = "center",fig.cap = "A simple DAG where $z$ is a confounder. There is no causal path from $x$ to $y$, and any correlation we observe between those variables is completely induced by $z$. We call this spurious correlation."} ggdag_confounder_triangle() ``` Here is a second example where $z$ is a confounder, but slightly different. ```{r dag41,echo=FALSE,fig.cap="$z$ is still a confounder here, but there is a causal link from $x$ to $y$ now. If we observed $z$, we can control for it."} d4 = dagify(y ~ x, x ~ z, y ~ z) %>% tidy_dagitty(layout = "tree") %>% ggdag() d4 ``` In \@ref(fig:dag41) there is an arrow from $x$ to $y$. In this setting, if we are able to *observe* $z$, we can adjust the correlation we observe between $x$ to $y$ for the variation induced by $z$. In practice, this is precisely what multiple regression will do: holding $z$ fixed at some value, what is the partial effect of $x$ on $y$. Notice that $z$ cedes to be a confounder in this situation, and interpreting our regression coefficient on $x$ as *causal* is correct. ## Smoking in a DAG Let's use this and cast our problem as a DAG now. What the scientists in the 1950s faced where two competing models of the relationship between smoking and lung cancer: ```{r dag-cig,fig.height = 4,echo = FALSE,fig.cap = "Two competing causal graphs for the relationship between smoking and lung cancer. In the right panel Lung Cancer is directly impacted by a genetic factor, which at the same time also influences smoking. This is a stark representation of Fisher's view. Another version would have an additional arrow from Smoking to Lung Cancer in the right panel."} # https://cran.r-project.org/web/packages/ggdag/vignettes/bias-structures.html p1 = dagify(cancer ~ smoking, labels = c("cancer" = "Lung Cancer", "smoking" = "Smoking" ), exposure = "smoking", outcome = "cancer") %>% ggdag(text = FALSE, use_labels = "label") + ggtitle("Doll & Hill") + theme(plot.title = element_text(hjust = 0.5)) p2 = confounder_triangle(x = "Smoking", y = "Lung Cancer", z = "Gene") %>% ggdag(text = FALSE, use_labels = "label") + ggtitle("R.A. Fisher") + theme(plot.title = element_text(hjust = 0.5)) p3 <- dagify(cancer ~ smoking, smoking ~ gene, cancer ~ gene, outcome = "cancer", labels = c("gene" = "Gene", "cancer" = "Lung Cancer", "smoking" = "Smoking")) %>% tidy_dagitty(layout = "tree") %>% ggdag(text = FALSE, use_labels = "label") + ggtitle("Gene Partial") cowplot::plot_grid(p1,p2,axis = "tb") ``` Basically, what critics like Fisher were claiming was that the existing studies did not compare like for like. In other words, our *ceteris paribus* assumption was not satisfied. They were worried that *smoking* was not the only relevant difference between a population of smokers and one of non-smokers. In particular, they worried that people **self-selected** into smoking, and that the choice to become a smoker may be influenced by other, unobserved, underlying forces - like genetic predisposition, for example. That could mean that smokers were also more likely to take risks, or more likely to be heavy drinkers, or engage in other behaviours that might be conducive to develop lung cancer. They did not formulate it in terms of genetics at the time, because they could not know until the 2000's, when the human genome was sufficiently mapped to establish this fact (and indeed there **is** a smoking gene! But that's beside the point), but they worried about this factor. The argument was settled in the eyes of most physicians, when Jerome Cornfield in 1959 wrote a rebuttal of Fisher's points. Cornfield's strategy was to allow Fisher to have his unobserved factor, but to show that there was an upper bound to *how important* it could be in determining the outcome. Here goes: 1. Suppose there is indeed a confounding factor "smoking gene", and that it completely determines the risk of cancer in smokers. 1. Suppose smokers are observed to have 9 times the risk of non-smokers to develop lung cancer. 1. The smoking gene needs to be at least 9 times more prevalent in smokers than in non-smokers to explain this difference in risk. But now consider what this implies. Let's suppose that around 11% of all non-smokers have the smoking gene. That means that $9\times 11 = 99\%$ of smokers need to have it! What's even more worrying, if only even 12% of non smokers have the gene, then the argument breaks down because it would require $9\times 12 = 108\%$ of smokers to have it, which is of course impossible. This argument was so important that it got a name: **Cornfield's inequality**. It left of Fisher's argument nothing but a pile of rubble. It's impossible to think that genetic variation alone could be so important in determining a complex choice of becoming a smoker or not. Looking back at the right panel of figure \@ref(fig:dag-cig), the link from smoking to lung cancer was much too strong to be explained by the genetic hypothesis alone. ## Randomized Control Trials (RCT) Primer {#rct} We now present a quick introduction to Randomized Control Trials (RCTs). The history of randomization is fascinating and goes back a long time, again involving R.A. Fisher from above.^[I refer the interested student to the introduction of the *potential outcomes model* of [Scott Cunningham's](https://twitter.com/causalinf) [mixtape](http://scunning.com/cunningham_mixtape.pdf), which heavily influences this section.] Suffice it to say that RCTs have become so important in Economics that the [Nobel Price in Economics 2019](https://www.nobelprize.org/prizes/economic-sciences/2019/summary/) has been awarded to three exponents of the RCT literature, [Duflo, Banerje and Kremer](https://www.economist.com/finance-and-economics/2019/10/17/a-nobel-economics-prize-goes-to-pioneers-in-understanding-poverty). RCTs are widely used in Medicine, where they originate from (in some sense). But, what *are* RCTs? ```{block type="note"} A randomized controlled trial is a type of scientific experiment that aims to reduce certain sources of bias when testing the effectiveness of some intervention (treatment or policy); this is accomplished by randomly allocating subjects to two or more groups, treating them differently, and then comparing them with respect to a measured response. ```
That sounds really intuitive. If we *randomly* allocate people to receive treatment, there can be no concern of unobserved confounders, as we have relieved the subjects of making the choice to get treated. Remember the cigarette smokers above: The concern was that an unobserved genetic predisposition correlated with both choosing to become a smoker but also with other potentially cancer-inducing behaviours like drinking or risk taking. Imagine for a moment that we could randomly select people at some young age to be selected for treatment (smoking for 30 years, say). The genetic predisposition will be equally prevalent in both treatment and control group. However, only the treatment group is allowed (and indeed forced) to smoke. Observing higher cancer rates in the treatment group would provide *causal evidence* for the effect of smoking on lung cancer. Thankfully, such an experiment is impossible to run on ethical grounds. We could never subject individuals to such severe and prolongued health risks for the sake of a research study. That's why the question took to long to be settled! Let's introduce a formal framework now to think more about RCTs. ## The Potential Outcomes Model {#rubin} The Potential Outcomes Model, often named after one of it's inventors the *Rubin Causal Model*, posits that there are two states of the world - the *potential outcomes*. A first state, where a certain intervention is administered to an individual, and a second state, where this is not the case. Formally, this idea is expressed with superscripts 0 and 1, like this: * $Y_i^1$: individual $i$ has been treated * $Y_i^0$: individual $i$ has **not** been treated Denoting with $D_i \in \{0,1\}$ the treatment indicator which is one if $i$ is indeed treated, the *observed outcome* $Y_i$ is then \begin{equation} Y_i = D_i Y_i^1 + (1-D_i)Y_i^0 (\#eq:rubin-model) \end{equation} This simple equation is able to formalize a rather deep question. We only ever observe one outcome of events for a given individual $i$, say $Y_i = Y_i^1$ in case treatment was given. The deep question is: *what would have happened to $i$, had they **not** received treatment*? You will realize that this a very natural question for us humans to put to ourselves, and to subsequently answer: * How long would the trip have taken, had I chosen another metro line? * What would have happened, had I chosen to study a different subject? * What would have happend, had [Neo](https://en.wikipedia.org/wiki/Neo_(The_Matrix)) taken the blue pill instead? Our ability to make those considerations distinguishes us from animals. It's one of the biggest challenges for machines when trying to be *intelligent*. What makes this question so hard to answer for machines and animals alike is the fact that one has to *imagine a parallel universe* where the actions taken were different, **without** having observed that precise situation before. Neo did *not* take the blue pill, and whatever happened after that originated from this decision - so how are we to tell what would have happened? It's easy for us and [still hard for machines](https://www.quantamagazine.org/to-build-truly-intelligent-machines-teach-them-cause-and-effect-20180515/). Potential outcome $Y_i^0$ above is what is known as the *counterfactual* outcome. What would have happened to subject $i$, had they **not** received treatment $D$? Following Rubin, let us define the **treatment effect** for individual $i$ as follows: \begin{equation} \delta_i = Y_i^1 - Y_i^0 (\#eq:TE) \end{equation} Notice our insistence about talking about a single individual $i$ throughout here. Keeping the potential outcome model \@ref(eq:rubin-model) in mind, i.e. the fact that we only observe *one* of both outcomes, we face the **fundamental identification problem of program evaluation**: ```{block type="warning"} Given we only observe *one* potential outcome, we cannot compute the treatment effect $\delta_i$ for any individual $i$. ```
That's pretty dire news. Let's see if we can do better with an average effect instead. Let's define three *average* effects of interest: 1. the Average Treatment Effect (ATE): $$\delta^{ATE} = E[\delta_i] = E[Y_i^1] - E[Y_i^0]$$ 1. the Average Treatment on the Treated (ATT): $$\delta^{ATT} = E[\delta_i|D_i = 1] = E[Y_i^1|D_i = 1] - E[Y_i^0|D_i = 1]$$ 1. the Average Treatment on the Untreated (ATU): $$\delta^{ATU} = E[\delta_i|D_i = 0] = E[Y_i^1|D_i = 0] - E[Y_i^0|D_i = 0]$$ Notice that *none* of those can be computed from data either, because all of them require data on individual $i$ from *both* scenarios. Let's focus on the ATE for now. Fundamentally we face a **missing data problem**: either $Y_i^1$ or $Y_i^0$ are missing from our dataset. Nevertheless, let's setup the following *naive* simple difference in means estimator $\hat{\delta}$: \begin{align} \hat{\delta} =& E[Y_i^1|D_i = 1] - E[Y_i^0|D_i = 0]\\ =& \frac{1}{N_T} \sum_{i \in T}^{N_T} T_i - \frac{1}{N_C} \sum_{j \in T}^{N_C} Y_j (\#eq:SDO) \end{align} in other words, we just difference the mean outcomes in both treatment (T) and control (C) groups. That is, $N_C$ is the number of people in the control group, $N_T$ is the same for treatment group. Now let's consider what randomly choosing people for treatment does. The key consideration here is that the true $\delta_i$ is potentially different for each person. That is, some people will have a high effect of treatment, while others may have a small (or even negative!) effect. To learn about the true $\delta^{ATE}$ from our naive $\hat{\delta}$, it matters who ends up being treated! Imagine that individuals have at least some partial knowledge about their likely *gains from treatment*, i.e. their personal $\delta_i$. If those who expect to benefit a lot will select disproportionately into treatment, then our estimator $\hat{\delta}$ will be biased upwards for the true average effect $\delta^{ATE}$. This is so because the average of observed outcomes in the treatment group, i.e. $$ \frac{1}{N_T} \sum_{i \in T}^{N_T} Y_i $$ will be **too high**. It represents the disproportionately *high* treatment outcome $Y_i^1$ for all those who *anticipated* such a high outcome from treatment, and who therefore were particularly eager to get selected into treatment. It's not *representative* of the true population wide treatment outcome $E[Y_i^1]$. Here is where randomization comes into play. Suppose we now flip a coin for each person to determine whether they obtain treatment or not. This takes away from them the possibility to select on expected gains into treatment. Crucially, the distribution of effects $\delta_i$ is still the same in the study population, i.e. there are still people with high and people with low effects. But we have solved the missing data problem mentioned above, because whether $Y_i^1$ or rather $Y_i^0$ is observed for each $i$ is now **random**, and no longer a function of any other factor that $i$ could act upon! Hooray! Notice how this links back to our initial discussion about DAGs above. Randomisation essentially cancels the links starting at confounder $z$ in \@ref(fig:dag41). ## Omitted Variable Bias and DAGs We want to revisit the underlying assumptions of the classical model outlined in \@ref(class-reg) in the previous chapter, which is closely related to the previous discussion. Let's talk a bit more about assumption number 2 of the definition in \@ref(class-reg). It said this: ```{block type='warning'} The mean of the residuals conditional on $x$ should be zero, $E[\varepsilon|x] = 0$. This means that $Cov(\varepsilon,x) = 0$, i.e. that the errors and our explanatory variable(s) should be *uncorrelated*. We want $x$ to be **strictly exogenous** to the model. ```
Let us start again with \begin{equation} y_i = \beta_0 + \beta_1 x_i + \varepsilon_i (\#eq:DGP-h) \end{equation} and imagine it represents the data generating process (DGP) of the impact of $x$ on $y$. Writing down this equation is tightly linked to drawing this DAG from above: ```{r dag4,fig.width=4,fig.height = 4, fig.align = "center",fig.cap = "The same simple DAG showing the causal impact of $x$ on $y$.",echo = FALSE} d1 ``` The role of $\varepsilon_i$ in equation \@ref(eq:DGP-h) is to allow for random variability in the data not captured by our model, almost as an acknowledgement that we would never be able to *fully* explain $y_i$ with our necessarily simple model. However, assumption $E[\varepsilon|x] = 0$ (or $Cov(\varepsilon,x) = 0$) makes sure that those other factors are in **no systematic relationship** with our regressor $x$. Why? Well if it *were* the case that another factor $z$ is related to $x$, we could never make our ceteris paribus statements of *holding all other factors fixed, the impact of $x$ on $y$ is $\beta$*. In other words, we'd have a confounder in our regression. ```{r dag5,echo = FALSE,fig.width=4,fig.height = 4, fig.align = "center",fig.cap = "The same simple DAG where $z$ is a confounder that needs to be controlled for."} d4 ``` Notice, again, that the key here is that if we don't control for $z$, it will form part of the error term $\varepsilon$. Given the causal link from $z$ to $x$, we will then observe that $Cov(x,u) = Cov(x,\varepsilon + z) \neq 0$, invalidating our assumption. ### House Prices and Bathrooms Let's imagine that equation \@ref(eq:DGP-h) represents the impact of number of bathrooms ($x$) on the sales price of houses ($y$). We run OLS as $$ y_i = b_0 + b_1 x_i + e_i $$ and find a positive impact of bathrooms on houses: ```{r housing,echo=TRUE} data(Housing, package="Ecdat") hlm = lm(price ~ bathrms, data = Housing) summary(hlm) ``` In fact, from this you conclude that each additional bathroom increases the sales price of a house by `r options(scipen=999);round(coef(hlm)[2],1)` dollars. Let's see if our assumption $E[\varepsilon|x] = 0$ is satisfied: ```{r,warning=FALSE,message=FALSE} library(dplyr) # add residuals to the data Housing$resid <- resid(hlm) Housing %>% group_by(bathrms) %>% summarise(mean_of_resid=mean(resid)) ``` Oh, that doesn't look good. Even though the unconditional mean $E[e] = 0$ is *very* close to zero (type `mean(resid(hlm))`!), this doesn't seem to hold at all by categories of $x$. This indicates that there is something in the error term $e$ which is *correlated* with `bathrms`. Going back to our discussion about *ceteris paribus* in section \@ref(ceteris), we stated that the interpretation of our OLS slope estimate is that ```{block,type="tip"} Keeping everything else fixed at the current value, what is the impact of $x$ on $y$? *Everything* also includes things in $\varepsilon$ (and, hence, $e$)! ```
It looks like our DGP in \@ref(eq:DGP-h) is the *wrong model*. Suppose instead, that in reality sales prices are generated like this: \begin{equation} y_i = \beta_0 + \beta_1 x_i + \beta_2 z_i + \varepsilon_i (\#eq:DGP-h2) \end{equation} This would now mean that by running our regression, informed by the wrong DGP, what we estimate is in fact this: $$ y_i = b_0 + b_1 x_i + (b_2 z_i + e_i) = b_0 + b_1 x_i + u_i. $$ This is to say that by *omitting* variable $z$, we relegate it to a new error term, here called $u_i = b_2 z_i + e_i$. Our assumption above states that *all regressors need to be uncorrelated with the error term* - so, if $Corr(x,z)\neq 0$, we have a problem. Let's take this idea to our running example. ### Including an Omitted Variable What we are discussing here is called *Omitted Variable Bias*. There is a variable which we omitted from our regression, i.e. we forgot to include it. It is often difficult to find out what that variable could be, and you can go a long way by just reasoning about the data-generating process. In other words, do you think it's *reasonable* that price be determined by the number of bathrooms only? Or could there be another variable, omitted from our model, that is important to explain prices, and at the same time correlated with `bathrms`? Let's try with `lotsize`, i.e. the size of the area on which the house stands. Intuitively, larger lots should command a higher price; At the same time, however, larger lots imply more space, hence, you can also have more bathrooms! Let's check this out: ```{r,echo=FALSE} options(scipen=0) hlm2 = update(hlm, . ~ . + lotsize) summary(hlm2) options(scipen=999) ``` Here we see that the estimate for the effect of an additional bathroom *decreased* from `r round(coef(hlm)[2],1)` to `r round(coef(hlm2)[2],1)` by almost 5000 dollars! Well that's the problem then. `r options(scipen=999)`We said above that one more bathroom is worth `r round(coef(hlm)[2],1)` dollars - if **nothing else changes**! But that doesn't seem to hold, because we have seen that as we increase `bathrms` from `1` to `2`, the mean of the resulting residuals changes quite a bit. So there **is something in $\varepsilon$ which does change**, hence, our conclusion that one more bathroom is worth `r round(coef(hlm)[2],1)` dollars is in fact *invalid*! The way in which `bathrms` and `lotsize` are correlated is important here, so let's investigate that: ```{r, fig.align='center', fig.cap='Distribution of `lotsize` by `bathrms`',echo=FALSE} options(scipen=0) h = subset(Housing,lotsize<13000 & bathrms<4) h$bathrms = factor(h$bathrms) ggplot(data=h,aes(x=lotsize,color=bathrms,fill=bathrms)) + geom_density(alpha=0.2,size=1) + theme_bw() ``` This shows that lotsize and the number of bathrooms is indeed positively related. Larger lot of the house, more bathrooms. This leads to a general result: ```{block type='note'} **Direction of Omitted Variable Bias** If the direction of correlation between omitted variable $z$ and $x$ is the same as that between $x$ and $y$, we will observe upward bias in our estimate of $b_1$, and vice versa if the correlations go in opposite directions. In other words, we have positive bias if $b_2 z_i > 0$ and vice versa. ```
================================================ FILE: 08-STAR.Rmd ================================================ # STAR Experiment {#STAR} How to best allocate spending on schooling is an important question. What's the impact of spending money to finance smaller classrooms on student performance and outcomes, both in the short and in the long run? A vast literature in economics is concerned with this question, and for a long time there was no consensus. The big underlying problem in answering this question is that we do not really know how student outcomes are *produced*. In other words, what makes a successful student? Is it the quality of their teacher? Surely matters. is it quality of the school building? Could be. Is it that the other pupils are of high quality and this somehow *rubs off* to weaker pupils? Also possible. What about parental background? Sure. You see that there are many potential channels that could determine student outcomes. What is more, there could be several interdependencies amongst those factors. Here's a DAG! ```{r star1,echo = FALSE,warning = FALSE,message = FALSE,fig.align = "center",fig.cap = "Possible Channels determining student outcomes. Dashed arrows represent potentially unobserved links."} library(ggdag) library(dplyr) p1 = dagify(outcome ~ teacher, size ~ building, outcome ~ building, outcome ~ peers, outcome ~ size, peers ~ size, outcome ~ parents, labels = c("teacher" = "teacher quality", "building" = "building quality", "size" = "class size", "peers" = "quality of peers", "parents" = "parental background", "outcome" = "student outcome" ), outcome = "outcome") %>% tidy_dagitty() %>% mutate(linetype = if_else(name %in% c("peers","parents","teacher"), "dashed","solid")) %>% ggplot(aes(x = x, y = y, xend = xend, yend = yend)) + geom_dag_point() + geom_dag_edges(aes(edge_linetype = linetype), show.legend = FALSE) + geom_dag_label_repel(aes(label=label)) + theme_dag() p1 ``` We will look at an important paper in this literature now, which used a randomized experiment to make some substantial progress in answering the question *what is the production function for student outcomes*. We will study @krueger1999, which analyses the Tennessee Student/Teacher Achievement Ratio Experiment, STAR in short. ## The STAR Experiment Starting in 1985-1986 and lasting for four years, young pupils starting Kindergarden *and their teachers* where randomly allocated to to several possible groups: 1. small classes with 13-17 students 2. regular classes with 22-25 students 3. regular classes with 22-25 students but with an additional full-time teaching aide. The experiment involved about 6000 students per year, for a total of 11,600 students from 80 schools. Each school was required to have at least on class of each size-type above, and random assignment happened *at the school level*. At the end of each school grade (kindergarden and grades 1 thru 3) the pupils were given a standardized test. Now, looking back at figure \@ref(fig:star1), what are the complications when we'd like to assess the impact of *class size* on student outcome? Put differently, why can't we just look at observational data of all schools (absent any experiment!), group classes by their size, and compute the mean outcomes for each group? Here is a short list: 1. There is selection into schools with different sized classes. Suppose parents have a prior that smaller classes are better - they will try to get their kids into those schools. 1. Relatedly, who ends up being in the classroom with a child could matter (peer effects). So, if high quality kids are sorting into schools with small classes, and if peer effects are strong, we could concluded that small classes improved student outcomes when in reality this was due to the high quality of peers in class. 1. Also related, teachers could sort towards schools with smaller classes because it's easier to teach a small rather than a large class, and if there is competition for those places, higher quality teachers will have an advantage. Now, what can STAR do for us here? There will still be selection into schools, however, once selected a school it is random whether one ends up in a small or a large class. So, the quality of peers present in the school (determined before the experiment through school choice) will be similar across small and big groups. In figure \@ref(fig:star1), you see that some factors are drawn as unobserved (dashed arrow), and some are observed (solid). In any observational dataset, the dashed arrows would be really troubling. Here, given randomisation into class sizes, *we don't care* whether those factors are unobserved or not: It's reasonable to assume that across randomly assigned groups, the distributions of each of those factors is roughly constant! If we *can* in fact proxy some of those factors (suppose we had data on teacher qualifications), even better, but not necessary to identify the causal effect of class size. ## PO as Regression Before we start replicating the findings in @krueger1999, let's augment our potential outcomes (PO) notation from the previous chapter. To remind you, we had defined the PO model in equation \@ref(eq:rubin-model): \begin{equation*} Y_i = D_i Y_i^1 + (1-D_i)Y_i^0 \end{equation*} and we had defined the treatment effect of individual $i$ as in \@ref(eq:TE): \begin{equation*} \delta_i = Y_i^1 - Y_i^0. \end{equation*} Now, as a start, let's assume that the treatment effect of *small class* is identical for all $i$: in that case we have \begin{equation*} \delta_i = \delta ,\forall i \end{equation*} Next, let's distribute the $Y_i^0$ in \@ref(eq:rubin-model) as follows: \begin{align*} Y_i &= Y_i^0 + D_i (Y_i^1 - Y_i^0 )\\ &= Y_i^0 + D_i \delta \end{align*} finally, let's add $E[Y_i^0] - E[Y_i^0]=0$ to the RHS of that last equation to get \begin{equation*} Y_i = E[Y_i^0] + D_i \delta + Y_i^0 - E[Y_i^0] \end{equation*} which we can rewrite in our well-known regression format \begin{equation} Y_i = b_0 + \delta D_i + u_i (\#eq:PO-reg) \end{equation} In that formulation, the first $E[Y_i^0]$ is the average non-treatment outcome, which we could regard as some sort of baseline - i.e. our intercept. $\delta$ is the coefficient on the binary treatment indicator. The random deviation $Y_i^0 - E[Y_i^0]$ is the residual $u$. Under only very specific circumstances will the OLS estimator $\hat{\delta}$ identify the true Average Treatment Effect $\delta^{ATE}$. Random assignment ensures that the crucial assumption $E[u|D] = E[Y_i^0 - E[Y_i^0]|D] = E[Y_i^0|D] - E[Y_i^0] = 0$, in other words, there is no difference in nontreatment outcomes across treatment groups. Additionally, we could easily include regressors $X_i$ in equation \@ref(eq:PO-reg) to account for additional variation in the outcome. With that out of the way, let's write down the regression that @krueger1999 wants to estimate. Equation (2) in his paper reads like this: \begin{equation} Y_{ics} = \beta_0 + \beta_1 \text{small}_{cs} + \beta_2 \text{REG/A}_{cs} + \beta_3 X_{ics} + \alpha_s + \varepsilon_{ics} (\#eq:krueger2) \end{equation} where $i$ indexes pupil, $c$ is class id and $s$ is the school id. $\text{small}_{cs}$ and $\text{REG/A}_{cs}$ are both dummy variables equal to one if class $c$ in school $s$ is either *small*, or *regular with aide*. $X_{ics}$ contains student specific controls (like gender). Importantly, given that randomization was at the school level, we control for the identify of the school with a school fixed effect $\alpha_s$. Before we proceed to run this regression, we need to define the outcome variable $Y_{ics}$. @krueger1999 combines the various SAT test scores in an average score for each student in each grade. However, given that the SAT scores are on different scales, he first computes a ranking of all scores for each subject (reading or math), and then assigns to each student their percentile in the rank distribution. The highest score is 100, the lowest score is 0. ## Implementing STAR Let's start with computing the ranking of grades. Let's load the data and the `data.table` package: ```{r,message = FALSE} data("STAR", package = "AER") library(data.table) x = as.data.table(STAR) x ``` It's a bit unfortunate to switch to data.table, but I haven't been able to do what I wanted in dplyr :-( . Ok, here goes. First thing, you can see that this data set is *wide*. First thing we want to do is to make it *long*, i.e. reshape it so that if has 4 ID columns, and several measurements columns thereafter. First, let's add a studend ID: ```{r} x[ , ID := 1:nrow(x)] # add a column called `ID` ``` ```{r} # `melt` a data.table means to dissolve it and reassamble for some ID variables mx = melt.data.table(x, id = c("ID","gender","ethnicity","birth"), measure.vars = patterns("star*","read*","math*", "schoolid*", "degree*","experience*","tethnicity*","lunch*"), value.name = c("classtype","read","math","schoolid","degree", "experience","tethniticy","lunch"), variable.name = "grade") levels(mx$grade) <- c("stark","star1","star2","star3") # reassign levels to grade factor mx[,1:8] # show first 8 cols ``` You can see here that for example pupil `ID=1` was not present in kindergarden, but joined later. We will only keep complete records, hence we drop those NAs: ```{r} mx <- mx[complete.cases(mx)] mx[ID==2] # here is pupil number 2 ``` Ok, now on to standardizing those `read` and `math` scores. you can see they are on their kind of arbitrary SAT scales ```{r} mx[,range(read)] ``` First thing to do is to create an empirical cdf of each of those scores within a certain grade. That is the *ranking* of scores from 0 to 1: ```{r,message = FALSE, results='hide'} setkey(mx, classtype) # key mx by class type ecdfs = mx[classtype != "small", # subset data.table to this list(readcdf = list(ecdf(read)), # create cols readcdf and mathcdf mathcdf = list(ecdf(math)) ), by = grade] # by grade # let's look at those cdf! om = par("mar") par(mfcol=c(4,2),mar = c(2,om[2],2.5,om[4])) ecdfs[,.SD[,plot(mathcdf[[1]],main = paste("math ecdf grade",.BY))],by = grade] ecdfs[,.SD[,plot(readcdf[[1]],main = paste("read ecdf grade",.BY))],by = grade] par(mfcol=c(1,1),mar = om) ``` You can see here how the cdf maps SAT scores (650, for example), into the interval $[0,1]$. Now, in the `ecdfs` `data.table` object, the `readcdf` column contains a *function* (a cdf) for each grade. We can evaluate the observed test scores for each student in that function to get their ranking in $[0,1]$, by grade: ```{r gradedens, fig.cap = "Reproducing Figure I in @krueger1999",fig.align="center"} setkey(ecdfs, grade) # key ecdfs according to `grade` setkey(mx,grade) z = mx[ , list(ID,perc_read = ecdfs[(.BY),readcdf][[1]](read), perc_math = ecdfs[(.BY),mathcdf][[1]](math)), by=grade] # stick `grade` into `ecdfs` as `.BY` z[,score := rowMeans(.SD)*100, .SDcols = c("perc_read","perc_math")] # take average of scores # and multiply by 100, so it's comparable to Krueger # merge back into main data mxz = merge(mx,z,by = c("grade","ID")) # make a plot ggplot(data = mxz, mapping = aes(x = score,color=classtype)) + geom_density() + facet_wrap(~grade) + theme_bw() ``` You can compare figure \@ref(fig:gradedens) to @krueger1999 figure 1. You can see that the density estimates are almost identical, the discrepancy comes mainly from the fact that we split the regular classes also by with/without aide. ```{r kruegerdens,echo=FALSE, fig.cap = "Outcome densities, @krueger1999 figure 1."} knitr::include_graphics("images/krueger1.png") ``` So far, so good! Now we can move to run a regression and estimate \@ref(eq:krueger2). ```{r} # create Krueger's dummy variables mxz = as_tibble(mxz) %>% mutate(small = classtype == "small", rega = classtype == "regular+aide", girl = gender == "female", freelunch = lunch == "free") # reproduce columns 1-3 m1 = mxz %>% group_by(grade) %>% do(model = lm(score ~ small + rega, data = .)) m2 = mxz %>% group_by(grade) %>% do(model = lm(score ~ small + rega + schoolid, data = .)) m3 = mxz %>% group_by(grade) %>% do(model = lm(score ~ small + rega + schoolid + girl + freelunch, data = .)) # get school id names to omit from regression tables school_co = grep(names(coef(m2[1,]$model[[1]])),pattern = "schoolid*",value=T) school_co = c(unique(school_co,grep(names(coef(m3[1,]$model[[1]])),pattern = "schoolid*",value=T)),"schoolid77") ``` Now let's look at each grade's models. ```{r} h = list() for (g in unique(mxz$grade)) { h[[g]] <- huxtable::huxreg(subset(m1,grade == g)$model[[1]], subset(m2,grade == g)$model[[1]], subset(m3,grade == g)$model[[1]], omit_coefs = school_co, statistics = c(N = "nobs", R2 = "r.squared"), number_format = 2 ) %>% huxtable::insert_row(c("School FE","No","Yes","Yes"),after = 11) %>% huxtable::theme_article() %>% huxtable::set_caption(paste("Estimates for grade",g)) %>% huxtable::set_top_border(12, 1:4, 2) } h$stark h$star1 h$star2 h$star3 ``` You should compare those to table 5 in @krueger1999, where it says *OLS: actual class size*. For the most part, we come quite close to his esimates! We did not follow his more sophisticated error structure (by allowing errors to be correlated at the classroom level), and we seem to have different number of individuals in each year. Here is his table 5: ```{r krug-table,echo=FALSE,fig.show = "hold", fig.align = "default"} knitr::include_graphics("images/krueger2.png",dpi = 300) knitr::include_graphics("images/krueger3.png",dpi = 300) knitr::include_graphics("images/krueger4.png",dpi = 300) knitr::include_graphics("images/krueger5.png",dpi = 300) ``` So, based on those results we can say that attending a small class raises student's test scores by about 5 percentage points. Unfortunately, says @krueger1999, is it hard to gauge whether those are big or small effects: how important is it to score 5% more or less? Well, you might say, it depends how close you are to an important cutoff value (maybe entrance to the next school level requires a score of `x`, and the 5% boost would have made that school feasible). Be that as it may, now you know more about one of the most influential papers in education economics, and why using an experimental setup allowed it to achieve credible causal estimates. ================================================ FILE: 09-RDD.Rmd ================================================ # Regression Discontinuity Design {#RDD} In the previous chapter we have seen how an experimental setup can be useful to recover causal effects from an OLS regression. In this chapter we will look at a similar approach where we don't randomly allocate subjects to either treatment or control (maybe because that's impossible to do in that particular situation), but where we can *zoom in* on a group of individuals where having been allocated to treatment is **as good as random** - hence not influenced by selection bias. The idea is called Regression Discontinuity Design, short RDD. ## RDD Setup Let's again start with a DAG for the main idea. Remember the numerical example in the set of slides on randomization, where we showed that if we know the allocating mechanism, we can recover the true ATE. RDD plays along those lines, in that we *know* how individuals got assigned to treatment. As is the case in many real-life situations, people are eligible for some treatment if some value **crosses a threshold**: * You are eligible to obtain a driving license as your age crosses the 18-year threshold. * You will receive pension benefits as your age crosses the 65-year threshold. * You are liable to criminal charges if you are caught with more than 3g of Marihuana in your pocket. * You are considered a subprime mortgage borrower if your loan-to-value ratio is above 95%. In RDD parlance, we call that particular variable we are looking at the *running variable* (age, quantity of Marihuana, LTV ratio etc). If we know the applicable threshold (18 years of age, say), and we know an individual's age, then it's trivial to figure out whether they were eligible to get a driving license. Let's formalize this a bit. Let's call the running variable $x$, the outcome $y$ as usual, let $D$ be the treament indicator and let us define a *threshold* value $c$. Treatment for individual $i$ will be such that \begin{equation*} D_i = \begin{cases}\begin{array}{c}1\text{ if }x_i > c \\ 0\text{ if }x_i \leq c. \end{array} \end{cases} \end{equation*} Here's the obligatory DAG! ```{r rdd1,echo = FALSE,warning = FALSE,message = FALSE,fig.align = "center",fig.cap = "DAG for a simple RDD design: $x$ determines treatment via the cutoff $c$"} library(ggdag) dagify(y ~ x, c ~ x, D ~ c, y ~ D) %>% ggdag(layout = "circle") + theme_dag() ``` The key idea can be glanced from figure \@ref(fig:rdd1): if we can *know* who ends up in treatment $D$, this can be useful for us to recover the true ATE. In particular, the idea is going to be to compare individuals who are *close* to the threshold $c$: Those with an $x$ *just above* the threshold should be comparable (in terms of their $x$!) to the ones *just below* $c$. Someone who has their 18th birthday next week is almost identical to someone how had their 18th birthday last week - in terms of age! So, computing our naive difference in mean outcomes for those narrowly defined groups should be a good approximation to a random allocation. Notice that there are two important things to keep in mind: 1. None of the other variables in the model should exhibit any discontinuity at $c$, other than $D$! 2. We obtain only *locally* valid identification of the ATE: as we move further and further away from the threshold, our individuals will cease to be really comparable. In our DAG, point 1 above is not currently visible. Let's augment this: ```{r rdd2,echo = FALSE,warning = FALSE,message = FALSE,fig.align = "center",fig.cap = "Augmented DAG for a simple RDD design: $x$ determines treatment via the cutoff $c$"} library(ggdag) dagify(y ~ x, c ~ x, D ~ c, y ~ z, y ~ D) %>% ggdag(layout = "circle") + theme_dag() ``` So, the condition we want is that additional explanatory variable $z$ does **not** suddenly jump as $x$ crosses $c$. Because we will be comparing the mean outcomes of people slightly to the left and right of $c$, we need to make sure that there is nothing that would *confound* our estimate of the size of the effect $\delta$. Let's look at an example of a recent RDD study now. ## Clicking on Heaven's Door In a recent paper title [Clicking on Heaven's Door](https://www.aeaweb.org/articles?id=10.1257/aer.20150355), U Bocconi economist [Paolo Pinotti](https://sites.google.com/view/paolo-pinotti/home) uses an RDD to show the effects of legal status of immigrants on criminal behaviour. The question of whether immigrants commit more or less crime than others (natives, for example) is a first order policy question. The question of that paper is what causal impact the *legal status* we confer upon an immigrant has on their propensity to commit a crime one year later. The study is based in Italy, where the legal status refers to an official permission to work. The key detail is that the residence permit needs to be sponsored by the immigrant's employer. ### Institutional Details In the Italian context, immigrants often enter illegally first, and then hope to obtain a residence permit through an employer later on. There is a quota system in place, which establishes how many permits are to be granted to how many people from each nationality, and to which Italian industries (construction, services, etc). See table \@ref(fig:pin1) for an overview of those quotas. ```{r pin1,echo=FALSE,fig.cap="Table 1 from @pinotti.",fig.align="center"} knitr::include_graphics("images/pinotti1.png") ``` Almost all of the estimated 650,000 illegal immigrants participate in the click days. @pinotti is able to link each immigrant to official interior ministry crime records, and is thus able to precisely identify whether an immigrant with a certain legal status is showing up in crime records in the year(s) after click days. ### Discontinuity Feature The principal feature of the Italian setting which makes this almost perfect for an RDD is the following: The quotas illustrated in \@ref(fig:pin1) are defined for a total of 1751 employer groups (varying by industry and location). Applications for a permit must be submitted online by employers, **starting at 8:00 AM on specific click days**, and will be given out on a first come first served basis. This implies that thousands of applicants are denied their permit each year not because they were not eligible (they had an employer sponsoring them!), but because they got late online (some seconds) when all permits for their specific quota were gone already. Here we formalize as $c$ the *exact* time the quota for a certain group was full, and the running variable $x$ is the *exact* time that the sponsoring employer clicked on the *submit* button on the website. This is measured at the level of milliseconds. The key observation is now that the exact timing when a certain quota is full, $c$, is impossible to foresee. Even if it is the case the employers of highly-skilled individuals are the first ones to log on, there is sufficient random variation (slow internet connection today...) in arrival times of employers on the platform such that we distribution of immigrant and employer types both sides of the cutoff $c$ are almost identical. So, it's very hard to *manipulate* the assignment rule. @pinotti experiments with several definitions of *close by the cutoff*, from 1 minute up to 30 minutes. The results are stable regardless of the definition chosen. ### Findings * In the year before click days, the crime rate for the individuals in the study is 1.1% in both treatment and control groups. * In the year after click days, crime rates decline to 0.8% for people who clicked before time $c$, and it stayed at 1.1% for people who came late. * The effect is mainly driven by so-called type A permits, which are given out to domestic employers (nannies, cleaning aides etc), while type B is for formal firms. Type A permits are easy to manipulate. Being granted legal status has a particularly large impact on reducing crime for type A permits. * There is no effect for type B permits. * Firm-sponsored applicants (type B) have a higher opportunity cost of crime already before click days - in the end they already are (informally) employed somewhere! Undocumented immigrants in domestic (fictitious) employment have a lower opportunity cost of crime. * At the same time, and crucially, that latter group seems to be particularly responsive to legalization, most likely because because a residence permit allows to search for work in the official labor market. * For people at the margin between persuing illegal activities or not, this is an important effect. * Figure \@ref(fig:pin2) illustrates this finding graphically with a typical RD plot. The vertical difference in red and green lines at the cutoff $T=0$ in the *2008: type-A applicants* is the estimate of the local average treatment effect (LATE), i.e. $0.5 - 1.8 = -1.3$. ```{r pin2,echo=FALSE,fig.cap="Table 3 from @pinotti.",fig.align="center"} knitr::include_graphics("images/pinotti2.png") ``` ================================================ FILE: 10-IV.Rmd ================================================ # Instrumental Variables (IV) {#IV} ```{r, echo = FALSE} library(modelsummary) gm = modelsummary::gof_map gm$omit <- TRUE gm$omit[gm$clean == "R2"] <- FALSE gm$omit[gm$clean == "Num.Obs."] <- FALSE gom = "p.value.|se_type|statistic.end|statistic.overid|statistic.weakinst" ``` In the previous chapters we have seen how to get credible causal estimates for the effect of some intervention via randomization techniques. Randomly allocating a subject to treatment or control groups makes sure that *everything else is equal*, hence we are really comparing apples with apples. Unfortunately, oftentimes we cannot perform a randomized control trial (RCT), out of technological, ethical, or other constraints. (We mentioned that forcing people to smoke via lottery draw is impossible to justify.) So, that's it, no RCT, no causal estimates? No! Methods like instrumental variables can help us to establish causality if we only have *observational data* (i.e. data generated not via experiment). That's reassuring, because in many settings this kind of data is the only thing we have. ## John Snow and the London Cholera Epidemic ```{r father-thames, fig.cap="Father Thames Introducing his Offspring to the Fair City of London, [Punch (1858)](https://www.bl.uk/collection-items/father-thames-introducing-his-offspring-to-the-fair-city-of-london-from-punch)",echo = FALSE} knitr::include_graphics("images/father-thames.jpg") ``` The 1853-1854 Cholera outbreak in London killed 616 people. Physician [John Snow](https://en.wikipedia.org/wiki/John_Snow) was able to use data collected during this period to demonstrate that the illness was water-borne, and not transmitted via air, as was widely believed at the time. In order to better appreciate this section, let's imagine the world of John Snow in 1853^[The following is based on @freedman1991]: ```{block, type='notel'} * It is not yet known that germs can cause disease (or indeed, that they exist). * Microscopes exist, but work at rather poor resolution. * Most human pathogens are not visible to the naked eye, and the isolation of such microbes is still several decades away. * The so-called *infection theory* (i.e. infection via *germs*) has some supporters, but the dominant idea is that disease, in general, results from [*miasmas*](https://en.wikipedia.org/wiki/Miasma_theory): very small, non-living poisonous particles that float in the air - basically rotting organic matter would emanate foul air, that caused disease. The figure below shows an illustration. ```
```{r miasma, fig.cap="Miasma in the Air. Robert Seymour - A Short History of the National Institutes of Health National Library of Medicine photographic archive.",echo = FALSE} knitr::include_graphics("images/Cholera_art.jpg",dpi = 75) ``` Snow hypothesized that the pathogen causing cholera was taken into the body via food or drink, multiplied and generated a poisonous substance causing the body to expel water, i.e. an extreme form of diarrhea. The active agent then would leave the human body via those excrements, and find their way back into the water supply, infecting the next victim. So, the question at the time was: is cholera a result of *miasmas* (foul, poisonous air), or a pathogen that was water-borne and infected new victims via excrements of former victims (the *infection theory*)? Snow conducted some impressive detective work tracking down exceptional cases that would refute the miasma theory. For example he documented that two adjacent appartment buildings, one hit by cholera and the other not, had different water supplies: the first building's supply was contaminated by runoffs from privies (toilets), while the second one had arguably cleaner water. He studied the wider water supply system of London, finding that several water providers took their water from the heavily polluted River Thames. During 1853-1854, John Snow drew a map (see figure \@ref(fig:snow-map)) that showed where the fatalities had occured. It became obvious that the cases clustered around the Broad Street pump. ```{r snow-map, fig.cap="John Snow's original map of the Broad Street pump. https://commons.wikimedia.org/wiki/File:Snow-cholera-map-1.jpg",echo = FALSE} knitr::include_graphics("images/snow-map.jpg") ``` The history goes that after observing the map and insisting with the local council, the handle of the water pump was removed, and the outbreak was ended. Alas, Snow himself showed that the epidemic was stopping anyway and the that removal of the pump handle was close to irrelevant. This can be seen in figure \@ref(fig:snow-TS). ```{r snow-TS,fig.cap="Time series of cholera deaths and timing of pump removal",warning=FALSE,echo = FALSE} plot(cholera::timeSeries()) ``` What seemed much more interesting to him were other observations, like for example: 1. He found that a large poorhouse in the Broad Street area had very few cholera cases. He observed that the poorhouse had its own well (no need for the inmates to go the public Broad Street pump). 1. There was a large brewery in the vicinity of the pump, whose workers did not die of cholera. The workers drank beer, and there was a private well on the premises. ### Mapping London's Water Supply A few years before the outbreak, Lambeth water company had decided to move its water intake point upstream along the Thames, beyond the main sewage discharge points. Two other companies, the Southwark and Vauxhall water companies, however, left their intake points where they were, i.e. downstream from the sewage discharges. Snows analysis of the data showed that cholera was more prevalent in the Southwark and Vauxhall serviced areas and largely had spared Lambeth. He was able to compile the data in table \@ref(tab:snow-tab9): ```{r snow-tab9,echo = FALSE} st9 <- data.frame(numhouses = c(40046,26107,256423), deaths = c(1263,98,1422), death1000 = c(315,37,59)) ``` Table: (\#tab:snow-tab9) John Snow's table IX | | Number of houses | Deaths from Cholera | Deaths per 1000 Houses | |----|:----------------:| :----------------: | :----------------: | |Southwark and Vauxhall | 40,046 | 1,263 | 315 | |Lambeth | 26,107 | 98 | 37 | |Rest of London | 256,423 | 1,422 | 59 | With table \@ref(tab:snow-tab9) in hand, Snow concluded that *if* Southwark and Vauxhall water companies had moved their water intakes upstream to where Lambeth water was taking in their supply, roughly 1,000 lives could have been saved. For proponents of the miasma theory, this was still not evidence enough, because there were also many factors that led to poor air quality in those areas. Of course Snow was proven right later on, when in 1884 Koch first isolated the cholera *vibrio*, basically confirming Snow's version of the story. But what is really interesting for us is how he was able to use his **non-experimental** data set to make his point. ### Snow's Model of Cholera Transmission Even though he never formally wrote it down in this way, we can formulate Snow's way of thinking about the issue in the following terms: * Suppose that $c_i$ takes the value 1 if individiual $i$ dies of cholera, 0 else. * Let $w_i = 1$ mean that $i$'s water supply is impure and $w_i = 0$ vice versa. Water purity is assessed with a technology that cannot detect small microbes. * Collect in $u_i$ all unobservable factors that impact $i$'s likelihood of dying from the disease: whether $i$ is poor, where exactly they reside, whether there is bad air quality in $i$'s surrounding, and other invidivual characteristics which impact the outcome (like genetic setup of $i$). With this, can write model \@ref(eq:snow-mod1): \begin{equation} c_i = \alpha + \delta w_i + u_i (\#eq:snow-mod1) \end{equation} John Snow could have used his data and assess the correlation between drinking pure water and cholera incidence, i.e. try to measure $Cor(c_i,w_i)$, or, which is similar, just run model \@ref(eq:snow-mod1) as a linear regression. There is a problem with this, however. As @deaton1997 says, > The people who drank impure water were also more likely to be poor, and to live in an environment contaminated in many ways, not least by the ‘poison miasmas’ that were then thought to be the cause of cholera. In other words, it does not make sense to compare someone who drinks pure water with someone with impure water, as model \@ref(eq:snow-mod1) proposes to do, because *all else is not equal*: pure water is correlated with being poor, living in bad area, bad air quality and so on - all factors that we encounter in $u_i$. This violates the crucial orthogonality assumption for valid OLS estimates, $E[u_i | w_i]=0$ in this context. Another way to say this, is that $Cov(w_i, u_i) \neq 0$, implying that $w_i$ is *endogenous* in equation \@ref(eq:snow-mod1): There are factors in $u_i$ that affect both $w_i$ and $c_i$, so we cannot reasonably say that *the effect of $w$ is that...*, because things in $u_i$ move at the same time as $w_i$ moves (and we can't see those things). So, the *miasma* theorists actually had a point. Let us condition equation \@ref(eq:snow-mod1) on either value $w$ might take: \begin{align} E[c_i | w_i = 1] &= \alpha + \delta + E[u_i | w_i = 1] \\ E[c_i | w_i = 0] &= \alpha + \phantom{\delta} + E[u_i | w_i = 0] \end{align} Simply differencing those two lines thus yields \begin{equation} E[c_i | w_i = 1] - E[c_i | w_i = 0] = \delta + \left\{ E[u_i | w_i = 1] - E[u_i | w_i = 0]\right\} \end{equation} and we said that it stands to reason that this last term $\left\{ E[u_i | w_i = 1] - E[u_i | w_i = 0]\right\}$ is not equal to zero, hence our regression estimate for $\delta$ would be biased by that quantity. ## Defining the IV Estimator John Snow did not know what an IV estimator was because it had not been described yet.^[In @angristkruegerIV you can read that Philipp (or his son Sewal, or both) Wright are widely attributed with this discovery in 1928] However, we can use the above setup to develop the idea. To get at this, it is useful to hear what John Snow has to say after he shows us his table IX, displayed in table \@ref(tab:snow-tab9) above: > [...] the mixing of the supply is of the most intimate kind. The pipes of each Company go down all the streets, and into nearly all the courts and alleys. [...] The experiment, too, is on the grandest scale. No fewer than three hundred thousand people of both sexes, of every age and occupation, and of every rank and station, from gentlefolks down to the very poor, were divided into two groups without their choice, and in most cases, without their knowledge; one group supplied with water containing the sewage of London, and amongst it, whatever might have come from the cholera patients, the other group having water quite free from such impurity. To back this up, he produced the following map showing which areas were served by which water company. As you can see from the legend, the purple areas denote those with mixed water supply. ```{r snow-supply,fig.cap="Snow's map of water supply in London",echo=FALSE} knitr::include_graphics("images/snow-supply.jpg",dpi = 100) ``` So, without knowing, Snow is proposing an **instrumental variable** $z_i$, the *identity of the water supplying company* to household $i$, which is highly correlated with the water purity $w_i$. However, following his remarks above, it seems to be uncorrelated with all the other factors in $u_i$, which worried us before: people in most cases didn't even know who supplied their water, as those decisions were taken years before. Very similar households, on either side of a street, may have had different water purity in their homes as a result of a different supplier.^[The formulation as an IV has been taken from [W. Greene's website](http://people.stern.nyu.edu/wgreene/Econometrics/Cholera-IV-Study.pdf)] Let's visualize this setup in a DAG to start with: ```{r IV-dag,warning = FALSE,message = FALSE,echo = FALSE,fig.cap="DAG for IV setup in Snow's study setting. $u$ affects both explanatory variable $w$ and outcome $c$ at the same time. $z$ affects the outcome *only through* its impact on $w$. Solid arrows are measurable with data, dashed arrows are not.",fig.height = 3 } library(ggdag) library(dplyr) coords <- list( x = c(z = 1, w = 3, u = 4, c = 5), y = c(z = 0, w = 0, u = 0.5, c = 0) ) dag <- dagify(c ~ w + u, w ~ z + u, coords = coords) dag %>% tidy_dagitty() %>% mutate(linetype = ifelse(name == "u", "dashed", "solid")) %>% ggplot(aes(x = x, y = y, xend = xend, yend = yend)) + geom_dag_point() + geom_dag_text() + geom_dag_edges(aes(edge_linetype = linetype), show.legend = FALSE) + theme_void() ``` You can see that outcome $c$ is affected by water purity $w$ and other factors $u$. The point of the DAG in figure \@ref(fig:IV-dag) is to show that $u$ affects both the outcome $c$ *and* the explanatory variable $w$ at the same time, and the conditional mean assumption $E[u_i | w_i]=0$ is violated (this is an implication of the arrow from $u$ to $w$). Now, $z$ affects *only* $w$ (it is *relevant* for $w$) and, furthermore, it is *not affected* by $u$ (it is *exogenous* to the outcome equation). Now, if you change the value of $z_i$ in this DAG, this will change $w_i$ (follow the solid arrow!), which will then change the outcome $c_i$. The key insight is that we can be sure that this change in $c_i$ had nothing to do with other factors $u_i$ - that's what our model assumes (no arrow from $u$ to $z$!). So, we can measure the part of the correlation between $w$ and $c$ that is *due to* correlation between $w$ and $z$, and obtain a *causal* effect! Spoiler alert: The formula for a one variable, one IV setting like this here is $$\frac{Cov(z,c)}{Cov(z,w)}.$$ The DAG is silent about the strength of each of the arrows. Whether any of the arrows is more or less important is a statistical question (i.e. we have to *measure* their strengths in data somehow). The usefulness of a DAG like this one is purely to think about and justify the model we have in mind, and for that purpose it is a very good tool. I would encourage you to draw one of those each time before you want to use IV analysis. In particular, your theory needs to spell out why there is no arrow from $u$ to $z$ - see Snow's argumentation above that water supply was close to randomly allocated to households in 1850 London. More formally, let's define the instrument as follows: \begin{align*} z_i &= \begin{cases} 1 & \text{if water supplied by Lambeth} \\ 0 & \text{if water supplied by Southwark or Vauxhall.} \\ \end{cases} \\ \end{align*} Here are the conditions for a valid instrument: ```{block, type="notel"} 1. **Relevance** or **First Stage**: Water purity is indeed a function of supplier identity. We want that $$E[w_i | z_i = 1] \neq E[w_i | z_i = 0]$$ i.e. the average water purity differs across suppliers. We can *verify* this condition with observational data. We want this effect to be reliably causal. 2. **Independence**: Whether a household has $z_i = 1$ or $z_i = 0$ is unrelated to $u$, hence *as good as random*. Whether we condition $u$ on certain values of $z$ does not change the result - we want $$E[u_i | z_i = 1] = E[u_i | z_i = 0].$$ 3. **Excludability** the instrument should affect the outcome $c$ *only* through the specified channel (i.e. via water purity $w$), and nothing else. ```
Point 3. is the difficult part, because there is no real test for it: we have to reason and argue to make the case that is a reasonable assumption. This is where Snow's citation from above comes into play: He reasons that water supply varies **randomly** over households, irrespective their unobservables $u$. The statement is that whatever factors are present in $u$, they are present in equal proportion in households with different $z$, because assignment of $z$ was **random**. So it is hard to imagine that the identity of the water company could affect $c$ through other channels (like, whether you are poor or not is *not* a function of $z$). We are now ready to define a simple IV estimator. Notice that conditioning \@ref(eq:snow-mod1) on values of $z$ yields \begin{align} E[c_i | z_i = 1] &= \alpha + \delta E[w_i | z_i = 1] + E[u_i | z_i = 1] \\ E[c_i | z_i = 0] &= \alpha + \delta E[w_i | z_i = 0] + E[u_i | z_i = 0] \end{align} which upon differencing both lines gives \begin{align} E[c_i | z_i = 1] - E[c_i | z_i = 0] &= \delta \left\{ E[w_i | z_i = 1] - E[w_i | z_i = 0]\right\} \\ &+ \underbrace{\left\{ E[u_i | z_i = 1] - E[u_i | z_i = 0] \right\}}_{=0 \text{ by Exogeneity Assumption}} \end{align} The IV estimator is then obtained by isolating $\delta$ and writing \begin{equation} \delta = \frac{E[c_i | z_i = 1] - E[c_i | z_i = 0]}{E[w_i | z_i = 1] - E[w_i | z_i = 0]} (\#eq:IV) \end{equation} Notice that this is only defined if the denominator is nonzero, i.e. the Relevance condition (point 1. above) holds. ### Computing the IV Estimate You remember from earlier chapters that equation \@ref(eq:IV) refers to a *population* parameter, i.e. *the true* value in an infinite population. To learn about it's value, we need to *estimate* it from data. We can use a simple sample analog of the population expectations, i.e. sample means. With some abuse of notation let's say that *$x \mapsto y$ means that $x$ is an estimate for $y$*: 1. $\overline{c}_1 \mapsto E[c_i | z_i = 1]$: the proportion of households supplied by Lambeth with cholera. 1. $\overline{w}_1 \mapsto E[w_i | z_i = 1]$: the proportion of households supplied by Lambeth with bad water. 1. $\overline{c}_0 \mapsto E[c_i | z_i = 0]$: the proportion of households not supplied by Lambeth with cholera. 1. $\overline{w}_0 \mapsto E[w_i | z_i = 0]$: the proportion of households not supplied by Lambeth with bad water. The estimator would then be \begin{equation} \hat{\delta} = \frac{\overline{c}_1 - \overline{c}_0}{\overline{w}_1 - \overline{w}_0} (\#eq:IVhat) \end{equation} In this special case where all involved variables $c,w,z$ are binary, the estimator is called the *Wald estimator*. Unfortunately we do not know the values of the above numbers, or at least I did not find them in readily available format (I think they are in Snow's book). So let's make some numbers up just for the sake of it. 1. $\overline{c}_1 = 0.002$: the proportion of households supplied by Lambeth with cholera. 1. $\overline{w}_1 = 0.1$: the proportion of households supplied by Lambeth with bad water. 1. $\overline{c}_0 = 0.315$: the proportion of households not supplied by Lambeth with cholera. 1. $\overline{w}_0 = 0.5$: the proportion of households not supplied by Lambeth with bad water. ```{r} delta = (0.002 - 0.315) / (0.1 - 0.5) ``` So, in this artificial dataset, we would have found an estimated **causal** effect of `r delta` of impure water on the likelihood of contracting cholera. We would write \begin{equation} \Delta \Pr(c = 1 | w ) = \alpha + \delta \times 1 - \alpha - \delta \times 0 = `r delta` \end{equation} so the probability of getting cholera is `r 100*round(delta,2)` percent higher, if you have impure water (i.e. if $w$ goes from 0 to 1). ```{block, type = "warningl"} **Summary**: IVs are a powerful tool to establish causality in contexts with observational data only and where we are concerned that the conditional mean assumption $E[u_i | x_i]=0$ is violated, hence, we cannot say *all else equal, as $x$ changes, $y$ changes like this and that*. Then we say that $x$ is *endogenous*. The key features of IV $z$ are that 1. $z$ is *relevant* for $x$. For example, in a simple regression of $z$ on $x$, we want $z$ to have considerable predictive power. We can *test* this condition in data. 2. We need a theory according to which is *reasonable* to assume that $z$ is *unrelated* to other unobservable factors that might impact the outcome. Hence, $z$ is *exogenous* to $u$, or $E[u | z] = 0$. This is an **assumption** (i.e. we can not test this with data). ``` ================================================ FILE: 11-IV2.Rmd ================================================ # IV Applications ```{r, echo = FALSE} library(modelsummary) gm = modelsummary::gof_map gm$omit <- TRUE gm$omit[gm$clean == "R2"] <- FALSE gm$omit[gm$clean == "Num.Obs."] <- FALSE gom = "p.value.|se_type|statistic.end|statistic.overid|statistic.weakinst" ``` An important term in economics are the *returns to schooling*, by which we mean the causal effect of education on later earnings. If you think about it, it's a crucial question for every single student (like yourself), if not even more so for a policy maker who needs to decide where to allocate budget spending on (education or other things?). One very famous and early study to estimate those returns to schooling was proposed by Jacob Mincer, and an equation of this kind was henceforth known as the *Mincer Equation* (we have encountered this equation before as a running example in chapter \@ref(linreg)). He measured $\log Y_i$, annual earnings for man $i$, $S_i$ his schooling (years spent studying), and $X_i$ his (potential) work experience (age minus years of schooling minus 6). The model can be drawn like this: ```{r mincer,warning = FALSE,message = FALSE,echo = FALSE,fig.cap="Jacob Mincer's model",fig.height = 3 } library(ggdag) library(dplyr) coords <- list( x = c(e = 1, x = 2, y = 3, s = 2), y = c(e=0, x = -.5, y = 0, s = 0.5) ) dag <- dagify(y ~ s, y ~ e, y ~ x, coords = coords) dag %>% tidy_dagitty() %>% mutate(linetype = ifelse(name == "e", "dashed", "solid")) %>% ggplot(aes(x = x, y = y, xend = xend, yend = yend)) + geom_dag_point() + geom_dag_text() + geom_dag_edges(aes(edge_linetype = linetype), show.legend = FALSE) + scale_x_continuous(limits = c(0,4))+ theme_void() ``` Hourly earnings are assumed to be affected by experience and schooling *only*. In terms of an equation, \begin{equation} \log Y_i = \alpha + \rho S_i + \beta_1 X_i + \beta_2 X_i^2 + e_i (\#eq:mincer) \end{equation} His results implied an estimate for $\rho$ of about 0.11, or an 11% earnings advantage for each additional year of education, given a certain level of experience. Notice that in the DAG in figure \@ref(fig:mincer), we explicitly drew other unobserved factors $e$ with *only* have an arrow directly into $y$. But is that a good model? Well, why would it not be? ## Ability Bias The model in \@ref(eq:mincer) compares earnings of men with certain schooling and work experience. The question to ask, if given those two controls, all else is equal? For a given value of $X$, are there more diligent and able workers out there? Do family connections vary across people with the same $X$? It seems quite likely that we'd answer yes. Well, then, all else is *not* equal, and we are in trouble. Because, again, our crucial identifying assumption for the linear model is violated, as $$E[e_i | S_i, X_i] \neq 0.$$ Our concern can be formalized by explicitly introducing *ability* $A$ as an (unobserved) factor into our model. That means we have now *two* unobservables - of course we can't tell them apart, so let's write them as a new unobservable factor $u_i = e_i + A_i$. Then we could visualize this new model as follows: ```{r mincer2,warning = FALSE,message = FALSE,echo = FALSE,fig.cap="Jacob Mincer's model with unobserved ability $A$. Given it's *unobserved* it is lumped together with all other unobservable factors in $e$, and we've called it $u = e + A$.",fig.height = 3 } coords <- list( x = c(e_A = 1, x = 2, y = 3, s = 2), y = c(e_A =0, x = -.5, y = 0, s = 0.5) ) dag <- dagify(y ~ s, y ~ e_A, s ~ e_A, y ~ x,coords = coords) d = dag %>% tidy_dagitty() %>% dag_label(labels = c("y" = "y","s" = "s","x" = "x","e_A"= "u=e+A")) %>% mutate(linetype = ifelse(label == "u=e+A", "dashed", "solid")) %>% ggplot(aes(x = x, y = y, xend = xend, yend = yend)) + geom_dag_point() + geom_dag_text(aes(label=label)) + geom_dag_edges(aes(edge_linetype = linetype), show.legend = FALSE) + scale_x_continuous(limits = c(0,4))+ theme_void() d ``` In figure \@ref(fig:mincer2), the unobserved factor $A$ influences *both* years of schooling and earnings on the labor market. For example, if we think for $A$ as something like *intelligence*, it might be that more intelligent students find it less painful to attend school (it's less costly for them in terms of effort), so they get more education, and also they earn higher wages because their intelligence is rewarded in the labor market. The same works if $A$ is related to family type and network. Suppose a family with high socio-economic status is also well connected. Then high $A$ could mean that the parents of $i$ know that education is a good signalling device (so they force $i$ to go to university, say), while at the same time their good network means that high $A$ will mean a good job and hence earnings. We would write in terms of an equation \begin{equation} \log Y_i = \alpha + \rho S_i + \beta_1 X_i + \beta_2 X_i^2 + \underbrace{u_i}_{A_i + e_i} (\#eq:ability) \end{equation} Sometimes these considerations do not matter greatly, and the (biased) OLS estimate is close the causal IV estimate. But in other cases, we might be very far from the truth with OLS, even inferring the wrong sign of an effect. Let's look at an example! ## Birthdate is as good as Random In an influential study, @angristkrueger address the above issues related to the ability bias in Mincer's equation by constructing an IV which encodes the birth date of a given student. The idea is that given certain features of the school system, children born shortly after a certain cutoff date will start school a year later than their peers who are a bit older than they are. For example, suppose it is mandated that all children who reach the age of 6 by 31st of december 2021 are required to enroll in the first grade of school in september 2021. Then someone born in September 2015 (i.e. 6 years prior) will be 5 years and 3/4 by the time they start school, while someone born on the 1st of January 2016 will be 6 and 3/4 years when *they* enter school in september 2022. Furthermore, the legal dropout age in the US is 16, so by the time those pupils may decide to stop school, they have been exposed to different amounts of schooling. All of this means that an IV defined by *quarter of birth of person $i$* will affect the outcome *earnings* through it's effect on more schooling - keeping other factors (in particular $A$!) constant across values of the IV. What's the implication for our model? ```{r ak-mod,echo = FALSE,fig.cap="Angrist and Krueger's IV in to tackle ability bias.",fig.height = 3 } coords <- list( x = c(e_A = 1, z = 1, y = 3, s = 2), y = c(e_A =0, z = 0.5, y = 0, s = 0.5) ) dag <- dagify(y ~ s, y ~ e_A, s ~ e_A, s ~ z,coords = coords) d = dag %>% tidy_dagitty() %>% dag_label(labels = c("y" = "y","s" = "s","z" = "z","e_A"= "u=e+A")) %>% mutate(linetype = ifelse(label == "u=e+A", "dashed", "solid")) %>% ggplot(aes(x = x, y = y, xend = xend, yend = yend)) + geom_dag_point() + geom_dag_text(aes(label=label)) + geom_dag_edges(aes(edge_linetype = linetype), show.legend = FALSE) + scale_x_continuous(limits = c(0,4))+ theme_void() d ``` In the DAG for @angristkrueger's model in \@ref(fig:ak-mod) we see that the IV directly impacts the endogenous explanatory variable $s$, but is itself independent of $u$ - we argued that $A$ is equally distributed across different birth quarters $z$ (birth date is almost random). Let us now formulate the following two-stage procedure: 1. We estimate a **first stage model** which uses only exogenous variables (like $z$) to explain our endgenous regressor $s$. 2. We then use the first stage model to *predict* values of $s$ in what is called the **second stage** or the **reduced form** model.^[It's called reduced form because this second equation is supposed to be derived from a true underlying structural model.] Performing this procedure is supposed to take out any impact of $A$ in the correlation we observe in our data between $s$ and $y$. This estimation technique is called the **Two stage least squares** estimator, or 2SLS for short. The great virtue is that in the first stage we could have any number of exogenous variables helping to predict our exogenous $s$ (here we have just one - quarter of birth.) In terms of equations, we could write the following: \begin{align} \text{1. Stage: }s_i &= \alpha_0 + \alpha_1 z_i + \eta_i (\#eq:2SLS1)\\ \text{2. Stage: }y_i &= \beta_0 + \beta_1 \hat{s}_i + u_i (\#eq:2SLS2) \end{align} where the $\hat{s}_i$ means to insert the *predicted* value from the first stage for $i$'s observed $s_i$ in the second stage regression. We can write down the conditions for a valid IV $z$ in this context: ```{block, type = "warningl"} **Conditions for a valid Instrument in this simple 2SLS setup**: 1. Relevance of the IV: $\alpha_1 \neq 0$ 1. Independence (IV assignment as good as random): $E[\eta | z] = 0$ 1. Exogeneity (our exclusion restriction): $E[u | z] = 0$ ``` ### Data on birth quarter and wages Let's load the data and look at a quick summary^[Code in this section comes from the great [mastering metrics with R](https://jrnold.github.io/masteringmetrics/) by J Arnold 👏 🙏]: ```{r ak-data} data("ak91", package = "masteringmetrics") # library(modelsummary) # loaded already for me datasummary_skim(data.frame(ak91),histogram = TRUE) ``` We convert quarter of birth to a factor: ```{r} ak91 <- mutate(ak91, qob_fct = factor(qob), q4 = as.integer(qob == "4"), yob_fct = factor(yob)) # get mean wage by year/quarter ak91_age <- ak91 %>% group_by(qob, yob) %>% summarise(lnw = mean(lnw), s = mean(s)) %>% mutate(q4 = (qob == 4)) ``` Let's reproduce their first figure now on education as a function of quarter of birth! ```{r ak91-age,fig.cap = "Reproducing figure 1 from @angristkruegerIV"} ggplot(ak91_age, aes(x = yob + (qob - 1) / 4, y = s )) + geom_line() + geom_label(mapping = aes(label = qob, color = q4)) + guides(label = FALSE, color = FALSE) + scale_x_continuous("Year of birth", breaks = 1930:1940) + scale_y_continuous("Years of Education", breaks = seq(12.2, 13.2, by = 0.2), limits = c(12.2, 13.2)) + theme_bw() ``` In figure \@ref(fig:ak91-age) we see first that there was a trend in getting more and more education as time passed. Secondly, and more importantly here, is that for almost all birth years in the sample, the group born in quarter 4 has the highest value for years of education! So what we said above about the instutional rules of school attendance in the US seems to be born out in this dataset. What about earnings for those groups? ```{r ak91-wage,fig.cap = "Reproducing figure 2 from @angristkruegerIV"} ggplot(ak91_age, aes(x = yob + (qob - 1) / 4, y = lnw)) + geom_line() + geom_label(mapping = aes(label = qob, color = q4)) + scale_x_continuous("Year of birth", breaks = 1930:1940) + scale_y_continuous("Log weekly wages") + guides(label = FALSE, color = FALSE) + theme_bw() ``` Figure \@ref(fig:ak91-wage) does not show a long running trend in earnings, so on average we'd say an hourly wage of 5 Dollars 90 per week. But, again, the group born in the fourth quarter seems special: In many cases they earn the highest or close to highest wage if compared to the other 3 groups in their birthyear. So, there really seems to be a relationship between quarter of birth and later in life earnings! Let us now construct an IV estimator, which will allow us to relate the sawtooth pattern in figure \@ref(fig:ak91-age) to the one in \@ref(fig:ak91-wage). We will start out with using just *born in fourth quarter* as our IV. ### Running IV estimation in `R` There are several possibilities to run IV estimation in `R`. We will use the `iv_robust` function from the `estimatr` package.^[the *robust* here refers to the fact that the `estimatr` package by default chooses formulae to compute standard errors which are correcting for heteroskedasticity - we have encountered this term before in \@ref(violating). See details [here](https://declaredesign.org/r/estimatr/articles/mathematical-notes.html)] Let's estimate a simple OLS version (subject to ability bias), the first stage and second stages, and the final 2SLS estimate: ```{r} mod <- list() mod$ols <- lm(lnw ~ s, data = ak91) mod[["first stage"]] <- lm(s ~ q4, data = ak91) # IV: born in q4 is TRUE? ak91$shat <- predict(mod[["first stage"]]) mod[["second stage"]] <- lm(lnw ~ shat, data = ak91) mod$`2SLS` <- estimatr::iv_robust(lnw ~ s | q4, data = ak91 ) # IV: born in q4 is TRUE? ``` Let's look at those models next to each other in table \@ref(tab:ms1): ```{r ms1,echo = FALSE} msummary(models = mod, stars = TRUE, statistic = 'std.error', gof_omit = 'DF|Deviance|AIC|BIC|R2|p.value|se_type|statistic|Log.Lik.|Num.Obs.|N', title = "OLS, first and sceond stages as well as 2SLS estimates for Angrist and Krueger (1991)") ``` Table \@ref(tab:ms1) contains a lot of information, so let's go column-wise: 1. The column labelled **ols** is the basic earnings equation similar to Mincer's model (without experience). We are worried about bias from the omitted variable *ability*, but we note that here we estimate a 7% higher wage for each additional year of schooling. 2. The next column is the first stage, i.e. the estimates for $\alpha$ in equation \@ref(eq:2SLS1). Remember we require that $\alpha_1 \neq 0$. That seems to be the case here (p-value very small). 3. Then we run the second stage model with the predicted values $\hat{s}$ from the first stage, i.e. we estimate the $\beta$s in \@ref(eq:2SLS2). You should compare `s` and `shat` in the first and third column. 4. Finally, we perform first and second stag e estimation in one go (you would usually go down this route directly) with 2SLS. You should compare `shat` from the second stage with the `s` estimate from the 2SLS model! The reason you should always go directly to something like `iv_robust` is that this procedure handles computation of standard errors correctly. In other words, the displayed standard error in the second stage for `shat` (0.03) is not taking into account that we estimated `shat` itself in a previous step - `iv_robust` does (note the small difference to 0.028). ### Additional Control Variables We saw in figure \@ref(fig:ak91-age) that there is a clear time trend in years of schooling. There are also business-cycle fluctuations in earnings, even if we were not able to see them from the graph above. It is probably a good idea to control for calendar year in order to guard against any time effects in our results. Also, we can use more than one IV! Here is how: ```{r} mod$ols_yr <- update(mod$ols, . ~ . + yob_fct) # just update previous model mod[["2SLS_yr"]] <- estimatr::iv_robust(lnw ~ s + yob_fct | q4 + yob_fct, data = ak91 ) # add exogenous vars on both sides of | ! mod[["2SLS_all"]] <- estimatr::iv_robust(lnw ~ s + yob_fct | qob_fct + yob_fct, data = ak91 ) # use all quarters as IVs # here is how to make the table: rows <- data.frame(term = c("Instruments","Year of birth"), ols = c("none","no"), SLS = c("Q4","no"), ols_yr = c("none","yes"), SLS_yr = c("Q4","yes"), SLS_all = c("All Quarters","yes") ) names(rows)[c(3,5,6)] <- c("2SLS","2SLS_yr","2SLS_all") modelsummary::msummary(models = mod[c("ols","2SLS","ols_yr","2SLS_yr","2SLS_all")], statistic = 'std.error', gof_omit = 'DF|Deviance|AIC|BIC|R2|p.value|se_type|statistic|Log.Lik.|Num.Obs.|N', title = "Adding Year as control, and more IVs", add_rows = rows, coef_omit = 'yob_fct') ``` ## IV Mechanics {#IV-mech} Let's now look a little closer under the hood of our simple IV estimator. We want to understand how inference of IV relates to OLS inference, and what we can say about *weak* instruments, i.e. IVs with small predictive power in the first stage. Let's go back to our simple linear model $$ y = \beta_0 + \beta_1 x + u (#eq:iv0) $$ where we think that $Cov(x,u) \neq 0$, hence, that $x$ is *endogenous* in this equation. By the way, IV estimation will work whether or not $Cov(x,u) \neq 0$, but we should prefer OLS if $x$ is exogenous, as should become clear soon. We now know that the conditions under which IV $z$ will deliver consistent estimates are the following: 1. **first stage** or **relevance**: $Cov(z,x) \neq 0$ 2. **IV exogeneity**: $Cov(z,u) = 0$: the IV is exogenous in the outcome equation. To reiterate, condition 2 here calls for $z$ to have no partial effect on $y$, after $x$ and other omitted variables have been considered (they are in $u$), hence, that $z$ is uncorrelated with $u$. Figure \@ref(fig:IV-dag2) shows a valid IV in panel A and an IV which violates condition 2 in panel B. ```{r IV-dag2,warning = FALSE,message = FALSE,echo = FALSE,fig.cap="A valid IV (A) and one where the exogeneity assumption is violated (B).",fig.height = 3 } coords <- list( x = c(z = 1, x = 3, u = 4, y = 5), y = c(z = 0, x = 0, u = 0.5, y = 0) ) dag1 <- dagify(y ~ x + u, x ~ z + u, coords = coords) d1 = dag1 %>% tidy_dagitty() %>% mutate(linetype = ifelse(name == "u", "dashed", "solid")) %>% ggplot(aes(x = x, y = y, xend = xend, yend = yend)) + geom_dag_point() + geom_dag_text() + geom_dag_edges(aes(edge_linetype = linetype), show.legend = FALSE) + theme_void() dag2 <- dagify(y ~ x + u, x ~ z + u, z ~ u, coords = coords) d2 = dag2 %>% tidy_dagitty() %>% mutate(linetype = ifelse(name == "u", "dashed", "solid")) %>% ggplot(aes(x = x, y = y, xend = xend, yend = yend)) + geom_dag_point() + geom_dag_text() + geom_dag_edges(aes(edge_linetype = linetype), show.legend = FALSE) + theme_void() cowplot::plot_grid(d1,NULL,d2,nrow = 1 , rel_widths = c(1, 0.15, 1) , labels = c("(A)", "", "(B)")) ``` Let us now discuss how conditions 1. and 2. are helpful in *identifying* parameter $\beta_1$ above. By this we mean our ability to express $\beta_1$ in terms of population moments, which we can estimate from a sample of data. We start by computing the covariance of the instrument $z$ with outcome $y$. \begin{align} Cov(z,y) &= Cov(z, \beta_0 + \beta_1 x + u) \\ &= \beta_1 Cov(z,x) + Cov(z,u) \end{align} Under condition 2. above (*IV exogeneity*), we have $Cov(z,u)=0$, hence $$ Cov(z,y) = \beta_1 Cov(z,x) $$ and under condition 1. (*relevance*), we have $Cov(z,x)\neq0$, so that we can divide the equation through to obtain $$ \beta_1 = \frac{Cov(z,y)}{Cov(z,x)}. $$ This shows that the parameter $\beta_1$ is *identified* via population moments $Cov(z,y)$ and $Cov(z,x)$. What is more, we can *estimate* those moments via their *sample analogs*, hence we have as an IV estimator this expression, where we just *plug in* the sample estimators for the population moments: $$ \hat{\beta}_1 = \frac{\sum_{i=1}^n (z_i - \bar{z})(y_i - \bar{y})}{\sum_{i=1}^n (z_i - \bar{z})(x_i - \bar{x})} (#eq:iv-estim) $$ The corresponding intercept estimate $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$ is identical to before (modulo using \@ref(eq:iv-estim)). Given both assumptions 1. and 2. are satisfied, we say that *the IV estimator is consistent for $\beta_1$*. This can also be written as $$ \text{plim}(\hat{\beta}_1) = \beta_1 $$ meaning that, as the sample size $n$ increases, the **probability limit** (plim) of the estimator $\hat{\beta}_1$ is the true value $\beta_1$.^[More precisely, we say that a sequence of random variables indexed by sample size $n$, ${X_n}$ say, *converges in probability* to the random variable $X$, if for all $\varepsilon > 0$, $$ \lim_{n \to \infty} \Pr \left(|X_n - X | > \varepsilon \right) = 0 $$ ] ### IV Inference Let us extend the homoskedasticity assumption to $z$, such that $E(u^2|z) = \sigma^2$, implying that the asymptotic (i.e. as the sample size gets very large) variance of the IV slope estimator is given by $$ Var(\hat{\beta}_{1,IV}) = \frac{\sigma^2}{n \sigma_x^2 \rho_{x,z}^2} (#eq:iv-var) $$ where $\sigma_x^2$ is the population variance of $x$, $\sigma^2$ the one of $u$, and $\rho_{x,z}$ is the population correlation between $x$ and $z$ - a measure of *how strongly* our IV and endogenous variable $x$ are correlated in the population. You can see 2 important things in equation \@ref(eq:iv-var): 1. Without the term $\rho_{x,z}^2$ in the denominator, this is identical to the variance of the OLS slope estimator. 2. As with the variance of the OLS slope estimator, as sample size $n$ increases, the variance decreases. It is convenient to replace $\rho_{x,z}^2$ with $R_{x,z}^2$, i.e. the R-squared of a regression of $x$ on $z$ - in a single regressor model we have this exact correspondence. It is convenient because we rewrite the variance of the IV slope now as $$ Var(\hat{\beta}_{1,IV}) = \frac{\sigma^2}{n \sigma_x^2 R_{x,z}^2} $$ 1. Given $R_{x,z}^2 < 1$ in most real life situations, we have that $Var(\hat{\beta}_{1,IV}) > Var(\hat{\beta}_{1,OLS})$ almost certainly. 1. The higher the correlation between $z$ and $x$, the closer their $R_{x,z}^2$ is to 1. With $R_{x,z}^2 = 1$ we get back to the OLS variance. This is no surprise, because that implies that in fact $z = x$. So, if you have a valid, exogenous regressor $x$, you should *not* perform IV estimation using $z$ to obtain $\hat{\beta}$, since your variance will be unnecessarily large. #### Returns to Education for Married Women Consider the following model for married women's wages: $$ \log wage = \beta_0 + \beta_1 educ + u $$ Let's run an OLS on this, and then compare it to an IV estimate using *father's education*. Keep in mind that this is a valid IV $z$ if 1. *fatheduc* and *educ* are correlated 2. *fatheduc* and $u$ are not correlated. ```{r mroz1} data(mroz,package = "wooldridge") mods = list() mods$OLS <- lm(lwage ~ educ, data = mroz) mods[['First Stage']] <- lm(educ ~ fatheduc, data = subset(mroz, inlf == 1)) mods$IV <- estimatr::iv_robust(lwage ~ educ | fatheduc, data = mroz) modelsummary::modelsummary(mods, gof_map = gm, gof_omit = gom, title = "Mroz female labor supply and wage data.") ``` The results in table \@ref(tab:mroz1) show in the first column that an additional year of education implies an 11% increase in annual wages for women. This is a standard OLS estimator which be biased because of *ability bias*. In the second column we show the first stage of the IV procedure. We see that *fatheduc* is indeed a statistically significant predictor of $educ$: Each additional year of father's education increases women's education by more than a quarter of a year (0.269). Also important, we observe that the $R^2$ here is about 17%. Turning to the final IV estimate in the third column, we can see that using *fatheduc* as an IV reduces the return to education by about half to 5.9%! This result *suggests* that OLS is biased upwards (for example by ability bias). But let's compare the standard errors of OLS and IV estimates, which are 0.014 for OLS vs 0.037 for IV. This can be seen in figure \@ref(fig:se-plot). You can clearly see that the standard errors of both estimators overlap, hence from this alone we cannot conclude they are different (we need a special statistical test to decide this). ```{r se-plot,echo = FALSE,fig.cap="OLS vs IV Standard Errors: The dots represent the point estimates, and the solid vertical lines the standard error for both estimators."} coefs_ols = broom::tidy(mods$OLS, conf.int = TRUE) coefs_IV = broom::tidy(mods$IV, conf.int = TRUE) bind_rows( coefs_ols %>% filter(term == "educ") %>% mutate(estimator = "OLS"), coefs_IV %>% filter(term == "educ") %>% mutate(estimator = "IV") ) %>% ggplot(aes(x=estimator, y=estimate, ymin=conf.low, ymax=conf.high)) + geom_hline(yintercept = 0.0, color = "red", size =1.2) + geom_pointrange() + theme_bw() ``` ### IV with a *Weak* Instrument We have seen that IV will produce consistent estimates under our stated assumptions. However, even under valid assumption, we get large IV standard errors if the the correlation between IV and endogenous $x$ is small. What is even worse is that *even if* we have only very small correlation between $z$ and $u$, so that we might *almost* be happy to assume exogeneity, a small corrleation between $x$ and $z$ can produce **inconsistent** estimates. To see this, consider the probability limit of the IV estimator again $$ \text{plim}(\hat{\beta}_{1,IV}) = \beta_1 + \frac{Cor(z,u)}{Cor(z,x)} \cdot \frac{\sigma_u}{\sigma_x} $$ The interesting part here involves the correlation terms. *Even if* $Cor(z,u)$ is very small, a **weak instrument**, i.e. one with only a small absolute value for $Cor(z,x)$ will blow up this second term in the probability limit. This would mean that even with a very big sample size $n$, our estimator would **not converge** to the true population parameter $\beta_1$, because we are using a weak instrument. To illustrate this point, let's assume we want to look at the impact of number of packs of cigarettes smoked per day by pregnant women (*packs*) on the birthweight of their child (*bwght*): $$ \log(bwght) = \beta_0 + \beta_1 packs + u $$ We are worried that smoking behavior is correlated with a range of other health-related variables which are in $u$ and which could impact the birthweight of the child - think of diet, physical exercise, and other lifestyle choices. So we look for an IV. Suppose we use the price of cigarettes (*cigprice*), assuming that the price of cigarettes is uncorrelated with factors in $u$ - the price of cigarettes would not impact birthweight (apart from through its effect on smoking behaviour, of course). Let's run the first stage of *cigprice* on *packs* and then let's show the 2SLS estimates: ```{r bw} data(bwght, package = "wooldridge") mods <- list() mods[["First Stage"]] <- lm(packs ~ cigprice, data = bwght) mods[["IV"]] <- estimatr::iv_robust(log(bwght) ~ packs | cigprice, data = bwght) modelsummary(mods, gof_map = gm, gof_omit = gom, title = "IV regression with weak instrument *cigprice*") ``` The first column of table \@ref(tab:bw) shows that the first stage is *very* weak. The partial effect of *cigprice* on *packs* smoked is zero! People don't seem to care a great deal about the price of cigarettes (at least in the range of price variation observed in this dataset). The $R^2$ of that first stage is thus zero. What do we get if we use that IV nonetheless in a 2SLS estimation as in column 2? We get a huge coefficient of unexpected sign on *packs* (smoking more increases birthweight? 🤔), with very large standard error, so statistically speaking, we cannot distinguish this from zero. What is more important, however, is that even if they *were* significant, the estimates of column 2 are **invalid**. The *relevance* of the IV condition is clearly not satisfied, hence, invalid approach. ⛔ ================================================ FILE: 12-panel.Rmd ================================================ # Panel Data ## Crime Rate vs Probability of Arrest This part draws heavily on [Nick C Huntington-Klein's](http://nickchk.com) outstanding [slides](https://github.com/NickCH-K/EconometricsSlides). 🙏 Up until now we have dealt with data that looked like the following, where `County` is the idendifier for a county, `CrimeRate` is the number of crimes committed by person, and `ProbofArrest` is the probability of arrest, given crime committed: ```{r,message=FALSE,warning=FALSE,echo = TRUE} library(dplyr) library(ggplot2) data(crime4,package = "wooldridge") crime4 %>% filter(year == 81) %>% arrange(county,year) %>% select(county, crmrte, prbarr) %>% rename(County = county, CrimeRate = crmrte, ProbofArrest = prbarr) %>% slice(1:5) %>% knitr::kable(align = "ccc") ``` We would have a unit identifier (like `County` here), and some observables on each unit. Such a dataset is usually called a **cross-sectional** dataset, providing one single snapshot view about variables from a study population at a single point in time. Each row, in other words, was one *observation*. **Panel data**, or **longitudinal** datasets, on the other hand, also index units over *time*. In the above dataset, for example, we could record crime rates in each county *and in each year*: ```{r,echo = FALSE} crime4 %>% select(county, year, crmrte, prbarr) %>% arrange(county,year) %>% rename(County = county, Year = year, CrimeRate = crmrte, ProbofArrest = prbarr) %>% slice(1:9) %>% knitr::kable(align = "ccc") ``` Here each unit $i$ (e.g. county `1`) is observed *several times*. Let's start by looking at the dataset as a single cross section (i.e. we forget about the $t$ index and treat each observation as independent over time) and investigate the relationship between crime rates and probability of arrest in counties number `1,3,23,145`: ```{r crime1,echo = TRUE,fig.cap = "Probability of arrest vs Crime rates in the cross section",message = FALSE} css = crime4 %>% filter(county %in% c(1,3,145, 23)) # subset to 4 counties ggplot(css,aes(x = prbarr, y = crmrte)) + geom_point() + geom_smooth(method="lm",se=FALSE) + theme_bw() + labs(x = 'Probability of Arrest', y = 'Crime Rate') ``` We see an upward-sloping regression line, so it seems that the higher the crime rate, the higher the probability of arrest. In particular, we'd get: ```{r} xsection = lm(crmrte ~ prbarr, css) coef(xsection)[2] # gets slope coef ``` ```{r,echo = FALSE} xsection_p = round(predict(xsection,newdata = data.frame(prbarr = c(0.2,0.3))),3) ``` such that we'd associate an increase of 10 percentage points in the probability of arrest (`prbarr` goes from 0.2 to 0.3) with an increase in crime rate from `r xsection_p[1]` to `r xsection_p[2]`, or a `r round(100 * diff(xsection_p) / xsection_p[1],2)` percent increase. Ok, but what does that *mean*? Literally, it tells us counties with a higher probability of being arrested also have a higher crime rate. So, does it mean that as there is more crime in certain areas, the police become more efficient at arresting criminals, and so the probability of getting arrested on any committed crime goes up? What does police efficiency depend on? Does the poverty level in a county matter for this? The local laws? 🤯 wow, there seem to be too many things left out of this simple picture. It's impossible to decide whether this estimate *makes sense* or not like this. A DAG to the rescue! ```{r cri-dag,echo = FALSE,message = FALSE,fig.cap="DAG to answer *what causes the local crime rate?*"} library(ggdag) coords <- list( x = c(ProbArrest = 1,LawAndOrder = 1, Police = 1.5, CivilRights = 3,Poverty = 3, CrimeRate = 5, LocalStuff = 5), y = c(ProbArrest = 1,LawAndOrder = 4, Police = 2.5, CivilRights = 2,Poverty = 4, CrimeRate = 1, LocalStuff = 4) ) dagify(CrimeRate ~ ProbArrest, CrimeRate ~ LocalStuff, CrimeRate ~ Poverty, CrimeRate ~ CivilRights, ProbArrest ~ LocalStuff, Poverty ~ LocalStuff, ProbArrest ~ Poverty, ProbArrest ~ LawAndOrder, ProbArrest ~ Police, ProbArrest ~ LawAndOrder, CivilRights ~ LawAndOrder, Police ~ LawAndOrder, labels = c("CrimeRate" = "Crime Rate", "ProbArrest" = "ProbArrest", "LocalStuff" = "LocalStuff", "Poverty" = "Poverty", "Police" = "Police", "CivilRights" = "CivilRights", "LawAndOrder" = "LawAndOrder" ), exposure = "ProbArrest", outcome = "CrimeRate", coords = coords) %>% ggdag(text = FALSE, use_labels = "label") + ggtitle("What causes the Crime Rate in County i?") + theme_dag() ``` In figure \@ref(fig:cri-dag) we've written `LawAndOrder` for how committed local politicians are to *law and order politics*, and `LocalStuff` for everything that is unique to a particular county apart from the things we've listed. So, at least we can appreciate to full problem now, but it's still really complicated. Let's try to think about *at which level* (i.e. county or time) each of those factors *vary*: * `LocalStuff` are things that describe the County, like geography, and other persistent features. * `LawAndOrder` and how many `CivilRights` one gets might change a little from year to year, but not very drastically. Let's assume they are fixed characteristics as well. * `Police` budget and the `Poverty` level vary by county and by year: an elected politician has some discretion over police spending (not too much, but still), and poverty varies with the national/global state of the economy. You will often hear the terms *within* and *between* variation in panel data contexts. If we think of our data as classified into groups of $i$ (i.e., counties), the *within* variation refers to things that change *within each group* over time: here we said police budgets and poverty levels would change within each group and over time. On the other hand, we said that `LocalStuff`, `LawAndOrder` and `CivilRights` were persistent features of each group, hence they would *not* vary over time (or *within* the group) - they would differ only across or **between** groups. Let's try to separate those out visually! ```{r,echo = FALSE,message = FALSE} pcolor = css %>% group_by(county) %>% mutate(label = case_when( crmrte == max(crmrte) ~ paste('County',county), TRUE ~ NA_character_ ), mcrm = mean(crmrte), mpr = mean(prbarr)) %>% ggplot(aes(x = prbarr, y = crmrte, label = label)) + geom_point(aes(color = factor(county))) + theme_bw() + geom_smooth(method = "lm", se=FALSE) + labs(x = 'Probability of Arrest', y = 'Crime Rate', color = "County") pcolor ``` That looks intriguing! Let's add the mean of `ProbofArrest` and `CrimeRate` for each of the counties to that plot, in order to show the *between* county variation: ```{r,echo = FALSE,fig.height = 3,warning = FALSE,message = FALSE} p1 = css %>% group_by(county) %>% mutate(label = case_when( crmrte == max(crmrte) ~ paste('County',county), TRUE ~ NA_character_ ), mcrm = mean(crmrte), mpr = mean(prbarr)) %>% ggplot(aes(x = prbarr, y = crmrte, label = label)) + geom_point(aes(color = factor(county))) + theme_bw() + # geom_smooth(method = "lm", se=FALSE) + scale_x_continuous(limits = c(0.1,0.43)) + scale_y_continuous(limits = c(0.01,0.041)) + labs(x = 'Probability of Arrest', y = 'Crime Rate', color = "County") + # scale_color_manual(values = c('black','blue','red','purple')) geom_point(aes(x = mpr, y = mcrm,color = factor(county)), size = 20, shape = 3) + annotate(geom = 'text', x = .3, y = .02, label = 'Means Within Each County', color = 'darkorange', size = 14/.pt) + guides(color = FALSE, labels = FALSE) # p11 = css %>% # ggplot(aes(x = prbarr, y = crmrte)) + geom_smooth(method = "lm", se = FALSE) p2 = css %>% group_by(county) %>% mutate(label = case_when( crmrte == max(crmrte) ~ paste('County',county), TRUE ~ NA_character_ ), mcrm = mean(crmrte), mpr = mean(prbarr)) %>% ggplot(aes(x = mpr, y = mcrm)) + theme_bw() + geom_smooth(method = "lm",se = FALSE) + geom_point(size = 20, shape = 3, aes(color = factor(county))) + scale_x_continuous(limits = c(0.1,0.43)) + scale_y_continuous(limits = c(0.01,0.041)) + labs(x = 'Probability of Arrest', y = 'Crime Rate') + guides(color = FALSE, labels = FALSE) cowplot::plot_grid(p1,p2,axis = "tb") ``` Simple OLS on the cross section (i.e. not taking into account the panel structure) seems to recover only the *between* group differences. It fits a line to the group means. Well this considerably simplifies our DAG from above! Let's collect all group-specific time-invariant features in the factor `County` - we don't really care about what they all are, because we can net the group effects out of the data it seems: ```{r cri-dag2,echo = FALSE,message = FALSE,fig.cap="DAG to answer *what causes the local crime rate?*"} coords <- list( x = c(ProbArrest = 1,Poverty = 1, Police = 1.5, County = 3, CrimeRate = 4), y = c(ProbArrest = 1,Police = 2.5, Poverty = 4, CrimeRate = 1, County = 4) ) dagify(CrimeRate ~ ProbArrest, CrimeRate ~ County, CrimeRate ~ Poverty, ProbArrest ~ Poverty, ProbArrest ~ County, Poverty ~ County, ProbArrest ~ Police, Police ~ County, labels = c("CrimeRate" = "Crime Rate", "ProbArrest" = "ProbArrest", "County" = "County", "Poverty" = "Poverty", "Police" = "Police"), exposure = "ProbArrest", outcome = "CrimeRate", coords = coords) %>% ggdag(text = FALSE, use_labels = "label") + theme_dag() ``` So, controlling for `County` takes care of all factors which do *not* vary over time within each unit. Police and Poverty will have a specific County-specific mean value, but there will be variation over time. We will basically be able to compare each county with itself at different points in time. ## Panel Data Estimation with `R` We have now seen several instances of problems with simple OLS arising from *unobserved variable bias*. For example, if the true model read $$ y_i = \beta_0 + \beta_1 x_i + c_i + u_i $$ with $c_i$ unobservable and potentially correlated with $x_i$, we were in trouble because the orthogonality assumption $E[u_i+c_i|x_i]\neq 0$ ($u_i+c_i$ is the total unobserved component). We have seen such an example where $c=A_i$ and $x=s$ was schooling and we were worried about *ability bias*. One solution we discussed was to find an IV which is correlated with schooling (quarter of birth) - but not with ability (we thought ability is equally distributed across birthdates). Today we'll look at solutions when we have *more than a single* observation for each unit $i$. To be precise, let's put down a basic unobserved effects model like this: $$ y_{it} = \beta_1 x_{it} + c_i + u_{it},\quad t=1,2,...T (\#eq:panel) $$ The object of interest here is $c_i$, called the *individual fixed effect*, *unobserved effect* or *unobserved heterogeneity*. The important thing to note is that it is fixed over time (ability $A_i$ for example). ### Dummy Variable Regression The simplest approach is arguably this: we could take the equation literally and estimate a linear model where we include a dummy variable for each $i$. This is closest to what we said above is *controlling for county* - that's exactly what we do here. You can see in \@ref(eq:panel) that each $i$ has basically their own intercept $c_i$, so this works. In `R` you achieve this like so: ```{r} mod = list() mod$dummy <- lm(crmrte ~ prbarr + factor(county), css) # i is the unit ID broom::tidy(mod$dummy) ``` Here is what we are talking about in a picture: ```{r dummy,echo = FALSE,message = FALSE} # not sure that's helpful css$pred <- predict(mod$dummy) # get predicted line pcolor = css %>% group_by(county) %>% ggplot(aes(x = prbarr, y = crmrte, color =factor(county) )) + geom_point() + geom_line(aes(y = pred )) + theme_bw() + # geom_smooth(method = "lm", se=FALSE) + labs(x = 'Probability of Arrest', y = 'Crime Rate', color = "County") pcolor ``` It's evident that *within* each county, there is a negative relationship. The dummy variable regression allows for different intercepts (county `1` is be the reference group), and one unique slope coefficient $\beta$. (you observe that the lines are parallel). You can see from this by looking at the picture that what the dummies are doing is shifting their line down from the reference group 1. ### First Differencing If we only had $T=2$ periods, we could just difference both periods, basically leaving us with \begin{align} y_{i1} &= \beta_1 x_{i1} + c_i + u_{i1} \\ y_{i2} &= \beta_1 x_{i2} + c_i + u_{i2} \\ & \Rightarrow \\ y_{i1}-y_{i2} &= \beta_1 (x_{i1} - x_{i2}) + c_i-c_i + u_{i1}-u_{i2} \\ \Delta y_{i} &= \beta_1 \Delta x_{i} + \Delta u_{i} \end{align} where $\Delta$ means *difference over time of* and to recover the parameter of interest $\beta_1$ we would run ```{r,eval=FALSE} lm(deltay ~ deltax, diff_data) ``` ### The Within Transformation In cases with $T>2$ we need a different approach - this is the most relevant case. One important concept is called the *within* transformation.^[Different packages implement different flavours of this procedure, this is the main gist] This is directly related to our discussion from above when we simplified our DAG. So, *controlling for group identity and only looking at time variation* is what we said - let's write it down! Here we denote as $\bar{x}_i$ the average *over time* of $i$'s $x$ values: $$ \bar{x}_i = \frac{1}{T} \sum_{t=1}^T x_{it} $$ With this in hand, the transformation goes like this: 1. for all variables compute their time-mean for each unit $i$: $\bar{x}_i,\bar{y}_i$ etc 1. for each observation, substract that time mean from the actual value and define $(x_{it} - \bar{x}_i),(y_{it}-\bar{y}_i)$ 1. Finally, regress $(x_{it} - \bar{x}_i)$ on $(y_{it}-\bar{y}_i)$ This *works* for our problem with fixed effect $c_i$ because $c_i$ is not time varying by assumption! hence it drops out: $$ y_{it}-\bar{y}_i = \beta_1 (x_{it} - \bar{x}_i) + c_i - c_i + u_{it}-\bar{u}_i $$ It's easy to do yourself! First let's compute the demeaned values: ```{r} cdata <- css %>% group_by(county) %>% mutate(mean_crime = mean(crmrte), mean_prob = mean(prbarr)) %>% mutate(demeaned_crime = crmrte - mean_crime, demeaned_prob = prbarr - mean_prob) ``` Then lets run the models with OLS: ```{r tab1} mod$xsect <- lm(crmrte ~ prbarr, data = cdata) mod$demeaned <- lm(demeaned_crime ~ demeaned_prob, data = cdata) gom = 'DF|Deviance|AIC|BIC|p.value|se_type|R2 Adj. |statistic|Log.Lik.|Num.Obs.' # stuff to omit from table modelsummary::modelsummary(mod[c("xsect","dummy","demeaned")], statistic = 'std.error', title = "Comparing (biased) X-secional OLS, dummy variable and manual demeaning panel regressions", coef_omit = "factor", gof_omit = gom) ``` Notice how in table \@ref(tab:tab1) the estimate for `prbarr` is positive in the cross-section, like in figure \@ref(fig:crime1). If we take care of the unobservered heterogeneity $c_i$ either by including an intercept for each $i$ or by time-demeaning the data, we obtain the same estimate: `r round(coef(mod$demeaned)[2],3)` in both cases. ```{r,echo = FALSE} panel_p = round(predict(mod$dummy,newdata = data.frame(prbarr = c(0.2,0.3), county = factor(1))),3) ``` We interpret those *within* estimates by imagining to look at a single unit $i$ and ask: *if the arrest probability in $i$ increases by 10 percentage points (i.e. from 0.2 to 0.3) from year $t$ to $t+1$, we expect crimes per person to fall from `r panel_p[1]` to `r panel_p[2]`, or by `r round(100 * diff(panel_p) / panel_p[1],2)` percent* (in the reference county number 1). ### Using a Package In real life you will hardly ever perform the within-transformation by yourself and use a package instead. There are several options (`fixest` if fastest). ```{r} mod$FE = fixest::feols(crmrte ~ prbarr | county, cdata) modelsummary::modelsummary(mod[c("xsect","dummy","demeaned","FE")], statistic = 'std.error', title = "Comparing (biased) X-secional OLS, dummy variable, manual demeaning and fixest panel regressions", coef_omit = "factor", gof_omit = paste(gom,"Std. errors","R2",sep = "|")) ``` Again, we get the same result as with manual demeaning 😅. Let's finish off with a nice visualisation by [Nick C Huntington-Klein's](http://nickchk.com) which illustrates how the within transformation works in this example. If you look back at $$ y_{it}-\bar{y}_i = \beta_1 (x_{it} - \bar{x}_i) + u_{it}-\bar{u}_i $$ you can see that we perform a form of **data centering** in the within transformation: subtracting their respective time means from all variables means to center all variables! Here's how this looks (only visible in HTML version online). ```{r anim, echo=FALSE, fig.cap = "Animation of a fixed effects panel data estimator: we remove *between group* variation and concentrate on *within group* variation only", fig.width=5, fig.height=4.5,message = FALSE, warning = FALSE, eval = knitr::is_html_output()} library(gganimate) cranim <- css %>% mutate(allcrm = mean(crmrte), allmpr = mean(prbarr)) %>% group_by(county) %>% mutate(label = case_when( crmrte == max(crmrte) ~ paste('County',county), TRUE ~ NA_character_ ), mcrm = mean(crmrte), mpr = mean(prbarr), stage = '1. Raw Data') cranim <- cranim %>% bind_rows(cranim %>% mutate(crmrte = crmrte - mcrm + allcrm, prbarr = prbarr - mpr + allmpr, mcrm = allcrm, mpr = allmpr, stage = '2. Remove all between variation')) p <- ggplot(cranim, aes(x = prbarr, y = crmrte, color = factor(county), label = label)) + geom_point() + geom_text(hjust = -.1, size = 14/.pt) + labs(x = 'Probability of Arrest', y = 'Crime Rate') + guides(color = FALSE, label = FALSE) + # scale_color_manual(values = c('black','blue','red','purple')) + geom_smooth(aes(color = NULL), method = 'lm', se = FALSE)+ theme_bw() + geom_point(aes(x = mpr, y = mcrm), size = 20, shape = 3, color = 'darkorange') + transition_states(stage) animate(p, nframes = 80) ``` ================================================ FILE: 13-discrete.Rmd ================================================ # Binary Outcomes {#binary} Until now we have encountered only contiunously distributed outcomes on the right hand side of our estimation equations. For example, in our typical linear model, we would define \begin{align} y &= b_0 + b_1 + e \\ e &\sim N\left(0,\sigma^2\right) \end{align} where the second line defines the unobservable $e$ to be drawn from the Normal distribution with mean zero and variance $\sigma^2$.^[We have not insisted too much on the fact that $e$ should be distributed according to the *Normal* distribution (this is required in particular for the theoretical derivation of standard errors as seen in chapter \@ref(std-errors)). However, we'd always have an unbounded and continuous distribution underlying our models] That means that, at least in principle, $y$ could be any number from the real line ($e$ could be arbitrarily small or large), and we can say that $y \in \mathbb{R}$. For the outcomes we studied, that was fine: test scores, earnings, crime rates etc are all continuous outcomes. But some outcomes are clearly binary (i.e. either `TRUE` or `FALSE`): * You either work or you don't, * You either have children or you don't, * You either bought a product or you didn't, * You flipped a coin and it came up either heads or tails. In this situation, our outcome is restricted to come from a small set of values: `FALSE` vs `TRUE`, or `0` vs `1`. We'd have $y \in \{0,1\}$. In those situations we are primarily interested in estimating the **response probability** or the **probability of success**, $$ p(x) = \Pr(y=1 | x), $$ or in words, *the probability to observe $y=1$ (a success), given explanatory variables $x$*. In particular, we will often be interested in learning how $p(x)$ changes as we change $x$ - that is, we are interested in the same *partial effect* of $x$ on the outcome as in our usual linear regression setup. Here, we ask ```{block,type = "tip"} If we increase $x$ by one unit, how would the probability of $y=1$ change? ``` It is worth reminding ourselves about two simple facts about binary random variables (i.e drawn from the [Bernoulli](https://en.wikipedia.org/wiki/Bernoulli_distribution) distribution). So, we call a random variable $y \in \{0,1\}$ such that \begin{align} \Pr(y = 1) &= p \\ \Pr(y = 0) &= 1-p \\ p &\in[0,1] \end{align} a *Bernoulli* random variable. In our setting, we just *condition* those probabilities on a covariate $x$, as above - that is, we measure the probability *given that $X$ takes value $x$*: \begin{align} \Pr(y = 1 | X = x) &= p(x) \\ \Pr(y = 0 | X = x) &= 1-p(x) \\ p(x) &\in[0,1] \end{align} Of particular interest for us is the fact that the *expected value* (i.e. the average) of $Y$ given $x$ is $$ E[y | x] = p(x) \times 1 + (1-p(x)) \times 0 = p(x) $$ There are several ways to model such binary outcomes. Let's look at them. ## The Linear Probability Model The Linear Probability Model (LPM) is the simplest option. In this case, we model the response probability as $$ \Pr(y = 1 | x) = p(x) = \beta_0 + \beta_1 x_1 + \dots + \beta_K x_K (\#eq:LPM) $$ Our interpretation is slightly changed to our usual setup, as we'd say *a 1 unit change in $x_1$, say, results in a change of $p(x)$ of $\beta_1$.* Estimation of the LPM as in equation \@ref(eq:LPM) can be performed by standard OLS. Let's look at an example. The Mroz (1987) dataset let's us investigate female labor market participation. How does a woman's `inlf` (*in labor force*) status depend on non-wife household income, her education, age and number of small children? First, let's look at a quick plot that shows how the outcome varies with 1 variable, age say: ```{r} data(mroz, package = "wooldridge") plot(factor(inlf) ~ age, data = mroz, ylevels = 2:1, ylab = "in labor force?") ``` Not so much variation with respect to age, except for the later years. Let's run the LPM now: ```{r} LPM = lm(inlf ~ nwifeinc + educ + exper + I(exper^2) + age +I(age^2) + kidslt6, mroz) summary(LPM) ``` You can see that this is *identical* to our previous linear regression models - with the exception that the outcome `inlf` takes on only two values, 0 or 1. The results from this: if non-wife income increases by 10 (i.e 10,000 USD), the probability of being in the labor force falls by 0.034 (that's a small effect!), whereas an additional small child would reduce the probability of work by 0.26 (that's large). So far, so simple. One often-mentioned problem of this model is that fact that nothing restricts our predictions of $p(x)$ to be proper probabilities, i.e. to lie in the unit interval $[0,1]$. You can see that quite easily here: ```{r} pr = predict(LPM) plot(pr[order(pr)],ylab = "p(inlf = 1)") abline(a = 0, b = 0, col = "red") abline(a = 1, b = 0, col = "red") ``` This picture tells you that for quite a few observations, this model predicts a probability of working which is either greater than 1, or smaller than zero. This may or may not be a big problem for your analysis. If you only care about marginal effects (i.e. the $\beta$s, that may be ok, in particular if you have discrete variables on the RHS; if you want actual *predictions* than that's more problematic). In the case of a *saturated model* - if we only have dummy explanatory variables - then this problem does not exist for the LPM: ```{r saturated,message=FALSE,warning=FALSE,fig.cap = "LPM model in a saturated setting, i.e. only mutually exhaustive dummy variables on the RHS."} library(dplyr) library(ggplot2) mroz %<>% # classify age into 3 and huswage into 2 classes mutate(age_fct = cut(age,breaks = 3,labels = FALSE), huswage_fct = cut(huswage, breaks = 2,labels = FALSE)) %>% mutate(classes = paste0("age_",age_fct,"_hus_",huswage_fct)) LPM_saturated = mroz %>% lm(inlf ~ age_fct + huswage_fct, data = .) mroz$pred <- predict(LPM_saturated) ggplot(mroz[order(mroz$pred),], aes(x = 1:nrow(mroz),y = pred,color = classes)) + geom_point() + theme_bw() + scale_y_continuous(limits = c(0,1), name = "p(inlf)") + ggtitle("LPM in a Saturated Model is Perfectly Fine") ``` In figure \@ref(fig:saturated) each line segment corresponds to the average probability of work *within that cell* of people. For example you see that women from the youngest age category and lowest husband income (class `age_1_hus_1`) have the highest probability of working (`r round(max(mroz$pred),3)`). ## Nonlinear Binary Response Models In this class of models we change the way we model the response probability $p(x)$. Instead of the simple linear structure from above, we write $$ \Pr(y = 1 | x) = p(x) = G \left(\beta_0 + \beta_1 x_1 + \dots + \beta_K x_K \right) (\#eq:GLM) $$ You note that this is *almost* identical, only that the entire sum $\beta_0 + \beta_1 x_1 + \dots + \beta_K x_K$ is now inside some function $G(\cdot)$. The main property of $G$ is that it can transform any value $z\in \mathbb{R}$ you give it to a number in the interval $(0,1)$. This immediately solves our problem of getting weird predictions for probabilities. The two most widely used forms of $G$ are the **probit** and the **logit** model. here are both forms for $G$ in one plot: ```{r cdfs, fig.cap = "The Probit and Logit functional forms for binary choice models",warning = FALSE} library(ggplot2) ggplot(data.frame(x = c(-5,5)), aes(x=x)) + stat_function(fun = pnorm, aes(colour = "Probit")) + stat_function(fun = plogis, aes(colour = "Logit")) + theme_bw() + scale_colour_manual(name = "Function G",values = c("red", "blue")) + scale_y_continuous(name = "Pr(y = 1 | x)") ``` You can see that 1. any value $x$ results in a value $y$ between 0 and 1 1. the higher $x$, the higher the resulting $p(x)$. ### Interpretation of Coefficients Let's run the Mroz example from above in both probit and logit now: ```{r} probit <- glm(inlf ~ age, data = mroz, family = binomial(link = "probit")) logit <- glm(inlf ~ age, data = mroz, family = binomial(link = "logit")) modelsummary::modelsummary(list("probit" = probit,"logit" = logit)) ``` From this table, we learn that the coefficient for `age` is `r round(coef(probit)[2],3)` for probit and `r round(coef(logit)[2],3)` for logit, respectively. In both cases, this tells us that the impact of an additional year of age on the probability of working is **negative**. However, we cannot straightforwardly read off the *magnitude* of the effect - **how much** does the probability decrease we can't tell. Why is that? One simple way to see this is to look back at figure \@ref(fig:cdfs) and imagine we had just one explanatory variable (like here - `age`). The model is $$ \Pr(y = 1 | \text{age})= G \left(x \beta\right) = G \left(\beta_0 + \beta_1 \text{age} \right) $$ and the *marginal effect* of `age` on the response probability is $$ \frac{\partial{\Pr(y = 1 | \text{age})}}{ \partial{\text{age}}} = g \left(\beta_0 + \beta_1 \text{age} \right) \beta_1 (\#eq:ME) $$ where function $g$ is defined as $g(z) = \frac{dG}{dz}(z)$ - the first derivative function of $G$ (i.e. the *slope* of $G$). The formulation in \@ref(eq:ME) is a result of the [chain rule](https://en.wikipedia.org/wiki/Chain_rule). Now, given that in figure \@ref(fig:cdfs) we see $G$ that is nonlinear, this means that also $g$ will be non-linear: sometimes (close to the edges of the graph) the slope will be really small and close to zero, but sometimes (in the center of the graph), the slope will be really steep. You are able to try this out yourself using this app: ```{r, eval = FALSE} ScPoApps::launchApp("marginal_effects_of_logit_probit") ``` So you can see that there is not one single *marginal effect* in those models, as that depends on *where we evaluate* expression \@ref(eq:ME). Notice that the case is identical for more than one $x$. In practice, there are two common approaches: 1. report \@ref(eq:ME) at the average values of $x$: $$g(\bar{x} \beta) \beta_j$$ 1. report the sample average of all marginal effects: $$\frac{1}{n} \sum_{i=1}^N g(x_i \beta) \beta_j$$ Thankfully there are packages available that help us to compute those marginal effects fairly easily. One of them is called [`mfx`](https://cran.r-project.org/web/packages/mfx/), and we would use it as follows: ```{r glms} f <- "inlf ~ age + kidslt6 + nwifeinc" # setup a formula glms <- list() glms$probit <- glm(formula = f, data = mroz, family = binomial(link = "probit")) glms$logit <- glm(formula = f, data = mroz, family = binomial(link = "logit")) # now the marginal effects versions glms$probitMean <- mfx::probitmfx(formula = f, data = mroz, atmean = TRUE) glms$probitAvg <- mfx::probitmfx(formula = f, data = mroz, atmean = FALSE) glms$logitMean <- mfx::logitmfx(formula = f, data = mroz, atmean = TRUE) glms$logitAvg <- mfx::logitmfx(formula = f, data = mroz, atmean = FALSE) modelsummary::modelsummary(glms, stars = TRUE, gof_omit = "AIC|BIC", title = "Logit and Probit estimates and marginal effects evaluated at mean of x or as sample average of effects") ``` In table \@ref(tab:glms) you should first note that the estimates of the first two columns (probit or logit) don't correspond to the remaining columns. That's because they only give you the $\beta$'s. As we have learned above, that in itself is not informative, as it depends *where* one computes the marginal effects. Hence the remaining columns compute the marginal effects either at the mean of all regressors (`probitMean`) or as the sample average over all effects in the data (`probitAvg`). You can notice some differences here, for example we find at the average regressor, an additional child below age of 6 reduces the probability of work by 0.314, whereas as an averag over all sample effects it reduces it by 0.29. Furthermore, you see that the marginal effect estimates between probit and logit don't correspond exactly, which is a consequence of the different shapes of the curves in figure \@ref(fig:cdfs). No one approach is correct here and depends on how your data is distributed (e.g. is the mean a good summary of the data here?). What is clear, though, is that in most cases reporting coefficient estimates only is not very informative (it only tells you the direction of any effect). ================================================ FILE: 14-references.Rmd ================================================ `r if (knitr::is_html_output()) '# References {-}'` ================================================ FILE: DESCRIPTION ================================================ Package: ScPoEconometrics Type: Package Title: ScPoEconometrics Date: 2020-10-31 Version: 0.2.7 Authors@R: c( person("Florian", "Oswald", email = "florian.oswald@sciencespo.fr", role = c("aut","cre")), person("Jean-Marc", "Robin", email = "jeanmarc.robin@sciencespo.fr", role = "ctb"), person("Vincent", "Viers", email = "vincent.viers@sciencespo.fr", role = "ctb"), person("Gustave", "Kenedi", email = "gustave.kenedi@sciencespo.fr", role = "ctb"), person("Pierre", "Villedieu", email = "pierre.villedieu@sciencespo.fr", role = "ctb")) Depends: R (>= 3.5.0) License: MIT + file LICENSE Description: The is the 2nd year UG econometrics book at SciencesPo. URL: https://github.com/ScPoEcon/ScPoEconometrics Imports: bookdown, tidyverse, datasauRus, plotly, webshot, Ecdat, rmarkdown (>= 1.11), AER, magick, pdftools, Hmisc, magrittr, dplyr, corrplot, cowplot, wooldridge, stargazer, quantreg, equatiomatic, ungeviz, masteringmetrics, ggdag, data.table, huxtable, cholera, reshape2, modelsummary, estimatr, gganimate, fixest, transformr, mfx Remotes: datalorax/equatiomatic, wilkelab/ungeviz, jrnold/masteringmetrics/masteringmetrics RoxygenNote: 7.1.1 ================================================ FILE: GA-tracker.html ================================================ ================================================ FILE: LICENSE ================================================ This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. ================================================ FILE: NAMESPACE ================================================ # Generated by roxygen2: do not edit by hand ================================================ FILE: R/utils.R ================================================ gitbook <- function(){ bookdown::render_book('index.Rmd', 'bookdown::gitbook') } pdfbook <- function(){ bookdown::render_book('index.Rmd', 'bookdown::pdf_book') } pasta_maker <- function(){ pasta_jar <- tibble::tibble(id = 1:1980,color = sample(c(rep("Red",488),rep("Green",492),rep("White",1000)), size = 1980 )) usethis::use_data(pasta_jar,overwrite = TRUE) pasta_jar } pasta_image <- function(){ data(pasta_jar) pasta_jar$cf = as.numeric(factor(pasta_jar$color)) m = matrix(c(pasta_jar$cf),44,45) image(m,col = c("green","orange","white"),xaxt="n",yaxt = "n") } ================================================ FILE: README.md ================================================ # ScPo UG Econometrics This is the git repo for the UG Econometrics book taught to 2nd year students at SciencesPo. **Slides for the Intro Course?** If you are looking for the introductory course slides, they are in the [ScPoEconometrics-Slides](https://github.com/ScPoEcon/ScPoEconometrics-Slides) repo. **Slides for the Advanced Course?** If you are looking for the advanced course slides, they are in the [Advanced-Metrics-Slides](https://github.com/ScPoEcon/Advanced-Metrics-slides) repo. **Apps and Tutorials?** If you are looking for our apps and tutorials, they are in the [ScPoApps](https://github.com/ScPoEcon/ScPoApps) repo. ## Meta Information for Teachers *This section is only relevant if you want to teach this course.* All material of this course is open source, and you are free to use it. Please refer to the [license](#license) section below for the precise wording and terms of the agreement. In particular, please stick to the agreement about proper citation of this repository. There is some relevant material in the [teachers](/teachers) folder. In particular, the `ForTeachers.md` document contains a detailed explanation of the course structure, as well as a section on student feedback from the first iteration of the course. The few other documents in there should be self explanatory. As outlined in the license section, you are free to use and re-use any parts of the content as you see fit. For instance, you could re-use our slides, and modify them, or publish a different version of our textbook (with proper attribution). However, it could be valuable to integrate your changes/additions to the project. In this case, please read on in the next section about how to make contributions. ## Contribution Workflow - Developers only! This section is only for people who want to contribute code to this project. 1. fork this repository 1. clone your fork to your computer: `git clone url_of_your_fork` 1. Start to work on your things on a new branch: `git checkout -b new_branch` 1. **commit** your work to that new branch! 1. Place your new stuff on top of the most recent `upstream/master`: 1. add the upstream repo as a remote: `git remote add upstream git@github.com:ScPoEcon/ScPoEconometrics.git` 1. Use the `rebase` command ``` # git add your stuff # git commit your stuff git fetch upstream # get stuff from upstream git rebase upstream/master # merge upstream master and put your commits on top of it ``` 1. push that branch to your fork: `git push origin new_branch` 1. create pull request on `upstream` (from your fork at github.com) ## Technology The book is made using bookdown. You can find the preview of an example at https://bookdown.org/yihui/bookdown-demo/ ## License This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/) ![](images/cc.png) You are free to: * Share — copy and redistribute the material in any medium or format * Adapt — remix, transform, and build upon the material **under the following terms**: 1. Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. We are happy to suggest the following citation if you use our material in your work: ```R > citation("ScPoEconometrics") Oswald F, Viers V, Villedieu P, Kennedy G (2020). Introduction to Econometrics with R. SciencesPo Department of Economics, Paris, France. . A BibTeX entry for LaTeX users is @Manual{, title = {Introduction to Econometrics with R}, author = {Florian Oswald and Vincent Viers and Pierre Villedieu and Gustave Kennedi}, organization = {SciencesPo Department of Economics}, address = {Paris, France}, year = {2020}, url = {https://scpoecon.github.io/ScPoEconometrics/}, } ``` 2. NonCommercial — You may not use the material for commercial purposes. 3. ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. ### Attributions Under the CC licence above, we are obliged to attribute any material that this book uses and which was shared under the same license: 1. Large parts of Chapter 1 *Introduction to R* are copied [appliedstats](https://daviddalpiaz.github.io/appliedstats/) by [David Dalpiaz](https://daviddalpiaz.com). I added a couple of practical tasks and made some minor edits. 1. Chapter 2 is partly based on [appliedstats](https://daviddalpiaz.github.io/appliedstats/), but only up to *scatterplots*. ================================================ FILE: ScPoEconometrics.Rproj ================================================ Version: 1.0 RestoreWorkspace: Default SaveWorkspace: Default AlwaysSaveHistory: Default EnableCodeIndexing: Yes UseSpacesForTab: Yes NumSpacesForTab: 2 Encoding: UTF-8 RnwWeave: knitr LaTeX: pdfLaTeX AutoAppendNewline: Yes StripTrailingWhitespace: Yes BuildType: Package PackageUseDevtools: Yes PackageInstallArgs: --no-multiarch --with-keep.source PackageRoxygenize: rd,collate,namespace ================================================ FILE: _archive/chapters/03-linear-reg.Rmd ================================================ ## An Example: California Student Test Scores {#lm-example1} Luckily for us, fitting a linear model to some data does not require us to iteratively find the best intercept and slope manually, as you have experienced in our `apps`. As it turns out, `R` can do this much more precisely, and very fast! Let's explore how to do this, using a real life dataset taken from the `Ecdat` package which includes many economics-related dataset. In this example, we will use the `Caschool` dataset which contains the average test scores of 420 elementary schools in California along with some additional information. ### Loading and exploring Data We can explore which variables are included in the dataset using the `names()` function: ```{r str, warning=F, message = F} library("Ecdat") # Load the Ecdat library names(Caschool) # Display the variables of the Caschool dataset ``` For each variable in the dataset, basic summary statistics can be obtained by calling `summary()` ```{r summary} summary(Caschool[, c("testscr", "str", "avginc")]) ``` ### Fitting a linear model Suppose we are interested in the following linear model: $$\text{testscr}_i = b_0 + b_1 \times \text{str}_i + e_i$$ Where $\text{testscr}_i$ is the *average test score* for a given school $i$ and $\text{str}_i$ is the *Student/Teacher Ratio* (i.e. the average number of students per teacher) in the same school $i$. Again, $b_0$ and $b_1$ are the intercept and the slope of the regression line. The subscript $i$ indexes all unique elementary schools ($i \in \{1, 2, 3, \dots 420\}$) and $e_i$ is the error, or *residual*, of the regression. (Remember that our procedure for finding the line of best fit is to minimize the *sum of squared residuals* (SSR)). At this point you should step back and take a second to think about what you believe the relation between a school's test scores and student/teacher ratio will be. Do you believe that, in general, a high student/teacher ratio will be associated with higher-than-average test scores for the school? Do you think that the number of students per teacher will impact results in any way? Let's find out! As always, we will start by plotting the data to inspect it visually: ```{r first-reg0,fig.align='center',fig.cap='Student Teacher Ratio vs Test Scores'} plot(formula = testscr ~ str, data = Caschool, xlab = "Student/Teacher Ratio", ylab = "Average Test Score", pch = 21, col = 'blue') ``` Can you spot a trend in the data? According to you, what would the line of best fit look like? Would it be upward or downward slopping? Let's ask `R`! ### The `lm()` function We will use the built-in `lm()` function to estimate the coefficients $b_0$ and $b_1$ using the data at hand. `lm` stands for *linear model*, which is what our representation in \@ref(eq:abline) amounts to. This function typically only takes 2 arguments, `formula` and `data`: `lm(formula, data)` - `formula` is the description of our model which we want `R` to estimate for us. Its syntax is very simple: `Y ~ X` (more generally, `DependentVariable ~ Independent Variables`). You can think of the tilda operator `~` as the equal sign in your model equation. An intercept is included by default and so you do not have to ask for it in `formula`. For example, the simple model $income = b_0 + b_1 \cdot age$ can be written as `income ~ age`. A `formula` can sometimes be written between quotation marks: `"X ~ Y"`. - `data` is simply the `data.frame` containing the variables in the model. In the context of our example, the function call is therefore: ```{r lmfit} # assign lm() output to some object `fit_cal` fit_cal <- lm(formula = testscr ~ str, data = Caschool) # ask R for the regression summary summary(fit_cal) ``` As we can see, `R` returns its estimates for the Intercept and Slope coefficients, $b_0 =$ `r round(coef(fit_cal)[1], 2)` and $b_1 =$ `r round(coef(fit_cal)[2], 2)`. The estimated relationship between a school's Student/Teacher Ratio and its average test results is **negative**. The output of the `summary` method for an `lm` object is commonly called a *regression table*, and you will be able to decypher it by the end of this course. You should be able to find an interpret the $R^2$ though: Are we explaining a lot of the variance in `testscr` with this simple model, or are we not? ### Plotting the regression line We can also use our `lm` fit to draw the regression line on top of our initial scatterplot, using the following syntax: ```{r plot-reg1,fig.align='center',fig.cap='Test Scores with Regression Line'} plot(formula = testscr ~ str, data = Caschool, xlab = "Student/Teacher Ratio", ylab = "Average Test Score", pch = 21, col = 'blue')# same plot as before abline(fit_cal, col = 'red') # add regression line ``` As you probably expected, the best line for schools' Student/Teacher Ratio and its average test results is downward sloping. Just as a way of showcasing another way to make the above plot, here is how you could use `ggplot`: ```{r,fig.align="center"} library(ggplot2) p <- ggplot(mapping = aes(x = str, y = testscr), data = Caschool) # base plot p <- p + geom_point() # add points p <- p + geom_smooth(method = "lm", size=1, color="red") # add regression line p <- p + scale_y_continuous(name = "Average Test Score") + scale_x_continuous(name = "Student/Teacher Ratio") p + theme_bw() + ggtitle("Testscores vs Student/Teacher Ratio") ``` The shaded area around the red line shows the width of the 95% confidence interval around our estimate of the slope coefficient $b_1$. We will learn more about it in chapter \@ref(std-errors). ## Interactions {#mreg-interactions} Interactions allow that the *ceteris paribus* effect of a certain regressor, `str` say, depends also on the value of yet another regressor, `computer` for example. In other words, do test scores depend differentially on the student teacher ratio, depending on wether there are many or few computers in a given school? Is `str` *particularly* important for the test score if there are only a few computers available, for instance? Notice that `str` and `computer` in isolation cannot answer that question (because the value of other variables is assumed *fixed*!). To measure such an effect, we would reformulate our model like this: \begin{equation} \text{testscr}_i = b_0 + b_1 \text{str}_i + b_2 \text{computer}_i + b_3 (\text{str}_i \times \text{computer}_i)+ e_i (\#eq:caschool-inter) \end{equation} The inclusion of the *product* of `str` and `computer` amounts to having different slopes with respect to `str` for different values of `computer` (and vice versa). This is easy to see if we take the partial derivative of \@ref(eq:caschool-inter) with respect to `str`: \begin{equation} \frac{\partial \text{testscr}_i}{\partial \text{str}_i} = b_1 + b_3 \text{computer}_i (\#eq:caschool-inter-deriv) \end{equation} >You should go back to equation \@ref(eq:abline2d-deriv) to remind yourself of what a *partial effect* was, and how exactly the present \@ref(eq:caschool-inter-deriv) differs from what we saw there. Back in our `R` session, we can run the full interactions model like this: ```{r} fit_inter = lm(formula = testscr ~ str + computer + str*computer, data = Caschool) # note that this would produce the same result: # lm(formula = testscr ~ str*computer, data = Caschool) # R expands str*computer for you in main effects + interactions summary(fit_inter) ``` We see here that the regression now estimates and additional coefficient $b_3$ for us. We observe also that the estimate of $b_2$ changes signs and becomes positive, while the interaction effect $b_3$ is negative. This means that an increase in `str` reduces average student scores (more students per teacher make it harder to teach effectively); that an additional computer increases the average test score by 0.05 points; and that the interaction of both decreases scores, implying that more students per teacher decrease scores slightly more if there are more computers. Looking at our visualization may help understand this result better. Figure \@ref(fig:3D-Plotly-inter) shows a plane that is no longer actually a *plane*. It shows a curved surface. You can see that the surface became more flexible in that we could kind of *bend* it more. Which model do you like better to explain this data? ```{r 3D-Plotly-inter, echo = FALSE, warning=F, message = F,fig.cap='Californa Test Scores vs student/teach ratio and computers in school plus interaction term'} df["pred"] <- predict.lm(fit_inter, df, se.fit = F) surf <- acast(df, computer ~ str) color <- rep(0, length(df)) Caschool %>% plot_ly(colors = "blue") %>% add_markers(x = ~str, y = ~computer, z = ~testscr, name = "Data", hoverinfo = "skip", opacity = .6, marker=list(color = 'red', size = 4)) %>% add_surface(x = to_plot_x, y = to_plot_y, z = ~surf, inherit = F, name = "Best Fit Plane with Interaction", opacity = .75, cauto = F, surfacecolor = color) %>% hide_colorbar() ``` ## Saturated Models: Main Effects and Interactions You can see above that we *restricted* male and female to have the same slope with repect to years of experience. This may or may not be a good assumption. Thankfully, the dummy variable regression machinery allows for a quick solution to this - so-called *interaction* effects. As already introduced in chapter \@ref(mreg-interactions), interactions allow that the *ceteris paribus* effect of a certain regressor, `exp` say, depends also on the value of yet another regressor, `sex` for example. Suppose then we would like to see whether male and female not only have different intercepts, but also different slopes with respect to `exp` in figure \@ref(fig:wage-plot2). Therefore we formulate this version of our model: \begin{equation} \ln w_i = b_0 + b_1 exp_i + b_2 sex_i + b_3 (sex_i \times exp_i) + e_i (\#eq:wage-sex-inter) \end{equation} The inclusion of the *product* of `exp` and `sex` amounts to having different slopes for different categories in `sex`. This is easy to see if we take the partial derivative of \@ref(eq:wage-sex-inter) with respect to `sex`: \begin{equation} \frac{\partial \ln w_i}{\partial sex_i} = b_2 + b_3 exp_i (\#eq:wage-sex-inter-deriv) \end{equation} Back in our `R` session, we can run the full interactions model like this: ```{r} lm_inter = lm(lwage ~ exp*sex, data = Wages) summary(lm_inter) ``` You can see here that `R` automatically expands `exp*sex` to include both *main effects*, i.e. `exp` and `sex` as single regressors as before, and their interaction, denoted by `exp:sexmale`. It turns out that in this example, the estimate for the interaction is not statistically significant, i.e. we cannot reject the null hypothesis that $b_3 = 0$. (If, for some reason, you wanted to include only the interaction, you could supply directly `formula = lwage ~ exp:sex` to `lm`, although this would be a rather difficult to interpret model.) We call a model like \@ref(eq:wage-sex-inter) a *saturated model*, because it includes all main effects and possible interactions. What our little exercise showed us was that with the sample of data at hand, we cannot actually claim that there exists a differential slope for male and female, so the model with main effects only may be more appropriate here. To finally illustrate the limits of interpretability when including interactions, suppose we run the fully saturated model for `sex`, `smsa`, `union` and `bluecol`, including all main and all interaction effects: ```{r} lm_full = lm(lwage ~ sex*smsa*union*bluecol,data=Wages) summary(lm_full) ``` The main effects remain clear to interpret: being a blue collar worker, for example, reduces average wages by 34% relative to white collar workers. One-way interactions are still ok to interpret as well: `sexmale:bluecolyes` indicates in addition to a wage premium over females of `r round(coef(lm_full)[2],2)`, and a penalty of being blue collar of `r round(coef(lm_full)[5],2)`, **male** blue collar workers suffer an additional wage loss of `r round(coef(lm_full)[9],2)`. All of this is relative to the base category, which are female white collar workers who don't live in an smsa and are not union members. If we now add a third or even a fourth interaction, this becomes much harder to interpret, and in fact we rarely see such interactions in applied work. ================================================ FILE: _bookdown.yml ================================================ book_filename: "ScPoEconometrics" language: ui: chapter_name: "Chapter " delete_merged_file: true new_session: no ================================================ FILE: _build.sh ================================================ #!/bin/sh set -e # build book(s) Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::gitbook')" # Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::pdf_book')" Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::epub_book')" ================================================ FILE: _deploy.sh ================================================ #!/bin/sh set -e [ -z "${GH_TOKEN}" ] && exit 0 [ "${TRAVIS_BRANCH}" != "master" ] && exit 0 git config --global user.email "florian.oswald@gmail.com" git config --global user.name "Florian Oswald" git clone -b gh-pages https://${GH_TOKEN}@github.com/${TRAVIS_REPO_SLUG}.git book-output cd book-output cp -r ../_book/* ./ git add --all * git status git commit -m"Update the book" || true git push origin gh-pages ================================================ FILE: _local_deploy.sh ================================================ #!/bin/bash # this script builds the book on your computer # and deploys it to your gh-pages branch. set -e gitbranch=$(git symbolic-ref --short -q HEAD) if [ "${gitbranch}" == "master" ] then echo building Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::gitbook')" # Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::pdf_book')" Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::epub_book')" echo done building echo copy to gh-pages branch # git clone -b gh-pages git@github.com:ScPoEcon/ScPoEconometrics.git ScPoEconometrics-book cd ../ScPoEconometrics-book cp -r ../ScPoEconometrics/_book/* ./ git checkout gh-pages && git pull origin gh-pages echo deploying git add --all * git status git commit -m"Update the book" || true git push origin gh-pages echo done deploying. else echo not on master branch - not deploying fi ================================================ FILE: _output.yml ================================================ bookdown::gitbook: toc_depth: 2 css: style.css config: toc: before: |
  • ScPo 2nd Year Econometrics
  • after: |
  • Published with bookdown
  • edit: https://github.com/ScPoEcon/ScPoEconometrics/edit/master/%s download: ["pdf","epub"] includes: in_header: GA-tracker.html bookdown::pdf_book: includes: in_header: preamble.tex latex_engine: xelatex citation_package: natbib keep_tex: yes bookdown::epub_book: default ================================================ FILE: _tex/ci.tex ================================================ % confidence interval % guassian with conficence region and with y axis \begin{center} \begin{tikzpicture}[scale=2, y=5cm] \draw[domain=-3.15:-2] (-3.15,0) plot[id=gauss1,samples=50] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}) -- (-2,0); \draw[domain=-2:2,fill=blue,opacity=0.4] (-2,0) -- plot[id=gauss1, samples=100] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}) -- (2,0); \draw[domain=2:3.2] (2,0) -- plot[id=gauss3, samples=50] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}); \draw (2,-0.01) -- (2,0.01); % ticks \draw (-2,-0.01) -- (-2,0.01); % ciritcal estimator values \node at (2,-0.04) {$c$}; \node at (-2,-0.04) {$-c$}; % x-axis 1 \draw[->, semithick] (-3.2,0) -- (3.4,0) node[right] {$T$}; % x-axis 2 %\draw[->, semithick] (-3.2,-0.1) -- (3.4,-0.1) node[right] %{$T=\sqrt{n}\frac{\bar{x}-\bar{x}}{s}$}; % ticks on t %\draw[-,semithick] (2,-0.11) -- (2,-0.09); %\node at (2,-0.13) {$c=2$}; %\draw[-,semithick] (-2,-0.11) -- (-2,-0.09); %\node at (-2,-0.13) {$-c=-2$}; %y-axis \draw[->, semithick] (-3.15,-0.02) node[left] {$0$} -- (-3.15,0.45) node[above] {$t_{n-1}$}; \draw[-,semithick] (0,-0.01) -- (0,0.01); % zero tick on x % \draw[-,semithick] (0,-0.11) -- (0,0.-0.09); % zero tick on t \node at (0,-0.03) {$0$}; % \node at (0,-0.13) {$0$}; % annotate alphas \node at (0,0.15) {$1 - \alpha = 0.95$}; \draw (2.2,0.02) -- (2.4,0.1) node[above] {$\alpha = 0.025$}; \draw (-2.2,0.02) -- (-2.4,0.1) node[above] {$\alpha = 0.025$}; \end{tikzpicture} \end{center} ================================================ FILE: _tex/onesided.tex ================================================ %guassian with conficence region and with y axis \begin{center} \begin{tikzpicture}[scale=2, y=5cm] \draw[domain=-3:1.645] plot[id=gauss1, samples=100] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}); \draw[domain=1.645:3.2,fill=red,opacity=0.4] (1.645,0) -- plot[id=gauss3, samples=25] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}); \draw (1.645,-0.01) -- (1.645,0.01); % tick \node at (1.645,-0.03) {$\bar{x}_c = 171.74$}; \draw[->, semithick] (-3.2,0) -- (3.4,0) node[right] {$\bar{x}$}; % x-axis 1 \draw[->, semithick] (-3.2,-0.1) -- (3.4,-0.1) node[right] {$t=\frac{\bar{x}-\mu_0}{\frac{s}{\sqrt{n}}}$}; % x-axis 2 %y-axis \draw[->, semithick] (-3.15,-0.02) node[left] {$0$} -- (-3.15,0.45) node[above] {$f(\bar{x}|\mu=167)$}; % y-axis \draw[-,semithick] (0,-0.01) -- (0,0.01); % zero tick on x \draw[-,semithick] (0,-0.11) -- (0,0.-0.09); % zero tick on z \node at (0,-0.03) {$\mu = 167$}; \node at (0,-0.13) {$0$}; \draw[-,semithick] (1.645,-0.11) -- (1.645,-0.09); \node at (1.645,-0.13) {$t_c = 1.676$}; % annotate alphas \node at (0,0.15) {$1 - \alpha = 0.95$}; \draw (2,0.02) -- (2.4,0.1) node[above] {$\alpha = 0.05$}; \end{tikzpicture} \end{center} ================================================ FILE: _tex/testing.lyx ================================================ #LyX 2.3 created this file. For more info see http://www.lyx.org/ \lyxformat 544 \begin_document \begin_header \save_transient_properties true \origin unavailable \textclass article \begin_preamble \usepackage{natbib} \usepackage{tikz} \end_preamble \use_default_options false \maintain_unincluded_children false \language english \language_package none \inputencoding auto \fontencoding default \font_roman "default" "default" \font_sans "default" "default" \font_typewriter "default" "default" \font_math "auto" "auto" \font_default_family default \use_non_tex_fonts false \font_sc false \font_osf false \font_sf_scale 100 100 \font_tt_scale 100 100 \use_microtype false \use_dash_ligatures true \graphics default \default_output_format default \output_sync 0 \bibtex_command default \index_command default \paperfontsize default \spacing single \use_hyperref false \papersize default \use_geometry true \use_package amsmath 1 \use_package amssymb 0 \use_package cancel 0 \use_package esint 1 \use_package mathdots 0 \use_package mathtools 0 \use_package mhchem 0 \use_package stackrel 0 \use_package stmaryrd 0 \use_package undertilde 0 \cite_engine basic \cite_engine_type default \biblio_style plain \use_bibtopic false \use_indices false \paperorientation portrait \suppress_date false \justification true \use_refstyle 0 \use_minted 0 \index Index \shortcut idx \color #008000 \end_index \leftmargin 2.5cm \rightmargin 2.5cm \secnumdepth 3 \tocdepth 3 \paragraph_separation indent \paragraph_indentation default \is_math_indent 0 \math_numbering_side default \quotes_style english \dynamic_quotes 0 \papercolumns 1 \papersides 1 \paperpagestyle default \tracking_changes false \output_changes false \html_math_output 0 \html_css_as_file 0 \html_be_strict false \end_header \begin_body \begin_layout Standard \begin_inset Float figure wide false sideways false status open \begin_layout Plain Layout \begin_inset ERT status open \begin_layout Plain Layout \backslash input{ci.tex} \end_layout \end_inset \end_layout \begin_layout Plain Layout \begin_inset Caption Standard \begin_layout Plain Layout cc \end_layout \end_inset \end_layout \end_inset \end_layout \end_body \end_document ================================================ FILE: _tex/two-sided-beta.tex ================================================ % two sided test for beta % guassian with conficence region and with y axis \begin{tikzpicture}[scale=2, y=5cm] \draw[domain=-3.15:-1.96,fill=red,opacity=0.4] (-3.15,0) plot[id=gauss1,samples=50] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}) -- (-1.96,0); \draw[domain=-1.96:1.96] plot[id=gauss1, samples=50] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}); \draw[domain=1.96:3.2,fill=red,opacity=0.4] (1.96,0) -- plot[id=gauss3, samples=50] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}); \draw (1.96,-0.01) -- (1.96,0.01); % ticks \draw (-1.96,-0.01) -- (-1.96,0.01); % ciritcal estimator values \node at (1.96,-0.03) {$b_c$}; \node at (-1.96,-0.03) {$-b_c$}; % ticks on t \draw[-,semithick] (1.96,-0.11) -- (1.96,-0.09); \node at (1.96,-0.13) {$t_{up}=1.96$}; \draw[-,semithick] (-1.96,-0.11) -- (-1.96,-0.09); \node at (-1.96,-0.13) {$t_{down}=-1.96$}; % x-axis 1 \draw[->, semithick] (-3.2,0) -- (3.4,0) node[right] {$b_k$}; \draw[->, semithick] (-3.2,-0.1) -- (3.4,-0.1) node[right] {$t=\frac{b_k-0}{s_k}$}; % x-axis 2 %y-axis \draw[->, semithick] (-3.15,-0.02) node[left] {$0$} -- (-3.15,0.45) node[above] {$f(b_k|\beta_k=0)$}; \draw[-,semithick] (0,-0.01) -- (0,0.01); % zero tick on x \draw[-,semithick] (0,-0.11) -- (0,0.-0.09); % zero tick on t \node at (0,-0.03) {$\beta_k = 0$}; \node at (0,-0.13) {$0$}; % annotate alphas \node at (0,0.15) {$1 - \alpha = 0.95$}; \draw (2.2,0.02) -- (2.4,0.1) node[above] {$\alpha = 0.025$}; \draw (-2.2,0.02) -- (-2.4,0.1) node[above] {$\alpha = 0.025$}; \end{tikzpicture} ================================================ FILE: _tex/twosided-mean.tex ================================================ % two sided test for mean % guassian with conficence region and with y axis \begin{tikzpicture}[scale=2, y=5cm] \draw[domain=-3.15:-1.96,fill=blue,opacity=0.4] (-3.15,0) plot[id=gauss1,samples=50] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}) -- (-1.96,0); \draw[domain=-1.96:1.96] plot[id=gauss1, samples=50] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}); \draw[domain=1.96:3.2,fill=blue,opacity=0.4] (1.96,0) -- plot[id=gauss3, samples=50] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}); \draw (1.96,-0.01) -- (1.96,0.01); % ticks \draw (-1.96,-0.01) -- (-1.96,0.01); % ciritcal estimator values \node at (1.96,-0.03) {$\bar{x}_c$}; \node at (-1.96,-0.03) {$-\bar{x}_c$}; % ticks on t \draw[-,semithick] (1.96,-0.11) -- (1.96,-0.09); \node at (1.96,-0.13) {$t_{up}=2.262$}; \draw[-,semithick] (-1.96,-0.11) -- (-1.96,-0.09); \node at (-1.96,-0.13) {$t_{down}=-2.262$}; % x-axis 1 \draw[->, semithick] (-3.2,0) -- (3.4,0) node[right] {$\bar{x}$\hspace{0.1cm}cm}; \draw[->, semithick] (-3.2,-0.1) -- (3.4,-0.1) node[right] {$t=\frac{\bar{x}-\mu_0}{\frac{s}{\sqrt{n}}}$}; % x-axis 2 %y-axis \draw[->, semithick] (-3.15,-0.02) node[left] {$0$} -- (-3.15,0.45) node[above] {$f(\bar{x}|\mu=167)$}; \draw[-,semithick] (0,-0.01) -- (0,0.01); % zero tick on x \draw[-,semithick] (0,-0.11) -- (0,0.-0.09); % zero tick on t \node at (0,-0.03) {$\mu = 167$}; \node at (0,-0.13) {$0$}; % annotate alphas \node at (0,0.15) {$1 - \alpha = 0.95$}; \draw (2.2,0.02) -- (2.4,0.1) node[above] {$\alpha = 0.025$}; \draw (-2.2,0.02) -- (-2.4,0.1) node[above] {$\alpha = 0.025$}; \end{tikzpicture} ================================================ FILE: _to_be_done/08-TBD.Rmd ================================================ # To Be Done Chapters The following topics could be part of a future version of this course. ## Quantile Regression 1. before you were modelling the mean. the average link 1. now what happens to **outliers**? how robust is the mean to that 1. what about the entire distribution of this? ## Panel Data ### fixed effects ### DiD ### RDD ### Example * scanner data on breakfast cereals, $(Q_{it},D_{it})$ * why does D vary with Q * pos relation ship * don't observe the group identity! * unobserved het alpha is correlated with Q * within group estimator * what if you don't have panel data? ## Logit and Probit ## Principal Component Analysis ## General Notes this creates a library for the used R packages. ```{r include=FALSE} # automatically create a bib database for R packages knitr::write_bib(c( .packages(), 'bookdown', 'knitr', 'rmarkdown','ScPoEconometrics','shiny','learnr','datasauRus','webshot','AER' ), 'packages.bib') ``` Packages used: * **bookdown** [@R-bookdown] * **shiny** [@R-shiny] * **learnr** [@R-learnr] * **datasauRus** [@R-datasauRus] * **webshot** [@R-webshot] * **AER** [@R-AER] * **knitr** [@xie2015] * **ScPoEconometrics** [@R-ScPoEconometrics] * **Ecdat** [@R-Ecdat] * **Ecfun** [@R-Ecfun] * **R** [@R-base] * **dplyr** [@R-dplyr] * **ggplot2** [@R-ggplot2] * **reshape2** [@R-reshape2] * **bindrcpp** [@R-bindrcpp] * **mvtnorm** [@R-mvtnorm] * **plotly** [@R-plotly] * **readr** [@R-readr] * **readxl** [@R-readxl] * **tidyr** [@R-tidyr] * **readr** [@R-readr] ## Inference via Bootstrap We will now take the ideas from the previous section and illustrate them using one single powerful idea. Instead of relying on some *population distribution* from which our samples of data have been drawn, we can regard the sampling distribution itself as the population distribution, and take samples *from* it. Our discussion in the previous section *relied* on a tight connection between sampling and population distribution - how else could we have inferred anything about the population from looking at our sample? We will take this idea to its extreme now and look *only* at the sampling distribution. This idea is a *resampling* technique, commonly referred to as the *Bootstrap*. ```{block type = "warning"} **Bootstrapping** is a test or metric that relies on *resampling with replacement*. It allows estimation of the sampling distribution of almost any statistic using random sampling methods. ``` ### How does the Bootstrap work? This best to show with an example. Suppose we had stored our `R n` measurements of students heights from figure \@ref(fig:heightdata) above in the variable `height` ```{r} head(height) ``` and let's suppose we want to figure out the distribution of the sample mean $\bar{height}$ here. We will now compute the mean of `height` repeatedly, each time collecting the result in a vector: ```{r} r = c() R = 200 for (ib in 1:R){ bs = sample(height,size = 10,replace = TRUE) # bootstrap sample r = c(r,mean(bs)) # add to results vector } ``` How are `r R` means we have computed distributed? Do they all have the same value? Different? How? ```{r hists-x,echo = FALSE,fig.cap="The Distribution of 200 means of `height`, where each version was obtained by resampling from `height` with replacement. The red line indicates the *true* mean from the original sample of `height`."} hist(r,main = "Histogram of Means of Height",xlab = "Mean of Height") abline(v = mean(height),col = "red",lw=2) ``` Well, this is very similar to a normal distribution, isn't it?^[In particular, notice that is *not* the same histogram as the one in figure \@ref(fig:heightdata) above. Here we see the *means* of height, whereas above we plotted the raw data values.] You should remember that above, for example in figure \@ref(fig:cifig), we had derived a *theoretical* sampling distribution. Relying on statistical theory, we learned that a certain test statistic will be distributed according to a certain distribution (students's t, or normal, for instance), and that we could use that knowledge to construct confidence intervals and hypothesis tests. Well, what you are seeing in figure \@ref(fig:hists-x) is the bootstrapped counterpart to that theoretical distribution. It's the *simulated proof* that the sampling distribution of the mean is indeed the normal distribution. The advantage here is that we did not need to rely on *any* theory at all, just simple resampling. ### Bootstrapped Confidence Intervals Let's redo what we did above with the bootstrap. We will use the brilliant [infer](https://github.com/tidymodels/infer) package to have some fun with this. ```{r} library(infer) hdf = data.frame(height) # needs a data.frame boot <- hdf %>% specify(response = height) %>% # specify repsonse generate(reps = 1000, type = "bootstrap") %>% # generate BS samples calculate(stat = "mean") # calculate statistic of interest ( percentile_ci <- get_ci(boot) ) # get CI ``` You note that this is reasonably close to the confidence interval about our sample mean we obtained above, which was $$\left[`r round(xbar - qt(0.975,df=n-1)* s/sqrt(n),3)` , `r round(xbar + qt(0.975,df=n-1)* s/sqrt(n),3)` \right]$$ What is really cool is the visualization: ```{r boot-ci, fig.cap="Simulate Distribution under Null Hypothesis together with a 95% confidence region"} visualize(boot) + shade_confidence_interval(endpoints = percentile_ci) + theme_bw() ``` In figure \@ref(fig:boot-ci) we see that each value of a sample mean within the green shaded area would lie within a 95% confidence region about the location of the true *population* mean. We can also repeat our hypothesis test from above with the help of the bootstrap. The hypothesis were \begin{align} H_0:& \mu = `r mu`\\ H_1:& \mu > `r mu`. \end{align} ```{r H0-height, fig.cap = "One sided hypothesis test with bootstrap."} H0_height <- hdf %>% specify(response = height) %>% # specify repsonse hypothesize(null = "point", mu = 167) %>% # spell out H0 generate(reps = 1000, type = "bootstrap") %>% # generate BS samples calculate(stat = "mean") # calculate statistic of interest visualize(H0_height) + shade_p_value(obs_stat = xbar, direction = "right") + theme_bw() ``` In figure \@ref(fig:H0-height) we see the simulated distribution under the Null, i.e. the one where indeed $\mu = `r mu`$. The red vertical line is the value of our calculated test statistic, which was `r xbar`. The shaded area shows is the corresponding level of significance that we would have to adopt, would we want to reject H0 on the grounds of having observed `\bar{x} = r xbar`. The size of the red-shaded area is *p-value* of this test. It's easy to compute from this object via ```{r} pval = H0_height %>% get_p_value(obs_stat = xbar, direction = "right") pval ``` This means that had we adopt a significance level of $\alpha = `r pval`$, we would (just!) been able to reject the Null hypthesis. Now you remember that $\alpha$ is the probability of a Type 1 Error. So, we would have to be happy to make a wrong decision (i.e. to reject when in fact we should not) in about `r round(pval,2)`% of all cases. So, as above, we probably conclude that this is pretty weak evidence against H0, and we cannot reject it based on this evidence. It's illustrative to reason about how this picture changes as we change the hypothesized value. Suppose we change our hypothesis to \begin{align} H_0:& \mu = 164\\ H_1:& \mu > 164. \end{align} ```{r H0-height2, fig.cap = "One sided hypothesis test with bootstrap and different hypothesis."} H0_height <- hdf %>% specify(response = height) %>% # specify repsonse hypothesize(null = "point", mu = 164) %>% # spell out H0 generate(reps = 1000, type = "bootstrap") %>% # generate BS samples calculate(stat = "mean") # calculate statistic of interest visualize(H0_height) + shade_p_value(obs_stat = xbar, direction = "right") + theme_bw() ``` The concept is astonishingly simple. It's best to illustrate with an example from the [ungeviz](https://github.com/wilkelab/ungeviz) package.^[this is based on `help(bootstrapper,package = "ungeviz")`] Here is a dataset: ```{r} set.seed(1) n = 10 # data points x = rnorm(n) df <- data.frame(x,y = x + 0.5*rnorm(n)) plot(y~x,data=df) grid() ``` Now we are going to randomly choose rows of this dataframe, `n` at a time, but with replacement. One way to achieve this is via ```{r, eval = FALSE} dplyr::sample_n(df, size = n, replace = TRUE) ``` which would generate one reshuffled sample of `df`. We repeat this for $R$ draws, and each time we calculate the statistic we are interested in. The mean of `x`, mean of `y`, whatever. Let's compute the OLS slope coefficient instead, just another statistic, and let's just take a small number of draws, $R=9$: ```{r ungeviz-demo,echo = FALSE,fig.height = 8} library(ungeviz) bs <- bootstrapper(9) p <- ggplot(df, aes(x, y)) + geom_point(shape = 21, size = 6, fill = "white") + geom_text(label = "0", hjust = 0.5, vjust = 0.5, size = 10/.pt) + geom_point(data = bs, aes(group = .row), shape = 21, size = 6, fill = "blue") + geom_text( data = bs, aes(label = .copies, group = .row), hjust = 0.5, vjust = 0.5, size = 10/.pt, color = "white" ) + geom_smooth(data = bs, method = "lm", se = FALSE, color = "red") + ggtitle("Bootstrap demonstration") + theme_bw() p + facet_wrap(~.draw) ``` ## Inference in Theory ```{r testing,echo=FALSE} s = 10 n = 50 mu = 167 xbar = 168.5 set.seed(2) height = rnorm(n,mean=xbar,sd=s) tstat = round((xbar - mu)/(s/sqrt(n)),3) ctval = qt(0.95,df=n) cxbar = ctval * (s/sqrt(n)) + mu ``` Imagine we were tasked by the Director of our school to provide him with our best guess of the *mean body height* $\mu$ amongst all SciencesPo students in order to assess which height the new desks should have. Of course, we are econometricians and don't *guess* things: we **estimate** them! How would we go about this task and estimate $\mu$? You may want to ask: Why bother with this estimation business at all, and not just measure all students' height, compute $\mu$, and that's it? That's a good question! In most cases, we cannot do this, either because we do not have access to the entire population (think of computing the mean height of all Europeans!), or it's too costly to measure everyone, or it's impractical. That's why we take *samples* from the wider population, to make inference. In our example, suppose we'd randomly measure students coming out of the SciencesPo building at 27 Rue Saint Guillaume until we have $`r n`$ measurements on any given Monday. Suppose further that we found a sample mean height $\bar{x} = `r xbar`$, and that the sample standard deviation was $s=`r s`$. In short, we found the data summarized in figure \@ref(fig:heightdata) ```{r heightdata,echo=FALSE,fig.cap="Our ficitious sample of SciencesPo students' body height. The small ticks indicate the location of each measurement.",fig.align='center'} hist(height) rug(height) ``` What are we going to tell *Monsieur le Directeur* now, with those two numbers and figure \@ref(fig:heightdata) in hand? Before we address this issue, we need to make a short detour into *test statistics*. ### Test Statistics We have encountered many statistics already: think of the sample mean, or the standard deviation. Statistics are just functions of data. *Test* statistics are used to perform statistical tests. Many test statistics rely on some notion of *standardizing* the sample data so that it becomes comparable to a theoretical distribution. We encountered this idea already in section \@ref(reg-standard), where we talked about a standardized regression. The most common standardization is the so-called *z-score*, which says that \begin{equation} \frac{x - \mu}{\sigma}\equiv z\sim \mathcal{N}(0,1), (\#eq:zscore) \end{equation} in other words, substracting the population mean from random variable $x$ and dividing by it's population standard deviation yields a standard normally distributed random variable, commonly called $z$. A very similar idea applies if we *don't know* the population variance (which is our case here!). The corresponding standardization gives rise to the *t-statistic*, and it looks very similar to \@ref(eq:zscore): \begin{equation} \sqrt{n} \frac{\bar{x} - \mu}{s} \equiv T \sim t_{n-1} (\#eq:tscore) \end{equation} Several things to note: * We observe the same standardization as above: dividing by the sample standard deviation $s$ brings $\bar{x} - \mu$ to a *unit free* scale. * We use $\bar{x}$ and $s$ instead of $x$ and $\sigma$ * We multiply by $\sqrt{n}$ because we expect $\bar{x} - \mu$ to be a small number: we need to *rescale* it again to make it compatible with the $t_{n-1}$ distribution. * $t_{n-1}$ is the [Student's T](https://en.wikipedia.org/wiki/Student's_t-distribution) distribution with $n-1$ degrees of freedom. We don't have $n$ degrees of freedom because we already had to estimate one statistic ($\bar{x}$) in order to construct $T$. ### Confidence Intervals {#CI} Back to our example now! We are clearly in need of some measure of *confidence* about our sample statistic $\bar{x} = `r xbar`$ before we communicate our result. It seems reasonable to inform the Director about $\bar{x}$, but surely we also need to tell him that there was considerable *dispersion* in the data: Some people were as short as `r round(min(height),2)`cm, while others were as tall as `r round(max(height),2)`cm! The way to proceed is to construct a *confidence interval* about the true population mean $\mu$, based on $\bar{x}$, which will take this uncertainty into account. We will use the *t* statistic from above. We want to have a *symmetric interval* around $\bar{x}$ which contains the true value $\mu$ with probability $1-\alpha$. One very popular choice of $\alpha$ is $0.05$, hence we cover $\mu$ with 95% probability. After computing our statistic $T$ as defind in \@ref(eq:tscore), this interval is defined as follows: \begin{align} \Pr \left(-c \leq T \leq c \right) = 1-\alpha (\#eq:ci) \end{align} where $c$ stands for *critical value*, which we need to choose. This is illustrated in figure \@ref(fig:cifig). ```{r cifig, echo=FALSE, engine='tikz', out.width='90%', fig.ext=if (knitr:::is_latex_output()) 'pdf' else 'png', fig.cap='Confidence Interval Construction. The blue area is called *coverage region* which contains the true $\\mu$ with probability $1-\\alpha$.',fig.align='center'} \begin{tikzpicture}[scale=2, y=5cm] \draw[domain=-3.15:-2] (-3.15,0) plot[id=gauss1,samples=50] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}) -- (-2,0); \draw[domain=-2:2,fill=blue,opacity=0.4] (-2,0) -- plot[id=gauss1, samples=100] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}) -- (2,0); \draw[domain=2:3.2] (2,0) -- plot[id=gauss3, samples=50] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}); \draw (2,-0.01) -- (2,0.01); % ticks \draw (-2,-0.01) -- (-2,0.01); % ciritcal estimator values \node at (2,-0.04) {$c$}; \node at (-2,-0.04) {$-c$}; % x-axis 1 \draw[->, semithick] (-3.2,0) -- (3.4,0) node[right] {$T$}; % x-axis 2 %\draw[->, semithick] (-3.2,-0.1) -- (3.4,-0.1) node[right] %{$T=\sqrt{n}\frac{\bar{x}-\bar{x}}{s}$}; % ticks on t %\draw[-,semithick] (2,-0.11) -- (2,-0.09); %\node at (2,-0.13) {$c=2$}; %\draw[-,semithick] (-2,-0.11) -- (-2,-0.09); %\node at (-2,-0.13) {$-c=-2$}; %y-axis \draw[->, semithick] (-3.15,-0.02) node[left] {$0$} -- (-3.15,0.45) node[above] {$t_{n-1}$}; \draw[-,semithick] (0,-0.01) -- (0,0.01); % zero tick on x % \draw[-,semithick] (0,-0.11) -- (0,0.-0.09); % zero tick on t \node at (0,-0.03) {$0$}; % \node at (0,-0.13) {$0$}; % annotate alphas \node at (0,0.15) {$1 - \alpha = 0.95$}; \draw (2.2,0.02) -- (2.4,0.1) node[above] {$\frac{\alpha}{2} = 0.025$}; \draw (-2.2,0.02) -- (-2.4,0.1) node[above] {$\frac{\alpha}{2} = 0.025$}; \end{tikzpicture} ``` Given the symmetry of the *t* distribution it's enough to find $c$ at the upper tail: the point above which $\frac{\alpha}{2}$ of all probability mass of the $t_{df}$ distribution comes to lie. In other words, if $\mathcal{T}_{df}$ is the CDF of the *t* distribution with *df* degrees of freedom, we find $c$ as \begin{align} \mathcal{T}_{df}(c)\equiv& \Pr \left( T < c \right) = 1-\frac{\alpha}{2} = 0.975 \\(\#eq:ci1) c =& \mathcal{T}_{df}^{-1}(\mathcal{T}_{df}(c)) = \mathcal{T}_{df}^{-1}(0.975) \end{align} Here $\mathcal{T}_{df}^{-1}$ stands for the *quantile function*, i.e. the inverse of the CDF. In our example with $df = `r n-1`$, you can find thus that $c = `r round(qt(0.975,df=n-1),3)`$ by typing `qt(0.975,df=49)` into your `R` session.^[You often will see $c=1.96$, which comes from the fact that one relies on the *t* distribution converging to the normal distribution with large $n$. Type `qnorm(0.975)` to confirm!] Now we only have to expand the definition of the *T* statistic from \@ref(eq:tscore) inside \@ref(eq:ci) to obtain \begin{align} 0.95 = 1-\alpha &= \Pr \left(-c \leq T \leq c \right) \\(\#eq:ci2) &= \Pr \left(-`r round(qt(0.975,df=n-1),3)` \leq \sqrt{n} \frac{\bar{x} - \mu}{s} \leq `r round(qt(0.975,df=n-1),3)` \right) \\ &= \Pr \left(\bar{x} -`r round(qt(0.975,df=n-1),3)` \frac{s}{\sqrt{n}} \leq \mu \leq \bar{x} + `r round(qt(0.975,df=n-1),3)` \frac{s}{\sqrt{n}} \right) \end{align} Finally, filling in our numbers for $s$ etc, this implies that a 95% confidence interval about the location of the true average height of all SciencesPo students, $\mu$, is given by: \begin{equation} CI = \left[`r round(xbar - qt(0.975,df=n-1)* s/sqrt(n),3)` , `r round(xbar + qt(0.975,df=n-1)* s/sqrt(n),3)` \right] \end{equation} We would tell the director that with 95% probability, the true average height of all students comes to lie within those two bounds. ### Hypothesis Testing Now know by now how the standard errors of an OLS estimate are computed, and what they stand for. We can now briefly^[We will not go into great detail here. Please refer back to your statistics course from last spring semester (chapters 8 and 9), or the short note I [wrote while ago](images/hypothesis.pdf) ] discuss a very common usage of this information, in relation to which variables we should include in our regression. There is a statistical proceedure called *hypothesis testing* which helps us to make such decisions. In [hypothesis testing](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing), we have a baseline, or *null* hypothesis $H_0$, which we want to confront with a competing *alternative* hypthesis $H_1$. Continuing with our example of the mean height of SciencesPo students ($\mu$), one potential hypothesis could be \begin{align} H_0:& \mu = `r mu`\\ H_1:& \mu \neq `r mu` \end{align} Here we state that under the null hypthesis, $\mu = `r mu`$, and under the alternative, it's not equal to that value. This would be called a *two-sided* test, because it tests deviations from $H_0$ below as well as above. An alternative formulation could use the *one-sided* test that \begin{align} H_0:& \mu = `r mu`\\ H_1:& \mu > `r mu`. \end{align} which would mean: under the null hypothesis, the average of all ScPo students' body height is `r mu`cm. Under the alternative, it is larger. You can immediately see that this is very similar to confidence interval construction. Suppose as above that we found $\bar{x} = `r xbar`$, and that the sample standard deviation is still $s=`r s`$. Would you regard this as strong or weak evidence against $H_0$ and in favor of $H_1$? You should now remember what you saw when you did `launchApp("estimate")`. Look again at this app and set the slider to a sample size of $`r n`$, just as in our running example. You can see that the app draws one hundred (100) samples for you, locates their sample mean on the x-axis, and estimates the red density. ```{block type='note'} The crucial thing to note here is that, given we are working with a **random sample** from a population with a certain distribution of *height*, our sample statistic $\bar{x}$ is **also a random variable**. Every new set of randomly drawn students would yield a different $\bar{x}$, and all of them together would follow the red density in the app. In reality we often only get to draw one single sample, and we can use knowledge about the sampling distribution to make inference. ```
    Our task is now to decide if given that particular sampling distribution, given our estimate $\bar{x}$ and given an observed sample variance $s^2$, whether $\bar{x} = `r xbar`$ is *far away* from $\bar{x} = `r mu`$, or not. The way to proceed is by computing a *test statistic*, which is to be compared to a *critical value*: if the test statistic exceeds that value, we reject $H_0$, otherwise we cannot. The critical value depends on the sampling distribution, and the size of the test. We talk about this next. ### Making Errors There are two types of error one can make when deploying such a test: 1. We might reject $H_0$, when in fact it is true! Here, upon observing $\bar{x} = `r xbar`$ we might conclude that indeed $\mu > `r mu`$ and thus we'd reject. But we might have gotten unlucky and by chance have obtained an unusually tall sample of students. This is called **type one error**. 2. We might *fail* to reject $H_0$ when in fact $H_1$ is true. This is called the **type two error**. We design a test with a certain probability of *type one error* $\alpha$ in mind. In other words, we choose with which probability $\alpha$ we are willing to make a type one error. (Notice that the best tests also avoid making type two errors! The number $1-\Pr(\text{type 2 error})$ is called *power*, hence we prefer tests with *high power*). A typical choice for $\alpha$ is 0.05, i.e. we are willing to make a type one error with probability 5%. $\alpha$ is commonly called the **level of significance** or the **size** of a test. ### Performing the Test We can stick to the following cookbook procedure, which is illustrated in figure \@ref(fig:testfig). 1. Set up hypothesis and significance level: 1. $H_0: \mu = `r mu`$ 2. $H_1: \mu > `r mu`$ 3. $\alpha = 0.05$ 2. Test Statistic and test distribution: * We don't know the true population variance $\sigma^2$, hence we estimate it via $s^2$ from our sample. * The corresponding test statistic is the *t-statistic*, which follows the [Student's T](https://en.wikipedia.org/wiki/Student's_t-distribution) distribution. * That is, our statistic is $T=\frac{\bar{x} - \mu}{s/\sqrt{n}} \sim t_{`r n-1`}$, where `r n-1` is equal to the *degrees of freedom* in this case. 3. Rejection Region: We perform a one-sided test. We said we are happy with a 5% significance level, i.e. we are looking for the $t$ value which corresponds *just* to $1-0.05 = 0.95$ mass under the pdf of the $t$ distribution. More precisely, we are looking for the $1-0.05 = 0.95$ quantile of the $t_{`r n`}$ distribution.^[See the previous footnote for an explanation of this!] This implies a critical value $c = `r round(qt(0.95,df=n),3)`$, which you can verify by typing `qt(0.95,df=50)` in `R`. 4. Calculate our test statistic: $\frac{\bar{x} - \mu}{s/\sqrt{n}} = \frac{168.5 - 167}{`r s`/\sqrt{`r n`}} = `r tstat`$ 5. Decide: We find that $`r tstat` < `r round(ctval,3)`$. Hence, we cannot reject $H_0$, because we only found weak evidence against it in our sample of data. ```{r testfig, echo=FALSE, engine='tikz', out.width='90%', fig.ext=if (knitr:::is_latex_output()) 'pdf' else 'png', fig.cap='Cookbook Testing Proceedure. Subscripts $c$ indicate *critical value*. There are two x-axis: one for values of $\\bar{x}$, and one for the corresponding $t$ statistic. The red area is the rejection area. If we observe a test statistic such that $t>t_c$, we feel reassured that our $\\bar{x}$ is *sufficiently far away* from the hypothesized value $\\mu$, such that we feel comfortable with rejecting $H_0$. And vice versa: If our test statistic falls below $t_c$, we will not reject $H_0$',fig.align='center'} \begin{tikzpicture}[scale=2, y=5cm] \draw[domain=-3:1.645] plot[id=gauss1, samples=100] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}); \draw[domain=1.645:3.2,fill=red,opacity=0.4] (1.645,0) -- plot[id=gauss3, samples=25] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}); \draw (1.645,-0.01) -- (1.645,0.01); % tick \node at (1.645,-0.03) {$\bar{x}_c = 171.74$}; \draw[->, semithick] (-3.2,0) -- (3.4,0) node[right] {$\bar{x}$}; % x-axis 1 \draw[->, semithick] (-3.2,-0.1) -- (3.4,-0.1) node[right] {$t=\sqrt{n} \frac{\bar{x}-\mu}{s}$}; % x-axis 2 %y-axis \draw[->, semithick] (-3.15,-0.02) node[left] {$0$} -- (-3.15,0.45) node[above] {$f(\bar{x}|\mu=167)$}; % y-axis \draw[-,semithick] (0,-0.01) -- (0,0.01); % zero tick on x \draw[-,semithick] (0,-0.11) -- (0,0.-0.09); % zero tick on z \node at (0,-0.03) {$\mu = 167$}; \node at (0,-0.13) {$0$}; \draw[-,semithick] (1.645,-0.11) -- (1.645,-0.09); \node at (1.645,-0.13) {$t_c = 1.676$}; % annotate alphas \node at (0,0.15) {$1 - \alpha = 0.95$}; \draw (2,0.02) -- (2.4,0.1) node[above] {$\alpha = 0.05$}; \end{tikzpicture} ``` You can see from this that whether or not our test statistic is far way from the critical value, or just below does not change our decision: it's either accept or reject. We never know if we *narrowly* rejected a $H_0$, or not. P-values are an improvement over this stark dichotomy. The p-value is defined as the particular level of significance $\alpha^*$, up to which *all* $H_0$'s would be rejected. If this is a very small number, we have overwhelming support to reject the null. If, on the contrary, $\alpha^*$ turns out to be rather large, we only found weak evidence against $H_0$. We define the p-value as the sum of rejection areas for a given test statistic $T^*$. Notice that the symmetry of the $t$ distribution implies that we would multiply by two each of the two tail probabilities in the case of a two-sided test. \begin{align} \alpha^* = \Pr(t > |T^*|) \end{align} ### Standard Errors in Practice We would like to further make this point in an experiential way, i.e. we want you to experience what is going on. We invite you to spend some time with the following apps. In particular, make sure you have a thorough understanding of `launchApp("estimate")`. ```{r,eval=FALSE} library(ScPoApps) launchApp("estimate") launchApp("sampling") launchApp("standard_errors_simple") launchApp("standard_errors_changeN") ``` ### Testing Regression Coefficients In Regression Analysis, we often want to test a very specific alternative hypothesis: We want to have a quick way to tell us whether a certain variable $x_k$ is *relevant* in our statistical model or not. In hypothesis testing language, that would be \begin{align} H_0:& \beta_k = 0\\ H_1:& \beta_k \neq 0.(\#eq:H0) \end{align} Clearly, if in the **true** regression model we find $\beta_k=0$, this means that $x_k$ has a zero partial effect on the outcome, hence it should be excluded from the regression. Notice that we are interested in $\beta_k$, not in $b_k$, which is the estimator that we compute from our sample (similarly to $\bar{x}$, which estimates $\mu$ above). As such, this is a *two-sided test*. We can again illustrate this in figure \@ref(fig:testfig2). Notice how we now have two rejection areas. ```{r testfig2, echo=FALSE, engine='tikz', out.width='90%', fig.ext=if (knitr:::is_latex_output()) 'pdf' else 'png', fig.cap='Testing whether coefficient $b_k$ is *statistically significantly different* from zero. Now we have two red rejection areas. We relabel critical values with a superscript here. If we observe a test statistic falling in either red region, we reject, else we do not. Notice that the true value under $H_0$ is $\\beta_k=0$. ',fig.align='center'} \begin{tikzpicture}[scale=2, y=5cm] \draw[domain=-3.15:-1.96,fill=red,opacity=0.4] (-3.15,0) plot[id=gauss1,samples=50] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}) -- (-1.96,0); \draw[domain=-1.96:1.96] plot[id=gauss1, samples=100] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}); \draw[domain=1.96:3.2,fill=red,opacity=0.4] (1.96,0) -- plot[id=gauss3, samples=50] (\x,{1/sqrt(2*pi)*exp(-0.5*(\x)^2)}); \draw (1.96,-0.01) -- (1.96,0.01); % ticks \draw (-1.96,-0.01) -- (-1.96,0.01); % ciritcal estimator values \node at (1.96,-0.03) {$b_k^c$}; \node at (-1.96,-0.03) {$-b_k^c$}; % ticks on t \draw[-,semithick] (1.96,-0.11) -- (1.96,-0.09); \node at (1.96,-0.13) {$t_{up}=1.96$}; \draw[-,semithick] (-1.96,-0.11) -- (-1.96,-0.09); \node at (-1.96,-0.13) {$t_{down}=-1.96$}; % x-axis 1 \draw[->, semithick] (-3.2,0) -- (3.4,0) node[right] {$b_k$}; \draw[->, semithick] (-3.2,-0.1) -- (3.4,-0.1) node[right] {$t=\frac{b_k-0}{s_k}$}; % x-axis 2 %y-axis \draw[->, semithick] (-3.15,-0.02) node[left] {$0$} -- (-3.15,0.45) node[above] {$f(b_k|\beta_k=0)$}; \draw[-,semithick] (0,-0.01) -- (0,0.01); % zero tick on x \draw[-,semithick] (0,-0.11) -- (0,0.-0.09); % zero tick on t \node at (0,-0.03) {$\beta_k = 0$}; \node at (0,-0.13) {$0$}; % annotate alphas \node at (0,0.15) {$1 - \alpha = 0.95$}; \draw (2.2,0.02) -- (2.4,0.1) node[above] {$\frac{\alpha}{2} = 0.025$}; \draw (-2.2,0.02) -- (-2.4,0.1) node[above] {$\frac{\alpha}{2} = 0.025$}; \end{tikzpicture} ``` The relevant test statistic for a regression coefficient is again the *t* distribution. In fact, this particular test is so important that all statistical packages report the *t* statistic corresponding to \@ref(eq:H0) automatically. Let's look at an example: ```{r,echo=FALSE} lm1=lm(mpg ~ wt + hp + drat, mtcars) summary(lm1) ``` The column `t value` is just `Estimate` divided by `Std. Error`. That is, `R` reports in the column `t value` the following number for us: \begin{equation} \text{t value} = \frac{b_k-0}{s_k} (\#eq:tstat) \end{equation} where $s_k$ is the estimated standard error as introduced in \@ref(se-theory), and where we test $H_0:\beta_k = 0$. Notice that this particular *t* statistic is different from our previous formulation in \@ref(eq:tscore): we don't have to scale by $\sqrt{n}$! This is so because `R` and other statistical software assumes the *normal* linear regression model (see \@ref(class-reg)). Normality of the regression error $\varepsilon$ implies that the *t* statistic looks like in \@ref(eq:tstat). We have to choose a critical value for this test. Many people automatically choose the 0.975 quantile of the standard normal distribution, `qnorm(0.975)`, `r round(qnorm(0.975),2)` in this case. This is fine for sample sizes greater than 100, say. In this regression, we only have 28 degrees of freedom, so we better choose the critical value from the *t* distribution as above. We get $t_{down} = `r round(qt(0.025,df=28),3)`$ and $t_{up} = `r round(qt(0.975,df=28),3)`$ as critical values. Let's test whether the coefficient on `wt` is statistically different from zero: \begin{align} H_0:& \beta_{wt} = 0\\ H_1:& \beta_{wt} \neq 0 (\#eq:mtcarswt) \end{align} We just take the `t value` entry, and see whether it lies above or below either critical value: Indeed, we see that $-4.053 < `r round(qt(0.025,df=28),3)`$, and we are happy to reject $H_0$. On the other hand, when testing for statistical significance of `drat` that does not seem to be the case: \begin{align} H_0:& \beta_{drat} = 0\\ H_1:& \beta_{drat} \neq 0 (\#eq:mtcarsdrat) \end{align} Here we find that $1.316 \in [`r round(qt(0.025,df=28),3)`,`r round(qt(0.975,df=28),3)`]$, hence it does not lie in any rejection region, and we can *not* reject $H_0$. We would say that *coefficient $\beta_{drat}$ is not statistically significant at the 5% level*. As such, we should not include it in our regression. ### P-Values and Stars `R` also reports two additional columns in its regression output. The so-called *p-value* in column `Pr(>|t|)` and a column with stars. P-values are an improvement over the dichotomy introduced in the standard reject/accept framework above. We never know if we *narrowly* rejected a $H_0$, or not. The p-value is defined as the particular level of significance $\alpha^*$, up to which *all* $H_0$'s would be rejected. If this is a very small number, we have overwhelming support to reject the null. If, on the contrary, $\alpha^*$ turns out to be rather large, we only found weak evidence against $H_0$. We define the p-value as the sum of rejection areas for a given test statistic $T^*$. Notice that the symmetry of the $t$ distribution implies that we multiply by two each of the two tail probabilities: \begin{align} \alpha^* = 2 \Pr(t > |T^*|) \end{align} The stars in the final column are a visualization of this information. They show a quick summary of the magnitude of each p-value. Commonly, `***` means an extremely small reference significance level $\alpha^*=0$ (almost zero), `**` means $\alpha^*=0.001$, etc. In that case, up to a significance level of 0.1%, all $H_0$ would be rejected. You clearly see that all columns `Std. Error`, `t value` and `Pr(>|t|)` give a different type of the same information. * Measurement error * Omitted Variable Bias * Reverse Causality / Simultaneity Bias are all called *endogeneity* problems. ## Simultaneity Bias * Detroit has a large police force * Detroit has a high crime rate * Omaha has a small police force * Omana has a small crime rate Do large police forces **cause** high crime rates? Absurd! Absurd? How could we use data to tell? We have the problem that large police forces and high crime rates covary positively in the data, and for obvious reasons: Cities want to protect their citizens and therefore respond to increased crime with increased police. Using mathematical symbols, we have the following *system of linear equations*, i.e. two equations which are **jointly determined**: \begin{align*} \text{crime}_{it} &= f(\text{police}_{it}) \\ \text{police}_{it}&= g(\text{crime}_{it} ) \end{align*} We need a factor that is outside this circular system, affecting **only** the size of the police force, but not the actual crime rate. Such a factor is called an *instrumental variable*. ================================================ FILE: _to_be_done/09-R-advanced.Rmd ================================================ # Advanced `R` {#R-advanced} This chapter continues with some advanced usage examples from chapter \@ref(R-intro) ## More Vectorization ```{r} x = c(1, 3, 5, 7, 8, 9) y = 1:100 ``` ```{r} x + 2 x + rep(2, 6) ``` ```{r} x > 3 x > rep(3, 6) ``` ```{r} x + y length(x) length(y) length(y) / length(x) (x + y) - y ``` ```{r} y = 1:60 x + y length(y) / length(x) ``` ```{r} rep(x, 10) + y ``` ```{r} all(x + y == rep(x, 10) + y) identical(x + y, rep(x, 10) + y) ``` ```{r} # ?any # ?all.equal ``` ## Calculations with Vectors and Matrices Certain operations in `R`, for example `%*%` have different behavior on vectors and matrices. To illustrate this, we will first create two vectors. ```{r} a_vec = c(1, 2, 3) b_vec = c(2, 2, 2) ``` Note that these are indeed vectors. They are not matrices. ```{r} c(is.vector(a_vec), is.vector(b_vec)) c(is.matrix(a_vec), is.matrix(b_vec)) ``` When this is the case, the `%*%` operator is used to calculate the **dot product**, also know as the **inner product** of the two vectors. The dot product of vectors $\boldsymbol{a} = \lbrack a_1, a_2, \cdots a_n \rbrack$ and $\boldsymbol{b} = \lbrack b_1, b_2, \cdots b_n \rbrack$ is defined to be \[ \boldsymbol{a} \cdot \boldsymbol{b} = \sum_{i = 1}^{n} a_i b_i = a_1 b_1 + a_2 b_2 + \cdots a_n b_n. \] ```{r} a_vec %*% b_vec # inner product a_vec %o% b_vec # outer product ``` The `%o%` operator is used to calculate the **outer product** of the two vectors. When vectors are coerced to become matrices, they are column vectors. So a vector of length $n$ becomes an $n \times 1$ matrix after coercion. ```{r} as.matrix(a_vec) ``` If we use the `%*%` operator on matrices, `%*%` again performs the expected matrix multiplication. So you might expect the following to produce an error, because the dimensions are incorrect. ```{r} as.matrix(a_vec) %*% b_vec ``` At face value this is a $3 \times 1$ matrix, multiplied by a $3 \times 1$ matrix. However, when `b_vec` is automatically coerced to be a matrix, `R` decided to make it a "row vector", a $1 \times 3$ matrix, so that the multiplication has conformable dimensions. If we had coerced both, then `R` would produce an error. ```{r, eval = FALSE} as.matrix(a_vec) %*% as.matrix(b_vec) ``` Another way to calculate a *dot product* is with the `crossprod()` function. Given two vectors, the `crossprod()` function calculates their dot product. The function has a rather misleading name. ```{r} crossprod(a_vec, b_vec) # inner product tcrossprod(a_vec, b_vec) # outer product ``` These functions could be very useful later. When used with matrices $X$ and $Y$ as arguments, it calculates \[ X^\top Y. \] When dealing with linear models, the calculation \[ X^\top X \] is used repeatedly. ```{r} C_mat = matrix(c(1, 2, 3, 4, 5, 6), 2, 3) D_mat = matrix(c(2, 2, 2, 2, 2, 2), 2, 3) ``` This is useful both as a shortcut for a frequent calculation and as a more efficient implementation than using `t()` and `%*%`. ```{r} crossprod(C_mat, D_mat) t(C_mat) %*% D_mat all.equal(crossprod(C_mat, D_mat), t(C_mat) %*% D_mat) ``` ```{r} crossprod(C_mat, C_mat) t(C_mat) %*% C_mat all.equal(crossprod(C_mat, C_mat), t(C_mat) %*% C_mat) ``` ## Matrices ```{r} Z = matrix(c(9, 2, -3, 2, 4, -2, -3, -2, 16), 3, byrow = TRUE) Z solve(Z) ``` To verify that `solve(Z)` returns the inverse, we multiply it by `Z`. We would expect this to return the identity matrix, however we see that this is not the case due to some computational issues. However, `R` also has the `all.equal()` function which checks for equality, with some small tolerance which accounts for some computational issues. The `identical()` function is used to check for exact equality. ```{r} solve(Z) %*% Z diag(3) all.equal(solve(Z) %*% Z, diag(3)) ``` `R` has a number of matrix specific functions for obtaining dimension and summary information. ```{r} X = matrix(1:6, 2, 3) X dim(X) rowSums(X) colSums(X) rowMeans(X) colMeans(X) ``` The `diag()` function can be used in a number of ways. We can extract the diagonal of a matrix. ```{r} diag(Z) ``` Or create a matrix with specified elements on the diagonal. (And `0` on the off-diagonals.) ```{r} diag(1:5) ``` Or, lastly, create a square matrix of a certain dimension with `1` for every element of the diagonal and `0` for the off-diagonals. ```{r} diag(5) ``` ================================================ FILE: _to_be_done/11-projects.Rmd ================================================ # Projects This chapter contains several empirical projects. ## Trade Exercise * [Trade exercise](images/trade.html) ================================================ FILE: _to_be_done/notes.R ================================================ data("STAR",package = "AER") x = as.data.table(STAR) mx = melt.data.table(x, id = 1:3, measure.vars = patterns("star*"), variable.name = "grade", value.name = "classtype") # mx[, grade := as.character(grade)] ms = melt.data.table(x, id = 1:3, measure.vars = patterns("read*","math*", "schoolid*", "degree*","experience*","tethnicity*"), variable.name = "grade", value.name = c("read","math","schoolid","degree","experience","tethniticy")) mx = cbind(mx,ms[,-c(1:4)]) mx = mx[complete.cases(mx)] setkey(mx, classtype) ecdfs = mx[classtype != "small", list(readcdf = list(ecdf(read)),mathcdf = list(ecdf(math))),by = grade] om = par("mar") par(mfcol=c(4,2),mar = c(2,om[2],2.5,om[4])) ecdfs[,.SD[,plot(mathcdf[[1]],main = paste("math ecdf grade",.BY))],by = grade] ecdfs[,.SD[,plot(readcdf[[1]],main = paste("read ecdf grade",.BY))],by = grade] par(mfcol=c(1,1),mar = om) setkey(ecdfs, grade) setkey(mx,grade) z=mx[,list(perc_read = ecdfs[(.BY),readcdf][[1]](read),perc_math = ecdfs[(.BY),mathcdf][[1]](math)),by=grade] z[,score := rowMeans(.SD), .SDcols = c("perc_read","perc_math")] mx = cbind(mx,z[,!"grade"]) ggplot(data = z, mapping = aes(x = score,color=classtype)) + geom_density() + facet_wrap(~grade) summary(lm(score ~ classtype + schoolid,mx[grade == "stark"])) ggplot(aes(x = x, y = y, xend = xend, yend = yend)) + geom_dag_point() + geom_dag_edges_diagonal() + geom_dag_text() + theme_dag() dagify(y ~ x, x ~ z, y ~ z) %>% tidy_dagitty(layout = "tree") %>% ggplot(aes(x = x, y = y, xend = xend, yend = yend)) + geom_dag_point() + geom_dag_edges_diagonal() + geom_dag_text() + theme_dag() + scale_adjus d = dagify(y ~ x, x ~ z, y ~ z) %>% tidy_dagitty(layout = "tree") p1 = d %>% filter(name == "z") %>% ggplot(aes(x = x, y = y, xend = xend, yend = yend)) + geom_dag_point() + geom_dag_edges(edge_linetype = "dashed") + geom_dag_text() + theme_dag() d2 = d %>% filter(name != "z") %>% ggdag() p1 + geom_dag_point(data = d2) # bootstrap from ungeviz repo library(gganimate) set.seed(69527) x <- rnorm(15) data <- data.frame( x, y = x + 0.5*rnorm(15) ) bs <- bootstrapper(9) library(ungeviz) bs <- bootstrapper(9) bs p <- ggplot(data, aes(x, y)) + geom_point(shape = 21, size = 6, fill = "white") + geom_text(label = "0", hjust = 0.5, vjust = 0.5, size = 10/.pt) + geom_point(data = bs, aes(group = .row), shape = 21, size = 6, fill = "blue") + geom_text( data = bs, aes(label = .copies, group = .row), hjust = 0.5, vjust = 0.5, size = 10/.pt, color = "white" ) + geom_smooth(data = bs, method = "lm", se = FALSE) + ggtitle("Bootstrap demonstration") + theme_bw() p + facet_wrap(~.draw) p + transition_states(.draw, 1, 1) + enter_fade() + exit_fade() # and groups by variable `KnownSex` bsr <- bootstrapper(20, KnownSex) ggplot(BlueJays, aes(BillLength, Head, color = KnownSex)) + geom_smooth(method = "lm", color = NA) + geom_point(alpha = 0.3) + # `.row` is a generated column providing a unique row number for all rows geom_point(data = bsr, aes(group = .row)) + geom_smooth(data = bsr, method = "lm", fullrange = TRUE, se = FALSE) + facet_wrap(~KnownSex, scales = "free_x") + scale_color_manual(values = c(F = "#D55E00", M = "#0072B2"), guide = "none") + theme_bw() + transition_states(.draw, 1, 1) + enter_fade() + exit_fade() data("wage1", package = "wooldridge") wage1$female = factor(wage1$female, labels = c("male","female")) bs <- bootstrapper(20, female) anim <- ggplot(wage1, aes(x = educ, y = lwage, color = female)) + geom_smooth(method = "lm", color = NA) + geom_point(alpha = 0.3) + # `.row` is a generated column providing a unique row number for all rows geom_point(data = bs, aes(group = .row)) + geom_smooth(data = bs, method = "lm", fullrange = TRUE, se = FALSE) + facet_wrap(~female, scales = "free_x") + scale_color_manual(values = c("male" = "#D55E00", "female" = "#0072B2"), guide = "none") + theme_bw() + transition_states(.draw, 1, 1) + enter_fade() + exit_fade() animate(anim, renderer = gifski_renderer()) ================================================ FILE: book.bib ================================================ @Book{xie2015, title = {Dynamic Documents with {R} and knitr}, author = {Yihui Xie}, publisher = {Chapman and Hall/CRC}, address = {Boca Raton, Florida}, year = {2015}, edition = {2nd}, note = {ISBN 978-1498716963}, url = {http://yihui.name/knitr/}, } @article{krueger1999, title={Experimental estimates of education production functions}, author={Krueger, Alan B}, journal={The quarterly journal of economics}, volume={114}, number={2}, pages={497--532}, year={1999}, publisher={MIT Press} } @article{pinotti, Author = {Pinotti, Paolo}, Title = {Clicking on Heaven's Door: The Effect of Immigrant Legalization on Crime}, Journal = {American Economic Review}, Volume = {107}, Number = {1}, Year = {2017}, Month = {January}, Pages = {138-68}, DOI = {10.1257/aer.20150355}, URL = {http://www.aeaweb.org/articles?id=10.1257/aer.20150355}} @article{freedman1991, title={Statistical models and shoe leather}, author={Freedman, David A}, journal={Sociological methodology}, pages={291--313}, year={1991}, publisher={JSTOR} } @book{deaton1997, title={The analysis of household surveys: a microeconometric approach to development policy}, author={Deaton, Angus}, year={1997}, publisher={The World Bank} } @article{angristlavy, title={Using Maimonides' rule to estimate the effect of class size on scholastic achievement}, author={Angrist, Joshua D and Lavy, Victor}, journal={The Quarterly journal of economics}, volume={114}, number={2}, pages={533--575}, year={1999}, publisher={MIT Press} } @article{angristkrueger, author = {Angrist, Joshua D. and Krueger, Alan B.}, title = {Does Compulsory School Attendance Affect Schooling and Earnings?}, journal = {The Quarterly Journal of Economics}, volume = {106}, number = {4}, pages = {979-1014}, year = {1991}, month = {11}, abstract = "{We establish that season of birth is related to educational attainment because of school start age policy and compulsory school attendance laws. Individuals born in the beginning of the year start school at an older age, and can therefore drop out after completing less schooling than individuals born near the end of the year. Roughly 25 percent of potential dropouts remain in school because of compulsory schooling laws. We estimate the impact of compulsory schooling on earnings by using quarter of birth as an instrument for education. The instrumental variables estimate of the return to education is close to the ordinary least squares estimate, suggesting that there is little bias in conventional estimates.}", issn = {0033-5533}, doi = {10.2307/2937954}, url = {https://doi.org/10.2307/2937954}, eprint = {https://academic.oup.com/qje/article-pdf/106/4/979/5298446/106-4-979.pdf}, } @article{angristkruegerIV, title={Instrumental variables and the search for identification: From supply and demand to natural experiments}, author={Angrist, Joshua D and Krueger, Alan B}, journal={Journal of Economic perspectives}, volume={15}, number={4}, pages={69--85}, year={2001} } ================================================ FILE: images/trade.html ================================================ Effects of Free Trade Agreements

    Setup

    This exercise was developed by Thierry Mayer for the International Trade and Finance Course. The dataset needed for this exercise is available in Stata format at this dropbox link. Download the file, and read it into R with the function read_stata from the haven package.

    Exploring the data

    1. What variables are included in the data?

      ##  [1] "year"        "iso_o"       "iso_d"       "contig"      "comlang_off"
      ##  [6] "distw"       "pop_o"       "pop_d"       "gdp_o"       "gdp_d"      
      ## [11] "comcur"      "fta_wto"     "flow"
    2. how many observations do we have in total?

      ## [1] 1106870      13
    3. How many unique countries do we have in the columns iso_o and iso_d (origin/destination)?

      ## [1] 208
      ## [1] 208
    4. How does the total number of observations evolve over the years? That is, how many rows of data do we have for each year?
    5. What about countries? How many countries iso_o do we have by year?
    6. How often does each country appear as iso_d within a year? Make a table that counts how often each country appears as iso_d per year!

      ## # A tibble: 1,106,870 x 3
      ## # Groups:   year [69]
      ##     year iso_d     n
      ##    <dbl> <chr> <int>
      ##  1  1984 ABW       2
      ##  2  1984 ABW       2
      ##  3  1985 ABW       1
      ##  4  1986 ABW       1
      ##  5  1987 ABW       1
      ##  6  1988 ABW       5
      ##  7  1988 ABW       5
      ##  8  1988 ABW       5
      ##  9  1988 ABW       5
      ## 10  1988 ABW       5
      ## # ... with 1,106,860 more rows
    7. Do all countries trade with each other? How many country pairs would we observe if each country traded with each other possible country? Produce a graph that illustrates cross country trade. You could think of a square matrix \(M\) with as many row and columns as there are unique countries. rows index origin and cols index destination countries. You could fill the the matrix like this, where \(i,j\) index origin and destination country:

      \[ M(i,j) = \begin{cases} 1 & \text{if flow}_{ij}>0 \\ 0 & \text{else.} \end{cases} \] Your graph should visualize this matrix somehow. Make the graph for two years, 1948 and 2016, and compute the share of trading countries in each of them.

    Gravity

    Compute a new variable called gravity, defined as

    \[ \text{gravity}_{odt} = \frac{GDP_{ot} \cdot GDP_{dt}}{DGP_{wt}\cdot distance_{od}} \]

    where indices \(o,d,t\) stand for origin, destination and year. The index \(w\) means world, i.e. here we talk about the sum of all destination countries. You need to be careful here because some countries don’t have any data in certain years (as we know from above), so there will be missing values. When you prepare this computation, apply the following cleaning protocol to your data:

    1. you need to be careful in computing world gdp. Look back at point 6. above for why. Using dplyr, I would compute world gdp by year first, and then merge it back onto the main dataset.
    2. group the data by year
    3. compute the share of gdp_o and gdp_d in world gdp and drop observations smaller than the first percentile of either share
    4. transform flow into flow/1000 i.e. trade flows in thousand dollars.
    5. compute gravity as above.

    Gravity Regression

    Run a regression of the log of trade flows on the log of gravity, using only data for the year 1995. Interpret the coefficient obtained. In a scatterplot, represent the relationship between the log of trade flows on the log of the gravity prediction, together with the regression line, which is very close to a 45 degree line for the 1995 data. How should we interpret the distance of each point to this 45 degree line?

    ## 
    ## Call:
    ## lm(formula = log(flow1000) ~ log(gravity), data = d95)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -12.836  -1.192   0.162   1.414   8.501 
    ## 
    ## Coefficients:
    ##               Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  -0.847702   0.052725  -16.08   <2e-16 ***
    ## log(gravity)  1.036308   0.005989  173.04   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 2.295 on 17801 degrees of freedom
    ## Multiple R-squared:  0.6272, Adjusted R-squared:  0.6271 
    ## F-statistic: 2.994e+04 on 1 and 17801 DF,  p-value: < 2.2e-16

    How do the slope coefficient estimates vary by year? You could run the above regression for each year, collect the slopes, and plot them against year.

    Effect of Free Trade Agreements

    Do the same scatterplot, but highlighting in a different color the pairs of countries engaged in a Free Trade Agreement (fta_wto = 1 for those in the database). Is it clear what is the effect of agreements graphically? I used function dplyr::sample_frac to randomly select 10% of rows from the 1995 data in order to avoid overplotting.

    Investigate this more with Regressions

    Run the following regressions using the 1995 data as above.

    1. A classical gravity equation with only GDPs and distance (in logs) explaining the log of trade flows. That is, instead of the computed gravity variable from above, we include the following variables individually: \[\begin{align} \log(gravity)_{odt} &= \log\left( \frac{GDP_{ot} \cdot GDP_{dt}}{ distance_{od}}\right) \\ &= \log(GDP_{ot}) + \log(GDP_{dt}) - \log(GDP_{dt}) - \log(distance_{od}) \end{align}\] and so you are supposed to investigate \[ \log \left( \frac{flow_{odt}}{1000} \right) = \log(GDP_{ot}) + \log(GDP_{dt} - \log(distance_{od}) \]

    2. Introduce the fta_wto dummy variable in that regression. What is the impact of becoming a wto member on expected trade flows? To answer that last question, remember that for a zero-one dummy \(d\) , \[\begin{align} \ln y &= a + b d \\ y =&= \exp(a +b d) \\ E[y|d=0] =& \exp(a)\\ E[y|d=1] =& \exp(a + b )\\ \Delta E[y|d] =& \exp(a + b ) - \exp(a)\\ \%\Delta E[y|d] =& \frac{\exp(a + b ) - \exp(a)}{\exp(a)}\\ =& e^{a + b - a} - 1 = \exp( b ) - 1 \end{align}\]

    3. Introduce common language and contiguity. Again compute the impact of having a common official language and of being contiguous contries.

    ================================================ FILE: index.Rmd ================================================ --- title: "Introduction to Econometrics with R" author: "Florian Oswald, Vincent Viers, Jean-Marc Robin, Pierre Villedieu, Gustave Kenedi " date: "`r Sys.Date()`" site: bookdown::bookdown_site output: bookdown::gitbook documentclass: book bibliography: ["packages.bib","book.bib"] biblio-style: apalike link-citations: yes url: 'https\://scpoecon.github.io/ScPoEconometrics/' favicon: "favicon.gif" github-repo: ScPoEcon/ScPoEconometrics description: "SciencesPo UG Econometrics online textbook. Almost no Maths." --- ```{r, setup, include=FALSE} # knitr::opts_chunk$set(comment=ScPoEconometrics:::getprompt(),fig.align = 'center') knitr::opts_chunk$set(fig.align = 'center') ``` # Syllabus {-} ![](ScPo.jpg) Welcome to Introductory Econometrics for 2nd year undergraduates at ScPo! On this page we outline the course and present the Syllabus. 2018/2019 was the first time that we taught this course in this format, so we are in year 3 now. ### Objective {-} We teach this course split over two levels and two semesters: *Introduction* and *Advanced*. Having taken the *Introduction* course is a requirement to enroll in *Advanced*. The *Introduction* course aims to teach you the basics of data analysis needed in a Social Sciences oriented University like SciencesPo. We purposefully start at a level that assumes no prior knowledge about statistics whatsoever. Our objective is to have you understand and be able to interpret linear regression analysis. We will not rely on maths and statistics, but practical learning in order to teach the main concepts. We also add the principal elements of causal inference, such that you will start being able to distinguish between simple statistical correlation and actual causation. The *Advanced* course will continue in the semester *after* you have taken the *Introduction* course, following the same philosophie of staying away as much as possible from formal derivations and proofs. We treat important further classical econometric topics like Instrumental Variables, Panel Data, Discrete Dependent Variables. Towards the end of the course we reserve a good amount of time to give an oveview of *Statistical Learning*. We will study and apply important concepts from machine learning in an accessible way. ### Course Structure {-} Either course is taught in several different groups across various campuses of SciencesPo. All groups will go over the same material, do the same exercises, and will have the same assessments. Groups meet once per week for 2 hours. The main purpose of the weekly meetings is to clarify any questions, and to work together through tutorials. The little theory we need will be covered in this book, and **you are expected to read through this in your own time** before coming to class. ### Introduction Course: Syllabus and Requirements {-} **Requirements** The only requirement is that **you bring your own personal computer** to each session. We will be using the free statistical computing language [`R`](https://www.r-project.org) very intensively. Before coming to the first session, please install `R` and `RStudio` as explained at the beginning of chapter \@ref(R-intro). **Syllabus** 1. Introduction: Chapters 1.1 and 1.2 from this book, Introduction from *Mastering Metrics*, *The Credibility Revolution in Empirical Economics* by Angrist and Pischke (JEP 2010) 2. Summarizing, Visualizing and Tidying Data: Chapter 2 of this book, Chapters 2 and 3 from [ModernDive](https://moderndive.com) 3. Continues with previous session. 4. Simple Linear Regression: Chapter \@ref(linreg) of this book, Chapter 5 of [ModernDive](https://moderndive.com) 5. Introduction to Causality: Chapter \@ref(causality) of this book, Chapter 1 Mastering Metrics, Potential Outcomes Model in *Causal Inerence, The Mixtape* by Scott Cunningham 6. Multiple Linear Regression: Chapter \@ref(multiple-reg) 7. Sampling: Chapter 7 of [ModernDive](https://moderndive.com) 8. Confidence Interval and Hypothesis Testing: Chapters 8 and 9 of [ModernDive](https://moderndive.com) 9. Regression Inference: Chapter \@ref(std-errors) of this book, Chapter 10 of [ModernDive](https://moderndive.com) 10. Differences-in-Differences: Chapter 5 of Mastering Metrics, Card and Krueger (AER 1994) 11. Regression Discontinuity: Chapter 4 of Mastering Metrics, Carpenter and Dobkin (AEJ, Applied, 2009), Imbens and Lemieux (Journal of Econometrics, 2008), Lee and Lemieux (JEL 2010) 12. Review Session ### Advanced Course: Syllabus and Requirements {-} **Requirements** *You must have taken the Intro course before, or a course with similar syllabus at your home institution.* **Syllabus** 1. Logistics, Organisation, Recap 1 from Intro Course 2. Recap 2 from Intro Course 3. Intro to `data.table` 4. Instrumental Variables and Causality 1 5. Instrumental Variables and Causality 2 6. Instrumental Variables and Causality 3 7. Panel Data: What, How and Why? 8. Discrete Outcomes: Logit and Probit 9. Intro to Statistical Learning 1: Taxonomy and Intro to Machine Learning 9. Intro to Statistical Learning 2: Model Validation 10. Intro to Statistical Learning 3: Unsupervised Learning Session 11: Recap / Buffer 1 Session 11: Recap / Buffer 2 ### Slides {-} **Introductory Level** There are slides for each book chapter at a [dedicated website](https://github.com/ScPoEcon/ScPoEconometrics-Slides). **Advanced Level** We host slides [here](https://github.com/ScPoEcon/Advanced-Metrics-slides). ### This Book and Other Material {-} What you are looking at is an online textbook. You can therefore look at it in your browser (as you are doing just now), on your mobile phone or tablet, but you can also download it as a `pdf` file or as an `epub` file for your ebook-reader. We don't have any ambition to actually produce and publish a *book* for now, so you should just see this as a way to disseminate our lecture notes to you. The second part of course material next to the book is an extensive suite of tutorials and interactive demonstrations, which are all contained in the `R` package which is associated to this book and which you will install in chapter 1. ### Open Source {-} The book and all other content for this course are hosted under an open source license on github. You can contribute to the book by just clicking on the appropriate *edit* symbol in the top bar of this page. Other teachers who want to use our material can freely do so, observing the terms of the license on the [github repository](https://github.com/ScPoEcon/ScPoEconometrics). ### Assessments {-} We will assess participation in class, quizzes on moodle and take home exams. ### Communication {-} We will communicate exclusively on our slack group. You will get an invitation email to join from your instructor in due course. ================================================ FILE: inst/CITATION ================================================ rref <- bibentry( bibtype = "Manual", title = "Introduction to Econometrics with R", author = c(person("Florian", "Oswald", role=c("aut","cre")), person("Jean-Marc", "Robin", role=c("ctb")), person("Vincent", "Viers", role=c("aut","ctb"))), organization = "SciencesPo, Department of Economics", address = "Paris, France", year = "2018", url = "https://scpoecon.github.io/ScPoEconometrics/") ================================================ FILE: inst/datasets/airline-safety.csv ================================================ "airline","avail_seat_km_per_week","type","value","period" "Aer Lingus",320906734,"incidents",2,"1985_1999" "Aeroflot*",1197672318,"incidents",76,"1985_1999" "Aerolineas Argentinas",385803648,"incidents",6,"1985_1999" "Aeromexico*",596871813,"incidents",3,"1985_1999" "Air Canada",1865253802,"incidents",2,"1985_1999" "Air France",3004002661,"incidents",14,"1985_1999" "Air India*",869253552,"incidents",2,"1985_1999" "Air New Zealand*",710174817,"incidents",3,"1985_1999" "Alaska Airlines*",965346773,"incidents",5,"1985_1999" "Alitalia",698012498,"incidents",7,"1985_1999" "All Nippon Airways",1841234177,"incidents",3,"1985_1999" "American*",5228357340,"incidents",21,"1985_1999" "Austrian Airlines",358239823,"incidents",1,"1985_1999" "Avianca",396922563,"incidents",5,"1985_1999" "British Airways*",3179760952,"incidents",4,"1985_1999" "Cathay Pacific*",2582459303,"incidents",0,"1985_1999" "China Airlines",813216487,"incidents",12,"1985_1999" "Condor",417982610,"incidents",2,"1985_1999" "COPA",550491507,"incidents",3,"1985_1999" "Delta / Northwest*",6525658894,"incidents",24,"1985_1999" "Egyptair",557699891,"incidents",8,"1985_1999" "El Al",335448023,"incidents",1,"1985_1999" "Ethiopian Airlines",488560643,"incidents",25,"1985_1999" "Finnair",506464950,"incidents",1,"1985_1999" "Garuda Indonesia",613356665,"incidents",10,"1985_1999" "Gulf Air",301379762,"incidents",1,"1985_1999" "Hawaiian Airlines",493877795,"incidents",0,"1985_1999" "Iberia",1173203126,"incidents",4,"1985_1999" "Japan Airlines",1574217531,"incidents",3,"1985_1999" "Kenya Airways",277414794,"incidents",2,"1985_1999" "KLM*",1874561773,"incidents",7,"1985_1999" "Korean Air",1734522605,"incidents",12,"1985_1999" "LAN Airlines",1001965891,"incidents",3,"1985_1999" "Lufthansa*",3426529504,"incidents",6,"1985_1999" "Malaysia Airlines",1039171244,"incidents",3,"1985_1999" "Pakistan International",348563137,"incidents",8,"1985_1999" "Philippine Airlines",413007158,"incidents",7,"1985_1999" "Qantas*",1917428984,"incidents",1,"1985_1999" "Royal Air Maroc",295705339,"incidents",5,"1985_1999" "SAS*",682971852,"incidents",5,"1985_1999" "Saudi Arabian",859673901,"incidents",7,"1985_1999" "Singapore Airlines",2376857805,"incidents",2,"1985_1999" "South African",651502442,"incidents",2,"1985_1999" "Southwest Airlines",3276525770,"incidents",1,"1985_1999" "Sri Lankan / AirLanka",325582976,"incidents",2,"1985_1999" "SWISS*",792601299,"incidents",2,"1985_1999" "TACA",259373346,"incidents",3,"1985_1999" "TAM",1509195646,"incidents",8,"1985_1999" "TAP - Air Portugal",619130754,"incidents",0,"1985_1999" "Thai Airways",1702802250,"incidents",8,"1985_1999" "Turkish Airlines",1946098294,"incidents",8,"1985_1999" "United / Continental*",7139291291,"incidents",19,"1985_1999" "US Airways / America West*",2455687887,"incidents",16,"1985_1999" "Vietnam Airlines",625084918,"incidents",7,"1985_1999" "Virgin Atlantic",1005248585,"incidents",1,"1985_1999" "Xiamen Airlines",430462962,"incidents",9,"1985_1999" "Aer Lingus",320906734,"fatal_accidents",0,"1985_1999" "Aeroflot*",1197672318,"fatal_accidents",14,"1985_1999" "Aerolineas Argentinas",385803648,"fatal_accidents",0,"1985_1999" "Aeromexico*",596871813,"fatal_accidents",1,"1985_1999" "Air Canada",1865253802,"fatal_accidents",0,"1985_1999" "Air France",3004002661,"fatal_accidents",4,"1985_1999" "Air India*",869253552,"fatal_accidents",1,"1985_1999" "Air New Zealand*",710174817,"fatal_accidents",0,"1985_1999" "Alaska Airlines*",965346773,"fatal_accidents",0,"1985_1999" "Alitalia",698012498,"fatal_accidents",2,"1985_1999" "All Nippon Airways",1841234177,"fatal_accidents",1,"1985_1999" "American*",5228357340,"fatal_accidents",5,"1985_1999" "Austrian Airlines",358239823,"fatal_accidents",0,"1985_1999" "Avianca",396922563,"fatal_accidents",3,"1985_1999" "British Airways*",3179760952,"fatal_accidents",0,"1985_1999" "Cathay Pacific*",2582459303,"fatal_accidents",0,"1985_1999" "China Airlines",813216487,"fatal_accidents",6,"1985_1999" "Condor",417982610,"fatal_accidents",1,"1985_1999" "COPA",550491507,"fatal_accidents",1,"1985_1999" "Delta / Northwest*",6525658894,"fatal_accidents",12,"1985_1999" "Egyptair",557699891,"fatal_accidents",3,"1985_1999" "El Al",335448023,"fatal_accidents",1,"1985_1999" "Ethiopian Airlines",488560643,"fatal_accidents",5,"1985_1999" "Finnair",506464950,"fatal_accidents",0,"1985_1999" "Garuda Indonesia",613356665,"fatal_accidents",3,"1985_1999" "Gulf Air",301379762,"fatal_accidents",0,"1985_1999" "Hawaiian Airlines",493877795,"fatal_accidents",0,"1985_1999" "Iberia",1173203126,"fatal_accidents",1,"1985_1999" "Japan Airlines",1574217531,"fatal_accidents",1,"1985_1999" "Kenya Airways",277414794,"fatal_accidents",0,"1985_1999" "KLM*",1874561773,"fatal_accidents",1,"1985_1999" "Korean Air",1734522605,"fatal_accidents",5,"1985_1999" "LAN Airlines",1001965891,"fatal_accidents",2,"1985_1999" "Lufthansa*",3426529504,"fatal_accidents",1,"1985_1999" "Malaysia Airlines",1039171244,"fatal_accidents",1,"1985_1999" "Pakistan International",348563137,"fatal_accidents",3,"1985_1999" "Philippine Airlines",413007158,"fatal_accidents",4,"1985_1999" "Qantas*",1917428984,"fatal_accidents",0,"1985_1999" "Royal Air Maroc",295705339,"fatal_accidents",3,"1985_1999" "SAS*",682971852,"fatal_accidents",0,"1985_1999" "Saudi Arabian",859673901,"fatal_accidents",2,"1985_1999" "Singapore Airlines",2376857805,"fatal_accidents",2,"1985_1999" "South African",651502442,"fatal_accidents",1,"1985_1999" "Southwest Airlines",3276525770,"fatal_accidents",0,"1985_1999" "Sri Lankan / AirLanka",325582976,"fatal_accidents",1,"1985_1999" "SWISS*",792601299,"fatal_accidents",1,"1985_1999" "TACA",259373346,"fatal_accidents",1,"1985_1999" "TAM",1509195646,"fatal_accidents",3,"1985_1999" "TAP - Air Portugal",619130754,"fatal_accidents",0,"1985_1999" "Thai Airways",1702802250,"fatal_accidents",4,"1985_1999" "Turkish Airlines",1946098294,"fatal_accidents",3,"1985_1999" "United / Continental*",7139291291,"fatal_accidents",8,"1985_1999" "US Airways / America West*",2455687887,"fatal_accidents",7,"1985_1999" "Vietnam Airlines",625084918,"fatal_accidents",3,"1985_1999" "Virgin Atlantic",1005248585,"fatal_accidents",0,"1985_1999" "Xiamen Airlines",430462962,"fatal_accidents",1,"1985_1999" "Aer Lingus",320906734,"fatalities",0,"1985_1999" "Aeroflot*",1197672318,"fatalities",128,"1985_1999" "Aerolineas Argentinas",385803648,"fatalities",0,"1985_1999" "Aeromexico*",596871813,"fatalities",64,"1985_1999" "Air Canada",1865253802,"fatalities",0,"1985_1999" "Air France",3004002661,"fatalities",79,"1985_1999" "Air India*",869253552,"fatalities",329,"1985_1999" "Air New Zealand*",710174817,"fatalities",0,"1985_1999" "Alaska Airlines*",965346773,"fatalities",0,"1985_1999" "Alitalia",698012498,"fatalities",50,"1985_1999" "All Nippon Airways",1841234177,"fatalities",1,"1985_1999" "American*",5228357340,"fatalities",101,"1985_1999" "Austrian Airlines",358239823,"fatalities",0,"1985_1999" "Avianca",396922563,"fatalities",323,"1985_1999" "British Airways*",3179760952,"fatalities",0,"1985_1999" "Cathay Pacific*",2582459303,"fatalities",0,"1985_1999" "China Airlines",813216487,"fatalities",535,"1985_1999" "Condor",417982610,"fatalities",16,"1985_1999" "COPA",550491507,"fatalities",47,"1985_1999" "Delta / Northwest*",6525658894,"fatalities",407,"1985_1999" "Egyptair",557699891,"fatalities",282,"1985_1999" "El Al",335448023,"fatalities",4,"1985_1999" "Ethiopian Airlines",488560643,"fatalities",167,"1985_1999" "Finnair",506464950,"fatalities",0,"1985_1999" "Garuda Indonesia",613356665,"fatalities",260,"1985_1999" "Gulf Air",301379762,"fatalities",0,"1985_1999" "Hawaiian Airlines",493877795,"fatalities",0,"1985_1999" "Iberia",1173203126,"fatalities",148,"1985_1999" "Japan Airlines",1574217531,"fatalities",520,"1985_1999" "Kenya Airways",277414794,"fatalities",0,"1985_1999" "KLM*",1874561773,"fatalities",3,"1985_1999" "Korean Air",1734522605,"fatalities",425,"1985_1999" "LAN Airlines",1001965891,"fatalities",21,"1985_1999" "Lufthansa*",3426529504,"fatalities",2,"1985_1999" "Malaysia Airlines",1039171244,"fatalities",34,"1985_1999" "Pakistan International",348563137,"fatalities",234,"1985_1999" "Philippine Airlines",413007158,"fatalities",74,"1985_1999" "Qantas*",1917428984,"fatalities",0,"1985_1999" "Royal Air Maroc",295705339,"fatalities",51,"1985_1999" "SAS*",682971852,"fatalities",0,"1985_1999" "Saudi Arabian",859673901,"fatalities",313,"1985_1999" "Singapore Airlines",2376857805,"fatalities",6,"1985_1999" "South African",651502442,"fatalities",159,"1985_1999" "Southwest Airlines",3276525770,"fatalities",0,"1985_1999" "Sri Lankan / AirLanka",325582976,"fatalities",14,"1985_1999" "SWISS*",792601299,"fatalities",229,"1985_1999" "TACA",259373346,"fatalities",3,"1985_1999" "TAM",1509195646,"fatalities",98,"1985_1999" "TAP - Air Portugal",619130754,"fatalities",0,"1985_1999" "Thai Airways",1702802250,"fatalities",308,"1985_1999" "Turkish Airlines",1946098294,"fatalities",64,"1985_1999" "United / Continental*",7139291291,"fatalities",319,"1985_1999" "US Airways / America West*",2455687887,"fatalities",224,"1985_1999" "Vietnam Airlines",625084918,"fatalities",171,"1985_1999" "Virgin Atlantic",1005248585,"fatalities",0,"1985_1999" "Xiamen Airlines",430462962,"fatalities",82,"1985_1999" "Aer Lingus",320906734,"incidents",0,"2000_2014" "Aeroflot*",1197672318,"incidents",6,"2000_2014" "Aerolineas Argentinas",385803648,"incidents",1,"2000_2014" "Aeromexico*",596871813,"incidents",5,"2000_2014" "Air Canada",1865253802,"incidents",2,"2000_2014" "Air France",3004002661,"incidents",6,"2000_2014" "Air India*",869253552,"incidents",4,"2000_2014" "Air New Zealand*",710174817,"incidents",5,"2000_2014" "Alaska Airlines*",965346773,"incidents",5,"2000_2014" "Alitalia",698012498,"incidents",4,"2000_2014" "All Nippon Airways",1841234177,"incidents",7,"2000_2014" "American*",5228357340,"incidents",17,"2000_2014" "Austrian Airlines",358239823,"incidents",1,"2000_2014" "Avianca",396922563,"incidents",0,"2000_2014" "British Airways*",3179760952,"incidents",6,"2000_2014" "Cathay Pacific*",2582459303,"incidents",2,"2000_2014" "China Airlines",813216487,"incidents",2,"2000_2014" "Condor",417982610,"incidents",0,"2000_2014" "COPA",550491507,"incidents",0,"2000_2014" "Delta / Northwest*",6525658894,"incidents",24,"2000_2014" "Egyptair",557699891,"incidents",4,"2000_2014" "El Al",335448023,"incidents",1,"2000_2014" "Ethiopian Airlines",488560643,"incidents",5,"2000_2014" "Finnair",506464950,"incidents",0,"2000_2014" "Garuda Indonesia",613356665,"incidents",4,"2000_2014" "Gulf Air",301379762,"incidents",3,"2000_2014" "Hawaiian Airlines",493877795,"incidents",1,"2000_2014" "Iberia",1173203126,"incidents",5,"2000_2014" "Japan Airlines",1574217531,"incidents",0,"2000_2014" "Kenya Airways",277414794,"incidents",2,"2000_2014" "KLM*",1874561773,"incidents",1,"2000_2014" "Korean Air",1734522605,"incidents",1,"2000_2014" "LAN Airlines",1001965891,"incidents",0,"2000_2014" "Lufthansa*",3426529504,"incidents",3,"2000_2014" "Malaysia Airlines",1039171244,"incidents",3,"2000_2014" "Pakistan International",348563137,"incidents",10,"2000_2014" "Philippine Airlines",413007158,"incidents",2,"2000_2014" "Qantas*",1917428984,"incidents",5,"2000_2014" "Royal Air Maroc",295705339,"incidents",3,"2000_2014" "SAS*",682971852,"incidents",6,"2000_2014" "Saudi Arabian",859673901,"incidents",11,"2000_2014" "Singapore Airlines",2376857805,"incidents",2,"2000_2014" "South African",651502442,"incidents",1,"2000_2014" "Southwest Airlines",3276525770,"incidents",8,"2000_2014" "Sri Lankan / AirLanka",325582976,"incidents",4,"2000_2014" "SWISS*",792601299,"incidents",3,"2000_2014" "TACA",259373346,"incidents",1,"2000_2014" "TAM",1509195646,"incidents",7,"2000_2014" "TAP - Air Portugal",619130754,"incidents",0,"2000_2014" "Thai Airways",1702802250,"incidents",2,"2000_2014" "Turkish Airlines",1946098294,"incidents",8,"2000_2014" "United / Continental*",7139291291,"incidents",14,"2000_2014" "US Airways / America West*",2455687887,"incidents",11,"2000_2014" "Vietnam Airlines",625084918,"incidents",1,"2000_2014" "Virgin Atlantic",1005248585,"incidents",0,"2000_2014" "Xiamen Airlines",430462962,"incidents",2,"2000_2014" "Aer Lingus",320906734,"fatal_accidents",0,"2000_2014" "Aeroflot*",1197672318,"fatal_accidents",1,"2000_2014" "Aerolineas Argentinas",385803648,"fatal_accidents",0,"2000_2014" "Aeromexico*",596871813,"fatal_accidents",0,"2000_2014" "Air Canada",1865253802,"fatal_accidents",0,"2000_2014" "Air France",3004002661,"fatal_accidents",2,"2000_2014" "Air India*",869253552,"fatal_accidents",1,"2000_2014" "Air New Zealand*",710174817,"fatal_accidents",1,"2000_2014" "Alaska Airlines*",965346773,"fatal_accidents",1,"2000_2014" "Alitalia",698012498,"fatal_accidents",0,"2000_2014" "All Nippon Airways",1841234177,"fatal_accidents",0,"2000_2014" "American*",5228357340,"fatal_accidents",3,"2000_2014" "Austrian Airlines",358239823,"fatal_accidents",0,"2000_2014" "Avianca",396922563,"fatal_accidents",0,"2000_2014" "British Airways*",3179760952,"fatal_accidents",0,"2000_2014" "Cathay Pacific*",2582459303,"fatal_accidents",0,"2000_2014" "China Airlines",813216487,"fatal_accidents",1,"2000_2014" "Condor",417982610,"fatal_accidents",0,"2000_2014" "COPA",550491507,"fatal_accidents",0,"2000_2014" "Delta / Northwest*",6525658894,"fatal_accidents",2,"2000_2014" "Egyptair",557699891,"fatal_accidents",1,"2000_2014" "El Al",335448023,"fatal_accidents",0,"2000_2014" "Ethiopian Airlines",488560643,"fatal_accidents",2,"2000_2014" "Finnair",506464950,"fatal_accidents",0,"2000_2014" "Garuda Indonesia",613356665,"fatal_accidents",2,"2000_2014" "Gulf Air",301379762,"fatal_accidents",1,"2000_2014" "Hawaiian Airlines",493877795,"fatal_accidents",0,"2000_2014" "Iberia",1173203126,"fatal_accidents",0,"2000_2014" "Japan Airlines",1574217531,"fatal_accidents",0,"2000_2014" "Kenya Airways",277414794,"fatal_accidents",2,"2000_2014" "KLM*",1874561773,"fatal_accidents",0,"2000_2014" "Korean Air",1734522605,"fatal_accidents",0,"2000_2014" "LAN Airlines",1001965891,"fatal_accidents",0,"2000_2014" "Lufthansa*",3426529504,"fatal_accidents",0,"2000_2014" "Malaysia Airlines",1039171244,"fatal_accidents",2,"2000_2014" "Pakistan International",348563137,"fatal_accidents",2,"2000_2014" "Philippine Airlines",413007158,"fatal_accidents",1,"2000_2014" "Qantas*",1917428984,"fatal_accidents",0,"2000_2014" "Royal Air Maroc",295705339,"fatal_accidents",0,"2000_2014" "SAS*",682971852,"fatal_accidents",1,"2000_2014" "Saudi Arabian",859673901,"fatal_accidents",0,"2000_2014" "Singapore Airlines",2376857805,"fatal_accidents",1,"2000_2014" "South African",651502442,"fatal_accidents",0,"2000_2014" "Southwest Airlines",3276525770,"fatal_accidents",0,"2000_2014" "Sri Lankan / AirLanka",325582976,"fatal_accidents",0,"2000_2014" "SWISS*",792601299,"fatal_accidents",0,"2000_2014" "TACA",259373346,"fatal_accidents",1,"2000_2014" "TAM",1509195646,"fatal_accidents",2,"2000_2014" "TAP - Air Portugal",619130754,"fatal_accidents",0,"2000_2014" "Thai Airways",1702802250,"fatal_accidents",1,"2000_2014" "Turkish Airlines",1946098294,"fatal_accidents",2,"2000_2014" "United / Continental*",7139291291,"fatal_accidents",2,"2000_2014" "US Airways / America West*",2455687887,"fatal_accidents",2,"2000_2014" "Vietnam Airlines",625084918,"fatal_accidents",0,"2000_2014" "Virgin Atlantic",1005248585,"fatal_accidents",0,"2000_2014" "Xiamen Airlines",430462962,"fatal_accidents",0,"2000_2014" "Aer Lingus",320906734,"fatalities",0,"2000_2014" "Aeroflot*",1197672318,"fatalities",88,"2000_2014" "Aerolineas Argentinas",385803648,"fatalities",0,"2000_2014" "Aeromexico*",596871813,"fatalities",0,"2000_2014" "Air Canada",1865253802,"fatalities",0,"2000_2014" "Air France",3004002661,"fatalities",337,"2000_2014" "Air India*",869253552,"fatalities",158,"2000_2014" "Air New Zealand*",710174817,"fatalities",7,"2000_2014" "Alaska Airlines*",965346773,"fatalities",88,"2000_2014" "Alitalia",698012498,"fatalities",0,"2000_2014" "All Nippon Airways",1841234177,"fatalities",0,"2000_2014" "American*",5228357340,"fatalities",416,"2000_2014" "Austrian Airlines",358239823,"fatalities",0,"2000_2014" "Avianca",396922563,"fatalities",0,"2000_2014" "British Airways*",3179760952,"fatalities",0,"2000_2014" "Cathay Pacific*",2582459303,"fatalities",0,"2000_2014" "China Airlines",813216487,"fatalities",225,"2000_2014" "Condor",417982610,"fatalities",0,"2000_2014" "COPA",550491507,"fatalities",0,"2000_2014" "Delta / Northwest*",6525658894,"fatalities",51,"2000_2014" "Egyptair",557699891,"fatalities",14,"2000_2014" "El Al",335448023,"fatalities",0,"2000_2014" "Ethiopian Airlines",488560643,"fatalities",92,"2000_2014" "Finnair",506464950,"fatalities",0,"2000_2014" "Garuda Indonesia",613356665,"fatalities",22,"2000_2014" "Gulf Air",301379762,"fatalities",143,"2000_2014" "Hawaiian Airlines",493877795,"fatalities",0,"2000_2014" "Iberia",1173203126,"fatalities",0,"2000_2014" "Japan Airlines",1574217531,"fatalities",0,"2000_2014" "Kenya Airways",277414794,"fatalities",283,"2000_2014" "KLM*",1874561773,"fatalities",0,"2000_2014" "Korean Air",1734522605,"fatalities",0,"2000_2014" "LAN Airlines",1001965891,"fatalities",0,"2000_2014" "Lufthansa*",3426529504,"fatalities",0,"2000_2014" "Malaysia Airlines",1039171244,"fatalities",537,"2000_2014" "Pakistan International",348563137,"fatalities",46,"2000_2014" "Philippine Airlines",413007158,"fatalities",1,"2000_2014" "Qantas*",1917428984,"fatalities",0,"2000_2014" "Royal Air Maroc",295705339,"fatalities",0,"2000_2014" "SAS*",682971852,"fatalities",110,"2000_2014" "Saudi Arabian",859673901,"fatalities",0,"2000_2014" "Singapore Airlines",2376857805,"fatalities",83,"2000_2014" "South African",651502442,"fatalities",0,"2000_2014" "Southwest Airlines",3276525770,"fatalities",0,"2000_2014" "Sri Lankan / AirLanka",325582976,"fatalities",0,"2000_2014" "SWISS*",792601299,"fatalities",0,"2000_2014" "TACA",259373346,"fatalities",3,"2000_2014" "TAM",1509195646,"fatalities",188,"2000_2014" "TAP - Air Portugal",619130754,"fatalities",0,"2000_2014" "Thai Airways",1702802250,"fatalities",1,"2000_2014" "Turkish Airlines",1946098294,"fatalities",84,"2000_2014" "United / Continental*",7139291291,"fatalities",109,"2000_2014" "US Airways / America West*",2455687887,"fatalities",23,"2000_2014" "Vietnam Airlines",625084918,"fatalities",0,"2000_2014" "Virgin Atlantic",1005248585,"fatalities",0,"2000_2014" "Xiamen Airlines",430462962,"fatalities",0,"2000_2014" ================================================ FILE: inst/datasets/corr50.csv ================================================ -1.5769,-0.107 -0.4231,5.72 1.2308,-2.6454 1.2308,1.2776 2.2692,5.72 4.1154,1.2776 4.0385,-1.8954 5.3462,8.893 4.4231,8.3738 5.0385,6.9892 6.1923,4.5661 5.9615,0.4123 7.4615,4.8546 7.5385,6.9892 9.1923,6.1238 3.8846,3.9892 2.3462,1.2776 8.7692,5.0276 8.7692,7.5084 -0.6923,1.7969 ================================================ FILE: inst/datasets/example-data.csv ================================================ "x","y","z" 1,"Hello",TRUE 3,"Hello",FALSE 5,"Hello",TRUE 7,"Hello",FALSE 9,"Hello",TRUE 1,"Hello",FALSE 3,"Hello",TRUE 5,"Hello",FALSE 7,"Hello",TRUE 9,"Goodbye",FALSE ================================================ FILE: packages.bib ================================================ @Manual{R-Ecdat, title = {Ecdat: Data Sets for Econometrics}, author = {Yves Croissant}, year = {2016}, note = {R package version 0.3-1}, url = {https://CRAN.R-project.org/package=Ecdat}, } @Manual{R-Ecfun, title = {Ecfun: Functions for Ecdat}, author = {Spencer Graves}, year = {2016}, note = {R package version 0.1-7}, url = {https://CRAN.R-project.org/package=Ecfun}, } @Manual{R-ScPoEconometrics, title = {ScPoEconometrics: ScPoEconometrics}, author = {Florian Oswald}, year = {2018}, note = {R package version 0.1.8}, url = {https://github.com/ScPoEcon/ScPoEconometrics}, } @Manual{R-base, title = {R: A Language and Environment for Statistical Computing}, author = {{R Core Team}}, organization = {R Foundation for Statistical Computing}, address = {Vienna, Austria}, year = {2018}, url = {https://www.R-project.org/}, } @Manual{R-bindrcpp, title = {bindrcpp: An 'Rcpp' Interface to Active Bindings}, author = {Kirill Müller}, year = {2018}, note = {R package version 0.2.2}, url = {https://CRAN.R-project.org/package=bindrcpp}, } @Manual{R-bookdown, title = {bookdown: Authoring Books and Technical Documents with R Markdown}, author = {Yihui Xie}, year = {2018}, note = {R package version 0.7}, url = {https://CRAN.R-project.org/package=bookdown}, } @Manual{R-dplyr, title = {dplyr: A Grammar of Data Manipulation}, author = {Hadley Wickham and Romain François and Lionel Henry and Kirill Müller}, year = {2018}, note = {R package version 0.7.6}, url = {https://CRAN.R-project.org/package=dplyr}, } @Manual{R-ggplot2, title = {ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics}, author = {Hadley Wickham and Winston Chang and Lionel Henry and Thomas Lin Pedersen and Kohske Takahashi and Claus Wilke and Kara Woo}, year = {2018}, note = {R package version 3.0.0}, url = {https://CRAN.R-project.org/package=ggplot2}, } @Manual{R-knitr, title = {knitr: A General-Purpose Package for Dynamic Report Generation in R}, author = {Yihui Xie}, year = {2018}, note = {R package version 1.20}, url = {https://CRAN.R-project.org/package=knitr}, } @Manual{R-mvtnorm, title = {mvtnorm: Multivariate Normal and t Distributions}, author = {Alan Genz and Frank Bretz and Tetsuhisa Miwa and Xuefei Mi and Torsten Hothorn}, year = {2018}, note = {R package version 1.0-8}, url = {https://CRAN.R-project.org/package=mvtnorm}, } @Manual{R-plotly, title = {plotly: Create Interactive Web Graphics via 'plotly.js'}, author = {Carson Sievert and Chris Parmer and Toby Hocking and Scott Chamberlain and Karthik Ram and Marianne Corvellec and Pedro Despouy}, year = {2017}, note = {R package version 4.7.1}, url = {https://CRAN.R-project.org/package=plotly}, } @Manual{R-readr, title = {readr: Read Rectangular Text Data}, author = {Hadley Wickham and Jim Hester and Romain Francois}, year = {2017}, note = {R package version 1.1.1}, url = {https://CRAN.R-project.org/package=readr}, } @Manual{R-readxl, title = {readxl: Read Excel Files}, author = {Hadley Wickham and Jennifer Bryan}, year = {2018}, note = {R package version 1.1.0}, url = {https://CRAN.R-project.org/package=readxl}, } @Manual{R-reshape2, title = {reshape2: Flexibly Reshape Data: A Reboot of the Reshape Package}, author = {Hadley Wickham}, year = {2017}, note = {R package version 1.4.3}, url = {https://CRAN.R-project.org/package=reshape2}, } @Manual{R-rmarkdown, title = {rmarkdown: Dynamic Documents for R}, author = {JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang}, year = {2018}, note = {R package version 1.10}, url = {https://CRAN.R-project.org/package=rmarkdown}, } @Manual{R-tidyr, title = {tidyr: Easily Tidy Data with 'spread()' and 'gather()' Functions}, author = {Hadley Wickham and Lionel Henry}, year = {2018}, note = {R package version 0.8.1}, url = {https://CRAN.R-project.org/package=tidyr}, } ================================================ FILE: preamble.tex ================================================ \usepackage{tcolorbox} \usepackage{booktabs} \usepackage{amsthm} \newenvironment{note}{\begin{tcolorbox}[colback=blue!5!white,colframe=blue!75!black]}{\end{tcolorbox}} \newenvironment{notel}{\begin{tcolorbox}[colback=blue!5!white,colframe=blue!75!black]}{\end{tcolorbox}} \newenvironment{warning}{\begin{tcolorbox}[colback=orange!5!white,colframe=orange]}{\end{tcolorbox}} \newenvironment{warningl}{\begin{tcolorbox}[colback=orange!5!white,colframe=orange]}{\end{tcolorbox}} \newenvironment{tip}{\begin{tcolorbox}[colback=green!5!white,colframe=green]}{\end{tcolorbox}} \makeatletter \def\thm@space@setup{% \thm@preskip=8pt plus 2pt minus 4pt \thm@postskip=\thm@preskip } \makeatother ================================================ FILE: previous_travis.yml ================================================ language: r os: - linux - osx before_install: # - if [ $TRAVIS_OS_NAME = linux ]; then sudo apt-get update; fi - if [ $TRAVIS_OS_NAME = linux ]; then sudo apt-get install -y ghostscript; sudo apt-get install -y libmagick++-dev; sudo add-apt-repository -y ppa:cran/poppler;sudo apt-get install -y libpoppler-cpp-dev; sudo apt-get install -y libv8-dev ; sudo apt-get install -y libudunits2-dev libgdal-dev libgeos-dev libproj-dev libfontconfig1-dev;fi - if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then brew install llvm; brew install v8; brew install poppler; export PATH="/usr/local/opt/llvm/bin:$PATH" && export LDFLAGS="-L/usr/local/opt/llvm/lib" && export CFLAGS="-I/usr/local/opt/llvm/include"; fi cache: packages: yes directories: - $TRAVIS_BUILD_DIR/_bookdown_files sudo: false pandoc_version: 1.19.2.1 before_script: - chmod +x ./_build.sh - chmod +x ./_deploy.sh - if [ $TRAVIS_OS_NAME = osx ]; then brew tap homebrew/cask; brew cask install phantomJS; brew install imagemagick@6; fi script: - R CMD build . - R CMD INSTALL *tar.gz - if [ $TRAVIS_OS_NAME = osx ]; then R CMD check *tar.gz ; fi - if [ $TRAVIS_OS_NAME = linux ]; then R CMD check *tar.gz; fi - if [ $TRAVIS_OS_NAME = osx ] && [[ $TRAVIS_COMMIT_MESSAGE != *"[nobook]"* ]]; then ./_build.sh && ./_deploy.sh; fi ================================================ FILE: style.css ================================================ p.caption { color: #777; margin-top: 10px; } p code { white-space: inherit; } pre { word-break: normal; word-wrap: normal; } pre code { white-space: inherit; } /* * Admonitions * * Colors (title, body) * warning: #f0b37e #ffedcc (orange) * note: #6ab0de #e7f2fa (blue) * tip: #1abc9c #dbfaf4 (green) */ .note { padding: 0.5em; background-color: #e7f2fa; border-radius: 5px; text-align: center; } .notel { padding: 0.5em; background-color: #e7f2fa; border-radius: 5px; text-align: left; } .warning { padding: 0.5em; background-color: #f0b37e; border-radius: 5px; text-align: center; } .warningl { padding: 0.5em; background-color: #f0b37e; border-radius: 5px; text-align: left; } .tip { padding: 0.5em; background-color: #dbfaf4; border-radius: 5px; text-align: center; } ================================================ FILE: teachers/ForTeachers.md ================================================ # Meta Info For Teachers This document contains info for teachers (at SciencesPo and elsewhere) who want to teach this course. ## Content 1. [Outline and Philosphie](#outline-and-philosphie) 2. [Details](#details) 3. [TODO list teachers](#TODO-list-teachers) 4. [Student/Teacher feedback from first iteration of course](#student-and-teacher-feedback) ## Outline and Philosophie * This is an introductory course to econometrics taught to 2nd year students at SciencesPo * The course is mandatory for the Economics and Society major. * Based on our experience teaching this course for many years, the traditional setup of teaching econometrics was found to be unsuitable. * The traditional curriculum assumes some basic maths knowledge, summation notation for example, as well as some basic statistics. * Both maths and stats are taught in the first year. * It seems that for many students this is too abstract (or not interesting). * The distribution of student evaluations was always *bimodal*: some students thought it was great, but didn’t go far enough, and a relatively larger number thought it was much too hard and they didn’t get much out of it. * This edition of the course uses only minimal maths and statistics * We are focusing on the lower mode of the above mentioned student evaluations population. * We will use `R` to illustrate key concepts interactively. * **Important**: this is not a course *about `R`*, in the sense that our primary goal is not to teach students how to program. (This is a very laudable goal in general, but we are constrained in this sense.) * Our primary goal is for students to understand the basics of linear regression, *using `R`*. They will be exposed to some very basic `R` programming. ## Details * course structure * material: Everything the students need is contained in an [online code repository](https://github.com/ScPoEcon/ScPoEconometrics). In particular, this contains an `R` package with * code that produces interactive `apps`, i.e. small server applications, used for illustration * `tutorials`, which are worked examples that require some student input for completion * code that produces the associated textbook * textbook: The textbook is online at [https://scpoecon.github.io/ScPoEconometrics/](https://scpoecon.github.io/ScPoEconometrics/). * It’s readable online in a browser (also on a mobile device), as an `epub` on an ebook reader, or as a `pdf`. * It is still work in progress (contributions welcome!). Particularly chapter 1 needs drastic shortening. * sessions: standard weekly meetings, 12 times per semester, 2 hours per session. The focus of the meetings will be to work on the tutorials, either alone or in small teams. The teacher will start each session with a short overview of the relevant chapter from the online textbook. The main task of the teacher will be to help students along the way and to break after each 20 min interval (or so) with short quizzes (more below). * The book should be for home study, practical exercises are done in class * Grades: Some weighted average between a final exam and bi-weekly online quizzes. * Exams - Both Exams and online quizzes rely on the amazing [R-exams](http://www.r-exams.org) package. - We produce a pool of template questions, and the package generates random numbers to populate the questions with. Cheating becomes very hard. * The package produces solutions and scannable pdfs for automatic grading. * One final exam. * can produce a mock exam before * pen and paper. could allow to use computer for computing during exam, but it carries high risk of cheating or technical problems. * Each teacher should supply as many exam questions as possible. We need a question bank from which to choose. * More on this below in see [TODO](#todo-list-teachers). * Online Quizzes (Homework) - Part of the grade. * weekly or bi-weekly * Serve the purpose to make sure that they read the book * [automatically put on moodle](https://moodle.sciences-po.fr/mod/quiz/view.php?id=114720) * Can be automatically generated from our questions pool. * We used moodle, but this works for pretty much all other online learning platforms. * Kahoots - Not part of the grade. * Kahoot! is an online quiz platform widely used in teaching * Kahoots should be given to students in class, just to have some fun and check they understand what is going on. They are played on a mobile phone or a browser. Students choose nicknames. best (fastest and correct) answer wins, shows podium in the end. * quick demo: * Students need to be able to see your screen. * I (or you!) create intermediate quizzes before class at https://create.kahoot.it/ * In class, you launch a kahoot from *my kahoots* (click on *play*). * students go to https://kahoot.it and enter quiz pin * teachers should sign up and I can share my kahoots with them. see [TODO](#todo-list-teachers) * Here is the [kahoot for chapter 2](https://play.kahoot.it/#/k/9dfe2cc0-ea38-491a-9e0b-fb55867fcdda) * Communication * slack: this is a chatroom-like environment that I have tested successfully in my other courses. * every group gets their separate channel * every teacher is responsible to manage questions in their group’s channel * general questions should be asked in the #general channel * using this technology is a viable way for me to maintain a global view of how this course is going in the various locations. If I can see what you and your students are talking about, we can react fast to adapt the course. On the other hand, if I have to read through several threaded emails back and forth between you and your students before I can understand what the problem is, this will be much harder (read: *impossible*) to do. * I **strongly recommend** to communicate with your students via slack, not via email. * When working with software and computers, there is **ALWAYS** another student who as exactly the same problem as the one you are currently emailing to. The economies of scale are almost unlimited in this domain. * You can for once share `code` in a readable way. * I would prefer if you communicated with me as well on slack. You can send private messages. ## TODO list teachers To ensure consistency in the department's approach to the *Introduction to Econometrics* curriculum, instructors are strongly encouraged to follow the following guidelines. 1. sign up to slack: send me an email at florian.oswald@sciencespo.fr so I can add you 2. get a free account on github.com 3. have a look at our course [code repository](https://github.com/ScPoEcon/ScPoEconometrics) 1. In particular, look at the [current list of issues](https://github.com/ScPoEcon/ScPoEconometrics/issues) and file new ones 4. Install `R` 5. Install the `R` package as described on the readme of the [code repository](https://github.com/ScPoEcon/ScPoEconometrics). 6. go through **all** the apps. Instructions always on the same readme. 1. This is important. 2. Please run all apps. If you find any trouble, please [file an issue](https://github.com/ScPoEcon/ScPoEconometrics/issues). 3. Make sure you understand what each app is supposed to teach. If it’s not clear, [file an issue](https://github.com/ScPoEcon/ScPoEconometrics/issues). 4. Feel free to suggest other apps! By [filing an issue](https://github.com/ScPoEcon/ScPoEconometrics/issues). 7. Create questions. 1. Have a look at the textbook for the level of difficulty you should aim at 2. You will be associated to the [private exams repo](https://github.com/floswald/ScPoMetricsExams) as soon as you send me your github user name (Point 2. above!). External teachers, please send me an email with that request. 3. I would like to get at the *very least* 4 questions from each teacher. They can be a mixture of short an long questions. 8. Have a close look at [the textbook](https://scpoecon.github.io/ScPoEconometrics/). If you have any suggestions about anything at all please [file an issue](https://github.com/ScPoEcon/ScPoEconometrics/issues). 9. Sign for a free account at [https://kahoot.com/](https://kahoot.com/) to we can share short quizzes. 10. Please be vocal. This course is an experiment and we are sailing uncharted territory. Every comment you have will be valuable for us. So [file an issue](https://github.com/ScPoEcon/ScPoEconometrics/issues), post a message on slack, or get in touch otherwise with anything at all! 11. Thank you for participating! ## Student and Teacher Feedback ### Course Iteration 1: September 2018. ScPo Paris and Regional Campuses. #### Teachers half-term feedback: ##### T1 - Few problems at the beginning concerning the installation of packages: many people had to change their security options in order to install the packages. Now everything is working smoothly. - Few people had problems opening the slides using safari and google chrome. - I think that some of the students would like to see more “real world” examples (as the one on California student test scores). - Two exchange students seem to have troubles understanding basic math concepts (one of them was not able to understand a simple linear equation). - 10 to 15 students reported some issue with the quiz. - They seem to like the format of the course. ##### T2 - Installations of R, RStudio, and packages were ok at the end of the first course - I do not use the slides, I follow the book, projecting RStudio from my laptop - Student do not use Slack but ask their questions during the course - (personal opinion) the tidyverse framework arrive too early to understand its interest - No problem with the quizz (we've tested only the first) ##### T3 -no specific problems with the package. Sometimes students using Macs have more difficulities because they need to adapt certain lines of code concerning import of files (folder paths etc..) - students are sometimes surprised that certain functions can use only particular types of objects as arguments. - several student had to retake the test twice because of the server collapse m. Overall, the results are good, low grades are rare. Overall, nothing very peculiar or worrisome so far in my groups ##### T4 1. The main problem concerns the ScPoEconometrics package. Sometimes it blocks suddently while it worked 1 day ago. Otherwise, everything goes well. 2. Some students find that the book is hard to follow. Aside from the slides, I give them a synthesis of the R codes at the end of each chapter. 3. Students would like to know the weight of the moodle quizzes. ##### T5 1. Some students had problem when they update the package and run tutorial, fortunately it seems to be ok in the last session. More students had problems with the first test but the second one until now only one student has. Students in my groups rarely interact in Slack, some even never check the messages. 2. Agree that the tidyverse seem to be technical and students were not quite interested in this early stage. 3. The average of the quizzes is good. 4. I think the command should be kept simple since it’s hard for some students even to replicate the command. ##### T6 1. Some student had problem with package installation at the beginning, they seem to know how to interact with slack but don’t really use it. 2. Agree with the past comments about tidyverse, it’s too early for them to understand its interest. They seem to like real world examples. I think that having a kind of small applied project would be helpful as it seems that they just try to reproduce class results and not to play with R. I have 1 student with almost no math background. I think the slides on OLS transformations (normalization, demeaning) may be too cryptic at this stage. 3. 3-4 students had problems with the quiz. The average grade is very nice for now. ##### My response to teacher feedback So my experience was overall similar to what you are writing, just to reassure you. Going forward, i.e. for the next edition of the course, I take the following messages out of what you wrote: 1. no tidyverse, or only later 2. more real world examples a la `Caschool`and or an applied project 3. Slides on OLS transformation too much math 4. Some find book hard to follow. All of those are good points. Let me just put some more realism into each point by highlighting that nothing comes for free. Again, this is mainly for my own future benefit, but please feel free to discuss. 1. The Tidyverse approach to cleaning data is easier than the corresponding solution using base R. This is related to *real world*. the example with reading an excel dataset downloaded from the web is _very_ real world in this sense. You will always have to reshape the data somehow, and I am doubtful whether the base R route is simpler to understand. 2. Can produce more worked examples or projects. We did as much as we could with the tutorials so far, clearly the more the better. 3. I explicitly say that the math is only for whoever is interested on those particular slides. I think we should at least give the option for those interested to get a chance to see how stuff works. Debatable. 4. I need more info as to which parts of the book they find hard to follow. Please don’t say *all of it*. #### Student Feedback * At the time of writing, the official course evaluation on behalf of students has not yet been published. To be added here. * I got some informal feedback during the semester. 1. some moodle/exam questions are not suitable for exams. for example, the question about a distribution being left/right skewed or unimodal etc is not always unambiguously clear, givne the random nature of the data. 2. the crashing moodle server caused some real pain. giving people grades under conditions of such technical frailty was quite borderline and I was tempted not to use the moodle quizzes in the grades at all. 3. The federated structure of SciencesPo (central paris and regional campuses) caused some frustration. It is hard to synchronize classrooms at a distance. Some people didn't find slack helpful. 4. Students who performed poorly on the final exam thought it was too hard/unfair. Students who performed well thought it was not unfair. Not much to learn from this. 80% of exam questions were using the *identical* template previously used in one of the online quizzes. ================================================ FILE: teachers/app-timeline.md ================================================ # App and Tutorial Schedule This doc sets out a rough timeline for when to do which app or tutorial. ## Chapter 1 Nothing ## Chapter 2: Summarizing Data * After slide *scatter plots*, do `runTutorial('chapter2')`. - Discrete Data - Continuous Data - Estimation based on a sample * Immediately after, `runTutorial('correlation')` * After that, introduce the `aboutApp()` function. do `aboutApp("corr_continous")` (slightly different app, but fitting explanation) * Finally in that chapter, I would recommend to go through the *entire* worked example in the book at 2.4.1 "Reading .csv data in the tidy way" ## Chapter 3: Linear Regression This is by far the most important chapter, so we have a lot of apps. You should take as long for this chapter as you feel is necessary. It's the core of the course. * After you showed figure 3.1 do `launchApp('reg_simple_arrows')` * continue with `launchApp('reg_simple')`. explanation for the squares comes later, at this stage this is just intuition. * continue in 3.1.2 to introduce SSR * after this point, they **must** have the simple formula (3.1) and what each part means in their head for the rest of their lives. make sure that is the case. * go back to `launchApp('reg_simple')` and explain the squares * now `launchApp('SSR_cone')` and tell them that OLS solves exactly this minization problem. spend good time there, explain all the numbers that are visible and that they can drag the 3D graph with their mouse to see better. * now do `launchApp('reg_full')`. explain - there are 10 different examples - what happens when you increase the noise level? - you should spend a lot of time with this app. * Now we have the basics down. Next we talk about some simple restrictions on the basic model. * what happens if we demean both x and y? `launchApp('demeaned_reg')` * contrained regression: what if we have only an intercept, or only a slope? how does our result improve (with 0 intercept, say), if we then demean the data? `launchApp('reg_constrained')` * what happens if we rescale either x or y or both by some number? say, what if instead of measuring wage in a regression in euros, we now measure it in 1000s of euros? 1. `runTutorial('rescaling')` 1. `launchApp('rescale')` * go back to 3.1.3 in the book and define the simple formulae for both coefficients * 3.1.4 - `launchApp('anscombe')` - `launchApp('datasaurus')` * Work through book 3.3 example till the end ## Chapter 4: Standard Errors * `launchApp('sampling')` * `launchApp('standard_errors_simple')` * `launchApp('standard_errors_changeN')` * `launchApp('confidence_intervals')` * `runTutorial('non_normal')` ## Chapter 5: Multiple Regression * `launchApp('reg_multivariate')` * `runTutorial('lm_example')` * `launchApp('multicollinearity')` ## Chapter 6: Categorical Variables * `launchApp('reg_dummy')` * `launchApp('reg_dummy_example')` ================================================ FILE: teachers/session1-ouline.md ================================================ # Session 1 Teacher brings a laptop with Slack, R and Rstudio installed. Our package code is installed on the laptop. The laptop is connected to the projector. ScPo provided hardware won't allow either Slack nor the installation of our package, so is not useful. ## Welcome! * Who am I? * name * experience (research, teaching, other) * What this course tries to teach you? * We want to teach you the basics of data analysis and Econometrics. * We want you to try things out, rather than to be able to proof them formally * For those of you very eager to derive formal and more rigorous insights, there will be ample opportunity later on, in a Masters or a Phd * Our aim is for *everybody* to understand and to be able a linear regression with `R`. * This is a brand new course. * This means that we quite happy to show you plenty of new things, but you should be aware there are still some rough edges. Please be patient if something does not work as expected - we are here to help! ## Meetings * We meet once per week * please bring your laptop each time ## Exam and Grading * There will be quizzes on Moodle roughly every two weeks. * There will be a final exam on paper. * We will do online quizzes on kahoot.com, but those will not be part of your grade. ## Today * We will talk about some logistical details first. You will need your computer running and connected to the internet, so why not start up now? * Then we will have a first look at `R`. ### Communication * We will talk to each other on Slack. * Who is not yet signed up to Slack? * This is much better to talk about issues with computer code than email 1. it *looks* nicer than in an email 2. Slack is like a chatroom, so other people see what you say. Odds are that there are several people who have the same/similar problem like you, so this much more efficient in a chatroom. * Let me quickly show you Slack. You should open Slack on your computer now as well. 1. [WAITS for all] 1. In the left panel you can see all the channels you are subscribed to. You can see I am subscribed to more channels than you are. 1. You should subscribe to *my* channel, so we can talk about things in this classroom. Just click on `Channels` and start typing my first name. You will see my channel appear, click on it, and finally click *join channel* at the bottom. 1. This channel is your first reference for any questions you have about the course. 1. Let me check that you are all in my channel now 1. [checks *members* in right panel] 1. I'll post an example message now in our channel to say hello to you all. 1. [posts hello message into their channel] 1. You can **react** to any post by clicking on the appropriate symbol at the top right corner of the post. 1. [reacts to hello message just posted] 1. Let me show you now how to nicely format computer code in a slack post. it's easy. 1. [starts typing x + y = 3 and alerts students to the appearing info just below the text box] 1. We want this to be formatted like ``code``. So we put this in backticks `` ` ``, like so: `` `x + y = 3` `` [hits enter] 1. If you want to write multiple lines of code, you could start with three backticks, and create a new line with `shift` and `enter` (`enter` alone sends the message!): ```` ``` x = 3 y = 4 x + y ``` ```` 1. you can also attach files by clicking on the plus symbol. 1. Please don't post in the #general channel, as this is for public announcements for all courses. 1. Finally, you can send direct messages by clicking on a username, or on *direct messages* in your left panel. ### RStudio * You all have R and RStudio installed? * If not, install now and look on your neighbors screen * Lets all open RStudio! * [make sure you have standard layout, from top left to bottom right source, environment, console, files/plots] * open an empty script * here is the console (bottom left): write some commands into it * show that variables show up in environment if you assign a value (top rigiht) * make a base plot (not ggplot) and show where it appears * write 2 lines of code in the open script file, execute each line (place cursor on line and hit cmd+enter or click on run) * save the script file somewhere by clicking on the save symbol * type `help(plot)` in the console and explain help file ### Let's get going with R! * open https://scpoecon.github.io/ScPoEconometrics/R-intro.html and project to wall * explain how the **book** works * left: TOC * menu bar on top: * make TOC disappear * search for a term * choose text type * edit this page of the book on github.com (to suggest a change or if you found a mistake) * download as pdf or as epub. * If you like what you see, on the right you can tweet and post to facebook about this book. * All the code you see in the book actually works. so please copy and paste from it as much as you can! ### Continue with Slides! * Start at 1.2.1: First Glossary * Do some basics from 1.3 * Do 1.4 * Do 1.5 * Do 1.7 and install the package! * make them load the library and check the version! * keep going over the chapter: * ideally you have your RStudio screen open and type commands as you go along * we want them to type as many commands as possible!!! * go until Task 1: * 2 minute break! * who is having any trouble with their computers, please come and see me now. * then do task 1 * then keep going ================================================ FILE: teachers/tasks_ch1.Rmd ================================================ --- title: "tasks for session 1" author: "Florian Oswald" date: "8/18/2018" output: pdf_document: default html_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # task 1 1. Create a vector of five ones, i.e. `[1,1,1,1,1]` `rep(1,5)` 1. Notice that the colon operator `a:b` is just short for *construct a sequence **from** `a` **to** `b`*. Create a vector the counts down from 10 to 0, i.e. it looks like `10,9,8,7,6,5,4,3,2,1,0`! `10:0` 1. the `rep` function takes additional arguments `times` (as above), and `each`, which tells you how often *each element* should be repeated (as opposed to the entire input vector). Use `rep` to create a vector that looks like this: `1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3` `rep(1:3,times=2,each=3)` # task 2 1. Create a vector filled with 10 numbers drawn from the uniform distribution (hint: use function `runif`) and store them in `x`. `x = runif(10)` 1. Using logical subsetting as above, get all the elements of `x` which are larger than 0.5, and store them in `y`. `y = x[x>0.5]` 1. using the function `which`, store the *indices* of all the elements of `x` which are larger than 0.5 in `iy`. `iy = which(x>0.5)` 1. Check that `y` and `x[iy]` are identical. `identical(y,x[iy])` or `all(y == x[iy])` # Task 3 1. Create a vector containing `1,2,3,4,5` called v. `v = 1:5` 1. Create a (2,5) matrix `m` containing the data `1,2,3,4,5,6,7,8,9,10`. The first row should be `1,2,3,4,5`. `m = matrix(data = 1:10,nrow=2,ncol=5,byrow=T)` 1. Perform matrix multiplication of `m` with `v`. Use the command `%*%`. What dimension does the output have? `dim(m%*% v)`, 1. Why does `v %*% m` not work? non-conformable # Task 4 1. Copy and paste the above code for `ex_list` into your R session. Remember that `list` can hold any kind of `R` object. Like...another list! So, create a new list `new_list` that has two fields: a first field called "this" with string content `"is awesome"`, and a second field called "ex_list" that contains `ex_list`. `new_list = list(this = "is awesome", ex_list = ex_list)` 1. Accessing members is like in a plain list, just with several layers now. Get the element `c` from `ex_list` in `new_list`! `new_list$ex_list$c` 1. Compose a new string out of the first element in `new_list`, the element under label `this`. Use the function `paste` to print `R is awesome` to your screen. `paste("R",new_list$this)` # Task 5 1. How many observations are there in `mtcars`? `nrow(mtcars)` 1. How many variables? `ncol(mtcars)` 1. What is the average value of `mpg`? `mean(mtcars$mpg)` 1. What is the average value of `mpg` for cars with more than 4 cylinders, i.e. with `cyl>4`? `mean(subset(mtcars,subset=cyl>4)$mpg)` # Task 6 1. Write a for loop that counts down from 10 to 1, printing the value of the iterator to the screen. ```{r} for (i in 10:1){ print(i) } ``` 1. Modify that loop to write "i iterations to go" where `i` is the iterator ```{r} for (i in 10:1){ print(paste(i,"iterations to go")) } ``` 1. Modify that loop so that each iteration takes roughly one second. You can achieve that by adding the command `Sys.sleep(1)` below the line that prints "i iterations to go". ```{r} for (i in 10:1){ print(paste(i,"iterations to go")) Sys.sleep(1) } ``` ================================================ FILE: teachers/tasks_ch2.Rmd ================================================ --- title: "tasks for chapter 2" author: "Florian Oswald" date: "8/18/2018" output: pdf_document: default html_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Task 1. Make sure to have the `mpg` dataset loaded by typing `data(mpg)` (and `library(ggplot2)` if you haven't!). Use the `table` function to find out how many cars were built by *mercury*? `table(mpg$manufacturer)`, `4`. 1. What is the average year the audi's were built in this dataset? Use the function `mean` on the subset of column `year` that corresponds to `audi`. (Be careful: subsetting a `tibble` returns a `tibble` (and not a vector)!. so get the `year` column after you have subset the `tibble`.) `mean(subset(mpg,subset=manufacturer=="audi")$year)`, `mean(mpg[mpg$manufacturer=="audi","year"]$year)` 1. Use the `dplyr` piping syntax from above first with `group_by` and then with `summarise(newvar=your_expression)` to find the mean `year` by manufacturer! ```{r} library(ggplot2) library(dplyr) mpg %>% group_by(manufacturer) %>% summarise(year=mean(year)) ``` ================================================ FILE: toc.css ================================================ #TOC ul, #TOC li, #TOC span, #TOC a { margin: 0; padding: 0; position: relative; } #TOC { line-height: 1; border-radius: 5px 5px 0 0; background: #141414; background: linear-gradient(to bottom, #333333 0%, #141414 100%); border-bottom: 2px solid #0fa1e0; width: auto; } #TOC:after, #TOC ul:after { content: ''; display: block; clear: both; } #TOC a { background: #141414; background: linear-gradient(to bottom, #333333 0%, #141414 100%); color: #ffffff; display: block; padding: 19px 20px; text-decoration: none; text-shadow: none; } #TOC ul { list-style: none; } #TOC > ul > li { display: inline-block; float: left; margin: 0; } #TOC > ul > li > a { color: #ffffff; } #TOC > ul > li:hover:after { content: ''; display: block; width: 0; height: 0; position: absolute; left: 50%; bottom: 0; border-left: 10px solid transparent; border-right: 10px solid transparent; border-bottom: 10px solid #0fa1e0; margin-left: -10px; } #TOC > ul > li:first-child > a { border-radius: 5px 0 0 0; } #TOC.align-right > ul > li:first-child > a, #TOC.align-center > ul > li:first-child > a { border-radius: 0; } #TOC.align-right > ul > li:last-child > a { border-radius: 0 5px 0 0; } #TOC > ul > li.active > a, #TOC > ul > li:hover > a { color: #ffffff; box-shadow: inset 0 0 3px #000000; background: #070707; background: linear-gradient(to bottom, #262626 0%, #070707 100%); } #TOC .has-sub { z-index: 1; } #TOC .has-sub:hover > ul { display: block; } #TOC .has-sub ul { display: none; position: absolute; width: 200px; top: 100%; left: 0; } #TOC .has-sub ul li a { background: #0fa1e0; border-bottom: 1px dotted #31b7f1; filter: none; display: block; line-height: 120%; padding: 10px; color: #ffffff; } #TOC .has-sub ul li:hover a { background: #0c7fb0; } #TOC ul ul li:hover > a { color: #ffffff; } #TOC .has-sub .has-sub:hover > ul { display: block; } #TOC .has-sub .has-sub ul { display: none; position: absolute; left: 100%; top: 0; } #TOC .has-sub .has-sub ul li a { background: #0c7fb0; border-bottom: 1px dotted #31b7f1; } #TOC .has-sub .has-sub ul li a:hover { background: #0a6d98; } #TOC ul ul li.last > a, #TOC ul ul li:last-child > a, #TOC ul ul ul li.last > a, #TOC ul ul ul li:last-child > a, #TOC .has-sub ul li:last-child > a, #TOC .has-sub ul li.last > a { border-bottom: 0; } #TOC ul { font-size: 1.2rem; }