Full Code of ScPoEcon/ScPoEconometrics for AI

master 4999239de84e cached

59 files

2.2 MB

580.5k tokens

1 requests

Download .txt

Showing preview only (2,322K chars total). Download the full file or copy to clipboard to get everything.

Repository: ScPoEcon/ScPoEconometrics
Branch: master
Commit: 4999239de84e
Files: 59
Total size: 2.2 MB

Directory structure:
gitextract_ejq2bful/

├── .Rbuildignore
├── .github/
│   └── ISSUE_TEMPLATE/
│       └── custom.md
├── .gitignore
├── 01-R.Rmd
├── 02-SummaryStats.Rmd
├── 03-linear-reg.Rmd
├── 04-MultipleReg.Rmd
├── 05-Categorial-Vars.Rmd
├── 06-StdErrors.Rmd
├── 07-Causality.Rmd
├── 08-STAR.Rmd
├── 09-RDD.Rmd
├── 10-IV.Rmd
├── 11-IV2.Rmd
├── 12-panel.Rmd
├── 13-discrete.Rmd
├── 14-references.Rmd
├── DESCRIPTION
├── GA-tracker.html
├── LICENSE
├── NAMESPACE
├── R/
│   └── utils.R
├── README.md
├── ScPoEconometrics.Rproj
├── _archive/
│   └── chapters/
│       └── 03-linear-reg.Rmd
├── _bookdown.yml
├── _build.sh
├── _deploy.sh
├── _local_deploy.sh
├── _output.yml
├── _tex/
│   ├── ci.tex
│   ├── onesided.tex
│   ├── testing.lyx
│   ├── two-sided-beta.tex
│   └── twosided-mean.tex
├── _to_be_done/
│   ├── 08-TBD.Rmd
│   ├── 09-R-advanced.Rmd
│   ├── 11-projects.Rmd
│   └── notes.R
├── book.bib
├── images/
│   └── trade.html
├── index.Rmd
├── inst/
│   ├── CITATION
│   └── datasets/
│       ├── airline-safety.csv
│       ├── corr50.csv
│       ├── demo_gind.xls
│       ├── example-data.csv
│       ├── grade5.dta
│       └── simple_arrows.RData
├── packages.bib
├── preamble.tex
├── previous_travis.yml
├── style.css
├── teachers/
│   ├── ForTeachers.md
│   ├── app-timeline.md
│   ├── session1-ouline.md
│   ├── tasks_ch1.Rmd
│   └── tasks_ch2.Rmd
└── toc.css

================================================
FILE CONTENTS
================================================

================================================
FILE: .Rbuildignore
================================================
^.*\.Rproj$
^\.Rproj\.user$
^.*\.html$
^.*\.jpg$
_book*
_slides*
_tex*
^\d\d-.*\.Rmd$
^.*\.yml$
^.*\.sh$
^.*\.css$
^.*\.gif$
^.*\.bib$
^.*\.tex$
images/
data/.keep
js/
^appveyor\.yml$
teachers/


================================================
FILE: .github/ISSUE_TEMPLATE/custom.md
================================================
---
name: Custom issue template
about: Please file an issue here!
title: ''
labels: ''
assignees: ''

---

hello!

Please ask any course-related questions here, or let us know if something does not work. make sure your issue includes a reproducible example of the bug/issue that you encountered. **Every issue needs to submit three things**:

1. The commands that lead to the error you find
1. The actual output of the error
1. **after** the error happend, type `sessionInfo()` and post the output here as well.


================================================
FILE: .gitignore
================================================
.Rproj.user
.Rhistory
.RData
_publish.R
_book
_bookdown_files
rsconnect
/data/
/inst/shinys/**/*.html
/inst/tutorials/**/*.html
/inst/tutorials/**/*data
inst/tutorials/chapter2/chapter2_files/
_slides/chapter1/chapter1-*
_slides/chapter2/chapter2-*
_slides/chapter6/chapter6_files/
_slides/**/*.html


================================================
FILE: 01-R.Rmd
================================================
# Introduction to `R`  {#R-intro}



## Getting Started

`R` is both a programming language and software environment for statistical computing, which is *free* and *open-source*. To get started, you will need to install two pieces of software:

1. [`R`, the actual programming language.](https://www.r-project.org)
    - Chose your operating system, and select the most recent version.
1. [RStudio, an excellent IDE for working with `R`.](http://www.rstudio.com/)
    - Note, you must have `R` installed to use RStudio. RStudio is simply an interface used to interact with `R`.

The popularity of `R` is on the rise, and everyday it becomes a better tool for statistical analysis. It even generated this book!

The following few chapters will serve as a whirlwind introduction to `R`. They are by no means meant to be a complete reference for the `R` language, but simply an introduction to the basics that we will need along the way. Several of the more important topics will be re-stressed as they are actually needed for analyses.

This introductory `R` chapter may feel like an overwhelming amount of information. You are not expected to pick up everything the first time through. You should try all of the code from this chapter, then return to it a number of times as you return to the concepts when performing analyses. We only present the most basic aspects of `R`. If you want to know more, there are countless online tutorials, and you could start with the official [CRAN sample session](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#A-sample-session) or have a look at the resources at [Rstudio](https://www.rstudio.com/online-learning/#DataScience) or on this  [github repo](https://github.com/qinwf/awesome-R).


## Starting R and RStudio

A key difference for you to understand is the one between `R`, the actual programming language, and `RStudio`, a popular interface to `R` which allows you to work efficiently and with greater ease with `R`.

The best way to appreciate the value of `RStudio` is to start using `R` *without* `RStudio`. To do this, double-click on the R GUI that you should have downloaded on your computer following the steps above (on windows or Mac), or start R in your terminal (on Linux or Mac) by just typing `R` in a terminal, see figure \@ref(fig:console). You've just opened the R **console** which allows you to start typing code right after the `>` sign, called *prompt*. Try typing `2 + 2` or `print("Your Name")` and hit the return key. And *voilà*, your first R commands!

```{r console, fig.cap="R GUI symbol and R in a MacOS Terminal",fig.align='center',out.width="50%",echo=FALSE}
knitr::include_graphics(c("images/RLogo.png","images/console.png") )
```


Typing one command after the other into the console is not very convenient as our analysis becomes more involved. Ideally, we would like to collect all command statements in a file and run them one after the other, automatically. We can do this by writing so-called **script files** or just **scripts**, i.e. simple text files with extension `.R` or `.r` which can be *inserted* (or *sourced*) into an `R` session. RStudio makes this process very easy.

Open `RStudio` by clicking on the `RStudio` application on your computer, and notice how different the whole environment is from the basic `R` console – in fact, that *very same* `R` console is running in your bottom left panel. The upper-left panel is a space for you to write scripts – that is to say many lines of codes which you can run when you choose to. To run a single line of code, simply highlight it and hit `Command` + `Return`.

```{block, type='note'}
We highly recommend that you use `RStudio` for everything related to this course (in particular, to launch our apps and tutorials).
```


RStudio has a large number of useful keyboard shortcuts. A list of these can be found using a keyboard shortcut -- the keyboard shortcut to rule them all:

- On Windows: `Alt` + `Shift` + `K`
- On Mac:  `Option` + `Shift` + `K`

The `RStudio` team has developed [a number of "cheatsheets"](https://www.rstudio.com/resources/cheatsheets/) for working with both `R` and `RStudio`. [This particular cheatseet for Base `R`](http://www.rstudio.com/wp-content/uploads/2016/05/base-r.pdf) will summarize many of the concepts in this document. ^[When programming, it is often a good practice to follow a style guide. (Where do spaces go? Tabs or spaces? Underscores or CamelCase when naming variables?) No style guide is "correct" but it helps to be aware of what others do. The more import thing is to be consistent within your own code. Here are two guides: [Hadley Wickham Style Guide](http://adv-r.had.co.nz/Style.html), and the [Google Style Guide](https://google.github.io/styleguide/Rguide.xml). For this course, our main deviation from these two guides is the use of `=` in place of `<-`. For all practical purposes, you should think `=` whenever you see `<-`.]

### First Glossary

* `R`: a statistical programming language
* `RStudio`: an integrated development environment (IDE) to work with `R`
* *command*: user input (text or numbers) that `R` *understands*.
* *script*: a list of commands collected in a text file, each separated by a new line, to be run one after the other.

## Basic Calculations

To get started, we'll use `R` like a simple calculator. Run the following code either directly from your RStudio console, or in RStudio by writting them in a script and running them using `Command` + `Return`.

#### Addition, Subtraction, Multiplication and Division {-}

| Math          | `R` code    | Result    |
|:-------------:|:-------:|:---------:|
| $3 + 2$       | `3 + 2` | `r 3 + 2` |
| $3 - 2$       | `3 - 2` | `r 3 - 2` |
| $3 \cdot2$    | `3 * 2` | `r 3 * 2` |
| $3 / 2$       | `3 / 2` | `r 3 / 2` |

#### Exponents  {-}

| Math         | `R` code             | Result            |
|:-------------:|:-------:|:---------:|
| $3^2$        | `3 ^ 2`         | `r 3 ^ 2`         |
| $2^{(-3)}$   | `2 ^ (-3)`      | `r 2 ^ (-3)`      |
| $100^{1/2}$  | `100 ^ (1 / 2)` | `r 100 ^ (1 / 2)` |
| $\sqrt{100}$ | `sqrt(100)`     | `r sqrt(100)`     |


#### Mathematical Constants  {-}

| Math         | `R` code             | Result            |
|:------------:|:---------------:|:-----------------:|
| $\pi$        | `pi`            | `r pi`            |
| $e$          | `exp(1)`        | `r exp(1)`        |

#### Logarithms  {-}

Note that we will use $\ln$ and $\log$ interchangeably to mean the natural logarithm. There is no `ln()` in `R`, instead it uses `log()` to mean the natural logarithm.

| Math              | `R` code                | Result                |
|:------------:|:---------------:|:-----------------:|
| $\log(e)$         | `log(exp(1))`       | `r log(exp(1))`       |
| $\log_{10}(1000)$ | `log10(1000)`       | `r log10(1000)`       |
| $\log_{2}(8)$     | `log2(8)`           | `r log2(8)`           |
| $\log_{4}(16)$    | `log(16, base = 4)` | `r log(16, base = 4)` |

#### Trigonometry  {-}

| Math            | `R` code           | Result          |
|:------------:|:---------------:|:-----------------:|
| $\sin(\pi / 2)$ | `sin(pi / 2)` | `r sin(pi / 2)` |
| $\cos(0)$       | `cos(0)`      | `r cos(0)`      |

## Getting Help

In using `R` as a calculator, we have seen a number of functions: `sqrt()`, `exp()`, `log()` and `sin()`. To get documentation about a function in `R`, simply put a question mark in front of the function name, or call the function `help(function)` and RStudio will display the documentation, for example: 

```{r, eval = FALSE}
?log
?sin
?paste
?lm
help(lm)   # help() is equivalent
help(ggplot,package="ggplot2")  # show help from a certain package
```

Frequently one of the most difficult things to do when learning `R` is asking for help. First, you need to decide to ask for help, then you need to know *how* to ask for help. Your very first line of defense should be to Google your error message or a short description of your issue. (The ability to solve problems using this method is quickly becoming an extremely valuable skill.) If that fails, and it eventually will, you should ask for help. There are a number of things you should include when contacting an instructor, or posting to a help website such as [Stack Overflow](https://stackoverflow.com).

- Describe what you expect the code to do.
- State the end goal you are trying to achieve. (Sometimes what you expect the code to do, is not what you want to actually do.)
- Provide the full text of any errors you have received.
- Provide enough code to recreate the error. Often for the purpose of this course, you could simply post your entire `.R` script or `.Rmd` to `slack`.
- Sometimes it is also helpful to include a screenshot of your entire RStudio window when the error occurs.

If you follow these steps, you will get your issue resolved much quicker, and possibly learn more in the process. Do not be discouraged by running into errors and difficulties when learning `R`. (Or any other technical skill.) It is simply part of the learning process.

## Installing Packages

`R` comes with a number of built-in functions and datasets, but one of the main strengths of `R` as an open-source project is its package system. Packages add additional functions and data. Frequently if you want to do something in `R`, and it is not available by default, there is a good chance that there is a package that will fulfill your needs.

To install a package, use the `install.packages()` function. Think of this as buying a recipe book from the store, bringing it home, and putting it on your shelf (i.e. into your library):

```{r, eval = FALSE}
install.packages("ggplot2")
```

Once a package is installed, it must be loaded into your current `R` session before being used. Think of this as taking the book off of the shelf and opening it up to read.

```{r, message = FALSE, warning = FALSE}
library(ggplot2)
```

Once you close `R`, all the packages are closed and put back on the imaginary shelf. The next time you open `R`, you do not have to install the package again, but you do have to load any packages you intend to use by invoking `library()`.

## `Code` vs Output in this Book {#code-output}

A quick note on styling choices in this book. We had to make a decision how to visually separate `R` code and resulting output in this book. All output lines are prefixed with `##` to make the distinction. A typical code snippet with output is thus going to look like this:

```{r}
1 + 3
# everything after a # is a comment, i.e. R disregards it.
```

where you see on the first line the `R` code, and on the second line the output. As mentioned, that line starts with `##` to say *this is an output*, followed by `[1]` (indicating this is a vector of length *one* - more on this below!), followed by the actual result - `1 + 3 = 4`!

Notice that you can simply copy and paste all the code you see into your `R` console. In fact, you are *strongly* encouraged to actually do this and try out **all the code** you see in this book.

Finally, please note that this way of showing output is fully our choice in this textbook, and that you should expect other output formats elsewhere. For example, in my `RStudio` console, the above code and output looks like this:

```R
> 1 + 3
[1] 4
```


## `ScPoApps` Package {#install-package}

To fully take advantage of our course, please install the associated `R` package directly from its online code repository. You can do this by copy and pasting the following three lines into your `R` console:

```R
if (!require("devtools")) install.packages("devtools")
devtools::install_github(repo = "ScPoEcon/ScPoApps")
```

In order to check whether everything works fine, you could load the library, and check it's current version:

```{r,warning=FALSE,message=FALSE,eval=FALSE}
library(ScPoApps)
packageVersion("ScPoApps")
```


## Data Types {#data-types}

`R` has a number of basic *data types*. While `R` is not a *strongly typed language* (i.e. you can be agnostic about types most of the times), it is useful to know what data types are available to you:

- Numeric
    - Also known as Double. The default type when dealing with numbers.
    - Examples: `1`, `1.0`, `42.5`
- Integer
    - Examples: `1L`, `2L`, `42L`
- Complex
    - Example: `4 + 2i`
- Logical
    - Two possible values: `TRUE` and `FALSE`
    - You can also use `T` and `F`, but this is *not* recommended.
    - `NA` is also considered logical.
- Character
    - Examples: `"a"`, `"Statistics"`, `"1 plus 2."`
- Categorical or `factor`
    - A mixture of integer and character. A `factor` variable assigns a label to a numeric value.
    - For example `factor(x=c(0,1),labels=c("male","female"))` assigns the string *male* to the numeric values `0`, and the string *female* to the value `1`. 

## Data Structures

`R` also has a number of basic data *structures*. A data structure is either homogeneous (all elements are of the same data type) or heterogeneous (elements can be of more than one data type).

| Dimension | **Homogeneous** | **Heterogeneous** |
|:---------:|:---------------:|:-----------------:|
| 1         | Vector          | List              |
| 2         | Matrix          | Data Frame        |
| 3+        | Array           |    nested Lists               |





### Vectors

Many operations in `R` make heavy use of **vectors**. A vector is a *container* for objects of identical type (see \@ref(data-types) above). Vectors in `R` are indexed starting at `1`. That is what the `[1]` in the output is indicating, that the first element of the row being displayed is the first element of the vector. Larger vectors will start additional rows with something like `[7]` where `7` is the index of the first element of that row.

Possibly the most common way to create a vector in `R` is using the `c()` function, which is short for "combine". As the name suggests, it combines a list of elements separated by commas. (Are you busy typing all of those examples into your `R` console? :-) )

```{r}
c(1, 3, 5, 7, 8, 9)
```

Here `R` simply outputs this vector. If we would like to store this vector in a **variable** we can do so with the **assignment** operator `=`. In this case the variable `x` now holds the vector we just created, and we can access the vector by typing `x`.

```{r}
x = c(1, 3, 5, 7, 8, 9)
x
```

As an aside, there is a long history of the assignment operator in `R`, partially due to the keys available on the [keyboards of the creators of the `S` language.](https://twitter.com/kwbroman/status/747829864091127809) (Which preceded `R`.) For simplicity we will use `=`, but know that often you will see `<-` as the assignment operator. 

Because vectors must contain elements that are all the same type, `R` will automatically **coerce** (i.e. convert) to a single type when attempting to create a vector that combines multiple types.

```{r}
c(42, "Statistics", TRUE)
c(42, TRUE)
```

Frequently you may wish to create a vector based on a sequence of numbers. The quickest and easiest way to do this is with the `:` operator, which creates a sequence of integers between two specified integers.

```{r}
(y = 1:100)
```

Here we see `R` labeling the rows after the first since this is a large vector. Also, we see that by putting parentheses around the assignment, `R` both stores the vector in a variable called `y` and automatically outputs `y` to the console.

Note that scalars do not exists in `R`. They are simply vectors of length `1`.

```{r}
2
```

If we want to create a sequence that isn't limited to integers and increasing by 1 at a time, we can use the `seq()` function.

```{r}
seq(from = 1.5, to = 4.2, by = 0.1)
```

We will discuss functions in detail later, but note here that the input labels `from`, `to`, and `by` are optional.

```{r}
seq(1.5, 4.2, 0.1)
```

Another common operation to create a vector is `rep()`, which can repeat a single value a number of times.

```{r}
rep("A", times = 10)
```

The `rep()` function can be used to repeat a vector some number of times.

```{r}
rep(x, times = 3)
```

We have now seen four different ways to create vectors:

- `c()`
- `:`
- `seq()`
- `rep()`

So far we have mostly used them in isolation, but they are often used together.

```{r}
c(x, rep(seq(1, 9, 2), 3), c(1, 2, 3), 42, 2:4)
```

The length of a vector can be obtained with the `length()` function.

```{r}
length(x)
length(y)
```


```{block type="warning"}
Let's try this out! **Your turn**:
```

#### Task 1

1. Create a vector of five ones, i.e. `[1,1,1,1,1]`
1. Notice that the colon operator `a:b` is just short for *construct a sequence **from** `a` **to** `b`*. Create a vector the counts down from 10 to 0, i.e. it looks like `[10,9,8,7,6,5,4,3,2,1,0]`!
1. the `rep` function takes additional arguments `times` (as above), and `each`, which tells you how often *each element* should be repeated (as opposed to the entire input vector). Use `rep` to create a vector that looks like this: `[1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3]`

#### Subsetting

To subset a vector, i.e. to choose only some elements of it, we use square brackets, `[]`. Here we see that `x[1]` returns the first element, and `x[3]` returns the third element:

```{r}
x
x[1]
x[3]
```

We can also exclude certain indexes, in this case the second element. 

```{r}
x[-2]
```

Lastly we see that we can subset based on a vector of indices.

```{r}
x[1:3]
x[c(1,3,4)]
```


All of the above are subsetting a vector using a vector of indexes. (Remember a single number is still a vector.) We could instead use a vector of logical values.

```{r}
z = c(TRUE, TRUE, FALSE, TRUE, TRUE, FALSE)
z
```

```{r}
x[z]
```

`R` is able to perform many operations on vectors and scalars alike:

```{r}
x = 1:10  # a vector
x + 1     # add a scalar
2 * x     # multiply all elements by 2
2 ^ x     # take 2 to the x as exponents
sqrt(x)   # compute the square root of all elements in x
log(x)    # take the natural log of all elements in x
x + 2*x   # add vector x to vector 2x
```

We see that when a function like `log()` is called on a vector `x`, a vector is returned which has applied the function to each element of the vector  `x`.


### Logical Operators

| Operator | Summary               | Example               | Result |
|:---------|:---------------------:|:---------------------:|:-------:|
| `x < y`  | `x` less than `y`                | `3 < 42`               | `r 3 < 42`               |
| `x > y`  | `x` greater than `y`             | `3 > 42`               | `r 3 > 42`               |
| `x <= y` | `x` less than or equal to `y`    | `3 <= 42`              | `r 3 <= 42`              |
| `x >= y` | `x` greater than or equal to `y` | `3 >= 42`              | `r 3 >= 42`              |
| `x == y` | `x`equal to `y`                  | `3 == 42`              | `r 3 == 42`              |
| `x != y` | `x` not equal to `y`             | `3 != 42`              | `r 3 != 42`              |
| `!x`     | not `x`                          | `!(3 > 42)`            | `r !(3 > 42)`            |
| `x | y`  | `x` or `y`                       | `(3 > 42) | TRUE`      | `r (3 > 42) | TRUE`      |
| `x & y`  | `x` and `y`                      | `(3 < 4) & ( 42 > 13)` | `r (3 < 4) & ( 42 > 13)` |

In `R`, logical operators also work on vectors:

```{r}
x = c(1, 3, 5, 7, 8, 9)
```

```{r}
x > 3
x < 3
x == 3
x != 3
```

```{r}
x == 3 & x != 3
x == 3 | x != 3
```

This is quite useful for subsetting.

```{r}
x[x > 3]
x[x != 3]
```


```{r}
sum(x > 3)
as.numeric(x > 3)
```

Here we saw that using the `sum()` function on a vector of logical `TRUE` and `FALSE` values that is the result of `x > 3` results in a numeric result: you just *counted* for how many elements of `x`, the condition `> 3` is `TRUE`. During the call to `sum()`, `R` is first automatically coercing the logical to numeric where `TRUE` is `1` and `FALSE` is `0`. This coercion from logical to numeric happens for most mathematical operations.

```{r}
# which(condition of x) returns true/false  
# each index of x where condition is true
which(x > 3)
x[which(x > 3)]

max(x)
which(x == max(x))
which.max(x)
```

#### Task 2

1. Create a vector filled with 10 numbers drawn from the uniform distribution (hint: use function `runif`) and store them in `x`.
1. Using logical subsetting as above, get all the elements of `x` which are larger than 0.5, and store them in `y`. 
1. using the function `which`, store the *indices* of all the elements of `x` which are larger than 0.5 in `iy`. 
1. Check that `y` and `x[iy]` are identical. 

### Matrices

`R` can also be used for **matrix** calculations. Matrices have rows and columns containing a single data type. In a matrix, the order of rows and columns is important. (This is not true of *data frames*, which we will see later.)

Matrices can be created using the `matrix` function. 

```{r}
x = 1:9
x
X = matrix(x, nrow = 3, ncol = 3)
X
```

Notice here that `R` is case sensitive (`x` vs `X`).

By default the `matrix` function fills your data into the matrix column by column. But we can also tell `R` to fill rows instead:

```{r}
Y = matrix(x, nrow = 3, ncol = 3, byrow = TRUE)
Y
```

We can also create a matrix of a specified dimension where every element is the same, in this case `0`.

```{r}
Z = matrix(0, 2, 4)
Z
```

Like vectors, matrices can be subsetted using square brackets, `[]`. However, since matrices are two-dimensional, we need to specify both a row and a column when subsetting.

```{r}
X
X[1, 2]
```

Here we accessed the element in the first row and the second column. We could also subset an entire row or column.

```{r}
X[1, ]
X[, 2]
```

We can also use vectors to subset more than one row or column at a time. Here we subset to the first and third column of the second row:

```{r}
X[2, c(1, 3)]
```

Matrices can also be created by combining vectors as columns, using `cbind`, or combining vectors as rows, using `rbind`.

```{r}
x = 1:9
rev(x)
rep(1, 9)
```

```{r}
rbind(x, rev(x), rep(1, 9))
```

```{r}
cbind(col_1 = x, col_2 = rev(x), col_3 = rep(1, 9))
```

When using `rbind` and `cbind` you can specify "argument" names that will be used as column names.

`R` can then be used to perform matrix calculations.

```{r}
x = 1:9
y = 9:1
X = matrix(x, 3, 3)
Y = matrix(y, 3, 3)
X
Y
```

```{r}
X + Y
X - Y
X * Y
X / Y
```

Note that `X * Y` is **not** matrix multiplication. It is *element by element* multiplication. (Same for `X / Y`). 
Matrix multiplication uses `%*%`. Other matrix functions include `t()` which gives the transpose of a matrix and `solve()` which returns the inverse of a square matrix if it is invertible.

```{r}
X %*% Y
t(X)
```

### Arrays

A vector is a one-dimensional array. A matrix is a two-dimensional array. In `R` you can create arrays of arbitrary dimensionality `N`. Here is how:

```{r}
d = 1:16
d3 = array(data = d,dim = c(4,2,2))
d4 = array(data = d,dim = c(4,2,2,3))  # will recycle 1:16
d3
```

You can see that `d3` are simply *two* (4,2) matrices laid on top of each other, as if there were *two pages*. Similary, `d4` would have two pages, and another 3 registers in a fourth dimension. And so on.
You can subset an array like you would a vector or a matrix, taking care to index each dimension:

```{r}
d3[ ,1,1]  # all elements from col 1, page 1
d3[2:3, , ]  # rows 2:3 from all pages
d3[2,2, ]  # row 2, col 2 from both pages.
```


#### Task 3

1. Create a vector containing `1,2,3,4,5` called v. 
1. Create a (2,5) matrix `m` containing the data `1,2,3,4,5,6,7,8,9,10`. The first row should be `1,2,3,4,5`.
1. Perform matrix multiplication of `m` with `v`. Use the command `%*%`. What dimension does the output have?
1. Why does `v %*% m` not work? 


### Lists

A list is a one-dimensional *heterogeneous* data structure. So it is indexed like a vector with a single integer value (or with a name), but each element can contain an element of any type. Lists are similar to a python or julia `Dict` object. Many `R` structures and outputs are lists themselves. Lists are extremely useful and versatile objects, so make sure you understand their useage:

```{r}
# creation without fieldnames
list(42, "Hello", TRUE)

# creation with fieldnames
ex_list = list(
  a = c(1, 2, 3, 4),
  b = TRUE,
  c = "Hello!",
  d = function(arg = 42) {print("Hello World!")},
  e = diag(5)
)
```

Lists can be subset using two syntaxes, the `$` operator, and square brackets `[]`. The `$` operator returns a named **element** of a list. The `[]` syntax returns a **list**, while the `[[]]` returns an **element** of a list.

- `ex_list[1]` returns a list contain the first element.
- `ex_list[[1]]` returns the first element of the list, in this case, a vector.

```{r}
# subsetting
ex_list$e

ex_list[1:2]
ex_list[1]
ex_list[[1]]
ex_list[c("e", "a")]
ex_list["e"]
ex_list[["e"]]

ex_list$d
ex_list$d(arg = 1)
```

#### Task 4

1. Copy and paste the above code for `ex_list` into your R session. Remember that `list` can hold any kind of `R` object. Like...another list! So, create a new list `new_list` that has two fields: a first field called "this" with string content `"is awesome"`, and a second field called "ex_list" that contains `ex_list`. 
1. Accessing members is like in a plain list, just with several layers now. Get the element `c` from `ex_list` in `new_list`!
1. Compose a new string out of the first element in `new_list`, the element under label `this`. Use the function `paste` to print `R is awesome` to your screen.

## Data Frames {#dataframes}

We have previously seen vectors and matrices for storing data as we introduced `R`. We will now introduce a **data frame** which will be the most common way that we store and interact with data in this course. A `data.frame` is similar to a python `pandas.dataframe` or a julia `DataFrame`. (But the `R` version was the first! :-) )

```{r}
example_data = data.frame(x = c(1, 3, 5, 7, 9, 1, 3, 5, 7, 9),
                          y = c(rep("Hello", 9), "Goodbye"),
                          z = rep(c(TRUE, FALSE), 5))
```

Unlike a matrix, which can be thought of as a vector rearranged into rows and columns, a data frame is not required to have the same data type for each element. A data frame is a **list** of vectors, and each vector has a *name*. So, each vector must contain the same data type, but the different vectors can store different data types. Note, however, that all vectors must have **the same length** (differently from a `list`)!

```{block, type="tip"}
A **data.frame** is similar to a typical Spreadsheet. There are *rows*,  and there are *columns*. A row is typically thought of as an *observation*, and each column is a certain *variable*, *characteristic* or *feature* of that observation.
```

<br>
Let's look at the data frame we just created above:

```{r}
example_data
```

Unlike a list, which has more flexibility, the elements of a data frame must all be vectors. Again, we access any given column with the `$` operator:

```{r}
example_data$x

all.equal(length(example_data$x),
          length(example_data$y),
          length(example_data$z))

str(example_data)

nrow(example_data)
ncol(example_data)
dim(example_data)
names(example_data)
```


### Working with `data.frames`

The `data.frame()` function above is one way to create a data frame. We can also import data from various file types in into `R`, as well as use data stored in packages.

```{r, echo = FALSE}
write.csv(example_data, "data/example-data.csv", row.names = FALSE)
write.csv(example_data,"inst/datasets/example-data.csv", row.names=FALSE)
```

To read this data back into `R`, we will use the built-in function `read.csv`:

```{r, message = FALSE, warning = FALSE}
path = system.file(package="ScPoEconometrics","datasets","example-data.csv")
example_data_from_disk = read.csv(path)
```

This particular line of code assumes that you installed the associated R package to this book, hence you have this dataset stored on your computer at `system.file(package = "ScPoEconometrics","datasets","example-data.csv")`. 

```{r}
example_data_from_disk
```

When using data, there are three things we would generally like to do:

- Look at the raw data.
- Understand the data. (Where did it come from? What are the variables? Etc.)
- Visualize the data.

To look at data in a `data.frame`, we have two useful commands: `head()` and `str()`.

```{r}
# we are working with the built-in mtcars dataset:
mtcars
```

You can see that this prints the entire data.frame to screen. The function `head()` will display the first `n` observations of the data frame. 

```{r}
head(mtcars,n=2)
head(mtcars) # default
```

The function `str()` will display the "structure" of the data frame. It will display the number of **observations** and **variables**, list the variables, give the type of each variable, and show some elements of each variable. This information can also be found in the "Environment" window in RStudio.

```{r}
str(mtcars)
```

In this dataset an observation is for a particular model of a car, and the variables describe attributes of the car, for example its fuel efficiency, or its weight.

To understand more about the data set, we use the `?` operator to pull up the documentation for the data.

```{r, eval = FALSE}
?mtcars
```

`R` has a number of functions for quickly working with and extracting basic information from data frames. To quickly obtain a vector of the variable names, we use the `names()` function.

```{r}
names(mtcars)
```

To access one of the variables **as a vector**, we use the `$` operator.

```{r}
mtcars$mpg
mtcars$wt
```

We can use the `dim()`, `nrow()` and `ncol()` functions to obtain information about the dimension of the data frame.

```{r}
dim(mtcars)
nrow(mtcars)
ncol(mtcars)
```

Here `nrow()` is also the number of observations, which in most cases is the *sample size*.

Subsetting data frames can work much like subsetting matrices using square brackets, `[ , ]`. Here, we find vehicles with mpg over 25 miles per gallon and only display columns `cyl`, `disp` and `wt`.

```{r}
# mpg[row condition, col condition]
mtcars[mtcars$mpg > 20, c("cyl", "disp", "wt")]
```

An alternative would be to use the `subset()` function, which has a much more readable syntax.

```{r, eval = FALSE}
subset(mtcars, subset = mpg > 25, select = c("cyl", "disp", "wt"))
```

#### Task 5

1. How many observations are there in `mtcars`?
1. How many variables?
1. What is the average value of `mpg`?
1. What is the average value of `mpg` for cars with more than 4 cylinders, i.e. with `cyl>4`?

## Programming Basics

In this section we illustrate some general concepts related to programming.

### Variables

We encountered the term *variable* already several times, but mainly in the context of a column of a data.frame. In programming, a variable is denotes an *object*. Another way to say it is that a variable is a name or a *label* for something:

```{r}
x = 1
y = "roses"
z = function(x){sqrt(x)}
```

Here `x` refers to the value `1`, `y` holds the string "roses", and `z` is the name of a function that computes $\sqrt{x}$. Notice that the argument `x` of the function is different from the `x` we just defined. It is **local** to the function: 

```{r}
x
z(9)
```

### Control Flow

Control Flow relates to ways in which you can adapt your code to different circumstances. Based on a `condition` being `TRUE`, your program will do one thing, as opposed to another thing. This is most widely known as an `if/else` statement. In `R`, the if/else syntax is:

```{r, eval = FALSE}
if (condition = TRUE) {
  some R code
} else {
  some other R code
}
```

For example,

```{r}
x = 1
y = 3
if (x > y) {  # test if x > y
  # if TRUE
  z = x * y
  print("x is larger than y")
} else {
  # if FALSE
  z = x + 5 * y
  print("x is less than or equal to y")
}

z
```


### Loops

Loops are a very important programming construct. As the name suggests, in a *loop*, the programming *repeatedly* loops over a set of instructions, until some condition tells it to stop. A very powerful, yet simple, construction is that the program can *count how many steps* it has done already - which may be important to know for many algorithms. The syntax of a `for` loop (there are others), is

```{r eval=FALSE}
for (ix in 1:10){   # does not have to be 1:10!
  # loop body: gets executed each time
  # the value of ix changes with each iteration
}
```

For example, consider this simple `for` loop, which will simply print the value of the *iterator* (called `i` in this case) to screen:

```{r}
for (i in 1:5){
  print(i)
}
```

Notice that instead of `1:5`, we could have *any* kind of iterable collection:

```{r}
for (i in c("mangos","bananas","apples")){
  print(paste("I love",i))  # the paste function pastes together strings
}
```

We often also see *nested* loops, which are just what its name suggests:

```{r}
for (i in 2:3){
  # first nest: for each i
  for (j in c("mangos","bananas","apples")){
    # second nest: for each j
    print(paste("Can I get",i,j,"please?"))
  }
}
```

The important thing to note here is that you can do calculations with the iterators *while inside a loop*. 

### Functions

So far we have been using functions, but haven't actually discussed some of their details. A function is a set of instructions that `R` executes for us, much like those collected in a script file. The good thing is that functions are much more flexible than scripts, since they can depend on *input arguments*, which change the way the function behaves. Here is how to define a function:

```{r eval=FALSE}
function_name <- function(arg1,arg2=default_value){
  # function body
  # you do stuff with arg1 and arg2
  # you can have any number of arguments, with or without defaults
  # any valid `R` commands can be included here
  # the last line is returned
}
```

And here is a trivial example of a function definition:

```{r}
hello <- function(your_name = "Lord Vader"){
  paste("You R most welcome,",your_name)
  # we could also write:
  # return(paste("You R most welcome,",your_name))
}

# we call the function by typing it's name with round brackets
hello()
```
You see that by not specifying the argument `your_name`, `R` reverts to the default value given. Try with your own name now! 

Just typing the function name returns the actual definition to us, which is handy sometimes:

```{r}
hello
```

It's instructive to consider that before we defined the function `hello` above, `R` did not know what to do, had you called `hello()`. The function did not exist! In this sense, we *taught `R` a new trick*. This feature to create new capabilities on top of a core language is one of the most powerful characteristics of programming languages. In general, it is good practice to split your code into several smaller functions, rather than one long script file. It makes your code more readable, and it is easier to track down mistakes.

#### Task 6

1. Write a for loop that counts down from 10 to 1, printing the value of the iterator to the screen.
1. Modify that loop to write "i iterations to go" where `i` is the iterator
1. Modify that loop so that each iteration takes roughly one second. You can achieve that by adding the command `Sys.sleep(1)` below the line that prints "i iterations to go". 





================================================
FILE: 02-SummaryStats.Rmd
================================================
# Working With Data  {#sum}


In this chapter we will first learn some basic concepts that help summarizing data. Then, we will tackle a real-world task and read, clean, and summarize data from the web.

## Summary Statistics

`R` has built in functions for a large number of summary statistics. For numeric variables, we can summarize data by looking at their center and spread, for example.

```{r}
# for the mpg dataset, we load:
library(ggplot2)
```

### Central Tendency {-}

Suppose we want to know the *mean* and *median* of all the values stored in the `data.frame` column `mpg$cty`:

| Measure | `R`               | Result              |
|:---------:|:-------------------:|:---------------------:|
| Mean    | `mean(mpg$cty)`   | `r mean(mpg$cty)`   |
| Median  | `median(mpg$cty)` | `r median(mpg$cty)` |

### Spread {-}

How do the values in that column *vary*? How far *spread out* are they?

| Measure            | `R`              | Result             |
|:---------:|:-------------------:|:---------------------:|
| Variance           | `var(mpg$cty)`   | `r var(mpg$cty)`   |
| Standard Deviation | `sd(mpg$cty)`    | `r sd(mpg$cty)`    |
| IQR                | `IQR(mpg$cty)`   | `r IQR(mpg$cty)`   |
| Minimum            | `min(mpg$cty)`   | `r min(mpg$cty)`   |
| Maximum            | `max(mpg$cty)`   | `r max(mpg$cty)`   |
| Range              | `range(mpg$cty)` | `r range(mpg$cty)` |

### Categorical {-}

For categorical variables, counts and percentages can be used for summary.

```{r}
table(mpg$drv)
table(mpg$drv) / nrow(mpg)
```

## Plotting

Now that we have some data to work with, and we have learned about the data at the most basic level, our next tasks will be to visualize it. Often, a proper visualization can illuminate features of the data that can inform further analysis.

We will look at four methods of visualizing data by using the basic `plot` facilities built-in with `R`:

- Histograms
- Barplots
- Boxplots
- Scatterplots

### Histograms

When visualizing a single numerical variable, a **histogram** is useful. It summarizes the *distribution* of values in a vector. In `R` you create one using the `hist()` function:

```{r}
hist(mpg$cty)
```

The histogram function has a number of parameters which can be changed to make our plot look much nicer. Use the `?` operator to read the documentation for the `hist()` to see a full list of these parameters.

```{r}
hist(mpg$cty,
     xlab   = "Miles Per Gallon (City)",
     main   = "Histogram of MPG (City)", # main title
     breaks = 12,   # how many breaks?
     col    = "red",
     border = "blue")
```

Importantly, you should always be sure to label your axes and give the plot a title. The argument `breaks` is specific to `hist()`. Entering an integer will give a suggestion to `R` for how many bars to use for the histogram. By default `R` will attempt to intelligently guess a good number of `breaks`, but as we can see here, it is sometimes useful to modify this yourself.

### Barplots

Somewhat similar to a histogram, a barplot can provide a visual summary of a categorical variable, or a numeric variable with a finite number of values, like a ranking from 1 to 10.

```{r}
barplot(table(mpg$drv))
```

```{r}
barplot(table(mpg$drv),
        xlab   = "Drivetrain (f = FWD, r = RWD, 4 = 4WD)",
        ylab   = "Frequency",
        main   = "Drivetrains",
        col    = "dodgerblue",
        border = "darkorange")
```

### Boxplots

To visualize the relationship between a numerical and categorical variable, once could use a **boxplot**. In the `mpg` dataset, the `drv` variable takes a small, finite number of values. A car can only be front wheel drive, 4 wheel drive, or rear wheel drive.

```{r}
unique(mpg$drv)
```

First note that we can use a single boxplot as an alternative to a histogram for visualizing a single numerical variable. To do so in `R`, we use the `boxplot()` function. The box shows the *interquartile range*, the solid line in the middle is the value of the median, the wiskers show 1.5 times the interquartile range, and the dots are outliers.

```{r}
boxplot(mpg$hwy)
```

However, more often we will use boxplots to compare a numerical variable for different values of a categorical variable.

```{r}
boxplot(hwy ~ drv, data = mpg)
```

Here used the `boxplot()` command to create side-by-side boxplots. However, since we are now dealing with two variables, the syntax has changed. The `R` syntax `hwy ~ drv, data = mpg` reads "Plot the `hwy` variable against the `drv` variable using the dataset `mpg`." We see the use of a `~` (which specifies a formula) and also a `data = ` argument. This will be a syntax that is common to many functions we will use in this course. 

```{r}
boxplot(hwy ~ drv, data = mpg,
     xlab   = "Drivetrain (f = FWD, r = RWD, 4 = 4WD)",
     ylab   = "Miles Per Gallon (Highway)",
     main   = "MPG (Highway) vs Drivetrain",
     pch    = 20,
     cex    = 2,
     col    = "darkorange",
     border = "dodgerblue")
```

Again, `boxplot()` has a number of additional arguments which have the ability to make our plot more visually appealing.

### Scatterplots

Lastly, to visualize the relationship between two numeric variables we will use a **scatterplot**. This can be done with the `plot()` function and the `~` syntax we just used with a boxplot. (The function `plot()` can also be used more generally; see the documentation for details.)

```{r}
plot(hwy ~ displ, data = mpg)
```

```{r}
plot(hwy ~ displ, data = mpg,
     xlab = "Engine Displacement (in Liters)",
     ylab = "Miles Per Gallon (Highway)",
     main = "MPG (Highway) vs Engine Displacement",
     pch  = 20,
     cex  = 2,
     col  = "dodgerblue")
```

### `ggplot` {#ggplot}

All of the above plots could also have been generated using the `ggplot` function from the already loaded `ggplot2` package. Which function you use is up to you, but sometimes a plot is easier to build in base R (like in the `boxplot` example maybe), sometimes the other way around.

```{r}
ggplot(data = mpg,mapping = aes(x=displ,y=hwy)) + geom_point()
```

`ggplot` is impossible to describe in brief terms, so please look at [the package's website](http://ggplot2.tidyverse.org) which provides excellent guidance. We will from time to time use ggplot in this book, so you could familiarize yourself with it. Let's quickly demonstrate how one could further customize that first plot:

```{r}
ggplot(data = mpg, mapping = aes(x=displ,y=hwy)) +   # ggplot() makes base plot
  geom_point(color="blue",size=2) +     # how to show x and y?
  scale_y_continuous(name="Miles Per Gallon (Highway)") +  # name of y axis
  scale_x_continuous(name="Engine Displacement (in Liters)") + # x axis
  theme_bw() +    # change the background
  ggtitle("MPG (Highway) vs Engine Displacement")   # add a title
```

If you want to see `ggplot` in action, you could start with [this](http://jcyhong.github.io/ggplot_demo.html) and then look at that [very nice tutorial](https://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html)? It's fun!

## Summarizing Two Variables {#summarize-two}

We often are interested in how two variables are related to each other. The core concepts here are *covariance* and *correlation*. Let's generate some data on `x` and `y` and plot them against each other:

```{r x-y-corr,echo=FALSE,message=FALSE,warning=FALSE,fig.cap='How are $x$ and $y$ related?',fig.align='center'}
library(mvtnorm)
set.seed(10)
cor = 0.9
sig = matrix(c(1,cor,cor,1),c(2,2))
ndat = data.frame(rmvnorm(n=300,sigma = sig))
x = ndat$X1
y = ndat$X2
par(pty="s")
plot(x ~ y, xlab="x",ylab="y")
```

Taking as example the data in this plot, the concepts *covariance* and *correlation* relate to the following type of question:

```{block, type="note"}
Given we observe value of something like $x=2$, say, can we expect a high or a low value of $y$, on average? Something like $y=2$ or rather something like $y=-2$?
```
<br>
The answer to this type of question can be addressed by computing the covariance of both variables:

```{r}
cov(x,y)  
```

Here, this gives a positive number, `r round(cov(x,y),2)`, indicating that as one variable lies above it's average, the other one does as well. In other words, it indicates a **positive relationship**. What is less clear, however, how to interpret the magnitude of `r round(cov(x,y),2)`. Is that a *strong* or a *weak* positive association?

In fact, we cannot tell. This is because the covariance is measured in the same units as the data, and those units often differ between both variables. There is a better measure available to us though, the **correlation**, which is obtained by *standardizing* each variable. By *standardizing* a variable $x$ one means to divide $x$ by its standard deviation $\sigma_x$:

$$
z = \frac{x}{\sigma_x}
$$

The *correlation coefficient* between $x$ and $y$, commonly denoted $r_{x,y}$, is then defined as

$$
r_{x,y} = \frac{cov(x,y)}{\sigma_x \sigma_y},
$$

and we get rid of the units problem. In `R`, you can call directly

```{r}
cor(x,y)
```

Now this is better. Given that the correlation has to lie in $[-1,1]$, a value of `r round(cor(x,y),2)` is indicative of a rather strong positive relationship for the data in figure \@ref(fig:x-y-corr)

Note that $x,y$ being drawn from a *continuous distribution* (they are joint normally distributed) had no implication for covariance and correlation: We can compute those measures also for discrete random variables (like the throws of two dice, as you will see in one of our tutorials).

### Visually estimating $\sigma$

Sometimes it is useful to estimate the standard deviation of some data *without* the help of a computer (for example during an exam ;-) ). If $x$ is approximately normally distributed, 95% of its observations will lie within a range of $\bar{x}\pm$ two standard deviations of $x$. That is to say, *four* standard deviations of $x$ cover 95% of its observations. Hence, a simple way to estimate the standard deviation for a variable is to look at the range of $x$, and simply divide that number by four. 
 
```{r vis,fig.cap='visual estimation on $\\sigma$. The x-axis labels min and max as well as mean of $x$.',echo=FALSE}
sdd = 3
md = 3
dta = rnorm(50,mean=md,sd=sdd)
plot(dta,rep(1,50),pch=3,yaxt="n",ylab="",xlab="x",xaxt="n")
axis(1,at=round(c(min(dta),md,max(dta)),2))
```

This is illustrated in figure \@ref(fig:vis). Here we see that `range(x)/4` gives `r round(diff(range(dta))/4,2)` which compares favourably to the actual standard deviation `r sdd`.


## The `tidyverse`

[Hadley Wickham](http://hadley.nz) is the author of R packages `ggplot2` and also of `dplyr` (and also a myriad of others). With `ggplot2` he introduced what is called the *grammar of graphics* (hence, `gg`) to `R`. Grammar in the sense that there are **nouns** and **verbs** and a **syntax**, i.e. rules of how nouns and verbs are to be put together to construct an understandable sentence. He has extended the *grammar* idea into various other packages. The `tidyverse` package is a collection of those packages. 

`tidy` data is data where:

* Each variable is a column
* Each observation is a row
* Each value is a cell

Fair enough, you might say, that is a regular spreadsheet. And you are right! However, data comes to us *not* tidy most of the times, and we first need to clean, or `tidy`, it up. Once it's in `tidy` format, we can use the tools in the `tidyverse` with great efficiency to analyse the data and stop worrying about which tool to use.

### Reading `.csv` data in the *tidy* way

We could have used the `read_csv()` function from the `readr` package to read our example dataset from the previous chapter. The `readr` function `read_csv()` has a number of advantages over the built-in `read.csv`. For example, it is much faster reading larger data. [It also uses the `tibble` package to read the data as a tibble.](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html) **A `tibble` is simply a data frame that prints with sanity.** Notice in the output below that we are given additional information such as dimension and variable type.

```{r, message = FALSE, warning = FALSE}
library(readr)  # you need `install.packages("readr")` once!
path = system.file(package="ScPoEconometrics","datasets","example-data.csv")
example_data_from_disk = read_csv(path)
```


### Tidy `data.frames` are `tibbles`

Let's grab some data from the `ggplot2` package:

```{r}
data(mpg,package = "ggplot2")  # load dataset `mpg` from `ggplot2` package
head(mpg, n = 10)
```

The function `head()` will display the first `n` observations of the data frame, as we have seen. The `head()` function was more useful before tibbles. Notice that `mpg` is a tibble already, so the output from `head()` indicates there are only `10` observations. Note that this applies to `head(mpg, n = 10)` and not `mpg` itself. Also note that tibbles print a limited number of rows and columns by default. The last line of the printed output indicates with rows and columns were omitted.

```{r}
mpg
```

Let's look at `str` as well to get familiar with the content of the data:

```{r}
str(mpg)
```

In this dataset an observation is for a particular model-year of a car, and the variables describe attributes of the car, for example its highway fuel efficiency.

To understand more about the data set, we use the `?` operator to pull up the documentation for the data.

```{r, eval = FALSE}
?mpg
```

Working with tibbles is mostly the same as working with plain data.frames:

```{r}
names(mpg)
mpg$year
mpg$hwy
```

Subsetting is also similar to dataframe. Here, we find fuel efficient vehicles earning over 35 miles per gallon and only display `manufacturer`, `model` and `year`.

```{r}
# mpg[row condition, col condition]
mpg[mpg$hwy > 35, c("manufacturer", "model", "year")]
```

An alternative would be to use the `subset()` function, which has a much more readable syntax.

```{r, eval = FALSE}
subset(mpg, subset = hwy > 35, select = c("manufacturer", "model", "year"))
```

Lastly, and most *tidy*, we could use the `filter` and `select` functions from the `dplyr` package which introduces the *pipe operator* `f(x) %>% g(z)` from the `magrittr` package. This operator takes the output of the first command, for example `y = f(x)`, and passes it *as the first argument* to the next function, i.e. we'd obtain `g(y,z)` here.^[A *pipe* is a concept from the Unix world, where it means to take the output of some command, and pass it on to another command. This way, one can construct a *pipeline* of commands. For additional info on the pipe operator in R, you might be interested [in this tutorial](https://www.datacamp.com/community/tutorials/pipe-r-tutorial).]

```{r, eval = TRUE,message=FALSE,warning=FALSE}
library(dplyr)
mpg %>% 
  filter(hwy > 35) %>% 
  select(manufacturer, model, year)
```

Note that the above syntax is equivalent to the following pipe-free command (which is much harder to read!):

```{r, eval = TRUE,message=FALSE,warning=FALSE}
library(dplyr)
select(filter(mpg, hwy > 35), manufacturer, model, year)
```

All three approaches produce the same results. Which you use will be largely based on a given situation as well as your preference.

#### Task 1

1. Make sure to have the `mpg` dataset loaded by typing `data(mpg)` (and `library(ggplot2)` if you haven't!). Use the `table` function to find out how many cars were built by *mercury*? 
1. What is the average year the audi's were built in this dataset? Use the function `mean` on the subset of column `year` that corresponds to `audi`. (Be careful: subsetting a `tibble` returns a `tibble` (and not a vector)!. so get the `year` column after you have subset the `tibble`.) 
1. Use the `dplyr` piping syntax from above first with `group_by` and then with `summarise(newvar=your_expression)` to find the mean `year` by all manufacturers (i.e. same as previous task, but for all manufacturers. don't write a loop!).



### Tidy Example: Importing Non-Tidy Excel Data

The data we will look at is from [Eurostat](http://ec.europa.eu/eurostat/data/database) on demography and migration. You should download the data yourself (click on previous link, then drill down to *database by themes > Population and social conditions > Demograph and migration > Population change - Demographic balance and crude rates at national level (demo_gind)*). 

Once downloaded, we can read the data with the function `read_excel` from the package [`readxl`](http://readxl.tidyverse.org), again part of the `tidyverse` suite.

It's important to know how the data is organized in the spreadsheet. Open the file with Excel to see:

* There is a heading which we don't need.
* There are 5 rows with info that we don't need.
* There is one table per variable (total population, males, females, etc)
* Each table has one row for each country, and one column for each year.
* As such, this data is **not tidy**.

Now we will read the first chunk of data, from the first table: *total population*:

```{r,message=FALSE,warning=FALSE}
library(readxl)  # load the library
# Notice that if you installed the R package of this book,
# you have the .xls data file already at 
# `system.file(package="ScPoEconometrics",
#                        "datasets","demo_gind.xls")`
# otherwise:
# * download the file to your computer
# * change the argument `path` to where you downloaded it
# you may want to change your working directory with `setwd("your/directory")
# or in RStudio by clicking Session > Set Working Directory

# total population in raw format
tot_pop_raw = read_excel(
                path = system.file(package="ScPoEconometrics",
                                    "datasets","demo_gind.xls"), 
                sheet="Data", # which sheet
                range="A9:K68")  # which excel cell range to read
names(tot_pop_raw)[1] <- "Country"   # lets rename the first column
tot_pop_raw
```

This shows a `tibble`, which we encountered just above. The column names are `Country,2008,2009,...`, and the rows are numbered `1,2,3,...`. Notice, in particular, that *all* columns seem to be of type `<chr>`, i.e. characters - a string, not a number! We'll have to fix that, as this is clearly numeric data.

#### `tidyr`

In the previous `tibble`, each year is a column name (like `2008`) instead of all years being collected in one column `year`. We really would like to have several rows for each Country, one row per year. We want to `gather()` all years into a new column to tidy this up - and here is how:

1. specify which columns are to be gathered: in our case, all years (note that `paste(2008:2017)` produces a vector like `["2008", "2009", "2010",...]`)
1. say what those columns should be gathered into, i.e. what is the *key* for those values: we'll call it `year`.
1. Finally, what is the name of the new resulting column, containing the *value* from each cell: let's call it `counts`.

```{r gather,warning=FALSE}
library(tidyr)   # for the gather function
tot_pop = gather(tot_pop_raw, paste(2008:2017),key="year", value = "counts")
tot_pop
```

That's better! However, `counts` is still `chr`! Let's convert it to a number:

```{r convert}
tot_pop$counts = as.integer(tot_pop$counts)
tot_pop
```

Now you can see that column `counts` is indeed `int`, i.e. an integer number, and we are fine. The `Warning: NAs introduced by coercion` means that `R` converted some values to `NA`, because it couldn't convert them into `numeric`. More below!

#### `dplyr`

>The [transform](http://r4ds.had.co.nz/transform.html) chapter of Hadley Wickham's book is a great place to read up more on using `dplyr`.

With `dplyr` you can do the following operations on `data.frame`s and `tibble`s:

* Choose observations based on a certain value (i.e. subset): `filter()`
* Reorder rows: `arrange()`
* Select variables by name: `select()`
* Create new variables out of existing ones: `mutate()`
* Summarise variables: `summarise()`

All of those verbs can be used with `group_by()`, where we apply the respective operation on a *group* of the dataframe/tibble. For example, on our `tot_pop` tibble we will now

* filter
* mutate
* and plot the resulting values

Let's get a plot of the populations of France, the UK and Italy over time, in terms of millions of people. We will make use of the `piping` syntax of `dplyr` which we introduced just above.

```{r gather-plot,warning=FALSE,message=FALSE}
library(dplyr)  # for %>%, filter, mutate, ...
# 1. take the data.frame `tot_pop`
tot_pop %>%
  # 2. pipe it into the filter function
  # filter on Country being one of "France","United Kingdom" or "Italy"
  filter(Country %in% c("France","United Kingdom","Italy")) %>%
  # 3. pipe the result into the mutate function
  # create a new column called millions
  mutate(millions = counts / 1e6) %>%
  # 4. pipe the result into ggplot to make a plot
  ggplot(mapping = aes(x=year,y=millions,color=Country,group=Country)) + geom_line(size=1)
```

#### Arrange a `tibble` {-} 

* What are the top/bottom 5 most populated areas?

```{r,message=FALSE}
top5 = tot_pop %>%
  arrange(desc(counts)) %>%  # arrange in descending order of col `counts`
  top_n(5)

bottom5 = tot_pop %>%
  arrange(desc(counts)) %>%
  top_n(-5)
# let's see top 5
top5
# and bottom 5
bottom5
```

Now this is not exactly what we wanted. It's always the same country in both top and bottom, because there are multiple years per country. Let's compute average population over the last 5 years and rank according to that:

```{r,message=FALSE}
topbottom = tot_pop %>%
  group_by(Country) %>%
  filter(year > 2012) %>%
  summarise(mean_count = mean(counts)) %>%
  arrange(desc(mean_count))

top5 = topbottom %>% top_n(5)
bottom5 = topbottom %>% top_n(-5)
top5
bottom5
```
That's better! 

#### Look for `NA`s in a `tibble` {-} 

Sometimes data is *missing*, and `R` represents it with the special value `NA` (not available). It is good to know where in our dataset we are going to encounter any missing values, so the task here is: let's produce a table that has three columns:

1. the names of countries with missing data
2. how many years of data are missing for each of those
3. and the actual years that are missing

```{r}
missings = tot_pop %>%
  filter(is.na(counts)) %>% # is.na(x) returns TRUE if x is NA
  group_by(Country) %>%
  summarise(n_missing = n(),years = paste(year,collapse = ", "))
knitr:::kable(missings)  # knitr:::kable makes a nice table
```


#### Males and Females {-} 

Let's look at the numbers by male and female population. They are in the same xls file, but at different cell ranges. Also, I just realised that the special character `:` indicates *missing* data. We can feed that to `read_excel` and that will spare us the need to convert data types afterwards. Let's see:

```{r females}
females_raw = read_excel(
                path = system.file(package="ScPoEconometrics",
                                    "datasets","demo_gind.xls"), 
                sheet="Data", # which sheet
                range="A141:K200",  # which excel cell range to read
                na=":" )   # missing data indicator
names(females_raw)[1] <- "Country"   # lets rename the first column
females_raw
```

You can see that `R` now correctly read the numbers as such, after we told it that the `:` character has the special *missing* meaning: before, it *coerced* the entire `2008` column (for example) to be of type `chr` after it hit the first `:`. We had to manually convert the column back to `numeric`, in the process automatically coercing the `:`s into `NA`. Now we addressed that issue directly. Let's also get the male data in the same way:

```{r males}
males_raw = read_excel(
                path = system.file(package="ScPoEconometrics",
                                    "datasets","demo_gind.xls"), 
                sheet="Data", # which sheet
                range="A75:K134",  # which excel cell range to read
                na=":" )   # missing data indicator
names(males_raw)[1] <- "Country"   # lets rename the first column
```

Next step was to `tidy` up this data, just as before:

```{r tidymales}
females = gather(females_raw, paste(2008:2017),key="year", value = "counts")
males = gather(males_raw, paste(2008:2017),key="year", value = "counts")
```

Let's try to tweak our above plot to show the same data in two separate panels: one for males and one for females. This is easiest to do with `ggplot` if we have all the data in one single `data.frame` (or `tibble`), and marked with a *group identifier*. Let's first add this to both datasets, and then let's just combine both into one:

```{r}
females$sex = "female"
males$sex = "male"
sexes = rbind(males,females)   # "row bind" 2 data.frames
sexes
```

Now that we have all the data nice and `tidy` in a `data.frame`, this is a very small change to our previous plotting code:

```{r psexes}
sexes %>%
  filter(Country %in% c("France","United Kingdom","Italy")) %>%
  mutate(millions = counts / 1e6) %>%
  ggplot(mapping = aes(x=as.Date(year,format="%Y"),  # convert to `Date`
                       y=millions,colour=Country,group=Country)) + 
      geom_line() +
  scale_x_date(name = "year") + # rename x axis
  facet_wrap(~sex)   # make two panels, splitting by groups `sex`
```

#### Always Compare to Germany :-) {-}

How do our three countries compare with respect to the biggest country in the EU in terms of population? What *fraction* of Germany does the French population make in any given year, for example?

```{r}
# remember that the pipe operator %>% takes the 
# result of the previous operation and passes it
# as the *first* argument to the next function call
merge_GER <- tot_pop %>%
  # 1. subset to countries of interest
  filter(
    Country %in% 
      c("France",
        "United Kingdom",
        "Italy")
    ) %>%
  # 2. group data by year
  group_by(year) %>%
  # 3. add GER's count as new column *by year*
  left_join(
    # Germany only
    filter(tot_pop,
           Country %in% "Germany including former GDR"),
    # join back in `by year`
    by="year")
merge_GER
```
 
Here you see that the merge (or join) operation labelled `col.x` and `col.y` if
 both datasets contained a column called `col`. Now let's continue to compute what proportion of german population each country amounts to:


```{r}
names(merge_GER)[1] <- "Country"
merge_GER %>%
  mutate(prop_GER = 100 * counts.x / counts.y) %>%
  # 5. plot
  ggplot(mapping = 
           aes(x = year,
               y = prop_GER,
               color = Country,
               group = Country)) + 
  geom_line(size=1) + 
  scale_y_continuous("percent of German population") + 
  theme_bw()  # new theme for a change?
```






================================================
FILE: 03-linear-reg.Rmd
================================================
---
output:
  pdf_document: default
  html_document: default
---
# Linear Regression {#linreg}

In this chapter we will learn an additional way how one can represent the relationship between *outcome*, or *dependent* variable variable $y$ and an *explanatory* or *independent* variable $x$. We will refer throughout to the graphical representation of a collection of independent observations on $x$ and $y$, i.e., a *dataset*. 

## How are `x` and `y` related?
    
### Data on Cars

We will look at the built-in `cars` dataset. Let's get a view of this by just typing `View(cars)` in Rstudio. You can see something like this:

```{r,echo=FALSE}
head(cars)
```

We have a `data.frame` with two columns: `speed` and `dist`. Type `help(cars)` to find out more about the dataset. There you could read that

>The data give the speed of cars (mph) and the distances taken to stop (ft).

It's good practice to know the extent of a dataset. You could just type 

```{r}
dim(cars)
```

to find out that we have 50 rows and 2 columns. A central question that we want to ask now is the following:

### How are `speed` and `dist` related?

The simplest way to start is to plot the data. Remembering that we view each row of a data.frame as an observation, we could just label one axis of a graph `speed`, and the other one `dist`, and go through our table above row by row. We just have to read off the x/y coordinates and mark them in the graph. In `R`:

```{r}
plot(dist ~ speed, data = cars,
     xlab = "Speed (in Miles Per Hour)",
     ylab = "Stopping Distance (in Feet)",
     main = "Stopping Distance vs Speed",
     pch  = 20,
     cex  = 2,
     col  = "red")
```

Here, each dot represents one observation. In this case, one particular measurement `speed` and `dist` for a car. Now, again: 


```{block, type='note'}
How are `speed` and `dist` related? How could one best *summarize* this relationship?
```

<br>
One thing we could do, is draw a straight line through this scatterplot, like so:

```{r}
plot(dist ~ speed, data = cars,
     xlab = "Speed (in Miles Per Hour)",
     ylab = "Stopping Distance (in Feet)",
     main = "Stopping Distance vs Speed",
     pch  = 20,
     cex  = 2,
     col  = "red")
abline(a = 60,b = 0,lw=3)
```

Now that doesn't seem a particularly *good* way to summarize the relationship. Clearly, a *better* line would be not be flat, but have a *slope*, i.e. go upwards:

```{r,echo=FALSE}
plot(dist ~ speed, data = cars,
     xlab = "Speed (in Miles Per Hour)",
     ylab = "Stopping Distance (in Feet)",
     main = "Stopping Distance vs Speed",
     pch  = 20,
     cex  = 2,
     col  = "red")
abline(a = 0,b = 5,lw=3)
```

That is slightly better. However, the line seems at too high a level - the point at which it crosses the y-axis is called the *intercept*; and it's too high. We just learned how to represent a *line*, i.e. with two numbers called *intercept* and *slope*. Let's write down a simple formula which represents a line where some outcome $z$ is related to a variable $x$:

\begin{equation}
z = b_0 + b_1 x (\#eq:bline)
\end{equation}

Here $b_0$ represents the value of the intercept (i.e. $z$ when $x=0$), and $b_1$ is the value of the slope. The question for us is now: How to choose the number $b_0$ and $b_1$ such that the result is the **good** line?

### Choosing the Best Line

```{r, echo = FALSE, message = FALSE, warning = FALSE}
generate_data = function(int = 0.5,
                         slope = 1,
                         sigma = 10,
                         n_obs = 9,
                         x_min = 0,
                         x_max = 10) {
  x = seq(x_min, x_max, length.out = n_obs)
  y = int + slope * x + rnorm(n_obs, 0, sigma)
  fit = lm(y ~ x)
  y_hat = fitted(fit)
  y_bar = rep(mean(y), n_obs)
  error = resid(fit)
  meandev = y - y_bar
  data.frame(x, y, y_hat, y_bar, error, meandev)
}

plot_total_dev = function(reg_data,title=NULL) {
  if (is.null(title)){
    plot(reg_data$x, reg_data$y, 
       xlab = "x", ylab = "y", pch = 20, cex = 2, col = "grey")
  rect(xleft = reg_data$x, ybottom = reg_data$y,
         xright = reg_data$x + abs(reg_data$meandev), ytop = reg_data$y - reg_data$meandev, density = -1,
         col = rgb(red = 0, green = 0, blue = 1, alpha = 0.5), border = NA)
  } else {
    plot(reg_data$x, reg_data$y, 
       xlab = "x", ylab = "y", pch = 20, cex = 2, col = "grey",main=title,ylim=c(-2,10.5))
     axis(side=2,at=seq(-2,10,by=2))
  rect(xleft = reg_data$x, ybottom = reg_data$y,
         xright = reg_data$x + abs(reg_data$meandev), ytop = reg_data$y - reg_data$meandev, density = -1,
         col = rgb(red = 0, green = 0, blue = 1, alpha = 0.5), border = NA)
  }
  # arrows(reg_data$x, reg_data$y_bar,
  #        reg_data$x, reg_data$y,
  #        col = 'grey', lwd = 1, lty = 3, length = 0.2, angle = 20)
  abline(h = mean(reg_data$y), lwd = 2,col = "grey")
  # abline(lm(y ~ x, data = reg_data), lwd = 2, col = "grey")
}

plot_total_dev_prop = function(reg_data) {
  plot(reg_data$x, reg_data$y, 
       xlab = "x", ylab = "y", pch = 20, cex = 2, col = "grey")
  arrows(reg_data$x, reg_data$y_bar,
         reg_data$x, reg_data$y_hat,
         col = 'darkorange', lwd = 1, length = 0.2, angle = 20)
  arrows(reg_data$x, reg_data$y_hat,
         reg_data$x, reg_data$y,
         col = 'dodgerblue', lwd = 1, lty = 2, length = 0.2, angle = 20)
  abline(h = mean(reg_data$y), lwd = 2,col = "grey")
  abline(lm(y ~ x, data = reg_data), lwd = 2, col = "grey")
}

plot_unexp_dev = function(reg_data) {
  plot(reg_data$x, reg_data$y, 
       xlab = "x", ylab = "y", pch = 20, cex = 2,asp=1)
  arrows(reg_data$x, reg_data$y_hat,
         reg_data$x, reg_data$y,
         col = 'red', lwd = 2, lty = 1, length = 0.1, angle = 20)
  abline(lm(y ~ x, data = reg_data), lwd = 2, col = "black")
}

plot_unexp_SSR = function(reg_data,asp=1,title=NULL) {
  if (is.null(title)){
      plot(reg_data$x, reg_data$y,
       xlab = "x", ylab = "y", pch = 20, cex = 2, 
  rect(xleft = reg_data$x, ybottom = reg_data$y,
         xright = reg_data$x + abs(reg_data$error), ytop = reg_data$y - reg_data$error, density = -1,
         col = rgb(red = 1, green = 0, blue = 0, alpha = 0.5), border = NA),asp=asp)
      abline(lm(y ~ x, data = reg_data), lwd = 2, col = "black")
  } else {
      plot(reg_data$x, reg_data$y,
       xlab = "x", ylab = "y", pch = 20, cex = 2, 
  rect(xleft = reg_data$x, ybottom = reg_data$y,
         xright = reg_data$x + abs(reg_data$error), ytop = reg_data$y - reg_data$error, density = -1,
         col = rgb(red = 1, green = 0, blue = 0, alpha = 0.5), border = NA),asp=asp,main=title)
    axis(side=2,at=seq(-2,10,by=2))
      abline(lm(y ~ x, data = reg_data), lwd = 2, col = "black")
  }
}

plot_exp_dev = function(reg_data) {
  plot(reg_data$x, reg_data$y, main = "SSReg (Sum of Squares Regression)", 
  xlab = "x", ylab = "y", pch = 20, cex = 2, col = "grey")
  arrows(reg_data$x, reg_data$y_bar,
         reg_data$x, reg_data$y_hat,
         col = 'darkorange', lwd = 1, length = 0.2, angle = 20)
  abline(lm(y ~ x, data = reg_data), lwd = 2, col = "grey")
  abline(h = mean(reg_data$y), col = "grey")
}
```

```{r, echo=FALSE, message=FALSE, warning=FALSE}
set.seed(21)
plot_data = generate_data(sigma = 2)
```

In order to be able to reason about good or bad line, we need to denote the *output* of equation \@ref(eq:bline). We call the value $\hat{y}_i$ the *predicted value* for obseration $i$, after having chosen some particular values $b_0$ and $b_1$:

\begin{equation}
\hat{y}_i = b_0 + b_1 x_i (\#eq:abline-pred)
\end{equation}

In general it is likely that we won't be able to choose $b_0$ and $b_1$ in such as way as to provide a perfect prediction, i.e. one where $\hat{y}_i = y_i$ for all $i$. That is, we expect to make an *error* in our prediction $\hat{y}_i$, so let's denote this value $e_i$. If we acknowlegdge that we will make errors, let's at least make them as small as possible! Exactly this is going to be our task now.

Suppose we have the following set of `r nrow(plot_data)` observations on `x` and `y`, and we put the *best* straight line into it, that we can think of. It would look like this: 

```{r line-arrows, echo=FALSE, message=FALSE, warning=FALSE, fig.cap="The best line and its errors",fig.align="center"}
plot_unexp_dev(plot_data)
```

Here, the red arrows indicate the **distance** between the prediction (i.e. the black line) to each data point, in other words, each arrow is a particular $e_i$. An upward pointing arrow indicates a positive value of a particular $e_i$, and vice versa for downward pointing arrows. The erros are also called *residuals*, which comes from the way can write the equation for this relationship between two particular values $(y_i,x_i)$ belonging to observation $i$:

\begin{equation}
y_i = b_0 + b_1 x_i + e_i (\#eq:abline)
\end{equation}

You realize of course that $\hat{y}_i = y_i - e_i$, which just means that our prediction is the observed value $y_i$ minus any error $e_i$ we make. In other words, $e_i$ is what is left to be explained on top of the line $b_0 + b_1 x_i$, hence, it's a residual to explain $y_i$. Here are $y,\hat{y}$ and the resulting $e$ which are plotted in figure \@ref(fig:line-arrows):

```{r,echo=FALSE}
knitr:::kable(subset(plot_data,select=c(x,y,y_hat,error)),align = "c",digits = 2)
```

If our line was a **perfect fit** to the data, all $e_i = 0$, and the column `error` would display `0` for each row - there would be no errors at all. (All points in figure \@ref(fig:line-arrows) would perfectly line up on a straight line). 

Now, back to our claim that this particular line is the *best* line. What exactly characterizes this best line? We now come back to what we said above - *how to make the errors as small as possible*? Keeping in mind that each residual $e_i$ is $y_i - \hat{y}_i$, we have the following minization problem to solve:

\begin{align}
e_i & = y_i - \hat{y}_i = y_i - \underbrace{\left(b_0 + b_1 x_i\right)}_\text{prediction}\\
e_1^2 + \dots + e_N^2 &= \sum_{i=1}^N e_i^2 \equiv \text{SSR}(b_0,b_1) \\
(b_0,b_1) &= \arg \min_{\text{int},\text{slope}} \sum_{i=1}^N \left[y_i - \left(\text{int} + \text{slope } x_i\right)\right]^2 (\#eq:ols-min)
\end{align}

```{block,type="warning"}
The best line chooses $b_0$ and $b_1$ so as to minimize the sum of **squared residuals** (SSR). 
```

<br>
Wait a moment, why *squared* residuals? This is easy to understand: suppose that instead, we wanted to just make the *sum* of the arrows in figure \@ref(fig:line-arrows) as small as possible (that is, no squares). Choosing our line to make this number small would not give a particularly good representation of the data -- given that errors of opposite sign and equal magnitude offset, we could have very long arrows (but of opposite signs), and a poor resulting line. Squaring each error avoids this (because now negative errors get positive values!)

```{r line-squares, echo=FALSE, message=FALSE, warning=FALSE, fig.align='center', fig.cap="The best line and its SQUARED errors"}
plot_unexp_SSR(plot_data)
```

We illustrate this in figure \@ref(fig:line-squares). This is the same data as in figure \@ref(fig:line-arrows), but instead of arrows of length $e_i$ for each observation $i$, now we draw a square with side $e_i$, i.e. an area of $e_i^2$. We have two apps for you at this point, one where you have to try and find the best line by choosing $b_0$ and $b_1$, only focusing on the sum of errors (and not their square), and a second one focusing on squared errors:

```{r app1, eval=FALSE}
library(ScPoApps)
launchApp("reg_simple_arrows")
launchApp("reg_simple") # with squared errors
launchApp("SSR_cone") # visualize the minimzation problem from above!
```

Most of our `apps` have an associated `about` document, which gives extra information and explanations. After you have looked at all three apps, we invite you thus to have a look at the associated explainers by typing

```{r,eval=FALSE}
aboutApp("reg_simple_arrows")
aboutApp("reg_simple") 
aboutApp("SSR_cone") 
```

## Ordinary Least Squares (OLS) Estimator{#OLS}

The method to compute (or *estimate*) $b_0$ and $b_1$ we illustrated above is called *Ordinary Least Squares*, or OLS. $b_0$ and $b_1$ are therefore also often called the *OLS coefficients*. By solving problem \@ref(eq:ols-min) one can derive an explicit formula for them:

\begin{equation}
b_1 = \frac{cov(x,y)}{var(x)},  (\#eq:beta1hat)
\end{equation}

i.e. the estimate of the slope coefficient is the covariance between $x$ and $y$ divided by the variance of $x$, both computed from our sample of data. With $b_1$ in hand, we can get the estimate for the intercept as

\begin{equation}
b_0 = \bar{y} - b_1 \bar{x}.  (\#eq:beta0hat)
\end{equation}

where $\bar{z}$ denotes the sample mean of variable $z$. The interpretation of the OLS slope coefficient $b_1$ is as follows. Given a line as in $y = b_0 + b_1 x$,

* $b_1 = \frac{d y}{d x}$ measures the change in $y$ resulting from a one unit change in $x$
* For example, if $y$ is wage and $x$ is years of education, $b_1$ would measure the effect of an additional year of education on wages.

There is an alternative representation for the OLS slope coefficient which relates to the *correlation coefficient* $r$. Remember from section \@ref(summarize-two) that $r = \frac{cov(x,y)}{s_x s_y}$, where $s_z$ is the standard deviation of variable $z$. With this in hand, we can derive the OLS slope coefficient as

\begin{align}
b_1 &= \frac{cov(x,y)}{var(x)}\\
    &= \frac{cov(x,y)}{s_x s_x} \\
    &= r\frac{s_y}{s_x} (\#eq:beta1-r)
\end{align}
    In other words, the slope coefficient is equal to the correlation coefficient $r$ times the ratio of standard deviations of $y$ and $x$.

### Linear Regression without Regressor

There are several important special cases for the linear regression introduced above. Let's start with the most obvious one: What is the meaning of running a regression *without any regressor*, i.e. without a $x$? Our line becomes very simple. Instead of \@ref(eq:bline), we get

\begin{equation}
y = b_0. (\#eq:b0line)
\end{equation}

This means that our minization problem in \@ref(eq:ols-min) *also* becomes very simple: We only have to choose $b_0$! We have 

$$
b_0 = \arg\min_{\text{int}} \sum_{i=1}^N \left[y_i - \text{int}\right]^2,
$$
which is a quadratic equation with a unique optimum such that 
$$
b_0 = \frac{1}{N} \sum_{i=1}^N y_i = \overline{y}.
$$

```{block type='tip'}
Least Squares **without regressor** $x$ estimates the sample mean of the outcome variable $y$, i.e. it produces $\overline{y}$.
```


### Regression without an Intercept

We follow the same logic here, just that we miss another bit from our initial equation and the minimisation problem in \@ref(eq:ols-min) now becomes:

\begin{align}
b_1 &= \arg\min_{\text{slope}} \sum_{i=1}^N \left[y_i - \text{slope } x_i \right]^2\\
\mapsto b_1 &= \frac{\frac{1}{N}\sum_{i=1}^N x_i y_i}{\frac{1}{N}\sum_{i=1}^N x_i^2} = \frac{\bar{x} \bar{y}}{\overline{x^2}} (\#eq:b1line)
\end{align}

```{block type='tip'}
Least Squares **without intercept** (i.e. with $b_0=0$) is a line that passes through the origin. 
```
<br>

In this case we only get to choose the slope $b_1$ of this anchored line.^[ This slope is related to the angle between vectors $\mathbf{a} = (\overline{x},\overline{y})$, and $\mathbf{b} = (\overline{x},0)$. Hence, it's related to the [scalar projection](https://en.wikipedia.org/wiki/Scalar_projection) of $\mathbf{a}$ on $\mathbf{b}$.] You should now try out both of those restrictions on our linear model by spending some time with 

```{r,eval=FALSE}
launchApp("reg_constrained")
```

### Centering A Regression

By *centering* or *demeaning* a regression, we mean to substract from both $y$ and $x$ their respective averages to obtain $\tilde{y}_i = y_i - \bar{y}$ and $\tilde{x}_i = x_i - \bar{x}$. We then run a regression *without intercept* as above. That is, we use $\tilde{x}_i,\tilde{y}_i$ instead of $x_i,y_i$ in \@ref(eq:b1line) to obtain our slope estimate $b_1$:

\begin{align}
b_1 &= \frac{\frac{1}{N}\sum_{i=1}^N \tilde{x}_i \tilde{y}_i}{\frac{1}{N}\sum_{i=1}^N \tilde{x}_i^2}\\
    &= \frac{\frac{1}{N}\sum_{i=1}^N (x_i - \bar{x}) (y_i - \bar{y})}{\frac{1}{N}\sum_{i=1}^N (x_i - \bar{x})^2} \\
    &= \frac{cov(x,y)}{var(x)}
    (\#eq:bline-centered)
\end{align}

This last expression is *identical* to the one in \@ref(eq:beta1hat)! It's the standard OLS estimate for the slope coefficient. We note the following: 

```{block type='tip'}
Adding a constant to a regression produces the same result as centering all variables and estimating without intercept. So, unless all variables are centered, **always** include an intercept in the regression.
```
<br>
To get a better feel for what is going on here, you can try this out now by yourself by typing:

```{r,eval=FALSE}
launchApp("demeaned_reg")
```

### Standardizing A Regression {#reg-standard}

*Standardizing* a variable $z$ means to demean as above, but in addition to divide the demeaned value by its own standard deviation. Similarly to what we did above for *centering*, we define transformed variables $\breve{y}_i = \frac{y_i-\bar{y}}{\sigma_y}$ and $\breve{x}_i = \frac{x_i-\bar{x}}{\sigma_x}$ where $\sigma_z$ is the standard deviation of variable $z$. From here on, you should by now be used to what comes next! As above, we use $\breve{x}_i,\breve{y}_i$ instead of $x_i,y_i$ in \@ref(eq:b1line) to this time obtain:

\begin{align}
b_1 &= \frac{\frac{1}{N}\sum_{i=1}^N \breve{x}_i \breve{y}_i}{\frac{1}{N}\sum_{i=1}^N \breve{x}_i^2}\\
    &= \frac{\frac{1}{N}\sum_{i=1}^N \frac{x_i - \bar{x}}{\sigma_x} \frac{y_i - \bar{y}}{\sigma_y}}{\frac{1}{N}\sum_{i=1}^N \left(\frac{x_i - \bar{x}}{\sigma_x}\right)^2} \\
    &= \frac{Cov(x,y)}{\sigma_x \sigma_y} \\
    &= Corr(x,y)  (\#eq:bline-standardized)
\end{align}

```{block type='tip'}
After we standardize both $y$ and $x$, the slope coefficient $b_1$ in the regression without intercept is equal to the **correlation coefficient**.
```
<br>
And also for this case we have a practical application for you. Just type this and play around with the app for a little while!

```{r,eval=FALSE}
launchApp("reg_standardized")
```


## Predictions and Residuals {#pred-resids}

Now we want to ask how our residuals $e_i$ relate to the prediction $\hat{y_i}$. Let us first think about the average of all predictions $\hat{y_i}$, i.e. the number $\frac{1}{N} \sum_{i=1}^N \hat{y_i}$. Let's just take \@ref(eq:abline-pred) and plug this into this average, so that we get

\begin{align}
\frac{1}{N} \sum_{i=1}^N \hat{y_i} &= \frac{1}{N} \sum_{i=1}^N b_0 + b_1 x_i \\
&= b_0 + b_1  \frac{1}{N} \sum_{i=1}^N x_i \\
&= b_0 + b_1  \bar{x} \\
\end{align}

But that last line is just equal to the formula for the OLS intercept \@ref(eq:beta0hat), $b_0 = \bar{y} - b_1 \bar{x}$! That means of course that

$$
\frac{1}{N} \sum_{i=1}^N \hat{y_i}  = b_0 + b_1  \bar{x} = \bar{y}
$$
in other words:

```{block type='tip'}
The average of our predictions $\hat{y_i}$ is identically equal to the mean of the outcome $y$. This implies that the average of the residuals is equal to zero.
```
<br>
Related to this result, we can show that the prediction $\hat{y}$ and the residuals are *uncorrelated*, something that is often called **orthogonality** between $\hat{y}_i$ and $e_i$. We would write this as

\begin{align}
Cov(\hat{y},e) &=\frac{1}{N} \sum_{i=1}^N (\hat{y}_i-\bar{y})(e_i-\bar{e}) =   \frac{1}{N} \sum_{i=1}^N (\hat{y}_i-\bar{y})e_i \\
&=  \frac{1}{N} \sum_{i=1}^N \hat{y}_i e_i-\bar{y} \frac{1}{N} \sum_{i=1}^N e_i = 0
\end{align}

It's useful to bring back the sample data which generate figure \@ref(fig:line-arrows) at this point in order to verify these claims:

```{r,echo=FALSE}
ss = subset(plot_data,select=c(y,y_hat,error))
round(ss,2)
```

Let's check that these claims are true in this sample of data. We want that

1. The average of $\hat{y}_i$ to be the same as the mean of $y$
2. The average of the errors should be zero.
3. Prediction and errors should be uncorrelated.

```{r}
# 1.
all.equal(mean(ss$error), 0)
# 2.
all.equal(mean(ss$y_hat), mean(ss$y))
# 3.
all.equal(cov(ss$error,ss$y_hat), 0)
```

So indeed we can confirm this result with our test dataset. Great! 

## Correlation, Covariance and Linearity

It is important to keep in mind that Correlation and Covariance relate to a *linear* relationship between `x` and `y`. Given how the regression line is estimated by OLS (see just above), you can see that the regression line inherits this property from the Covariance. 
A famous exercise by Francis Anscombe (1973) illustrates this by constructing 4 different datasets which all have identical **linear** statistics: mean, variance, correlation and regression line *are identical*. However, the usefulness of the statistics to describe the relationship in the data is not clear.

```{r,echo=FALSE}
##-- now some "magic" to do the 4 regressions in a loop:
ff <- y ~ x
mods <- setNames(as.list(1:4), paste0("lm", 1:4))
for(i in 1:4) {
  ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
  ## or   ff[[2]] <- as.name(paste0("y", i))
  ##      ff[[3]] <- as.name(paste0("x", i))
  mods[[i]] <- lmi <- lm(ff, data = anscombe)
}

op <- par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma =  c(0, 0, 2, 0))
for(i in 1:4) {
  ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
  plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
       xlim = c(3, 19), ylim = c(3, 13),main=paste("dataset",i))
  abline(mods[[i]], col = "blue")
}
par(op)
```

The important lesson from this example is the following:

```{block,type="warning"}
Always **visually inspect** your data, and don't rely exclusively on summary statistics like *mean, variance, correlation and regression line*. All of those assume a **linear** relationship between the variables in your data.
```
<br>
The mission of Anscombe has been continued recently. As a result of this we can have a look at the `datasauRus` package, which pursues Anscbombe's idea through a multitude of funny data sets, all with the same linear statistics. Don't just compute the covariance, or you might actually end up looking at a Dinosaur! What? Type this to find out:

```{r,eval=FALSE}
launchApp("datasaurus")
aboutApp("datasaurus")
```


### Non-Linear Relationships in Data

Suppose our data now looks like this:

```{r non-line-cars,echo=FALSE}
with(mtcars,plot(hp,mpg,xlab="x",ylab="y"))
```

Putting our previous *best line* defined in equation \@ref(eq:abline) as $y = b_0 + b_1 x + e$, we get something like this:

```{r non-line-cars-ols,echo=FALSE,fig.align='center',fig.cap='Best line with non-linear data?'}
l1 = lm(mpg~hp,data=mtcars)
plot(mtcars$hp,mtcars$mpg,xlab="x",ylab="y")
abline(reg=l1,lw=2)
```

Somehow when looking at \@ref(fig:non-line-cars-ols) one is not totally convinced that the straight line is a good summary of this relationship. For values $x\in[50,120]$ the line seems to low, then again too high, and it completely misses the right boundary. It's easy to address this shortcoming by including *higher order terms* of an explanatory variable. We would modify \@ref(eq:abline) to read now

\begin{equation}
y_i = b_0 + b_1 x_i + b_2 x_i^2 + e_i (\#eq:abline2)
\end{equation}

This is a special case of *multiple regression*, which we will talk about in chapter \@ref(multiple-reg). You can see that there are *multiple* slope coefficients. For now, let's just see how this performs:

```{r non-line-cars-ols2,echo=FALSE,fig.align="center",fig.cap="Better line with non-linear data!",echo=FALSE}
l1 = lm(mpg~hp+I(hp^2),data=mtcars)
newdata=data.frame(hp=seq(from=min(mtcars$hp),to=max(mtcars$hp),length.out=100))
newdata$y = predict(l1,newdata=newdata)
plot(mtcars$hp,mtcars$mpg,xlab="x",ylab="y")
lines(newdata$hp,newdata$y,lw=2)
```

## Analysing $Var(y)$

Analysis of Variance (ANOVA) refers to a method to decompose variation in one variable as a function of several others. We can use this idea on our outcome $y$. Suppose we wanted to know the variance of $y$, keeping in mind that, by definition, $y_i = \hat{y}_i + e_i$. We would write

\begin{align}
Var(y) &= Var(\hat{y} + e)\\
 &= Var(\hat{y}) + Var(e) + 2 Cov(\hat{y},e)\\
 &= Var(\hat{y}) + Var(e) (\#eq:anova)
\end{align}

We have seen above in \@ref(pred-resids) that the covariance between prediction $\hat{y}$ and error $e$ is zero, that's why we have $Cov(\hat{y},e)=0$ in \@ref(eq:anova).
What this tells us in words is that we can decompose the variance in the observed outcome $y$ into a part that relates to variance as *explained by the model* and a part that comes from unexplained variation. Finally, we know the definition of *variance*, and can thus write down the respective formulae for each part:

* $Var(y) = \frac{1}{N}\sum_{i=1}^N (y_i - \bar{y})^2$
* $Var(\hat{y}) = \frac{1}{N}\sum_{i=1}^N (\hat{y_i} - \bar{y})^2$, because the mean of $\hat{y}$ is $\bar{y}$ as we know. Finally,
* $Var(e) = \frac{1}{N}\sum_{i=1}^N e_i^2$, because the mean of $e$ is zero.

We can thus formulate how the total variation in outcome $y$ is aportioned between model and unexplained variation:

```{block, type="tip"}
The total variation in outcome $y$ (often called SST, or *total sum of squares*) is equal to the sum of explained squares (SSE) plus the sum of residuals (SSR). We have thus **SST = SSE + SSR**.
```



## Assessing the *Goodness of Fit*

In our setup, there exists a convenient measure for how good a particular statistical model fits the data. It is called $R^2$ (*R squared*), also called the *coefficient of determination*. We make use of the just introduced decomposition of variance, and write the formula as

\begin{equation}
R^2 = \frac{\text{variance explained}}{\text{total variance}} = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}\in[0,1]  (\#eq:Rsquared)
\end{equation}

It is easy to see that a *good fit* is one where the sum of *explained* squares (SSE) is large relativ to the total variation (SST). In such a case, we observe an $R^2$ close to one. In the opposite case, we will see an $R^2$ close to zero. Notice that a small $R^2$ does not imply that the model is useless, just that it explains a small fraction of the observed variation.


## An Example: A Log Wage Equation

Let's consider the following example concerning wage data collected in the 1976 Current Population Survey in the USA.^[This example is close to the vignette of the [wooldridge](https://cloud.r-project.org/web/packages/wooldridge/index.html) package, whose author I hereby thank for the excellent work.] We want to investigate the relationship between average hourly earnings, and years of education. Let's start with a plot:

```{r wooldridge-wages, echo=TRUE,fig.cap='Wages vs Education from the wooldridge dataset wage1.',fig.height=7}
data("wage1", package = "wooldridge")   # load data

# a function that returns a plot
plotfun <- function(wage1,log=FALSE,rug = TRUE){
    y = wage1$wage
    if (log){
        y = log(wage1$wage)
    }
    plot(y = y,
       x = wage1$educ, 
       col = "red", pch = 21, bg = "grey",     
       cex=1.25, xaxt="n", frame = FALSE,      # set default x-axis to none
       main = ifelse(log,"log(Wages) vs. Education, 1976","Wages vs. Education, 1976"),
       xlab = "years of education", 
       ylab = ifelse(log,"Log Hourly wages","Hourly wages"))
    axis(side = 1, at = c(0,6,12,18))         # add custom ticks to x axis
    if (rug) rug(wage1$wage, side=2, col="red")        # add `rug` to y axis
}

par(mfcol = c(2,1))  # set up a plot with 2 panels
# plot 1: standard scatter plot
plotfun(wage1)

# plot 2: add a panel with histogram+density
hist(wage1$wage,prob = TRUE, col = "grey", border = "red", 
     main = "Histogram of wages and Density",xlab = "hourly wage")
lines(density(wage1$wage), col = "black", lw = 2)
```

```{r,echo=FALSE}
par(mfcol = c(1,1)) 
```

Looking at the top panel of figure \@ref(fig:wooldridge-wages), you notice two things: From the red ticks on the y axis, you see that wages are very concentrated at around 5 USD per hour, with fewer and fewer observations at higher rates; and second, that it seems that the hourly wage seems to increase with higher education levels. The bottom panel reinforces the first point, showing that the estimated pdf (probability density function) shown as a black line has a very long right tail: there are always fewer and fewer, but always larger and larger values of hourly wage in the data.

```{block,type="warning"}
You have seen this shape of a distribution in the tutorial for chapter 2 already! Do you remember the name of this particular shape of a distribution? (why not type `ScPoEconometrics::runTutorial('chapter2')`) to check?
```
<br>

Let's run a first regression on this data to generate some intution:

\begin{equation}
\text{wage}_i = b_0 + b_1 \text{educ}_i + e_i (\#eq:wage)
\end{equation}

We use the `lm` function for this purpose as follows:

```{r}
hourly_wage <- lm(formula = wage ~ educ, data = wage1)
```

and we can add the resulting regression line to our above plot:

```{r wooldridge-wages2, echo=TRUE,fig.cap='Wages vs Education from the wooldridge dataset wage1, with regression'}
plotfun(wage1)
abline(hourly_wage, col = 'black', lw = 2) # add regression line
```

The `hourly_wage` object contains the results of this estimation. We can get a summary of those results with the `summary` method:

```{r}
summary(hourly_wage)
```

The main interpretation of this table can be read off the column labelled *Estimate*, reporting estimated coefficients $b_0,b_1$:

1. With zero year of education, the hourly wage is about -0.9 dollars per hour (row named `(Intercept)`)
1. Each additional year of education increase hourly wage by 54 cents. (row named `educ`)
1. For example, for 15 years of education, we predict roughly -0.9 + 0.541 * 15 = `r -0.9 + 0.541 * 15` dollars/h.

## Scaling Regressions

```{block type="tip"}
Regression estimates ($b_0, b_1$) are in the scale *of the data*. The actual *value* of the estimates will vary, if we change the scale of the data. The overall fit of the model to the data would *not* change, however, so that the $R^2$ statistic would be constant.
```
<br>

Suppose we wanted to use the above estimates to report the effect of years of education on *annual* wages instead of *hourly* ones. Let's assume we have full-time workers, 7h per day, 5 days per week, 45 weeks per year. Calling this factor $\delta = 7 \times 5 \times 45 = 1575$, we have that $x$ dollars per hour imply $x \times \delta = x \times `r 7*5*45`$ dollars per year. 

What would be the effect of using $\tilde{y} = wage \times `r 7*5*45`$ instead of $y = wage$ as outcome variable on our regression coefficients $b_0$ and $b_1$? Well, let's try!

```{r,results= "asis",echo = FALSE}
delta = 7*5*45
wage1$annual_wage <- wage1$wage * delta
wage_annual <- lm(formula = annual_wage ~ educ, data = wage1)
c1 = coef(hourly_wage)
c2 = coef(wage_annual)
stargazer::stargazer(hourly_wage, wage_annual, type = if (knitr:::is_latex_output()) "latex" else "html",title = "Effect of Scaling on Coefficients")
```

Let's call the coefficients in the column labelled (1) as $b_0$ and $b_1$, and let's call the ones in column (2) $b_0^*$ and $b_1^*$. In column (1) we see that another year increaeses hourly wage by `r round(c1[2],2)` dollars, as before.  In column (2), the corresponding number is `r round(c2[2],2)`, i.e. another year of education will increase *annual* wages by `r round(c2[2],2)` dollars, on average. Notice however, that $b_0 \times \delta = `r round(c1[1],2)` \times `r delta` = `r round(c2[1],2)` = b_0^*$ and that $b_1 \times \delta = `r round(c1[2],2)` \times `r delta` = `r round(c2[2],2)` = b_1^*$, that is we just had to multiply both coefficients by the scaling factor applied to original outcome $y$ to obtain our new coefficients $b_0^*$ and $b_1^*$! Also, observe that the $R^2$s of both regressions are identical! So, really, we did not have to run the regression in column (2) at all to make this change: multiplying all coefficients through by $\delta$ is enough in this case. We keep the identically same fit to the data. 

Rescaling the regressors $x$ is slightly different, but it's easy to work out *how* different, given the linear nature of the covariance operator, which is part of the OLS estimator. Suppose we rescale $x$ by the number $c$. Then, using the OLS formula in \@ref(eq:beta1hat), we see that we get new slope coefficient $b_1^*$ via

\begin{align} 
b_1^* &= \frac{Cov(cx,y)}{Var(cx)} \\ 
      &= \frac{cCov(x,y)}{c^2 Var(x)} \\
      &= \frac{1}{c} b_1.
\end{align}

As for the intercept, and by using \@ref(eq:beta0hat)
\begin{align} 
b_0^* &= \bar{y} -             b_1^* \frac{1}{N}\sum_{i=1}^N c \cdot x_i \\ 
      &= \bar{y} -             b_1^* \frac{c}{N}\sum_{i=1}^N x_i  \\
      &= \bar{y} - \frac{1}{c} b_1 c * \bar{x}  \\
      &= \bar{y} -  b_1 * \bar{x}  \\
      &= b_0
\end{align}

That is, we change the slope by the *inverse* of the scaling factor applied to regressor $x$, but the intercept is unaffected from this. You should play around for a while with our rescaling app to get a feeling for this:

```{r,eval=FALSE}
library(ScPoApps)
launchApp('Rescale')
```

## A Particular Rescaling: The $\log$ Transform

The natural logarithm is a particularly important transformation that we often encounter in economics. Why would we transform a variable with the $\log$ function to start with?

1. Several important economic variables (like wages, city size, firm size, etc) are approximately *log-normally* distributed. By transforming them with the $\log$, we obtain an approximately *normally* distributed variable, which has desirable properties for our regression.
1. Applying the $\log$ reduces the impact of outliers.
1. The transformation allows for a convenient interpretation in terms of *percentage changes* of the outcome variable.

Let's investigate this issue in our running example by transforming the wage data above. Look back at the bottom panel of figure \@ref(fig:wooldridge-wages): Of course you saw immediately that this looked a lot like a log-normal distribution, so point 1. above applies. We modify the left hand side of equation \@ref(eq:wage):

\begin{equation}
\log(\text{wage}_i) = b_0 + b_1 \text{educ}_i + e_i (\#eq:log-wage)
\end{equation}

Let's use the `update` function to modify our previous regression model:

```{r}
log_hourly_wage = update(hourly_wage, log(wage) ~ ., data = wage1)
```

The `update` function takes an existing `lm` object, like `hourly_wage` here, and updates the `formula`. Here the `.` on the right hand side means *leave unchanged* (so the RHS stays unchanged). How do our pictures change?

```{r logplot,echo = TRUE}
par(mfrow = c(1,2))

plotfun(wage1,rug = FALSE)
abline(hourly_wage, col = 'black', lw = 2) # add regression line

plotfun(wage1,log = TRUE, rug = FALSE)
abline(log_hourly_wage, col = 'black', lw = 2) # add regression line

par(mfrow = c(1,1))
```

It *looks as if* the regression line has the same slope, but beware of the different scales of the y-axis! You can clearly see that all y-values have been compressed by the log transformation. The log case behaves differently from our *scaling by a constant number* case above because it is a *nonlinear* function. Let's compare the output between both models:

```{r,echo = FALSE, results = "asis"}
stargazer::stargazer(hourly_wage, log_hourly_wage, title = "Log Transformed Equation",type = if (knitr:::is_latex_output()) "latex" else "html")
```

The interpretation of the transformed model in column (2) is now the following: 

```{block type = "note"}
We call a regression of the form $\log(y) = b_0 + b_1 x + u$ a *log-level* specification, because we regressed the log of a variable on the level (i.e not the log!) of another variable. Here, the impact of increasing $x$ by one unit is to increase $y$ by $100 \times b_1$ **percent**. In our example: an additional year of education will increase hourly wages by 8.3%. Notice that this is very different from saying *...increases log hourly wages by 8.3%*, which is wrong.
```
<br>

Notice that the $R^2$ slightly improved, so have a better fit to the data. This is due the fact that the log compressed large outlier values. Whether we apply the $log$ to left or right-hand side variables makes a difference, as outlined in this important table:

<center><caption> (\#tab:loglog) Common Regression Specifications </caption></center>
Specification | Outcome Var | Regressor | Interpretation of $b_1$ | Comment
:------------:|:------------:|:------------:|:------------:|:------------:
Level-level | y | x | $\Delta y = b_1 \Delta x$ | Standard
Level-log | y | $\log(x)$ | $\Delta y = \frac{b_1}{100} \Delta x$ | less frequent
Log-level | $\log(y)$  | x | $\% \Delta y = (100 b_1) \Delta$ x | Semi-elasticity
Log-Log  | $\log(y)$  | $\log(x)$ | $\% \Delta y = \% \Delta$ b_1 x | Elasticity

You may remember from your introductory micro course what the definition of the *elasticity* of $y$ with respect to $x$ is: This number tells us by how many percent $y$ will change, if we change $x$ by one percent. Let's look at another example from the `wooldridge` package of datasets, this time concerning CEO salaries and their relationship with company sales.

```{r ceo-sal,fig.cap="The effect of log-transforming highly skewed data.",fig.height = 4}
data("ceosal1", package = "wooldridge")  
par(mfrow = c(1,2))
plot(salary ~ sales, data = ceosal1, main = "Sales vs Salaries",xaxt = "n",frame = FALSE)
axis(1, at = c(0,40000, 80000))
rug(ceosal1$salary,side = 2)
rug(ceosal1$sales,side = 1)
plot(log(salary) ~ log(sales), data = ceosal1, main = "Log(Sales) vs Log(Salaries)")
```
```{r,echo = FALSE}
par(mfrow = c(1,1))
```
In the left panel of figure \@ref(fig:ceo-sal) you clearly see that both `sales` and `salary` have very long right tails, as indicated by the rug plots on either axis. As a consequence, the points are clustered in the bottom left corner of the plot. We suspect a positive relationship, but it's hard to see. Contrast this with the right panel, where both axis have been log transformed: the points are nicely spread out, clearly spelling out a positive correlation. Let's see what this gives in a regression model!

```{r,results = 'asis',echo = FALSE,warning=FALSE,message = FALSE}
library(magrittr)
library(dplyr)
ceosal1 %>%
  mutate(logsalary = log(salary), logsales = log(sales)) %>%
  lm(logsalary ~ logsales, data = .) %>%
  equatiomatic::extract_eq(use_coefs = TRUE)
```

Refering back at table \@ref(tab:loglog), here we have a log-log specification. Therefore we interpret this regression as follows: 

```{block type = "tip"}
In a log-log equation, the slope coefficient $b_1$ is the *elasticity of $y$ with respect to changes in $x$*. Here: A 1% increase in sales is associated to a 0.26% increase in CEO salaries. Note, again, that there is no *log* in this statement. 
```


================================================
FILE: 04-MultipleReg.Rmd
================================================
# Multiple Regression {#multiple-reg}


We can extend the discussion from chapter \@ref(linreg) to more than one explanatory variable. For example, suppose that instead of only $x$ we now had $x_1$ and $x_2$ in order to explain $y$. Everything we've learned for the single variable case applies here as well. Instead of a regression *line*, we now get a regression *plane*, i.e. an object representable in 3 dimenions: $(x_1,x_2,y)$.
As an example, suppose we wanted to explain how many *miles per gallon* (`mpg`) a car can travel as a function of its *horse power* (`hp`) and its *weight* (`wt`). In other words we want to estimate the equation

\begin{equation}
mpg_i = b_0 + b_1 hp_i + b_2 wt_i + e_i (\#eq:abline2d)
\end{equation}

on our built-in dataset of cars (`mtcars`):

```{r mtcarsdata}
head(subset(mtcars, select = c(mpg,hp,wt)))
```

How do you think `hp` and `wt` will influence how many miles per gallon of gasoline each of those cars can travel? In other words, what do you expect the signs of $b_1$ and $b_2$ to be? 


With two explanatory variables as here, it is still possible to visualize the regression plane, so let's start with this as an answer. The OLS regression plane through this dataset looks like in figure \@ref(fig:plane3D-reg):

```{r plane3D-reg,echo=FALSE,fig.align='center',fig.cap='Multiple Regression - a plane in 3D. The red lines indicate the residual for each observation.',warning=FALSE,message=FALSE}
library(plotly)
library(reshape2)
data(mtcars)
 
# linear fit
fit <- lm(mpg ~ wt+hp,data=mtcars)
 
to_plot_x <- range(mtcars$wt)
to_plot_y <- range(mtcars$hp)

df <- data.frame(wt = rep(to_plot_x, 2),
           hp = rep(to_plot_y, each = 2))
df["pred"] <- predict.lm(fit, df, se.fit = F)

surf <- acast(df, wt ~ hp)

color <- rep(0, length(df))
mtcars %>%
  plot_ly(colors = "grey") %>%
  add_markers(x = ~wt, y = ~hp, z = ~mpg,name = "data",opacity = .8, marker=list(color = 'red', size = 5, hoverinfo="skip")) %>%
  add_surface(x = to_plot_x, y = to_plot_y, z = ~surf, inherit = F, name = "Mtcars 3D", opacity = .75, cauto = F, surfacecolor = color) %>%
  hide_colorbar()
```



This visualization shows a couple of things: the data are shown with red points and the grey plane is the one resulting from OLS estimation of equation \@ref(eq:abline2d). You should realize that this is exactly the same story as told in figure \@ref(fig:line-arrows) - just in three dimensions!

Furthermore, *multiple* regression refers the fact that there could be *more* than two regressors. In fact, you could in principle have $K$ regressors, and our theory developed so far would still be valid:

\begin{align}
\hat{y}_i &= b_0 + b_1 x_{1i} +   b_2 x_{2i} + \dots + b_K x_{Ki}\\
e_i &= y_i - \hat{y}_i (\#eq:multiple-reg)
\end{align}

Just as before, the least squares method chooses numbers $(b_0,b_1,\dots,b_K)$ to as to minimize SSR, exactly as in the minimization problem for the one regressor case seen in \@ref(eq:ols-min).

## All Else Equal {#ceteris}

We can see from the above plot that cars with more horse power and greater weight, in general travel fewer miles per gallon of combustible. Hence, we observe a plane that is downward sloping in both the *weight* and *horse power* directions. Suppose now we wanted to know impact of `hp` on `mpg` *in isolation*, so as if we could ask 

```{block,type="tip"}
<center>
Keeping the value of $wt$ fixed for a certain car, what would be the impact on $mpg$ be if we were to increase **only** its $hp$? Put differently, keeping **all else equal**, what's the impact of changing $hp$ on $mpg$?
</center>
```
<br>
We ask this kind of question all the time in econometrics. In figure \@ref(fig:plane3D-reg) you clearly see that both explanatory variables have a negative impact on the outcome of interest: as one increases either the horse power or the weight of a car, one finds that miles per gallon decreases. What is kind of hard to read off is *how negative* an impact each variable has in isolation. 

As a matter of fact, the kind of question asked here is so common that it has got its own name: we'd say "*ceteris paribus*, what is the impact of `hp` on `mpg`?". *ceteris paribus* is latin and means *the others equal*, i.e. all other variables fixed. In terms of our model in \@ref(eq:abline2d), we want to know the following quantity:

\begin{equation}
\frac{\partial mpg_i}{\partial hp_i} = b_1 (\#eq:abline2d-deriv)
\end{equation}

The $\partial$ sign denotes a *partial derivative* of the function describing `mpg` with respect to the variable `hp`. It measures *how the value of `mpg` changes, as we change the value of `hp` ever so slightly*. In our context, this means: *keeping all other variables fixed, what is the effect of `hp` on `mpg`?*. We call the value of coefficient $b_1$ therefore also the *partial effect* of `hp` on `mpg`. In terms of our dataset, we use `R` to run the following **multiple regression**:
<br>

```{r,echo=FALSE}
summary(fit)
```

From this table you see that the coefficient on `wt` has value `r round(coef(fit)[2],5)`. You can interpret this as follows:

```{block,type="warning"}
Holding all other variables fixed at their observed values - or *ceteris paribus* - a one unit increase in $wt$ implies a -3.87783 units change in $mpg$. In other words, increasing the weight of a car by 1000 pounds (lbs), will lead to 3.88 miles less travelled per gallon. Similarly, a car with one additional horse power means that we will travel 0.03177  fewer miles per gallon of gasoline, *all else (i.e. $wt$) equal*.
```


## Multicolinearity {#multicol}

One important requirement for multiple regression is that the data be **not linearly dependent**: Each variable should provide at least some new information for the outcome, and it cannot be replicated as a linear combination of other variables. Suppose that in the example above, we had a variable `wtplus` defined as `wt + 1`, and we included this new variable together with `wt` in our regression. In this case, `wtplus` provides no new information. It's enough to know $wt$, and add $1$ to it. In this sense, `wt_plus` is a redundant variable and should not be included in the model. Notice that this holds only for *linearly* dependent variables - *nonlinear* transformations (like for example $wt^2$) are exempt from this rule. Here is why:

\begin{align}
y &= b_0 + b_1 \text{wt} + b_2 \text{wtplus} + e \\
  &= b_0 + b_1 \text{wt} + b_2 (\text{wt} + 1) + e \\
  &= (b_0 + b_2) + \text{wt} (b_1 + b_2) + e
\end{align}

This shows that we cannot *identify* the regression coefficients in case of linearly dependent data. Variation in the variable `wt` identifies a different coefficient, say $\gamma = b_1 + b_2$, from what we actually wanted: separate estimates for $b_1,b_2$.

```{block, type="note"}
We cannot have variables which are *linearly dependent*, or *perfectly colinear*. This is known as the **rank condition**. In particular, the condition dictates that we need at least $N \geq K+1$, i.e. more observations than coefficients. The greater the degree of linear dependence amongst our explanatory variables, the less information we can extract from them, and our estimates becomes *less precise*.
```



## Log Wage Equation

Let's go back to our previous example of the relationship between log wages and education. How does this relationship change if we also think that experience in the labor market has an impact, next to years of education? Here is a picture:

```{r plane3D-lwage,echo=FALSE,fig.align='center',fig.cap='Log wages vs education and experience in 3D.',warning=FALSE,message=FALSE}
data("wage1", package = "wooldridge")
# linear fit
log_wage <- lm(lwage ~ educ + exper,data=wage1)
 
to_plot_x <- range(wage1$educ)
to_plot_y <- range(wage1$exper)

df <- data.frame(educ = rep(to_plot_x, 2),
           exper = rep(to_plot_y, each = 2))
df["pred"] <- predict.lm(log_wage, df, se.fit = F)

surf <- acast(df, educ ~ exper)

color <- rep(0, length(df))
wage1 %>%
  plot_ly(colors = "grey") %>%
  add_markers(x = ~educ, y = ~exper, z = ~lwage,name = "data",opacity = .8, marker=list(color = 'red', size = 5, hoverinfo="skip", opacity = 0.8)) %>%
  add_surface(x = to_plot_x, y = to_plot_y, z = ~surf, inherit = F, name = "wages 3D", opacity = .75, cauto = F, surfacecolor = color) %>%
  hide_colorbar()
```

Let's add even more variables! For instance, what's the impact of experience in the labor market, and time spent with the current employer? Let's first look at how those variables co-vary with each other:

```{r corrplot, fig.cap = "correlation plot"}
cmat = round(cor(subset(wage1,select = c(lwage,educ,exper,tenure))),2) # correlation matrix
corrplot::corrplot(cmat,type = "upper",method = "ellipse")
```

The way to read the so-called *correlation plot* in figure \@ref(fig:corrplot) is straightforward: each row illustrates the correlation of a certain variable with the other variables. In this example both the shape of the ellipse in each cell as well as their color coding tell us how strongly two variables correlate. Let us put this into a regression model now:

```{r,results = 'asis'}
educ_only <- lm(lwage ~ educ                 , data = wage1)
educ_exper <- lm(lwage ~ educ + exper        , data = wage1)
log_wages <- lm(lwage ~ educ + exper + tenure, data = wage1)
stargazer::stargazer(educ_only, educ_exper, log_wages,type = if (knitr:::is_latex_output()) "latex" else "html")
```

Column (1) refers to model \@ref(eq:log-wage) from the previous chapter, where we only had `educ` as a regressor: we obtain an $R^2$ of 0.186. Column (2) is the model that generated the plane in figure \@ref(fig:plane3D-lwage) above. (3) is the model with three regressors. You can see that by adding more regressors, the quality of our fit increases, as more of the variation in $y$ is now accounted for by our model. You can also see that the values of our estimated coefficients keeps changing as we move from left to right across the columns. Given the correlation structure shown in figure \@ref(fig:corrplot), it is only natural that this is happening: We see that `educ` and `exper` are negatively correlated, for example. So, if we *omit* `exper` from the model in column (1), `educ` will reflect part of this correlation with `exper` by a lower estimated value. By directly controlling for `exper` in column (2) we get an estimate of the effect of `educ` *net of* whatever effect `exper` has in isolation on the outcome variable. We will come back to this point later on.

## How To Make Predictions {#make-preds}

So suppose we have a model like 

$$\text{lwage} = b_0 + b_{1}(\text{educ}) + b_{2}(\text{exper}) + b_{3}(\text{tenure}) + \epsilon$$
How could we use this to make a *prediction* of log wages, given some new data? Remember that the OLS procedure gives us *estimates* for the values $b_0,b_1, b_2,b_3$. With those in hand, it is straightforward to make a prediction about the *conditional mean* of the outcome - just plug in the desired numbers for `educ,exper` and `tenure`. Suppose you want to know what the mean of `lwage` is conditional on `educ = 10,exper=4` and `tenure = 2`. You'd do

\begin{align}
E[\text{lwage}|\text{educ}=10,\text{exper}=4,\text{tenure}=2] &= b_0 + b_1  10 + b_2 4 + b_3  2\\
&= `r round(coef(log_wages) %*% c(1,10,4,2),2)`.
\end{align}

I computed the last line directly with

```{r,eval=FALSE}
x = c(1,10,4,2)  # 1 for intercept
pred = coef(log_wages) %*% x
```

but `R` has a more complete prediction interface, using the function `predict`. For starters, you can predict the model on all data points which were contained in the dataset we used for estimation, i.e. `wage1` in our case:

```{r}
head(predict(log_wages))  # first 6 observations of wage1 as predicted by our model
```

Often you want to add that prediction *to* the original dataset:

```{r}
wage_prediction = cbind(wage1, prediction = predict(log_wages))
head(wage_prediction[, c("lwage","educ","exper","tenure","prediction")])
```

You'll remember that we called the distance in prediction and observed outcome our *residual* $e$. Well here this is just `lwage - prediction`. Indeed, $e$ is such an important quantity that `R` has a convenient method to compute $y - \hat{y}$ from an `lm` object directly - the method `resid`. Let's add another column to `wage_prediction`: 

```{r}
wage_prediction = cbind(wage_prediction, residual = resid(log_wages))
head(wage_prediction[, c("lwage","educ","exper","tenure","prediction","residual")])
```

Using the data in `wage_prediction`, you should now check for yourself what we already know about $\hat{y}$ and $e$ from section \@ref(pred-resids): 

1. What is the average of the vector `residual`?
1. What is the average of `prediction`?
1. How does this compare to the average of the outcome `lwage`?
1. What is the correlation between `prediction` and `residual`?



================================================
FILE: 05-Categorial-Vars.Rmd
================================================
# Categorial Variables {#categorical-vars} 


Up until now, we have encountered only examples with *continuous* variables $x$ and $y$, that is, $x,y \in \mathbb{R}$, so that a typical observation could have been $(y_i,x_i) = (1.5,5.62)$. There are many situations where it makes sense to think about the data in terms of *categories*, rather than continuous numbers. For example, whether an observation $i$ is *male* or *female*, whether a pixel on a screen is *black* or *white*, and whether a good was produced in *France*, *Germany*, *Italy*, *China* or *Spain* are all categorical classifications of data. 

Probably the simplest type of categorical variable is the *binary*, *boolean*, or just *dummy* variable. As the name suggests, it can take on only two values, `0` and `1`, or `TRUE` and `FALSE`. 

## The Binary Regressor Case

Even though this is an extremely parsimonious way of encoding that, it is a very powerful tool that allows us to represent that a certain observation $i$ **is a member** of a certain category $j$. For example, let's imagine we have income data on males and females, and we would create a variable called `is.male` that is `TRUE` whenever $i$ is male, `FALSE` otherwise, and similarly for women. For example, to encode whether subject $i$ is male, one could do this:

\begin{align*}
\text{is.male}_i &=  \begin{cases}
                    1 & \text{if }i\text{ is male} \\
                    0 & \text{if }i\text{ is not male}. \\
                 \end{cases}, \\
\end{align*}

and similarly for females, we'd have

\begin{align*}
\text{is.female}_i &=  \begin{cases}
                    1 & \text{if }i\text{ is female} \\
                    0 & \text{if }i\text{ is not female}. \\
                 \end{cases} \\
\end{align*}

By definition, we have just introduced a linear dependence into our dataset. It will always be true that $\text{is.male}_i + \text{is.female}_i = 1$. This is because dummy variables are based on data being mutually exclusively categorized - here, you are either male or female.^[There are [transgender](https://en.wikipedia.org/wiki/Transgender) individuals where this example will not apply.] This should immediately remind you of section \@ref(multicol) where we introduced *multicolinearity*. A regression of income on both of our variables like this

$$
y_i = b_0 + b_1 \text{is.female}_i + b_2 \text{is.male}_i + e_i
$$
would be invalid because of perfect colinearity between $\text{is.female}_i$ and $\text{is.male}_i$. The solution to this is pragmatic and simple: 

```{block, type="tip"}
In dummy variable regressions, we remove one category from the regression (for example here: `is.male`) and call it the *reference category*. The effect of being *male* is absorbed in the intercept. The coefficient on the remaining categories measures the *difference* in mean outcome with respect to the reference category.
```
<br>

Now let's try this out. We start by creating the female indicator as above,

$$
\text{is.female}_i = \begin{cases}
          1 & \text{if }i\text{ is female} \\
            0 & \text{if }i\text{ is not female}. \\
   \end{cases}
$$
and let's suppose that $y_i$ is a measure of $i$'s annual labor income. Our model is

\begin{equation}
y_i = b_0 + b_1 \text{is.female}_i + e_i (\#eq:dummy-reg)
\end{equation}

and here is how we estimate this in `R`:

```{r, echo=FALSE}
set.seed(19)
n = 50
b0 = 2
b1 = -3
x = sample(x = c(0, 1), size = n, replace = T)
y = b0 + b1 * x + rnorm(n)
dta = data.frame(x,y)
zero_one = lm(y~x,dta)
```

```{r, dummy-reg}
# x = sample(x = c(0, 1), size = n, replace = T)
dta$is.female = factor(x)  # convert x to factor
dummy_reg = lm(y~is.female,dta)
summary(dummy_reg)
```

Notice that `R` displays the *level* of the factor to which coefficient $b_1$ belongs here, i.e. `is.female1` means this coefficient is on level `is.female = 1` - the reference level is `is.female = 0`, and it has no separate coefficient. Also interesting is that $b_1$ is equal to the difference in conditional means between male and female

$$b_1 = E[y|\text{is.female}=1] - E[y|\text{is.female}=0]=`r round(mean(dta[dta$x == 1, "y"]) - mean(dta[dta$x == 0, "y"]),4)`.$$ 

```{block,type="note"}
A dummy variable measures the difference or the *offset* in the mean of the response variable, $E[y]$, **conditional** on $x$ belonging to some category - relative to a baseline category. In our artificial example, the coefficient $b_1$ informs us that women earn on average 3.756 units less than men.
```
<br>

It is instructive to reconsider this example graphically:

```{r x-zero-one,fig.align='center',fig.cap='regressing $y \\in \\mathbb{R}$ on $\\text{is.female}_i \\in \\{0,1\\}$. The blue line is $E[y]$, the red arrow is the size of $b_1$. Which is the same as the slope of the regression line in this case and the difference in conditional means!',echo=FALSE}

a <- coef(zero_one)[1]
b <- coef(zero_one)[2]

# plot
expr <- function(x) a + b*x
errors <- (a + b*x) - y

plot(x, y, type = "p", pch = 21, col = "blue", bg = "royalblue", asp=.25,
   xlim = c(-.1, 1.1),
   ylim = c(min(y)-.1, max(y)+.1),
   frame.plot = T,
   cex = 1.2)

points(0, mean(dta[dta$x == 0, "y"]), col = 'orange',
       cex = 3, pch = 15)
text(0.05, mean(dta[dta$x == 0, "y"]), "E[Y | is.female = 0]", pos = 4)

points(1, mean(dta[dta$x == 1, "y"]), col = 'orange',
       cex = 3, pch = 15)
text(1.05, mean(dta[dta$x == 1, "y"]), "E[Y | is.female = 1]", pos = 4)
curve(expr = expr, from = min(x)-10, to = max(x)+10, add = TRUE, col = "black")
segments(x0 = x, y0 = y, x1 = x, y1 = (y + errors), col = "green")
arrows(x0 =-1, y0 = mean(dta[dta$x == 0, "y"]), x1 = -1, y1 = mean(dta[dta$x == 1, "y"]),col="red",lw=3,code=3,length=0.1)
# dashes
segments(x0=-1,y0 = mean(dta[dta$x == 0, "y"]),x1=0,y1 = mean(dta[dta$x == 0, "y"]),col="red",lty="dashed")
segments(x0=-1,y0 = mean(dta[dta$x == 1, "y"]),x1=1,y1 = mean(dta[dta$x == 1, "y"]),col="red",lty="dashed")

text(-1, mean(y)+1, paste("b1=",round(b,2)), pos = 4,col="red")
abline(a=mean(dta$y),b=0,col="blue",lw=2)
```

In figure \@ref(fig:x-zero-one) we see that this regression simplifies to the straight line connecting the mean, or the *expected value* of $y$ when $\text{is.female}_i = 0$, i.e. $E[y|\text{is.female}_i=0]$, to the mean when $\text{is.female}_i=1$, i.e.  $E[y|\text{is.female}_i=1]$. It is useful to remember that the *unconditional mean* of $y$, i.e. $E[y]$, is going to be the result of regressing $y$ only on an intercept, illustrated by the blue line. This line will always lie in between both conditional means. As indicated by the red arrow, the estimate of the coefficient on the dummy, $b_1$, is equal to the difference in conditional means for both groups. You should look at our app now to deepen your understanding of what's going on here:

```{r,eval=FALSE}
library(ScPoApps)
launchApp("reg_dummy")
```


## Dummy and Continuous Variables

What happens if there are more predictors than just the dummy variable in a regression? For example, what if instead we had

\begin{equation}
y_i = b_0 + b_1 \text{is.female}_i + b_2 \text{exper}_i + e_i (\#eq:dummy-reg2)
\end{equation}

where $\text{exper}_i$ would measure years of experience in the labor market? As above, the dummy variable acts as an intercept shifter. We have

\begin{equation}
y_i =  \begin{cases}
b_0 + b_1 + b_2 \times \text{exper}_i + e_i & \text{if is.female=1} \\
b_0  + \hphantom{b_1} +b_2 \times \text{exper}_i + e_i & \text{if is.female=0}
\end{cases}
\end{equation}

so that the intercept is $b_0 + b_1$ for women but $b_0$ for men. We will see this in the real-world example below, but for now let's see the effect of switching the dummy *on* and *off* in this app:

```{r,eval=FALSE}
library(ScPoApps)
launchApp("reg_dummy_example")
```




## Categorical Variables in `R`: `factor` 

`R` has extensive support for categorical variables built-in. The relevant data type representing a categorical variable is called `factor`. We encountered them as basic data types in section \@ref(data-types) already, but it is worth repeating this here. We have seen that a factor *categorizes* a usually small number of numeric values by *labels*, as in this example which is similar to what I used to create regressor `is.female` for the above regression:

```{r factors}
is.female = factor(x = c(0,1,1,0), labels = c(FALSE,TRUE))
is.female
```

You can see the result is a vector object of type `factor` with 4 entries, whereby `0` is represented as `FALSE` and `1` as `TRUE`. An other example could be if we wanted to record a variable *sex* instead, and we could do 

```{r}
sex = factor(x = c(0,1,1,0), labels = c("male","female"))
sex
```

You can see that this is almost identical, just the *labels* are different.


### More Levels

We can go beyond *binary* categorical variables such as `TRUE` vs `FALSE`. For example, suppose that $x$ measures educational attainment, i.e. it is now something like $x_i \in \{\text{high school,some college,BA,MSc}\}$. In `R` parlance, *high school, some college, BA, MSc* are the **levels of factor $x$**. A straightforward extension of the above would dictate to create one dummy variable for each category (or level), like 

\begin{align*}
\text{has.HS}_i &= \mathbf{1}[x_i==\text{high school}] \\
\text{has.someCol}_i &= \mathbf{1}[x_i==\text{some college}] \\
\text{has.BA}_i &= \mathbf{1}[x_i==\text{BA}] \\
\text{has.MSc}_i &= \mathbf{1}[x_i==\text{MSc}] 
\end{align*}

but you can see that this is cumbersome. There is a better solution for us available:

```{r}
factor(x = c(1,1,2,4,3,4),labels = c("HS","someCol","BA","MSc"))
```

Notice here that `R` will apply the labels in increasing order the way you supplied it (i.e. a numerical value `4` will correspond to "MSc", no matter the ordering in `x`.)

### Log Wages and Dummies {#factors}

The above developed `factor` terminology fits neatly into `R`'s linear model fitting framework. Let us illustrate the simplest use by way of example.

Going back to our wage example, let's say that a worker's wage depends on their education as well as their sex:

\begin{equation}
\ln w_i = b_0 + b_1 educ_i + b_2 female_i + e_i (\#eq:wage-sex)
\end{equation}


```{r,results = "asis"}
data("wage1", package = "wooldridge")
wage1$female = as.factor(wage1$female)  # convert 0-1 to factor
lm_w = lm(lwage ~ educ, data = wage1)
lm_w_sex = lm(lwage ~ educ + female, data = wage1)
stargazer::stargazer(lm_w,lm_w_sex,type = if (knitr:::is_latex_output()) "latex" else "html")
```

We know the results from column (1) very well by now. How does the relationship change if we include the `female` indicator? Remember from above that `female` is a `factor` with two levels, *0* and *1*, where *1* means *that's a female*. We see in the above output that `R` included a regressor called `female1`. This is a combination of the variable name `female` and the level which was included in the regression. In other words, `R` chooses a *reference category* (by default the first of all levels by order of appearance), which is excluded - here this is `female==0`. The interpretation is that $b_2$ measures the effect of being female *relative* to being male. `R` automatically creates a dummy variable for each potential level, excluding the first category.

```{r wage-plot,fig.align='center',echo=FALSE,fig.cap='log wage vs educ. Right panel with female dummy.',message=FALSE,warning=FALSE,fig.height=3}
library(ggplot2)
p1 = ggplot(mapping = aes(y=lwage,x=educ), data=wage1) + geom_point(shape=1,alpha=0.6) + geom_smooth(method="lm",col="blue",se=FALSE) + theme_bw()

p_sex = cbind(wage1,pred=predict(lm_w_sex))
# p_sex = dplyr::sample_n(p_sex,2500)
p2 <- ggplot(data=p_sex,mapping=aes(x=educ,y=lwage,color=female)) 
p2 <- p2 + geom_jitter(shape=1,alpha=0.6,width=0.1) + geom_line(mapping = aes(y=pred), size=1) + theme_bw() + scale_y_continuous(name = NULL)
cowplot::plot_grid(p1,p2, rel_widths = c(1,1.2))
```

Figure \@ref(fig:wage-plot) illustrates this. The left panel is our previous model. The right panel adds the `female` dummy. You can see that both male and female have the same upward sloping regression line. But you can also see that there is a parallel downward shift from male to female line. The estimate of $b_2 = `r round(coef(lm_w_sex)[3],2)`$ is the size of the downward shift. 




## Interactions

Sometimes it is useful to let the slope of a certain variable to be dependent on the value of *another* regressor. For example consider a model for the sales prices of houses, where `area` is the livable surface of the property, and `age` is its age:

\begin{equation}
\log(price) = b_0 + b_1 \text{area} + b_2 \text{age} + b_3 (\text{area} \times \text{age}) + e  (\#eq:price-interact)
\end{equation}

In that model, the partial effect of `area` on `log(price)`, keeping all other variables fixed, is

\begin{equation}
\frac{\partial \log(price)}{\partial \text{area}} = b_1 + b_3 (\text{age}) 
\end{equation}

If we find that $b_3 > 0$ in a regression, we conclude that the size of a house values more in older houses. We call $b_3$ the **interaction effect** between area and age. Let's look at that regression model now.


```{r}
data(hprice3, package = "wooldridge")
summary(lm(lprice ~ area*age, data = hprice3))
```

In this instance, we see that indeed there is a small positive interaction between `area` and `age` on the sales price: even though `age` in isolation decreases the sales value, bigger houses command a small premium if they are older.

### Interactions with Dummies: Differential Slopes

It is straightforward to extend the interactions logic to allow not only for different *intercepts*, but also different *slopes* for each subgroup in a dataset. Let's go back to our dataset of wages from section \@ref(factors) above. Now that we know how to create and interaction between two variables, we can easily modify equation \@ref(eq:wage-sex) like this:

\begin{equation}
\ln w = b_0 + b_1 \text{female} + b_2 \text{educ} + b_3 (\text{female} \times \text{educ}) + e (\#eq:wage-sex2)
\end{equation}

The only peculiarity here is that `female` is a factor with levels `0` and `1`: i.e. the interaction term $b_3$ will be zero for all men. Similarly to above, we can test whether there are indeed different returns to education or men and women by looking at the estimated value $b_3$:

```{r,echo = TRUE}
lm_w_interact <- lm(lwage ~ educ * female , data = wage1)  # R expands to full interactions model
summary(lm_w_interact)
```

We will in the next chapter learn that the estimate for $b_3$ on the interaction `educ:female1` is difficult for us to distinguish from zero in a statistical sense; Hence for now we conclude that there are *no* significantly different returns in education for men and women in this data. This is easy to verify visually in this plot, where we are unable to detect a difference in slopes in the right panel.

```{r wage-plot2,fig.align='center',echo=FALSE,fig.cap='log wage vs educ. Right panel allows slopes to be different - turns out they are not!',message=FALSE,warning=FALSE,fig.height=3}

p_sex = cbind(wage1,pred=predict(lm_w_sex))
p_sex = cbind(p_sex,pred_inter=predict(lm_w_interact))
# p_sex = dplyr::sample_n(p_sex,2500)
p2 <- ggplot(data=p_sex,mapping=aes(x=educ,y=lwage,color=female)) 
p2 <- p2 + geom_jitter(shape=1,alpha=0.6,width=0.1) + geom_line(mapping = aes(y=pred), size=1) + theme_bw() + scale_y_continuous(name = "log wage") + ggtitle("Impose Parallel Slopes") + theme(legend.position = "none")

p3 <- ggplot(data=p_sex,mapping=aes(x=educ,y=lwage,color=female)) 
p3 <- p3 + geom_jitter(shape=1,alpha=0.6,width=0.1) + geom_line(mapping = aes(y=pred_inter), size=1) + theme_bw() + scale_y_continuous(name = NULL)  + ggtitle("Allow Different Slopes") + theme(legend.position = "none")

cowplot::plot_grid(p2,p3)
```


## (Unobserved) Individual Heterogeneity

Finally, dummary variables are sometimes very important to account for spurious relationships in that data. Consider the following (artificial example):

1. Suppose we collected data on hourly wage data together with a the number of hours worked for a set of individuals.
1. We plot want to investigate labour supply behaviour of those individuals, hence we run regression `hours_worked ~ wage`.
1. We expect to get a positive coefficient on `wage`: the higher the wage, the more hours worked.
1. You know that individuals are members of either group `g=0` or `g=1`.

```{r, echo = FALSE}
two_clouds <- function(a1 = 5, a2 = -2, b1 = 2.5, b2 = 1.5, n1 = 50, n2 = 50){
  set.seed(12)
  x1 = rnorm(n1,mean = 1,sd = 0.5)
  x2 = rnorm(n2,mean = 1.3,sd = 0.5)
  y1 = a1 + b1 * x1 + rnorm(n1,sd = 2)
  y2 = a2 + b2 * x2 + rnorm(n2,sd = 2)
  x = c(x1,x2)
  y = c(y1,y2)
  g = factor(c(rep(1,n1),rep(2,n1)))
  z = data.frame(x,y,g)
  m1 = lm(y~x,data = z)
  m2 = update(m1, . ~ . + g)
  p1 = ggplot(z, aes(x,y)) + geom_point() + geom_smooth(method = "lm",se =FALSE) + scale_x_continuous(name = "wage")  + scale_y_continuous(name = "hours") + theme_bw() 
  p2 = ggplot(z, aes(x,y,color = g)) + geom_point() + geom_smooth(method = "lm",se = FALSE) + scale_x_continuous(name = "wage")  + scale_y_continuous(name = "hours") + theme_bw()  + ggtitle("Controlling for Group")
  # par(mfcol = c(1,2))
  # plot(z$x,z$y)
  # abline(m1)
  list(m1=m1,m2=m2,p1 = p1, p = cowplot::plot_grid(p1,p2,rel_widths = c(1,1.2)))
}
tc = two_clouds()
tc$p1
```

Here we observe a slightly negative relationship: higher wages are associated with fewer hours worked? Maybe. But what is this, there is a group identifier in this data! Let's use this and include `g` as a dummy in the regression - suppose `g` encodes male and female. 

```{r,echo = FALSE,fig.align='center',fig.cap='Left and right panel exhibit the same data. The right panel controls for group composition.',}
tc$p
```

This is an artificial example; yet it shows that you can be severly misled if you don't account for group-specific effects in your data. The problem is particularly accute if we *don't know group membership* - we can then resort to advanced methods that are beyond the scope of this course to *estimate* which group each individual belongs to. If we *do know* group membership, however, it is good practice to include a group dummy so as to control for group effects.


================================================
FILE: 06-StdErrors.Rmd
================================================
# Regression Inference {#std-errors}

In this chapter we want to investigate uncertainty in regression estimates. We want to understand what the precise meaning of the `Std. Error` column in a typical regression table is telling us. In terms of a picture, we want to understand better the meaning of the shaded area as in this one here:

```{r confint,fig.align="center",message=FALSE,warning=FALSE,echo=FALSE,fig.cap="Confidence bands around a regression line."}
library(ggplot2)
data("wage1", package = "wooldridge")
p <- ggplot(mapping = aes(x = educ, y = lwage), data = subset(wage1,educ > 5)) # base plot
p <- p + geom_point() # add points
p <- p + geom_smooth(method = "lm", size=1, color="red") # add regression line
p <- p + scale_y_continuous(name = "log hourly wage") + 
         scale_x_continuous(name = "years of education")
p + theme_bw() + ggtitle("Log Wages vs Education")
```

In order to fully understand this, we need to go back and make sure we have a good grasp of *sampling*. Let's do this first.

## Sampling

In class we were confronted with a jar of Tricolore Fusilli pasta as picture in figure \@ref(fig:pasta1).^[This part is largely based on [moderndive](https://moderndive.com/7-sampling.html), to which I am giving full credit hereby. Thanks for this great idea.] We asked ourselves a question which, secretly, many of you had asked themselves at one point in their lives, namely:

```{block type = "tip"}
What is the proportion of **green** Fusilli in a pack of Tricolore Fusilli?
```
<br>

Well, it's time to find out.

```{r pasta1, fig.cap="A glass jar filled with Fusilli pasta in three different colors.",echo = FALSE,fig.width = 8, out.width = "90%"}
knitr::include_graphics("images/pasta1.JPG")
```

Let's call the fusilly in this jar our *study population*, i.e. the set of units about which we want to learn something. There are several approaches to address the question of how big a proportion in the population the green Fusilli make up. One obvious solution is to enumerate all Fusilli according to their color, and compute their proportion in the entire population. It works perfectly well as a solution, but is a long and arduous process, see figures \@ref(fig:pasta2) and \@ref(fig:pasta3). 

```{r pasta2, fig.cap="Manually separating Fusilli by their color is very costly in terms of effort and cost.",echo = FALSE,fig.width = 8, out.width = "90%"}
knitr::include_graphics("images/pasta2.JPG")
```


Additionally, you may draw worried looks from the people around you, while you are doing it. Maybe this is not the right way to approach this task?^[Regardless of the worried onlookers, I did what I had to do and I carried on to count the green pile. I know exactly how many greens are in there now! I then computed the weight of 20 Fusilli (5g), and backed out the number of Fusilli in the other piles. I will declare those numbers as the *true numbers*. (Sceptics are free to recount.)]

```{r pasta3, fig.cap="Heaps of Fusilli pasta ready to be counted.",echo = FALSE, out.width = "90%"}
knitr::include_graphics("images/pasta3.JPG")
```

### Taking One Sample From the Population

We started by randomly grabbing a handful of Fusilli from the jar and by letting drop exactly $N=20$ into a paper coffee cup, pictured in \@ref(fig:pasta5). We call $N$ the *sample size*. The count and corresponding proportions of each color in this first sample are shown in the following table:

Color | Count | Proportion
:------:|:------:|:--------:
Red   |  7        |  0.35
Green   |  5     |   0.25
White   |  8    |     0.4

So far, so good. We have our first *estimate of the population proportion of green Fusilli in the overall population*: 0.25. Notice that taking a sample of $N=20$ was *much* quicker and *much less painful* than performing the full count (i.e. the *census*) of Fusilli performed above. 

```{r pasta5, fig.cap="Taking one sample of 20 Fusilli from the jar.",echo = FALSE, out.width = "90%"}
knitr::include_graphics("images/pasta5.JPG")
```

Then, we put my sample back into the jar, and we reshuffled the Fusilli. Had we taken *another* sample, again of $N=20$, would we again have gotten 7 Red, 5 Green, and 8 White, just as in the first sample? Maybe, but maybe not. Suppose we had carried on for several times drawing samples of 20 and counting the colors: Would we also have observed 5 green Fusilli? Definitely not. We would have noted some degree of *variability* in the proportions computed from our samples. The *sample proportions* in this case are an example of a *sample statistic*.

```{block type = "note"}
**Sampling Variation** refers to the fact that if we *randomly* take samples from a wider population, the *random* composition of each sample will imply that we obtain statistics that vary - they take on potentially different values in each sample.
```


Let's see how this story evolved as we started taking more samples at a time. 

## Taking Eleven Samples From The Population

We formed teams of two students in class who would each in turn take samples from the jar (the population) of size $N=20$, as before. Each team computed the proportion of green Fusilli they had in their sample, and we wrote this data down in a table on the board. Then, we drew a histogram which showed how many samples had fallen into which bins. 

```{r pasta6, fig.cap="Taking eleven samples of 20 Fusilli each from the jar, and plotting the histogram of obtained sample proportions of Green Fusilli.",echo = FALSE, out.width = "90%"}
knitr::include_graphics("images/pasta6.JPG")
```

We looked at the histogram in figure \@ref(fig:pasta6) and we noted several things:

1. The largest proportions where 0.3 green
1. The smallest proportion was 0.15 green.
1. Most samples found a proportion of 0.25 green fusilli.
1. We did think that this looked *suspiciouly* like a **normal distribution**. 

We collected the sample data into a data.frame:

```{r sample-data}
pasta_samples <- data.frame(group = 1:11, replicate = 1:11, prop_green = c(0.3,0.25,0.25,0.3,0.15,0.3,0.25,0.25,0.2,0.25,0.2))
pasta_samples
```

This produces an associated histogram which looks very much like the one we draws onto the board:

```{r pasta-hist,echo = FALSE}
hist(pasta_samples$prop_green,breaks = c(0.125,0.175,0.225,0.275,0.325),main = "Histogram of 11 Pasta Samples", xlab = "Proportion of Green Fusilli")
```

### Recap

Let's recaptiulate what we just did. We wanted to know what proportion of Fusilli in the glass jar in figure \@ref(fig:pasta1) are green. We acknowledged that an exclusive count, or a census, is a costly and cumbersome exercise, which in most circumstances we will try to avoid. In order to make some progress nonetheless, we took a *random sample* from the full population in the jar: we randomly selected 20 Fusilli, and looked at the proportion of green ones in there. We found a proportion of 0.25.

After replacing the Fusilli from the first sample in the jar, we asked ourselves if, upon drawing a *new* sample of 20 Fusilli, we should expect to see the same outcome - and we concluded: maybe, but maybe not. In short, we discovered some random variation from sample to sample. We called this **sampling variation**.

The purpose of this little activity was three-fold:

1. To understand that random samples differ and that there is sampling variation.
1. To understand that bigger samples will yield smaller sampling variation.
1. To illustrate that the sampling distribution of *any* statistic (i.e. not only the sample proportion as in our case) computed from a random sample converges to a normal distribution as the sample size increases.

```{block type = "note"}
The value of this exercise consisted in making **you** perform the sampling activity yourself. We will now hand over to the brilliant **moderndive** package, which will further develop this chapter. 
```


## Handover to `Moderndive`

```{r handover,out.width="90%", fig.cap="The Moderndive package used red and white balls instead of fusilli pasta.",echo = FALSE}
knitr::include_graphics("images/transition.png")
```

The sampling activity in `moderndive` was performed with red and white balls instead of green fusilli pasta. The rest is identical. We will now read sections [7.2](https://moderndive.com/7-sampling.html#sampling-simulation) and [7.3](https://moderndive.com/7-sampling.html#sampling-framework) in their book, as well as [chapter 8 on confidence intervals adn bootstrapping](https://moderndive.com/8-confidence-intervals.html), and [chapter 9 on hypothesis testing](https://moderndive.com/9-hypothesis-testing.html).



 
 
## Uncertainty in Regression Estimates

In the previous chapters we have seen how the OLS method can produce estimates about intercept and slope coefficients from data. You have seen this method at work in `R` by using the `lm` function as well. It is now time to introduce the notion that given that $b_0$, $b_1$ and $b_2$ are *estimates* of some unkown *population parameters*, there is some degree of **uncertainty** about their values. An other way to say this is that we want some indication about the *precision* of those estimates. The underlying issue that the data we have at hand are usually *samples* from a larger population.

```{block,type="note"}
<center>
How *confident* should we be about the estimated values $b$?
</center>
```
<br>

## What is *true*? What are Statistical Models?

A **statistical model** is simply a set of assumptions about how some data have been generated. As such, it models the  data-generating process (DGP), as we have it in mind. Once we define a DGP, we could simulate data from it and see how this compares to the data we observe in the real world. Or, we could change the parameters of the DGP so as to understand how the real world data *would* change, could we (or some policy) change the corresponding parameters in reality. Let us now consider one particular statistical model, which in fact we have seen so many times already.

## The Classical Regression Model (CRM) {#class-reg}

Let's bring back our simple model \@ref(eq:abline) to explain this concept.

\begin{equation}
y_i = \beta_0 + \beta_1 x_i + \varepsilon_i (\#eq:abline-5)
\end{equation}

The smallest set of assumptions used to define the *classical regression model* as in \@ref(eq:abline-5) are the following:

1. The data are **not linearly dependent**: Each variable provides new information for the outcome, and it cannot be replicated as a linear combination of other variables. We have seen this in section \@ref(multicol). In the particular case of one regressor, as here, we require that $x$ exhibit some variation in the data, i.e. $Var(x)\neq 0$.
1. The mean of the residuals conditional on $x$ should be zero, $E[\varepsilon|x] = 0$. Notice that this also means that $Cov(\varepsilon,x) = 0$, i.e. that the errors and our explanatory variable(s) should be *uncorrelated*. It is said that $x$ should be **strictly exogenous** to the model.

These assumptions are necessary to successfully (and correctly!) run an OLS regression. They are often supplemented with an additional set of assumptions, which help with certain aspects of the exposition, but are not strictly necessary:

3. The data are drawn from a **random sample** of size $n$: observation $(x_i,y_i)$ comes from the exact same distribution, and is independent of observation $(x_j,y_j)$, for all $i\neq j$.
4. The variance of the error term $\varepsilon$ is the same for each value of $x$: $Var(\varepsilon|x) = \sigma^2$. This property is called **homoskedasticity**.
5. The error is normally distributed, i.e. $\varepsilon \sim \mathcal{N}(0,\sigma^2)$

Invoking assumption 5. in particular defines what is commonly called the *normal* linear regression model.


### $b$ is not $\beta$!

Let's talk about the small but important modifications we applied to  model \@ref(eq:abline) to end up at \@ref(eq:abline-5) above:

* $\beta_0$ and $\beta_1$ and intercept and slope parameters
* $\varepsilon$ is the error term.

First, we *assumed* that \@ref(eq:abline-5) is the correct represenation of the DGP. With that assumption in place, the values $\beta_0$ and $\beta_1$ are the *true parameter values* which generated the data. Notice that $\beta_0$ and $\beta_1$ are potentially different from $b_0$ and $b_1$ in \@ref(eq:abline) for a given sample of data - they could in practice be very close to each other, but $b_0$ and $b_1$ are *estimates* of $\beta_0$ and $\beta_1$. And, crucially, those estimates are generated from a sample of data. Now, the fact that our data $\{y_i,x_i\}_{i=1}^N$ are a sample from a larger population, means that there will be *sampling variation* in our estimates - exactly like in the case of the sample mean estimating the population average as mentioned above. One particular sample of data will generate one particular set of estimates $b_0$ and $b_1$, whereas another sample of data will generate estimates which will in general be different - by *how much* those estimates differ across samples is the question in this chapter. In general, the more observations we have the greater the precision of our estimates, hence, the closer the estimates from different samples will lie together.



### Violating the Assumptions of the CRM {#violating}

It's interesting to consider in which circumstances we might violate those assumptions. Let's give an example for each of them:

1. No Perfect Collinearity. We have seen that a perfect collinearity makes it impossible to compute to OLS coefficients. Remember the example about adding `wtplus = wt + 1` to the `mtcars` dataset? Here it is:
    ```{r,warning = FALSE, message = FALSE}
    library(dplyr)
    mtcars %>%
    mutate(wtplus = wt + 1) %>%
    lm(mpg ~ wt + wtplus, data = .)
    ```
    That the coefficient on `wtplus` is `NA` is the result of the direct linear dependence. (Notice that creating `wtplus2 = (wt + 1)^2`) would work, since that is not linear!)
1. Conditional Mean of errors is zero, $E[\varepsilon|x] = 0$. Going back to our running example in figure \@ref(fig:confint) about wages and education: Suppose that each individual $i$ in our data  something like *innate ability*, something we might wish to measure with an IQ-test, however imperfecty. Let's call it $a_i$. It seems reasonable to think that high $a_i$ will go together with high wages. At the same time, people with high $a_i$ will find studying for exams and school work much less burdensome than others, hence they might select into obtaining more years of schooling. The problem? Well, there is no $a_i$ in our regression equation - most of time we don't have a good measure of it to start with. So it's an *unobserved variable*, and as such, it is part of the error term $\varepsilon$ in our model. We will attribute to `educ` part of the effect on wages that is actually *caused* by ability $a_i$! Sometimes we may be able to reason about whether our estimate on `educ` is too high or too low, but we will never know it's true value. We don't get the *ceteris paribus* effect (the true partial derivative of `educ` on `lwage`). Technically, the assumption $E[\varepsilon|x] = 0$ implies that $Cov(\varepsilon,x) = 0$, so that's the part that is violated.
1. Data from Random Sample. One common concern here is that the observations in the data could have been *selected* in a particular fashion, which would make it less representative of the underlying population. Suppose we had ended up with individuals only from the richest neighborhood of town; Our interpretation the impact of education on wages might not be valid for other areas.
1. Homoskedasticity. For correct inference (below!), we want to know whether the variance of $\varepsilon$ varies with our explanatory variable $x$, or not. Here is a typical example where it does:
    ```{r,echo = FALSE}
    data("engel",package = "quantreg")
    plot(foodexp ~ log(income) ,data = engel,main = "Food Expenditure vs Log(income)")
    ```
    As income increases, not all people increase their food consumption in an equal way. So $Var(\varepsilon|x)$ will vary with the value of $x$, hence it won't be equal to the constant $\sigma^2$. 
1. If the distribution of $\varepsilon$ is not normal, it is more cumbersome to derive theoretical results about inference. 


## Standard Errors in Theory {#se-theory}

The standard deviation of the OLS parameters is generally called *standard error*. As such, it is just the square root of the parameter's variance.
Under assumptions 1. through 4. above we can define the formula for the variance of our slope coefficient in the context of our single regressor model \@ref(eq:abline-5) as follows: 

\begin{equation}
Var(b_1|x_i) = \frac{\sigma^2}{\sum_i^N (x_i - \bar{x})^2}  (\#eq:var-ols)
\end{equation}

In pratice, we don't know the theoretical variance of $\varepsilon$, i.e. $\sigma^2$, but we form an estimate about it from our sample of data. A widely used estimate uses the already encountered SSR (sum of squared residuals), and is denoted $s^2$:

$$
s^2 = \frac{SSR}{n-p} = \frac{\sum_{i=1}^n (y_i - b_0 - b_1 x_i)^2}{n-p} =  \frac{\sum_{i=1}^n e_i^2}{n-p}
$$
where $n-p$ are the *degrees of freedom* available in this estimation. $p$ is the number of parameters we wish to estimate (here: 1). So, the variance formula would become

\begin{equation}
Var(b_1|x_i) = \frac{SSR}{(n-p)\sum_i^N (x_i - \bar{x})^2}  (\#eq:var-ols2)
\end{equation}

We most of the time work directly with the *standard error* of a coefficient, hence we define

\begin{equation}
SE(b_1) = \sqrt{Var(b_1|x_i)} = \sqrt{\frac{SSR}{(n-p)\sum_i^N (x_i - \bar{x})^2}}  (\#eq:SE-ols2)
\end{equation}

You can clearly see that, as $n$ increases, the denominator increases, and therefore variance and standard error of the estimate will decrease.







================================================
FILE: 07-Causality.Rmd
================================================
# Causality {#causality}

In this chapter we take on a challenging part of our course. Remember that in the [first set of slides](https://rawcdn.githack.com/ScPoEcon/ScPoEconometrics-Slides/session2_1/chapter1/chapter1.html) we introduced Econometrics as the economist's toolkit to answer questions like *does $x$ **cause** $y$?* Let's illustrate the issues at stake with a question from epidemiologie and public health:

```{block type = "warning"}
Does smoking **cause** lung cancer?
```
<br>

Just in case you were wondering: Yes it does! However, for a very long time the *causal impact* of smoking on lung cancer was hotly debated, and it's instructive for us to look at this history.^[This chapter is drawn from chapter 5 of *The Book of Why* by [Judea Pearl](http://bayes.cs.ucla.edu/jp_home.html).]

Let's go back to the 1950's. We are at the start of a big increase in deaths from lung cancer. At the same time cigarette consumption was growing very fast. With the benefit of hindsight, we can now draw this graph:

```{r smoking-cancer,echo = FALSE,fig.align = "center",fig.cap="Two time series showing cigarette consumption per capita and incidence of lung cancer in the USA."}
knitr::include_graphics("images/Smoking_lung_cancer.png",)
```

However, time series graphs are poor tools to make causal statements. Many *other things* had changed from 1900 to 1950, all of which could equally be responsible for the rise in cancer rates:

1. Tarring of roads
1. Inhalation of motor exhausts (leaded gasoline fumes)
1. General greater air pollution.

We call those other factors **confounders** of the relationship between smoking and lung cancer.

So, there were a series of sceptics around who at the time were contesting the existing evidence. That evidence consisted in general of the following:

1. **Case-Control studies**: British Epidemiologists Richard Doll and Austin Bradford Hill started to compare people already diagnosed with cancer to those without, recording their history, and observable characteristics (like age and health behaviours). In one study, out of 649 lung cancer patients interviewed, all but 2 had been smokers! In that study, a cancer patient was 1.5 million times more likely to be have been a smoker than a non-smoker. Still, critics said, there are several sources of bias: 
    * Hospital patients could be a selected sample of the general (smoking) population.
    * Patients could suffer from *recall bias*, affecting their recollection of facts.
    * So, while comparing cancer patients to non-patients and controlling for several important *confounders* (like age, income and other observable characteristics), there was still scope for bias.
    * Moreoever, replicating those studies, as Doll and Hill attempted, would not have solved this issue.
1. Next they attempted what doctors call a **Dose-Response Effect** study. In 1951 they sent out 60,000 questionnaires to British physicians asking about *their* smoking habits. Then they followed them over time:
    * Only 5 years on, heavy smokers had a death rate from lung cancer that was 24 times higher than for nonsmokers.
    * People who had smoked and then stopped reduced their risk by a factor of 2.
    * Still, notorious sceptics like R.A. Fisher were unconvinced. The studies *still* failed to compare **otherwise identical** smokers to non-smokers. There were *still* important unobserved confounders out there which could invalidate the conclusion that we observed indeed a **causal** relationship.
    
Let's put a some structure on this problem now, so we can make progress.

## Directed Acyclical Graphs (DAG) {#dags}

A DAG is a tool to visualize a causal relationship. It is a graph where nodes are connected via arrows, where an arrow can run in one direction only (hence, *directed* graph). If an arrow starts at node $x$ and ends at node $y$, we say that $x$ causes $y$. Here is a simple example of such a DAG:

```{r dag1,echo = FALSE,warning = FALSE,message = FALSE,fig.width=4,fig.height = 4, fig.align = "center",fig.cap = "A simple DAG showing the causal impact of $x$ on $y$."}
library(ggdag)
theme_set(theme_dag())
d1 = dagify(y ~ x) %>% 
  ggdag()
d1
```

Now consider this setting, where there is a third variable, $z$. It could be possible that also $z$ has a direct influence on $y$:

```{r dag2,echo = FALSE,warning = FALSE,message = FALSE,fig.width=4,fig.height = 4, fig.align = "center",fig.cap = "A simple DAG with with 2 causal paths: Both $x$ and $z$ have a direct impact on $y$."}
dagify(y ~ x,
       y ~ z) %>% 
  ggdag()
```

Now let's change this and create a path from $z$ to *both* $x$ and $y$ instead. We call $z$ a *confounder* in the relationship between $x$ and $y$: $z$ *confounds* the direct causal impact of $x$ on $y$, by affecting them both at the same time. What is more, there is no arrow from $x$ to $y$ at all, so the only *real* explanatory variable here is in fact $z$. Attributing any explanatory power to $x$ would be wrong in this setting.

```{r dag3,echo = FALSE,warning = FALSE,fig.width=4,fig.height = 4, fig.align = "center",fig.cap = "A simple DAG where $z$ is a confounder. There is no causal path from $x$ to $y$, and any correlation we observe between those variables is completely induced by $z$. We call this spurious correlation."}
ggdag_confounder_triangle()
```

Here is a second example where $z$ is a confounder, but slightly different. 

```{r dag41,echo=FALSE,fig.cap="$z$ is still a confounder here, but there is a causal link from $x$ to $y$ now. If we observed $z$, we can control for it."}
d4 = dagify(y ~ x,
  x ~ z,
  y ~ z) %>%
    tidy_dagitty(layout = "tree") %>%
  ggdag()
d4
```

In \@ref(fig:dag41) there is an arrow from $x$ to $y$. In this setting, if we are able to *observe* $z$, we can adjust the correlation we observe between $x$ to $y$ for the variation induced by $z$. In practice, this is precisely what multiple regression will do: holding $z$ fixed at some value, what is the partial effect of $x$ on $y$. Notice that $z$ cedes to be a confounder in this situation, and interpreting our regression coefficient on $x$ as *causal* is correct.


## Smoking in a DAG

Let's use this and cast our problem as a DAG now. What the scientists in the 1950s faced where two competing models of the relationship between smoking and lung cancer:

```{r dag-cig,fig.height = 4,echo = FALSE,fig.cap = "Two competing causal graphs for the relationship between smoking and lung cancer. In the right panel Lung Cancer is directly impacted by a genetic factor, which at the same time also influences smoking. This is a stark representation of Fisher's view. Another version would have an additional arrow from Smoking to Lung Cancer in the right panel."}
# https://cran.r-project.org/web/packages/ggdag/vignettes/bias-structures.html
p1 = dagify(cancer ~ smoking,
       labels = c("cancer" = "Lung Cancer", 
                  "smoking" = "Smoking"
                  ),
       exposure = "smoking",
       outcome = "cancer") %>%
  ggdag(text = FALSE, use_labels = "label") + ggtitle("Doll & Hill") + theme(plot.title = element_text(hjust = 0.5))

p2 = confounder_triangle(x = "Smoking", y = "Lung Cancer", z = "Gene") %>% 
       ggdag(text = FALSE, use_labels = "label") + ggtitle("R.A. Fisher") + theme(plot.title = element_text(hjust = 0.5))
p3 <- dagify(cancer ~ smoking,
smoking ~ gene,
cancer ~ gene,
outcome = "cancer",
labels = c("gene" = "Gene", "cancer" = "Lung Cancer",
"smoking" = "Smoking")) %>%   
  tidy_dagitty(layout = "tree") %>%
  ggdag(text = FALSE, use_labels = "label") + ggtitle("Gene Partial")

cowplot::plot_grid(p1,p2,axis = "tb")
```

Basically, what critics like Fisher were claiming was that the existing studies did not compare like for like. In other words, our *ceteris paribus* assumption was not satisfied. They were worried that *smoking* was not the only relevant difference between a population of smokers and one of non-smokers. In particular, they worried that people **self-selected** into smoking, and that the choice to become a smoker may be influenced by other, unobserved, underlying forces - like genetic predisposition, for example. That could mean that smokers were also more likely to take risks, or more likely to be heavy drinkers, or engage in other behaviours that might be conducive to develop lung cancer. They did not formulate it in terms of genetics at the time, because they could not know until the 2000's, when the human genome was sufficiently mapped to establish this fact (and indeed there **is** a smoking gene! But that's beside the point), but they worried about this factor.


The argument was settled in the eyes of most physicians, when Jerome Cornfield in 1959 wrote a rebuttal of Fisher's points. Cornfield's strategy was to allow Fisher to have his unobserved factor, but to show that there was an upper bound to *how important* it could be in determining the outcome. Here goes:

1. Suppose there is indeed a confounding factor "smoking gene", and that it completely determines the risk of cancer in smokers. 
1. Suppose smokers are observed to have 9 times the risk of non-smokers to develop lung cancer.
1. The smoking gene needs to be at least 9 times more prevalent in smokers than in non-smokers to explain this difference in risk.

But now consider what this implies. Let's suppose that around 11% of all non-smokers have the smoking gene. That means that $9\times 11 = 99\%$ of smokers need to have it! What's even more worrying, if only even 12% of non smokers have the gene, then the argument breaks down because it would require $9\times 12 = 108\%$ of smokers to have it, which is of course impossible.

This argument was so important that it got a name: **Cornfield's inequality**. It left of Fisher's argument nothing but a pile of rubble. It's impossible to think that genetic variation alone could be so important in determining a complex choice of becoming a smoker or not. Looking back at the right panel of figure \@ref(fig:dag-cig), the link from smoking to lung cancer was much too strong to be explained by the genetic hypothesis alone.


## Randomized Control Trials (RCT) Primer {#rct}

We now present a quick introduction to Randomized Control Trials (RCTs). The history of randomization is fascinating and goes back a long time, again involving R.A. Fisher from above.^[I refer the interested student to the introduction of the *potential outcomes model* of [Scott Cunningham's](https://twitter.com/causalinf) [mixtape](http://scunning.com/cunningham_mixtape.pdf), which heavily influences this section.] Suffice it to say that RCTs have become so important in Economics that the [Nobel Price in Economics 2019](https://www.nobelprize.org/prizes/economic-sciences/2019/summary/) has been awarded to three exponents of the RCT literature, [Duflo, Banerje and Kremer](https://www.economist.com/finance-and-economics/2019/10/17/a-nobel-economics-prize-goes-to-pioneers-in-understanding-poverty). RCTs are widely used in Medicine, where they originate from (in some sense). But, what *are* RCTs?

```{block type="note"}
A randomized controlled trial is a type of scientific experiment that aims to reduce certain sources of bias when testing the effectiveness of some intervention (treatment or policy); this is accomplished by randomly allocating subjects to two or more groups, treating them differently, and then comparing them with respect to a measured response.
```
<br>
That sounds really intuitive. If we *randomly* allocate people to receive treatment, there can be no concern of unobserved confounders, as we have relieved the subjects of making the choice to get treated. Remember the cigarette smokers above: The concern was that an unobserved genetic predisposition correlated with both choosing to become a smoker but also with other potentially cancer-inducing behaviours like drinking or risk taking. Imagine for a moment that we could randomly select people at some young age to be selected for treatment (smoking for 30 years, say). The genetic predisposition will be equally prevalent in both treatment and control group. However, only the treatment group is allowed (and indeed forced) to smoke. Observing higher cancer rates in the treatment group would provide *causal evidence* for the effect of smoking on lung cancer.

Thankfully, such an experiment is impossible to run on ethical grounds. We could never subject individuals to such severe and prolongued health risks for the sake of a research study. That's why the question took to long to be settled!

Let's introduce a formal framework now to think more about RCTs.


## The Potential Outcomes Model {#rubin}

The Potential Outcomes Model, often named after one of it's inventors the *Rubin Causal Model*, posits that there are two states of the world - the *potential outcomes*. A first state, where a certain intervention is administered to an individual, and a second state, where this is not the case. Formally, this idea is expressed with superscripts 0 and 1, like this:

* $Y_i^1$: individual $i$ has been treated
* $Y_i^0$: individual $i$ has **not** been treated

Denoting with $D_i \in \{0,1\}$ the treatment indicator which is one if $i$ is indeed treated, the *observed outcome* $Y_i$ is then

\begin{equation}
Y_i = D_i Y_i^1 + (1-D_i)Y_i^0 (\#eq:rubin-model)
\end{equation}

This simple equation is able to formalize a rather deep question. We only ever observe one outcome of events for a given individual $i$, say $Y_i = Y_i^1$ in case treatment was given. The deep question is: *what would have happened to $i$, had they **not** received treatment*? You will realize that this a very natural question for us humans to put to ourselves, and to subsequently answer:

* How long would the trip have taken, had I chosen another metro line?
* What would have happened, had I chosen to study a different subject?
* What would have happend, had [Neo](https://en.wikipedia.org/wiki/Neo_(The_Matrix)) taken the blue pill instead?

Our ability to make those considerations distinguishes us from animals. It's one of the biggest challenges for machines when trying to be *intelligent*.

What makes this question so hard to answer for machines and animals alike is the fact that one has to *imagine a parallel universe* where the actions taken were different, **without** having observed that precise situation before. Neo did *not* take the blue pill, and whatever happened after that originated from this decision - so how are we to tell what would have happened? It's easy for us and [still hard for machines](https://www.quantamagazine.org/to-build-truly-intelligent-machines-teach-them-cause-and-effect-20180515/). 

Potential outcome $Y_i^0$ above is what is known as the *counterfactual* outcome. What would have happened to subject $i$, had they **not** received treatment $D$?

Following Rubin, let us define the **treatment effect** for individual $i$ as follows:

\begin{equation}
\delta_i = Y_i^1 - Y_i^0 (\#eq:TE)
\end{equation}

Notice our insistence about talking about a single individual $i$ throughout here. Keeping the potential outcome model \@ref(eq:rubin-model) in mind, i.e. the fact that we only observe *one* of both outcomes, we face the **fundamental identification problem of program evaluation**:

```{block type="warning"}
Given we only observe *one* potential outcome, we cannot compute the treatment effect $\delta_i$ for any individual $i$.
```
<br>
That's pretty dire news. Let's see if we can do better with an average effect instead. Let's define three *average* effects of interest:

1. the Average Treatment Effect (ATE): $$\delta^{ATE} = E[\delta_i] = E[Y_i^1] - E[Y_i^0]$$
1. the Average Treatment on the Treated (ATT): $$\delta^{ATT} = E[\delta_i|D_i = 1] = E[Y_i^1|D_i = 1] - E[Y_i^0|D_i = 1]$$
1. the Average Treatment on the Untreated (ATU): $$\delta^{ATU} = E[\delta_i|D_i = 0] = E[Y_i^1|D_i = 0] - E[Y_i^0|D_i = 0]$$

Notice that *none* of those can be computed from data either, because all of them require data on individual $i$ from *both* scenarios. Let's focus on the ATE for now. Fundamentally we face a **missing data problem**: either $Y_i^1$ or $Y_i^0$ are missing from our dataset. Nevertheless, let's setup the following *naive* simple difference in means estimator $\hat{\delta}$:

\begin{align}
\hat{\delta} =& E[Y_i^1|D_i = 1] - E[Y_i^0|D_i = 0]\\
             =& \frac{1}{N_T} \sum_{i \in T}^{N_T} T_i - \frac{1}{N_C} \sum_{j \in T}^{N_C} Y_j (\#eq:SDO)
\end{align}

in other words, we just difference the mean outcomes in both treatment (T) and control (C) groups. That is, $N_C$ is the number of people in the control group, $N_T$ is the same for treatment group.

Now let's consider what randomly choosing people for treatment does. The key consideration here is that the true $\delta_i$ is potentially different for each person. That is, some people will have a high effect of treatment, while others may have a small (or even negative!) effect. To learn about the true $\delta^{ATE}$ from our naive $\hat{\delta}$, it matters who ends up being treated! 

Imagine that individuals have at least some partial knowledge about their likely *gains from treatment*, i.e. their personal $\delta_i$. If those who expect to benefit a lot will select disproportionately into treatment, then our estimator $\hat{\delta}$ will be biased upwards for the true average effect $\delta^{ATE}$. This is so because the average of observed outcomes in the treatment group, i.e.

$$
\frac{1}{N_T} \sum_{i \in T}^{N_T} Y_i
$$

will be **too high**. It represents the disproportionately *high* treatment outcome $Y_i^1$ for all those who *anticipated* such a high outcome from treatment, and who therefore were particularly eager to get selected into treatment. It's not *representative* of the true population wide treatment outcome $E[Y_i^1]$.

Here is where randomization comes into play. Suppose we now flip a coin for each person to determine whether they obtain treatment or not. This takes away from them the possibility to select on expected gains into treatment. Crucially, the distribution of effects $\delta_i$ is still the same in the study population, i.e. there are still people with high and people with low effects. But we have solved the missing data problem mentioned above, because whether $Y_i^1$ or rather $Y_i^0$ is observed for each $i$ is now **random**, and no longer a function of any other factor that $i$ could act upon! Hooray!

Notice how this links back to our initial discussion about DAGs above. Randomisation essentially cancels the links starting at confounder $z$ in \@ref(fig:dag41). 

## Omitted Variable Bias and DAGs

We want to revisit the underlying assumptions of the classical model outlined in \@ref(class-reg) in the previous chapter, which is closely related to the previous discussion. Let's talk a bit more about assumption number 2 of the definition in \@ref(class-reg). It said this:

```{block type='warning'}
The mean of the residuals conditional on $x$ should be zero, $E[\varepsilon|x] = 0$. This means that $Cov(\varepsilon,x) = 0$, i.e. that the errors and our explanatory variable(s) should be *uncorrelated*. We want $x$ to be **strictly exogenous** to the model.
```
<br>

Let us start again with

\begin{equation}
y_i = \beta_0 + \beta_1 x_i + \varepsilon_i (\#eq:DGP-h)
\end{equation}

and imagine it represents the data generating process (DGP) of the impact of $x$ on $y$. Writing down this equation is tightly linked to drawing this DAG from above:

```{r dag4,fig.width=4,fig.height = 4, fig.align = "center",fig.cap = "The same simple DAG showing the causal impact of $x$ on $y$.",echo = FALSE}
d1
```

The role of $\varepsilon_i$ in equation \@ref(eq:DGP-h) is to allow for random variability in the data not captured by our model, almost as an acknowledgement that we would never be able to *fully* explain $y_i$ with our necessarily simple model. However, assumption $E[\varepsilon|x] = 0$ (or $Cov(\varepsilon,x) = 0$) makes sure that those other factors are in **no systematic relationship** with our regressor $x$. Why? Well if it *were* the case that another factor $z$ is related to $x$, we could never make our ceteris paribus statements of *holding all other factors fixed, the impact of $x$ on $y$ is $\beta$*. In other words, we'd have a confounder in our regression. 

```{r dag5,echo = FALSE,fig.width=4,fig.height = 4, fig.align = "center",fig.cap = "The same simple DAG where $z$ is a confounder that needs to be controlled for."}
d4
```


Notice, again, that the key here is that if we don't control for $z$, it will form part of the error term $\varepsilon$. Given the causal link from $z$ to $x$, we will then observe that $Cov(x,u) = Cov(x,\varepsilon + z) \neq 0$, invalidating our assumption.

### House Prices and Bathrooms

Let's imagine that equation \@ref(eq:DGP-h) represents the impact of number of bathrooms ($x$) on the sales price of houses ($y$). We run OLS as

$$
y_i = b_0 + b_1 x_i + e_i 
$$ 

and find a positive impact of bathrooms on houses:

```{r housing,echo=TRUE}
data(Housing, package="Ecdat")
hlm = lm(price ~ bathrms, data = Housing)
summary(hlm)
```

In fact, from this you conclude that each additional bathroom increases the sales price of a house by `r options(scipen=999);round(coef(hlm)[2],1)` dollars. Let's see if our assumption $E[\varepsilon|x] = 0$ is satisfied:

```{r,warning=FALSE,message=FALSE}
library(dplyr)
# add residuals to the data
Housing$resid <- resid(hlm)
Housing %>%
  group_by(bathrms) %>%
  summarise(mean_of_resid=mean(resid))
```

Oh, that doesn't look good. Even though the unconditional mean $E[e] = 0$ is *very* close to zero (type `mean(resid(hlm))`!), this doesn't seem to hold at all by categories of $x$. This indicates that there is something in the error term $e$ which is *correlated* with `bathrms`. Going back to our discussion about *ceteris paribus* in section \@ref(ceteris), we stated that the interpretation of our OLS slope estimate is that 

```{block,type="tip"}
Keeping everything else fixed at the current value, what is the impact of $x$ on $y$? *Everything* also includes things in $\varepsilon$ (and, hence, $e$)!
```
<br>
It looks like our DGP in \@ref(eq:DGP-h) is the *wrong model*. Suppose instead, that in reality sales prices are generated like this:

\begin{equation}
y_i = \beta_0 + \beta_1 x_i + \beta_2 z_i + \varepsilon_i (\#eq:DGP-h2)
\end{equation}

This would now mean that by running our regression, informed by the wrong DGP, what we estimate is in fact this:
$$
y_i = b_0 + b_1 x_i + (b_2 z_i + e_i)  = b_0 + b_1 x_i + u_i.
$$ 
This is to say that by *omitting* variable $z$, we relegate it to a new error term, here called $u_i = b_2 z_i + e_i$. Our assumption above states that *all regressors need to be uncorrelated with the error term* - so, if $Corr(x,z)\neq 0$, we have a problem. Let's take this idea to our running example.


### Including an Omitted Variable

What we are discussing here is called *Omitted Variable Bias*. There is a variable which we omitted from our regression, i.e. we forgot to include it. It is often difficult to find out what that variable could be, and you can go a long way by just reasoning about the data-generating process. In other words, do you think it's *reasonable* that price be determined by the number of bathrooms only? Or could there be another variable, omitted from our model, that is important to explain prices, and at the same time correlated with `bathrms`? 

Let's try with `lotsize`, i.e. the size of the area on which the house stands. Intuitively, larger lots should command a higher price; At the same time, however, larger lots imply more space, hence, you can also have more bathrooms! Let's check this out:


```{r,echo=FALSE}
options(scipen=0)
hlm2 = update(hlm, . ~ . + lotsize)
summary(hlm2)
options(scipen=999)
```

Here we see that the estimate for the effect of an additional bathroom *decreased* from `r round(coef(hlm)[2],1)` to `r round(coef(hlm2)[2],1)` by almost 5000 dollars! Well that's the problem then. `r options(scipen=999)`We said above that one more bathroom is worth `r round(coef(hlm)[2],1)` dollars - if **nothing else changes**! But that doesn't seem to hold, because we have seen that as we increase `bathrms` from `1` to `2`, the mean of the resulting residuals changes quite a bit. So there **is something in $\varepsilon$ which does change**, hence, our conclusion that one more bathroom is worth `r round(coef(hlm)[2],1)` dollars is in fact *invalid*! 

The way in which `bathrms` and `lotsize` are correlated is important here, so let's investigate that:


```{r, fig.align='center', fig.cap='Distribution of `lotsize` by `bathrms`',echo=FALSE}
options(scipen=0)
h = subset(Housing,lotsize<13000 & bathrms<4)
h$bathrms = factor(h$bathrms)
ggplot(data=h,aes(x=lotsize,color=bathrms,fill=bathrms)) + geom_density(alpha=0.2,size=1) + theme_bw()
```

This shows that lotsize and the number of bathrooms is indeed positively related. Larger lot of the house, more bathrooms. This leads to a general result:

```{block type='note'}
**Direction of Omitted Variable Bias**

If the direction of correlation between omitted variable $z$ and $x$ is the same as that between $x$ and $y$, we will observe upward bias in our estimate of $b_1$, and vice versa if the correlations go in opposite directions. In other words, we have positive bias if $b_2 z_i > 0$ and vice versa.
```
<br>




================================================
FILE: 08-STAR.Rmd
================================================
# STAR Experiment {#STAR}

How to best allocate spending on schooling is an important question. What's the impact of spending money to finance smaller classrooms on student performance and outcomes, both in the short and in the long run? A vast literature in economics is concerned with this question, and for a long time there was no consensus.

The big underlying problem in answering this question is that we do not really know how student outcomes are *produced*. In other words, what makes a successful student? Is it the quality of their teacher? Surely matters. is it quality of the school building? Could be. Is it that the other pupils are of high quality and this somehow *rubs off* to weaker pupils? Also possible. What about parental background? Sure. You see that there are many potential channels that could determine student outcomes. What is more, there could be several interdependencies amongst those factors. Here's a DAG!

```{r star1,echo = FALSE,warning = FALSE,message = FALSE,fig.align = "center",fig.cap = "Possible Channels determining student outcomes. Dashed arrows represent potentially unobserved links."}
library(ggdag)
library(dplyr)
p1 = dagify(outcome ~ teacher,
            size ~ building,
            outcome ~ building,
            outcome ~ peers,
            outcome ~ size,
            peers ~ size,
            outcome ~ parents,
       labels = c("teacher" = "teacher quality", 
                  "building" = "building quality",
                  "size" = "class size",
                  "peers" = "quality of peers",
                  "parents" = "parental background",
                  "outcome" = "student outcome"
                  ),
       outcome = "outcome") %>%
    tidy_dagitty() %>% 
    mutate(linetype = if_else(name %in% c("peers","parents","teacher"), "dashed","solid")) %>% 
    ggplot(aes(x = x, y = y, xend = xend, yend = yend)) + 
    geom_dag_point() + 
    geom_dag_edges(aes(edge_linetype = linetype), show.legend = FALSE) +
    geom_dag_label_repel(aes(label=label)) + theme_dag()
p1
```

We will look at an important paper in this literature now, which used a randomized experiment to make some substantial progress in answering the question *what is the production function for student outcomes*. We will study @krueger1999, which analyses the Tennessee Student/Teacher Achievement Ratio Experiment, STAR in short.

## The STAR Experiment

Starting in 1985-1986 and lasting for four years, young pupils starting Kindergarden  *and their teachers* where randomly allocated to to several possible groups:

1. small classes with 13-17 students
2. regular classes with 22-25 students
3. regular classes with 22-25 students but with an additional full-time teaching aide.

The experiment involved about 6000 students per year, for a total of 11,600 students from 80 schools. Each school was required to have at least on class of each size-type above, and random assignment happened *at the school level*. At the end of each school grade (kindergarden and grades 1 thru 3) the pupils were given a standardized test. Now, looking back at figure \@ref(fig:star1), what are the complications when we'd like to assess the impact of *class size* on student outcome? Put differently, why can't we just look at observational data of all schools (absent any experiment!), group classes by their size, and compute the mean outcomes for each group? Here is a short list:

1. There is selection into schools with different sized classes. Suppose parents have a prior that smaller classes are better - they will try to get their kids into those schools.
1. Relatedly, who ends up being in the classroom with a child could matter (peer effects). So, if high quality kids are sorting into schools with small classes, and if peer effects are strong, we could concluded that small classes improved student outcomes when in reality this was due to the high quality of peers in class.
1. Also related, teachers could sort towards schools with smaller classes because it's easier to teach a small rather than a large class, and if there is competition for those places, higher quality teachers will have an advantage.

Now, what can STAR do for us here? There will still be selection into schools, however, once selected a school it is random whether one ends up in a small or a large class. So, the quality of peers present in the school (determined before the experiment through school choice) will be similar across small and big groups. In figure \@ref(fig:star1), you see that some factors are drawn as unobserved (dashed arrow), and some are observed (solid). In any observational dataset, the dashed arrows would be really troubling. Here, given randomisation into class sizes, *we don't care* whether those factors are unobserved or not: It's reasonable to assume that across randomly assigned groups, the distributions of each of those factors is roughly constant! If we *can* in fact proxy some of those factors (suppose we had data on teacher qualifications), even better, but not necessary to identify the causal effect of class size.

## PO as Regression

Before we start replicating the findings in @krueger1999, let's augment our potential outcomes (PO) notation from the previous chapter. To remind you, we had defined the PO model in equation \@ref(eq:rubin-model):

\begin{equation*}
Y_i = D_i Y_i^1 + (1-D_i)Y_i^0 
\end{equation*}

and we had defined the treatment effect of individual $i$ as in \@ref(eq:TE):

\begin{equation*}
\delta_i = Y_i^1 - Y_i^0. 
\end{equation*}

Now, as a start, let's assume that the treatment effect of *small class* is identical for all $i$: in that case we have

\begin{equation*}
\delta_i = \delta ,\forall i
\end{equation*}

Next, let's distribute the $Y_i^0$ in \@ref(eq:rubin-model) as follows:

\begin{align*}
Y_i &= Y_i^0 + D_i (Y_i^1 - Y_i^0 )\\
    &= Y_i^0 + D_i \delta  
\end{align*}

finally, let's add $E[Y_i^0] - E[Y_i^0]=0$ to the RHS of that last equation to get

\begin{equation*}
Y_i = E[Y_i^0] + D_i \delta + Y_i^0 - E[Y_i^0]  
\end{equation*}

which we can rewrite in our well-known regression format 

\begin{equation}
Y_i = b_0 + \delta D_i  + u_i  (\#eq:PO-reg)
\end{equation}

In that formulation, the first $E[Y_i^0]$ is the average non-treatment outcome, which we could regard as some sort of baseline - i.e. our intercept. $\delta$ is the coefficient on the binary treatment indicator. The random deviation $Y_i^0 - E[Y_i^0]$ is the residual $u$. Under only very specific circumstances will the OLS estimator $\hat{\delta}$ identify the true Average Treatment Effect $\delta^{ATE}$. Random assignment ensures that the crucial assumption $E[u|D] = E[Y_i^0 - E[Y_i^0]|D] = E[Y_i^0|D] - E[Y_i^0] = 0$, in other words, there is no difference in nontreatment outcomes across treatment groups. Additionally, we could easily include regressors $X_i$ in equation \@ref(eq:PO-reg) to account for additional variation in the outcome.

With that out of the way, let's write down the regression that @krueger1999 wants to estimate. Equation (2) in his paper reads like this:

\begin{equation}
Y_{ics} = \beta_0 + \beta_1 \text{small}_{cs} + \beta_2 \text{REG/A}_{cs} + \beta_3 X_{ics} + \alpha_s + \varepsilon_{ics} (\#eq:krueger2)
\end{equation}

where $i$ indexes pupil, $c$ is class id and $s$ is the school id. $\text{small}_{cs}$ and $\text{REG/A}_{cs}$ are both dummy variables equal to one if class $c$ in school $s$ is either *small*, or *regular with aide*. $X_{ics}$ contains student specific controls (like gender). Importantly, given that randomization was at the school level, we control for the identify of the school with a school fixed effect $\alpha_s$. 

Before we proceed to run this regression, we need to define the outcome variable $Y_{ics}$. @krueger1999 combines the various SAT test scores in an average score for each student in each grade. However, given that the SAT scores are on different scales, he first computes a ranking of all scores for each subject (reading or math), and then assigns to each student their percentile in the rank distribution. The highest score is 100, the lowest score is 0.

## Implementing STAR

Let's start with computing the ranking of grades. Let's load the data and the `data.table` package:

```{r,message = FALSE}
data("STAR", package = "AER")
library(data.table)
x = as.data.table(STAR)
x
```

It's a bit unfortunate to switch to data.table, but I haven't been able to do what I wanted in dplyr :-( . Ok, here goes. First thing, you can see that this data set is *wide*. First thing we want to do is to make it *long*, i.e. reshape it so that if has 4 ID columns, and several measurements columns thereafter. First, let's add a studend ID:

```{r}
x[ , ID := 1:nrow(x)]  # add a column called `ID`
```

```{r}
# `melt` a data.table means to dissolve it and reassamble for some ID variables

mx = melt.data.table(x, 
                     id = c("ID","gender","ethnicity","birth"), 
                     measure.vars = patterns("star*","read*","math*", "schoolid*",
                                             "degree*","experience*","tethnicity*","lunch*"), 
                     value.name = c("classtype","read","math","schoolid","degree",
                                    "experience","tethniticy","lunch"),
                     variable.name = "grade")

levels(mx$grade) <- c("stark","star1","star2","star3")  # reassign levels to grade factor
mx[,1:8]  # show first 8 cols

```

You can see here that for example pupil `ID=1` was not present in kindergarden, but joined later. We will only keep complete records, hence we drop those NAs:

```{r}
mx <- mx[complete.cases(mx)]
mx[ID==2]  # here is pupil number 2
```

Ok, now on to standardizing those `read` and `math` scores. you can see they are on their kind of arbitrary SAT scales

```{r}
mx[,range(read)]
```

First thing to do is to create an empirical cdf of each of those scores within a certain grade. That is the *ranking* of scores from 0 to 1:

```{r,message = FALSE, results='hide'}
setkey(mx, classtype)  # key mx by class type
ecdfs = mx[classtype != "small",        # subset data.table to this
    list(readcdf = list(ecdf(read)),    # create cols readcdf and mathcdf
         mathcdf = list(ecdf(math))
         ),
         by = grade]    # by grade

# let's look at those cdf!
om = par("mar")
par(mfcol=c(4,2),mar = c(2,om[2],2.5,om[4]))
ecdfs[,.SD[,plot(mathcdf[[1]],main = paste("math ecdf grade",.BY))],by = grade]
ecdfs[,.SD[,plot(readcdf[[1]],main = paste("read ecdf grade",.BY))],by = grade]
par(mfcol=c(1,1),mar = om)
```         

You can see here how the cdf maps SAT scores (650, for example), into the interval $[0,1]$. Now, in the `ecdfs` `data.table` object, the `readcdf` column contains a *function* (a cdf) for each grade. We can evaluate the observed test scores for each student in that function to get their ranking in $[0,1]$, by grade:

```{r gradedens, fig.cap = "Reproducing Figure I in @krueger1999",fig.align="center"}
setkey(ecdfs, grade)  # key ecdfs according to `grade`
setkey(mx,grade)

z = mx[ , list(ID,perc_read = ecdfs[(.BY),readcdf][[1]](read),
               perc_math = ecdfs[(.BY),mathcdf][[1]](math)),
        by=grade]   # stick `grade` into `ecdfs` as `.BY`

z[,score := rowMeans(.SD)*100, .SDcols = c("perc_read","perc_math")]  # take average of scores
# and multiply by 100, so it's comparable to Krueger

# merge back into main data
mxz = merge(mx,z,by = c("grade","ID"))

# make a plot
ggplot(data = mxz, mapping = aes(x = score,color=classtype)) + geom_density() + facet_wrap(~grade) + theme_bw()
```

You can compare figure \@ref(fig:gradedens) to @krueger1999 figure 1. You can see that the density estimates are almost identical, the discrepancy comes mainly from the fact that we split the regular classes also by with/without aide.

```{r kruegerdens,echo=FALSE, fig.cap = "Outcome densities, @krueger1999 figure 1."}
knitr::include_graphics("images/krueger1.png")
```

So far, so good! Now we can move to run a regression and estimate \@ref(eq:krueger2).

```{r}
# create Krueger's dummy variables
mxz =  as_tibble(mxz) %>%
    mutate(small = classtype == "small",
           rega  = classtype == "regular+aide",
           girl  = gender == "female",
           freelunch = lunch == "free")

# reproduce columns 1-3
m1 = mxz %>% 
    group_by(grade) %>%
    do(model = lm(score ~ small + rega, data = .))
m2 = mxz %>% 
    group_by(grade) %>%
    do(model = lm(score ~ small + rega + schoolid, data = .))
m3 = mxz %>% 
    group_by(grade) %>%
    do(model = lm(score ~ small + rega + schoolid + girl + freelunch, data = .))

# get school id names to omit from regression tables
school_co = grep(names(coef(m2[1,]$model[[1]])),pattern = "schoolid*",value=T)
school_co = c(unique(school_co,grep(names(coef(m3[1,]$model[[1]])),pattern = "schoolid*",value=T)),"schoolid77")
```


Now let's look at each grade's models. 

```{r}
h = list()
for (g in unique(mxz$grade)) {
    h[[g]] <- huxtable::huxreg(subset(m1,grade == g)$model[[1]],
                 subset(m2,grade == g)$model[[1]],
                 subset(m3,grade == g)$model[[1]],
                 omit_coefs = school_co,  
                 statistics = c(N = "nobs", R2 = "r.squared"),
                 number_format = 2
        ) %>% 
    huxtable::insert_row(c("School FE","No","Yes","Yes"),after = 11) %>%
    huxtable::theme_article() %>%
    huxtable::set_caption(paste("Estimates for grade",g)) %>%
    huxtable::set_top_border(12, 1:4, 2)
}
h$stark
h$star1
h$star2
h$star3
```

You should compare those to table 5 in @krueger1999, where it says *OLS: actual class size*. For the most part, we come quite close to his esimates! We did not follow his more sophisticated error structure (by allowing errors to be correlated at the classroom level), and we seem to have different number of individuals in each year. Here is his table 5:

```{r krug-table,echo=FALSE,fig.show = "hold", fig.align = "default"}
knitr::include_graphics("images/krueger2.png",dpi = 300) 
knitr::include_graphics("images/krueger3.png",dpi = 300)
knitr::include_graphics("images/krueger4.png",dpi = 300) 
knitr::include_graphics("images/krueger5.png",dpi = 300)
```

So, based on those results we can say that attending a small class raises student's test scores by about 5 percentage points. Unfortunately, says @krueger1999, is it hard to gauge whether those are big or small effects: how important is it to score 5% more or less? Well, you might say, it depends how close you are to an important cutoff value (maybe entrance to the next school level requires a score of `x`, and the 5% boost would have made that school feasible). Be that as it may, now you know more about one of the most influential papers in education economics, and why using an experimental setup allowed it to achieve credible causal estimates.



================================================
FILE: 09-RDD.Rmd
================================================
# Regression Discontinuity Design {#RDD}

In the previous chapter we have seen how an experimental setup can be useful to recover causal effects from an OLS regression. In this chapter we will look at a similar approach where we don't randomly allocate subjects to either treatment or control (maybe because that's impossible to do in that particular situation), but where we can *zoom in* on a group of individuals where having been allocated to treatment is **as good as random** - hence not influenced by selection bias. The idea is called Regression Discontinuity Design, short RDD.

## RDD Setup

Let's again start with a DAG for the main idea. Remember the numerical example in the set of slides on randomization, where we showed that if we know the allocating mechanism, we can recover the true ATE. RDD plays along those lines, in that we *know* how individuals got assigned to treatment. As is the case in many real-life situations, people are eligible for some treatment if some value **crosses a threshold**:

* You are eligible to obtain a driving license as your age crosses the 18-year threshold.
* You will receive pension benefits as your age crosses the 65-year threshold.
* You are liable to criminal charges if you are caught with more than 3g of Marihuana in your pocket.
* You are considered a subprime mortgage borrower if your loan-to-value ratio is above 95%.

In RDD parlance, we call that particular variable we are looking at the *running variable* (age, quantity of Marihuana, LTV ratio etc). If we know the applicable threshold (18 years of age, say), and we know an individual's age, then it's trivial to figure out whether they were eligible to get a driving license. Let's formalize this a bit.

Let's call the running variable $x$, the outcome $y$ as usual, let $D$ be the treament indicator and let us define a *threshold* value $c$. Treatment for individual $i$ will be such that

\begin{equation*}
D_i = \begin{cases}\begin{array}{c}1\text{ if }x_i > c \\
                  0\text{ if }x_i \leq c. \end{array}
                  \end{cases}
\end{equation*}

Here's the obligatory DAG!

```{r rdd1,echo = FALSE,warning = FALSE,message = FALSE,fig.align = "center",fig.cap = "DAG for a simple RDD design: $x$ determines treatment via the cutoff $c$"}
library(ggdag)
dagify(y ~ x,
       c ~ x,
       D ~ c,
       y ~ D) %>% 
    ggdag(layout = "circle") + theme_dag()
```

The key idea can be glanced from figure \@ref(fig:rdd1): if we can *know* who ends up in treatment $D$, this can be useful for us to recover the true ATE. In particular, the idea is going to be to compare individuals who are *close* to the threshold $c$: Those with an $x$ *just above* the threshold should be comparable (in terms of their $x$!) to the ones *just below* $c$. Someone who has their 18th birthday next week is almost identical to someone how had their 18th birthday last week - in terms of age! So, computing our naive difference in mean outcomes for those narrowly defined groups should be a good approximation to a random allocation. Notice that there are two important things to keep in mind:

1. None of the other variables in the model should exhibit any discontinuity at $c$, other than $D$!
2. We obtain only *locally* valid identification of the ATE: as we move further and further away from the threshold, our individuals will cease to be really comparable.

In our DAG, point 1 above is not currently visible. Let's augment this:

```{r rdd2,echo = FALSE,warning = FALSE,message = FALSE,fig.align = "center",fig.cap = "Augmented DAG for a simple RDD design: $x$ determines treatment via the cutoff $c$"}
library(ggdag)
dagify(y ~ x,
       c ~ x,
       D ~ c,
       y ~ z,
       y ~ D) %>% 
    ggdag(layout = "circle") + theme_dag()
```

So, the condition we want is that additional explanatory variable $z$ does **not** suddenly jump as $x$ crosses $c$. Because we will be comparing the mean outcomes of people slightly to the left and right of $c$, we need to make sure that there is nothing that would *confound* our estimate of the size of the effect $\delta$. Let's look at an example of a recent RDD study now.

## Clicking on Heaven's Door

In a recent paper title [Clicking on Heaven's Door](https://www.aeaweb.org/articles?id=10.1257/aer.20150355), U Bocconi economist [Paolo Pinotti](https://sites.google.com/view/paolo-pinotti/home) uses an RDD to show the effects of legal status of immigrants on criminal behaviour. The question of whether immigrants commit more or less crime than others (natives, for example) is a first order policy question. The question of that paper is what causal impact the *legal status* we confer upon an immigrant has on their propensity to commit a crime one year later. The study is based in Italy, where the legal status refers to an official permission to work. The key detail is that the residence permit needs to be sponsored by the immigrant's employer.

### Institutional Details

In the Italian context, immigrants often enter illegally first, and then hope to obtain a residence permit through an employer later on. There is a quota system in place, which establishes how many permits are to be granted to how many people from each nationality, and to which Italian industries (construction, services, etc). See table \@ref(fig:pin1) for an overview of those quotas.

```{r pin1,echo=FALSE,fig.cap="Table 1 from @pinotti.",fig.align="center"}
knitr::include_graphics("images/pinotti1.png")
```

<!-- There are type A and type B permits. Type A permits are mainly used by families and individuals to obtain permits for domestic employment, but those are open to fraudulent behaviour - i.e. it's easy to get a friend with a permit to say that they would be ones employer. Type B permits are for actual firms. -->
Almost all of the estimated 650,000 illegal immigrants participate in the click days. @pinotti is able to link each immigrant to official interior ministry crime records, and is thus able to precisely identify whether an immigrant with a certain legal status is showing up in crime records in the year(s) after click days.

### Discontinuity Feature

The principal feature of the Italian setting which makes this almost perfect for an RDD is the following: The quotas illustrated in \@ref(fig:pin1) are defined for a total of 1751 employer groups (varying by industry and location). Applications for a permit must be submitted online by employers, **starting at 8:00 AM on specific click days**, and will be given out on a first come first served basis. This implies that thousands of applicants are denied their permit each year not because they were not eligible (they had an employer sponsoring them!), but because they got late online (some seconds) when all permits for their specific quota were gone already. Here we formalize as $c$ the *exact* time the quota for a certain group was full, and the running variable $x$ is the *exact* time that the sponsoring employer clicked on the *submit* button on the website. This is measured at the level of milliseconds.

The key observation is now that the exact timing when a certain quota is full, $c$, is impossible to foresee. Even if it is the case the employers of highly-skilled individuals are the first ones to log on, there is sufficient random variation (slow internet connecti

Download .txt

gitextract_ejq2bful/

├── .Rbuildignore
├── .github/
│   └── ISSUE_TEMPLATE/
│       └── custom.md
├── .gitignore
├── 01-R.Rmd
├── 02-SummaryStats.Rmd
├── 03-linear-reg.Rmd
├── 04-MultipleReg.Rmd
├── 05-Categorial-Vars.Rmd
├── 06-StdErrors.Rmd
├── 07-Causality.Rmd
├── 08-STAR.Rmd
├── 09-RDD.Rmd
├── 10-IV.Rmd
├── 11-IV2.Rmd
├── 12-panel.Rmd
├── 13-discrete.Rmd
├── 14-references.Rmd
├── DESCRIPTION
├── GA-tracker.html
├── LICENSE
├── NAMESPACE
├── R/
│   └── utils.R
├── README.md
├── ScPoEconometrics.Rproj
├── _archive/
│   └── chapters/
│       └── 03-linear-reg.Rmd
├── _bookdown.yml
├── _build.sh
├── _deploy.sh
├── _local_deploy.sh
├── _output.yml
├── _tex/
│   ├── ci.tex
│   ├── onesided.tex
│   ├── testing.lyx
│   ├── two-sided-beta.tex
│   └── twosided-mean.tex
├── _to_be_done/
│   ├── 08-TBD.Rmd
│   ├── 09-R-advanced.Rmd
│   ├── 11-projects.Rmd
│   └── notes.R
├── book.bib
├── images/
│   └── trade.html
├── index.Rmd
├── inst/
│   ├── CITATION
│   └── datasets/
│       ├── airline-safety.csv
│       ├── corr50.csv
│       ├── demo_gind.xls
│       ├── example-data.csv
│       ├── grade5.dta
│       └── simple_arrows.RData
├── packages.bib
├── preamble.tex
├── previous_travis.yml
├── style.css
├── teachers/
│   ├── ForTeachers.md
│   ├── app-timeline.md
│   ├── session1-ouline.md
│   ├── tasks_ch1.Rmd
│   └── tasks_ch2.Rmd
└── toc.css

Download .json

Condensed preview — 59 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (2,335K chars).

[
  {
    "path": ".Rbuildignore",
    "chars": 194,
    "preview": "^.*\\.Rproj$\n^\\.Rproj\\.user$\n^.*\\.html$\n^.*\\.jpg$\n_book*\n_slides*\n_tex*\n^\\d\\d-.*\\.Rmd$\n^.*\\.yml$\n^.*\\.sh$\n^.*\\.css$\n^.*\\."
  },
  {
    "path": ".github/ISSUE_TEMPLATE/custom.md",
    "chars": 512,
    "preview": "---\nname: Custom issue template\nabout: Please file an issue here!\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n\nhello!\n\nPleas"
  },
  {
    "path": ".gitignore",
    "chars": 300,
    "preview": ".Rproj.user\n.Rhistory\n.RData\n_publish.R\n_book\n_bookdown_files\nrsconnect\n/data/\n/inst/shinys/**/*.html\n/inst/tutorials/**"
  },
  {
    "path": "01-R.Rmd",
    "chars": 35287,
    "preview": "# Introduction to `R`  {#R-intro}\n\n\n\n## Getting Started\n\n`R` is both a programming language and software environment for"
  },
  {
    "path": "02-SummaryStats.Rmd",
    "chars": 26686,
    "preview": "# Working With Data  {#sum}\n\n\nIn this chapter we will first learn some basic concepts that help summarizing data. Then, "
  },
  {
    "path": "03-linear-reg.Rmd",
    "chars": 38964,
    "preview": "---\noutput:\n  pdf_document: default\n  html_document: default\n---\n# Linear Regression {#linreg}\n\nIn this chapter we will "
  },
  {
    "path": "04-MultipleReg.Rmd",
    "chars": 12851,
    "preview": "# Multiple Regression {#multiple-reg}\n\n\nWe can extend the discussion from chapter \\@ref(linreg) to more than one explana"
  },
  {
    "path": "05-Categorial-Vars.Rmd",
    "chars": 18292,
    "preview": "# Categorial Variables {#categorical-vars} \n\n\nUp until now, we have encountered only examples with *continuous* variable"
  },
  {
    "path": "06-StdErrors.Rmd",
    "chars": 17787,
    "preview": "# Regression Inference {#std-errors}\n\nIn this chapter we want to investigate uncertainty in regression estimates. We wan"
  },
  {
    "path": "07-Causality.Rmd",
    "chars": 25461,
    "preview": "# Causality {#causality}\n\nIn this chapter we take on a challenging part of our course. Remember that in the [first set o"
  },
  {
    "path": "08-STAR.Rmd",
    "chars": 14856,
    "preview": "# STAR Experiment {#STAR}\n\nHow to best allocate spending on schooling is an important question. What's the impact of spe"
  },
  {
    "path": "09-RDD.Rmd",
    "chars": 9339,
    "preview": "# Regression Discontinuity Design {#RDD}\n\nIn the previous chapter we have seen how an experimental setup can be useful t"
  },
  {
    "path": "10-IV.Rmd",
    "chars": 20095,
    "preview": "# Instrumental Variables (IV) {#IV}\n\n```{r, echo = FALSE}\nlibrary(modelsummary)\ngm = modelsummary::gof_map\ngm$omit <- TR"
  },
  {
    "path": "11-IV2.Rmd",
    "chars": 27505,
    "preview": "# IV Applications\n\n```{r, echo = FALSE}\nlibrary(modelsummary)\ngm = modelsummary::gof_map\ngm$omit <- TRUE\ngm$omit[gm$clea"
  },
  {
    "path": "12-panel.Rmd",
    "chars": 18813,
    "preview": "# Panel Data\n\n## Crime Rate vs Probability of Arrest\n\nThis part draws heavily on [Nick C Huntington-Klein's](http://nick"
  },
  {
    "path": "13-discrete.Rmd",
    "chars": 12570,
    "preview": "# Binary Outcomes {#binary}\n\nUntil now we have encountered only contiunously distributed outcomes on the right hand side"
  },
  {
    "path": "14-references.Rmd",
    "chars": 52,
    "preview": "`r if (knitr::is_html_output()) '# References {-}'`\n"
  },
  {
    "path": "DESCRIPTION",
    "chars": 1296,
    "preview": "Package: ScPoEconometrics\nType: Package\nTitle: ScPoEconometrics\nDate: 2020-10-31\nVersion: 0.2.7\nAuthors@R: c(\n    person"
  },
  {
    "path": "GA-tracker.html",
    "chars": 314,
    "preview": "<!-- Global site tag (gtag.js) - Google Analytics -->\n<script async src=\"https://www.googletagmanager.com/gtag/js?id=UA-"
  },
  {
    "path": "LICENSE",
    "chars": 282,
    "preview": "This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view"
  },
  {
    "path": "NAMESPACE",
    "chars": 46,
    "preview": "# Generated by roxygen2: do not edit by hand\n\n"
  },
  {
    "path": "R/utils.R",
    "chars": 587,
    "preview": "\ngitbook <- function(){\n  bookdown::render_book('index.Rmd', 'bookdown::gitbook')\n}\n\npdfbook <- function(){\n  bookdown::"
  },
  {
    "path": "README.md",
    "chars": 4759,
    "preview": "# ScPo UG Econometrics\n\nThis is the git repo for the UG Econometrics book taught to 2nd year students at SciencesPo.\n\n**"
  },
  {
    "path": "ScPoEconometrics.Rproj",
    "chars": 395,
    "preview": "Version: 1.0\n\nRestoreWorkspace: Default\nSaveWorkspace: Default\nAlwaysSaveHistory: Default\n\nEnableCodeIndexing: Yes\nUseSp"
  },
  {
    "path": "_archive/chapters/03-linear-reg.Rmd",
    "chars": 12413,
    "preview": "\n\n## An Example: California Student Test Scores {#lm-example1}\n\nLuckily for us, fitting a linear model to some data does"
  },
  {
    "path": "_bookdown.yml",
    "chars": 120,
    "preview": "book_filename: \"ScPoEconometrics\"\nlanguage:\n  ui:\n    chapter_name: \"Chapter \"\ndelete_merged_file: true\nnew_session: no\n"
  },
  {
    "path": "_build.sh",
    "chars": 248,
    "preview": "#!/bin/sh\n\nset -e\n\n# build book(s)\nRscript -e \"bookdown::render_book('index.Rmd', 'bookdown::gitbook')\"\n# Rscript -e \"bo"
  },
  {
    "path": "_deploy.sh",
    "chars": 418,
    "preview": "#!/bin/sh\n\nset -e\n\n[ -z \"${GH_TOKEN}\" ] && exit 0\n[ \"${TRAVIS_BRANCH}\" != \"master\" ] && exit 0\n\ngit config --global user"
  },
  {
    "path": "_local_deploy.sh",
    "chars": 888,
    "preview": "#!/bin/bash\n\n# this script builds the book on your computer\n# and deploys it to your gh-pages branch.\n\nset -e\n\ngitbranch"
  },
  {
    "path": "_output.yml",
    "chars": 568,
    "preview": "bookdown::gitbook:\n  toc_depth: 2\n  css: style.css\n  config:\n    toc:\n      before: |\n        <li><a href=\"./\">ScPo 2nd "
  },
  {
    "path": "_tex/ci.tex",
    "chars": 1468,
    "preview": "\n% confidence interval\n% guassian with conficence region and with y axis \n\\begin{center}\n\\begin{tikzpicture}[scale=2, y="
  },
  {
    "path": "_tex/onesided.tex",
    "chars": 1154,
    "preview": "%guassian with conficence region and with y axis \n\\begin{center}\n\\begin{tikzpicture}[scale=2, y=5cm]\n\\draw[domain=-3:1.6"
  },
  {
    "path": "_tex/testing.lyx",
    "chars": 2209,
    "preview": "#LyX 2.3 created this file. For more info see http://www.lyx.org/\n\\lyxformat 544\n\\begin_document\n\\begin_header\n\\save_tra"
  },
  {
    "path": "_tex/two-sided-beta.tex",
    "chars": 1509,
    "preview": "% two sided test for beta\n% guassian with conficence region and with y axis \n\\begin{tikzpicture}[scale=2, y=5cm]\n\\draw[d"
  },
  {
    "path": "_tex/twosided-mean.tex",
    "chars": 1568,
    "preview": "% two sided test for mean\n% guassian with conficence region and with y axis \n\\begin{tikzpicture}[scale=2, y=5cm]\n\\draw[d"
  },
  {
    "path": "_to_be_done/08-TBD.Rmd",
    "chars": 34108,
    "preview": "# To Be Done Chapters\n\nThe following topics could be part of a future version of this course.\n\n## Quantile Regression\n\n1"
  },
  {
    "path": "_to_be_done/09-R-advanced.Rmd",
    "chars": 4355,
    "preview": "# Advanced `R` {#R-advanced}\n\n\n\nThis chapter continues with some advanced usage examples from chapter \\@ref(R-intro)\n\n##"
  },
  {
    "path": "_to_be_done/11-projects.Rmd",
    "chars": 131,
    "preview": "# Projects\n\nThis chapter contains several empirical projects. \n\n\n\n\n## Trade Exercise\n\n* [Trade exercise](images/trade.ht"
  },
  {
    "path": "_to_be_done/notes.R",
    "chars": 4088,
    "preview": "\n\n\ndata(\"STAR\",package = \"AER\")\nx = as.data.table(STAR)\n\nmx = melt.data.table(x, id = 1:3, measure.vars = patterns(\"star"
  },
  {
    "path": "book.bib",
    "chars": 3096,
    "preview": "@Book{xie2015,\n  title = {Dynamic Documents with {R} and knitr},\n  author = {Yihui Xie},\n  publisher = {Chapman and Hall"
  },
  {
    "path": "images/trade.html",
    "chars": 1894302,
    "preview": "<!DOCTYPE html>\n\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\n\n<head>\n\n<meta charset=\"utf-8\" />\n<meta http-equiv=\"Content"
  },
  {
    "path": "index.Rmd",
    "chars": 6804,
    "preview": "--- \ntitle: \"Introduction to Econometrics with R\"\nauthor: \"Florian Oswald, Vincent Viers, Jean-Marc Robin, Pierre Villed"
  },
  {
    "path": "inst/CITATION",
    "chars": 431,
    "preview": "rref <- bibentry(\n  bibtype = \"Manual\",\n  title = \"Introduction to Econometrics with R\",\n  author = c(person(\"Florian\", "
  },
  {
    "path": "inst/datasets/airline-safety.csv",
    "chars": 18464,
    "preview": "\"airline\",\"avail_seat_km_per_week\",\"type\",\"value\",\"period\"\n\"Aer Lingus\",320906734,\"incidents\",2,\"1985_1999\"\n\"Aeroflot*\","
  },
  {
    "path": "inst/datasets/corr50.csv",
    "chars": 280,
    "preview": "-1.5769,-0.107\n-0.4231,5.72\n1.2308,-2.6454\n1.2308,1.2776\n2.2692,5.72\n4.1154,1.2776\n4.0385,-1.8954\n5.3462,8.893\n4.4231,8."
  },
  {
    "path": "inst/datasets/example-data.csv",
    "chars": 169,
    "preview": "\"x\",\"y\",\"z\"\n1,\"Hello\",TRUE\n3,\"Hello\",FALSE\n5,\"Hello\",TRUE\n7,\"Hello\",FALSE\n9,\"Hello\",TRUE\n1,\"Hello\",FALSE\n3,\"Hello\",TRUE\n"
  },
  {
    "path": "packages.bib",
    "chars": 4056,
    "preview": "@Manual{R-Ecdat,\n  title = {Ecdat: Data Sets for Econometrics},\n  author = {Yves Croissant},\n  year = {2016},\n  note = {"
  },
  {
    "path": "preamble.tex",
    "chars": 693,
    "preview": "\\usepackage{tcolorbox}\n\\usepackage{booktabs}\n\\usepackage{amsthm}\n\n\\newenvironment{note}{\\begin{tcolorbox}[colback=blue!5"
  },
  {
    "path": "previous_travis.yml",
    "chars": 1326,
    "preview": "language: r\nos:\n  - linux\n  - osx\n\nbefore_install:\n#  - if [ $TRAVIS_OS_NAME = linux ]; then sudo apt-get update; fi\n  -"
  },
  {
    "path": "style.css",
    "chars": 880,
    "preview": "p.caption {\n  color: #777;\n  margin-top: 10px;\n}\np code {\n  white-space: inherit;\n}\npre {\n  word-break: normal;\n  word-w"
  },
  {
    "path": "teachers/ForTeachers.md",
    "chars": 14447,
    "preview": "# Meta Info For Teachers\n\nThis document contains info for teachers (at SciencesPo and elsewhere) who want to teach this "
  },
  {
    "path": "teachers/app-timeline.md",
    "chars": 2934,
    "preview": "# App and Tutorial Schedule\n\nThis doc sets out a rough timeline for when to do which app or tutorial.\n\n## Chapter 1\n\nNot"
  },
  {
    "path": "teachers/session1-ouline.md",
    "chars": 5418,
    "preview": "# Session 1\n\nTeacher brings a laptop with Slack, R and Rstudio installed. Our package code is installed on the laptop. T"
  },
  {
    "path": "teachers/tasks_ch1.Rmd",
    "chars": 3293,
    "preview": "---\ntitle: \"tasks for session 1\"\nauthor: \"Florian Oswald\"\ndate: \"8/18/2018\"\noutput:\n  pdf_document: default\n  html_docum"
  },
  {
    "path": "teachers/tasks_ch2.Rmd",
    "chars": 1093,
    "preview": "---\ntitle: \"tasks for chapter 2\"\nauthor: \"Florian Oswald\"\ndate: \"8/18/2018\"\noutput:\n  pdf_document: default\n  html_docum"
  },
  {
    "path": "toc.css",
    "chars": 2443,
    "preview": "#TOC ul,\n#TOC li,\n#TOC span,\n#TOC a {\n  margin: 0;\n  padding: 0;\n  position: relative;\n}\n#TOC {\n  line-height: 1;\n  bord"
  }
]

// ... and 3 more files (download for full content)

About this extraction

This page contains the full source code of the ScPoEcon/ScPoEconometrics GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 59 files (2.2 MB), approximately 580.5k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo