Repository: phlippe/UvA_Summaries
Branch: master
Commit: c42eab447ecd
Files: 88
Total size: 781.5 KB

Directory structure:
gitextract_3o3s9o2d/

├── .gitignore
├── Computer_Vision_1/
│   ├── cv_appendix.tex
│   ├── cv_applications.tex
│   ├── cv_deep_learning.tex
│   ├── cv_deep_video.tex
│   ├── cv_imgformation.tex
│   ├── cv_imgprocessing.tex
│   ├── cv_intro.tex
│   ├── cv_object_rec.tex
│   └── cv_summary.tex
├── Deep_Learning/
│   ├── cheat_sheet/
│   │   └── main.tex
│   ├── dl_appendix.tex
│   ├── dl_autoregressive.tex
│   ├── dl_bayesian.tex
│   ├── dl_convnets.tex
│   ├── dl_deep_rl.tex
│   ├── dl_generative_models.tex
│   ├── dl_intro.tex
│   ├── dl_modularity.tex
│   ├── dl_optimization.tex
│   ├── dl_rnn.tex
│   └── dl_summary.tex
├── Information_Retrieval_1/
│   ├── ir_boolean_retrieval.tex
│   ├── ir_click_models.tex
│   ├── ir_counterfactual_eval.tex
│   ├── ir_language_models.tex
│   ├── ir_learning_to_rank.tex
│   ├── ir_neural_models.tex
│   ├── ir_offline_evaluation.tex
│   ├── ir_online_evaluation.tex
│   ├── ir_semantic_matching.tex
│   └── ir_summary.tex
├── Knowledge_Representation/
│   ├── figures/
│   │   └── figures.pptx
│   ├── kr_csp.tex
│   ├── kr_dl.tex
│   ├── kr_intro.tex
│   ├── kr_qr.tex
│   ├── kr_sat.tex
│   └── kr_summary.tex
├── LICENSE
├── ML4QS/
│   ├── mlqs_clustering.tex
│   ├── mlqs_feature_engineering.tex
│   ├── mlqs_intro.tex
│   ├── mlqs_modeling_with_time.tex
│   ├── mlqs_modeling_without_time.tex
│   ├── mlqs_reinforcement_learning.tex
│   ├── mlqs_sensory_noise.tex
│   ├── mlqs_summary.tex
│   └── mlqs_supervised_learning.tex
├── Machine_Learning_1/
│   ├── ml_appendix.tex
│   ├── ml_basic_probability.tex
│   ├── ml_combining_models.tex
│   ├── ml_kernel_methods.tex
│   ├── ml_linear_classification.tex
│   ├── ml_linear_regression.tex
│   ├── ml_neural_networks.tex
│   ├── ml_summary.tex
│   └── ml_unsupervised_learning.tex
├── Machine_Learning_2/
│   ├── ml2_appendix.tex
│   ├── ml2_causality.tex
│   ├── ml2_exponential_family.tex
│   ├── ml2_graphical_models.tex
│   ├── ml2_graphical_models.tex.recover.bak~
│   ├── ml2_sampling_methods.tex
│   ├── ml2_sequential_data.tex
│   ├── ml2_summary.tex
│   └── ml2_variational_EM.tex
├── Natural_Language_Processing_1/
│   ├── nlp_bayesian.tex
│   ├── nlp_compositional_semantic.tex
│   ├── nlp_dialog_modelling.tex
│   ├── nlp_formal_grammars.tex
│   ├── nlp_lexical_distributional_semantics.tex
│   ├── nlp_morphology.tex
│   ├── nlp_pos_tagging.tex
│   ├── nlp_summarization.tex
│   ├── nlp_summary.tex
│   ├── nlp_textual_entailment_paraphrasing.tex
│   └── nlp_translation.tex
├── README.md
└── Reinforcement_Learning/
    ├── rl_appendix.tex
    ├── rl_introduction.tex
    ├── rl_learning_with_approx.tex
    ├── rl_mcts_alpha_go.tex
    ├── rl_model_based.tex
    ├── rl_partially_observable.tex
    ├── rl_policy_gradient_methods.tex
    ├── rl_summary.tex
    └── rl_tabular_methods.tex

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
*.aux
*.log
*.out
*.synctex.gz
*.toc
*.txss
.DS_Store

================================================
FILE: Computer_Vision_1/cv_appendix.tex
================================================
\section{Practicals}
Gathering some interesting/important questions from the practicals and old exams.
\subsection{Color spaces}
\subsubsection{General parameters in color spaces}
\begin{itemize}
	\item \textbf{Chromaticity}: the color component regardless of its luminance/intensity. For example, the $xy$-diagram in Figure~\ref{fig:rgb_color_wavelength_distribution_XYZ_diagram} visualizes the chromaticity (includes saturation and hue)
	\item \textbf{Saturation}: defined as ``colorfulness of a stimulus relative to its own brightness''. In the normalized $rgb$ space, it is the distance to the point $(1/3,1/3,1/3)$ (ratio to the maximum distance). In case of the wavelength distribution, a color is saturated if it is very peaked.
	\item \textbf{Intensity}: the energy of the light. It is the integral of the wavelength distribution.
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.25\textwidth]{figures/cv_image_formation_rg_chromaticity.png}
		\caption{\textit{rg}-chromaticity diagram. A point in this space symbolizes the chromaticity (color without intensity), and the distance to the point $(1/3,1/3)$ (if considered white light source as reference) with ratio to distance to border the saturation.}
	\end{figure}
\end{itemize}
\subsubsection{XYZ color space}
Calculate saturation, hue, intensity, plotting in the diagram, using reference lights, etc.

Interpolate between colors. We can perceive color (e.g. white) although it is not as we would define 
\subsubsection{Color invariance}
How to determine whether formula is color invariant or not. 
\begin{itemize}
	\item Color invariance is trying to remove transformations that do not directly affect the color, but let the sensor perceive it differently. 
	\item Hence, color invariant models are more or less insensitive to varying imaging conditions such as variations in illumination (light source) and object pose (shading, highlighting cues)
	\item For example, if we assume a Lambertian world where we only have body reflection and a white light source (equal for all wavelengths), we get for the $rgb$ space (note that $R=cos\theta \cdot e\cdot \int_{\lambda} p(\lambda) f_R(\lambda)d\lambda$):
	\begin{equation*}
		\begin{split}
			r & = \frac{R}{R + G + B} = \frac{\cancel{cos\theta} \cdot \cancel{e}\cdot \int_{\lambda} p(\lambda) f_R(\lambda)d\lambda}{\cancel{cos\theta} \cdot \cancel{e}\cdot \int_{\lambda} p(\lambda) \left(f_R(\lambda) + f_G(\lambda) + f_B(\lambda)\right)d\lambda}
		\end{split}
	\end{equation*}
	Thus, the \textit{rgb} color space is color invariant when assuming a Lambertian reflection model.
\end{itemize}
\subsection{Convolution operator}
\subsubsection{Difference between convolution and correlation}
Formally, correlation is a measurement of similarity between two signals whilst convolution is a measures the effect of one signal on the other. In practice however, correlation simply moves the filter over the image and computes the sum of the box at each pixel. Convolution is practically the same however before moving over the image, the filter is rotated 180 degrees. The formulas are:
\begin{equation*}
	\begin{split}
		\text{Correlation:} & I_{out} = I \otimes h,\hspace{1mm} I_{out}(i,j) = \sum\limits_{k,l} I(i+k, j+l) \cdot h(k,l)\\
		\text{Convolution:} &  I_{out} = I \ast h,\hspace{1mm} I_{out}(i,j) = \sum\limits_{k,l} I(i-k, j-l) \cdot h(k,l)
	\end{split}
\end{equation*}
Note that for both methods there is no difference in the result if we take the center pixel or a corner pixel as the start point for a filter. 
\subsubsection{Convolving two filters}
Two consecutive filters applied to an image can be summarized into one by convolving two filters. There are two ways to calculate the convolution of two filters. The more intuitive way to calculate the effect of every element of the second filter based on the first one.
Example:
\begin{equation*}
	\begin{split}
		f &=\left[\begin{array}{ccc}3 & 7 & 6\end{array}\right], \hspace{2mm}g=\left[\begin{array}{ccc}-1 & 5 & 8\end{array}\right] \Rightarrow f\ast g \\[5pt]
		& \implies \begin{array}{cccccc}
			& [-1\cdot 3 & 5\cdot 3 & 8\cdot 3] & & \\
		 +	& & [-1\cdot 7 & 5\cdot 7 & 8\cdot 7] & \\
		 +	& & & [-1\cdot 6 & 5\cdot 6 & 8\cdot 6] \\[5pt]
		 \hline
		 & [ -3 & 8 & 53 & 86 & 48 ]
		\end{array}
	\end{split}
\end{equation*}
The second option is to apply convolution right away with extended zero padding. We can imagine to use infinite zero padding but remove the zero elements in the convolved filter again. Note that we perform convolution, and therefore have to flip the second filter.
\begin{equation*}
	\begin{split}
		f\ast g & = \left[\begin{array}{ccccccc}0 & 0 & 3 & 7 & 6 & 0 & 0\end{array}\right] \otimes \left[\begin{array}{ccc}8 & 5 & -1\end{array}\right]\\
		& = \left[\begin{array}{ccccccc}-1\cdot 3 & (5\cdot 3 - 1\cdot 7) & (8\cdot 3 + 5\cdot 7 - 1\cdot 6) & (8\cdot 7 + 5\cdot 6) & 8\cdot 6\end{array}\right]\\
		& = \left[\begin{array}{ccccc}-3 & 8 & 53 & 86 & 48\end{array}\right]
	\end{split}
\end{equation*}
\subsubsection{Linearly Separable Filters}
Some 2D filters are separable in their $x$ and $y$ dimension. We can test it by comparing the convolution of separated $x$ and $y$ filters with the 2D version.
\begin{itemize}
	\item \textit{What is the benefit of separable filters?}
	
	\underline{Answer}: The computational cost is reduced form $k^2$ to $2\cdot k$.
	
	\item \textit{Prove that a 2D Gaussian filter is linearly separable.}
	
	\underline{Answer}: We can show this holds for the continuous case, and thus also for the discrete. Note that we can neglect a constant factor $c$ for normalization as this does not introduce any significant computational effort.
	\begin{equation*}
		\begin{split}
			G_x * G_y  & = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{x^2}{2\sigma^2}} * \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{y^2}{2\sigma^2}}\\
			& = \frac{1}{2\pi\sigma^2} e^{-\frac{x^2 + y^2}{2\sigma^2}}\\
			& = G_{xy}
		\end{split}
	\end{equation*}
	\item \textit{Prove that a 2D box filter (size $3\times 3$) is linearly separable.}
	
	\underline{Answer}: We can show this by simply computing the convolution.
	\begin{equation*}
	\begin{split}
		\left[\begin{array}{ccc}1 & 1 & 1\end{array}\right] *  
		\left[\begin{array}{c}1 \\ 1 \\ 1\end{array}\right] & = \left[\begin{array}{ccc}
		1 & 1 & 1\\ 1 & 1 & 1\\ 1 & 1 & 1
		\end{array}\right]\\
	\end{split}
	\end{equation*}
	\item \textit{Check whether the following 2D filter is linearly separable:}
	$$h = \left[\begin{array}{ccc}
	1 & -2 & 1\\ -2 & 4 & -2\\ 1 & -2 & 1
	\end{array}\right]$$
	
	\underline{Answer}: The way to check that is looking for symmetric patterns in $x$ and $y$ direction which are independent of the other dimension. In this case, we can easily spot the pattern:
	\begin{equation*}
	\begin{split}
	\left[\begin{array}{ccc}1 & -2 & 1\end{array}\right] *  
	\left[\begin{array}{c}1 \\ -2 \\ 1\end{array}\right] & = \left[\begin{array}{ccc}
	1 & -2 & 1\\ -2 & 4 & -2\\ 1 & -2 & 1
	\end{array}\right]\\
	\end{split}
	\end{equation*}
	\item \textit{Check whether the following 2D filter is linearly separable:}
	$$h = \left[\begin{array}{ccc}
	1 & 8 & 3\\ 7 & 6 & 2\\ 4 & 9 & 5
	\end{array}\right]$$
	
	\underline{Answer}:  No, this kernel is not linearly separable.
\end{itemize}
\subsection{Object detection}

\subsection{Convolutional Neural Networks}
\subsubsection{Amount of parameters, output size and computational cost}
\begin{itemize}
	\item \textbf{Output size}: the spatial output size of a convolutional layer depends on the kernel size $k$, the padding $p$ (per side), the stride $s$ and the input size $w_i$. The output size is then calculated by $w_o = (w_i + 2\cdot p - k)/s + 1$
	\begin{itemize}
		\item \textit{What is the size of the output volume with stride $3$, kernel $5\times 5$, number of neurons $5$ and input size $32\times 32\times 3$ (no padding)?}
		
		\underline{Answer}: The output size is $w_0 = (32 + 2\cdot 0 - 5)/3 + 1 = 10$.
		
		\item \textit{What padding size is required to keep the output size equals to the input size for a kernel $k$ and stride $s$?}
		
		\underline{Answer}: we have to reverse the equation above to:
		\begin{equation*}
			\begin{split}
				w_o = w_i & = (w_i + 2\cdot p - k)/s + 1\\
				\Leftrightarrow (w_i - 1) \cdot s & = w_i + 2\cdot p - k\\
				\Leftrightarrow p & = \frac{1}{2}\left(w_i \cdot \left(s-1\right) - s + k\right)
			\end{split}
		\end{equation*}
		Hence, if stride is $s=1$, the necessary padding is $p=\frac{k-1}{2}$.
		
		\item \textit{How many output frames do we get for a 3D convolution of $3\times 3\times 3$ (stride $s=3$ and padding $p=1$ in temporal dimension) on a input video size of $16\times 256\times 256\times 3$?}
		
		\underline{Answer}: We can apply the same formula as before: $l_o = (16 + 2\cdot 1 - 3)/3 + 1 = 6$ output frames.
	\end{itemize}
	\item \textbf{Number of parameters}: a 2D convolution contains $k\times k\times c_F \times c_G$ parameters where $k$ is the kernel size, and $c_F$ and $c_G$ the number of input and output channels. For a 3D convolution, we multiply it by another $k$. Note that all these three $k$'s can be different (e.g. $3\times 3\times 1$, $5\times 1 \times 1$, ...)
	\begin{itemize}
		\item \textit{How many parameters are learned in a convolutional layer with an RGB input image, $5\times 5$ kernel size and $100$ different filters?}
		
		\underline{Answer}: We learn $5\times 5\times 3\times 100 = 7,500$ parameters for the filters, and $100$ biases. Thus, we have overall $7,600$ parameters.
		
		\item \textit{How many parameters are learned if we set the padding to $p=2$ and stride $s=2$?}

		\underline{Answer}: The number of parameters is independent of the stride and the padding.
	\end{itemize}
	\item \textbf{Computational cost}: The computational cost of a layer is the cost of a single filter application (the filter size) times the number of output neurons.
	\begin{itemize}
		\item \textit{Given the input $w_F \times h_F \times c_F$ and output $w_G \times h_G \times c_G$, what is the computational cost of a 2D convolution with kernel size $k\times k$ between these two layers?}
		
		\underline{Answer}: The cost of applying a single filter once is $k\times k\times c_F$. We then have to move the filter over $x$ and $y$ dimension, and repeat it for $c_G$ filters. Thus, the overall cost is determined by:
		$$k\times k\times c_F\times c_G\times w_G\times h_G$$
		
		\item \textit{Given the input $256 \times 256 \times 3$, what is the computational cost of a 2D convolution with kernel size $7\times 7$, $32$ output channels, stride $s=3$ and padding $p=0$?}
		
		\underline{Answer}: We first have to calculate the output size $w_G = (w_F + 2\cdot p - k)/s + 1 = (256 + 0 - 7)/3 = 83$ and $h_G = 83$. Next, we can apply our previous formula:
		$7\times 7\times 3\times 32\times 83\times 83$
		
		\item \textit{What are two ways to reduce the number of computations for 2D convolutions?}
		
		\underline{Answer}: Same as in case of 3D convolutions. We can either do depth-wise convolutions (\textit{MobileNet}), or do pseudo 2D convolutions by separating the filter $k\times k$ to a $1\times k$ and $k\times 1$ convolution (\textit{InceptionV2}).
	\end{itemize}
\end{itemize}
\subsubsection{Other general questions}
\begin{itemize}
	\item \textbf{Locally constrained layer}: A convolutional layer where we don't share weights over spatial dimensions.
	\begin{itemize}
		\item \textit{How many parameters are needed for a locally constrained layer, where each neuron looks at a $10\times10$ window, when using $W=H=100$, and stride of $5$?}
		
		\underline{Answer}: The spatial output size is $(100 - 10) / 5 + 1 = 19$ so that we have $19\times 19=361$ different kernels. Combined with the kernel/window size, we get overall $10\times 10\times 361=36,100$ parameters.
		\item \textit{Describe a scenario where weight sharing as done in plain convolutional layers is not beneficial for recognition}
		
		\underline{Answer}: Weight sharing works most effectively, if the input is transitional invariant. However, if this is not the case and we have stationary data, we should for example use locally constrained layers where the weights are not shared. This may lead to more parameters but reduces the required amount of channels (restricted number of possible objects per position). Example: face recognition with standardized position (eyes and mouth filters at different parts of the image). 
	\end{itemize}
\end{itemize}


================================================
FILE: Computer_Vision_1/cv_applications.tex
================================================
\section{Applications}
Not in the exam :-)

================================================
FILE: Computer_Vision_1/cv_deep_learning.tex
================================================
\section{Deep Learning}
\begin{itemize}
	\item Deep Neural Networks perform hierarchical feature learning and classification in a single architecture
\end{itemize}
\subsection{Convolutional Neural Networks}
\begin{itemize}
	\item Key layer of CNNs are convolutions. The weights are surface-wise local, but depth-wise global.
	\item Multiple neurons look at the same position, but using different kernels (channels)
	\item Parameters of a convolutional layer
	\begin{itemize}
		\item \textit{Kernel size}: size of the filter which is learned. If size is $k\times k$, we learn overall $k^2$ parameters per channel
		\item \textit{Input channels}: number of input channels $c_i$. Every filter has the size of $k\times k\times c_i$
		\item \textit{Output channels}: number of output channels $c_o$. Represent the number of different filters learned.
		\item \textit{Stride} with which we slide the filter over the image. Stride of $s=1$ means we apply a filter on every pixel as usual, $s=2$ would skip every second pixel and $s=4$ takes only every fourth pixel as center of an filter application. Default: $s=1$.
	\end{itemize}
	\item Overall, we learn $(k\times k\times c_i + 1)\times c_0$ \textbf{parameters} in a convolutional layer (the 1 extra parameter for bias)
	\item The \textbf{output size} is calculated by $$h_o = (h_i + 2\cdot p - k) / s + 1$$ where $p$ is the padding (number of extra pixels on each side)
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.2\textwidth]{figures/cv_deep_learning_convolution_operator.png}
		\caption{Convolutional layer in a CNN}
	\end{figure}
	\item Activation layers like ReLU ($\max(0,x)$) introduce non-linearity
	\item Pooling aggregates multiple values into a single value making it invariant to small transformations. Reduces the size of the next output layer while keeping the most important information 
\end{itemize}
\subsubsection{Transfer Learning}
\begin{itemize}
	\item Reuse information gained on a large dataset (e.g. ImageNet) on a new one
	\item Depending on the amount and similarity of data with the pretrained one, we should fine-tune different layers (see Figure~\ref{fig:transfer_learning})
	\item Transfer Learning can greatly influence the performance of a network. Low level features (first layers) are almost always the same for images as we have to detect edges, colors, etc.
\end{itemize}
\begin{figure}[ht!]
	\centering
	\includegraphics[width=0.5\textwidth]{figures/cv_deep_learning_transfer_learning.png}
	\caption{Transfer learning}
	\label{fig:transfer_learning}
\end{figure}
% cv_deep_learning_transfer_learning.png
\subsection{GANs}
\begin{itemize}
	\item Capture the underlying data distribution and being able to generate new samples
	\item Next to generative adversarial networks, we can also apply Variational Autoencoders or PixelCNN/RNN for this task
	\item GANs are trained by a minimax game between two neural networks (Discriminator $D$ and Generator $G$). $G$ wants to fool $D$ by generating realistic images. $D$ tries to distinguish between generated and real images/data:
	$$\min_G \max_D V(G,D) = \mathbb{E}_{\bm{x}\sim p_{\text{data}}(\bm{x})} \left[\log \left(D\left(\bm{x}\right)\right)\right] + \mathbb{E}_{\bm{z}\sim p_{z}(\bm{z})} \left[\log\left(1 - D\left(G\left(\bm{z}\right)\right)\right)\right] $$
	\item The standard/plain GAN architecture uses a noise vector $\bm{z}$ as input to the generator. Note that it is also possible to put  and condition the GANs input on the output (aka \textit{conditional GANs}). To ensure that the generator learns a relation from input to output, we might need to add an additional loss term like MSE to a label
	\item The training procedure consists of two steps which can be alternated or repeated by themselves for multiple times
	\begin{enumerate}
		\item \textit{Fix $G$ and train $D$}: in order to train the discriminator, we let $G$ generate fake images and feed the discriminator both the fake and sampled real data. Note that we need to fix $G$ to not backpropagate the error of $D$ through $G$.
		\item \textit{Fix $D$ and train $G$}: $G$ is trained by generating images and backpropagating the error of the prediction of $D$ (towards prediction of a real image). Although the gradients flow back through $D$, we do not update any weights of the discriminator as we otherwise cheat (train $D$ to optimize loss of $G$)
	\end{enumerate}
\subsubsection{Stability and Training problems}
	\item In general, it is hard to train a GAN. There are a lot of problems that can occur
	\item \textbf{Vanishing gradients} during training:
	\begin{itemize}
		\item If the discriminator is too bad, the generator does not get valid/accurate feedback and can therefore not learn properly
		\item If the discriminator is perfect, the generator has very low gradients as a small change does not influence the discriminator
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.4\textwidth]{figures/cv_deep_learning_GAN_vanishing_gradients.jpeg}
			\caption{Vanishing gradients problem for training with KL-divergence. When the distance between the two distributions $p$ and $q$ (respectively $P_g$ and $P_r$) is too huge, the KL divergence is very close to zero. Hence, is does not provide any strong gradients in these regions.}
		\end{figure}
	\end{itemize}
	\item \textbf{Reaching the equilibrium}
	\begin{itemize}
		\item We know that the nash equilibrium of the minimax game is $P_g=P_r$ meaning the distribution of the real data is equal to the generated data. In that case, $D$ return 0.5 no matter what example we put in (as both distributions are equal).
		\item However, it has been shown that such cost functions may not converge when using gradient descent. An example is shown in Figure~\ref{fig:GAN_reaching_equilibrium}.
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.4\textwidth]{figures/cv_deep_learning_GAN_oscillating.png}
			\caption{Oscillating behavior of a non-cooperative game where $\min_x \max_y V(x,y) = x\cdot y$. The equilibrium $x=y=0$ is never reached.}
			\label{fig:GAN_reaching_equilibrium}
		\end{figure}
	\end{itemize}
	\item \textbf{Mode collapse}
	\begin{itemize}
		\item A GAN suffers from a mode collapse if the generator limits its predictions/generated distribution to a few samples/modes.
		\item For example in case of the MNIST dataset, this would mean that the generator only creates numbers of one or two different digits. Although a full mode collapse is rarely the case, partial mode collapses frequently occur
		\item In order to create a mode collapse, the gradients regarding the noise $\bm{z}$ must be very low/close to zero. This can for example happen if we fix the discriminator and the generator converges to the optimal image $\bm{x}^*$ that fools the discriminator the most
		\item Once the generator collapse to one mode, the discriminator will learn that this mode is purely/mostly generated and thus changes its predictions. The generator will address that by changing the mode (note that as $\partial L/\partial \bm{z}\approx 0$, we will just collapse to the next mode and are not able to escape this loop).
		\item In the end, this turns into a cat-and-mouse game between the generator and discriminator, and will not converge (see Figure~\ref{fig:GAN_mode_collapse}).
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.6\textwidth]{figures/cv_deep_learning_GAN_mode_collapse.png}
			\caption{\textit{Top row}: optimal convergence of generator distribution to 8 modes. \textit{Bottom row}: Sample of a mode collapse after 10k iterations. The generator is only able to generate a single mode.}
			\label{fig:GAN_mode_collapse}
		\end{figure}
	\end{itemize}
\end{itemize}

================================================
FILE: Computer_Vision_1/cv_deep_video.tex
================================================
\section{Deep Video}
\begin{itemize}
	\item Understanding a video requires to analyze spatial and temporal information. Thus, also more data is needed to fully train such a network whereas we cannot label every single frame (too expensive)
	\item Grid-like data can be processed by a CNN, temporal mostly by RNN, and for unstructured data a fully connected network is most suitable
	\item Easiest solution for video understanding would be to classify (sample/all) frames independently by standard CNN, and then perform average pooling over predictions. However, this approach does not capture temporal structure
\end{itemize}
\subsection{Recurrent Neural Networks}
\begin{itemize}
	\item In Recurrent Neural Networks, a hidden state flows over time steps. The vanilla RNN formula is
	\begin{equation*}
		\begin{split}
			h_t & = \tanh \left(W_{hh}h_{t-1} + W_{xh} x_{t}\right)\\
			y_t & = W_{hy} h_t
		\end{split}
	\end{equation*}
	\item Weights are shared over time (also $W_{hh}$) so that a RNN can process an arbitrary sequence length. Also, it reduces the number of parameters and thus the chance of overfitting 
	\item However, weight sharing can also lead to vanishing gradients as if we backpropagate from $h_t$ to $h_k$, we have a factor $\theta$ that lets the gradients vanish if it's lower than one, and explode if it is greater than one:
	$$\frac{\partial h_t}{\partial h_k} = \theta^{(t-k)} \sum f(\cdot)$$
	\item Vanilla RNNs have troubles capturing long-term dependencies. A possible solution is using LSTMs that control the information flow by three gates (see Figure~\ref{fig:deep_video_LSTM}):
	\begin{equation*}
		\begin{split}
			\text{Forget gate:  } & f_t = \sigma\left(W_f \cdot \left[h_{t-1}, x_t\right] + b_f\right)\\[7pt]
			\text{Input gate:  } & i_t = \sigma\left(W_i \cdot \left[h_{t-1}, x_t\right] + b_i\right)\\
			& \tilde{c}_t = \tanh\left(W_c \cdot \left[h_{t-1}, x_t\right] + b_c\right)\\
			& c_t = f_t * c_{t-1} + i_t * \tilde{c}_t\\[7pt]
			\text{Output gate:  } & o_t = \sigma\left(W_o \cdot \left[h_{t-1}, x_t\right] + b_o\right)\\
			& h_t = o_t * \tanh\left(c_t\right)
		\end{split}
	\end{equation*}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/cv_deep_video_LSTM.png}
		\caption{Visual representation of a LSTM chain.}
		\label{fig:deep_video_LSTM}
	\end{figure}
\end{itemize}
\subsection{3D convolutions}
\begin{itemize}
	\item We can extend standard convolutions to 3D by moving the filter over the time dimension as well (channels are now 4th dimension over which filter is still global)
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.3\textwidth]{figures/cv_deep_video_3D_convs.png}
		\caption{A 3D convolution is local over spatial and temporal dimensions, but still global over channels (i.e. RGB).}
		\label{fig:deep_video_3d_convs}
	\end{figure}
	\item Example: extending a 2D kernel by temporal dimension:
	\begin{equation*}
		\begin{split}
			200\times 200\times 3 \textcolor{blue}{\times 16} \xrightarrow{\text{filter }3\times3\textcolor{blue}{\times 3}} 200\times 200\times 256 \textcolor{blue}{\times 16}
			\Rightarrow \underbrace{3\times 3}_{\text{ spatial }}\underbrace{\textcolor{blue}{\times 3}}_{\text{ temporal }}\underbrace{\times 3}_{\text{ input channels }}\underbrace{\times 256}_{\text{ output channels}}\text{ parameters}
		\end{split}
	\end{equation*}
	\item Such convolutions learn combined temporal and spatial information. 
	\item Alternative is to concatenate all input frames over the channel dimension and pass it to a simple 2D network (also called \textit{early fusion}). Note that this approach loses the temporal information very fast
	\item Consecutive 3D convolutions can be seen as hierarchical combination of frames. Low level layers therefore capture low level motions, while high level layers (close to output) are able to reason about a longer set of frames and thus high level motion.
	\item Still, it is hard to learn long term dependencies with 3D convolutions as it does not have any gates and thus no explicit control over the information flow
	\item Note that in general, video-based networks are more likely to suffer from overfitting as the input space has a much higher dimensionality and the network has more parameters
\end{itemize}
\subsection{State-of-the-art}
\subsubsection{Two Stream Network}
\begin{itemize}
	\item Earliest proposed network for action recognition was \textbf{Two stream network}
	\item The architecture consists of two networks. One takes a single frame (\textit{spatial} stream net), and the other processes the concatenated optical flow over the set of frames (\textit{temporal} stream net). Both predictions are in the end combined
	\item The biggest problem here is that the spatial and temporal information is processed independently, and the very late fusion makes it impossible to reason about both
	\item Other disadvantages include a higher computational cost (two networks plus optical flow), only capturing short motion (early fusion of optical flow), noisy optical flow, and higher probability of overfitting due to number of parameters
	\item Approach can be slightly improved by repeatedly applying the network on small snippets of the network, and combining the prediction afterwards
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/cv_deep_video_two_stream_network.png}
		\caption{Architecture of the two stream network.}
		\label{fig:deep_video_two_stream_net}
	\end{figure}
\end{itemize}
\subsubsection{I3D}
\begin{itemize}
	\item Inspired by the success of the 2D version (GoogLeNet), current state-of-the-art networks apply 3D inception modules (see Figure~\ref{fig:deep_video_I3D_module})
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.35\textwidth]{figures/cv_deep_video_I3D.png}
		\caption{\textit{Left}: Standard Inception module of the I3D network. \textit{Right}: Inception module with 3D temporal separable convolutions.}
		\label{fig:deep_video_I3D_module}
	\end{figure}
	\item It is pretrained on ImageNet where the 2D filters are (after pretraining) inflated to a third dimension by repeating the values $N$ times over the time dimension, and rescaled by dividing by $N$
\end{itemize}
\subsubsection{Efficient 3D convolutions}
\begin{itemize}
	\item The main drawback of I3D and all other 3D convolutional networks are the huge amount of parameters. There are three ways to efficiently reduce the number of parameters
\end{itemize}
\begin{enumerate}
	\item \textbf{Pseudo 3D convolutions}
	\begin{itemize}
		\item The idea behind this operation is that the spatial and the temporal dimension do not correlate in every detail, but the temporal dimension is more important locally for the spatial dimension
		\item Thus, we split 3D convolution into a 2D spatial and a consecutive 1D temporal convolution. The concept is visualized in Figure~\ref{fig:deep_video_pseudo_3D_convs}
		\item The number of operations applied on input size $l_F \times w_F \times h_F \times c_F$ to output $l_G \times w_G \times h_G \times c_G$ is:
		\begin{equation*}
				\underbrace{k \times k \times 1 \times c_F \times c_I \times l_F \times w_G \times h_G}_{\text{Spatial 2D convolution}} + \underbrace{1\times 1\times k \times c_I \times c_G \times l_G \times w_G \times h_G}_{\text{Temporal 1D convolution}}
		\end{equation*}
		\item The speedup by this operation is about $\frac{1}{k}\cdot \frac{c_I}{c_G} \cdot \frac{l_F}{l_G} + \frac{1}{k^2} \cdot \frac{c_I}{c_F}\approx \frac{1}{k}\cdot \frac{c_I}{c_G}$
	\end{itemize}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.55\textwidth]{figures/cv_deep_video_pseudo_3D_conv.png}
		\caption{Pseudo 3D convolutions split the operation into a spatial part (2D) and a temporal (1D) convolution.}
		\label{fig:deep_video_pseudo_3D_convs}
	\end{figure}
	\item \textbf{Depth-wise separable convolutions}
	\begin{itemize}
		\item This operation is inspired by the MobileNet architecture and removes the property of convolutions being depth-wise global
		\item We consider every input channel independently, and apply a different filter on each of them. For example, if we have an RGB input, we would apply three filters, each processing a different input channel
		\item To still allow interaction/combination of multiple channels, we apply a local $1\times 1\times 1$ convolution afterwards. Hence, an output channel depends again on all input channels.
		\item The number of operations applied on input size $l_F \times w_F \times h_F \times c_F$ to output $l_G \times w_G \times h_G \times c_G$ is:
		\begin{equation*}
			\underbrace{k \times k \times k \times 1 \times c_F \times l_G \times w_G \times h_G}_{\text{Depth-wise 3D convolution}} + \underbrace{1\times 1\times 1 \times c_F \times c_G \times l_G \times w_G \times h_G}_{\text{Local }1\times 1\times 1\text{ convolution}}
		\end{equation*}
		\item The speedup by this operation is considerably bigger than for pseudo 3D, namely $\frac{1}{c_G} + \frac{1}{k^{3}} \approx \frac{1}{k^{3}}$
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.6\textwidth]{figures/cv_deep_video_3D_depthwise_conv.png}
			\caption{Depth-wise 3D convolutions apply one filter per input channel, and combine the different channels afterwards. Same architecture is applied in MobileNet for the 2D case.}
			\label{fig:deep_video_depthwise_3D_convs}
		\end{figure}
	\end{itemize}
	\item \textbf{Partial 2D architecture} 
	\begin{itemize}
		\item Depending on the kind of motion we want to detect, it might not be necessary to apply 3D convolutions at every stage of the network. 
		\item For example, if we are only interested in high-level motions, we might want ot use a \textit{Top-heavy I3D} which applies 3D convolutions only on the last layers. 
		\item Similarly, for short motions, we might want to consider a \textit{Bottom-heavy I3D}. 
		\item Figure~\ref{fig:deep_video_I3D_architectures} summarizes the different network architectures.
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.6\textwidth]{figures/cv_deep_video_different_I3D_architectures.png}
			\caption{Different I3D network architectures.}
			\label{fig:deep_video_I3D_architectures}
		\end{figure}
	\end{itemize}
\end{enumerate}
\subsection{Self-supervised learning}
\begin{itemize}
	\item Learn to represent a video adequately in the network by using data and tasks where the labels are freely exploited. The great benefit is that we can use a lot of (unlabeled) data
	\item This is mostly done as a pre-training step as the network learns to deal and analyze with videos on a huge dataset. There are various tasks we can perform self-supervised learning on:
	\begin{itemize}
		\item \textbf{Visual tracking}: If we have given a tracking system, we can train a network to predict whether two patches are similar or not. Therefore, we create labels by the tracking system by setting it to 1 if two patches are the same object over time, or otherwise to 0 (we sample a random other patch from the image and compare the scores).
		\item \textbf{Learning by shuffling}: The network is given a set of frames, and its tasks is it to determine whether it is in the correct temporal order or not. The supervision signals are easily generated by labeling the real videos as positive, and shuffle their frame order to create a negative example. The goal is that the network learns to understand poses and motions over frames.
		\item \textbf{Learning by arrow of time}: The task of the network is to predict whether a video is played forwards of backwards (binary classification). This is a very challenging task as it requires the network to understand laws of physics (water only flows downwards, not upwards) by analyzing different motions in the video. One can cluster afterwards what clues the network had extracted which lead to a prediction of forward or backward (called \textit{arrow of time}). This approach gave the best self-supervised pre-training results so far, but is still not able to beat a supervised ImageNet pre-training.   
	\end{itemize}
\end{itemize}

================================================
FILE: Computer_Vision_1/cv_imgformation.tex
================================================
\section{Image formation}
\label{sec:img_formation}
\begin{itemize}
	\item To fully understand/analyze an image, we first have to examine how it was created (note that an image is a 2D representation of a 3D world)
	\item Various challenges occur in CV 
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/cv_image_formation_challenges_cv.png}
		\caption{Challenges in Computer Vision}
	\end{figure}
	\item The two main parts of how an image is formed are:
	\begin{itemize}
		\item \textit{Geometry} of the projection of a 3D environment to a 2D image. This defines which pixel belongs to which object (part/location). 
		\item \textit{Physics of light} which determines the brightness of a point in the image plane as a function of illumination and surface properties. Thus, the light source has a crucial influence on an object's appearance 
	\end{itemize}
\end{itemize}
\subsection{Projective Geometry and Camera models}
\begin{itemize}
	\item A camera can be abstracted by a pinhole model. Larger aperture/pinhole results in blurry images, smaller give sharp but noisy images (less energy of light is being passed) $\Rightarrow$ Change between both by using different lenses
	\item We represent an image by a projection plane. The intersection between the center of projection and the plane is determined by (note that $z$ is negative):
	$$(x,y,z)\to (-\frac{d}{z}\cdot x, -\frac{d}{z}\cdot y, -d)$$ 
	\begin{figure}[ht!]
		\centering
		\begin{subfigure}[b]{0.48\textwidth}
			\centering
			\includegraphics[width=0.4\textwidth]{figures/cv_image_formation_3D_model.png}
			\caption{Projection plane}
		\end{subfigure}
		\begin{subfigure}[b]{0.48\textwidth}
			\includegraphics[width=0.8\textwidth]{figures/cv_image_formation_3D_model_2.png}
			\caption{Pinhole camera model}
		\end{subfigure}
		\caption{Abstract camera model in 3D coordinates}
	\end{figure}
	\item Model projection of 3D points to 2D image plane using homogeneous coordinates. The components we use for the projection are:
	\begin{itemize}
		\item \textit{Viewport projection}: Convert plane points to image coordinates (top left corner $(0,0)$, resolution scaling $s_x$, $s_y$)
		\item \textit{Perspective projection}: 3D points to image plane (homogeneous coordinates)
		\item \textit{View transformation}: rotation and translation matrix $\bm{R}$ and $\bm{T}$ for modeling the position and orientation of the camera. Can be seen as changing the coordinate system
		\item All together, we get the transformed points by:
		$$\left[\begin{array}{c}u\\v\\1\end{array}\right] = \underbrace{\left[\begin{array}{ccc}s_x & 0 & u_0\\0 & -s_y & v_0\\0 & 0 & 1\end{array}\right]}_{\text{Viewport}} \cdot \underbrace{\left[\begin{array}{cccc}1 & 0 & 0 & 0\\0 & 1 & 0 & 0\\0 & 0 & -1/d & 0\end{array}\right]}_{\text{Perspective}} \cdot \underbrace{\left[\begin{array}{cc}\bm{R} & \bm{T} \\\bm{0}^T_3 &  1\end{array}\right]}_{\text{View}} \cdot \left[\begin{array}{c}x\\y\\z\\1\end{array}\right]$$
	\end{itemize}
	\item Viewport and perspective projection depend on the camera (size and position of image plain) so that those are called \textit{intrinsic} camera parameter. In contrast, the view transformation is determined by \textit{extrinsic} camera parameters as it defines the camera position in the (original) coordinate system
\end{itemize}
\subsection{Light and Color models}
\label{sec:color_models}
\begin{itemize}
	\item The appearance color of an object is influenced by three components
	\begin{itemize}
		\item \textit{Light source}: spectral power distribution of light $e(\lambda)$ 
		\item \textit{Object}: the reflection distribution of an object $p(\lambda)$ (how good certain wavelengths are reflected)
		\item \textit{Sensor}: Detection by the sensor of the distribution $e(\lambda) p(\lambda)$
	\end{itemize}
	\item The goal is to be invariant to light source $e(\lambda)$ and sensor perspective
	\item Two very simple approaches to make an image independent of light source
	\begin{itemize}
		\item \textbf{Gray-world} assumption: the world is in average gray. So, we rescale every channel independently by $128/$mean of channel. Problematic if image is biased towards not being grey (high single channel, etc.)
		\item \textbf{Scale-by-max}/\textbf{White-patch} assumption: there is always at least one white pixel in an image. Hence, the channels are rescaled by $255$/max of channel. Fails if there is actually no white pixel in the image (results in wrong maximum), or if white pixel is in the shadow $\Rightarrow$ assumes whole image being shaded.
		\item All models underly/use the von Kries model where we convert an unknown light source $u$ to a canonical $c$ (i.e. day light) by simple channel scaling:
		$$\left(\begin{array}{c}R^c\\G^c\\B^c\end{array}\right) = \left(\begin{array}{ccc}
		\alpha & 0 & 0 \\
		0 & \beta & 0\\
		0 & 0 & \gamma
		\end{array}\right) \cdot \left(\begin{array}{c}R^u\\G^u\\B^u\end{array}\right)$$
		Note that to simplify the calculation of $\alpha$, $\beta$ and $\gamma$, and assume that the channels $R$, $G$ and $B$ are independent (thus only diagonal matrix), we approximate the integral as single wavelength for narrow-band filters.
	\end{itemize}
	\item As computer can't handle continuous distributions, the following integrals are approximate by for example the RGB model:
	$$R = \int_\lambda e(\lambda) p(\lambda) f_R(\lambda) d\lambda, \hspace{2mm}G = \int_\lambda e(\lambda) p(\lambda) f_G(\lambda) d\lambda, \hspace{2mm}B = \int_\lambda e(\lambda) p(\lambda) f_B(\lambda) d\lambda$$
	Every spectral color (see below diagram in Figure~\ref{fig:rgb_color_wavelength_distribution_RGB}) can be represented by an linear combination of RGB values.
	Note that human ganglion cells have similar functions, but are the most sensitive to green.
	\begin{figure}[ht!]
		\centering
		\begin{subfigure}[b]{0.4\textwidth}
			\centering
			\includegraphics[width=0.75\textwidth]{figures/cv_image_formation_color_RGB_model.png}
			\caption{RGB model}
			\label{fig:rgb_color_wavelength_distribution_RGB}
		\end{subfigure}
		\begin{subfigure}[b]{0.24\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/cv_image_formation_color_XYZ_model.png}
			\caption{XYZ model}
			\label{fig:rgb_color_wavelength_distribution_XYZ}
		\end{subfigure}
		\hspace{5mm}
		\begin{subfigure}[b]{0.28\textwidth}
			\centering
			\includegraphics[width=0.8\textwidth]{figures/cv_image_formation_color_XYZ_diagram_2.png}
			\caption{XYZ diagram}
			\label{fig:rgb_color_wavelength_distribution_XYZ_diagram}
		\end{subfigure}
		\caption{Color matching functions $f_R$, $f_G$ and $f_B$ for the standard (a) RGB / (b) XYZ model. The colors represented by the XYZ system are shown in (c). Note that the line of purples contains colors that cannot be created by a monochromatic light source and needs a combination of fully saturated red and violet (max and min of spectrum).}
		\label{fig:rgb_color_wavelength_distribution}
	\end{figure}
	\item The intensity of the RGB color space is calculated by the sum of the channels: $I=R+G+B$
	\item Another color space is the XYZ system. The color matching functions $\overline{x}(\lambda), \overline{y}(\lambda), \overline{z}(\lambda)$ are similar but not the same as RGB (see Figure~\ref{fig:rgb_color_wavelength_distribution_XYZ}). The values are calculated by:
	$$X = \int_\lambda e(\lambda) p(\lambda) \overline{x}(\lambda) d\lambda, \hspace{2mm}Y = \int_\lambda e(\lambda) p(\lambda) \overline{y}(\lambda) d\lambda, \hspace{2mm}Z = \int_\lambda e(\lambda) p(\lambda) \overline{z}(\lambda) d\lambda$$
	\item However, we can split these measurements into a brightness/luminance and chromaticity/color component specified by $x$ and $y$. The luminance is given by $Y$ ($XYZ$ was designed for that), and the chromaticity is determined as ($Z$ is implicitly given by $1-x-y$):
	$$x=\frac{X}{X+Y+Z},\hspace{2mm}y=\frac{Y}{X+Y+Z}$$
	\item The created colors can be visualized in an $xy$-diagram (see Figure~\ref{fig:rgb_color_wavelength_distribution_XYZ_diagram}). 
	\item Given a reference light source $e$, we can determine the dominant wavelength (\textit{hue}) of a point $p$ by a line from $e$ through $p$ towards the boundary. The \textit{saturation} is given by the ratio of line length between $e$ and $p$ and $e$ to dominant wavelength boundary. Combining these with the luminance $Y$, a point $p$ can be converted into the HSI color space (see Figure~\ref{fig:rgb_color_HSV_color_cone}).
	\item HSV can be seen as applying non-linear functions on the wavelength distribution (see Figure~\ref{fig:rgb_color_HSV_wavelength_dist}). Hue is defined as the dominant wavelength, saturation as the purity of the color (probably relation between max energy and mean), and the brightness/luminance (given by average)
	\begin{figure}[ht!]
		\centering
		\begin{subfigure}[b]{0.4\textwidth}
			\centering
			\includegraphics[width=0.5\textwidth]{figures/cv_image_formation_color_HSV.png}
			\caption{HSV color cone}
			\label{fig:rgb_color_HSV_color_cone}
		\end{subfigure}
		\begin{subfigure}[b]{0.4\textwidth}
			\centering
			\includegraphics[width=0.6\textwidth]{figures/cv_image_formation_color_HSV_wavelength_dist.png}
			\caption{HSV wavelength distribution}
			\label{fig:rgb_color_HSV_wavelength_dist}
		\end{subfigure}
		\caption{HSV color space. }
	\end{figure}
	\item The $xy$-diagram in Figure~\ref{fig:rgb_color_wavelength_distribution_XYZ_diagram} visualizes the gamut that is visible for an average person/human vision. Different color spaces/devices capture colors by defining three points and linearly interpolate between those. However, it can be seen that there is no such gamut that can include the whole human vision gamut.
\end{itemize}
\subsection{Reflection models}
\begin{itemize}
	\item When a light source shines on an object, it might be differently perceived from different sensors/cameras although they have the same properties $\Rightarrow$ object appearance by reflectance
	\item The reflectance properties of an object/point can be specified by a \textit{BRDF}: Bi-directional reflectance distribution function $f(\theta_i, \phi_i; \theta_r, \phi_r)$ ($\theta_i$ and $\phi_i$ define the angles between input light and surface normal in $x$-$z$/$x$-$y$ direction respectively, $\theta_r$ and $\phi_r$ for the outgoing direction).
	\item A BRDF can be build up by different components, as visualized in Figure~\ref{fig:reflection_models_brdf_reflection_components}. The main parts can be distinguished into \textit{body reflection} (also referred to as mate appearance), and \textit{surface reflection} (responsible for the glossy appearance)
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.8\textwidth]{figures/cv_image_formation_reflectance_properties.png}
		\caption{Different components of reflectance. Black arrow visualizes the input ray, and greyish/shaded arrows the output rays. Length of the output rays indicate their energy. }
		\label{fig:reflection_models_brdf_reflection_components}
	\end{figure}
	\item There are different models that approximate/assume/deal with certain forms of BRDFs
\end{itemize}
\subsubsection{Lambertian model}
\begin{itemize}
	\item The lambertian reflectance model assumes a BRDF that constant: $f(\theta_i, \phi_i; \theta_r, \phi_r) = \frac{\rho_d}{\pi}$ where $\rho_d$ is defined by the albedo of the object, and division of $\pi$ as energy is equally distributed over hemisphere
	\item The surface reflection/output radiance can be calculated by $L=\frac{\rho_d}{\pi}I\cos \theta_i=\frac{\rho_d}{\pi}I\cdot (\vec{n}\cdot \vec{s})$ where $I$ is light source intensity, $\vec{n}$ the surface normal and $\vec{s}$ the input ray direction.
	\item Note that the factor $(\vec{n}\cdot \vec{s})$ defines the ratio of energy/photons that interact with that point/surface
	\item By assuming a Lambertian world, we can decompose an image into a shading part (surface normals) and the albedo (reflectance) of an object.
\end{itemize}
\subsubsection{Phong model}
\begin{itemize}
	\item The Phong model extends the Lambertian model by taking glossy reflectance into account (note that mirror is mostly approximated by glossy as mirror only looks at a single output angle which is rarely met). See Figure~\ref{fig:reflection_models_phong} for the components of the Phong model
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.7\textwidth]{figures/cv_image_formation_phong_model.png}
		\caption{The Phong model combines diffuse and glossy reflectance. Note that ambient gives the object a certain base brightness for approximating reflectance among objects/walls/...}
		\label{fig:reflection_models_phong}
	\end{figure}
	\item The reflection component of the specularity is calculated by $L_s=I\cdot \rho_s \left(\cos \phi\right)^{n_{shiny}}=I\cdot \rho_s \left(\vec{r}\cdot \vec{v}\right)^{n_{shiny}}$ where $r$ is the output ray/mirror direction (calculated by $\vec{r}=2\cdot \vec{n} \cdot (\cos \theta) - \vec{s}$), and $v$ the view direction of the sensor. 
	\item Large values for $n_{shiny}$ lead to narrow, small dot reflections (close to mirror) while small $n_{shiny}$ give broad, big surface reflectance. Note that the intensity is capped at a highest value (e.g. 1 or 255), so that multiple points can have the maximum intensity although they have a slightly different angle
	\item Also, Figure~\ref{fig:reflection_models_phong} shows that the body reflection (diffuse and ambient) contain the object color while the specularity depends on the light source (highlights color from light source)
\end{itemize}
\subsubsection{Dichromatic reflection models}
\begin{itemize}
	\item The previously discussed models only consider the light source intensity for the reflection. However, we can integrate the reflection in our color models:
	$$\text{body}_C = m_b (\vec{n}, \vec{s})  \int_\lambda e(\lambda) p(\lambda) f_C(\lambda) d\lambda$$
	$$\text{surface}_C = m_s (\vec{n}, \vec{s}, \vec{v})  \int_\lambda e(\lambda) c(\lambda) f_C(\lambda) d\lambda$$
	where $C$ is a specific channel (for example $R$, $G$ or $B$), $\vec{n}$ is the surface normal, $\vec{s}$ the input ray direction and $\vec{v}$ the viewpoint. 
	\item The function $m_b$ models the diffuse body reflection (i.e. $m_b(\vec{n}, \vec{s})=\cos \theta = \vec{n}\cdot \vec{s}$ as for Lambertian) whereas $m_s$ represents the glossy surface reflection (i.e. Phong model). 
	\item The diffuse reflectance depends on the albedo of the object $p(\lambda)$ whereas $c(\lambda)$ determines the specularity of the object for certain wavelengths.
	\item The perceived color of an object is the sum of the body and the surface
	\item Our goal is to map an input image into a space which is independent of the scene (i.e. independent of $m_b$, $m_s$, ...). Different color models can help:
	\begin{itemize}
		\item \textbf{rgb}: Assuming a white light source, normalize RGB values by the intensity (i.e. $r=\frac{R}{R+G+B}$). This leads to photometric invariance for pure matte objects ($m_b$ cancels out as it is the same for all channels when assuming $m_s=0$). Note that this approach fails if an object has no color (i.e. all gray tones are mapped to the same value).
		\item \textbf{c1c2c3}: color space is obtained from RGB manipulation and is invariant to shadowing effects of light interaction particularly for matte objects. It has similar properties as rgb, but is determined by $c_1(R,G,B) = \arctan\frac{R}{\max\left\{G,B\right\}}$
		\item \textbf{HSV} can be invariant to specularity if we assume a white light source and thus white specularity. The dominant wavelength, i.e. the hue, stays the same for those points. However, note that this model is instable for gray and especially white points that commonly occur at maximum specularity, as the hue is undefined.
		\item \textbf{l1l2l3}: Similar behavior as HSV, but calculates the values by $l_1(R,G,B) = \frac{(R-G)^2}{(R-G)^2 + (R-B)^2 + (G-B)^2}$
	\end{itemize}
	\item Figure~\ref{fig:cv_image_formation_invariance_color_spaces} summarizes some invariance properties of common color spaces
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/cv_image_formation_invariance_color_spaces.png}
		\caption{Overview of invariance in color spaces.}
		\label{fig:cv_image_formation_invariance_color_spaces}
	\end{figure}
	\item Different color spaces have different instabilities. Normalized colors get unstable around black pixels ($R=1, G=0, B=0$ is considered as pure red in rgb although in RGB it is black) whereas Hue is instable for low saturation (any hue gives same color)
	\item Another method to be invariant to shadows is filtering the image for smooth image intensity transitions as color transitions are harsh compared to that. The new image is recovered by summing up over gradients. Note that this method fails for sharp shadows and/or smooth color transitions
\end{itemize}

================================================
FILE: Computer_Vision_1/cv_imgprocessing.tex
================================================
\section{Image processing}
\begin{itemize}
	\item Apply various algorithms on image to analyze/improve the data
	\item The simplest kind of image transformation are those independent to the spatial position (thus also called point processing) where the new image is calculated by $g=a\cdot t(f)+ b$. Examples: gamma correction ($\log x$ to boost small, black values more than high ones), histogram equalization
\end{itemize}
\subsection{Neighborhood processing}
\begin{itemize}
	\item The most common way to process an image is by applying filters on it. A filter is a linear weighted sum of local input values. 
	\item A convolution of image $I$ and a linear filter $h$ is calculated by $$I_{out} = I \ast h, \hspace{1mm} I_{out}(i,j) = \sum\limits_{k,l} I(i-k, j-l) \cdot h(k,l)$$
	\item Depending on the size of the filter, we might not be able to apply the filter on the pixels at the border. Thus, we extend the image to have the same output shape. Common padding methods are zero/black, mirror/copy edge or wrap around.
	\item There are a lot of different filters that can be applied on an image. Filters can for example also be used for translation if wanted/needed. 1D example: $\left[\begin{array}{ccc}
	0 & 0 & 1
	\end{array}\right]$
	\item In general, we distinguish between \textit{low}-pass filters (smoothing) and \textit{high}-pass filters (edge detection, sharpening). The frequency is thereby the change of pixel values, and the passed wavelengths describe to what the filters react the most. Note that there are also \textit{band}-pass filters (low-pass filter convolved with high-pass filter)
	\item For example, unicolor images stay mostly unchanged when they are processed by an low-pass filter. In contrast, applying a high-pass filter on such images leads to very low activations.
\end{itemize}
\subsubsection{Smoothing filters}
\begin{itemize}
	\item \textit{Box filter}: replace every pixel by the average of its neighborhood. 
	$$h = \frac{1}{9}\left[\begin{array}{ccc}
	1 & 1 & 1\\
	1 & 1 & 1\\
	1 & 1 & 1\\
	\end{array}\right]$$
	Convolving a box filter with itself results in a filter in a shape of a Gaussian
	\item \textit{Gaussian filter}: weight contributions of neighboring pixels by distance: $G_\sigma = \frac{1}{2\pi \sigma^2} e^{-\frac{(x^2 +y^2)}{2\sigma^2 }}$. A $3\times 3$ Gaussian with $\sigma=0.5$ has the following values:
	$$h= \left[\begin{array}{ccc}
	0.011 & 0.084 & 0.011\\
	0.084 & 0.619 & 0.084\\
	0.011 & 0.084 & 0.011\\
	\end{array}\right]$$
	Note that convolving a Gaussian with another Gaussian is again a Gaussian. Thus, we can separate a 2D Gaussian into two 1D filters which are sequentially applied on the image $\Rightarrow$ reduce computational effort from $n^2$ to $2n$.
	\item \textit{Sharpening filter}: reverses the process of smoothing by accentuates differences with local average
	$$h = \left[\begin{array}{ccc}
	0 & 0 & 0\\
	0 & 2 & 0\\
	0 & 0 & 0\
	\end{array}\right]-\frac{1}{9}\left[\begin{array}{ccc}
	1 & 1 & 1\\
	1 & 1 & 1\\
	1 & 1 & 1\\
	\end{array}\right]$$
	\item \textit{Median filter}: A non-linear filter that selects the median value in the kernel window. The advantage of this filter is that its robust against outliers (good for filtering out salt-and-pepper noise)
\end{itemize}
\subsubsection{Edge detection filters}
\begin{itemize}
	\item \textit{Simple gradient filter}: The simplest gradient/edge detector is in 1D: $h = \left[\begin{array}{cc}-1 & 1\end{array}\right]$ 
	\item \textit{Sobel filter}: a derivative filter that also takes nearby pixels into account for better approximation. $h_x$ detects vertical edges (gradients over $x$-direction) and $h_y$ detects horizontal edges.
	$$h_x = \left[\begin{array}{ccc}
	1 & 0 & -1\\
	2 & 0 & -2\\
	1 & 0 & -1\\
	\end{array}\right] \text{\hspace{5mm}and\hspace{5mm}}h_y = \left[\begin{array}{ccc}
	1 & 2 & 1\\
	0 & 0 & 0\\
	-1 & -2 & -1\\\end{array}\right] $$ 
	\item \textit{Derivative of a Gaussian}: the derivative of a Gaussian is highly suitable for edge detection as it represents a band-pass filter (Gaussian filter convolved with discrete gradient filter although derivative mostly calculated by continuous). Similar to sobel, but weights the pixels nearby a bit different. Note that we also have different filters for $x$ and $y$ direction.
	\item \textit{Laplacian of Gaussian}: Laplacian operator $\nabla^2 f = \frac{\partial^2 
	f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2}$ applied on a Gaussian. Is invariant to the direction of the gradient (circular symmetric). The shape of the function is also often described as a Mexican hat (see Figure~\ref{fig:cv_image_processing_gaussian_filters}). Is highly responsive to blobs (blob detection) but is sensitive to the scale. To be invariant of the scale, we can apply multiple LoG filters with different values of $\sigma$ and stack the results together.
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.6\textwidth]{figures/cv_image_processing_gaussian_filters.png}
		\caption{Visualization of different Gaussian filters.}
		\label{fig:cv_image_processing_gaussian_filters}
	\end{figure}
\end{itemize}
\subsection{Harris corner detector}
\begin{itemize}
	\item Detect interest points in an image to perform matching or similar tasks. Corners are suitable to serve as interest points as they have a unique 2D position compared to edges and points
	\item The initial idea is derived from performing autocorrelation on a small window of the image, and test which ones are unique/expressive. Now we are looking for small changes in $x$ and $y$ direction, how much the image changes. Based on that information, we can decide whether a pixel represents a corner or not.
	\item Steps in the Harris corner detector
	\begin{enumerate}
		\item Compute the derivatives $I_x$ and $I_y$ of the image
		\item Compute the products of the derivatives at every pixel: $I_x^2$, $I_y^2$, $I_{xy}=I_{x}\cdot I_{y}$ 
		\item Compute sums of products over the window size and align them in the Harris matrix:
		$$H = \left[\begin{array}{cc}
		\sum_W I_x^2 & \sum_W I_x \cdot I_y\\
		\sum_W I_x \cdot I_y & \sum_W I_y^2 \\
		\end{array}\right]$$
		Note that the sum represents the application of a box filter. It is equally possible to apply Gaussian filters etc. 
		\item Determine the response of the detector at each pixel:
		$$R = \det(H) - k\cdot \left(\text{trace}(H)\right)^2$$
		\item If $|R|$ is small, the region is probably flat. Otherwise, if $R<0$ (and greater a certain threshold) we have an edge, and $R>0$ indicates a corner.
		\item Perform non-maximum suppression if corner detector is calculated pixel-wise.
	\end{enumerate}
	\item Determining the \textit{cornerness} of a point is based on the eigenvalues of the matrix $H$: $R=\lambda_1 \lambda_2 - k\cdot (\lambda_1 + \lambda_2)^2$. The maximum eigenvalue is the gradient of the direction with the fastest change, and the minimum eigenvalue the gradient of the direction with the smallest change. Note that this models an ellipse for the gradients (see Figure~\ref{fig:cv_image_processing_harris_ellipse})
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.25\textwidth]{figures/cv_image_processing_harris_ellipse.png}
		\caption{Visualization of relation between eigenvalues and gradients.}
		\label{fig:cv_image_processing_harris_ellipse}
	\end{figure}
	% cv_image_processing_harris_ellipse.png
	\item If we have an edge, one eigenvalue is considerably greater than the other as in one direction we have a large gradient, whereas in the other (90 degrees) the pixels stay the same. Here, $R$ is smaller than 0 as $\lambda_1\lambda_2$ is small but $\lambda_1 + \lambda_2$ is large.
	\item Thus, we only have a corner if in both directions we have a (equally) high change. In that case, $R$ is positive as $\lambda_1\lambda_2$ is large.
	\item Other properties of the Harris Corner detector
	\begin{itemize}
		\item Partial invariance to \textit{affine intensity} change. As only derivatives are used, a bias term $I+b$ does not influence result. When multiplying an image by a factor $I\cdot a$, we scale the eigenvalues and thus the cornerness as well. We therefore might only have to adapt the threshold.
		\item \textit{Rotation invariant} as only the ellipse rotates but the eigenvalues stay the same
		\item \textit{Scaling sensitive}: The Harris corner detector is sensitive to scale as it usually applies LoG/Derivatives of Gaussians for determining $I_x$ and $I_y$. To make the corner detector invariant to scale, we can apply multiple gradient filters with different values for $\sigma$ and stack them together (3D output instead of 2D). We then perform the detector on various scales, and take in the end the maximum response over scales for every pixel.
	\end{itemize}
	\item Applications
	\begin{itemize}
		\item \textit{Image stitching} as for combining separate photos into a panorama. We therefore detect interest points in all images, and try to match those (description by e.g. SIFT/histogram/...)
		\item \textit{Object recognition} by comparing local features that were found for a specific object with the ones from another image.
	\end{itemize}
\end{itemize}

================================================
FILE: Computer_Vision_1/cv_intro.tex
================================================
\section{Introduction}
\subsection{Challenges in Computer Vision}
\begin{itemize}
	\item 
\end{itemize}

================================================
FILE: Computer_Vision_1/cv_object_rec.tex
================================================
\section{Object recognition}
\begin{itemize}
	\item Challenges in object recognition
	\begin{itemize}
		\item Huge dimensionality (large input size)
		\item Image formation process (see Section~\ref{sec:img_formation})
		\item Images are stationary signal and share features, but have to distinguish it from noise
	\end{itemize}
	\item Hard to define explicit rules, but easy to collect examples $\Rightarrow$ Machine learning
\end{itemize}
\subsection{Image representations}
\begin{itemize}
	\item Need to find an image representation that is able to capture the semantics of an image and hence makes it easy to recognize objects
	\item For normal pixel values, the euclidean distance does not reflect the similarity of images well. A change of illumination or translation has a huge impact on the metric although it is the same object
	\item Global histograms over whole image are scale and translation invariant, but are not really distinctive (different images have same histogram)
	\item The best way is to find \textit{local features} that images share. They are more descriptive and reoccur in different images. In the next step, we have to describe these features to get a final representation. 
	\item One way to describe them are using SIFT (Scale-invariant feature transform) which creates a local histogram of gradients in the neighborhood (see Figure~\ref{fig:SIFT})
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.6\textwidth]{figures/cv_object_detection_SIFT.png}
		\caption{SIFT descriptors for a $2\times 2$ histogram patch (normally $4\times 4$ see Figure~\ref{fig:descriptor_SIFT}).}
		\label{fig:SIFT}
	\end{figure}
\end{itemize}
\subsubsection{Histogram of Gradients (HoG)}
\begin{itemize}
	\item A HoG descriptor abstracts a patch by a histogram of gradient orientations. \item The steps for calculating a HoG descriptor for a given patch are
	\begin{enumerate}
		\item Determine pixel-wise gradients $I_x$ and $I_y$ by e.g. applying a Sobel filter (or rather simple $[1,-1]$ derivative filter)
		\item Determine the orientation $\theta = \arctan \frac{I_y}{I_x}$ and magnitude $I=\sqrt{I_y^2 + I_x^2}$ of the pixel-wise gradients
		\item Report gradients as a histogram. For example, if we take a 9 bin histogram, we map every gradient to the closest value of $0^{\circ}$, $45^{\circ}$, $90^{\circ}$ etc. Note that the 9th bin is for zero gradients which have no orientation.
	\end{enumerate}
	\item An improvement to simply counting the number of gradients is considering their magnitude as well, or using a non-hard counting ($30^{\circ}$ counts for $0^{\circ}$ and $45^{\circ}$).
	\item A disadvantage of HoG is that it is not rotational invariant
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/cv_image_processing_HoG.jpg}
		\caption{The HoG descriptor takes the gradients in a patch and group them into a histogram of orientations.}
		\label{fig:descriptor_HoG}
	\end{figure}
\end{itemize}
\subsubsection{Scale Invariant Feature Transform (SIFT)}
\begin{itemize}
	\item SIFT is a combination of detector and descriptor which is (mostly) both rotation and scale invariant
	\item The first step of SIFT is getting a scale-invariant response map. This is done by extracting features by LoG (or rather DoG due to runtime) on various scales (see Figure~\ref{fig:SIFT_scale_invariance})
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/cv_image_processing_SIFT_scales.png}
		\caption{SIFT}
		\label{fig:SIFT_scale_invariance}
	\end{figure}
	\item We now look for local maxima in terms of both scale and location. This means that we search for points that are higher than all neighboring pixels in $x$-$y$ direction and scale (see Figure~\ref{fig:SIFT_scale_invariance} green points on the right) $\Rightarrow$ non-maximum suppression
	\item Given these points, we check for their \textit{cornerness}. Only at those points, we need to calculate the gradients and estimate the eigenvalues:
	$$\frac{\text{Tr}(\bm{H})^2}{\text{Det}(\bm{H})} < \frac{(r+1)^2}{r}$$
	The term $(r+1)^2/r$ is just a new threshold that specifies the required ratio between first and second eigenvalue.
	\item To guarantee rotation invariance, we look for the dominant gradient orientation in the patch. This is done by creating a weighted histogram of gradient orientations in the whole patch (weighted by the magnitudes of these gradients), and take the orientation with the highest value as orientation of the patch. If the patch has other orientations that have a value of at least 80\% of the dominant orientation, we create another descriptor for those as well.
	\item Once a point is selected as a key-point, we can group all gradients in small regions in a histogram and combine them into a $4\times 4$ grid of histograms. Note that we adjust all gradients according to the orientation of the key-point. Our final descriptor has then 128 features ($4\times 4$ histograms with each $8$ bins, see Figure~\ref{fig:descriptor_SIFT})
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/cv_image_processing_SIFT.jpg}
		\caption{A SIFT descriptor with $4\times 4$ histogram patch.}
		\label{fig:descriptor_SIFT}
	\end{figure}
\end{itemize}
\subsection{Bag-of-Words}
\begin{itemize}
	\item One approach for image representation is the visual Bag-of-Words (BoW). We therefore split an image into patches, describe each of these patches by one "visual word" (patch in our dictionary), and finally create a histogram out of it (see Figure~\ref{fig:BoW_pipeline})
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/cv_object_detection_BoW_pipeline.png}
		\caption{General pipeline for BoW approach.}
		\label{fig:BoW_pipeline}
	\end{figure}
	\item There are four components of BoW for which a design choice has to be made
	\begin{itemize}
		\item \textbf{Patch sampling}: which patches should be used to describe an image. Can be either descriptive patches/interest points, but then the number of patches can significantly differ from image to image. Alternatively, we can perform a grid-like selection of patches (\textit{dense sampling}) on multiple scale (reduce size of image and sample again).
		\item \textbf{Patch description}: describe the patches/visual words by SIFT, RGB, HOG or similar. Goes along with image representation 
		\item \textbf{Visual dictionary}: create a dictionary by sampling a lot of patches from a large set of images (training images), and cluster them in their descriptor space to find distinctive patches. Use these clusters as visual words. There are different cluster methods that can be applied. However, one hyper-parameter is usually the number of clusters. High number of clusters give very distinctive, but noise sensitive patches, whereas low number of clusters give general, but less distinctive patches.
		\item \textbf{Histogram creation}: the simplest approach is finding the nearest prototype/visual word for every sampled batch of the image by e.g. L2 on the descriptor, and record the number of occurrences for each visual word. There are many (more advanced) alternatives that for example take the distance to the cluster means into account, or calculating mean and stddev etc. 
	\end{itemize} 
	\item Advantages and drawbacks of visual BoW
	\begin{itemize}
		\item[+] Translation invariant
		\item[+] Fixed length feature vector 
		\item[$-$] Loss of spatial information
		\item[$-$] Quantization loses information (mapping to visual words)
	\end{itemize}
	\item In order to keep some spatial information, we can extend the histogram by using multiple scales (spatial pyramid) and concatenate those for an output feature vector. Another approach would be to use the spatial information ($xy$-position) as additional features for the patch descriptor, and use during matching/clustering.
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.45\textwidth]{figures/cv_object_detection_BoW_spatial_pyramid.png}
		\caption{Spatial pyramid for histogram creation. We concatenate all histogram to a longer feature vector.}
		\label{fig:BoW_spatial_pyramid}
	\end{figure}
\end{itemize}
\subsubsection{Bag of Words for Retrieval}
\begin{itemize}
	\item We can compare images for the retrieval task by their BoW histogram. This is more efficient and faster than checking for every interest point and try to compare those.
	\item Offline, we have to create the BoW vocabulary and determine a histogram for every image in our database
	\item When an image is entered as a query, we need to represent it by its BoW histogram and then compare it with every other.
	\item We can apply other techniques from IR as well like TF-IDF, query expansion, stop word removal, inverted file index,...
	\item To guarantee a good performance for the first retrieved examples, we can rerank the top $k$ by using geometrical verification (detect interest points and try to match those)
\end{itemize}
\subsection{Object detection}
\begin{itemize}
	\item Localization of objects in an image. Often approximated by bounding boxes that should be predicted around the object.
	\item A simple sliding window approach is too expensive as it generates 1) a lot of boxes over 2) a lot of scales with 3) different box ratios/shapes and 4) many classes.
	\item Hence, the first challenge is to find a set of relevant boxes with ``object'' (also called \textit{candidate boxes} all graded by an objectness score), and in a second step determine the class of the object in this candidate boxes
	\item One approach for that is \textbf{selective search} which is based on the property of images being hierarchical
	\begin{itemize}
		\item Segment image into small fragments based on simple approaches. Generate for all of these a candidate box
		\item For multiple iterations (recursively), combine two fragments that are the most similar together and consider a box for the combined fragment as well. Repeat until only one region is left
		\item Apply a classifier on those candidate boxes
	\end{itemize}
	\item A general pipeline for object detection is shown in Figure~\ref{fig:cv_object_detection_BB_pipeline}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.45\textwidth]{figures/cv_object_detection_BB_pipeline.png}
		\caption{Pipeline for object detection with Bounding Boxes1.}
		\label{fig:cv_object_detection_BB_pipeline}
	\end{figure}
	% cv_object_detection_BB_pipeline.png
\end{itemize}

================================================
FILE: Computer_Vision_1/cv_summary.tex
================================================
\documentclass[a4paper]{article} 
\addtolength{\hoffset}{-2.25cm}
\addtolength{\textwidth}{4.5cm}
\addtolength{\voffset}{-3.25cm}
\addtolength{\textheight}{5cm}
\setlength{\parskip}{0pt}
\setlength{\parindent}{0in}

\usepackage{blindtext} % Package to generate dummy text
\usepackage{charter} % Use the Charter font
\usepackage[utf8]{inputenc} % Use UTF-8 encoding
\usepackage{microtype} % Slightly tweak font spacing for aesthetics
\usepackage[english]{babel} % Language hyphenation and typographical rules
\usepackage{amsthm, amsmath, amssymb} % Mathematical typesetting
\usepackage{float} % Improved interface for floating objects
\usepackage[final, colorlinks = true, 
linkcolor = black, 
citecolor = black]{hyperref} % For hyperlinks in the PDF
\usepackage{graphicx, multicol} % Enhanced support for graphics
\usepackage{xcolor} % Driver-independent color extensions
\usepackage{marvosym, wasysym} % More symbols
\usepackage{rotating} % Rotation tools
\usepackage{subcaption}
\usepackage{wrapfig}
\usepackage[makeroom]{cancel}
\usepackage{censor} % Facilities for controlling restricted text
\newcommand{\note}[1]{\marginpar{\scriptsize \textcolor{red}{#1}}} % Enables comments in red on margin
\usepackage{bm}
\usepackage{blkarray}
\usepackage{enumitem}
\usepackage{pgfplots}
\usepackage{tikz}
\definecolor{colkeyword}{rgb}{0,0.4,0}
\definecolor{colname}{rgb}{0.4,0.4,0}
\definecolor{coltype}{rgb}{0.4,0,0.4}
\definecolor{coloperators}{rgb}{0,0,1.0}
\definecolor{colscopes}{rgb}{0.4,0,0}

\setcounter{tocdepth}{2}
% Title Page
\title{Summary Computer Vision 1}
\author{Phillip Lippe}


\begin{document}
\maketitle
\tableofcontents
\newpage

% \input{cv_intro.tex}
\input{cv_imgformation.tex}
\input{cv_imgprocessing.tex}
\input{cv_object_rec.tex}
\input{cv_deep_learning.tex}
\input{cv_deep_video.tex}
\input{cv_applications.tex}
\appendix
\newpage
\input{cv_appendix.tex}
\end{document}

================================================
FILE: Deep_Learning/cheat_sheet/main.tex
================================================
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% MatPlotLib and Random Cheat Sheet
%
% Edited by Michelle Cristina de Sousa Baltazar
%
% http://matplotlib.org/api/pyplot_summary.html
% http://matplotlib.org/users/pyplot_tutorial.html
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\documentclass[a4paper]{article}
\usepackage[landscape]{geometry}
\usepackage{url}
\usepackage{multicol}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{tikz}
\usetikzlibrary{decorations.pathmorphing}
\usepackage{amsmath,amssymb}

\usepackage{colortbl}
\usepackage{xcolor}
\usepackage{mathtools}
\usepackage{amsthm, amsmath, amssymb, amsfonts}
\usepackage{enumitem}

\title{Deep Learning cheat sheet}
\usepackage[english]{babel}
\usepackage[utf8]{inputenc}
\usepackage{bm}

\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\loss}[0]{\mathcal{L}}
\newcommand{\chain}[3]{\frac{\partial #1}{\partial #2}\frac{\partial #2}{\partial #3}}
\newcommand{\eq}[1]{\begin{equation*}\begin{split}#1\end{split}\end{equation*}}
\newcommand{\coderef}[0]{Please find the implementation in the folder with the code files.}
\newcommand{\TODO}[1]{\textbf{\textcolor{red}{#1}}}

\definecolor{green}{RGB}{0,160,0}
\definecolor{blue}{RGB}{0,0,160}
\definecolor{red}{RGB}{160,0,0}
\definecolor{orange}{RGB}{200,160,0}
\definecolor{purple}{RGB}{170,0,200}
\definecolor{cyan}{RGB}{0,200,200}
\definecolor{lightred}{RGB}{200,50,50}

\advance\topmargin-0.9in
\advance\textheight3in
\advance\textwidth3in
\advance\oddsidemargin-1.5in
\advance\evensidemargin-1.5in
\parindent0pt
\parskip2pt
\newcommand{\hr}{\centerline{\rule{3.5in}{1pt}}}
%\colorbox[HTML]{e4e4e4}{\makebox[\textwidth-2\fboxsep][l]{texto}
\begin{document}
\footnotesize
\begin{multicols*}{3}

\tikzstyle{mybox} = [draw=black, fill=white, very thick,
    rectangle, rounded corners, inner sep=10pt, inner ysep=10pt]
\tikzstyle{fancytitle} =[fill=black, text=white, font=\bfseries]
%------------ CONTEÚDO CAIXA RANDOM ---------------
\begin{tikzpicture}
\node [mybox] (box){%
    \begin{minipage}{0.3\textwidth}
	\underline{Definition}: A family of \textcolor{green}{parametric}, \textcolor{lightred}{non-linear} and \textcolor{blue}{hierarchical} \textcolor{orange}{representation learning functions}, which are \textcolor{red}{massively optimized with stochastic gradient descent} to \textcolor{purple}{encode domain knowledge}, i.e. domain invariances, stationarity.\\
	\vspace{-3mm}
	\begin{itemize}[leftmargin=4mm]
		\setlength\itemsep{0.0em}
		\item Neural Network is a directed acyclic graph		
		% \item Every module can be expressed by $a=h(x;w)$
		\item Use loss function that matches output distribution to improve numerical stability and make gradients larger
		\item Input and output distribution of every module should be the same to prevent inconsistent behavior and harder learning
	\end{itemize}
	\underline{Backprop}: chain rule $\pd{z}{x_i}=\sum_j \chain{z}{y_j}{x_i}$, $\nabla_{\bm{x}} \bm{z} = \left(\pd{\bm{y}}{\bm{x}}\right)^T \cdot \nabla_{\bm{y}} \bm{z}$
	\vspace{-1mm}
	\begin{enumerate}[leftmargin=4mm]
	\setlength\itemsep{0.2em}
	\item Compute forward: $a^{(l)} = h^{(l)}\left(x^{(l)}\right)$, $x^{(l+1)}=a^{(l)}$
	\item Compute reverse: $\pd{\loss}{a^{(l)}} = \left(\pd{a^{(l+1)}}{x^{(l+1)}}\right)^T \cdot \pd{\loss}{a^{(l+1)}}$\\$\pd{\loss}{\theta^{(l)}} = \pd{a^{(l)}}{x^{(l+1)}} \cdot \left(\pd{\loss}{a^{(l)}}\right)^T$
	\item Update params: $\theta^{(l)}_{t+1} = \theta^{(l)}_{t}-\eta \nabla_{\theta_t^{(l)}}\loss$
	\end{enumerate}
		
%	\begin{center}\small{\begin{tabular}{lp{4.5cm} l}
%		\textit{random():} & obtém o próximo número aleatório no intervalo [0.0, 1.0] \\ \hline
%		\textit{random(começo,fim):} & obter o próximo número aleatório no intervalo [começo, fim] \\ \hline
%		\textit{random(stop):} & obtém o próximo número aleatório no intervalo [0, fim]
%	\end{tabular}}\end{center}
    \end{minipage}
};
\node[fancytitle, right=10pt] at (box.north west) {Modular Learning};
\end{tikzpicture}


%------------ CONTEÚDO CAIXA MatPlotLib ---------------
\begin{tikzpicture}
\node [mybox] (box){%
    \begin{minipage}{0.3\textwidth}
		\underline{Pure optimization} very direct goal to optimize (e.g. scheduling). ML wants to optimize test error that is intractable/only indirectly optimizable. Reduce different cost function on training set, optimum might be not optimal for test set (overfitting).\\[4pt]
		\underline{Gradient descent}: dataset mostly too large, slow, not better optimum/faster convergence. \underline{SGD}: standard error $\sigma/\sqrt{m}$, noisy gradients act as regularizer, dynamicly changing data possible. \\[4pt]
		\underline{Ill conditioning}: if 2nd order change is greater than 1st ($\frac{1}{2}\epsilon^2 g^THg>\epsilon g^Tg$), loss increases. Later training, reduce lr\\
		\underline{Pathological curvatures}: ravine region in loss surface, high gradients in suboptimal direction, oscillations, slow convergence\\[4pt]
		\underline{Hessian}: requires large batch to be accurate, hard to compute\\
		\underline{Momentum}: maintain momentum from previous updates to dampen oscillations: $u_{t+1}=\gamma u_t - \eta_t g_t$, $w_{t+1}=w_t+u_{t+1}$. Exponential averaging $\Rightarrow$ more robust gradients, faster\\
		\underline{Nesterov momentum}: take future gradients, better in theory.\\[2pt]
		\underline{RMSprop}: adaptive lr, exp. averaging over norms, assuming directions of sensitivity axis aligned. $r_t = \alpha \cdot r_{t-1} + \left(1 - \alpha\right) \cdot g_t^2$, $\eta_t = \frac{\eta}{\sqrt{r_t} + \epsilon}$, $w_{t+1} = w_{t} - \eta_t \cdot g_t$\\[3pt]
		\underline{AdaGrad}: adaptive lr, \textit{sums} norm, thus based on scale and frequency, bad for nonconvex. $r_t = r_{t-1} + \text{diag}(g^2_t)$\\ [3pt]
		\underline{Adam}: Combine adaptive lr and momentum (applied on unscaled gradients). Bias correction to account init at origin. \\[4pt]
		\underline{Bayesian optimization}: gradient-free, educated trail and error guesser, determine next point on uncertainty and expectation\\
		\underline{Normalization}: center data around 0, same variance, allows higher learning rate and better learning. \textit{BatchNorm}: ensure Gaussian distribution of features over batches. $\hat{y}_i = \gamma \cdot \hat{x}_i + \beta$\\$\mu_B = \frac{1}{m} \sum\limits_{i=1}^{m} x_i$, $\sigma_B^2 = \frac{1}{m} \sum\limits_{i=1}^{m} \left(x_i - \mu_B\right)^2$, $\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma^2 + \epsilon}}$\\
		Reduce effect of 2nd order between layers, acts as regularizer by introducing noise, let network control mean and variance.\\
		During testing, take moving average of last training steps
    \end{minipage}
};
%------------ CAIXA PRELIMINARES ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Deep Learning Optimization (1)};
\end{tikzpicture}
%------------ CONTEUDO EXEMPLO BASICO ---------------------
\begin{tikzpicture}
\node [mybox] (box){%
    \begin{minipage}{0.3\textwidth}
    	\underline{Regularization}: objective during training to reduce test error\\
    	\textit{$\ell_2$}: introduce objective $\frac{\lambda}{2}\sum_l ||w_l||^2$, weight decay for SGD\\
    	\textit{$\ell_1$}: sparse weights with $\lambda\sum_l ||w_l||$\\
    	\textit{Others}: Dropout, Early stopping, Augmentation, Multitask\\[4pt]
    	\underline{Weight initialization}: small weights to keep data at origin, large to have strong gradients, preserve variance of activations ($w\sim \mathcal{N}(0,\sqrt{1/d})$), no learning if all same, prevent dead ReLU
    \end{minipage}
};
%------------ EXEMPLO BASICO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Deep Learning Optimization (2)};
\end{tikzpicture}
%------------ CONTEUDO DOIS EIXOS ---------------------
\begin{tikzpicture}
\node [mybox] (box){%
    \begin{minipage}{0.3\textwidth}
		Images stationary signals with spatial structure and huge dimensionality. Dimensions highly correlated (translation inv)\\[4pt]
		\underline{Transfer Learning}: use large datasets to learn useful features, prevent overfitting, fine-tune less layers if datasets similar, use lower lr for pre-trained layers as close to optimum\\[4pt]
		 \underline{Architectures}: small filter for less params and higher non-linearity (even $n\times1$/$1\times n$), different scales on same input (stack of convs prone to overfitting), vanishing gradients by intermediate classifiers or residual connections (learn difference instead of mapping) $H(x)=x+F(x)$, possibly with gates\\[4pt]
		 \underline{Tracking}: \textit{Fast R-CNN} based on middle feature map, extract BB (selective search, NN for \textit{Faster R-CNN}). RoI pooling to get fixed-size output. \textit{Siamese}: train on similarity of BB patches.
    \end{minipage}
};
%------------ DOIS EIXOS BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {(Modern) Convolutional Neural Networks};
\end{tikzpicture}
%------------ CONTEÚDO COMANDOS DE TEXTO ---------------------
\begin{tikzpicture}
\node [mybox] (box){%
    \begin{minipage}{0.3\textwidth}
    \underline{Backprop thorugh time}: gradients of weights on memory $W$: $\pd{\loss}{W} = \sum\limits_{t=1}^{T}\sum\limits_{k=1}^{t} \chain{\loss_t}{y_t}{c_t}\left(\prod\limits_{i=k+1}^{t} \pd{c_i}{c_{i-1}}\right)\frac{\partial^{+}c_k}{\partial W}$\\
    Formulating RNN as $c_t = W \cdot \sigma(c_{t-1}) + U \cdot x_{t-1}$ leads to:\\ $\left\lVert \pd{c_{t+1}}{c_{t}}\right\rVert \leq \left\lVert W^T\right\rVert \cdot \left\lVert \text{diag}\left(\pd{\sigma\left(c_t\right)}{c_t}\right)\right\rVert$. If norm of non-linearity bounded by $\gamma$, and $\left\lVert W^T\right\rVert < 1/\gamma$, then vanishing gradients. If $\left\lVert W^T\right\rVert \gg 1/\gamma$ and non-linearity not zero, then exploding gradients. Quick fix for second: clip gradient norm\\[4pt]
    \underline{LSTM}: Prevent vanishing gradient by gated skip connections over time. Forget, output, and input+candidate gate\\[4pt]
    \underline{GNN}: \textit{Deep Walk}: latent repr. by random walks, skip gram on sequences, not dynamic. \textit{GraphSage}: aggregate information from neighbors, can be mean/max pool with weights, LSTM.
    \textit{GCN}: $h(H^{(l)}, A) = \sigma\left(D^{-1/2}\hat{A}D^{-1/2} H^{(l)}W^{(l)}\right)$
    \end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Recurrent Neural Networks};
\end{tikzpicture}
%------------ CONTEÚDO COMANDOS DE TEXTO ---------------------
\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	\underline{Generative modeling}: learn joint probability $p(x,y)$ or density function $p(x)$. Task can be performed by Bayes: $p(y|x)$. Generalizes better, better modeling of causal relations, out-of-distribution detection $p(y|x)p(x)$ with $p(x)$ low. \textit{Discriminative modeling}: learn pdf $p(y|x)$, task-oriented and mostly better\\[4pt]
	\underline{Applications}: RL simulator, creating missing data (pixel patches), super-res., data augm., cross-modal transl. (sketch to img)\\[4pt]
	\underline{Types}: \textit{Explicit density}: maximize log likelihood of data by modeling pdf. Must be complex enough and computationally tractable. \textit{Implicit density}: no explicit pdf, only sampling mechanism
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Deep Generative Models (1)};
\end{tikzpicture}
%------------ CONTEÚDO COMANDOS DE TEXTO ---------------------
\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	\underline{GAN}: implicit model, adversarial training. Mini-max game:\\ $\min_G \max_D V(G,D) = \mathbb{E}_{\bm{x}\sim p_{r}(\bm{x})} \left[\log \left(D\left(\bm{x}\right)\right)\right] + \mathbb{E}_{\bm{z}\sim p_{z}(\bm{z})} \left[\log\left(1 - D\left(G\left(\bm{z}\right)\right)\right)\right] $. Better loss for generator: $-\log D(G(z))$. Otherwise vanishing gradients if D too strong.\\[2pt]
	\textit{Problems}: reaching equilibrium (oscillation around Nash), mode collapse if $\partial \loss / \partial z\approx 0$, low dimensional support (JS assumes overlap of distributions).\\[2pt]
	\textit{Improvements}: WGAN using Earth-Mover's distance (also good for non-overlapping), usage of labels $y$ like in conditional GANs, label smoothing for overconfident D, Virtual BatchNorm with reference batch to reduce intra-batch inference\\[4pt]
	\underline{Boltzmann machines}: Pdf based on energy function we learn: $p(x)=1/Z \exp(-E(x))$ where $Z=\sum_{x'} \exp(E(x'))$. $Z$ complex, $2^{n}$ pos. for binary data. Restrain to pairwise relations: $E(x)=-x^TWx-b^Tx$. \textit{Restricted BM}: reduce $W$ by introducing $h$ latents: $E(x,h)=-x^TWh-b^Tx-c^Th$, $p(x)=1/Z\sum_{h'} \exp(-E(x,h'))$, higher-order relations. Can reformulate to $p(h_j|x,\theta)=\sigma(W_{:,j}x+b_j)$, $p(x_i|h,\theta)=\sigma(W_{i,:}h+c_i)$. Maximize log likelihood by contrastive divergence. Sample $h_0\sim p(h|x)$, $x_1\sim p(x|h_0)$, a.s.o.\\[4pt]
	\underline{VAE}: Model $p(x,z)=p(x|z)p(z)$. Goal is to maximize $p(x)=\int p(x,z)dz$ which is intractable. Use ELBO instead:\\
	$\log p(x) > \mathbb{E}_{q_{\varphi}(z|x)}\left[\log p_{\theta}(x|z)\right] - \text{KL}\left(q_{\varphi}(z|x)||p_{\lambda}(z)\right)$\\
	Difference is $ - \text{KL}\left(q_{\varphi}(z|x)||p(z|x)\right)$. \textit{Reparameterization trick}: sample from external dist., and transform it to own. For Gaussian: $z=\mu_q + \epsilon \cdot \sigma_q$. Backprop through model params and lower variance than REINFORCE.\\
	\textit{Improvements}: $q(z|x)$ with NF on top, ELBO is extended by NF term. Optimize prior $p_{\lambda}(z)=\frac{1}{K}\sum_k q_{\varphi}\left(z|u_k\right)$, $u_k$ trained\\[4pt]
	\underline{NF}: Model $p(x)$ directly with series of invertible transformations shifting probability mass. Math expression of NF:\\
	 $x = z_k = f_k \circ f_{k-1} \circ ... \circ f_1 (z_0) \to z_i = f_i(z_{i-1})$\\
	 $p(z_i) = p(z_{i-1}) \cdot \left|\det \frac{f_{i}^{-1}}{z_i}\right| \implies p(x) = p(z_0) \cdot \prod_{i=1}^{K} \left|\det \frac{f_{i}^{-1}}{z_i}\right|$\\
	 $\log p(x) = \log p(z_0) - \sum_{i=1}^{K} \log \left|\det \frac{f_{i}}{z_i}\right|$\\
	 $f$ must be invertible and has simple $\det$ Jacobian (triangular)
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Deep Generative Models (2)};
\end{tikzpicture}

\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	Hold dist. per latent variable instead of single val. \textit{Benefits}: ensemble modeling (better acc), uncertainty estimates, prevent overconfidence, model compression (prior towards 0)\\[4pt]
	\underline{Epistemic uncertainty}: dataset limits, unseen data, important for safety-critical and small datasets. Posterior $p(w|x,y)$ intractable. \textit{MC dropout}: apply DP during test (Bernoulli-dist over weights). Var approx. uncertainty. Any NN can be made Bayesian with that, but expensive and not accurate. Can also be motivated from Gaussian Processes. Over-param. models better uncert. estm.\\[4pt]
	\underline{Aleatoric uncertainty}: data uncertainty due to noise (e.g. bad sensor). \textit{Data-dependent/heteroscedastic}: specific raw inputs hard to interpret, predict uncert. per data point: $\loss = \frac{||y_i - \hat{y}_i||^2}{2\sigma_i^2} + \log \sigma_i$. \textit{Task-dependent/homoscedastic}: introduced by task (e.g. depth estimation), Sol: train on multiple tasks. $\loss = \frac{||y_i - \hat{y}_i||^2}{2\sigma^2} + \log \sigma$
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Bayesian Deep Learning (1)};
\end{tikzpicture}

\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	\underline{Bayes by Backprop}: approx. true posterior $p(w|\mathcal{D})$ by $q(w|\theta)$: $\loss = \log q(w_s|\theta) - \log p(w_s) - \log p(\mathcal{D}|w_s) \hspace{2mm}\text{ where }\hspace{2mm} w_s\sim q(w_s|\theta)$\\
	Example: assume Gaussian variational posterior with softplus $w=\mu + \epsilon\cdot \log\left(1+\exp\rho\right)$, then learn $\mu$ and $\rho$ by SGD.
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Bayesian Deep Learning (2)};
\end{tikzpicture}

\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	\underline{Autoregressive Models}: generative without latent variables, assuming order in data, conditional probs $p(x) = \prod_k p(x_k|x_{<k})$. Not necessarily parameter sharing, $p(x)$ tractable, but slow\\[4pt]
	\underline{NADE}: model output with single layer, $\mathcal{O}(D\times H)$ params\\$p(x_d=1|x_{<d})=\sigma\left(V_{d,:}\cdot h_d+b_d\right)$, $h_d=\sigma\left(W_{:,<d}\cdot x_{<d} + c\right)$\\
	\underline{MADE}: Autoencoder with carefully masked connections. $y_d$ only depends on $x_{<d}$. Connections can be shared with future $d$\\[4pt]
	\underline{PixelRNN}: row-wise pixel and sequential color generation\\
	$p(x_i|x_{<i}) = p(x_{i,R}|x_{<i})\cdot p(x_{i,G}|x_{i,R}, x_{<i})\cdot p(x_{i,B}|x_{i,R}, x_{i,G}, x_{<i})$\\
	\textit{Row-LSTM}: next output depends on three hidden states above\\
	\textit{Diagonal-BiLSTM}: use all pixels before (all prev rows and left)\\
	\underline{PixelCNN}: masked convs to only see top and left. Causes blind spot. Use separated vertical and horizontal stack\\[4pt]
	\underline{PixelVAE}: Standard VAE with PixelCNN as decoder
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Deep Sequential Models};
\end{tikzpicture}

\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	Value function $q^{\pi}(s_t, a_t) = \mathbb{E}_{\pi}\left[r_{t+1}+\gamma r_{t+2} + \gamma^2 r_{t+3} + ... | s_t, a_t\right]$\\
	Bellman equation $q^{\pi}(s_t, a_t) = \mathbb{E}_{\pi}\left[r_{t+1}+\gamma q^{\pi}| s_t, a_t\right]$\\
	Optimal policy with $q^{*}(s_t,a_t) = r_{t+1} + \gamma \max_{a_{t+1}} q(s_{t+1}, a_{t+1})$\\[4pt]
	\underline{Value-based}: learn $q^{*}$ to get $\pi^{*}$. Q-Learning (off-policy):\\
	$\mathcal{L} = \mathbb{E}\left[\left(r + \gamma \max_{a_{t+1}} q(s_{t+1}, a_{t+1}, \theta) -q(s_{t}, a_{t}, \theta) \right)^2\right]$\\
	For gradient calculation, bootstrapped val is fixed. \\
	\underline{Stability problems}: bootstrap, target and policy always changing, oscillations; seq. data break iid assump.; scale of $q$ values hard to control, unstable gradients; \\
	\underline{Solutions}: experience replay (store samples $\langle s, a, r, s'\rangle$ in dataset, sample from that, makes batch iid), freezing target network every $K$ iters to avoid oscillations, clip rewards, skip frames, control exploration vs. exploitation by annealing $\epsilon$-greedy policy\\[4pt]
	\underline{Policy-based}: learn $\pi^{*}$ directly, avoid problems with $q$ vals (especially hard for continuous action space). Training steps:\\
	\vspace{-3mm}
	\begin{enumerate}[leftmargin=4mm]
	\setlength\itemsep{0.0em}
	\item Determine $q$ by simulation: $q^{\pi_w}(s_t, a_t) = \mathbb{E}\left[r_t + \gamma r_{t+1}... | \pi_{w}\right]$
	\item Maximize $q$ by $\pd{\mathcal{L}}{w} = \mathbb{E}\left[\chain{q^{\pi}(s,a)}{a}{w}\right]$ (deterministic)\\
	or $\pd{\mathcal{L}}{w} = \mathbb{E}\left[\pd{\log \pi^{w}(a|s)}{w} q^{\pi}(s,a)\right]$ (stochastic)
	\end{enumerate}
	\textit{Asynchronous Advantage Actor-Critic}: Learn both policy and value function at same time, run multiple agents simultaneously (more diverse samples), advantage estimates: use learned value function to compare actually gained $q$ value. If loss is higher, unexpected (good) things happened $\Rightarrow$ exploration\\[4pt]
	\underline{Model-based}: try to model environment and be aware of rules. E.g. AlphaGo with tree-search guided by CNNs. Two policy networks playing against each other, and a third network to predict $V(s_t)=\sum_{a'} \pi(a'|s_t) \cdot q^{\pi}(s_t, a')$.
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Deep Reinforcement Learning};
\end{tikzpicture}

\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	\underline{Forward KL}: $D_{KL}(p||q)$, overstimate variance\\
	\underline{Backward KL}: $D_{KL}(q||p)$, underestimate variance\\
	$D_{KL}(q||p) = \int q(x) \log \frac{q(x)}{p(x)}dx \Rightarrow$ if $p(x)=0$, then $q(x)=0$\\
	$D_{KL}(p||q) = \int p(x) \log \frac{p(x)}{q(x)}dx \Rightarrow$ if $p(x)>0$, then $q(x)>0$\\[2pt]
	\underline{Jensen-Shannon}: $D_{JS}(p||q) = \frac{1}{2}D_{KL}(p||M)+\frac{1}{2}D_{KL}(p||M)$\\ $M = \frac{p+q}{2}\Rightarrow D_{JS}(p||q)=D_{JS}(q||p)$\\[4pt]
	$ a = Wx+b$, $\pd{a_i}{W_{jk}} = 1(i=j)\cdot x_k$, $\pd{a}{b} = \bm{I}$, $\pd{a}{x} = W$\\
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Math to know};
\end{tikzpicture}

\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	\underline{Compare non-linear activation functions}
	% \vspace{-3mm}
	\begin{description}[leftmargin=4mm]
	\setlength\itemsep{0.0em}
	\item[ReLU] Strong gradient for $x>0$, non saturating \textit{Drawbacks}: dead neurons
	% \item Every module can be expressed by $a=h(x;w)$
	\item[Sigmoid] probability distribution output \textit{Drawbacks}: small gradients $<1/4$, saturating, shifts distribution
	\item[Tanh] zero-centered in origin \textit{Drawbacks}: saturating, only strong gradients around 0
	\end{description}
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Old Exams};
\end{tikzpicture}

\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	\underline{Differences between generative and discriminative models}\\
	1. Generative models are used to estimate the joint probability density function $p(x)$. Discriminative models are used, instead, to model the conditional $p(y|x)$.\\
	2. Generative models are often intractable because in the $p(x)=\int p(x|z) p(z) dz$ the integral is not always possible to analytically compute.\\
	3. Discriminative models tend to yield better accuracies given a task, meaning they are optimized for the particular task, at the cost of potential overfitting.\\[5pt]
	\underline{Advantages/Disadvantages of generative models}\\
	\textbf{GAN}: Very good, realistic results, fast to sample from, no need to train on likelihood, very flexible to extension \textit{Drawbacks}: no quantitative evaluation, hard to train (sensitive to hyperparameters, mode collapse, etc.), no real objective in terms of likelihood (and distribution is unknown)\\
	\textbf{VAE}: \textit{Benefits}: Usable for data compression, distribution known (calculate likelihood function), stable training (no mode collapse) \textit{Drawbacks}: only approx. likelihood (ELBO), tends to give blurry instead of realistic images, need flexible enough encoder and prior\\
	\textbf{NF}: \textit{Benefits}: directly optimize $p(x)$, one-to-one mapping between $z$ and $x$ (knows exact embedding of any image in latent space) \textit{Drawbacks}: high number of parameters, complexity restrained by requirement of reversible $f$\\[5pt]
	\underline{Difference RNN/Autoregressive}\\
	\textbf{RNN}: shares weigths over steps, applicable to any sequence length, compresses all previous inputs into single hidden state/memory, not necessarily generative\\
	\textbf{Autoregressive}: does not necessarily share weights, fixed in sequence length, are generative
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Additional questions};
\end{tikzpicture}

\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	\begin{minipage}{0.45\textwidth}
	\includegraphics[width=\textwidth]{figures/NN_Zoo.png}
	\end{minipage}
	\begin{minipage}{0.45\textwidth}
	\includegraphics[width=\textwidth]{figures/optimization_pathological_curvatures.png}
	\includegraphics[width=\textwidth]{figures/RNN_LSTM.png}
	\end{minipage}
	
	\begin{minipage}{\textwidth}
	\includegraphics[width=\textwidth]{figures/NF_concept.png}
	\end{minipage}
	
	\begin{minipage}{\textwidth}
	\centering
	\includegraphics[width=0.7\textwidth]{figures/Autoregressive_PixelRNN.pdf}
	\end{minipage}
	
	\begin{minipage}{\textwidth}
	\centering
	\includegraphics[width=0.7\textwidth]{figures/GAN_generative_models_overview_2.png}
	\end{minipage}
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Figures};
\end{tikzpicture}

\end{multicols*}
\end{document}

================================================
FILE: Deep_Learning/dl_appendix.tex
================================================
% \section{Neural Network Zoo}

\begin{figure}[ht!]
	\centering
	\includegraphics[width=0.9\textwidth]{figures/NN_Zoo_High.png}
\end{figure}

================================================
FILE: Deep_Learning/dl_autoregressive.tex
================================================
\section{Deep Sequential Models}
\subsection{Autoregressive Models}
\begin{itemize}
	\item Generative models without latent variables, but assuming an order in the data (if there is no, create an artificial order like image from left to right, top to bottom). The likelihood is the product of conditionals:
	$$p(x)=\prod_{k=1}^{D} p(x_k|x_{j<k})$$
	\item In contrast to RNNs, there is no/not necessarily parameter sharing, and the chain cannot be of infinite length because of that
	\item \textit{Advantages}: $p(x)$ is tractable
	\item \textit{Drawbacks}: training and generation is slow due to being sequential and not parallel
\end{itemize}
\subsubsection{NADE}
\begin{itemize}
	\item Originally defined for binary inputs/data. Can be generalized for other spaces as well
	\item Every output $x_d$ is modeled by a single layer that takes as input all previous data points, and generates based on that it's prediction:
	\begin{equation*}
		\begin{split}
			p(x_d=1|x_{<d}) & = \sigma\left(V_{d,:}\cdot h_d + b_d\right), h_d = \sigma\left(W_{:,<d}\cdot x_{<d} + c\right)
		\end{split}
	\end{equation*}
	where $V\in \mathbb{R}^{D\times H}, W\in \mathbb{R}^{H\times D}, b\in \mathbb{R}^{D}, c\in \mathbb{R}^{H}$ ($H$ hidden dimensionality, $D$ input dimensions)
	\item Objective is minimizing log likelihood: $\mathcal{L} = - \log p(x) = - \sum_{k=1}^{D} p(x_k|x_{<k})$
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.6\textwidth]{figures/Autoregressive_NADE.pdf}
		\caption{Concept of NADE.}
	\end{figure}
	\item \textit{Teacher forcing}: During training, use ground truth as input for all levels. For testing, use generated samples as input (sequentially)
\end{itemize}
\subsubsection{MADE}
\begin{itemize}
	\item Use an autoencoder where we carefully mask out connections so that the output $y_d$ only depends on inputs $x_{<d}$
	\item Name ``autoencoder'' is only because we try to reproduce the input. However, note that we neither have a bottleneck nor we try to get sparsity. We just remove connections to make the outputs depending on certain inputs
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/Autoregressive_MADE.png}
		\caption{Masked autoencoder for autoregressive models. We set certain weights to 0 (i.e. remove connections between neurons) so that the generation of $x_1$ only depends on $x_2$ and $x_3$, but not on $x_1$ itself (which would be cheating and prevent the model of being generative).}
	\end{figure}
\end{itemize}
\subsubsection{PixelRNN}
\begin{itemize}
	\item Assume row-wise pixel and sequential color generation (first red channel, then green, afterwards blue):
	$$p(x_i|x_{<i}) = p(x_{i,R}|x_{<i})\cdot p(x_{i,G}|x_{i,R}, x_{<i})\cdot p(x_{i,B}|x_{i,R}, x_{i,G}, x_{<i})$$
	\item Different ways of modeling it. LSTM variants mostly have 12 layers
	\begin{itemize}
		\item \textit{Row LSTM}: to compute next output (i.e. next hidden state), we take into consideration the three hidden states of the row above a certain pixel as ``last hidden state''. We get therefore a tri-angular shape of context. However, it thereby misses context from the row itself, and further away context. As it does not use pixels in the same row, the computation can be parallelized for a row. 
		\item \textit{Diagonal Bi-LSTM}: Uses all pixels that were generated before by using a Bi-LSTM. 
	\end{itemize}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/Autoregressive_PixelRNN.pdf}
		\caption{Comparing different methods of PixelRNN and PixelCNN. The lower level is the previous layer, and the top is the next layer. If we have a single layer PixelRNN/CNN, the lower one would be the input and the upper the generated output.}
		\label{fig:Autoregressive_PixelRNN}
	\end{figure}
	\item The architecture includes residual connections to speed up training
	\item \textit{Benefits}: good modeling of $p(x)$, reasonable image quality
	\item \textit{Disadvantages}: slow training and slow generation
\end{itemize}
\subsubsection{PixelCNN}
\begin{itemize}
	\item Replace recurrence by convolutions to speed up (at least) training
	\item Convolutions are masked so that only context from before (i.e. left and top) can be used. See Figure~\ref{fig:Autoregressive_PixelRNN} left and Figure~\ref{fig:Autoregressive_PixelCNN} for an example
	\begin{figure}[ht!]
		\centering
		\begin{subfigure}{0.3\textwidth}
			\centering
			\includegraphics[width=0.6\textwidth]{figures/Autoregressive_Masked_Conv.png}
			\caption{Example mask for $5\times 5$ convolution}
		\end{subfigure}
		\hspace{2mm}
		\begin{subfigure}{0.32\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/Autoregressive_PixelCNN_blindspot_problem.png}
			\caption{Blindspot of PixelCNN}
		\end{subfigure}
		\hspace{2mm}
		\begin{subfigure}{0.32\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/Autoregressive_PixelCNN_blindspot.jpg}
			\caption{Solution to blindspot}
		\end{subfigure}
		\caption{Masked convolutions in PixelCNN}
		\label{fig:Autoregressive_PixelCNN}
	\end{figure}
	\item Problem: worse results than PixelRNN because of limited context and blind spot (cascaded convolutions ignore right upper part)
	\item Solution: use two convolutions, one vertical stack looking purely on the top part, and the horizontal stack looking to the right. Additionally, use gated convolutions (one half of the features go through tanh, the other through sigmoid)
	\item \textbf{PixelCNN++}: replace softmax with logistic mixture likelihood over 8 bits, use encoder-decoder architecture with skip connections
\end{itemize}
\subsubsection{PixelVAE}
\begin{itemize}
	\item Standard VAE with PixelCNN as decoder/generator
	\item However, generator is very powerful which can lead to the problem that it ignores the latent code, and just generates ``nice'' images
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.3\textwidth]{figures/Autoregressive_PixelVAE.png}
		\caption{Architecture of a PixelVAE}
	\end{figure}
\end{itemize}

================================================
FILE: Deep_Learning/dl_bayesian.tex
================================================
\section{Bayesian Deep Learning}
\begin{itemize}
	\item Bayesian machine learning: holding a distribution per latent variable instead of single value
	\item Benefits of Bayesian
	\begin{itemize}
		\item Ensemble modeling (better accuracies)
		\item Uncertainty estimates, preventing overconfident networks
		\item Model compression (have prior that pushes weights towards 0)
		\item \TODO{Think of more}
	\end{itemize}
\end{itemize}
\subsection{Epistemic uncertainty}
\begin{itemize}
	\item \textit{Epistemic uncertainty}: dataset limits
	\item Uncertainty that is introduced by dataset limits (unseen data $\Rightarrow$ how certain are the weights)
	\item Can be reduced by increasing the amount of data
	\item Important for safety-critical applications and small datasets
	\item Hard to model because posterior is usually intractable for complex functions like NN
	$$p(w|x,y) = \frac{p(x,y|w)p(w)}{\int p(x,y|w)p(w)dw}$$
	\item \textbf{Monte-Carlo Dropout}: apply dropout during testing (Bernoulli-distribution over weights as variational distribution). The variance/uncertainty derived from there approximates uncertainty gained by variational framework. 
	\begin{itemize}
		\item \textit{Advantages}: every standard NN can be turned into a Bayesian NN. Very easy to train and no inference network necessary
		\item \textit{Drawbacks}: expensive, have to rerun model several times on data. Not very accurate (depends on activation function etc.)
	\end{itemize}
	\item \textbf{Deep Gaussian Process}: predict mean and variance for every data point.
	\begin{itemize}
		\item The predictive distribution is $p(y|x,X,Y) = \int p(y|x,w)p(w|X,Y)dw$
		\item The likelihood term is a Gaussian $p(y|x,w)=\mathcal{N}(y; \hat{y}(x,w), \tau^{-1}I_D)$ where $\hat{y}(x,w)$ is a NN and $\tau^{-1}$ the model precision that can be derived from MC dropout
		\item For the posterior, we use variational approximation: $p(w|X,Y)\approx q(w)$. In case of MC dropout, we have $\tilde{W}_i = W_i\cdot \text{diag}\left(\left[z_{i,j}\right]_{1}^{K_i}\right), z_{i,j}\sim \text{Bernoulli}\left(p_i\right)$ where $\tilde{W}_i$ are the weights with applied dropout
		\item Minimize loss $\mathcal{L}= - \int q(w)\log p(Y|X,w)dw + KL\left(q(w)||p(w|X,Y)\right)$. First term is approximated by Monte-Carlo integration (equivalent to sampling dropout), and second can be approximated analytically
	\end{itemize}
	\item Over-paramterized models give better uncertainty estimates as they capture bigger class of models. However, they also need higher dropout rates
\end{itemize}
\subsection{Aleatoric uncertainty}
\begin{itemize}
	\item \textit{Aleatoric uncertainty}: data uncertainty
	\item Uncertainty due to the nature of data (noise/hard to predict accurate. Example: depth estimation with bad sensor)
	\item Can be reduced by better data (better sensors, multiple different sensors, etc.)
	\item \textit{Data-dependent/heteroscedastic aleatoric uncertainty}: specific raw inputs like images that are hard to interpret
	\begin{itemize}
		\item Can be modeled by predicting a variance term per data point to reduce loss
		$$\mathcal{L} = \frac{\lVert y_i - \hat{y}_i\rVert^2}{2\sigma_i^2} + \log \sigma_i$$
		If variance low, the loss is weighted higher, but the $\log$ term is smaller $\Rightarrow$ trade-off
	\end{itemize}
	\item \textit{Task-dependent/homoscedastic aleatoric uncertainty}: introduced by task like semantic segmentation or depth estimation (hard at edges). Possible solution: train on multiple tasks like edge detection
	\begin{itemize}
		\item We can as well introduce a variance term, but shared by all data points (task individual):
		$$\mathcal{L} = \frac{\lVert y_i - \hat{y}_i\rVert^2}{2\sigma^2} + \log \sigma$$
	\end{itemize}
\end{itemize}
\subsection{Bayes by Backprop}
\begin{itemize}
	\item Start from a NN with a distribution over its weights
	\item Train weights to approximate the true posterior well (similar to ELBO just with $p(\mathcal{D})=1 \Rightarrow \log p(\mathcal{D}) = 0$)
	$$\text{KL}\left(q\left(w|\theta\right)||p\left(w|\mathcal{D}\right)\right) = \text{KL}\left(q\left(w|\theta\right)||p\left(w\right)\right) - \int q(w|\theta) \log p(\mathcal{D}|w)dw$$
	First term pushes distributions towards prior, and second towards modeling the data well
	\item Compute by Monte-Carlo integration (over distribution $q(w|\theta)$) for \textit{both} terms:
	$$\mathcal{L} = \log q(w_s|\theta) - \log p(w_s) - \log p(\mathcal{D}|w_s) \hspace{2mm}\text{ where }\hspace{2mm} w_s\sim q(w_s|\theta)$$
	\item Example: assume a Gaussian variational posterior on the weights $w=\mu + \epsilon \cdot \log(1 + \exp\rho))$ (standard deviation with softplus trick for always positive values). Learn parameters $\mu$ and $\rho$ per weight
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/Bayes_By_Backprop.png}
	\end{figure}
	\item In experiments, Bayesian NNs perform similar to plain NNs with dropout
\end{itemize}

================================================
FILE: Deep_Learning/dl_convnets.tex
================================================
\section{Convolutional Neural Networks}
\begin{itemize}
	\item Images are stationary signals with spatial structure and huge dimensionality
	\item Input dimensions are highly correlated (e.g. translation invariant)
	\item Preserve spatial structure by convolutional filters, local connectivity (with shared weights) and being robust to local variances by spatial pooling
\end{itemize}
\subsection{Transfer Learning}
\begin{itemize}
	\item Use large datasets like ImageNet to learn useful features for other, smaller datasets
	\item Prevent overfitting, even for large networks
	\item Alternatively, we could also use a pre-trained network on task 1 as feature extractor for task 2 (same as freezing first layers)
	\item Which layer(s) to fine-tune?
	\begin{itemize}
		\item If both task have the same labels, we can initialize all layers. Otherwise, the classification layer (last layer) must be newly trained. If there is only very few data available, only fine-tune this layer
		\item If datasets are very different, the fully connected layers need to be replaced
		\item First convolutional filters capture low-level information that mostly does not change over datasets. Mid-level convolutions can be fine-tuned if dataset is large enough
	\end{itemize}
	\item Use a smaller learning rate for pre-initialized layers as network starts already from a point close to the optimum. New layers can be trained with higher learning rate
\end{itemize}
\subsection{Standard classification architectures}
\subsubsection{VGGNet}
\begin{itemize}
	\item All filter sizes are $3\times 3$, as this is the smallest filter size, and is more parameter efficient to build up large filters, plus additional non-linearity between filters
	\item $1\times 1$ convolutions used to increase non-linearity/complexity without increasing receptive field
\end{itemize}
\begin{figure}[ht!]
	\centering
	\includegraphics[width=0.3\textwidth]{figures/CNN_VGGnet.png}
	\caption{VGG16 architecture}
	\label{fig:CNN_VVGnet}
\end{figure}
\subsubsection{Inception}
\begin{itemize}
	\item Receptive fields should vary in size as objects can appear in different scales
	\item Naively stacking more convolutional operations on top of each other is expensive and prone to overfitting
	\item Inception module applies different filter sizes on same input ($1\times 1$ convolutions for feature reduction)
	\item Architecture consists of 9 Inception blocks
	\item Solution for vanishing gradients: have intermediate classifiers that amplify the gradient signal for early layers
	\item InceptionV2: $5\times 5$ replaced by two $3\times 3$ filters
	\item InceptionV3: $1\times 3$ and $3\times 1$ filters instead of $3\times 3$
	\item BatchNormalization has shown to be very helpful in this architecture
\end{itemize}
\begin{figure}[ht!]
	\centering
	\includegraphics[width=0.8\textwidth]{figures/CNN_Inception_module.pdf}
	\caption{Inception module}
	\label{fig:CNN_Inception_module}
\end{figure}
\subsubsection{ResNet/DenseNet/HighwayNet}
\begin{itemize}
	\item Deeper networks are harder to optimize, and might actually achieve worse results than shallow ones because of that (although learning identity in additional layers must lead to same results)
	\item Better approach: try to model the difference that is learned in every layer $H(x) = F(x) + x$
	\item Different ways for modeling $F(x)$. Most popular ones shown in Figure~\ref{fig:CNN_ResNet_blocks}. BatchNormalization has been shown to be very important because of vanishing gradients
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.7\textwidth]{figures/CNN_ResNet_blocks.png}
		\caption{ResNet blocks}
		\label{fig:CNN_ResNet_blocks}
	\end{figure}
	\item \textbf{HighwayNet} introduces a gate with learnable parameters to determine the importance of a layer: $H(x) = F(x) \cdot T(x) + x \cdot \left(1 - T\left(x\right)\right)$
	\item \textbf{DenseNet} uses skip connections to multiple forward layers. Creates complex blocks where last layer sees the input of all previous layers
\end{itemize}
\subsection{Tracking/Object detection}
\subsubsection{Fast R-CNN}
\begin{itemize}
	\item Based on middle feature map, get bounding boxes by e.g. selective search 
	\item RoI pooling returns fixed size feature map for selected bounding box (puts e.g. $3\times 3$ mask on features and pools accordingly)
	\item Features used to generate class prediction and location correction
	\item During training, sample multiple candidate boxes from image and train on all of them. Makes it more efficient/faster, \textit{but} batch elements might be highly correlated (in the paper, they report that they experienced it to be neglectable)
	\item Very accurate and fast, but external box proposals needed
	\item \textbf{Faster R-CNN}: train network to propose box locations
\end{itemize}
\subsubsection{Siamese Network for Training}
\begin{itemize}
	\item Use Siamese network to compare similarity of two patches
	\item If we compare patches over time, we can find objects with the highest similarity $\Rightarrow$ tracking of objects
	\item Can be trained on rich video dataset, and can be applied to unseen categories/targets
\end{itemize}
\subsection{Spatial Transformer Network}
\begin{itemize}
	\item ConvNets must be invariant/robust to pose/geometry changes. One simple way of doing it is data augmentation
	\item Better: use spatial transformer network to learn rotation/scale transformation
	\item Define grid on input. Scale, translation and rotation parameters are learned by the network and depend on the input. Finally, transform image based on the changed grid. 
	\item Operation is differentiable and thus can be learned
\end{itemize}

================================================
FILE: Deep_Learning/dl_deep_rl.tex
================================================
\section{Deep Reinforcement Learning}
\subsection{Fundamentals of Reinforcement Learning}
\begin{itemize}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/RL_basic_concept.png}
		\caption{Interaction model between environment and agent}
	\end{figure}
	\item The \textbf{state} $s_t$ is the summary of all experience so far: $s_t = f(o_1, r_1, a_1, o_2, r_2, a_2, ..., o_t, r_t)$ ($o_i$ observable part of environment at time step $i$). If we have a fully observable environment, then $s_t = f(o_t)$.
	\item The \textbf{policy} of an agent determines its actions: $\pi\left(a_t|s_t\right)$. Can be deterministic or stochastic
	\item The \textbf{value function} is the expected total reward under policy $\pi$: $$q^{\pi}(s_t, a_t) = \mathbb{E}_{\pi}\left[r_{t+1}+\gamma r_{t+2} + \gamma^2 r_{t+3} + ... | s_t, a_t\right]$$
	$\gamma$ as discount factor as we are most certain about close rewards and sometimes are more interested in immediate rewards
	\item \textbf{Bellman equation} for value function:
	$$q^{\pi}(s_t, a_t) = \mathbb{E}_{s', a'}\left[r + \gamma q^{\pi}\left(s', a'\right) | s_t, a_t\right] = \sum_{s'} p(s'|s_t,a_t)\cdot \left[r(s', a_t, s_t) + \gamma \sum_{a'} \pi(a'|s') \cdot q^{\pi}\left(s', a'\right) \right]$$
	\item The optimal value function is therefore $q^{*}(s_t,a_t) = \max_{\pi} q^{\pi}(s_t,a_t) = r_{t+1} + \gamma \max_{a_{t+1}}$
	\item The \textbf{environment} can be modeled by the agent (learned from experience), and used for planning and look ahead. This can be for example a simulator
\end{itemize}
\subsection{Deep RL approaches}
\subsubsection{Value-based approaches}
\begin{itemize}
	\item Try to learn value function $q^*$ to get the optimal policy $\pi^*$
	\item The input to such models is usually the state, which should be as raw as possible (e.g. image frames). We can either add the action to the input and let the network predict its Q-value, or predict Q-values for all possible actions (second is faster and simpler)
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/RL_deep_QLearning.png}
		\caption{Modeling of Q-value predictions}
	\end{figure}
	\item Optimization by SARSA-like loss:
	$$\mathcal{L} = \mathbb{E}\left[\left(r + \gamma \max_{a_{t+1}} q(s_{t+1}, a_{t+1}, \theta) -q(s_{t}, a_{t}, \theta) \right)^2\right]$$
	\item For the gradients, we assume that the bootstrapped max value is fixed:
	$$\pd{\mathcal{L}}{\theta} = \mathbb{E}\left[-2\cdot \left(r + \gamma \max_{a_{t+1}} q(s_{t+1}, a_{t+1}, \theta) -q(s_{t}, a_{t}, \theta) \right) \cdot \pd{q(s_{t}, a_{t}, \theta)}{\theta}\right]$$
	\item Optimize with SGD by sampling one action and state, calculate q-values for all possible future actions, and use the maximum as bootstrap goal
\end{itemize}
\subsubsection{Stability problems}
\begin{itemize}
	\item As we bootstrap, the target is always changing $\Rightarrow$ policy changes fast, can lead to oscillations
	\item The sequential data breaks the iid assumption on which SGD relies
	\item The scale of Q-values is not easy to control, and is very task dependent $\Rightarrow$ gradients are unstable and can be either too large or too small
	\item \textbf{Improving stability}
	\begin{itemize}
		\item \textit{Experience replay}: store memories of $\langle s, a, r, s'\rangle$ (with e.g. a $\epsilon$-greedy policy) in a dataset, and sample batches from there to train on. Breaks temporal dependency and helps SGD by i.i.d.
		\item \textit{Freezing target}: instead of having a moving target, we freeze the $Q$ network every $K$ iterations, and use that to generate our targets (Q-targets come now from a bit older network parameter setting, but is steady over $K$ iterations). Avoids oscillations
		\item \textit{Clipping rewards}: Normalize or clip rewards to be in range $[-1,+1]$ or any other stable range. Prevents unknown scales of $Q$
		\item \textit{Skipping frames}: a light version of experience replay is skipping $N$ frames between two data points to avoid too strong temporal dependency (two consecutive frames are very similar)
		\item \textit{Exploration vs Exploitation}: use a $\epsilon$-greedy policy with annealing temperature. In the beginning, we will focus on exploration while slowly converging to exploitation
	\end{itemize}
\end{itemize}
\subsubsection{Policy-based approaches}
\begin{itemize}
	\item Try to learn the optimal policy $\pi^*$ directly from experience (parameterized policy $\pi_w(a_t|s_t)$)
	\item Avoids learning the $q$ values which are hard for continuous action spaces, and tend to oscillate because of bootstrapping
	\item Training steps
	\begin{enumerate}
		\item Determine Q-value for current policy by running a simulation:\\ $q^{\pi_w}(s_t, a_t) = \mathbb{E}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + ... | \pi_{w}\right]$
		\item Maximize q-values as loss function. 
		\begin{enumerate}
			\item If policy is deterministic:
			$$\pd{\mathcal{L}}{w} = \mathbb{E}\left[\chain{q^{\pi}(s,a)}{a}{w}\right]$$
			\item If policy is stochastic:
			$$\pd{\mathcal{L}}{w} = \mathbb{E}\left[\pd{\log \pi^{w}(a|s)}{w} q^{\pi}(s,a)\right]$$
		\end{enumerate}
	\end{enumerate}
	\item Asynchronous Advantage Actor-Critic
	\begin{itemize}
		\item Learn both policy and value function
		\item Multiple agents that simultaneously interact with (copy of) environment and learn
		\item \textit{Advantage estimates}: Use the learned value function to compare to your actually gained $q$ value. Loss is therefore higher if unexpected things happen $\Rightarrow$ exploration
	\end{itemize}
	\begin{figure}[ht!]
		\centering
		\begin{subfigure}{0.45\textwidth}
			\centering
			\includegraphics[width=0.8\textwidth]{figures/RL_A3C_multiple_workers.png}
		\end{subfigure}
		\begin{subfigure}{0.45\textwidth}
			\centering
			\includegraphics[width=0.8\textwidth]{figures/RL_A3C_cycle.png}
		\end{subfigure}
		\caption{Schematic overview of A3C}
		\label{fig:RL_A3C}
	\end{figure}
\end{itemize}
\subsubsection{Model-based approaches}
\begin{itemize}
	\item Try to model the environment to be aware of rules etc. 
	\item Example: AlphaGo relies on Tree-Search guided by CNNs. We use two policy networks to play against each other, and one value network that predicts the value function of a state
\end{itemize}

================================================
FILE: Deep_Learning/dl_generative_models.tex
================================================
\section{Deep Generative Models}
\begin{itemize}
	\item \textit{Generative modeling}: learn the joint probability $p(x,y)$ or density function $p(x)$. Task can be performed with Bayes rule: $p(y|x)$. Generalize better (less prone to overfitting), and better modeling of causal relations. Members include GAN, VAE, etc.
	\begin{itemize}
		\item We can use generative models to predict uncertainty and out of distribution examples: $p(x,y) = p(y|x)p(x) \Rightarrow$ if $x$ o.o.d., then $p(x)$ low!
	\end{itemize}
	\item \textit{Discriminative modeling}: learn conditional pdf $p(y|x)$. Is usually task-oriented and gets better results. 
	\item Applications of generative models
	\begin{itemize}
		\item Simulating possible futures for reinforcement learning
		\item Creating missing data  (e.g. pixel patches which are missing)
		\item Super-resolution scaling for images
		\item Data augmentation (replace e.g. car by bicyclist in a scene)
		\item Cross-modal translation (sketch to image)
	\end{itemize}
	\item Different type of generative models (see Figure~\ref{fig:GAN_generative_models_overview})
	\begin{itemize}
		\item \textit{Explicit density}: maximize log likelihood of the data by modeling a probability density function. Function must be complex enough and computationally tractable
		\item \textit{Implicit density}: no explicit pdf needed, only a sampling mechanism
	\end{itemize}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.45\textwidth]{figures/GAN_generative_models_overview_2.png}
		\caption{Overview of generative models}
		\label{fig:GAN_generative_models_overview}
	\end{figure}
\end{itemize}
\subsection{Generative Adversarial Networks}
\begin{itemize}
	\item Adversarial training of generator vs discriminator
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.45\textwidth]{figures/GAN_pipeline.png}
		\caption{Pipeline of adversarial GAN training}
		\label{fig:GAN_pipeline}
	\end{figure}
	\item The generator is a (mostly deconvolutional) network that takes noise $z$ as input, and creates fake images. The discriminator tries to distinguish between fake and real images
	\item Trained in a minimax game fashion, the loss function resembles the Jensen-Shannon divergence:
	\begin{equation*}
		\begin{split}
			\min_G \max_D V(G,D) & = \mathbb{E}_{\bm{x}\sim p_{\text{data}}(\bm{x})} \left[\log \left(D\left(\bm{x}\right)\right)\right] + \mathbb{E}_{\bm{z}\sim p_{z}(\bm{z})} \left[\log\left(1 - D\left(G\left(\bm{z}\right)\right)\right)\right] \\
			J^{(D)} & = - \frac{1}{2}\mathbb{E}_{x\sim p_{\text{data}}}\left[\log D(x)\right] - \frac{1}{2}\mathbb{E}_{z\sim p_{z}}\left[\log 1 - D(G(z))\right]\\
			J^{(G)} & = - \frac{1}{2}\mathbb{E}_{z\sim p_{z}}\left[\log D(G(z))\right]\\
		\end{split}
	\end{equation*}
	\item Loss of generator is changed from $\log 1 - D(G(z))$ because otherwise the gradients of the generator vanish for a too strong discriminator 
	\item Divergence is important and can strongly influence the behavior of model
	\begin{equation*}
		\begin{split}
			D_{KL}\left(p(x)\lVert q^{*}(x)\right) = \int p(x) \log \frac{p(x)}{q^{*}(x)} dx & \implies \text{if } p(x)>0, \text{ then } q(x)>0\\
			D_{KL}\left(q^{*}(x)\lVert p(x)\right) = \int q^{*}(x) \log \frac{q^{*}(x)}{p(x)} dx & \implies \text{if } p(x)=0, \text{ then } q(x)=0\\
		\end{split}
	\end{equation*}
\end{itemize}
\subsubsection{GAN training problems}
\begin{itemize}
	\item \textbf{Vanishing gradients} during training:
	\begin{itemize}
		\item If the discriminator is too bad, the generator does not get valid/accurate feedback and can therefore not learn properly
		\item If the discriminator is perfect, the generator has very low gradients as a small change does not influence the discriminator
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.4\textwidth]{figures/cv_deep_learning_GAN_vanishing_gradients.jpeg}
			\caption{Vanishing gradients problem for training with KL-divergence. When the distance between the two distributions $p$ and $q$ (respectively $P_g$ and $P_r$) is too huge, the KL divergence is very close to zero. Hence, is does not provide any strong gradients in these regions.}
		\end{figure}
	\end{itemize}
	\item \textbf{Reaching the equilibrium}
	\begin{itemize}
		\item We know that the nash equilibrium of the minimax game is $P_g=P_r$ meaning the distribution of the real data is equal to the generated data. In that case, $D$ return 0.5 no matter what example we put in (as both distributions are equal).
		\item However, it has been shown that such cost functions may not converge when using gradient descent. An example is shown in Figure~\ref{fig:GAN_reaching_equilibrium}.
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.4\textwidth]{figures/cv_deep_learning_GAN_oscillating.png}
			\caption{Oscillating behavior of a non-cooperative game where $\min_x \max_y V(x,y) = x\cdot y$. The equilibrium $x=y=0$ is never reached.}
			\label{fig:GAN_reaching_equilibrium}
		\end{figure}
	\end{itemize}
	\item \textbf{Mode collapse}
	\begin{itemize}
		\item A GAN suffers from a mode collapse if the generator limits its predictions/generated distribution to a few samples/modes.
		\item For example in case of the MNIST dataset, this would mean that the generator only creates numbers of one or two different digits. Although a full mode collapse is rarely the case, partial mode collapses frequently occur
		\item In order to create a mode collapse, the gradients regarding the noise $\bm{z}$ must be very low/close to zero. This can for example happen if we fix the discriminator and the generator converges to the optimal image $\bm{x}^*$ that fools the discriminator the most
		\item Once the generator collapse to one mode, the discriminator will learn that this mode is purely/mostly generated and thus changes its predictions. The generator will address that by changing the mode (note that as $\partial L/\partial \bm{z}\approx 0$, we will just collapse to the next mode and are not able to escape this loop).
		\item In the end, this turns into a cat-and-mouse game between the generator and discriminator, and will not converge (see Figure~\ref{fig:GAN_mode_collapse}).
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.6\textwidth]{figures/cv_deep_learning_GAN_mode_collapse.png}
			\caption{\textit{Top row}: optimal convergence of generator distribution to 8 modes. \textit{Bottom row}: Sample of a mode collapse after 10k iterations. The generator is only able to generate a single mode.}
			\label{fig:GAN_mode_collapse}
		\end{figure}
	\end{itemize}
	\item \textbf{Low dimensional support}
	\begin{itemize}
		\item The KL and JS divergence work best for overlapping distributions as neither of them is 0 (numerical instability)
		\item However, during training, the training distribution is not perfect, and as we have high dimensional data, both distributions are less likely to overlap much
		\item Also, it is easy for the discriminator to find a line in between them
	\end{itemize}
\end{itemize}
\subsubsection{GAN improvements}
\begin{itemize}
	\item \textbf{Wasserstein GAN}
	\begin{itemize}
		\item Instead of KL/JS, use Wasserstein (Earth Mover's) Distance:
		$$\mathcal{W}(p_r, p_g) = \inf\limits_{\gamma \sim \prod (p_r,p_g)} \mathbb{E}_{(x,y)\sim \gamma}|x-y|$$
		\item Intuitive explanation: how much do I have to move from one distribution to get the other one. Thus, the distance is even meaningful for non-overlapping distributions
	\end{itemize}
	\item \textbf{Usage of labels}
	\begin{itemize}
		\item Learning a conditional model $p(y|x)$ often generates better samples than from a random distribution
		\item One example are conditional GANs where we have given a ground truth
	\end{itemize}
	\item \textbf{Label smoothing}
	\begin{itemize}
		\item Train the discriminator to predict $D(x)\approx 1 - \alpha$ instead of 1
		\item Has been shown to be a good regularization by preventing the discriminator to be overconfident
		\item In addition, the gradients of the generator do less likely explode
	\end{itemize}
	\item \textbf{Virtual batch normalization}
	\begin{itemize}
		\item Batch Normalization can significantly help in neural networks
		\item However, in GANs, it leads to high intra-batch correlation
		\item Solution: \textit{virtual batch normalization} where we select a reference batch which is fixed during training, and combine it with the statistics of the current batch. Reduces overfitting on reference batch and intra-batch correlation
	\end{itemize}
\end{itemize}
\subsubsection{GAN open questions}
\begin{itemize}
	\item \textbf{Mode collapse}: How to prevent a model to suffer from mode collapse. One idea is penalizing the model is features are too similar, or allowing discriminator to see across batch elements. But these solutions are more heuristic tries and no theoretical solution
	\item \textbf{Evaluation of GANs}: GANs are currently judged by their qualitative results/predictions, but there is no quantitative measurement yet
	\item \textbf{Discrete outputs}: The generator and discriminator need to be differentiable, and thus discrete outputs are not possible. There are some workarounds, but no real theoretically sound solution.
	\item \textbf{Semi-supervised classification}: How to combine a GAN training and discriminative model efficiently (discriminator predicts class and fake/real at the same time)
\end{itemize}
\subsection{Boltzmann machines}
\begin{itemize}
	\item A Boltzmann distribution is defined by $p(x) = \frac{1}{Z}\exp\left(-E\left(x\right)\right)$ where $E(x)$ is a energy function described by our model, and $Z=\sum\limits_x \exp\left(E\left(x\right)\right)$ a normalization constant
	\item The benefit of defining a distribution like that is that our model can use any output values between $[-\infty, \infty]$ instead of being constrained to $[0,1]$
	\item A problem is that even if $x$ is binary, the normalizing constant $Z$ gets out of hands (sum over $2^{n}$ combinations for $n$ dimensional $x$). Thus, we limit the computations by only considering pairwise relations
	\item Pairwise relations modeled by $E(x)=-x^TWx-b^Tx$. Learning $W$ and $b$ by maximizing the likelihood of the data
	\item Problem: $W$ is still of size $n^2$ which can be too large for e.g. images ($256\times 256$ leads to $4.2$ billion parameters in $W$) $\Rightarrow$ Restricted Boltzmann machines
\end{itemize}
\subsubsection{Restricted Boltzmann machines}
\begin{itemize}
	\item Restrict model by additional bottleneck over $h$ latents
	$$E(x,h) = -x^T W h - b^T x - c^T h, \hspace{2mm} p(x) = \frac{1}{Z}\sum_h \exp\left(-E\left(x,h\right)\right)$$
	\item This function is not in the form of a energy function anymore (because of the sum). We can rewrite it as:
	\begin{equation*}
		\begin{split}
			F(x) & = -b^T x - \sum_i \log \sum_{h_i} \exp\left(h_i\left(c_i + W_i x\right)\right)\\
			p(x) & = \frac{1}{Z} \exp\left(-F(x)\right)\\
			Z & = \sum\limits_x \exp\left(-F(x)\right)
		\end{split}
	\end{equation*}
	\item Can be represented as a single MLP layer (undirected) with less hidden units
	\item Compared to simple Boltzmann machine, we can express higher-order relations 
	\item Every hidden unit is independent of each other, and the same for input $x$:
	$$p(h|x) = \prod_j p(h_j|x, \theta), \hspace{2mm} p(x|h) = \prod_i p(x_i|h, \theta) $$
	\item We can now reformulate the conditional probabilities as sigmoids \textbf{iff} $h$ and $x$ are still binary:
	$$p(h_j|x, \theta) = \sigma\left(W_{:,j} x + b_j\right), \hspace{2mm}p(x_i|h, \theta) = \sigma\left(W_{i,:} h + c_i\right)$$
	\item The loss is maximizing the log likelihood:
	$$\mathcal{L}(\theta) = \frac{1}{N}\sum_n \log p(x_n|\theta) = \frac{1}{N}\sum_n\left[- F(x) - \log Z\right]$$
	\item The gradients can be computed accordingly:
	\begin{equation*}
		\begin{split}
			\pd{\log p(x_n|\theta)}{\theta} & = -\sum_h p(h|x_n, \theta) \pd{E(x_n,h| \theta)}{\theta} + \sum_{\tilde{x}, h} p(\tilde{x}, h|\theta) \pd{E(\tilde{x}, h|\theta)}{\theta}\\
		\end{split}
	\end{equation*}
	Problem: second term is sum over $x$ and $h$ $\Rightarrow$ high-dimensional, hard to compute
	\item One way to do it is using contrastive divergence: sample $h_0 \sim p(h|x)$, and $x_1 \sim p(x|h_0)$, etc. In practice, a single sample is mostly sufficient
	\item \textbf{Deep Belief Network}: RBM are still models of single layer, we can also use a stack of RBMs. First layer is directed, others not. Our joint pdf is $p(x, h_1, h_2) = p(x|h_1)\cdot p(h_1|h_2)$
	\item \textbf{Deep Boltzmann machines}: also a stack of RBMs, but with undirected first layer
	\begin{itemize}
		\item Hence, we get $p(h_2^{k}|h_1, h_3) = \sigma \left(W_1^{:,k}h_1 + W_3^{k,:}h_3 \right)$
		\item Computing gradients is intractable $\Rightarrow$ approximate by sampling
	\end{itemize}
\end{itemize}
\subsection{Variational Autoencoders}
\begin{itemize}
	\item We assume an underlying, lower-dimensional data distribution $p(z)$ with which we can model our data distribution $p(x,z)=p(x|z)p(z)$
	\item Therefore, we need to model $p(z|x)$ which is often not easy to compute. In variational inference, we approximate the true posterior by $q_{\varphi}(z)$ (approximated posterior does not have to depend on observed $x$, e.g. in VAE it does)
	\item Our goal is to maximize $p(x)$. As this is intractable, we use the ELBO:
	\begin{equation*}
		\begin{split}
			\log p(x) & = \log \int p(x,z)dz \\
			& = \log \int q_{\varphi}(z) \frac{\int p(x,z)}{q_{\phi}(z)} dz\\
			& = \log \mathbb{E}_{q_{\varphi}(z)}\left[\frac{p(x,z)}{q_{\varphi}(z)}\right]\\
			& \geq \mathbb{E}_{q_{\varphi}(z)}\left[\log \frac{p(x,z)}{q_{\varphi}(z)}\right]\\
			& = \mathbb{E}_{q_{\varphi}(z)}\left[\log p(x|z)\right] - \text{KL}\left(q_{\varphi}(z)||p(z)\right) = \text{ELBO}_{\theta, \varphi}\left(x\right)
		\end{split}
	\end{equation*}
	\item The distance between $\log p(x)$ and the ELBO is the KL divergence to the true (unknown) posterior:
	$$\log p(x) - \text{KL}\left(q_{\varphi}(z)||p(z|x)\right) = \mathbb{E}_{q_{\varphi}(z)}\left[\log p(x|z)\right] - \text{KL}\left(q_{\varphi}(z)||p(z)\right)$$
	\item Thus, maximizing the ELBO either increases the log likelihood or optimizes the approximated posterior
	\item Variational Autoencoders make $q_{\varphi}(z)$ dependent of $x$, and model $p_{\theta}(x|z)$ as well:
	$$\text{ELBO}_{\theta, \varphi}\left(x\right) = \mathbb{E}_{q_{\varphi}(z|x)}\left[\log p_{\theta}(x|z)\right] - \text{KL}\left(q_{\varphi}(z|x)||p_{\lambda}(z)\right)$$
	Note that $p_{\lambda}(z)$ is not optimized, and its parameters $\lambda$ just describe the prior (e.g. standard Gaussian) 
	\item The loss function for a VAE is the negative ELBO, where we approximate the expectation by a single sample. The KL is mostly chosen to be analytically solvable (e.g. for two Gaussian) to prevent a Monte-Carlo approximation of the integral 
	\item However, we face a problem when we try to compute the gradients for $\nabla_{\varphi} \mathcal{L}$. Using Monte-Carlo integration has high variance, and sampling is non-continuous operation
	\item \textbf{Reparameterization trick}: sample from external, constant distribution, and transform this sample into a sample of the modeled distribution. For Gaussian: $z = \mu_q + \sigma_q \cdot \epsilon$
\end{itemize}
\subsubsection{Improvements of VAE}
\begin{itemize}
	\item \textbf{Encoder distribution}
	\begin{itemize}
		\item Modeling $q(z|x)$ as Gaussian makes training and implementation easy, but assumes that true posterior is also Gaussian, or can be at least approximated by one
		\item Simple option: use different task-specific distribution like e.g. hyperspherical, however not always suitable
		\item We can improve the complexity of this posterior by plugging in a Normalizing flow on top of the encoder output
		\begin{equation*}
		\begin{split}
		z_0 \sim q_0(z|x) & = \mathcal{N}(z|\mu(x), \text{diag}(\sigma^2(x)))\\
		q_K(z|x) & = q_0(z|x) \cdot \left|\text{det}\pd{f_K(z_{k-1})}{z_{k-1}}\right|\\
		\end{split}
		\end{equation*}
		\item The ELBO is added with an additional term during training
		$$\text{ELBO} = \mathbb{E}_{q_{\varphi}(z|x)}\left[\log p_{\theta}(x|z)\right] - \text{KL}\left(q_{\varphi}(z|x)||p_{\lambda}(z)\right) + \mathbb{E}_{z_0 \sim q_0(z_0|x)}\left[\sum_{k=1}^{K} \log \left|\text{det}\pd{f_k(z_{k-1})}{z_{k-1}}\right|\right]$$
	\end{itemize}
	\item \textbf{Prior optimization}
	\begin{itemize}
		\item We assume a prior $p(z)$ which is for example Gaussian, but cannot make sure that every point of the prior actually has a realistic counterpart in the original $x$ space
		\item The optimal prior is the averaged distribution over all data samples: $q^{*}(z) = \frac{1}{N}\sum_{n=1}^{N} q_{\varphi}(z|x_n)$
		\item However, summing over all data point is infeasible. Thus, approximate it by $K$ pseudo-inputs $u_k$ that are trained via standard SGD in the framework:
		$$p_\lambda(z) = \frac{1}{K} \sum_{k=1}^{K} q_{\varphi}(z|u_k)$$
	\end{itemize}
	
\end{itemize}
\subsection{Normalizing flows}
\begin{itemize}
	\item VAE cannot model $p(x)$ directly because of the intractable formulation ($p(x) = \int p(x,z)dz$)
	\item Normalizing Flows solve that problem by using a series of invertible transformation that allow more complex latent distributions than Gaussian
	\item The models can therefore be trained on directly maximizing the log likelihood instead of using the ELBO or similar
	\item A normalizing flow consists of multiple flows that transform a simple Gaussian distribution step by step in the data distribution (see Figure~\ref{fig:NF_concept})
	\item Every flow shifts the probability mass specified by parameters (determined by e.g. a NN, see Figure~\ref{fig:NF_density_shift})
	\begin{figure}[ht!]
		% NF_density_shift.png
		\centering
		\begin{subfigure}{0.7\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/NF_concept.png}
			\caption{General concept of stacking multiple flows}
			\label{fig:NF_concept}
		\end{subfigure}
		\hspace{8mm}
		\begin{subfigure}{0.2\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/NF_density_shift.png}
			\caption{Shifting density}
			\label{fig:NF_density_shift}
		\end{subfigure}
		\caption{Outline of how a normalizing flow works}
		\label{fig:NF}
	\end{figure} 
	\item Mathematically, we can define a normalizing flow by:
	\begin{equation*}
		\begin{split}
			x & = z_k = f_k \circ f_{k-1} \circ ... \circ f_1 (z_0) \to z_i = f_i(z_{i-1})\\
			p(z_i) & = p(z_{i-1}) \cdot \left|\det \frac{f_{i}^{-1}}{z_i}\right| \implies p(x) = p(z_0) \cdot \prod_{i=1}^{K} \left|\det \frac{f_{i}^{-1}}{z_i}\right|\\
			\log p(x) & = \log p(z_0) - \sum_{i=1}^{K} \log \left|\det \frac{f_{i}}{z_i}\right|
		\end{split}
	\end{equation*}
	\item Requirements: $f$ must be invertible (dimensions of $x$ and $z$ equal), and the Jacobian must be easy to compute (i.e. triangular)
\end{itemize}

================================================
FILE: Deep_Learning/dl_intro.tex
================================================
\section{Introduction}
\subsubsection{Perceptron}
\begin{itemize}
	\item Single perceptron weights every input with a weight, and adds a bias term
	\item Step function as output: if input sum greater zero, then output is 1, else 0 (or -1)
	\item Problem: can only learn linear problems and not e.g. XOR
	\item Overcoming by multi-layer perceptron 
\end{itemize}

================================================
FILE: Deep_Learning/dl_modularity.tex
================================================
\section{Modular Learning}
\begin{itemize}
	\item \textit{Definition}: A family of \textcolor{green}{parametric}, \textcolor{lightred}{non-linear} and \textcolor{blue}{hierarchical} \textcolor{orange}{representation learning functions}, which are \textcolor{red}{massively optimized with stochastic gradient descent} to \textcolor{purple}{encode domain knowledge}, i.e. domain invariances, stationarity.
	% \item Although with two-layer (shallow) network, we can approximate all possible functions, a deep architecture tends to be more efficient and generalize better
	\item A neural network is a series of hierarchically connected functions $\Rightarrow$ Directed Acyclic graph
	\item Note that it is not allowed to have loops except over time/additional dimension
\end{itemize}
\begin{figure}[ht!]
	\centering
	\includegraphics[width=0.2\textwidth]{figures/modularity_example_network.png}
	\caption{Example network with interweaved connections. The architecture can be made arbitrarily complex, and can also include recurrent connections.}
	\label{fig:modularity_example_network}
\end{figure}
\subsection{Module}
\begin{itemize}
	\item A module is the simplest mathematical component in a NN, and can be expressed by $a=h(x;w)$ where $a$ is the output, $x$ the input, $w$ trainable parameters and $h$ an activation function
	\item $w$ mostly learned by gradient-based methods, usually maximizing the likelihood
	\begin{itemize}
		\item ML solution: $w^{*} = \arg\max\limits_{w}\prod\limits_{x,y}p_{model}\left(y|x;w\right)$
		\item For gradient-based methods, we can minimize the negative log likelihood:\\ $\mathcal{L}(w) = -\mathbb{E}_{x,y\sim \tilde{p}_{data}}\left[\log p_{model}\left(y|x;w\right)\right]$
		\item If output is Gaussian, we would get the $\ell_2$ norm
		\item If output is Laplacian, we would get the $\ell_1$ norm
	\end{itemize} 
	\item Using a loss function that matches the output distribution of the network helps, because:
	\begin{itemize}
		\item It makes math simpler (exponential cancels out)
		\item Better numerical stability ($\log$ with very small/negative values, helps for e.g. Softmax+CrossEntropy)
		\item Makes gradients larger as exponential-like activations often lead to saturation, which means gradients are almost 0 (but not with $\log$)
	\end{itemize}
	\item It is important that the input and output distribution of every module match, as otherwise we get inconsistent behavior and makes it harder to learn
	\begin{itemize}
		\item For activation functions, this means we prefer them to be mostly activated around the origin and centered
		\item Otherwise, e.g. ReLU can be come a linear unit or set everything to 0
	\end{itemize}
\end{itemize}
\subsubsection{Example modules}
\begin{itemize}
	\item \textbf{Linear module}: $a = wx$
	\begin{itemize}
		\item Simple gradients $\frac{\partial a}{\partial w} = x$, $\frac{\partial a}{\partial x} = w$
		\item No activation saturation $\Rightarrow$ strong, reliable gradients
	\end{itemize}
	\item \textbf{Rectified Linear Unit}: $a = \max(0,x)$
	\begin{itemize}
		\item Gradient is step function. $\pd{a}{x} = \begin{cases}
		0 & \text{ if } x\leq 0\\
		1 & \text{ if } x > 0\\
		\end{cases}$
		\item Hence, strong, fast gradients
		\item However, dead neurons might be an issue when initialization/weights produce outputs smaller 0 for every input 
		\item Different variations like LeakyReLU, Softplus ($\ln(1+e^{x})$), NoisyReLU exist
	\end{itemize}
	\item \textbf{Sigmoid}: $a=\sigma(x)=\frac{1}{1+e^{-x}}$
	\begin{itemize}
		\item Gradient easy to calculate: $\pd{a}{x} = \sigma(x)\left(1-\sigma\left(x\right)\right)$
		\item Can be used as output function for probability distribution between $[0,1]$
		\item Saturates and has small gradients
		\item Not centered around origin $\Rightarrow$ not good choice for within a network
	\end{itemize}
	\item \textbf{Tanh}: $a=\tanh\left(x\right)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}$
	\begin{itemize}
		\item Gradients $\pd{a}{x}=1-\tanh\left(x\right)^2$
		\item Saturates as well, but has slightly higher gradients than sigmoid and is centered around origin
	\end{itemize}
	\item \textbf{Softmax}: $a^{(k)} = \text{softmax}\left(x^{(k)}\right) = \frac{e^{x^{(k)}}}{\sum_j e^{x^{(j)}}}$
	\begin{itemize}
		\item Probability distribution over multiple classes
		\item Softmax trick for numerical stability: $\frac{e^{x^{(k)}-\mu}}{\sum_j e^{x^{(j)}-\mu}}$
	\end{itemize}
\end{itemize}
\subsection{Backpropagation}
\begin{itemize}
	\item Calculate gradients of all parameters in the network based on the loss on the last layer
	\item Principle of chain rule: $\pd{z}{x} = \sum_j \chain{z}{y_i}{x}$ (gradients from all possible paths)
	\begin{itemize}
		\item In vector notation: $\nabla_{\bm{x}} \bm{z} = \left(\pd{\bm{y}}{\bm{x}}\right)^T \cdot \nabla_{\bm{y}} \bm{z}$ with Jacobian $\pd{\bm{y}}{\bm{x}} = \left[\begin{array}{ccc}
		\pd{y_1}{x_1} & \pd{y_1}{x_2} & \pd{y_1}{x_3} \\[5pt]
		\pd{y_2}{x_1} & \pd{y_2}{x_2} & \pd{y_2}{x_3} \\
		\end{array}\right]$
	\end{itemize}
	\item Steps of Backpropagation:
	\begin{enumerate}
		\item Compute forward propagations for all layers recursively:
		$a^{(l)} = h^{(l)}\left(x^{(l)}\right) \text{ and } x^{(l+1)} = a^{(l)}$
		\item Compute the reverse path. 
		$$\pd{\mathcal{L}}{a^{(l)}} = \left(\pd{a^{(l+1)}}{x^{(l+1)}}\right)^T \cdot \pd{\mathcal{L}}{a^{(l+1)}}, \hspace{4mm} \pd{\mathcal{L}}{\theta^{(l)}} = \pd{a^{(l)}}{\theta^{(l)}} \cdot \left(\pd{\mathcal{L}}{a^{(l)}}\right)^T$$
		\item Use gradients $\pd{\mathcal{L}}{\theta^{(l)}}$ to update parameters via SGD
	\end{enumerate}
\end{itemize}

================================================
FILE: Deep_Learning/dl_optimization.tex
================================================
\section{Deep Learning Optimizations}
\begin{itemize}
	\item Pure optimization has a very direct goal, namely finding the optimum. However, in Machine Learning, we define a training goal. Thus, the ``optimal'' parameters might not necessarily be the optimum (e.g. overfitting)
\end{itemize}
\subsection{Stochastic Gradient Descent}
\begin{itemize}
	\item Pushing the weights towards highest gradient change
	$$w_{t+1} = w_{t} - \eta_t \nabla_{w} \mathcal{L}$$
	\item \textit{Gradient descent}: gradients on the full dataset. However:
	\begin{itemize}
		\item Dataset is mostly too large for this
		\item No real guarantee that this leads to a good optimum and/or it will converge faster
	\end{itemize}
	\item \textit{Stochastic gradient descent}: approximate gradients by averaging over a small batch. 
	\begin{itemize}
		\item Standard error is inverse proportional to number of elements $m$ in a batch: $\sigma / \sqrt{m}$.
		\item Noisy gradients help to escape local minima, acts as regularization
		\item Does sample roughly representative gradients from dataset. Is better as training data is also just a rough approximation of what the test data might look like (optimum on training $\neq$ optimum on test)
		\item SGD is faster, especially in first iterations
		\item SGD is able to adapt with dynamically changing datasets
	\end{itemize}
	\item \textit{Ill conditioning}: if gradients are large, applying them can lead to worse performance. This is the case if the second order derivative changes faster 
\end{itemize}
\subsection{Advanced optimizations}
\subsubsection{Gradient-based optimization}
\begin{itemize}
	\item \textit{Pathological curvatures}: move through a ravine towards minimum. SGD tends to oscillate between the walls because they have high gradients
	\begin{itemize}
		\item Second order optimization can help a lot for pathological curvatures: $$w_{t+1} = w_{t} - H_{\mathcal{L}}^{-1} \eta_t g_t$$
		\item Hessian $H_{\mathcal{L}}^{ij} = \pd{\mathcal{L}}{w_i\partial w_j}$ works as adaptive learning rate per parameter
		\item However, unfeasible in practice because Hessian gets very large
	\end{itemize}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.3\textwidth]{figures/optimization_pathological_curvatures.png}
		\caption{Pathological curvature}
		\label{fig:optimization_pathological_curvatures}
	\end{figure}
	\item \textbf{Momentum}: maintain \textit{momentum} from previous parameter updates to dampen the oscillations.
	\begin{equation*}
		\begin{split}
			u_{t+1} & = \gamma u_{t} - \eta_t g_t \\
			w_{t+1} & = w_{t} + u_{t+1}
		\end{split}
	\end{equation*}
	\begin{itemize}
		\item Works as a exponential averaging $\Rightarrow$ more robust gradients, faster convergence
		\item $\gamma$ might be initialized lower and then increased over time to $0.9$
		\item Standard values for $\gamma$ are between $0.5$ and $0.9$ (note that a lower learning rate should be used compared to standard SGD)
	\end{itemize}
	\item \textbf{RMSprop}: adapting learning rate on current loss surface.
	\begin{equation*}
		\begin{split}
			r_t & = \alpha \cdot r_{t-1} + \left(1 - \alpha\right) \cdot g_t^2\\
			\eta_t & = \frac{\eta}{\sqrt{r_t} + \epsilon} \\
			w_{t+1} & = w_{t} - \eta_t \cdot g_t\\
		\end{split}
	\end{equation*}
	\begin{itemize}
		\item $r_t$ is the (exponentially) averaged gradient norm describing the size of the gradients (per dimension!)
		\item The learning rate is then adapted by $\eta_t$ at every time step for each dimension independently
		\item $\epsilon$ to prevent numerical instability and too large learning rates
		\item With the adapted learning rate, we update our weights with SGD
	\end{itemize}
	\item \textbf{Adam}: Combining adaptive learning rate and momentum
	\begin{equation*}
		\begin{split}
			m^{(t)} & = \beta_1 m^{(t-1)} + (1 - \beta_1)\cdot g^{(t)}\\
			v^{(t)} & = \beta_2 v^{(t-1)} + (1 - \beta_2)\cdot \left(g^{(t)}\right)^2\\
			\hat{m}^{(t)} & = \frac{m^{(t)}}{1-\beta^{t}_1}, \hat{v}^{(t)} = \frac{v^{(t)}}{1-\beta^{t}_2}\\
			w^{(t)} & = w^{(t-1)} - \frac{\eta}{\sqrt{v^{(t)}} + \epsilon}\circ \hat{m}^{(t)}\\
		\end{split}
	\end{equation*}
	\begin{itemize}
		\item Keeps track of the gradient norm for momentum $m^{(t)}$, and norm (also known as velocity) $v^{(t)}$
		\item The hyperparameters $\beta_1$ and $\beta_2$ correlate with the $\gamma$ and $\alpha$ respectively from the previous approaches
		\item The adaptive learning rate is expressed by $\hat{v}^{(t)}$, and the exponentially averaged gradients by $\hat{m}^{(t)}$
		\item The division is to remove the bias of $m^{(0)}$ and $v^{(0)}$ being zero. Note that $\beta_1^t$ means the value of $\beta_1$ to the power $t$, and not at time step $t$
		\item Adam is in general better for complex models, but might fail on easy/stupid tasks compared to simple methods like SGD
	\end{itemize}
	\item \textbf{Adagrad}: adapting learning rate based on both gradient scale and frequency of updates
	\begin{equation*}
		\begin{split}
			G_t & = G_{t-1} + \text{diag}\left(g_t^2\right)\\
			w_{t+1} & = w_{t} - \frac{\eta}{\sqrt{G_t + \epsilon}}\cdot g_t\\
		\end{split}
	\end{equation*}
	\begin{itemize}
		\item Very similar to RMSprop, but sums the scales over all time steps ($G_t$) instead of exponentially averaging 
		\item Less sensitive to learning rate tuning, but it gets very small over training time annealing to 0
	\end{itemize} 
	\item \textbf{Nesterov momentum}: use the future gradient instead of the current gradient. Leads to better convergence in theory
\end{itemize}
\subsubsection{Bayesian optimization}
\begin{itemize}
	\item Gradient-based optimizations have the problem of getting stuck in local minima
	\item Bayesian optimization is a gradient-free, educated trial and error guesser that works in lower dimensional spaces (up to 1000, but mostly 20 to 50 parameters)
	\item Determines the next point/parameter values to evaluate based on variance/uncertainty, and expected/predictive value. 
	\item Can be used for e.g. network architecture search
\end{itemize}
\subsection{Normalization}
\begin{itemize}
	\item Data pre-processing
	\begin{itemize}
		\item Center data around 0 (activation functions are designed for that)
		\item Scale input variables to have similar diagonal covariances (not if features are differently important)
		\item De-correlate features if there is no inductive bias (e.g. sequence over time)
	\end{itemize}
	\item \textbf{Batch normalization}: ensure Gaussian distribution of features over batches at every module input
	\begin{equation*}
		\begin{split}
			\mu_B = \frac{1}{m} \sum\limits_{i=1}^{m} x_i, &\hspace{5mm} \sigma_B^2 = \frac{1}{m} \sum\limits_{i=1}^{m} \left(x_i - \mu_B\right)^2 \\
			\hat{x}_i & = \frac{x_i - \mu_B}{\sqrt{\sigma^2 + \epsilon}} \\
			\hat{y}_i & = \gamma \cdot \hat{x}_i + \beta
		\end{split}
	\end{equation*}
	\begin{itemize}
		\item Normalize feature to $\hat{x}_i \sim \mathcal{N}(0,1)$, then rescale with trainable parameters $\gamma$ (variance) and $\beta$ (mean).
		\item Helps the optimizer to control mean and variance of input distribution, and reduces effects of 2nd order between layers $\Rightarrow$ easier, faster learning 
		\item Acts as regularizer as distribution depends on mini-batch and therefore introduces noise
		\item During testing, take a moving average of the last training steps and use those for $\mu_B$ and $\sigma_B^2$
	\end{itemize}
\end{itemize}
\subsection{Regularization}
\begin{itemize}
	\item Weight regularization needed to prevent overfitting
	\item \textbf{$\ell_2$-regularization}: Introduce objective term for minimizing weights
	$$w^{*}=\arg\min_w \mathcal{L} + \frac{\lambda}{2}\sum_l ||w_l||^2$$
	\begin{itemize}
		\item When using simple (stochastic) gradient descend, then $\ell_2$ regularization is the same as weight decay: $$w_{t+1} = \left(1-\lambda \eta_t\right) w_{t} - \eta_t \nabla_{\theta} \mathcal{L}$$
	\end{itemize}
	\item \textbf{$\ell_1$-regularization}: use $\ell_1$ objective, introduces sparse weights
	$$w^{*}=\arg\min_w \mathcal{L} + \lambda \sum_l ||w_l||$$
	\item \textbf{Early stopping}: stop the training when test error increases but training loss continues to decrease. Can be counted to regularization as training steps are reduced
	\item \textbf{Dropout}: setting activations randomly to 0 during training with probability $p$ (mostly between $0.1$ and $0.5$)
	\begin{itemize}
		\item During test time, every activation is reweighted by $1 - p$
		\item Reduces co-adaptations/-dependencies between neurons because none can solely depend on the other
		\item Neurons get more robust $\Rightarrow$ reduces overfitting
		\item Effectively, a different network architecture is used every iteration. Testing can be seen as using model ensemble
	\end{itemize}
\end{itemize}
\subsection{Weight initialization}
\begin{itemize}
	\item There are two forces on the weight magnitude: small weights are needed to keep data around origin, but large weights are required to have strong learning signals
	\item Initialization should preserve variance of activations (input variance $\approx$ output variance to keep distribution between modules same)
	\item Depends on non-linearity and data normalization
	\item \textbf{Xavier initialization}: to maintain data variance, the variance of the weights must be $1/d$ where $d$ is number of input neurons $\Rightarrow$ sample weight values from $w\sim\mathcal{N}(0,\sqrt{1/d})$
	\item \textbf{Initialization for ReLU}: ReLU set half of the output neurons to 0 $\Rightarrow$ double the weight variance to compensate zero flat-area: $w\sim\mathcal{N}(0,\sqrt{2/d})$
\end{itemize}

================================================
FILE: Deep_Learning/dl_rnn.tex
================================================
\section{Recurrent and Graph Neural Networks}
\subsection{Backpropagation through time}
\begin{itemize}
	\item Sequences are of arbitrary length. Standard networks like CNN mostly work on fixed input dimensionality
	\item Usage of memory with shared weights $\theta$: $$c_{t+1} = h_{\theta}\left(x_{t+1}, c_{t}\right) = h_{\theta}\left(x_{t+1}, h_{\theta}\left(x_{t}, c_{t-1}\right)\right) = ...$$
	\item Simple RNN cell: 
	\begin{equation*}
		\begin{split}
			c_t & = \tanh\left(U\cdot x_t + W \cdot c_{t-1}\right) \\
			y_t & = \text{softmax}\left(V \cdot c_{t}\right) \\
			\loss & = \sum\limits_{t=1}^{T} y_t^{*} \log y_t \\
		\end{split}
	\end{equation*}
	\item Gradient for output weights $V$:
	\begin{equation*}
		\begin{split}
			\pd{\loss_t}{V} & = \chain{\loss_t}{y_t}{c_t}\pd{c_t}{V} = \left(y_t - y_t^{*}\right) \cdot \left(c_t\right)^T\\
			\pd{\loss}{V} & = \sum\limits_{t=1}^{T} \pd{\loss_t}{V}\\
		\end{split}
	\end{equation*}
	\item Gradient for memory weights $W$: 
	\begin{equation*}
		\begin{split}
			\pd{\loss_t}{W} & = \chain{\loss_t}{y_t}{c_t}\pd{c_t}{W}\\
		\end{split}
	\end{equation*}
	\begin{itemize}
		\item In $\pd{c_t}{W}$, $c_t$ depends on $c_{t-1}$ which again depends on $W$. Thus, we have a recurrence in the gradient calculation:
		$$\pd{\loss_t}{W} = \sum\limits_{k=1}^{t} \chain{\loss_t}{y_t}{c_t}\chain{c_t}{c_k}{W}$$
		where $\pd{c_k}{W}$ only models the dependency exactly at time step $k$
		\item The gradient $\pd{c_t}{c_k}$ can be determined by the chain rule: $\pd{c_t}{c_k} = \prod\limits_{i=k+1}^{t} \pd{c_i}{c_{i-1}}$
		\item All in all, the final loss is:
		\begin{equation*}
			\begin{split}
				\pd{\loss}{W} & = \sum\limits_{t=1}^{T}\sum\limits_{k=1}^{t} \chain{\loss_t}{y_t}{c_t}\left(\prod\limits_{i=k+1}^{t} \pd{c_i}{c_{i-1}}\right)\pd{c_k}{W}
			\end{split}
		\end{equation*}
	\end{itemize}
	\item Gradient for input weights $U$ very similar to $W$: 
	\begin{equation*}
		\begin{split}
			\pd{\loss}{U} & = \sum\limits_{t=1}^{T}\sum\limits_{k=1}^{t} \chain{\loss_t}{y_t}{c_t}\left(\prod\limits_{i=k+1}^{t} \pd{c_i}{c_{i-1}}\right)\pd{c_k}{U}
		\end{split}
	\end{equation*}
	\item The problem with RNNs are that the gradients at time step $t$ depend on $c_{t-1}$ which also depends on $w$. However, the gradients are calculated with the assumption that $w$ stays the same for the previous time steps.
	\item This error can easily accumulate over many time steps so that in very long sequences, the gradients for the last steps are inaccurate
	\item Reduce learning rate/fewer updates, but this leads to slower training
\end{itemize}
\subsubsection{Vanishing gradients}
\begin{itemize}
	\item The exact derivations can be found in \href{http://proceedings.mlr.press/v28/pascanu13.pdf}{this paper}
	\item We assume an alternative formulation for simplicity here: $c_t = W \cdot \sigma(c_{t-1}) + U \cdot x_{t-1}$ where $\sigma$ is an arbitrary activation function. Then, the partial derivative between two time steps is\\ $\pd{c_{t}}{c_{k}} = \prod\limits_{i=k+1}^{t} \pd{c_{i}}{c_{i-1}} = \prod\limits_{i=k+1}^{t} W^T \cdot \text{diag}\left(\pd{\sigma\left(c_t\right)}{c_t}\right)$
	\item Hence, the magnitude of $\pd{c_{t+1}}{c_{t}}$ is bounded by this derivative: 
	$$\left\lVert \pd{c_{t+1}}{c_{t}}\right\rVert \leq \left\lVert W^T\right\rVert \cdot \left\lVert \text{diag}\left(\pd{\sigma\left(c_t\right)}{c_t}\right)\right\rVert$$
	\item In case the derivative of our non-linearity is bounded to a value $\gamma$ (which is 1 in case of tanh), we know that gradients vanish if the norm of the weight gradients are lower than $1/\gamma$:
	$$\left\lVert \pd{c_{t+1}}{c_{t}}\right\rVert \leq \left\lVert W^T\right\rVert \cdot \left\lVert \text{diag}\left(\pd{\sigma\left(c_t\right)}{c_t}\right)\right\rVert < \frac{1}{\gamma}\gamma = 1$$
	\item This term is exponentiated with the number of time steps. Thus, long sequences suffer even more of vanishing gradients $\Rightarrow$ learn only short-term relationships
	\item If however $\left\lVert \pd{c_{t+1}}{c_{t}}\right\rVert > 1$ because of $\left\lVert W^T\right\rVert \gg 1/\gamma$, then we can get exploding gradients
	\item Quick fix for exploding gradients: clip gradient norm. However, there the counterpart can happen where we only focus on long-term relationships
\end{itemize}
\subsubsection{Long Short-Term Memory}
\begin{itemize}
	\item Preventing vanishing gradients by gate mechanism
	\item By simply adding features to memory and limiting memory by sigmoid we can get strong gradients for any sequence length. Note that the gradients get lower in expectation because sigmoid has mean $0.5$. Nevertheless, if long-term dependencies are important, the network can learn them now
	\item \textit{Forget gate}: regulating how much information is kept from last time step
	\item \textit{Input + candidate gate}: Regulating which, and how much new information should be added given the current time step
	\item \textit{Output gate}: What features are important for the current time step
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.3\textwidth]{figures/RNN_LSTM.png}
		\caption{Visualization of a LSTM cell}
		\label{fig:RNN_LSTM}
	\end{figure}
\end{itemize}
\subsection{Graph Neural Networks}
\begin{itemize}
	\item Perform operation on graph-structured data (e.g. social networks or knowledge graphs)
\end{itemize}
\subsubsection{Deep Walk}
\begin{itemize}
	\item Learning latent representations of vertices in a network
	\item The Deep Walk algorithm consists of two simple steps:
	\begin{enumerate}
		\item Perform random walks on the graph to generate node sequences
		\item Run skip-gram on sequence (with word window) to learn node embeddings
	\end{enumerate}
	\item \textit{Drawback}: algorithm has to be re-run if a new node is added, not useful for dynamic graphs
\end{itemize}
\subsubsection{GraphSage}
\begin{itemize}
	\item In every iteration, aggregate information of neighbors and the node itself to generate new embeddings
	\item Aggregation techniques are taking the mean (with weight and non-linearity applied on it afterwards), max pooling, or using a LSTM
\end{itemize}
\subsubsection{Graph Convolutional Networks}
\begin{itemize}
	\item A GNN layer takes as input the embeddings for every node $H^{(l)}$ and the adjacency matrix $A$, and create new embeddings $H^{(l+1)}$
	\item Graph convolutional layers use for this a matrix multiplication where weights are shared over nodes
	\item In the simplest form, a GCN layer can be defined as $h(H^{(l)}, A) = \sigma\left(A H^{(l)} W^{(l)}\right)$
	\item To make it more efficient, we add the identity matrix to $\hat{A} = A + I$ so that nodes use their old embeddings as well, and take the mean instead of the sum over all neighbors (by degree matrix $D$):
	$$h(H^{(l)}, A) = \sigma\left(D^{-1/2}\hat{A}D^{-1/2} H^{(l)} W^{(l)}\right)$$
\end{itemize}

================================================
FILE: Deep_Learning/dl_summary.tex
================================================
\documentclass[a4paper]{article} 
\addtolength{\hoffset}{-2.25cm}
\addtolength{\textwidth}{4.5cm}
\addtolength{\voffset}{-3.25cm}
\addtolength{\textheight}{5cm}
\setlength{\parskip}{0pt}
\setlength{\parindent}{0in}

\usepackage{blindtext} % Package to generate dummy text
\usepackage{charter} % Use the Charter font
\usepackage[utf8]{inputenc} % Use UTF-8 encoding
\usepackage{microtype} % Slightly tweak font spacing for aesthetics
\usepackage[english]{babel} % Language hyphenation and typographical rules
\usepackage{amsthm, amsmath, amssymb, amsfonts} % Mathematical typesetting
\usepackage{float} % Improved interface for floating objects
\usepackage[final, colorlinks = true, 
linkcolor = black, 
citecolor = black]{hyperref} % For hyperlinks in the PDF
\usepackage{graphicx, multicol} % Enhanced support for graphics
\usepackage{xcolor} % Driver-independent color extensions
\usepackage{marvosym, wasysym} % More symbols
\usepackage{rotating} % Rotation tools
\usepackage{subcaption}
\usepackage{wrapfig}
\usepackage{censor} % Facilities for controlling restricted text
\newcommand{\note}[1]{\marginpar{\scriptsize \textcolor{red}{#1}}} % Enables comments in red on margin
\usepackage{bm}
\usepackage{blkarray}
\usepackage{enumitem}
\usepackage{pgfplots}
\usepackage{tikz}

\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\loss}[0]{\mathcal{L}}
\newcommand{\chain}[3]{\frac{\partial #1}{\partial #2}\frac{\partial #2}{\partial #3}}
\newcommand{\eq}[1]{\begin{equation*}\begin{split}#1\end{split}\end{equation*}}
\newcommand{\coderef}[0]{Please find the implementation in the folder with the code files.}
\newcommand{\TODO}[1]{\textbf{\textcolor{red}{#1}}}

\definecolor{green}{RGB}{0,160,0}
\definecolor{blue}{RGB}{0,0,160}
\definecolor{red}{RGB}{160,0,0}
\definecolor{orange}{RGB}{200,160,0}
\definecolor{purple}{RGB}{170,0,200}
\definecolor{cyan}{RGB}{0,200,200}
\definecolor{lightred}{RGB}{200,50,50}

\setcounter{tocdepth}{2}
% Title Page
\title{Summary Deep Learning}
\author{Phillip Lippe}


\begin{document}
\maketitle
\tableofcontents
\newpage

\input{dl_intro.tex}
\input{dl_modularity.tex}
\input{dl_optimization.tex}
\input{dl_convnets.tex}
\input{dl_rnn.tex}
\input{dl_generative_models.tex}
\input{dl_bayesian.tex}
\input{dl_autoregressive.tex}
\input{dl_deep_rl.tex}
\appendix
\newpage
\input{dl_appendix.tex}

\end{document}

================================================
FILE: Information_Retrieval_1/ir_boolean_retrieval.tex
================================================
\section{Boolean Retrieval}
\begin{itemize}
	\item \textbf{Information retrieval} is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)
	\item \textbf{Boolean retrieval model} is a model in which the queries are in the form of a Boolean expression of terms. Terms can be combined by the operators \texttt{AND}, \texttt{OR} and \texttt{NOT} 
\end{itemize}
\subsection{Inverted Index}
\begin{itemize}
	\item 
\end{itemize}

================================================
FILE: Information_Retrieval_1/ir_click_models.tex
================================================
\section{Click models}
\begin{itemize}
	\item User clicks can be used as evaluation of IR systems as clicks indicate the relevance of a document
	\item However, clicks are highly biased (positional, textual, attention/visual,...) $\Rightarrow$ click models try to remove these biases and help using clicks for evaluation
	\item Click models are optimized/trained on click logs which record for a given query which documents were clicked
	\item Most models are based on probabilistic graphical models (PGMs) that describe the probability of a click
	\item They are mostly trained by either applying a MLE or EM algorithm
\end{itemize}
\subsection{Random click model}
\begin{itemize}
	\item In random click models, every document on the result page has the same probability of being clicked: $$P(C_u = 1) = \text{const} = \rho$$
	\item Therefore, the model contains only a single parameter, which can be optimized by applying MLE: $$\rho = \frac{\#\text{clicks}}{\#\text{shown docs}}$$
	\item \textit{Advantages}: simple and fast
	\item \textit{Disadvantages}: the random click model does not consider many aspects including the position and content of a document
	\item There are different variations of this model (also called click-through rate models - CTR) considering more aspects
	\begin{itemize}
		\item \textbf{Rank-based CTR} - modeling a probability for every rank on the result page: $P(C_{u_r} = 1) = \rho_r$
		\item \textbf{Query-document CTR} - modeling a probability for every query-document pair in the dataset: $P(C_{u}=1) = \rho_{uq}$
	\end{itemize}
\end{itemize}
\subsection{Position-based model}
\begin{itemize}
	\item Position-based models take the position \textit{and} the document-query pair into account for modeling the probability of a click
	\begin{itemize}
		\item \textit{Examination} - reading a snippet at a rank/position $\implies$ $P(E_r = 1) = \gamma_r$
		\item \textit{Attractiveness} - prob. for document-query relevance $\implies$ $P(A_{uq} = 1) = \alpha_{uq}$
		\item The combined probability of clicking on a document is therefore: $$P(C_u = 1) = P(E_{r_u} = 1) \cdot P(A_{uq} = 1)$$
	\end{itemize}
	\item The model is visualized in Figure~\ref{img:click_models_PBM_pgm}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.25\textwidth]{figures/click_models_PBM_pgm.png}
		\caption{Probabilistic graphical model of parameters for PBM}
		\label{img:click_models_PBM_pgm}
	\end{figure}
	\item The examination models the position bias in user clicks while the attractiveness covers the document relevance
	\item \textit{Advantages}: Distinguishing between position bias and document relevance
	\item \textit{Disadvantages}: the Position-based model assumes that all clicks are independent of each other. Models that overcome this include:
	\begin{itemize}
		\item \textit{User browsing model (UBM)} - examination is also based on the rank of the previously clicked document $\implies$ $P(E_{r,r'}=1) = \gamma_{r,r'} $ ($n + n\cdot (n-1)/2$ parameters $\to$ 55 parameters for $n=10$)
		\item \textit{Cascade model} - see next section
	\end{itemize}
\end{itemize}
\subsection{Cascade model}
\begin{itemize}
	\item The cascade model assumes that the user scans the documents from top to bottom until he finds a relevant document and clicks
	\item Thus, the top document is always examined, while following documents are only examined if none of the previous ones were clicked
	\item The cascade model can be summarized in the equations:
	\begin{equation*}
		\begin{split}
			P(A_r = 1) & = \alpha_{u_r q}\\
			P(E_1 = 1) & = 1 \textit{\hspace{7mm} first element is always examined}\\
			P(E_r = 1|C_{r-1} = 1) & = 0 \textit{\hspace{7mm} stop if previous document is clicked}\\
			P(E_r = 1|E_{r-1} = 0) & = 0 \textit{\hspace{7mm} only examine if none of the documents before was clicked}\\
			P(E_r = 1|E_{r-1}=1, C_{r-1}=0) & = 1 \textit{\hspace{7mm} if no click was performed yet, examine next document}\\
		\end{split}
	\end{equation*}
	\item Therefore, the model has no parameters for examination and solely relies on attractiveness. The corresponding PGM is visualized in Figure~\ref{img:click_models_CM_pgm}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/click_models_CM_pgm.png}
		\caption{Probabilistic graphical model of parameters for CM}
		\label{img:click_models_CM_pgm}
	\end{figure}
	\item \textit{Advantages}: Clicking on a document depends on previous decisions/documents
	\item \textit{Disadvantages}: No skips are allowed. Also, the cascade model only considers a single click $\implies$ Dynamic Bayesian Networks
\end{itemize}

================================================
FILE: Information_Retrieval_1/ir_counterfactual_eval.tex
================================================
\section{Counterfactual Evaluation and Learning to Rank}
\begin{itemize}
	\item The term \textit{counterfactual} relates to \textit{off-policy} learning in RL
	\item Thus, we try to evaluate an offline task by using online data obtained by another policy to estimate the performance of the new policy in a online setting
\end{itemize}
\subsection{Counterfactual Evaluation}
\begin{itemize}
	\item In general, a user interactive system can be formalized as follows (see Figure~\ref{img:counterfactual_user_interactive_system}):
	\begin{itemize}
		\item $x$: Feature vector describing the user and context (i.e. query)
		\item $y$: Result the system returns based on its policy ($y=\pi(x)$)
		\item $\delta$: Feedback signal from the actions a user took. The function encodes the metric (user utility function) and is defined as $\delta: X\times Y\to \mathbb{R}$
		\item $\pi$: Policy describing the ranking system which takes $x$ as input and maps it to output $y$: $\pi:X\to Y$
	\end{itemize}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/counterfactual_user_interactive_system.png}
		\caption{Visualization fo a user interactive system}
		\label{img:counterfactual_user_interactive_system}
	\end{figure}
	\item \textit{Counterfactual evaluation}: perform offline evaluation of online metrics given online data from another system $\pi_{\text{production}}$. Thus, we try to estimate performance of $\pi_{\text{new}}$ with interaction data obtained with $\pi_{\text{production}}$.
	\item The interactions data/log is structured as $D=\left\{\left(x_1, y_1, \delta_1\right),...,\left(x_n, y_n, \delta_n\right)\right\}$
	\begin{itemize}
		\item The actions $y_i$ were selected by $\pi_{\text{production}}:X\to Y$
		\item Note that we only have partial information feedback, and no complete supervision. Only for the chosen action, we know the feedback signal/user utility. Thus, the "correct"/optimal action is unknown (also called "bandit feedback" as it was sampled from only one arm)
	\end{itemize}
	\item We want to estimate $\mathbb{E}_{y\sim \pi_{\text{new}}}\left[\delta(x,y)\right]$ given $D$ from $\pi_{\text{production}}$. For this, there are two approaches: \textit{model the rewards} and \textit{inverse propensity scoring}.
\end{itemize}
\subsubsection{Model the rewards}
\begin{itemize}
	\item The intuition behind \textit{model the rewards} is to learn the reward function $\delta:X\times Y\to \mathbb{R}$ from $D\sim \pi_{\text{production}}$ directly
	\item The task can be reduced to a regression problem: $$\delta_w = \arg\min_{\delta_w} \sum\limits_{i=1}^{N} \mathcal{L}\left(\delta_w\left(x_i, y_i\right), \delta_i \right)$$
	where $\mathcal{L}$ is a loss function like MSE.
	\item Once $\delta_w$ is learned, we can estimate our goal by $\mathbb{E}_{y\sim \pi_{\text{new}}}\left[\delta(x,y)\right] = \frac{1}{n} \sum\limits_{i=1}^{N} \delta_w \left(x_i, \pi_{\text{new}}(x_i)\right)$
	\item However, learning $\delta_w$ is in general very difficult, as:
	\begin{itemize}
		\item Input space $X\times Y$ is very high-dimensional
		\item Rewards are highly non-linear and noisy
		\item Data is strongly biased to the actions that $\pi_{\text{production}}$ prefers
	\end{itemize}
\end{itemize}
\subsubsection{Inverse Propensity Scoring}
\begin{itemize}
	\item Instead of learning $\delta_w$, is it possible to directly estimate the value of the new policy $\pi_{\text{new}}$?
	\item Answer: only under the condition that the policy $\pi_{\text{production}}$ is stochastic: $y\sim \pi(y|x)$. The probability $p$ to choose the action $y$ is also called \textit{propensity}. Note that $p>0$ must hold for all possible actions as we otherwise have no chance to discover/obtain feedback for all actions
	\item For unbiased counterfactual evaluation, we need data samples with the propensity $p_i$ from policy $\pi_{\text{production}}$ describing the probability of selecting $y_i$ for given input $x_i$: $\left(x_i, y_i, \delta_i, p_i\right)$
	\item Use importance sampling to make distributions $\pi_{\text{production}}$ and $\pi_{\text{new}}$ comparable. This leads to the \textbf{IPS-estimator}:
	$$\frac{1}{n}\sum\limits_{i=1}^{N} \delta_i \frac{\pi_{\text{new}}(y_i|x_i)}{p_i}$$
\end{itemize}
\subsubsection{Proof of Unbiasedness}
\begin{itemize}
	\item We want to proof that in expectation, the IPS estimator will lead to the correct value: $$\mathbb{E}_{y\sim \pi_{\text{production}}}\left[\frac{1}{n}\sum\limits_{i=1}^{N} \delta_i \frac{\pi_{\text{new}}(y_i|x_i)}{p_i}\right] = \mathbb{E}_{y\sim \pi_{\text{new}}}\left[\delta(x,y)\right]$$
	\item First, we can put the sum outside the expectation:
	$$\mathbb{E}_{y\sim \pi_{\text{production}}}\left[\frac{1}{n}\sum\limits_{i=1}^{N} \delta_i \frac{\pi_{\text{new}}(y_i|x_i)}{p_i}\right] = \frac{1}{n}\sum\limits_{i=1}^{N} \mathbb{E}_{y\sim \pi_{\text{production}}}\left[\delta_i \frac{\pi_{\text{new}}(y_i|x_i)}{p_i}\right]$$
	\item Next, we replace the expectation by a sum over actions weighted by their corresponding probabilities:
	$$\frac{1}{n}\sum\limits_{i=1}^{N} \mathbb{E}_{y\sim \pi_{\text{production}}}\left[\delta_i \frac{\pi_{\text{new}}(y_i|x_i)}{p_i}\right] = \frac{1}{n}\sum\limits_{i=1}^{N} \sum\limits_{y_i\in Y}\left[\pi_{\text{production}}(y_i|x_i)\delta_i \frac{\pi_{\text{new}}(y_i|x_i)}{p_i}\right]$$
	\item As $p_i$ is defined as $\pi_{\text{production}}(y_i|x_i)$, we can reduce the equation to:
	$$\frac{1}{n}\sum\limits_{i=1}^{N} \sum\limits_{y_i\in Y}\left[\pi_{\text{production}}(y_i|x_i)\delta_i \frac{\pi_{\text{new}}(y_i|x_i)}{p_i}\right] = \frac{1}{n}\sum\limits_{i=1}^{N} \sum\limits_{y_i\in Y}\left[\delta_i \pi_{\text{new}}(y_i|x_i)\right]$$
	\item Finally, we apply rules based on the definition of expectation:
	$$\frac{1}{n}\sum\limits_{i=1}^{N} \sum\limits_{y_i\in Y}\left[\delta_i \pi_{\text{new}}(y_i|x_i)\right] = \frac{1}{n}\sum\limits_{i=1}^{N} \mathbb{E}_{y_i\sim \pi_{\text{new}}}\left[\delta(x_i,y_i)\right] = \mathbb{E}_{y\sim \pi_{\text{new}}}\left[\delta(x,y)\right]$$
	\item Note that the IPS estimator has a high variance which scales with $p_i^2$. Thus, if we have a very low probability for some actions, this can introduce a high error $\implies$ many samples needed to approximate target accurately. There are different approaches to reduce the variance
\end{itemize}
\subsection{Counterfactual Learning to Rank}
\begin{itemize}
	\item Learning to Rank: \textit{offline} - train on labeled data, \textit{online} - learn from user interactions, \textit{counterfactual} - learn offline from online retrieved data obtained by another policy/ranker
	\item The goal of counterfactual LTR is to learn a new ranker $\pi_{\text{new}}$ from the interaction data with $\pi_{\text{production}}$
	\begin{itemize}
		\item The data is specified by $D=\left\{(x_1,y_1,\delta_1),...,(x_N,y_N,\delta_N)\right\}$ where $\delta_i$ indicates which document was clicked (we assume that only one document was clicked)
		\item $y_i$ is the ranking selected by $\pi_{\text{production}}:X\to Y$
	\end{itemize}
	\item Naive approach: assume click indicates relevance and learn as if it would be a supervised dataset: $$\pi_{\text{new}} = \arg\min_{\pi} \sum\limits_{i=1}^{N} \text{rank}\left(\pi(x_i),y_i,\delta_i\right)$$
	The objective function is to reduce the rank of the relevant document given the new ranking of $\pi_{\text{new}}$ and the previous ranking by $y_i$. Can be solved by pairwise LTR objective.
	\item However, data obtained by online Learning to Rank is commonly noisy and biased
	\item We can take these biases into account by using the inverse propensity scores:
	$$\pi_{\text{new}} = \arg\min_{\pi} \sum\limits_{i=1}^{N} \frac{\text{rank}\left(\pi(x_i),y_i,\delta_i\right)}{p(\textit{observing }\delta_i)}$$
	This formula can be motivated from a probabilistic click model perspective:
	$$p(\textit{click}) = p(\textit{observation})\times p(\textit{relevant}) \implies p(\textit{relevant}) = \frac{p(\textit{click})}{p(\textit{observation})}$$
	Left side is what we want to get, and on the right side it is specified what we actually optimize.
\end{itemize}
\subsubsection{Propensity estimation}
\begin{itemize}
	\item However, the question remains how we calculate $p(\textit{observing }\delta_i)$. We can either approximate it by using click models, or by performing a randomization test
	\item \textit{RandTopN}
	\begin{itemize}
		\item Randomly shuffle the top $N$ documents
		\item Measure clicks people have performed on the data (online experiment)
		\item Aggregate clicks for infinite samples
		\item Infer $\hat{p} \propto p(\textit{observing} \delta_i)$
	\end{itemize} 
	\item \textit{RandPair}
	\begin{itemize}
		\item Randomly swap top document with random top $N$ documents
		\item Infers $\frac{p(\textit{observing} \delta_i)}{p(\textit{observing} \delta_j)}$ for swapped documents $i$ and $j$
	\end{itemize}
	\begin{figure}[ht]
		\centering
		\begin{subfigure}[b]{0.45\textwidth}
			\centering
			\includegraphics[width=0.8\textwidth]{figures/counterfactual_LTR_RandTopN.png}
			\caption{RandTopN}
		\end{subfigure}
		\begin{subfigure}[b]{0.45\textwidth}
			\centering
			\includegraphics[width=0.8\textwidth]{figures/counterfactual_LTR_RandPair.png}
			\caption{RandPair}
		\end{subfigure}
		\label{img:counterfactual_propensity_estimation}
	\end{figure}
\end{itemize}
\subsubsection{The Variance problem}
\begin{itemize}
	\item The problem of solving the counterfactual approach is that if $p(\textit{observing} \delta_i)$ heads to $0$, the overall objective will be heavily biased towards this example $\implies$ overfitting on single data point
	\item One way to overcome this problem is using a variance regularizer which prevents the policy to deviate too much from the original production policy $\pi_{\text{production}}$:
	$$\pi_{\text{new}} = \arg\min_{\pi} \sum\limits_{i=1}^{N} \frac{\text{rank}\left(\pi(x_i),y_i,\delta_i\right)}{p(\textit{observing }\delta_i)} + \lambda \sqrt{\frac{\mathcal{V}[\pi, \pi_{\text{production}}]}{n}}$$
	\item However, this optimization problem cannot be solved by SGD anymore and iterative methods must be applied (new learning framework \textit{counterfactual risk minimization})
\end{itemize}

================================================
FILE: Information_Retrieval_1/ir_language_models.tex
================================================
\section{Introduction to Retrieval models}
\begin{itemize}
	\item Mathematical framework for defining query-document matching
\end{itemize}
\subsection{TF-IDF}
\begin{itemize}
	\item In a vector space model, documents and queries are represented in vector space
	\item Axes are mostly terms/vocabulary so that a document or query is represented by terms they contain (or their frequency)
	\item We can rank documents based on their cosine similarity with the query:
	$$\text{score}(d,q) = \frac{\vec{q} \cdot \vec{d}}{||\vec{q}||\cdot ||\vec{d}||}$$
	\item Documents can be therefore represented as non-negative vector of term weights (raw frequency in doc):
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/language_models_tf_example.png}
		\label{img:language_models_tf_example}
	\end{figure}
	\item However, the problem here is that terms with a higher frequency in documents are automatically more important, although this is not always the case (e.g. "the"). Thus, for identifying the important terms, we can report document frequency (no. of docs in which terms occurs):
	$$\text{df}(t) \coloneqq \#\left\{d:\text{tf}(t;d)>0\right\}$$
	\item We can translate document frequencies to term weights by inverting them (inverted document frequency - \textit{IDF}):
	$$\text{idf}(t) = \log \frac{n}{\text{df}(t)} = \log n - \log \text{df}(t)$$
	The log is applied to dampen the effect of IDF.
	\item Also the term frequencies should be dampened by a monotonic, sub-linear transformation as a term occurring twice as often doesn't imply that the document is also twice as important/relevant. Together, we can define the tf-idf weights as follows:
	$$\text{tf-idf}(t;d) = \log \left(1+\text{tf}(t;d)\right) \log \frac{n}{\text{df}(t)}$$
	\item Scores are normalized by euclidean distance of document. Alternatively, we could also apply tf-idf on the relative term frequencies.
\end{itemize}
\subsection{BM25}
\begin{itemize}
	\item Probabilistic retrieval framework that extends the idea of tf-idf
	\item Instead of the log, we use a different damping functions which are easier to control:
	$$w_t = \frac{(k_1 + 1)\cdot \text{tf}(d;t)}{k_1 + \text{tf}(d;t)}\cdot \text{idf}(t)$$
	\item In addition, we normalize the term frequency by the document length: $\text{tf}'(d;t) = \text{tf}(d;t) \cdot l_{avg}/l_{d}$ ($l_{avg}$ is the average document length of collection). By this we prevent copies of documents concatenated with each other being higher rated. Putting this into our original function, we get:
	$$w_t = \frac{(k_1 + 1) \cdot \text{tf}(d;t)}{k_1 \cdot (l_d / l_{avg})+ \text{tf}(d;t)}\cdot \text{idf}(t)$$
	\item However, longer documents also tend to contain more information. Thus, we introduce another parameter $b$ that controls the normalization:
	$$w_t = \frac{(k_1 + 1) \cdot \text{tf}(d;t)}{k_1 \cdot ((1-b) + b\cdot (l_d / l_{avg}))+ \text{tf}(d;t)}\cdot \text{idf}(t)$$
	\item For very long queries, we also need to consider this normalization which can be done by multiplying another term $\frac{(k_3 + 1)\cdot \text{tf}(q;t)}{k_3 \cdot \text{tf}(q;t)}$
	\item In conclusion, the BM25 score is calculated as follows:
	$$\text{BM25} = \sum\limits_{\text{unique\hspace{1mm}} t\in q} \frac{(k_1 + 1) \cdot \text{tf}(d;t)}{k_1 \cdot ((1-b) + b\cdot (l_d / l_{avg}))+ \text{tf}(d;t)} \cdot \frac{(k_3 + 1)\cdot \text{tf}(q;t)}{k_3 + \text{tf}(q;t)} \cdot \text{idf}(t)$$
	\item Parameters $k_1$, $b$ and $k_3$ are tuned. Common defaults are $k_1 = 1.5$ and $b=0.75$
	\item It is the most widely used ranking in IR but only loosely inspired by probabilistic models
\end{itemize}
\subsection{Statistical Language Models}
\begin{itemize}
	\item Statistical language models are a probability distribution over word sequences $P(w_1, ..., w_m)$ with which documents and queries can be represented (and uncertainty quantified)
	\item Thus, a language model describes the probability of e.g. $q$ being the given word sequence
	\item Documents are ranked given a query by its similarity. Therefore we can use either document likelihood, query likelihood or KL-divergence
\end{itemize}
\subsubsection{Query likelihood}
\begin{itemize}
	\item Given a document, what queries are most likely to be created for it? 
	\item We first have to ensure that the query likelihood correlates with document likelihood. Therefore, we apply the Bayes rule: $p(d|q) = \frac{p(q|d)p(d)}{p(q)}$. As $p(q)$ is equal for all documents, and we assume a uniform prior for all documents (though not always the case), we retrieve $p(d|q)\propto p(q|d)$
	\item Thus, by generating a probability distribution of possible queries for a document, we can approximate how likely a document is given a query.
	\item The scoring function is defined as follows:
	$$\text{score}(d,q) = \log \left[p(q|\theta_d)\cdot p(d)\right]$$
	where $\theta_d$ describes the document. There are mainly three modeling choices:
	\begin{enumerate}
		\item \textit{How to define the generative process $p|\theta_d$?}
		\begin{itemize}
			\item Given $\theta_d$, what is the generative process for getting $q=w_1,...,w_{|q|}$?
			\item Different distributions are possible
			\item \textit{Multiple Bernoulli} - bag of word perspective, every word in vocabulary has probability to be in query or not. The related probability is:
			$$p(q|\theta_d) = \prod\limits_{w_i \in q} p(X_i = 1 | \theta_d) \prod\limits_{w_i \not\in q} \left(1 - p\left(X_i = 1 | \theta_d\right) \right)$$
			\item \textit{Multinomial} - similar to bernoulli, but we know have a random variable for every word slot in the query and not one for every word in the vocabulary. Thus, the calculation is:
			$$p(q|\theta_d) = \prod\limits_{w_i \in q} p(w_i | \theta_d) \text{\hspace{4mm}where\hspace{4mm}} \sum\limits_{w_i \in V} p(w_i|\theta_d) = 1$$
			\item \textit{Multiple Poisson} - similar to bernoulli, but instead of presence or absence, we model the number of times we expect a word from the vocabulary to occur in the query of length $|q|$ by a Poisson distribution:
			$$p(q|\theta_d) = \prod\limits_{w_i \in V} \frac{e^{-\lambda_i |q|} (\lambda_i |q|)^{\text{tf}(w_i;d)}}{\text{tf}(w_i;d)!}$$
		\end{itemize}
		\item \textit{How to estimate $\theta_d$ based on document $d$?}
		\begin{itemize}
			\item To estimate $\theta_d$ we perform MLE: $\hat{\theta}_d = \arg \max_{\theta_d} p(d|\theta_d)$
			\item In case of a multinomial distribution, we would get:
			$$p(d|\theta_d) = \prod\limits_{w_i \in V} p(w_i | \theta_d)^{\text{tf}(w_i;d)} \implies \log p(d|\theta_d) = \sum\limits_{w_i \in V} \text{tf}(w_i;d) \log p(w_i | \theta_d)$$
			\item Note that this is a constrained optimization problem with $\sum\limits_{w_i \in V} p(w_i|\theta_d) = 1$.
			\item By using lagrangian multiplier, we get $p_{MLE}(w_i|d) = \frac{\text{tf}(w_i;d)}{|d|}$
		\end{itemize}
		\item \textit{How to compute prior $p(d)$?}
		\begin{itemize}
			\item The prior takes everything into account which is independent of a query.
			\item This can include number of clicks, credibility, ...
		\end{itemize}
	\end{enumerate}
\end{itemize}
\subsubsection{Smoothing}
\begin{itemize}
	\item How to deal with unseen words which have a probability of 0.
	\item First, we assume a multinomial distribution again with the optimal parameters of $p(w_i|\theta_d) = \frac{\text{tf}(w_i;d)}{|d|}$
	\item \textbf{Adaptive smoothing}: add a small extra count to every word:
	$$p(w_i|\theta_d) = \frac{\text{tf}(w_i;d) + \epsilon}{|d| + \epsilon |V|}$$
	In case of $\epsilon=0$, we fall back to ML estimation. $\epsilon=1$ is called Laplace smoothing.
	\item \textbf{Jelinek-Mercer smoothing}: linearly interpolate with "background" knowledge so that rare words also have smaller additives:
	$$p_{\lambda}(w_i|\theta_d) = \lambda \frac{\text{tf}(w_i;d)}{|d|} + (1 - \lambda) \frac{\text{tf}(w_i;C)}{|C|}$$
	The context $C$ is approximated by the concatenation of all documents.
	\item \textbf{Dirichlet prior smoothing}: we assume that before seeing the document, we have a prior belief over all words $p(\theta_d)$. We use the posterior which gets narrower the more words we see and therefore the more certain we are about the document distribution.
	\begin{itemize}
		\item Maximum A Posteriori estimate by $\hat{\theta}_d = \arg\max_{\theta_d} p(\theta_d|d) = \arg\max_{\theta_d} p(d|\theta_d) p(\theta_d)$
		\item Prior distribution $p_i\sim \text{Dir}(\alpha) \implies p(\theta_d) = \prod\limits_{w \in V} p(w|\theta_d)^{\alpha_w - 1}$
		\item With a multinomial likelihood, we get:
		$$p(\theta_d | d) \propto \prod\limits_{w \in V} p(w|\theta_d)^{\text{tf}(w;d)} \prod\limits_{w \in V} p(w|\theta_d)^{\alpha_w - 1} = \prod\limits_{w \in V} p(w|\theta_d)^{\text{tf}(w;d) + \alpha_w - 1}$$
		\item Thus, our new MAP solution is:
		$$p(w|\theta_d) = \frac{\text{tf}(w;d) + \alpha_w - 1}{|d| + \sum_{w\in V}\alpha_w - |V|}$$
		\item For $\alpha_w = 1$, we get MLE estimation, and $\alpha_w = 2$ represents Laplace smoothing.
		\item We can also rewrite the smoothing similar to Jelinek-Mercer smoothing:
		$$p(w|\theta_d) = \frac{|d|}{|d|+ \mu}\frac{\text{tf}(w;d)}{|d|} + \frac{\mu}{\mu + |d|}p(w|C)$$
		where $\mu$ is the parameter depending on $\alpha_w$. Thus, we interpolate with the background knowledge while taking the document length into account.
	\end{itemize}
	\item Next to Dirichlet prior smoothing, we can also use other distributions (for example a beta prior with multiple Bernoulli) which lead to slightly different smoothing functions. For example, with the beta prior, we get for a variable $\alpha_w$ and $\beta_w$ (without constraints!):
	$$p(w|\theta_d) = \frac{\text{tf}(w;d) + \alpha_w - 1}{\alpha_w + \beta_w - 1}$$
\end{itemize}
\subsubsection{Positional Language Models}
\begin{itemize}
	\item There are variants of basic language models capturing term dependencies
	\item Instead of having one language model representing the whole document, Positional Language Models define a LM for every word position
	\item Thus we capture (small) "fuzzy" passages with which we can match our query
	\item A term at each position can propagate its occurrence to close positions in word windows
	\begin{itemize}
		\item Example sentence: \texttt{the black hat is not}...
		\item With a equally weighted word window of one, we would retrieve the following language model (MLE params) for the position of word "\texttt{black}": $p(\texttt{black}|\theta_p) = 1/3, \hspace{2mm} p(\texttt{the}|\theta_p) = 1/3, \hspace{2mm} p(\texttt{hat}|\theta_p) = 1/3$
	\end{itemize}
	\item We can weight the occurrences of every word based on the distance to the "root" of the language model (also called kernel):
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/language_models_positional.png}
		\label{img:language_models_positional}
	\end{figure}
	\item In general, the term frequency of a word for a LM at position $j$ with kernel $k$ is determined as follows:
	$$\text{tf'}(w,j;d) = \sum\limits_{i=1}^{|d|} \text{tf}(w,i;d) \cdot k(i,j) $$
	\item The language model at every position is given by the corresponding MLE estimation:
	$$p(w|d,j) = \frac{\text{tf'}(w,j;d)}{\sum_{w'\in V} \text{tf'}(w',j;d)}$$
	\item Documents can now be scored by either their best matching language model with the query, or the average of the top-$k$ models
\end{itemize}

================================================
FILE: Information_Retrieval_1/ir_learning_to_rank.tex
================================================
\section{Learning to Rank}
\begin{itemize}
	\item Main issue in information retrieval is to determine whether document $d$ is relevant for query $q$
	\item Common relevance signals include TF-IDF, BM25, document popularity etc.
	\item But: what signals to use/how to combine these signals? There is not a single relevance signal "to rule them all" $\implies$ combine all signals in a model
	\item Simplest combination method: linear model $f(\bm{d},\bm{\theta}) = \sum\limits_{i=1}^{|d|} \theta_i d_i$ where $\bm{d}$ represents the different signals for document-query pair
	\item Task: find the optimal parameter set $\bm{\theta}$, commonly by Machine Learning techniques (linear regression)
\end{itemize}
\subsection{Offline Learning To Rank}
\begin{itemize}
	\item Given an annotated dataset of relation document and relevance/ranking
	\item There are three different approaches
	\begin{enumerate}
		\item \textbf{Pointwise}: optimize models $f(\bm{d},\bm{\theta})$ to predict relevancy of a document. This can be recasted in a regression problem with loss:
		$$\mathcal{L}=\sum_{\bm{d}} \left(f(\bm{d},\bm{\theta}) - \text{relevancy}(d,q)\right)^2$$
		However, this approach does not consider the application of ranking where only the final order is important, but not the single scores.
		\item \textbf{Pairwise}: optimize regarding the total order of the documents and not specific relevance scores. The loss can be expressed by:
		$$\mathcal{L}=\sum_{d\succ d'}\left[f(\bm{d'},\bm{\theta}) - f(\bm{d},\bm{\theta})\right]$$ 
		where $d\succ d'$ means that $d'$ is the successor of $d$ in the labeled ranking. Nevertheless, this method does not take into account that only a subpart (top 10) of the collection is actually presented to the user.
		\item \textbf{Listwise}: optimize regarding ranking metrics like $DCG$. Thus, the loss could be:
		$$\mathcal{L} = -nDCG(f(\cdot,\bm{\theta}))$$
		The problem is that most ranking metrics are not differentiable. There are heuristic approaches to still optimize with respect to such metrics. 
	\end{enumerate}
	\item Problems with offline Learning to Rank: similar to offline evaluation in Section~\ref{sec:offline_eval_problems}
	\begin{itemize}
		\item All described methods require an annotated dataset which contains either relevance labels for each document-query pair or a ranking over the whole collection.
		\item Creating such is time consuming and expensive
		\item Impossible to personalize for a user (everyone prefers a little bit different documents). Also, annotators and users might disagree in some points $\implies$ dataset does not fully reflect user behavior
		\item Can change over time
	\end{itemize}
\end{itemize}
\subsection{Online Learning to Rank}
\begin{itemize}
	\item Learn from implicit user feedback
	\begin{itemize}
		\item Might be noisy
		\item Consider position bias (higher rank is more frequently clicked) and selection bias (only a limited set of documents is presented to the user)
	\end{itemize}
	\item Online Learning to Rank methods can learn from user interactions, \textbf{and} control the results which are displayed/presented to the user
	\item Thus, these methods can be more efficients as they control over what data is actually gathered
	\item A general online learning to rank technique is visualized in Figure~\ref{img:learning_to_rank_online_overview}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/learning_to_rank_online_overview.png}
		\caption{Overview of the general concept of online learning to rank}
		\label{img:learning_to_rank_online_overview}
	\end{figure}
	\begin{itemize}
		\item The user enters a query, for which the ranking algorithm generates a list of documents
		\item The Online Learning to Rank system interacts with the results by adding and/or removing documents from the ranking. This can also include interleaving with another, slightly changed ranking algorithm
		\item User interacts with the displayed result and gives implicit feedback.
		\item The Online Learning to Rank algorithm updates the ranking parameters according to the analyzed feedback
	\end{itemize}
	\item \textbf{Advantages}: learns directly from the user, is more responsive by immediately adapting its parameters
	\item \textbf{Risks}:
	\begin{itemize}
		\item Unreliable methods will affect/worsen user experiences immediately.
		\item (Noisy) clicks can easily bias or even manipulate search engines
		\item \textbf{Self-confirming loop}
		\begin{itemize}
			\item If an irrelevant document was clicked by random, the system still perceives that this document is relevant and will change its parameters accordingly
			\item Thus, the random document will be placed higher in future ranks. However, also similar documents to the irrelevant one will have an increased relevance score and will probably occur at a high position
			\item Most likely, the next clicked document will be one of the highest ones which were irrelevant $\implies$ entering a self-confirming loop
			\item Due to bias and noise, an irrelevant document was clicked and inferred to be relevant
			\item Due to noise, this inference is most likely to appear again
			\item The algorithms confidence in this incorrect inference continues to increase
		\end{itemize}
	\end{itemize}
	\item To prevent a self-confirming loop, we have to balance exploration and exploitation
	\begin{itemize}
		\item \textit{Exploration}: collect feedback for learning from the most documents as possible
		\item \textit{Exploitation}: utilize what has been already learned 
		\item If systems only exploits, it misses out to obtain feedback for other documents that might be even better (danger to enter/staying in self-confirming loop)
		\item To high exploration rate leads to a lot of irrelevant documents in ranking that worsen the user experience 
	\end{itemize}
\end{itemize}
\subsubsection{Designing an Online Learning to Rank algorithm}
\begin{itemize}
	\item To design a OLTR algorithm, we have to make design choices in four aspects (see Figure~\ref{img:learning_to_rank_online_design})
\end{itemize}
\begin{figure}[ht]
	\centering
	\includegraphics[width=0.4\textwidth]{figures/learning_to_rank_online_design.png}
	\caption{General design components of an OLTR algorithm}
	\label{img:learning_to_rank_online_design}
\end{figure}
\begin{enumerate}[label=(\Alph*)]
	\item \textbf{Ranker}: the ranker maps documents to relevance scores. This module operates on feature level/document id's and can be for example a linear ranker/neural model/...
	\item \textbf{Exploration strategy}: define interactions with results of the ranker. No exploration would mean that the document ranking is simply passed and stays unchanged. A common strategy is \textit{epsilon-greedy} where we inject random documents in random positions with ratio $\epsilon$. Other algorithms include upper confidence bound etc.
	\item \textbf{Signal recording and interpretation}: algorithm can consider multiple signals (raw observation like clicks and dwell time, more complex metrics like time to success). Should remove bias/noise. When result list was constructed by using interleaving, the feedback would also consider which ranker has won based on user interactions.
	\item \textbf{Update mechanism}: update ranking algorithm given the user feedback. If ranker operates on document id's, we can update the document's specific relevance estimate for the query. If the ranker relies on features, we optimize a loss function like the ones shown in offline LTR.
\end{enumerate}
\subsubsection{Dueling Bandit Gradient Descent}
\begin{itemize}
	\item One of the first OLTR algorithms was the \textit{Dueling Bandit Gradient Descent}
	\item The intuition is that we compare two rankers by online evaluation, and optimize our system towards the better performing one.
	\item The method is structured in following steps (visualized in Figure~\ref{img:learning_to_rank_online_DBGD})
	\begin{enumerate}
		\item From current feature state $\theta_b$ of the ranker (shown in green), sample a new ranker/feature point $\theta_c = \theta_b + u$ laying on the unit sphere $||u||=1$ around the current one (shown in red)
		\item Get the rankings of $\theta_b$ and $\theta_c$
		\item Compare $\theta_b$ and $\theta_c$ using interleaving
		\item If $\theta_c$ wins comparison: update the current model by $\theta_b \leftarrow \theta_b + \eta (\theta_c - \theta_b)$
	\end{enumerate}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/learning_to_rank_online_DBGD.png}
		\caption{Steps in Dueling Bandit Gradient Descent}
		\label{img:learning_to_rank_online_DBGD}
	\end{figure}
	\item It can be shown that if there is only a single optimum, the Dueling Bandit Gradient Descent algorithm will be able to approximate the optimal model
\end{itemize}

================================================
FILE: Information_Retrieval_1/ir_neural_models.tex
================================================
\section{Neural Retrieval Models}
\subsection{Distributed Word Representations}
\begin{itemize}
	\item Latent, dense vector representation to model semantic similarity/relations
	
\end{itemize}
\subsubsection{Skip-gram}
\begin{itemize}
	\item \textbf{Skip gram}: learn to predict neighboring words in a small context window
	\item Model probability by similarity between word and context vectors (two matrices):
	$$p(w_k|w_j) = \frac{\exp\left(c_k \cdot v_j\right)}{\sum_{i\in|V|}} \exp\left(c_i \cdot v_j\right)$$
	\item Denominator can be computationally expensive if vocabulary is quite large. Thus, we can approximate it by taking just a few negative examples $\implies$ negative sampling
	\item Overall, skip gram will learn two representations for each word (context $C$ and target words $V$) from which me most likely only use $V$ (visualized in Figure~\ref{img:neural_ir_skip_gram})
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/neural_ir_skip_gram.png}
		\caption{Visualization of skip gram method for learning word representations}
		\label{img:neural_ir_skip_gram}
	\end{figure}
	\item Skip gram shows to capture relational meaning (\texttt{KING - MAN + WOMAN = QUEEN})
\end{itemize}
\subsubsection{Using word embeddings in IR}
\begin{itemize}
	\item \textbf{Generalized Language Model}
	\begin{itemize}
		\item The standard language model assume that a term $t_q$ occurring in $q$ is being sampled from a document or a background collection (smoothing):
		$$p_{LM}(t_q|d) = \lambda \cdot p(t_q|d) + (1 - \lambda) \cdot p(t_q|C)$$
		\item The generalized language model extends this idea by also considering terms that are similar to $t_q$ (for example synonyms):
		$$p_{LM}(t_q|d) = \lambda \cdot p(t_q|d) + \alpha \sum\limits_{t'\in d} p(t_q|t',d) p(t'|d) + \beta \sum\limits_{t' \in N_t} p(t_q|t',C) p(t'|C) + (1 - \alpha - \beta - \lambda) \cdot p(t_q|C)$$
		where $$p(t_q|t',d) = \frac{sim(t',t_q)}{\sum_{t''\in d} sim(t',t'')} \text{\hspace{2mm}and\hspace{2mm}} p(t'|d) = \frac{tf(t';d)}{|d|}$$
		$N_t$ is the set of the most similar words to $t_q$.
	\end{itemize}
	\item \textbf{Word Mover's distance}
	\begin{itemize}
		\item For every word $w_i$ in the query $q$, look for the word with the highest similarity/smallest distance in document $d$
		\item Score a document by the sum of the pairwise distances. The document with the smallest distance gets the highest rank
		\item However, this approach doesn't care about the whole document but only the best matches
	\end{itemize}
\end{itemize}
\subsection{Compositionality}
\begin{itemize}
	\item To match queries and documents in the embedding space, we need to combine the words in each $\implies$ compositionality
\end{itemize}
\subsubsection{Aggregate word vectors}
\begin{itemize}
	\item Apply simple rules/arithmetic to combine word vectors
	\item Example: \textbf{Dual Embedding Space Model} 
	\begin{itemize}
		\item represent a document by the centroid of its word vectors $\bm{\overline{D}} = \frac{1}{|D|}\sum_{\bm{d}_j \in D} \frac{\bm{d}_j}{||\bm{d}_j||}$
		\item The query-document similarity is the average over query words of cosine similarity:\\ $\text{DESM}(Q,D) = \frac{1}{|Q|}\sum_{q_i \in Q} \frac{\bm{q}_i^T \bm{\overline{D}}}{||\bm{q}_i|| \cdot ||\bm{\overline{D}}||}$
		\item We can also use both the IN (word) and OUT (context) embeddings from skip-gram to optimize matching. What worked best was using IN representations for the query and OUT for document
		\item In the ranking system, we either first rank documents by BM25 and rerank top $N$ with DESM, or use a linear combination of both scores
	\end{itemize}
\end{itemize}
\subsubsection{Tune and Aggregate word vectors}
\begin{itemize}
	\item Learn task-specific representations and not rely on pure skip-gram
	\item \textbf{Paragraph2vec}
	\begin{itemize}
		\item Generalizes word2vec to whole documents by embedding them in a fixed-size vector
		\item Two different approaches. First is \textit{distributed memory}:
		\begin{itemize}
			\item We are trying to predict the next word based on a few previous context word \textit{and} a paragraph embedding. 
			\item Both the word and paragraph embeddings are learned during this process
			\item Input embeddings can either be concatenated or averaged (commonly first one is applied)
			\item Visualization in Figure~\ref{img:neural_models_distributed_memory}
			\begin{figure}[ht]
				\centering
				\includegraphics[width=0.3\textwidth]{figures/neural_models_distributed_memory.png}
				\caption{Distributed memory model. }
				\label{img:neural_models_distributed_memory}
			\end{figure}
		\end{itemize}
		\item Second method: \textit{Distributed bag of words}
		\begin{itemize}
			\item In this approach, we don't consider the context words but try to predict all possible words in the paragraph given the embedding vector
			\item This is done by sampling a random word at every SGD iteration from the small text windows, and train the classifier on predicting this word
			\item Thus, we optimize the embedding regarding representing the word distribution in the paragraph
			\item The distributed BOW is visualized in Figure~\ref{img:neural_model_distributed_BOW}
			\begin{figure}[ht]
				\centering
				\includegraphics[width=0.3\textwidth]{figures/neural_model_distributed_BOW.png}
				\caption{Distributed BOW model. }
				\label{img:neural_model_distributed_BOW}
			\end{figure}
		\end{itemize}
	\end{itemize}
	\item \textbf{Lexicographical definition}
	\begin{itemize}
		\item We can also use the lexicographical definitions of words to train and/or test the word embeddings
		\item The word embeddings of the definition are combined by an (arithmetic) function $f_c$, and compared to the embedding of the word to be defined
		\item The objective is to minimize the distance to the defined word, but maximize the distance to other words to distinguish between words
		\item An example is shown in Figure~\ref{img:neural_model_lexicographical_definition}
		\begin{figure}[ht]
			\centering
			\includegraphics[width=0.3\textwidth]{figures/neural_model_lexicographical_definition.png}
			\caption{Lexicographical model for the example of "\textit{person}" defined as "\textit{a human being}". }
			\label{img:neural_model_lexicographical_definition}
		\end{figure}
	\end{itemize}
\end{itemize}
\subsubsection{Tune word vectors and learn rules of composition}
\begin{itemize}
	\item Deep, neural architectures which have different aspects to be designed
	\item \textbf{Architectures and representation}
	\begin{itemize}
		\item The simplest approach is to use neural networks to embed documents and queries to latent space, and then perform same similarity measures as before (like cosine similarity). This method is also referred to as \textit{Projection to latent space} (see Figure~\ref{img:neural_model_architecture_comparisons_projection_latent_space}). Possible network architectures are Convolutional NN, Recurrent NN or fixed Deep NNs if (max) input size is known
		\item The next step is to replace the similarity measure by another neural network. Thus, this NN takes the composed embeddings of the document and query as input, and return a single real value indicating the similarity score. This approach is called \textit{One Dimensional Matching} and visualized in Figure~\ref{img:neural_model_architecture_comparisons_one_dim_matching}. The common architecture for the highest 
		\item Another architecture is spanning up a two-dimensional input by the query and document. Therefore, we compute similarity scores for every word in the query to every word in the document which results in a two dimensional matrix. On this, we apply a convolutional NN to end up in a fixed-size embedding. A consecutive fully-connected NN maps this embedding into a similarity score. Figure~\ref{img:neural_model_architecture_comparisons_two_dim_matching} visualizes 
	\end{itemize}
	\begin{figure}[ht]
		\centering
		\begin{subfigure}[b]{0.3\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/neural_model_latent_space.png}
			\caption{Projection to Latent space}
			\label{img:neural_model_architecture_comparisons_projection_latent_space}
		\end{subfigure}
		\hspace{2mm}
		\begin{subfigure}[b]{0.3\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/neural_models_one_dimensional_matching.png}
			\caption{One dimensional matching}
			\label{img:neural_model_architecture_comparisons_one_dim_matching}
		\end{subfigure}
		\hspace{2mm}
		\begin{subfigure}[b]{0.3\textwidth}
			\centering
			\includegraphics[width=0.8\textwidth]{figures/neural_models_two_dim_matching.png}
			\caption{Two dimensional matching}
			\label{img:neural_model_architecture_comparisons_two_dim_matching}
		\end{subfigure}
		\label{img:neural_model_architecture_comparisons}
		\caption{Comparison of different neural architectures for compositionality}
	\end{figure}
	\item \textbf{Training}: depending on the available data, we can perform different levels of supervision
	\begin{itemize}
		\item \textit{No supervision/labels}: If no labels are provided at all, we could autoencoders to reduce query and document to a latent space and check similarity metrics. Otherwise, we can make use of pretrained neural language models with techniques like ELMo and BERT
		\item \textit{Distant supervision}: We create pseudo-labels by sampling short word sequences from a document and considering this as query. The document from which we sampled is labeled as relevant/high similarity, while all other documents (from which we sample one randomly for training) are considered as being not relevant
		\item \textit{Weak supervision}: As alternative, we can use unsupervised ranking functions like BM25 to generate labels and use this scores for training (teacher-student architecture). In experiments, the neural network was even able to outperform BM25.
		\item \textit{Full supervision}: Labels are created by either annotators or using the click log as implicit feedback from the users. We can train the models in a standard supervised fashion. 
	\end{itemize}
\end{itemize}

================================================
FILE: Information_Retrieval_1/ir_offline_evaluation.tex
================================================
\section{Offline evaluation}
\begin{itemize}
	\item Evaluating an IR system without any interaction with user 
	\item Assumption: assessors can tell what is relevant
\end{itemize}
\subsection{Collection-based evaluation}
\begin{itemize}
	\item Approximating user happiness by relevance of the found documents
	\item There are different measures to do that
\end{itemize}
\subsubsection{Traditional Evaluation measures}
\begin{itemize}
	\item We can view IR as a (binary) classification problem where every document is either relevant or not with respect to a query
	\item The evaluation is performed by calculating precision ($\frac{TP}{TP+FP}$) and recall ($\frac{TP}{TP+FN}$)
	\item However, the output of an IR system is a ranking and not a binary classification. Thus, we label the first $k$ documents the system proposes as relevant, and other as non-relevant $\implies$ precision/recall $@$ cut-off ($P@k$/$R@k$)
	\item The trade-off between precision and recall is task specific. For web searches, we want a high precision on the first few documents, but allow a worse recall (as a user doesn't want to find \textit{all} relevant documents). In contrast, in medicine, we want to have a high recall to not miss an important document
	\item Another way to incorporate precision and recall for ranking is using R/P curves. We start with a cut-off point of 0 where precision is 1 and recall 0. Then we increase the cut-off point one by one and record the new values in the curve (left, Figure~\ref{img:offline_eval_RP_curves}). While the cut-off rank heads to infinity, recall goes to 1 and precision to 0. We interpolate the curves by taking the maximum value of all future values/values right from a given point.
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/offline_eval_RP_curves.png}
		\caption{R/P curves for ranking}
		\label{img:offline_eval_RP_curves}
	\end{figure}
	\item When having multiple queries, we would average the RP curves. 
	\item The area under the curve is the average precision which can also be calculated by taking the average of all precision values at ranks with relevant documents.
	\item Usually, a binary scale of whether a document is relevant is not sufficient. For a graded relevance scale, we can use different evaluation measures
	\begin{itemize}
		\item \textbf{Discounted Cumulative Gain (DCG)} - considers the relevance grade and position of every document. The total gain is accumulated at a certain rank $k$:
		$$DCG@k = \sum\limits_{\text{rank} r=1}^{k} \frac{2^{\text{rel}_r} - 1}{\log_2\left(1 + r\right)}$$
		\item The numerator is the non-linear relevance score of the document at rank $r$, and the denominator the discount over ranking position
		\item The score highly depends on the best possible ranking for a query. Thus, the DCG can be normalized by the value of the best ranking $\implies$ $0\leq nDCG \leq 1$. This makes it easier to compare scores over different queries
	\end{itemize}
\end{itemize}
\subsubsection{Model-based Evaluation measures}
\begin{itemize}
	\item Another perspective of evaluation is looking at different aspects of possibles metrics. A model-based approach considers the following three components:
	\begin{enumerate}
		\item \textbf{Browsing model} - describes how the user interacts with results, like the probability of a document being clicked/viewed $\Rightarrow p(d)$
		\item \textbf{Model of document utility} - describes how a user derives utility from individual relevant documents. Similar to how to determine the graded relevance scale $\Rightarrow g(d)$.
		\item \textbf{Utility accumulation model} - describes how a user accumulates utility in the course of browsing $\Rightarrow E\left[g(D)\right] = \sum_{r=1}^{\infty}g(d) \cdot p(d)$
	\end{enumerate}
	\item Examples for the browsing models
	\begin{itemize}
		\item \textit{Position-based models} - the chance of observing a document depends on the position in the ranking. We can for example model it by $\Rightarrow p(d_r)=(1-\theta)^{r-1} \theta$. The corresponding utility accumulation is described by Rank-biased Precision (RBP): $RBP = \sum_{r=1}^{\infty}\text{rel}_r (1-\theta)^{r-1} \theta$
		\item \textit{Cascade-based models} - considers $\theta$ as a function of the document at rank $r$. Mostly, the following function is used: $\theta_r = \mathcal{R}(\text{rel}_r) = \frac{2^{\text{rel}_r}-1}{2^{\text{max rel}}}$. The corresponding utility accumulation is the \textit{Expected Reciprocal Relevance}: $ERR@k = \sum\limits_{r=1}^{k} \frac{1}{r} \cdot \theta_r \cdot \prod\limits_{i=1}^{r-1}\left(1 - \theta_i\right)$
	\end{itemize}
\end{itemize}
\subsubsection{Collection construction}
\begin{itemize}
	\item To evaluate a system offline, we need labels of whether a document is relevant with respect to a query (or graded scale) $\Rightarrow$ labels are created by humans
	\item First step is to generate a huge document collection, and generate a set of topics/queries that should be evaluated. Mostly, queries are selected from very frequent, common and rare query bin sets of highly-used search engines 
	\item To be able to calculate measures like recall, we need to find all relevant documents in the collection. Can be done either deterministically or stochastically
	\begin{itemize}
		\item \textbf{Depth-k pooling} - deterministic, standard method. Apply $M$ IR systems and take the union of the $k$ top results of all $M$ systems. This set of documents is labeled by humans, and all others are considered as not relevant. Note that we need the $M$ systems to be different/take another perspective on the data so that they don't find all the same documents. Otherwise, future IR algorithms can find relevant documents that the others haven't found yet and will be punished for that. $k$ is task specific, but a value of $100$ has shown to be sufficient
		\item \textbf{Random Sampling} - stochastic method. Simplest approach is for a query $q$, just sample a small set of documents out of the whole corpus and label those. Otherwise are considered as unlabeled, thus neglected in evaluation. Problem: significant sparsity of relevant documents in the corpus. 
	\end{itemize}
\end{itemize}
\subsection{Challenges of offline evaluation}
\label{sec:offline_eval_problems}
\begin{itemize}
	\item Expensive and slow to collect new data
	\item Ambiguous queries are particularly hard to judge realistically (what intent is most popular?). Particularly hard for personalized searches
	\item Judges need to correctly appreciate uncertainty/allow different intents
	\item How to identify when relevance changes (temporal, query intent changes, ...)?
\end{itemize}
\subsection{Comparative evaluation}
\begin{itemize}
	\item How do we compare different retrieval systems? Is the difference only due to random noise? $\implies$ Statistical significance test
	\item Two hypotheses where we want to prove that $H_0$ is wrong: $$H_0: \text{MAP}_E - \text{MAP}_P = 0, \hspace{4mm} H_1: \underbrace{\text{MAP}_E - \text{MAP}_P \neq 0}_{\text{two-sided}} \text{\hspace{2mm}or\hspace{2mm}}\underbrace{\text{MAP}_E - \text{MAP}_P > 0}_{\text{one-sided}}$$
	\item Compute the $p$-value that describes the probability of observing the test data given that $H_0$ is valid (low $p$-value disprove null hypothesis)
\end{itemize}
\subsubsection{Student's t-test}
\begin{itemize}
	\item Statistic: $$t=\frac{\mu_{E-P}}{\frac{\sigma_{E-P}}{\sqrt{N}}} = \frac{\overline{AP_E - AP_P}}{\frac{\sigma_{E-P}}{\sqrt{N}}}$$
	\item We assume that the mean measure follow a normal distribution (see Figure~\ref{img:hypothesis_testing_t_test_t_dist})
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.3\textwidth]{figures/hypothesis_testing_t_test_t_dist.png}
		\caption{Distribution of $t$ values under the null hypothesis}
		\label{img:hypothesis_testing_t_test_t_dist}
	\end{figure}
	\item The $p$-value is determined by the area under the distribution right from the determined value
	\item If the $p$-value is lower than the significance level $\alpha$, reject null hypothesis
	\item There are two different error types for the t-test: 
	\begin{description}
		\item[Type 1] rejecting null hypothesis although it was true (prob. is $\alpha$)
		\item[Type 2] not rejecting the null hypothesis although it was false (prob. is $\beta$)
	\end{description}
	\item There are four aspects of the test that interact with each other. If one is unknown, it can be derived from the others
	\begin{enumerate}
		\item \textit{Sample size} $N$
		\item \textit{Effect size} = diff. of means / std. dev.
		\item \textit{Significance level} = Type 1 error $\alpha$
		\item \textit{Power} = 1 - Type 2 error $\beta$. Prob. of finding an effect if it is there. 
	\end{enumerate}
\end{itemize}
\subsubsection{Sign test}
\begin{itemize}
	\item Look at score/sample pairs from $A$ and $B$ and consider the null hypothesis $H_0: P(B>A)=P(A>B)=1/2$.
	\item It is a discrete way of looking at the t-test. For a sample size of $N$, we get a binomial distribution with $N$ bins. The bins summed up from the measured number of $B$ winning over $A$ describe the p-value (see Figure~\ref{img:hypothesis_testing_sign_test})
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/hypothesis_testing_sign_test.png}
		\caption{Distribution of $t$ values under the null hypothesis}
		\label{img:hypothesis_testing_sign_test}
	\end{figure}
\end{itemize}
\subsubsection{Distribution-free tests}
\begin{itemize}
	\item Tests where we do not explicitly assume the underlying data to be sampled from a specific distribution
	\item \textbf{Randomization test}
	\begin{itemize}
		\item Given: a set of results for $N$ queries for algorithm $A$ and $B$
		\item Repeat for many times:
		\begin{itemize}
			\item Randomly swap values for a query in algorithm $A$ and $B$
			\item Compute average of both systems and their difference
			\item Add difference to an array
		\end{itemize} 
		\item The two systems are significantly different, if the actual difference without swapping is outside 95\% of the differences in the array.
	\end{itemize}
	\item \textbf{Bootstrap test}
	\begin{itemize}
		\item Same preparation as for randomization test
		\item Repeat for many times:
		\begin{itemize}
			\item Randomly sample pair of scores (i.e. selecting queries) of $A$ and $B$ with replacement
			\item Compute average of each systems in the set of pairs
			\item Add difference to an array
		\end{itemize}
		\item The two systems are significantly different if the mean of the array can be shown to be significantly different from 0.
	\end{itemize}
	
\end{itemize}

================================================
FILE: Information_Retrieval_1/ir_online_evaluation.tex
================================================
\section{Online evaluation}
\begin{itemize}
	\item In online evaluation, the system interacts with the user $\implies$ user "tells" what is relevant, system analyzes the user's behavior for gaining that knowledge
	\item The benefit of online evaluations is that they are mostly simpler and directly incorporate measuring the ranking quality
	\item However, the downsides are that the results are worse to explain/interpret (why did users click less, different queries might rely on different metrics, ...). Also, evaluations might not be comparable over time so that we also need to ensure the same conditions/user population for both systems.
\end{itemize}
\subsection{Analyzing user behavior}
\begin{itemize}
	\item A user provides various signals from which we can try to retrieve his "happiness" about the results. The following ones are mostly used:
	\item \textit{Clicks} - clicks are mostly noisy so that a click doesn't ensure that the document was actually relevant. Clicks have several biases:
	\begin{itemize}
		\item \textit{Position bias} - a user tends towards clicking higher ranked results
		\item \textit{Contextual bias} - nearby results effect the click probability of a document
		\item \textit{Attention bias} - some results draw more attention to themselves by the usage of images, font size, ...
	\end{itemize}
	\item \textbf{Time} - the time a user spends on a certain query before coming back to search engine
	\begin{itemize}
		\item \textit{Dwell time} - time spent on a clicked page. If duration is more than 30 seconds, we assume that click satisfied
		\item \textit{Exit type} - how the user exists the page (closing browser, continue scrolling through results, putting in new query, ...) 
	\end{itemize}
	\item \textit{Mouse movement} - time on website is not sufficient. Mouse movement can indicate whether user is actually reading or only scrolling/scanning
	\item \textit{Reformulations} - if new query is entered, check for similarity with the previous one. Reformulated/Similar queries that were entered quickly after the first one, indicate that user was not satisfied with previous results.
\end{itemize}
\subsection{A/B Testing}
\begin{itemize}
	\item When testing two systems in an online experiment, we need to make sure that both have the same preconditions so that the system improvements clearly correlate with the new click/sale numbers
	\item In A/B Testing, users are split into two groups where each group is assigned to one of the algorithms. We analyze the users' behavior on both systems and calculate a metric based on that. Comparing the results for both systems on significance leads to a final decision.
	\item \textit{Challenges} in A/B Testing
	\begin{itemize}
		\item If one system is very different and probably bad, it will affect the \textit{user experience} and damages website image $\implies$ perform offline evaluation in advance to avoid testing a very bad system
		\item It is hard to define \textit{metrics} as they can contradict each other. For example, if we report number of clicks and sessions per users, a click increase can indicate better/more relevant results. However, if another systems provides snippets that already contain the information, the user will click less.
		\item The metric should be as sensitive as possible. \textit{Sensitivity} is the ability of the metric to detect the statistically significant difference when the treatment effect exists $\implies$ how many queries/days/users/... for significance needed?
	\end{itemize}

	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/online_eval_AB_testing.png}
		\caption{Visualization of A/B Testing}
		\label{img:online_eval_AB_testing}
	\end{figure}
\end{itemize}
\subsection{Interleaving}
\begin{itemize}
	\item A/B testing introduces a high variance by letting different users evaluate different systems $\implies$ Show interleaved results from both algorithms A and B without telling the user which document is from which model
	\item The evaluation is based on the clicks of a user where the algorithm gets the credit that provided the clicked document
\end{itemize}
\subsubsection{Balanced interleaving}
\begin{itemize}
	\item In balanced interleaving, we select randomly which algorithm starts (A or B). If A would start, we take the first document of A and place it in our interleaved ranking list. Then we pick the first document of B and continue with A again
	\item If a document is already in the interleaved ranking, we skip this document and continue with picking the next document from the \textit{other} ranking model 
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/online_eval_balanced_interleaving.png}
		\caption{Formal algorithm describing balanced interleaving}
		\label{img:online_eval_balanced_interleaving}
	\end{figure}
	\item Problem: balanced interleaving brakes under corner cases. Assume following ranking:
	$$A: \left\{d_1, d_2, d_3, d_4\right\}, B: \left\{d_2, d_3, d_4, d_1\right\}$$
	No matter whether we start at model A or B, the interleaved list contains three documents assigned to B and only one to A. Thus, random clicking would lead to B winning $\implies$ bias. Resolved by team-draft interleaving
\end{itemize}
\subsubsection{Team-draft interleaving}
\begin{itemize}
	\item In team-draft interleaving we guarantee that both algorithms contribute equally to the interleaved ranking
	\item At each stage, we flip a coin to determine whether to pick the next document from A or B first. Afterwards, the document of the other system is picked
	\item If a document is already in the interleaved ranking, we look for the next document from the \textit{same} ranking model until we find a new document.
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/online_eval_team_draft_interleaving.png}
		\caption{Formal algorithm describing Team-draft interleaving}
		\label{img:online_eval_team_draft_interleaving}
	\end{figure}
	\item There are also corner cases that can cause troubles in team-draft interleaving. However, in practice this rarely happens/has a significant effect. 
\end{itemize}
\subsubsection{Probabilistic interleaving}
\begin{itemize}
	\item To avoid biases completely, we can apply probabilistic models
	\item Convert the ranking of each model to a probability distribution by applying softmax ($\tau = 3$):
	$$p_i(d) = \frac{\frac{1}{r_i(d)^{\tau}}}{\sum_{d' \in D} \frac{1}{r_i(d')^{\tau}}}$$
	\item For every position in the interleaved ranking, flip a coin to determine whether to pick a document from A or B. Next, we sample from the corresponding softmax distribution a document without replacement, and add it to the interleaved list. The picked document is removed from the probability distributions of A and B. 
	\item We can perform evaluation by counting the clicks for documents sampled from A and B. We expect the same number of clicks for documents at the same position of both algorithms due to the same probability in the softmax. % both A and B based on the softmax, and compare which one has the higher probability sum of all clicked documents. Thus, for documents at the same rank, we expect a tie.
	\item Note that compared to a hard assignment of 0 or 1 in balanced and team draft interleaving, the distribution of credit accumulated for clicks is smoothed based on the relative rank of the document in the original result lists (a click on any document leads to a non-zero credit for both rankings)
	\item The algorithm is summarized in Figure~\ref{img:online_eval_probabilistic_interleaving}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/online_eval_probabilistic_interleaving.png}
		\caption{Formal algorithm describing probabilistic interleaving}
		\label{img:online_eval_probabilistic_interleaving}
	\end{figure}
	\item Another, more efficient evaluation method is by marginalizing over all possible assignments $a$. Therefore, we calculate the probability of the interleaved list $l$ given $a$ (and the query $q$) by successively multiplying the softmax probabilities at that point. For example, the first assignment $a=\left\{1,1,1,1\right\}$ leads to the following calculation:
	$$p(l_i|a=\left\{1,1,1,1\right\},q) = 0.85 \cdot \frac{0.1}{0.15} \cdot \frac{0.03}{0.05} \cdot \frac{0.02}{0.02} = 0.34$$
	\item Normalizing all $p(l_i|a,q)$ by its sum lead to $p(a|l_i,q)$ $\implies$ $p(a|l_i,q) = \frac{p(l_i|a,q)}{\sum_{a \in A} p(l_i|a,q)}$
	\item For every assignment, we add the value $o=-1$ times the probability $p(a|l_i,q)$ if more clicked documents were assigned to $A$. If $B$ has more clicks, we use $o=1$ as factor, and ignore it for a tie (or multiply by $o=0$). Thus, our expected number of wins $B$ has more than $A$ is given by:
	$$E[O] = \sum_{a \in A} o_a \cdot p(a|l_i,q) \text{\hspace{2mm}where\hspace{2mm}} o_a = \begin{cases}
	 -1 & \text{if } c_A > c_B\\
	 0 & \text{if } c_A = c_B\\
	 1 & \text{if } c_A < c_B
	\end{cases}$$
	\item Figure~\ref{img:online_eval_probabilistic_interleaving_2} visualizes an example for probabilistic interleaving
	\begin{figure}[ht]
		\centering
		\includegraphics[width=\textwidth]{figures/online_eval_probabilistic_interleaving_2.png}
		\caption{Visualization of probabilistic interleaving}
		\label{img:online_eval_probabilistic_interleaving_2}
	\end{figure}
\end{itemize}

================================================
FILE: Information_Retrieval_1/ir_semantic_matching.tex
================================================
\section{Semantic matching}
\begin{itemize}
	\item \textit{Vocabulary gap}: query and document might use different lexical representation for the same entity $\implies$ resolve by semantic matching
	\item Represent query and document by their meaning, not lexical/word level
	\item This will help to identify synonyms and/or semantic relatedness for computing similarity
	\item To get such a representation, one idea is to apply dimensionality reduction. This relies on the assumption that the dimension of the data is actually lower than i.e. vocabulary size
	\item Similar data in terms of semantic will be (hopefully) similar in reduced dimensions as well
\end{itemize}
\subsection{Latent Semantic Indexing}
\begin{itemize}
	\item We represent all documents and queries in a term-document matrix:
	$$X = \left[\begin{array}{cccc}
	| & | &  & | \\
	x_1 & x_2 & \dots & x_m \\
	| & | &  & | 
	\end{array}\right]$$
	where $m$ rows represent the documents, and $n$ rows the terms. A single cell in the matrix $X$ specifies the term frequency $\text{tf}(w;d)$. Note that we would also add the queries in $X$ as documents.
	\item On this matrix, we apply Singular Value Decomposition (SVD) so that $X_{n\times m} = U_{n \times r} \Sigma_{r\times r} V_{m\times r}^T$
	\begin{itemize}
		\item $\bm{U}_{n \times r}$ represents the word/term embedding along rows (one word per row). The columns are the "semantic" dimensions that e.g. represent topics/hidden lower-dimensional space. 
		\item $\bm{V}_{m \times r}$ similarly represents the embedding of documents (one doc per row, but is transposed in calculation). 
		\item $\bm{\Sigma}_{r \times r}$ is a square, diagonal matrix. The magnitude of a singular value represents the importance of the corresponding latent dimension in the collection/data. It is always sorted from the highest value in first place to lowest value in last place.
	\end{itemize}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/semantic_matching_SVD_example.png}
		\caption{Example of SVD for 6 documents and 5 terms}
		\label{img:semantic_matching_SVD_example}
	\end{figure}
	\item To reduce dimensions to $k$, we simply drop those with the lowest values in $\Sigma$. These dimensions may be noise and make things dissimilar when they actually are on topic level. $k$ is hyperparameter.
	\item In case of Figure~\ref{img:semantic_matching_SVD_example} with $k=2$, we would drop the last three dimensions. This leads to our new embeddings $X'$ for the documents. The resulting matrices would look like as in Figure~\ref{img:semantic_matching_SVD_example_2}.
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/semantic_matching_SVD_example_2.png}
		\caption{Reduced dimensions by $k=2$ for 6 documents and 5 terms}
		\label{img:semantic_matching_SVD_example_2}
	\end{figure}
	\item The similarity between documents/queries can be calculated by cosine similarity (dot product) between their new embeddings in $X'$. Furthermore, we can also compute the similarity between terms by using the rows.
	\item Choice of $k$:
	\begin{itemize}
		\item The choice of $k$ is critical in IR. The ideal value of $k$ would be large enough to fit all the real structure in the data, but small enough to compress/group terms together that are very similar (less noise).
		\item Typically, different values of $k$ are tested and compared by their performance. For example, a high precision but low recall suggests a poor generalization of the model. Therefore, we should decrease $k$ in this case.
	\end{itemize}
	\item LSI addresses synonymy by mapping similar words in the same dimensions. The cost of such a mapping is lower than for unrelated words as they occur similar/same documents. 
	\item \textbf{Strengths} of LSI
	\begin{itemize}
		\item Using $X'$ instead of $X$ show performance increase as we filter out the noise
		\item $X'$ represents the best approximation of $X$ with a matrix of rank $k$: $X' = \argmin\limits_{X':\text{rank}(X')=k} ||X-X'||$
		\item Is mostly combined with lexical methods like BM25 to not lose "obvious" matches
	\end{itemize}
	\item \textbf{Weaknesses} of LSI
	\begin{itemize}
		\item A huge storage is required as the matrices $U$ and $V$ are dense (less zeros)
		\item Representations are not interpretable, and it is not guaranteed that hidden dimensions represent topics
		\item $k$ is often not easy to determine and requires multiple tests
		\item SVD assumes orthogonal dimensions on which the variance is maximum which is not always the case
		\item The model is not generative or probabilistic, which makes it hard to extend collection by new documents/queries (worst case: redo whole SVD)
	\end{itemize}
	\item One alternative is Non-negative Matrix Factorization which leads to smaller, positive matrices (but doesn't solve the other problems)
\end{itemize}
\subsection{Probabilistic Latent Semantic Indexing}
\begin{itemize}
	\item (Pseudo-)generative model with which we try to detect key topics in the collection in an unsupervised fashion. The model describes how we would generate docs for certain topics
	\item Every topic has its own language model/distribution over words in vocabulary in a unigram/bag-of-word style: $p(w|z=1), p(w|z=2),...$ where $z$ is the variable representing the topics. 
	\item A document is represented by the distribution of these topics in the document. Generating a document would require to first sample a topic based on the topic distribution of the document, and then generating a word based on the language model of the topic. The order of the words is not taken into account. An example is shown in Figure~\ref{img:semantic_matching_PLSI_example}.
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/semantic_matching_PLSI_example.png}
		\caption{Example of document representation in PLSI}
		\label{img:semantic_matching_PLSI_example}
	\end{figure}
	\item For semantic matching, we calculate the cosine similarity between the topic distributions of the query and the document.
	\item PSLA can have problems with stop words (common words that occur in every document frequently like "the" or "and"). To prevent that, we summarize all stop words in a separate background topic model which is equally shared by all docs. Before generating a word, we toss a biased coin to decide whether to retrieve a word from the background or standard topic model. Note that in matching, these stop words are masked out by that
\end{itemize}
\subsubsection{Retrieving distributions}
\begin{itemize}
	\item Input: collection of $N$ documents, number of topics $K$
	\item Output:
	\begin{itemize}
		\item Distributions over words $\phi_{(z,w)} = p(w|z)$ for $z\in \left\{1,...,K\right\}$ with $\sum_{w\in V} p(w|z) = 1$
		\item Distributions over topics in all documents: $\theta_{d,z} = p(z|d)$ for every $d$, with $\sum_{z=1}^{K} p(z|d) = 1$
	\end{itemize} 
	\item We try to solve problem by MLE. The probability of $w$ appearing at position $i$ in the document $d$ is:
	$$p(d_i = w | \Phi, \theta_d) = \sum\limits_{z=1}^{K} \phi_{(z,w)} \theta_{d,z}$$
	The joint likelihood of the entire dataset is:
	$$p(W|\Phi, \Theta) = \prod\limits_{d\in D}\prod\limits_{w\in V}\left(\sum\limits_{z=1}^{K} \phi_{(z,w)} \theta_{d,z}\right)^{\text{tf}(w;d)} $$
	\item Taking the two constraints into account, we get an optimization problem which we can solve by the EM algorithm. For that, we assume that we know from which topic a word was generated at position $i$ in the document by $R_{d_i}$:
	$$p(W|R,\Phi,\Theta) = \prod\limits_{d\in D}\prod\limits_{i=1}^{N_i}\sum\limits_{z=1}^{K} R_{(d_i,z)} \left(\phi_{(z,w)} \theta_{d,z}\right)$$
	\item For the EM algorithm, we would update $R_{d_i}$ during the expectation step by $R_{d_i} = \frac{\phi_{(z,w_i)\theta_{(d,z)}}}{\sum_{z=1}^{K} \phi_{(z,w_i)\theta_{(d,z)}}}$. The maximization step consists of updating $\Theta$ and $\Phi$: $\theta_{(d,z)} = \frac{\sum_{d_i} R_{(d_i,z)}}{|d|}$, $\phi_{(z,w)} = \frac{\sum_{d \in D} n(d,w) R_{(w,z)}}{\sum_{w' \in V} \sum_{d \in D} n(d,w')R_{(w',z)}}$
	\item PLSA is able to learn topics with their corresponding word distributions. However, there are still some drawbacks:
	\begin{itemize}
		\item It is still not a fully generative model. After running PSLA, we have topic distribution for documents we initially had, but we cannot extend it to new documents (or only hardly with heuristics)
		\item Prone to overfitting
	\end{itemize}
\end{itemize}
\subsubsection{Latent Dirichlet Allocation (LDA)}
\begin{itemize}
	\item Takes PLSA but makes it a generative model with Dirichlet prior. Can also be seen as a Bayesian treatment of PLSA
	\item Instead of probabilities for every document, we now simply define two hyperparameters $\alpha$ and $\beta$
	\item For every topic $z=1,...,K$, we draw a word distribution $\phi_z \sim \text{Dir}(\beta)$. Thus, $\beta$ determines how words are distributed per topic
	\item For each document $d$, we sample a topic distribution $\theta_d \sim \text{Dir}(\alpha)$. The alpha therefore controls the mixture of topics for any given document.
	\item The words in the documents are then sampled by the probabilities $\phi_z$ and $\theta_d$. Note that this is a fully generative model as we can generate as many new documents as we need.
\end{itemize}
\subsubsection{Graphical Models and Probabilistic Topic Models}
\begin{itemize}
	\item We can represent every probabilistic topic model as graphical model which abstracts the conditional independence relationships
	\item For example, Figure~\ref{img:semantic_matching_graphical_models_LDA} visualizes the graphical model of the Latent Dirichlet Allocation
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/semantic_matching_graphical_models_LDA.png}
		\caption{Graphical model of LDA}
		\label{img:semantic_matching_graphical_models_LDA}
	\end{figure}
	\item In this diagrams, it is easier to show extensions. For instance, Figure~\ref{img:semantic_matching_graphical_models_author_model} visualizes the Author-Topic model where we add an observed variable of the author of a document. This observation affects the probability distributions over words for each topic in a document 
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/semantic_matching_graphical_models_author_model.png}
		\caption{Graphical model of the Author-topic model}
		\label{img:semantic_matching_graphical_models_author_model}
	\end{figure}
	% semantic_matching_graphical_models_author_model.png
\end{itemize}
\subsection{Probabilistic Topic Models in IR}
\begin{enumerate}
	\item \textbf{Topic matching}
	\begin{itemize}
		\item Represent document by $p(z|d)$ and query by $p(z|q)$
		\item Take the KL divergence or similar measure to find score/similarity between the distributions
		\item But: query is very short so that topic model is harder to infer without having too much noise
	\end{itemize}
	\item \textbf{Smoothing}
	\begin{itemize}
		\item Smooth probabilities according to the topics in the document:
		\begin{equation*}
			\begin{split}
				p(w|d) & = \lambda p_{\mu}(w|d) + (1 - \lambda) p_{\text{tm}}(w|d)\\
				& = \lambda p_{\mu}(w|d) + (1 - \lambda) \left(\sum\limits_{z=1}^{K} p(w|z) p(z|d)\right)
			\end{split}
		\end{equation*}
		\item Thus we apply Jelinek-Mercer smoothing where the context is replaced by the topic word distributions of the document
	\end{itemize}
	\item \textbf{Query expansion}
	\begin{itemize}
		\item We can build an own language model for a given query by using the word distributions of the topics:
		$$p_{\text{tm}}(w|q) = \sum\limits_{z=1}^{K} p(w|z) p(z|q)$$
	\end{itemize}
\end{enumerate}

================================================
FILE: Information_Retrieval_1/ir_summary.tex
================================================
\documentclass[a4paper]{article} 
\addtolength{\hoffset}{-2.25cm}
\addtolength{\textwidth}{4.5cm}
\addtolength{\voffset}{-3.25cm}
\addtolength{\textheight}{5cm}
\setlength{\parskip}{0pt}
\setlength{\parindent}{0in}

\usepackage{blindtext} % Package to generate dummy text
\usepackage{charter} % Use the Charter font
\usepackage[utf8]{inputenc} % Use UTF-8 encoding
\usepackage{microtype} % Slightly tweak font spacing for aesthetics
\usepackage[english]{babel} % Language hyphenation and typographical rules
\usepackage{amsthm, amsmath, amssymb} % Mathematical typesetting
\usepackage{float} % Improved interface for floating objects
\usepackage[final, colorlinks = true, 
linkcolor = black, 
citecolor = black]{hyperref} % For hyperlinks in the PDF
\usepackage{graphicx, multicol} % Enhanced support for graphics
\usepackage{xcolor} % Driver-independent color extensions
\usepackage{marvosym, wasysym} % More symbols
\usepackage{rotating} % Rotation tools
\usepackage{subcaption}
\usepackage{censor} % Facilities for controlling restricted text
\newcommand{\note}[1]{\marginpar{\scriptsize \textcolor{red}{#1}}} % Enables comments in red on margin
\usepackage{bm}
\usepackage{blkarray}
\usepackage{enumitem}
\usepackage{pgfplots}
\definecolor{colkeyword}{rgb}{0,0.4,0}
\definecolor{colname}{rgb}{0.4,0.4,0}
\definecolor{coltype}{rgb}{0.4,0,0.4}
\definecolor{coloperators}{rgb}{0,0,1.0}
\definecolor{colscopes}{rgb}{0.4,0,0}
\usepackage{mathtools}
\DeclareMathOperator*{\argmin}{arg\,min}

\setcounter{tocdepth}{2}
% Title Page
\title{Summary Information Retrieval 1}
\author{Phillip Lippe}


\begin{document}
\maketitle
\tableofcontents
\newpage

% \input{ir_boolean_retrieval.tex}
\input{ir_offline_evaluation.tex}
\input{ir_online_evaluation.tex}
\input{ir_click_models.tex}
\input{ir_language_models.tex}
\input{ir_semantic_matching.tex}
\input{ir_neural_models.tex}
\input{ir_learning_to_rank.tex}
\input{ir_counterfactual_eval.tex}
\appendix
% \newpage
% \input{nlp_appendix.tex}
\end{document}

================================================
FILE: Knowledge_Representation/kr_csp.tex
================================================
\section{Constraint Satisfaction Problems}
\begin{itemize}
	\item Knowledge Representation is focused on qualitative reasoning, not really quantitative
	\begin{itemize}
		\item Abstract description of the world is mostly easier for human to reason than an exact numerical definition 
		\item We can for example reason about time points and intervals in our system where we have relations between those
	\end{itemize}
\end{itemize}
\subsection{Fundamentals of CSPs}
\begin{itemize}
	\item To represent qualitative knowledge, we again have to define a:
	\begin{itemize}
		\item \textit{Vocabulary}: finite set of relations, mostly binary. 
		
		Example: $x$ equals $y$ $\Rightarrow$ $x=y$, $x$ before $y$ $\Rightarrow$  $x<y$, $x$ after $y$ $\Rightarrow$  $x>y$
		\item \textit{Language}: sets of atomic formulae, perhaps restricted disjunction
		
		Example: define disjunction by $x\left\{<,=\right\}y$ (in maths $x\leq y$), and formulae to describe configurations: $\left\{x\left\{<,=\right\}y, y\left\{z\right\}\right\}$
		\item \textit{Formal semantics}: interpretation of function and symbols
		
		Example: interpret time points and relations over rational (or real) numbers
	\end{itemize}
	\item On those, we can perform various reasoning tasks
	\begin{itemize}
		\item \textit{Satisfiability}: Is this a consistent set of constraints? $\Rightarrow$ find satisfying instantiation of all variables
		\item \textit{Deduction}: Does $x\left\{=\right\}y$ logically follow from the configuration?
		\item \textit{Minimal description}: What are the most constrained relations that describe the same set? 
		\item \textit{Solving}: find one or all (optimal) solutions to the CSP
	\end{itemize}
	\item Formally, a Constraint Satisfaction Problem consists of:
	\begin{itemize}
		\item Variables $Y:=y_1, y_2, ..., y_k$
		\item Domains $D_1, ..., D_k$ to which the variables belong ($y_i$ represents a possible value of $D_i$: $y_i\in D_i$)
		\item Constraints $C\in \mathcal{C}$ on $Y$ which define a subset of $D_1 \times D_2 \times ... \times D_k$ 
	\end{itemize}
	\item Given a CSP $\left\{\mathcal{C}; y_1\in D_1, ..., y_k\in D_k \right\}$, $\left(d_1, ..., d_k\right)\in D_1\times ...\times D_k$ is a solution iff for all $C\in \mathcal{C}: \left(d_1, ..., d_k\right)\in C$
	\item Most CSPs have only binary constraints (constraints over two variables) which can be modeled by a constraint graph (nodes are variables, edges/arcs are constraints)
\end{itemize}
\subsubsection{Backtrack search}
\begin{itemize}
	\item Finding a solution for CSPs is a more general version of DPLL (PL is actually a special case CSP)
	\item In the initial state, we have an empty set of assignments. We then assign a value to an unassigned variable that does not conflict with the current assignment. Test if CSP is fulfilled or unsatisfied for this assignment. If neither of both, choose the next variable etc.
	\item The depth of the search tree is the number of variables, and the number of children/splits in the tree are given by the domain size. In the worst case, we have a search space of $|D|^{\# vars}$
	\item We can improve backtracking efficiency by:
	\begin{enumerate}
		\item \textbf{Smart variable picking}: various heuristics for choosing the next variable
		\begin{itemize}
			\item \textit{Most constrained variable}: choose the variable with the fewest legal values (therefore also called \textit{Minimum remaining values} (MRV))
			\item \textit{Most constraining variable}: the variable with the most constraints on. Is mostly used as tie-breaker strategy for MRV
		\end{itemize}
		\item \textbf{Smart value picking}: what possible value of the domain to pick for the variable
		\begin{itemize}
			\item \textit{Least constraining value}: given a variable, choose the value that constraints the least, i.e. the one that rules out the fewest values in the remaining values  
		\end{itemize}
		\item \textbf{Spotting failure early} to backtrack early: finding unsatisfiable constraints in a branch
		\begin{itemize}
			\item \textit{Forward checking}: by keeping track of remaining legal values for unassigned variables, we can stop the search in this branch if one variable has an empty set/no legal values to be assigned. This is done by propagating information from assigned to unassigned variables.
			\item \textit{Arc consistency}: simplest form of constraint propagation (see Section~\ref{sec:constraint_propagation}) that makes each arc/edge consistent. Hence, $x\to y$ is consistent iff for every value $x$ in $D_x$ there exists a value $y$ in $D_y$ for which the constraint is satisfied (if not, remove contradicting values of $D_x$). Note that if $x$ has been changed, we have to recheck all its neighbors. Arc consistency can detect failures earlier than forward checking
			\item More complex/sophisticated methods are discussed under constraint propagation in Section~\ref{sec:constraint_propagation}
		\end{itemize}
	\end{enumerate}   
\end{itemize}
\subsection{Constraint propagation}
\label{sec:constraint_propagation}
\begin{itemize}
	\item A constraint itself can restrict the search space without splitting
	\item Constraint propagation is similar to unit propagation and subsumes resolution. It therefore checks the CSP for local consistency (a CSP is locally consistent if we can extend it by another assignment while being satisfied)
	\item Example: the CSP $\langle x < y ; x \in \left[50..200\right], y \in \left[0..100\right] \rangle$ can be simplified to 
	
	$\langle x < y ; x \in \left[50..99\right], y \in \left[51..100\right] \rangle$ without losing any possible solutions
	\item This might be an iterative problem when multiple constraints interact with each other ($x$ has effect on $y$, $y$ on $z$, and so on)
	\item Thus, we have to decide when to stop performing local consistency checking. There are various heuristics/methods:
	\begin{itemize}
		\item \textit{(Directional) Arc consistency} is one of the most simplest approaches. Note that general arc consistency applies propagation to both sides $x\to y$ and $y\to x$, while we can also limit it by a directed approach (only $x\to y$ or $y\to x$). Problem: only looks at direct constraints and can for example not spot that the following CSP is inconsistent:
		
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.3\textwidth]{figures/kr_csp_arc_const_example.pdf}
		\end{figure}
		\item \textit{Path consistency}: extends arc consistency to picking two variables. Formally, for all $i$ and $j$:
		$$\forall x,y: (x,y)\in P_{i,j}, x\in P_i, y\in P_j \to \left(\exists z: z \in P_k \wedge (x,z) \in P_{i,k} \wedge (z,y) \in P_{k,j}\right)$$
		i.e., any consistent assignment to two variables can be extended to a third one. Can be enforced by iterating the following assignment until nothing changes anymore:
		$$P_{i,j} := P_{i,j} \cap \left\{(x,y)\hspace{1mm}|\hspace{1mm}\exists z: (x,z) \in P_{i,k} \wedge (z,y) \in P_{k,j}\hspace{2mm}\text{for all } k\right\}$$
		Path consistency subsumes arc consistency and detects inconsistency in the previous example. Note that path consistency removes pairs of values, and thus makes constraints explicitly
		\item \textit{$k$-Consistency}: generalization of arc and path consistency to arbitrary $k$. For each satisfying value assignment to $k-1$ variables, there exists an extension of this assignment to a $k$-th variable such that this extended assignment satisfies all constraints among these $k$ variables.
		\item \textit{Strong $k$-consistency}: A CSP is strongly $k$-consistent iff it is $j$-consistent for all $j\leq k$. If a CSP of size $n$ $n$-consistent is, then we can construct a model in polynomial time. But note that checking this already solves the CSP and the computational costs exponentially grow with $k$.
		\item \textcolor{red}{Question: why does PC subsumes AC, but not $k$-consistency $k-1$-consistency?}
		\item \textit{Hyper-arc consistency}: extension of arc consistency to more than binary constraints (same for those). Defined by: for every constraint $C$ and every variable $x$ with domain $D_x$, each value for $x$ from $D_x$ participates in a solution to $C$.
	\end{itemize}
\end{itemize}

================================================
FILE: Knowledge_Representation/kr_dl.tex
================================================
\section{Description Logic}
\begin{itemize}
	\item Description Logic is the logic for ontologies
	\item Is more expressive than propositional logic, but less than first-order logic (DL is subsumed by FOL as classes are unary predicates and relations binary)
	\item We limit our discussion to the $\mathcal{ALC}$ description logic (\textit{Attributive Concept Language with Complements})
\end{itemize}
\subsection{Ontologies}
\begin{itemize}
	\item Organizing knowledge in a way that is useful for people
	\item Fundamental elements of ontologies
	\begin{itemize}
		\item \textit{Class}/\textit{Type}/\textit{Concept}: name + set of properties that describe certain set of individuals
		\item \textit{Instances}: members of the set defined by the class
		\item \textit{Property}/\textit{Relation}: assert facts about the instances/relation between classes
	\end{itemize}
	\item The backbone of every ontology is a type-/class-hierarchy where multi-parent inheritance is possible
	\item Axioms that need to be formalized in a logic for ontologies:
	\begin{itemize}
		\item Two classes are equivalent iff they have the same individuals and same definition
		\item The intersection and union of classes
		\item A class can be specified by either its definition or enumeration of all members
		\item Restriction of some or all values from the specified class
	\end{itemize}
	\item Property types/possible properties of relations:
	\begin{itemize}
		\item \textit{Symmetry}: if $a$ to $b$ by relation $r$, then $b$ to $a$ by $r$
		\item \textit{Asymmetry}: if $a$ to $b$ by relation $r$, then there can't be $b$ to $a$ by $r$
		\item \textit{Transitivity}: if $a$ to $b$ and $b$ to $c$ by relation $r$, then $a$ to $c$ by $r$
		\item \textit{Functionality}: if $a$ to $b$ and $a$ to $c$ by relation $r$, then $b$ and $c$ must be the same
		\item \textit{Inverse functionality}: if $a$ to $b$ and $c$ to $b$ by relation $r$, then $a$ and $c$ must be the same
		\item \textit{Reflexivity}: $a$ to $a$ by relation $r$ always holds
		\item \textit{Ir-reflexivity}: $a$ to $a$ by relation $r$ can never hold
		\item \textit{Inverse property}: if $a$ to $b$ by relation $r$, then $b$ to $a$ by relation $q$
	\end{itemize}
\end{itemize}
\subsection{Fundamentals of Description Logic}
\begin{itemize}
	\item A logic is defined by
	\begin{itemize}
		\item A language (``syntax'')
		\item The meaning of expressions (``semantics'')
		\item Inference (``deduction'')
	\end{itemize}
\end{itemize}
\subsubsection{Syntax}
\begin{itemize}
	\item The vocabulary of DL includes concept/class/type and role names
	\item Furthermore, we have the universal concept $\top$ (everything) and the bottom concept $\bot$ (nothing)
	\item More complex types can be set together by union $\sqcup$, intersection $\sqcap$ and complement $\lnot$
	\item Restrictions are encoded by $\exists r.C$ and $\forall r.C$. See semantics for meaning/explanation, and visualization in Figure~\ref{fig:kr_dl_restriction_operator_visualized}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/kr_dl_restriction_operator_visualized.png}
		\label{fig:kr_dl_restriction_operator_visualized}
		\caption{Visualization of restriction operators $\exists r.C$ and $\forall r.C$}
	\end{figure}
	\item Examples:
	\begin{itemize}
		\item $\texttt{Boys} \sqcup \texttt{Girls}$
		\item $\texttt{Girls} \sqcap \exists \texttt{owns}.\texttt{Car}$
	\end{itemize}
	\item Important for identifying concepts and roles in text. Example:
	
	``\textit{Any artwork is created by an artist. A sculpture is an artwork. A painting is an artwork that is not a sculpture. A painter is someone who painted a painting. A sculptor is someone who sculptured an artwork and only create sculptures. If an artwork is created by an artist, he has either painted or sculptured it.}''
	
	The solution would be the concepts $\left\{\texttt{Artwork}, \texttt{Artist}, \texttt{Sculptor}, \texttt{Painter}, \texttt{Painting}, \texttt{Sculpture}\right\}$, and the roles $\left\{\texttt{created}, \texttt{created\_by}, \texttt{painted}, \texttt{sculptured}\right\}$
	
	``\textit{An artwork that is not a sculpture}'': $\texttt{Artwork}\sqcap \lnot\texttt{Sculpture}$
	
	``\textit{Some who painted a painting}'': $\exists \texttt{painted}.\texttt{Painting}$
	
	``\textit{Someone who sculptured an artwork and only created sculptures}'': $\exists \texttt{sculptured}.\texttt{Artwork}\sqcap \forall \texttt{created}.\texttt{Sculpture}$
\end{itemize}
\subsubsection{Semantics}
\begin{itemize}
	\item Mathematical meaning/interpretation of syntax
	\item The interpretation function $\mathcal{I}=\left(\Delta^{\mathcal{I}}, \cdot^{\mathcal{I}} \right)$ where $\Delta^{\mathcal{I}}$ is a non-empty domain of individuals, and $\cdot^{\mathcal{I}}$ is an interpretation function that maps
	\begin{itemize}
		\item $A^{\mathcal{I}} \subseteq \Delta^{\mathcal{I}}$, i.e. concepts to subsets of $\Delta^{\mathcal{I}}$
		\item $r^{\mathcal{I}} \subseteq \Delta^{\mathcal{I}}\times \Delta^{\mathcal{I}}$, i.e. role names to subsets of $\Delta^{\mathcal{I}}\times\Delta^{\mathcal{I}}$
	\end{itemize}
	% \item Question: why has $\Delta^{\mathcal{I}}$ need to be non-empty? Is therefore a model without any individuals not allowed?
	\item $\Delta^{\mathcal{I}}$ has to be not empty as otherwise no model exists. 
	\item Thus, a concept/class/type represents a set of individuals: $\mathcal{I}(\texttt{Painter}) = \left\{\texttt{rembrandt}, \texttt{vanGogh}\right\}$
	\item Interpretation of a role/relation: $\mathcal{I}(\texttt{hasPainted}) = \left\{(\texttt{rembrandt}, \texttt{nightwatch}),(\texttt{daVinci}, \texttt{MonaLisa}) \right\}$ 
	\item $\cdot^{\mathcal{I}}$ is inductively extended over complex concept descriptions
	\begin{equation*}
		\begin{split}
			\top^{\mathcal{I}} & = \Delta^{\mathcal{I}}\\
			\bot^{\mathcal{I}} & = \emptyset\\
			(\lnot C)^{\mathcal{I}} & = \Delta^{\mathcal{I}}\setminus C^{\mathcal{I}}\\
			(C\sqcap D)^{\mathcal{I}} & = C^{\mathcal{I}} \cap D^{\mathcal{I}}\\
			(C\sqcup D)^{\mathcal{I}} & = C^{\mathcal{I}} \cup D^{\mathcal{I}}\\
			(\exists r.C)^{\mathcal{I}} & = \left\{x\in \Delta^{\mathcal{I}}\hspace{1mm}|\hspace{1mm}\exists y.(x,y)\in r^{\mathcal{I}} \wedge y\in C^{\mathcal{I}} \right\}\\
			(\forall r.C)^{\mathcal{I}} & = \left\{x\in \Delta^{\mathcal{I}}\hspace{1mm}|\hspace{1mm}\forall y.(x,y)\in r^{\mathcal{I}} \to y\in C^{\mathcal{I}} \right\}
		\end{split}
	\end{equation*}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/kr_dl_interpretation_example.png}
		\caption{Example for interpretation function $\mathcal{I}$}
	\end{figure}
	\item Note that the quantifier $\forall r.C$ also includes the empty set. To prevent this, use $\forall r.C \sqcap \exists r.\top$
	% kr_dl_interpretation_example.png
\end{itemize}
\subsection{Inference in Description Logic}
\begin{itemize}
	\item A \textbf{knowledge base} $\mathcal{K} = (\mathcal{T}, \mathcal{A}) $ consists of:
	\begin{itemize}
		\item the terminology/theory $\mathcal{T}$, that defines the general world model you can apply to any model. It summarizes the relations between concepts.
		\begin{itemize}
			\item The knowledge about relations between concepts is expressed by means of terminological axioms
			\item \textit{Concept inclusion}: $C\sqsubseteq D$ models necessary conditions for object of type $C$ 
			
			Example: $\texttt{Elephant} \sqsubseteq \texttt{Animal} \sqcap \lnot\texttt{Mouse}$
			\item \textit{Concept equivalence}: $C\equiv D$ models necessary and sufficient conditions for object $C$. Can also be written as $C \sqsubseteq D$ and $D \sqsubseteq C$
			
			Example: $\mathcal{T}$: ``\textit{painter is a human that created a painting}'' $\Rightarrow$ $\texttt{Painter} \equiv \texttt{human}\sqcap \exists\texttt{created}.\texttt{Painting}$
		\end{itemize}
		\item the assertions $\mathcal{A}$ that describes certain objects and the sets that are defined by $\mathcal{T}$
		\begin{itemize}
			\item Knowledge about individuals in the domain expressed in terms of the vocabulary is specified by means of assertional axioms 
			\item \textit{Concept assertion}: $a : C$ models that individual $a$ is of class $C$ 
			
			Example: $\texttt{dumbo}: \texttt{Elephant}$
			\item \textit{Role assertions}: $(a, b) : r$ models that individual $a$ is related to individual $b$ by the role $r$
			
			Example: $(\texttt{daVinci}, \texttt{MonaLisa}) : \texttt{painted}$
		\end{itemize}
	\end{itemize}
	\item Definition of a \textit{\textbf{model}}
	\begin{itemize}
		\item An interpretation $\mathcal{I}$ is a \textit{model} of the $\mathcal{T}$-box iff it satisfies every terminological axiom in $\mathcal{T}$
		\item An interpretation $\mathcal{I}$ is a \textit{model} of the $\mathcal{A}$-box iff it satisfies every assertional axiom in $\mathcal{A}$
		
		Note that this just requires $\mathcal{I}$ to have the individuals and relations/assertions defined in $\mathcal{A}$
		\item Finally, an interpretation $\mathcal{I}$ is a \textit{model} of the knowledge base $\mathcal{K} = (\mathcal{T},\mathcal{A})$-box iff $\mathcal{I}$ is a model of both $\mathcal{T}$ and $\mathcal{A}$
		\item A knowledge base is called \textit{satisfiable}/\textit{consistent} iff there exists a model for it.
	\end{itemize}
	\item In general, axioms of $\mathcal{A}$- and $\mathcal{T}$-box restrict the possible models
	\item \textbf{Reasoning tasks} for $\mathcal{T}$-box
	\begin{itemize}
		\item \textit{Concept satisfiability}: Concept $C$ is satisfiable w.r.t. $\mathcal{T}$ iff there is a model $\mathcal{I}$ of $\mathcal{T}$: $C^{\mathcal{I}} \neq \emptyset$
		\item \textit{Subsumption}: check if $\mathcal{T} \models C\sqsubseteq D$. $C$ is subsumed by $D$ in $\mathcal{T}$ iff $C^{\mathcal{I}}\subseteq D^{\mathcal{I}}$ in every model $\mathcal{I}$ of $\mathcal{T}$
		\item \textit{Equivalence}: check if $\mathcal{T} \models C\equiv D$. Concepts $C$ and $D$ are equivalent in $\mathcal{T}$ iff $C^{\mathcal{I}} = D^{\mathcal{I}}$ in every model $\mathcal{I}$ of $\mathcal{T}$
		\item All $\mathcal{T}$-box problems can be reduced to concept satisfiability which is equivalent to showing that $C\sqcap \lnot D$ is unsatisfiable in $\mathcal{T}$ (see section~\ref{sec:tableau_algorithm} for Tableau algorithm to solve this)
	\end{itemize}
	\item \textbf{Reasoning tasks} for $\mathcal{A}$-box
	\begin{itemize}
		\item \textit{$\mathcal{A}$-box consistency}: $\mathcal{A}$ is consistent w.r.t. $\mathcal{T}$ iff there is a model of $\mathcal{K}$. In such a case, $\mathcal{K}$ is satisfiable.
		\item \textit{Instance checking}: check if $\mathcal{K} \models a:C$ (or respectively $\mathcal{K} \models (a,b) : r$). This holds iff for every model of $\mathcal{K}$ is a model of $a:C$
		\item \textcolor{red}{Question: don't we also need the assertion that $\mathcal{K}$ is consistent as otherwise we can prove anything for an unsatisfiable knowledge base?}
		\item \textit{Retrieval task}: given a concept $C$ and an $\mathcal{A}$-box $\mathcal{A}$, find all individuals $a$ such that $\mathcal{K}\models a:C$
		\item \textit{Realization task}: given an individual $a$ and a set of concepts, find the most specific concept $C$ such that $\mathcal{K}\models a:C$
		\item All tasks are reducible to checking $\mathcal{A}$-box consistency. We can check those by showing that $\mathcal{A} \cup \left\{a:\lnot C\right\}$ is inconsistent.
	\end{itemize}
\end{itemize}
\subsubsection{Tableau algorithm}
\label{sec:tableau_algorithm}
\begin{itemize}
	\item All reasoning tasks in $\mathcal{ALC}$ can be reduced to a single task of checking
	$\mathcal{A}$-box consistency w.r.t. $\mathcal{T}$-box.
	\item Tableau shows (un-)satisfiability by contradiction. It searches through the tree of possible models and delivers a proof iff no model exists and therefore the input inconsistent is (or otherwise returns a valid model).
	\item The general approach of Tableau is to extend the model until we find a model, or closed all branches
	\item To reduce the number of tableau proves, we assume that all concepts appear in the Negation Normal Form (NNF). Conversion rules:
	\begin{equation*}
		\begin{split}
			\lnot\left(C \cap D\right) & \Rightarrow \left(\lnot C \cup \lnot D\right)\\
			\lnot\left(C \cup D\right) & \Rightarrow \left(\lnot C \cap \lnot D\right)\\
			\lnot\exists r.C & \Rightarrow \forall r.\lnot C\\
			\lnot\forall r.C & \Rightarrow \exists r.\lnot C\\
		\end{split}
	\end{equation*}
	\item A \textit{\textbf{branch}} of a tableau is a set of $\mathcal{A}$-box assertions. For any branch $S$, the following rules apply:
%	\begin{figure}[ht!]
%		\hspace{10mm}
%		\includegraphics[width=0.6\textwidth]{figures/kr_dl_tableau_rules.png}
%	\end{figure}
	\begin{itemize}
		\item \textbf{IF} $(a:C\sqcap D)\in S$ \textbf{THEN} $S' := S \cup \left\{a:C, a:D\right\}$
		\item \textbf{IF} $(a:C\sqcup D)\in S$ \textbf{THEN} $S' := S \cup \left\{a:C\right\}$ \textbf{or} $S' := S \cup \left\{a:D\right\}$
		\item \textbf{IF} $(a:\exists r.C)\in S$ \textbf{THEN} $S' := S \cup \left\{(a,b):r, b:C\right\}$ where $b$ is a ``fresh'' individual name in $S$
		\item \textbf{IF} $(a:\forall r.C)\in S$ \textbf{and} $(a,b):r \in S$ \textbf{THEN} $S' := S \cup \left\{b:C\right\}$
		\item \textbf{IF} $\left\{a:C, a:\lnot C\right\}\subseteq S$ \textbf{THEN} mark branch as CLOSED (unsatisfiable)
	\end{itemize}
	\item To check for satisfiability, we assume that there is an individual of that given concept. Try to show that this leads to an contradiction.
	\item Example: \textit{show that $\exists r.A \sqcap \exists r.B$ is subsumed by $\exists r.(A\sqcap B)$}
	\begin{enumerate}
		\item Write expression as concept: $\exists r.A \sqcap \exists r.B \sqsubseteq \exists r.(A\sqcap B)$
		\item Negate concept (proof by contradiction): $\exists r.A \sqcap \exists r.B \sqcap \lnot \exists r.(A\sqcap B)$
		\item Rewrite concept in NNF: $\exists r.A \sqcap \exists r.B \sqcap \forall r.(\lnot A\sqcup \lnot B)$
		\item Assume $\mathcal{A}$-box $\mathcal{A}=\left\{a: \exists r.A \sqcap \exists r.B \sqcap \forall r.(\lnot A\sqcup \lnot B)\right\}$ and search for a model
%		\begin{itemize}
%			\item $a: \exists r.A$, $a: \exists r.B$, $a: \forall r.(\lnot A\sqcup \lnot B)$
%			\item $(a,b):r$, $b:A$, $(a,c):r$, $c:B$
%			\item $b:(\lnot A\sqcup \lnot B)$, $c:(\lnot A\sqcup \lnot B)$
%			
%		\end{itemize}
% kr_dl_tableau_proof_example.png
		\begin{figure}[ht!]
			\hspace{30mm}
			\includegraphics[width=0.5\textwidth]{figures/kr_dl_tableau_proof_example.png}
		\end{figure}
		\item Step 13 shows the valid model where $\Delta^{\mathcal{I}}=\left\{a,b,c\right\}$, $A^{\mathcal{I}}=\left\{b\right\}$, $B^{\mathcal{I}}=\left\{c\right\}$, $r^{\mathcal{I}}=\left\{(a,b), (a,c)\right\}$ $\Rightarrow$ the negated concept is satisfiable and thus proves the hypothesis to be wrong
		
		If we would not have been able to construct a model, the hypothesis would be correct 
	\end{enumerate}
	\item Note that the tableau algorithm is sound (if algorithm finds a proof, then statement is correct) and complete (if statement is correct, the algorithm finds a proof).
\end{itemize}
\subsubsection{Reasoning with non-empty $\mathcal{T}$-box}
\begin{itemize}
	\item The tableau algorithm can be extended in order to support terminology axioms of the $\mathcal{T}$-box
	\item Input preprocessing
	\begin{itemize}
		\item Replace every $C\equiv D \in \mathcal{T}$ with $C\sqsubseteq D$ and $D\sqsubseteq C$
		\item Replace every $C\sqsubseteq D \in \mathcal{T}$ with $\top \equiv NNF(\lnot C \sqcup D)$
		\item Add all concepts/formula of $\mathcal{T}$ to the root $S_0$ of the tableau
	\end{itemize}
	\item Extend the tableau rules by the following:
	\begin{itemize}
		\item \textbf{IF} $(\top \equiv C) \in S$ \textbf{and} an individual $a$ occurs in $S$ \textbf{THEN} $S' := S\cup\left\{a:C\right\}$
	\end{itemize}
	\item Example:
	\begin{figure}[ht!]
		\hspace{10mm}
		\includegraphics[width=0.5\textwidth]{figures/kr_dl_tableau_nonempty_tbox_example.png}
	\end{figure}
	\item In case of reasoning with non-empty $\mathcal{T}$-box, the tableau algorithm is not guaranteed to terminate (rules in the form of $\top\equiv ...\exists r.C...$ can lead to an infinite loop)
	\item Solution: Detect cycles and prevent further application of the $\Rightarrow_{\exists}$ rule. This is achieved by a special blocking rule:
	\begin{itemize}
		\item \textbf{IF} $b$ is a (possibly indirect) successor of $a$ in $S$ \textbf{and} it is the case that: $$\left\{C\hspace{1mm}|\hspace{1mm} b:C \in S\right\} \subset \left\{D\hspace{1mm}|\hspace{1mm} a:D \in S\right\} $$
		
		\textbf{THEN} mark $b$ as BLOCKED by $a$ in $S$ and do not apply $\Rightarrow_{\exists}$ rule to $b$.
	\end{itemize}
	\item Example:
	\begin{figure}[ht!]
		\hspace{10mm}
		\includegraphics[width=0.5\textwidth]{figures/kr_dl_tableau_blocking_example.png}
	\end{figure}
	\item Note that we can only apply the blocking rule if no other rules apply anymore on that branch.
\end{itemize}

================================================
FILE: Knowledge_Representation/kr_intro.tex
================================================
\section{Introduction to KR}
\begin{itemize}
	\item There are two main lines of development in AI: \textit{symbolic} and \textit{statistical} representation
	\item Both approaches go along with different benefits/weaknesses
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/kr_intro_symbolic_vs_statistical.png}
		\caption{Weaknesses of symbolic and statistical representation in AI}
	\end{figure}
	\begin{itemize}
		\item \textit{\textbf{Construction}}: effort that is needed to create such a system
		\begin{itemize}
			\item A symbolic AI requires a knowledge base on which it bases its reasoning. This knowledge base is mostly created by a human which can take a lot of time. For example, the SNOMED database was created in more than 40 years and contains now about $300,000$ definitions. The construction of knowledge bases is summarized in the research area \textit{Knowledge Engineering}.
			\item In the connectionist/statistical approach, the model learns from data so that we need to provide a (huge) dataset. Depending on the task and the required labels, it can also take a lot of human effort until we have the required data size.
		\end{itemize}
		\item \textit{\textbf{Scalability}}: effect of data size/amount on the systems
		\begin{itemize}
			\item The more data we have for an symbolic AI, the easier it is to run into a problem. Huge knowledge bases tend to be not sound anymore (consistent) as a small mistake at one point can lead to wrong reasoning for any problem (if knowledge base is unsatisfiable, then all given problems are unsatisfiable), So we need to put extra effort in ensuring the soundness of the knowledge base.
			\item Connectionist approaches learn from the statistics of a dataset which gets more accurate by increasing the amount of data. Small errors/noise are thereby smoothed out. However, this also means that statistical representations are inaccurate if there is only a small amount of data. 
		\end{itemize}
		\item \textit{\textbf{Explainable}}: understanding how the system came to its decision
		\begin{itemize}
			\item Symbolic AI is dedicated to creating explainable systems as they only applies facts/statements/rules of the knowledge base that was hand-crafted and thus understandable for a human. The reasoning of such an AI system is explainable by the used and newly derived rules.
			\item In contrast, connectionist approaches are less explainable. As they capture the data distribution in a very high dimensional space (e.g. neural networks with millions of parameters), it more serves as a black box. Errors that are produced by small, carefully selected input noises are harder to understand and to prevent. 
		\end{itemize}
		\item \textit{\textbf{Generalization}}: performance across unseen domains
		\begin{itemize}
			\item Symbolic AI relies on the given knowledge base. If we try to reason about a domain that is unknown in the knowledge base, we don't get any answer (or rather that the reasoning ended without a result). For example, if we have a system based on SNOMED and try to show that if it is raining outside, I probably get wet the AI terminates without an solution because it is not provided by the needed rules/facts.
			\item Commonly, statistical AI systems are already limited to their specific domain. In the task of classification, we usually have a fixed set of classes from which the system has to choose one. If we show a new image that does not belong to any of those, it will try to find the most similar class of what the system has so far (borders in high-dimensional space).
		\end{itemize}
	\end{itemize}
	
\end{itemize}

================================================
FILE: Knowledge_Representation/kr_qr.tex
================================================
\section{Qualitative Reasoning}
\begin{itemize}
	\item Learning by making (qualitative) representations, combine meanings and reason with them
	\item Differs with numerical reasoning by not having exact values (we can express that something is positive, but we don't say it has the value $3.21$) 
\end{itemize}
\subsection{General vocabulary}
\begin{itemize}
	\item In QR, we express systems by the use of entities (like \textit{population}), and describe them by their quantities (e.g. \textit{size})
	\item We can define relations and inequalities between those. The full vocabulary is visualized in Figure~\ref{fig:kr_qr_vocabulary}
\end{itemize}
\begin{figure}[ht!]
	\centering
	\includegraphics[width=0.45\textwidth]{figures/kr_qr_vocabulary.png}
	\caption{Overview of the vocabulary used in QR}
	\label{fig:kr_qr_vocabulary}
\end{figure}
\subsubsection{Quantities}
\begin{itemize}
	\item Quantities are described by a magnitude and a derivative (can also include higher order derivatives)
	\item The quantities are represented by a discrete scales which are build up by
	\begin{itemize}
		\item \textit{Intervals} which defines a range of values (e.g. $+$ for positive values)
		\item \textit{Landmarks} which reflect key points of the system, and are defined as single number (e.g. $0$ or $\max$)
	\end{itemize}
\end{itemize}
\subsubsection{Influence and Proportional}
\begin{itemize}
	\item \textbf{Proportional}: If \texttt{Q1} increases, then \texttt{Q2} increases (P+)/decreases (P$-$):
	$\partial \texttt{Q1}>0\implies \partial \texttt{Q2}>0$
	\begin{itemize}
		\item Mathematically, we can define the relation by:
		$$\texttt{Q2}\propto_{Q+}\texttt{Q1} \hspace{2mm}\equiv\hspace{2mm} \exists f\hspace{1mm}|\hspace{1mm}\texttt{Q2} = f(...,\texttt{Q1},...) \wedge f \text{ is increasing monotonic in }\texttt{Q1}$$
		\item This means that if \texttt{Q2} is \textit{qualitatively} proportional to \texttt{Q1}, we can express \texttt{Q2} by a function based on \texttt{Q1} such that it increases if \texttt{Q1} increases (and everything else stays the same)
		\item Monotonic functions provide an abstraction that cover a wide range of more mathematical expressions which can be narrowed by further constraints
	\end{itemize}
	\item \textbf{Influence}: If \texttt{Q1} is greater than 0, then \texttt{Q2} increases (I+)/decreases (I$-$):
	$\texttt{Q1} > 0\implies \partial \texttt{Q2}>0$
	\begin{itemize}
		\item Expressed in mathematical terms, the influence relation is stated as:
		$$I^{+}(\texttt{Q2},\texttt{Q1})\hspace{1mm} \equiv\hspace{1mm} d\texttt{Q2}/dt=...+B+...$$
		\item Note that this function definition is much more specific than for proportional. Hence it enables us knowledge of relative rates if multiple influences are given. Example: if $I^{+}(\texttt{Q2},\texttt{Q1})\wedge I^{-}(\texttt{Q2},\texttt{Q3})$, then we know that if $\texttt{Q1}>\texttt{Q3}$, $\texttt{Q2}$ increases.
		\item \textit{Warning}: for reasoning with influences, we often require the second-order derivative (if \texttt{Q1} has a derivative and doesn't change its qualitative magnitude, we can't model the change in the derivative of \texttt{Q2}). This is especially important if we have a positive and negative influence which both are in a interval with their magnitude.
	\end{itemize}
	\item Example in formulas:
	\begin{itemize}
		\item The change of the size of a closed population can be specified by:
		$N_{t+1} = N_{t} + B_{t} - D_{t}$
		where $B$ is the number of births, and $D$ the number of deaths between $t$ and $t+1$.
		\item Then, we have a positive influence of $B$ to $N$, and a negative influence of $D$ to $N$: $I^{+}(N,B)$, $I^{-}(N,D)$
		\item Further, we specify that the birth and death rate proportionally depend on the population size: $B=f_B(...,N,...)$, $D=f_D(...,N...)$ $\Rightarrow$ $P^{+}(D,N)$, $P^{+}(B,N)$
	\end{itemize}
	\item Both relations can lead to ambiguity if one influence/proportion is positive, and another is negative. Without given any other facts, all outcomes are possible new states
\end{itemize}
\subsubsection{Inequalities}
\begin{itemize}
	\item We can define inequalities between quantities (magnitudes), landmarks and derivatives
	\item Inequalities state a rather uncertain knowledge, or assumptions that are valid for at least the initial state
	 \item For example, we can state that $A$ is larger than $B$, but this might change during the simulation
	 \item Also, we can reason with inequalities by propagating knowledge (if $A>B$ and $B>C$, then $A>C$). This can for example lead to constraints in the value space of another quantity
\end{itemize}
\subsubsection{Value constraint}
\begin{itemize}
	\item The value constraint binds landmarks/intervals of two quantities together.
	\item If we have a value constraint from $A$ to $B$ to both 0, this means that if $A=0$ then $B=0$ (directed relation!)
\end{itemize}
\subsection{States and Transitions}
\begin{itemize}
	\item A qualitative state is a period of time in which the qualitative behavior of the system doesn't change (changes include magnitudes, derivatives, inequalities, etc.)
	\item Thus, states are \textit{unique} sets of inequality quantity expressions
	\item Transitions are changes of at least one quantity (but all changes must take place to enter the next state if multiple quantities change)
\end{itemize}
\begin{figure}[ht!]
	\centering
	\includegraphics[width=0.5\textwidth]{figures/kr_qr_state_graph_example.png}
	\caption{Example state graph for heating water in a cattle}
	\label{fig:kr_qr_state_graph_example}
\end{figure}
\subsubsection{Compositional modeling}
\begin{itemize}
	\item To improve re-usability of QR definitions, we can split the model into several parts
	\item We define a hierarchy of generic entities (e.g. population) on which we will operate
	\item The library of model fragments contain structures that defines certain entities/quantities and their relation (if a population exists, then...). These can be conditioned on assumptions (e.g. closed or open population)
	\item In the scenarios, we define certain instances of our entity hierarchy (e.g. ``\textit{Green Frogs}'' as instance of entity population). Based on them, we can state initial values (e.g. population size is max), and assumptions we make (e.g. closed population)
	\item We then combine both the scenario and the model fragment to create the behavior graph
\end{itemize}
\begin{figure}[ht!]
	\centering
	\includegraphics[width=0.5\textwidth]{figures/kr_qr_compositional_modeling.png}
	\caption{Compositional modeling}
	\label{fig:kr_qr_compositional_modeling}
\end{figure}
\subsubsection{Finding transitions}
\begin{itemize}
	\item There are three common types of termination 
	\begin{itemize}
		\item \textbf{Value termination}: if a quantity has a derivative in a certain direction, then it can move to the next quantity point in this direction
		\item \textbf{Inequality termination}: Given a inequality, if one side is changing due to derivatives, then the truth value might change (from $\texttt{Q2}>\texttt{Q1}$ to $\texttt{Q2}=\texttt{Q1}$, remember continuous changes!)
		\item \textbf{Exogenous termination}: An external effect/input that leads to a change. Can for example be controlling the derivative of a quantity. The behavior can be random, sinusoidal, random, etc.
	\end{itemize}
	\item Another important concept for deciding on transitions is \textbf{Epsilon ordering} dealing with value termination
	\begin{itemize}
		\item We distinguish between \textit{immediate} and \textit{non-immediate} transitions
		\item Immediate transitions are if a derivative is unequals 0 and the magnitude is currently at a landmark/point. Then, we will immediately move to the next state with the magnitude being changed
		\item Non-immediate transitions are when the derivative is unequals zero, but the magnitude is in an interval. Then it can change at some point, but does not have to be
	\end{itemize}
	\item Another concept we have to consider is value correspondence
	\begin{itemize}
		\item If two values are restraint by a value correspondence, then this can limit the possible transitions we can have
		\item For example, in Figure~\ref{fig:kr_qr_value_correspondence_termination}, we can either apply \texttt{T1} or \texttt{T2}, but both can't happen at the same time because then the value correspondence is not fulfilled
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.4\textwidth]{figures/kr_qr_value_correspondence_termination.png}
			\caption{Example for transitions that exclude each other because of a value constraint}
			\label{fig:kr_qr_value_correspondence_termination}
		\end{figure}
	\end{itemize}
\end{itemize}

================================================
FILE: Knowledge_Representation/kr_sat.tex
================================================
\section{Satisfiability solvers}
\subsection{Propositional Logic}
\begin{itemize}
	\item In Knowledge Representation, we have three sets of rules/formula:
	\begin{itemize}
		\item \textit{Knowledge base}: statements which are known to be true/need to be fulfilled by all models (\textit{axioms})
		\item \textit{Premises}: statements that are only true for certain states/inputs (\textit{implications} at partial truth value assignment)
		\item \textit{Conclusion}: statements for which we want to check if there exists an model for (\textit{conjecture}). Derived by knowledge base and premises
	\end{itemize} 
	\item Propositional logic is based on simple statements (\textit{literals}) that can be combined to complex statements (\textit{formula})
	\begin{itemize}
		\item \textit{Conjunction}: $A\wedge B$
		\item \textit{Disjunction}: $A\vee B$
		\item \textit{Negation}: $\lnot A$
		\item \textit{Implication}: $A\to B$ ($\equiv\lnot A \vee B\equiv \lnot (A \wedge \lnot B)$)
	\end{itemize}
	\item Truth values are assigned to literals by an interpretation function $I$
	\item \textbf{Clausal normal form}
	\begin{itemize}
		\item Every formula can be rewritten in CNF (conjunction of disjunctions)
		\item Example: $(A\vee B\vee C\vee D)\wedge(E\vee F)\wedge(\lnot A \vee F \vee D)\wedge ...$
		\item To rewrite a formula to CNF, we have to remove implications ($A\to B \equiv\lnot A \vee B$), move negations in front of the literals ($\lnot (A\vee B) \equiv \lnot A \wedge \lnot B$) and move conjunctions outside ($A \vee (B\wedge C) \equiv (A\vee B)\wedge (A \vee C)$) 
	\end{itemize}
	\item Propositional logic is a weak language i.e. it is less expressive (no instances, no functions on terms, ...)
	\item We can express first-order logic in propositional logic by instantiating all quantifiers and all possible input arguments to predicates
	\begin{itemize}
		\item Only possible for finite domains. Exponential explosion of number of instances
		\item Example: Domain $\left\{A, B, C\right\}$, $\forall x P(x) \vee Q(x,x)$ $\implies$ $(P\_A \vee Q\_A\_A)\wedge (P\_B \vee Q\_B\_B) \wedge (P\_C \vee Q\_C\_C)$
		\item $\exists x \forall y Q(x, y) \implies (Q\_A\_A \wedge Q\_A\_B \wedge Q\_A\_C) \vee (Q\_B\_A\wedge...)\vee ...$
	\end{itemize}
	\item \textit{Tautology}: formula that is always true (e.g. $A\vee \lnot A$), also called valid sentence. For $n$ symbols, we have $2^n$ models.
	\item \textit{Contradiction}: formula that is always false (e.g. $A\wedge \lnot A$), also called inconsistent sentence
	\item \textit{Satisfiable sentence}: formula that can be made true in at least one world (i.e. not inconsistent)
	\item A \textit{model} is a ``possible world'' (truth assignment) in which the knowledge base is true (including conjecture if given)
	\item $P\models Q$ means that any model of $P$ also fulfills $Q$
\end{itemize}
\subsection{Davis Putman algorithm}
\begin{itemize}
	\item Algorithm for proving if there exists an model for a given knowledge base, \textbf{and} returns it as a result in that case
	\item In general, this problem is NP-complete
	\begin{itemize}
		\item Any other NP problem can be reduced to SAT
		\item Exponential time $\mathcal{O}(2^n)$ with number of literals, but algorithms are optimized to be fast for most cases (only worst case is exponential)
	\end{itemize}
	\item The DPLL algorithm is summarized in Figure~\ref{fig:kr_sat_dpll_algorithm} and described in more detail here:
	\begin{enumerate}
		\item[Step 1] \textbf{Simplification}: iteratively remove unnecessary clauses and derive literal implications
		\begin{itemize}
			\item \textit{Tautology}: remove tautologies like $P\vee \lnot P$ from knowledge base (once in the beginning)
			\item \textit{Pure literals}: set predicates that solely occur in their positive \textit{or} negative form to the corresponding truth value
			\item \textit{Unit clauses}: set literals for which the knowledge base contain a unit clause to true (or the predicate to false respectively)
		\end{itemize}
		\item[Step 2] \textbf{Split}: pick a predicate and assume a truth value
		\begin{itemize}
			\item Heuristics of which literal to pick next can improve the efficiency of DPLL a lot
			\item \textbf{DLCS} (\textit{Dynamic Largest Combined Sum}): Pick $v$ with the largest count of positive and negative occurrences: $CP(v) + CN(v)$. If $CP(v) > CN(v)$, choose $v=1$, else $v=0$
			\item \textbf{DLIS} (\textit{Dynamic Largest Individual Sum}): Pick $v$ with either largest $CP$ or $CN$. Same truth assignment as for \textit{DLCS}.
			\item \textbf{Jeroslow-Wang}: weight of literal depends on the length of clauses it occurs in. Thereby, we prefer small clauses. The score of a literal is $J(v)=\sum_{c\in \mathcal{C}_v} 2^{-|c|}$, and we pick the highest value. One-sided JW looks at $v$ and $\lnot v$ independently, whereas the two-sided approach looks at sum of $v$ and $\lnot v$, and just picks the truth value based on $J(v)\geq J(\lnot v)\Rightarrow v=1$, else $v=0$.
			\item \textbf{MOMs} (\textit{Maximum Occurrences in Clauses of Minimum Size}): similar to JW, but we only look at the smallest clauses in the knowledge base. The number of occurrences in those is indicates by the function $f$. We choose the literal that maximizes $[f(v)+f(\lnot v)]*2^{k}+f(v)\cdot f(\lnot v)$. $k$ is a tuning parameter for the trade-off between the balanced distribution of $v$ and $\lnot v$ and their individual ones.
		\end{itemize}
	\end{enumerate}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.6\textwidth]{figures/kr_sat_dpll_algorithm.png}
		\caption{DPLL pseudo-code algorithm}
		\label{fig:kr_sat_dpll_algorithm}
	\end{figure}
\end{itemize}
\subsubsection{Clause Learning}
\begin{itemize}
	\item \textit{Non-chronological backtracking}: If we encounter a conflict at the end of a search branch, we try to find the root of the problem, and then backtrack to the that point.
	\item The implications (decisions/assignments) that lead to a conflict can be modeled as an acyclic graph where each node represents a literal assignment and each edge represents the reason for that assignment (see Figure~\ref{fig:kr_sat_conflict_clause_figure})
	\item We can then partition the graph such that one side contains at least all decision variables (called \textit{reason side}), and the other the conflict literal (called \textit{conflict side}). 
	\item The conflict clause is determined by the negations of the literals associated with the cut between both sides. 
	\item Different cuts of the implication graph distinguish learning schemes from one another as they imply different conflict clauses and hence
	the information gained from them.
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.3\textwidth]{figures/kr_sat_conflict_clause_figure.pdf}
		\caption{Conflict clause implication graph}
		\label{fig:kr_sat_conflict_clause_figure}
	\end{figure}
	\item Conflict clause learning is a selective application of resolution (probably more useful than random resolution). General resolution rule:
	$$\frac{\hspace{2mm}A\vee \lnot B\hspace{6mm}C\vee B\hspace{2mm}}{A\vee C}$$
\end{itemize}
\subsection{Stochastic solver}
\begin{itemize}
	\item Properties of a SAT solver
	\begin{itemize}
		\item \textit{Decidability} = completeness. 
		\begin{itemize}
			\item This means that given enough runtime, the SAT solver guarantees to find an assignment or returns that there is no solution.
			\item Although DPLL is complete in theory, it is not in practice as we have limited runtime to wait for an answer
			\item Note that \textit{undecidability} just indicates that it is not \textit{always} guaranteed to get an answer (may be harmless in practice)
		\end{itemize}
		\item \textit{Complexity}: exponential maximum runtime $\mathcal{O}(n^2)$
		\begin{itemize}
			\item However, mean instead of worst-case runtime is more important in practice
			\item Also, $\mathcal{O}$-notation does not consider constants whereas with limited runtime, it might be important
			\begin{figure}[ht!]
				\centering
				\includegraphics[width=0.2\textwidth]{figures/kr_sat_prob_worst_case.png}
				\caption{Comparison of worst case runtime $\mathcal{O}(n)$ and $\mathcal{O}(n^2)$ with different constants}
				\label{fig:kr_sat_prob_worst_case}
			\end{figure}
		\end{itemize}
	\end{itemize}
	\item A good measurement for complexity in case of SAT solvers has been shown to be the ratio of clauses to variables (see Figure~\ref{fig:kr_sat_prob_hardness})
	\item Problems with a low ratio are easy to solve as they have (in average) many solutions
	\item Problems with a high ratio tend to have no solution, and are easy to show that there exists a conflict (see Figure~\ref{fig:kr_sat_prob_hardness_2})
	\item In between, around $4.26$, are apparently the hardest randomly generated problems as there might a solution (or only a few) or not. This points is also referred to as \textit{phase transition}
	\begin{figure}[ht!]
		\centering
		\begin{subfigure}[b]{0.23\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/kr_sat_prob_hardness.png}
			\caption{Problem complexity}
		\end{subfigure}
		\hspace{10mm}
		\begin{subfigure}[b]{0.4\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/kr_sat_prob_hardness_2.png}
			\caption{Proportion of satisfiable problems}
			\label{fig:kr_sat_prob_hardness_2}
		\end{subfigure}
		\caption{Hardness in SAT problems}
		\label{fig:kr_sat_prob_hardness}
	\end{figure}
	\item Stochastic solvers have been shown to perform quite well on such randomly generated SAT instances, but might perform poorly compared to DPLL on highly structured problems
	\item It is not intended to find all solution neither that there exists none. But is for example useful for MAXSAT (see later)
	\item Stochastic SAT solvers perform local search 
	\begin{enumerate}
		\item Make a guess (smart or random) about values of the variables
		\item Evaluate how many clauses are broken
		\item Try flipping a variable to make things better for a certain number of iterations (various heuristics in which variables to flip next)
	\end{enumerate}
	\item This search is repeated $N$ times until we either find a solution or terminate without an answer
	\item \textbf{GSAT}: Local greedy search/algorithm of picking the next variable to flip which increases the number of satisfied clauses the most (ties are broken randomly, note that flips will break also some clauses).
	\item Full algorithm shown in Figure~\ref{fig:kr_sat_prob_GSAT}
	\item Note that this algorithm tends to get stuck in local minimum (flipping a single variable does not increase score). Thus, we perform random restarts to start new. That's also why GSAT spends most time on plateaus where score is not improved
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/kr_sat_prob_GSAT.png}
		\caption{Pseudo-code of GSAT}
		\label{fig:kr_sat_prob_GSAT}
	\end{figure}
\end{itemize}
\subsection{MAXSAT}
\begin{itemize}
	\item MAXSAT is Proportion \textit{optimization} extension of SAT that asks what is the maximum number of clauses that can be simultaneously satisfied
	\item Example: $\lnot A\wedge (A \vee B) \wedge (\lnot B)$ is a contradiction. But the truth assignment $\pi = \left\{\lnot A, \lnot B\right\}$ maximizes number of satisfied clauses
	\item Some clauses might be more important to be satisfied than others $\Rightarrow$ adding a cost/positive weight to every clause $C$ that will be incurred if $C$ is falsified
	\item A cost of $\infty$ indicates \textit{hard} clauses that are mandatory to satisfy, whereas \textit{soft} clauses have a finite cost
	\item We try to minimize the sum of the costs of all unsatisfied clauses
	\item There are different variations of MAXSAT:
	\begin{itemize}
		\item (Standard) MAXSAT: no hard clauses and all have a weight of 1 (solution maximizes the number of satisfied clauses)
		\item \textit{Weighted} MAXSAT: no hard clauses, but soft clauses with any finite, positive weight
		\item \textit{Partial} MAXSAT: hard clauses are allowed, but all soft clauses have weight 1
		\item \textit{Weighted Partial} MAXSAT: both hard and soft clauses are allowed. Subsumes all previously mentioned variations.
	\end{itemize}
\end{itemize}
\subsection{SAT planning}
\begin{itemize}
	\item Real-world planning problems can be translated to a SAT problem, and the plan can be extracted from the truth assignments\\
	\begin{minipage}{.65\textwidth}
		\item Formal definition of the planning problem
	\begin{itemize}
		\item States $S=\left\{...,s_i,...\right\}$ (in the example $S=\left\{s_0, s_1, s_2, s_3, s_4, s_5\right\}$)
		\item Actions $A=\left\{..., a_i,...\right\}$ (in the example $A=\left\{\texttt{move1}, \texttt{move2}, \texttt{put}, \texttt{take}, \texttt{load}, \texttt{unload}\right\}$)
		\item State-transition function $\gamma:S\times A\to S$ (note that we only consider deterministic transitions, otherwise $2^S$ output)
		\item Planning problem $P=(\Sigma, s_0, s_G)$ where $\Sigma=(S,A,\gamma)$, and $s_0$ initial state and $s_G$ (set of) goal states
		\item Classical plan is a sequence of actions: $\pi = \langle a_0, a_1, ..., a_{n-1}\rangle$
		\item Policy is a partial function from $S$ to $A$
	\end{itemize}
	\end{minipage}
	\hspace{10mm}
	\begin{minipage}{.3\textwidth}
			\includegraphics[width=0.8\textwidth]{figures/kr_sat_planning_example.png}
			\label{fig:kr_sat_planning_example}    
	\end{minipage}
	\item Translation for plan $P$ and a fixed plan length $n$:
	\begin{itemize}
		\item If $\pi=\langle a_0, a_1, ..., a_{n-1}\rangle$ is a solution to the planning problem, we know that the traversed states are $s_0, s_1 = \gamma(s_0, a_0), s_2 = \gamma(s_1, a_1), ..., s_n = \gamma(s_{n-1}, a_{n-1})$ (where $a_i$ is the $i$-th step of $\pi$, and $s_i$ the states in which the agent is at step $i$)
		\item We denote all possible literals with $L$
		\item Formula describing the initial state: $\bigwedge\left\{l_0\hspace{1mm}|\hspace{1mm}l \in s_0\right\} \wedge \bigwedge\left\{\lnot l_0\hspace{1mm}|\hspace{1mm}l \in L-s_0\right\}$
		\item Formula describing the goal state: $\bigwedge\left\{l_n\hspace{1mm}|\hspace{1mm}l \in g^+\right\} \wedge \bigwedge\left\{\lnot l_n\hspace{1mm}|\hspace{1mm}l \in g^{-}\right\}$ ($g^{+}$ valid goal states, $g^{-}$ invalid goal states)
		\item Formula describing the state-transitions. For every $a_i$ add $\bigwedge\left\{p_i\hspace{1mm}|\hspace{1mm}p\in \text{Precond}(a)\right\} \wedge \bigwedge\left\{e_{i+1}\hspace{1mm}|\hspace{1mm}e\in \text{Effects}(a)\right\}$
		\item Complete exclusion axiom: only one action per step ($\lnot a_{1,i} \vee a_{2,i}$,...)
		\item Frame axioms that describe what \textit{doesn't} change between $i$ and $i+1$: $\left(\lnot l_i \wedge l_{i+1} \Rightarrow \bigvee_{a\in A} \left\{ a_i | l \in \text{Effects}^+ (a) \right\} \right) \wedge \left(l_i \wedge \lnot l_{i+1} \Rightarrow \bigvee_{a\in A} \left\{ a_i | l \in \text{Effects}^- (a) \right\} \right)$ ($\text{Effects}^+ (a)$: set of literals that change their truth value to true if $a$ is performed, and $\text{Effects}^- (a)$ those that are changed to false)
	\end{itemize}
	\item We apply a SAT solver on this knowledge base. If we find a solution, then $P$ has a solution of length $n$
	\item To find solutions of shortest length, we loop over different values for $n$. If we don't find any solution for $n=0$, we encode the problem for $n=1$, and so on.
	\item In practice, SAT solvers for planning take too much time and memory, but can be combined with other techniques like planning-graph expansion (SatPlan).
\end{itemize}
\subsection{Applications of SAT}
\begin{itemize}
	\item Model checking in terms of hardware and software verification (can state $S$ be ever reached? Is state $T$ always reached after $S$?)
	\item Classical planning
	\item Combinatorial design (existence of mathematical structures)
	\item Solving subproblems in domains like scheduling, test pattern generation, multi-agent systems, ...
\end{itemize}

================================================
FILE: Knowledge_Representation/kr_summary.tex
================================================
\documentclass[a4paper]{article} 
\addtolength{\hoffset}{-2.25cm}
\addtolength{\textwidth}{4.5cm}
\addtolength{\voffset}{-3.25cm}
\addtolength{\textheight}{5cm}
\setlength{\parskip}{0pt}
\setlength{\parindent}{0in}

\usepackage{blindtext} % Package to generate dummy text
\usepackage{charter} % Use the Charter font
\usepackage[utf8]{inputenc} % Use UTF-8 encoding
\usepackage{microtype} % Slightly tweak font spacing for aesthetics
\usepackage[english]{babel} % Language hyphenation and typographical rules
\usepackage{amsthm, amsmath, amssymb} % Mathematical typesetting
\usepackage{float} % Improved interface for floating objects
\usepackage[final, colorlinks = true, 
linkcolor = black, 
citecolor = black]{hyperref} % For hyperlinks in the PDF
\usepackage{graphicx, multicol} % Enhanced support for graphics
\usepackage{xcolor} % Driver-independent color extensions
\usepackage{marvosym, wasysym} % More symbols
\usepackage{rotating} % Rotation tools
\usepackage{subcaption}
\usepackage{wrapfig}
\usepackage{censor} % Facilities for controlling restricted text
\newcommand{\note}[1]{\marginpar{\scriptsize \textcolor{red}{#1}}} % Enables comments in red on margin
\usepackage{bm}
\usepackage{blkarray}
\usepackage{enumitem}
\usepackage{pgfplots}
\usepackage{tikz}
\definecolor{colkeyword}{rgb}{0,0.4,0}
\definecolor{colname}{rgb}{0.4,0.4,0}
\definecolor{coltype}{rgb}{0.4,0,0.4}
\definecolor{coloperators}{rgb}{0,0,1.0}
\definecolor{colscopes}{rgb}{0.4,0,0}

\setcounter{tocdepth}{2}
% Title Page
\title{Summary Knowledge Representation}
\author{Phillip Lippe}


\begin{document}
\maketitle
\tableofcontents
\newpage

\input{kr_intro.tex}
\input{kr_sat.tex}
\input{kr_dl.tex}
\input{kr_csp.tex}
\input{kr_qr.tex}
\appendix
% \newpage
% \input{kr_appendix.tex}
\end{document}

================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2022 Phillip Lippe

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: ML4QS/mlqs_clustering.tex
================================================
\section{Clustering}
\begin{itemize}
	\item Using the features engineered before to cluster instances
	\item Two different learning setups
	\begin{itemize}
		\item \textit{Per instance}: cluster data points/instances of quantified selfs. If multiple quantified selfs are available, we concatenate the datasets.
		\item \textit{Per person}: cluster different quantified selfs. Here, we consider all recorded instances of a person as a single data point, and we compare datasets/persons and cluster them. 
	\end{itemize}
\end{itemize}
\subsection{Distance Metrics}
\begin{itemize}
	\item We need different distance metrics per scenario
	\item We have to distinguish between feature-level and dataset-level
\end{itemize}
\subsubsection{Feature-level distance metrics}
\begin{itemize}
	\item For \textit{numerical} features, we can use the minkowski distance $\left(\sum_k \left|x_i^k - x_j^k\right|^q\right)^{1/q}$ which subsumes the Euclidean and Manhatten ($q=1$). However, we need to consider the scaling of the features (assumed to be equal)
	\item For \textit{categorical} features, we can use the Gower's similarity
	\begin{itemize}
		\item For binary attributes (called \textit{dichotomous}) $s(x_i^k, x_j^k)=1$ if $x_i^k$ and $x_j^k$ are present (i.e. are 1), else 0
		\item For categorical, we have $s(x_i^k, x_j^k)=\mathbbm{1}(x_i^k = x_j^k)$
		\item For numerical values in a range $R$, the Gower's similarity is $s(x_i^k, x_j^k)=1 - \frac{|x_i^k - x_j^k|}{R}$
		\item Similarity over multiple attributes is the mean of them
		\item Note that this is a similarity and not a distance (correlated by $\text{Similarity}\sim1/\text{Distance}$)
	\end{itemize}
\end{itemize}
\subsubsection{Dataset-level distance metrics}
\begin{itemize}
	\item We have to distinguish between datasets with and without temporal ordering
	\item \textbf{Non-temporal personal level distance metrics}: three different approaches possible
	\begin{enumerate}
		\item Summarize values per attribute over the entire dataset into a single number, as e.g. take the mean, min, max, stddev, etc. On these, we can use the same distance metrics as before
		\item Estimate parameters of a distribution that describes the dataset, such as a normal distribution with $\mathcal{N}(\mu, \sigma^2)$. On the parameters $\mu, \sigma^2$ we can apply the same distance metrics as before
		\item Compare the distributions of values for an attribute with a statistical test, such as the Kolmogorov Smirnov test. The distance metric would be $1-p$ where $p$ is the $p$-value returned by the test.
	\end{enumerate}
	\item \textbf{Temporal personal level distance metrics}: again, three different approaches
	\begin{enumerate}
		\item \textit{Feature-based}: extract features from the two time series, such as those from Section~\ref{sec:chapter_4_feature_engineering} (time and frequency domain).
		\item \textit{Model-based}: we try to fit a model on the two time series, and use those parameters to compare them. For example, we could use dynamical systems or similar
		\item \textit{Raw-data based} uses a distance per point.
		\begin{itemize}
			\item For example, it can assume a equal number of points in both datasets, and just takes e.g. the Euclidean Distance per time point
			\item Alternatively, we can also take a possible lag into account (shifted dataset). Then we compute the cross correlation coefficient $ccc(\tau, x_{qs_i}^{l}, x_{qs_j}^{l})=\sum_{k=-\infty}^{\infty} x_{k, qs_i}^{l} \cdot x_{k+\tau, qs_j}^{l}$. 
			\item Optimize $\tau$ by $\arg\min_{\tau} \sum_{k=1}^{p} \frac{1}{ccc\left(\tau, x_{qs_i}^{l}, x_{qs_j}^{l}\right)}$. Note that we have a single $\tau$ for all attributes
			\item \textbf{Dynamic Time Warping}: make best pairs of instances in the sequence to find minimum distance. Allows different frequencies of activities
			\begin{itemize}
				\item Two conditions for pairing: \\
				\textit{Monoticity condition}: time order has to be preserved. We cannot go ``back'' in time\\
				\textit{Boundary condition}: the first and last point must be aligned of the two time series
				\item Algorithm similar to finding shortest path in a graph.
				\begin{figure}[ht!]
					\centering
					\includegraphics[width=0.5\textwidth]{figures/chapter_5_dynamic_time_warping.png}
					\caption{Algorithm of dynamic time warping}
				\end{figure}
				\item The drawbacks of this methods are that it is computational expensive ($\mathcal{O}(N\cdot M)$), and that the distance metric might not always be the best fit (i.e. should it be allowed to align the first point of sequence 1 with the last point of sequence 2?)
				\item \comment{For this algorithm, it helps more to practice it several times instead of writing it down in all details.}
			\end{itemize}
		\end{itemize} 
	\end{enumerate}
\end{itemize}
\subsection{Clustering approaches}
\begin{itemize}
	\item Overview of different clustering approaches
	\item \textbf{K-means}: define $k$ cluster means. A point is assigned to the cluster to which mean it is the closest. Means are updated by the assigned points.
	\begin{itemize}
		\item Using Silhoutte score to select best $k$/determine whether a clustering is good:
		\begin{equation*}
			\begin{split}
				\text{silhoutte} = \frac{\sum_{i=1}^{N}\frac{b(x_i) - a(x_i)}{\max(a(x_i), b(x_i)}}{N}\hspace{3mm} & \text{where}\hspace{2mm} a(x_i)=\frac{\sum_{\forall x_j \in C_l} d(x_i, x_j)}{|C_l|} \hspace{3mm} \text{(} x_i\in C_l\text{)}\\
				& \hspace{2mm}\text{and} \hspace{4mm} b(x_i) = \min_{C_m \neq C_l}\frac{\sum_{\forall x_j \in C_m} d(x_i, x_j)}{|C_m|}
			\end{split}
		\end{equation*}
		In text, the silhoutte score compares the average distance of a point with others within a cluster ($a(x_i)$) with the distance of points with the next closest cluster ($b(x_i)$). The larger the score, the better (between$[-1,1]$)
	\end{itemize}
	\item \textbf{K-medoids}: very similar to $k$-means, but we use actual points as cluster centers instead of artificial ones
	\begin{itemize}
		\item Choose new cluster means as the point with the minimum distance to all other points in the cluster
		\item More suitable if certain points in search space might not make sense
		\item For example, k-medoids is known to work better for person-level clustering
	\end{itemize}
\end{itemize}
\subsubsection{Hierarchical clustering}
\begin{itemize}
	\item Perform clustering in an iterative approach
	\item \textbf{Divisive clustering}: start with one cluster with all points in it, and in each step, perform one split
	\begin{itemize}
		\item Define the dissimilarity of a point to all other points in a cluster as the average distance $$\text{dissimilarity}(x_i, C)=\frac{\sum_{x_j\neq x_i \in C} \text{distance}(x_i, x_j)}{|C|}$$
		\item When creating a new cluster $C'$, we add the most dissimilar points (in order of dissimilarity) until a point is more dissimilar to the points in $C'$ than points in $C$. 
		\item If we have multiple clusters, we choose the cluster to split which has the greatest distance between any points in the cluster
	\end{itemize}
	\item \textbf{Agglomerative clustering}: start with all points in separate clusters, and merge them step by step. Merge decision can be based on different criteria (equal to distance metric between clusters):
	\begin{itemize}
		\item \textit{Single linkage}: merge the two clusters with the minimum distance between any two points
		$$d_{SL}(C_k, C_l) = \min\limits_{x_i \in C_k, x_j \in C_l} \text{distance}(x_i, x_j)$$
		\item \textit{Complete linkage}: merge the two clusters where the maximum distance between any two points is minimal
		$$d_{SL}(C_k, C_l) = \max\limits_{x_i \in C_k, x_j \in C_l} \text{distance}(x_i, x_j)$$
		\item \textit{Group average}: merge the two clusters with the average distance between all points is minimal
		$$d_{SL}(C_k, C_l) = \frac{\sum\limits_{x_i \in C_k, x_j \in C_l} \text{distance}(x_i, x_j)}{|C_k|\cdot |C_l|}$$
		\item \textit{Ward's criterion}: merge the two clusters where the increase of standard deviation by the combined cluster is minimal
		$$d_{SL}(C_k, C_l) = \sigma^2_{C_k \cup C_l} - \left(\sigma^2_{C_l} + \sigma^2_{C_k}\right)$$
	\end{itemize}
\end{itemize}
\subsubsection{Subspace clustering}
\begin{itemize}
	\item The problem of standard clustering algorithms for a huge number of features is that the distance between two points get uninformative (small distance in all features compared to big difference in only one feature). Hence, the clusters get less meaningful as well
	\item Better approach is therefore to look at subspaces in the feature space.
	\item Pseudo algorithm:
	\begin{enumerate}
		\item For all features, define intervals of the feature space. This leads to units $u={u_1, ..., u_p}$ where $u_i(l)$ is the lower-bound for attribute $i$ in this unit, and $u_h(l)$ the upper bound respectively. Note that units do not require to cover all features, but can look at subspaces only (same to setting lower bound to $-\infty$ and upper to $\infty$). 
		\item Determine selectivity of a unit as proportion of points in them. We call a unit ``dense'' if it contains more points/higher proportion than a certain threshold (hyperparameter).
		\item Units are connected to a cluster if they 
		\begin{itemize}
			\item share a common face which is defined as having the lower bound equals to an upper bound of another unit (or other way round), and the same upper and lower bound for all other attributes. 
			\item or when they share a unit to which they both have a common face.
		\end{itemize}
	\end{enumerate}
	\item In the end, we strive to find dense units using a combination of attributes to form clusters
	\item To reduce number of attributes, we can start to make units with a single attribute, and add more attributes iteratively based on some \textit{fancy} algorithm. After having found the units, we can create the clusters
\end{itemize}

================================================
FILE: ML4QS/mlqs_feature_engineering.tex
================================================
\section{Feature Engineering}
\label{sec:chapter_4_feature_engineering}
\begin{itemize}
	\item Create useful features from temporal data
\end{itemize}
\subsection{Time Domain}
\begin{itemize}
	\item Summarize values of a certain attributes in a window size of $\lambda$ steps before. If we would take the time steps $t$ and $t-1$ into account, our value for $\lambda$ is $1$
	\item Note that we cannot compute any values for the time steps $t=1,..,\lambda$.
\end{itemize}
\subsubsection{Numerical}
\begin{itemize}
	\item Aggregate values by mean, min, max, stddev, etc. (including current time step)
	\item We could also use coefficient between first and last value interpolation (gradient of attribute)
\end{itemize}
\subsubsection{Categorical}
\begin{itemize}
	\item Generate patterns of occurrences of categorical values. We distinguish between successive \texttt{(b)} and co-occurring \texttt{(c)} actions/classes. Example patterns:
	\begin{itemize}
		\item Activity level = high \texttt{(c)} Activity = running
		\item Activity = running \texttt{(b)} Activity = running
	\end{itemize}
	\item If we have a window size of $\lambda$, we just see if there is any time step within $t-\lambda, ..., t-1, t$ where the activities are co-occuring \texttt{(c)}, or there is one activity happening (arbitrary number of time steps) before another \texttt{(b)}.
	\item We can find important patterns by determining the support of such. This is important as the number of patterns exponentially increases with the number of categories/attributes, and only frequent patterns are interesting.
	\item The support of a pattern is defined as the proportion of the processed time steps at which this pattern would occur.
	\item The algorithm for finding such patterns starts with single attribute patterns, and extends only those which have sufficient support. In the end, we add all patterns (including the single-attribute) with enough support.
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/chapter_4_pattern_identification.png}
		\caption{Pattern identification algorithm}
	\end{figure}
\end{itemize}
\subsubsection{Mixed data}
\begin{itemize}
	\item We can also combine numerical and categorical attributes for features
	\item Hereby, we create categories from numerical data by looking at them \textbf{qualitatively} (greetings from QR). For example, we can define ranges like low, medium and high, or look at the trend/gradients as \textit{increasing} or \textit{decreasing}
	\item Afterwards, we can apply the categorical approach on those
\end{itemize}
\subsection{Frequency Domain}
\begin{itemize}
	\item Apply Fourier transformation on data within a window of $\lambda$ (plus the current time point $t$) to extract periodicity of the data
	\item Assume a base frequency of $f_0 = \frac{2\pi}{\lambda+1}$ (or $f_0 = \frac{N_{\text{sec}}}{\lambda+1}$ in seconds) which is the lowest frequency with a complete sinusoid in it. 
	\item We look at all the frequencies $\left\{0\cdot f_0, 1\cdot f_0, ..., \lambda \cdot f_0\right\}$ and determine the corresponding amplitudes
	\item Our features can be:
	\begin{itemize}
		\item Frequency with highest amplitude
		\item Frequency-weighted signal average $\frac{\sum_{k=0}^{\lambda} a_{t-\lambda}^{t}(k) \cdot f(k)}{\sum_{k=0}^{\lambda} a_{t-\lambda}^{t}(k)}$
		\item \textit{Power Spectrum Entropy}: Amount of information in the signal
		$$x\_pse = - \sum_{k=0}^{\lambda} p_{t-\lambda}^{t}(k) \ln p_{t-\lambda}^{t}(k), \hspace{3mm}\text{with}\hspace{2mm} p_{t-\lambda}^{t}(k) = \frac{|a_{t-\lambda}^{t}(k)|^2}{\sum_{i=0}^{\lambda} |a_{t-\lambda}^{t}(i)|^2}$$
	\end{itemize}
\end{itemize}
\subsection{Unstructured data}
\begin{itemize}
	\item How to handle non-temporal/unstructured data like text, audio, images, etc.
	\item Here we focus on text. The standard pipeline contains for steps:
	\begin{itemize}
		\item \textit{Tokenization}: split sentence into smallest parts 
		\item \textit{Lower case}: put all words to lower case to have no difference in such
		\item \textit{Stemming}: reduce words to their stem to remove all small variations (like tense, etc.)
		\item \textit{Stop word removal}: remove known, uninformative stop words 
	\end{itemize}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/chapter_4_text_pipeline.png}
		\caption{Pipeline of text processing}
	\end{figure}
	\item Three approaches in general
	\begin{itemize}
		\item \textbf{Bag of words}: count occurrences of n-gram within text. These counts are the features for the text
		\item \textbf{TF-IDF}: BoW does not take ``uniqueness'' of word into account. Thus, TF-IDF takes occurrences of a word in a text and in the whole corpus into account
		\item \textbf{Topic modeling}: assume that any text is a combination of $k$ topics. Perform LDA to get these topics, and topic distribution of text is its features.
	\end{itemize}
\end{itemize}

================================================
FILE: ML4QS/mlqs_intro.tex
================================================
\section{Introduction}
\label{sec:chapter_1_2_introduction}
\subsection{Definitions}
\begin{itemize}
	\item The quantified self can be defined as:
	\blockquote{\textit{The quantified self is any individual engaged in the self-tracking of any kind of biological, physical, behavioral, or environmental information. \\The self-tracking is driven by a certain goal of the individual with a desire to act upon the collected information.}}
	\item \textbf{Augemberg} (2012): there are various types of measurements, including:
	\begin{itemize}
		\item Physical activities (movement by accelerometer, steps, etc.)
		\item Diet (calories consumed, fat, protein, etc.)
		\item Psychological states (mood, emotions, depression, etc.)
		\item Mental and cognitive traits (IQ, reaction, memory, etc.)
		\item Environmental (location, weather, etc.)
		\item Situational (time of the day, context, etc.)
		\item Social variables (influence, role in a group, etc.)
	\end{itemize} 
	\item \textbf{Choe} (2014): distinguish quantified selves into three categories based on their goal.
	\begin{itemize}
		\item Improved health (cure or manage a condition,
		execute a treatment plan, achieve a goal)
		\item Improve other aspects of life (maximize work performance, be mindful)
		\item Find new life experiences (have fun, learn new things)	
	\end{itemize}
	\item \textbf{Gimpel} (2013): identified five (non-exclusive) factors for quantified self motivation
	\begin{itemize}
		\item Self-healing (to become healthier)
		\item Self-discipline (to experience rewarding aspects of it)
		\item Self-design (to control and optimize ``yourself'', as e.g. on sport)
		\item Self-association (to be associated with the movement of QS)
		\item Self-entertainment (to experience entertainment value)
	\end{itemize}
	\item Machine Learning (automatically identifying patterns from data) is slightly different in the setting of Quantified Self because
	\begin{itemize}
		\item we have to deal with sensory noise
		\item there might be missing measurements
		\item we have temporal data (feature engineering) with irregular time points
		\item there can be an interaction with a user (advice for training/mood improvements, etc.), but we cannot try out every possibility
		\item Learn across multiple datasets/users
	\end{itemize}
	\item \comment{Most of the above definitions need to be memorized for the exam}
\end{itemize}
\subsection{Basic Terminology and Notation}
\begin{itemize}
	\item Measurement $=$ one value for an attribute recorded at a specific time point
	\item Time series $=$ series of measurements in temporal order
	\item Further notation: 
	\begin{itemize}
		\item For matrix $\bm{X}$, the columns are the different measurements like accelerometer, and rows are the time points (if dataset is temporally ordered, otherwise random list) 
	\end{itemize}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/intro_notation.png}
		\caption{Overview of notation used in this course.}
	\end{figure}
\end{itemize}
\subsection{Basic overview of Sensory Data}
\begin{itemize}
	\item Different sensors available on mobile devices, such as:
	\begin{itemize}
		\item \textit{Accelerometer}: Measures the changes in forces upon the phone in the $x$-$y$-$z$ plane
		\item \textit{Gyroscope}: Orientation of the phone compared to
		the earth's surface
		\item \textit{Magnetometer}: Measures $x$-$y$-$z$ orientation compared to the earth's magnetic field
	\end{itemize}
	\item Transforming raw data of time series require selecting a step size $\Delta t$, and combine sensory data over this interval. See Section~\ref{sec:chapter_4_feature_engineering} for techniques
	\item A large $\Delta t$ gives (maybe too) coarse-grained data, but we have in the end a smaller dataset and lower standard deviation. The opposite is gained by a smaller $\Delta t$ (fine-grained data, but large dataset and high stddev)
\end{itemize}

================================================
FILE: ML4QS/mlqs_modeling_with_time.tex
================================================
\section{Predictive Modeling with Notion of Time}
\subsection{Time Series}
\begin{itemize}
	\item Understanding the periodicity and trends in data in the time domain. Can be used for forecasting or control (e.g. how can we influence the trend)
	\item Is build up by three components:
	\begin{itemize}
		\item \textit{Seasonality/Periodicity}: any periodic/repeating pattern over any frequency (e.g. seconds, hours or days)
		\item \textit{Trend}: how the mean evolves over time
		\item \textit{Irregular variations}: noise, everything left after we remove periodicity and trend
	\end{itemize}
	\item \textbf{Stationarity}: assumption/requirement for many algorithms applied on time series
	\begin{itemize}
		\item In general, stationarity means that the statistical properties of a process generating a time series do not change over time.
		\item We call a time series stationary if trends and periodicity are removed (mean is constant), and the variance of the remaining irregular variations is constant over time
		\item Additionally, the lagged auto correlation should be constant over time/lags $\lambda$ and close to $0$:
		$$r_{\lambda} = \frac{\sum\limits_{t=1}^{N-\lambda} (x_t - \bar{x}) (x_{t+\lambda} - \bar{x})}{\sum\limits_{t=1}^{N}(x_t - \bar{x})^2}$$
		It can provides clues of underlying pattern in the data, which should not be the case for stationary ones.
		\item If a time series is not stationary, we can mostly transform it to one by removing trend, periodicity, and try to stabilize the variance
	\end{itemize}
\end{itemize}
\subsubsection{Filtering and Smoothing}
\begin{itemize}
	\item To determine the trend of a signal, one of the simplest approaches is filtering and/or smoothing
	\item Simplest filtering is for a window size of $\pm q$ (creates a new, filtered time series):
	$$z_t = \sum\limits_{r=-q}^{q} a_r x_{t+r}$$
	\item \textbf{Differencing}
	\begin{itemize}
		\item For removing a trend, the most effective technique is differencing (or gradient filter): $z_t = x_t - x_{t-1} = \nabla x_t$. As we expect the trend to be low-frequent, the gradient is therefore small. % Periodicity is not influenced by this filter (differentiating $\sin$ just shifts it in time). % Not 100% true because if we have \sin(0.1*x), we get 0.1 * \cos(0.1 * x). Hence, 10 times smaller
		\item We can also apply this operator multiple times, leading to a $d$-th order differencing ($\nabla^d x_t$). For $d=2$, we get $z_t = \nabla^2 x_t = x_t - 2x_{t-1}+x_{t-2}$. A $d$-th order differencing can remove trends than can be approximated by a polynomial of order $d$ or lower.
		\item Drawback of differencing is that the variance of the remaining time series increases. Can be improved by using a better approximation of trend than $x_{t-1}$, as for example a exponential filtered signal $e_{t} = \sum\limits_{r=-q}^{0} \frac{\alpha(1 - \alpha)^{|r|}}{2 - \alpha} x_{t+r}$ (with e.g. $\alpha=0.05$), and use that for the differencing: $z_t = x_t - e_{t}$
		\item Still, we have to be careful as we might remove low-frequent periodicity as well ($\partial \sin(0.1\cdot x)/\partial x = 0.1 \cdot \cos(0.1 \cdot x)$ $\Rightarrow$ dampen signal by factor of 10). 
		\item If we would want to remove periodicity, we could apply the differencing operator not on two adjacent points, but two points that are moved by 1 period as the difference between those is expected to be zero (note that we need to know/estimate the frequency for that)
	\end{itemize}
\end{itemize}
\subsubsection{ARIMA}
\begin{itemize}
	\item ``Autoregressive Integrated Moving Average Model''
	\item We try to estimate a model that describes the empirical data well and can be used to forecast/predict new values
	\item For this, we learn/determine a mapping of time point $t$ to probability distribution $P_t$
	\item In ARIMA, we assume $P_t$ to be modeled by a combination of a autoregressive process (AR), and a moving average (MA) over white noise $W_t$:
	$$P_t = \underbrace{\phi_1 P_{t-1} + ... \phi_p P_{t-p} + W_t}_{\text{Autoregressive process}} + \underbrace{\theta_1 W_{t-1} + ... \theta_q W_{t-q}}_{\text{Moving Average}}$$
	Note that an AR can be expressed by an infinite MA, and the other way round. But to reduce the number of parameters, both concepts are used here.
	\item To remove the drifts of mean (trend), we apply differencing of order $d$ on $P_t$ ($V_t = \nabla^d P_t$). 
	\item Optimization of parameters
	\begin{itemize}
		\item For $p$ we can look at the autocorrelation between $x_t$ and $x_{t-p}$ to find patterns
		\item For other parameters, gridsearch with objective function as the fit to the data we have
	\end{itemize}
	\item Note that ARIMA does not take seasonality/periodicity into account. Hence, we either have to add it externally, remove it beforehand or model it as well (ARIMAX)
\end{itemize}
\subsection{Recurrent Neural Networks}
\begin{itemize}
	\item Unfolding for gradient calculation
	\item \textbf{Echo State Networks}: ``cheap'' RNNs without backprop through time
	\begin{itemize}
		\item Three weight matrices:
		\begin{itemize}
			\item $\bm{W}^{\textbf{in}}$: are the weights from the input layer to the memory (or here also called \textit{reservoir}). 
			\item $\bm{W}$: are the weights over time steps (or internally in the reservoir).
			\item $\bm{W}^{\textbf{out}}$: specify the weights from the reservoir to the output
		\end{itemize}
		\item $\bm{W}^{\textbf{in}}$ and $\bm{W}$ are randomly initialized and \textbf{fixed} during training ,while $\bm{W}^{\textbf{out}}$ is learned (either by SGD or pseudo inverse)
		\item Initialization of $\bm{W}$ need to satisfy the \textit{Echo State property} which state that the effect of a previous state should gradually decrease over time (prevent exploding values)
		\item We can ensure this by randomly initializing a matrix, dividing by its spectral radius and (optionally) scale it down even further.
		\item There are different initialization heuristics to optimize this process, but all underly the \textit{No Free Lunch} theorem (optimizing for one use case will make it worse for others)
	\end{itemize}
\end{itemize}
\subsection{Dynamical Systems}
\begin{itemize}
	\item Build domain knowledge-based models that cover temporal relationships between attributes by the meas of differential equations. Furthermore, they assume a numerical state. Very simple model for velocity:
	$$y_{vel}(t) = y_{vel}(t-1) + \gamma_1 \cdot y_{acc}(t)$$
	\item Models still contain parameters (as $\gamma_1$ above) that can/need to be tuned
	\item \textbf{Parameter optimization}: three main approaches
	\begin{itemize}
		\item \textit{Simulated Annealing}: similar to an EA with a population size of 1. 
		\begin{itemize}
			\item We have a single solution which we randomly initialize first. At each iteration, we take a random step, and compare the difference in score. 
			\item If the new point is better than the old one, we replace it. Otherwise, we replace it with a probability based on the distance between the scores, and the number of steps we have already taken. 
			\item Note that the probability decreases with number of steps. Thus, we switch from exploration in the first iterations to exploitation in the last.
		\end{itemize}
		\item \textit{Genetic Algorithms}: 
		\begin{itemize}
			\item We represent parameter values as a bit string (arbitrary number of bits per parameter, all together concatenated into one genotype), and initialize a couple of them in our population 
			\item At each iteration, we choose a set of parents from our population (based on fitness value), and perform crossover as well as mutation on the children
			\item Perform survivor selection, or just completely replace the old generation by the new one
		\end{itemize}
		\item \textit{NSGA-II}: multi-criteria optimization GA
		\begin{itemize}
			\item ``Non-Dominated Sorting Genetic Algorithm''
			\item Used when multiple targets need to be optimized, and there is no fixed tradeoff between both 
			\item Therefore, we find Pareto fronts in our population (individuals that are not Pareto dominated by any other individual in our population)
			\item We create several Pareto fronts by iteratively creating one for our population, and then remove all individuals on it from the population, and start again
			\item Interested in a wide spread of individuals/coverage of Pareto front $\Rightarrow$ weight individuals on the Pareto front by the distance to other points (points on the border set to infinity because they are the best for a certain objective). We use this weight for survivor selection where we iteratively add individuals until we have enough for a new population. Note that we of course prioritize the individuals that are on a earlier Pareto front
		\end{itemize}
	\end{itemize}
\end{itemize}

================================================
FILE: ML4QS/mlqs_modeling_without_time.tex
================================================
\section{Predictive Modeling without Notion of Time}
\begin{itemize}
	\item Any predictor that does not explicitly take time into account (except the temporal features from time/frequency domain etc.)
	\item Different learning setups possible
	\begin{itemize}
		\item \textit{Individual}: train and test on a single user
		\item \textit{Population - unknown user}: train on a set of users, test on a different set of users
		\item \textit{Population - unseen data}: train on a set of users, test on the same users but different data
	\end{itemize}
\end{itemize}
\subsection{Preventing overfitting}
\begin{itemize}
	\item One big issue in the context of QS is that algorithms can easily overfit. This can be due to the big amount of features, the noise contained in them, and the usually small datasets we have
	\item \textbf{Feature selection}
	\begin{itemize}
		\item To prevent the models to overfit on not useful features, we can reduce the number of features to the essential ones
		\item \textit{Forward selection}: start with empty set, and iteratively add most predictive feature. At every iteration, we need to run a model on the previously added features plus any of the other features left. Stop when accuracy does not improve significantly anymore
		\item \textit{Backward selection}: start with all features, and iteratively remove the least predictive feature. Similar to forward selection, just doing the whole algorithm reversed.
	\end{itemize}
	\item \textbf{Regularization}: add a term to the error function to punish more for more complex models. Examples include L1/L2 regularization for NN, points per leaf for decision trees, etc.
\end{itemize}

================================================
FILE: ML4QS/mlqs_reinforcement_learning.tex
================================================
\section{Reinforcement Learning}
\begin{itemize}
	\item RL for ML4QS to learn from interactions with user and influencing him
	\item General overview of how to integrate RL in ML4QS is shown in Figure~\ref{fig:chapter_9_RL_loop}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/chapter_9_RL_loop.png}
		\caption{Reinforcement Learning in the loop for ML4QS}
		\label{fig:chapter_9_RL_loop}
	\end{figure}
	\item \textbf{Markov Property}: $\mathbb{P}\left\{R_{t+1}=r, S_{t+1}=s|S_0, A_0, R_0, ..., S_t, A_t, R_t\right\} = \mathbb{P}\left\{R_{t+1}=r, S_{t+1}=s|S_t, A_t\right\}$\\
	The conditional probabilities of future state and rewards solely depend on the last state $S_t, A_t$. 
	\item For every problem that satisfies this property, we can easily create a Markov Decision Process with a finite set of states as in Figure~\ref{fig:chapter_9_MDP}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.35\textwidth]{figures/chapter_9_MDP.png}
		\caption{Markov Decision Process for simple example}
		\label{fig:chapter_9_MDP}
	\end{figure}
	\item \textbf{SARSA}:
	\begin{itemize}
		\item On-policy optimization, update $Q$-values by:
		$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha(R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t))$$
		\item Popular policies are $\epsilon$-greedy or softmax over q-values for different actions
	\end{itemize}
	\item \textbf{Q-Learning}:
	\begin{itemize}
		\item Off-policy optimization, update by taking the maximum Q-value over next state:
		$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha(R_{t+1} + \gamma \max\limits_{A'(S_{t+1})} Q(S_{t+1}, A') - Q(S_t, A_t))$$
	\end{itemize}
	\item \textbf{Eligibility traces}: update frequently seen states in a single run more
	\begin{itemize}
		\item If we have seen a state and action combination more frequently in our history, then we want to increase the weight of the update because it is more eligible (i.e. more responsible for the outcome). We can determine the eligibility by:
		$$Z_t(s, a) = \begin{cases}
		\gamma \lambda Z_{t-1} + 1 & \text{if } s=S_t\wedge a=A_t\\
		 \gamma \lambda Z_{t-1} & \text{otherwise}
		\end{cases}$$
		\item In our learning algorithms, we can incorporate this by increasing the weight of the update, as e.g. in Q-Learning:
		$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha(R_{t+1} + \gamma \max\limits_{A'(S_{t+1})} Q(S_{t+1}, A') - Q(S_t, A_t))\cdot Z_t(s,a)$$
	\end{itemize}
	\item Usually, the Q-values are stored in a table. If the number of states and actions are very large, this is not feasible. Alternative is to learn a function/model, that takes as input the state and action, and predicts the Q-value.
	\item For continuous state spaces, we can discretize it by e.g. the \textbf{U-tree} algorithm
	\begin{itemize}
		\item Start with a single unit/leaf/discrete state where all continuous states are mapped to 
		\item Collect data by trial and error for a while and estimate the Q-values
		\item On the collected data for each leaf, we test whether we can find splits for any attribute $X_i$ with a significant difference in Q-values
		\item Choose $X_i$ and its split with the lowest $p$-value, and create new leafs. Continue until maximum number of leafs is reached
	\end{itemize}
\end{itemize}

================================================
FILE: ML4QS/mlqs_sensory_noise.tex
================================================
\section{Handling Sensory Noise}
\label{sec:chapter_3_sensory_noise}
\subsection{Outlier Detection}
\begin{itemize}
	\item ``\textit{An outlier is an observation point that is distant from other observations}''
	\item Outliers can be caused by measurement errors, or variability of the data (e.g. very high heart rate due to pushing someone's limits)
	\item Outliers can be detected by either \textit{domain knowledge} (known in what range to expect value, e.g. heart rate should not be over 220), or without by filtering noise. We distinguish between two ways for doing so:
	\begin{itemize}
		\item \textit{Distribution-based}: assume a certain distribution of the data, and remove all points with a likelihood lower than a certain threshold
		\item \textit{Distance-based}: focus on the distance between data points, and mark those as outliers which are far apart
	\end{itemize}
	\item After detecting the outliers, we can replace them with \textit{unknown} values/value missing tag. 
\end{itemize}
\subsubsection{Distribution-based outlier detection}
\begin{itemize}
	\item \textbf{Chauvenet's criterion}: assume a normal distribution for a single attribute
	\begin{itemize}
		\item We can fit a normal distribution by calculating the mean and stddev of the data
		\item For each point, calculate the probability $P(X\leq x_{i}^{j})=\int_{-\infty}^{x_{i}^{j}} \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(u-\mu)^2}{2\sigma^2}} \partial u$\\ (instance $i$ from $j$th attribute)
		\item A point is an outlier if:
		$$P(X\leq x_{i}^{j}) < \frac{1}{c\cdot N} \hspace{3mm}\text{or}\hspace{3mm} \left(1-P(X\leq x_{i}^{j})\right) < \frac{1}{c\cdot N}$$
		Thus, the probability of a point being an outlier decreases with the number of observations for this attribute (as the likelihood increases to observe rare values)
		\item Mostly $c=2$ is chosen.
	\end{itemize}
	\item \textbf{Mixture models}: assume the data to be described by $K$ normal distributions $p(x)=\sum_{k=1}^{K}\pi_k \mathcal{N}\left(x|\mu_k, \sigma_k\right)$
	\begin{itemize}
		\item Use EM algorithm to optimize maximum-likelihood of the data
		\item A point is considered as an outlier if it has a lower probability than a certain threshold
		\item Both threshold and number of distributions $K$ is a hyperparameter to optimize
	\end{itemize}
\end{itemize}
\subsubsection{Distance-based outlier detection}
\begin{itemize}
	\item Outlier detection based on a distance metric $d(x_{i}^{j}, x_{k}^{j})$ (as e.g. Euclidean distance) which can also be across multiple attributes
	\item \textbf{Simple distance-based approach}:
	\begin{itemize}
		\item Call two points close if they are within a distance $d_{\text{min}}$ (hyperparameter)
		\item A point $x_{i}^{j}$ is considered as an outlier if:
		$$\frac{\sum_{n=1}^{N} \mathbbm{1}\left(d(x_{i}^{j}, x_{n}^{j}) > d_{\text{min}}\right)}{N} > f_{\text{min}}$$
		Hence, we look if the number of points within the range of $d_{\text{min}}$ are at least $1-f_{\text{min}}$.
		\item Example values of hyperparameters: $d_{\text{min}}=0.1, f_{\text{min}}=0.99$
		\item Not working if we have multi-modal distribution (does not take local densities into account)
	\end{itemize}
	\item \textbf{Local outlier factor}: use local densities to determine outliers.
	\begin{itemize}
		\item Define $k_{\text{dist}}$ for point $x_{i}$ as the maximum distance in set of its $k$ closest neighbors. 
		\item The reachability distance of $x_{i}$ \textbf{from} $x$ is defined as:
		$$k_{\text{reach\_dist}}(x_i, x) = \max\left(k_{\text{dist}}(x), d(x, x_i)\right)$$
		Note that this distance is \textit{not} symmetric as it uses the $k$th nearest neighbors of a point.\\
		Furthermore, this operation is only done to reduce the influence of very close-by points. 
		\item The local reachability distance of a point is defined by:
		$$k_{\text{lrd}}\left(x_{i}\right) = \frac{\left|k_{\text{distnh}}\left(x_{i}\right)\right| }{\sum\limits_{x\in k_{\text{distnh}}\left(x_{i}\right)} k_{\text{reach\_dist}}\left(x_i, x\right)}$$
		Hence, it is high if a point is very close to others.
		\item Outlier if neighbor points have much higher local reachability points than actual point:
		$$k_{\text{lof}}\left(x_{i}\right) = \frac{\sum\limits_{x\in k_{\text{distnh}}\left(x_{i}\right)} k_{\text{lrd}}\left(x\right)}{\left|k_{\text{distnh}}\left(x_{i}\right)\right| \cdot k_{\text{lrd}}\left(x_{i}\right)}$$
	\end{itemize}
\end{itemize}
\subsection{Missing value imputation}
\begin{itemize}
	\item Due to outliers or measuring errors, we might have missing values in our dataset
	\item We can use simple methods like replace it by the mean or median of the other observed data, or also use more advanced methods that take the values of the other attributes at this observation into account, or a local time window.
	\begin{itemize}
		\item Example for the latter: \textbf{interpolation} $x_{i}^{j} = x_{i-k}^{j} + k \cdot \frac{x_{i+l}^{j} - x_{i-k}^{j}}{l+k}$
	\end{itemize}
\end{itemize}
\subsubsection{Kalman Filter}
\begin{itemize}
	\item Combine outlier detection and imputation into a single model
	\item Therefore, we keep a latent state $s_t$, for which $x_t$ are the observations in this states (Kalman filter relates $x_t$ and $s_t$)
	\item The next value of a state is defined as: $$s_t = F_t s_{t-1} + B_t u_t + w_t$$ where $u_t$ is a control input state (as e.g. sending a message), $w_t$ is white noise, and $F_t$ and $B_t$ are learned matrices
	\item The measurements associated with $s_t$ can be predicted by: $$x_t = H_t s_t + v_t$$ where $v_t$ is again white noise.
	\item We can predict the next state (without noise) by $\hat{s}_{t|t-1} = F_t \hat{s}_{t-1|t-1} + B_t u_t$
	\item The error at time $t$ compared to the observations $x_t$ is then $e_t = x_t - H^T \hat{s}_{t|t-1}$
	\begin{itemize}
		\item If we observe (after training/modeling) a high error, we can assume a value to be an outlier, and replace it with the prediction of the Kalman Filter.
	\end{itemize}
	\item Given this error, we can update our prediction accordingly: $\hat{s}_{t|t} = \hat{s}_{t|t-1} + K_t e_t$ where $K_t$ takes the expected prediction error into account (based on the white noise $w_t$ and $v_t$)
	% \item Next, we can estimate our prediction error of $\hat{s}_{t|t-1}$ by $P_{t|t-1}=\mathbb{E}\left[\left(s_t - \hat{s}_{t|t-1}\right)\left(s_t - \hat{s}_{t|t-1}\right)^T\right]$
\end{itemize}
\subsection{Transforming the Data}
\begin{itemize}
	\item Transform data to extract most useful data, and get rid of remaining noise
	\item Different approaches can be used
	\item \textbf{Lowpass filter}: filter out high-frequent noise
	\begin{itemize}
		\item We assume that our signal has a certain periodicity, but we are only interested in certain parts of the frequency band (noise is mostly very high-frequent)
		\item The low-pass filter can remove those by weighting each periodicity by its frequency:
		$$|G(f)|^2 = \frac{1}{1+\left(f/f_c\right)^{2n}}$$
		with $|G(f)|$ as the magnitude, $f_c$ is the cutoff frequency (magnitude halved), and $n$ the order of the filter
	\end{itemize}
	\item \textbf{Principal Component Analysis}: find components that explain most of the variance in the data
	\begin{itemize}
		\item Select number of components based on the explained variance. Other, low-variance components are removed to reduce noise 
		\item Problem: we loose insight in the data because the components are not easily interpretable anymore
	\end{itemize}
\end{itemize}

================================================
FILE: ML4QS/mlqs_summary.tex
================================================
\documentclass[a4paper]{article} 
\addtolength{\hoffset}{-2.25cm}
\addtolength{\textwidth}{4.5cm}
\addtolength{\voffset}{-3.25cm}
\addtolength{\textheight}{5cm}
\setlength{\parskip}{0pt}
\setlength{\parindent}{0in}

\usepackage{blindtext} % Package to generate dummy text
\usepackage{charter} % Use the Charter font
\usepackage[utf8]{inputenc} % Use UTF-8 encoding
\usepackage{microtype} % Slightly tweak font spacing for aesthetics
\usepackage[english]{babel} % Language hyphenation and typographical rules
\usepackage{amsthm, amsmath, amssymb, amsfonts} % Mathematical typesetting
\usepackage{float} % Improved interface for floating objects
\usepackage[final, colorlinks = true, 
linkcolor = black, 
citecolor = black]{hyperref} % For hyperlinks in the PDF
\usepackage{graphicx, multicol} % Enhanced support for graphics
\usepackage{xcolor} % Driver-independent color extensions
\usepackage{marvosym, wasysym} % More symbols
\usepackage{rotating} % Rotation tools
\usepackage{subcaption}
\usepackage{wrapfig}
\usepackage{censor} % Facilities for controlling restricted text
\newcommand{\note}[1]{\marginpar{\scriptsize \textcolor{red}{#1}}} % Enables comments in red on margin
\usepackage{bm}
\usepackage{blkarray}
\usepackage{enumitem}
\usepackage{pgfplots}
\usepackage{tikz}
\usepackage{bbm}
\usepackage[autostyle]{csquotes} 

\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\loss}[0]{\mathcal{L}}
\newcommand{\chain}[3]{\frac{\partial #1}{\partial #2}\frac{\partial #2}{\partial #3}}
\newcommand{\eq}[1]{\begin{equation*}\begin{split}#1\end{split}\end{equation*}}
\newcommand{\TODO}[1]{\textbf{\textcolor{red}{#1}}}
\newcommand{\comment}[1]{\textit{\textcolor{blue}{Comment: #1}}}

\definecolor{green}{RGB}{0,160,0}
\definecolor{blue}{RGB}{0,0,160}
\definecolor{red}{RGB}{160,0,0}
\definecolor{orange}{RGB}{200,160,0}
\definecolor{purple}{RGB}{170,0,200}
\definecolor{cyan}{RGB}{0,200,200}
\definecolor{lightred}{RGB}{200,50,50}

\setcounter{tocdepth}{2}
% Title Page
\title{Summary Machine Learning for the Quantified Self}
\author{Phillip Lippe}


\begin{document}
\maketitle
\tableofcontents
\newpage

\input{mlqs_intro.tex}
\input{mlqs_sensory_noise.tex}
\input{mlqs_feature_engineering.tex}
\input{mlqs_clustering.tex}
\input{mlqs_supervised_learning.tex}
\input{mlqs_modeling_without_time.tex}
\input{mlqs_modeling_with_time.tex}
\input{mlqs_reinforcement_learning.tex}
%\appendix
%\newpage
%\input{ml4qs_appendix.tex}

\end{document}

================================================
FILE: ML4QS/mlqs_supervised_learning.tex
================================================
\section{Supervised Learning}
\begin{itemize}
	\item The perspective on supervised learning in this course is summarized in Figure~\ref{fig:chapter_6_supervised_learning}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/chapter_6_supervised_learning_overview.png}
		\caption{Overview of supervised learning framework}
		\label{fig:chapter_6_supervised_learning}
	\end{figure}
	\item Discussion of error measuring
	\begin{itemize}
		\item \textit{Risk} $E(h,f)$ describes the distance between our hypothesis $h$ and the target function $f$
		\item \textit{Loss} is the point-wise risk $e(h(x),f(x))$
		\item Given the evidence $p(x)$, we can determine the risk by $E(h,f)=\int e(h(x), f(x)) p(x) dx$
		\item However, this integral can (usually) not be computed, and only approximated by Monte-Carlo integration. 
		\item For definitions of $e$, we can use metrics like F1 or accuracy (classification), or MSE etc. (regression)
		\item The in-sample error is the average loss over all training points $E_{in}(h)=\frac{1}{N}\sum_{(x,y) \in \mathcal{O}_{\text{train}}} e(y, h(x))$
		\item The out-of-sample error accordingly for points not in the training set:\\
		$E_{out}(h)=\int_{\mathcal{X}\setminus \mathcal{O}_{\text{train}}} e(h(x), f(x)) p(x) dx$
	\end{itemize}
	\item We select the model with the lowest in-sample error, but need to be careful with overfitting
\end{itemize}
\subsection{PAC Learnability and VC dimensionality}
\begin{itemize}
	\item ``Probably approximately correct learning''
	% \item A hypothesis set is PAC learnable if a learning algorithm exists that can minimize the generalization error to $|E_{out}(\hat{f}) - E_{in}(\hat{f})| < \epsilon$ with a probability of $1-\delta$
	\item A hypothesis set is PAC learnable when it can be shown that given any value of $\delta$, $\epsilon$ there is an $N$ (number of samples) where with probability $1-\delta$ the difference between the in-sample and out-of-sample error is less than $\epsilon$. 
	\item \textit{Probably}: $1-\delta$, \textit{Approximate correct}: $|E_{out}(\hat{f}) - E_{in}(\hat{f})| < \epsilon$ 
	\item For a finite set of $M$ hypotheses, we determine it by:
	$$E_{out}(\hat{f}) \leq E_{in}(\hat{f}) + \sqrt{\frac{1}{2N}\log \frac{2M}{\delta}}$$
	Hence, every finite set of hypotheses is PAC learnable, and we can calculate the expected error given number of samples $N$, hypothesis set size $M$, and probability $\delta$
	\item For infinite set of hypotheses, we can look at VC dimensionality
	\begin{itemize}
		\item We say that a set of input vectors $X$ is shattered by a hypothesis set $\mathcal{H}$ if it can represent all possible labeling
		\item The VC dimension of $\mathcal{H}$ is an $X$ with the highest cardinality $D$. Note that not all possible point sets of cardinality $D$ must be shattered by $\mathcal{H}$. It is sufficient if it is true for at least one. 
		\item Example for a perceptron:
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.3\textwidth]{figures/chapter_6_VC_dimensions.png}
		\end{figure}
	
		The VC dimension of a perceptron in 2 dimensions is $3$, because there exists no set of 4 points that can be shattered (i.e. for which we can learn any labeling)
		\item If the hypothesis set can represent any labeling for an arbitrary large set $X$, it has a VC dimension of $\infty$
		\item Important finding: all hypothesis set with a finite VC-dimension are PAC learnable
		\item We can 
	\end{itemize}
	\item Some implications from this study
	\begin{itemize}
		\item Given a few training samples, it is easy to get a low in-sample error. But with increasing number of samples, the out-of-sample error decreases
		\item In addition, for a fixed $N$, we can study the influence of more complex hypotheses and find the best compromise
		\item  
	\end{itemize}
\end{itemize}

================================================
FILE: Machine_Learning_1/ml_appendix.tex
================================================
\section{Appendix: Foundations}
\subsection{Important functions}
\subsubsection{Rectified Linear Unit}
Properties of the ReLU function:
\begin{itemize}
	\item $\text{ReLU}(x)=\max(x,0)$
	\item $\text{ReLU}'(x)=\begin{cases}
	1 & \text{ if } x>0\\
	0 & \text{ if } x<0\\
	\text{undef} & \text{ if } x=0\\
	\end{cases}$ (last case usually set to 0)
	\item Variations: 
	\begin{itemize}
		\item Leaky ReLU: $f(x)=\begin{cases}
		x & \text{ if } x>0\\
		0.01x & \text{ otherwise }
		\end{cases}$
		\item ELU: $f(x)=\begin{cases}
		x & \text{ if } x>0\\
		\alpha (e^x-1) & \text{ otherwise }
		\end{cases}$
		\item Self-normalizing ELU (carefully selected $\alpha$ and scaling, so that activations stay close to mean 0, variance 1)
	\end{itemize}
\end{itemize}
\subsubsection{Sigmoid}
Properties of the sigmoid function:
\begin{itemize}
	\item $\sigma(x)=\frac{1}{1+e^{-x}}$
	\item $\sigma(-x) = 1 - \sigma(x)$
	\item $\sigma'(x) = \sigma(x)\left(1 - \sigma(x)\right)$
	\item Output range: $[0,1]$
\end{itemize}
\begin{figure}[ht]
	\centering
	\begin{subfigure}[b]{0.3\textwidth}
		\centering
		\scalebox{0.5}{%
			\begin{tikzpicture}
			\begin{axis}[
			title = {Sigmoid},
			axis lines = left,
			xlabel = {input $x$},
			ylabel = {output $y$},
			xmin = -5,
			xmax = 5,
			ymin = 0,
			ymax = 1,
			ymajorgrids=true,
			xmajorgrids=true,
			grid style=dashed
			]
			%Here the blue parabloa is defined
			\addplot [
			domain=-5:5, 
			samples=1000, 
			color=black!60!green,
			line width = 0.5mm
			]
			{1/(1+e^(-x))};
			
			\end{axis}
			\end{tikzpicture}
		}
		\caption{$\sigma(x)$}
		\label{img:activation_function_sigmoid}
	\end{subfigure}
	\begin{subfigure}[b]{0.3\textwidth}
		\centering
		\scalebox{0.5}{%
			\begin{tikzpicture}
			\begin{axis}[
			title = {Hyperbolic tangent},
			axis lines = left,
			xlabel = {input $x$},
			ylabel = {output $y$},
			xmin = -4,
			xmax = 4,
			ymin = -1,
			ymax = 1,
			ymajorgrids=true,
			xmajorgrids=true,
			grid style=dashed
			]
			%Here the blue parabloa is defined
			\addplot [
			domain=-4:4, 
			samples=1000, 
			color=blue,
			line width = 0.5mm
			]
			{tanh(x)};
			
			\end{axis}
			\end{tikzpicture}
		}
		\caption{$\tanh(x)$}
		\label{img:activation_function_tanh}
	\end{subfigure}
	\begin{subfigure}[b]{0.3\textwidth}
		\centering
		\scalebox{0.5}{%
			\begin{tikzpicture}
			\begin{axis}[
			title = {Rectified linear unit},
			axis lines = left,
			xlabel = {input $x$},
			ylabel = {output $y$},
			xmin = -2,
			xmax = 2,
			ymin = 0,
			ymax = 2,
			ymajorgrids=true,
			xmajorgrids=true,
			grid style=dashed
			]
			%Here the blue parabloa is defined
			\addplot [
			domain=-2:2, 
			samples=1000, 
			color=orange,
			line width=0.5mm
			]
			{(x > 0)*x};
			
			\end{axis}
			\end{tikzpicture}
		}
		\caption{$\text{ReLU}(x)$}
		\label{img:activation_function_relu}
	\end{subfigure}
	
	\caption[Comparison of activation functions]{(a) The sigmoid function maps the inputs to a range of 0 to 1 while having high gradients near to $y=0$ to bring the output more to either 0 or 1. (b) The hyperbolic tangent is similar to the sigmoid function but has a output range of -1 to 1. (c) A rectified linear unit (ReLU) is 0 for all input lower than 0. All other values are processed linearly so that they do not change.}
\end{figure}
\subsubsection{Hyperbolic tan}
Properties of the hyperbolic tan:
\begin{itemize}
	\item $\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$
	\item $\tanh(-x) = -\tanh(x)$
	\item $\tanh'(x) = 1 - \tanh(x)^2$
	\item Output range: $[-1,1]$
\end{itemize}
\subsubsection{Softmax}
Properties:
\begin{itemize}
	\item $\text{softmax}(x_k) = \frac{\exp(x_k)}{\sum_{i=1}^{N}\exp(x_i)}$
	\item If $x_k \gg x_j$, the softmax for all $j\neq k$ is approx. 0, whereas for $k$ it is 1
	\item Maps vector from $\bm{y} \in \mathbb{R}^D$ to probability distribution $\bm{y}' \in [0,1]^D$ with $\sum_{i=1}^{D}y'_{i} = 1$ $\Rightarrow$ useful for multi-class classification
	\item Invariant to bias: $\frac{\exp(x_k + c)}{\sum_{i=1}^{N}\exp(x_i + c)} = \frac{\exp(x_k)}{\sum_{i=1}^{N}\exp(x_i)}= \text{softmax}(x_k)$
\end{itemize}
\subsection{Matrix operations}
\subsubsection{Properties of transposed and inverse matrices}
\textbf{Transpose}
\begin{itemize}
	\item $(AB)^T=B^T A^T$
	\item $\det\left(A^{T}\right) = \det\left(A\right)$
\end{itemize}
\textbf{Inverse}
\begin{itemize}
	\item $(AB)^{-1} = B^{-1} A^{-1}$
	\item $\det\left(A^{-1}\right) = \det\left(A\right)^{-1}$
\end{itemize}
\textbf{Combination}
\begin{itemize}
	\item $\left(A^{-1}\right)^T = \left(A^{T}\right)^{-1}$
\end{itemize}
\subsubsection{Derivations}
\subsubsection{Hand-in 1: 1.3d}
Derivation of multivariate Gaussian by matrix
\subsection{Lagrange Multiplier}
\begin{itemize}
	\item Finding stationary points of a function with subject to one or more constraints
	\item \textbf{Equality constraint}
	\begin{itemize}
		\item Maximize $f(\bm{x})$ with respect to constraint $g(\bm{x})=0$
		\item At a constrained maximum, we know that $\nabla f(\bm{x}) = -\lambda \nabla g(\bm{x})$
		\item The Lagrangian function is therefore $$ L(\bm{x}, \lambda) = f(\bm{x}) + \lambda g(\bm{x})$$
		\item We solve it by maximizing regarding to $\bm{x}$ and $\lambda$: $$\max_{\bm{x}} \max_{\lambda} L(\bm{x}, \lambda)$$
		\item Note that the sign of the constraint is irrelevant. A minus sign leads to the same result as $g(x)$ must be zero at this point
		\item We find solutions by setting the derivate of both primal and dual variables to 0:
		$$\frac{\partial }{\partial \bm{x}} L(\bm{x}, \lambda) = 0, \text{\hspace{3mm}} \frac{\partial }{\partial \lambda} L(\bm{x}, \lambda) = 0$$
	\end{itemize}
	\item \textbf{Inequality constraint}
	\begin{itemize}
		\item Maximize $f(\bm{x})$ with respect to constraint $g(\bm{x})\geq0$ (introduce Lagrangian multiplier $\mu$)
		\item Two kinds of solutions:
		\begin{itemize}
			\item If the optimum of $f(\bm{x})$ lies already in the region of $g(\bm{x})\geq0$, then we have an inactive constraint $\Rightarrow$ $\mu=0$
			\item Otherwise, the optimum is on the boundary so that $g(\bm{x})=$ and $\mu> 0$
		\end{itemize}
		\item Thus, our primal Lagrangian is defined as:
		$$L(\bm{x}, \mu) = f(\bm{x}) + \mu g(\bm{x})$$
		\item We now maximize regarding $\bm{x}$, but \textit{minimize} for the Lagrangian multiplier as we prefer $f(x)$ being inside the constraint area:
		$$\max_{\bm{x}} \min_{\mu} L(\bm{x}, \mu)$$
		\item Note that the sign is here important. When we minimize $f(x)$, we can keep the max-min conditions for the Lagrangian but then have to switch the sign in front of the constraint!
		\item Also, deriving by $\mu$ does not guarantee a valid solution anymore as we have the following KKT conditions for \textit{every} Lagrangian multiplier:
		$$\mu \geq 0 \hspace{5mm} g(\bm{x})\geq 0 \hspace{5mm} \mu g(\bm{x}) = 0$$
		\item We obtain the dual Lagrangian by optimizing with respect to only the primal variables $\bm{x}$, and replacing those in the primal Lagrangian:
		$$\tilde{L}(\mu) = \max_{\bm{x}} L(\bm{x},\mu)$$
		\item Next, minimize with respect to the dual parameters $\mu$ by considering the constraint $\mu=0$
	\end{itemize}
	\item \textbf{Combined constraints}
	\begin{itemize}
		\item If we have multiple constraints (can be pure (in-)equalities or mixed), we just add them all to our Lagrangian function
		\item Solve with respect to all constraints
	\end{itemize}
\end{itemize}

================================================
FILE: Machine_Learning_1/ml_basic_probability.tex
================================================
\section{Probability Theory}
\subsection{Multivariate Gaussian}
$$\mathcal{N}\left(\bm{x}|\bm{\mu}, \bm{\Sigma}\right) = \frac{1}{\left(2\pi\right)^{D/2} \cdot |\bm{\Sigma}|^{1/2}}\cdot \exp\left(-\frac{1}{2}\left(\bm{x}-\bm{\mu}\right)^T\bm{\Sigma}^{-1}\left(\bm{x}-\bm{\mu}\right)\right)$$

$$\frac{\partial}{\partial \bm{\mu}}\mathcal{N}\left(\bm{x}|\bm{\mu}, \bm{\Sigma}\right)  = \mathcal{N}\left(\bm{x}|\bm{\mu}, \bm{\Sigma}\right) \left(\bm{x}-\bm{\mu}_k\right)^T\bm{\Sigma}^{-1}$$

% $$\frac{\partial}{\partial \bm{\Sigma}}\mathcal{N}\left(\bm{x}|\bm{\mu}, \bm{\Sigma}\right)  = ...$$

\subsection{Rules of probability}
\begin{table}[ht]
	\centering
	\begin{tabular}{c|cc}
		& \textbf{Discrete} & \textbf{Continuous}\\
		\hline
		\textbf{Additivity} & $p(X\in A) = \sum\limits_{x\in A}p(x)$ & $p\left(x\in (a,b)\right) = \int\limits_{a}^{b}p(x)dx$\\[10pt]
		\textbf{Positivity} & $0 \leq p(x)\leq 1$ & $0 \leq p(x) \not\leq 1$\\[10pt]
		\textbf{Normalization} & $\sum_{x} p(x) = 1$ & $\int_{\chi} p(x)dx = 1$\\[10pt]
		\textbf{Sum Rule} & $p(x) = \sum\limits_{y\in\mathcal{Y}} p(x,y)$ & $p(x)=\int\limits p(x,y)dy$\\[10pt]
		\textbf{Product Rule} & $p(x,y) = p(x|y)p(y)$ & $p(x,y) = p(x|y)p(y)$
	\end{tabular}
\end{table}
\subsection{Bayes Rule}
$$\underbrace{p(x|y)}_{\text{posterior}} = \frac{\overbrace{p(y|x)}^{\text{likelihood}} \overbrace{p(x)}^{\text{prior}}}{\underbrace{p(y)}_{\text{evidence}}} = \frac{p(y|x) p(x)}{\int p(y|x) p(x)dx} \text{\hspace{5mm}or\hspace{5mm}} \frac{p(y|x) p(x)}{\sum p(y|x) p(x)}$$

================================================
FILE: Machine_Learning_1/ml_combining_models.tex
================================================
\section{Combining models}
\begin{itemize}
	\item Improve performance by combining different models
	\item For example, we can train $L$ different models and take their average as prediction (called committee)
	\item Alternatively, we can also make the choice of which model we should use for an input $\bm{x}$ dependent on $\bm{x}$. This example includes Mixtures of experts
	\item \textbf{Bayesian model averaging vs. model combination methods}
	\begin{itemize}
		\item In Bayesian model averaging, the entire dataset is generated by a single model. We are just unsure which one it is. The likelihood of the data is thus:
		$$p(\bm{X}) = \sum_{h=1}^{H} p(\bm{X}|h)p(h)$$
		\item In contrast, model combination methods consider that different data points can be generated by different components. So, every data point has its own latent variable $\bm{z}_n$. The likelihood is here given by:
		$$p(\bm{X}) = \prod_{n=1}^{N}\sum_{\bm{z}_n} p(\bm{x}_n|\bm{z}_n)p(\bm{z}_n)$$
		Example methods include Gaussian mixture models and Mixture of experts.
	\end{itemize}
\end{itemize}
\subsection{Committees}
\begin{itemize}
	\item We can motivate the idea of committees by the bias-variance decomposition: when we average over models, we are able to reduce the variance of the model's predictions. Thus, by using complex models with low bias error, we can improve the performance by reducing the variance through averaging
	\item Averaging is therefore only effective if models are complex enough to overfit
	\item However, in practice, we have only one dataset on which we train $\Rightarrow$ introduce variability between the models within the committee by various methods
\end{itemize}
\subsubsection{Bootstrap aggregation}
\begin{itemize}
	\item Suppose we have a dataset $\bm{X} = \left[\bm{x}_1, ..., \bm{x}_N\right]^T$
	\item \textbf{Bootstrapping dataset}: we create $B$ datasets by sampling $N$ datapoints \textit{with replacement} from the original dataset $\bm{X}$. So, in $\bm{X}_b$, some points will occur more than once and others might be absent
	\item For doing regression with this method, we train $B$ models on their corresponding dataset, and use the average prediction for a new point:
	$$y(\bm{x}) = \frac{1}{B}\sum\limits_{b=1}^{B} y_b(\bm{x})$$
	\item This is called bootstrap aggregation or also \textit{bagging}
	\item The average error made by one of the models is $E_{\text{AV}} = \frac{1}{B}\sum_{b=1}^{B} \mathbb{E}_{\bm{x}}\left[\epsilon_b(\bm{x})^2\right]$. In contrast, for the committee, we expect an error of:
	$$E_{\text{COM}} = \mathbb{E}_{\bm{x}}\left[\left\{\frac{1}{B}\sum\limits_{b=1}^{B}\epsilon_b(\bm{x})\right\}^2\right]$$
	\item If all models would be independent (which they are not because of using very similar datasets), we would reduce the expected error by factor $B$. In practice, we can at least guarantee that $E_{\text{COM}}\leq E_{\text{AV}}$
	\item Still, bias error cannot be reduced by bagging!
\end{itemize}
\subsubsection{Feature bagging}
\begin{itemize}
	\item Similar to bagging, but based on features: sample a subset of \textit{features} of length $r<D$ for each learner. 
	$$\bm{x} = \left[x_1, x_2, \dots, x_D\right]^T\Rightarrow \bm{\tilde{x}} = \left[x_1, x_3, x_5, x_{D-1}\right]^T$$
	\item Also called \textit{random subspace method}
	\item Works especially well if features are uncorrelated and/or if the number of features is much larger than the number of training points
	\item Decision trees with bagging and random subspaces lead to random forests
\end{itemize}
\subsubsection{Boosting}
\begin{itemize}
	\item Use a set of simple individual models (also called weak classifiers) which even can be only slightly better than random
	\item In the following description, we concentrate on boosting for classification, but it can also be used for regression
	\item \textbf{AdaBoost}: adaptive boosting
	\item Base classifiers are trained in a sequence where every model uses a weighted form of the dataset
	\item The weight coefficients are associated to the performance of the previous models
	\item In the end, a prediction is based on the (weighted) majority voting scheme:
	$$Y_M(\bm{x}) = \text{sign}\left(\sum\limits_{m=1}^{M}\alpha_m y_m(\bm{x})\right)$$
	\item AdaBoost algorithm:
	\begin{enumerate}
		\item Initialize weights $w_n = 1/N$ for all $n=1, ...,N$
		\item For all models $m=1,...,M$ sequentially:
		\begin{enumerate}
			\item Fit classifier $y_m(\bm{x})$ to minimize $J_m = \sum\limits_{n=1}^{N} w_n^{(m)} \bm{I}[y_m(\bm{x}_n)\neq t_n]$
			\item Compute weighted error rate $\epsilon_m = \frac{\sum_{n=1}^{N}w_n^{(m)}\bm{I}[y_m(\bm{x}_n)\neq t_n]}{\sum_{n=1}^{N}w_n^{(m)}}$ and $\alpha_m = \ln\left(\frac{1-\epsilon_m}{\epsilon_m}\right)$
			\item Update weights $w_n^{(m+1)} = w_n^{(m)}\exp\left\{\alpha_m \bm{I}\left[y_m(\bm{x}_n)\neq t_n\right]\right\}$
		\end{enumerate}
		\item Make predictions $Y_M(\bm{x}) = \text{sign}\left(\sum\limits_{m=1}^{M}\alpha_m y_m(\bm{x})\right)$
	\end{enumerate}
	\item Note that the weight in the prediction ($\alpha_m$) is based on the average error it has on the weighted training dataset (greater weights for more accurate models)
	\item When taking a huge number of basis models (large $M$), we can easily overfit
	\item Interpretation/Derivation of AdaBoost: minimizing exponential error function sequentially ($E_m = \sum_{n=1}^{N}\exp\left(-t_n f_m(\bm{x}_n)\right)$)
	\item \textbf{Advantages}: simple boosting algorithm
	\item \textbf{Disadvantages}: very sensitive to outliers ($t_n y_m(\bm{x})$ very large and exponentially increasing weight), no probabilistic interpretation
\end{itemize}
\subsection{Decision trees}
\begin{itemize}
	\item Split input space into rectangles which are aligned along the axes (parallel to axes)
	\item We use sequential binary decisions which can be summarized in a tree structure
	\item Used for classification and regression
	\item \textbf{Advantages}: interpretable, combining with boosting strongly increases performance
	\item \textbf{Disadvantages}: Not state-of-the-art, large trees easily overfit but small trees underfit (can be prevented by training large trees and sequentially removing nodes that reduce the error the least)
	\item Tree building process is recursively by minimizing the squared error (for regression). At each iteration, we add the feature boundary that reduces the error the most
	\item Stop criteria can be for example min. number of data points in region, depth/height,... or decrease of loss is lower than certain threshold
	\item \textit{Pruning}: give a penalty to trees with large number of leafs to prevent unnecessary overfitting 
	\item \textbf{Random forests}: By combining bootstrapping and feature bagging, we ensure that the models uses different features to build the trees. Thus, the models are less correlated and probably result in better accuracies.
\end{itemize}

================================================
FILE: Machine_Learning_1/ml_kernel_methods.tex
================================================
\section{Kernel methods}
\begin{itemize}
	\item Standard parametric models have either fixed basis functions (like linear regression or linear classification models) or learnable basis functions like in neural networks. The training points are solely used to optimize the parameters $\bm{w}$, and all further predictions are based on these optimal parameters
	\item In contrast, kernel methods keep the training data points and also use them (or a subset) during prediction
	\item The predictions are based on a linear combination of the kernel function evaluated on the training data points:
	$$y(\bm{x}) = \sum\limits_{n=1}^{N} \alpha_n k\left(\bm{x}, \bm{x}'\right)$$
	\item For linear models with fixed basis functions, the kernel is:
	$$k(\bm{x},\bm{x}') = \bm{\phi}(\bm{x})^T\bm{\phi}(\bm{x}')$$
	\item The kernel measures \textit{similarity} between $\bm{x}$ and $\bm{x}'$ in features space defined by $\phi(x)$. Thus, it is symmetric: $$k(\bm{x},\bm{x}')=k(\bm{x}',\bm{x})$$
\end{itemize}
\subsection{Kernelizing linear parametric models}
\begin{itemize}
	\item Many linear parametric model can be re-casted into a ``dual representation'' by using the \textbf{kernel trick}: 
	\begin{itemize}
		\item If we have an algorithm formulated in such a way that the input vector $\bm{x}$ enters only in the form of a scalar product, we can replace the scalar product with some other choice of kernel
	\end{itemize}
	\item For instance, the linear regression model is determined by minimizing the regularized sum-of-squares error function given by:
	$$J(\bm{w}) = \frac{1}{2}\sum\limits_{n=1}^{N} \left\{\bm{w}^T \bm{\phi}\left(\bm{x}_n\right) - t_n \right\} + \frac{\lambda}{2}\bm{w}^T \bm{w}$$
	\item Solving the equation of the derivate being equals to 0, we obtain:
	$$\bm{w} = \left(\bm{\Phi}^T\bm{\Phi} +\lambda \bm{I}_M\right)^{-1}\bm{\Phi}^T\bm{t} = \bm{\Phi}^T\left(\bm{\Phi}\bm{\Phi}^T +\lambda \bm{I}_M\right)^{-1}\bm{t} $$
	\item Here, we can replace the inner product $\bm{\Phi}\bm{\Phi}^T$ by the gram matrix $\bm{K}$ where $K_{ij} = \bm{\phi}(\bm{x}_i)^T\bm{\phi}(\bm{x}_j)$
	\item By defining the dual variable $\bm{\alpha} = \left(\bm{K} +\lambda \bm{I}_M\right)^{-1}\bm{t}$, we get the following equations:
	\begin{equation*}
		\begin{split}
			\bm{w} & =\bm{\Phi}^T \bm{\alpha} = \sum\limits_{n=1}^{N} \alpha_n \bm{\phi}(\bm{x}_n)\\
			y\left(\bm{x}'\right) & = \bm{w}^T \bm{\phi}(\bm{x}') = \sum\limits_{n=1}^{N} \alpha_n \bm{\phi}\left(\bm{x}_n\right)^T \bm{\phi}\left(\bm{x}'\right) = \sum\limits_{n=1}^{N} \alpha_n k\left(\bm{x},\bm{x}'\right)
		\end{split}
	\end{equation*}
	\item Thus, we can express linear regression by a dual representation with kernel methods
	\item \textbf{Benefits} of kernel representation:
	\begin{itemize}
		\item We have no explicit parameters/features anymore, only implicit by the kernel function $k(\bm{x},\bm{x}')$
		\item No need to handpick locations of basis functions 
		\item No increase in number of parameters when using kernel methods as those implicitly map inputs to a higher dimensional space
	\end{itemize}
	\item \textbf{Disadvantages}/\textbf{problems}:
	\begin{itemize}
		\item The computational cost to retrieve $\bm{\alpha}$ is $\mathcal{O}(N^3)$ as $\bm{K}\in\mathbb{R}^{N\times N}$ compared to $\mathcal{O}(M^3)$ for calculating $M$ on the standard way (the cost comes from the inverse)
		\item During prediction, we need $\mathcal{O}(N\cdot M)$ to compute the output for a new point, but would only need $\mathcal{O}(N)$ with the primal parameters $\bm{w}$ $\Rightarrow$ slow prediction for large datasets
	\end{itemize}
\end{itemize}
\subsubsection{Constructing valid kernels}
\begin{itemize}
	\item For a valid kernel, the gram matrix $\bm{K}$ must be positive semi-definite for all possible choices of $\left\{x_n\right\}_{n=1}^{N}$
	\item An equivalent constraint would be that $\bm{z}^T \bm{K} \bm{z} \geq 0$ for all $\bm{z}\in\mathbb{R}^{N}$ or the eigenvalues must all be positive (note that $\bm{K}$ can still contain negative elements)
	\item We can construct a kernel from an explicit set of basis functions when we use the expression $k\left(\bm{x},\bm{x}'\right)=\bm{\phi}^T(\bm{x})\bm{\phi}(\bm{x})$
	\item Further, we can construct new kernels by using other valid kernels and extend them by for example multiplying with a constant (no need to know all variations)
	\item Given a valid kernel function, we can derive its corresponding feature vectors (which can be hard and possible infinite). Therefore, we need to express it in the form of $\bm{\phi}(\bm{x})^T \bm{\phi}(\bm{x}')$ where $\bm{\phi}$ must be the same function applied on different points
	\item For example a polynomial kernel of $M=2$ can be rewritten as:
	\begin{equation*}
		\begin{split}
			k\left(\bm{x},\bm{z}\right) & = \left(1+\bm{x}^T\bm{z}\right)^2 = \left(1 + x_1 z_1 + x_2 z_2\right)^2\\
			& = 1 + 2x_1 z_1 + 2x_1 z_2 + (x_1 z_1)^2 + (x_2 z_2)^2 + 2 x_1 z_1 x_2 z_2\\
			& = \left[1, \sqrt{2}x_1, \sqrt{2}x_2, x_1, x_2, \sqrt{2}x_1 x_2\right] \cdot \left[1, \sqrt{2}z_1, \sqrt{2}z_2, z_1, z_2, \sqrt{2}z_1 z_2\right]^T\\
			& = \bm{\phi}(\bm{x})^T \bm{\phi}(\bm{z})
		\end{split}
	\end{equation*}
	\item Here we see that from a two-dimensional vector, we scaled it up to a 6-dimensional feature vector just from our kernel
	\item Some (popular) kernels:
	\begin{itemize}
		\item Generalized polynomial kernel $k\left(\bm{x}, \bm{x}'\right) = \left(c + \bm{x}^T \bm{x}'\right)^{M}$ (feature vector only contains polynomial to order $M$)
		\item Gaussian kernel with infinite feature dimensionality: $k\left(\bm{x}, \bm{x}'\right) = \exp\left(-\frac{1}{2l^2} ||\bm{x}-\bm{x}'||^2\right)$
		\item Radial basis functions of the form $k\left(\bm{x}, \bm{x}'\right) = k\left(||\bm{x}-\bm{x}'||^2\right)$
	\end{itemize}
\end{itemize}
\subsection{Support Vector Machines}
\begin{itemize}
	\item To overcome the slow prediction problem, support vector machines only uses a subset of the training points on which the kernel function needs to be evaluated (also called kernel methods with \textit{sparse} solutions)
	\item It is a convex optimization problem so that only one single optimum exists
	\item No good probabilistic interpretation (see Gaussian Processes for that)
\end{itemize}
\subsubsection{Maximum Margin Classifier}
\begin{itemize}
	\item Similar to discriminant functions in 3.3
	\item For a linearly separable dataset, the maximum margin is defined as the distance between the decision boundary and the closest training point $\Rightarrow$ most robust and stable for perturbations of the input
	\item The distance of a point to the decision boundary is (as previously) defined by:
	$$r_n = \frac{|y(\bm{x}_n)|}{||w||} = \frac{t_ny(\bm{x}_n)}{||w||}\text{\hspace{2mm} if } \bm{x}_n \text{ correctly classified}$$ 
	\item The margin is defined as the minimum distance of decision boundary to any point:
	$$\min_n \frac{t_n \left(\bm{w}^T\bm{x}_n + b\right)}{||\bm{w}||} $$
	\item As we can easily increase the distance by increasing $\bm{w}$ by a factor $\kappa$ and still get the same minimum ($\min_n \frac{t_n \left(\kappa \bm{w}^T\bm{x}_n + \kappa b\right)}{||\kappa \bm{w}||}$), we restrict the choice by setting $t_n \left(\bm{w}^T\bm{x}_n + b\right) = 1$ for the closest point. 
	\item Thus, for all other points, the following constraint must hold: $t_n \left(\bm{w}^T\bm{x}_n + b\right) \geq 1$
	\item A maximum margin is found by maximizing $\frac{1}{||\bm{w}||}$ (as the upper part of the fraction is fixed to 1)
\end{itemize}
\subsubsection{Optimizing Maximum Margin}
\begin{itemize}
	\item To maximize the margin, we try to minimize $\frac{1}{2}||\bm{w}||^2$ (has same optimum as $\frac{1}{||\bm{w}||}$ but is easier to optimize)
	\item By that, we need to fulfill the constraint $t_n \left(\bm{w}^T\bm{x}_n + b\right) \geq 1$ for all data points
	\item To do that, we use Lagrange multiplier for inequalities
	\begin{itemize}
		\item Given the problem to maximize $f(\bm{x})$ subject to $g(\bm{x})\geq 0$, it is equivalent to optimize:
		$$\max_{\bm{x}} \min_{\mu} L\left(\bm{x},\mu\right) = \max_{\bm{x}} \min_{\mu} f(\bm{x}) + \mu g(\bm{x})$$
		\item Note that if we want to minimize $f(\bm{x})$, it is equivalent to maximizing $-f(\bm{x})$: 
		$$\max_{\bm{x}} \min_{\mu} L\left(\bm{x},\mu\right) = \max_{\bm{x}} \min_{\mu} -f(\bm{x}) + \mu g(\bm{x}) \Rightarrow \min_{\bm{x}} \max_{\mu} L\left(\bm{x},\mu\right) = \min_{\bm{x}} \max_{\mu} f(\bm{x}) - \mu g(\bm{x})$$
		\item We have the following (Karush-Kuhn-Tucker) conditions when optimizing this function:
		$$\mu\geq0, \text{\hspace{5mm}}g(\bm{x})\geq0, \text{\hspace{5mm}}\mu\cdot g(\bm{x}) = 0$$
		\item There are two kinds of solutions:
		\begin{itemize}
			\item If the stationary points lies in the region $g(\bm{x})\geq 0$, we have $\nabla f(\bm{x})=0$ and $\mu=0$
			\item Otherwise, if stationary points lies on the boundary we have $\nabla f(\bm{x})=-\mu \nabla g(\bm{x})$
		\end{itemize}
		\item We can solve the optimization problem by first getting a solution for $\tilde{L}(\mu) = \max_{\bm{x}} L(\bm{x},\mu)$, and then optimizing it with respect to $\mu$: $\max_{\mu}\tilde{L}(\mu)$
	\end{itemize}
	\item When we apply this for our maximum margin classifier, we get the following optimization objective with $N$ Lagrange multipliers $a_n$:
	$$L\left(\bm{w},b,\bm{a}\right) = \frac{1}{2}||\bm{w}||^2 - \sum\limits_{n=1}^{N} a_n \left\{t_n \left(\bm{w}^T \bm{x} + b\right) - 1\right\}$$
	\item First, minimize with respect to the primal variables $\bm{w}$ and $b$:
	\begin{equation*}
		\begin{split}
			\frac{\partial L}{\partial \bm{w}} & = \bm{w}^T - \sum\limits_{n=1}^{N} a_n t_n \bm{x}_n^T = 0\to \bm{w} = \sum\limits_{n=1}^{N}a_n t_n \bm{x}_n^T\\
			\frac{\partial L}{\partial b} & = - \sum\limits_{n=1}^{N} a_n t_n = 0\to \sum\limits_{n=1}^{N} a_n t_n = 0\\
		\end{split}
	\end{equation*}
	\item Eliminating $\bm{w}$ and $b$ gives the dual representation $\tilde{L}(\bm{a})$:
	\begin{equation*}
		\begin{split}
			\tilde{L}(\bm{a}) = \sum\limits_{n=1}^{N} a_n - \frac{1}{2} \sum\limits_{n=1}^{N}\sum\limits_{m=1}^{N} a_n a_m t_n t_m \underbrace{\bm{x}_n^T \bm{x}_m}_{k(\bm{x}_n, \bm{x}_m)}
		\end{split}
	\end{equation*}
	\item For prediction, we use the previously derived result $\bm{w} = \sum_{n=1}^{N}a_n t_n \bm{x}_n^T$ to convert it into a kernel:
	$$y(\bm{x})=\bm{w}^T \bm{x} + b = \sum_{n=1}^{N}a_n t_n k(\bm{x}_n, \bm{x})$$
	\item For every data point, $a_n = 0$ or $t_n y(\bm{x}) = 1$. For all points that have $a_n > 0$ influence the prediction, so called support vectors. They lie on maximum margin hyperplanes 
	\item The bias $b$ can be determined by solving $t_n y(\bm{x}_n) = 1$ for a support vector $x_n$
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.3\textwidth]{figures/svm_support_vectors.png}
		\caption{Visualization of non-linear support vectors}
		\label{img:svm_support_vectors}
	\end{figure}
\end{itemize}
\subsubsection{Soft Margin Classifier}
\begin{itemize}
	\item So far we assumed that dataset is (non-linear) separable. However, sometimes distributions overlap 
	\item Thus, soft margin classifier allow data points to be on the "wrong" side of the margin but causing a certain penalty
	\item We introduce \textbf{slack variables} $\xi_n\geq 0$ for $n=1,...,N$
	\item If a point is on the correct side of the margin, its slack variable is $\xi_n = 0$
	\item If it is one the wrong side of the margin, the slack variable is $\xi_n = |t_n - y(\bm{x}_n)|$
	\item Hence, we also have a ``soft'' constraint/margin $t_n y(\bm{x}_n)\geq 1 - \xi_n$
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.3\textwidth]{figures/svm_soft_margin.png}
		\caption{A soft margin classifier uses slack variables to penalize data points on the wrong side.}
		\label{img:svm_soft_margin}
	\end{figure}
	\item The goal is now to maximize the margin while minimizing the penalty given by the slack variables:
	$$\arg\min_{\bm{w},b,\bm{\xi}} \frac{1}{2} ||\bm{w}||^2 + C\sum\limits_{n=1}^{N} \xi_n$$
	\item Introducing the conditions $\xi_n\geq 0$ and $t_n y(\bm{x}_n)\geq 1 - \xi_n$ into the minimization problem, we get the following Lagrangian:
	$$L\left(\bm{w}, b, \bm{\xi}, \bm{a}, \bm{\mu}\right) = \frac{1}{2} ||\bm{w}||^2 +C \sum\limits_{n=1}^{N} \xi_n - \sum\limits_{n=1}^{N} a_n \left\{t_n \left(\bm{w}^T \bm{x}_n + b\right) - 1 + \xi_n \right\} - \sum\limits_{n=1}^{N} \mu_n \xi_n $$
	\item The KKT conditions for the dual variables are:
	\begin{equation*}
		\begin{split}
			& a_n \geq 0,\text{\hspace{3mm}} t_n y(\bm{x}_n) - 1 + \xi_n \geq 0,\text{\hspace{3mm}} a_n \left\{t_n \left(\bm{w}^T \bm{x}_n + b\right) - 1 + \xi_n \right\} = 0\\
			& \mu_n \geq 0,\text{\hspace{3mm}} \xi_n \geq 0,\text{\hspace{3mm}} \mu_n \xi_n = 0\\
		\end{split}
	\end{equation*}
	\item Minimize w.r.t. primal variables $\bm{w}, b, \bm{\xi}$ and use these conditions to eliminate $\bm{w}, b, \bm{\xi}$ from the Lagrangian to obtain the \textbf{dual representation}
	\begin{equation*}
		\begin{split}
			\frac{\partial L}{\partial \bm{w}} & = \bm{w}^T - \sum\limits_{n=1}^{N} a_n t_n \bm{x}_n^T = 0 \implies \bm{w} = \sum\limits_{n=1}^{N} a_n t_n \bm{x}_n\\
			\frac{\partial L}{\partial b} & = - \sum\limits_{n=1}^{N} a_n t_n = 0 \implies \sum\limits_{n=1}^{N} a_n t_n = 0\\
			\frac{\partial L}{\partial \xi_n} & = C - a_n - \mu_n = 0 \implies a_n  = C - \mu_n\\
			\Rightarrow \tilde{L}(\bm{a}) & = \sum\limits_{n=1}^{N} a_n - \frac{1}{2}  \sum\limits_{n=1}^{N}  \sum\limits_{m=1}^{N} a_n a_m t_n t_m \bm{x}_n^T \bm{x}_m
		\end{split}
	\end{equation*}
	\item The remaining constraints are $0\leq a_n \leq C$, and $\sum\limits_{n=1}^{N} a_n t_n = 0$, and we try to \textit{maximize} $\tilde{L}(\bm{a})$
	\item We can also express the dual representation with the kernel trick:
	$$\tilde{L}(\bm{a}) = \sum\limits_{n=1}^{N} a_n - \frac{1}{2}  \sum\limits_{n=1}^{N}  \sum\limits_{m=1}^{N} a_n a_m t_n t_m k(\bm{x}_n, \bm{x}_m)$$
	\item When we want to predict the class for a new test data point $\bm{x}'$, we use:
	$$y(\bm{x}') = \sum\limits_{n=1}^{N} a_n t_n k(\bm{x}_n, \bm{x}') + b$$
	\item Points for different dual parameters:
	\begin{itemize}
		\item Only points with $a_n > 0$ are support vectors and contribute to the prediction
		\item If $C > a_n > 0$, then $t_n y(\bm{x}_n) = 1$ (points on the margin) as $\mu_n > 0$ and hence $\xi_n = 0$
		\item If $a_n = C$, then $\mu_n = 0$ and $\xi_n \geq 0$. When $\xi_n \leq 1$, the points is still correctly classified but within the margin. Otherwise, the point is misclassified
	\end{itemize}
	\item If $C\to\infty$, we recover the hard margin classifier again as we don't allow any outliers
	\item If $C\to 0$, the margin gets really large as we try to maximize the margin without caring about the misclassifications. Also, all points $a_n$ will become support vectors 
\end{itemize}
\subsection{Gaussian Processes}
\subsubsection{Essentials of Gaussian distributions}
\begin{itemize}
	\item \textbf{Marginalization property}: if two random variables $x_1$ and $x_2$ are jointly Gaussian distributed, then marginalizing out one variables still leads to a Gaussian
	$$p\left(\left[\begin{array}{c}
	x_1 \\ x_2
	\end{array}\right]\right) = \left(\left[\begin{array}{c}
	x_1\\x_2
	\end{array}\right]|\left[\begin{array}{c}
	\mu_1\\\mu_2
	\end{array}\right], \left[\begin{array}{cc}
	\Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} 
	\end{array}\right]\right)\implies\hspace{1mm} p(x_1) = \mathcal{N}(\mu_1, \Sigma_11),\hspace{3mm} p(x_2) = \mathcal{N}(\mu_2, \Sigma_22)$$
	\item \textbf{Conditional property}: if two random variables $x_1$ and $x_2$ are jointly Gaussian distributed, then conditioning one variables on the other still leads to a Gaussian
	$$p(x_1|x_2) = \mathcal{N}(\mu_{1|2}, \Sigma_{1|2})$$
	\item \textbf{Sum property}: Summing two independent Gaussian random variables lead to a new Gaussian variable:
	$$x\sim \mathcal{N}(\mu_1, \Sigma_1) \text{\hspace{1mm}and\hspace{1mm}} y\sim \mathcal{N}(\mu_2, \Sigma_2) \implies x+y=z\sim\mathcal{N}(\mu_1+\mu_2, \Sigma_1+\Sigma_2) $$ 
	\item \textbf{Correlation property}: If $\bm{x}$ is an uncorrelated Gaussian random variable $\mathcal{N}(\bm{0}, \bm{I})$ then $\bm{y} = \bm{\mu} + \bm{A}\bm{x}$ is correlated by $\bm{y}\sim\mathcal{N}(\mu, \bm{A}\bm{A}^T)$
\end{itemize}
\subsubsection{Introduction to Gaussian Processes}
\begin{itemize}
	\item In Bayesian linear regression, we assume that the target is distributed as $t=\bm{\phi}\left(\bm{x}\right)^T \bm{w} + \epsilon$ where $\epsilon\sim\mathcal{N}(0,\beta^{-1})$. The posterior is also Gaussian distributed: $p(\bm{w}|\bm{X},\bm{t}) = \mathcal{N}(\bm{w}|\bm{m}_N, \bm{S}_N)$.
	\item When we predict for new points, we use the mean $\mu_N = \sum_{n=1}^{N}\beta \bm{\phi}(\bm{x})^T \bm{S}_N^{-1} \bm{\phi}(\bm{x}_n)t_n$ % and variance $\sigma_N^2(\bm{x})=\beta^{-1} + \bm{\phi}(\bm{x})^T \bm{S}_N \bm{\phi}(\bm{x})$
	\item Here we see that we can express the mean by the kernel $k(\bm{x}_n, \bm{x}_m) = \bm{\phi}(\bm{x}_n)^T \bm{S}_N^{-1} \bm{\phi}(\bm{x}_m)$ $\Rightarrow$ increase expressiveness of Linear Bayesian regression by using more complex kernels
	\item Definition of Gaussian Processes: A Gaussian process is a collection of random variables, any finite number of which is jointly Gaussian distributed
	\item Gaussian Processes represent distributions over random functions!
	$$f(\circ) \sim \mathcal{N}(m(\circ), k(\circ, \circ))$$
	\item The function \textit{evaluated} at a specific point $\bm{x}$ is a random variable, with $\mathbb{E}[f(\bm{x})] = m(\bm{x})$ and $\text{cov}(f(\bm{x}), f(\bm{x}')) = k(\bm{x}, \bm{x}')$ (covariance matrix is the gram matrix $K$)
	\item Thus, for a finite set of points $\left\{\bm{x}_1, ...,\bm{x}_N\right\}$, the random variables $\left\{f(\bm{x}_1), ...,f(\bm{x}_N)\right\}$ are:
	$$p\left(\begin{bmatrix}
	f(\bm{x}_1)\\
	\vdots\\
	f(\bm{x}_N)\\
	\end{bmatrix}\right) = \mathcal{N}\left(\begin{bmatrix}
	m(\bm{x}_1)\\
	\vdots\\
	m(\bm{x}_N)\\
	\end{bmatrix}, \begin{bmatrix}
	k(\bm{x}_1, \bm{x}_1) & \cdots & k(\bm{x}_1, \bm{x}_N)\\
	\vdots & \ddots & \vdots\\
	k(\bm{x}_N, \bm{x}_1)& \cdots & k(\bm{x}_N, \bm{x}_N)\\
	\end{bmatrix}\right)$$
	\item Each entry is the sampled function evaluated at point $\bm{x}$. We can evaluate/sample functions by just using a fine-grained set of points
	\item The kernel has a significant influence on how the functions might look like. When we consider the kernel $k(\bm{x}_n, \bm{x}_m) = \theta_0 \exp\left(-\frac{1}{2\theta_1}||\bm{x}_n - \bm{x}_m||^2\right) + \theta_2 + \theta_3 \bm{x}_n^T \bm{x}_m$, we see that:
	\begin{itemize}
		\item $\theta_0$ influences the amplitude of the samples of $f$
		\item $\theta_1$ scale the length of correlation
		\item $\theta_2$ introduces a random bias for sampled $f$ (different bias for every sample)
		\item $\theta_3$ adds a linear component into the samples leading to a up-/down-ward trend
	\end{itemize}
\end{itemize}
\subsubsection{Regression with Gaussian Processes}
\begin{itemize}
	\item We have observed data which we model by $f(\bm{x}_i) = y(\bm{x}_i) + \epsilon$ ($\epsilon\sim\mathcal{N}(0,\beta^{-1})$)
	\item We can now model $y$ as GP: 
	$$p\left(\begin{bmatrix}
	y(\bm{x}_1)\\
	\vdots\\
	y(\bm{x}_N)\\
	\end{bmatrix}\right) = \mathcal{N}\left(\bm{0}, \begin{bmatrix}
	k(\bm{x}_1, \bm{x}_1) & \cdots & k(\bm{x}_1, \bm{x}_N)\\
	\vdots & \ddots & \vdots\\
	k(\bm{x}_N, \bm{x}_1)& \cdots & k(\bm{x}_N, \bm{x}_N)\\
	\end{bmatrix}\right)$$
	\item Then, $f(\circ)$ is also a GP since $\bm{f} = \bm{y} + \epsilon$ ($\bm{f}\sim \mathcal{N}(\bm{0}, K(\bm{X}, \bm{X}) + \beta^{-1}\bm{I})$)
	\item For new test data points, we can predict them by using:
	$$p\left(\begin{bmatrix}
	\bm{f}\\
	\bm{f}^{*}\\
	\end{bmatrix}\right) = \mathcal{N}\left(\bm{0}, \begin{bmatrix}
	K(\bm{X}, \bm{X}) + \beta^{-1}\bm{I} & K(\bm{X},\bm{X}^{*})\\
	K(\bm{X}^{*},\bm{X}) & K(\bm{X}^{*}, \bm{X}^{*}) + \beta^{-1}\bm{I}\\
	\end{bmatrix}\right)$$
	\item The more points we see the more certain our predictions gets
	\item Kernel parameters can be chosen based on MLE on training observations
\end{itemize}

================================================
FILE: Machine_Learning_1/ml_linear_classification.tex
================================================
\section{Linear classification}
\begin{itemize}
	\item Input $\bm{x}=\left(x_1, x_2, ..., x_D\right)^T$ with $\bm{x}\in\mathbb{R}^D$.
	\item Target $t\in\left\{C_1, C_2, ..., C_K\right\}$ with $K$ classes (one-hot representation)
	\item Goal: divide input space $\mathbb{R}^D$ into $K$ decision regions $R_k$ with $k=1,...,K$
	\item Boundaries of decision regions are called \textit{decision boundaries/surfaces}
	\begin{itemize}
		\item Linear classification only considers \textit{linear} decision boundaries $\Rightarrow$ $D-1$ dimensional hyperplanes
		\item A dataset is \textit{linearly separable}, if its classes can be exactly separated by linear decision boundaries
	\end{itemize}
	\item First, we derive the optimal solution for decision boundaries in general (3.1 Decision Theory), and then look at different models for deriving such solutions (3.2-3.4) 
\end{itemize}
\subsection{Decision Theory For Classification}
\begin{itemize}
	\item For every observed datapoint: label/ground truth $t_n=C_j$, prediction $t_n=C_k$
	\item Confusion matrix (row: GT class, columns: prediction region/class)
	$$\begin{blockarray}{ccccc}
	& R_1 & R_2 & \dots & R_K \\
	\begin{block}{c(cccc)}
	C_1 & 6 & 1 & \dots & 0 \\
	C_2 & 4 & 2 & \dots & 3 \\
	\vdots & \vdots & \vdots & \ddots & \vdots \\
	C_K & 1 & 0 & \dots & 5 \\
	\end{block}
	\end{blockarray}$$
	\begin{itemize}
		\item The elements on the diagonal represent the correctly classified examples
		\item Try to minimize misclassified examples (off-diagonal elements), or the probability of a mistake: $p\left(\text{mistake}\right) = 1 - \sum\limits_{k=1}^{K}p(\bm{x}\in R_k, C_k)$
	\end{itemize}
	\item Assign $x$ to class $C_k$ if $\forall j\neq k: p\left(\bm{x}, t=C_k\right) > p\left(\bm{x}, t=C_j\right)$ $\Rightarrow$ $p\left(C_k|\bm{x}\right) > p\left(C_j|\bm{x}\right)$
	\item Optimal decision boundary where $p\left(C_k|\bm{x}\right) = p\left(C_j|\bm{x}\right)$
	\item Problem: \textit{class imbalance} $\Rightarrow$ possible solution: weighted loss for balancing the importance of each class
	\item For imbalanced datasets, assign $x$ to $C_k$ if $\sum\limits_{j=1}^{K}L_{jk}p\left(x,C_j\right)$ is minimal 
	\begin{itemize}
		\item $L_{jk}$ is misclassification weight matrix where $L_{ii}=0$
		\item Example for dataset with 1\% cancer patients: 
		$$L = \begin{blockarray}{ccc}
		\text{pred. cancer} & \text{pred. healthy} & \\
		\begin{block}{(cc)c}
		0 & 1000 & \text{true cancer} \\
		1 & 0 & \text{true healthy} \\
		\end{block}
		\end{blockarray}$$
	\end{itemize}
\end{itemize}
\subsection{Probabilistic generative models}
\begin{itemize}
	\item Model the class conditional densities $p\left(x|C_k\right)$ \textbf{and} the prior class probabilities $p(C_k)$ to compute posterior probabilities $p\left(C_k|x\right)$ (as we know from Decision Theory that at $p\left(C_k|x\right)=p\left(C_j|x\right)$ are the optimal decision boundaries)
	\item For $K=2$, the posterior is: $p\left(C_1|\bm{x}\right) = \frac{p\left(\bm{x}|C_1\right)p\left(C_1\right)}{p\left(\bm{x}\right)} = \frac{p\left(\bm{x}|C_1\right)p\left(C_1\right)}{p\left(\bm{x}|C_1\right)p\left(C_1\right) + p\left(\bm{x}|C_2\right)p\left(C_2\right)}$
	\item We can simplify the previous equation by using the sigmoid function:
	\begin{equation*}
		\begin{split}
			p\left(C_1|\bm{x}\right) & = \frac{1}{1 + \frac{p\left(\bm{x}|C_2\right)p\left(C_2\right)}{p\left(\bm{x}|C_1\right)p\left(C_1\right)} } = \frac{1}{1+\exp\left(-a\right)} \text{\hspace{5mm}where\hspace{5mm}} a=\ln\frac{\sigma}{1-\sigma} = \ln \frac{p\left(\bm{x}|C_2\right)p\left(C_2\right)}{p\left(\bm{x}|C_1\right)p\left(C_1\right)}
		\end{split}
	\end{equation*}
	\item For general $K$: $p\left(C_k|\bm{x}\right) = \frac{p\left(\bm{x}|C_k\right)p\left(C_k\right)}{\sum_{j=1}^{K}p\left(\bm{x}|C_j\right)p\left(C_j\right)} = \frac{\exp(a_k)}{\sum_{j=1}^{K} \exp(a_j)}$ with $a_k = \ln \left[ p\left(\bm{x}|C_k\right)p\left(C_k\right)\right]$ (softmax)
	\item In the special case of $K=2$: $a=a_1-a_2$
\end{itemize}
\subsubsection{Continuous inputs}
\begin{itemize}
	\item Assume that the class-conditional  densities are Gaussian:
	$$p\left(\bm{x}|C_k\right) = \mathcal{N}\left(\bm{x}|\bm{\mu}_k,\bm{\Sigma}_k\right) = \frac{1}{\left(2\pi\right)^{D/2}}\frac{1}{|\bm{\Sigma}_k|^{1/2}}\exp\left\{\frac{1}{2}\left(\bm{x}-\bm{\mu}_k\right)^T\bm{\Sigma}^{-1}\left(\bm{x}-\bm{\mu}_k\right)\right\}$$
	\item We assume that all classes share the same covariance: $\bm{\Sigma}_k = \bm{\Sigma}$\\$\Rightarrow$ We are able to apply \textbf{linear discriminant analysis} (otherwise, decision boundaries would be quadratic)
	\item Determining posterior for $K=2$:
	\begin{equation*}
		\begin{split}
			a & = \ln \frac{p\left(\bm{x}|C_2\right)p\left(C_2\right)}{p\left(\bm{x}|C_1\right)p\left(C_1\right)} = \ln \mathcal{N}\left(\bm{x}|\bm{\mu}_1,\bm{\Sigma}_1\right) - \ln \mathcal{N}\left(\bm{x}|\bm{\mu}_2,\bm{\Sigma}_2\right) + \ln \frac{p\left(C_1\right)}{p\left(C_2\right)}\\
			& = \left(\bm{\mu}_1 - \bm{\mu}_2\right)^T\bm{\Sigma}^{-1}\bm{x} - \frac{1}{2}\bm{\mu}_1^T\bm{\Sigma}^{-1}\bm{\mu}_1 + \frac{1}{2}\bm{\mu}_2^T\bm{\Sigma}^{-1}\bm{\mu}_2 + \ln \frac{p\left(C_1\right)}{p\left(C_2\right)}
		\end{split}
	\end{equation*}
	\item Thus, the posterior can be expressed by
	\begin{equation*}
		\begin{split}
			p(C_1|\bm{x})=\sigma\left(\bm{w}^Tx+w_0\right) \text{\hspace{3mm}where\hspace{2mm}} & \bm{w} = \bm{\Sigma}^{-1}\left(\bm{\mu}_1 - \bm{\mu}_2\right)\\
			& w_0 = -\frac{1}{2}\bm{\mu}_1^T\bm{\Sigma}^{-1}\bm{\mu}_1 + \frac{1}{2}\bm{\mu}_2^T\bm{\Sigma}^{-1}\bm{\mu}_2 + \ln \frac{p\left(C_1\right)}{p\left(C_2\right)}
		\end{split}
	\end{equation*} 
	\item For general $K$, we get $p\left(C_k|\bm{x}\right) = \frac{\exp\left(a_k\left(\bm{x}\right)\right)}{\sum_{j=1}^{K}\exp\left(a_j\left(\bm{x}\right)\right)}$ with 
	\begin{equation*}
	\begin{split}
	a_k\left(\bm{x}\right) = \ln \left[p\left(\bm{x}|C_k\right)\cdot p\left(C_k\right)\right]=\bm{w}_k^Tx+w_{k0} \text{\hspace{3mm}where\hspace{2mm}} & \bm{w}_k = \bm{\Sigma}^{-1}\bm{\mu}_k\\
	& w_{k0} = -\frac{1}{2}\bm{\mu}_k^T\bm{\Sigma}^{-1}\bm{\mu}_k + \ln p\left(C_k\right)
	\end{split}
	\end{equation*} 
	\item Decision boundaries are at $p\left(C_k|\bm{x}\right) = p\left(C_j|\bm{x}\right)$ $\Rightarrow$ $a_k = a_j\Rightarrow$ Linear decision boundaries!
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.7\textwidth]{figures/linear_classification_pgm.png}
		\caption{Left: Gaussian class-conditional densities, Right: corresponding posterior with sigmoid}
	\end{figure}
\end{itemize}
\subsubsection{Maximum likelihood solution for $K=2$}
\begin{itemize}
	\item Binary targets $t_n\in\left\{0,1\right\}$ ($1$ for $C_1$, $0$ for $C_0$)
	\item We use maximum likelihood to find optimal solution for $\bm{\mu}_k$, $\bm{\Sigma}$ and priors $p\left(C_k\right)$
	\item For $K=2$, the priors are denoted by $p\left(C_1\right) = q$ and $p\left(C_2\right) = 1-q$
	\item If $\bm{x}_n$ has target $t_n=1$: $p\left(\bm{x}_n, C_1\right) = p\left(\bm{x}_n|C_1\right)p\left(C_1\right) =q\mathcal{N}\left(\bm{x}_n|\bm{\mu}_1, \bm{\Sigma}\right)$
	\item If $\bm{x}_n$ has target $t_n=0$: $p\left(\bm{x}_n, C_2\right) = p\left(\bm{x}_n|C_2\right)p\left(C_2\right) =(1-q)\mathcal{N}\left(\bm{x}_n|\bm{\mu}_2, \bm{\Sigma}\right)$
	\item Combined likelihood: $p\left(\bm{t}, \bm{X}|q,\bm{\mu}_1,\bm{\mu}_2,\bm{\Sigma}\right) = \prod\limits_{n=1}^{N} \left[q\mathcal{N}\left(\bm{x}_n|\bm{\mu}_1, \bm{\Sigma}\right)\right]^{t_n}\left[(1-q)\mathcal{N}\left(\bm{x}_n|\bm{\mu}_2, \bm{\Sigma}\right)\right]^{1-t_n}$
	\item Log-likelihood: $$\ln p\left(\bm{t}, \bm{X}|q,\bm{\mu}_1,\bm{\mu}_2,\bm{\Sigma}\right) = \sum\limits_{n=1}^{N}t_n \ln q + t_n \ln \mathcal{N}\left(\bm{x}_n|\bm{\mu}_1, \bm{\Sigma}\right) + (1 - t_n) \ln \left(1 - q\right) + \left(1 - t_n\right) \ln \mathcal{N}\left(\bm{x}_n|\bm{\mu}_2, \bm{\Sigma}\right) $$
	\item Estimate for $q$: $\frac{\partial}{\partial q} \ln p\left(\bm{t}, \bm{X}|q,\bm{\mu}_1,\bm{\mu}_2,\bm{\Sigma}\right) = \sum\limits_{n=1}^{N} \frac{t_n}{q} - \frac{1 - t_n}{1 - q} = \sum\limits_{n=1}^{N} \frac{t_n - q}{q\left(1 - q\right)} \Rightarrow q_{\text{ML}} = \frac{1}{N}\sum\limits_{n=1}^{N} t_n = \frac{N_1}{N}$\\
	- Thus, the estimate of $p(C_1)$ is the proportion of samples that are assigned to class 1
	\item Estimate for $\bm{\mu_1}$: $\bm{\mu}_{1,\text{ML}} = \frac{1}{N_1}\sum\limits_{n=1}^{N} t_n \bm{x}_n$, $\bm{\mu}_{2,\text{ML}} = \frac{1}{N_2}\sum\limits_{n=1}^{N} \left(1-t_n\right) \bm{x}_n$\\
	- Thus, the estimate of $\bm{\mu}_k$ is the mean of the samples assigned to class $k$
	\item Estimate for $\bm{\Sigma}$: 
	$$\bm{\Sigma}_{\text{ML}} = \frac{N_1}{N}\underbrace{\left[\frac{1}{N_1} \sum\limits_{n=1}^{N} t_n\left(\bm{x} - \bm{\mu}_{1,\text{ML}}\right)\left(\bm{x} - \bm{\mu}_{1,\text{ML}}\right)^T\right]}_{\text{sample covariance of class 1}} + \frac{N_2}{N}\underbrace{\left[\frac{1}{N_2} \sum\limits_{n=1}^{N} (1-t_n)\left(\bm{x} - \bm{\mu}_{2,\text{ML}}\right)\left(\bm{x} - \bm{\mu}_{2,\text{ML}}\right)^T\right]}_{\text{sample covariance of class 2}}$$
	- Thus, the estimate of $\bm{\Sigma}$ is a weighted average (based on number of samples for each class) of the class' sample covariance\\
	- Note that this assumes a similar covariance matrix for every class cluster. If this is not the case, the estimation gives bad results (see Figure~\ref{img:linear_discriminative_analysis_different_cov})
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.2\textwidth]{figures/linear_discriminant_analysis_different_cov.png}
		\caption{Example for three classes with different covariance matrices $\bm{\Sigma}_k^{-1}$. Linear discriminant analysis fails as it estimates a weighted sum of the sample covariance, and the distribution of green class significantly differs from the other two. The resulting estimate would tend to be a circle for each class instead of the drawn ellipses.}
		\label{img:linear_discriminative_analysis_different_cov}
	\end{figure}
\end{itemize}
\subsubsection{Discrete inputs}
\begin{itemize}
	\item In contrast to the previous subsections, we now assume that $\bm{x}_n \in\left\{0,1\right\}^D$ and is therefore discrete
	\item As we know have no PDF anymore, we need $2^D - 1$ parameters per class to guarantee a perfect fit
	\item However, if we use the Naive Bayes assumption (feature values are treated as independent given $C_k$), we reduce the number of features to $D$ per class (by using the Bernoulli distribution):
	$$p\left(\bm{x}|C_k\right) = \prod\limits_{i=1}^{D}p\left(x_i|C_k\right) = \prod\limits_{i=1}^{D} \pi_{ki}^{x_i}\left(1 - \pi_{ki}\right)^{1-x_i}$$
	\item We can apply this simplification to rewrite $a_k$:
	$$a_k = \ln p\left(x|C_k\right) + \ln p\left(C_k\right)  = \sum\limits_{i=1}^{D}\left[x_i\ln \pi_{ki} + (1 - x_i)\ln (1 - \pi_{ki}) \right] + \ln p\left(C_k\right) = \bm{x}^T\bm{w} + w_0 $$
	 where $w_i=\ln \frac{\pi_{ki}}{1 - \pi_{ki}}$ and $w_0 = \ln p\left(C_k\right) + \sum_{i=1}^{D}\ln \left(1-\pi_{ki}\right)$ 
\end{itemize}
\subsection{Discriminant functions}
\begin{itemize}
	\item Direct mapping of input to target (similar to regression)%$t=y(x,w)$
	\item We use $y\left(\bm{x},\bm{\tilde{w}}\right) = f\left(\bm{\tilde{w}}^T\bm{\phi}\right)$, where $f$ is the activation function and might be non-linear 
	\item The decision boundary is defined at a point where $y\left(\bm{x},\bm{\tilde{w}}\right) = \text{const}_1$. As $y$ represents the application of $f$, we can rewrite it as $\bm{\tilde{w}}^T\bm{\phi} = \text{const}_2$
	\item We first review the application of the case of two classes, and then try to find a solution for multiple classes
\end{itemize}
\subsubsection{Discriminant functions for two classes}
\begin{itemize}
	\item For a two class problem, we set the decision boundary to 0 as we are still able to shift it by $w_0$
	\item If $y\left(\bm{x},\bm{\tilde{w}}\right)\geq 0$, the input $\bm{x}$ is assigned to class $C_1$, whereas if $y\left(\bm{x},\bm{\tilde{w}}\right)< 0$, the class is $C_2$ $\Rightarrow w_0$ is considered as the activation threshold
	\item To determine how the weights $\bm{\tilde{w}}$ influence this classification, we assume two points $\bm{x}_a$ and $\bm{x}_b$ on the decision boundary $\Rightarrow$ $y\left(\bm{x}_a\right) = y\left(\bm{x}_b\right) = 0 \Rightarrow \bm{w}^T(\bm{x}_a - \bm{x}_b) = 0$ (see Figure~\ref{img:discriminant_function_two_classes})
	\item Hence, $\bm{w}$ is orthogonal to every vector lying within the decision surface, so that $\bm{w}$ determines the orientation of the surface
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/discriminant_function_two_classes.png}
		\caption{Illustration of the geometry of a linear discriminant function in two dimensions. The decision surface, shown in red, is perpendicular to $\bm{w}$, and its displacement from the origin is controlled by the bias parameter $w_0$. Also, the signed orthogonal distance of a general point $\bm{x}$ from the decision surface is given by $y(\bm{x})/ \bm{w}$ .}
		\label{img:discriminant_function_two_classes}
	\end{figure}
	\item So, we can express every point by the summation of a point on the decision surface and the weights: $$\bm{x} = \bm{x}_{\perp} + r\frac{\bm{w}}{||\bm{w}||}$$
	\item Applied in $y$, we get: $y\left(\bm{x}\right) = \bm{w}^T \bm{x} + w_0 = \bm{w}^T \bm{x}_{\perp} + w_0 + r\frac{\bm{w}^T\bm{w}}{||\bm{w}||} = r ||\bm{w}|| \Rightarrow $
	\item So, the distance between a point $\bm{x}$ and the decision surface is $r = \frac{y\left(\bm{x}\right)}{||\bm{w}||}$
\end{itemize}
\subsubsection{Discriminant functions for multiple classes}
\begin{itemize}
	\item K-class discriminant: $\bm{y}_k (\bm{x}) = \bm{w}_k^T + w_{k0}$
	\item Assign $\bm{x}$ to $C_k$ if $y_k(\bm{x})>y_j(\bm{x})$ for all $j\neq k$
	\item Thus, the decision boundary between $\mathcal{R}_k$ and $\mathcal{R}_j$ is determined by: $y_k(\bm{x})=y_j(\bm{x})$
	\item Note that decision regions of linear discriminant functions are convex (if two points are in $\mathcal{R}_k$, then all points between those are also in the same region $\mathcal{R}_k$)
\end{itemize}
\subsubsection{Least squares discriminant}
\begin{itemize}
	\item Consider $\bm{t}_n$ as one-hot vector. We try to learn a function $y_k(\bm{x}, \bm{\tilde{w}}_k)$ for every class $k$ that maps $\bm{x}$ to its corresponding value in the one-hot vector (basically regression task)
	\item For shorter notation, we write $\bm{y}(\bm{x}) = \bm{\tilde{W}}\bm{\tilde{x}}$ to combine all classes and weights into a single operation
	\item As before: assign $\bm{x}$ to class $C_k$ if $k=\arg\max_j y_j (\bm{x})$
	\item The error function is the sum-of-squares: $$E_D(\bm{\tilde{W}}) = \frac{1}{2} \text{Tr}\left[\left(\bm{\tilde{X}}\bm{\tilde{W}}-\bm{T}\right)^T\left(\bm{\tilde{X}}\bm{\tilde{W}}-\bm{T}\right)\right] = \frac{1}{2}\sum\limits_{n=1}^{N}\sum\limits_{k=1}^{K}\sum\limits_{d=1}^{D}\left(\tilde{X}_{nd}\tilde{W}_{dk}-\tilde{T}_{nd}\right)$$
	\item Minimizing this error leads to:  $\bm{\tilde{W}}_{\text{LS}} = \left(\bm{\tilde{X}}^T \bm{\tilde{X}}\right)^{-1}\bm{\tilde{X}}^T \bm{\tilde{T}}$
	\item But: there are many problems with least squared errors
	\begin{itemize}
		\item The decision boundaries are very sensitive to outliers (try to minimizes \textit{mean} error to one-hot vector)
		\item For $K>2$, some decision regions become very small or are even completely ignored (also called masking)
		\item components of $\bm{y}_{\text{LS}}$ are not probabilities and can be outside the interval $\left[0,1\right]$
	\end{itemize}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.6\textwidth]{figures/discriminant_function_least_squares_problem.png}
		\caption{Illustration of the problems with least squared error discriminant}
		\label{img:discriminant_function_least_squares_problem}
	\end{figure}
\end{itemize}
\subsubsection{Perceptron}
\begin{itemize}
	\item For the perceptron, we use the step function as activation function:
	$$y\left(\bm{x}\right) = f\left(\bm{w}^T\bm{\phi}(\bm{x})\right) \text{\hspace{5mm} where \hspace{5mm}} f(a) = \begin{cases}
	1 & \text{ if } a\geq 0\\
	-1 & \text{ if } a <0
	\end{cases}$$
	\item Thus, assign $\bm{x}$ to class $C_1$ if $\bm{w}^T\bm{\phi}(\bm{x})\geq0$, otherwise $C_2$
	\item The goal is now to find a $\bm{w}$ such that $\bm{w}^T\bm{\phi}(\bm{x})t_n\geq 0$ ($t_n \in \left\{1,-1\right\}$)
	\item We can define the error of a perceptron based on the set of misclassified examples $\mathcal{M}$: \\$E_P(\bm{w})=-\sum\limits_{n\in\mathcal{M}}\bm{w}^T \bm{\phi}(\bm{x}_n)t_n = \sum\limits_{n\in\mathcal{M}}E_n(\bm{w})$
	\item Use Stochastic Gradient Descent (SGD) for each misclassified $\bm{x}_n$:\\ $$\bm{w}^{\tau+1} = \bm{w}^{\tau} - \eta \bigtriangledown^T E_n(\bm{w}) = \bm{w}^{\tau} + \eta \bm{\phi}(\bm{x}_n) t_n$$
	\item If $\bm{X}$ is linearly separable, SGD will converge
	\item However, there are some problems with the perceptron algorithm:
	\begin{itemize}
		\item Perceptron only works for 2 classes
		\item There might be many optimal solutions, so that the exact outcome depends on initialization of $\bm{w}$ and order of data that are used in SGD
		\item If dataset is not linearly separable, the perceptron algorithm will not converge
		\item Based on linear combination of fixed basis functions 
	\end{itemize}
\end{itemize}
\subsubsection{Usage of Basis functions}
\begin{itemize}
	\item If the data in the input space is not linearly separable, we can use basis functions (that might be non-linear) to transform them into a new space, where they can be linearly separated!
	\item However, prior knowledge is required for this step as the general data distribution must be known and how to convert it into a linearly separable space. This step is especially hard/not possible for high dimensions
\end{itemize}
\subsection{Probabilistic discriminative models}
\begin{itemize}
	\item Instead of specifying the class-conditional probabilities $p\left(\bm{x}|C_k\right)$ and applying maximum likelihood to find the best parameters, we can try to explicitly model the posterior class probability $p\left(C_k|\bm{x}\right)$ and find its distribution $\Rightarrow$ posteriors are non-linear functions with a linear function of $\bm{\phi}$ as input: $p\left(C_k|\bm{\phi}, \bm{w}\right) = f(\bm{w}_k^T\bm{\phi})$
	\item The implicit method of finding the parameters of a generalized model is by fitting $p\left(\bm{x}|C_k\right)$ and $p\left(C_k\right)$ representing a generative model (generate synthetic data from $p(\bm{x})$)
\end{itemize}
\subsubsection{Logistic regression for two classes}
\begin{itemize}
	\item Logistic regression uses the sigmoid function to model the posterior:\\
	$$p\left(C_1|\bm{\phi},\bm{w}\right) = y\left(\bm{\phi}\right) = \sigma\left(\bm{w}^T\bm{\phi}\right), p\left(C_2|\bm{\phi},\bm{w}\right) = 1 - p\left(C_1|\bm{\phi},\bm{w}\right)$$
	\item For inference/classification, take the class with the higher probability ($>0.5$) $\Rightarrow$ Decision boundaries: $\bm{w}\bm{\phi}(\bm{x}) = 0$
	\item If $\bm{w}\in\mathbb{R}^M$, we use $M$ number of parameters (compared to $M(M+5)/2 + 1$ for modeling a Gaussian multivariate distribution)
	\item Use maximum likelihood to determine the parameters of the logistic regression model
	\item Conditional likelihood: $p\left(\bm{t}|\bm{X},\bm{w}\right) = \prod\limits_{n=1}^{N} p\left(t_n | \bm{x}_n,\bm{w}\right) = \prod\limits_{n=1}^{N} y_n^{t_n}\left(1 - y_n\right)^{1-t_n}$
	\item Maximizing the likelihood is equal to minimizing the cross-entropy loss:
	$$E(\bm{w}) = -\ln p\left(\bm{t}|\bm{X},\bm{w}\right) = -\sum\limits_{n=1}^{N} \left[t_n \ln y_n + (1 - t_n) \ln (1 - y_n)\right] =\sum\limits_{n=1}^{N}E_n(\bm{w})$$
	\item The loss $E(\bm{w})$ is convex (has a single, \textbf{unique minimum}), but no closed-form solution exists ($y_n = \sigma(\bm{w}^T \phi_n)$ is nonlinear in $\bm{w}$) $\Rightarrow$ Use SGD/...
	\item For taking the gradient, we can make use of the property of the sigmoid function: 
	$$\frac{\partial E_n(\bm{w})}{\partial w_j} = \frac{\partial E_n(\bm{w})}{\partial y_n} \frac{\partial y_n}{\partial w_j} = \left[-\frac{t_n}{y_n}+\frac{1 - t_n}{1 - y_n}\right] \cdot \left[\sigma(\bm{w}^T\bm{\phi}(\bm{x}_n))\left(1 - \sigma(\bm{w}^T\bm{\phi}(\bm{x}_n))\right) \phi_j(\bm{x}_n)\right] = (y_n - t_n)\phi_j(\bm{x}_n)$$
	\item Update rule (SGD): $\bm{w}^{\tau + 1} = \bm{w}^{\tau} - \eta \triangledown^{\tau} E_n(\bm{w})^{\tau}= \bm{w}^{\tau} - \eta (y_n - t_n)\bm{\phi}\left(\bm{x}_n\right)$
	\item If $\eta$ too large: no convergence. If $\eta$ too small: very slow convergence
	\item Converged $\bm{w}^*$ minimizes the loss $E(\bm{w})$
\end{itemize}
\subsubsection{Iterative reweighted least squares}
\begin{itemize}
	\item Also called the \textit{Newton-Raphson iterative optimization scheme}
	\item We use a \textbf{quadratic approximation} instead of a linear at $E(\bm{w}^{\tau})$ as the difference between the difference of the loss function to a second order polynomial is quite small (find $E(\bm{w}^{\tau+1})$ that minimizes our quadratic approximation $\Rightarrow$ no learning rate)
	\item New update rule: $$\bm{w}^{\tau} = \bm{w}^{\tau-1} - \bm{H}^{-1} \triangledown E(\bm{w}^{\tau - 1})$$ where $\bm{H}$ is the Hessian matrix whose elements comprise the second derivatives of $E(\bm{w})$: $H_{ij}=\frac{\partial E(\bm{w})}{\partial w_i \partial w_j}$ ($\bm{H}$ is symmetric!)
	\item Derived from the previous section, the gradient for all $N$ data-points is 
	$$\triangledown E(\bm{w}) = \sum\limits_{n=1}^{N} (y_n - t_n)\bm{\phi}\left(\bm{x}_n\right) = \bm{\Phi}^T\left(\bm{y} - \bm{t}\right)$$
	\item The elements of the Hessian derive this by a second parameter again:
	$$H_{ij} = \frac{\partial E(\bm{w})}{\partial w_i \partial w_j} = \frac{\partial}{\partial w_i}\sum\limits_{n=1}^{N} (y_n - t_n)\phi_j \left(\bm{x}_n\right) = \sum\limits_{n=1}^{N} \phi_j\left(\bm{x}_n\right) \frac{\partial y_n}{\partial w_i} = \sum\limits_{n=1}^{N} y_n (1 - y_n) \phi_i(\bm{x}_n)\phi_j(\bm{x}_n)$$
	\item The overall Hessian matrix is therefore 
	$$\bm{H} = \sum\limits_{n=1}^{N} y_n (1 - y_n) \bm{\phi}(\bm{x}_n)\bm{\phi}(\bm{x}_n)^T = \bm{\Phi}^T \bm{R}\bm{\Phi}$$
	where $R_{nn} = y_n(1-y_n)$, and otherwise $R_{nm} = 0$ for $n\neq m$
	\item Applying this term in the update equation leads to:
	$$\bm{w}^{(\tau)} = \bm{w}^{(\tau - 1)} - \left(\bm{\Phi}^T \bm{R} \bm{\Phi}\right)^{-1}\bm{\Phi}^T\left(\bm{y} - \bm{t}\right) = \left(\bm{\Phi}^T \bm{R} \bm{\Phi}\right)^{-1} \bm{\Phi}^T \bm{z} \text{\hspace{5mm}where\hspace{5mm}} \bm{z} = \bm{\Phi}\bm{w}^{(\tau-1)}-\bm{R}^{-1}(\bm{y}-\bm{t})$$
	\item Note the similarity to the maximum likelihood solution $\bm{w}_{\text{ML}} = \left(\bm{\Phi}^T \bm{\Phi}\right)^{-1} \bm{\Phi}^T\bm{t}$
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/logistic_regression_sgd_iterat.png}
		\caption{Illustration of SGD (green) and Newton Raphson (red). SGD always goes in the direction of the steepest gradient and is therefore slower than Newton-Raphson.}
		\label{img:logistic_regression_sgd_iterat}
	\end{figure}
\end{itemize}
\subsubsection{Logistic regression for multiple classes}
\begin{itemize}
	\item Our posterior distribution is now given by a softmax:
	$$p(C_k|\bm{\phi}, \bm{w}_1, ..., \bm{w}_K) = y_k\left(\bm{\phi}\right) = \frac{\exp\left(a_k\right)}{\sum_{j=1}^{K}\exp\left(a_j\right)} \text{\hspace{5mm}where\hspace{5mm}} a_k=\bm{w}_k^T\bm{\phi}(\bm{x})$$
	\item The derivation by a element $a_j$ is $\frac{\partial y_k}{\partial a_j}=y_k\left(\mathbb{I}\left(k=j\right) - y_j\right)$
	\item We use again maximum likelihood to determine the optimal parameters
	\item Conditional likelihood $p\left(\bm{T}|\bm{X},\bm{w}_1,...\bm{w}_K\right) = \prod\limits_{n=1}^{N}\prod\limits_{k=1}^{K}p(C_k|x_k, \bm{w}_1, ..., \bm{w}_K) = \prod\limits_{n=1}^{N}\prod\limits_{k=1}^{K} y_{k}\left(\bm{\phi}_n\right)^{t_{nk}}$
	\item Taking the negative logarithm gives the \textit{cross-entropy} loss function for the multiclass classification problem
	$$E\left(\bm{w}_1,...,\bm{w}_K\right) = - \ln p\left(\bm{T}|\bm{w}_1,...,\bm{w}_K\right) = -\sum\limits_{n=1}^{N}\sum\limits_{k=1}^{K} t_{nk} \ln y_{nk}$$
	\item To minimize the function by SGD or Newton Raphson, we need to take the derivate:
	$$\triangledown_{\bm{w}_j} E\left(\bm{w}_1, ...,\bm{w}_K\right) = \sum\limits_{n=1}^{N}\left(y_{nj} - t_{nj}\right) \bm{\phi}\left(\bm{x}_n\right)$$
	\item For Newton Raphson/Iterative reweighted least squares, we also need the Hessian matrix:
	$$\frac{\partial}{\partial \bm{w}_k}\frac{\partial}{\partial \bm{w}_j}E\left(\bm{w}_1,...,\bm{w}_K\right) = \sum\limits_{n=1}^{N} y_{nk}\left(\mathbb{I}\left(k=j\right) - y_{nj}\right) \bm{\phi}_n \bm{\phi}_n^T$$
	\item Decision boundaries at $\left(\bm{w}_k^*\right)^T \bm{\phi}\left(\bm{x}'\right) = \left(\bm{w}_j^*\right)^T \bm{\phi}\left(\bm{x}'\right)$
\end{itemize}

================================================
FILE: Machine_Learning_1/ml_linear_regression.tex
================================================
\section{Linear Regression}

\subsection{Basic approaches}
\subsubsection{Maximum likelihood}
\begin{itemize}
	\item Given a dataset $D=(\bm{x}_1, \bm{x}_2, ..., \bm{x}_N)$ of $N$ independent observations
	\item The likelihood of the dataset given the model parameters $\bm{w}$ is specified as $p(D|\bm{w})$
	\item \textit{Maximum likelihood estimation}: the most likely ``explanation'' of $D$ is $\bm{w}_{\text{ML}}$:
	$$\bm{w}_{\text{ML}} = \arg\max_{\bm{w}} p(D|\bm{w})$$
	\item Using the i.i.d. assumption, we can state $p(D|\bm{w}) = \prod\limits_{n=1}^{N} p(\bm{x}_n|\bm{w})$
	\item For preventing numerical overflow and mostly simplifying the derivation, we can take the logarithm $\log p(D|\bm{w})$
	\item Maximum where $\frac{\partial}{\partial \bm{w}}\log p(D|\bm{w}) = 0$
	\item We can check whether our estimation is biased by comparing the expected result by the distribution parameters: $$\mathbb{E}\left[\sigma_{ML}^{2}\right] = \mathbb{E}\left[\frac{1}{N}\sum\limits_{i=1}^{N}\left(x_i - \frac{1}{N}\sum_{n=1}^{N} x_n\right)^2\right] = \frac{N-1}{N} \sigma^2 \implies \text{biased estimator}$$
\end{itemize}
\subsubsection{Maximum a posteriori}
\begin{itemize}
	\item Choose the most probable model parameters $\bm{w}$ given data $D$:
	$$\bm{w}_{\text{MAP}} = \arg\max_{\bm{w}} p(\bm{w}|D)$$
	\item By applying the Bayes rule (and log), we get:
	$$\bm{w}_{\text{MAP}} = \arg\max_{\bm{w}} \log p(D|\bm{w}) + \log p(\bm{w}) - \log p(D)$$
	\item We can drop the evidence as it is independent of $\bm{w}$
\end{itemize}
\subsubsection{Bayesian approach}
\begin{itemize}
	\item Frequentist approaches only consider point estimates without taking the uncertainty of the prediction into account
	\item Given a prior belief over $\bm{w}$, we are interested in the posterior distribution (not only maximum!)
	\item The predictive distribution for a new data point $\bm{x}'$ is therefore 
	$$p(t'|\bm{x}',D) = \int p(t'| \bm{x}', \bm{w}) \cdot p(\bm{w}|D) d\bm{w}$$
	\item Thus, we also consider our uncertainty when predicting
	\item However, we need to compute the evidence for that which is mostly quite hard (prefer less complex models)
\end{itemize}
\subsection{Model selection for supervised learning}
\begin{itemize}
	\item Model selection comes with two main questions:
	\begin{enumerate}
		\item How can we estimate the performance of a model on unknown data?
		\item How can we choose the optimal hyperparameters? $\Rightarrow$ \textbf{model selection}
	\end{enumerate}
	\item Common approach for large datasets: split in train, val and test dataset
	\begin{description}
		\item[Training dataset] About 80\% of the data should be used for training. On this, we try to minimize the error/loss $L\left(y(\bm{x}_i),t_i\right)$ for $(\bm{x},t)\in D_{train}$ and find optimal parameters $\bm{w}^*$.
		\item[Validation dataset] About 10\% of the data is used for estimating the test error $L\left(y(\bm{x}_{\text{val}}, \bm{w}^*),t_{\text{val}}\right)$ for various $\bm{w}^*$ from different hyperparameters. Hence, the hyperparameters are tuned on the validation dataset.
		\item[Testing dataset] The last 10\% of the available data provides the final test of the chosen best weights and hyperparameters. This data is used to estimate the performance on unseen data, and should therefore not be used for any parameter choosing!
	\end{description}
	\item However, for a small dataset, the validation and test set is very small and, hence, very noisy $\Rightarrow$ use cross validation
\end{itemize}
\subsubsection{Cross Validation}
\begin{itemize}
	\item Split data into $K$ folds % $D=\left\{\left(x_1, t_1\right), \left(x_2, t_2\right), ..., \left(x_N, t_N\right)\right\}$
	\item If $K=N$, it is also called leave-one-out cross validation as the validation is one single data point
	\item Train the model $y$ on $K-1$ folds, and test on the remaining fold $k$ $\Rightarrow$ model $\hat{y}^{\mbox{--}k}(x)$
	\item The estimation of the prediction error is the mean validation error over all folds. With the index function $\kappa:\left\{1,...,N\right\}\mapsto\left\{1,...,K\right\}$ (mapping data point to corresponding fold where it is used for validation), we get:
	$$CV(\hat{y}) = \frac{1}{N}\sum\limits_{i=1}^{N}L\left(\hat{y}^{\mbox{--}\kappa\left(i\right)}(\bm{x}_i), t_i\right)$$
	\item Task of model selection: Run cross validation for each possible parameter setting and choose the one with lowest cross validation error
	\item Task of test error estimation: after finding the best hyperparameters like $\alpha^*$, retrain model on all $K$ folds, and test this model on a held-out test set
	\item However, if test set is small, we again get a noisy estimation $\Rightarrow$ Nested cross validation
	\item Drawback of cross validation: it is computationally expensive and should therefore only be used for fast trainings/small datasets
\end{itemize}
\subsubsection{Nested Cross Validation}
\begin{itemize}
	\item Cross validation for both model selection and model performance by reusing dataset for testing
	\item General algorithm:
	\begin{enumerate}
		\item Split dataset into $M$ cross validation folds
		\item For each of these folds $m=1,...,M$:
		\begin{enumerate}
			\item Let fold $m$ be the test dataset
			\item Apply cross validation on the remaining data by splitting it into $K$ folds and find best hyperparameters $\alpha^*,\beta^*,...$
			\item Retrain the model with the best hyperparameters on all data besides the fold $m$
			\item Test the model on unseen data fold $m$ 
		\end{enumerate}
		\item The final generalization error/loss on unseen data is the mean over all $M$ folds
	\end{enumerate}
	\item For choosing the best hyperparameters $\alpha^*,\beta^*,...$, we use single cross validation on the whole dataset again without a test dataset, but record the found generalization error as estimation for unknown data, also for the new model
\end{itemize}
\begin{figure}[ht]
	\centering
	\includegraphics[width=0.5\textwidth]{figures/cross_validation_nested.png}
	\caption{Illustration of nested cross validation. The outer loop splits dataset into test and trainval parts. Within the trainval parts, we apply cross validation to find optimal hyperparameters. Those are tested on the left-out fold from the outer loop, and the mean test error of all folds is the final generalization error. Note that every outer fold can lead to different optimal hyperparameters.}
	\label{img:linear_regression_nested_cross_validation}
\end{figure}
\subsection{Bias variance decomposition}
\begin{itemize}
	\item Frequentist view on model complexity
	\item Common loss: the squared loss function, defined as $L\left(t, y\left(\bm{x}\right)\right) = \left(t - y\left(\bm{x}\right)\right)^2$
	% \item The expected loss is $\mathbb{E}\left[L\left(t, y\left(\bm{x}\right)\right)\right]=\int\int \left(t - y\left(\bm{x}\right)\right)^2 p\left(\bm{x}, t\right) \text{d}\bm{x} \text{ d}t$
	\item An optimal model of $y\left(\bm{x}\right)$ would minimize this loss which is given by $$h(\bm{x}) = \mathbb{E}\left[t|\bm{x}\right] = \int t\cdot p\left(t|\bm{x}\right) \text{d}t$$
	where the conditional distribution $p\left(t|\bm{x}\right)$ is the actual, noisy data distribution (not known!)
	\item Thus, the expected squared loss can be written as
	$$\mathbb{E}\left[L\right] = \int\underbrace{\left\{y(\bm{x}) - \mathbb{E}\left[t|\bm{x}\right]\right\}^2}_{\text{model loss}} p(\bm{x}) \text{d}\bm{x} + \int\underbrace{\left\{\mathbb{E}\left[t|\bm{x}\right] - t\right\}^2}_{\text{intrinsic noise on data}} p(\bm{x},t) \text{d}\bm{x}\text{ d}t$$
	where the first term, the model loss, depends on how different the model $y(\bm{x})$ is from the actual data distribution, and the second term arises from the intrinsic noise and represents the minimum achievable expected loss
	\item In Bayesian approach, we would model $y(\bm{x}, \bm{w})$ where the uncertainty of $\bm{w}$ is expressed in the posterior distribution
	\item However, from a frequentist viewpoint, we use multiple datasets $\mathcal{D}$ on which we train our model and get a single estimation $\bm{\hat{w}}$ for each of them. The final model is the average over this ensemble of datasets.
	\item To apply this approach, we take the model loss for a single input $\bm{x}$, and add the expected model over all datasets:
	\begin{equation*}
		\begin{split}
			\left\{y\left(\bm{x};\mathcal{D}\right) - h\left(\bm{x}\right)\right\}^2 & = \left\{y\left(\bm{x};\mathcal{D}\right) - \mathbb{E}_{\mathcal{D}}\left[y\left(\bm{x};\mathcal{D}\right)\right] + \mathbb{E}_{\mathcal{D}}\left[y\left(\bm{x};\mathcal{D}\right)\right] - h\left(\bm{x}\right)\right\}^2\\
			& = \left\{y\left(\bm{x};\mathcal{D}\right) - \mathbb{E}_{\mathcal{D}}\left[y\left(\bm{x};\mathcal{D}\right)\right]\right\}^2 + \left\{\mathbb{E}_{\mathcal{D}}\left[y\left(\bm{x};\mathcal{D}\right)\right] - h\left(\bm{x}\right)\right\}^2 \\
			& \text{\hspace{5mm} } + 2\left\{y\left(\bm{x};\mathcal{D}\right) - \mathbb{E}_{\mathcal{D}}\left[y\left(\bm{x};\mathcal{D}\right)\right]\right\}\left\{ \mathbb{E}_{\mathcal{D}}\left[y\left(\bm{x};\mathcal{D}\right)\right] - h\left(\bm{x}\right)\right\}
		\end{split}
	\end{equation*}
	\item The final model of the frequentist approach is the expected value of this loss over all datasets:
	\begin{equation*}
		\begin{split}
			\mathbb{E}_{\mathcal{D}}\left[\left\{y\left(\bm{x};\mathcal{D}\right) - h\left(\bm{x}\right)\right\}^2\right] & = \underbrace{\mathbb{E}_{\mathcal{D}}\left[\left\{y\left(\bm{x};\mathcal{D}\right) - \mathbb{E}_{\mathcal{D}}\left[y\left(\bm{x};\mathcal{D}\right)\right]\right\}^2\right]}_{\text{\textcolor{red}{variance}}} + \underbrace{\left\{\mathbb{E}_{\mathcal{D}}\left[y\left(\bm{x};\mathcal{D}\right)\right] - h(\bm{x})\right\}^2}_{\textcolor{blue}{(\text{bias})^2}}
		\end{split}
	\end{equation*}
	where the first term is the \textbf{variance} of a model trained on a single dataset compared to the average, and the second term is the loss of the average/expected model over all datasets, or rather the \textbf{bias} of the model. The third term of the original equation is eliminated as only $y(\bm{x};\mathcal{D})$ is affected by the expectation operator $\mathbb{E}_{\mathcal{D}}$, and is the same as $\mathbb{E}_{\mathcal{D}}\left[y\left(\bm{x};\mathcal{D}\right)\right]$. 
	\item Coming back to the original expected squared loss, we can decompose it into three terms:
	$$\text{expected loss} = (\text{bias})^2 + \text{variance} + \text{noise}$$
	where
	\begin{equation*}
		\begin{split}
			(\text{bias})^2 &= \int \left\{\mathbb{E}_{\mathcal{D}}\left[y\left(\bm{x};\mathcal{D}\right)\right] - h(\bm{x})\right\}^2p(\bm{x}) \text{d}\bm{x}\\
			\text{variance} &= \int \mathbb{E}_{\mathcal{D}}\left[\left\{y\left(\bm{x};\mathcal{D}\right) - \mathbb{E}_{\mathcal{D}}\left[y\left(\bm{x};\mathcal{D}\right)\right]\right\}^2\right]p(\bm{x}) \text{d}\bm{x}\\
			\text{noise} &= \int \left\{h(\bm{x})-t\right\}^2p\left(\bm{x},t\right)\text{d}\bm{x}\text{ d}t
		\end{split}
	\end{equation*}
	\item Now, the task is to find the best balance between bias and variance. An example for the data distribution $\mathbb{E}[t|x]=\sin(2\pi x)$ (note that noise is canceled by expectation), 24 Gaussian basis functions and regularized loss function is shown in Figure~\ref{img:linear_regression_bias_variance_decomp_example}.
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/bias_variance_reg_comp.png}
		\caption{Illustration of dependence of bias and variance on model complexity controlled by $\lambda$. Less complex models (high $\lambda$) tend to have a high bias (be far off the correct distribution) but it is more robust regarding the actual dataset (therefore, a low variance). Decreasing $\lambda$ results in a lower bias, but a high variance as models tend to overfit and are therefore sensitive to the dataset.}
		\label{img:linear_regression_bias_variance_decomp_example}
	\end{figure}
	\item Plotting the terms of the decomposed squared loss function over $\lambda$ gives further insights of the model behavior (see Figure~\ref{img:linear_regression_bias_variance_decomp_error_plot}). For generating such a plot, the integrals are approximated by sums over all data points $x$ as we have a limited number of samples.
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/bias_variance_loss_plot.png}
		\caption{Plot of decomposed loss function for example of Figure~\ref{img:linear_regression_bias_variance_decomp_example}. The goal is to minimize the test error. It is common that this is close to the minimum value of $(\text{bias})^2+\text{variance}$. High variance as on the left indicates overfitting, high bias error on the left shows that the model is underfitting.}
		\label{img:linear_regression_bias_variance_decomp_error_plot}
	\end{figure}
	\item In conclusion, high values of $\lambda$ reduce model complexity, and therefore increase bias loss and leads to underfitting. However, it provides a small variance.\\In contrast, small values of $\lambda$ causes a low bias as the model is quite complex. Still, the variance is high indicating that the model overfits on the small datasets.
	\item The bias-variance decomposition is less practical as it is better to train on one large dataset instead of splitting it into several small ones. Furthermore, this reduces the risk of overfitting for a high model complexity on the data anyways.
\end{itemize}
\subsection{Bayesian Linear Regression}
\begin{itemize}
	\item Determining the suitable model complexity using the training data alone without overfitting
	\item Result is a distribution of $\bm{w}$ instead of single value as in maximum likelihood or posterior
\end{itemize}
\subsubsection{Parameter distribution}
\begin{itemize}
	\item Prior over weights: $p(\bm{w}) = \mathcal{N}\left(\bm{m}_0, \bm{S}_0\right)$
	\item Likelihood: $p\left(t'|\bm{x}', \bm{w}, \beta\right)=\mathcal{N}\left(t'|\bm{\phi}(\bm{x})^T\bm{w}, \beta^{-1}\right)$
	\item Posterior distribution: $p(\bm{w}|\bm{t}, \bm{X}) = \frac{p\left(\bm{t}|\bm{X}, \bm{w}, \beta\right) p\left(\bm{w}\right)}{p\left(\bm{t}|\bm{X}, \beta\right)} = \mathcal{N}\left(\bm{m}_N, \bm{S}_N\right)$, where\\
	$$\bm{S}_N^{-1}=\bm{S}_0^{-1} + \beta \bm{\Phi}^T\bm{\Phi}$$
	$$\bm{m}_N = \bm{S}_N\left(\bm{S}_0^{-1}\bm{m}_0+\beta\bm{\Phi}^T\bm{t}\right)$$
	\item Maximum a posteriori corresponds by $\bm{w}_{\text{MAP}} = \bm{m}_N$
	\item If no prior was given ($\bm{S}_0=\alpha^{-1} \bm{I}$ with $\alpha\to0$) the mean $\bm{m}_N$ reduces to $\bm{w}_{\text{ML}}$
	\item Mostly simpler Gaussian prior used: $p(\bm{w}) = \mathcal{N}\left(\bm{0}, \alpha^{-1}\bm{I}\right)$
	\begin{itemize}
		\item Resulting parameters of posterior: 
		$$\bm{S}_N^{-1}=\alpha^{-1}\bm{I} + \beta \bm{\Phi}^T\bm{\Phi}$$
		$$\bm{m}_N = \beta \bm{S}_N\bm{\Phi}^T\bm{t}$$
		$$p(\bm{w}|\bm{t}, \bm{X})=\frac{1}{\sqrt{\left(2\pi\right)^M |\bm{S}_N|}}\exp\left[-\frac{1}{2}\left(\bm{w}-\bm{m}_N\right)^T \bm{S}_N^{-1} \left(\bm{w}-\bm{m}_N\right)\right]$$
		\item Corresponding log posterior: 
		$$\ln p\left(\bm{w}|\bm{t}, \bm{X}\right) = -\frac{\beta}{2}\sum\limits_{n=1}^{N}\left\{t_n - \bm{w}^T \bm{\phi}\left(\bm{x}_n\right)\right\}^2 - \frac{\alpha}{2}\bm{w}^T \bm{w} + C$$
		\item Thus, maximizing this posterior is equal to having a regularization term with $\lambda=\frac{\alpha}{\beta}$
		\item Infinitely narrow prior by $\alpha\to\infty$ ($\alpha\to0$ seen before ends up in maximum likelihood):
		$$\lim\limits_{\alpha\to\infty} \bm{S}_N = \lim\limits_{\alpha\to\infty} \left(\alpha\bm{I}+\beta \bm{\Phi}^{T}\bm{\Phi}\right)^{-1} = \lim\limits_{\alpha\to\infty} \alpha^{-1}\bm{I} = \bm{0}$$
		$$\lim\limits_{\alpha\to\infty} \bm{m}_N = \lim\limits_{\alpha\to\infty} \beta\left(\alpha\bm{I}+\beta \bm{\Phi}^{T}\bm{\Phi}\right)^{-1}\bm{\Phi}^T \bm{t} = \lim\limits_{\alpha\to\infty} \frac{\beta}{\alpha}\bm{\Phi}^T\bm{t} = \bm{0} = \bm{m}_{0}$$
		\item Infinite data $N\to\infty$:
		$$\lim\limits_{N\to\infty} \bm{S}_N = \lim\limits_{N\to\infty} \left(\alpha\bm{I}+\beta \bm{\Phi}^{T}\bm{\Phi}\right)^{-1} = \lim\limits_{N\to\infty} \left(\bm{\Phi}^{T}\bm{\Phi}\right)^{-1} = \bm{0}$$
		$$\lim\limits_{N\to\infty} \bm{m}_N = \lim\limits_{N\to\infty} \beta\left(\alpha\bm{I}+\beta \bm{\Phi}^{T}\bm{\Phi}\right)^{-1}\bm{\Phi}^T \bm{t} = \lim\limits_{N\to\infty} \left(\bm{\Phi}^T\bm{\Phi}\right)^{-1}\bm{\Phi}^T\bm{t} = \bm{w}_{\text{ML}}$$
		$\Rightarrow$ At infinite data, all approaches agree: $\bm{m}_N = \bm{w}_{\text{ML}} = \bm{w}_{\text{MAP}}$
	\end{itemize}
\end{itemize}

\subsubsection{Sequential Bayesian Learning}
\begin{itemize}
	\item Data is sequences of input $x$ and target $t$
	\item Posterior after $N-1$ data points constitutes the prior for the $N$th data point!
	\item Posterior 1: $p(\bm{w}|x_1,t_1,\alpha,\beta)\propto p(t_1|x_1,\bm{w},\beta)p(\bm{w}|\alpha)$
	\item Posterior 2: $p\left(\bm{w}|(x_1,t_1),(x_2,t_2),\alpha,\beta\right)\propto p(t_2|x_2,\bm{w},\beta)p(\bm{w}|x_1,t_1,\alpha,\beta)$
	\item Posterior narrows down step by step until it gets very certain of the correct estimation 
\end{itemize}
\begin{figure}[ht]
\centering
\includegraphics[width=0.5\textwidth]{figures/sequential_bayesian_linear_regression.png}
\caption{Example for Sequential Bayesian Learning on target $t=-0.3+0.5x+\epsilon$. First column: likelihood (not normalized for $\bm{w}$, but for $t_n$!), second column: posterior, third column: sampled weights}
\end{figure}
\subsubsection{Predictive Distribution}
\begin{itemize}
	\item Predictive distribution is defined by ($\bm{\mathtt{t}}$ targets in training set):
	$$p\left(t|x, \bm{\mathtt{t}}, \bm{X}, \alpha, \beta \right) = \int p\left(t|x, \bm{w},\beta\right)p\left(\bm{w}|\bm{\mathtt{t}}, \bm{X}, \alpha, \beta\right)d\bm{w}$$
	where $p\left(t|x, \bm{w},\beta\right) = \mathcal{N}\left(t|y\left(\bm{x},\bm{w}\right), \beta^{-1}\right)$ is the conditional distribution of target variable, and \\$p\left(\bm{w}|\bm{\mathtt{t}}, \bm{X}, \alpha, \beta\right) = \mathcal{N}\left(\bm{w}|\bm{m}_N, \bm{S}_N\right)$ the posterior weight distribution
	\item Predictive distribution is convolution of two Gaussians $\Rightarrow$ $p\left(t|x, \bm{\mathtt{t}}, \bm{X}, \alpha, \beta \right)=\mathcal{N}\left(t|\bm{m}_N^{T}\bm{\phi}(\bm{x}),\sigma_N^2(\bm{x}) \right)$
	where variance $\sigma_N^2(\bm{x})=\frac{1}{\beta} + \bm{\phi}(\bm{x})^T \bm{S}_N \bm{\phi}(\bm{x})$ (first term data noise, second weight uncertainty, which goes to 0 for infinite data $N\to\infty$)
	\item Important points
	\begin{enumerate}
		\item Uncertainty is smaller near training points
		\item Variance/uncertainty decreases with larger $N$
	\end{enumerate}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/bayesian_linear_regression_predictive_dist.png}
		\caption{Example for predictive distributions. Green: ground truth data, blue: data points, red line: mean prediction, red area: 1-sigma area}
	\end{figure}
	\item The predictive distribution can be expressed by a \textbf{kernel formulation}:
	\begin{itemize}
		\item The predictive mean is 
		\begin{equation*}
		\begin{split}
			y\left(\bm{x}', \bm{m}_N\right) & = \bm{\phi}^T(\bm{x}') \bm{m}_N = \beta \bm{\phi}^T(\bm{x}')\bm{S}_N \bm{\Phi}^T \bm{t}\\
			& = \beta \bm{\phi}^T(\bm{x}')\bm{S}_N \sum\limits_{n=1}^{N}\bm{\Phi}_{:,n}^T t_n = \beta \sum\limits_{n=1}^{N} \bm{\phi}^T(\bm{x}')\bm{S}_N \bm{\phi}(\bm{x}_n) t_n\\
			& = \sum\limits_{n=1}^{N} k\left(\bm{x}',\bm{x}_n\right) t_n \text{\hspace{5mm} 
				where\hspace{5mm}} k\left(\bm{x}',\bm{x}_n\right)=\beta \bm{\phi}^T(\bm{x}')\bm{S}_N \bm{\phi}(\bm{x}_n)
		\end{split}
		\end{equation*}
		$\Rightarrow$ Prediction is a linear combination of training set target values
		\item Kernel values depend on whole dataset by $\bm{S}_N$
		\item Closer data points to $\bm{x}'$ are given a higher weight than points further removed from $\bm{x}'$
		\item Thus, local evidence is weighted more strongly than distant evidence
		\item Kernel can also express covariance:
		\begin{equation*}
		\begin{split}
		\text{cov}\left[t_1, t_2 | \bm{x}_1, \bm{x}_2\right] & = \text{cov}_{\bm{w}}\left[y(\bm{x}_1, \bm{w}), y(\bm{x}_2, \bm{w})\right] = \text{cov}_{\bm{w}}\left[\bm{\phi}^T(\bm{x}_1)\bm{w}, \bm{w}^T\bm{\phi}(\bm{x}_2)\right]\\
		& = \mathbb{E}_{\bm{w}}\left[\bm{\phi}^T(\bm{x}_1)\bm{w} \bm{w}^T\bm{\phi}(\bm{x}_2)\right] - \mathbb{E}_{\bm{w}}\left[\bm{\phi}^T(\bm{x}_1)\bm{w}\right]\mathbb{E}_{\bm{w}}\left[\bm{w}^T\bm{\phi}(\bm{x}_2)\right]\\
		& = \bm{\phi}^T(\bm{x}_1)\text{cov}\left[\bm{w},\bm{w}\right] \bm{\phi}^T(\bm{x}_2) = \bm{\phi}^T(\bm{x}_1)\bm{S}_N \bm{\phi}^T(\bm{x}_2)\\
		& = \frac{1}{\beta}k\left(\bm{x}_1,\bm{x}_2\right) 
		\end{split}
		\end{equation*}
		\item Based on that, we can see that predictive mean at nearby points will be highly correlated (high values of the kernel), and smaller for distant points
			\begin{figure}[ht]
			\centering
			\includegraphics[width=0.5\textwidth]{figures/bayesian_kernel_formulation.png}
			\caption{Right plot: matrix for $(x',x)$ of kernel $k\left(x',x\right)$ for Gaussian basis function. Left plot: slices of this matrix for different values of $x$}
		\end{figure}
		
	\end{itemize}
\end{itemize}
\subsection{Bayesian Model Comparison}
\begin{itemize}
	\item By marginalizing (integrating) over the model parameters instead of making point estimates of their values, models can be directly compared on the training data instead of separate validation data
	\item Compare $L$ models $\left\{\mathcal{M}_i\right\}_{i=1}^{L}$
	\item Probabilities are used to represent uncertainty in the choice of model. 
	\item We express our preference for different models by a prior distribution $p\left(\mathcal{M}_i\right)$, so that the posterior is:
	$$p\left(\mathcal{M}_i|\mathcal{D}\right)\propto p\left(\mathcal{M}_i\right)p\left(\mathcal{D}|\mathcal{M}_i\right)$$
	\item Important term is the \textit{model evidence} $p\left(\mathcal{D}|\mathcal{M}_i\right)$ which updates our preference based on the seen data $\mathcal{D}$. Marginalizes the parameters $\bm{w}$ of a model:
	$$p\left(\mathcal{D}|\mathcal{M}_i\right) = \int p\left(\mathcal{D}|\bm{w}, \mathcal{M}_i\right)p\left(\bm{w}|\mathcal{M}_i\right)d\bm{w}$$
	\begin{itemize}
		\item Can be viewed as the probability that $\mathcal{D}$ is generated by a random sample of $\bm{w}$ from the prior. 
		\item Is also the normalization constant for $p\left(\bm{w}|\mathcal{D}, \mathcal{M}_i\right)$
	\end{itemize}
	\item Two models can be compared by dividing their posteriors:
	$$\frac{p\left(\mathcal{M}_1|\mathcal{D}\right)}{p\left(\mathcal{M}_2|\mathcal{D}\right)} = \frac{p\left(\mathcal{M}_1\right)p\left(\mathcal{D}|\mathcal{M}_1\right)}{p\left(\mathcal{M}_2\right)p\left(\mathcal{D}|\mathcal{M}_2\right)} \text{\hspace{5mm}where\hspace{5mm}} \frac{p\left(\mathcal{D}|\mathcal{M}_1\right)}{p\left(\mathcal{D}|\mathcal{M}_2\right)}\text{ is called \textit{Bayes factor}}$$
	\item The predictive distribution is a weighted mean (based on the model probabilities) of our models:
	$$p\left(t'|\bm{x}',\mathcal{D}\right) = \sum\limits_{i=1}^{L}p\left(t'|\bm{x}', \mathcal{M}_i, \mathcal{D}\right) p\left(\mathcal{M}_i | \mathcal{D}\right)$$
	\item However, a simple approximation is using the single most probable model alone to make prediction $\Rightarrow$ also known as \textit{model selection}
\end{itemize}
\subsubsection{Approximated Model Evidence}
\begin{itemize}
	\item For a single parameter $w$, assume that posterior distribution $p\left(w|\mathcal{D}, \mathcal{M}_i\right)$ is sharply peaked around the most probably value $w_{\text{MAP}}$ with width $\Delta w_{\text{posterior}}$
	\item Further, we assume that also the prior is a flat distribution with width $\Delta w_{\text{prior}}$ so that $p\left(w|\mathcal{M}_i\right) = 1/\Delta w_{\text{prior}}$
	\item Integral of model evidence can be approximated by its maximum value times the width of the peak:
	 $$p\left(\mathcal{D}|\mathcal{M}_i\right) = \int p\left(\mathcal{D}|\bm{w}, \mathcal{M}_i\right)p\left(\bm{w}|\mathcal{M}_i\right)d\bm{w}\simeq p\left(\mathcal{D}|w_{\text{MAP}}, \mathcal{M}_i\right)\frac{\Delta w_{\text{posterior}}}{\Delta w_{\text{prior}}}$$
	 \item Taking the log leads to:
	 $$\ln p\left(\mathcal{D}|\mathcal{M}_i\right) \simeq \underbrace{\ln p\left(\mathcal{D}|w_{\text{MAP}}, \mathcal{M}_i\right)}_{\text{model fit}} + \underbrace{\ln \frac{\Delta w_{\text{posterior}}}{\Delta w_{\text{prior}}}}_{\text{complexity penalty}}$$
	 \item The first term is the likelihood of the data, and therefore describes how good the model fits to the given data (optimal: maximized)
	 \item The second term penalizes model complexity as if $\Delta w_{\text{posterior}} < \Delta w_{\text{prior}}$ (distribution was finely tuned to the data), the term is negative and reduces the model evidence (optimal: minimized)
	 \item Hence, model evidence favors models where we have a trade-off between model fit and complexity
	 \begin{figure}[ht]
	 	\centering
	 	\includegraphics[width=0.5\textwidth]{figures/bayesian_model_comparison_log.png}
	 	\caption{Plotting the curve of $\ln p\left(\mathcal{D}|\mathcal{M}_i\right)$ for different polynomials $M=0,1,...$ for the task of fitting a sine. As the sine is an odd function, polynomials of odd order fit the best (give the most improvement for the model fit). However, increasing the model complexity increases the penalty.}
	 \end{figure}
	 \item For a model with $K$ parameters, we get a similar approximation: 
	 $$\ln p\left(\mathcal{D}|\mathcal{M}_i\right) \simeq \ln p\left(\mathcal{D}|\bm{w}_{\text{MAP}}, \mathcal{M}_i\right) + K \ln \frac{\Delta w_{\text{posterior}}}{\Delta w_{\text{prior}}}$$
	 \item Drawbacks of Bayesian approach:
	 \begin{itemize}
	 	\item Still need to make assumptions about possible models
	 	\item If no model is suitable for the data, the algorithm gives bad estimations
	 	\item Model evidence is sensitive regarding the prior
	 	\item Thus, a small test set is commonly used for Bayesian comparison
	 \end{itemize}
	 \begin{figure}[ht]
	 	\centering
	 	\includegraphics[width=0.4\textwidth]{figures/bayesian_model_comparison.png}
	 	\caption{Illustration of three different models and there corresponding model evidences. Horizontal axis $x$: one dimensional representation of all possible datasets; Vertical axis $y$: probability that these models generate this specific dataset based on their prior distribution of parameters $\bm{w}$. $\mathcal{M}_1$ is the simplest model and is therefore only able to create a small set of different data $\mathcal{D}$. As the probability is normalized over all datasets $\mathcal{D}$, the probability is higher than for more complex models like $\mathcal{M}_2$ and $\mathcal{M}_3$. Given a certain dataset $\mathcal{D}_0$, we choose the model with the highest probability $\Rightarrow$ model which just is enough complex to generate this dataset}
	 \end{figure}
 	
\end{itemize}
\subsubsection{Model Evidence for Linear Basis Models}
\begin{itemize}
	\item In fully Bayesian treatment, we must also consider all hyperparameters:
	\begin{equation*}
		\begin{split}
			p\left(\bm{t}|\bm{X}, \mathcal{M}_i\right) & =\int\int\int p\left(\bm{t}|\bm{X},\bm{w},\beta,\mathcal{M}_i\right)p\left(\bm{w}|\alpha\right)p\left(\alpha, \beta | \mathcal{M}_i\right)d\bm{w}\text{ }d\alpha\text{ }d\beta\\
			& = \int\int \underbrace{p\left(\bm{t}|\bm{X}, \beta, \alpha, \mathcal{M}_i\right)}_{\text{peaked posterior/prior}} \underbrace{p\left(\alpha, \beta | \mathcal{M}_i\right)}_{\text{broad hyperprior}} d\alpha \text{ }d\beta\\
		\end{split}
	\end{equation*}
	\item Note that the hyperprior can again contain new hyperparameters, for which one might have to define a new prior (and so on)
	\item Approximation: take best hyperparameters $\alpha^*$ and $\beta^*$
	$$p\left(\bm{t}|\bm{X}, \mathcal{M}_i\right) = \arg\max_{\alpha, \beta} p\left(\bm{t}|\bm{X}, \beta, \alpha, \mathcal{M}_i\right)$$
	\item Using this approximation, we come to following predictive distribution:
	$$p\left(t'|\bm{x}',\bm{t}, \bm{X}, \mathcal{M}_i^*\right) \approx p\left(t'|\bm{x}',\bm{t}, \bm{X}, \beta^*, \alpha^*, \mathcal{M}_i\right)$$
\end{itemize}
\subsection{Limitations of fixed basis functions}
\textbf{Advantages}
\begin{itemize}
	\item[+] Closed form solution for least-squares problem 
	\item[+] Tractable Bayesian treatment
	\item[+] Nonlinear models mapping input variables to target variables through basis functions
\end{itemize}
\textbf{Limitations}
\begin{itemize}
	\item[-] Assumption: Basis functions $\phi_j(\bm{x})$ are fixed, not learned
	\item[-] \textit{Curse of dimensionality}: to cover growing dimensions $D$ of input vectors, the number of basis functions needs to grow rapidly / exponentially
\end{itemize}

================================================
FILE: Machine_Learning_1/ml_neural_networks.tex
================================================
\section{Neural Networks}
\begin{itemize}
	\item Previously: fixed basis function $\bm{\phi}(\bm{x}) = \left(\phi_0\left(\bm{x}\right), \phi_1\left(\bm{x}\right), ..., \phi_M\left(\bm{x}\right)\right)^T$
	\item Neural networks: Create flexible non-linear features and learn them. 
	\begin{itemize}
		\item Basis function with extra parameters: $\phi_m\left(\bm{x},\bm{w}_m^{(1)}\right) = h\left(\left(\bm{w}_m^{(1)}\right)^T \bm{x}\right) = h\left(\sum\limits_{d=0}^{D}w_{md}^{(1)}\right)$ 
		\item Note that $\bm{x}_n = \left(1, x_{n0},...,x_{nD}\right)^T\Rightarrow \bm{x}_n \in \mathbb{R}^D$
		\item $h$ is the non-linear activation function
	\end{itemize}
	\item We can define regression for a one-layer neural network:
	\begin{equation*}
		\begin{split}
			y\left(\bm{x}, \bm{W}^{(1)}, \bm{w}^{(2)}\right) & = \sum\limits_{m=0}^{M} w_m^{(2)} h\left(\sum\limits_{d=0}^{D}w_{md}^{(1)}\right) \\
			& = \left(\bm{w}^{(2)}\right)^T h\left(\bm{W}^{(1)} \bm{x}\right) \text{\hspace{5mm}where\hspace{5mm}} \bm{W}^{(1)} = \left(\begin{array}{cccc}
			\mid & \mid & \dots & \mid\\
			\bm{w}_0^{(1)} & \bm{w}_1^{(1)} & \dots & \bm{w}_D^{(1)}\\
			\mid & \mid & \dots & \mid
			\end{array}\right)\\
		\end{split}
	\end{equation*}
	\item The same way, we can adjust a network for classification:
	$$y\left(\bm{x}, \bm{W}^{(1)}, \bm{w}^{(2)}\right) = f\left(\sum\limits_{m=0}^{M} w_m^{(2)} h\left(\sum\limits_{d=0}^{D}w_{md}^{(1)}\right)\right) = f\left(\left(\bm{w}^{(2)}\right)^T h\left(\bm{W}^{(1)} \bm{x}\right)\right)$$
	where $f$ is sigmoid for binary and softmax for multi-class classification (then $\bm{w}^{(2)}$ is $K\times M$ matrix)
\end{itemize}
\subsection{Feed-forward Network Functions}
\begin{itemize}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/neural_networks_overview.png}
		\caption{Illustration of a multilayer perceptron (2 layers).}
		\label{img:neural_networks_overview}
	\end{figure}
	\item Input are $D+1$ units whereas $x_0=1$ is the bias
	\item First layer with $M\times D$ weight matrix $\bm{W}^{(1)}$ $\Rightarrow$ $M$ activations $a_m = \sum\limits_{d=0}^{D} w_{md}^{(1)} x_{d}$ and bias $h^{(1)}(a_0)=1$
	\item We apply an activation function on these activations to get the hidden units: $z_m = h^{(1)}(a_m)$ where $z_0=1$
	\item Second layer with $K\times M$ weight matrix $\bm{W}^{(2)}$ $\Rightarrow$ $K$ output units $y_k = h^{(2)}\left(\sum\limits_{m=0}^{M} w_{km}^{(2)}z_m\right)$
	\item In conclusion, a output unit $y_k$ is calculated as follows:
	$$y_k\left(\bm{x}, \bm{W}^{(1)}, \bm{W}^{(2)}\right) = h^{(2)}\left(\sum\limits_{m=0}^{M} w_{km}^{(2)}\cdot h^{(1)}\left(\sum\limits_{d=0}^{D} w_{md}^{(1)}x_{d}\right)\right)$$
	\item Alternative notation: $y_k = h^{(2)} \circ \bm{a}^{(2)} \circ h^{(1)} \circ \bm{a}^{(1)} (\bm{x})$
	\item Additional forms: 
	\begin{itemize}
		\item \textit{Skip connections}: Connection between for instance first and fourth layer
		\item \textit{Sparse connections}: For instance convolutions, can have weight sharing
	\end{itemize}
	\item In general: $z_m = h\left(\sum\limits_{j}w_{mj}z_{j}\right)$ where $j$ are all incoming connections
	\item Note that no closed directed cycles are allowed
\end{itemize}
\subsubsection{Universal approximator}
\begin{itemize}
	\item Let $f$ by any continuous function on a compact area of $\mathbb{R}^{D}$ and $h$ any fixed analytic function which is not polynomial (e.g. logistic function, tanh function, ...). Given any small number $\epsilon > 0$ of an acceptable error, we can find a number $M$ and weights $w_{m}^{(2)}$ and $w_{md}^{(1)}\in \mathbb{R}$ such that:
	$$\left|f(\bm{x}) - y\left(\bm{x}, \bm{W}^{(1)}, \bm{w}^{(2)}\right)\right| < \epsilon$$
	with $y$ as two-layer NN
	\item For smaller $\epsilon$ we need more hidden units $\Rightarrow$ larger $M$
	\item We may also take deeper networks that are usually capable to approximate more complex functions with less units
	\item To approximate deep network with shallow one by error $\epsilon$, the number of units $M$ needed scales exponentially for decreasing $\epsilon$
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/neural_networks_universal_approximator.png}
		\caption{Example approximations by a 2-layer network with 3 hidden units of the function (a) $f(x)=x^2$, (b) $f(x)=\sin(x)$, (c) $f(x) = |x|$ and (d) $f(x)=\mathbb{I}\left(x>0\right)$. The outputs of the three hidden units are shown as dashed lines.}
		\label{img:neural_networks_universal_approximator}
	\end{figure}
\end{itemize}
\subsection{Network Training}
\begin{itemize}
	\item Use probabilistic interpretation of the network outputs to choose number of outputs, output activation function and loss (e.g. $p(t|\bm{x},\bm{w})$ for regression $\Rightarrow$ maximizing likelihood used as error function)
\end{itemize}
\subsubsection{Network training for regression}
\begin{itemize}
	\item Data input $\bm{x}_n \in \mathbb{R}^{D}$ with continuous target $t_n \in \mathbb{R}$
	\item Single real-valued target $\to$ Single output unit with identity activation function $y\left(\bm{x}, \bm{w}\right) = a^{\text{out}}$
	\item Derive loss function by maximum likelihood:
	$$E(\bm{w}) = -\ln p\left(\bm{t}|\bm{X}, \bm{w}\right) = \frac{\beta}{2}\sum\limits_{n=1}^{N}\left\{y(\bm{x}_n, \bm{w}) - t_n\right\}^2 - \frac{N}{2}\ln\beta + \frac{N}{2}\ln 2\pi $$
	$$\text{equivalent to minimizing } E(\bm{w})=\frac{1}{2}\sum\limits_{n=1}^{N}\left\{y(\bm{x}_n, \bm{w}) - t_n\right\}^2$$
\end{itemize}
\subsubsection{Network training for binary classification}
\begin{itemize}
	\item Targets are now binary values: $t_n\in\left\{0,1\right\}$
	\item As $p\left(t=1|\bm{x}\right) = 1 - p\left(t=0|\bm{x}\right)$, we model only one output unit: $y\left(\bm{x},\bm{w}\right) = p\left(t=1|\bm{x}\right)$
	\item The output activation function is therefore a sigmoid: $y\left(\bm{x}, \bm{w}\right) = \sigma\left(a^{\text{out}}\right)$
	\item The maximum likelihood is here equivalent to minimizing BCE: $$E\left(\bm{w}\right) = -\sum\limits_{n=1}^{N} t_n \ln y\left(\bm{x}_n, \bm{w}\right) + (1 - t_n) \ln \left(1 - y\left(\bm{x}_n, \bm{w}\right)\right)$$
\end{itemize}
\subsubsection{Network training for classification with $K$ classes}
\begin{itemize}
	\item Targets are now one-hot vectors $\bm{t}_n = \left(0,...,1,...,0\right)^T$
	\item Now, we have to model all $K$ class distributions by $y_k\left(\bm{x}, \bm{w}\right) = p\left(C_k|\bm{x}\right)$
	\item Activation function is softmax: $y_k\left(\bm{x}, \bm{w}\right) = \frac{\exp\left(a_k^{\text{out}}\right)}{\sum_{j=1}^{K}\exp\left(a_j^{\text{out}}\right)}$
	\item The maximum likelihood is here equivalent to:
	$$E\left(\bm{w}\right) = - \sum\limits_{n=1}^{N} \sum\limits_{k=1}^{K} t_{nk} \ln y_k\left(\bm{x}_n, \bm{w}\right)$$
\end{itemize}
\subsubsection{Parameter optimization}
\begin{itemize}
	\item Optimal parameters minimize error function: $\bm{w}^{*} = \arg\min\limits_{\bm{w}} E(\bm{w})$
	\item Problem: $E(\bm{w})$ is not convex so that many local minima (can) exist 
	\item Different optimization strategies can be developed
	\item \textbf{Gradient Descent} uses full dataset for each update: $\bm{w}^{(\tau + 1)} = \bm{w}^{(\tau)} - \eta \triangledown E\left(\bm{w}^{(\tau)}\right)$
	\item Always goes in the direction of steepest gradient
	\item Will easily get stuck in local minimum when $\triangledown E\left(\bm{w}\right) = 0$
	\item \textbf{Stochastic Gradient Descent} uses single data point or minibatches for the update step: $\bm{w}^{(\tau + 1)} = \bm{w}^{(\tau)} - \eta \triangledown\sum\limits_{i=1}^{M} E_i\left(\bm{w}^{(\tau)}\right)$ 
	\item Converges to area around local minimum
	\item More likely to escape local minimum as $\triangledown E\left(\bm{w}^{(\tau)}\right) = 0$ does not imply $\triangledown E_n\left(\bm{w}^{(\tau)}\right) = 0$ for all $n$
	\item Is more computational efficient at the beginning as all $E_n\left(\bm{w}\right)$ will point in a similar direction
	\item Choose learning rate carefully to get good results
	\begin{itemize}
		\item If learning rate is too small: slow convergence
		\item If learning rate is too high: oscillations around local minimum
		\item Use learning rate scheduling with smaller learning rate over time
	\end{itemize}
\end{itemize}
\subsection{Error Backpropagation}
\begin{itemize}
	\item The error function is the sum of single point errors ($E(\bm{w})=\sum_{n}E_n(\bm{w})$), so that we can calculate the gradients for each data point independently: $\frac{\partial E_n(\bm{w})}{\partial \bm{w}}$ 
	\item Therefore, we first apply \textit{forward propagation}: calculate all $a_j^{(l)} = \sum_i w_{ji}^{(l)}z_i^{(l-1)}$ and $z_j^{(l)}=h^{(l)}\left(a_j^{(l)}\right)$
	\item Then, apply \textit{back propagation} by calculating all $\frac{\partial E_n}{\partial w_{ji}^{(l)}}$
	\item Backpropagation is based on the multi-dimensional chain rule: 
	$$\frac{\partial f\left(g_1\left(x\right), ..., g_D\left(x\right) \right)}{\partial x} = \sum\limits_{d=1}^{D}\frac{\partial f\left(g_1\left(x\right), ..., g_D\left(x\right)\right)}{\partial g_d(x)}\frac{\partial g_d(x)}{\partial x}$$
	\item Thus, we can express the gradient regarding a single weight element by (only $a_{j}^{(l)}$ depends on $w_{ji}^{(l)}$):
	$$\frac{\partial E_n}{\partial w_{ji}^{(l)}}=\frac{\partial E_n}{\partial a_{j}^{(l)}}\frac{\partial a_{j}^{(l)}}{\partial w_{ji}^{(l)}}$$
	\item The second part of the derivate is just $\frac{\partial a_{j}^{(l)}}{\partial w_{ji}^{(l)}}=z_{i}^{(l-1)}$, the first one we define as $\delta_j^{(l)}\equiv \frac{\partial E_n}{\partial a_j^{(l)}}$
	\item So, our derivate is $\frac{\partial E_n}{\partial w_{ji}^{(l)}}=\delta_j^{(l)}z_{i}^{(l-1)}$
	\item $a_j^{(l)}$ effects the error only by its following units $a_k^{(l+1)}\Rightarrow \delta_j^{(l)}=\sum_{k}\frac{\partial E_n}{\partial a_k^{(l+1)}}\frac{\partial a_k^{(l+1)}}{a_j^{(l)}} = \sum_{k}\delta_{k}^{(l+1)}\frac{\partial a_k^{(l+1)}}{a_j^{(l)}}$
	\item As $a_j^{(l)}$ effects $a_k^{(l+1)}$ only by the weight $w_{kj}^{(l+1)}$, the derivate is $\frac{\partial a_k^{(l+1)}}{a_j^{(l)}}=w_{kj}^{(l+1)}h^{(l)'} \left(a_{j}^{(l)}\right)$
	\item Note that we need to be careful with skip connections
	\item Overall, backpropagation can be summarized in three steps:
	\begin{enumerate}
		\item Compute $\delta_k$ for all output units
		\item Compute $\delta_j$ for all hidden units through backpropagation:
		$$\delta_j^{(l)} = h^{(l)'}\left(a_j^{(l)}\right)\sum\limits_{k}\delta_k^{(l+1)}w_{kj}^{(l+1)}$$
		\item Compute derivatives $\frac{\partial E_n}{\partial w_{ji}^{(l)}}=\delta_j^{(l)}z_{i}^{(l-1)}$
		\item Apply iterative weight update: 
		$w_{ji}^{(l)(\tau+1)} = w_{ji}^{(l)(\tau)}-\eta \delta_j^{(l)}z_{i}^{(l-1)}$
	\end{enumerate}
\end{itemize}
\subsection{Issues with Neural Networks}
\begin{itemize}
	\item Initialization of weights: randomly start near zero such that activations fall into linear part of activation functions (e.g. for tanh and sigmoid) and gradients don't vanish
	\item Networks perform best when input has mean 0 and variance 1
	\item When you have a large number of parameters, we need regularization!
	\item Multiple local minima: Non-convex error function. Restart experiment with different seeds and choose model with lowest regularized error
	\item Use weight sharing to reflect symmetries in data if possible
\end{itemize}

================================================
FILE: Machine_Learning_1/ml_summary.tex
================================================
\documentclass[a4paper]{article} 
\addtolength{\hoffset}{-2.25cm}
\addtolength{\textwidth}{4.5cm}
\addtolength{\voffset}{-3.25cm}
\addtolength{\textheight}{5cm}
\setlength{\parskip}{0pt}
\setlength{\parindent}{0in}

\usepackage{blindtext} % Package to generate dummy text
\usepackage{charter} % Use the Charter font
\usepackage[utf8]{inputenc} % Use UTF-8 encoding
\usepackage{microtype} % Slightly tweak font spacing for aesthetics
\usepackage[english]{babel} % Language hyphenation and typographical rules
\usepackage{amsthm, amsmath, amssymb} % Mathematical typesetting
\usepackage{float} % Improved interface for floating objects
\usepackage[final, colorlinks = true, 
linkcolor = black, 
citecolor = black]{hyperref} % For hyperlinks in the PDF
\usepackage{graphicx, multicol} % Enhanced support for graphics
\usepackage{xcolor} % Driver-independent color extensions
\usepackage{marvosym, wasysym} % More symbols
\usepackage{rotating} % Rotation tools
\usepackage{subcaption}
\usepackage{censor} % Facilities for controlling restricted text
\newcommand{\note}[1]{\marginpar{\scriptsize \textcolor{red}{#1}}} % Enables comments in red on margin
\usepackage{bm}
\usepackage{blkarray}
\usepackage{enumitem}
\usepackage{pgfplots}
\definecolor{colkeyword}{rgb}{0,0.4,0}
\definecolor{colname}{rgb}{0.4,0.4,0}
\definecolor{coltype}{rgb}{0.4,0,0.4}
\definecolor{coloperators}{rgb}{0,0,1.0}
\definecolor{colscopes}{rgb}{0.4,0,0}

\setcounter{tocdepth}{2}
% Title Page
\title{Summary Machine Learning 1}
\author{Phillip Lippe}


\begin{document}
\maketitle
\tableofcontents
\newpage

\input{ml_basic_probability.tex}
\input{ml_linear_regression.tex}
\input{ml_linear_classification.tex}
\input{ml_neural_networks.tex}
\input{ml_unsupervised_learning.tex}
\input{ml_kernel_methods.tex}
\input{ml_combining_models.tex}
\appendix
\newpage
\input{ml_appendix.tex}
\end{document}

================================================
FILE: Machine_Learning_1/ml_unsupervised_learning.tex
================================================
\section{Unsupervised learning}
\begin{itemize}
	\item We can express our data distribution by marginalizing latent variables (unobserved targets/values that make it easier to understand the data). This allows us to model the data with more tractable joint distributions with simpler components to understand:
	\begin{equation*}
		\begin{split}
			\bm{z}\text{ continuous: } & p\left(\bm{x}\right) = \int p\left(\bm{x}, \bm{z}\right) d\bm{z} = \int p\left(\bm{x}| \bm{z}\right) p\left(\bm{z}\right)d\bm{z}\\
			\bm{z}\text{ discrete: } & p\left(\bm{x}\right) = \sum\limits_{\bm{z}} p\left(\bm{x}, \bm{z}\right) = \sum\limits_{\bm{z}} p\left(\bm{x}| \bm{z}\right) p\left(\bm{z}\right)\\
		\end{split}
	\end{equation*}
	\item Discrete latent variables are typically used for clustering, whereas continuous are applied for dimensionality reduction 
\end{itemize}
\subsection{\textit{K}-means Clustering}
\begin{itemize}
	\item Every single data point $\bm{x}$ is assigned to a cluster $\to$ a discrete latent variable $\bm{z}$
	\item Number of clusters/different values for $\bm{z}$ must be determined beforehand
	\item Cluster as comprising a group of data points whose inter-point distances are small compared with the distances to points outside the cluster
	\item Hence, we define $\bm{\mu}_k$ as a prototype (here also the mean) of the cluster $k$, and minimize the sum of squares of the distances of each data point to its closest vector $\bm{\mu}_k$:
	$$J=\sum\limits_{n=1}^{N}\sum\limits_{k=1}^{K} z_{nk} ||\bm{x}_n - \bm{\mu}_k||^2$$
	where $\bm{z}_n$ is a one-hot vector with $z_{nk}=1$ if $k$ is closest cluster of $\bm{x}_n$
	\item Optimization algorithm (expectation-maximization (EM) algorithm):
	\begin{enumerate}
		\item Means $\bm{\mu}_k \in \mathbb{R}^D$ are initialized randomly
		\item Repeat until convergence ($\mu_k$ and $z_{nk}$ do not change for any $n$ and $k$):
		\begin{enumerate}
			\item \textbf{Expectation step}: Find the assignment of the closest cluster for every data point:
			$$\frac{\partial J}{\partial z_{nk}}=0 \Rightarrow z_{nk} = \begin{cases}
			1 & \text{ if } k=\arg\min\limits_j ||\bm{x}_n - \bm{\mu}_j||^2\\
			0 & \text{ otherwise }
			\end{cases}$$
			\item \textbf{Maximization step}: Find the means of each cluster:
			$$\frac{\partial J}{\partial \bm{\mu}_{k}}=0 \Rightarrow \bm{\mu}_k = \frac{\sum_n z_{nk}\bm{x}_n}{\sum_n z_{nk}}$$
		\end{enumerate}
	\end{enumerate}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/k_means_example.png}
		\caption{Illustration of the $K$-means algorithm. First, an expectation step is performed where the data points are assigned to a cluster (see \textit{(b)}, \textit{(d)}, \textit{(f)}, and \textit{(h)}), and then the maximization step optimizes the means of the clusters (see \textit{(c)}, \textit{(e)}, \textit{(g)}, and \textit{(i)}).}
		\label{img:k_means_example}
	\end{figure}
	\item The algorithm converges as each phase reduces the value of the objective function $J$, but they might converge to a local rather than global minimum (perform multiple random restarts and choose best minimum found)
	\item \textbf{Application}: image compression. Every pixel is a data point, and we search for $K$ clusters representing different colors in the image. The image is compressed by only using the cluster means (colors) instead of specifying a color at every pixel. A problem of this method is that the position correlations are ignored.
	\item \textbf{Failures} of $K$-means:
	\begin{itemize}
		\item $K$-means is only able to cluster spherical data due to the squared distance we try to minimize. Other shapes require different distance measures or data transformation by some basis functions.
		\item $K$-means strongly prefers clusters of the same size/spread, and therefore tries to find cluster with the same spread in the dataset.
		\item $K$-means is very sensitive to outliers. As it tries to minimize the squared distance, outliers may have a significant effect on the cluster means.
	\end{itemize}
	\item \textbf{Improvements}:
	\begin{itemize}
		\item For a large dataset, we can use SGD to reduce computational effort. The update for a single data point would look like:
		$$\bm{\mu}_k^{(\tau+1)} = \bm{\mu}_k^{(\tau)} - \eta \left(\frac{\partial J}{\partial \bm{\mu}_k^{(\tau)}}\right)^T = \bm{\mu}_k^{(\tau)} + 2 \eta \left(\bm{x}_n - \bm{\mu}_k^{(\tau)}\right)$$
		\item Use other distance measure between points that is for example not so sensitive to outliers:
		$$\tilde{J} = \sum\limits_{n=1}^{N}\sum\limits_{k=1}^{K} z_{nk} \mathcal{V}\left(\bm{x}_n, \bm{\mu}_k\right)$$
		where $\mathcal{V}$ measures the similarity of $\bm{x}_n$ and $\bm{\mu}_k$.
	\end{itemize}
	\item \textbf{Pros and cons} of $K$-means
	\begin{itemize}
		\item[+] Simple to implement
		\item[+] Fast
		\item Local minima
		\item Only models spherical data 
		\item Sensitive to feature scales and outliers
		\item Number of clusters $K$ must be specified in advance with prior knowledge
		\item Cluster assignments are hard and not probabilistic
	\end{itemize}
\end{itemize}
\subsection{Mixture of Gaussians and EM algorithm}
\begin{itemize}
	\item Approximate the joint distribution $p\left(\bm{x},\bm{z}\right)=p\left(\bm{x}|\bm{z}\right)\cdot p\left(\bm{z}\right)$ by a mixture of Gaussians ($\bm{z}$ chooses the mixture component, and points in the cluster are Gaussian distributed)
	\item We define the prior as $p\left(z_k=1\right)=\pi_k$, where $\sum_k \pi_k= 1$ and $\pi_k \in [0,1]$
	\item The single clusters are Gaussian: $p\left(\bm{x}|z_k=1\right) = \mathcal{N}\left(\bm{x}|\bm{\mu}_k, \bm{\Sigma}_k\right)$
	\item Overall, the generative distribution is $p\left(\bm{x}\right) = \sum_{k=1}^{K} \pi_k \mathcal{N}\left(\bm{x}|\bm{\mu}_k, \bm{\Sigma}_k\right)$
	\item The posterior/conditional probability of $\bm{z}$ given $\bm{x}$ is also defined as the \textit{responsibility} (that component $k$ in the mixture model takes for 'explaining' the observation/data point $\bm{x}$):
	$$p\left(z_k=1|\bm{x}\right) = \frac{p\left(\bm{x}|z_k=1\right)\cdot p\left(z_k=1\right)}{\sum_j p\left(\bm{x}|z_j=1\right)\cdot p\left(z_j=1\right)}=\frac{\pi_k \mathcal{N}\left(\bm{x}|\bm{\mu}_k, \bm{\Sigma}_k\right)}{\sum_j \pi_j \mathcal{N}\left(\bm{x}|\bm{\mu}_j, \bm{\Sigma}_j\right)} = \gamma\left(z_{k}\right)$$
	\item To optimize our parameters, we again maximize the log-likelihood:
	$$\ln p\left(\bm{X}|\bm{\pi}, \bm{\mu}, \bm{\Sigma}\right) = \sum\limits_{n=1}^{N} \ln \sum\limits_{k=1}^{K} \pi_k \mathcal{N}\left(\bm{x}|\bm{\mu}_k, \bm{\Sigma}_k\right)$$
	\item However, maximizing the log-likelihood has no closed-form solution as stationary points depend on $\gamma\left(z_{nk}\right)$ which again depends on $\bm{\pi}, \bm{\mu}$ and $\bm{\Sigma}$ $\Rightarrow$ use expectation maximization algorithm by alternating the update of (\textit{expected}) posterior $\gamma\left(z_{nk}\right)$ and \textit{maximizing} for parameters $\bm{\pi}, \bm{\mu}$ and $\bm{\Sigma}$
	\item Maximizing with respect to $\bm{\mu}_k$:
	\begin{equation*}
		\begin{split}
			& \frac{\partial}{\partial \bm{\mu}_k} \sum\limits_{n=1}^{N} \ln p\left(\bm{x}_n|\left\{\pi_k\right\}_{k=1}^{K}, \left\{\bm{\mu}_k\right\}_{k=1}^{K}, \left\{\bm{\Sigma}_k\right\}_{k=1}^{K}\right)\\
			= & \sum\limits_{n=1}^{N} \frac{1}{ p\left(\bm{x}_n|\left\{\pi_k\right\}_{k=1}^{K}, \left\{\bm{\mu}_k\right\}_{k=1}^{K}, \left\{\bm{\Sigma}_k\right\}_{k=1}^{K}\right)} \frac{\partial}{\partial \bm{\mu}_k} p\left(\bm{x}_n|\left\{\pi_k\right\}_{k=1}^{K}, \left\{\bm{\mu}_k\right\}_{k=1}^{K}, \left\{\bm{\Sigma}_k\right\}_{k=1}^{K}\right)\\
			= & \sum\limits_{n=1}^{N} \frac{\pi_k \mathcal{N}\left(\bm{x}_n|\bm{\mu}_k, \bm{\Sigma}_k\right)}{\sum\limits_{j=1}^{K}\pi_j \mathcal{N}\left(\bm{x}_n|\bm{\mu}_j, \bm{\Sigma}_j\right)} \left(\bm{x}_n - \bm{\mu}_k\right)^T\bm{\Sigma}_k^{-1}\\
			= & \sum\limits_{n=1}^{N} y\left(z_{nk}\right) \left(\bm{x}_n - \bm{\mu}_k\right)^T\bm{\Sigma}_k^{-1}\\
			\Rightarrow & \bm{\mu}_k = \frac{\sum_{n=1}^{N}\gamma\left(z_{nk}\right)\bm{x}_n}{\sum_{n=1}^{N}\gamma\left(z_{nk}\right)}
		\end{split}
	\end{equation*}
	\item For maximizing $\pi_k$ we need to use the Lagrange multiplier (as the sum must be 1):
	\begin{equation*}
		\begin{split}
			& \frac{\partial}{\partial \pi_k} \sum\limits_{n=1}^{N} \ln p\left(\bm{x}_n|\left\{\pi_k\right\}_{k=1}^{K}, \left\{\bm{\mu}_k\right\}_{k=1}^{K}, \left\{\bm{\Sigma}_k\right\}_{k=1}^{K}\right) + \lambda \left(\sum\limits_{j=1}^{K}\pi_j - 1\right)\\
			= & \sum\limits_{n=1}^{N} \frac{\pi_k \mathcal{N}\left(\bm{x}_n|\bm{\mu}_k, \bm{\Sigma}_k\right)}{\sum\limits_{j=1}^{K}\pi_j \mathcal{N}\left(\bm{x}_n|\bm{\mu}_j, \bm{\Sigma}_j\right)} + \lambda \pi_k\\
			= & \sum\limits_{n=1}^{N} \gamma\left(z_{nk}\right) + \lambda \pi_k\\
			\Rightarrow & \pi_k = -\frac{1}{\lambda}\sum\limits_{n=1}^{N} \gamma\left(z_{nk}\right)\\[10pt]
			& \frac{\partial}{\partial \lambda} \sum\limits_{n=1}^{N} \ln p\left(\bm{x}_n|\left\{\pi_k\right\}_{k=1}^{K}, \left\{\bm{\mu}_k\right\}_{k=1}^{K}, \left\{\bm{\Sigma}_k\right\}_{k=1}^{K}\right) + \lambda \left(\sum\limits_{j=1}^{K}\pi_j - 1\right)\\
			= & \sum\limits_{j=1}^{K}\pi_j - 1 = -\frac{1}{\lambda}\sum\limits_{n=1}^{N} \underbrace{\sum\limits_{j=1}^{K}\gamma\left(z_{nj}\right)}_{=1} - 1 = 0\\
			\Rightarrow & \lambda = -N, \pi_k = \frac{N_k}{N} \text{\hspace{5mm}where\hspace{5mm}} N_k = \sum\limits_{n=1}^{N} \gamma\left(z_{nk}\right) \text{\hspace{2mm}(effective number of points in \textit{k})}
		\end{split}
	\end{equation*}
	\item Maximizing $\bm{\Sigma}_k$ is done by:
	$$\bm{\Sigma}_k = \frac{1}{N_k} \sum\limits_{n=1}^{N} \gamma\left(z_{nk}\right)\left(\bm{x}_n - \bm{\mu}_k\right)^T\left(\bm{x}_n - \bm{\mu}_k\right)$$
	\item Summarized EM algorithm steps for Gaussian mixture models: 
	\begin{itemize}
		\item \textbf{Expectation step}: update the posterior:
		$$\gamma\left(z_{k}\right) = \frac{\pi_k \mathcal{N}\left(\bm{x}|\bm{\mu}_k, \bm{\Sigma}_k\right)}{\sum_j \pi_j \mathcal{N}\left(\bm{x}|\bm{\mu}_j, \bm{\Sigma}_j\right)}$$
		\item \textbf{Maximization step}: update the parameters:
		\begin{equation*}
			\begin{split}
				\bm{\mu}_k & = \frac{\sum_{n=1}^{N}\gamma\left(z_{nk}\right)\bm{x}_n}{\sum_{n=1}^{N}\gamma\left(z_{nk}\right)}\\
				\pi_k & = \frac{N_k}{N}\\
				\bm{\Sigma}_k & = \frac{1}{N_k} \sum\limits_{n=1}^{N} \gamma\left(z_{nk}\right)\left(\bm{x}_n - \bm{\mu}_k\right)^T\left(\bm{x}_n - \bm{\mu}_k\right)\\
			\end{split}
		\end{equation*}
	\end{itemize}
	\item Assigning points to clusters: either soft clusters (probability of belonging to $k$: $\gamma\left(z_{k}\right) = p\left(z_k=1|\bm{x}\right)$) or hard clusters (most likely cluster given by $k=\arg\min_j \gamma\left(z_j\right)$)
	\item \textbf{Pros and cons} of GMM:
	\begin{itemize}
		\item[+] Allows soft-assignments in contrast to $K$-means
		\item[+] More flexible as we can model different covariances per cluster
		\item Slower than $K$-means as every step requires more computation (can use $K$-means result as initialization)
		\item Same local convergence issues as $K$-means
	\end{itemize}
\end{itemize}
\subsection{Principal Component Analysis}
\begin{itemize}
	\item Find linear orthogonal projection to lower dimensional space to maximize variance ($\mathbb{R}^{D}\to \mathbb{R}^{M}$ where $M<D$)
	\item Try to find projection by capturing axes of maximal variation in the data, called \textit{principal components}
	\item Covariance is given by $S=\frac{1}{N}\sum\limits_{n=1}^{N}\left(\bm{x}_n - \overline{\bm{x}}\right)\left(\bm{x}_n - \overline{\bm{x}}\right)^T$  which is symmetric and positive semi-definite
	\item Project data into first latent dimension by $\bm{u}_1 \in \mathbb{R}^D$. As we only need its direction, we make sure that $\bm{u}_1^T \bm{u}_1 = 1$
	\item The projection is given by $\bm{u}_1^T \bm{x}_n$.
	\item The variance of the projected data is $\bm{u}_1^T S \bm{u}_1$. We try to maximize this variance with respect to the constraint $\bm{u}_1^T \bm{u}_1 = 1$ (Lagrangian multiplier):
	$$\arg\max_{\bm{u}_1}\max_{\lambda_1} \bm{u}_1^T S \bm{u}_1 + \lambda_1 (1 - \bm{u}_1^T \bm{u}_1)$$
	Deriving by $\bm{u}_1$ gives us the equation $S\bm{u}_1 = \lambda_1 \bm{u}_1$ which is the eigenvalue equation. Thus, $\bm{u}_1$ is an eigenvector and $\lambda_1$ and eigenvalue of $S$. As we try to maximize the equation, we choose $\lambda_1$ to be the \textit{greatest} eigenvalue.
	\item $\bm{u}_1$ is called a \textit{principal component}. The variance of the projected data is $\bm{u}_1^T S \bm{u}_1 = \lambda_1$
	\item We can repeat this procedure for $M$ orthogonal vectors and get a projection $U_M = \left[\bm{u}_1, ..., \bm{u}_M\right] \in \mathbb{R}^{D\times M}$. Those are $M$ eigenvectors of $S$, where $\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_M$. As $S$ is positive semi-definite, we can ensure that $\lambda_i\geq 0$
	\item The variance of the projected data is $\sum\limits_{j=1}^{M} \bm{u}_j^T S \bm{u}_j = \sum\limits_{j=1}^{M} \lambda_j$
	\item \textbf{PCA}: Compute $\overline{\bm{x}}$ and the eigen-decomposition of $S$. The \textbf{projection} is $z = U_M^T \left(\bm{x} - \overline{\bm{x}}\right)$
	\item The idea is that the information which is lost is only noise, so that we still keep the expressiveness of the data. However, eigen-decomposition might be expensive.
	\item Note that eigenvalues can be found by solving the equation $\det\left(S - \lambda \bm{I}\right) = 0$. We can represent $S$ by its eigenvalue decomposition $S=U\Lambda U^T$. For the eigenvectors, we can state that $\bm{u}_j^T \bm{u}_i = 0 \text{ if } i\neq j, \text{ else } 1$.
	\item \textbf{Applications}
	\begin{itemize}
		\item \textit{Dimensionality reduction}: which $M$ to choose? To preserve at least 90\% of the variance we need to make sure that $\frac{\sum_{j=1}^{M}\lambda_j}{\sum_{j=1}^{D}\lambda_j} \geq 0.9$
		\item \textit{Feature de-correlation}: PCA ensures that features have no correlation in projected space. The covariance matrix is diagonal: $S'_M = \Lambda_M$
		\item \textit{Whitening}: center and de-correlate features by $\bm{z} = U_M^T (\bm{x}-\overline{\bm{x}})$ where $M$ can be equals to $D$. If we want to rescale it (e.g. unit std. deviation), we apply a factor: $\bm{z} = \Lambda_M^{1/2} U_M^T (\bm{x}-\overline{\bm{x}})$ 
		\item \textit{Compression}: transform input to lower dimensional space. Reconstruction can be performed by $\tilde{\bm{x}} = U_M \bm{z} + \overline{\bm{x}}$
	\end{itemize}
	\item \textbf{Perspective of minimal reconstruction error}
	\begin{itemize}
		\item An alternative view on PCA is minimizing the reconstruction error of the transformed data to the original space
		$$\min \frac{1}{N} \sum\limits_{n=1}^{N} ||\bm{x}_n - \bm{z}_n||^2$$
		\item We can express our data by $\bm{x}_n = \sum\limits_{j=1}^{D} (\bm{x}_n^T \bm{u}_j) \bm{u}_j$. The transformed data is thus $\bm{z}_n = \sum\limits_{j=1}^{M} (\bm{x}_n^T \bm{u}_j) \bm{u}_j + \sum\limits_{j=M+1}^{D} b_j \bm{u}_j$
		\item (By doing some math) we can show that the objective function is actually $\sum_{j=M+1}^{D} \bm{u}_j^T S \bm{u}_j$. Thus, both approaches lead to the same result
	\end{itemize}
\end{itemize}
\subsubsection{Probabilistic PCA}
\begin{itemize}
	\item Generative probabilistic version of PCA where we learn by maximizing the likelihood (both latent and observed are Gaussian)
	\item Generative model works as $\bm{x} = W\bm{z} + \bm{\mu} + \bm{\epsilon}$ where $\bm{\epsilon}$ represents the noise
	\item We define $p(\bm{z}) = \mathcal{N}(\bm{z}|0, \bm{I})$, $p(\epsilon) = \mathcal{N}(\epsilon|0, \sigma^2 \bm{I})$, and therefore $p(\bm{x}|\bm{z}) = \mathcal{N}(\bm{x}, W\bm{z} + \bm{\mu}, \sigma^2 \bm{I})$. By marginalizing out $\bm{z}$, we get $p(\bm{x}) = \mathcal{N}(\bm{x}|\bm{\mu}, C)$ with $C = WW^T + \sigma^2 \bm{I}$
	\item New data points are generated by first sampling from low dimensional $z$ space, and then sampling $x$ based on $z$
	\item We can optimize this sample distribution based on the maximum likelihood of a given dataset ($\mu$ is as usual the mean, $\sigma^2 = \frac{1}{D - M}\sum_{j=M+1}^{D} \lambda_j$)
	\item 
\end{itemize}
\subsubsection{Non-linear variants of PCA}
\begin{itemize}
	\item Limitations of PCA: only linear transformation possible. So, we can also just capture variance along a linear axes through the $x$ space
	\item We can get non-linear by using different forms
	\item \textbf{Kernel PCA}
	\begin{itemize}
		\item We can define the covariance matrix by $C=\frac{1}{N}\sum\limits_{n=1}^{N} \bm{\phi}(\bm{x})\bm{\phi}(\bm{x})^T$
		\item Then, we can state that $z_i(\bm{x}) = \bm{\phi}(\bm{x})^T \bm{u}_i = \sum_{n=1}^{N}a_{in}\bm{\phi}(\bm{x})^T \bm{\phi}(\bm{x}_n) =  \sum_{n=1}^{N}a_{in}k(\bm{x},\bm{x}_n)$
		\item By using a non-linear kernel, we are able to get non-linear projections
	\end{itemize}
	\item \textbf{Autoencoders (NN)}
	\begin{itemize}
		\item Non-linear dimensionality reduction with neural networks by having a low hidden dimension size, and trying to reproduce input
		\item In variational auto-encoders, we introduce sampling from the latent space so that we can generate new data points
	\end{itemize}
\end{itemize}

================================================
FILE: Machine_Learning_2/ml2_appendix.tex
================================================
\section{Appendix Math}
Here we revisit some important mathematical tricks and equations to know.  
\subsection{Useful properties of a Gaussian}
Given $\bm{x}\sim \mathcal{N}(\bm{\mu}, \bm{\Sigma})$, suppose $\bm{x}=(\bm{x}_a, \bm{x}_b)$, define $\bm{\mu}=(\bm{\mu}_a, \bm{\mu}_b)$ and $\bm{\Sigma} = \begin{bmatrix}
\Sigma_{aa} & \Sigma_{ab}\\ \Sigma_{ba} & \Sigma_{bb}
\end{bmatrix}$ ($\Sigma_{ab}=\Sigma_{ba}^T$, $\Sigma_{aa}=\Sigma_{aa}^T$)\\

\textbf{Marginal distribution}: $p(\bm{x}_a) = \mathcal{N}(\bm{x}_a\mid \bm{\mu}_a, \Sigma_{aa})$\\

\textbf{Conditional distribution}: $p(\bm{x}_a\vert \bm{x}_b) = \mathcal{N}(\bm{x}_a\mid \bm{\mu}_{a|b}, \Sigma_{a|b})$\\ where $\Sigma_{a|b} = \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}, \bm{\mu}_{a\vert b} = \bm{\mu}_a + \Sigma_{ab}\Sigma_{bb}^{-1}\left(\bm{x}_{b}-\bm{\mu}_{b}\right)$\\[10pt]

\textbf{Multiplication of two Gaussians}: $\mathcal{N}(\bm{x}\mid \bm{a}, \bm{A})\mathcal{N}(\bm{x}\mid\bm{b}, \bm{B})=\mathcal{N}(\bm{x}|\bm{c}, \bm{C})\overbrace{\mathcal{N}(\bm{a}|\bm{b}, \bm{A}+\bm{B})}^{\text{Normalization constant}}$\\ where $\bm{C}=\left(\bm{A}^{-1}+\bm{B}^{-1}\right)^{-1}$, $\bm{c}=\bm{C}\left(\bm{A}^{-1}\bm{a}+\bm{B}^{-1}\bm{b}\right)$\\

\textbf{Conditional and marginals in graphical model}\\
\begin{wrapfigure}[3]{l}{0.1\textwidth}
	\centering
	\tikz{ %
		\node[latent] (x) {$\bm{x}$} ; %
		\node[latent, below=of x] (y) {$\bm{y}$} ; %
		
		\edge{x}{y}
	}
\end{wrapfigure}
\begin{equation*}
	\begin{split}
		p(x) & = \mathcal{N}(x|\mu, \Lambda^{-1})\\
		p(y|x) & = \mathcal{N}(Ax+b, L^{-1})\\[10pt]
		\Rightarrow p(y) & = \mathcal{N}(y|A\mu + b, L^{-1} + A\Lambda^{-1}A^T)\\
		\Rightarrow p(x|y) & = \mathcal{N}(x|\Sigma(A^T L (y - b) + \Lambda\mu), \Sigma), \hspace{2mm} \Sigma=(\Lambda + A^T L A)^{-1}
	\end{split}
\end{equation*}


\subsection{Distributions from the exponential family}
It is useful to know the exponential form of a few most popular distributions. Remember that in general, a distribution of the exponential family can be written in the form:
$$p(\bm{x}|\bm{\eta}) = h(\bm{x})g(\bm{\eta})\exp\left(\bm{\eta}^T \cdot \bm{u}(\bm{x})\right)$$
Some tricks to keep in mind:
\begin{itemize}
	\item $a^{b}=\exp(b\cdot \log a)$ - helpful to find sufficient statistics and natural parameters
	\item If we have the constraint $\sum_k \pi_k= 1$, replace $\pi_K$ with $\pi_K=1-\sum_{k\neq K} \pi_k$ $\Rightarrow$ one less parameter
\end{itemize}
\subsubsection{Gaussian}
\textbf{Univariate}:
\begin{fleqn}[\parindent]
	\begin{equation*}
	\begin{split}
	& p(x|\mu, \sigma^2) = \mathcal{N}(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\\[8pt]
	& \bm{\eta} = \begin{bmatrix}
	\frac{\mu}{\sigma^2} & -\frac{1}{2\sigma^2}\\
	\end{bmatrix}^T\\
	& \bm{u}(\bm{x}) = \begin{bmatrix}
	x & x^2\\
	\end{bmatrix}^T\\
	& h(\bm{x}) = \frac{1}{\sqrt{2\pi}}\\
	& g(\bm{\eta}) = (-2\eta_2)^{-\frac{1}{2}}\exp\left(\frac{\eta_1^2}{4\cdot \eta_2}\right)\\
	\end{split}
	\end{equation*}
\end{fleqn}
\textbf{Multivariate}:
\begin{fleqn}[\parindent]
	\begin{equation*}
		\begin{split}
			& p(\bm{x}|\bm{\mu}, \bm{\Sigma}) = \mathcal{N}(\bm{x}|\bm{\mu}, \bm{\Sigma}) = (2\pi)^{-D/2}|\Sigma|^{-1}\cdot \exp\left(-\frac{1}{2}(\bm{x}-\bm{\mu})^T\Sigma^{-1}(\bm{x}-\bm{\mu})\right)\\[8pt]
			& \bm{\eta} = \begin{bmatrix}
			\bm{\Sigma}^{-1}\bm{\mu} & -\frac{1}{2}\bm{\Sigma}^{-1}\\
			\end{bmatrix}^T\\
			& \bm{u}(\bm{x}) = \begin{bmatrix}
				\bm{x} & \bm{x}\bm{x}^T\\
			\end{bmatrix}^T\\
			& h(\bm{x}) = (2\pi)^{-D/2}\\
			& g(\bm{\eta}) = |-2\bm{\eta}_2|^{-\frac{1}{2}} \cdot \exp\left(\frac{1}{4}\bm{\eta}_{1}^T\bm{\eta}_{2}^{-1}\bm{\eta}_{1}\right)\\
		\end{split}
	\end{equation*}
\end{fleqn}
\subsubsection{Beta}
\begin{fleqn}[\parindent]
	\begin{equation*}
	\begin{split}
	& p(x|\alpha, \beta) = \frac{x^{\alpha-1}(1 - x)^{\beta-1}}{B(\alpha, \beta)} \hspace{4mm}\text{where}\hspace{4mm}B(\alpha, \beta)=\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}\\[8pt]
	& \bm{\eta} = \begin{bmatrix}
	\alpha & \beta\\
	\end{bmatrix}^T\\
	& \bm{u}(\bm{x}) = \begin{bmatrix}
	\log x & \frac{1}{x}\\
	\end{bmatrix}^T\\
	& h(\bm{x}) = 1\\
	& g(\bm{\eta}) = \frac{1}{B(\eta_1, \eta_2)}\\
	\end{split}
	\end{equation*}
\end{fleqn}
\subsubsection{Multinomial}
\begin{fleqn}[\parindent]
	\begin{equation*}
	\begin{split}
	& p(\bm{x}|\bm{\pi}) = \frac{M!}{\prod_{i=1}^{K}x_i!}\prod_{i=1}^{K}\pi_i^{x_i}\\[8pt]
	& \bm{\eta} = \begin{bmatrix}
	\ln\frac{\pi_1}{1-\sum_{i=1}^{K-1}\pi_i} & \ln\frac{\pi_2}{1-\sum_{i=1}^{K-1}\pi_i} & ... & \ln\frac{\pi_{K-1}}{1-\sum_{i=1}^{K-1}\pi_i}
	\end{bmatrix}^T\\
	& \bm{u}(\bm{x}) = \begin{bmatrix}
	x_1 & x_2 & ... & x_{K-1}
	\end{bmatrix}^T\\
	& h(\bm{x}) = \frac{M!}{\prod_{i=1}^{K}x_i!}\\
	& g(\bm{\eta}) = \exp\left(-M\ln \left(1 + \sum_{i=1}^{K-1}\exp(\eta_i)\right)\right)\\
	\end{split}
	\end{equation*}
\end{fleqn}
\subsubsection{Dirichlet}
\begin{fleqn}[\parindent]
	\begin{equation*}
	\begin{split}
	& p(\bm{x}|\alpha_1,..,\alpha_K) = \frac{1}{B(\alpha_1,..., \alpha_K)}\prod_{i=1}^{K}x_i^{\alpha_i-1} \hspace{4mm}\text{where}\hspace{4mm}B(\alpha_1,...,\alpha_K)=\frac{\prod_{i=1}^{K}\Gamma(\alpha_i)}{\Gamma(\sum_{i=1}^{K} \alpha_i)}\\[8pt]
	& \bm{\eta} = \begin{bmatrix}
	\alpha_1 & ... & \alpha_K\\
	\end{bmatrix}^T\\
	& \bm{u}(\bm{x}) = \begin{bmatrix}
	\log x_1 & ... & \log x_K\\
	\end{bmatrix}^T\\
	& h(\bm{x}) = \frac{1}{\prod_{i=1}^{K}x_i}\\
	& g(\bm{\eta}) = \frac{1}{B(\eta_1, ..., \eta_K)}\\
	\end{split}
	\end{equation*}
\end{fleqn}
\subsubsection{Poisson}
\begin{fleqn}[\parindent]
	\begin{equation*}
	\begin{split}
	& p(x|\lambda) = \frac{\lambda^{x}\exp(-x)}{x!}\\[8pt]
	& \bm{\eta} = \begin{bmatrix}
	\ln \lambda \\
	\end{bmatrix}^T\\
	& \bm{u}(\bm{x}) = \begin{bmatrix}
	x\\
	\end{bmatrix}^T\\
	& h(\bm{x}) = \frac{1}{x!}\\
	& g(\bm{\eta}) = \exp(-\exp(\eta))\\
	\end{split}
	\end{equation*}
\end{fleqn}
\subsubsection{Gamma}
\begin{fleqn}[\parindent]
	\begin{equation*}
	\begin{split}
	& p(x|a,b) = \frac{1}{\Gamma(x)}b^{a}x^{a-1}\exp(-bx)\\[8pt]
	& \bm{\eta} = \begin{bmatrix}
	(a-1) & -b \\
	\end{bmatrix}^T\\
	& \bm{u}(\bm{x}) = \begin{bmatrix}
	\ln x & x\\
	\end{bmatrix}^T\\
	& h(\bm{x}) = 1\\
	& g(\bm{\eta}) = \frac{(-\eta_2)^{\eta_1+1}}{\Gamma(\eta_1+1)}\\
	\end{split}
	\end{equation*}
\end{fleqn}
\subsubsection{Conjugate priors}
\begin{itemize}
	\item Dirichlet $\to$ Multinomial 
	\item Dirichlet $\to$ Categorical 
	\item Beta $\to$ Bernoulli
	\item Gamma $\to$ Poisson
	\item Gaussian $\to$ Gaussian
	\item Gamma (precision) $\to$ Gaussian (known mean)
\end{itemize}

================================================
FILE: Machine_Learning_2/ml2_causality.tex
================================================
\section{Causality}
\begin{itemize}
	\item Causality is about testing whether one event (\textit{effect}) is the result of the occurrence of another event (\textit{cause}), i.e. a change in the cause will lead to a change in the effect. It differs from correlation by \textit{explaining}/\textit{finding} the relationship behind variables
	\item While we look at the data distribution for correlation, we are focusing on the generation mechanism of the data in causality. While in statistics we then like to predict the next observation (or its likelihood), causality is interested in what happens if we perform interventions (setting a variable to a certain value)
	\item The most important operator in causality is the \textbf{do-operator}: $p(A=a|\Cdo(B=b))$. It differs from the standard conditional probability by not assuming that we have observed $B=b$, but that we externally set the value of $B$. This means that we cannot infer anything from its parents as in standard conditionals (if we observe $B=b$, then this usually gives us information about its parents).
	
	Note that there are cases where $p(A=a|\Cdo(B=b))=p(A=a|B=b)$. One obvious example for this is when $B$ has no parents in its corresponding graphical model.
	\item We start with a discussion about the terminology in causality, and then take a closer look at Causal Bayesian networks and causal reasoning
\end{itemize}
\subsection{Causality terminology}
\begin{itemize}
	\item $A$ is said to \underline{cause} $B$ if changing $A$ leads to a change in $B$
	\item Similar to graphical models, we can define Causal graphs that represent causal relationships. An edge from $A$ to $B$ in the graph means that $A$ \textit{causes} $B$ even if all other variables are kept fixed. Figure~\ref{fig:causality_overview_causal_graphs} gives an overview.
	
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.6\textwidth]{figures/causality_overview_causal_graphs.png}
		\caption{Examples of causal graphs (retrieved from lecture notes).}
		\label{fig:causality_overview_causal_graphs}
	\end{figure}
	
	Note that these graphs can contain loops, which represents a feedback loop (a change in $A$ leads to a change in $B$, and a change in $B$ leads to a change in $A$). However, we will not take a closer look at those.
	
	\item We can interpreted node relationships in the graph as causal relations:
	\begin{itemize}
		\item $A$ is a \textit{parent} of $B$ $\implies$ $A$ is a direct cause of $B$
		\item $A$ is a \textit{child} of $B$ $\implies$ $A$ is a direct effect of $B$
		\item $A$ is a \textit{ancestor} of $B$ (e.g. $A\to C \to B$) $\implies$ $A$ is a cause of $B$. Note that if we fix $C$, there is no effect of $A$ on $B$. Hence, there is no direct edge.
		\item $A$ is a \textit{descendant} of $B$ (e.g. $B\to C\to A$) $\implies$ $A$ is an effect of $B$.
	\end{itemize}
	\item We use the notation $\mathcal{G}_{\overline{X}}$ to denote a sub-graph of $\mathcal{G}$ in which the incomming edges of $X$ are removed. This is useful for discussing when $X$ is set externally (hence, no influence of parents of $X$ in that case).
	
	Similarly, $\mathcal{G}_{\underline{X}}$ is $\mathcal{G}$ without the outgoing edges of $X$.
	\item  A \textbf{perfect intervention} $\Cdo(X=\xi)$ means that we force $X$ to be the value $\xi$. Thereby, the graph $\mathcal{G}$ changes to $\mathcal{G}_{\overline{X}}$
	\begin{itemize}
		\item To perform intervention, we require \textit{modularity}, meaning that we can manipulate $X$ without influencing any other variables in the graph $\bm{V}\setminus X$
		\item This can be a challenge in real systems, but in our theoretical models, we can assume that we are able to do so
	\end{itemize}
	\item A variable $H$ is a \textbf{confounder} of $X$ and $Y$ (i.e. $H$ confounds $X$ and $Y$) if there is a directed path from $H$ to $X$ which does not include $Y$, and same from $H$ to $Y$. Note that it is still allowed to have other paths between $X$ and $Y$. Examples of confounders are shown in Figure~\ref{fig:causality_confounders_examples}.
	
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/causality_confounders_examples.png}
		\caption{Examples of when a node $H$ is a confounder of $X$ and $Y$ (retrieved from lecture notes).}
		\label{fig:causality_confounders_examples}
	\end{figure}

	\item \textbf{Reichenbach's principle} sets correlation and causality into relation: if $X$ and $Y$ are correlated/depending on each other, then there is either a causal relation of the type $X\to Y$, $Y\to X$ or there exists a confounder $H$ of $X$ and $Y$.
	\begin{itemize}
		\item Note that this principle can fail if we have a selection bias, meaning that the dataset of $X$ and $Y$ was obtained by only including samples that are conditional on some (possibly latent) event.
	\end{itemize}
\end{itemize}

\subsection{Causal Bayesian Networks}
\begin{itemize}
	\item An subspace of causal networks with many assumptions/limitations, but therefore easier to work with, are Causal Bayesian Networks. We make the following assumptions:
	\begin{itemize}
		\item No confounding % A graph $\mathcal{G}$ does not contain any confounder
		\item A graph $\mathcal{G}$ does not contain any loops
		\item We do not have any selection bias in the data, nor measurements error or time dependencies
	\end{itemize}
	\item We call a Bayesian Network causal if:
	\begin{itemize}
		\item Directed edges correspond with directed causal relations
		\item After a perfect intervention $\Cdo(X_{I}=x_I)$, the probability density becomes:
		\begin{equation*}
		\tcbox[nobeforeafter]{\(
			\begin{split}
				p\left(\bm{X}_{\bm{V}\setminus I}|\Cdo(X_I=x_I)\right) = \prod_{i\in \bm{V}\setminus I} p\left(x_i|\bm{x}_{\text{pa}_i}\right)
			\end{split}
			\)}
		\end{equation*}
		
	\end{itemize}
\end{itemize}

\subsection{Causal Reasoning}
\begin{itemize}
	\item The goal of causal reasoning is to estimate $p(y|\Cdo(X=x))$. If we can express it in terms of the observational distribution $p(x,y,...)$ we say it is \textbf{identifiable} from the observational distribution. 
	
	Note that it does not necessarily require all variables to be observable.
	\item Assume we have the following Bayesian Causal Network:
	
	\begin{figure}[ht!]
		\centering
		\tikz{ %
			\node[latent] (X) {$X$} ; %
			\node[latent, right=of X] (H) {$H$} ; %
			\node[latent, below=of H] (Y) {$Y$} ; %
			
			\edge{H}{X};
			\edge{H}{Y};
			\edge{X}{Y};
		}
	\end{figure}

	The standard conditional distribution is:
	$$p(y|x)=\int p(y|h,x)p(h|x)dh$$

	Now, assume we perform a perfect intervention on $X$, i.e. $\Cdo(X=x)$. What happens is that we neglect the effect of $H$ on $X$, \textit{but} we still need to consider the effect of $H$ on $Y$. Hence, the conditional becomes:
	$$p(y|\Cdo(X=x))=\int p(h)p(y|h,x)dh$$
	The important thing is that we have to prevent that changing $X$ influences $H$ by ``back-reasoning'' (i.e. $H$ causes $X$, but observing $X$ gives us information of $H$), which again influence $Y$. This is because we cannot change $H$ by just forcing $X$ to a value, as we \textit{overwrite} the effect of $H$ on $X$. Hence, we have to explicitly remove its dependency on $X$ in the integral.
	
	%The result indicates that we had to \textit{adjust} the conditional probability for the effect of $H$. But suppose, we would not have the connection between $H$ and $Y$. Then, we would not have to adjust for $H$, and get $p(y|\Cdo(X=x))=p(y|x)$.
	\item We can derive a more general algorithm for deciding, for which variables we need to \textit{adjust} our conditional probability for. This can be done in a very similar manner to d-separation, as we need to find all variables, that are implicitly changed by setting $X$ to a certain value (i.e. variables that influence the decision of which value $X$ can have), but then also influence $Y$. We do not want this influence because by forcing $X$ to be a certain value, we cannot change variables that cause $X$. Hence, we are trying to find a set of variables $S$ which break these kind of influences, and remove their dependency with $X$.
	
	\item In general, we can determine whether $S$ is a sufficient set of variables we are adjusting by the following check:
	
	\begin{tcolorbox}[colback=white!80!gray,colframe=gray!75!black,title=Back-door criterion]
		A set of variables $S$ satisfies the back-door criterion relative to a variable pair ($X$, $Y$), if:
		\begin{enumerate}
			\item $X, Y\not\in S$
			\item No node of $S$ is a descendant (i.e. child of a child etc.) of $X$
			\item $S$ blocks all paths from $Y$ to $X$ where we have an incoming edge to $X$ (other directions irrelevant for path itself). A path is blocked by $S$ if:
			\begin{enumerate}
				\item It contains a collider $...\rightarrow u \leftarrow ...$ such that $u$ is not an ancestor of a node in $S$
				\item It contains a non-collider $...\rightarrow u$, $...\rightarrow u \rightarrow ...$, $...\leftarrow u \rightarrow ...$ such that $u$ is in $S$
			\end{enumerate}
		\end{enumerate}
		Then $S$ is admissible for adjustment to find the causal effect of $X$ on $Y$:
		$$p(y|\Cdo(X=x))=\int p(y|X=x,S=s)p(S=s)ds$$
		If $S=\emptyset$: $p(y|\Cdo(X=x))=p(y|X=x)$
	\end{tcolorbox}	

	To find the actual set $S$, we can simply perform the algorithm backwards. First, find all paths from $Y$ to $X$ with an incoming edge to $X$. Then, we start with $S=\emptyset$, and try to block all paths by adding variables to $S$. In case that all paths were blocked from the beginning on, or no paths exist, we can stop with $S=\emptyset$.
	
	Note that simplest solution of $S$ is the set of all the nodes with an edge to $X$. It might not be the smallest set, but should be always a valid one as we do not allow loops in BNs.
	
	\item \underline{Examples}: 
	\begin{itemize}
		\item Consider the examples in Figure~\ref{fig:causality_backdoor_example}. Note that we usually try to find the smallest set of admissible variables as this simplifies the integral we have to take.
		
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.5\textwidth]{figures/causality_backdoor_example.png}
			\caption{Example of admissible sets of variables for adjustment (retrieved from lecture notes).}
			\label{fig:causality_backdoor_example}
		\end{figure}
		\newpage
		\item Consider the following, slightly more complicated Causal Bayesian Network:
		\begin{figure}[ht!]
			\centering
			\tikz{ %
				\node[latent] (X1) {$X_1$} ; %
				\node[latent, below=of X1] (X3) {$X_3$} ; %
				\node[latent, right=of X3] (X4) {$X_4$} ; %
				\node[latent, right=of X4] (X5) {$X_5$} ; %
				\node[latent, above=of X5] (X2) {$X_2$} ; %
				\node[latent, below=of X3] (X7) {$X_7$} ; %
				\node[latent, right=of X7] (X6) {$X_6$} ; %
				\node[latent, right=of X6] (X8) {$X_8$} ; %
				
				\edge{X1}{X3};
				\edge{X1}{X4};
				\edge{X3}{X7};
				\edge{X4}{X7};
				\edge{X4}{X8};
				\edge{X7}{X6};
				\edge{X6}{X8};
				\edge{X5}{X8};
				\edge{X2}{X5};
				\edge{X2}{X4};
			}
		\end{figure}	
	
		We are trying to find the set $S$ admissible for adjustment for the causal effect of $X_7$ on $X_8$ (the two nodes on the bottom, left and right). We have to consider all paths with incoming edges to $X_7$, so from $X_3$ and $X_4$. To block the path $X_8\to X_4\to X_7$, we add $X_4$ to $S$: $S=\{X_4\}$. However, by doing this, we unblocked another path: $X_8\to X_5\to X_2\to X_4\to X_1\to X_3\to X_7$. $X_4$ is not longer a collider anymore, as it is included in $S$. So, we can either add $X_3$, or $X_5$, or even $X_1$ or $X_2$ to block this path. For example, we can take $S=\{X_3,X_4\}$, which is then admissible for adjustment as all paths are blocked.
	\end{itemize}
	\item Although we found a way to estimate what happens when we perform an intervention, the best way to find causal relations is to use randomized controlled trials. In a drug test, this would mean that we completely random assign a person to take the drug or not, ensuring that no underlying selection bias is in the process. By that, we should break all back-door paths (as there is nothing besides a coin flip that causes the event of  "taking the drug")

\end{itemize}

================================================
FILE: Machine_Learning_2/ml2_exponential_family.tex
================================================
\section{Introduction to popular distributions and their properties}
\begin{itemize}
	\item This section (lecture 1 and 2) reviews different kinds of distributions, including the exponential family, Student-t distribution and common distributions for binary and discrete variables
	\item Furthermore, we shortly introduce Independent Component Analysis and Information theory
	\item In general, the first two lectures gave some fundamental knowledge we will use a couple of times for the rest of the course
	\item More mathematical tricks or examples of the exponential family can be found in the appendix
\end{itemize}
\subsection{Exponential family distributions}
\textbf{(Bishop 2.4)}
\begin{itemize}
	\item A distribution is considered a member of the exponential family if it can be written as follows:
	\begin{equation*}
	\tcbox[nobeforeafter]{\(
		\begin{split}
			p(\bm{x}|\bm{\eta}) & = h(\bm{x})g(\bm{\eta})\exp\left(\bm{\eta}^T \cdot \bm{u}(\bm{x})\right)\\[5pt]
			\bm{\eta} & \hspace{3mm}\text{natural parameters}\\
			\bm{u}(\bm{x}) & \hspace{3mm}\text{sufficient statistics}\\
		\end{split}
	\)}
	\end{equation*}
	\item $\bm{u}(\bm{x})$ is called sufficient statistics because for the maximum likelihood estimate of $\bm{\eta}$, it is sufficient to record $\sum_{n=1}^{N}\bm{u}(\bm{x}_n)$ instead of the whole dataset $\left\{\bm{x}_n\right\}_{n=1}^{N}$ (see below for ML estimate)
	\item An important property of the exponential families is that the moments of distributions (i.e. mean and variance) can be determined by deriving $-\ln g(\bm{\eta})$ by $\bm{\eta}$:
	\begin{equation*}
		\begin{split}
			\text{Normalization constant}\hspace{2mm} z(\bm{\eta}) & = \frac{1}{g(\bm{\eta})} = \int h(\bm{x})\exp\left(\bm{\eta}^T \cdot \bm{u}(\bm{x})\right) d\bm{x}\\
			\frac{\partial}{\partial \bm{\eta}} -\ln g(\bm{\eta}) & = \frac{1}{z(\bm{\eta})} \int h(\bm{x})\bm{u}(\bm{x})\exp\left(\bm{\eta}^T \cdot \bm{u}(\bm{x})\right) d\bm{x} = \E[\bm{u}(\bm{x})\vert \bm{\eta}]\\
		\end{split}
	\end{equation*}
	\begin{itemize}
		\item Note that these moments are of the sufficient statistics $\bm{u}(\bm{x})$, and not $\bm{x}$
		\item Additionally, the second moment around the mean can be determined by: $\nabla_{\bm{\eta}}^2 -\ln g(\bm{\eta})$
	\end{itemize}
	\item From the first moment, we can show that the \underline{MLE solution} of the natural parameters are:
	$$\tcbox[nobeforeafter]{\(-\nabla_{\bm{\eta}}\ln g(\bm{\eta}) = \E[\bm{u}(\bm{x})\vert \bm{\eta}] \implies -\nabla_{\bm{\eta}}\ln g(\bm{\eta}_{\text{ML}}) = \frac{1}{N}\sum_{n=1}^{N} \bm{u}(\bm{x})\)}$$
\end{itemize}
\subsubsection{Conjugate priors}
\begin{itemize}
	\item A conjugate prior $p(\bm{\eta})$ is conjugate to the likelihood so that the posterior $p(\bm{\eta}|\bm{X})$ has the same form as the prior
	\item Each member of the exponential family has a conjugate prior
	\item To find the conjugate prior for a exponential distribution as likelihood, we only have to look at $\bm{\eta}$ of the likelihood and $\bm{u}(\bm{x})$ of the prior take on the same form. Then, we simply get:
	\begin{equation*}
		\begin{split}
			\bm{u}(\bm{x})_{\text{posterior}} & = \bm{\eta}_{\text{likelihood}} = \bm{u}(\bm{x})_{\text{prior}}\\
			\bm{\eta}_{\text{posterior}} & = \bm{u}(\bm{x})_{\text{likelihood}} + \bm{\eta}_{\text{prior}}
		\end{split}
	\end{equation*}
\end{itemize}
\subsubsection{Bayesian Inference for Gaussian}
\begin{itemize}
	\item We can demonstrate the conjugate prior idea for Gaussians (one dimensional), where we have to distinguish three cases
\end{itemize}
\begin{enumerate}
	\item \underline{Variance known, mean estimated}
	\begin{itemize}
		\item Conjugate prior is a Gaussian $p(\mu)=\mathcal{N}(\mu\vert\mu_0, \sigma_0^2)$ such that our posterior has the distribution:
		\begin{equation*}
		\tcbox[nobeforeafter]{\(
			\begin{split}
				& \textbf{Variance known, mean estimated}\\
				& p(\mu|\mathcal{D})=\mathcal{N}(\mu|\mu_N, \sigma_N^2), \hspace{5mm}\mu_N= \frac{\sigma^2 \mu_0 + N\sigma_0^2 \mu_{\text{ML}}}{N\sigma_0^2 + \sigma^2}, \hspace{5mm}\frac{1}{\sigma_N^2}=\frac{1}{\sigma_0^2} + \frac{N}{\sigma^2}
			\end{split}
		\)}
		\end{equation*}
	\end{itemize}
	\item \underline{Mean unknown, variance estimated}
	\begin{itemize}
		\item Conjugate prior for the precision $\lambda=\frac{1}{\sigma^2}$ is a Gamma distribution $\text{Gamma}(\lambda|a_0, b_0)$ such that the posterior is:
		\begin{equation*}
		\tcbox[nobeforeafter]{\(
			\begin{split}
			& \textbf{Mean known, variance estimated}\\
			& p(\lambda|\mathcal{D})=\text{Gamma}(\lambda|a_N,b_N), \hspace{5mm}a_N=a_0+\frac{N}{2},\hspace{5mm}b_N = b_0+\frac{1}{2}\sum_n(x_n-\mu)^2
			\end{split}
			\)}
		\end{equation*}
	\end{itemize}
	\item \underline{Variance and mean estimated}
	\begin{itemize}
		\item If both are unknown, we have a ``normal-Gamma'' distribution as prior and posterior: $p(\mu,\lambda|a,b,\mu_0, \beta)=\mathcal{N}(\mu|\mu_0, (\beta \lambda^{-1}))\text{Gamma}(\lambda|a,b)$
		\item Finding the posterior is harder in this case because of the combined distribution. For details, see Bishop, but in the lecture it was not further discussed
	\end{itemize}
\end{enumerate}
\subsection{Student's-t distribution}
\begin{itemize}
	\item The Student-t distribution is "heavy-tailed", meaning that the probability for data points decreases slower with the distance from the mean/center than for a Gaussian (polynomial $\text{St}(x)\propto |x|^{-\alpha}$ instead of exponential $\mathcal{N}\propto e^{-\frac{x^2}{\sigma^2}}$)
	\item This makes the distribution more \underline{robust against outliers} as the MLE solution is less influenced by those and focuses more on the biggest data point mass (see Figure~\ref{fig:exponential_families_student_t})
	\begin{figure}[ht!]
		\centering
		\begin{subfigure}{0.25\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/exponential_families_student_t.png}
			\caption{MLE estimate}
			\label{fig:exponential_families_student_t}
		\end{subfigure}
		\hspace{10mm}
		\begin{subfigure}{0.3\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/exponential_families_student_t_nu.png}
			\caption{Effect of parameter $\nu$}
			\label{fig:exponential_families_student_t_nu}
		\end{subfigure}
		\caption{(a) Comparison of MLE solution of Student-t distribution (red) and Gaussian (green). (b) The parameter $\nu$ for fixed $\mu=0$ and $\lambda=1$. For $\nu\to\infty$, }
	\end{figure}
	\item It emerges from a infinite mixture of Gaussians with a fixed mean and the precision (i.e. inverse variance) distributed as a Gamma distribution:
	\begin{enumerate}
		\item Draw precision $\tau \sim \text{Gamma}(a,b)$
		\item Draw $x\sim \mathcal{N}(\mu, \tau^{-1})$
	\end{enumerate}
	Then the resulting $x$ will be distributed according to the Student-t distribution
	$$p(x) \sim \text{St}(x\mid \mu, \lambda=a/b, \nu=2a)$$
	\item By marginalizing out $\tau$, we can derivate the PDF of the student distribution:
	\begin{equation*}
		\begin{split}
			\text{Scalar}\hspace{2mm}&\text{St}(x\mid\mu, \lambda=a/b, \nu=2a) = \frac{b^a}{\Gamma(a)\sqrt{2\pi}}\left(b + \frac{(x-\mu)^2}{2}\right)^{-a-\frac{1}{2}}\Gamma\left(a+\frac{1}{2}\right)\\[8pt]
			\text{d-dimensional}\hspace{2mm} &  \text{St}(\bm{x}\mid\bm{\mu}, \bm{\Sigma}, \nu) = \frac{\Gamma\left(\frac{d}{2} + \frac{\nu}{2}\right)}{\Gamma\left(\frac{d}{2}\right)}\frac{1}{\left(\pi\nu\right)^{d/2}\left|\bm{\Sigma}\right|^{1/2}}\left(1+\nu^{-1}\left(\bm{x}-\bm{\mu}\right)^T \bm{\Sigma}^{-1}\left(\bm{x}-\bm{\mu}\right)\right)^{-\frac{d}{2}-\frac{\nu}{2}}\\
		\end{split}
	\end{equation*}
	\item The parameter $\lambda$ is often called precision, but does not exactly represent the inverse of the variance.
	\item $\nu$ is called the degrees of freedom (see Figure~\ref{fig:exponential_families_student_t_nu}). For $\nu\to\infty$, the student-t distribution becomes a Gaussian $\mathcal{N}(x\vert\mu, \lambda^{-1})$
\end{itemize}
\subsection{Distributions for Binary and Discrete Variables}
\begin{itemize}
	\item In this section, we review common distributions for binary and discrete distributions. We can actually find one-to-one correlations in the binary and categorical space:
	\begin{table}[ht!]
		\centering
		\begin{tabular}{c|c}
			Binary & Discrete\\\hline
			Bernoulli & Categorical\\
			Binomial & Multinomial\\
			Beta & Dirichlet
		\end{tabular}
		\vspace{-5mm}
	\end{table}
\end{itemize}
\subsubsection{Binary}
\begin{description}
	\item[Bernoulli distribution] can be interpreted as a coin flip, and models a single binary outcome:
	$$\text{Bern}(x|\mu)=\mu^{x}(1-\mu)^{1-x}, \hspace{3mm}x\in\{0,1\}$$
	\begin{itemize}
		\item Expectation $\E[x|\mu]=\mu$
		\item Variance $\mathbb{V}\text{ar}[x]=\E[x^2]-\E[x]^2=\mu(1-\mu)$
		\item Maximum likelihood estimate $\mu_{\text{ML}}=\frac{1}{N}\sum_{n=1}^{N} x_n$ (sensitive to overfitting for small dataset)
		\item Exponential family $p(x|\eta)=\sigma(-\eta)\exp(\eta\cdot x), \eta=\ln \frac{\mu}{1-\mu}$
	\end{itemize}

	\item[Binomial distribution] models $N$ i.i.d. Bernoulli experiments, where we define $m$ as $m=\sum_{i=1}^{N}x_i$, i.e. the number of times the outcome is $1$:
	$$\text{Bin}(m|N,\mu)=\frac{N!}{(N-m)!m!}\mu^{m}(1 - \mu)^{N-m}$$	
	\begin{itemize}
		\item Expectation $\E[m]=\sum_{i=1}^{N}\E[x_i]=N\cdot \mu$
		\item Variance $\mathbb{V}\text{ar}[x]=N\cdot \mu(1-\mu)$
		\item Maximum likelihood estimate $\mu_{\text{ML}}=\frac{m}{N}$
		\item Exponential family $p(m|\eta)=\frac{N!}{(N-m)!m!}\cdot \exp(N \log 1-\mu) \cdot \exp(m\log\frac{\mu}{1-\mu})$, $\eta=\log \frac{\mu}{1-\mu}$
		\item Conjugate prior: Beta distribution. The posterior is: $\text{Beta}(\mu|a+m,b+N-m)$
	\end{itemize}

	\item[Beta distribution] is the conjugate prior for the binomial distribution
	$$\text{Beta}(\mu|a,b)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\mu^{a-1}(1-\mu)^{b-1}$$
	\begin{itemize}
		\item Expectation $\E[\mu]=\frac{a}{a+b}$
		\item Variance $\mathbb{V}\text{ar}[x]=\frac{ab}{(a+b)^2(a+b+1)}$
		\item Exponential family: see Appendix
	\end{itemize}
\end{description}
\subsubsection{Discrete}
\begin{description}
	\item[Categorical distribution] considers a single sample, and assign each category a different probability. The input $\bm{x}$ is a one-hot vector.
	$$\text{Cat}(\bm{x}|\bm{\mu})=\prod_{k=1}^{K}\mu_k^{x_k}=\mu_{x_k}, \hspace{3mm}\sum_k \mu_k = 1$$
	 \begin{itemize}
	 	\item Expectation $\E[\bm{x}]=\bm{\mu}$
	 	\item Covariance $\text{Cov}[\bm{x}]=\text{diag}(\bm{\mu}(1-\bm{\mu}))$
	 	\item Maximum likelihood estimate  $\bm{\mu}_{\text{ML}}=\frac{1}{N}\sum_{i=1}^{N} \bm{x}$
	 	\item Exponential family $p(\bm{x}|\bm{\eta})=\frac{1}{1+\sum_{k=1}^{K-1}\exp(\eta_k)}\cdot \exp(\bm{\eta}^T\bm{x})$, $\eta_k=\ln\frac{\mu_k}{1-\sum_{j=1}^{K-1}\mu_j}$
	 \end{itemize}
	\item[Multinomial distribution] takes $N$ i.i.d. categorical observations into account, where $m_k=\sum_{n=1}^{N} x_{nk}$.
	$$\text{Mult}(m_1,...,m_K|N,\bm{\mu})=\frac{N!}{\prod_{k=1}^{K}m_k!}\prod_{k=1}^{K} \mu_{k}^{m_k}$$
	\begin{itemize}
		\item Expectation: $\E[\bm{x}]=N\cdot \mu$
		\item Covariance: $\text{Cov}[\bm{x},\bm{x}]=N(\text{diag}(\bm{\mu})-\bm{\mu}\bm{\mu}^T)$
		\item Maximum likelihood estimation: $\bm{\mu}_{\text{ML}}=\frac{\bm{m}}{N}$
		\item Exponential family: see Appendix
	\end{itemize}
	\item[Dirichlet distribution] is the conjugate prior for multinomial
	$$\text{Dir}(\bm{\mu}|\bm{\alpha})=\frac{\Gamma(\sum_k \alpha_k)}{\prod_k
	\Gamma(\alpha_k)} \prod_{k=1}^{K}\mu_k^{\alpha_k-1}$$
	\begin{itemize}
		\item Expectation $\E[\bm{x}]=\frac{1}{\sum_k \alpha_k}\bm{\alpha}$
		\item Covariance $\text{Cov}[\bm{x}]=-\frac{1}{\sum_k \alpha_k + 1}\bm{\alpha}\bm{\alpha}^T$
		\item Exponential family: see Appendix
	\end{itemize}
\end{description}
\subsection{Independent Component Analysis}
\begin{itemize}
	\item Independent Component Analysis (ICA) tries to reconstruct source signals from linearly mixed measurements. For example, for two sources $S(t)=\begin{bmatrix}
	S_1(t)\\S_2(t)
	\end{bmatrix}$, we assume to have the measurements:
	$$X(t)=\begin{bmatrix}
	X_1(t)\\X_2(t)
	\end{bmatrix} = \begin{bmatrix}
	\alpha_1 S_1(t) + \beta_1 S_2(t)\\\alpha_2 S_2(t) + \beta_2 S_2(t)
	\end{bmatrix}$$
	The goal is now to find the parameter matrix
	$$\bm{A}=\begin{bmatrix}
	\alpha_1 & \beta_1\\ \alpha_2 & \beta_2
	\end{bmatrix}$$
	to reconstruct our signals $S(t)$ from the measurements $\bm{X}(t)=\bm{A}\bm{S}(t)$
	\item Note that we can only reconstruct $S(t)$ up to permutation and scaling/multiplicative factors as these give the same result
	\item As we assume the sources to be independent, we can write the joint probability distribution as:
	$$p(S_1,...,S_I)=\prod_{i=1}^{I}p(S_i)$$
	One crucial element of ICA is that these prior distributions need to be designed by the user. This requires pre-knowledge of how the source signals can look like (e.g. Gaussian, bounded Uniform, etc.). The performance of the algorithm depend on this design choice, and can lead to ICA failing if the prior has a very different distribution than points in the sources.
	\item We will again use a maximum likelihood  approach where we try to increase the probability of the observed data, which can be derived as:
	$$\ln p(\bm{x}|\bm{A})=\ln |\det \bm{A}| + \frac{1}{N}\sum_{n=1}^{N}\sum_{i=1}^{I}\ln p_i\left(\sum_{j=1}^{I} \left(A^{-1}\right)_{ij} x^{(n)}_j\right)$$
	For simplicity, we replace $\bm{A}^{-1}=\bm{W}$, and aim to learn $\bm{W}$ which is slightly easier.
	\item We take now the derivative with respect to $\bm{W}$, and end up with the following expression:
	$$\bm{W}^{t+1}=\bm{W}^{t} + \alpha \cdot \frac{1}{N}\sum_{n=1}^{N}\left(\nabla_{\bm{S}} \log p(\bm{S})\Big\vert_{S=S_n}\bm{S}_n^T+\bm{I}\right)\bm{W}$$
	where we estimate $\bm{S}=\bm{W}\bm{X}$. In addition, we see here that what we actually need from our prior is the derivative of its log. Hence, the prior is mostly designed to have a simple form of $\Phi_i=\frac{\partial \ln p_i(a_i)}{\partial a_i}$.
	\item We can slightly simplify the gradient calculation by splitting it into multiple parts. Summarizing the full algorithm, we get:
	\begin{tcolorbox}[colback=white!85!gray,colframe=gray!75!black,title=Independent Component Analysis]
		\begin{algorithm}[H]
			\SetAlgoLined
			Choose prior and calculate log derivative $\Phi_i=\frac{\partial \ln p_i(a_i)}{\partial a_i}$\;
			Set learning rate $\eta$\;
			Initialize $\bm{W}=\bm{A}^{-1}$\;
			\While{$\nabla \bm{W}^{(t)} > \epsilon$}{
				Let $\hat{\bm{S}}=\bm{W}\bm{X}$ be the current estimate of $\bm{S}$\;
				Let $\bm{Z}_i=\Phi_i(\hat{\bm{S}}_i)$\;
				Let $\bm{X}' = \bm{W}^T\hat{\bm{S}}$\;
				Calculate the gradients $\nabla \bm{W}^{(t)}= \bm{W}^{(t)} + \frac{1}{N}\left[ \bm{Z}{\bm{X}'}^T\right]$\;
				Apply gradient with learning rate $\bm{W}^{(t+1)}=\bm{W}^{(t)}+\eta \nabla \bm{W}^{(t)}$\;
			}
			Reconstruct signals $\bm{S}_n=\bm{W}\bm{X}_n$\;
		\end{algorithm}
	\end{tcolorbox} 
	\item One issue with ICA is that the signals are not allowed to be Gaussian. If this would be the case, we can not reconstruct the signal up to rotation as Gaussians are rotation invariant. Hence, the signals will be messed up although we find an optimum
\end{itemize}
\subsection{Information theory}
\begin{itemize}
	\item The information of an event $A$ can be measured by:
	\begin{equation*}
		\begin{split}
			h(A) & = -\log_2 p(A)\hspace{4mm}\text{(in bits)}\\
			& = - \ln p(A)\hspace{4mm}\text{(in nats)}
		\end{split}
	\end{equation*}
	\item An important measurement of a distribution in information theory is the Shannon entropy, which can be interpreted as the expected information of an event according to the distribution $p$:
	\begin{equation*}
		\tcbox[nobeforeafter]{\(
			H(X) = -\sum_{x\in D_x} p(x)\log_2 p(x)
		\)}
	\end{equation*}
	In case we have $N$ independent events, the entropy is the sum of the single entropy of each of the $N$ events.
	\item The entropy can also be defined for continuous space. It is then referred to as the differential entropy:
	$$H(\bm{x})=-\int p(\bm{x})\log_2 p(\bm{x})d\bm{x}$$
	\item We can also define conditional entropy, which is as follows:
	\begin{equation*}
	\tcbox[nobeforeafter]{\(
		H(\bm{y}|\bm{x}) = -\int p(\bm{x})\left[\int p(\bm{y}|\bm{x})\ln p(\bm{y}|\bm{x})d\bm{y}\right]d\bm{x}
		\)}
	\end{equation*}
	with the property $H(\bm{x},\bm{y})=H(\bm{x})+H(\bm{y}|\bm{x})=H(\bm{y})+H(\bm{x}|\bm{y})$
	\item Another well-known measurement is the Kullback-Leiber divergence (also referred to as relative entropy):
	\begin{equation*}
		\tcbox[nobeforeafter]{\(
			\text{KL}(p(\bm{x})||q(\bm{x})) = -\int p(\bm{x})\ln \frac{q(\bm{x})}{p(\bm{x})}d\bm{x}
		\)}
	\end{equation*}
	Some properties of this divergence are:
	\begin{itemize}
		\item Always positive: $\text{KL}(p||q)\geq 0$
		\item If $\text{KL}(p||q) = 0$, then $p=q$ (if $p$,$q$ are sufficient regular, i.e. strictly positive and integral defined)
		\item The triangular inequality does not hold for KL, thus it is not a distance measure: 
		
		$\text{KL}(p||q)+\text{KL}(q||r)\not\geq \text{KL}(p||r)$
	\end{itemize}
	\item Mutual information describes the amount of information that is shared among $x$ and $y$:
	\begin{equation*}
		\tcbox[nobeforeafter]{\(
			I(\bm{x};\bm{y}) = \text{KL}(p(\bm{x},\bm{y})||p(\bm{x}),p(\bm{y})) = H(\bm{x})-H(\bm{x}|\bm{y}) = H(\bm{y}) - H(\bm{y}|\bm{x})
		\)}
	\end{equation*}
	In other words, how much information about $y$ do I get by observing $x$. In a diagram, mutual information can be visualized as follows:
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.3\textwidth]{figures/information_theory_mutual_information.png}
		\caption{Visualizing the relationship between mutual information and entropy.}
	\end{figure}
\end{itemize}


================================================
FILE: Machine_Learning_2/ml2_graphical_models.tex
================================================
\section{Probabilistic graphical models}
\begin{itemize}
	\item It is often beneficial to visualize a probabilistic model as a diagram, which we call \textit{(probabilistic) graphical models}. 
	\item They are good for:
	\begin{itemize}
		\item causal reasoning/modeling
		\item calculating inference and conditional distributions efficiently 
		\item Designing and communicating statistical model
		\item Encoding (conditional) independence relations
	\end{itemize} 
	\item Note that there are often multiple ways to express the same probability distribution. For example, take a joint distribution $p(A,B,C)$, which we can either write as $p(A,B,C)=p(A)p(B|A)p(C|A,B)$ (see Figure~\ref{fig:graphical_models_example_1}) or $p(A,B,C)=p(C)p(A|C)p(B|A,C)$ (see Figure~\ref{fig:graphical_models_example_2}). Nevertheless, what we are interested in is the graphical representation with the least number of edges, as e.g. if $A$ and $B$ are independent (conditionally on $C$), we can drop the edge between those.
	\begin{figure}[ht!]
		\centering
		\begin{subfigure}{0.4\textwidth}
			\centering
			\tikz{ %
				\node[latent] (A) {$A$} ; %
				\node[latent, right=of A] (B) {$B$} ; %
				\node[latent, below=of B] (C) {$C$} ; %
				
				\edge{A}{B};
				\edge{A}{C};
				\edge{B}{C};
			}
			\caption{$p(A,B,C)=p(A)p(B|A)p(C|A,B)$}
			\label{fig:graphical_models_example_1}
		\end{subfigure}
		\hspace{10mm}
		\begin{subfigure}{0.4\textwidth}
			\centering
			\tikz{ %
				\node[latent] (A) {$A$} ; %
				\node[latent, right=of A] (B) {$B$} ; %
				\node[latent, below=of B] (C) {$C$} ; %
				
				\edge{A}{B};
				\edge{C}{A};
				\edge{C}{B};
			}
			\caption{$p(A,B,C)=p(C)p(A|C)p(B|A,C)$}
			\label{fig:graphical_models_example_2}
		\end{subfigure}
		\caption{Two different graphical models (here Bayesian Networks) for the same joint distribution $p(A,B,C)$.}
	\end{figure}
	\item We distinguish between directed acyclic graphs, which we call \textit{Bayesian networks} (BN), and undirected graphs, which are \textit{Markov Random Fields} (MRF)
\end{itemize}

\subsection{Bayesian Networks}
\begin{itemize}
	\item There is a simple way for creating a Bayesian network for a given statistical model.
	\begin{enumerate}
		\item Determine the ordering of the variables (``\textit{topological ordering}'')
		\item In this ordering, call the parents of the random variable $X_i$: $\text{pa}_i$, or $\text{pa}(X_i)$ which is a subset of variables with lower ordering: $\text{pa}_i \subseteq \left\{1,...,i-1\right\}$. The joint probability distribution can be written as:
		$$p(X_1,...,X_M) = \prod_{i=1}^{M}p(X_i|X_{\text{pa}_i})$$
		\item In the graphical model, draw an edge from $X_j$ to $X_i$ if $j\in \text{pa}_i$
	\end{enumerate}
	\item \underline{Example}: (first-order) Markov Chain
	\begin{itemize}
		\item The joint probability distribution of a Markov Chain can be expressed by:
		$$p(X_1,...,X_M)=p(X_1)\cdot \prod_{i=2}^{M} p(X_i|X_{i-1})$$
		\item The corresponding Bayesian Network looks as follows:
		\begin{figure}[ht!]
			\centering
			\tikz{ %
				\node[obs] (x1) {$X_1$} ; %
				\node[obs, right=of x1] (x2) {$X_2$} ; %
				\node[obs, right=of x2] (x3) {$X_3$} ; %
				\node[const, right=of x3] (xetc1) { \hspace{2mm}...\hspace{2mm} } ; %
				\node[obs, right=of xetc1] (xM) {$X_M$} ; %
				
				\edge{x1}{x2};
				\edge{x2}{x3};
				\edge{x3}{xetc1};
				\edge{xetc1}{xM};
			}
		\end{figure}
	
		where the filling expresses that $X_i$ is an observed variable.
	\end{itemize}
	\item \underline{Example}: Regression
	\begin{itemize}
		\item Suppose we have a simple regression problem where we want to learn parameters $W$ to predict targets $T$ from input $X$. We further assume that we know our sensory noise $\sigma^2$, and have a prior with hyperparameters $\alpha$.
		\item 
		\begin{figure}[ht!]
			\centering
			\tikz{ %
				\node[obs] (x) {$X_n$} ; %
				\node[obs, right=of x] (t) {$T_n$} ; %
				\plate{xt}{(x)(t)}{$n=1,...,N$};
				
				\node[latent, above=of t] (w) {$W$} ; %
				\node[const, left=of w] (alpha) {$\alpha$} ; %
				\node[const, right=of t] (sigma) {$\sigma^2$} ; %
				
				\edge{x}{t};
				\edge{w}{t};
				\edge{alpha}{w};
				\edge{sigma}{t};
			}
		\end{figure}
	
		We can express this in the graphical model above, which represents the probability distribution
		$$p(W, \left\{T_n\right\}, \left\{X_n\right\}|\alpha, \sigma^2)p(W|\alpha)\prod_{n=1}^{N} \left[p(T_n|X_n, W, \sigma^2)p(X_n)\right]$$
		Note that in the graphical model, $\alpha$ and $\sigma^2$ are assumed to be fixed and known, and the ``plate'' can be interpreted as copying the content $N$ times (i.e. we have $N$ $X_i$ and $T_i$ variables with the same edges).
		
		Also, if desired, we could have used a constant for the data points $X_i$ as well as these are often assumed to be fixed.
		\item If we also want to express the predictive distribution $p(T^{*}|X^{*},W, \left\{T_n\right\}, \left\{X_n\right\},\alpha, \sigma^2)$, we can extend our model as follows:
		\begin{figure}[ht!]
			\centering
			\tikz{ %
				\node[obs] (x) {$X_n$} ; %
				\node[obs, right=of x] (t) {$T_n$} ; %
				\plate{xt}{(x)(t)}{$n=1,...,N$};
				
				\node[latent, above=of t] (w) {$W$} ; %
				\node[const, left=of w] (alpha) {$\alpha$} ; %
				\node[const, right=of t] (sigma) {$\sigma^2$} ; %
				
				\node[latent, right=of w] (tstar) {$T^{*}$} ; %
				\node[obs, right=of tstar] (xstar) {$X^{*}$} ; %
				
				
				\edge{x}{t};
				\edge{w}{t};
				\edge{w}{tstar};
				\edge{xstar}{tstar};
				\edge{alpha}{w};
				\edge{sigma}{t};
				\edge{sigma}{tstar};
			}
		\end{figure}
	\end{itemize}
\end{itemize}
\subsubsection{Conditional independence and D-separation}
\begin{itemize}
	\item A useful property of graphical models is that we can easily study the independence relations between random variables in our model. 
	\item We call $X$ and $Y$ being independent iff $p(X,Y)=p(X)p(Y)$. The notation for this is $X\independent Y$
	\item $X$ is \textit{conditionally} independent of $Y$ given $Z$ if $p(X,Y|Z)=p(X|Z)p(Y|Z)$. The notation for this is $X\independent Y|Z$. Note that if $X$ and $Y$ are generally independent, we can also write $X\independent Y|\emptyset$
	\item For proving/testing conditional independence, we can use \textbf{d-separation}. Supposed $A$, $B$, $C$ are sets of variables. If $A$ is d-separated from $B$ given $C$, then $p(X_A,X_B|X_C)=p(X_A|X_C)p(X_B|X_C)$, which we can also write as $A\perp B|C\implies X_A\independent X_B|X_C$ 
	\item Note that the other way round, $X_A\independent X_B|X_C\not\Rightarrow A\perp B|C $ is not always valid (but mostly) as we will show in a later example. Hence, if $A$ and $B$ are not d-separated, it does not necessarily mean that $X_A$ and $X_B$ are not conditional independent.
	
	\item The algorithm can be summarized as follows:
	\begin{tcolorbox}[colback=white!85!gray,colframe=gray!75!black,title=D-separation]
		Given the sets of variables $A$, $B$, $C$:
		\begin{enumerate}
			\item Consider all paths (sequence of nodes, connected by edges,  s.t. no node repeats) between any node in $A$ and any node in $B$
			\item Mark a path as \underline{blocked} by $C$ if
			\begin{enumerate}
				\item It contains a collider $...\rightarrow u \leftarrow ...$ such that $u$ is not an ancestor of a node in $C$
				\item It contains a non-collider $...\rightarrow u$, $...\rightarrow u \rightarrow ...$, $...\leftarrow u \rightarrow ...$ such that $u$ is in $C$
			\end{enumerate}
			\item If all paths are marked as blocked by $C$, then $A$ is d-separated from $B$ given $C$
		\end{enumerate}
	\end{tcolorbox}	
	\item \underline{Examples}: 
	\begin{itemize}
		\item Consider the following graphical model:
		\begin{figure}[ht!]
			\centering
			\tikz{ %
				\node[latent] (xA) {$X_A$} ; %
				\node[latent, right=of xA] (xB) {$X_B$} ; %
				\node[latent, below=of xB] (xC) {$X_C$} ; %
				
				\edge{xC}{xA};
				\edge{xC}{xB};
			}
		\end{figure}
	
		$A$ is d-separated from $B$ given $C$ as the only way from $B$ to $A$ is through $X_C$, and it represents a non-collider: $X_A\independent X_B|X_C$.
		
		Note that $A$ is not d-separated from $B$ given $\emptyset$ because $X_C$ is then neither a non-collider nor a collider.
		\item Consider the following graphical model:
		\begin{figure}[ht!]
			\centering
			\tikz{ %
				\node[latent] (xA) {$X_A$} ; %
				\node[latent, right=of xA] (xC) {$X_C$} ; %
				\node[latent, right=of xC] (xB) {$X_B$} ; %
				
				\edge{xA}{xC};
				\edge{xC}{xB};
			}
		\end{figure}
	
		Similarly to the previous model, $A$ is d-separated from $B$ given $C$ as the only way from $B$ to $A$ is through $X_C$, and it represents a non-collider: $X_A\independent X_B|X_C$.
		
		However, here we can show a special case where conditional independence does not imply d-separation. Suppose that we model $p(C|A)=\delta_{C,A}$, hence being a deterministic mapping. Now, $C\independent B|A$ holds because if we know $A$, we know $C$ for certain. Nevertheless, the d-separation is not valid because there is a direct path from $C$ to $B$! 
		
		\item Consider the following graphical model:
		\begin{figure}[ht!]
			\centering
			\tikz{ %
				\node[latent] (xA) {$X_A$} ; %
				\node[latent, right=of xA] (xC) {$X_C$} ; %
				\node[latent, above=of xC] (xB) {$X_B$} ; %
				\node[const, right=of xC] (xetc) {\hspace{2mm}...\hspace{2mm} } ; %
				\node[latent, right=of xetc] (xD) {$X_D$} ; %
				
				\edge{xA}{xC};
				\edge{xB}{xC};
				\edge{xC}{xetc};
				\edge{xetc}{xD};
			}
		\end{figure}
	
		$A$ is d-separated from $B$ given the empty set as $X_C$ represents a collider which is not in the empty set: $X_A\independent X_B|\emptyset$.
		
		$A$ is \textit{not} d-separated from $B$ given $C$ because $X_C$ is then not a collider anymore: $A\not\perp B|C$.
		
		$A$ is \textit{not} d-separated from $B$ given $D$ because $X_C$ is an ancestor of a node in $D$, and hence, not a collider: $A\not\perp B|D$.
	\end{itemize}
\end{itemize}
\subsubsection{Markov blanket}
\begin{itemize}
	\item A Markov blanket of a variable $X_i$ is defined as the set of variables which are the parents, children or children's parents of $X_i$, except $X_i$ itself:
	$$\text{MB}(X_i)=\text{pa}_i \cup \text{ch}_i \cup \left(\text{pa}_{\text{ch}_i}\setminus i\right)$$
	\item The important property of the Markov blanket is that, for a random variable $X_i$ in any BN, given its Markov blanket $\text{MB}(X_i)$, it is conditionally independent of the rest of the graph:
	$$p\left(X_i|X_{\text{MB}(X_i)}, X_{\text{res}}\right) = p\left(X_i|X_{\text{MB}(X_i)}\right)$$
	\item \underline{Example}: For the graphical model of the regression problem, the Markov blanket of $T^{*}$ is $\text{MB}(T^{*})=\left\{X^{*}, W\right\}$. This result is intuitive as once we have trained our model, we do not need to revisit our data or our prior over $W$. Note that $\sigma^2$ is a constant, and hence not in the Markov blanket.
\end{itemize}
\subsection{Markov Random Fields}
\begin{itemize}
	\item A Markov Random Field is a undirected graphical models. Hence, our model consists now of two parts: the undirected graph $G$, and so called (maximum) cliques potentials $\left\{\psi_A\right\}$
	\item A clique in a undirected graph $G$ is a fully connected subset of nodes. Hence, also single nodes are considered as a clique.
	\begin{itemize}
		\item A clique is \textit{maximal} if there is no clique that strictly contains it, i.e. we cannot add another node to the clique which is fully connected to all others. 
		\item \underline{Example}: Consider the following graphical model:
		\begin{figure}[ht!]
			\centering
			\tikz{ %
				\node[latent] (xA) {$X_A$} ; %
				\node[latent, right=of xA] (xB) {$X_B$} ; %
				\node[latent, below=of xA] (xC) {$X_C$} ; %
				\node[latent, right=of xC] (xD) {$X_D$} ; %
				\node[latent, right=of xD] (xE) {$X_E$} ; %
				\node[latent, left=of xA] (xF) {$X_F$} ; %
				
				\edge[-]{xA}{xC};
				\edge[-]{xA}{xB};
				\edge[-]{xC}{xD};
				\edge[-]{xB}{xD};
				\edge[-]{xB}{xE};
				\edge[-]{xE}{xD};
			}
		\end{figure}
	
		Then our maximum cliques are $\{X_A,X_C\}, \{X_A, X_B\}, \{X_C,X_D\},\{X_B,X_D,X_E\},\{X_F\}$
	\end{itemize}
	\item We can now write our joint probability distribution in terms of the maximum cliques $\left\{\psi_A\right\}$:$$p(x_1,...,x_N)=\frac{1}{Z}\prod_A \psi_A(x_A)$$
	Note that we now need a normalization constant $Z$ which we did not need for Bayesian networks. The reason for this is that clique potentials might not be normalized. The only requirement for them is to be positive for any $x_A$, and are thus often modeled by a energy function $\psi_A(x_A) = \exp(f(x_A))$ (hence the name \textit{potential})
	\item For the previous example, our probability distribution can be now written as:
	$$p(x_A,...,x_E)=\frac{1}{Z}\psi_{A,B}(x_A,x_B)\psi_{A,C}(x_A,x_C)\psi_{C,D}(x_C,x_D)\psi_{B,D,E}(x_B,x_D,x_E)\psi_{F}(x_F)$$
	where $Z=\sum_{x_A}\sum_{x_B}...\sum_{x_F}\psi_{A,B}(x_A,x_B)\psi_{A,C}(x_A,x_C)...\psi_{F}(x_F)$
	\item One disadvantage of undirected graphs, as we can see here, is that we need to calculate $Z$ which grows exponentially with the number of variables.
	\item We can also define the properties of separation and Markov blanket for undirected graphs:
	\begin{description}
		\item[Separation] similarly to d-separation in BNs, two subsets of nodes $A$ and $B$ are \underline{separated} given $C$ if each path between a node in $A$ and a node in $B$ passes through (at least one) node $C$:
		$$A\perp B|C\implies X_A\independent X_B|X_C$$
		\item[Markov blanket] The Markov blanket for MRFs is defined as the neighbors of $i$, i.e. the nodes adjacent to $i$. In the previous example, the Markov blanket of $X_B$ is: $\text{MB}(X_B) = \{X_A,X_D,X_E\}$
	\end{description}
\end{itemize}
\subsubsection{Converting Bayesian network to MRFs}
\begin{itemize}
	\item Sometimes it is the case that we want to represent a same statistical model which we have as a Bayesian network, also as a MRF. This is the case when we want to apply algorithms which are generally defined for undirected graphs (e.g. sum-product)
	\item The \textit{Hammersley-Clifford} theorem states that any strictly positive, joint distribution $p(\bm{X})\ge 0$ can be represented as a MRF. Hence, we can also do it with any Bayesian network
	\item Nevertheless, note that by converting a BN to a MRF, some properties/information might be lost, such as (conditional) independence relations. 
	\item \underline{Examples}:
	\begin{itemize}
		\item Consider a first-order Markov chain:
		\begin{figure}[ht!]
			\centering
			\begin{subfigure}{0.46\textwidth}
				\centering
				\tikz{ %
					\node[obs] (x1) {$x_1$} ; %
					\node[obs, right=of x1] (x2) {$x_2$} ; %
					\node[const, right=of x2] (xetc) {\hspace{2mm}...\hspace{2mm} } ; %
					\node[obs, right=of xetc] (xM) {$x_M$} ; %
					
					\edge{x1}{x2};
					\edge{x2}{xetc};
					\edge{xetc}{xM};
				}
				\caption{BN}
			\end{subfigure}
			\hspace{5mm}
			\begin{subfigure}{0.46\textwidth}
				\centering
				\tikz{ %
					\node[obs] (x1) {$x_1$} ; %
					\node[obs, right=of x1] (x2) {$x_2$} ; %
					\node[const, right=of x2] (xetc) {\hspace{2mm}...\hspace{2mm} } ; %
					\node[obs, right=of xetc] (xM) {$x_M$} ; %
					
					\edge[-]{x1}{x2};
					\edge[-]{x2}{xetc};
					\edge[-]{xetc}{xM};
				}
				\caption{MRF}
			\end{subfigure}
		\end{figure}
	
		As Bayesian network, we can represent it with the probability density function $p(x_1)\prod_{i=2}^{M}p(x_i|x_{i-1})$.
		
		In the case of the MRF, we have $\frac{1}{Z}\prod_{i=2}^{M}\psi_{i-1,i}(x_{i-1},x_i)$. Note that the prior $\psi_1(x_1)$ is integrated in $\psi_{1,2}(x_1,x_2)$ as the clique potentials are more flexible than the conditional probabilities in Bayesian networks.
		
		\item Consider the following Bayesian network:
		\begin{figure}[ht!]
			\centering
			\begin{subfigure}{0.25\textwidth}
				\centering
				\tikz{ %
					\node[latent] (xA) {$X_A$} ; %
					\node[latent, right=of xA] (xC) {$X_C$} ; %
					\node[latent, below=of xC] (xB) {$X_B$} ; %
					\node[latent, below=of xA] (xD) {$X_D$} ; %
					
					\edge{xA}{xC};
					\edge{xB}{xC};
					\edge{xD}{xA};
				}
				\caption{BN}
			\end{subfigure}
			\hspace{5mm}
			\begin{subfigure}{0.25\textwidth}
				\centering
				\tikz{ %
					\node[latent] (xA) {$X_A$} ; %
					\node[latent, right=of xA] (xC) {$X_C$} ; %
					\node[latent, below=of xC] (xB) {$X_B$} ; %
					\node[latent, below=of xA] (xD) {$X_D$} ; %
					
					\edge[-]{xA}{xC};
					\edge[-]{xB}{xC};
					\edge[-]{xD}{xA};
				}
				\caption{(Potential) MRF}
				\label{fig:graphical_models_BN_to_MRF_2}
			\end{subfigure}
			\hspace{5mm}
			\begin{subfigure}{0.25\textwidth}
				\centering
				\tikz{ %
					\node[latent] (xA) {$X_A$} ; %
					\node[latent, right=of xA] (xC) {$X_C$} ; %
					\node[latent, below=of xC] (xB) {$X_B$} ; %
					\node[latent, below=of xA] (xD) {$X_D$} ; %
					
					\edge[-]{xA}{xC};
					\edge[-]{xB}{xC};
					\edge[-]{xA}{xB};
					\edge[-]{xD}{xA};
				}
				\caption{MRF via mortalization}
				\label{fig:graphical_models_BN_to_MRF_3}
			\end{subfigure}
			\caption{Comparing different conversions from BN to MRF}
		\end{figure}
	
		In this BN, $X_A\independent X_B$, and (typically) $X_A\not\independent X_B|X_C$. If we just replace the directed edges by undirected ones (see Figure~\ref{fig:graphical_models_BN_to_MRF_2}), we loose the independence $X_A\not\independent X_B$. Furthermore, we would need to design the potentials in a way that is captures $p(X_C|X_A,X_B)$ correctly.
		
		The easiest way is transforming BNs by \textbf{mortalization} (see Murphy, chapter 20.3). For each node, we ``marry the parents'', i.e. adding an edge between those if not already existing. By that, we ensure that we express all maximum clique potentials by the conditional probabilities of the Bayesian network. For example, see the MRF in Figure~\ref{fig:graphical_models_BN_to_MRF_3} which we got via mortalization. The clique potentials are now simply: $\psi_{A,D}(X_A,X_D)=p(X_D)p(X_A|X_D)$, $\psi_{A,B,C}(X_A,X_B,X_C)=p(X_C|X_A,X_B)p(X_B)$
		
		\item There are also MRFs which cannot be fully modeled by a Bayesian network. Consider for example the following graphical model:
		\begin{figure}[ht!]
			\centering
			\begin{subfigure}{0.25\textwidth}
				\centering
				\tikz{ %
					\node[latent] (xA) {$X_A$} ; %
					\node[latent, right=of xA] (xC) {$X_C$} ; %
					\node[latent, below=of xC] (xB) {$X_B$} ; %
					\node[latent, below=of xA] (xD) {$X_D$} ; %
					
					\edge[-]{xA}{xC};
					\edge[-]{xB}{xC};
					\edge[-]{xD}{xA};
					\edge[-]{xD}{xB};
				}
				\caption{MRF}
				\label{fig:graphical_models_MRF_to_BN_MRF}
			\end{subfigure}
			\hspace{5mm}
			\begin{subfigure}{0.25\textwidth}
				\centering
				\tikz{ %
					\node[latent] (xA) {$X_A$} ; %
					\node[latent, right=of xA] (xC) {$X_C$} ; %
					\node[latent, below=of xC] (xB) {$X_B$} ; %
					\node[latent, below=of xA] (xD) {$X_D$} ; %
					
					\edge{xA}{xC};
					\edge{xB}{xC};
					\edge{xD}{xA};
					\edge{xD}{xB};
				}
				\caption{(Potential) BN}
				\label{fig:graphical_models_MRF_to_BN}
			\end{subfigure}
		\end{figure}
	
		The MRF models the following independence relations: $C\perp D|\{A,B\}$, $A\perp B|\{C,D\}$. If we would now want to model the same in a BN, we get into trouble as for the model in Figure~\ref{fig:graphical_models_MRF_to_BN}, as although $C\perp D|\{A,B\}$, we have $A\not\perp B|\{C,D\}$ because $X_C$ is not a collider (is in set $\{C,D\}$) and also not non-collider.
	\end{itemize}
\end{itemize}
\subsubsection{Factor graphs}
\begin{itemize}
	\item The third form of graphical models are Factor graphs. The idea is to represent the connections between variables by their factors in a bipartite graph. Hence, we have two sets of nodes: variable nodes, and factor nodes.
	\item Consider the statistical model $p(X_A,X_B,X_C)=p(X_A)p(X_B)p(X_C|X_A,X_B)$. The factor graph representation of this is:
	\begin{figure}[ht!]
		\centering
		\tikz{ %
			\node[latent] (xA) {$X_A$} ; %
			\node[latent, right=of xA] (xB) {$X_B$} ; %
			\node[latent, right=of xB] (xC) {$X_C$} ; %
			
			\factor[above=of xA] {f1} {$p(A)$} {} {} ;
			\factor[above=of xB] {f2} {$p(B)$} {} {} ;
			\factor[above=of xC] {f3} {$p(C|A,B)$} {} {} ;
			
			\factoredge[-]{xA}{f1}{} ;	
			\factoredge[-]{xB}{f2}{} ;	
			\factoredge[-]{xA}{f3}{} ;	
			\factoredge[-]{xB}{f3}{} ;	
			\factoredge[-]{xC}{f3}{} ;	
		}
	\end{figure}
	\item Similarly, for Markov Random Fields such as in Figure~\ref{fig:graphical_models_MRF_to_BN_MRF}, the factor graph representation is:
	
	
	\begin{figure}[ht!]
		\centering
		\tikz{ %
			\node[latent] (xA) {$X_A$} ; %
			\node[latent, right=of xA] (xC) {$X_C$} ; %
			\node[latent, right=of xC] (xB) {$X_B$} ; %
			\node[latent, right=of xB] (xD) {$X_D$} ; %
			
			\factor[above=of xA] {f1} {$\psi_{A,C}$} {} {} ;
			\factor[above=of xC] {f2} {$\psi_{C,B}$} {} {} ;
			\factor[above=of xB] {f3} {$\psi_{B,D}$} {} {} ;
			\factor[above=of xD] {f4} {$\psi_{D,A}$} {} {} ;
			\factor[right=of f4] {f0} {$\frac{1}{Z}$} {} {} ;
			
			\factoredge[-]{xA}{f1}{} ;	
			\factoredge[-]{xC}{f1}{} ;	
			\factoredge[-]{xC}{f2}{} ;	
			\factoredge[-]{xB}{f2}{} ;	
			\factoredge[-]{xD}{f3}{} ;	
			\factoredge[-]{xB}{f3}{} ;	
			\factoredge[-]{xA}{f4}{} ;	
			\factoredge[-]{xD}{f4}{} ;	
		}
	\end{figure}
	
	where we could have merged $\frac{1}{Z}$ into any other factor if wanted.
	\item Note that in contrast ot MRFs, factor graph do not require to take the \textit{maximum} cliques. Hence, for the same statistical model, there exist different factor graphs.
\end{itemize}
\subsection{Learning in graphical models}
\begin{itemize}
	\item One of the applications of graphical models is to learn the conditional probabilities/potentials they model. We will first discuss the learning process for Bayesian networks, and afterwards do the same for MRFs
\end{itemize}
\subsubsection{Learning in Bayesian networks}
\begin{itemize}
	\item Suppose that we replace all conditionals $p(x_i|\text{pa}_i)$ by a learnable function $\theta_i(x_i, \text{pa}_i)$ (equal to $f_i$ with parameters $\theta_i$) with the constraint of $\sum_{x_i} \theta_i(x_i, \text{pa}_i) = 1$.
	\item The likelihood of a dataset can then be written as:
	$$p(\left\{\tilde{x}_{in}\right\}|\bm{\theta}) = \prod_{i=1}^{d} \prod_{n=1}^{N}\theta_i(\tilde{x}_{in}, \tilde{x}_{\text{pa}_i,n}) = \prod_i \prod_{n=1}^{N} \prod_{x_i} \prod_{x_{\text{pa}_i}} \theta_i({x}_{in}, x_{\text{pa}_i,n})^{\delta(x_i=\tilde{x}_{in})\delta(x_{\text{pa}_i}=\tilde{x}_{\text{pa}_i,n})}$$
	\item Our objective to optimize is the log likelihood with the Lagrange multipliers:
	\begin{equation*}
		\begin{split}
			\mathcal{L}(\bm{\theta}, \bm{X}, \bm{\lambda}) & = \sum_{i=1}^{d} \sum_{n=1}^{N} \sum_{x_i} \sum_{x_{\text{pa}_i}} \delta(x_i=\tilde{x}_{in})\cdot \delta(x_{\text{pa}_i}=\tilde{x}_{\text{pa}_i,n})\cdot \ln \theta_i({x}_{in}, x_{\text{pa}_i,n}) - \sum_{i=1}^{d} \sum_{x_{\text{pa}_i}} \lambda_{i,x_{\text{pa}_i}}\left(\sum_{x_i} \theta_i(x_i,x_{\text{pa}_i}) - 1\right)\\
			& = \sum_{i=1}^{d} \sum_{x_i} \sum_{x_{\text{pa}_i}} N(x_i, x_{\text{pa}_i})\cdot \ln \theta_i({x}_{i}, x_{\text{pa}_i}) - \sum_{i=1}^{d} \sum_{x_{\text{pa}_i}} \lambda_{i,x_{\text{pa}_i}}\left(\sum_{x_i} \theta_i(x_i,x_{\text{pa}_i}) - 1\right)\\
		\end{split}
	\end{equation*}
	where $N(x_i, x_{\text{pa}_i})$ represent a counter of how often the combination of values for $x_i$ and $x_{\text{pa}_i}$ co-occur
	\item By taking the derivative, we get the following solution:
	\begin{equation*}
		\begin{split}
			\frac{\partial \mathcal{L}(\bm{\theta}, \bm{X}, \bm{\lambda})}{\partial \theta_i(x_i, x_{\text{pa}_i})} & = \frac{N(x_i, x_{\text{pa}_i})}{\theta_i(x_i, x_{\text{pa}_i})} - \lambda_{i,x_{\text{pa}_i}} \overset{!}{=} 0\\
			\Leftrightarrow \theta_i(x_i, x_{\text{pa}_i}) & = \frac{N(x_i, x_{\text{pa}_i})}{\lambda_{i,x_{\text{pa}_i}}}\\[8pt]
			\frac{\partial \mathcal{L}(\bm{\theta}, \bm{X}, \bm{\lambda})}{\partial \lambda_{i,x_{\text{pa}_i}}} & = \sum_{x_i} \theta_i(x_i, x_{\text{pa}_i}) - 1 \overset{!}{=} 0\\
			\Leftrightarrow \sum_{x_i} \frac{N(x_i, x_{\text{pa}_i})}{\lambda_{i,x_{\text{pa}_i}}} & = 1\\
			\Leftrightarrow \lambda_{i,x_{\text{pa}_i}} & = \frac{1}{N(x_{\text{pa}_i})}\\[8pt]
			\implies \theta_i(x_i, x_{\text{pa}_i}) & = \frac{N(x_i, x_{\text{pa}_i})}{N(x_{\text{pa}_i})}
		\end{split}
	\end{equation*}
	Hence, the optimal conditionals $\theta_i(x_i=a, x_{\text{pa}_i}=b)$ are simply the average number of times we have seen $x_i=a$ when $x_{\text{pa}_i}=b$.
	\item For each conditional $\theta_i$, the optimum solely depends on $x_i$ and its parents, and not over all variables as we had before (log-likelihood has the sum over all variables). This is why learning in Bayesian networks is fast and efficient
\end{itemize}
\subsubsection{Learning in Markov Random Fields}
\begin{itemize}
	\item In the case of MRFs, we are interested in learning the potential $\psi_A$ for all cliques. With a similar rewriting with the indicator function, we get the log-likelihood of the data as:
	$$\mathcal{L}(\bm{\psi}, \bm{X}) = \sum_{n=1}^{N} \sum_A \sum_{x_A} \delta(x_A=\tilde{x}_A)\ln \psi_A(x_A) - N\ln Z = \sum_A \sum_{x_A} N(x_A)\ln \psi_A(x_A) - N\ln Z$$
	\item The derivative of the log-likelihood is:
	\begin{equation*}
		\begin{split}
			\frac{\partial \mathcal{L}(\bm{\psi}, \bm{X})}{\partial \psi_A(x_A)} = \frac{N(x_A)}{\psi_A(x_A)} - \frac{N}{\psi_A(x_A)}\E_{\psi}[\delta(x_A=\cdot)] \overset{!}{=} 0
		\end{split}
	\end{equation*}
	where $\E_{\psi}[\delta(x_A=\cdot)]$ is the expected fraction of observations of $x_A$ over all potentials $\bm{\psi}$. In other words, how much probability mass is assigned to states where $x_A=\tilde{x}_A$ (and other $x$ anything), compared to all other states. This terms comes from our normalization constant $Z$ which is of course influenced by our potentials.
	\item The optimal is therefore found when the fraction of observed $x_A=\tilde{x}_A$ is equal to the expected number of observations. To find a solution, we can use sampling to approximate the expectation:
	$$\E_{\psi}[\delta(x_A=\cdot)] \approx \frac{N_{\psi}(x_A)}{N_{\psi}}$$
	where $N_{\psi}$ is the sample size, and $N_{\psi}(x_A)$ the number of times we observed $x_A=\tilde{x}_A$ during sampling.
	\item Hence, the (approximated) optimum is:
	\begin{equation*}
		\begin{split}
			\frac{\partial \mathcal{L}(\bm{\psi}, \bm{X})}{\partial \psi_A(x_A)} & = \frac{N(x_A)}{\psi_A(x_A)} - \frac{N}{\psi_A(x_A)}  \frac{N_{\psi}(x_A)}{N_{\psi}}\overset{!}{=} 0\\
			\Leftrightarrow \psi_A(x_A) & = \frac{N(x_A)/N}{N_{\psi}(x_A)/N_{\psi}}\\
		\end{split}
	\end{equation*}
	To interpret the result, if we sample less times $x_A$ than in our dataset, we increase $\psi_A$. Otherwise, we reduce it. For stability, we can view this optimum as an update step, and repeat this procedure a couple of times until we converge 
\end{itemize}
\subsection{Inference in graphical models}
\begin{itemize}
	\item Another important aspect of graphical models is inference to be efficient. Here, our goal is to either marginalize out variables, or set some to observed, and calculate the posterior distribution of others
	\item Let's first consider again a first-order Markov chain. Its joint probability distribution can be expressed by:
	$$p(x_1,...,x_d)=\frac{1}{Z}\psi_{1,2}(x_1,x_2)\cdot \psi_{2,3}(x_2,x_3)\cdot ... \cdot \psi_{d-1,d}(x_{d-1},x_d)$$
	\item Now suppose we want to calculate the marginal distribution $p(x_j)$. We can do this by marginalizing over all other variables:
	\begin{equation*}
		\begin{split}
			p(x_j) & = \sum_{x_1}...\sum_{x_{j-1}}\sum_{x_{j+1}}...\sum_{x_d} p(x_1,...,x_{j-1},x_{j+1},...,x_d)\\
			& = \sum_{x_1}...\sum_{x_{j-1}}\sum_{x_{j+1}}...\sum_{x_d} \frac{1}{Z}\psi_{1,2}(x_1,x_2)\cdot \psi_{2,3}(x_2,x_3)\cdot ... \cdot \psi_{d-1,d}(x_{d-1},x_d)\\
		\end{split}
	\end{equation*}
	Now, we can move the sums as the potentials only contain two variables, and are hence independent of all others:
	\begin{equation*}
		\begin{split}
		p(x_j) & = \frac{1}{Z} \sum_{x_1}\sum_{x_2}...\sum_{x_{j-1}}\sum_{x_{j+1}}...\sum_{x_d}  \psi_{1,2}(x_1,x_2)\cdot \psi_{2,3}(x_2,x_3)\cdot ... \cdot \psi_{d-1,d}(x_{d-1},x_d)\\
		& =  \frac{1}{Z}\sum_{x_1}\sum_{x_2}...\sum_{x_{j-1}} \psi_{1,2}(x_1,x_2)\cdot ...\psi_{j-1,j}(x_{j-1},x_j) \cdot \\&\hspace{23mm}\sum_{x_{j+1}} \psi_{j,j+1}(x_{j},x_{j+1}) \sum_{x_{j+2}} \psi_{j+1,j+2}(x_{j+1},x_{j+2}) ... \sum_{x_{d}} \psi_{d-1,d}(x_{d-1},x_d)\\
		& = \frac{1}{Z} \underbrace{\sum_{x_{j-1}} \psi_{j-1,j}(x_{j-1},x_{j}) \sum_{x_{j-2}} \psi_{j-2,j-1}(x_{j-2},x_{j-1}) ... \sum_{x_{1}} \psi_{1,2}(x_{1},x_2)}_{\mu_{\alpha}(x_j)} \cdot \\&\hspace{8mm}\underbrace{\sum_{x_{j+1}} \psi_{j,j+1}(x_{j},x_{j+1}) \sum_{x_{j+2}} \psi_{j+1,j+2}(x_{j+1},x_{j+2}) ... \sum_{x_{d}} \psi_{d-1,d}(x_{d-1},x_d)}_{\mu_{\beta}(x_j)}\\
		& = \frac{1}{Z}\mu_{\alpha}(x_j)\mu_{\beta}(x_j)
		\end{split}
	\end{equation*}
	\item This shows us that we can split the marginal into two separate parts, $\mu_{\alpha}$ and $\mu_{\beta}$, which both are a recursive functions:
	$$\mu_{\alpha}(x_j) = \sum_{x_{j-1}} \psi_{j-1,j}(x_{j-1},x_j) \mu_{\alpha}(x_{j-1})$$ 
	$$\mu_{\alpha}(x_j) = \sum_{x_{j+1}}\psi_{j,j+1}(x_{j},x_{j+1}) \mu_{\beta}(x_{j+1})$$ 
	We can view these recursive functions also as \textit{messages} which are passed across the chain. $\mu_{\alpha}(x_{j})$ and $\mu_{\beta}(x_{j})$ would be then the incoming messages of $x_j$, and $\mu_{\alpha}(x_{j+1})$ and $\mu_{\beta}(x_{j-1})$ the outgoing messages.
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/graphical_models_sum_product_message_passing.pdf}
		\caption{Message passing in graphical models (Bishop 8.38)}
	\end{figure}
	\item Even the normalization term $Z$ can be expressed by the messages: $Z=\sum_{x_n}\mu_{\alpha}(x_n)\mu_{\beta}(x_n)$
	\item The benefit of this recursive implementation is that we don't have to take a sum over $d$ variables ($\mathcal{O}(K^{d})$), but only have to take sums over two variables at a time ($\mathcal{O}(K^{2}d)$). Hence, the computational time scales linear with number of nodes instead exponential.
	
	Furthermore, we can share calculations between nodes, as the same messages can be re-used. Thus, calculating the marginals for all variables reduces to $\mathcal{O}(2\cdot K^2{d})=\mathcal{O}(K^2{d})$
\end{itemize}
\subsubsection{Sum-product algorithm}
\begin{itemize}
	\item The message passing idea is not limited to a Markov chain, but can be applied to any graph. For simplicity, we focus here on \textit{trees}, i.e. graphs where all nodes are connected, but without any loops/cycles, independent of the direction of the edges.
	\item In our discussion, we will focus on factor graphs as they represent the most general form of graphical models. Furthermore, cycles in MRFs can often be resolved in a factor graph so that we have less problem getting a tree-structured graph
	\item Now we will send messages from variables to factors, and from factors to variables. As a result, we get the marginalization for all variables. This algorithm is called \textbf{sum-product} algorithm as we only take sums and products
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.2\textwidth]{figures/graphical_models_sum_product_messages_factor.png}
		\caption{Message passing to and from a factor node (Bishop 8.47)}
	\end{figure}
	\item The messages can be calculated as follows where we start at leaf nodes, and recursively go to the center of the tree:
	\begin{equation*}
	\tcbox[nobeforeafter]{\(
		\begin{split}
			\textbf{Factor$\to$Variable:} & \hspace{2mm} \mu_{\alpha\to i}(x_i)=\sum_{\bm{x}_{\alpha\setminus i}} f_{\alpha}(\bm{x}_{\alpha})\prod_{j\in \alpha\setminus i}\mu_{j\to\alpha}(x_j)\\
			& \text{If $\alpha$ leaf node:} \hspace{2mm}\mu_{\alpha\to i}(x_i)=\sum_{\bm{x}_{\alpha}} f_{\alpha}(\bm{x}_{\alpha})\\
			\textbf{Variable$\to$Factor:} & \hspace{2mm} \mu_{j\to \alpha}(x_j)=\prod_{\beta\in \text{ne}(j)\setminus \alpha}\mu_{\beta\to j}(x_j)\\
			&\text{If $j$ leaf node:} \hspace{2mm}\mu_{j\to \alpha}(x_j)=1
		\end{split}
		\)}
	\end{equation*}
	The marginalizations/beliefs are in the end:
	\begin{equation*}
	\tcbox[nobeforeafter]{\(
		\begin{split}
		\textbf{Variable belief:} & \hspace{2mm} p(x_i)=\frac{1}{Z}\prod_{\alpha\in\text{ne}(i)}\mu_{\alpha\to i}(x_i)\hspace{5mm}\text{where}\hspace{2mm}Z=\sum_{x_i}\prod_{\alpha\in\text{ne}(i)}\mu_{\alpha\to i}(x_i)\\
		\textbf{Factor belief:} & \hspace{2mm} p(\bm{x}_{\alpha})=\frac{1}{Z}f_{\alpha}(\bm{x}_\alpha) \prod_{i \in\text{ne}(\alpha)}\mu_{i \to\alpha}(x_i)\\
		\end{split}
		\)}
	\end{equation*}
	where a factor belief is the marginalization of all variables except those with an direct edge to the factor.
	\item The complexity of this algorithm scales with $\mathcal{O}(EK^{M})$ where $E$ are the number of edges, $M$ the maximum number of variables that are connected to a factor, and $K$ the maximum domain size
	\item Note that this algorithm is only exact on trees or forest (group of trees). For general graphs, we can first bring them into the shape of a tree by e.g. \textbf{variable elimination}
	\begin{itemize}
		\item Given a MRF or factor graph, we will marginalize out these variable nodes which cause a loop in our graphical model. Let's for example consider this model:
		\begin{figure}[ht!]
			\centering
			\tikz{ %
				\node[latent] (xA) {$X_A$} ; %
				\node[latent, right=of xA] (xB) {$X_B$} ; %
				\node[latent, below=of xA] (xC) {$X_C$} ; %
				\node[latent, right=of xC] (xD) {$X_D$} ; %
				\node[latent, right=of xD] (xE) {$X_E$} ; %
				
				\edge[-]{xA}{xB};
				\edge[-]{xB}{xD};
				\edge[-]{xC}{xD};
				\edge[-]{xE}{xD};
				\edge[-]{xE}{xB};
			}
		\end{figure}
		\item We can do this by simply writing down the joint probability distribution and determine a order of variables $X_1,...,X_M$ where the last variables, e.g. $X_M$, should be those which are eliminated (likely the easiest to marginalize). We then sort the sums according to the selected order.
		
		In our example, we want to eliminate $X_E$ as it has the least connections from those in the loop. The sum order is therefore:
		\begin{equation*}
			\hspace{-10mm}p(X_A,...,X_E) = \frac{1}{Z}\sum_{X_B}\sum_{X_A} \psi_{A,B}(X_A, X_B)\sum_{X_D}\psi_{B,D}(X_B, X_D) \sum_{X_C}\psi_{C,D}(X_C, X_D)\sum_{X_E} \psi_{B,E}(X_B, X_E)\psi_{D,E}(X_D, X_E)
		\end{equation*}
		\item Finally, replace the marginalized terms by a new factor, and remove node from graph. In the example, we would replace $\tau(X_B, X_D) = \sum_{X_E} \psi_{B,E}(X_B, X_E)\psi_{D,E}(X_D, X_E)$. This can be merged into the potential $\psi_{B,D}$, hence not changing the graph structure. Our final graph is a tree again, and looks as follows:
		\begin{figure}[ht!]
			\centering
			\tikz{ %
				\node[latent] (xA) {$X_A$} ; %
				\node[latent, right=of xA] (xB) {$X_B$} ; %
				\node[latent, below=of xA] (xC) {$X_C$} ; %
				\node[latent, right=of xC] (xD) {$X_D$} ; %
				
				\edge[-]{xA}{xB};
				\edge[-]{xB}{xD};
				\edge[-]{xC}{xD};
			}
		\end{figure}
	\end{itemize}
	\item The sum-product algorithm so far only calculated marginals. To set some observed variables as observed, we can use the same algorithm, but simply add additional ``hard evidence'' factor nodes, which is nothing else than a hard prior that $x_j$ has the value $\xi_j$:
	$$f_{\xi_j}(x_j)=\delta(x_j=\xi_j)$$
	Then if we apply the sum product algorithm on the extended graph again, we get the conditionals $p(x_i|x_j=\xi_j)$
\end{itemize}
\subsubsection{Max-sum algorithm}
\begin{itemize}
	\item The sum-product algorithm calculates the full marginal distribution $p(x_j)$, but sometimes, we just want to know the most likely value of $x_j$, especially if we set some variables to observed states
	\item As it turns out, we can use a very similar algorithm for this, but simply replace sums by maximum operators, and products by sums.
	\item First, let's consider what the optimum is in the general case:
	$$\bm{x}^{*}=\arg\max_{\bm{x}} \prod_{\alpha} f_{\alpha}(\bm{x}_{\alpha})$$
	where we can ignore the normalization constant. Furthermore, to simplify optimization, we can apply the log:
	$$\bm{x}^{*} = \arg\max_{\bm{x}} \sum_{\alpha} \ln f_{\alpha}(\bm{x}_{\alpha})$$
	\item Our messages passed across the graph are then as follows:
	\begin{equation*}
	\tcbox[nobeforeafter]{\(
		\begin{split}
		\textbf{Factor$\to$Variable:} & \hspace{2mm} \nu_{\alpha\to i}(x_i)=\max_{\bm{x}_{\alpha\setminus i}} \log f_{\alpha}(\bm{x}_{\alpha}) + \sum_{j\in \alpha\setminus i}\nu_{j\to\alpha}(x_j)\\
		& \text{If $\alpha$ leaf node:} \hspace{2mm}\nu_{\alpha\to i}(x_i)=\max_{\bm{x}_{\alpha\setminus i}}\log  f_{\alpha}(\bm{x}_{\alpha})\\
		\textbf{Variable$\to$Factor:} & \hspace{2mm} \nu_{j\to \alpha}(x_j)=\sum_{\beta\in \text{ne}(j)\setminus \alpha}\nu_{\beta\to j}(x_j)\\
		&\text{If $j$ leaf node:} \hspace{2mm}\nu_{j\to \alpha}(x_j)=0
		\end{split}
		\)}
	\end{equation*}
	\item The maximum beliefs/marginals are:
	\begin{equation*}
	\tcbox[nobeforeafter]{\(
		\begin{split}
		\textbf{Max-marginals:} & \hspace{2mm} q_i(x_i)=\sum_{\alpha\in\text{ne}(i)}\nu_{\alpha\to i}(x_i)\\
		\end{split}
		\)}
	\end{equation*}
	\item In the case that $q_i(x_i)$ has a unique maximum, we can simply take the argmax to get the optimum: $x^{*}_i = \arg\max_{x_i} q_i(x_i)$
	
	If this is not the case, we need to run the Viterbi algorithm (Bishop 8.4.5, we do not go in detail for this in the exam) to get the global optimum. The general idea is that some optima might depend on each other. For example, if we have something similar to a XOR, $(x_0=0, x_1=1)$ and $(x_0=1,x_1=0)$ are the optimums, but just looking at the independent marginals, $(x_0=0,x_1=0)$ and $(x_0=1,x_1=1)$ would be also optima. We can prevent this by slightly extending the messages passed around.
\end{itemize}

================================================
FILE: Machine_Learning_2/ml2_graphical_models.tex.recover.bak~
================================================
\section{Probabilistic graphical models}
\begin{itemize}
	\item It is often beneficial to visualize a probabilistic model as a diagram, which we call \textit{(probabilistic) graphical models}. 
	\item They are good for:
	\begin{itemize}
		\item causal reasoning/modeling
		\item calculating inference and conditional distributions efficiently 
		\item Designing and communicating statistical model
		\item Encoding independence relations
	\end{itemize} 
	\item Note that there are often multiple ways to express the same probability distribution. For example, take a joint distribution $p(A,B,C)$, which we can either write as $p(A,B,C)=p(A)p(B|A)p(C|A,B)$ (see Figure~\ref{}) or $p(A,B,C)=p(C)p(A|C)p(B|A,C)$ (see Figure~\ref{}). Nevertheless, what we interested in the end is the graphical representation with the least number of edges, as e.g. if $A$ and $B$ are independent, we can drop the edge between those.
	\begin{figure}[ht!]
		\centering
		\begin{subfigure}{0.4\textwidth}
			\centering
			\tikz{ %
				\node[latent] (A) {$A$} ; %
				\node[latent, right=of A] (B) {$B$} ; %
				\node[latent, below=of B] (C) {$C$} ; %
				
				\edge{A}{B};
				\edge{A}{C};
				\edge{B}{C};
			}
			\caption{$p(A,B,C)=p(A)p(B|A)p(C|A,B)$}
		\end{subfigure}
		\hspace{10mm}
		\begin{subfigure}{0.4\textwidth}
			\centering
			\tikz{ %
				\node[latent] (A) {$A$} ; %
				\node[latent, right=of A] (B) {$B$} ; %
				\node[latent, below=of B] (C) {$C$} ; %
				
				\edge{A}{B};
				\edge{C}{A};
				\edge{C}{B};
			}
			\caption{$p(A,B,C)=p(C)p(A|C)p(B|A,C)$}
			\label{fig:graphical_models_example_2}
		\end{subfigure}
		\caption{Two different graphical models for the same joint distribution $p(A,B,C)$.}
	\end{figure}
	\item We distinguish between directed acyclic graphs, which we call \textit{Bayesian networks} (BN), and undirected graphs, which are \textit{Markov Random Fields} (MRF)
\end{itemize}

================================================
FILE: Machine_Learning_2/ml2_sampling_methods.tex
================================================
\section{Sampling methods}
\begin{itemize}
	\item In the previous chapter, we have seen that we can perform inference by approximating the posterior distribution. However, as alternative, we can also consider Monte Carlo techniques, i.e. sampling
	\item In most practical cases, we often want to evaluate expectations over the posterior. This we can approximate by an average over samples:
	$$\E_{p(\bm{x})}\left[f(\bm{x})\right] \approx \frac{1}{N}\sum_{n=1}^{N} f(\bm{x}_n), \hspace{2mm} \bm{x}_n\sim p(\bm{x})$$
	This can be for example used for prediction:
	$$p(y^{*}|x^{*}) = \int p(y^{*}\vert x^{*}, \bm{\theta})p(\bm{\theta}|\bm{X}, \bm{Y})d\bm{\theta} = \E_{p(\bm{\theta}|\bm{X}, \bm{Y})}\left[p(y^{*}\vert x^{*}, \bm{\theta})\right] \approx \frac{1}{K}\sum_{k=1}^{K} p(y^{*}\vert x^{*}, \bm{\theta}^{(k)}), \hspace{2mm} \bm{\theta}^{(k)}\sim p(\bm{\theta}|\bm{X}, \bm{Y})$$
	\item Hence, we will look at different techniques for sampling from more complex distributions than standard Gaussians
\end{itemize}
\subsection{Regular Sampling}
\begin{itemize}
	\item As an introduction, we look at how we can sample from simple, known distributions, and verify the correctness of the Monte Carlo approximation
	\item Assume we draw $N$ samples from $p(z)$: $z_i\sim p(z)$, $i=1,...,N$
	\item We can calculate $\E\left[f\right]\approx \widehat{E\left[f\right]}=\left<f\right> = \frac{1}{N}\sum_{n=1}^{N}f(z_i)$ for approximating the expectation. Note that we introduced here the different notations for the Monte Carlo approximation
	\item To verify that our approximation makes sense, we first check whether we have an \textit{unbiased} estimator:
	$$\E\left[\left<f\right>\right] = \E\left[\frac{1}{N}\sum_{i=1}^{N}f(z_i)\right]=\frac{1}{N}\sum_{i=1}^{N}\E\left[f(z_i)\right]=\E\left[f(z)\right]$$
	\item Furthermore, we would like that with an infinite amount of samples, our variance goes to zero:
	$$\mathbb{V}\text{ar}\left[\left<f\right>\right]=\frac{1}{N^2}\sum_{i=1}^{N}\mathbb{V}\text{ar}\left[f(z_i)\right] = \frac{1}{N}\mathbb{V}\text{ar}\left[f\right]$$
	Hence, with $N\to\infty$, we linearly reduce the variance compared to a single sample.
	\item As a last part, we will look into how we can sample from any \textit{known} distribution, given a uniform sampler
	\begin{description}
		\item[Discrete random variables]  In case of a discrete random variable $z\in\left\{1,2,...,K\right\}$, and given the distribution $p(z)$, we first have to calculate the cumulative density function: $p(z\leq \zeta)$. Then, given $u_i\sim U(0,1)$, the sample $k$ of $p(z)$ is where $p(z\leq k-1)\leq  u_i < p(z\leq k)$
		\item[Continuous random variables] For continuous variables, we need to calculate the CDF by an integral:
		$$F(\zeta)=\int_{-\infty}^{\zeta} p(z)dz = p(z\leq \zeta)$$
		, and take its inverse $F^{-1}$. Then, given $u_i\sim U(0,1)$, our sample is $z_i=F^{-1}(u_i)$ which is a change of variables.
	\end{description}
\end{itemize}
\subsection{Rejection sampling}
\begin{itemize}
	\item Assume we have a probability density $p(z)$ from which we want to sample. Choose another distribution $q(z)$, called \textit{proposal} distribution, from which we can sample. Further constraints are that the unnormalized distributions $\tilde{p}(z)\propto p(z), \tilde{q}(z)\propto q(z)$ fulfill:
	$$\int \tilde{q}(z)dz < \infty, \tilde{p}(z)\leq \tilde{q}(z) \forall z$$
	\item Now, we can generate samples from $p(z)$ by a simple algorithm:
	\begin{tcolorbox}[colback=white!85!gray,colframe=gray!75!black,title=Pseudocode for rejection sampling]
		\begin{algorithm}[H]
			\SetAlgoLined
			\For{$n=1,...,N$}{
				\While{No sample for $z_n$ accepted}{
					Sample $\hat{z}$ from $q(z)$\;
					Sample $u\sim U(0,1)$\;
					\eIf{$u<\frac{\tilde{p}(\hat{z})}{\tilde{q}(\hat{z})}$}{Accept sample $z_n=\hat{z}$\;}{Reject sample $\hat{z}$ and re-sample\;}
				}
			}
			Return samples $\left\{z_n\right\}_{n=1}^{N}$\;
		\end{algorithm}
	\end{tcolorbox}
	\item The principle of rejection sampling is visualized in Figure~\ref{fig:sampling_rejection_sampling}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.3\textwidth]{figures/sampling_rejection_sampling.png}
		\caption{Visualizing rejection sampling. In the figure, $\tilde{q}(z)$ is denoted as $kq(z)$, and $x_0$ is the initial sample from $q$.}
		\label{fig:sampling_rejection_sampling}
	\end{figure}
	\item To show that we actually generate samples from $p(z)$, we can write down the probability for a value $z_i$ to be picked. First, the chance of $z_i$ being generated in first place is $q(z_i)$. Next, the chance of $z_i$ being accepted, is $\frac{\tilde{p}(z_i)}{\tilde{q}(z_i)}$. Together, we get the probability of $z_i$:
	$$\hat{p}(z_i) = q(z_i)\frac{\tilde{p}(z_i)}{\tilde{q}(z_i)} \propto p(z_i)$$
	Hence, we actually generate samples from $z_i$ although we initially sample from $q(z)$
	\item One requirement of rejection sampling to work well is that the area between $\tilde{p}(z)$ and $\tilde{q}(z)$ is small. The efficience of this sampler can be measured by the acceptance rate, which is $\E_{z_i\sim q}\left[\frac{\tilde{p}(z_i)}{\tilde{q}(z_i)} \right]$. If this value is low, it means that a lot of samples are rejected, hence the sampling process takes longer. This gets especially critical in higher dimensions as we need to make sure that \textit{for all} $z_i$, $\tilde{q}(z_i)$ is greater than $\tilde{p}(z_i)$. Finding a simple distribution in high dimensions that fulfills this requirement is often not trivial
\end{itemize}
\subsection{Importance sampling}
\begin{itemize}
	\item Another approach for estimating an expectation is not generating actual samples from $p$, but simply weight samples by their \textit{importance}
	\item Again, we need two distributions: $p$ over which we want to determine the expectation, and $q$ that we actually sample from. Note that for this algorithm to work, none of these distributions need to be normalized. The only thing that is required is that we can sample from the normalized density of $q$.
	\item The pseudo-code for the importance sampling is fairly simple and straight forward: \begin{tcolorbox}[colback=white!85!gray,colframe=gray!75!black,title=Pseudocode for importance sampling]
		\begin{algorithm}[H]
			\SetAlgoLined
			Sample $\left\{z_n\right\}_{n=1}^{N}$ from $q(z)$\;
			Calculate the weights $w_n=\frac{p(z_n)}{q(z_n)}$\;
			Determine expectation by $\E_p[f]\approx \frac{\sum_n w_n f(z_n)}{
			\sum_n w_n}$\;
		\end{algorithm}
	\end{tcolorbox}
	\item The intuition of importance sampling can be shown when plugging in the two distributions:
	$$\E_p[f]=\int p(z)f(z)dz=\frac{\int q(z)\frac{p(z)}{q(z)}f(z)dz}{\underbrace{\int q(z)\frac{p(z)}{q(z)}dz}_{=1}} = \frac{\E_q\left[w_z\cdot f(z)\right]}{\E_q\left[w_z\right]} \approx \E_q\left[\frac{\sum_i w_i f(z_i)}{\sum_i w_i}\right]$$
	\item Although importance sampling can use all samples, it has two major drawbacks:
	\begin{itemize}
		\item The estimate of $\E[f]$ is \underline{not} unbiased. Imagine we would sample only a single time ($N=1$ in pseudo code). As the sum drops out, we end up with $f(z_i)$ where $z_i$ is sampled from $q$, and not $p$! This bias decreases with the number of samples, but should always be kept in mind as an accurate estimate might require more samples, especially when it occurs with the second drawback.
		\item The importance weighting estimate has a high variance if $p$ differs strongly from $q$, especially when $p(z_i)\gg q(z_i)$ as the weight is very high for this data point. Imagine $q$ is a Gaussian with a large variance, while $p$ is a peaked Gaussian around 0. For most of the samples, $p(z)\ll q(z)$, and hence their weight is low. But for a small set of points, namely those close to 0, $p(z)$ is much greater than $q(z)$ resulting in a high weight. If we now sample e.g. 100 times, all points except those around 0 are neglected due to their low weight. And as those important points are rarely sampled from $q$, we need many samples to reduce the variance. This problem occurs even stronger in high-dimensional spaces.
	\end{itemize}
	\item Furthermore, note that importance sampling can only be used for approximating an expectation, and not for generating independent samples from $p$
\end{itemize}
\subsection{Ancestral Sampling}
\begin{itemize}
	\item Assume we have given a Bayesian network, and want to sample from the joint probability. We can write the joint probability as:
	$$p(\bm{z})=\prod_{i=1}^{d} p(z_i\vert z_{\text{pa}(i)})$$
	where we use a topological ordering $z_1,...,z_d$ with $z_j<z_i$ if $j\in \text{pa}(i)$
	\item Now we can simply sample from the joint distribution by sampling from each of the conditionals, in the topological ordering. For example, assume we have the following Bayesian network:
	
	\begin{figure}[ht!]
		\centering
		\tikz{ %
			\node[latent] (z1) {$z_1$} ; %
			\node[latent, right=of z1] (z3) {$z_3$} ; %
			\node[latent, above=of z3] (z2) {$z_2$} ; %
			\node[latent, below=of z3] (z4) {$z_4$} ; %
			\node[latent, right=of z2] (z5) {$z_5$} ; %
			\node[latent, right=of z4] (z6) {$z_6$} ; %
			
			\edge{z1}{z2};
			\edge{z1}{z3};
			\edge{z1}{z4};
			\edge{z2}{z5};
			\edge{z3}{z5};
			\edge{z3}{z6};
			\edge{z2}{z6};
			\edge{z4}{z6};
		}
	\end{figure}

	Then we can sample as follows:
	\begin{equation*}
		\begin{split}
			\tilde{z}_1 & \sim p(z_1)\\
			\tilde{z}_2 & \sim p(z_2|z_1=\tilde{z}_1)\\
			\tilde{z}_3 & \sim p(z_3|z_1=\tilde{z}_1)\\
			\tilde{z}_4 & \sim p(z_4|z_1=\tilde{z}_1)\\
			\tilde{z}_5 & \sim p(z_5|z_2=\tilde{z}_2,z_3=\tilde{z}_3)\\
			\tilde{z}_6 & \sim p(z_6|z_2=\tilde{z}_2,z_3=\tilde{z}_3,z_4=\tilde{z}_4)\\
		\end{split}
	\end{equation*}
	
	\item Note that for sampling from each of the individual distributions, we can use the other techniques like rejection or importance sampling. 
	
	% \item \TODO{Question: lecture notes say that it works badly with high dimensions, but other sources say it works well. Why would it not work in high dimensions?}
\end{itemize}
\subsection{Markov-Chain Monte Carlo}
\begin{itemize}
	\item Given a target distribution $p(x)$ that we want to sample from, we setup a Markov chain such that $p(x_n)\to p(x)$ as $N\to\infty$, i.e. such that $p(x)$ is its equilibrium distribution
	
	\begin{figure}[ht!]
		\centering
		\tikz{ %
			\node[latent] (x1) {$x_1$} ; %
			\node[latent, right=of x1] (x2) {$x_2$} ; %
			\node[latent, right=of x2] (x3) {$x_3$} ; %
			\node[const, right=of x3] (xetc1) { \hspace{2mm}...\hspace{2mm} } ; %
			\node[latent, right=of xetc1] (xN) {$x_N$} ; %
			\node[const, right=of xN] (xetc2) { \hspace{2mm}...\hspace{2mm} } ; %
			\node[latent, right=of xetc2] (xInfty) {$x_{\infty}$} ; %
			
			\edge{x1}{x2};
			\edge{x2}{x3};
			\edge{x3}{xetc1};
			\edge{xetc1}{xN};
			\edge{xN}{xetc2};
			\edge{xetc2}{xInfty};
		}
	\end{figure}
	\item Note that as MCMC runs ancestral sampling on a chain, two consecutive samples are no longer independent. Still, we can assume that the dependency between two fairly distant samples is neglectable
	\item We start sampling $x_1$ from an initial distribution $q(x_1)$ which can be chosen arbitrarily (but the closer $q$ to $p$, the faster the convergence and hence, faster sampling). For the next samples, we use a \underline{transition kernel} $T$ (which we have to specify/find by ourselves) to get $x_2\sim T(x_2|x_1)$
	\item The transition kernel is usually independent of time (most efficient), but can be extended to multiple steps (e.g. $x_t$ is sampled by $T_1$, $x_{t+1}$ from $T_2$, and $x_{t+2}$ again from $T_1$)
	\item The marginal distribution can be determined by:
	$$q(x_{n}) = \int q(x_{n-1})T(x_{n}|x_{n-1})dx_{n-1}$$
	and the joint probability for all $x_1,...,x_N$ is:
	$$q(x_1,...,x_N) = q(x_1)\prod_{i=2}^{N}T(x_i|x_{i-1})$$
	\item The equilibrium is reached when $x_N\sim p(x)$ (i.e. $p(x)$ is an invariant of the chain):
	$$p(x_{N+1}) = \int T(x_{N+1}|x_N)p(x_N)dx_N$$ 
	In this case, $p(x)$ is called to be an invariant of the chain. Note that a chain can have multiple invariants, as if $T$ is the identity transformation, any distribution is an invariant of that chain
	\item A sufficient but not necessary condition of an invariant $p(x)$ is that it satisfies the property of \underline{detailed balance}:
	$$p(x_t)T(x_{t+1}|x_t) = p(x_{t+1})T(x_t|x_{t+1})$$
	This property can be interpreted as the Markov chain being reversible. Although detailed balance is not required, it is mostly easier to fulfill when designing a kernel
	\item Another property we are looking for is \underline{ergodicity}. A sufficient condition for ergodicity is that any state $x_N$ has a positive probability, for any $N$.
	\item If we have an ergodic Markov chain and $p^{\star}(x)$ being an invariant, then $p^{\star}(x)$ is a unique equilibrium, where for any start distribution $q(x_0)$, the distribution $q(x_N)$ with $N\to\infty$ converges to the required distribution $p^{\star}(x)$.
	\item The sketch of the sampling process is:
	
	\begin{algorithm}[H]
		Sample initial state $x_0$ from $q(x_0)$\;
		\For{$t=0,...,N$}{
			Sample $x_{t+1}\sim T(x_{t+1}|x_{t})$\;
		}
		Output $x_N$ as sample of $p^{\star}(x)$\;
	\end{algorithm}
	$N$ has to be large enough in this case, and is often referred to as \textit{burn-in} time. 
	
	In addition, note that the required $N$ can be reduced by reducing the distance between $q$ and $p$. Thus, for generating multiple samples, we can take the first sample $x_N$, and continue the chain for additional $M$ steps, where usually $M\ll N$ (but $M$ must be certain size to guarantee independence of samples, depends on $T$). Then, $x_{N+M}$ is a new sample from $p$.
	
	\item The problem of MCMC is how we can find the transition kernel $T$ for a distribution $p$. Once we have found two kernels, we can combine those to new kernels:
	\begin{equation*}
		\begin{split}
			T_3 & = \alpha T_1 + (1 - \alpha) T_2, \hspace{2mm}(\alpha \in [0,1])\\
			T_3 & = T_2 \circ T_1 \hspace{5mm}(\text{composition, first apply $T_1$, then $T_2$})
		\end{split}
	\end{equation*}
	\item One way to overcome this problem is the Metropolis-Hastings algorithm, which uses a form of rejection sampling to allow any transition kernel $T$
\end{itemize}
\subsubsection{Metropolis-Hastings algorithm}
\begin{itemize}
	\item Choose a proposal transition kernel $Q(x_{t+1}|x_{t})$ (e.g. a random walk). Then we can sample from $p^{\star}(x)$ as follows:
	\begin{tcolorbox}[colback=white!85!gray,colframe=gray!75!black,title=Pseudocode for Metropolis-Hastings algorithm]
		\begin{algorithm}[H]
			\SetAlgoLined
			Sample initial state $x_0$ from $q(x_0)$\;
			\For{$t=0,...,N$}{
				Sample $\tilde{x}_{t+1}\sim Q(\tilde{x}_{t+1}|x_{t})$\;
				Compute acceptance probability $\alpha(\tilde{x}_{t+1}|x_t) = \min\left(1, \frac{p^{\star}(\tilde{x}_{t+1}) Q(x_t|\tilde{x}_{t+1})}{p^{\star}(x_t)Q(\tilde{x}_{t+1}|x_{t})}\right)$\;
				Sample $u_t\sim U(0,1)$\;
				\eIf{$u_t \leq \alpha(\tilde{x}_{t+1}|x_t) $}{accept sample $x_{t+1}=\tilde{x}_{t+1}$\;}{reject sample and stay at current state: $x_{t+1}=x_t$\;}
			}
			Output $x_N$ as sample of $p^{\star}(x)$\;
		\end{algorithm}
	\end{tcolorbox}	
	We can view the combination of $Q$ and acceptance probability $\alpha$ as our transition kernel $T=\alpha\circ Q$.
	\item Given this simple algorithm, we can design the transition kernel $Q$ with minimal knowledge of $p^{\star}$. For example, a kernel which mostly works well is to combine larger and smaller random walk steps. This can be very helpful in high-dimensional space as it can explore in different scales, and hence, in contrast to rejection and importance sampling, still works well in high dimensions
	\item Nonetheless, keep in mind that the performance now depends on the transition kernel $Q$. If it is designed poorly, many samples are rejected and we again end up with an inefficient sampling process. For example, if we have a highly multi-modal distribution, we need to have large enough steps to be able to jump between modes. But as said before, this is much less knowledge we need of $p^{\star}$ compared to the other discussed sampling algorithms
	\item We can sketch the proof here for the detailed balance. If a new sample $x_{t+1}$ is accepted, it has the probability:
	\begin{equation*}
		\begin{split}
			p^{\star}(x_t)T(x_{t+1}|x_{t}) & = p^{\star}(x_t)Q(x_{t+1}|x_{t})\min\left(1, \frac{p^{\star}(x_{t+1}) Q(x_t|x_{t+1})}{p^{\star}(x_t)Q(x_{t+1}|x_{t})}\right)\\
			& = \min\left(p^{\star}(x_t)Q(x_{t+1}|x_{t}), p^{\star}(x_{t+1}) Q(x_t|x_{t+1})\right)\\
			& = p^{\star}(x_{t+1})Q(x_{t}|x_{t+1})\min\left(1, \frac{p^{\star}(x_{t}) Q(x_{t+1}|x_{t})}{p^{\star}(x_{t+1})Q(x_{t}|x_{t+1})}\right)\\
			& = p^{\star}(x_{t+1})T(x_{t}|x_{t+1})
		\end{split}
	\end{equation*}
	As we can interchange $x_t$ and $x_{t+1}$, $p^{\star}$ is an invariant of the Markov chain.
	
	In case $\tilde{x}_{t+1}$ was rejected, we stay at $x_t$ which satisfies the detailed balance anyways.
\end{itemize}
\subsubsection{Gibbs sampling}
\begin{itemize}
	\item Gibbs sampling is a special case of the Metropolis-Hastings algorithm, where the acceptance probability is always 1. The idea is that we cannot easily sample from a big joint distribution, but sampling a single variable given all others is feasible/much simpler. 
	\item Hence, we sample a D-dimensional vector $(x_1,x_2,...,x_D)$ by sampling from the conditional distributions of a single variable $x_i$ where we keep all other variables fixed. In pseudo-code, we can define Gibbs sampling as:
	\begin{tcolorbox}[colback=white!85!gray,colframe=gray!75!black,title=Pseudocode for Gibbs sampling]
		\begin{algorithm}[H]
			\SetAlgoLined
			Choose an initial state $\left\{x_i:i=1,...,M\right\}$\;
			\For{$t=0,...,N$}{
				Sample $x_1^{(t+1)}\sim p(x_1|x_2^{(t)}, x_3^{(t)},...,x_M^{(t)})$\;
				Sample $x_2^{(t+1)}\sim p(x_2|x_1^{(t+1)}, x_3^{(t)},...,x_M^{(t)})$\;
				... \\
				Sample $x_M^{(t+1)}\sim p(x_M|x_2^{(t+1)}, x_3^{(t+1)},...,x_{M-1}^{(t+1)})$\;
			}
			Output $\left\{x_1^{(N)},...,x_M^{(N)}\right\}$ as sample of $p(x_1,...,x_M)$\;
		\end{algorithm}
	\end{tcolorbox}	
	\item Clearly, $p(\bm{x})$ is an invariant of the Markov chain because at each step, we sample from the correct conditional $p(x_i,\bm{x}_{\setminus i})$. But note that detailed balance can only be guaranteed if the order of $x_i$'s is randomized every iteration.
	\item For ergodicity, we just need to make sure that the conditionals are not zero for any point. Otherwise, we have to prove ergodicity explicitly.
	\item The acceptance probability of a sample according to the Metropolis-Hastings algorithm is in case of Gibbs sampling:
	$$\alpha(\bm{x}^{(t+1)}|\bm{x}^{(t)}) = \frac{p^{\star}(\bm{x}^{(t+1)}) Q(\bm{x}^{(t)}|\bm{x}^{(t+1)})}{p^{\star}(\bm{x}^{(t)})Q(\bm{x}^{(t+1)}|\bm{x}^{(t)})} = \frac{p^{\star}(x_i^{(t+1)}|\bm{x}_{\setminus i}^{(t+1)})p^{\star}(\bm{x}_{\setminus i}^{(t+1)})\cdot p^{\star}(x_i^{(t)}|\bm{x}_{\setminus i}^{(t+1)})}{p^{\star}(x_i^{(t)}|\bm{x}_{\setminus i}^{(t)})p^{\star}(\bm{x}_{\setminus i}^{(t)})\cdot p^{\star}(x_i^{(t+1)}|\bm{x}_{\setminus i}^{(t)})} = 1$$
	where $\bm{x}_{\setminus i}^{(t)}=\bm{x}_{\setminus i}^{(t+1)}$ as the other variables do not change.
\end{itemize}

================================================
FILE: Machine_Learning_2/ml2_sequential_data.tex
================================================
\section{Sequential Data}
\begin{itemize}
	\item Most models we discussed so far assumed that multiple data points $\bm{x}_n$ are independent of each other. However, in many use-cases, we have e.g. temporal data where consecutive data points dependent on each other
	\item The likelihood of such data can be written as:
	$$p(x_1,...,x_N) = \prod_{n=1}^{N} p(x_n|x_1,...,x_{n-1})$$
	\item Here, we will focus on Markov models and its different variations
\end{itemize}
\subsection{Markov models}
\begin{itemize}
	\item One of the simplest models for sequential data are Markov models, where we limit the conditionals to a fixed size. For example, a \textit{first-order} Markov model has the likelihood $p(x_1)\prod_{n=2}^{N}p(x_n|x_{n-1})$:
	
	\begin{figure}[ht!]
		\centering
		\tikz{ %
			\node[obs] (x1) {$x_1$} ; %
			\node[obs, right=of x1] (x2) {$x_2$} ; %
			\node[obs, right=of x2] (x3) {$x_3$} ; %
			\node[const, right=of x3] (xetc1) { \hspace{2mm}...\hspace{2mm} } ; %
			\node[obs, right=of xetc1] (xN) {$x_N$} ; %
			
			\edge{x1}{x2};
			\edge{x2}{x3};
			\edge{x3}{xetc1};
			\edge{xetc1}{xN};
		}
	\end{figure}

	A second-order MM would add connections between $x_1$ and $x_3$, $x_2$ and $x_4$ etc.
	\item We call a Markov model \underline{homogeneous} if all conditionals $p(x_{n}|x_{n-1})$ are the same, i.e. the transition probability is steady over time:
	$$p(x_{n}=a|x_{n-1}=b) = p(x_{n+m}=a|x_{n+m-1}=b)$$
	\item Suppose each $x_n$ has $K$ possible states. Then the parameters for a homogeneous MM increases over the order by:
	\begin{table}[ht!]
		\centering
		\begin{tabular}{c|cc}
			Order & \multicolumn{2}{c}{Number of params}\\
			& Prior & Conditionals\\
			\hline 
			$0$ & $K-1$ & $0$\\
			$1$ & $K-1$ & $K(K-1)$\\
			$2$ & $K^2 - 1$ & $K^2(K-1)$\\
			... & ... & ...\\
			$M$ & $K^{M}-1$ & $K^{M}(K-1)$\\
		\end{tabular}
	\end{table}

	For inhomogeneous MM, we would have $N-M$-times the conditional parameters where $N$ is the length of the chain (needs fixed size if conditionals are not shared!). 
	
	As the number of parameters increases exponentially with the order $M$, it is often impractical to use high-order MM. 
	\item Our next discussions will focus on the first-order MM but can be applied to any order. This can be easily shown by reducing a M'th order MM to 1st order MM. Suppose we
	introduce new variables $y_n=(x_n,x_{n+1},...,x{n+M-1})$, then we can express our MM by:
	$$\hspace{-10mm}p(y_n=(x_n,x_{n+1},...,x_{n+M-1})|y_{n-1}=(\tilde{x}_{n-1},\tilde{x}_n,...,\tilde{x}_{n+M-2}))=\begin{cases}
	0 & \text{if for any } i, x_i\neq \tilde{x}_i\\
	p(x_{n+M-1}|x_n,...,x_{n+M-2}) & \text{otherwise}
	\end{cases}$$
	\item Often, we want Markov models that are more expressive than a first-order, but at the same time, prevent having too many parameters. One way of enriching the class of MMs is by introducing latent variables as shown in this graphical model:
	
	\begin{figure}[ht!]
		\centering
		\tikz{ %
			\node[latent] (z1) {$z_1$} ; %
			\node[latent, right=of z1] (z2) {$z_2$} ; %
			\node[latent, right=of z2] (z3) {$z_3$} ; %
			\node[obs, below=of z1] (x1) {$x_1$} ; %
			\node[obs, below=of z2] (x2) {$x_2$} ; %
			\node[obs, below=of z3] (x3) {$x_3$} ; %
			\node[const, right=of z3] (zetc1) { \hspace{2mm}...\hspace{2mm} } ; %
			
			\edge{z1}{z2};
			\edge{z2}{z3};
			\edge{z3}{zetc1};
			
			\edge{z1}{x1};
			\edge{z2}{x2};
			\edge{z3}{x3};
		}
	\end{figure}
	\item The joint distribution for this model is given by: $$p(x_1,...,x_N,z_1,...,z_N)=p(z_1)\cdot \left[\prod_{n=2}^{N} p(z_n|z_{n-1})\right]\cdot \left[\prod_{n=1}^{N} p(x_n|z_{n})\right]$$
	\item We now look at two variants of this model:
	\begin{enumerate}
		\item \textit{Hidden Markov Models} assume that the latent space $z$ is discrete
		\item \textit{Linear Dynamical Systems} use a continuous latent space $z$ such as linear Gaussian
	\end{enumerate}
\end{itemize}

\subsection{Hidden Markov Models}
\begin{itemize}
	\item Commonly, when using discrete latent variables, we assume that we have a mixture model, and $z_n$ as discrete multinomial variables specify the component from which $x_n$ was generated
	\item For the homogeneous case, we get the joint distribution:
	$$p(\bm{x}_1,...,\bm{x}_N|\bm{\pi},\bm{A},\bm{\phi}) = \sum_{\bm{z}_1}...\sum_{\bm{z}_N} p(\bm{z}_1|\bm{\pi})\left[\prod_{n=2}^{N} \underbrace{p(\bm{z}_n|\bm{z}_{n-1},\bm{A})}_{\text{Transition probabilities}}\right]\cdot \left[\prod_{n=1}^{N} \underbrace{p(\bm{x}_n|\bm{z}_n,\bm{\phi})}_{\text{Emission probabilities}}\right]$$
	where $\bm{A}$ can be seen as a table with $A_{jk}\equiv p(z_{nk}=1|z_{n-1,j}=1)$ which gives the constraints $\sum_k A_{jk}=1$, $0\leq A_{jk}\leq 1$. 
	
	The parameter $\bm{\pi}$ is again a prior over latent states for $z_1$, where $\sum_k \pi_k = 1$.
	
	The mapping between latent and observed variable is described as emission probabilities, which we can write as $p(\bm{x}_n|\bm{z}_n,\bm{\phi})=\prod_{k=1}^{K} p(\bm{x}_n|\bm{\phi}_k)^{z_{nk}}$
	
	\item For optimizing the parameters, we again use the EM algorithm because otherwise we would need to calculate $p(\bm{X}|\bm{\theta})=\sum_Z p(\bm{X},\bm{Z}|\bm{\theta})$. The sum has a complexity which increases exponentially with the number of hidden variables, and makes it inefficient for large $N$.
	
\end{itemize}
\subsubsection{Maximum Likelihood for HMM}
\begin{itemize}
	\item Remember that our objective in the EM algorithm was:
	$$Q(\bm{\theta}, \bm{\theta}^{\text{old}})=\sum_{\bm{Z}} p(\bm{Z}|\bm{X},\bm{\theta}^{\text{old}})\ln p(\bm{X},\bm{Z}|\bm{\theta})=\E_{\bm{Z}\sim p(\bm{Z}|\bm{X},\bm{\theta}^{\text{old}})}\left[\ln p(\bm{X},\bm{Z}|\bm{\theta})\right]$$
	which can be written in our case as:
	\begin{equation*}
		\begin{split}
			Q(\bm{\theta}, \bm{\theta}^{\text{old}}) & = \E_{\bm{z}_1\sim p(\bm{z}_1|\bm{X},\bm{\theta}^{\text{old}})}\left[\sum_{k=1}^{K} z_{1k}\ln \pi_k\right] + \\
			& \hspace{5mm}\sum_{n=2}^{N} \E_{(\bm{z}_n, \bm{z}_{n-1})\sim p(\bm{z}_n, \bm{z}_{n-1}|\bm{X},\bm{\theta}^{\text{old}})}\left[\sum_{j=1}^{K}\sum_{k=1}^{K}z_{nj}\cdot z_{n-1,k}\cdot \ln A_{jk}\right] + \\
			& \hspace{5mm} \sum_{n=1}^{N} \E_{\bm{z}_n\sim p(\bm{z}_n|\bm{X},\bm{\theta}^{\text{old}})} \left[\sum_{k=1}^{K} z_{nk} \ln p(\bm{x}_n|\bm{\phi}_k)\right]\\
			& = \sum_{k=1}^{K} \gamma(z_{1k})\ln \pi_k + \sum_{n=2}^{N} \sum_{j=1}^{K}\sum_{k=1}^{K}\zeta(z_{n-1,j}, z_{nk})\cdot \ln A_{jk} + \sum_{n=1}^{N} \sum_{k=1}^{K} \gamma(z_{nk})\ln p(\bm{x}_n|\bm{\phi}_k)
		\end{split}
	\end{equation*}
	where we use $\gamma(\bm{z}_n)=p(\bm{z}_n|\bm{X},\bm{\theta}^{\text{old}})$ and $\zeta(\bm{z}_{n-1},\bm{z}_n)=p(\bm{z}_{n-1},\bm{z}_{n}|\bm{X},\bm{\theta}^{\text{old}})$.
	\item Now, let's take a closer look at the steps of the EM algorithm
	\begin{description}
		\item[E-step] We need to determine $p(z_1,...,z_N|\bm{X},\bm{\theta}^{\text{old}})$ which we split into $\gamma(\bm{z}_n)$ and $\zeta(\bm{z}_{n-1},\bm{z}_n)$. Hence, we need to calculate marginals, which can be done efficiently with the sum-product algorithm.
		
		First, we need to convert the Bayesian network into a factor graph. We do this by replacing the prior with  $h(\bm{z}_1)=p(\bm{z}_1|\bm{\pi}^{\text{old}})p(\bm{x}_1|\bm{z}_1, \bm{\phi}^{\text{old}})$, and the transitions by $f_n(\bm{z}_{n-1},\bm{z}_n)=p(\bm{z}_n|\bm{z}_{n-1}, \bm{A}^{\text{old}})p(\bm{x}_n|\bm{z}_n,\bm{\theta}^{\text{old}})$. The corresponding factor graph looks like in Figure~\ref{fig:HMM_factor_graph}.
		
		\begin{figure}[ht!]
			\centering
			\tikz{ %
				\factor[] {h1} {$h_1$} {} {} ;
				\node[latent, right=of h1] (z1) {$\bm{z}_{1}$} ; %
				\factor[right=of z1] {f2} {$f_2$} {} {} ;
				\node[latent, right=of f2] (z2) {$\bm{z}_{2}$} ; %
				\factor[right=of z2] {f3} {$f_3$} {} {} ;
				\node[latent, right=of f3] (z3) {$\bm{z}_{3}$} ; %
				\factor[right=of z3] {f4} {$f_4$} {} {} ;
				\node[const, right=of f4] (zetc) {\hspace{2mm}...\hspace{2mm} } ; %
				\factor[right=of zetc] {fN} {$f_N$} {} {} ;
				\node[latent, right=of fN] (zN) {$\bm{z}_{N}$} ; %
						
				\factoredge[-]{}{h1}{z1} ;	
				\factoredge[-]{z1}{f2}{z2} ;	
				\factoredge[-]{z2}{f3}{z3} ;	
				\factoredge[-]{z3}{f4}{zetc} ;	
				\factoredge[-]{zetc}{fN}{zN} ;	
			}
			\caption{Drawing of the factor graph. }
			\label{fig:HMM_factor_graph}
		\end{figure}
	
		Using the factor graph, we want to determine the normalized beliefs $p(\bm{z}_n|\bm{X},\bm{\theta}^{\text{old}})$. The sum product updates are:
		\begin{equation*}
			\begin{split}
				\mu_{z_{n-1}\to f_n}(\bm{z}_{n-1}) & = \mu_{f_{n-1}\to z_{n-1}}(\bm{z}_{n-1})\\
				\alpha_n(\bm{z}_n) = \mu_{f_n\to z_n}(\bm{z}_{n}) & = \sum_{\bm{z}_{n-1}} f_n(\bm{z}_{n-1},\bm{z}_n)\alpha_{n-1}(\bm{z}_{n-1})\\
				\mu_{z_n \to f_n}(\bm{z}_n) & = \mu_{f_{n+1} \to z_n}(\bm{z}_n) \\
				\beta_n(\bm{z}_n) & = \sum_{z_{n+1}} f_{n+1}(\bm{z}_n, \bm{z}_{n+1})\beta_{n+1}(\bm{z}_{n+1})
			\end{split}
		\end{equation*}
		Our beliefs can be calculated by:
		\begin{equation*}
			\begin{split}
				\text{Variable belief } \hspace{2mm} p(\bm{z}_n,\bm{X}|\bm{\theta}^{\text{old}}) & = \alpha_n(\bm{z}_n)\beta_n(\bm{z}_n)\\
				\text{Factor belief } \hspace{2mm} p(\bm{z}_{n-1}, \bm{z}_n,\bm{X}|\bm{\theta}^{\text{old}}) & = \mu_{f_{n-1}\to z_{n-1}}(z_{n-1})\mu_{f_{n+1}\to z_n}(\bm{z}_n)f_n(\bm{z}_{n-1}, \bm{z}_n)\\
				\text{Normalization constant } \hspace{2mm} p(\bm{X}|\bm{\theta}^{\text{old}}) & = \sum_{\bm{z}_n} \alpha_n(\bm{z}_n)\beta_n(\bm{z}_n)\\
			\end{split}
		\end{equation*}
		Finally, we can calculate our sufficient statistics:
		\begin{equation*}
			\begin{split}
				\gamma(\bm{z}_n) & = \frac{p(\bm{z}_n, \bm{X}|\bm{\theta}^{\text{old}})}{p(\bm{X}|\bm{\theta}^{\text{old}})} = \frac{\alpha_n(\bm{z}_n)\beta_n(\bm{z}_n)}{p(\bm{X}|\bm{\theta}^{\text{old}})}\\
				\zeta(\bm{z}_{n-1},\bm{z}_n) & = \frac{p(\bm{z}_{n-1},\bm{z}_n,\bm{X}|\bm{\theta}^{\text{old}})}{p(\bm{X}|\bm{\theta}^{\text{old}})} = \frac{\alpha_{n-1}(\bm{z}_{n-1})\beta_n(\bm{z}_n)p(\bm{z}_n|\bm{z}_{n-1}, \bm{A}^{\text{old}})p(\bm{x}_n|\bm{z}_n, \bm{\phi}^{\text{old}})}{p(\bm{X}|\bm{\theta}^{\text{old}})}			
			\end{split}
		\end{equation*}
		
		\item[M-step] In the maximization step, we optimize $Q(\bm{\theta}, \bm{\theta}^{\text{old}})$ regarding the parameters $\bm{\pi}$, $\bm{A}$ and $\bm{\phi}$. Note that we need to add Lagrangian for the constraints on $\bm{A}$ and $\bm{\pi}$:
		$$\tilde{Q}(\bm{\theta}, \bm{\theta}^{\text{old}}) = Q(\bm{\theta}, \bm{\theta}^{\text{old}}) + \lambda \left(\sum_{k=1}^{K} \pi_k - 1\right) + \sum_{j=1}^{K} \lambda_j \left(\sum_{k=1}^{K} A_{jk} - 1\right)$$
		Performing the maximization, we get:
		\begin{equation*}
			\begin{split}
				\pi_k^{\text{new}} & =\frac{\gamma(z_{1k})}{\sum_{j=1}^{K}\gamma(z_{1j})}, \hspace{5mm} A_{jk} = \frac{\sum_{n=2}^{N} \zeta(z_{n-1,j}, z_{nk})}{\sum_{l=1}^{K}\sum_{n=2}^{N} \zeta(z_{n-1,j}, z_{nl})}
			\end{split}
		\end{equation*}
		Solving the same for the parameter $\bm{\phi}$ depends on the form of emission probability that was chosen. For example, if we have a Gaussian density $p(\bm{x}|\bm{\phi}_k)$, the optimized parameters are:
		$$\bm{\mu}_k=\frac{\sum_{n=1}^{N}\gamma(z_{nk})\bm{x}_n}{\sum_{n=1}^{N}\gamma(z_{nk})}, \hspace{5mm} \bm{\Sigma}_k = \frac{\sum_{n=1}^{N} \gamma(z_{nk})(\bm{x}_n - \bm{\mu}_{k})(\bm{x}_n - \bm{\mu}_{k})^T}{\sum_{n=1}^{N} \gamma(z_{nk})}$$
		Similarly, if we would have discrete observations and model it with a multinomial distribution, i.e. $p(\bm{x}|\bm{z},\bm{\mu}) = \prod_{i=1}^{D}\prod_{k=1}^{K} \mu_{ik}^{x_i z_k}$, we would get as solution:
		$$\mu_{ik} = \frac{\sum_{n=1}^{N}\gamma(z_{nk})x_{ni}}{\sum_{n=1}^{N}\gamma(z_{nk})}$$
	\end{description}
	
	\item To conclude, the EM algorithm for Hidden Markov Models can be summarized as follows:
	\begin{tcolorbox}[colback=white!85!gray,colframe=gray!75!black,title=EM for HMM]
		\begin{enumerate}
			\item Choose initial values $\bm{\theta}^{\text{old}}$ with $\bm{\theta}=(\bm{\pi}, \bm{A}, \bm{\phi})$
			\item Iterate until $\Delta \bm{\theta}^{(t)} < \epsilon$
			\begin{enumerate}
				\item \textbf{E-step}: Calculate $Q(\bm{\theta}, \bm{\theta}^{\text{old}})$ by:
				\begin{enumerate}
					\item Run forward $\alpha$ recursion to calculate $\alpha(z_1),...,\alpha(z_N)$
					\item Run backward $\beta$ recursion to calculate $\beta(z_N),...,\beta(z_1)$
					\item Calculate sufficient statistics $\gamma(z_n)$, $\zeta(z_{n-1},z_n)$ for $n=1,..,N$, and normalization constant $p(\bm{X}|\bm{\theta}^{\text{old}})$
				\end{enumerate}
				\item \textbf{M-step}: Calculate $\bm{\theta}^{\text{new}}=\arg\max_{\bm{\theta}} \tilde{Q}(\bm{\theta}, \bm{\theta}^{\text{old}})$
				\begin{itemize}
					\item $\pi_k^{\text{new}}=\gamma(z_{1k})/\sum_{k=1}^{K}\gamma(z_{1j})$
					\item $A_{jk}=\sum_{n=2}^{N} \zeta(z_{n-1,j}, z_{nk})/\sum_{l=1}^{K}\sum_{n=2}^{N} \zeta(z_{n-1,j}, z_{nl})$
					\item $\bm{\phi}^{\text{new}}$ depending on choice of emission probability
				\end{itemize}
			\end{enumerate}
		\end{enumerate}
	\end{tcolorbox}
\end{itemize}
\subsubsection{Viterbi Algorithm (Max-sum for HMMs)}
\begin{itemize}
	\item Sometimes we might be interested in the latent variables $\bm{Z}$ for a given fixed model as they can be interpreted, and thus we want to determine the most likely values
	\item For this, we can use an adaptation of the max-sum algorithm where we use dynamic programming for the backward path
	\item We use the same factor graph as in Figure~\ref{fig:HMM_factor_graph}. For simplification, we denote the message from $f_n$ (or $h_1$ for $n=1$) to $\bm{z}_n$ as $\omega(\bm{z}_n)$.
	\item The whole algorithm can be then summarized as:
	\begin{tcolorbox}[colback=white!85!gray,colframe=gray!75!black,title=Pseudocode of Viterbi algorithm]
		\begin{algorithm}[H]
			\SetAlgoLined
			\tcp{\textcolor{blue}{Forward pass}}
			Set initial message $\omega(\bm{z}_1) = \ln p(\bm{z}_1) + \ln p(\bm{x}_1|\bm{z}_1)$\;
			\For{$n=1,...,N-1$}{
				$\omega(\bm{z}_{n+1})=\ln p(\bm{x}_{n+1}|\bm{z}_{n+1}) + \max_{\bm{z}_n}\left(\ln p(\bm{z}_{n+1}|\bm{z}_n) + \omega(\bm{z}_n)\right)$\;
				$\psi_n(\bm{z}_{n+1})=\arg\max_{\bm{z}_n}\left(\ln p(\bm{z}_{n+1}|\bm{z}_n) + \omega(\bm{z}_n)\right)$
			}
			\tcp{\textcolor{blue}{Backward pass}}
			$\bm{z}_N^{\text{max}}=\arg\max_{\bm{z}_N} \omega(\bm{z}_N)$\;
			\For{$n=N-1,...,1$}{
				$\bm{z}_n^{\text{max}} = \psi_{n}(\bm{z}_{n+1}^{\text{max}})$\;
			}
			Return $\bm{z}_1^{\text{max}}, \bm{z}_2^{\text{max}},...,\bm{z}_N^{\text{max}}$\;
		\end{algorithm}
	\end{tcolorbox}
	\item Coming back to the discussion about the max-sum algorithm on trees, we have said that for multiple optima, we need to use the Viterbi algorithm. Assume that for any $n$, we have two solutions for $\bm{z}_n^{\text{max}} = \psi_{n}(\bm{z}_{n+1}^{\text{max}})$ (i.e. we have two values for $z_n$ that maximize our objective). If we are just interested in one, arbitrary solution, we get pick any of them and continue. Otherwise, we select one of the optimums, continue the Viterbi algorithm, and afterwards go back to this step, and select the next optimum. In the end, we can combine all solutions to return a full set of all optima. 
	
	We prevent the problem of returning undesirable combinations of independent optima by conditioning them on each other (i.e. $\bm{z}_n^{\text{max}}$ depends on $\bm{z}_{n+1}^{\text{max}}$).
\end{itemize}
\subsection{Linear Dynamical Systems}
\begin{itemize}
	\item As mentioned before, Linear Dynamical Systems use a continuous latent space $\bm{z}_n\in \mathbb{R}^{d_z}$ instead of a discrete as in HMM. This also influences the models we take because, as we will see later, specific distribution properties help to simplify the calculations
	\item One popular model is a \textit{Linear-Gaussian}: all conditional distributions in the Bayesian network are Gaussians with means that depend linearly on its parents. This leads to the probabilities:
	\begin{equation*}
		\begin{split}
			\text{Transition probability}\hspace{2mm} p(\bm{z}_n|\bm{z}_{n-1}) & = \mathcal{N}(\bm{z}_n|\bm{A}\bm{z}_{n-1}, \bm{\Gamma})\\
			\text{Emission probability}\hspace{2mm} p(\bm{x}_n|\bm{z}_n) & = \mathcal{N}(\bm{x}_n|\bm{C}\bm{z}_n, \bm{\Sigma})\\
			\text{Initial state}\hspace{2mm} p(\bm{z}_1)  & = \mathcal{N}(\bm{z}_1|\bm{\mu}_0, \bm{V}_0)
		\end{split}
	\end{equation*}
	with parameters $\bm{\theta}=(\bm{A}, \bm{\Gamma}, \bm{C}, \bm{\Sigma}, \bm{\mu}_0, \bm{V}_0)$ where $\bm{\Sigma}$ is the observation noise, and $\bm{\Gamma}$ the transition uncertainty.
	\item We first consider inference in LDS which represents the E-step, and then complete the learning process with the M-step in the second subsection
\end{itemize}
\subsubsection{Inference in Linear Dynamical Systems}
\begin{itemize}
	\item To find the marginal distributions for the latent variables, we use again message passing. The forward equations are denoted with the normalized marginal distributions $\widehat{\alpha}(\bm{z}_n)$:
	$$p(\bm{z}_n|\bm{x}_1,...,\bm{x}_n) = \widehat{\alpha}(\bm{z}_n) = \mathcal{N}(\bm{z}_n|\bm{\mu}_{n}, \bm{V}_n)$$
	\item Similarly to the HMM, our forward propagation takes the form:
	$$c_n \widehat{\alpha}(\bm{z}_n) = p(\bm{x}_n|\bm{z}_n) \int \widehat{\alpha}(\bm{z}_{n-1})p(\bm{z}_n|\bm{z}_{n-1})d\bm{z}_{n-1}$$
	where we now have an integral instead of the sum, and $c_n$ is a constant making sure of the right scale as $\widehat{\alpha}(\bm{z}_n)$ is normalized.
	\item Plugging in the Gaussian distributions, we get:
	\begin{equation*}
		\begin{split}
			c_n \mathcal{N}(\bm{z}_n|\bm{\mu}_n, \bm{V}_n) &  = \mathcal{N}(\bm{x}_n|\bm{C}\bm{z}_n, \bm{\Sigma})\int \mathcal{N}(\bm{z}_{n-1}|\bm{\mu}_{n-1}, \bm{V}_{n-1})\mathcal{N}(\bm{z}_n|\bm{A}\bm{z}_{n-1}, \bm{\Sigma})d\bm{z}_{n-1}\\
			& = \mathcal{N}(\bm{x}_n|\bm{C}\bm{z}_n, \bm{\Sigma}) \mathcal{N}(\bm{z}_n|\bm{A}\bm{\mu}_{n-1}, \underbrace{\bm{\Gamma}+\bm{A}\bm{V}_{n-1}\bm{A}^T}_{\bm{P}_{n-1}})
		\end{split}
	\end{equation*}
	Here we see the first point where the chosen distributions can make a difference. The integral can be calculated due to the Gaussian, and we can get $\bm{\mu}_n$ and $\bm{V}_n$ by applying more mathematical tricks with Gaussians and matrices (which we will not detail here, but can be found in the lecture notes). The end result is:
	$$\bm{\mu}_N = \bm{A}\bm{\mu}_{n-1} + \bm{K}_n (\bm{x}_n - \bm{C}\bm{A}\bm{\mu}_{n-1}), \hspace{4mm} \bm{V}_n=(\bm{I} - \bm{K}_n\bm{C})\bm{P}_{n-1}$$
	The important thing here is that without observation, $\bm{\mu}_n$ would be simply $\bm{\mu}_{n-1}$ moved/shifted by $A$, but the second term corrects it for the observation.
	\item For the EM algorithm, in the backward pass, we get our sufficient statistics:
	$$\gamma(\bm{z}_n)=\mathcal{N}(\bm{z}_n|\hat{\bm{\mu}}_n, \hat{\bm{V}}_n),\hspace{5mm}\zeta(\bm{z}_{n-1},\bm{z}_n)=\mathcal{N}\left(\begin{bmatrix}
	\hat{\bm{\mu}}_{n-1} \\ \hat{\bm{\mu}}_{n}
	\end{bmatrix}, \begin{bmatrix}
	\hat{\bm{V}}_{n-1} & \bm{J}_{n-1}\hat{\bm{V}}_{n}\\ \hat{\bm{V}}_{n}\bm{J}_{n-1}^T& \hat{\bm{V}}_{n}
	\end{bmatrix}\right)$$
\end{itemize}
\subsubsection{Learning in LDS using EM}
\begin{itemize}
	\item Above we have seen the results of the E-step in Linear Dynamical Systems. Now we take a closer look at the M-step, where we want to optimize for the parameters $\bm{\theta}=(\bm{A}, \bm{\Gamma}, \bm{C}, \bm{\Sigma}, \bm{\mu}_0, \bm{V}_0)$
	\item The complete data likelihood which we want to optimize in expectation to the posterior, is:
	$$\ln p(\bm{X},\bm{Z}|\bm{\theta}) = \ln p(\bm{z}_1|\bm{\mu}_0, \bm{V}_0) + \sum_{n=2}^{N} \ln p(\bm{z}_n|\bm{z}_{n-1}, \bm{A}, \bm{\Gamma}) + \sum_{n=1}^{N} \ln p(\bm{x}_n|\bm{z}_n,\bm{C},\bm{\Sigma})$$ 
	\item If we now e.g. want to optimize for $\bm{\mu}_0$ and $\bm{V}_0$, we need to derive $Q(\bm{\theta}, \bm{\theta}^{\text{old}})$ with respect to these variables. The solution for those is:
	\begin{equation*}
		\begin{split}
			\bm{\mu}_0, \bm{V}_0: Q(\bm{\theta}, \bm{\theta}^{\text{old}}) & = -\frac{1}{2}\ln \left|\bm{V}_0\right| - \frac{1}{2}\E_{\bm{z}_1|\bm{\theta}^{\text{old}}}\left[\left(\bm{z}_1 - \bm{\mu}_0\right)^T \bm{V}_0^{-1}(\bm{z}_1 - \bm{\mu}_0)\right] + \text{const}\\
			\bm{\mu}_0^{\text{new}} & = \E[\bm{z}_1|\bm{\theta}^{\text{old}}]\\
			\bm{V}_0^{\text{new}} & = \E[\bm{z}_1\bm{z}_1^T|\bm{\theta}^{\text{old}}] - \E[\bm{z}_1|\bm{\theta}^{\text{old}}]\E[\bm{z}_1|\bm{\theta}^{\text{old}}]^T
		\end{split}
	\end{equation*}
\end{itemize}

================================================
FILE: Machine_Learning_2/ml2_summary.tex
================================================
\documentclass[a4paper]{article} 
\addtolength{\hoffset}{-2.25cm}
\addtolength{\textwidth}{4.5cm}
\addtolength{\voffset}{-3.25cm}
\addtolength{\textheight}{5cm}
\setlength{\parskip}{0pt}
\setlength{\parindent}{0in}

\usepackage{blindtext} % Package to generate dummy text
\usepackage{charter} % Use the Charter font
\usepackage[utf8]{inputenc} % Use UTF-8 encoding
\usepackage{microtype} % Slightly tweak font spacing for aesthetics
\usepackage[english]{babel} % Language hyphenation and typographical rules
\usepackage{amsthm, amsmath, amssymb, amsfonts, nccmath} % Mathematical typesetting
\usepackage{float} % Improved interface for floating objects
\usepackage[final, colorlinks = true, 
linkcolor = black, 
citecolor = black]{hyperref} % For hyperlinks in the PDF
\usepackage{graphicx, multicol} % Enhanced support for graphics
\usepackage{xcolor} % Driver-independent color extensions
\usepackage{marvosym, wasysym} % More symbols
\usepackage{rotating} % Rotation tools
\usepackage{subcaption}
\usepackage{wrapfig}
% \usepackage{geometry}
\usepackage{censor} % Facilities for controlling restricted text
\newcommand{\note}[1]{\marginpar{\scriptsize \textcolor{red}{#1}}} % Enables comments in red on margin
\usepackage{bm}
\usepackage{blkarray}
\usepackage{enumitem}
\usepackage{pgfplots}
\usepackage{tikz}
\usetikzlibrary{bayesnet}

\usepackage{tcolorbox}
\usepackage[ruled,vlined]{algorithm2e}

\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\loss}[0]{\mathcal{L}}
\newcommand{\chain}[3]{\frac{\partial #1}{\partial #2}\frac{\partial #2}{\partial #3}}
% \newcommand{\eq}[1]{\begin{equation*}\begin{split}#1\end{split}\end{equation*}}
\newcommand{\TODO}[1]{\textbf{\textcolor{red}{#1}}}
\newcommand{\E}[0]{\mathbb{E}} % Expectation
\newcommand{\R}[0]{\mathbb{R}} % Real numbers
\newcommand{\Cdo}[0]{\textnormal{do}}
\newcommand\independent{\protect\mathpalette{\protect\independenT}{\perp}}
\def\independenT#1#2{\mathrel{\rlap{$#1#2$}\mkern2mu{#1#2}}}
\newcommand*{\QED}{\hfill\ensuremath{\blacksquare}}%

\definecolor{green}{RGB}{0,160,0}
\definecolor{blue}{RGB}{0,0,160}
\definecolor{red}{RGB}{160,0,0}
\definecolor{orange}{RGB}{200,160,0}
\definecolor{purple}{RGB}{170,0,200}
\definecolor{cyan}{RGB}{0,200,200}
\definecolor{lightred}{RGB}{200,50,50}

\setcounter{tocdepth}{2}
% Title Page
\title{Summary Machine Learning 2}
\author{Phillip Lippe}


\begin{document}
\maketitle
\tableofcontents
\newpage

\input{ml2_exponential_family.tex}
\newpage
\input{ml2_graphical_models.tex}
\newpage
\input{ml2_variational_EM.tex}
\newpage
\input{ml2_sampling_methods.tex}
\newpage
\input{ml2_sequential_data.tex}
\newpage 
\input{ml2_causality.tex}
\appendix
\newpage
\input{ml2_appendix.tex}

\end{document}

================================================
FILE: Machine_Learning_2/ml2_variational_EM.tex
================================================
\section{Variational Expectation Maximization}

\begin{itemize}
	\item The expectation maximization algorithm can be viewed from a different angle where we focus on the latent variables $z_n$
	\item For each observed data point $\bm{x}_n$, we create a latent variable $\bm{z}_n$ which \textit{explains} the observation by our underlying model
	\begin{itemize}
		\item For example, in case we have a mixture model, the latent variable $z_n$ indicates from which component $\bm{x}_n$ was created
		\item We create the model in such a way that $p(X,Z|\theta)$ can be easily calculated
	\end{itemize}
	\item As a graphical model, we can represent it as follows:
	\begin{figure}[ht!]
		\centering
		\tikz{ %
			\node[obs] (x) {$\bm{x}_n$} ; %
			\node[latent, above=of x] (z) {$\bm{z}_n$} ; %
			
			\edge{z}{x};
			
			\plate{xz}{(x)(z)}{$n=1,...,N$};
			\node[const, left=of x] (theta) {$\bm{\theta}$};
			\edge{theta}{x};
		}
	\end{figure}
	\item The objective we want to optimize is the log-likelihood $$\ln p(\bm{X}|\bm{\theta}) = \ln \left\{\sum_{\bm{Z}}p(\bm{X}, \bm{Z}|\bm{\theta})\right\}$$
	\item However, note that in the standard setting, we are not given $\bm{Z}$ so that we need to work with the posterior $p(\bm{Z}|\bm{X},\bm{\theta})$ as knowledge over latent variables. Hence, we calculate the log-likelihood under expectation of our posterior:
	$$\E_{\bm{Z}\sim p(\bm{Z}|\bm{X},\bm{\theta})}\left[\ln p(\bm{X}, \bm{Z}|\bm{\theta})\right] = \sum_{\bm{Z}} p(\bm{Z}|\bm{X},\bm{\theta}) \ln p(\bm{X}, \bm{Z}|\bm{\theta})$$
	\item As the optimization problem for the posterior and the parameters has often no analytical closed-form solution, we optimize both sequentially, leading to the general idea of the EM algorithm
	\begin{description}
		\item[E-step] Find the posterior distribution $p(\bm{Z}|\bm{X},\bm{\theta}^{\text{old}})$ where $\bm{\theta}^{\text{old}}$ means that we fix the other parameters
		\item[M-step] Optimize the log-likelihood with respect to parameters $\bm{\theta}$ while keeping the posterior fixed
		$$\bm{\theta}^{\text{new}} = \arg\max_{\bm{\theta}} \mathcal{Q}(\bm{\theta}, \bm{\theta}^{\text{old}}) = \arg\max_{\bm{\theta}} \sum_{\bm{Z}} p(\bm{Z}|\bm{X},\bm{\theta}^{\text{old}}) \ln p(\bm{X}, \bm{Z}|\bm{\theta})$$
	\end{description}
	\item In case we want to find the MAP instead of the MLE, we simply have to add the prior term $\ln p(\bm{\theta})$ to $\mathcal{Q}(\bm{\theta}, \bm{\theta}^{\text{old}})$ in the M-step 
\end{itemize}
\subsection{Generalizing EM}
\begin{itemize}
	\item We can further generalize the EM algorithm to a form, which was originally used to prove its optimization objective. 
	\item However, we can also look at it from a different perspective. It might be sometimes the case, that the posterior $p(\bm{Z}|\bm{X},\bm{\theta}^{\text{old}})$ is hard to determine. Instead, we can introduce an approximation $q(\bm{Z})$ for which we can choose the form ourselves (e.g. Gaussian, or softmax distribution over discrete states, etc.)
	\item Using this approximation, we can calculate the log likelihood by:
	\begin{equation*}
		\begin{split}
			\ln p(\bm{X}|\bm{\theta}) & = \sum_{n=1}^{N} \sum_{\bm{z}_n} q(\bm{z}_n)\ln \frac{p(\bm{x}_n, \bm{z}_n|\bm{\theta})}{p(\bm{z}_n|\bm{x}_n, \bm{\theta})}\\
			 & = \sum_{n=1}^{N} \sum_{\bm{z}_n} q(\bm{z}_n)\ln \frac{p(\bm{x}_n, \bm{z}_n|\bm{\theta})}{q(\bm{z}_n)}\frac{q(\bm{z}_n)}{p(\bm{z}_n|\bm{x}_n, \bm{\theta})}\\
			 & = \sum_{n=1}^{N} \left[\E_{q_n}\left[\log p(\bm{x}_n, \bm{z}_n|\bm{\theta})\right] + H(q_n) + \underbrace{\text{KL}\left(q_n(\bm{z}_n)||p(\bm{z}_n|\bm{x}_n, \bm{\theta})\right)}_{\geq 0}\right]\\
		\end{split}
	\end{equation*}
	\item We can define an ELBO for our objective function:
	$$\ln p(\bm{X}|\bm{\theta}) = \mathcal{L}(\bm{\theta}) \geq  \sum_{n=1}^{N} \left[\E_{q_n}\left[\log p(\bm{x}_n, \bm{z}_n|\bm{\theta})\right] + H(q_n)\right] = \mathcal{L}(\bm{\theta}, q)$$
	\item Hence, for any set of distributions $\left\{q_n(\bm{z}_n)\right\}_{n=1}^{N}$, we can define a lower bound optimization objective $\mathcal{L}(\bm{\theta}, q)$, which is equal to $\mathcal{L}(\bm{\theta})$ iff $q_n(\bm{z}_n)=p(\bm{z}_n\vert \bm{x}_n, \bm{\theta})$
	\item During the E-step, we optimize $\mathcal{L}(\bm{\theta}, q)=\ln p(\bm{X}|\bm{\theta})-\text{KL}(q||p)$ regarding $q$. As $\ln p(\bm{X}|\bm{\theta})$ is independent of $q$, we minimize the KL-divergence, meaning that we push $q_n(\bm{z}_n)$ to be similar to $p(\bm{z}_n\vert \bm{x}_n, \bm{\theta})$
	\item In the M-step, we increase $p(\bm{X}|\bm{\theta})$ by optimizing the parameters $\bm{\theta}$. Note that as $q$ is fixed in this step, the KL-divergence will increase as well, due to $q$ being now sub-optimal for the new parameters. Hence, we perform another E-step again a.s.o.
	\item Keep in mind that when optimizing $\mathcal{L}(\bm{\theta}, q)$, we need to add Lagrangian for the constraint that $q_n$ is a valid PDF:
	$$\tilde{\mathcal{L}}(\bm{\theta}, q) = \mathcal{L}(\bm{\theta}, q) + \sum_{n=1}^{N} \lambda_n \left(\sum_{\bm{z}_n} q_n(\bm{z}_n) - 1\right)$$
	Depending on our model, we might need to add additional Lagrangian to $\tilde{\mathcal{L}}(\bm{\theta}, q)$ (e.g. over $\bm{\pi}$ in mixture models)
	\item In summary, the variational EM algorithm can be summarized as follows:
	
	\begin{tcolorbox}[colback=white!85!gray,colframe=gray!75!black,title=Variational EM algorithm]
		\begin{enumerate}
			\item Choose initial $\bm{\theta}^{(0)}$
			\item Iterate until $\Delta \bm{\theta}^{(t)} < \epsilon$
			\begin{enumerate}
				\item \textbf{E-step}: Given \underline{fixed} $\bm{\theta}^{(t)}$,
				\begin{itemize}
					\item If the posterior can be determined, evaluate $q_n^{(t)}(\bm{z}_n) = p(\bm{z}_n|\bm{x}_n, \bm{\theta}^{(t)})$
					\item Otherwise, use Variational EM by increasing $\tilde{\mathcal{L}}(\bm{\theta}^{(t)}, q)$ over $q$, e.g. gradient ascend
				\end{itemize}
				\item \textbf{M-step}: Given \underline{fixed} $q^{(t)}$,
				\begin{itemize}
					\item Solve, if possible, $\bm{\theta}^{(t+1)} = \arg\max_{\bm{\theta}} \tilde{\mathcal{L}}(\bm{\theta}, q^{(t)})$ % \sum_{n=1}^{N} \E_{q_n}\left[\log p(\bm{x}_n, \bm{z}_n|\bm{\theta})\right]
					\item Otherwise, increase $\tilde{\mathcal{L}}(\bm{\theta}, q^{(t)})$ over $\bm{\theta}$, e.g. gradient ascend
				\end{itemize}
			\end{enumerate}
		\end{enumerate}
	\end{tcolorbox}
\end{itemize}
\subsubsection{Example: Mixture of multivariate Bernoulli's}
\begin{itemize}
	\item As an example, we outline the EM algorithm to optimize a mixture of multivariate Bernoulli's here. Our model distribution looks like:
	$$p(\bm{x}_n|\bm{\mu}, \bm{\pi}) = \sum_{k=1}^{K} \pi_k \prod_{i=1}^{D} \mu_{ki}^{x_{ni}} (1- \mu_{ki})^{1-x_{ni}} \hspace{4mm}\text{where}\hspace{2mm} \sum_{k=1}^{K}\pi_k = 1$$
	\item Optimizing the parameters without the EM algorithm does not has a closed-form solution because of a sum in the log:
	$$\log p(\bm{X}|\bm{\mu},\bm{\pi}) = \sum_{n=1}^{N} \log \sum_{z_n=1}^{K} \pi_{z_n} p(\bm{x}_n|\bm{\mu_{z_n}})$$
	\item Our EM objective is instead:
	\begin{equation*}
		\begin{split}
			\mathcal{L}(q,\bm{\mu}, \bm{\pi}) & = \sum_{n=1}^{N} \sum_{\bm{z}_n} q_n(\bm{z}_n) \left\{ \left(\log \pi_{z_n} + \sum_{i=1}^{D}\left[x_{ni}\log \mu_{z_n,i} + (1 - x_{ni})\log (1 - \mu_{z_n,i})\right]\right)- \log q_n(\bm{z}_n)\right\}\\
		\end{split}
	\end{equation*}
	\begin{equation*}
		\begin{split}
			\tilde{\mathcal{L}}(q,\bm{\mu}, \bm{\pi}, \lambda, \left\{\lambda_n\right\}) & = \mathcal{L}(q,\bm{\mu}, \bm{\pi}) + \lambda \left(\sum_{k=1}^{K} \pi_k - 1\right) + \sum_{n=1}^{N} \lambda_n \left(\sum_{\bm{z}_n} q_n(\bm{z}_n) - 1\right)
		\end{split}
	\end{equation*}
	\item The update equation we get by deriving $\tilde{\mathcal{L}}(q,\bm{\mu}, \bm{\pi}, \lambda, \left\{\lambda_n\right\})$ with respect to the according parameters
	\begin{description}
		\item[E-Step] Optimize $q$
		\begin{equation*}
			\begin{split}
				\frac{\partial \tilde{\mathcal{L}}}{\partial q_n(z_n)} & = \log \pi_{z_n} + \left[\sum_{i=1}^{D} x_{ni}\log \mu_{z_n,i}+(1-x_{ni})\log (1-\mu_{z_n,i})\right] - \log q_n(z_n) - 1 + \lambda_n = 0\\
				\implies q_n(z_n) & = \exp(\lambda_n-1)\pi_{z_n}\prod_{i=1}^{D} \mu_{z_n,i}^{x_{ni}}(1 - \mu_{z_n,i})^{1-x_{ni}}
			\end{split}
		\end{equation*}
		By solving for the Lagrangian $\lambda_n$, we would see that $\exp(\lambda_n-1)=1/(\sum_{z_n}q(z_n))$, and hence, being a normalization factor.
		\item[M-step] Optimize parameters $\bm{\pi}$:
		\begin{equation*}
			\begin{split}
				\frac{\partial \tilde{\mathcal{L}}}{\partial \pi_k} & = \sum_{n=1}^{N} \sum_{\bm{z}_n }  q_n(\bm{z}_n)\frac{z_{nk}}{\pi_k}  + \lambda \overset{!}{=} 0\\
				\Leftrightarrow \pi_k & = \frac{1}{\lambda} \sum_{n=1}^{N} \sum_{\bm{z}_n}  z_{nk} q_n(\bm{z}_n)\\
				\frac{\partial \tilde{\mathcal{L}}}{\partial \lambda} & = \sum_{k=1}^{K} \pi_k - 1 \overset{!}{=} 0\\
				\Leftrightarrow 1 & = \sum_{k=1}^{K} \frac{1}{\lambda} \sum_{n=1}^{N} \sum_{\bm{z}_n} z_{nk} q_n(\bm{z}_n) \\ 
				\Leftrightarrow \lambda & = \sum_{k=1}^{K} \sum_{n=1}^{N} \sum_{\bm{z}_n} z_{nk} q_n(\bm{z}_n) = \sum_{n=1}^{N} 1 = N\\
				\implies \pi_k & = \frac{\sum_{n=1}^{N} \sum_{\bm{z}_n} z_{nk} q_n(\bm{z}_n)}{N}
			\end{split}
		\end{equation*}
		Optimize parameter $\bm{\mu}$:
		\begin{equation*}
			\begin{split}
				\frac{\partial \tilde{\mathcal{L}}}{\partial \mu_{ki}} & = \sum_{n=1}^{N} q_n(k)\left[\frac{x_{ni}}{\mu_{ki}} - \frac{1-x_{ni}}{1-\mu_{ki}}\right]\\
				\implies \mu_{ki} & = \frac{\sum_{n=1}^{N} q_n(k)x_{ni}}{\sum_{n=1}^{N} q_n(k)}
			\end{split}
		\end{equation*}
	\end{description}
\end{itemize}
\subsection{Variational Inference: Variational Bayes}
\begin{itemize}
	\item In standard EM, we treat $\bm{\theta}$ to be a parameter that we want to optimize with a single value. However, looking from a Bayesian perspective, we would also treat it as a (latent) random variable and calculate its posterior, leading to the following graphical model:
	\begin{figure}[ht!]
		\centering
		\tikz{ %
			\node[obs] (x) {$\bm{x}_n$} ; %
			\node[latent, left=of x] (theta) {$\bm{\theta}$} ; %
			
			\edge{theta}{x};
			
			\plate{xz}{(x)}{$n=1,...,N$};
		}
	\end{figure}
	\item Note that we could also include latent variables $\bm{z}_n$ as in the EM algorithm, but for generality, we could simply include them in $\bm{\theta}$ as the calculations are exactly the same
	\item We perform the same derivation as for variational EM, with the difference, that $q$ is now over all parameters $\bm{\theta}$, giving us the following objective:
	\begin{equation*}
		\begin{split}
			L=\log p(X) & \geq \mathcal{L}(q)\\
			\text{where}\hspace{2mm}\mathcal{L}(q) & = \int q(\bm{\theta}) \log \left[p(X|\bm{\theta})p(\bm{\theta})\right]d\bm{\theta} - \int q(\bm{\theta})\log q(\bm{\theta})d\bm{\theta}\\
			& = \E_{q(\bm{\theta})}\left[\log \left(p(X|\bm{\theta})p(\bm{\theta})\right)\right] + H(q)
		\end{split}
	\end{equation*}
	Note that we use integrals here as parameters are often continuous.
	\item The posterior can be found by optimizing $q$:
	$$p(\bm{\theta}|\bm{X}) = \arg\max_{q(\bm{\theta})} \mathcal{L}(q)$$
	\item In practice, we restrict $q(\bm{\theta})$ to be in a function family $Q$ for  which we can calculate the maximum of $\mathcal{L}(q)$ analytically, or optimize by gradient ascend. 
	\item One example for $Q$ is to assume that all (or at least certain parts) of the parameters are independent of each other, hence:
	$$q(\bm{\theta})\in Q=\left\{\prod_{i=1}^{D}q_i(\bm{\theta}_i)\right\}$$
	where $\bm{\theta}_i$ can be either a scalar or a vector.
	\item For this case, we can simplify the objective:
	\begin{equation*}
		\begin{split}
			\tilde{L}(q) & = \int \left(\prod_{i=1}^{D} q_i(\bm{\theta}_i)\right) \log \left[p(X|\bm{\theta})p(\bm{\theta})\right]d\bm{\theta} - \sum_{i=1}^{D}\int q_i(\bm{\theta}_i)\log q_i(\bm{\theta}_i)d\bm{\theta}_i + \sum_{i=1}^{D} \lambda_i \left(\int q_i(\theta_i)d\bm{\theta}_i - 1\right)\\
			\frac{\partial \tilde{L}}{\partial q_i(\bm{\theta}_i)} & = \int \left(\prod_{j\neq i} q_j(\bm{\theta}_j)\right) \log \left[p(X|\bm{\theta})p(\bm{\theta})\right]d\bm{\theta}_{\setminus i} - \log q_i(\theta_i) - 1 + \lambda_i\\
			\Leftrightarrow q_i(\bm{\theta}_i) & = \exp\left(\lambda_i - 1\right) \exp\left\{\int \left(\prod_{i\neq j} q_j(\bm{\theta}_j)\right)\log \left(p(\bm{X}|\bm{\theta})p(\bm{\theta})\right) \right\}\\
		\end{split}
	\end{equation*}
	\item Hence, our approximated posterior over $\bm{\theta}_i$ is the expectation of $\log p(X,\bm{\theta})$ over all other parameters:
	\begin{equation*}
		\tcbox[nobeforeafter]{\(p(\bm{\theta}|\bm{X})=\prod_{i=1}^{D}q_i(\bm{\theta}_i), \hspace{2mm}q_i(\bm{\theta}_i) = \frac{1}{Z} \exp\left(\E_{q_{\setminus i}}\left[\log p(\bm{X}, \bm{\theta})\right]\right)\)}
	\end{equation*}
	We iterate this update for each $q$ until convergence.
	\item Note that Variational Bayes is highly related to Gibbs sampling as there, we do not calculate the full probability distribution $q$, but simply alternate between sampling from the different $q_i$'s. By iterating until convergence/for a sufficient number of time, we also reach the true posteriors.
	
\end{itemize}
\subsubsection{Example: Gaussian with mean and variance as latent variables}
\begin{itemize}
	\item We consider the following graphical model:\\
	\begin{figure}[ht!]
		\centering
		\tikz{ %
			\node[obs] (x) {$x_n$} ; %
			\node[latent, above=of x] (mu) {$\mu$} ; %
			\node[latent, right=of mu] (tau) {$\tau$} ; %
			
			\edge{mu}{x};
			\edge{tau}{x};
			\edge{tau}{mu};
			
			\plate{xz}{(x)}{$n=1,...,N$};
		}
	\end{figure}

	where
	\begin{equation*}
		\begin{split}
			p(\bm{X}|\mu, \tau) & = \prod_{n=1}^{N} \mathcal{N}(x_n|\mu, \tau^{-1})\\
			p(\tau) & = \text{Gamma}(\tau|a_0, b_0)\\
			p(\mu|\tau) & = \mathcal{N}(\mu|\mu_0, (\lambda_0 \tau)^{-1})
		\end{split}
	\end{equation*}
	\item Now, let's assume that we approximate the posteriors by $q(\tau, \mu)=q(\tau)q(\mu)$. Our objective (excluding the Lagrangian) is:
	$$\mathcal{L}(q_{\mu}, q_{\tau}) = \int q_{\mu}(\mu)q_{\tau}(\tau)\left[\log p(\bm{X}|\mu, \tau)p(\mu|\tau)p(\tau)\right] + H(q_{\mu}) + H(q_{\tau})$$
	\item When solving this, we will end up with:
	\begin{equation*}
		\begin{split}
			q_{\mu}(\mu) & = \mathcal{N}\left(\mu\Big\vert \frac{\lambda_0\mu_0 + N\overline{x}}{\lambda_0 + N}, \left[(\lambda_0 + N)\E_{q_{\tau}}[\tau]\right]^{-1}\right)\\
			q_{\tau}(\tau) & = \text{Gamma}\left(\tau\Big\vert a_0 + \frac{N}{2}, b_0 + \frac{1}{2}\E_{q_{\mu}}\left[\sum_n (x_n-\mu)^2+\lambda_0(\mu-\mu_0)^2\right]\right)\\
			\Rightarrow p(\mu, \tau|\bm{X}) &\approx q_{\mu}(\mu)q_{\tau}(\tau)
		\end{split}
	\end{equation*}
\end{itemize}
\subsubsection{Combining Variational EM and Variational Bayes}
\begin{itemize}
	\item The Variational EM algorithm and the variational Bayes are strongly related. In fact, we can generalize the framework to combine both of them
	\item Our goal is to optimize the parameters $\bm{\theta}$, and marginalize over latent variables $\bm{Z}$, leading to the objective:
	$$\ln p(\bm{X}) = \E_{q(\bm{Z}, \bm{\theta})}\left[\ln p(\bm{X}, \bm{Z}, \bm{\theta})\right] + \underbrace{H(q_Z) + H(q_{\theta})}_{\text{assume } q(\bm{Z},\bm{\theta})=q_Z(\bm{Z})q_{\theta}(\bm{\theta})} + \underbrace{\text{KL}\left(q_Z q_{\theta} || p(\bm{Z}, \bm{\theta}|\bm{X})\right)}_{\geq 0}$$
	\item Again, we can optimize an ELBO instead:
	$$\ln p(\bm{X}) \geq \E_{q_Z q_{\theta}}\left[\ln p(\bm{X}, \bm{Z}, \bm{\theta})\right] = \mathcal{L}(q_Z, q_{\theta})$$
	\item Assuming $\tilde{\mathcal{L}}(q_Z, q_{\theta})$ being the objective function including the Lagrangian, we have the following steps:
	\begin{description}
		\item[$\E_{Z}$-step]: $\max_{q_Z} \tilde{\mathcal{L}}(q_Z, q_{\theta})$
		\item[$\E_{\theta}$-step]: $\max_{q_{\theta}} \tilde{\mathcal{L}}(q_Z, q_{\theta})$
	\end{description}
	\item As this is a generalization, we can find both the EM algorithm and variational Bayes in it. The EM algorithm is found if we choose $q_{\theta}(\theta)=\delta(\theta'-\theta)$, where $\theta'$ is optimized $\Rightarrow$ equal to replacing $q_{\theta}$ with MLE/MAP solution
	\item For variational Bayes, we can simply ignore the $Z$ part as we fully focus on $q_{\theta}$. The iteration over the $\E_{\theta}$-step is equal to variational Bayes
\end{itemize}
\subsection{Variational Auto-Encoder (VAE)}
\begin{itemize}
	\item One implementation with neural networks of this variational framework are VAEs, where we model the likelihood by:
	$$p(\bm{X}, \bm{Z}|\bm{\theta}) = \underbrace{p(\bm{X}|\bm{Z}, \bm{\theta}_2)}_{\text{decoder}}\underbrace{p(\bm{Z}|\bm{\theta}_1)}_{\text{encoder/prior}}$$
	\item The aim is to find a lower-dimensional representation $Z$ of the data $X$ (\textit{encode} $X$). For approximating the posterior, we use again $q(\bm{Z}|\bm{X},\bm{\lambda})\approx p(\bm{Z}|\bm{X}, \bm{\theta}_1, \bm{\theta}_2)$, where $q$ is now a neural network
	\item As a prior distribution, we use e.g. $p(\bm{Z}|\bm{\theta}_1) \sim \mathcal{N}(0,1)$ which encourages the latents to be independent. Furthermore, by using this prior, we can treat VAEs as a non-linear version of PCA (in case we choose student-t distribution, we would get non-linear ICA)
	\item Optimize the ELBO of the log-likelihood via SGD:
	\begin{equation*}
		\begin{split}
			\mathcal{L}(\bm{\theta}) & = \ln p(\bm{X}|\bm{\theta}) = \ln \int p(\bm{X}|\bm{Z},\bm{\theta})p(\bm{Z}|\bm{\theta})d\bm{Z}\\
			\mathcal{L}(\bm{\theta}, \bm{\lambda}) & = \E_{q(Z|X,\lambda)}\left[\ln p(\bm{X}|\bm{Z},\bm{\theta}_2) + \ln p(\bm{Z}|\bm{\theta}_1)\right] + H(q(\bm{Z}|\bm{X}, \bm{\lambda}))
		\end{split}
	\end{equation*}
	\item To prevent the integral, we can sample instead. To make these samples differentiable, we need to use the reparameterization trick:
	\begin{equation*}
		\begin{split}
			z^{(k)}\sim q(z|x, \lambda) & \Rightarrow z=g_{\lambda}(x, \epsilon), \epsilon\sim p(\epsilon)
		\end{split}
	\end{equation*}
	As for a multivariate Gaussian with diagonal covariance, we have $g_{\lambda}(\bm{x}, \bm{\epsilon})=\bm{\mu}_{\lambda}(x)+\bm{\sigma}_{\lambda}\odot \bm{\epsilon}, \epsilon\sim\mathcal{N}(0,1)$
\end{itemize}

================================================
FILE: Natural_Language_Processing_1/nlp_bayesian.tex
================================================
% \section{Foundations of Bayesian NLP}
% \textbf{Foundations of Bayesian NLP is not in the exam.}

================================================
FILE: Natural_Language_Processing_1/nlp_compositional_semantic.tex
================================================
\section{Compositional semantics and discourse processing}
% \subsection{Compositional semantics}
\begin{itemize}
	\item \textbf{Principle of Compositionality}: meaning of whole phrase derivable from meaning of its parts
	\item Sentence structure conveys some meaning as well
	\begin{itemize}
		\item Different syntactic structures may have the same meaning, but similar syntactic structures can also have different meanings
	\end{itemize}
	\item Not all phrases are interpreted compositionally (e.g. \textit{kick the bucket}) but can be grouped together and viewed as one element
	\item Meaning of a single word can depend on the composition (\textit{fast} programmer vs. \textit{fast} plane, metaphors,...)
\end{itemize}
\subsection{Compositional distributional semantics}
\begin{itemize}
	\item Extending distributional semantics to phrases/sentences
	\item Unsupervised model $\Rightarrow$ general-purpose representations
	\item Model composition in vector space. However, if we would model every sentence as independent, we would get an infinite dimensional space
\end{itemize}
\subsubsection{Vector mixture model}
\begin{itemize}
	\item Combining the vectors of all words in the sentence
	\item Mostly done either additive (adding all vector) or multiplicative (elementwise product of vectors)
	\item Problem: does not consider word order and is therefore suitable for modelling content words (nouns, verbs, adjectives,...), but not for function words that require syntactic dependencies (pronouns, ...)
	\item Is often used as baseline
\end{itemize}
\begin{figure}[ht]
	\centering
	\begin{subfigure}{0.3\textwidth}
		\includegraphics[width=\textwidth]{figures/compositional_semantic_vector_mixture_model.png}
		\caption{Vector mixture model}
	\end{subfigure}
	\begin{subfigure}{0.3\textwidth}
		\includegraphics[width=\textwidth]{figures/compositional_semantics_lexical_function_models.png}
		\caption{Lexical function model}
	\end{subfigure}
	\caption{Compositional distributional semantics}
	\label{fig:compositional_semantic_vector_mixture_model}
\end{figure}
\subsubsection{Lexical function model}
\begin{itemize}
	\item Discriminate between words that meaning is determined by its context/distribution (e.g. nouns), and function words that are applied on the represented words as \textbf{lexical functions}
	\item Example: $\underbrace{\textit{old}}_{\text{functional}} \underbrace{\textit{dog}}_{\text{distributional}} \Rightarrow$ apply function of \textit{old} on \textit{dog}
	\item Lexical functions are parameter matrices (i.e. $\bm{A}_{\textit{old}}$) which are multiplied with the vector representation of nouns
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/compositional_semantics_lexical_function_models_adjectives.png}
	\end{figure}
	\item Adjectives that does not change the meaning of a word are diagonal up to identity matrix $\Rightarrow$ element captures how features interact with each other given the adjective
	\item The matrices are learned by comparing the representation of plain nouns to combination of noun and adjective. Pseudo-code algorithm:
	\begin{enumerate}
		\item Obtain a distributional vector $n_j$ for each noun $n_j$ in the lexicon.
		\item Collect adjective noun pairs $(a_i,n_j)$ from the corpus.
		\item Obtain a distributional vector $p_{ij}$ of each pair $(a_i,n_j)$ from the same corpus using a conventional DSM.
		\item The set of tuples $\{(n_j,p_{ij})\}_j$ represents a dataset $\mathcal{D}(a_i)$ for the adjective $a_i$.
		\item Learn matrix $\bm{A}_i$ from $\mathcal{D}(a_i)$ using linear regression by minimizing:
		$$L(\bm{A}_i) = \sum\limits_{j\in \mathcal{D}(a_i)} ||p_{ij} - \bm{A}_i n_j||^2$$
	\end{enumerate} 
	\item Verbs can be represented as high-order tensors. If only subject is taken into consideration, it is a two-dimensional matrix. When also considering object, then it is three dimensional (or even higher)
	\item \textit{Polysemy} (different forms of a word) are mostly handled by a single representation. We assume that ambiguity can be handled as long as the context is given.
	\item To identify \textit{metaphors}, two separate senses of every adjective can be learned (literal and metaphorical). We then map from literal to metaphorical by a linear transformation. 
\end{itemize}
\subsubsection{Compositional semantics in neural networks}
\begin{itemize}
	\item Supervised learning framework $\Rightarrow$ compositional representations are fine-tuned for specific application/task
	\item Word representations are taken as input and processed within the network
	\item Example tasks include sentiment classification, paraphrasing, machine translation, ...
	\item Using recurrent and/or recursive networks (LSTMs, Tree-LSTMs, ...)
\end{itemize}
\subsection{Discourse structure}
\begin{itemize}
	\item Most documents have a implicit (in news paper articles, first sentence is a summary) or explicit structure like sections and paragraphs
	\item There are also relationships between sentences that need to be modeled as follow
\end{itemize}
\subsubsection{Rethorical relations}
\begin{itemize}
	\item There are implicit relations between sentences. For example:\\
	\texttt{Max fell. John pushed him.}\\
	can be interpreted as \textit{explanation} (Max fell because John pushed him), or as \textit{narration} (Max fell and then John pushed him).
	\item This relation is called \textbf{discourse relation} or \textbf{rhetorical relation}
	\item \textbf{Cue phrases} indicate what kind of relation it is. In the previous examples, the cue phrases were \texttt{because} and \texttt{and then}.
	\item Analyzing a text for rhetorical relations mostly gives a binary structure: the main sentence is called \textbf{nucleus}, and subsidiary phrase (explanation, justification, ...) is called \textbf{satellite}
	\item In a \textit{narration} (cue phrase \texttt{and}) both sentences have equal weight instead of nucleus vs satellite.
\end{itemize}
\subsubsection{Coherence}
\begin{itemize}
	\item Discourses need to have connectivity/context to be coherent.
	\item Otherwise, a sentence/small discourse might not make sense
	\item However, this information is mostly missing (background/world knowledge)!
	\item Assuming discourse coherence can affect interpretation. Especially when dealing with pronouns, th
\end{itemize}
\subsubsection{Overview of factors influencing discourse interpretation}
\begin{enumerate}
	\item \textit{Cue phrases} (\texttt{because}, \texttt{and}, ...)
	\item \textit{Punctuation and text structure} (\texttt{Max fell (John pushed him), and Kim laughed.})
	\item \textit{Real world context} (\texttt{Max was falling.} \texttt{John pushed him as he lay on the ground.})
	\item \textit{Tense and aspects} (\texttt{Max was falling.} \texttt{John pushed him.})
\end{enumerate}
\begin{itemize}
	\item Discourse parsing (understanding discourse structure) is a hard task
	\item Mostly done by supervision (annotated data of about 8-10 discourses)
	\item However, \textit{surface techniques} (primitive algorithms that look at characteristic phrases, punctuation, ...) seem to work to some extent
\end{itemize}
\subsection{Referring expressions and anaphora}
\begin{itemize}
	\item To fully process a discourse, co-references/referring expressions like pronouns need to be resolved
	\item We can define the following entities for a referring expression:
	\begin{itemize}
		\item \textit{referent} - a real world entity to which is referred
		\item \textit{referring expression} - part of speech that refers to an entity
		\item \textit{antecedent} - the text initially evoking a referent (where referent is named)
		\item \textit{anaphora} - the phenomenon of referring to an antecedent
		\item \textit{cataphora} - pronouns that appear \textit{before} the pronoun (rare)
	\end{itemize}
	\item \textbf{Pronoun resolution}
	\begin{itemize}
		\item Identifying the referents of pronouns
		\item \textit{Anaphora resolution}: in most cases, the task is limited to identifying referents that are mentioned before the actual pronoun/reference
	\end{itemize}
\end{itemize}
\subsubsection{Algorithms for anaphora resolution}
\begin{itemize}
	\item For anaphora resolution, we mostly apply a supervised training algorithm
	\item The instances in the corpus are possible pairs of pronoun and antecedent (possible antecedent include all noun phrases in the current and last 5 sentences)
	\item The classification is binary (true if pronoun refers to this specific antecedent, otherwise false)
	\item Training data is annotated by humans
	\item Beware that there are also pronouns in the text that might have no referent at all (\textit{pleonastic pronouns})
	\item Distinguishing between \textit{hard} and \textit{soft} constraints that must be fulfilled between pronouns and antecedent
	\item \textbf{Hard constraints} : Pronoun must match in terms of tense, singular/plural, gender, ...
	\item \textbf{Soft constraints/Salience}: 
	\begin{itemize}
		\item \textit{recency} -  more recent antecedents are preferred
		\item \textit{grammatical role} - subjects might be referred to more often than objects. Also, it is preferred that entity and pronoun has same role in sentence (subject, object, ...)
		\item \textit{repeated mention} - entities that have been mentioned more often are preferred
		\item \textit{coherence effect} - pronoun resolution might depend on discourse relation/semantic within the sentences
	\end{itemize}
	\item Based on the hard and soft constraints, we can define features for every pronoun-antecedent pair
	\item Simple classification model takes these features as input and classifies the link as valid or not
	\item Simplest evaluation matrix is link accuracy (number of correct links). However, it does not take into account pleonastic pronouns or a chain of references so that multiple metrics exist
\end{itemize}

================================================
FILE: Natural_Language_Processing_1/nlp_dialog_modelling.tex
================================================
\section{Computational Dialog Modeling}
\subsection{Modular dialog systems}
\begin{itemize}
	\item There are two main tasks in dialog modeling: either understand a conversation from outsider's view (summarizing), or the capability to take part in conversation $\Rightarrow$ make a dialog agent
	\item First approaches of modeling dialog agent were based on hand-specified patterns/transformation rules based on keywords to find an appropriate answer
	\item Recently, the focus shifted towards data-driven methods by either retrieving existing information or generating new sentences by i.e. Encoder-Decoder architectures
	\item Problems: hard to evaluate, such systems often show to just copy patterns in training dataset but don't generalize well.
	\item Different approach: in dialogs, there is a tendency to ascribe goal and \textbf{intentions}
	\item However, intentions are not easy to recognize. That's why such methods are often used for \textbf{task-oriented} dialog systems where the end-goal makes intentions tractable
	\item The modular dialog system architecture: 
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/dialog_modeling_modules.png}
	\end{figure}
	\item \textit{Language understanding}: NLP1 course. Morphology, POS tagging, lexical semantics and syntactic parsing, compositional semantics, ...
	\item \textit{Dialog management}: consists of two modules:
	\begin{itemize}
		\item \textit{Dialog state tracker}: handle linguistic context (what has been said) and how relevant it is to the task. We can convert messages into slots with parameters (like \texttt{request(name)}) to simplify the task. 
		\item \textit{Dialog policy}: select what action to take next/model the next answer. Estimate probabilities for possible actions, and choose best ones. Training mostly done in a reinforcement learning way with simulator
	\end{itemize}
	\item \textit{Extra-linguistic environment}: taking information into account which is not coming from this dialog itself (images, databases, ...)
\end{itemize}
\subsection{Visually grounded, task-oriented dialog}
\begin{itemize}
	\item \textbf{Visual dialog}: given an image and a history of human-dialog, answer a follow-up question 
	\item We can evaluate this task with the same metrics as for summarization and translation (BLEU, ROGUE)
	\item However, in this task the agent is thrown at random into a conversation without being able to interact
	\item \textit{Image guessing game}: one agents ($Q$) sees the original image, and the other agent ($A$) sees the image with the object highlighted that $Q$ needs to guess
	\item Current implementation is based on LSTMs with CNN encoders. Still, the models perform poorly compared to humans
\end{itemize}

================================================
FILE: Natural_Language_Processing_1/nlp_formal_grammars.tex
================================================
\section{Formal grammars and syntactic parsing}
\begin{itemize}
	\item Syntax: structure of sentence, parsing syntax to get (long-distance) dependencies of words
\end{itemize}

\subsection{Generative grammar}
\begin{itemize}
	\item Formally specified grammar that can generate all and only acceptable sentences of a natural language
	\item A phrase can be bracketed into its internal structure: \textit{((the (big dog)) slept)}
	\item Each subpart is a \textbf{constituent}: group of words/phrase behaving as a single unit 
	\item Labels can be assigned to the internal structures (for instance, \textit{the big dog} is a noun phrase)
\end{itemize}
\subsubsection{Phrases and substitutability}
\begin{itemize}
	\item Words with the same POS tag can be replaced
	\item Phrasal categories indicate which phrases can be substituted
	\item Example phrasal categories include noun phrase (NP), verb phrase (VP), propositional phrase (PP), ...
	\item Goal: capture substitutability at phrase level by phrasal categories
\end{itemize}
\subsection{Context-Free Grammars}
\begin{itemize}
	\item Defining a grammar on rules of production, and basic lexicon
	\item Basic elements of a context-free grammar (CFG):
	\begin{enumerate}
		\item Set of non-terminal symbols (e.g. S, VP)
		\item Set of terminal symbols (i.e., the words)
		\item Set of rules, where left-hand side is single non-terminal symbol, and right side combination of non-terminal and terminal. Examples:\\
		\texttt{S -> NP VP}\\
		\texttt{V -> fish}
		\item A start symbol (here \texttt{S}) which is a non-terminal
	\end{enumerate}
	\item Exclude empty productions, like \texttt{NP->$\epsilon$}
	\item For rules of non-terminal to single word, the non-terminal represents the POS tag of this word
	\item A context free grammar can be used for either generating sentences (start with \texttt{S}, and choose rules), or for analyzing/assigning a structure to a given sentence
	\item For analyzing, the bracketed notation or a parse tree is often used to represent the structure:
	
	\texttt{(S (NP} \textit{they}\texttt{) (VP (V }\textit{fish}\texttt{)))}  (for parse tree, \texttt{S} would be root node and \texttt{NP} and \texttt{VP} its children a.s.o.)
	\item However, the grammar is not always unique 
	\begin{itemize}
		\item \textbf{Lexical ambiguity}: a word is more than once in the lexicon having different POS tag
		\item \textbf{Structural ambiguity}: multiple possible analysis because of multiple rules for same non-terminals
	\end{itemize}
\end{itemize}
\subsection{Chart parsing with CFGs}
\begin{itemize}
	\item Increase efficiency by recording all possible rules we could apply on a sentence
	\item The \textbf{chart} is a record of all substructures that have ever been built during the parsing / stores partial results of parsing in a vector
	\item An \textbf{edge} is a data structure that represents a rule application ,which includes:
	\begin{itemize}
		\item An id for referring to it
		\item The outer left and right node in the sentence of the phrase on which the rule is applied
		\item The \textit{mother} symbol (non-terminal which is on the left side of the rule)
		\item The \textit{daughters} which are the symbols on the right side of the rule (words and/or non-terminals produced by previous edges and referred to by the id)
	\end{itemize}
	\item In conclusion, a full chart for the sentence \texttt{they can fish} look like that:
%	$$\begin{array}{ccccc}
%	\text{id} & \text{left} & \text{right} & \text{mother} & \text{daugthers}\\
%	\hline
%	1 & 0 & 1 & \texttt{NP} & \text{(they)}\\
%	2 & 1 & 2 & \texttt{V} & \text{(can)}\\
%	3 & 1 & 2 & \texttt{VP} & \text{(2)}\\
%	4 & 0 & 2 & \texttt{S} & \text{(1 3)}\\
%	\end{array}$$
	$$\begin{array}{ccccc}
	\text{id} & \text{left} & \text{right} & \text{mother} & \text{daugthers}\\
	\hline
	1 & 0 & 1 & \texttt{NP} & \text{(they)}\\
	2 & 1 & 2 & \texttt{V} & \text{(can)}\\
	3 & 1 & 2 & \texttt{VP} & \text{(2)}\\
	4 & 0 & 2 & \texttt{S} & \text{(1 3)}\\
	5 & 2 & 3 & \texttt{V} & \text{(fish)}\\
	6 & 2 & 3 & \texttt{VP} & \text{(5)}\\
	7 & 1 & 3 & \texttt{VP} & \text{(2 6)}\\
	8 & 0 & 3 & \texttt{S} & \text{(1 7)}\\
	9 & 2 & 3 & \texttt{NP} & \text{(fish)}\\
	10 & 1 & 3 & \texttt{VP} & \text{(2 9)}\\
	11 & 0 & 3 & \texttt{S} & \text{(1 10)}\\
	\end{array}$$
	where the sentence is structured as $._0$\texttt{they}$._1$ \texttt{can}$._2$ \texttt{fish}$._3$ and rows of the chart are edges. The parsing is visualized in the figure below
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/chart_parsing_structure.png}
		\caption{Example for chart parsing. The resulting chart data structure is shown below.}
		\label{fig:chart_parsing_structure}
	\end{figure}
\end{itemize}
\subsubsection{Implementation of bottom-up parser}
\begin{itemize}
	\item A bottom-up parser would start from the left, look at the first two connections points (first word) and check for a rule that can be applied to this word
	\item If a rule has been found, it is added as edge into the chart, and we start looking for rules that can be applied to the new edge (recursion!). Once no more rule can be applied on this edge and the recursion stops, we go back to the word and continue our search for applicable rules on this word 
	\item After every rule was applied, the parser moves on to the next word on the right and check for rules that can be applied to this word, \textbf{and} all other words/edges that have been processed beforehand. 
	\item Only if no more rules can be applied, the parser moves on to the next word, until all words are processed
	\item The correct parse/grammar structure is this one that end with the start symbol \texttt{S} from the first to the last node of the sentence
	\item Important sub-technique: \textbf{Packing}
	\begin{itemize}
		\item Due to multiple rules with same input, we can have two identical edges that are just based on different daughters
		\item Every following rule is then applied on both edges which is very inefficient
		\item Thus, with \textit{packing} we change the daughter entries to a list of possible daughter lists 
		\item For example, the edge 7 and 10 from the previous example can be combined:
		$$\begin{array}{ccccc}
		\text{id} & \text{left} & \text{right} & \text{mother} & \text{daugthers}\\
		\hline
		7 & 1 & 3 & \texttt{VP} & \left\{\text{(2 6), (2 9)}\right\}\\
		\end{array}$$
		\item If a new daughter list is added, no new recursion/rule application needs to be done  
	\end{itemize}
\end{itemize}
\subsection{Probabilistic parsing}
\begin{itemize}
	\item For a single sentence with 20 or more words, we will get over 1000 analysis $\Rightarrow$ how do we determine the best/most probable analysis?
	\item The traditional approach is it to grammar rules handwritten but they tend to often fail when parsing new sentences
	\item  Current approaches: probabilistic CFG (PCFG) where every rule is augmented with a probability
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.35\textwidth]{figures/chart_parsing_prob_cfg.png}
		\caption{Probabilistic CFGs. The probabilities are normalized over all rules with same \textit{left} side.}
		\label{fig:probabilistic_cfg}
	\end{figure}
	\item The probability of a parse tree is the product of the probabilities of all the grammar rules that are used in the sentence derivation
	\item Probabilistic CFGs help for \textit{disambiguation} as we can rank all analysis by probability and just pick the best (n) one(s)
	\item Probabilities can also be used to speed up parsing (drop trees/substructures during parsing that already have a very low probability compared to other current substructures)
\end{itemize}
\subsubsection{Treebank PCFGs}
\begin{itemize}
	\item Instead of specifying/tuning the grammar and its corresponding probabilities by our own, we can use a large dataset of sentences with annotated parse trees
	\item This way, we implicitly get a grammar and the probabilities of each rule
	\item A \textbf{treebank} is therefore a collection of sentences annotated with constituent trees
	\item To estimate the rule probabilities, we use the maximum likelihood:
	$$p(X\to \alpha) = \frac{C(X\to \alpha )}{C(X)}$$
	where $C(X\to \alpha)$ number of times the rule is used in corpus, and $C(X)$ the number of times the non-terminal symbol $X$ appears in treebank
\end{itemize}
\subsubsection{Why CFG and not finite state machines}
\begin{itemize}
	\item Language often has centre-embeddings like $A\to \alpha A \beta$ which cannot be captures by FSAs
	\item However, humans limit the application of such centre-embeddings so that we can convert those into finite rules
	\item The advantage of a FSA would be that we can model hierarchical structures (supported by the fact that we understand the semantic of a sentence, we need good internal structures like the hierarchy)
\end{itemize}
\subsection{Dependency structures}
\begin{itemize}
	\item Context free grammars were based on phrase-structures in sentences
	\item Another possible representation of parsing sentences is using directed/asymmetric binary grammatical relations that hold among the words 
	\item A relation consists of 
	\begin{itemize}
		\item a head \texttt{H} (central word)
		\item a dependent \texttt{D}
		\item a label identifying the relation between \texttt{H} and \texttt{D}
	\end{itemize}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/dependency_parsing.png}
		\caption{Dependency parsing. All relations are directed form head \texttt{H} to dependent \texttt{D}.}
		\label{fig:dependency_parsing}
	\end{figure}
	\item It is important to prevent parsing errors as these can significantly change the semantic of a sentence
\end{itemize}

================================================
FILE: Natural_Language_Processing_1/nlp_lexical_distributional_semantics.tex
================================================
\section{Lexical and distributional semantics}
\begin{itemize}
	\item \textbf{Compositional semantics}: meaning of phrase/sentence
	\item \textbf{Lexical semantics}: meaning of individual words
\end{itemize}
\subsection{Approaches for lexical meaning}
\begin{itemize}
	\item How to represent a meaning of a word. Problem: no representation can fully capture language yet!
\end{itemize}
\subsubsection{Formal semantics}
\begin{itemize}
	\item Based on set theory, describing the features of a word
	\item Meaning postulates: $\forall x \left[\text{bachelor}'(x)\to\text{man}'(x)\wedge\text{unmarried}'(x)\right]$
	\begin{itemize}
		\item If a word is in the set \textit{bachelor}, then it also is in \textit{man} and \textit{unmarried}
	\end{itemize}
	\item Problems:
	\begin{itemize}
		\item Limited, especially for special cases (i.e. what is the Pope?)
		\item Very expensive for large corpus
		\item For some words, it is almost impossible to find a good formalization
	\end{itemize}
	\item Alternative: \textbf{Prototype theory}
	\begin{itemize}
		\item Notion of graded semantic categories with no clear boundaries
		\item No requirement that a feature must be shared by all members
		\item Certain members are more central or prototypical $\Rightarrow$ \textbf{Protoypes}
		\item New members are added based on similarity to prototypes
		\item Features and category memberships are graded
	\end{itemize}
\end{itemize}
\subsubsection{Semantic relations}
\begin{itemize}
	\item \textbf{Hyponymy}: IS-A relation, forms a taxonomy (\textit{Example}: dog is a hyponym of animal)
	\begin{itemize}
		\item Easier to construct for certain nouns, but especially hard for adjectives
	\end{itemize}
	\item \textbf{Meronomy}: PART-OF relation (\textit{Example}: arm is a meronym of body)
	\item \textbf{Synonymy}: Words can be exchanged without changing the meaning of a sentence/phrase
	\item \textbf{Antonymy}: Opposite meanings (\textit{Example}: big vs. little)
\end{itemize}
\subsubsection{WordNet}
\begin{itemize}
	\item Large-scale corpus for English resource
	\item Handconstructed
	\item Organized in synsets: sets of synonyms
	\item Synsets are connected by semantic relations
	\item Similarity of words is the similarity of synsets
\end{itemize}

\subsection{Polysemy and word sense disambiguation}
\begin{itemize}
	\item A word can mean different things based on the sentence/context it is used in
	\item Meaning of words is not fixed, but dynamically adapted by the context
	\item \textbf{Regular polysemy}: mechanisms to apply on words to fit into context
	\begin{itemize}
		\item \textit{Zero-derivation}: verb $\leftrightarrow$ noun without changing word. Example: ``\textit{tango}''
		\item \textit{Metaphorical}: using words from a different domain to express similar meaning. Example: ``\textit{swallow information}''
		\item \textit{Metonymy}: use an entity to actually refer to other aspects of it. 
		Example: ``\textit{drinking his glass}''
	\end{itemize}
	\item \textbf{Word sense disambiguation}: derive meaning of word in context
	\begin{itemize}
		\item \textit{Supervised} (most common) $\to$ predefined list of senses (i.e. WordNet), and train model on \textit{large} corpus. Problem: we have to learn a new classifier for every word!
		\item \textit{Semi-supervised} $\to$ annotate small dataset, bootstrap from there. Might be helpful as some instances have no single/discrete meaning.
		\item \textit{Unsupervised} $\to$ induce sense by clustering of word occurences. Usually, the result is either too fine-grained or coarse-grained for most word (only work great for frequent words)
	\end{itemize}
\end{itemize}
\subsubsection{Semi-Supervised WSD: Yarowsky algorithm}
\begin{itemize}
	\item Using bootstrapping which needs only a very small hand-labeled training set
	\item Still, learns a classifier for every word $\Rightarrow$ no generalization!
	\item Also, define features (see notion of context) for every word to determine its meaning
	\item \textbf{Algorithm iteration}: 
	\begin{enumerate}[start=0]
		\item Given a small initial seed set $\Lambda_0$ of labeled instances of each sense, and a much larger unlabeled corpus $V_0$
		\item Train classifier on $\Lambda_0$
		\item Use trained classifier to label $V_0$
		\item Select the examples in $V_0$ that the classifier is most confident on
		\begin{enumerate}
			\item Reliability of a prediction defined as $\log\left(\frac{p(a|w)}{p(b|w)}\right)$ for word w and possible senses $a$ and $b$
			\item Rank reliabilities of all predictions and choose $n$ best
		\end{enumerate}
		\item Remove chosen examples from $V_0\Rightarrow V_1$, and add them to the training set $\Rightarrow\Lambda_1$
		\item Repeat step 2) to 4) until:
		\begin{enumerate}
			\item Either $V_i$ is empty
			\item Or the error rate on the training/validation is sufficient low
		\end{enumerate}
	\end{enumerate}
	\item Reported accuracy of 95\%, but on easy homonymous examples
	\item \textbf{One sense per discourse}
	\begin{itemize}
		\item Original algorithm uses \textit{one sense per discourse} as second heuristic
		\item If a word appears twice or more often in the same text, they are probably of the same meaning\\
		$\Rightarrow$ Annotate these as well and use them as additional training examples
	\end{itemize}
\end{itemize}
\subsection{Distributional semantics}
\begin{itemize}
	\item Probabilistic models for semantics
	\item Distributional hypothesis about word meaning: the meaning of a word is determined by its context $\Rightarrow$ similar meanings have similar contexts
	\item Thus, distributions are a good conceptual representation if you believe that ‘the meaning of a word is given by its usage’ $\Rightarrow$ Corpus-dependent like different culture, domains, ...
	\item Distributions can encode lexical- and world knowledge, but mostly only partial lexical semantics 
	\item Techniques: Count-based models and prediction models
\end{itemize}
\subsubsection{Count-based models}
\begin{itemize}
	\item Vector spaced models in the semantic space, where every dimension corresponds to a possible context $\Rightarrow$ features
	\item Distribution can be seen as point in space
	\item As a result, we get a feature matrix:
	$$\begin{array}{c|cccc}
	& \text{feature}_1 & \text{feature}_2 & \dots & \text{feature}_n\\
	\hline
	\text{word}_1  & f_{1,1} & f_{2,1} & \dots & f_{n,1}\\
	\text{word}_2  & f_{1,2} & f_{2,2} & \dots & f_{n,2}\\
	\vdots  & \vdots & \vdots & \ddots & \vdots\\
	\text{word}_m  & f_{1,m} & f_{2,m} & \dots & f_{n,m}\\
	\end{array}$$
	\item Possible design choices in count-based models:
\end{itemize}
\begin{enumerate}
	\item \textbf{Notion of context}: how to define the context of a word
	\begin{enumerate}
		\item \textit{Word windows}: n words on either side of the lexical item, and count occurrences of words
		\item \textit{Filtered word windows}: n words, but remove irrelevant words based on POS-tag or stop-list (don't need to extend window)
		\item \textit{Lexeme windowing}: word windows (filtered or unfiltered), but with using stemming (mostly lead to more robust models)
		\item \textit{(Syntactic) dependencies}: context with dependency structure it belongs to (directed link between heads and dependents). Can be used with different extends\\
		Example: ``The prime minister acknowledged the question'' \\
		- [prime\_a 1, acknowledge\_v 1] (a for adjectives, v for verbs)\\
		- [prime\_a\_mod 1, acknowledge\_v\_subj 1] (mod for modifiers, subj for verb in relation to subject)\\
		$\Rightarrow$ Problem: complex context lead to sparse vectors
	\end{enumerate}
	$\Rightarrow$ Working best: small window sizes or short dependencies
	\item \textbf{Context weighting}: how to set the weights in the vector
	\begin{enumerate}
		\item \textit{Binary model}: if $c$ co-occurs with word $w$, value of entry is 1, else 0
		\item \textit{Basic frequency model}: number of times $c$ co-occurs (probably normalized)
		\item \textit{Characteristic model}: weights express how characteristic a given context is for a word $w$\\
		\item \textit{Pointwise Mutual Information (PMI)}: example of characteristic model. Comparing probability of both words occur together compared to occurring alone.\\
		$$PMI(w,c)=\log \frac{P(w,c)}{P(w)P(c)} = \log \frac{P(c|w)}{P(c)} \text{\hspace{5mm} where \hspace{5mm}}P(c) = \frac{f(c)}{\sum_k f(c_k)}, P(c|w) = \frac{f(w,c)}{f(w)}$$
		$$\Rightarrow PMI(w,c) = \log \frac{f(w,c)\sum_k f(c_k)}{f(w)f(c)}$$
		$PPMI\to$ only use positive values, $PPMI(w,c) =\max\left(PMI(w,c),0\right)$
	\end{enumerate}
	\item \textbf{Semantic space}: what are possible contexts
	\begin{enumerate}
		\item \textit{Entire vocabulary}: every word represents a possible context.\\
		+ All info included (also the rare one) - Inefficient (large space \& sparse), noisy
		\item \textit{Top n words with highest frequency}: \\
		+ More efficient, noise is filtered out - May miss out infrequent contexts
		\item \textit{Singular Value Decomposition}: dimension reduction by exploiting redundancies\\
		+ Very efficient, good generalization - Not interpretable (or very hard)
		\item \textit{Non-negative matrix factorization}: Similar to SVB, but performs factorization differently
	\end{enumerate}
\end{enumerate}
\subsubsection{Prediction-based models}
\begin{itemize}
	\item Train a model to predict plausible contexts for a word
	\item Learn word representations in the training process
	\item \textbf{Short dense} embeddings with \textbf{latent} dimensions 
	\begin{itemize}
		\item Easier to use as features with machine learning
		\item Better generalization than simple counting $\Rightarrow$ capturing more complex relations like synonym
	\end{itemize}
	\item One example for prediction-based models is skip-gram, also known as word2vec (see later section)
\end{itemize}
\subsubsection{Similarity}
\begin{itemize}
	\item Definition of similarity very broad. Can include synonym, antonyms, hyponyms, ...
	\item Measuring similarity with \textbf{Cosine} between vectors $v$ and $u$:
	$$\cos\left(\theta\right) = \frac{\sum_k v_k \cdot u_k}{\sqrt{\sum_k v_k^2} \cdot \sqrt{\sum_k u_k^2}}$$
	\item Cosine measure calculates the angle between $v$ and $u$, and is length-independent (normalization). Important as frequent words can have longer vectors
	\item Other measures include Jaccard, Euclidean distance (vectors need to be normalized!), ...
	\item However, true-synonyms do not always get higher similarity scores than near-synonyms, and also to antonyms
	\item Identifying antonyms by extra heuristics like checking words with high similarity that frequently appear together (\textit{example}: ``we serve hot and cold drinks'')
	
\end{itemize}
\subsubsection{Distributional word clustering}
\begin{itemize}
	\item Cluster words based on the contexts they occur
	\item Predefine number of clusters and corpus (for instance 2000 nouns in 200 clusters)
	\item \textbf{Features} can represent different kinds of contexts
	\begin{itemize}
		\item Windows based context, parsed or unparsed, syntactic dependencies
		\item Define notion of context, context weighting, and semantic space 
		\item Feature representation can significantly influence performance
	\end{itemize}
	\item Clustering algorithm: K-means
	\begin{itemize}
		\item Given dataset with $N$ points and task of $K$ clusters, minimize sum of squares of distance of each data point to its closest cluster mean $\bm{\mu}_i$:
		$$\arg\min_C \sum\limits_{i=1}^{K}\sum\limits_{\bm{x}\in C_i} ||\bm{x} - \bm{\mu}_{i}||^2$$
	\end{itemize}
	\item Small context sizes (small windows, syntactic dependencies) lead to clusters of synonyms (words that can be replaced)
	\item Large context sizes lead to \textbf{topical similarity} (words belonging to the same topic)
\end{itemize}

\subsubsection{Skip-gram}
\begin{itemize}
	\item Sometimes referred to as \textit{word2vec} because it is implemented in this package
	\item Given a word $w_t$, predict neighbouring words in a context windows of $2L$ words (for $L=2$: $w_{t-2}, w_{t-1}, w_{t+1}, w_{t+2}$) 
	\item In skip-gram, we learn two representations for every word $w_j \in V$:
	\begin{itemize}
		\item \textbf{word embedding} $v$ in word matrix $W$
		\item \textbf{context embedding} $c$ in context matrix $C$ (word in the role as context for other words)
	\end{itemize}
	\item To learn these embeddings, we take every word $w(t)$ in the corpus (index $j$ in vocabulary), and try to predict $w(t+1), ...$ where we denote this word with index $k$ in the vocabulary: 
	$$p\left(w_k | w_j\right)$$
	\item The idea in skip-gram is that we compute this probability by the similarity between the words $w_k$ and $w_j$ whereas we use the context matrix $C$ for $w_k$ and the word matrix $W$ for $w_j$ (see figure below)
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/skip_gram_matrices.png}
		\caption{Skip gram overview}
		\label{fig:skip_gram_matrices}
	\end{figure}
	\item Similar to the cosine similarity, we use the dot product for calculating this:
	$$\text{Similarity}(c_k, v_j)\propto c_k\cdot v_j$$
	\item To normalize and get probability distribution over contexts, we use the softmax function:
	$$p\left(w_k|w_j\right)=\frac{\exp\left(c_k \cdot v_j\right)}{\sum_{i\in V} \exp\left(c_i \cdot v_j\right)}$$
	\item For the learning process, we start with randomly initialized vectors, and try to maximize the log-likelihood of the dataset (by performing SGD or similar):
	$$\arg\max \sum\limits_{\left(w_j, w_k\right)\in D} \log p\left(w_k|w_j\right) = \sum\limits_{\left(w_j, w_k\right)\in D}  \left(c_k \cdot v_j - \log \sum\limits_{c_i \in V} \exp\left(c_i \cdot v_j\right)\right)$$
	\item We can also represent skip-gram as a (neural) network:
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/skip_gram_nn.png}
		\caption{Skip gram as neural network. The weights of the first layer represent the word embedding $W$, whereas the context embedding $C$ is in the second layer.}
		\label{fig:skip_gram_nn}
	\end{figure}
	\item However, the problem here is that for a large vocabulary size, the denominator of the softmax is very expensive to calculate $\Rightarrow$ \textbf{negative sampling}
	\begin{itemize}
		\item Approximate denominator by sampling $k$ random words from vocabulary (probability of sampling for a word is mostly connected to its unigram probability/frequency in the corpus like $\text{count}^{\alpha}(w)$ with for example $\alpha=0.75$)
		\item Dataset consists therefore out of words pair which are either positive or negative examples for context + word (note that we do not distinguish between the probability calculation of $w_{t-2}$ and $w_{t-1}$ for example)
		\item We convert the classification task into predicting whether a context pair is a positive or negative example from the corpus
		$$p\left(+|w_j, w_k\right) = \sigma(c_k\cdot v_j) = \frac{1}{1 + \exp(-c_k\cdot v_j)}$$
		$$p\left(-|w_j, w_k\right) = 1 - p\left(+|w_j, w_k\right) = \frac{1}{1 + \exp(c_k\cdot v_j)}$$
		$$\Rightarrow \arg\max \sum\limits_{\left(w_j, w_k\right)\in D_{+}} \log p\left(+|w_k,w_j\right) + \sum\limits_{\left(w_j, w_k\right)\in D_{-}} \log p\left(-|w_k,w_j\right) $$
	\end{itemize}
	\item Embeddings capture \textbf{analogies}: \textit{a} is to \textit{b} as \textit{c} is to \textit{d}
	\begin{itemize}
		\item Due to similarity, we can use the offsets to find the appropriate word $d$:
		$$a-b\approx c-d \Rightarrow d' = \arg\max_{d'_w \in V} \cos\left(a-b, c-d'\right)$$
	\end{itemize}
	\item Word2vec is often used as initialization/pretraining for other tasks. Reasons:
	\begin{itemize}
		\item Will help the model to start from an informed position
		\item Only needs a plain text corpus without any annotation
		\item Is very fast and pretrained versions are also available on the internet
		\item Best performance can be achieved by fine-tuning the weights afterwards
	\end{itemize}
\end{itemize}

================================================
FILE: Natural_Language_Processing_1/nlp_morphology.tex
================================================
\section{Morphology and finite state techniques}
\begin{itemize}
	\item Morphology concerns the \textbf{structure of words}
	\item \textit{Morpheme}: minimal information carrying unit in a word. A word consists of morphemes
	\item \textit{Affix}: Morphemes that only occur in conjunction with other morphemes
	\item \textit{Stem}: a word is made up of a stem and zero or more affixes. Stems are therefore stand-alone morphemes
	\item There are different forms of affixes that describe when (prefix, suffix, infix, ...)
	\item An affix is productive if it applies in general and therefore also probably for new words
	\item \textbf{Inflectional morphology}
	\begin{itemize}
		\item Fills predefined slots in paradigm, as plural, tense,... (create different grammatical forms, but word stays the same)
		\item Fully productive, except irregular forms
		\item Inflectional affixes are not combined in English
	\end{itemize}
	\item \textbf{Derivational morphology}
	\begin{itemize}
		\item Forming a new word through affix (also change of meaning possible)
		\item May change POS tag
		\item Examples include \textit{anti-}, \textit{re-}, \textit{-ism}, \textit{-ist} (``reset'' vs. ``set'')
		\item Generally semi-productive (applies for only subset of words in language)
		\item Include \textit{zero-derivation}: word that is both verb and noun, e.g. ``text'' vs. ``(to) text (someone)''
	\end{itemize}
	\item Ambiguities in terms of morphemes (single stems or affixes are ambiguous like ``dog'') or structure (combination of affixes/stem like ``shorts'' vs ``short-s'')
	\item \textbf{Bracketing}
	\begin{itemize}
		\item Starting from the stem, find the combination of nearby affixes that still lead to a possible form
		\item Example \textit{un-ion-ise-ed}. Putting \textit{un-} and \textit{ion} together not possible as this forms a non-valid word (union would be different stem). $\Rightarrow$ \textit{un-(ion-ise)-ed}
		\item Next, adding the \textit{-ed} ending is valid, and finally concatenating it with \textit{un-}: \textit{(un-((ion-ise)-ed))}
	\end{itemize}
\end{itemize}
\subsection{Applications of morphological processing in NLP}
\begin{itemize}
	\item We can use morphology to create a full-form lexicon (lexicon with each form of every word in it). However, this tends to explode very fast (high redundancy) and is not scalable for new words
	\item \textbf{Stemming}: use rules to get the stem form of a word. This allows us to match words to a small set of base words
	\item \textbf{Lemmatization}: Only finding split of stems and affixes. Is the preprocessing step before parsing (understanding the word!)
	\item Morphological process can either by analysis or generation
	\item Possible aspects/steps of morphological processing
	\begin{enumerate}
		\item Surface/ground-word mapped to stem(s) and affixes. Either by declaring the affixes (\textit{ping-ed}) or by explicitly saying which rule was applied (\textit{ping} \textit{PAST\_VERB})
		\item After knowing the affixes, analyze internal structure by bracketing
		\item Finally, understand syntactic and semantic effects where parsing can filter results of previous stages
	\end{enumerate}
	\item Overall, we need a lexicon combining three aspects:
	\begin{itemize}
		\item affixes (with the associated information they carry)
		\item irregular forms
		\item stems (with syntactic categories)
	\end{itemize}
\end{itemize}
\subsection{Spelling rules}
\begin{itemize}
	\item English morphology is essentially concatenative
	\item English spelling rules can be described independently of the particular stems and affixes involved. It simply looks at the affix boundaries.
	\item Example spelling rule for e-insertion:
	$$\epsilon \to \text{e}/\left\{\begin{array}{c}
	\text{s}\\
	\text{x}\\
	\text{z}
	\end{array}\right\}\hat{\text{ }}\textunderscore s$$
	Here, the formula is interpreted as ``an empty string maps to e if an s,x or z is followed by an s of the next affix'' (where e is inserted in the underscore space)
	\item Finite state machines (or transducers also creating corresponding output while parsing) can be used to implement spelling rules
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/morphology_FST.png}
	\end{figure}
	\item Each transition corresponds to a pair of characters. 
	\item When the transducer is run in analysis mode, the system can detect an affix boundary (where we look up the stem and affix in the corresponding lexicon)
	\item In generation mode, we can just put in our parsed version and generate the correct spelling
	\item Morphology systems are usually implemented so that there is one FST per spelling rule and these operate in parallel
	\item However, FST are not applicable for internal structures as for example no bracketing model is possible
	\item A system which generates invalid output/accepts invalid derivations is said to \textbf{overgenerate}.
\end{itemize}

================================================
FILE: Natural_Language_Processing_1/nlp_pos_tagging.tex
================================================
\section{Language models and part-of-speech tagging}
\subsection{Probabilistic language modeling}
\begin{itemize}
	\item The Naive Bayes approach considers words as independent. However, we can also model word sequences using statistical techniques (based on context/semantic and syntax)
	\item \textbf{Corpus}: text collected for some purpose (i.e. movie reviews)
	\begin{itemize}
		\item A corpus is \textit{balanced} if it represents different genres (types of text/domain)
		\item A \textit{tagged} corpus has annotations regarding the POS tags (mostly POS tags are learned unsupervised, but with this data we enable supervised training methods)
	\end{itemize}
	\item Use of language modeling
	\begin{itemize}
		\item In speech recognition, it is hard to distinguish between words that sound similar. Thus, language modeling is used to rank the hypothesis of the recognizing system to make an estimation what phrase is most likely being said
		\item Language modeling can also be used for word prediction, text entry, spelling correction, ...
	\end{itemize}
\end{itemize}
\subsubsection{N-gram models}
\begin{itemize}
	\item Modeling a sequence of $n$ words
	\item \textbf{Bigram} ($n=2$): 
	\begin{itemize}
		\item use only previous word $\Rightarrow$ $p(w_n|w_{n-1})$
		\item Probability of a sequence can be expressed by $p(w_1,...w_N)=\prod\limits_{n=1}^{N} p(w_n|w_{n-1})$
		\item Still, we assume that words with distance of more than 1 are independent
		\item Estimating the probabilities by the maximum likelihood solution: $$p(w_n|w_{n-1})=\frac{c(w_n, w_{n-1})}{\sum_{w_k} c(w_k, w_{n-1})}=\frac{c(w_n, w_{n-1})}{c(w_{n-1})}$$
		Thus, we normalize the counts over the next word $w_{n-1}$
	\end{itemize}
	\item \textbf{Trigram} ($n=3$):
	\begin{itemize}
		\item The probability of a word is based on the two previous words $\Rightarrow$ $p(w_n|w_{n-1},w_{n-2})$
		\item Again, the probability of the sequence is $p(w_1,...w_N)=\prod\limits_{n=1}^{N} p(w_n|w_{n-1},w_{n-2})$
	\end{itemize}
	\item Problems with sparse data
	\begin{itemize}
		\item \textbf{Smoothing}: for smoothing, we add a small extra probability for rare and unseen events to prevent probabilities of zero. E.g. for bigram:
		$$p(w_n|w_{n-1})=\frac{c(w_n, w_{n-1}) + \kappa}{c(w_{n-1}) + |V|\cdot \kappa }$$
		Simple to implement, but only suitable if having few unseen events (high $n$-gram have a lot)
		\item \textbf{Backoff}: If we have good evidence of a long phrase, we use a high $n$-gram model (for example trigram). Otherwise, if phrase was not seen yet, we go to the next smaller model (here bigram) and checker its probability. If also not known, go deeper until you reach unigram.
		\item \textbf{Interpolation}: combine the probability estimations of all models. For example we can use linear interpolation where we weight every model with a parameter:
		$$p(w_n|w_{n-1},w_{n-2}) = \lambda_1 p(w_n) + \lambda_2 p(w_n|w_{n-1}) + \lambda_3 p(w_n|w_{n-1}, w_{n-2})$$
		The parameters $\lambda_i$ need to sum up to 1 and are optimized on small held-out training subset. 
		\item \textbf{Unknown word tag}: using a unknown word tag which is also used in the training set. Replace all unknown words in the (test) text by this tag
	\end{itemize}
	\item Another limitation of $n$-gram models are long-term dependencies as these cannot be captured efficiently
	\item \textit{Evaluation} of $n$-gram models
	\begin{itemize}
		\item \textbf{Intrinsic evaluation}: evaluate directly on test set designed for the task with a metric
		\begin{itemize}
			\item A suitable metric is for example \textit{perplexity} which is the inverse probability of the test dataset normalized by number of words $N$:
			$$PP(W) = \left(p(w_1,...,w_N)\right)^{-1/N}$$
			\item For bigram, this would be $PP(W)=\left(\prod_{n=1}^{N}p(w_n|w_{n-1})\right)^{-1/N}$
			\item The goal is to minimize perplexity (lower perplexity indicates better model)
			\item However, perplexity strongly relies on the similarity of training and test dataset and is therefore not comparable across different datasets
		\end{itemize}
		\item \textbf{Extrinsic evaluation}: evaluation in the context of external task, i.e. speech recognition or word prediction
		\begin{itemize}
			\item Better, but very time consuming
			\item Hybrid approaches compare own models by perplexity, and apply the best model in extrinsic environment (external task)
		\end{itemize}
	\end{itemize}
\end{itemize}
\subsection{Part-of-speech tagging}
\begin{itemize}
	\item Tag every word by what king of speech it is (verb, noun, ... $\Rightarrow$ ambiguity)
	\item The tags are taken from a tagset which uses standardized codes for fine-grained POS
	\item \textit{Benefits} of POS tagging
	\begin{itemize}
		\item First step towards syntactic analysis (is very fast, but simpler than full syntax parsing)
		\item POS tags can be useful features for application
	\end{itemize}
	\item Problem of ambiguity: most high-frequency words have more than one POS tag. Language with rich morphology (significant affixes) tend to have less as the distinguish affixes better
\end{itemize}
\subsubsection{Tagging strategies}
\begin{itemize}
	\item Simplest strategy: assign to each words its most common tag (also called unigram tagging). Already gives a strong baseline
	\item \textbf{Hidden Marcov models}
	\begin{enumerate}
		\item Start with untagged text
		\item Assign to the words all their possible POS tags
		\item Find the most probable sequences of tags given sequences of words
		$$\hat{t}^{n} = \arg\max_{t^{n}} p(t^{n} | w^{n}) = \arg\max_{t^{n}} p(w^{n}|t^{n})\cdot p(t^{n}) $$
	\end{enumerate}
	\item If we apply for example bigram in this model, we get:
	\begin{equation*}
	\begin{split}
	p(t^{n}) & \approx \prod_{i=1}^{n} p(t_i | t_{i-1})\\
	p(w^{n}|t^{n}) & \approx \prod_{i=1}^{n} p(w_i | t_i)\\
	\hat{t}^{n} & = \arg\max_{t^{n}} \prod_{i=1}^{n}p(w_i | t_i)p(t_i | t_{i-1})
	\end{split}
	\end{equation*}
	\item Actual systems use trigrams. Smoothing and backoff are important (fewer unknown open class words)
	\item Evaluation by percentage of correct tags (but using most common tag already gives 90\% accuracy. With trigram about 97\%)
	\item Common errors
	\begin{itemize}
		\item Difference between country ``Turkey'' and bird ``turkey'' (it decides based on whether an \textit{a} is in front of turkey or not)
		\item Because of smoothing, we can get for the phrase ``have hope'' that both words are verbs although hope has no past tense which is antigrammatical!
	\end{itemize}
\end{itemize}

================================================
FILE: Natural_Language_Processing_1/nlp_summarization.tex
================================================
\section{Language generation and summarization}
\begin{itemize}
	\item Most tasks/methods until now have concentrated on language analysis. Next coming: tasks where we have to generate text
	\item Generation mostly has the starting point at semantic representation like distributional semantics or hidden representation for neural networks
	\item We can also concentrate on \textit{regeneration} where we convert input to another representation. Examples include summarization, translation, ...
	\item For generation, there are various subtasks (e.g. content selection, discourse structuring, ...)
	\item Approaches for generation include:
	\begin{itemize}
		\item \textit{Templates}: fixed text that has slots that can be filled
		\item \textit{Statistical}: using machine learning
		\item \textit{Deep Learning}: using deep embeddings, especially for regeneration task
	\end{itemize}
\end{itemize}
\subsection{Text Summarization}
\begin{itemize}
	\item Task: generate short version of input text with important points
	\item We distinguish between \textbf{single-document summarization} (given a single document, produce summary with important points) and \textbf{multi-document summarization} (given a set of documents, produce brief summary of combination)
	\item Also, we differentiate between \textbf{generic summarization} (identifying important parts by itself and present these) and \textbf{query-focused summarization} (regarding a query/question from the user, find relevant parts in the document/s)
	\item There are mostly main approaches:
	\begin{itemize}
		\item \textbf{Extractive summarization}: extract important info from document by copying sentences and combine them into a summary
		\item \textbf{Abstractive summarization}: interpret content of document and generate completely new sentences (much harder task!)
	\end{itemize}
	\item Most approaches deal with extractive summarization as it is much easier to realize and achieves better results till now
\end{itemize}
\subsubsection{Extractive summarization}
\begin{itemize}
	\item For extractive summarization, there are three main steps:
	\begin{enumerate}
		\item \textbf{Content selection}: identify important parts/sentences from the document 
		\item \textbf{Information ordering}: order the sentence within the summary
		\item \textbf{Sentence realization}: optimizing the text by e.g. sentence simplification
	\end{enumerate}
	\item Approaches for \textit{content selection}
	\begin{itemize}
		\item \textit{Unsupervised}: 
		\begin{itemize}
			\item Take those words that are significantly more often used than in other documents in average $\Rightarrow$ these are the ``informative'' words and mostly biased towards names/cities (pronoun resolution important to find these references as well)
			\item Measured by metrics like \texttt{tf-idf}
		\end{itemize}
		\item \textit{Supervised}:
		\begin{itemize}
			\item Large training corpus with human summary needed
			\item Sentences of summary are aligned with those in the original document, and features are extracted (position in document, sentence length, informative words, ...)
			\item Based on these features, we train a binary classifier whether a sentence should be included in the summary or not
			\item Problem: expensive to generate all this data, and the supervised approaches did not significantly outperform the unsupervised ones
		\end{itemize}
	\end{itemize}
	\item Approaches for \textit{information ordering}:
	\begin{itemize}
		\item For a single document, the sentences are mostly structured in the order they occur in the original document
	\end{itemize}
\end{itemize}
\subsubsection{Query-focused multi-document summarization}
\begin{itemize}
	\item For query-focused multi-document summarization, we need to extend the extractive summarization by two pre-processing steps:
	\begin{enumerate}
		\item Find a set of relevant documents
		\item (Optionally) simplify sentences in the documents (to make the task of content selection easier)
		\item \textit{Content selection}: identify informative sentences in the documents (much harder than for the single-document task)
		\item \textit{Information ordering}: order the sentences in the summary
		\item \textit{Sentence realization}: modify sentences to get consistent summary
	\end{enumerate}
	\item Approaches for \textit{sentence simplification}
	\begin{itemize}
		\item Parse sentences and apply hand-rules what parts of a sentence we might drop (initial adverbials as ``for example'', irrelevant attribute clauses, ...)
		\item Also possible to train a classifier to identify satellites (non-informative parts on a nucleon phrase)
	\end{itemize}
	\item Approaches for \textit{content selection} for multiple documents
	\begin{itemize}
		\item We can either combine all documents into one, or retrieve information from all documents separately and weight these documents
		\item Estimate informativeness similarly to single-document
		\item Then, start by adding the most informative sentences in summary (one by one) until the maximum length of the summary is reached
		\item When adding new sentences, we need to make sure that not the same/very similar sentences from different documents are added $\Rightarrow$ \textbf{Maximum marginal relevance}
		\begin{itemize}
			\item Iterative method to determine best sentence to add to summary. Relies on two counter-part measures:
			\item \textit{Relevance to query}: high cosine similarity between a sentence and the query indicates a high relevance for the summary
			\item \textit{Novelty regarding the summary so far}: low cosine similarity between sentences and summary
			\item Estimated score is calculated as follows (for query $Q$, summary $S$, documents $D$):
			$$\hat{s} = \arg\max_{s_i \in D} \left[\lambda \text{sim}\left(s_i, Q\right) - \left(1 - \lambda\right)\max_{s_j \in S}\text{sim}\left(s_i, s_j\right)\right]$$
		\end{itemize}
	\end{itemize}
	\item Approaches for \textit{sentence ordering}
	\begin{itemize}
		\item \textit{Chronologically}: for example by date of document
		\item \textit{Coherence}: sentences that are similar/discuss same entity should be grouped together in the summary
		\item \textit{Topical ordering}: learns set of topics present in documents (by e.g. topic modeling), and then order the sentences by topic
	\end{itemize}
\end{itemize}
\subsubsection{Summarization using neural networks}
\begin{itemize}
	\item We can apply neural networks for the task of summarization
	\item For extractive summarization, we train a RNN on word level creating a representation of words, and a RNN on sentence/document level that combines sentence embeddings
	\item Apply classifier on output of all document-level RNN to decide whether to include sentence in summary or not (problem: still captures coarse-grained features)
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.3\textwidth]{figures/summarization_rnn.png}
		\caption{Summarization by RNNs}
		\label{fig:summarization_rnn}
	\end{figure}
	\item Abstractive summarization can be realized by large newspaper datasets where for a small article, a headline must be predicted
	\item We use an Encoder-Decoder architecture (seq2seq models) where the encoder generates fixed-size embedding, and the decoder generates word-by-word output given this representation (decoder is autoregressive as it takes own output back as input for next time step)
\end{itemize}
\subsubsection{Evaluating summarization models}
\begin{itemize}
	\item Human judgments of quality is too expensive
	\item Better, automatic method: \textbf{ROUGE} (recall oriented understudy for gisting evaluation)
	\item We compare a few human-generated summaries $R_1, ..., R_N$ with the system generated summary $S$ by computing the percentage of $n$-grams from the reference summaries $R_1,...,R_N$ that occur in $S$. Example: ROUGE-2 (using bigram):
	$$\frac{\sum_{R_i}\sum_{bigram_j\in R_i} \text{count}_{\text{match}}(j,S)}{\sum_{R_i}\sum_{bigram_j\in R_i} \text{count}(j, R_i)}$$
	\item Note that summary length is not considered here 
	\item Example for calculating ROUGE metric:
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/summarization_rogue_example.png}
		\label{fig:summarization_rogue_example}
	\end{figure}
	% summarization_rogue_example.png
\end{itemize}

================================================
FILE: Natural_Language_Processing_1/nlp_summary.tex
================================================
\documentclass[a4paper]{article} 
\addtolength{\hoffset}{-2.25cm}
\addtolength{\textwidth}{4.5cm}
\addtolength{\voffset}{-3.25cm}
\addtolength{\textheight}{5cm}
\setlength{\parskip}{0pt}
\setlength{\parindent}{0in}

\usepackage{blindtext} % Package to generate dummy text
\usepackage{charter} % Use the Charter font
\usepackage[utf8]{inputenc} % Use UTF-8 encoding
\usepackage{microtype} % Slightly tweak font spacing for aesthetics
\usepackage[english]{babel} % Language hyphenation and typographical rules
\usepackage{amsthm, amsmath, amssymb} % Mathematical typesetting
\usepackage{float} % Improved interface for floating objects
\usepackage[final, colorlinks = true, 
linkcolor = black, 
citecolor = black]{hyperref} % For hyperlinks in the PDF
\usepackage{graphicx, multicol} % Enhanced support for graphics
\usepackage{xcolor} % Driver-independent color extensions
\usepackage{marvosym, wasysym} % More symbols
\usepackage{rotating} % Rotation tools
\usepackage{subcaption}
\usepackage{censor} % Facilities for controlling restricted text
\newcommand{\note}[1]{\marginpar{\scriptsize \textcolor{red}{#1}}} % Enables comments in red on margin
\usepackage{bm}
\usepackage{blkarray}
\usepackage{enumitem}
\usepackage{pgfplots}
\definecolor{colkeyword}{rgb}{0,0.4,0}
\definecolor{colname}{rgb}{0.4,0.4,0}
\definecolor{coltype}{rgb}{0.4,0,0.4}
\definecolor{coloperators}{rgb}{0,0,1.0}
\definecolor{colscopes}{rgb}{0.4,0,0}

\setcounter{tocdepth}{2}
% Title Page
\title{Summary Natural Language Processing 1}
\author{Phillip Lippe}


\begin{document}
\maketitle
\tableofcontents
\newpage

\input{nlp_morphology.tex}
\input{nlp_pos_tagging.tex}
\input{nlp_formal_grammars.tex}
\input{nlp_lexical_distributional_semantics.tex}
\input{nlp_compositional_semantic.tex}
\input{nlp_textual_entailment_paraphrasing.tex}
\input{nlp_dialog_modelling.tex}
\input{nlp_summarization.tex}
\input{nlp_translation.tex}
\input{nlp_bayesian.tex}
\appendix
% \newpage
% \input{nlp_appendix.tex}
\end{document}

================================================
FILE: Natural_Language_Processing_1/nlp_textual_entailment_paraphrasing.tex
================================================
\section{Textual Entailment and Paraphrasing}
\begin{itemize}
	\item Textual entailment is defined as a directional relationship of text $T$ to hypothesis $H$
	\item We say $T$ entails $H$ if the meaning of $H$ can be inferred from the meaning of $T$
	\item Task of recognizing textual entailment aims for classifying a pair of sentences as whether they are an entailment or not (binary classifier). Can be used in different settings:
	\begin{itemize}
		\item \textit{Question-Answering}: A question answering system generates $n$ candidate solutions. The textual entailment recognizer must now decide which candidate solution is correct
		\item \textit{Summarization}: A summarization system sequentially generates new sentences to add to the summary in progress. The textual entailment recognizer should now identify whether a new sentence contains information that is already in the summary or not (redundancy checker).
	\end{itemize}
\end{itemize}
\subsection{Levels of Representation}
\begin{itemize}
	\item Determining the equivalence of the meaning of $T$ and $H$
	\item The representation of the $T$-$H$ pair is used to train a supervised model
	\item There are different levels of representation that can be used (all having their own benefits and drawbacks)
	\item \textbf{Lexical level}
	\begin{itemize}
		\item Solely looking on the words used in $T$ and $H$ (basically BoW of both sentences)
		\item Comparing the used words for similarity (are words of $H$ in $T$)
		\item Problem: structure of $H$ and $T$ cannot be fully captured by BoW
	\end{itemize}
	\item \textbf{Structural level}
	\begin{itemize}
		\item Build up syntactic structure (like parse tree from context-free grammar or dependency graph) for $T$ and $H$
		\item If $T$ contains same structures as $H$ (i.e. certain dependency edges, subtrees, ...), we predict the texts to be entailed
		\item However, it is hard to distinguish which edges should contribute to similarity and which not 
	\end{itemize}
	\item \textbf{Semantic level}
	\begin{itemize}
		\item The idea is to label words/phrases with semantic role in sentence
		\item Words are group into \textit{arguments} (such as person or place) and connected to \textit{predicates} (mostly verbs)
		\item Now we check whether semantic connections of $H$ are in $T$ or not
		\begin{figure}[ht]
			\centering
			\includegraphics[width=0.4\textwidth]{figures/text_entailment_semantic_level.png}
		\end{figure}
	\end{itemize}
	\item \textbf{Knowledge Acquisition for RTE}
	\begin{itemize}
		\item To answer some text entailments, background/world knowledge is required (which words are synonyms, what is connected to a certain noun as i.e. a person/place...)
		\item Knowledge is mostly constrained to lexical-semantic between two words (synonym, hypnonymy, ...)
		\item But we can also model more complex relations like $X$ causes $Y$ $\implies$ $Y$ is a \textit{symptom} of $X$
		\item Such connections/knowledge can be retrieved from WordNet, Wikipedia, ...
		\item This leads to the \textbf{Extended Distributional Hypothesis}: if two paths occur in similar contexts, the meaning of the paths tend to be similar ($X$ \textit{solves} $Y$ compared to $X$ \textit{is a solution of }$Y$)
	\end{itemize}
\end{itemize}
\subsection{Recognizing Text Entailment Methods}
\begin{itemize}
	\item RTE depend on the representation which is used for $T$ and $H$
	\item Different approaches to model the classifier
	\item \textbf{Similarity-based approach}
	\begin{itemize}
		\item Pair with strong similarity score gets high entailment relation
		\item Similarity is measured by for example WordNet (how many edges to traverse to get to other word) and string similarity (length or even single letters)
	\end{itemize}
	\item \textbf{Alignment-based approaches}
	\begin{itemize}
		\item Use heuristics to align junk of words from $T$ to $H$
		\item For example, match phrase "\textit{purchase of $X$ to $Y$}" with "\textit{$Y$ acquired $X$}"
		\item However, we need a knowledge base to infer these relations
		\begin{figure}[ht]
			\centering
			\includegraphics[width=0.4\textwidth]{figures/text_entailment_alginment_based_methods.png}
		\end{figure}
	\end{itemize}
	\item \textbf{Formal Logic approaches}
	\begin{itemize}
		\item Finding proof by theorem prover that $H$ can be proofed by $T$ 
		\item Convert statements in $T$ and $H$ into formal logic
		\item Problem: mostly the lack of background knowledge is the bottleneck, as the simplest mistakes/missing statements can stop this approach to get the correct result
	\end{itemize}
	\item \textbf{Edit distance-based approaches}
	\begin{itemize}
		\item Sequence of transformations that need to be applied on $T$ to get to $H$
		\item If the number of transformations is higher than specified threshold, classify relation as \texttt{false}
		\item Alternative for \textit{expensive} theorem prover
	\end{itemize}
	\item Evaluation done on dataset with 1,600 $T$-$H$ pairs with accuracy as metric. Lexical baseline is at about 58\%
\end{itemize}
\subsection{Current methods}
\begin{itemize}
	\item RTE datasets are mostly very small which limits the application of complex systems
	\item However, there are large Natural Language Inference datasets, where also neural networks can be trained on (different domains, for example image to text)
	\item We need datasets over multiple domains as otherwise the algorithms generalize poorly
	\item \textbf{Neural networks}
	\begin{itemize}
		\item Specifying features by hand for the input
		\item Using both hypothesis and text as input. Mostly, we classify then into classes \textit{entailment}, \textit{contradiction} and \textit{neutral} (not enough info to decide)
		\item Using various LSTM models with attention modules 
		\item Generative models create a hypothesis given the text and the class for which the hypothesis should be generated
		\item However, networks show to overfit on noise in the data (contradiction mostly contains negative words, entailments biased on animals and so on)
	\end{itemize}
\end{itemize}

================================================
FILE: Natural_Language_Processing_1/nlp_translation.tex
================================================
\section{Machine Translation}
\subsection{Statistical Machine Translation}
\begin{itemize}
	\item Given a sentence $f$ in foreign language, find most probable translation $\hat{e}$:
	$$\hat{e} = \arg\max_{e} P(e|f) = \arg\max_{e} \underbrace{P(f|e)}_{\text{channel}} \underbrace{P(e)}_{\text{source}}$$
	\item The source is the \textbf{language model} which makes sure that the grammatical structure in the text is correct
	\begin{itemize}
		\item It is also helpful for disambiguate the word decision in the translating language
		\item This is very important if a word in the foreign language is ambiguous
	\end{itemize}
	\item The (noisy) channel is the \textbf{translation model} which is responsible to translate the text (makes sure that $f$ are translations of $e$)
	\item IBM-3 model:
	\begin{itemize}
		\item For every word:
		\begin{itemize}
			\item Choose a fertility $\phi_i$ (number of words in goal language should be translated into in foreign language. E.g. ``did'' has fertility of 0, ``slap'' 3 in French)
			\item Generate $\phi_i$ foreign words
			\item Generate spurious/default words that might be needed
		\end{itemize}
		\item Permute translated words based on the position a word was before, and language it was in before
	\end{itemize}
\end{itemize}
\subsubsection{Learning parameters of models}
\begin{itemize}
	\item For learning the parameters of the language and translation model, we would need the word alignments in the translation which however require the parameters
	\item Thus, we apply the Expectation-Maximization algorithm
	\item Assume we have alignments, but for every sentences multiple ones. For example, we can have the following possible alignments for a two-word sentence:
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/translation_EM_alignment.png}
	\end{figure}
	\item Every alignment is assigned to a probability/fractional count it occurs. Initially, we set the probability of every alignment/word to a uniform distribution
	\item We try to maximize the probability that a foreign word/phrase $f_j$ in our corpus is  a translation of $e_{a_j}$ where $a_j$ is the alignment of the foreign to translated language. When using Bayes rule (and looking at only 1to1 alignments), we maximize:
	$$P(a,f|e) = \prod\limits_{j=1}^{M}t(f_j|e_{a_j})$$
	where $t$ are the fractional counts
	\item EM algorithm:
	\begin{enumerate}[label=Step \theenumi:]
		\item Compute $P(a,f|e)$ for every possible alignment and sentence
		\item Normalize the alignments for the same foreign sentence. $P(a,f|e)\to P(a|f,e)$
		\item Collect the fractional counts $tc(x|b)$ by summing up the probabilities of all $P(a|f,e)$ where $b$ is aligned to $x$. 
		\item Normalize fractional counts by $b$ $\Rightarrow$ revised parameters for next iteration
	\end{enumerate}
	\item Example: Given $t(x|b) = 1/4$, $t(x|c)= 3/4$, $t(y|b)= 1/2$, $t(y|c)= 1/2$.
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/translation_EM_step_1.png}
	\end{figure}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/translation_EM_step_2.png}
	\end{figure}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/translation_EM_step_3_4.png}
	\end{figure}
	\item Note on EM: Optimization function is non-convex so that we might find a local minimum
\end{itemize}
\subsection{Phrase-based Statistical Machine Translation}
\begin{itemize}
	\item Previously, translation was based on single words as atomic unit. However, we can also use phrases (few consecutive words) as unit 
	\item The advantage is that context can be taken into account for translation, and no more fertility, insertion and deletion of words are necessary to translate
	\item We now have a phrase table where we have probabilities to translate a certain phrase into another
	\item The translation model uses phrases instead of words, but also needs to consider to reorder the phrases:
	$$P(f|e) = \prod\limits_{i=1} \phi(\overline{f}_i | \overline{e}_i) \underbrace{d(\text{start}_i - \text{end}_i -1)}_{\text{distance-based reordering}}$$
	Note that $\text{start}$ and $\text{end}$ are the positions in the foreign language, but $i$ is the index of the translated language!
	\item Extract all phrases that are consistent with a word alignment $A$. A phrase is consistent if all words of $\overline{f}'$ are only aligned to words in $\overline{e}'$ and not any other words outside this phrase (and the other way round).
	\item The \textit{phrase translation probability} $\phi$ is estimated by the relative frequency:
	$$\phi(\overline{f}, \overline{e}) = \frac{\text{count}(\overline{f}, \overline{e})}{\sum_i \text{count}(\overline{f}_i, \overline{e})}$$
\end{itemize}

================================================
FILE: README.md
================================================
# Summaries of Master AI at UvA

In this repository, I collect all my summaries I created during my studies in the Master programme Artificial Intelligence at the University of Amsterdam (2018 - 2020). Feel free to use them, but keep in mind that small mistakes might be included.

## Courses

The PDF versions of the summaries can be found in the folder _Final_versions_. For latex files and editing, please see the corresponding folders.

### Machine Learning 1 (UvA, 2018/19)
* Semester 1, Period 2
* Lecturer: Dr. Rianne van den Berg

### Natural Language Processing 1 (UvA, 2018/19)
* Semester 1, Period 2
* Lecturer: Dr. Ekaterina Shutova
* Website: https://cl-illc.github.io/nlp1/

### Information Retrieval 1 (UvA, 2018/19)
* Semester 1, Period 3
* Lecturer: Dr. Evangelos Kanoulas

### Knowledge Representation (VU, 2018/19)
* Semester 2, Period 1
* Lecturer: Dr. Frank van Harmelen

### Computer Vision 1 (UvA, 2018/19)
* Semester 2, Period 1
* Lecturer: Dr. Theo Gevers
* Website: https://cv1-uva.github.io 

### Deep Learning (UvA, 2018/19)
* Semester 2, Period 2
* Lecturer: Dr. Efstratios Gavves
* Website: http://uvadlc.github.io

### Machine Learning for Quantified Self (VU, 2018/19)
* Semester 2, Period 3
* Lecturer: Dr. Mark Hoogendoorn
* Website: https://ml4qs.org

### Machine Learning 2 (UvA, 2019/20)
* Semester 1, Period 1
* Lecturer: Dr. Joris Mooij

### Reinforcement Learning (UvA, 2019/20)
* Semester 1, Period 1
* Lecturer: Dr. Herke van Hoof


================================================
FILE: Reinforcement_Learning/rl_appendix.tex
================================================
\section{Deep RL in practice}
\textit{This section reviews the lecture slides 10 (last half).}
\begin{itemize}
	\item There are several things to keep in mind when performing experiments in RL in practice
	\item If we have a research questions that we want to investigate, we need to design experiments for which we have to answer the following questions:
\end{itemize}
\begin{description}
	\item[On which tasks?] We need to find environments which fit to the research question in mind. Things to consider are:
	\begin{itemize}
		\item Continuous control tasks lend themselves to actor critic methods
		\item Pixel-based task can show whether complex input data can be handled
		\item Highly complex tasks show whether a method scales with having lots of compute and training data available
		\item Toy examples can point out difference between methods, so it is often good to have both a toy example, and a more complex, practical one
	\end{itemize}
	\item[Which parameters and architectures to test?] RL have been shown to be very sensitive to the selection of hyperparameters. Hence, you should also spend similar tuning efforts on \underline{all} your experiments, including the baseline, to ensure a fair comparison.
	\item[Does a random seed affect my experiments?] Due to the high variance of the RL methods, we need to average all runs over a sufficient amount of seeds. Furthermore, if we perform a gridsearch, we should always keep the seeds fixed for all hyperparameter settings, but in the final test, use a different set of random seeds to prevent overfitting on seeds.
	\item[What to report?] Next to the mean and/or median performance, the spread of the result should be shown as well. Furthermore, it should represent what you want to show. For example, if we want to underline that a new method learns faster, we should show a plot over learning iterations instead of just the final performance.
\end{description}


================================================
FILE: Reinforcement_Learning/rl_introduction.tex
================================================
\section{Introduction to Reinforcement Learning}
\textit{This section reviews the lecture slides 1 and 2 (until Monte Carlo). }
\begin{itemize}
	\item Reinforcement Learning deals with the following question: 
	
	How to \textcolor{blue}{sequentially interact} with an environment to \textcolor{blue}{maximize a long-term objective}?
	
	\item The general RL setting is visualized in Figure~\ref{fig:rl_introduction_reinforcement_learning}. Thereby we have 4 variables that are passed around:
	\begin{itemize}
		\item The state $S_t$, in which the agent currently is, determined by the environment
		\item The action $A_t$ which is chosen by the agent, based on the observation of the state $S_t$
		\item Based on the interaction $A_t$, the agent ends up in a new state $S_{t+1}$ as well as getting a reward $R_{t+1}$. Both properties are determined by the environment, and can only be influenced by the agent through $A_t$
	\end{itemize}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/rl_introduction_reinforcement_learning.pdf}
		\caption{Illustration of the interaction of an agent with an environment in the reinforcement learning setting.}
		\label{fig:rl_introduction_reinforcement_learning}
	\end{figure}

	\item The decision of which action to take at which state is called policy, and we denote it by $\pi$. The goal is to find the optimal policy $\pi$ which maximizes the reward we get from the environment
		
	\item As we learn by interactions, we have major main challenges in contrast to standard supervised learning:
	\begin{itemize}
		\item The dataset is not static as in e.g. image classification. Every time we change our policy, new data has to be generated as the actions we take at certain states are different. Note that there are techniques to use the data more efficiently, and we will discuss them later.  
		\item Instead of having i.i.d. data, we have sequential data which are highly correlated. Standard optimizers like SGD fail because they assume i.i.d. inside a batch. We will also discuss how we can tackle this problem.
	\end{itemize}
\end{itemize}
\subsection{MDPs and $k$-armed bandits}
\begin{itemize}
	\item A common used, simplified example for RL is a $k$-armed bandit. We can image it as $k$ slot machines with unknown pay-off distribution. Hence, we have a static environment where the mapping function between actions and rewards is independent of state and time step $t$
	\item Our goal is to maximize the cumulative reward over time. There are two variants we can use:
	\begin{itemize}
		\item If we want to maximize our reward for a finite horizon $T$ (i.e. limited number of trials), our goal is to maximize $\sum_{t=1}^{T} r_t$
		\item If we assume that we can take as many actions as we want, we have the objective for infinite horizon: $\sum_{t=1}^{\infty} \gamma^{t}r_t$ where $\gamma\in [0,1]$ is a discount factor. Note that if we would not have a discount factor, we end up with infinite reward for any action whose reward has a mean greater than zero.
		\item For generalization, we call $G_t=\sum_{k=0}^{\infty} \gamma^{k}R_{t+k+1}=R_{t+1}+\gamma G_{t+1}$ the \textbf{(discounted) return}, or cumulative reward, at time step $t$. Only if a episode terminates (meaning that we cannot play forever), a discount factor of $\gamma=1$ is allowed.
	\end{itemize} 
	\item A general trade-off in reinforcement learning is between \textbf{exploration} (i.e. trying new actions) and \textbf{exploitation} (i.e. taking best actions we know). If we perform too little exploration, we might overlook the best action, for example due to stochasticity of the reward. However, exploiting the best actions is likely to lead to the maximum rewards, so that with exploring, we ``lose'' possible rewards.
	
	A general rule of thumb is: if we have much time left or are very uncertain about our current estimates, do more exploration. If we are limited on time, or are certain about our estimates, we should exploit more. 
	
	Also $\gamma$ can play a role as the higher $\gamma$, the more we care about rewards in the future, and hence, should perform more exploration.
	
	\item We now introduce a set of important functions which are used for finding the optimal policy:
	\begin{itemize}
		\item The \textbf{state-action value function}, also called \underline{$q$-function}, expresses the expected return of taking a certain action in a given state:
		$$q_{\pi}(s,a)=\E_{\pi}[G_t|S_t=s,A_t=a]$$
		Note that $q$-value is always specific to a certain policy as $G_t$ is in expectation that all steps after $t$ are taken according to the policy $\pi$
		\item The \textbf{state-value function} is similar to the $q$ function, but only takes the state into account, and considers the action under the expectation:
		$$v_{^\pi}(s)=\E_{\pi}[G_t|S_t=s]$$
	\end{itemize}

	\item In the case of the k-armed bandit, we try to learn a $q$-function (as we want to find the best action) but assume that we stay in the same state $s$. To balance exploration and exploitation, there are different strategies possible, for example:
	\begin{itemize}
		\item \textbf{$\epsilon$-greedy} takes in $(1-\epsilon)$ cases the optimal action, and with the chance of $\epsilon$ selects an action randomly
		\item An annealed softmax takes the estimated action value into account, and creates a distriubtion based on this with a temperature factor $\tau$ (high $\tau$ means more stochasticity):
		$$p(a)=\frac{\exp\left(\hat{q}(a)/\tau\right)}{\sum_{a'}\exp\left(\hat{q}(a')/\tau\right)}$$
		\item We can use the current estimate $\hat{q}$ in combination with the uncertainty we have a certain action. This leads to the Upper confidence bound, or we can alternatively initialize all $q$-values optimistically (guarantees certain level of exploration)
	\end{itemize} 

	\item \textbf{Markov Decision Process}: An agent chooses an action which only depends on the current state $s_t$, and is independent of the history $s_0,...,s_{t-1}$ given $s_t$. Formally, we can define a finite MDP by
	\begin{itemize}
		\item A finite set of states $\mathcal{S}$
		\item A finite set of actions for each state $\mathcal{A}_s$ (often the same in all states)
		\item A dynamics function $p(s',r|s,a)=\Prob{S_t=s',R_t=r|S_{t-1}=s,A_{t-1}=a}$ which is often split into
		\begin{itemize}
			\item Transition function $p(s'|s,a)$
			\item Reward function $p(r|s,a,s')$
		\end{itemize}
		\item A discount factor $\gamma\in[0,1]$
	\end{itemize}
	\item In this setting, the optimal action can be found by optimizing the policy $\pi^{*}(s_t)$. In the rest of the course, we mostly focus on MDPs
\end{itemize}

\subsection{Dynamic Programming}
\begin{itemize}
	\item For simple environments where we know the dynamics function of the MDP, we can apply approaches of dynamic programming
	\item One thing to notice about the functions $v$ and $q$ are their relationships, namely:
	\begin{equation*}
		\begin{split}
			v(s) & = \E_{\pi}\left[G_t|S_t=s\right] = \E_{a\sim\pi}\left[\E_{\pi}\left[G_t|S_t=s,A_t=a\right]\right] = \E_{a\sim\pi} q_{\pi}(s,a)\\[8pt]
			q(s,a) & = \E_{\pi}[G_t|S_t=s,A_t=a] = \E_{\pi}[R_{t+1}|S_t=s,A_t=a]+\E_{s',\pi}[\gamma G_{t+1}|S_{t+1}=s'] \\& = \E_{s',\pi}[R_{t+1}+\gamma v(s')|S_t=s,A_t=a,S_{t+1}=s']
		\end{split}
	\end{equation*}
	\item A policy is optimal if there is no other policy for which the value of any state is larger than the current one: $v_{*}(s)=\max_{\pi} v_{\pi}(s)$, $q_{*}(s,a)=\max_{\pi} q_{\pi}(s,a)$
	\item Again, we can write down the relations between the two functions, which are called \textit{Bellman optimality equations} for the optimal case:
	\begin{equation*}
		\begin{split}
			v_{*}(s) & =\max_{a}q_{*}(s,a)= \max_a \E\left[R_{t+1}+\gamma v_{*}(S_{t+1})|S_t=s,A_t=a\right]\\
			q_{*}(s,a) & = \E\left[R_{t+1}+\gamma \max_{a'} q_{*}(S_{t+1},a')\Big\vert S_t=s,A_t=a\right]
		\end{split}
	\end{equation*}
	
	\item The first approach of finding the optimal policy is \textbf{policy iteration}. It combines two steps:
	\begin{itemize}
		\item \textit{Policy evaluation}: given a policy $\pi$, we try to find the corresponding value function $v_{\pi}$. We do this by performing the update $v(s)=\E[R_{t+1}+\gamma v(s')]$ until the values converge. Note that we can evaluate the expectation as we know $\pi$ and the MDP dynamics $p(s',r|s,a)$
		\item \textit{Policy improvement}: given the value function $v_{\pi}$, we try to find a new policy for which we know that $\forall s, v_{\pi'}(s)\geq v_{\pi}(s)$. We can do that by taking the argmax over actions in each state.
	\end{itemize}
	Policy iteration performs these two in a loop until the policy is not changed anymore in the improvement step. It is guaranteed to converge to the optimal policy $\pi$.
	
	The full algorithm is shown in Figure~\ref{fig:rl_introduction_policy_iteration}.
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.7\textwidth]{figures/rl_introduction_policy_iteration.png}
		\caption{Policy iteration algorithm (Sutton book)}
		\label{fig:rl_introduction_policy_iteration}
	\end{figure}
	\item The issue of policy iteration is that the policy evaluation step can take a long time until it fully converges, although slight changes might not influence the policy too much. An alternative is to stop policy evaluation after a single iteration, and directly optimize it. This leads to the \textbf{value iteration algorithm}.
	
	When implementing it, we can efficiently combine the two steps of evaluation and improvement, which is actually the same as performing the Bellman optimality equation as an update step. See Figure~\ref{fig:rl_introduction_value_iteration} for details.
	
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.7\textwidth]{figures/rl_introduction_value_iteration.png}
		\caption{Value iteration algorithm (Sutton book)}
		\label{fig:rl_introduction_value_iteration}
	\end{figure}

	The drawback of value iteration is that it can lead to noisy updates as it only performs a single update step and hence, can give inaccurate estimates of $v$. In practice, what has been found to mostly work the best, is to perform a limited, small number of steps of policy evaluation.
	
	\item Keep in mind that for all these algorithms we require to know the MDP dynamics $p(s',r|s,a)$. However, this is often not the case, especially for more complicated, real-world environments. There we can only sample data point $(s_i,a_i,r_i,s'_i)$ which we need to use effectively. 
	
\end{itemize}
\subsection{Outline}
\begin{itemize}
	\item In the next sections (and rest of the whole course), we will deal with different ways of learning the optimal policy when the dynamics of the MDP are unknown in advance. We can distinguish the approaches into three main groups (see Figure~\ref{fig:rl_introduction_overview_leanring_techniques}):
	\begin{itemize}
		\item \textbf{Value-based} methods try to learn the value functions $v(s)$ and $q(s,a)$ from interactions with the environment. Based on these, we can find the optimal policy $\pi$.
		\item \textbf{Policy-based} methods try to directly learn the desired objective, namely the policy $\pi$. While we prevent propagating errors to the policy from learning a value function, it is often harder to optimize.
		\item In contrast to the previous techniques, \textbf{model-based} RL is based on the idea of learning the dynamics of the MDP, namely $p(s',r|s,a)$. With this knowledge given, we can then again apply value-based or policy-based methods, but support them by either using the transition function directly (i.e. take all possible future states into account instead of sampling), or simulate new trajectories if this is expensive in the original environment. 
	\end{itemize}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/rl_introduction_overview_leanring_techniques.png}
		\caption{Overview of different learning strategies in RL.}
		\label{fig:rl_introduction_overview_leanring_techniques}
	\end{figure}
	\item The next sections 2 and 3 (lecture slides 3 to 6) deal with value-based methods. First, we discuss tabular-based techniques, meaning that we store e.g. $v(s)$ by a big table (i.e. every state has an entry in this table). However, these methods cannot be applied if the state space is continuous and/or high-dimensional (size increases exponentially). Thus, we look at approximations in section 3.
	\item Section 4 (lecture slides 7 to 10) deals with policy-based RL introducing different techniques for approximating the optimal gradients in policy learning.
	\item Model-based RL is discussed in section 5 (lecture slides 11 and 12), but in less details than the previous two.
	\item The final chapter deals with partially-observable environments (Section 7, lecture 13), and how to take uncertainty into account.
\end{itemize}


================================================
FILE: Reinforcement_Learning/rl_learning_with_approx.tex
================================================
\section{Value-based RL: Learning with approximation}
\label{sec:value_based_approximation}
\textit{This section reviews the lecture slides 5 and 6.}
\begin{itemize}
	\item When we talk about approximating the value function, we mean that instead of implementing $v$ as look-up table, we view it as parameterized function $\hat{v}(\bm{w},s)\approx v_{\pi}(s)$ with $\bm{w}\in\R^{d}$.
	\item Commonly, we try to allow generalization over nearby states while trying to keep it compact. Hence, the size of the weights,  $d$, is mostly much smaller than the actual state size. This implies that a change in $\bm{w}$ will affect many states, and hence, generalize.
	\item Learning value functions is similar to supervised learning as we try to push a prediction closer to a target (similar to regression). The value error can be summarized as:
	$$\overline{\text{VE}}(\bm{w}) = \sum_{s\in S}\mu(s)\left[v_{\pi}(s)-\hat{v}(s,\bm{w})\right]^2=\E_{s\sim\mu(s)}\left[\left(v_{\pi}(s)-\hat{v}(s,\bm{w})\right)^2\right]$$
	where $\mu(s)$ is a weighting factor for the states (which state is how important, distribution over those). This depends on the task we are aiming for.
	
	However, keep in mind that our overall goal is to find the optimal policy, and not the best value function. So, the VE error might not be optimal as we often converge to local optima. 
\end{itemize}
\subsection{Types of function approximations}
\begin{itemize}
	\item There are various function approximation techniques we can use. We will review here a few, practical/simple ones
	\item In general, we distinguish between linear and non-linear function approximation. We call an approximation linear if the value function is linear with respect to the weight, namely:
	$$\hat{v}(s,\bm{w})=\bm{w}^T\bm{x}(s)$$
	where $\bm{x}(s)$ can be any (non-)linear functions. It can be also seen as a linear combination of static feature extractions. Some examples for $\bm{x}$ are:
	\begin{itemize}
		\item \textit{Polynomials}, as if we take enough (infinite), we would be able to approximate any function. However, this is not feasible so we mostly have many features (especially in higher dimensions because we have $\left[1,s_1,s_2,s_1^2, s_2^2, s_1s_2, s_1^2s_2,...\right]$), and hence less generalization. Furthermore, the behavior at 0 is rather static
		\item \textit{Aggregations} where we group multiple states into one. This can be seen as returning a one-hot vector for $\bm{x}$ where the $1$ assigns a point to a certain state group. Figure~\ref{fig:rl_approximate_value_based_aggregation} visualizes some examples.
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.5\textwidth]{figures/rl_approximate_value_based_aggregation.png}
			\caption{Simple types of aggregations on a 2D state space. Note that we can combine aggregations, meaning that we use both a vertical and a horizontal aggregation.}
			\label{fig:rl_approximate_value_based_aggregation}
		\end{figure}
		\item \textit{Radial basis functions} that take the distance to a mean in the state space, e.g. $||\mu_i-s||$, as input features. We can model this by having multiple Gaussians, and weight their influence by $p(s)$. It enables us to have smoother transitions between close-by states, but might be problematic for far-away states. This is why it is often problematic in high-dimensional state spaces.
		\item \textit{Fourier basis} where we take different frequencies to model $s$. This can provide quite a flexible feature set.
	\end{itemize}
	Note that tabular RL can also be expressed by linear function approximation where we simply use $\bm{x}(s)=\left[\delta(s=s_1), \delta(s=s_2),...\right]$, and $\bm{w}$ therefore contains one parameter per state.
	
	Linear function approximation is especially used when prior knowledge can be introduced in the system. Carefully selecting the features simplifies the learning objective of the model, and hence, let it converge faster.
	
	\item In non-linear function approximation, we use $\bm{w}$ in a non-linear fashion in $\hat{v}$, such as in neural networks.
	
\end{itemize}
\subsection{Prediction objective for on-policy prediction}
\begin{itemize}
	\item In the case that we perform an on-policy prediction (i.e. policy evaluation for a fixed policy), the state importance is based on the visit frequency of $\pi$. To arrive at $\mu$, we also have to distinguish between the tasks:
	\begin{itemize}
		\item If we have a continuing task (never ending), we get a stationary distribution at the point:
		$$\mu_{\pi}(s)=\sum_{s'}\sum_{a}p(s|s',a)\pi(a|s')\mu_{\pi}(s')$$
		with the condition that we can reach every state from the start.
		\item For episodic tasks, we need to consider the start frequency $h(s)$ as well. To guarantee that $\mu(s)$ is a distribution, we can use a softmax:
		$$\mu_{\pi}(s)=\frac{\eta(s)}{\sum_{s'}\eta(s')}, \hspace{4mm}\eta(s)=h(s)+\sum_{s'}\sum_a p(s|s',a)\pi(a|s')\eta(s')$$
		where the second part is pretty much the same as before.
	\end{itemize} 
	\item In order to calculate the gradients $\nabla_{\bm{w}}\overline{\text{VE}}(\bm{w})$, we would need to know $\mu(s)$ which is not possible due to missing information of the environment dynamics ($p(s'|s,a)$). However, we can approximate it by Monte-Carlo samples such that:
	$$\nabla_{\bm{w}}\overline{\text{VE}}(\bm{w})\approx \nabla_{\bm{w}}\left[G_t - \hat{v}(S_t,\bm{w})\right]^2 = -2\cdot \left[G_t - \hat{v}(S_t,\bm{w})\right] \nabla_{\bm{w}}\hat{v}(S_t,\bm{w})$$
	Which leads us to the \textbf{Gradient Monte Carlo} algorithm:
	$$\bm{w}_{t+1}=\bm{w}_{t}+\alpha \left[G_t - \hat{v}(S_t,\bm{w})\right] \nabla_{\bm{w}}\hat{v}(S_t,\bm{w})$$
	\item Alternatively, we could also think about using the bootstrapping estimate as target, which gives us the following update rule:
	$$\bm{w}_{t+1}=\bm{w}_t + \alpha\underbrace{\left[R_{t+1} + \gamma\hat{v}(S_{t+1},\bm{w}_t) - \hat{v}(S_t,\bm{w}_t)\right]}_{\text{TD error }\delta}\nabla_{\bm{w}}\hat{v}(S_t,\bm{w}_t)$$  
	This method is called \textbf{Semi-gradient TD(0)}, and indicates by its name that it is not a true gradient. The reason for that is that we actually ignore the dependency of the target on $\bm{w}$. We assume it to be fixed.  Nevertheless, experiments with the true gradient have shown that the semi-gradient works much better in practice. We will discuss it later in more detail.
	
	Note that as in the value-based, we can extend this approach to $n$-step if wanted.
	\item When comparing Gradient MC and semi-gradient TD(0), we get the same arguments as for the tabular case in Section~\ref{sec:value_based_tabular_difference_TD_MC} $\Rightarrow$ TD has lower variance and learns usually faster, but can have a bias (see below)
	\item A thing to keep in mind when using semi-gradient TD(0) is that it tries to minimize distance between close-by states, especially if we take approximation like aggregating multiple state into the same. This is because a small step can lead to a huge TD error which we try to minimize. 
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/rl_approximate_value_based_semi_gradient_td.png}
		\caption{Problems of semi-gradient TD(0) updates on the random walk example. It prefers a value function with low changes between states, so that it gets a biased prediction.}
		\label{fig:rl_approximate_value_based_semi_gradient_td}
	\end{figure}
\end{itemize}
\subsubsection{Discussion on convergence for different objectives}
\begin{itemize}
	\item The advantages of linear function approximations are that the gradients are easy to calculate ($\nabla_{\bm{w}}\hat{v}(s,\bm{w})=\bm{x}(s)$). Furthermore, it can be proven that all local optima are global optima, so that gradient MC converges to the minimum of $\overline{\text{VE}}$. This is not necessarily the case for semi-gradient TD but we can define a upper bound $\overline{\text{VE}}(\bm{w}_{td})\leq \frac{1}{1-\gamma}\min_{\bm{w}}\overline{\text{VE}}(\bm{w})$ and guarantee that it converges. 
	
	In the non-linear case, we cannot guarantee convergence for semi-gradient TD (but for Gradient Monte Carlo), and we might end up in local optima. Nevertheless, linear features are much more restricted than non-linear as neural networks. Hence, non-linear methods can lead to better results, even if we might get stuck in local optima.
	
	\item When learning via Gradient Monte Carlo or Semi-gradient TD, we have to select a step size $\alpha$. This can be a bit more tricky here because we combine the values of many states into a single function. In case of linear function approximation, we can actually give a rule of thumb because the features are static, namely:
	$$\alpha = (\tau\E[\bm{x}^T\bm{x}])^{-1}$$
	where $\tau$ is the number of experiences we expect to have for the same (or similar) feature vector to average over (as say the learning rate you would choose for the tabular setting would be $\frac{1}{\tau}$)
	\item Alternatively we could try to find the fix point of semi-gradient TD(0). We can write the TD update rule for linear function approximation as:
	\begin{equation*}
		\begin{split}
			\bm{w}_{t+1} & = \bm{w}_t + \alpha \left(R_{t+1}+\gamma \bm{w}_t^T \bm{x}_{t+1} - \bm{w}_t^T\bm{x}_t\right)\bm{x}_t\\
			& = \bm{w}_t + \alpha \left(R_{t+1}\bm{x}_t - \bm{x}_t (\bm{x}_t - \gamma \bm{x}_{t+1})^T \bm{w}_t\right)\\
			\implies \E[\bm{w}_{t+1}|\bm{w}_t] & = \bm{w}_t + \alpha \left(\underbrace{\E[R_{t+1}\bm{x}_t]}_{\bm{b}} - \underbrace{\E[\bm{x}_t (\bm{x}_t - \gamma \bm{x}_{t+1})^T]}_{\bm{A}} \bm{w}_t\right)\\
		\end{split}
	\end{equation*}
	The fix point is given when we do not change our weights anymore, meaning $\bm{w}_{t+1}=\bm{w}_t$. This is the case if:
	$$\bm{w}_{td}=\bm{A}^{-1}\bm{b}$$
	We can approximate $\bm{A}$ and $\bm{b}$ by MC sampling:
	\begin{equation*}
		\begin{split}
			\hat{\bm{A}}_t & = \sum_{k=0}^{t-1}\bm{x}_k\left(\bm{x}_k - \gamma\bm{x}_{k+1}\right)^T + \epsilon\bm{I}\\
			\hat{\bm{b}}_t & = \sum_{k=0}^{t-1}R_{k+1}\bm{x}_k
		\end{split}
	\end{equation*}
	where $\epsilon$ is a small constant ensuring that $\bm{\hat{A}}$ is always invertible. This solution is called \textbf{least-squares temporal-difference (LSTD)}, and is usually more sample efficient because we do not have to perform iterative updates, and has the benefit of not requiring a step size. However, it is more computationally expensive (quadratic plus the invert of $\bm{A}$), and we cannot adapt to a change in the environment over time (once performed, we fix our weights)
\end{itemize}
\subsection{Control with approximation}
\begin{itemize}
	\item For learning a policy $\pi$, we again change our objective to learning the $q$-values, which we now approximate with $\hat{q}(s,a,\bm{w})$. We will focus on episodic cases, but note that everything could be generalized to the continuous case as well.
	\item In the \textbf{on-policy} case, we can use methods like (episodic) semi-gradient SARSA, so that our update step is:
	$$\bm{w}_{t+1}=\bm{w}+\alpha\left[U_t - \hat{q}(S_t,A_t,\bm{w})\right]\nabla_{\bm{w}}\hat{q}(S_t,A_t,\bm{w})$$
	where $U_t$ is our target, which is for one-step SARSA $U_t = R_t + \gamma \hat{q}(S_{t+1},A_{t+1},\bm{w})$.
	As usual, we iterate over this update rule while setting our policy to $\epsilon$-greedy on $\hat{q}$.
	\item In the \textbf{off-policy} case, we experience more problems. As we have a behavior policy $b$ and target policy $\pi$, we often need to use importance weight to correct the target of the update:
	$$\bm{w}_{t+1}=\bm{w}_{t}+\rho \alpha\left[U_t - \hat{q}(S_t,A_t,\bm{w})\right]\nabla_{\bm{w}}\hat{q}(S_t,A_t,\bm{w})$$
	Although it increases variance, it is sometimes necessary to guarantee unbiased, correct estimates. However, note that in cases like Q-learning, where $U_t$ is independent of $b$, we might not have to consider the importance weights as $\rho=1$.
	
	The second issue is that we need to take the changed state distribution, $\mu_b$, into account. Consider for example the very simple MDP in Figure~\ref{fig:rl_approximation_value_based_offpolicy_divergence} for which we just want to estimate $v$ (policy evaluation). We assume the reward for any action to be $0$, and start with an initial value of $w=10$.
		
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.2\textwidth]{figures/rl_approximation_value_based_offpolicy_divergence.png}
		\caption{Simple MDP where off-policy updates can diverge. Green indicate behavior policy, red the target. Every transition has a reward of $0$, meaning that the optimal $v$ are 0 at both states.}
		\label{fig:rl_approximation_value_based_offpolicy_divergence}
	\end{figure}
	
	First, consider the on-policy case where $\pi=b$ (green action from second state). Then, we would alternate between the two update equations:
	\begin{equation*}
		\begin{split}
		\text{Left to right: }\hspace{2mm}w_{t+1} & = w_t + \alpha (\gamma \cdot 2w_t-w_t)\nabla_w w_t = (1+\alpha(2\gamma-1)) w_t\\
		\text{Right to left: }\hspace{2mm}w_{t+1} & = w_t + \alpha (\gamma w_t-2w_t)\nabla_w 2w_t = (1+2\alpha(1-2\gamma)) w_t\\
		\end{split}
	\end{equation*}
	Overall, we would converge to $w=0$ as the right to left update is twice as high as the other.
	
	Now, assume the behavior policy stays the same, but our target policy stays at the second state. Then, the importance weight for left to right is 1 (as both policies do that with probability 1), but from right to left is zero because we would not take this action with our target policy. So we end up with the update:
	\begin{equation*}
		\begin{split}
			\text{Left to right: }\hspace{2mm}w_{t+1} & = w_t + \alpha (\gamma \cdot 2w_t-w_t)\nabla_w w_t = (1+\alpha(2\gamma-1)) w_t\\
		\end{split}
	\end{equation*}
	which makes $w_t$ head to infinity if $\gamma>0.5$. This shows that off-policy prediction can diverge!
	
	\item This divergence can occur when the following three methods are used together (\textit{Deadly Triad}):
	\begin{itemize}
		\item Function approximation
		\item Semi-gradient bootstrapping
		\item Off-policy training
	\end{itemize}
	\item To overcome this issue, we need to consider alternatives to semi-gradients.
	
\end{itemize}
\subsubsection{Alternatives to semi-gradients}
\begin{itemize}
	\item There are couple of objectives that we can use instead of semi-gradient. We visualize all of them in Figure~\ref{fig:rl_approximation_value_based_different_errors}, and discuss them here in detail
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.6\textwidth]{figures/rl_approximation_value_based_different_errors.png}
		\caption{Geometry of linear value-function approximation. We show an approximation of a 3D state space by a two dimension weight vector.}
		\label{fig:rl_approximation_value_based_different_errors}
	\end{figure}
	
	\item Before we start our discussion, we need to introduce some notation:
	\begin{itemize}
		\item First, we need to consider how we measure distance between two value functions. The standard euclidean norm is not sufficient, as we give importance to different states. This is why we take $\mu$ into account:
		$$||v_1-v_2||_{\mu}^2 = \sum_{s\in\mathcal{S}} \mu(s)\left[v_1(s)-v_2(s)\right]^2$$
		\item Given the norm, we also want to define a \textit{projection operator} which assign to an arbitrary $v$ (over whole state space) the closest value function based on the norm that can be represented:
		$$\Pi v = v_{\bm{w}}\hspace{3mm}\text{where}\hspace{3mm}\bm{w}=\arg\min_{\tilde{\bm{w}}}||v-v_{\tilde{\bm{w}}}||_{\mu}^2 $$
		\item The last notation we want to introduce is the Bellman operator, which maps a value function $v$ to its bootstrapping estimates:
		$$(B_{\pi}v_{\bm{w}})(s) = \sum_a \pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\bm{w}}(s')] = v_{\bm{w}}(s) + \overline{\delta}_{\bm{w}}(s)$$
		with $\overline{\delta}_{\bm{w}}(s)$ being the expected TD error for state $s$.
	\end{itemize}
	\item The value error $\overline{\text{VE}}$ is minimized if the norm is the lowest to $v_{\pi}$: $$\min_{\bm{w}} \overline{\text{VE}}(\bm{w}) = \min_{\bm{w}} ||v_{\bm{w}}-v_{\pi}||_{\mu}^2 $$
	The projected point, which we can actually reach, is $\Pi v_{\pi}$ which is the best point we can represent in our $\bm{w}$-space. Gradient Monte Carlo methods converge to this point, but mostly quite slowly
	\item Without the approximation, we could simply apply the Bellman operator  over and over again, and reach $v_{\pi}$ (as in tabular TD(0) learning) which is the gray line above. However, we cannot represent the change so that we have to project $v$ after each step: $\Pi B_{\pi}v_{\bm{w}}$. The step we take in between is the projected Bellman error $PBE=\Pi\delta_{\bm{w}}$ 
	
	Semi-gradient TD is converging to the point where $PBE=0$ as we reach a fix-point there. However, this does not have to be where the minimum Bellman error is reached because imagine $\delta_{\bm{w}}$ being orthogonal to $\bm{w}$-subspace. Then, the projected bellman error is 0, but without projection, we would continue changing $\bm{w}$, until we reach $\min \overline{\text{BE}}$.
	
	At the same time, even if we would reach $\min \overline{\text{BE}}$, it would most likely not be a optimum (i.e. gradients greater than zero) because the gradients can point to outside the representable $\bm{w}$-space (does not need to be orthogonal as before), and hence the projected Bellman error can be unequal to zero.
	
	\item The last objective we consider here is the true-gradient TD error, meaning: $$\overline{\text{TDE}}(\bm{w})=\sum_{s\in\mathcal{S}}\mu(s)\E\left[\delta_t^2 |S_t=s,A_t\sim \pi\right] = \E_{b}[\rho_t \delta_t^2] \hspace{4mm}\text{(if we assume $\mu$ is under $b$)}$$
	Following SGD updates, we get:
	$$\bm{w}_{t+1}=\bm{w}_t + \alpha \rho_t \delta_t (\nabla \hat{v}(S_t,\bm{w}_t) - \gamma \nabla \hat{v}(S_{t+1},\bm{w}_t))$$
	
	\item Now, let's consider which of these objectives we can take as alternative to semi-gradient updates. 
	
	A major drawback of TDE is that we also take the gradients regarding the next steps, which can push the value function in a wrong direction (tries to minimize distance between the steps, similar to Figure~\ref{fig:rl_approximate_value_based_semi_gradient_td}). Consider the MDP in Figure~\ref{fig:rl_approximation_value_based_TDE}, with on-policy evaluation of a uniform policy, and $\gamma=1$.
	
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.2\textwidth]{figures/rl_approximation_value_based_TDE.png}
		\caption{Simple example where TDE gives an undesirable result.}
		\label{fig:rl_approximation_value_based_TDE}
	\end{figure}

	The optimal/correct value function is obviously $v(A)=1/2, v(B)=1, v(C)=0$. The TD error is given by:
	$$\delta_t=\frac{1}{2}\left(\left[v(B)-v(A)\right]^2 + \left[1-v(B)\right]^2\right)+\frac{1}{2}\left(\left[v(C)-v(A)\right]^2 + \left[0-v(C)\right]^2\right)$$
	for which the optimal is actually $v(A)=1/2, v(B)=3/4, v(C)=1/4$ because we also minimize the distance between $v(A)$ and $v(B)$, and similarly between $v(A)$ and $v(C)$.
	
	\item For calculating the Bellman error ($\min \text{BE}$), we need to calculate:
	$$\overline{\text{BE}}=||\overline{\delta}_{\bm{w}}||_{\mu}^2\hspace{2mm}\text{where}\hspace{2mm}\overline{\delta}_{\bm{w}}=\E_{\pi}\left[\delta_{\bm{w}}|S_t=s,A_t\sim\pi\right]$$
	As we have the square in the error, to guarantee an unbiased estimate, we need to sample at least two times independently (otherwise we estimate $\E[\delta^2]$ instead of $\E[\delta]^2$). This is mostly not possible in interactions to obtain, or makes the algorithm rather slow.
	
	\item The last remaining objective is the mean squared projected Bellman error $\overline{\text{PBE}}$. We can derive at the following gradient for PBE:
	$$\nabla_{\bm{w}} \overline{\text{PBE}}(\bm{w}) = 2\E[\rho_t (\gamma \bm{x}_{t+1}-\bm{x}_t)\bm{x}_t^T]\E[\bm{x}_t\bm{x}_t^T]^{-1}\E[\rho_t\delta_t\bm{x}_t]$$
	Using the same samples for all the expectations gives the same bias as the one for the Bellman error. What we can do, however, is learning some factors from all the data, namely the last two, and denote it as $\bm{v}_t=\E[\bm{x}_t\bm{x}_t^T]^{-1}\E[\rho_t\delta_t\bm{x}_t]$. Then, we can perform SGD as:
	\begin{equation*}
		\begin{split}
			\bm{v}_{t+1} & = \bm{v}_t + \beta \rho_t (\delta_t - \bm{v}_t^T \bm{x}_t)\bm{x}_t\\
			\bm{w}_{t+1} & = \bm{w}_t + \alpha \left[\rho_t (\gamma \bm{x}_{t+1}-\bm{x}_t)\bm{x}_t^T\right]\bm{v}_t
		\end{split}
	\end{equation*}
	This algorithm is called GTD2 (gradient TD) which converges to the minimum PBE for linear features. The drawbacks are that we need an additional learning rate $\beta$ (mostly greater than $\alpha$), and need to store two parameter updates.
	
	Nevertheless, as we have a guarantee of convergence for all settings, it makes GTD2 the preferred technique compared to Semi-gradient TD, except when we just want a simple method.
	\item Overall, we have the following convergence properties:
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.6\textwidth]{figures/rl_approximate_value_based_convergence_overview.png}
		\caption{Overview of convergence properties of different optimization methods. The columns show the setting (on=on-policy, off=off-policy). "No C." for semi-gradient TD means that we cannot guarantee its convergence. "N.A" stands for "not applicable", as LSTD is based on the assumption of linear features.}
	\end{figure}	
\end{itemize}
\subsubsection{Deep Q network}
\begin{itemize}
	\item Another way of stabilizing off-policy control is by using many additional tricks, to make it more similar to supervised learning. One popular example of this is the DQN
	\item Given a state as input, we try to learn a $q$-value for each output, so that we can perform a simple maximization step over the outputs to get the optimal policy
	\item We use image as input. However, to detect movement, a couple of frames are stacked on top of each other
	\item To guarantee i.i.d. samples within a batch, and use data more efficiently (look at an experience more than once), we use \textbf{experience replay}:
	\begin{itemize}
		\item All the experiences we had from interacting with the environment are stored in a buffer (if limited size, use FIFO queue)
		\item At every time step, randomly select $N$ experiences which form a batch for training
	\end{itemize}
	Note that this is only possible because of off-policy training, as the collected experiences come from a different policy, namely an older one.
	\item Another trick to stabilize learning is \textbf{fixing the target}. As we use semi-gradient version of $q$-learning, which is:
	$$\bm{w}_{t+1}\leftarrow \bm{w}_t + \alpha \left[R_{t+1}+\gamma\max_a \hat{q}(S_{t+1}, a, \bm{w}_t) - \hat{q}(S_t, A_t, \bm{w}_t)\right]\nabla \hat{q}(S_t, A_t, \bm{w}_t)$$
	To fix the target, we copy the weights $\tilde{\bm{w}}$, and use this to calculate the target $\gamma\max_a \hat{q}(S_{t+1}, a, \bm{w}_t)$.
	
	Furthermore, it has been shown to work well to clip the TD error between a range of $[-1,1]$ to prevent any divergence issues. 
	\item If needed/wanted, we can overcome the maximization by using a double Q-learning approach
\end{itemize}


================================================
FILE: Reinforcement_Learning/rl_mcts_alpha_go.tex
================================================
%\section{Monte-Carlo Tree Search and Alpha Go}
%\label{sec:MCTS_Alpha_Go}
%\textit{This section reviews the lecture slides 12.}
%\begin{itemize}
%	\item In Section~\ref{sec:value_based_approximation}, we have seen that to learn a value function for problems with very large state space, we can approximate our $q$-function by e.g. a neural network. However, these approximations will always contain a certain amount of noise/inaccuracy.
%	\item An alternative approach is to learn $q(s,a)$ \underline{online}. The simplest approach for this is to perform rollouts 
%\end{itemize}


================================================
FILE: Reinforcement_Learning/rl_model_based.tex
================================================
\section{Model-based Reinforcement Learning}
\label{sec:model_based}
\textit{This section reviews the lecture slides 11 and 12.}
\begin{itemize}
	\item We have seen that given the environment dynamics, we can find the optimal policy by dynamical programming. All the methods after that purely learned from interactions. We now want to take a step in between and try to learn the model dynamics itself, $p(s'|s,a)r(s,a,s')$. If we have that, we could plan by simulating in our learned model.
	\item There are several benefits of this approach:
	\begin{itemize}
		\item We don't require interactions with the environment, but can generate new data from simulation. This is especially helpful when real-time data is expensive (whether in time, computational resources, etc.) as in real-life robotic systems (takes a long time for a single rollout)
		\item We can obtain probability distributions which tell us how likely we end up in a state when we take a certain action. This can be very helpful in some cases, as e.g. in the (slippery) cliff world example, we would know how likely it is that we actually fall of the cliff even if we take the right action. 
	\end{itemize}
	However, when these things are not required, model-free methods mostly work better and/or are computationally cheaper/simpler. Furthermore, we prevent any bias we might get when our model is inaccurate.
	\item In general, we distinguish between three types of systems we can have that tries to imitate the real environment:
	\begin{itemize}
		\item A \textbf{full} or \textbf{distributional model} is a full description of all transition probabilities and rewards. 
		\item A \textbf{sample} or \textbf{generative model} can be viewed as a black-box simulator, where given any state $s$ and action $a$, it can sample a reward $r_t$ and a next state $s'$.
		\item A \textbf{trajectory} or \textbf{simulation model} can simulate whole episodes, but is not able to start at any state and action. This is for example the case for a physical model where we cannot start with an arbitrary velocity.
	\end{itemize}
	These three models can be seen as generalization steps. The most limited implementation is the trajectory one. If we provide the ability of changing the start state to any arbitrary state, we arrive at the generative model. Adding the probabilities $p(s',r|s,a)$ gives us in the end the distributional model.
	\item There are several ways of implementing this. We will consider here a simple method, called Dyna
\end{itemize}
\subsection{Dyna-Q}
\begin{itemize}
	\item Dyna makes two assumptions of the environment:
	\begin{enumerate}
		\item Our environment is deterministic, meaning that any transition probabilities are either 1 or 0. 
		\item The state and action space is discrete and limited, so that we can store it in a tabular setting.
	\end{enumerate}
	Note that we can relax the first requirement slightly by storing e.g. how often we came from one state-action pair to another state.
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/rl_model_based_dyna_Q.png}
		\caption{Real experience is generated by the interaction of the agent (according to the policy) and the environment. This real data is used to update our policy (direct RL), but at the same time update our model, from which we can generate new samples to learn from (indirect RL).}
		\label{fig:rl_model_based_dyna_Q}
	\end{figure}
	\item The general overview of the idea is shown in Figure~\ref{fig:rl_model_based_dyna_Q}. We have two sources from which we train our policy and/or value function: direct and indirect. The samples from the real environment are used to perform "direct Reinforcement Learning" as we use the actual samples to learn. At the same time, we use the real samples to update our model, and can generate from there as many samples as we want.
	\item Written down as an algorithm, we arrive at Figure~\ref{fig:rl_model_based_dyna_Q_algorithm}. The parameter $n$ which specifies how often we train from the simulated/learned model compared to the real environment, is a hyperparameter and depends on the access of the environment (how expensive is it, etc.). However, it is usually $n\gg 1$.
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.65\textwidth]{figures/rl_model_based_dyna_Q_algorithm.png}
		\caption{Example implementation of the algorithm of Dyna-Q. We use a tabular-based setting to store transition and rewards. Note that we will discuss the design choices in this part more in detail.}
		\label{fig:rl_model_based_dyna_Q_algorithm}
	\end{figure}
	\item Looking at Figure~\ref{fig:rl_model_based_dyna_Q_algorithm}, we see that there are still quite a few design decisions to make, which we will go through step-by-step:
	\begin{enumerate}
		\item Step (e) - How do we learn our model?
		\item Step (f) - When should we use our simulated environment? Can we also use it somewhere else than the update step?
		\item Step (f) loop - Which state and action should we choose to update?
		\item Step (f) loop - How do we update e.g. our values or policy?
	\end{enumerate} 
\end{itemize}
\subsubsection{How to learn the model}
\begin{itemize}
	\item As mentioned previously, Dyna uses a tabular setting to learn the model. Hence, we have a table over state and actions, where an entry contains the information of the next state and reward
	\item In case we have stochastic transitions, we have to slightly adjust our table. We now create a table over $(s,a,s')$ where we store how often we experienced this transition, and what reward we got. In case we also have stochastic rewards, we need to extend the table further.
	
	When sampling, we first have to normalize the probabilities for every $s'$ (and $r$) to occur when $(s,a)$ is given, and finally sample from this distribution.
	
	An example table is shown below
	\begin{table}[ht!]
		\centering
		\begin{tabular}{c|cc}
			& \textit{State 1} & \textit{State 2}\\
			\hline
			\textit{State 1, Action 1} & $\eta_{111}=1, r_{111}=2$ (50\%) & $\eta_{112}=1, r_{112}=-1$ (50\%)\\
			\textit{State 1, Action 2} & $\eta_{121}=5, r_{121}=5$ (100\%) & $\eta_{122}=0, r_{122}=0$ (0\%) \\
			\textit{State 2, Action 1}  & $\eta_{211}=4, r_{211}=-4$ (80\%) & $\eta_{212}=1, r_{212}=2$ (20\%)  \\
			\textit{State 2, Action 2} & $\eta_{221}=4, r_{221}=-2$ (40\%) & $\eta_{222}=6, r_{222}=1$ (60\%) \\
		\end{tabular}
		\caption{Example table for stochastic transitions and deterministic rewards.}
	\end{table}
\end{itemize}
\subsubsection{What to update}
\begin{itemize}
	\item To prevent that we spend too much computational effort on state-action pairs that are not relevant for the current/optimal policy, we should make smarter selections
	\item One approach is \textbf{prioritized sweeping} where we prefer these state-action pairs that lead to a state for which we just have experienced an update (whether with real or simulated experience)
	\begin{itemize}
		\item The priority in the queue is given by the TD error we would get at the time we add the state in the queue. This supports that states with high errors, i.e. wrong estimates, are updated first.
		\item To limit the queue, we can define a threshold $\theta$ over which the TD error has to be to add a state in the queue
		\item In the simulation step (indirect RL), we perform updates based on the queue until either the queue is empty, or we reached a maximum of $n$ steps. If the queue is non-empty, it is kept for the next iteration as well.
		\item For this model, we require at least a sample/generative model because we need to be able to start at any state-action pair
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.5\textwidth]{figures/rl_model_based_dyna_prioritized_sweeping.png}
			\caption{Algorithm of Prioritized Sweeping.}
		\end{figure}
	\end{itemize}
	\item An alternative is performing \textbf{trajectory sampling} where we start from the start state (or sample one if multiple exist), and follow our current policy. 
	\begin{itemize}
		\item While updating the more frequently visited states, we have the disadvantage of limited exploration because we highly focus on states of our distribution
		\item Hence, if we have a (close to) deterministic environment, trajectory sampling might work well, but in a stochastic environment where we continuously have to explore, it might perform worse than uniformly sampling any state-action pair 
		\item For this method, we only require a trajectory model, making it less complex
	\end{itemize}
\end{itemize}
\subsubsection{How to update}
\begin{itemize}
	\item For updating our $q$/$v$/policy, we can use any of the methods we have discussed before. 
	\item However, remember that for some methods like dynamic programming, we can make use of the model dynamics. Nevertheless, this might not be the most efficient computation, especially when we have many possible next steps. Remember that we have to take \underline{all} next states into account for dynamic programming, although some can be neglected if we have a small weight on them. 
	\item In addition, we would require the full/distributional model to perform these kind of updates, whereas the others either work with generative or trajectory models
\end{itemize}
\subsubsection{When to plan}
\begin{itemize}
	\item Currently, we only use the environment to generate new samples for training
	\item Knowing the system dynamics can however be valuable in more than this situation. For example, we can easily plan ahead by trying out different actions, and observing the reward in simulation. Afterwards, we take the actions in the real world which gave the best result in the simulation
	\item This idea is used in Monte Carlo Tree Search algorithms, which we will discuss in more detail in Section~\ref{sec:MCTS_Alpha_Go}. 
\end{itemize}
\subsection{Model-based policy search}
\begin{itemize}
	\item In the previous discussion, we mainly focused on value-based updates. However, we could of course use policy-based methods as well.
	\item Again, the decision of whether to use policy-based or value-based methods is based on multiple decisions. For example, if we need to learn a stochastic policy, or we have continuous actions, then we might want to use policy-based methods. In the case that we have discrete actions and aim for learning a deterministic, greedy policy, value-based methods are more suited because for policy gradient we require the policy to be smoothly changeable/differentiable.
	\item Let's assume that the reward is known for now (e.g. we have defined the reward for a problem by our own). Then, we could reformulate the transition function as:
	$$s_{t+1} = f(s_t, a_t) + w$$
	where $f(s_t, a_t)$ is a deterministic function that maps a state-action pair to a new state, and $w$ is additive noise (e.g. Gaussian for continuous states). To ensure this formulation to work well, we would require a mostly deterministic environment, as otherwise $f(s_t, a_t)$ cannot model the different outcomes.
	\item We can then train by
	\begin{enumerate}
		\item Use any model-free policy-based approach (e.g. TRPO or DDPG) to learn from the real world.
		\item Use the extra information from the environment, e.g. for richer gradient information
		\begin{itemize}
			\item As we now model the transition function $s_{t+1}\approx f(s_t, a_t)$, we know how the next state will change if we change our parameters.
			\item This allows us to look at the gradients over $s'$ and $a$, and find the best action more easily
			\item When we perform backpropagation through $v_{\theta}(s_t)$, we can (instead of sampling) also derive rewards because they are a simple function depending on $s_{t+1}$, or in $f(s_t,a_t)$. Hence, we can write:
			$v_{\theta}(s_t)=\nabla_{\theta} r_{t+1}+\gamma \nabla_{\theta} r_{t+2}+...$
		\end{itemize}
	\end{enumerate}
\end{itemize}
\subsection{Monte-Carlo Tree Search and Alpha Go}
\label{sec:MCTS_Alpha_Go}
\begin{itemize}
	\item In Section~\ref{sec:value_based_approximation}, we have seen that to learn a value function for problems with very large state space, we can approximate our $q$-function by e.g. a neural network. However, these approximations will always contain a certain amount of noise/inaccuracy.
	\item An alternative approach is to learn $q_{\pi}(s,a)$ \underline{online}. The simplest approach, when we have given our model, is to perform a couple of rollouts from the state $s_t$ with our current policy $\pi$. $q_{\pi}(s,a)$ can then be estimated by the mean of the experienced returns $G_t$.
	
	Playing the best action based on this estimate (e.g. estimated $q_{\pi}(s_t,a_1)$, $q_{\pi}(s_t,a_2)$, etc.) is guaranteed to be at least as good as $a\sim \pi(s)$ as if $\pi$ was the optimal policy, $a$ will also be the argmax of the estimate (in expectation).
\end{itemize}
\subsubsection{Monte-Carlo Tree Search}
\begin{itemize}
	\item If we have given a full model description including the dynamics, we could simply expand the previous approach by taking all possible futures into account. However, this is less likely to work for games like Go because there are a huge number of possible outcomes (a full game tree has about $10^{170}$ different states). 
	\item Nevertheless, a lot of this computation might not be necessary. Instead, we can focus on the most likely subtree which only contains a small selection of possible outcomes. This leads us to the Monte-Carlo Tree Search algorithm
	\item In MCTS, we build a tree incrementally by performing 4 steps for $n$ steps (limited by computational resources, time, etc.), visualized in Figure~\ref{fig:rl_model_based_MCTS}:
	\begin{enumerate}
		\item \textbf{Selection}
		
		Given a subtree, we need to decide at which point we want to expand it. This is defined by our \textit{tree policy} $\pi_{\text{tree}}$, and can be for example the upper confidence bound (similarly to choosing the next action in a bandit setting):
		$$\pi_{\text{tree}}(s)=\arg\max_{a} \left[Q(s,a)+c\sqrt{\frac{\ln N(s)}{N(s,a)}}\right]$$
		with $N(s)$ as the number of visits in $s$, and $N(s,a)$ the number of times we took $a$ in $s$. We continue our policy until we end up at a leaf node.
		\item \textbf{Expansion}
		
		After deciding at which node we want to "grow" the tree, we need to expand it. This means that we add a new leaf, which is an action in case of $q$, or a state in case of $v$ (we always add a state-action pair, just ordering is different). We initialize it with the values $N(s)=0$, $N(s,a)=0$ in case of UCB.
		\item \textbf{Simulation}
		
		From the newly added node, we perform a rollout. This means that starting from the leaf node, we interact with the environment according to the current policy $\pi$ until terminating. 
		\item \textbf{Backup}
		
		After finishing simulation, we update our estimates based on the newly observed return. Note that we update the $q$/$v$-values for each node which led to the leaf, while taking a possible discount factor $\gamma$ into account. In case of UCB, this means that we increase $N(s)$ and $N(s,a)$ by one, as well as adding a new point to $Q$ for averaging (e.g. use running average).
	\end{enumerate}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.6\textwidth]{figures/rl_model_based_MCTS.png}
		\caption{Visualization of the four steps in MCTS: Selection, Expansion, Simulation, Backup.}
		\label{fig:rl_model_based_MCTS}
	\end{figure}
	\item Note that we have seen two requirements of MCTS along the way: (1) we need a generative model where we can start at any $(s,a)$. Alternatively, if our environment is deterministic, we can also make trajectory models working by playing the same actions again from the top. (2) we assume $q(s,a)$ to be storable in a table
	\item If we want to use MCTS for planning, we can choose our our policy based on $\pi(a|s)\propto N(s,a)^{1/\tau}$. Note that we don't take the policy according to the $q$ values because they are estimates, and potentially very noisy as we have different amount of samples for each action.
	\item After taking a step, we can reuse the selected branch of the tree, and don't have to start from scratch again
\end{itemize}
\subsubsection{AlphaGo Zero}
\begin{itemize}
	\item For estimating the $q$-values, MCTS uses full Monte Carlo samples. However, we know that we can also use TD learning for it, meaning we bootstrap our estimates. Using this idea, two separate networks were used in Alpha Go: a policy network $\pi_{\theta}(a|s)$, and a value network $v_{\theta}(s)$
	\item We can now look at the changes AlphaGo makes in the MCTS algorithm:
	\begin{enumerate}
		\item \textbf{Selection}
		
		We define our tree policy as:
		$$\pi_{\text{tree}}(a|s)=\arg\max_a \left[Q(s,a)+cU(s,a)\right], \hspace{5mm}U(s,a)=\frac{\pi_{\theta}(a|s)}{1+N(s,a)}$$
		Note that this is a solution found empirically as we sum $q$-values and probabilities.
		
		\item \textbf{Expansion}
		
		When reaching a leaf node, we evaluate the value network $v_{\theta}(s)$ for this specific state, and expand the state by all its possible actions.
		
		\item \textbf{Simulation}
		
		In the original AlphaGo, we randomly choose between using $v_{\theta}(s)$, or simulating by performing a rollout. However, in the newer version AlphaGo zero, we fully rely on $v_{\theta}(s)$.
		
		\item \textbf{Backup}
		
		Using $v_{\theta}(s)$, we update all $q$ values above.
	\end{enumerate}  
	\item Our policy network $\pi_{\theta}$ limits our search in width because we select values based on its prior. The value network $v_{\theta}(s)$ limits the search in depth because we don't have to sample anymore.
	\item This approach is working well, if we have (1) a discrete state space, (2) a fully observable environment, and (3) a deterministic environment.
	\item We train the network by self-play. The policy network tries to predict the outcome of the tree search (how often will we choose action $a$ at state $s$), and the value network tries to predict the return we get after the full rollout. See Figure~\ref{fig:rl_model_based_alphago_zero_selfplay} for a visualization of the self-play learning.
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/rl_model_based_alphago_zero_selfplay.png}
		\caption{Self-play RL in AlphaGo zero.}
		\label{fig:rl_model_based_alphago_zero_selfplay}
	\end{figure}
	\item Nevertheless, the training might not be 100\% stable. In a small amount of times, it can happen that the network diverges. To prevent this, we evaluate the network every $n$ steps by playing against itself/an older version of itself. If the policy did not improve (i.e. losing more games than winning against older version), we throw away the new model and start again from the old weights. 
\end{itemize}


================================================
FILE: Reinforcement_Learning/rl_partially_observable.tex
================================================
\section{Partially observable environments and Bayesian methods}
\label{sec:partially_observable}
\textit{This section reviews the lecture slides 13.}
\begin{itemize}
	\item Until now, we always assumed to have a fully observable environment. However, this is often not the case, also in real life (we cannot see what is happening behind walls, or 1000km away. Thus we are \textit{living} in a partially observable environment).
	
	This can happen if we have state aliasing (we see twice the same state although if it would be fully observed, it is clear that we are in two different states), or even simple noise.
	\item First, let's consider a generalization of our environment. We define a latent state in the environment $x'_t$ which captures all information about the true state. From this latent state, we can make an observation $o_t$ per state, which can be seen as measurement of an unknown quantity.
	
	Note that in fully observable environments, we have $s_t=o_t=x'_t$.
	\item A simple approach would be to consider an observation $o_t$ as features from the latent state, and hence use approaches from Section~\ref{sec:value_based_approximation}. But this is usually not sufficient.
	\item In most environments, we can infer information about the latent space by looking at the history $H_t=A_0,O_1,A_1,...,A_{t-1},O_t$, and we choose our next action based on the whole history $A_t=\pi(H_t)$
	\item However, using a full history is neither efficient nor practical (increases in size over time). Hence, we better use a lower dimensional feature representation of the history, $f(H_t)$, and use this as internal state of the policy $s_t = f(H_t)$ (note that $s_t$ has slightly different meaning here because it is the state which the policy sees, not what we get back from the environment).
	\item The best function $f$ would be the one that summarizes all important information. We can define what this means as follows:
	$$f(H_t)=f(H'_t)\implies \Prob{O_{t+1}=o|H=H_t,A_t=a}=\Prob{O_{t+1}=o|H=H'_t,A_t=a}$$
	which means in textual form: if the representation of two histories are the same, then the expectation of the next observation is the same for both histories. Hence, we would also choose the same action, which leads to the conclusion, that the optimal policy can be found solely on $f$.
	
	A function that fulfills this condition is called \textit{Markov function}. For any function which is not Markovian, we can only find an approximate optimal policy, but often not the optimal itself.
	\item Which function is Markovian depends highly on the environment, and will be discussed next.
\end{itemize}
\subsection{Markov functions and histories}
\begin{itemize}
	\item First, let's consider what we need to deal with a partially observable environment. Besides the environment and our policy-/value-based approach, we also have a state-update function (see Figure~\ref{fig:rl_partially_observability_architecture}). Our goal is to find a state-update function which is Markovian and efficient/compact.
	
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/rl_partially_observability_architecture.png}
		\caption{Overview of the dynamics in dealing with a partially observable environment.}
		\label{fig:rl_partially_observability_architecture}
	\end{figure}
\end{itemize}
\subsubsection{Sample Markov functions}
\begin{itemize}
	\item The simplest function is the identity, meaning $s_t=H_t$. However, this is neither compact, nor can we use it in any tabular policy setting efficiently (all possible sequences must be stored).
	\item We can define a probability distribution over the latent space $X$, and try to do Bayesian inference (i.e. finding the posterior). This is done by calculating a \textbf{belief state}:
	\begin{equation*}
		\begin{split}
			s'(x')=p(x'|o',a,s)& =\frac{p(o'|x',a,s)p(x'|a,s)}{p(o'|a,s)}\\
			& = \frac{p(o'|x',a)p(x'|a,s)}{p(o'|a,s)}\hspace{5mm}\text{(removing $s$ as $x'$ is given as true state)}\\
			& = \frac{p(o'|a,x')\overbrace{\sum_x p(x'|x,a)s(x)}^{=p(x'|a,s)}}{\sum_{x'}p(o'|a,x')\sum_x p(x'|x,a)s(x)}
		\end{split}
	\end{equation*}
	where $s(x)$ is the old belief (i.e. belief over $x$ from last step). We can further define $p(o'|a,x')$ as the \underline{observation model} (i.e. what do I see from the latent space), and $p(x'|x,a)$ as the \underline{transition model} (i.e. how likely is it to move from one latent state to another). If we know these model dynamics by a full model description (or can estimate them), we have as state the probability distribution over latent state $x$.
	
	This method is the classical approach for POMDPs, as it is compact, can be updated recursively and is easily interpretable by a human. However, the disadvantages are that we need the underlying model (not always given), and that it is only feasible for a discrete latent state (otherwise sums become integrals etc.).
	\item As a last example, we can consider the obvious approach of determining all the observation probabilities:
	$$f(h)=\begin{bmatrix}
	f_{o_1a_1}(h)\\ f_{o_1a_2}(h)\\\vdots\\f_{o_2a_1}(h)\\\vdots\\
	\end{bmatrix}\hspace{5mm}\text{where}\hspace{5mm}f_{oa}(h)=\Prob{O_{t+1}=o|H=h,A_t=a}$$
	Given enough data, we can learn this distribution. Furthermore, we can extend this to longer trajectories like $\tau=a_0o_1a_1o_2a_2o_3$, and it can be proven that for a special set of "core tests" $\tau_1,\tau_2,...,\tau_d$, we can create a Markov state. This is called a \textbf{predictive state representation}.
	
	The advantage is that it is as compact or even more than the belief states as we only have a probability over observations and not the latent space. However, it might be harder to interpret as we only have the probabilities of the "core tests", and it is still limited to the tabular setting.
\end{itemize}
\subsubsection{Approximations with non-Markov functions}
\begin{itemize}
	\item Alternatively, we can also consider non-Markov functions which cannot guarantee to find the optimal policy, but at least an approximate one
	\item The simplest method here is just using the last state, $S_t=O_t$. However, this might not contain all the information we need (e.g. in Atari games, movement cannot be captured), and is often not compact (still have the whole screen)
	
	A slight improvement is stacking a few observations, as in Atari games. This allows us to observe movement, but we still lose long-term dependencies.
	\item We can also apply RNNs which take $O_t$ and $A_t$ as input including the last state $S_{t-1}$, and generate a new state $S_t$. This feature extractor can be learned end-to-end, and applied to a wide range of environments. However, the training might be a bit tricky in terms of hyperparameter tuning. 
\end{itemize}

\subsection{Partial observability and exploration}
\begin{itemize}
	\item We have seen that Markov functions rely on uncertainty of the latent state, which can be consider as trying to take the actions that make you most certain about the latent state (while maximizing the reward).
	\item Hence, we can also consider this as a exploration strategy. If we assume to know the set of states and action of our environment, we can try to learn the transition probabilities as well by adding them to our state. This leads us to a \textit{hyperstate}:
	$$x_{\text{POMDP}} = (s_{\text{MDP}}, \text{transition}, \text{rewards})$$
	where we consider a fully observable environment as partially observable by adding the transition and reward distributions. Now, we can simply apply POMDP techniques as we have discussed before, where $p(x'|a,x)$ is now modeled by our transition parameters $\theta$.
	\item For example, consider a simple environment with 2 states and 2 actions each. Our transition probabilities can be defined as a vector:
	$$\theta=(p_{11},p_{12},p_{21},p_{22})$$
	where for a prior, we assume a uniform distribution. By interaction, we change our belief towards what we have observed by using Bayes:
	$$p(\theta | x',x,a) = \frac{p(x'|x,a,\theta)p(\theta)}{\int_{\theta} p(x'|x,a,\theta)p(\theta)d\theta}$$
	So, if we observe $x'=s_1$, $x=s_1$, $a=1$, our new belief over $p_{11}$ is (all other stay the same):
	$$p(p_{11}|x'=s_1,x=s_1,a=1)=\frac{p(x'=s_1|x=s_1,a=1,p_{11})p(p_{11})}{\int_{\theta} p(x'=s_1|x=s_1,a=1,p_{11})p(p_{11})dp_{11}} = \frac{p_{11} \cdot 1}{1/2} = 2p_{11}$$
	Hence, we expect to go in at least $2/3$ of the cases from $s_1$ to $s_1$ if we take action $1$. If we observe now again the same combination, we get:
	$$p(p_{11}|x'_2=s_1,x_2=s_1,a_2=1)=\frac{p(x'=s_1|x=s_1,a=1,p_{11})p(p_{11}|x_1,x_1',a_1)}{\int_{\theta} p(x'_2=s_1|x_2=s_1,a_2=1,p_{11})p(p_{11}|x_1,x_1',a_1)dp_{11}} = \frac{2p_{11} \cdot p_{11}}{2/3} = 3p_{11}^2$$
	\item Now, an optimal policy will take the uncertainty of the transition probabilities in account, and tries to maximizes the expected return. This leads to an optimal trade-off between exploration and exploitation.
	\item However, keep in mind that we have $|\mathcal{S}|^{|\mathcal{A}|}$ transition probabilities to learn, which can be too large for certain environments. Hence, we might have to consider using approximations again.
\end{itemize}
\subsubsection{Bayesian Adaptive MDP and Meta-reinforcement learning}
\begin{itemize}
	\item Our approach on partial observability can be seen as Bayesian because we use our posterior to estimate the expected reward given a prior of our beliefs (iterative posterior update)
	\item If the prior assigns a non-zero probability to a certain model, then it can find the optimal strategy for it. The amount of samples needed, i.e. exploration, is based on the design of the prior. The closer the prior to a certain model, the less it will have to explore. But giving a model a higher chance in the prior than it should leads also to worse performance on other MDPs.
	\item We can show that to find the optimal strategy for finding the best policy in a unknown MDP can be learned by sampling from the prior over MDPs, and use simple gradient estimates
	\item Hence, with a prior over MDPs, optimal exploration can be phrased as greedy behaviour in an augmented MDP, where the hyperstates include the unknown transition and reward probabilities
	\item Such techniques are investigated under the term \textbf{Meta-reinforcement learning}. Here the agent is not told which exact MDP it gets, but has to learn patterns across MDPs, and find the optimal way of exploring.
\end{itemize}


================================================
FILE: Reinforcement_Learning/rl_policy_gradient_methods.tex
================================================
\section{Policy gradient methods}
\label{sec:policy_learning}
\textit{This section reviews the lecture slides 7, 8, 9 and 10.}
\begin{itemize}
	\item In this section we will discuss techniques for learning the policy directly. There are couple of advantages to it:
	\begin{itemize}
		\item We are able to deal with continuous actions
		\item We are changing the policy \textit{smoothly}, meaning that after an update, we only slightly change the probability distribution over actions. In case of $\epsilon$-greedy on $q$-values, the policy heavily changes when best action becomes another one
		\item Small errors in the value functions don't give a big error in $\pi$ (we are directly optimizing the quantity of interest)
		\item We are able to include prior knowledge, like "don't fall of the cliff", "going left is potentially more interesting", etc.
		\item We are able to learn how much stochasticity is optimal for the given environment. 
	\end{itemize}
	\item In case we have discrete actions, we can simply learn by a softmax over these (viewing them as different classes). For continuous, we can for example use a Gaussian, and learn to predict its mean and variance
	\item The objective of a policy is always the same, namely optimize the expected return for its start state(s): $$J(\theta)=v_{\pi_{\theta}}(s_0)=\E\left[\sum_{t=0}^{T-1} r_{t+1}\right]$$
	Note that we assume here $\gamma=1$. We will use it throughout this section as it makes the derivations/discussion a bit easier, but we can change this term if necessary.
	\item The simplest update is using \textbf{finite difference}, meaning that we estimate the gradients by a small parameter change:
	$$\nabla J(\theta)\approx \frac{ J(\theta+\epsilon) - J(\theta-\epsilon)}{2\epsilon}$$
	However, this means that for $n$ parameters, we would need at least $2n$ roll-outs for an estimate. For stochastic policies, this estimate is extremely noisy and hence, not really applicable.
\end{itemize}
\subsection{REINFORCE and the Policy Gradient Theorem}
\begin{itemize}
	\item The Policy Gradient Theorem says that the gradients of $\nabla_{\theta}J(\theta)$ are proportional to:
	$$\nabla_{\theta}J(\theta) \propto \sum_s \mu(s) \sum_a \nabla_{\theta} \pi_{\theta}(a|s)q_{\pi_{\theta}}(s,a) $$
	Note that we are not interested in the constant proportionality factor because we will absorb it anyways in the learning rate
	\item For deriving the REINFORCE algorithm, we follow the approach of the lecture slides. Let's define $\tau$ as a trajectory that starts from $s_0$ and ends in an arbitrary terminal state. Then, the expected return is the expected return over these trajectories. Using this equality, we can derive the gradients as:
	\begin{equation*}
		\begin{split}
			\nabla_{\theta}J(\theta) & =  \nabla_{\theta} \E_{\tau}[G(\tau)]\\
			& = \int \nabla_{\theta} p_{\theta}(\tau) G(\tau)d\tau\\
		\end{split}
	\end{equation*}
	where the probability of a trajectory is defined as $p_{\theta}(\tau)=p(s_0)\prod_{t=1}^{T}\pi_{\theta}(A_t|S_t)p(S_{t+1}|A_{t},S_t)$, and $G(\tau)$ is the expected return from the initial state. Using the trick $\nabla_{\theta} p_{\theta}(\tau)=p_{\theta}(\tau)\cdot \nabla_{\theta} \ln p_{\theta}(\tau)$, we get:
	
	\begin{equation*}
		\begin{split}
			\nabla_{\theta}J(\theta) & = \int \nabla_{\theta} p_{\theta}(\tau) G(\tau)d\tau\\
			& = \E_{\tau}[G(\tau)\nabla_{\theta} \ln p_{\theta}(\tau)]\\
			& = \E_{\tau}\left[G(\tau)\sum_{t=1}^{T}\nabla_{\theta} \ln p_{\theta}(a_t|s_t)\right]\\
		\end{split}
	\end{equation*}
	\item Hence, to estimate the gradient, we can sample trajectories and approximate the expectation above. This estimate is unbiased but has some disadvantages. The efficiency of REINFORCE is low because of its high variance which comes from two points:
	\begin{itemize}
		\item First, REINFORCE can be seen as a Monte Carlo method of policy-based RL. Hence, the MC samples bring a certain level of noise with them
		\item Suppose we are playing CartPole. Our reward is 1 for each time step, until we terminate. This leads to always positive gradients, which can be seen as "supporting" the last actions. Only if we sample the other action, we might experience an even higher return which pushes the policy towards the newly explored actions. 
	\end{itemize}
	\item The second issue can be tackled by the usage of a \textit{baseline} which is a constant subtracted from the return, that does not influence the gradients being unbiased:
	\begin{equation*}
		\begin{split}
			\E_{\tau}\left[\left(G(\tau)-b\right)\sum_{t=0}^{T}\nabla_{\theta} \ln p_{\theta}(a_t|s_t)\right] & = \E_{\tau}\left[G\left(\tau\right)\sum_{t=0}^{T}\nabla_{\theta} \ln p_{\theta}(a_t|s_t)\right]  - \E_{\tau}\left[b\sum_{t=0}^{T}\nabla_{\theta} \ln p_{\theta}(a_t|s_t)\right] \\
			& = \nabla J(\theta) - b\underbrace{\int p_{\theta}(\tau)\nabla_{\theta} \ln p(\tau)d\tau}_{=0}\\
			& = \nabla J(\theta) 
		\end{split}
	\end{equation*}
	\item A good baseline is the expected reward, which we for example can aggregate over the past.
	\item However, there is one drawback which still remains. We assign each action of a trajectory the same credit, meaning that we punish every action equally no matter how much it actually was responsible for it. This is especially a problem when we punish actions for something in the past (e.g. if last 5 steps get reward of 10 each, but first got -100, we punish all of them equally). To prevent this, we move to G(PO)MDP
\end{itemize}
\subsubsection{G(PO)MDP}
\begin{itemize}
	\item Gradient estimates for (Partially Observable) Markov Decision Processes
	\item Let's reconsider the gradient estimate again, and try to split it into a part before $t$, and a part after $t$:
	
	\begin{equation*}
		\begin{split}
			\nabla_{\theta}J(\theta) & =  \E_{\tau}\left[G(\tau)\sum_{t=1}^{T}\nabla_{\theta} \ln p_{\theta}(a_t|s_t)\right]\\
			& = \E_{\tau}\left[\sum_{t=1}^{T} r_t \sum_{t'=1}^{T}\nabla_{\theta} \ln p_{\theta}(a_{t'}|s_{t'})\right]\hspace{5mm}\text{(Put in definition of return)}\\
			& = \sum_{t=1}^{T} \E_{\tau_{1:t}}\E_{\tau_{t+1:T}}\left[r_t \sum_{t'=1}^{T}\nabla_{\theta} \ln p_{\theta}(a_{t'}|s_{t'})\right]\hspace{5mm}\text{(Move sum out and split expectation)}\\
			& = \sum_{t=1}^{T} \E_{\tau_{1:t}}\left[r_t \left(\sum_{t'=1}^{t}\nabla_{\theta} \ln p_{\theta}(a_{t'}|s_{t'}) + \underbrace{\E_{\tau_{t+1:T}}\left[\sum_{t'=t+1}^{T}\nabla_{\theta} \ln p_{\theta}(a_{t'}|s_{t'})\right]}_{=\E[\int p(x)\nabla\log p(x)dx]=0}\right)\right]\\
			& = \sum_{t=1}^{T} \E_{\tau_{1:t}}\left[r_t \sum_{t'=1}^{t}\nabla_{\theta} \ln p_{\theta}(a_{t'}|s_{t'}) \right]\\
			& = \E_{\tau}\left[\sum_{t=1}^{T}  r_t \sum_{t'=1}^{t}\nabla_{\theta} \ln p_{\theta}(a_{t'}|s_{t'}) \right]
		\end{split}
	\end{equation*}
	With this rewritten gradient, we give credit for $r_t$ only those actions that came \textit{before} $t$
	\item This reduces the variance from REINFORCE, and can be combined with baselines etc. However, keep in mind that we still have a Monte Carlo sample, so that there remains a significant variance
\end{itemize}
\subsubsection{Policy Gradients with Parameter-based Exploration (PGPE)}
\begin{itemize}
	\item Even when sampling from stochastic policies during a rollout, the variation and exploration we get is mostly fairly limited. Furthermore, we end up with small perturbations (choose "left"-"right"-"left" in CartPole) which are less likely to be repeated by a deterministic policy, and can damage e.g. a robot in real-life situations
	\item Instead, we rather \textit{sample} a deterministic policy $\pi_{\theta}$ from a distribution $p(\theta|\nu)$.  The advantage is that if we now see a state twice, we can guarantee that our policy takes a same action although we are still exploring/stochastic. 
	\item So, instead of choosing $a$ at every randomly, we randomly choose the action for any state in the beginning (represented by $\pi_{\theta}$), and keep it fixed over the trajectory
	\item Our gradient (which is now with respect to $\nu$ as we want to learn $p(\theta|\nu)$) is:
	$$\nabla_{\nu} J(\nu) = \E_{\theta}\E_{\tau|\pi_{\theta}}[G(\tau) \nabla_{\nu}p(\theta;\nu)]$$
\end{itemize}
\subsection{Actor-critic Policy Gradient}
\begin{itemize}
	\item Although we increased the stability of REINFORCE by the discussed improvements, the problems of Monte Carlo sampling remain: we have to wait until the end of the episode, and the samples have a high variance
	\item First we take a look again at G(PO)MDP, where we slightly re-arrange the terms:
	$$\E_{\tau}\left[\sum_{t=1}^{T}  r_t \sum_{t'=1}^{t}\nabla_{\theta} \ln p_{\theta}(a_{t'}|s_{t'}) \right] = \E_{\tau}\left[\sum_{t'=1}^{T}\nabla_{\theta} \ln p_{\theta}(a_{t'}|s_{t'}) \sum_{t=t'}^{T}  r_t\right]$$
	In our value-based methods, we previously learn the terms $\sum_{t=t'}^{T}  r_t$ by the $v$/$q$-functions, as it is the expected value of those. Hence, we can also plug them in here:
	$$\E_{\tau}\left[\sum_{t'=1}^{T}\nabla_{\theta} \ln p_{\theta}(a_{t'}|s_{t'}) \sum_{t=t'}^{T}  r_t\right] = \E_{\tau}\left[\sum_{t'=1}^{T}\nabla_{\theta} \ln p_{\theta}(a_{t'}|s_{t'}) q_{\pi}(s_{t'},a_{t'})\right]$$
	\item The question arises whether we can replace $q_{\pi}(s_{t'},a_{t'})$ by an estimate $\hat{q}_{\bm{w}}(s_{t'},a_{t'})$ without introducing a bias. The answer is yes, but with two constraints on the function $\hat{q}_{\bm{w}}$:
	\begin{enumerate}
		\item The function has to be \textit{compatible}, which means:
		$$\nabla_{\bm{w}}\hat{q}_{\bm{w}}(s,a) = \nabla_{\theta}\ln \pi_{\theta}(a|s)\hspace{5mm}\text{like}\hspace{2mm} \hat{q}_{\bm{w}}(s,a)=\bm{w}^T \nabla_{\theta} \ln \pi_{\theta}(a|s)$$
		\item $\hat{q}_{\bm{w}}$ has to be fully converged, i.e.
		$$\E\left[(q_{\pi}(s,a)-q_{\bm{w}}(s,a)) \frac{\partial \hat{q}_{\bm{w}}(s,a)}{\partial \bm{w}}\right]= 0$$
	\end{enumerate}
	\item We call the policy $\pi$ the actor, while $\hat{q}_{\bm{w}}$ is the critic
	\item Note that we can still add a baseline to stabilize learning further. For example, a good baseline is the value function so that our actual goal of $\hat{q}_{\bm{w}}$ should be to learn $\hat{q}_{\bm{w}}(s,a)\approx q_{\pi}(s,a)-v_{\pi}(s)=A(s,a)$ which is also called the \textbf{advantage}
	\item The \underline{benefits} of actor-critic methods is a lower variance as the target is not sampled anymore, and we can update our policy more frequently instead of waiting until the end of the episode.
	
	However, the \underline{drawbacks} are that we have more hyperparameters to finetune (two learning rates etc.), and we require an stochastic policy (no full greedy policy possible). We will see later methods which can deal with deterministic ones.
\end{itemize}
\subsubsection{Generalized Advantage Estimation (n-step AC)}
\begin{itemize}
	\item Let's reconsider the difference between actor-critic and actor-only approaches from a different perspective. Actor-critic bootstrap its estimate on the next value, which is very similar to TD(0) learning. Actor-only uses the sampled return, which is a Monte Carlo method. In value-based methods, we discussed that we can generalize TD(0) and Monte Carlo to $n$-step TD learning which we can do here similarly. This is again a trade-off between variance and bias. %  because as we have seen before, if $\hat{q}_{\bm{w}}$ has not fully converged (which is in practice mostly not the case), we get a biased estimate of our gradients.
	\item The advantage for an $n$-step estimate is:
	$$\hat{A}_t^n = r_t + \gamma r_{t+1} + ... + \gamma^n v(s_{t+n}) - v(s_t) = \sum_{l=0}^{n-1}\gamma^{l}\delta_{t+l}$$
	where $\delta_{t}$ is the TD error for time step $t$. 
	 
	However, we can also take a smoother version of $n$-step, where we take a weighted average of all the advantages:
	$$\hat{A}_t^{GAE} = (1-\lambda)\left(\hat{A}_t^{(1)} + \lambda \hat{A}_t^{(2)} + \lambda^2 \hat{A}_t^{(2)} + ...\right) = \sum_{l=0}^{\infty} (\gamma \lambda)^{l}\delta_{t+l}$$
	which is also known as TD($\lambda$).
	\item A lower $\lambda$ reduces the variance but increases the bias (TD). Similarly, choosing a high $\lambda$ gives a low bias but high variance (MC).
	
	However, a disadvantage of using $\lambda$ instead of a fixed $n$ is that we need to run a full episode before we can calculate any advantage. We can overcome this issue by using eligibility traces where we update each state by its already known advantage factors of an episode, and continue doing so while following the trajectory.
	\item This approach can again be used in combination with many different optimization techniques, like using TRPO (see next section). 
\end{itemize}
\subsection{Higher-order Policy Search Methods}
\begin{itemize}
	\item When updating our policy, we want to make sure that we don't change too much. The reason for that is that our samples come from $\pi_{\theta}$. The more we change $\pi_{\theta}$, the more our gradient estimate becomes inaccurate! Hence, we should limit our change in $\pi_{\theta}$.
	\item A simple way of checking that is by taking the L2 norm over parameters, namely $d\theta^T d\theta$, and fix this norm to a certain value $c$
	\item To find the next optimal value, we have to solve the following equation for the step we take, namely $\theta^{*}-\theta_0$ ($\theta^{*}$ next value), such that $d\theta^T d\theta=c$:
	\begin{equation*}
		\begin{split}
			\theta^{*}-\theta_0 & = \arg\max_{d\theta} J(\theta_0 + d\theta)\\
			& \approx \arg\max_{d\theta} J(\theta_0) + (\nabla_{\theta} J(\theta))^T d\theta \hspace{5mm}\text{(Taylor expansion)}\\
			& \propto \nabla_{\theta} J(\theta)
		\end{split}
	\end{equation*}
	This means that the previous, standard policy gradient methods maximize the Taylor expansion of $J$ such that the update is on the norm sphere (as $d\theta^T d\theta=c$)
	\item However, there are many disadvantages and possible problems of this:
	\begin{itemize}
		\item The norm itself is highly sensitive to the parameterization of the model. For example, consider a Gaussian for which we want to learn the mean and the variance. We can achieve the same if we learn the standard deviation, or even the precision of the Gaussian. However, all these parameters have a different scale, and euclidean distance is not always the best distance measurement (e.g. $\sigma=0.1$ and $\sigma=0.2$ are more \textit{different} than $\sigma=10.1$ and $\sigma=10.2$)
		\item Another simple fail case is when we change the scale of the parameters. Suppose we express the mean by $\mu=4\cdot \theta_1$ instead of $\mu=\theta_1$, while keeping $\sigma=\theta_2$. As we only look at the gradient norm of $\theta_1$ and not $\mu$, we take in the first case a four-times as big step than in the other case. This is clearly not desired because both parameterizations express the same model, with just different scales. Furthermore, this can lead to an issue for $\sigma=\theta_2$ as we do not change $\sigma$ equally.
		\item Furthermore, we ignore correlations between parameters. If the gradient of $\theta_1$ and $\theta_2$ are highly correlated, like if we would use $\mu=\theta_1+\theta_2$, we update both as if they were independent.
	\end{itemize}
	\item So, we are not directly interested in the change of the parameters, but of the policy distribution. A better way of doing so is by using the KL divergence as difference:
	$$D_{\text{KL}}(p||q)=\int p(x)\log\frac{p(x)}{q(x)}dx$$
	There are different algorithms that exploits this property, and we will discuss two of them: Natural Policy Gradient, and Trust region policy optimization
\end{itemize}
\subsubsection{Natural Policy Gradients}
\begin{itemize}
	\item The first step for using the KL divergence as step size regulator, is by replacing the constant $c$ by the quadratic expansion of expected KL divergence over states:
	\begin{equation*}
		\begin{split}
			c & = \E_{s}\left[D_{KL}\left(\pi(a|s;\theta_0)||\pi(a|s;\theta)\right)\right] = \text{EKL}(\theta)\\
			& \approx \underbrace{\text{EKL}(\theta_0)}_{=KL(p||p)=0} + d\theta^T \underbrace{(\nabla_{d\theta}\text{EKL})(\theta_0)}_{=0 \text{ as }\theta_0\text{ optimum}} + \frac{1}{2}d\theta^T (\nabla_{d\theta}^2\text{EKL})(\theta_0)d\theta\\
			& = \frac{1}{2}d\theta^T (\nabla_{d\theta}^2\text{EKL})(\theta_0)d\theta\\
		\end{split}
	\end{equation*}
    \item So, to obtain the optimal parameter $c$, we need to calculate the Hessian $\nabla_{d\theta}^2\text{EKL}$ at point $\theta_0$. This is also known as the Fisher information matrix (i.e. how much information of $\pi$ is changed by $\theta$), and can be calculated by:

	\begin{equation*}
		\begin{split}
			F & = \nabla_{d\theta}^2\text{EKL} = \E_{s}\left[\nabla_{d\theta}^2 D_{KL}\left(\pi(a|s;\theta_0)||\pi(a|s;\theta)\right)\right]\\
			\nabla_{d\theta}^2 D_{KL} & = \E_{a\sim\pi(a|s;\theta_0)}[\nabla_{d\theta}^2 \log \pi(a|s;\theta_0+d\theta)]\\[10pt]
		\end{split}
	\end{equation*}
	$$\implies F = \begin{bmatrix}
	\E_a\left[\left(\nabla_{d\theta_1} \log \pi_{\theta}(a|s)\right)^2\right] & \E_a\left[\nabla_{d\theta_1} \log \pi_{\theta}(a|s) \cdot \nabla_{d\theta_2} \log \pi_{\theta}(a|s)\right] & ...\\
	\E_a\left[\nabla_{d\theta_2} \log \pi_{\theta}(a|s) \cdot \nabla_{d\theta_1} \log \pi_{\theta}(a|s)\right] & \E_a\left[\left(\nabla_{d\theta_2} \log \pi_{\theta}(a|s)\right)^2\right] & ...\\
	% \E_a\left[\nabla_{d\theta_1} \log \pi_{\theta}(a|s)\right]\cdot \E_a\left[\nabla_{d\theta_3} \log \pi_{\theta}(a|s)\right] & \E_a\left[\nabla_{d\theta_2} \log \pi_{\theta}(a|s)\right]\cdot \E_a\left[\nabla_{d\theta_3} \log \pi_{\theta}(a|s)\right] & ...\\
	\vdots & \vdots & \ddots
	\end{bmatrix}$$
	
	\item Now, we reconsider our update step:
	\begin{equation*}
		\begin{split}
			\theta^{*}-\theta_0 & \approx \arg\max_{d\theta} J(\theta_0) + (\nabla_{\theta} J(\theta))^T d\theta \hspace{5mm}\text{(Taylor expansion)}\\
		\end{split}
	\end{equation*}
	To take the maximum, we now also need to consider our constraint as Lagrangian:
	$$\max_{d\theta}\min_{\lambda} J(\theta_0) + (\nabla_{\theta} J(\theta))^T d\theta + \lambda (d\theta F d\theta-c)$$
	So that, when we solve it, we get:
	$$d\theta  \propto F^{-1}\nabla_{\theta}J(\theta)$$
	which we call the \textit{natural gradient}
	\item The update rule is now:
	$$\theta_{t+1} = \theta_t + \alpha F^{-1}\nabla_{\theta_t}J(\theta_t)$$
	where for the vanilla gradient $\nabla_{\theta_t}J(\theta_t)$, we can use any of the above methods.
	\item We can show that for a sufficiently small step size, we will always improve by this update step 
	\item Figure~\ref{fig:rl_policy_gradients_NPG} shows a visualization of the update difference between NPG and standard policy gradients. NPG allows us to find a better fit in the region of "safe" changes, so that we possibly can take larger steps, towards the right policy.
	
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/rl_policy_gradients_NPG.png}
		\caption{Comparison of L2 norm update (left) and NPG (right). The orange background represents the safe changes, meaning the parameter changes for which our policy does not change greater than our defined threshold. While the L2 sticks with the unit sphere, NPG can represents ellipsoids so that we can take larger steps in parameter space towards increasing $J$ without changing the policy too much.}
		\label{fig:rl_policy_gradients_NPG}
	\end{figure}

	We can also visualize the gradient direction grid, as in Figure~\ref{fig:rl_policy_gradients_NPG_gradient_example}. Gradients that point in the wrong direction, give very slow convergence because we focus on parameters which are already close to optimal. 
	
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/rl_policy_gradients_NPG_gradient_example.png}
		\caption{Gradient direction over parameter space for finding the optimum at $(-1,0)$. While the NPG gives very smooth transitions, vanilla gradients point straight down (strong gradient for $\theta_2$) for points like $(-2,0.1)$ where we clearly need to update $\theta_1$ more. This leads to slow convergence.}
		\label{fig:rl_policy_gradients_NPG_gradient_example}
	\end{figure}
	\item The advantages of NPG are therefore:
	\begin{itemize}
		\item Faster convergence and less training time
		\item Is an adaptation on top of standard policy gradient, so we can use any of the previous methods with all additions/tricks we want
	\end{itemize}
	However, the biggest drawback is that we have to calculate the Fisher information matrix, which is known for standard distributions like Gaussians, but might be harder to determine for other cases, especially if we want to use a neural network (can be approximated with conjugate gradient algorithm). It also keeps the disadvantages of the other policy gradient methods, namely high variance and still slower convergence compared to value-based methods.
\end{itemize}
\subsubsection{Trust region policy optimization}
\begin{itemize}
	\item A problem of Natural Policy Gradient is that we approximated the KL divergence by a second-order Taylor expansion. The errors that we introduced there, might cause our initial KL constraint to break meaning that $d\theta^T F d\theta\neq c$
	\item TRPO takes a bit different view on the problem. The main concept of the algorithm is that we take as big steps as long as we can guarantee improvement. Hence, we have three steps:
	\begin{enumerate}
		\item Approximate the return function $J$
		\item Apply a penalty term to yield lower bound on the exact function
		\item Maximize lower bound (which guarantees improvement on exact function) by e.g. SGD again
	\end{enumerate} 
	\item The region where we assume our approximation to be valid, is called \textit{trust region}
	\item Figure~\ref{fig:rl_policy_gradients_TRPO_vs_NPG} compares the ideas of NPG and TRPO visually. NPG takes a linear approximation of $J$ at a point, and limits the step size by approximating the KL divergence. TRPO however designs a lower bound which is strictly lower than the "true" function. It is a combination of an approximation of $J$, and a penalty term for big changes in the policy 
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.55\textwidth]{figures/rl_policy_gradients_TRPO_vs_NPG.png}
		\caption{Comparing the optimization properties of NPG (left) and TRPO (right).}
		\label{fig:rl_policy_gradients_TRPO_vs_NPG}
	\end{figure}
	\item We can calculate the return by:
	$$\eta(\theta)=\E_{\bm{s}\sim \mu_{\pi_{\theta},\bm{a}\sim \pi_{\theta} }(s)}[r(\bm{s},\bm{a})]$$
	where we use samples from $\pi_{\theta}$ to approximate the expectation. However, as soon as we shift $\theta$, we cannot use the same samples anymore because we would be biased. Hence, we can use importance weights:
	$$\eta(\theta)\approx \E_{\bm{s}\sim \mu_{\pi_{\theta}},\bm{a}\sim \pi_{\theta'} (\bm{s})}\left[\frac{\pi_{\theta}(\bm{a}|\bm{s})}{\pi_{\theta'}(\bm{a}|\bm{s})}r(\bm{s},\bm{a})\Bigg\vert \theta' \right]=L_{\theta'}(\theta)$$
	but note that the state distribution is not changed (hence approximation!).
	\item Now, let's consider how we get the lower bound based on our approximation. 
	$$\eta(\theta)\geq L_{\theta'}(\theta) - \frac{2\epsilon \gamma}{(1-\gamma)^2}\cdot \max_s D_{\text{KL}}\left(\pi_{\theta}(\cdot|s)||\pi_{\theta'}(\cdot|s)\right)$$
	where the factor in front of the penalty is environment/policy dependent. 
	\item Although this term above guarantees us a lower bound, in practice, we might run into multiple issues:
	\begin{itemize}
		\item We need to take the maximum KL divergence over all states. However, in environments with many and/or continuous states, this is often not possible. So, we approximate it by taking the average instead.
		\item The penalty is usually very high so that we cannot make big steps. So, the average can already help to reduce the penalty, but we can also consider the KL divergence rather as a constraint than a penalty. This leads us to a similar approach as for NPG
	\end{itemize}
	\item So, what we do instead is maximizing the approximation as in Natural Policy Gradient, but dynamically set the step size based on a maximum KL divergence that we allow between policies. Meaning, we solve the following equation for step size $\beta$:
	$$D_{\text{KL}} \approx \beta^2 d\theta^T F_s d\theta / 2$$
	with $d\theta$ being in the same direction as NPG. One way of (approximately) solving it is by starting with an initial $\beta_0$, and for a couple of steps, increase it if constraint is fulfilled. Otherwise, reduce until we find a valid, sufficiently high $\beta$.
	
	\item In Figure~\ref{fig:rl_policy_gradients_TRPO_vs_NPG}, we now are at the left image again but the step size is adjusted by the KL.
	\item The advantage of TRPO is that we can take bigger steps than the standard NPG, while in theory, having the guarantee of converging. It has been shown to work well with neural controllers where we approximate $F$ by the conjugate gradients. 
	
	The disadvantages are however, that it still requires many steps, and the return is still a Monte Carlo sample (high variance). The guarantee of convergence is actually broken by all the approximations we took.
\end{itemize}

\subsection{Deep Policy Search}
\begin{itemize}
	\item When using deep neural networks for policy search, we might need to consider a few additional tricks because of the high non-linearity of the networks.
	\item To discuss this, we take deterministic policy gradient as an example, and explain the tricks that are used here
\end{itemize}
\subsubsection{Deterministic policy gradients}
\begin{itemize}
	\item All policy gradient methods we have discussed so far considered stochastic policy. However, it is sometimes preferred to learn a deterministic policy (e.g. remember Q-learning)
	\item This means that we will also learn off-policy (behavior policy $b$ with target/actor $\pi$). All other methods were discussed from the on-policy perspective but could be adjusted for off-policy with some minor modifications like importance sampling
	\item When using a different policy for sampling, we change our state distribution from $\mu_{\pi}$ to $\mu_{\beta}$, so that our return is:
	$$J_{\beta}(\pi_{\theta}) = \int_{\mathcal{S}} \mu^{\beta}(s)Q^{\pi}\left(s,\pi_{\theta}(s)\right)ds$$
	In terms of gradients, we end up with:
	$$\nabla_{\theta} J_{\beta}(\pi_{\theta})=\E_{s\sim\mu_{\beta}}\left[\nabla_{\theta}\pi_{\theta}(s)\nabla_a Q^{\pi}(s,a)|a=\pi_{\theta}(s)\right] $$
	Note that this requires $\pi_{\theta}(s)$ to be differentiable, hence returning continuous actions.
	
	Although our samples are slightly off/biased ($\mu_{\beta}$ instead of $\mu_{\pi}$), this is usually not a problem as $\beta$ is chosen to be $\pi$ with some additional noise (like Gaussian).
	\item Our update equations are as follows:
	\begin{equation*}
		\begin{split}
			\text{TD error }\hspace{2mm}\delta_t & = r_t + \gamma Q^{w}\left(s_{t+1}, \pi_{\theta}\left(s_{t+1}\right)\right) - Q^{w}\left(s_{t}, a_t\right)\\
			\text{Update of $Q$ }\hspace{2mm}w_{t+1} & = w_{t} + \alpha_{w}\delta_t \nabla_{w} Q^{w}(s_t,a_t)\\
			\text{Update of $\pi$ }\hspace{2mm}\theta_{t+1} & = \theta_{t} + \alpha_{\theta} \nabla_{\theta} \pi_{\theta}(s_t) \nabla_{a} Q^{w}(s_t,a_t)|_{a=\pi_{\theta}(s)}
		\end{split}
	\end{equation*}
	where we illustrate the gradients in Figure~\ref{fig:rl_policy_gradients_DPG}.
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/rl_policy_gradients_DPG.png}
		\caption{Illustrating the gradients in DPG. We have the combination of how much we change the action by changing $\theta$, and how this action change influences the action value $Q^{\pi}$.}
		\label{fig:rl_policy_gradients_DPG}
	\end{figure}
\end{itemize}
\subsubsection{Deep DPG}
\begin{itemize}
	\item When using neural networks, we again have to consider the same issues as in the DQN approach
	\item To use the collected data more efficiently and break the dependency between elements in a batch, we apply \textit{experience replay}
	\item For stabilizing the TD updates, we don't fix the target network, but create a second one that slowly tracks the learned $Q$ values
	\item For ensuring a similar scale of features, we apply \textit{batch normalization} within the network
	\item One aspect of exploration that we have discussed in PGPE before is that independent noise on the actions do not explore well. As a simple improvement, the Deep DPG paper correlates the noise by endorsing to use the same random decision as the time step before
\end{itemize}
\subsection{Summary}
\begin{itemize}
	\item To wrap up policy-based reinforcement learning, we want to put all discussed algorithms into perspective.
	
	\begin{figure}[ht!]
		\centering
		\begin{subfigure}{0.45\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/rl_policy_gradient_summary_1.png}
			\caption{Exploration vs Evaluation}
		\end{subfigure}
		\hspace{5mm}
		\begin{subfigure}{0.45\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/rl_policy_gradient_summary_2.png}
			\caption{Actor-only versus Value-only methods}
		\end{subfigure}
		\caption{Comparing algorithms across two dimensions. (a) Pointing out the main difference between simple methods and advanced policy gradients (see text for more explanation). (b) Setting policy gradient methods into perspective with value-based.}
	\end{figure}

	\item When discussing the first algorithms of policy gradient, we could distinguish the methods on two dimensions:
	\begin{itemize}
		\item \textit{Exploration}: One key point in the discussion of PGPE was the exploration. Methods like REINFORCE explore by sampling an action at each time step independently, hence their exploration is step-based. PGPE however samples a new policy once in the beginning. This is episode-based because within the episode, we follow a deterministic policy and do not add noise per step. DDPG can be considered as in-between because it adds noise correlations between steps, and hence has not a purely step-based exploration strategy anymore. 
		
		In general, it is hard to say which of both is preferred, and possibly depends on the environment. Independent noise as in the step-based methods have been shown to explore worse (which is why DDPG added the correlation). However, PGPE is more complex to implement and to learn because we have to learn a distribution over parameters $\theta$ which do not one-to-one correspond to distribution over different policies (as discussed in NPG, relation between $\theta$ and $\pi$ might be quite complex).
		
		\item \textit{Evaluation}: Across our discussion, we have seen that some algorithms evaluate their actions step-wise and others per episode. We prefer methods that we can evaluate step-wise because they usually don't need a full sample until the end of an episode (except GPOMDP and other non-Actor-Critic methods), and give every step individual credit assignment. REINFORCE performs episode-based evaluations because the first steps reward influences the last steps update (which we tried to prevent in the other algorithms)
	\end{itemize}
	\item Another part is to consider the different sub-groups of policy-based methods with respect to value-based techniques. REINFORCE, G(PO)MDP and PGPE are all actor-only methods, meaning that they only learn a policy $\pi$. We have seen that we can extend most approaches by introducing a critic that learns $q_{\pi}(s,a)$.
	
	NPG and TRPO can be applied whether with or without actor-critic. Furthermore, we are free to choose how we arrive at $\nabla J$, but note that in the theoretical motivation of TRPO, we use the lower bound so that it is, strictly speaking, not a policy gradient method
\end{itemize}


================================================
FILE: Reinforcement_Learning/rl_summary.tex
================================================
\documentclass[a4paper]{article} 
\addtolength{\hoffset}{-2.25cm}
\addtolength{\textwidth}{4.5cm}
\addtolength{\voffset}{-3.25cm}
\addtolength{\textheight}{5cm}
\setlength{\parskip}{0pt}
\setlength{\parindent}{0in}

\usepackage{blindtext} % Package to generate dummy text
\usepackage{charter} % Use the Charter font
\usepackage[utf8]{inputenc} % Use UTF-8 encoding
\usepackage{microtype} % Slightly tweak font spacing for aesthetics
\usepackage[english]{babel} % Language hyphenation and typographical rules
\usepackage{amsthm, amsmath, amssymb, amsfonts, nccmath} % Mathematical typesetting
\usepackage{float} % Improved interface for floating objects
\usepackage[final, colorlinks = true, 
linkcolor = black, 
citecolor = black]{hyperref} % For hyperlinks in the PDF
\usepackage{graphicx, multicol} % Enhanced support for graphics
\usepackage{xcolor} % Driver-independent color extensions
\usepackage{marvosym, wasysym} % More symbols
\usepackage{rotating} % Rotation tools
\usepackage{subcaption}
\usepackage{wrapfig}
% \usepackage{geometry}
\usepackage{censor} % Facilities for controlling restricted text
\newcommand{\note}[1]{\marginpar{\scriptsize \textcolor{red}{#1}}} % Enables comments in red on margin
\usepackage{bm}
\usepackage{blkarray}
\usepackage{enumitem}
\usepackage{pgfplots}
\usepackage{tikz}
\usetikzlibrary{bayesnet}

\usepackage{tcolorbox}
\usepackage[ruled,vlined]{algorithm2e}

\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\loss}[0]{\mathcal{L}}
\newcommand{\chain}[3]{\frac{\partial #1}{\partial #2}\frac{\partial #2}{\partial #3}}
% \newcommand{\eq}[1]{\begin{equation*}\begin{split}#1\end{split}\end{equation*}}
\newcommand{\TODO}[1]{\textbf{\textcolor{red}{#1}}}
\newcommand{\E}[0]{\mathbb{E}} % Expectation
\newcommand{\R}[0]{\mathbb{R}} % Real numbers
\newcommand{\Cdo}[0]{\textnormal{do}}
\newcommand{\Prob}[1]{\Pr\left\{#1\right\}} % Real numbers
\newcommand\independent{\protect\mathpalette{\protect\independenT}{\perp}}
\def\independenT#1#2{\mathrel{\rlap{$#1#2$}\mkern2mu{#1#2}}}
\newcommand*{\QED}{\hfill\ensuremath{\blacksquare}}%

\definecolor{green}{RGB}{0,160,0}
\definecolor{blue}{RGB}{0,0,160}
\definecolor{red}{RGB}{160,0,0}
\definecolor{orange}{RGB}{200,160,0}
\definecolor{purple}{RGB}{170,0,200}
\definecolor{cyan}{RGB}{0,200,200}
\definecolor{lightred}{RGB}{200,50,50}

\setcounter{tocdepth}{2}
% Title Page
\title{Summary Reinforcement Learning}
\author{Phillip Lippe}


\begin{document}
\maketitle
\tableofcontents
\newpage

\input{rl_introduction.tex}
\newpage
\input{rl_tabular_methods.tex}
\newpage
\input{rl_learning_with_approx.tex}
\newpage
\input{rl_policy_gradient_methods.tex}
\newpage 
\input{rl_model_based.tex}
\newpage
\input{rl_partially_observable.tex}
\newpage
\appendix
\newpage
\input{rl_appendix.tex}

\end{document}

================================================
FILE: Reinforcement_Learning/rl_tabular_methods.tex
================================================
\section{Value-based RL: Tabular Methods}
\textit{This section reviews the lecture slides 2 (Monte Carlo), 3 and 4.}
\subsection{Monte Carlo}
\begin{itemize}
	\item We can try to estimate the value function by simply sampling from the expectation, meaning we generate episodes, and evaluate the cumulative reward:
	$$v(s_t)=\E_{\pi}\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1}\Big\vert S_t=s\right]\approx \frac{1}{N}\sum_{n=1}^{N}\sum_{k=0}^{T_{n}}\gamma^k R_{t+k+1}^{(n)}$$
	Note that this requires the task to be \textbf{episodic}, meaning that an episode always ends. Otherwise, we are not able to sample 
	\item We can get into the situation that we visit the same state twice in a trajectory. In the update, we can either just take the first time into account that we visited $s$ (\textit{first-visit MC}), or we can consider all of them as different points (\textit{every-visit MC}). Both approaches are very similar, and converge to the same optimum. However, every-visit MC leads to slightly biased estimates if number of samples is low.
	\item If we want to use Monte-Carlo for learning the optimal policy, we need to slightly adjust our algorithm. First, note that we rather want to learn the $q$-value as we can determine the optimal policy from them by greedifying: $\pi_{*}(s)=\arg\max_a q_{*}(s,a)$
	\item To guarantee that every state-action pair is visited, we can either:
	\begin{itemize}
		\item Perform ``exploring starts'', meaning that we randomly sample our start state $(S_0,A_0)$. However, note that this requires an environment where we can set the agent to any position, which is not always possible (e.g. in physical systems, hard to initialize velocity or acceleration)
		\item Use policy that visits every state and actions with non-zero probability. 
	\end{itemize}
	\item As the random starts are often not possible, we mostly choose to integrate exploration into our policy. We can either do this by updating our policy \textit{towards} the greedy one, but not match it exactly (called \textbf{on-policy}). Or we sample from a non-greedy behavior policy but update with respect to our greedy one (called \textbf{off-policy}).
\end{itemize}
\subsubsection{On-policy MC}
\begin{itemize}
	\item To ensure that every action is taken with a non-zero probability, we can use policies like $\epsilon$-greedy. Every time we update our $q$-value, we can update our policy by making it greedy on $q$, and adding $\epsilon$ as probability for choosing a random action.
	\item We can show that choosing our $\epsilon$-greedy policy by that actually leads to the optimal $\epsilon$-greedy policy, as:
	\begin{equation*}
		\begin{split}
			q_{\pi}(s,\pi'(s)) & =\frac{\epsilon}{|\mathcal{A}(s)|}\sum_a q_{\pi}(s,a) + \left(1-\epsilon\right)\max_a q_{\pi}(s,a)\\
			v_{\pi}(s) & = \frac{\epsilon}{|\mathcal{A}(s)|} \sum_a q_{\pi}(s,a) + (1-\epsilon)\left[\sum_{a}\frac{\pi(a|s)-\frac{\epsilon}{|\mathcal{A}(s)|}}{1-\epsilon} q_{\pi}(s,a)\right]\\
		\end{split}
	\end{equation*}
	To show that we improve, we need to show that $v_{\pi'}(s)\geq v_{\pi}(s)$. As the stochastic part $\frac{\epsilon}{|\mathcal{A}(s)|} \sum_a q_{\pi}(s,a)$ is unaffected by $\pi$, we only have to compare the greedy part:
	\begin{equation*}
		\begin{split}
			\sum_{a}\frac{\pi'(a|s)-\frac{\epsilon}{|\mathcal{A}(s)|}}{1-\epsilon} q_{\pi}(s,a) \geq \sum_{a}\frac{\pi(a|s)-\frac{\epsilon}{|\mathcal{A}(s)|}}{1-\epsilon} q_{\pi}(s,a)\\
		\end{split}
	\end{equation*}
	where we can put in $\pi'$ as the greedy policy:
	\begin{equation*}
		\begin{split}
			\max_a q_{\pi}(s,a) \geq \sum_{a}\frac{\pi(a|s)-\frac{\epsilon}{|\mathcal{A}(s)|}}{1-\epsilon} q_{\pi}(s,a)
		\end{split}
	\end{equation*}
	This is obviously true because there is no actions besides the greedy one which gives higher reward. 
	\item Hence, when we updating the policy, we either improve or stay equally optimal. Note that this only holds for $\epsilon$-soft policies, meaning policies for which every action has at least $\epsilon/|\mathcal{A}(s)|$ probability of being selected
\end{itemize}
\subsubsection{Off-policy MC}
\begin{itemize}
	\item In off-policy, we have a behavior policy $b$ from which we sample the trajectories, and our greedy target policy $\pi$. The only constraint on $b$ is that for any action where $\pi(a|s)>0$, $b$ also needs to be $b(a|s)>0$. We can ensure this by any $\epsilon$-soft policy
	\item When sampling, we need to correct for the fact that we use samples from $b$ to evaluate an expectation over $\pi$. One way of doing so is importance sampling:
	$$v_{\pi}(s) \approx \frac{1}{N}\sum_{n=1}^{N} \frac{p(\tau^{n}_{t}|s,A_t\sim\pi)}{p(\tau^{n}_{t}|s,A_t\sim b)} G(\tau_{t}^{n})$$
	where $\tau^{n}_{t}$ is the $n$-th trajectory starting from time step $t$ till the end. We can rewrite the importance weights as $\rho_{t:T-1}=\frac{\prod_{k=t}^{T-1}\pi(A_k|S_k)}{\prod_{k=t}^{T-1}b(A_k|S_k)}$. Note that the transition probabilities between $S_{t}$ and $S_{t+1}$ cancel out as they are the same for $\pi$ and $b$.
	\item When using these importance weight, there are two ways we can average over them:
	\begin{itemize}
		\item \textbf{Ordinary} importance sampling averages by taking the number of trajectories into account:
		$$v_{\pi}(s)=\frac{\sum_{t\in\mathcal{T}(s)}\rho_{t:T(t)-1}G_t}{|\mathcal{T}(s)|}$$
		While this gives us an unbiased estimate, also for small sample sizes, it suffers from high variance when $\rho_{t:T(t)-1}$ varies a lot (e.g. $b$ and $\pi$ quite different)
		\item \textbf{Weighted} importance sampling averages by summing over weights:
		$$v_{\pi}(s)=\frac{\sum_{t\in\mathcal{T}(s)}\rho_{t:T(t)-1}G_t}{\sum_{t\in\mathcal{T}(s)}\rho_{t:T(t)-1}}$$
		This approach reduces the variance because we take into account whether we mostly have big or small values of $\rho_{t:T(t)-1}$, but gives an biased estimate. Suppose we have a single sample, then the importance weight cancels out, meaning we estimate $v_{\pi}(s)\approx v_{b}(s)$. The more samples we get, the lower this bias gets.
	\end{itemize}
	Note that while both give the same correct result for $N\to\infty$, they differ for cases with limited sample size. In practice, the lower variance is mostly preferred so that weighted importance sampling is usually applied.
	\item For implementing this, we take an incremental approach as we can calculate importance weights by $\rho_{t:T(t)-1}=\frac{\pi(A_t|S_t)}{b(A_t|S_t)}\rho_{t+1:T(t)-1}$ which we denote by $W$ in Figure~\ref{fig:rl_tabular_methods_offpolicy_MC_control}. Hence, given a trajectory, we should start with the last state, and iterate to the start state.
	
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.6\textwidth]{figures/rl_tabular_methods_offpolicy_MC_control.png}
		\caption{Incremental implementation of Off-policy Monte-Carlo control.}
		\label{fig:rl_tabular_methods_offpolicy_MC_control}
	\end{figure}
	
	Furthermore, to perform the averaging efficiently, we need to keep track of the normalization constant which is the sum of all importance weights for a certain state-action pair. This we will store in $C$. 
	
	In case we use a greedy policy for $\pi$, we can simplify the algorithm further. In the importance weight, $\pi(A_t|S_t)$ is 1 if $A_t$ was the greedy action, otherwise, we have a factor of 0 (which gives $\rho=0$ for all previous actions). If we update our states in a reverse manner, this means that we can stop the loop as soon as we hit a sub-optimal action. 
	
	Note that $W$ can get fairly large for long trajectories, as $b(A|S)$ is always smaller than 1. We need to use weighted importance sampling instead of ordinary as otherwise, the variance is too high. In addition, as only the tails of the episode are updated frequently, we might have an insufficient amount of samples for the states close to the start, which makes the algorithm inefficient.
	
	
\end{itemize}
\subsubsection{On-policy versus Off-policy control}
\label{sec:value_based_tabular_on_off_policy}
\begin{itemize}
	\item After reviewing two alternative ways of learning a policy for a given environment, we can consider what are the advantages and drawbacks of each of the methods
	\item In general, we can note that off-policy is actually a generalization of on-policy because if we set the behavior policy equal to our target policy, i.e. $b=\pi$, then off-policy becomes on-policy
	\item Commonly, on-policy converges faster as it uses the samples from the same policy it updates for. Off-policy can introduce variance by correcting the estimates for the target policy, as e.g. the importance weight can vary a lot if $\pi$ and $b$ are quite different. This can lead to slow convergence.
	\item A benefit of off-policy is however that we can learn from already recorded data, eventually from another source, as only our updates are based on the current policy $\pi$, and not the actual samples as in on-policy. This means that we could use the same data to evaluate multiple policies, which especially helps for limited data/interactions with the environment.
	\item Another point to consider is that off-policy methods allow us to learn the actual greedy policy. This is not possible in the on-policy setting because for guaranteeing the convergence to the correct $q$-values, we need to give every state-action pair a chance greater than zero to be visited. Using an $\epsilon$-soft policy in the on-policy method and greedifying it afterwards can lead to a good approximation, but we also need to keep in mind that we then learn the optimal $\epsilon$-soft policy, and not the strictly the optimal greedy policy. Hence, our final moves might be sub-optimal (remember Cliff-World, Sutton Book Example 6.6, page 132)
\end{itemize}

\subsection{Temporal-Difference Learning}
\begin{itemize}
	\item Combining the ideas of Monte Carlo and Dynamic Programming, we arrive at a different type of methods, called Temporal difference learning. Remember that we can define the value function as a recursive function: $v(s)=\E[R_{t+1}+\gamma v(S_{t+1})|S_t=s]$. We can use this equality as a target instead of $G_t$, leading to the following update rule:
	$$v(s_t)\leftarrow v(s_t)+\alpha\underbrace{\left[R_{t+1}+\gamma v(s_{t+1}) - v(s_t)\right]}_{\text{TD error } \delta_t}$$
	Instead of waiting for a full episode to finish, we can perform this update after \textit{a single action} taken in the environment. This method is called TD(0), and we will see later that this is a special case of TD($\lambda$), or $n$-step TD
	\item TD(0) is a \textit{bootstrapping} method as it uses its own estimates as targets. 
	\item There are two main approaches for learning policies with Temporal Difference, namely \textbf{SARSA} and \textbf{Q-learning}, which we will now discuss in detail. Both learn the $q$-function, but have slightly different update rules.
	
	Note that TD learning can of course also be used for policy evaluation by simply performing the update rule above on samples from the original policy until our value function converges
\end{itemize}
\subsubsection{SARSA}
\begin{itemize}
	\item The update rule of SARSA is as follows:
	$$Q(S_t,A_t)\leftarrow Q(S_t,A_t) + \alpha\left[R_{t+1}+\gamma Q(S_{t+1},A_{t+1})-Q(S_t,A_t)\right]$$
	where $A_{t+1}$ is selected by the policy $\pi$ based on $S_{t+1}$. Note that $Q(S_{t+1},A_{t+1})=0$ for the terminal state.
	\item The method got its name from using $\bm{S}_t$, $\bm{A}_t$, $\bm{R}_{t+1}$, $\bm{S}_{t+1}$, $\bm{A}_{t+1}$ in its update rule.
	\item Note that SARSA is a \textbf{on-policy} method, meaning that it learns the $q$-values of the policy $\pi$ (see Section~\ref{sec:value_based_tabular_on_off_policy} for discussion of benefits and drawbacks)
	\item Instead of just using the next sample to estimate $Q(S_{t+1},A_{t+1})$, we could also take our policy into account as we can calculate the expectation operator over it instead of simply sampling:
	\begin{equation*}
		\begin{split}
			Q(S_t,A_t) & \leftarrow Q(S_t,A_t) + \alpha\left[R_{t+1}+\gamma \E_{\pi}[Q(S_{t+1},A_{t+1})|S_{t+1}]-Q(S_t,A_t)\right]\\
			& \leftarrow Q(S_t,A_t) + \alpha\left[R_{t+1}+\gamma \sum_{a} \pi(a|S_{t+1}) Q(S_{t+1},a)-Q(S_t,A_t)\right]
		\end{split}
	\end{equation*}
	This method is also called \textbf{expected SARSA}
	\item Note that we can perform \textbf{off-policy} control with expected SARSA, where we use a different behavior policy $b$ to sample, but learn the $q$-values of $\pi$. A special case of this is when we choose $\pi$ to be the greedy policy, which leads to the \textbf{Q-learning algorithm}
\end{itemize}

\subsubsection{Q-Learning}
\begin{itemize}
	\item As mentioned before, Q-learning applies a greedy policy in expected SARSA. This simplifies the update rule to:
	$$Q(S_t,A_t)\leftarrow Q(S_t,A_t) + \alpha\left[R_{t+1}+\gamma \max_a  Q(S_{t+1},a)-Q(S_t,A_t)\right]$$
	\item It can be shown that Q-learning converges to the optimal $q_{*}$ under the condition, that the learning rate $\alpha$ goes to zero (but not too fast), and every state-action pair is visited infinite amount of times when we have infinite number of steps.
	\item However, there are also disadvantages of using the greedy policy. Suppose you have multiple actions with the same value $q(s,a)=0$ as ground truth. When learning it, we will have a certain amount of noise on it, so that some are slightly lower and other slightly above 0. When we now take the maximum, $\max_a q(s,a)$, we get a positive value although the GT is zero. Hence, we have a positive bias, to which we also refer to as \textbf{maximization bias}.
	\item This bias can occur when we use a maximum operator in our update step. Hence, it is also the case for SARSA if it uses a $\epsilon$-greedy policy
	\item With infinite number of samples it might become less relevant, but we are usually limited in computational resources/time. Take for example the environment in Figure~\ref{fig:rl_tabular_methods_maximization_bias}. The action of going left has a obviously lower expected reward, but due to the maximization bias, we will have for some actions from $B$ positive rewards (due to a high variance), and hence, we prefer going left. If we limit our number of samples, it is likely that we didn't get a accurate estimate of each of the actions in $B$ yet, and hence, still prefer to go left.
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.3\textwidth]{figures/rl_tabular_methods_maximization_bias.png}
		\caption{Example environment where the maximization bias can lead to a suboptimal policy.}
		\label{fig:rl_tabular_methods_maximization_bias}
	\end{figure}

	\item To overcome this bias, we have to prevent to take the maximum of the estimates as an estimate of the maximum of the true values. So, intuitively, we need to determine the maximizing action from somewhere else than our estimates we are trying to update.
	\item A simple method of doing so is \textbf{Double Q-Learning}. Instead of learning a single $q$-function, we learn two, almost independently. Now, we can update $Q_1$ by using the maximum operator over $Q_2$, and vice versa. By that, we overcome the positive bias as even if $Q_1$ and $Q_2$ are biased themselves, they are positively biased on different actions. 
	\item The general update rule is then:
	\begin{equation*}
		\begin{split}
			\text{Either update }Q_1: \hspace{2mm} Q_1(S_t,A_t) & \leftarrow Q_1(S_t,A_t) + \alpha\left[R_{t+1}+\gamma Q_2\left(S_{t+1},\arg\max_a Q_1(S_{t+1},a)\right)-Q_1(S_t,A_t)\right]\\
			\text{Or update }Q_2: \hspace{2mm} Q_2(S_t,A_t) & \leftarrow Q_2(S_t,A_t) + \alpha\left[R_{t+1}+\gamma Q_1\left(S_{t+1},\arg\max_a Q_2(S_{t+1},a)\right)-Q_2(S_t,A_t)\right]
		\end{split}
	\end{equation*}
	where we randomly assign a sample either to $Q_1$ or $Q_2$ (but not both, because we otherwise get the same bias).
\end{itemize}

\subsubsection{N-step TD learning}
\begin{itemize}
	\item As we will discuss in Section~\ref{sec:value_based_tabular_difference_TD_MC} in detail, both MC and TD have certain advantages and drawbacks. However, we can actually interpolate between these two, which we call $n$-step TD (a generalization of TD(0))
	\item Instead of bootstrapping on the next state, we could use the reward of the $R_{t+2}$ as well, and then bootstrap on $v(S_{t+2})$. This leads to $2$-step TD. It gets obvious, that if we bootstrap always on the terminal state, meaning $\infty$-TD, we arrive at MC as the value of a terminal state is zero. So, we approximate the return $G_{t:t+n}$ for $n$-step TD by:
	$$G_{t:t+n} = R_{t+1}+\gamma R_{t+2} + ... + \gamma^{n-1}R_{t+n} + \gamma^n v_{t+n-1}(S_{t+n})$$
	The update rule is hence:
	$$v_{t+n}(S_t) = v_{t+n-1}(S_t) + \alpha \left[G_{t:t+n}- v_{t+n-1}(S_t)\right]$$
	\item Which $n$ works best, depends on the environment (see Section~\ref{sec:value_based_tabular_difference_TD_MC} for more detailed discussion)
	\item To enable off-policy learning, we would require importance sampling as in MC which can introduce additional variance. However, there is an alternative in $n$-step, namely the $n$-step \textbf{Tree Backup} algorithm
	\item In $n$-step tree backup, we take the next $n$ steps into account, but at each action decision, we also look at all other actions. As visualized in Figure~\ref{fig:rl_tabular_methods_n_step_tree_backup}, we now have multiple leaf nodes. At each of the leaves, we use our estimates.
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.1\textwidth]{figures/rl_tabular_methods_n_step_tree_backup.png}
		\caption{3-step tree backup as a backup diagram (see Section~\ref{sec:value_based_tabular_backup_diagram}).}
		\label{fig:rl_tabular_methods_n_step_tree_backup}
	\end{figure}
	\item Even if we sample from a behavior policy $p$, we can re-weight each of the leaves' contribution by the target policy $\pi$. For example, at $S_{t+1}$, the leaves are weighted by $\pi(A_1|S_{t+1})$ and $\pi(A_3|S_{t+1})$, while the main sample trajectory gets a factor $\pi(A_2|S_{t+1})$. The next leaves then have $\pi(A_1|S_{t+2})\pi(A_2|S_{t+1})$ etc.
	\item For a 1-step tree backup, we get the same return estimate as for expected SARSA:
	$$G_{t:t+1}=R_{t+1}+\sum_a \pi(a|S_{t+1})Q_t(S_{t+1},a)$$
	If we have $n$ steps, we can calculate our return estimate recursively:
	$$G_{t:t+n}=R_{t+1}+ \gamma\Bigg[\underbrace{\sum_{a\neq A_{t+1}} \pi(a|S_{t+1})Q_{t+n-1}(S_{t+1},a)}_{\text{leaf contributions}} + \underbrace{\pi(A_{t+1}|S_{t+1})G_{t+1:t+n}}_{\text{samples at }t+1}\Bigg]$$
	\item Our behavior policy therefore influences where we generate longer updates, but in expectation (due to the re-weighting), we still get the correct value estimates. Hence, we can use it on off-policy data without needing importance weights ($b$ influences just the depth of certain updates)
\end{itemize}

\subsection{Comparing tabular-based methods}
\begin{itemize}
	\item We have seen now various methods and all take a slightly different approach to Reinforcement Learning
	\item In this section, we compare and review all methods, and put things into perspective
\end{itemize}
\subsubsection{Difference of TD learning and Monte Carlo}
\label{sec:value_based_tabular_difference_TD_MC}
\begin{itemize}
	\item TD can be implemented in an online, fully incremental fashion. In contrast, MC has to wait for the whole episode to finish which can delay learning in applications with very long episodes.
	\item TD learning is more strongly influenced by the initial values we give the $v$/$q$-values (and hence by \underline{biased}), which can slow down training. Especially in cases where we try to learn a policy (like Q-learning for TD), we can focus our exploration in the wrong direction for a long time before finding another, optimal case.
	\item In general, MC suffer from \underline{high variance}. This is because $G_t$ is approximated by a single sample, which is often not enough for a sufficient estimate. For example, assume we have a uniform policy over two actions, and an episode is always exactly 10 steps long. Then, the expectation would give each of the $2^{10}=1024$ possibilities a weight factor of $\frac{1}{1024}$ but in MC, we simply pick one of them which clearly is inaccurate.
	\item As usual, there is a trade-off version of both, which is $n$-step TD learning. Which $n$ to choose is highly dependent on your environment. For the extreme case of only having a single action to take, MC is clearly preferred because there is no variance in the updates. However, in the case where we can have many different outcomes from the same state, TD learning might be the better option. A common rule of thumb is that we need a lower learning rate for larger $n$ due to the increase of variance
	\item Another interesting difference is that (batch) MC finds the estimates that minimizes the \underline{mean-squared error} on the training set, whereas (batch) TD(0) finds the optimal estimates for a \underline{maximum-likelihood model} of the Markov process. This means that it estimates the transition probability from state $i$ to $j$ as the fraction of observed transitions from $i$ to $j$, and its expected reward is the average of the rewards observed for this transition. Thus, TD learning exploits the Markov property of the environment while MC neglects it. Note that this property can make a big difference if we observe an insufficient number of data points for each state.
\end{itemize}
\subsubsection{Backup diagrams}
\label{sec:value_based_tabular_backup_diagram}
\begin{itemize}
	\item Another way to visualize the difference between TD and MC is the usage of \textit{backup diagrams}
	\item A backup diagram visualizes a sequence of actions (black dots) and states (white nodes). We always go from a state to a action, which leads us to a next state. One example diagram was already given in Figure~\ref{fig:rl_tabular_methods_n_step_tree_backup}.
	\item Now, consider Figure~\ref{fig:rl_tabular_methods_final_comparison} where we visualize the different extreme cases. TD learning has a shallow update (bootstraps on the next state), while Monte Carlo has infinite depth (i.e. going until the end). The trade-off here is obviously the variance/bias so that $n$-step TD is in between those.
	
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/rl_tabular_methods_final_comparison.png}
		\caption{Putting all methods so far into perspective.}
		\label{fig:rl_tabular_methods_final_comparison}
	\end{figure}
	
	To the right, we get \textit{wider} updates, i.e. move from sampled updates towards expected updates. This means that we consider more actions and states we can end up in. Dynamic programming uses the whole environment dynamics, hence it takes all possible actions and next states into account. In between TD and DP, we can consider expected SARSA or Q-learning as they look at the next actions, but use bootstrapping. 
	
	If we make those methods deeper, we arrive at the $n$-step tree backup algorithm. Making dynamic programming deeper means that we consider \textbf{all} possible outcomes from a given state, which includes possible actions we can get, and possible next states we can end up in. This obviously ends up in a huge, intractable graph for most environments (if we even have given the dynamics) so that it is mostly not feasible to perform.
\end{itemize}
\subsubsection{Limitations so far}
\begin{itemize}
	\item To summarize the tabular-based methods, we want to review their limitations so far.
	\item First of all, as the name indicates, for all the previous methods we store the $q$- and $v$-functions as a table. This is however not always possible. Suppose we have a continuous state space. Then we cannot create a table for that. Alternatively, imagine we want to play an Atari game. Having a screen resolution of $256\times 256$, we would get $256\times 256\times 3\times 256$ different frames (last two factors are channels and 8-bit values of channels) making it infeasible to store. However, it would be also extremely inefficient because similar frames mostly relate to similar actions to take. This leads us to approximate value-based learning methods which we will review in Section~\ref{sec:value_based_approximation}.
	\item Currently, we have to choose the (behavior) policy ourselves with which we explore the environment. Furthermore, if we want to learn an optimal stochastic policy, we also need to set $\epsilon$ in $\epsilon$-soft policies, or the temperature for softmax distributions. But not only for exploration we want randomness, as in partially observable states, we also have uncertainty which we have to take into account. The question arises whether we cannot learn the optimal stochasticity in the algorithm itself, which we will discuss in Section~\ref{sec:policy_learning} and \ref{sec:partially_observable}.
	\item Until we have learned what the effect of our actions are, it takes quite some time for TD and MC to learn. If we want to take sample efficiency into account, we might want to consider model-based approaches as will be discussed in Section~\ref{sec:model_based}.
\end{itemize}