Full Code of phlippe/UvA_Summaries for AI

master c42eab447ecd cached
88 files
781.5 KB
252.2k tokens
1 requests
Download .txt
Showing preview only (817K chars total). Download the full file or copy to clipboard to get everything.
Repository: phlippe/UvA_Summaries
Branch: master
Commit: c42eab447ecd
Files: 88
Total size: 781.5 KB

Directory structure:
gitextract_3o3s9o2d/

├── .gitignore
├── Computer_Vision_1/
│   ├── cv_appendix.tex
│   ├── cv_applications.tex
│   ├── cv_deep_learning.tex
│   ├── cv_deep_video.tex
│   ├── cv_imgformation.tex
│   ├── cv_imgprocessing.tex
│   ├── cv_intro.tex
│   ├── cv_object_rec.tex
│   └── cv_summary.tex
├── Deep_Learning/
│   ├── cheat_sheet/
│   │   └── main.tex
│   ├── dl_appendix.tex
│   ├── dl_autoregressive.tex
│   ├── dl_bayesian.tex
│   ├── dl_convnets.tex
│   ├── dl_deep_rl.tex
│   ├── dl_generative_models.tex
│   ├── dl_intro.tex
│   ├── dl_modularity.tex
│   ├── dl_optimization.tex
│   ├── dl_rnn.tex
│   └── dl_summary.tex
├── Information_Retrieval_1/
│   ├── ir_boolean_retrieval.tex
│   ├── ir_click_models.tex
│   ├── ir_counterfactual_eval.tex
│   ├── ir_language_models.tex
│   ├── ir_learning_to_rank.tex
│   ├── ir_neural_models.tex
│   ├── ir_offline_evaluation.tex
│   ├── ir_online_evaluation.tex
│   ├── ir_semantic_matching.tex
│   └── ir_summary.tex
├── Knowledge_Representation/
│   ├── figures/
│   │   └── figures.pptx
│   ├── kr_csp.tex
│   ├── kr_dl.tex
│   ├── kr_intro.tex
│   ├── kr_qr.tex
│   ├── kr_sat.tex
│   └── kr_summary.tex
├── LICENSE
├── ML4QS/
│   ├── mlqs_clustering.tex
│   ├── mlqs_feature_engineering.tex
│   ├── mlqs_intro.tex
│   ├── mlqs_modeling_with_time.tex
│   ├── mlqs_modeling_without_time.tex
│   ├── mlqs_reinforcement_learning.tex
│   ├── mlqs_sensory_noise.tex
│   ├── mlqs_summary.tex
│   └── mlqs_supervised_learning.tex
├── Machine_Learning_1/
│   ├── ml_appendix.tex
│   ├── ml_basic_probability.tex
│   ├── ml_combining_models.tex
│   ├── ml_kernel_methods.tex
│   ├── ml_linear_classification.tex
│   ├── ml_linear_regression.tex
│   ├── ml_neural_networks.tex
│   ├── ml_summary.tex
│   └── ml_unsupervised_learning.tex
├── Machine_Learning_2/
│   ├── ml2_appendix.tex
│   ├── ml2_causality.tex
│   ├── ml2_exponential_family.tex
│   ├── ml2_graphical_models.tex
│   ├── ml2_graphical_models.tex.recover.bak~
│   ├── ml2_sampling_methods.tex
│   ├── ml2_sequential_data.tex
│   ├── ml2_summary.tex
│   └── ml2_variational_EM.tex
├── Natural_Language_Processing_1/
│   ├── nlp_bayesian.tex
│   ├── nlp_compositional_semantic.tex
│   ├── nlp_dialog_modelling.tex
│   ├── nlp_formal_grammars.tex
│   ├── nlp_lexical_distributional_semantics.tex
│   ├── nlp_morphology.tex
│   ├── nlp_pos_tagging.tex
│   ├── nlp_summarization.tex
│   ├── nlp_summary.tex
│   ├── nlp_textual_entailment_paraphrasing.tex
│   └── nlp_translation.tex
├── README.md
└── Reinforcement_Learning/
    ├── rl_appendix.tex
    ├── rl_introduction.tex
    ├── rl_learning_with_approx.tex
    ├── rl_mcts_alpha_go.tex
    ├── rl_model_based.tex
    ├── rl_partially_observable.tex
    ├── rl_policy_gradient_methods.tex
    ├── rl_summary.tex
    └── rl_tabular_methods.tex

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
*.aux
*.log
*.out
*.synctex.gz
*.toc
*.txss
.DS_Store

================================================
FILE: Computer_Vision_1/cv_appendix.tex
================================================
\section{Practicals}
Gathering some interesting/important questions from the practicals and old exams.
\subsection{Color spaces}
\subsubsection{General parameters in color spaces}
\begin{itemize}
	\item \textbf{Chromaticity}: the color component regardless of its luminance/intensity. For example, the $xy$-diagram in Figure~\ref{fig:rgb_color_wavelength_distribution_XYZ_diagram} visualizes the chromaticity (includes saturation and hue)
	\item \textbf{Saturation}: defined as ``colorfulness of a stimulus relative to its own brightness''. In the normalized $rgb$ space, it is the distance to the point $(1/3,1/3,1/3)$ (ratio to the maximum distance). In case of the wavelength distribution, a color is saturated if it is very peaked.
	\item \textbf{Intensity}: the energy of the light. It is the integral of the wavelength distribution.
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.25\textwidth]{figures/cv_image_formation_rg_chromaticity.png}
		\caption{\textit{rg}-chromaticity diagram. A point in this space symbolizes the chromaticity (color without intensity), and the distance to the point $(1/3,1/3)$ (if considered white light source as reference) with ratio to distance to border the saturation.}
	\end{figure}
\end{itemize}
\subsubsection{XYZ color space}
Calculate saturation, hue, intensity, plotting in the diagram, using reference lights, etc.

Interpolate between colors. We can perceive color (e.g. white) although it is not as we would define 
\subsubsection{Color invariance}
How to determine whether formula is color invariant or not. 
\begin{itemize}
	\item Color invariance is trying to remove transformations that do not directly affect the color, but let the sensor perceive it differently. 
	\item Hence, color invariant models are more or less insensitive to varying imaging conditions such as variations in illumination (light source) and object pose (shading, highlighting cues)
	\item For example, if we assume a Lambertian world where we only have body reflection and a white light source (equal for all wavelengths), we get for the $rgb$ space (note that $R=cos\theta \cdot e\cdot \int_{\lambda} p(\lambda) f_R(\lambda)d\lambda$):
	\begin{equation*}
		\begin{split}
			r & = \frac{R}{R + G + B} = \frac{\cancel{cos\theta} \cdot \cancel{e}\cdot \int_{\lambda} p(\lambda) f_R(\lambda)d\lambda}{\cancel{cos\theta} \cdot \cancel{e}\cdot \int_{\lambda} p(\lambda) \left(f_R(\lambda) + f_G(\lambda) + f_B(\lambda)\right)d\lambda}
		\end{split}
	\end{equation*}
	Thus, the \textit{rgb} color space is color invariant when assuming a Lambertian reflection model.
\end{itemize}
\subsection{Convolution operator}
\subsubsection{Difference between convolution and correlation}
Formally, correlation is a measurement of similarity between two signals whilst convolution is a measures the effect of one signal on the other. In practice however, correlation simply moves the filter over the image and computes the sum of the box at each pixel. Convolution is practically the same however before moving over the image, the filter is rotated 180 degrees. The formulas are:
\begin{equation*}
	\begin{split}
		\text{Correlation:} & I_{out} = I \otimes h,\hspace{1mm} I_{out}(i,j) = \sum\limits_{k,l} I(i+k, j+l) \cdot h(k,l)\\
		\text{Convolution:} &  I_{out} = I \ast h,\hspace{1mm} I_{out}(i,j) = \sum\limits_{k,l} I(i-k, j-l) \cdot h(k,l)
	\end{split}
\end{equation*}
Note that for both methods there is no difference in the result if we take the center pixel or a corner pixel as the start point for a filter. 
\subsubsection{Convolving two filters}
Two consecutive filters applied to an image can be summarized into one by convolving two filters. There are two ways to calculate the convolution of two filters. The more intuitive way to calculate the effect of every element of the second filter based on the first one.
Example:
\begin{equation*}
	\begin{split}
		f &=\left[\begin{array}{ccc}3 & 7 & 6\end{array}\right], \hspace{2mm}g=\left[\begin{array}{ccc}-1 & 5 & 8\end{array}\right] \Rightarrow f\ast g \\[5pt]
		& \implies \begin{array}{cccccc}
			& [-1\cdot 3 & 5\cdot 3 & 8\cdot 3] & & \\
		 +	& & [-1\cdot 7 & 5\cdot 7 & 8\cdot 7] & \\
		 +	& & & [-1\cdot 6 & 5\cdot 6 & 8\cdot 6] \\[5pt]
		 \hline
		 & [ -3 & 8 & 53 & 86 & 48 ]
		\end{array}
	\end{split}
\end{equation*}
The second option is to apply convolution right away with extended zero padding. We can imagine to use infinite zero padding but remove the zero elements in the convolved filter again. Note that we perform convolution, and therefore have to flip the second filter.
\begin{equation*}
	\begin{split}
		f\ast g & = \left[\begin{array}{ccccccc}0 & 0 & 3 & 7 & 6 & 0 & 0\end{array}\right] \otimes \left[\begin{array}{ccc}8 & 5 & -1\end{array}\right]\\
		& = \left[\begin{array}{ccccccc}-1\cdot 3 & (5\cdot 3 - 1\cdot 7) & (8\cdot 3 + 5\cdot 7 - 1\cdot 6) & (8\cdot 7 + 5\cdot 6) & 8\cdot 6\end{array}\right]\\
		& = \left[\begin{array}{ccccc}-3 & 8 & 53 & 86 & 48\end{array}\right]
	\end{split}
\end{equation*}
\subsubsection{Linearly Separable Filters}
Some 2D filters are separable in their $x$ and $y$ dimension. We can test it by comparing the convolution of separated $x$ and $y$ filters with the 2D version.
\begin{itemize}
	\item \textit{What is the benefit of separable filters?}
	
	\underline{Answer}: The computational cost is reduced form $k^2$ to $2\cdot k$.
	
	\item \textit{Prove that a 2D Gaussian filter is linearly separable.}
	
	\underline{Answer}: We can show this holds for the continuous case, and thus also for the discrete. Note that we can neglect a constant factor $c$ for normalization as this does not introduce any significant computational effort.
	\begin{equation*}
		\begin{split}
			G_x * G_y  & = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{x^2}{2\sigma^2}} * \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{y^2}{2\sigma^2}}\\
			& = \frac{1}{2\pi\sigma^2} e^{-\frac{x^2 + y^2}{2\sigma^2}}\\
			& = G_{xy}
		\end{split}
	\end{equation*}
	\item \textit{Prove that a 2D box filter (size $3\times 3$) is linearly separable.}
	
	\underline{Answer}: We can show this by simply computing the convolution.
	\begin{equation*}
	\begin{split}
		\left[\begin{array}{ccc}1 & 1 & 1\end{array}\right] *  
		\left[\begin{array}{c}1 \\ 1 \\ 1\end{array}\right] & = \left[\begin{array}{ccc}
		1 & 1 & 1\\ 1 & 1 & 1\\ 1 & 1 & 1
		\end{array}\right]\\
	\end{split}
	\end{equation*}
	\item \textit{Check whether the following 2D filter is linearly separable:}
	$$h = \left[\begin{array}{ccc}
	1 & -2 & 1\\ -2 & 4 & -2\\ 1 & -2 & 1
	\end{array}\right]$$
	
	\underline{Answer}: The way to check that is looking for symmetric patterns in $x$ and $y$ direction which are independent of the other dimension. In this case, we can easily spot the pattern:
	\begin{equation*}
	\begin{split}
	\left[\begin{array}{ccc}1 & -2 & 1\end{array}\right] *  
	\left[\begin{array}{c}1 \\ -2 \\ 1\end{array}\right] & = \left[\begin{array}{ccc}
	1 & -2 & 1\\ -2 & 4 & -2\\ 1 & -2 & 1
	\end{array}\right]\\
	\end{split}
	\end{equation*}
	\item \textit{Check whether the following 2D filter is linearly separable:}
	$$h = \left[\begin{array}{ccc}
	1 & 8 & 3\\ 7 & 6 & 2\\ 4 & 9 & 5
	\end{array}\right]$$
	
	\underline{Answer}:  No, this kernel is not linearly separable.
\end{itemize}
\subsection{Object detection}

\subsection{Convolutional Neural Networks}
\subsubsection{Amount of parameters, output size and computational cost}
\begin{itemize}
	\item \textbf{Output size}: the spatial output size of a convolutional layer depends on the kernel size $k$, the padding $p$ (per side), the stride $s$ and the input size $w_i$. The output size is then calculated by $w_o = (w_i + 2\cdot p - k)/s + 1$
	\begin{itemize}
		\item \textit{What is the size of the output volume with stride $3$, kernel $5\times 5$, number of neurons $5$ and input size $32\times 32\times 3$ (no padding)?}
		
		\underline{Answer}: The output size is $w_0 = (32 + 2\cdot 0 - 5)/3 + 1 = 10$.
		
		\item \textit{What padding size is required to keep the output size equals to the input size for a kernel $k$ and stride $s$?}
		
		\underline{Answer}: we have to reverse the equation above to:
		\begin{equation*}
			\begin{split}
				w_o = w_i & = (w_i + 2\cdot p - k)/s + 1\\
				\Leftrightarrow (w_i - 1) \cdot s & = w_i + 2\cdot p - k\\
				\Leftrightarrow p & = \frac{1}{2}\left(w_i \cdot \left(s-1\right) - s + k\right)
			\end{split}
		\end{equation*}
		Hence, if stride is $s=1$, the necessary padding is $p=\frac{k-1}{2}$.
		
		\item \textit{How many output frames do we get for a 3D convolution of $3\times 3\times 3$ (stride $s=3$ and padding $p=1$ in temporal dimension) on a input video size of $16\times 256\times 256\times 3$?}
		
		\underline{Answer}: We can apply the same formula as before: $l_o = (16 + 2\cdot 1 - 3)/3 + 1 = 6$ output frames.
	\end{itemize}
	\item \textbf{Number of parameters}: a 2D convolution contains $k\times k\times c_F \times c_G$ parameters where $k$ is the kernel size, and $c_F$ and $c_G$ the number of input and output channels. For a 3D convolution, we multiply it by another $k$. Note that all these three $k$'s can be different (e.g. $3\times 3\times 1$, $5\times 1 \times 1$, ...)
	\begin{itemize}
		\item \textit{How many parameters are learned in a convolutional layer with an RGB input image, $5\times 5$ kernel size and $100$ different filters?}
		
		\underline{Answer}: We learn $5\times 5\times 3\times 100 = 7,500$ parameters for the filters, and $100$ biases. Thus, we have overall $7,600$ parameters.
		
		\item \textit{How many parameters are learned if we set the padding to $p=2$ and stride $s=2$?}

		\underline{Answer}: The number of parameters is independent of the stride and the padding.
	\end{itemize}
	\item \textbf{Computational cost}: The computational cost of a layer is the cost of a single filter application (the filter size) times the number of output neurons.
	\begin{itemize}
		\item \textit{Given the input $w_F \times h_F \times c_F$ and output $w_G \times h_G \times c_G$, what is the computational cost of a 2D convolution with kernel size $k\times k$ between these two layers?}
		
		\underline{Answer}: The cost of applying a single filter once is $k\times k\times c_F$. We then have to move the filter over $x$ and $y$ dimension, and repeat it for $c_G$ filters. Thus, the overall cost is determined by:
		$$k\times k\times c_F\times c_G\times w_G\times h_G$$
		
		\item \textit{Given the input $256 \times 256 \times 3$, what is the computational cost of a 2D convolution with kernel size $7\times 7$, $32$ output channels, stride $s=3$ and padding $p=0$?}
		
		\underline{Answer}: We first have to calculate the output size $w_G = (w_F + 2\cdot p - k)/s + 1 = (256 + 0 - 7)/3 = 83$ and $h_G = 83$. Next, we can apply our previous formula:
		$7\times 7\times 3\times 32\times 83\times 83$
		
		\item \textit{What are two ways to reduce the number of computations for 2D convolutions?}
		
		\underline{Answer}: Same as in case of 3D convolutions. We can either do depth-wise convolutions (\textit{MobileNet}), or do pseudo 2D convolutions by separating the filter $k\times k$ to a $1\times k$ and $k\times 1$ convolution (\textit{InceptionV2}).
	\end{itemize}
\end{itemize}
\subsubsection{Other general questions}
\begin{itemize}
	\item \textbf{Locally constrained layer}: A convolutional layer where we don't share weights over spatial dimensions.
	\begin{itemize}
		\item \textit{How many parameters are needed for a locally constrained layer, where each neuron looks at a $10\times10$ window, when using $W=H=100$, and stride of $5$?}
		
		\underline{Answer}: The spatial output size is $(100 - 10) / 5 + 1 = 19$ so that we have $19\times 19=361$ different kernels. Combined with the kernel/window size, we get overall $10\times 10\times 361=36,100$ parameters.
		\item \textit{Describe a scenario where weight sharing as done in plain convolutional layers is not beneficial for recognition}
		
		\underline{Answer}: Weight sharing works most effectively, if the input is transitional invariant. However, if this is not the case and we have stationary data, we should for example use locally constrained layers where the weights are not shared. This may lead to more parameters but reduces the required amount of channels (restricted number of possible objects per position). Example: face recognition with standardized position (eyes and mouth filters at different parts of the image). 
	\end{itemize}
\end{itemize}


================================================
FILE: Computer_Vision_1/cv_applications.tex
================================================
\section{Applications}
Not in the exam :-)

================================================
FILE: Computer_Vision_1/cv_deep_learning.tex
================================================
\section{Deep Learning}
\begin{itemize}
	\item Deep Neural Networks perform hierarchical feature learning and classification in a single architecture
\end{itemize}
\subsection{Convolutional Neural Networks}
\begin{itemize}
	\item Key layer of CNNs are convolutions. The weights are surface-wise local, but depth-wise global.
	\item Multiple neurons look at the same position, but using different kernels (channels)
	\item Parameters of a convolutional layer
	\begin{itemize}
		\item \textit{Kernel size}: size of the filter which is learned. If size is $k\times k$, we learn overall $k^2$ parameters per channel
		\item \textit{Input channels}: number of input channels $c_i$. Every filter has the size of $k\times k\times c_i$
		\item \textit{Output channels}: number of output channels $c_o$. Represent the number of different filters learned.
		\item \textit{Stride} with which we slide the filter over the image. Stride of $s=1$ means we apply a filter on every pixel as usual, $s=2$ would skip every second pixel and $s=4$ takes only every fourth pixel as center of an filter application. Default: $s=1$.
	\end{itemize}
	\item Overall, we learn $(k\times k\times c_i + 1)\times c_0$ \textbf{parameters} in a convolutional layer (the 1 extra parameter for bias)
	\item The \textbf{output size} is calculated by $$h_o = (h_i + 2\cdot p - k) / s + 1$$ where $p$ is the padding (number of extra pixels on each side)
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.2\textwidth]{figures/cv_deep_learning_convolution_operator.png}
		\caption{Convolutional layer in a CNN}
	\end{figure}
	\item Activation layers like ReLU ($\max(0,x)$) introduce non-linearity
	\item Pooling aggregates multiple values into a single value making it invariant to small transformations. Reduces the size of the next output layer while keeping the most important information 
\end{itemize}
\subsubsection{Transfer Learning}
\begin{itemize}
	\item Reuse information gained on a large dataset (e.g. ImageNet) on a new one
	\item Depending on the amount and similarity of data with the pretrained one, we should fine-tune different layers (see Figure~\ref{fig:transfer_learning})
	\item Transfer Learning can greatly influence the performance of a network. Low level features (first layers) are almost always the same for images as we have to detect edges, colors, etc.
\end{itemize}
\begin{figure}[ht!]
	\centering
	\includegraphics[width=0.5\textwidth]{figures/cv_deep_learning_transfer_learning.png}
	\caption{Transfer learning}
	\label{fig:transfer_learning}
\end{figure}
% cv_deep_learning_transfer_learning.png
\subsection{GANs}
\begin{itemize}
	\item Capture the underlying data distribution and being able to generate new samples
	\item Next to generative adversarial networks, we can also apply Variational Autoencoders or PixelCNN/RNN for this task
	\item GANs are trained by a minimax game between two neural networks (Discriminator $D$ and Generator $G$). $G$ wants to fool $D$ by generating realistic images. $D$ tries to distinguish between generated and real images/data:
	$$\min_G \max_D V(G,D) = \mathbb{E}_{\bm{x}\sim p_{\text{data}}(\bm{x})} \left[\log \left(D\left(\bm{x}\right)\right)\right] + \mathbb{E}_{\bm{z}\sim p_{z}(\bm{z})} \left[\log\left(1 - D\left(G\left(\bm{z}\right)\right)\right)\right] $$
	\item The standard/plain GAN architecture uses a noise vector $\bm{z}$ as input to the generator. Note that it is also possible to put  and condition the GANs input on the output (aka \textit{conditional GANs}). To ensure that the generator learns a relation from input to output, we might need to add an additional loss term like MSE to a label
	\item The training procedure consists of two steps which can be alternated or repeated by themselves for multiple times
	\begin{enumerate}
		\item \textit{Fix $G$ and train $D$}: in order to train the discriminator, we let $G$ generate fake images and feed the discriminator both the fake and sampled real data. Note that we need to fix $G$ to not backpropagate the error of $D$ through $G$.
		\item \textit{Fix $D$ and train $G$}: $G$ is trained by generating images and backpropagating the error of the prediction of $D$ (towards prediction of a real image). Although the gradients flow back through $D$, we do not update any weights of the discriminator as we otherwise cheat (train $D$ to optimize loss of $G$)
	\end{enumerate}
\subsubsection{Stability and Training problems}
	\item In general, it is hard to train a GAN. There are a lot of problems that can occur
	\item \textbf{Vanishing gradients} during training:
	\begin{itemize}
		\item If the discriminator is too bad, the generator does not get valid/accurate feedback and can therefore not learn properly
		\item If the discriminator is perfect, the generator has very low gradients as a small change does not influence the discriminator
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.4\textwidth]{figures/cv_deep_learning_GAN_vanishing_gradients.jpeg}
			\caption{Vanishing gradients problem for training with KL-divergence. When the distance between the two distributions $p$ and $q$ (respectively $P_g$ and $P_r$) is too huge, the KL divergence is very close to zero. Hence, is does not provide any strong gradients in these regions.}
		\end{figure}
	\end{itemize}
	\item \textbf{Reaching the equilibrium}
	\begin{itemize}
		\item We know that the nash equilibrium of the minimax game is $P_g=P_r$ meaning the distribution of the real data is equal to the generated data. In that case, $D$ return 0.5 no matter what example we put in (as both distributions are equal).
		\item However, it has been shown that such cost functions may not converge when using gradient descent. An example is shown in Figure~\ref{fig:GAN_reaching_equilibrium}.
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.4\textwidth]{figures/cv_deep_learning_GAN_oscillating.png}
			\caption{Oscillating behavior of a non-cooperative game where $\min_x \max_y V(x,y) = x\cdot y$. The equilibrium $x=y=0$ is never reached.}
			\label{fig:GAN_reaching_equilibrium}
		\end{figure}
	\end{itemize}
	\item \textbf{Mode collapse}
	\begin{itemize}
		\item A GAN suffers from a mode collapse if the generator limits its predictions/generated distribution to a few samples/modes.
		\item For example in case of the MNIST dataset, this would mean that the generator only creates numbers of one or two different digits. Although a full mode collapse is rarely the case, partial mode collapses frequently occur
		\item In order to create a mode collapse, the gradients regarding the noise $\bm{z}$ must be very low/close to zero. This can for example happen if we fix the discriminator and the generator converges to the optimal image $\bm{x}^*$ that fools the discriminator the most
		\item Once the generator collapse to one mode, the discriminator will learn that this mode is purely/mostly generated and thus changes its predictions. The generator will address that by changing the mode (note that as $\partial L/\partial \bm{z}\approx 0$, we will just collapse to the next mode and are not able to escape this loop).
		\item In the end, this turns into a cat-and-mouse game between the generator and discriminator, and will not converge (see Figure~\ref{fig:GAN_mode_collapse}).
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.6\textwidth]{figures/cv_deep_learning_GAN_mode_collapse.png}
			\caption{\textit{Top row}: optimal convergence of generator distribution to 8 modes. \textit{Bottom row}: Sample of a mode collapse after 10k iterations. The generator is only able to generate a single mode.}
			\label{fig:GAN_mode_collapse}
		\end{figure}
	\end{itemize}
\end{itemize}

================================================
FILE: Computer_Vision_1/cv_deep_video.tex
================================================
\section{Deep Video}
\begin{itemize}
	\item Understanding a video requires to analyze spatial and temporal information. Thus, also more data is needed to fully train such a network whereas we cannot label every single frame (too expensive)
	\item Grid-like data can be processed by a CNN, temporal mostly by RNN, and for unstructured data a fully connected network is most suitable
	\item Easiest solution for video understanding would be to classify (sample/all) frames independently by standard CNN, and then perform average pooling over predictions. However, this approach does not capture temporal structure
\end{itemize}
\subsection{Recurrent Neural Networks}
\begin{itemize}
	\item In Recurrent Neural Networks, a hidden state flows over time steps. The vanilla RNN formula is
	\begin{equation*}
		\begin{split}
			h_t & = \tanh \left(W_{hh}h_{t-1} + W_{xh} x_{t}\right)\\
			y_t & = W_{hy} h_t
		\end{split}
	\end{equation*}
	\item Weights are shared over time (also $W_{hh}$) so that a RNN can process an arbitrary sequence length. Also, it reduces the number of parameters and thus the chance of overfitting 
	\item However, weight sharing can also lead to vanishing gradients as if we backpropagate from $h_t$ to $h_k$, we have a factor $\theta$ that lets the gradients vanish if it's lower than one, and explode if it is greater than one:
	$$\frac{\partial h_t}{\partial h_k} = \theta^{(t-k)} \sum f(\cdot)$$
	\item Vanilla RNNs have troubles capturing long-term dependencies. A possible solution is using LSTMs that control the information flow by three gates (see Figure~\ref{fig:deep_video_LSTM}):
	\begin{equation*}
		\begin{split}
			\text{Forget gate:  } & f_t = \sigma\left(W_f \cdot \left[h_{t-1}, x_t\right] + b_f\right)\\[7pt]
			\text{Input gate:  } & i_t = \sigma\left(W_i \cdot \left[h_{t-1}, x_t\right] + b_i\right)\\
			& \tilde{c}_t = \tanh\left(W_c \cdot \left[h_{t-1}, x_t\right] + b_c\right)\\
			& c_t = f_t * c_{t-1} + i_t * \tilde{c}_t\\[7pt]
			\text{Output gate:  } & o_t = \sigma\left(W_o \cdot \left[h_{t-1}, x_t\right] + b_o\right)\\
			& h_t = o_t * \tanh\left(c_t\right)
		\end{split}
	\end{equation*}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/cv_deep_video_LSTM.png}
		\caption{Visual representation of a LSTM chain.}
		\label{fig:deep_video_LSTM}
	\end{figure}
\end{itemize}
\subsection{3D convolutions}
\begin{itemize}
	\item We can extend standard convolutions to 3D by moving the filter over the time dimension as well (channels are now 4th dimension over which filter is still global)
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.3\textwidth]{figures/cv_deep_video_3D_convs.png}
		\caption{A 3D convolution is local over spatial and temporal dimensions, but still global over channels (i.e. RGB).}
		\label{fig:deep_video_3d_convs}
	\end{figure}
	\item Example: extending a 2D kernel by temporal dimension:
	\begin{equation*}
		\begin{split}
			200\times 200\times 3 \textcolor{blue}{\times 16} \xrightarrow{\text{filter }3\times3\textcolor{blue}{\times 3}} 200\times 200\times 256 \textcolor{blue}{\times 16}
			\Rightarrow \underbrace{3\times 3}_{\text{ spatial }}\underbrace{\textcolor{blue}{\times 3}}_{\text{ temporal }}\underbrace{\times 3}_{\text{ input channels }}\underbrace{\times 256}_{\text{ output channels}}\text{ parameters}
		\end{split}
	\end{equation*}
	\item Such convolutions learn combined temporal and spatial information. 
	\item Alternative is to concatenate all input frames over the channel dimension and pass it to a simple 2D network (also called \textit{early fusion}). Note that this approach loses the temporal information very fast
	\item Consecutive 3D convolutions can be seen as hierarchical combination of frames. Low level layers therefore capture low level motions, while high level layers (close to output) are able to reason about a longer set of frames and thus high level motion.
	\item Still, it is hard to learn long term dependencies with 3D convolutions as it does not have any gates and thus no explicit control over the information flow
	\item Note that in general, video-based networks are more likely to suffer from overfitting as the input space has a much higher dimensionality and the network has more parameters
\end{itemize}
\subsection{State-of-the-art}
\subsubsection{Two Stream Network}
\begin{itemize}
	\item Earliest proposed network for action recognition was \textbf{Two stream network}
	\item The architecture consists of two networks. One takes a single frame (\textit{spatial} stream net), and the other processes the concatenated optical flow over the set of frames (\textit{temporal} stream net). Both predictions are in the end combined
	\item The biggest problem here is that the spatial and temporal information is processed independently, and the very late fusion makes it impossible to reason about both
	\item Other disadvantages include a higher computational cost (two networks plus optical flow), only capturing short motion (early fusion of optical flow), noisy optical flow, and higher probability of overfitting due to number of parameters
	\item Approach can be slightly improved by repeatedly applying the network on small snippets of the network, and combining the prediction afterwards
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/cv_deep_video_two_stream_network.png}
		\caption{Architecture of the two stream network.}
		\label{fig:deep_video_two_stream_net}
	\end{figure}
\end{itemize}
\subsubsection{I3D}
\begin{itemize}
	\item Inspired by the success of the 2D version (GoogLeNet), current state-of-the-art networks apply 3D inception modules (see Figure~\ref{fig:deep_video_I3D_module})
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.35\textwidth]{figures/cv_deep_video_I3D.png}
		\caption{\textit{Left}: Standard Inception module of the I3D network. \textit{Right}: Inception module with 3D temporal separable convolutions.}
		\label{fig:deep_video_I3D_module}
	\end{figure}
	\item It is pretrained on ImageNet where the 2D filters are (after pretraining) inflated to a third dimension by repeating the values $N$ times over the time dimension, and rescaled by dividing by $N$
\end{itemize}
\subsubsection{Efficient 3D convolutions}
\begin{itemize}
	\item The main drawback of I3D and all other 3D convolutional networks are the huge amount of parameters. There are three ways to efficiently reduce the number of parameters
\end{itemize}
\begin{enumerate}
	\item \textbf{Pseudo 3D convolutions}
	\begin{itemize}
		\item The idea behind this operation is that the spatial and the temporal dimension do not correlate in every detail, but the temporal dimension is more important locally for the spatial dimension
		\item Thus, we split 3D convolution into a 2D spatial and a consecutive 1D temporal convolution. The concept is visualized in Figure~\ref{fig:deep_video_pseudo_3D_convs}
		\item The number of operations applied on input size $l_F \times w_F \times h_F \times c_F$ to output $l_G \times w_G \times h_G \times c_G$ is:
		\begin{equation*}
				\underbrace{k \times k \times 1 \times c_F \times c_I \times l_F \times w_G \times h_G}_{\text{Spatial 2D convolution}} + \underbrace{1\times 1\times k \times c_I \times c_G \times l_G \times w_G \times h_G}_{\text{Temporal 1D convolution}}
		\end{equation*}
		\item The speedup by this operation is about $\frac{1}{k}\cdot \frac{c_I}{c_G} \cdot \frac{l_F}{l_G} + \frac{1}{k^2} \cdot \frac{c_I}{c_F}\approx \frac{1}{k}\cdot \frac{c_I}{c_G}$
	\end{itemize}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.55\textwidth]{figures/cv_deep_video_pseudo_3D_conv.png}
		\caption{Pseudo 3D convolutions split the operation into a spatial part (2D) and a temporal (1D) convolution.}
		\label{fig:deep_video_pseudo_3D_convs}
	\end{figure}
	\item \textbf{Depth-wise separable convolutions}
	\begin{itemize}
		\item This operation is inspired by the MobileNet architecture and removes the property of convolutions being depth-wise global
		\item We consider every input channel independently, and apply a different filter on each of them. For example, if we have an RGB input, we would apply three filters, each processing a different input channel
		\item To still allow interaction/combination of multiple channels, we apply a local $1\times 1\times 1$ convolution afterwards. Hence, an output channel depends again on all input channels.
		\item The number of operations applied on input size $l_F \times w_F \times h_F \times c_F$ to output $l_G \times w_G \times h_G \times c_G$ is:
		\begin{equation*}
			\underbrace{k \times k \times k \times 1 \times c_F \times l_G \times w_G \times h_G}_{\text{Depth-wise 3D convolution}} + \underbrace{1\times 1\times 1 \times c_F \times c_G \times l_G \times w_G \times h_G}_{\text{Local }1\times 1\times 1\text{ convolution}}
		\end{equation*}
		\item The speedup by this operation is considerably bigger than for pseudo 3D, namely $\frac{1}{c_G} + \frac{1}{k^{3}} \approx \frac{1}{k^{3}}$
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.6\textwidth]{figures/cv_deep_video_3D_depthwise_conv.png}
			\caption{Depth-wise 3D convolutions apply one filter per input channel, and combine the different channels afterwards. Same architecture is applied in MobileNet for the 2D case.}
			\label{fig:deep_video_depthwise_3D_convs}
		\end{figure}
	\end{itemize}
	\item \textbf{Partial 2D architecture} 
	\begin{itemize}
		\item Depending on the kind of motion we want to detect, it might not be necessary to apply 3D convolutions at every stage of the network. 
		\item For example, if we are only interested in high-level motions, we might want ot use a \textit{Top-heavy I3D} which applies 3D convolutions only on the last layers. 
		\item Similarly, for short motions, we might want to consider a \textit{Bottom-heavy I3D}. 
		\item Figure~\ref{fig:deep_video_I3D_architectures} summarizes the different network architectures.
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.6\textwidth]{figures/cv_deep_video_different_I3D_architectures.png}
			\caption{Different I3D network architectures.}
			\label{fig:deep_video_I3D_architectures}
		\end{figure}
	\end{itemize}
\end{enumerate}
\subsection{Self-supervised learning}
\begin{itemize}
	\item Learn to represent a video adequately in the network by using data and tasks where the labels are freely exploited. The great benefit is that we can use a lot of (unlabeled) data
	\item This is mostly done as a pre-training step as the network learns to deal and analyze with videos on a huge dataset. There are various tasks we can perform self-supervised learning on:
	\begin{itemize}
		\item \textbf{Visual tracking}: If we have given a tracking system, we can train a network to predict whether two patches are similar or not. Therefore, we create labels by the tracking system by setting it to 1 if two patches are the same object over time, or otherwise to 0 (we sample a random other patch from the image and compare the scores).
		\item \textbf{Learning by shuffling}: The network is given a set of frames, and its tasks is it to determine whether it is in the correct temporal order or not. The supervision signals are easily generated by labeling the real videos as positive, and shuffle their frame order to create a negative example. The goal is that the network learns to understand poses and motions over frames.
		\item \textbf{Learning by arrow of time}: The task of the network is to predict whether a video is played forwards of backwards (binary classification). This is a very challenging task as it requires the network to understand laws of physics (water only flows downwards, not upwards) by analyzing different motions in the video. One can cluster afterwards what clues the network had extracted which lead to a prediction of forward or backward (called \textit{arrow of time}). This approach gave the best self-supervised pre-training results so far, but is still not able to beat a supervised ImageNet pre-training.   
	\end{itemize}
\end{itemize}

================================================
FILE: Computer_Vision_1/cv_imgformation.tex
================================================
\section{Image formation}
\label{sec:img_formation}
\begin{itemize}
	\item To fully understand/analyze an image, we first have to examine how it was created (note that an image is a 2D representation of a 3D world)
	\item Various challenges occur in CV 
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/cv_image_formation_challenges_cv.png}
		\caption{Challenges in Computer Vision}
	\end{figure}
	\item The two main parts of how an image is formed are:
	\begin{itemize}
		\item \textit{Geometry} of the projection of a 3D environment to a 2D image. This defines which pixel belongs to which object (part/location). 
		\item \textit{Physics of light} which determines the brightness of a point in the image plane as a function of illumination and surface properties. Thus, the light source has a crucial influence on an object's appearance 
	\end{itemize}
\end{itemize}
\subsection{Projective Geometry and Camera models}
\begin{itemize}
	\item A camera can be abstracted by a pinhole model. Larger aperture/pinhole results in blurry images, smaller give sharp but noisy images (less energy of light is being passed) $\Rightarrow$ Change between both by using different lenses
	\item We represent an image by a projection plane. The intersection between the center of projection and the plane is determined by (note that $z$ is negative):
	$$(x,y,z)\to (-\frac{d}{z}\cdot x, -\frac{d}{z}\cdot y, -d)$$ 
	\begin{figure}[ht!]
		\centering
		\begin{subfigure}[b]{0.48\textwidth}
			\centering
			\includegraphics[width=0.4\textwidth]{figures/cv_image_formation_3D_model.png}
			\caption{Projection plane}
		\end{subfigure}
		\begin{subfigure}[b]{0.48\textwidth}
			\includegraphics[width=0.8\textwidth]{figures/cv_image_formation_3D_model_2.png}
			\caption{Pinhole camera model}
		\end{subfigure}
		\caption{Abstract camera model in 3D coordinates}
	\end{figure}
	\item Model projection of 3D points to 2D image plane using homogeneous coordinates. The components we use for the projection are:
	\begin{itemize}
		\item \textit{Viewport projection}: Convert plane points to image coordinates (top left corner $(0,0)$, resolution scaling $s_x$, $s_y$)
		\item \textit{Perspective projection}: 3D points to image plane (homogeneous coordinates)
		\item \textit{View transformation}: rotation and translation matrix $\bm{R}$ and $\bm{T}$ for modeling the position and orientation of the camera. Can be seen as changing the coordinate system
		\item All together, we get the transformed points by:
		$$\left[\begin{array}{c}u\\v\\1\end{array}\right] = \underbrace{\left[\begin{array}{ccc}s_x & 0 & u_0\\0 & -s_y & v_0\\0 & 0 & 1\end{array}\right]}_{\text{Viewport}} \cdot \underbrace{\left[\begin{array}{cccc}1 & 0 & 0 & 0\\0 & 1 & 0 & 0\\0 & 0 & -1/d & 0\end{array}\right]}_{\text{Perspective}} \cdot \underbrace{\left[\begin{array}{cc}\bm{R} & \bm{T} \\\bm{0}^T_3 &  1\end{array}\right]}_{\text{View}} \cdot \left[\begin{array}{c}x\\y\\z\\1\end{array}\right]$$
	\end{itemize}
	\item Viewport and perspective projection depend on the camera (size and position of image plain) so that those are called \textit{intrinsic} camera parameter. In contrast, the view transformation is determined by \textit{extrinsic} camera parameters as it defines the camera position in the (original) coordinate system
\end{itemize}
\subsection{Light and Color models}
\label{sec:color_models}
\begin{itemize}
	\item The appearance color of an object is influenced by three components
	\begin{itemize}
		\item \textit{Light source}: spectral power distribution of light $e(\lambda)$ 
		\item \textit{Object}: the reflection distribution of an object $p(\lambda)$ (how good certain wavelengths are reflected)
		\item \textit{Sensor}: Detection by the sensor of the distribution $e(\lambda) p(\lambda)$
	\end{itemize}
	\item The goal is to be invariant to light source $e(\lambda)$ and sensor perspective
	\item Two very simple approaches to make an image independent of light source
	\begin{itemize}
		\item \textbf{Gray-world} assumption: the world is in average gray. So, we rescale every channel independently by $128/$mean of channel. Problematic if image is biased towards not being grey (high single channel, etc.)
		\item \textbf{Scale-by-max}/\textbf{White-patch} assumption: there is always at least one white pixel in an image. Hence, the channels are rescaled by $255$/max of channel. Fails if there is actually no white pixel in the image (results in wrong maximum), or if white pixel is in the shadow $\Rightarrow$ assumes whole image being shaded.
		\item All models underly/use the von Kries model where we convert an unknown light source $u$ to a canonical $c$ (i.e. day light) by simple channel scaling:
		$$\left(\begin{array}{c}R^c\\G^c\\B^c\end{array}\right) = \left(\begin{array}{ccc}
		\alpha & 0 & 0 \\
		0 & \beta & 0\\
		0 & 0 & \gamma
		\end{array}\right) \cdot \left(\begin{array}{c}R^u\\G^u\\B^u\end{array}\right)$$
		Note that to simplify the calculation of $\alpha$, $\beta$ and $\gamma$, and assume that the channels $R$, $G$ and $B$ are independent (thus only diagonal matrix), we approximate the integral as single wavelength for narrow-band filters.
	\end{itemize}
	\item As computer can't handle continuous distributions, the following integrals are approximate by for example the RGB model:
	$$R = \int_\lambda e(\lambda) p(\lambda) f_R(\lambda) d\lambda, \hspace{2mm}G = \int_\lambda e(\lambda) p(\lambda) f_G(\lambda) d\lambda, \hspace{2mm}B = \int_\lambda e(\lambda) p(\lambda) f_B(\lambda) d\lambda$$
	Every spectral color (see below diagram in Figure~\ref{fig:rgb_color_wavelength_distribution_RGB}) can be represented by an linear combination of RGB values.
	Note that human ganglion cells have similar functions, but are the most sensitive to green.
	\begin{figure}[ht!]
		\centering
		\begin{subfigure}[b]{0.4\textwidth}
			\centering
			\includegraphics[width=0.75\textwidth]{figures/cv_image_formation_color_RGB_model.png}
			\caption{RGB model}
			\label{fig:rgb_color_wavelength_distribution_RGB}
		\end{subfigure}
		\begin{subfigure}[b]{0.24\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/cv_image_formation_color_XYZ_model.png}
			\caption{XYZ model}
			\label{fig:rgb_color_wavelength_distribution_XYZ}
		\end{subfigure}
		\hspace{5mm}
		\begin{subfigure}[b]{0.28\textwidth}
			\centering
			\includegraphics[width=0.8\textwidth]{figures/cv_image_formation_color_XYZ_diagram_2.png}
			\caption{XYZ diagram}
			\label{fig:rgb_color_wavelength_distribution_XYZ_diagram}
		\end{subfigure}
		\caption{Color matching functions $f_R$, $f_G$ and $f_B$ for the standard (a) RGB / (b) XYZ model. The colors represented by the XYZ system are shown in (c). Note that the line of purples contains colors that cannot be created by a monochromatic light source and needs a combination of fully saturated red and violet (max and min of spectrum).}
		\label{fig:rgb_color_wavelength_distribution}
	\end{figure}
	\item The intensity of the RGB color space is calculated by the sum of the channels: $I=R+G+B$
	\item Another color space is the XYZ system. The color matching functions $\overline{x}(\lambda), \overline{y}(\lambda), \overline{z}(\lambda)$ are similar but not the same as RGB (see Figure~\ref{fig:rgb_color_wavelength_distribution_XYZ}). The values are calculated by:
	$$X = \int_\lambda e(\lambda) p(\lambda) \overline{x}(\lambda) d\lambda, \hspace{2mm}Y = \int_\lambda e(\lambda) p(\lambda) \overline{y}(\lambda) d\lambda, \hspace{2mm}Z = \int_\lambda e(\lambda) p(\lambda) \overline{z}(\lambda) d\lambda$$
	\item However, we can split these measurements into a brightness/luminance and chromaticity/color component specified by $x$ and $y$. The luminance is given by $Y$ ($XYZ$ was designed for that), and the chromaticity is determined as ($Z$ is implicitly given by $1-x-y$):
	$$x=\frac{X}{X+Y+Z},\hspace{2mm}y=\frac{Y}{X+Y+Z}$$
	\item The created colors can be visualized in an $xy$-diagram (see Figure~\ref{fig:rgb_color_wavelength_distribution_XYZ_diagram}). 
	\item Given a reference light source $e$, we can determine the dominant wavelength (\textit{hue}) of a point $p$ by a line from $e$ through $p$ towards the boundary. The \textit{saturation} is given by the ratio of line length between $e$ and $p$ and $e$ to dominant wavelength boundary. Combining these with the luminance $Y$, a point $p$ can be converted into the HSI color space (see Figure~\ref{fig:rgb_color_HSV_color_cone}).
	\item HSV can be seen as applying non-linear functions on the wavelength distribution (see Figure~\ref{fig:rgb_color_HSV_wavelength_dist}). Hue is defined as the dominant wavelength, saturation as the purity of the color (probably relation between max energy and mean), and the brightness/luminance (given by average)
	\begin{figure}[ht!]
		\centering
		\begin{subfigure}[b]{0.4\textwidth}
			\centering
			\includegraphics[width=0.5\textwidth]{figures/cv_image_formation_color_HSV.png}
			\caption{HSV color cone}
			\label{fig:rgb_color_HSV_color_cone}
		\end{subfigure}
		\begin{subfigure}[b]{0.4\textwidth}
			\centering
			\includegraphics[width=0.6\textwidth]{figures/cv_image_formation_color_HSV_wavelength_dist.png}
			\caption{HSV wavelength distribution}
			\label{fig:rgb_color_HSV_wavelength_dist}
		\end{subfigure}
		\caption{HSV color space. }
	\end{figure}
	\item The $xy$-diagram in Figure~\ref{fig:rgb_color_wavelength_distribution_XYZ_diagram} visualizes the gamut that is visible for an average person/human vision. Different color spaces/devices capture colors by defining three points and linearly interpolate between those. However, it can be seen that there is no such gamut that can include the whole human vision gamut.
\end{itemize}
\subsection{Reflection models}
\begin{itemize}
	\item When a light source shines on an object, it might be differently perceived from different sensors/cameras although they have the same properties $\Rightarrow$ object appearance by reflectance
	\item The reflectance properties of an object/point can be specified by a \textit{BRDF}: Bi-directional reflectance distribution function $f(\theta_i, \phi_i; \theta_r, \phi_r)$ ($\theta_i$ and $\phi_i$ define the angles between input light and surface normal in $x$-$z$/$x$-$y$ direction respectively, $\theta_r$ and $\phi_r$ for the outgoing direction).
	\item A BRDF can be build up by different components, as visualized in Figure~\ref{fig:reflection_models_brdf_reflection_components}. The main parts can be distinguished into \textit{body reflection} (also referred to as mate appearance), and \textit{surface reflection} (responsible for the glossy appearance)
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.8\textwidth]{figures/cv_image_formation_reflectance_properties.png}
		\caption{Different components of reflectance. Black arrow visualizes the input ray, and greyish/shaded arrows the output rays. Length of the output rays indicate their energy. }
		\label{fig:reflection_models_brdf_reflection_components}
	\end{figure}
	\item There are different models that approximate/assume/deal with certain forms of BRDFs
\end{itemize}
\subsubsection{Lambertian model}
\begin{itemize}
	\item The lambertian reflectance model assumes a BRDF that constant: $f(\theta_i, \phi_i; \theta_r, \phi_r) = \frac{\rho_d}{\pi}$ where $\rho_d$ is defined by the albedo of the object, and division of $\pi$ as energy is equally distributed over hemisphere
	\item The surface reflection/output radiance can be calculated by $L=\frac{\rho_d}{\pi}I\cos \theta_i=\frac{\rho_d}{\pi}I\cdot (\vec{n}\cdot \vec{s})$ where $I$ is light source intensity, $\vec{n}$ the surface normal and $\vec{s}$ the input ray direction.
	\item Note that the factor $(\vec{n}\cdot \vec{s})$ defines the ratio of energy/photons that interact with that point/surface
	\item By assuming a Lambertian world, we can decompose an image into a shading part (surface normals) and the albedo (reflectance) of an object.
\end{itemize}
\subsubsection{Phong model}
\begin{itemize}
	\item The Phong model extends the Lambertian model by taking glossy reflectance into account (note that mirror is mostly approximated by glossy as mirror only looks at a single output angle which is rarely met). See Figure~\ref{fig:reflection_models_phong} for the components of the Phong model
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.7\textwidth]{figures/cv_image_formation_phong_model.png}
		\caption{The Phong model combines diffuse and glossy reflectance. Note that ambient gives the object a certain base brightness for approximating reflectance among objects/walls/...}
		\label{fig:reflection_models_phong}
	\end{figure}
	\item The reflection component of the specularity is calculated by $L_s=I\cdot \rho_s \left(\cos \phi\right)^{n_{shiny}}=I\cdot \rho_s \left(\vec{r}\cdot \vec{v}\right)^{n_{shiny}}$ where $r$ is the output ray/mirror direction (calculated by $\vec{r}=2\cdot \vec{n} \cdot (\cos \theta) - \vec{s}$), and $v$ the view direction of the sensor. 
	\item Large values for $n_{shiny}$ lead to narrow, small dot reflections (close to mirror) while small $n_{shiny}$ give broad, big surface reflectance. Note that the intensity is capped at a highest value (e.g. 1 or 255), so that multiple points can have the maximum intensity although they have a slightly different angle
	\item Also, Figure~\ref{fig:reflection_models_phong} shows that the body reflection (diffuse and ambient) contain the object color while the specularity depends on the light source (highlights color from light source)
\end{itemize}
\subsubsection{Dichromatic reflection models}
\begin{itemize}
	\item The previously discussed models only consider the light source intensity for the reflection. However, we can integrate the reflection in our color models:
	$$\text{body}_C = m_b (\vec{n}, \vec{s})  \int_\lambda e(\lambda) p(\lambda) f_C(\lambda) d\lambda$$
	$$\text{surface}_C = m_s (\vec{n}, \vec{s}, \vec{v})  \int_\lambda e(\lambda) c(\lambda) f_C(\lambda) d\lambda$$
	where $C$ is a specific channel (for example $R$, $G$ or $B$), $\vec{n}$ is the surface normal, $\vec{s}$ the input ray direction and $\vec{v}$ the viewpoint. 
	\item The function $m_b$ models the diffuse body reflection (i.e. $m_b(\vec{n}, \vec{s})=\cos \theta = \vec{n}\cdot \vec{s}$ as for Lambertian) whereas $m_s$ represents the glossy surface reflection (i.e. Phong model). 
	\item The diffuse reflectance depends on the albedo of the object $p(\lambda)$ whereas $c(\lambda)$ determines the specularity of the object for certain wavelengths.
	\item The perceived color of an object is the sum of the body and the surface
	\item Our goal is to map an input image into a space which is independent of the scene (i.e. independent of $m_b$, $m_s$, ...). Different color models can help:
	\begin{itemize}
		\item \textbf{rgb}: Assuming a white light source, normalize RGB values by the intensity (i.e. $r=\frac{R}{R+G+B}$). This leads to photometric invariance for pure matte objects ($m_b$ cancels out as it is the same for all channels when assuming $m_s=0$). Note that this approach fails if an object has no color (i.e. all gray tones are mapped to the same value).
		\item \textbf{c1c2c3}: color space is obtained from RGB manipulation and is invariant to shadowing effects of light interaction particularly for matte objects. It has similar properties as rgb, but is determined by $c_1(R,G,B) = \arctan\frac{R}{\max\left\{G,B\right\}}$
		\item \textbf{HSV} can be invariant to specularity if we assume a white light source and thus white specularity. The dominant wavelength, i.e. the hue, stays the same for those points. However, note that this model is instable for gray and especially white points that commonly occur at maximum specularity, as the hue is undefined.
		\item \textbf{l1l2l3}: Similar behavior as HSV, but calculates the values by $l_1(R,G,B) = \frac{(R-G)^2}{(R-G)^2 + (R-B)^2 + (G-B)^2}$
	\end{itemize}
	\item Figure~\ref{fig:cv_image_formation_invariance_color_spaces} summarizes some invariance properties of common color spaces
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/cv_image_formation_invariance_color_spaces.png}
		\caption{Overview of invariance in color spaces.}
		\label{fig:cv_image_formation_invariance_color_spaces}
	\end{figure}
	\item Different color spaces have different instabilities. Normalized colors get unstable around black pixels ($R=1, G=0, B=0$ is considered as pure red in rgb although in RGB it is black) whereas Hue is instable for low saturation (any hue gives same color)
	\item Another method to be invariant to shadows is filtering the image for smooth image intensity transitions as color transitions are harsh compared to that. The new image is recovered by summing up over gradients. Note that this method fails for sharp shadows and/or smooth color transitions
\end{itemize}

================================================
FILE: Computer_Vision_1/cv_imgprocessing.tex
================================================
\section{Image processing}
\begin{itemize}
	\item Apply various algorithms on image to analyze/improve the data
	\item The simplest kind of image transformation are those independent to the spatial position (thus also called point processing) where the new image is calculated by $g=a\cdot t(f)+ b$. Examples: gamma correction ($\log x$ to boost small, black values more than high ones), histogram equalization
\end{itemize}
\subsection{Neighborhood processing}
\begin{itemize}
	\item The most common way to process an image is by applying filters on it. A filter is a linear weighted sum of local input values. 
	\item A convolution of image $I$ and a linear filter $h$ is calculated by $$I_{out} = I \ast h, \hspace{1mm} I_{out}(i,j) = \sum\limits_{k,l} I(i-k, j-l) \cdot h(k,l)$$
	\item Depending on the size of the filter, we might not be able to apply the filter on the pixels at the border. Thus, we extend the image to have the same output shape. Common padding methods are zero/black, mirror/copy edge or wrap around.
	\item There are a lot of different filters that can be applied on an image. Filters can for example also be used for translation if wanted/needed. 1D example: $\left[\begin{array}{ccc}
	0 & 0 & 1
	\end{array}\right]$
	\item In general, we distinguish between \textit{low}-pass filters (smoothing) and \textit{high}-pass filters (edge detection, sharpening). The frequency is thereby the change of pixel values, and the passed wavelengths describe to what the filters react the most. Note that there are also \textit{band}-pass filters (low-pass filter convolved with high-pass filter)
	\item For example, unicolor images stay mostly unchanged when they are processed by an low-pass filter. In contrast, applying a high-pass filter on such images leads to very low activations.
\end{itemize}
\subsubsection{Smoothing filters}
\begin{itemize}
	\item \textit{Box filter}: replace every pixel by the average of its neighborhood. 
	$$h = \frac{1}{9}\left[\begin{array}{ccc}
	1 & 1 & 1\\
	1 & 1 & 1\\
	1 & 1 & 1\\
	\end{array}\right]$$
	Convolving a box filter with itself results in a filter in a shape of a Gaussian
	\item \textit{Gaussian filter}: weight contributions of neighboring pixels by distance: $G_\sigma = \frac{1}{2\pi \sigma^2} e^{-\frac{(x^2 +y^2)}{2\sigma^2 }}$. A $3\times 3$ Gaussian with $\sigma=0.5$ has the following values:
	$$h= \left[\begin{array}{ccc}
	0.011 & 0.084 & 0.011\\
	0.084 & 0.619 & 0.084\\
	0.011 & 0.084 & 0.011\\
	\end{array}\right]$$
	Note that convolving a Gaussian with another Gaussian is again a Gaussian. Thus, we can separate a 2D Gaussian into two 1D filters which are sequentially applied on the image $\Rightarrow$ reduce computational effort from $n^2$ to $2n$.
	\item \textit{Sharpening filter}: reverses the process of smoothing by accentuates differences with local average
	$$h = \left[\begin{array}{ccc}
	0 & 0 & 0\\
	0 & 2 & 0\\
	0 & 0 & 0\
	\end{array}\right]-\frac{1}{9}\left[\begin{array}{ccc}
	1 & 1 & 1\\
	1 & 1 & 1\\
	1 & 1 & 1\\
	\end{array}\right]$$
	\item \textit{Median filter}: A non-linear filter that selects the median value in the kernel window. The advantage of this filter is that its robust against outliers (good for filtering out salt-and-pepper noise)
\end{itemize}
\subsubsection{Edge detection filters}
\begin{itemize}
	\item \textit{Simple gradient filter}: The simplest gradient/edge detector is in 1D: $h = \left[\begin{array}{cc}-1 & 1\end{array}\right]$ 
	\item \textit{Sobel filter}: a derivative filter that also takes nearby pixels into account for better approximation. $h_x$ detects vertical edges (gradients over $x$-direction) and $h_y$ detects horizontal edges.
	$$h_x = \left[\begin{array}{ccc}
	1 & 0 & -1\\
	2 & 0 & -2\\
	1 & 0 & -1\\
	\end{array}\right] \text{\hspace{5mm}and\hspace{5mm}}h_y = \left[\begin{array}{ccc}
	1 & 2 & 1\\
	0 & 0 & 0\\
	-1 & -2 & -1\\\end{array}\right] $$ 
	\item \textit{Derivative of a Gaussian}: the derivative of a Gaussian is highly suitable for edge detection as it represents a band-pass filter (Gaussian filter convolved with discrete gradient filter although derivative mostly calculated by continuous). Similar to sobel, but weights the pixels nearby a bit different. Note that we also have different filters for $x$ and $y$ direction.
	\item \textit{Laplacian of Gaussian}: Laplacian operator $\nabla^2 f = \frac{\partial^2 
	f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2}$ applied on a Gaussian. Is invariant to the direction of the gradient (circular symmetric). The shape of the function is also often described as a Mexican hat (see Figure~\ref{fig:cv_image_processing_gaussian_filters}). Is highly responsive to blobs (blob detection) but is sensitive to the scale. To be invariant of the scale, we can apply multiple LoG filters with different values of $\sigma$ and stack the results together.
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.6\textwidth]{figures/cv_image_processing_gaussian_filters.png}
		\caption{Visualization of different Gaussian filters.}
		\label{fig:cv_image_processing_gaussian_filters}
	\end{figure}
\end{itemize}
\subsection{Harris corner detector}
\begin{itemize}
	\item Detect interest points in an image to perform matching or similar tasks. Corners are suitable to serve as interest points as they have a unique 2D position compared to edges and points
	\item The initial idea is derived from performing autocorrelation on a small window of the image, and test which ones are unique/expressive. Now we are looking for small changes in $x$ and $y$ direction, how much the image changes. Based on that information, we can decide whether a pixel represents a corner or not.
	\item Steps in the Harris corner detector
	\begin{enumerate}
		\item Compute the derivatives $I_x$ and $I_y$ of the image
		\item Compute the products of the derivatives at every pixel: $I_x^2$, $I_y^2$, $I_{xy}=I_{x}\cdot I_{y}$ 
		\item Compute sums of products over the window size and align them in the Harris matrix:
		$$H = \left[\begin{array}{cc}
		\sum_W I_x^2 & \sum_W I_x \cdot I_y\\
		\sum_W I_x \cdot I_y & \sum_W I_y^2 \\
		\end{array}\right]$$
		Note that the sum represents the application of a box filter. It is equally possible to apply Gaussian filters etc. 
		\item Determine the response of the detector at each pixel:
		$$R = \det(H) - k\cdot \left(\text{trace}(H)\right)^2$$
		\item If $|R|$ is small, the region is probably flat. Otherwise, if $R<0$ (and greater a certain threshold) we have an edge, and $R>0$ indicates a corner.
		\item Perform non-maximum suppression if corner detector is calculated pixel-wise.
	\end{enumerate}
	\item Determining the \textit{cornerness} of a point is based on the eigenvalues of the matrix $H$: $R=\lambda_1 \lambda_2 - k\cdot (\lambda_1 + \lambda_2)^2$. The maximum eigenvalue is the gradient of the direction with the fastest change, and the minimum eigenvalue the gradient of the direction with the smallest change. Note that this models an ellipse for the gradients (see Figure~\ref{fig:cv_image_processing_harris_ellipse})
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.25\textwidth]{figures/cv_image_processing_harris_ellipse.png}
		\caption{Visualization of relation between eigenvalues and gradients.}
		\label{fig:cv_image_processing_harris_ellipse}
	\end{figure}
	% cv_image_processing_harris_ellipse.png
	\item If we have an edge, one eigenvalue is considerably greater than the other as in one direction we have a large gradient, whereas in the other (90 degrees) the pixels stay the same. Here, $R$ is smaller than 0 as $\lambda_1\lambda_2$ is small but $\lambda_1 + \lambda_2$ is large.
	\item Thus, we only have a corner if in both directions we have a (equally) high change. In that case, $R$ is positive as $\lambda_1\lambda_2$ is large.
	\item Other properties of the Harris Corner detector
	\begin{itemize}
		\item Partial invariance to \textit{affine intensity} change. As only derivatives are used, a bias term $I+b$ does not influence result. When multiplying an image by a factor $I\cdot a$, we scale the eigenvalues and thus the cornerness as well. We therefore might only have to adapt the threshold.
		\item \textit{Rotation invariant} as only the ellipse rotates but the eigenvalues stay the same
		\item \textit{Scaling sensitive}: The Harris corner detector is sensitive to scale as it usually applies LoG/Derivatives of Gaussians for determining $I_x$ and $I_y$. To make the corner detector invariant to scale, we can apply multiple gradient filters with different values for $\sigma$ and stack them together (3D output instead of 2D). We then perform the detector on various scales, and take in the end the maximum response over scales for every pixel.
	\end{itemize}
	\item Applications
	\begin{itemize}
		\item \textit{Image stitching} as for combining separate photos into a panorama. We therefore detect interest points in all images, and try to match those (description by e.g. SIFT/histogram/...)
		\item \textit{Object recognition} by comparing local features that were found for a specific object with the ones from another image.
	\end{itemize}
\end{itemize}

================================================
FILE: Computer_Vision_1/cv_intro.tex
================================================
\section{Introduction}
\subsection{Challenges in Computer Vision}
\begin{itemize}
	\item 
\end{itemize}

================================================
FILE: Computer_Vision_1/cv_object_rec.tex
================================================
\section{Object recognition}
\begin{itemize}
	\item Challenges in object recognition
	\begin{itemize}
		\item Huge dimensionality (large input size)
		\item Image formation process (see Section~\ref{sec:img_formation})
		\item Images are stationary signal and share features, but have to distinguish it from noise
	\end{itemize}
	\item Hard to define explicit rules, but easy to collect examples $\Rightarrow$ Machine learning
\end{itemize}
\subsection{Image representations}
\begin{itemize}
	\item Need to find an image representation that is able to capture the semantics of an image and hence makes it easy to recognize objects
	\item For normal pixel values, the euclidean distance does not reflect the similarity of images well. A change of illumination or translation has a huge impact on the metric although it is the same object
	\item Global histograms over whole image are scale and translation invariant, but are not really distinctive (different images have same histogram)
	\item The best way is to find \textit{local features} that images share. They are more descriptive and reoccur in different images. In the next step, we have to describe these features to get a final representation. 
	\item One way to describe them are using SIFT (Scale-invariant feature transform) which creates a local histogram of gradients in the neighborhood (see Figure~\ref{fig:SIFT})
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.6\textwidth]{figures/cv_object_detection_SIFT.png}
		\caption{SIFT descriptors for a $2\times 2$ histogram patch (normally $4\times 4$ see Figure~\ref{fig:descriptor_SIFT}).}
		\label{fig:SIFT}
	\end{figure}
\end{itemize}
\subsubsection{Histogram of Gradients (HoG)}
\begin{itemize}
	\item A HoG descriptor abstracts a patch by a histogram of gradient orientations. \item The steps for calculating a HoG descriptor for a given patch are
	\begin{enumerate}
		\item Determine pixel-wise gradients $I_x$ and $I_y$ by e.g. applying a Sobel filter (or rather simple $[1,-1]$ derivative filter)
		\item Determine the orientation $\theta = \arctan \frac{I_y}{I_x}$ and magnitude $I=\sqrt{I_y^2 + I_x^2}$ of the pixel-wise gradients
		\item Report gradients as a histogram. For example, if we take a 9 bin histogram, we map every gradient to the closest value of $0^{\circ}$, $45^{\circ}$, $90^{\circ}$ etc. Note that the 9th bin is for zero gradients which have no orientation.
	\end{enumerate}
	\item An improvement to simply counting the number of gradients is considering their magnitude as well, or using a non-hard counting ($30^{\circ}$ counts for $0^{\circ}$ and $45^{\circ}$).
	\item A disadvantage of HoG is that it is not rotational invariant
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/cv_image_processing_HoG.jpg}
		\caption{The HoG descriptor takes the gradients in a patch and group them into a histogram of orientations.}
		\label{fig:descriptor_HoG}
	\end{figure}
\end{itemize}
\subsubsection{Scale Invariant Feature Transform (SIFT)}
\begin{itemize}
	\item SIFT is a combination of detector and descriptor which is (mostly) both rotation and scale invariant
	\item The first step of SIFT is getting a scale-invariant response map. This is done by extracting features by LoG (or rather DoG due to runtime) on various scales (see Figure~\ref{fig:SIFT_scale_invariance})
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/cv_image_processing_SIFT_scales.png}
		\caption{SIFT}
		\label{fig:SIFT_scale_invariance}
	\end{figure}
	\item We now look for local maxima in terms of both scale and location. This means that we search for points that are higher than all neighboring pixels in $x$-$y$ direction and scale (see Figure~\ref{fig:SIFT_scale_invariance} green points on the right) $\Rightarrow$ non-maximum suppression
	\item Given these points, we check for their \textit{cornerness}. Only at those points, we need to calculate the gradients and estimate the eigenvalues:
	$$\frac{\text{Tr}(\bm{H})^2}{\text{Det}(\bm{H})} < \frac{(r+1)^2}{r}$$
	The term $(r+1)^2/r$ is just a new threshold that specifies the required ratio between first and second eigenvalue.
	\item To guarantee rotation invariance, we look for the dominant gradient orientation in the patch. This is done by creating a weighted histogram of gradient orientations in the whole patch (weighted by the magnitudes of these gradients), and take the orientation with the highest value as orientation of the patch. If the patch has other orientations that have a value of at least 80\% of the dominant orientation, we create another descriptor for those as well.
	\item Once a point is selected as a key-point, we can group all gradients in small regions in a histogram and combine them into a $4\times 4$ grid of histograms. Note that we adjust all gradients according to the orientation of the key-point. Our final descriptor has then 128 features ($4\times 4$ histograms with each $8$ bins, see Figure~\ref{fig:descriptor_SIFT})
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/cv_image_processing_SIFT.jpg}
		\caption{A SIFT descriptor with $4\times 4$ histogram patch.}
		\label{fig:descriptor_SIFT}
	\end{figure}
\end{itemize}
\subsection{Bag-of-Words}
\begin{itemize}
	\item One approach for image representation is the visual Bag-of-Words (BoW). We therefore split an image into patches, describe each of these patches by one "visual word" (patch in our dictionary), and finally create a histogram out of it (see Figure~\ref{fig:BoW_pipeline})
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/cv_object_detection_BoW_pipeline.png}
		\caption{General pipeline for BoW approach.}
		\label{fig:BoW_pipeline}
	\end{figure}
	\item There are four components of BoW for which a design choice has to be made
	\begin{itemize}
		\item \textbf{Patch sampling}: which patches should be used to describe an image. Can be either descriptive patches/interest points, but then the number of patches can significantly differ from image to image. Alternatively, we can perform a grid-like selection of patches (\textit{dense sampling}) on multiple scale (reduce size of image and sample again).
		\item \textbf{Patch description}: describe the patches/visual words by SIFT, RGB, HOG or similar. Goes along with image representation 
		\item \textbf{Visual dictionary}: create a dictionary by sampling a lot of patches from a large set of images (training images), and cluster them in their descriptor space to find distinctive patches. Use these clusters as visual words. There are different cluster methods that can be applied. However, one hyper-parameter is usually the number of clusters. High number of clusters give very distinctive, but noise sensitive patches, whereas low number of clusters give general, but less distinctive patches.
		\item \textbf{Histogram creation}: the simplest approach is finding the nearest prototype/visual word for every sampled batch of the image by e.g. L2 on the descriptor, and record the number of occurrences for each visual word. There are many (more advanced) alternatives that for example take the distance to the cluster means into account, or calculating mean and stddev etc. 
	\end{itemize} 
	\item Advantages and drawbacks of visual BoW
	\begin{itemize}
		\item[+] Translation invariant
		\item[+] Fixed length feature vector 
		\item[$-$] Loss of spatial information
		\item[$-$] Quantization loses information (mapping to visual words)
	\end{itemize}
	\item In order to keep some spatial information, we can extend the histogram by using multiple scales (spatial pyramid) and concatenate those for an output feature vector. Another approach would be to use the spatial information ($xy$-position) as additional features for the patch descriptor, and use during matching/clustering.
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.45\textwidth]{figures/cv_object_detection_BoW_spatial_pyramid.png}
		\caption{Spatial pyramid for histogram creation. We concatenate all histogram to a longer feature vector.}
		\label{fig:BoW_spatial_pyramid}
	\end{figure}
\end{itemize}
\subsubsection{Bag of Words for Retrieval}
\begin{itemize}
	\item We can compare images for the retrieval task by their BoW histogram. This is more efficient and faster than checking for every interest point and try to compare those.
	\item Offline, we have to create the BoW vocabulary and determine a histogram for every image in our database
	\item When an image is entered as a query, we need to represent it by its BoW histogram and then compare it with every other.
	\item We can apply other techniques from IR as well like TF-IDF, query expansion, stop word removal, inverted file index,...
	\item To guarantee a good performance for the first retrieved examples, we can rerank the top $k$ by using geometrical verification (detect interest points and try to match those)
\end{itemize}
\subsection{Object detection}
\begin{itemize}
	\item Localization of objects in an image. Often approximated by bounding boxes that should be predicted around the object.
	\item A simple sliding window approach is too expensive as it generates 1) a lot of boxes over 2) a lot of scales with 3) different box ratios/shapes and 4) many classes.
	\item Hence, the first challenge is to find a set of relevant boxes with ``object'' (also called \textit{candidate boxes} all graded by an objectness score), and in a second step determine the class of the object in this candidate boxes
	\item One approach for that is \textbf{selective search} which is based on the property of images being hierarchical
	\begin{itemize}
		\item Segment image into small fragments based on simple approaches. Generate for all of these a candidate box
		\item For multiple iterations (recursively), combine two fragments that are the most similar together and consider a box for the combined fragment as well. Repeat until only one region is left
		\item Apply a classifier on those candidate boxes
	\end{itemize}
	\item A general pipeline for object detection is shown in Figure~\ref{fig:cv_object_detection_BB_pipeline}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.45\textwidth]{figures/cv_object_detection_BB_pipeline.png}
		\caption{Pipeline for object detection with Bounding Boxes1.}
		\label{fig:cv_object_detection_BB_pipeline}
	\end{figure}
	% cv_object_detection_BB_pipeline.png
\end{itemize}

================================================
FILE: Computer_Vision_1/cv_summary.tex
================================================
\documentclass[a4paper]{article} 
\addtolength{\hoffset}{-2.25cm}
\addtolength{\textwidth}{4.5cm}
\addtolength{\voffset}{-3.25cm}
\addtolength{\textheight}{5cm}
\setlength{\parskip}{0pt}
\setlength{\parindent}{0in}

\usepackage{blindtext} % Package to generate dummy text
\usepackage{charter} % Use the Charter font
\usepackage[utf8]{inputenc} % Use UTF-8 encoding
\usepackage{microtype} % Slightly tweak font spacing for aesthetics
\usepackage[english]{babel} % Language hyphenation and typographical rules
\usepackage{amsthm, amsmath, amssymb} % Mathematical typesetting
\usepackage{float} % Improved interface for floating objects
\usepackage[final, colorlinks = true, 
linkcolor = black, 
citecolor = black]{hyperref} % For hyperlinks in the PDF
\usepackage{graphicx, multicol} % Enhanced support for graphics
\usepackage{xcolor} % Driver-independent color extensions
\usepackage{marvosym, wasysym} % More symbols
\usepackage{rotating} % Rotation tools
\usepackage{subcaption}
\usepackage{wrapfig}
\usepackage[makeroom]{cancel}
\usepackage{censor} % Facilities for controlling restricted text
\newcommand{\note}[1]{\marginpar{\scriptsize \textcolor{red}{#1}}} % Enables comments in red on margin
\usepackage{bm}
\usepackage{blkarray}
\usepackage{enumitem}
\usepackage{pgfplots}
\usepackage{tikz}
\definecolor{colkeyword}{rgb}{0,0.4,0}
\definecolor{colname}{rgb}{0.4,0.4,0}
\definecolor{coltype}{rgb}{0.4,0,0.4}
\definecolor{coloperators}{rgb}{0,0,1.0}
\definecolor{colscopes}{rgb}{0.4,0,0}

\setcounter{tocdepth}{2}
% Title Page
\title{Summary Computer Vision 1}
\author{Phillip Lippe}


\begin{document}
\maketitle
\tableofcontents
\newpage

% \input{cv_intro.tex}
\input{cv_imgformation.tex}
\input{cv_imgprocessing.tex}
\input{cv_object_rec.tex}
\input{cv_deep_learning.tex}
\input{cv_deep_video.tex}
\input{cv_applications.tex}
\appendix
\newpage
\input{cv_appendix.tex}
\end{document}

================================================
FILE: Deep_Learning/cheat_sheet/main.tex
================================================
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% MatPlotLib and Random Cheat Sheet
%
% Edited by Michelle Cristina de Sousa Baltazar
%
% http://matplotlib.org/api/pyplot_summary.html
% http://matplotlib.org/users/pyplot_tutorial.html
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\documentclass[a4paper]{article}
\usepackage[landscape]{geometry}
\usepackage{url}
\usepackage{multicol}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{tikz}
\usetikzlibrary{decorations.pathmorphing}
\usepackage{amsmath,amssymb}

\usepackage{colortbl}
\usepackage{xcolor}
\usepackage{mathtools}
\usepackage{amsthm, amsmath, amssymb, amsfonts}
\usepackage{enumitem}

\title{Deep Learning cheat sheet}
\usepackage[english]{babel}
\usepackage[utf8]{inputenc}
\usepackage{bm}

\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\loss}[0]{\mathcal{L}}
\newcommand{\chain}[3]{\frac{\partial #1}{\partial #2}\frac{\partial #2}{\partial #3}}
\newcommand{\eq}[1]{\begin{equation*}\begin{split}#1\end{split}\end{equation*}}
\newcommand{\coderef}[0]{Please find the implementation in the folder with the code files.}
\newcommand{\TODO}[1]{\textbf{\textcolor{red}{#1}}}

\definecolor{green}{RGB}{0,160,0}
\definecolor{blue}{RGB}{0,0,160}
\definecolor{red}{RGB}{160,0,0}
\definecolor{orange}{RGB}{200,160,0}
\definecolor{purple}{RGB}{170,0,200}
\definecolor{cyan}{RGB}{0,200,200}
\definecolor{lightred}{RGB}{200,50,50}

\advance\topmargin-0.9in
\advance\textheight3in
\advance\textwidth3in
\advance\oddsidemargin-1.5in
\advance\evensidemargin-1.5in
\parindent0pt
\parskip2pt
\newcommand{\hr}{\centerline{\rule{3.5in}{1pt}}}
%\colorbox[HTML]{e4e4e4}{\makebox[\textwidth-2\fboxsep][l]{texto}
\begin{document}
\footnotesize
\begin{multicols*}{3}

\tikzstyle{mybox} = [draw=black, fill=white, very thick,
    rectangle, rounded corners, inner sep=10pt, inner ysep=10pt]
\tikzstyle{fancytitle} =[fill=black, text=white, font=\bfseries]
%------------ CONTEÚDO CAIXA RANDOM ---------------
\begin{tikzpicture}
\node [mybox] (box){%
    \begin{minipage}{0.3\textwidth}
	\underline{Definition}: A family of \textcolor{green}{parametric}, \textcolor{lightred}{non-linear} and \textcolor{blue}{hierarchical} \textcolor{orange}{representation learning functions}, which are \textcolor{red}{massively optimized with stochastic gradient descent} to \textcolor{purple}{encode domain knowledge}, i.e. domain invariances, stationarity.\\
	\vspace{-3mm}
	\begin{itemize}[leftmargin=4mm]
		\setlength\itemsep{0.0em}
		\item Neural Network is a directed acyclic graph		
		% \item Every module can be expressed by $a=h(x;w)$
		\item Use loss function that matches output distribution to improve numerical stability and make gradients larger
		\item Input and output distribution of every module should be the same to prevent inconsistent behavior and harder learning
	\end{itemize}
	\underline{Backprop}: chain rule $\pd{z}{x_i}=\sum_j \chain{z}{y_j}{x_i}$, $\nabla_{\bm{x}} \bm{z} = \left(\pd{\bm{y}}{\bm{x}}\right)^T \cdot \nabla_{\bm{y}} \bm{z}$
	\vspace{-1mm}
	\begin{enumerate}[leftmargin=4mm]
	\setlength\itemsep{0.2em}
	\item Compute forward: $a^{(l)} = h^{(l)}\left(x^{(l)}\right)$, $x^{(l+1)}=a^{(l)}$
	\item Compute reverse: $\pd{\loss}{a^{(l)}} = \left(\pd{a^{(l+1)}}{x^{(l+1)}}\right)^T \cdot \pd{\loss}{a^{(l+1)}}$\\$\pd{\loss}{\theta^{(l)}} = \pd{a^{(l)}}{x^{(l+1)}} \cdot \left(\pd{\loss}{a^{(l)}}\right)^T$
	\item Update params: $\theta^{(l)}_{t+1} = \theta^{(l)}_{t}-\eta \nabla_{\theta_t^{(l)}}\loss$
	\end{enumerate}
		
%	\begin{center}\small{\begin{tabular}{lp{4.5cm} l}
%		\textit{random():} & obtém o próximo número aleatório no intervalo [0.0, 1.0] \\ \hline
%		\textit{random(começo,fim):} & obter o próximo número aleatório no intervalo [começo, fim] \\ \hline
%		\textit{random(stop):} & obtém o próximo número aleatório no intervalo [0, fim]
%	\end{tabular}}\end{center}
    \end{minipage}
};
\node[fancytitle, right=10pt] at (box.north west) {Modular Learning};
\end{tikzpicture}


%------------ CONTEÚDO CAIXA MatPlotLib ---------------
\begin{tikzpicture}
\node [mybox] (box){%
    \begin{minipage}{0.3\textwidth}
		\underline{Pure optimization} very direct goal to optimize (e.g. scheduling). ML wants to optimize test error that is intractable/only indirectly optimizable. Reduce different cost function on training set, optimum might be not optimal for test set (overfitting).\\[4pt]
		\underline{Gradient descent}: dataset mostly too large, slow, not better optimum/faster convergence. \underline{SGD}: standard error $\sigma/\sqrt{m}$, noisy gradients act as regularizer, dynamicly changing data possible. \\[4pt]
		\underline{Ill conditioning}: if 2nd order change is greater than 1st ($\frac{1}{2}\epsilon^2 g^THg>\epsilon g^Tg$), loss increases. Later training, reduce lr\\
		\underline{Pathological curvatures}: ravine region in loss surface, high gradients in suboptimal direction, oscillations, slow convergence\\[4pt]
		\underline{Hessian}: requires large batch to be accurate, hard to compute\\
		\underline{Momentum}: maintain momentum from previous updates to dampen oscillations: $u_{t+1}=\gamma u_t - \eta_t g_t$, $w_{t+1}=w_t+u_{t+1}$. Exponential averaging $\Rightarrow$ more robust gradients, faster\\
		\underline{Nesterov momentum}: take future gradients, better in theory.\\[2pt]
		\underline{RMSprop}: adaptive lr, exp. averaging over norms, assuming directions of sensitivity axis aligned. $r_t = \alpha \cdot r_{t-1} + \left(1 - \alpha\right) \cdot g_t^2$, $\eta_t = \frac{\eta}{\sqrt{r_t} + \epsilon}$, $w_{t+1} = w_{t} - \eta_t \cdot g_t$\\[3pt]
		\underline{AdaGrad}: adaptive lr, \textit{sums} norm, thus based on scale and frequency, bad for nonconvex. $r_t = r_{t-1} + \text{diag}(g^2_t)$\\ [3pt]
		\underline{Adam}: Combine adaptive lr and momentum (applied on unscaled gradients). Bias correction to account init at origin. \\[4pt]
		\underline{Bayesian optimization}: gradient-free, educated trail and error guesser, determine next point on uncertainty and expectation\\
		\underline{Normalization}: center data around 0, same variance, allows higher learning rate and better learning. \textit{BatchNorm}: ensure Gaussian distribution of features over batches. $\hat{y}_i = \gamma \cdot \hat{x}_i + \beta$\\$\mu_B = \frac{1}{m} \sum\limits_{i=1}^{m} x_i$, $\sigma_B^2 = \frac{1}{m} \sum\limits_{i=1}^{m} \left(x_i - \mu_B\right)^2$, $\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma^2 + \epsilon}}$\\
		Reduce effect of 2nd order between layers, acts as regularizer by introducing noise, let network control mean and variance.\\
		During testing, take moving average of last training steps
    \end{minipage}
};
%------------ CAIXA PRELIMINARES ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Deep Learning Optimization (1)};
\end{tikzpicture}
%------------ CONTEUDO EXEMPLO BASICO ---------------------
\begin{tikzpicture}
\node [mybox] (box){%
    \begin{minipage}{0.3\textwidth}
    	\underline{Regularization}: objective during training to reduce test error\\
    	\textit{$\ell_2$}: introduce objective $\frac{\lambda}{2}\sum_l ||w_l||^2$, weight decay for SGD\\
    	\textit{$\ell_1$}: sparse weights with $\lambda\sum_l ||w_l||$\\
    	\textit{Others}: Dropout, Early stopping, Augmentation, Multitask\\[4pt]
    	\underline{Weight initialization}: small weights to keep data at origin, large to have strong gradients, preserve variance of activations ($w\sim \mathcal{N}(0,\sqrt{1/d})$), no learning if all same, prevent dead ReLU
    \end{minipage}
};
%------------ EXEMPLO BASICO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Deep Learning Optimization (2)};
\end{tikzpicture}
%------------ CONTEUDO DOIS EIXOS ---------------------
\begin{tikzpicture}
\node [mybox] (box){%
    \begin{minipage}{0.3\textwidth}
		Images stationary signals with spatial structure and huge dimensionality. Dimensions highly correlated (translation inv)\\[4pt]
		\underline{Transfer Learning}: use large datasets to learn useful features, prevent overfitting, fine-tune less layers if datasets similar, use lower lr for pre-trained layers as close to optimum\\[4pt]
		 \underline{Architectures}: small filter for less params and higher non-linearity (even $n\times1$/$1\times n$), different scales on same input (stack of convs prone to overfitting), vanishing gradients by intermediate classifiers or residual connections (learn difference instead of mapping) $H(x)=x+F(x)$, possibly with gates\\[4pt]
		 \underline{Tracking}: \textit{Fast R-CNN} based on middle feature map, extract BB (selective search, NN for \textit{Faster R-CNN}). RoI pooling to get fixed-size output. \textit{Siamese}: train on similarity of BB patches.
    \end{minipage}
};
%------------ DOIS EIXOS BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {(Modern) Convolutional Neural Networks};
\end{tikzpicture}
%------------ CONTEÚDO COMANDOS DE TEXTO ---------------------
\begin{tikzpicture}
\node [mybox] (box){%
    \begin{minipage}{0.3\textwidth}
    \underline{Backprop thorugh time}: gradients of weights on memory $W$: $\pd{\loss}{W} = \sum\limits_{t=1}^{T}\sum\limits_{k=1}^{t} \chain{\loss_t}{y_t}{c_t}\left(\prod\limits_{i=k+1}^{t} \pd{c_i}{c_{i-1}}\right)\frac{\partial^{+}c_k}{\partial W}$\\
    Formulating RNN as $c_t = W \cdot \sigma(c_{t-1}) + U \cdot x_{t-1}$ leads to:\\ $\left\lVert \pd{c_{t+1}}{c_{t}}\right\rVert \leq \left\lVert W^T\right\rVert \cdot \left\lVert \text{diag}\left(\pd{\sigma\left(c_t\right)}{c_t}\right)\right\rVert$. If norm of non-linearity bounded by $\gamma$, and $\left\lVert W^T\right\rVert < 1/\gamma$, then vanishing gradients. If $\left\lVert W^T\right\rVert \gg 1/\gamma$ and non-linearity not zero, then exploding gradients. Quick fix for second: clip gradient norm\\[4pt]
    \underline{LSTM}: Prevent vanishing gradient by gated skip connections over time. Forget, output, and input+candidate gate\\[4pt]
    \underline{GNN}: \textit{Deep Walk}: latent repr. by random walks, skip gram on sequences, not dynamic. \textit{GraphSage}: aggregate information from neighbors, can be mean/max pool with weights, LSTM.
    \textit{GCN}: $h(H^{(l)}, A) = \sigma\left(D^{-1/2}\hat{A}D^{-1/2} H^{(l)}W^{(l)}\right)$
    \end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Recurrent Neural Networks};
\end{tikzpicture}
%------------ CONTEÚDO COMANDOS DE TEXTO ---------------------
\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	\underline{Generative modeling}: learn joint probability $p(x,y)$ or density function $p(x)$. Task can be performed by Bayes: $p(y|x)$. Generalizes better, better modeling of causal relations, out-of-distribution detection $p(y|x)p(x)$ with $p(x)$ low. \textit{Discriminative modeling}: learn pdf $p(y|x)$, task-oriented and mostly better\\[4pt]
	\underline{Applications}: RL simulator, creating missing data (pixel patches), super-res., data augm., cross-modal transl. (sketch to img)\\[4pt]
	\underline{Types}: \textit{Explicit density}: maximize log likelihood of data by modeling pdf. Must be complex enough and computationally tractable. \textit{Implicit density}: no explicit pdf, only sampling mechanism
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Deep Generative Models (1)};
\end{tikzpicture}
%------------ CONTEÚDO COMANDOS DE TEXTO ---------------------
\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	\underline{GAN}: implicit model, adversarial training. Mini-max game:\\ $\min_G \max_D V(G,D) = \mathbb{E}_{\bm{x}\sim p_{r}(\bm{x})} \left[\log \left(D\left(\bm{x}\right)\right)\right] + \mathbb{E}_{\bm{z}\sim p_{z}(\bm{z})} \left[\log\left(1 - D\left(G\left(\bm{z}\right)\right)\right)\right] $. Better loss for generator: $-\log D(G(z))$. Otherwise vanishing gradients if D too strong.\\[2pt]
	\textit{Problems}: reaching equilibrium (oscillation around Nash), mode collapse if $\partial \loss / \partial z\approx 0$, low dimensional support (JS assumes overlap of distributions).\\[2pt]
	\textit{Improvements}: WGAN using Earth-Mover's distance (also good for non-overlapping), usage of labels $y$ like in conditional GANs, label smoothing for overconfident D, Virtual BatchNorm with reference batch to reduce intra-batch inference\\[4pt]
	\underline{Boltzmann machines}: Pdf based on energy function we learn: $p(x)=1/Z \exp(-E(x))$ where $Z=\sum_{x'} \exp(E(x'))$. $Z$ complex, $2^{n}$ pos. for binary data. Restrain to pairwise relations: $E(x)=-x^TWx-b^Tx$. \textit{Restricted BM}: reduce $W$ by introducing $h$ latents: $E(x,h)=-x^TWh-b^Tx-c^Th$, $p(x)=1/Z\sum_{h'} \exp(-E(x,h'))$, higher-order relations. Can reformulate to $p(h_j|x,\theta)=\sigma(W_{:,j}x+b_j)$, $p(x_i|h,\theta)=\sigma(W_{i,:}h+c_i)$. Maximize log likelihood by contrastive divergence. Sample $h_0\sim p(h|x)$, $x_1\sim p(x|h_0)$, a.s.o.\\[4pt]
	\underline{VAE}: Model $p(x,z)=p(x|z)p(z)$. Goal is to maximize $p(x)=\int p(x,z)dz$ which is intractable. Use ELBO instead:\\
	$\log p(x) > \mathbb{E}_{q_{\varphi}(z|x)}\left[\log p_{\theta}(x|z)\right] - \text{KL}\left(q_{\varphi}(z|x)||p_{\lambda}(z)\right)$\\
	Difference is $ - \text{KL}\left(q_{\varphi}(z|x)||p(z|x)\right)$. \textit{Reparameterization trick}: sample from external dist., and transform it to own. For Gaussian: $z=\mu_q + \epsilon \cdot \sigma_q$. Backprop through model params and lower variance than REINFORCE.\\
	\textit{Improvements}: $q(z|x)$ with NF on top, ELBO is extended by NF term. Optimize prior $p_{\lambda}(z)=\frac{1}{K}\sum_k q_{\varphi}\left(z|u_k\right)$, $u_k$ trained\\[4pt]
	\underline{NF}: Model $p(x)$ directly with series of invertible transformations shifting probability mass. Math expression of NF:\\
	 $x = z_k = f_k \circ f_{k-1} \circ ... \circ f_1 (z_0) \to z_i = f_i(z_{i-1})$\\
	 $p(z_i) = p(z_{i-1}) \cdot \left|\det \frac{f_{i}^{-1}}{z_i}\right| \implies p(x) = p(z_0) \cdot \prod_{i=1}^{K} \left|\det \frac{f_{i}^{-1}}{z_i}\right|$\\
	 $\log p(x) = \log p(z_0) - \sum_{i=1}^{K} \log \left|\det \frac{f_{i}}{z_i}\right|$\\
	 $f$ must be invertible and has simple $\det$ Jacobian (triangular)
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Deep Generative Models (2)};
\end{tikzpicture}

\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	Hold dist. per latent variable instead of single val. \textit{Benefits}: ensemble modeling (better acc), uncertainty estimates, prevent overconfidence, model compression (prior towards 0)\\[4pt]
	\underline{Epistemic uncertainty}: dataset limits, unseen data, important for safety-critical and small datasets. Posterior $p(w|x,y)$ intractable. \textit{MC dropout}: apply DP during test (Bernoulli-dist over weights). Var approx. uncertainty. Any NN can be made Bayesian with that, but expensive and not accurate. Can also be motivated from Gaussian Processes. Over-param. models better uncert. estm.\\[4pt]
	\underline{Aleatoric uncertainty}: data uncertainty due to noise (e.g. bad sensor). \textit{Data-dependent/heteroscedastic}: specific raw inputs hard to interpret, predict uncert. per data point: $\loss = \frac{||y_i - \hat{y}_i||^2}{2\sigma_i^2} + \log \sigma_i$. \textit{Task-dependent/homoscedastic}: introduced by task (e.g. depth estimation), Sol: train on multiple tasks. $\loss = \frac{||y_i - \hat{y}_i||^2}{2\sigma^2} + \log \sigma$
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Bayesian Deep Learning (1)};
\end{tikzpicture}

\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	\underline{Bayes by Backprop}: approx. true posterior $p(w|\mathcal{D})$ by $q(w|\theta)$: $\loss = \log q(w_s|\theta) - \log p(w_s) - \log p(\mathcal{D}|w_s) \hspace{2mm}\text{ where }\hspace{2mm} w_s\sim q(w_s|\theta)$\\
	Example: assume Gaussian variational posterior with softplus $w=\mu + \epsilon\cdot \log\left(1+\exp\rho\right)$, then learn $\mu$ and $\rho$ by SGD.
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Bayesian Deep Learning (2)};
\end{tikzpicture}

\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	\underline{Autoregressive Models}: generative without latent variables, assuming order in data, conditional probs $p(x) = \prod_k p(x_k|x_{<k})$. Not necessarily parameter sharing, $p(x)$ tractable, but slow\\[4pt]
	\underline{NADE}: model output with single layer, $\mathcal{O}(D\times H)$ params\\$p(x_d=1|x_{<d})=\sigma\left(V_{d,:}\cdot h_d+b_d\right)$, $h_d=\sigma\left(W_{:,<d}\cdot x_{<d} + c\right)$\\
	\underline{MADE}: Autoencoder with carefully masked connections. $y_d$ only depends on $x_{<d}$. Connections can be shared with future $d$\\[4pt]
	\underline{PixelRNN}: row-wise pixel and sequential color generation\\
	$p(x_i|x_{<i}) = p(x_{i,R}|x_{<i})\cdot p(x_{i,G}|x_{i,R}, x_{<i})\cdot p(x_{i,B}|x_{i,R}, x_{i,G}, x_{<i})$\\
	\textit{Row-LSTM}: next output depends on three hidden states above\\
	\textit{Diagonal-BiLSTM}: use all pixels before (all prev rows and left)\\
	\underline{PixelCNN}: masked convs to only see top and left. Causes blind spot. Use separated vertical and horizontal stack\\[4pt]
	\underline{PixelVAE}: Standard VAE with PixelCNN as decoder
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Deep Sequential Models};
\end{tikzpicture}

\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	Value function $q^{\pi}(s_t, a_t) = \mathbb{E}_{\pi}\left[r_{t+1}+\gamma r_{t+2} + \gamma^2 r_{t+3} + ... | s_t, a_t\right]$\\
	Bellman equation $q^{\pi}(s_t, a_t) = \mathbb{E}_{\pi}\left[r_{t+1}+\gamma q^{\pi}| s_t, a_t\right]$\\
	Optimal policy with $q^{*}(s_t,a_t) = r_{t+1} + \gamma \max_{a_{t+1}} q(s_{t+1}, a_{t+1})$\\[4pt]
	\underline{Value-based}: learn $q^{*}$ to get $\pi^{*}$. Q-Learning (off-policy):\\
	$\mathcal{L} = \mathbb{E}\left[\left(r + \gamma \max_{a_{t+1}} q(s_{t+1}, a_{t+1}, \theta) -q(s_{t}, a_{t}, \theta) \right)^2\right]$\\
	For gradient calculation, bootstrapped val is fixed. \\
	\underline{Stability problems}: bootstrap, target and policy always changing, oscillations; seq. data break iid assump.; scale of $q$ values hard to control, unstable gradients; \\
	\underline{Solutions}: experience replay (store samples $\langle s, a, r, s'\rangle$ in dataset, sample from that, makes batch iid), freezing target network every $K$ iters to avoid oscillations, clip rewards, skip frames, control exploration vs. exploitation by annealing $\epsilon$-greedy policy\\[4pt]
	\underline{Policy-based}: learn $\pi^{*}$ directly, avoid problems with $q$ vals (especially hard for continuous action space). Training steps:\\
	\vspace{-3mm}
	\begin{enumerate}[leftmargin=4mm]
	\setlength\itemsep{0.0em}
	\item Determine $q$ by simulation: $q^{\pi_w}(s_t, a_t) = \mathbb{E}\left[r_t + \gamma r_{t+1}... | \pi_{w}\right]$
	\item Maximize $q$ by $\pd{\mathcal{L}}{w} = \mathbb{E}\left[\chain{q^{\pi}(s,a)}{a}{w}\right]$ (deterministic)\\
	or $\pd{\mathcal{L}}{w} = \mathbb{E}\left[\pd{\log \pi^{w}(a|s)}{w} q^{\pi}(s,a)\right]$ (stochastic)
	\end{enumerate}
	\textit{Asynchronous Advantage Actor-Critic}: Learn both policy and value function at same time, run multiple agents simultaneously (more diverse samples), advantage estimates: use learned value function to compare actually gained $q$ value. If loss is higher, unexpected (good) things happened $\Rightarrow$ exploration\\[4pt]
	\underline{Model-based}: try to model environment and be aware of rules. E.g. AlphaGo with tree-search guided by CNNs. Two policy networks playing against each other, and a third network to predict $V(s_t)=\sum_{a'} \pi(a'|s_t) \cdot q^{\pi}(s_t, a')$.
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Deep Reinforcement Learning};
\end{tikzpicture}

\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	\underline{Forward KL}: $D_{KL}(p||q)$, overstimate variance\\
	\underline{Backward KL}: $D_{KL}(q||p)$, underestimate variance\\
	$D_{KL}(q||p) = \int q(x) \log \frac{q(x)}{p(x)}dx \Rightarrow$ if $p(x)=0$, then $q(x)=0$\\
	$D_{KL}(p||q) = \int p(x) \log \frac{p(x)}{q(x)}dx \Rightarrow$ if $p(x)>0$, then $q(x)>0$\\[2pt]
	\underline{Jensen-Shannon}: $D_{JS}(p||q) = \frac{1}{2}D_{KL}(p||M)+\frac{1}{2}D_{KL}(p||M)$\\ $M = \frac{p+q}{2}\Rightarrow D_{JS}(p||q)=D_{JS}(q||p)$\\[4pt]
	$ a = Wx+b$, $\pd{a_i}{W_{jk}} = 1(i=j)\cdot x_k$, $\pd{a}{b} = \bm{I}$, $\pd{a}{x} = W$\\
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Math to know};
\end{tikzpicture}

\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	\underline{Compare non-linear activation functions}
	% \vspace{-3mm}
	\begin{description}[leftmargin=4mm]
	\setlength\itemsep{0.0em}
	\item[ReLU] Strong gradient for $x>0$, non saturating \textit{Drawbacks}: dead neurons
	% \item Every module can be expressed by $a=h(x;w)$
	\item[Sigmoid] probability distribution output \textit{Drawbacks}: small gradients $<1/4$, saturating, shifts distribution
	\item[Tanh] zero-centered in origin \textit{Drawbacks}: saturating, only strong gradients around 0
	\end{description}
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Old Exams};
\end{tikzpicture}

\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	\underline{Differences between generative and discriminative models}\\
	1. Generative models are used to estimate the joint probability density function $p(x)$. Discriminative models are used, instead, to model the conditional $p(y|x)$.\\
	2. Generative models are often intractable because in the $p(x)=\int p(x|z) p(z) dz$ the integral is not always possible to analytically compute.\\
	3. Discriminative models tend to yield better accuracies given a task, meaning they are optimized for the particular task, at the cost of potential overfitting.\\[5pt]
	\underline{Advantages/Disadvantages of generative models}\\
	\textbf{GAN}: Very good, realistic results, fast to sample from, no need to train on likelihood, very flexible to extension \textit{Drawbacks}: no quantitative evaluation, hard to train (sensitive to hyperparameters, mode collapse, etc.), no real objective in terms of likelihood (and distribution is unknown)\\
	\textbf{VAE}: \textit{Benefits}: Usable for data compression, distribution known (calculate likelihood function), stable training (no mode collapse) \textit{Drawbacks}: only approx. likelihood (ELBO), tends to give blurry instead of realistic images, need flexible enough encoder and prior\\
	\textbf{NF}: \textit{Benefits}: directly optimize $p(x)$, one-to-one mapping between $z$ and $x$ (knows exact embedding of any image in latent space) \textit{Drawbacks}: high number of parameters, complexity restrained by requirement of reversible $f$\\[5pt]
	\underline{Difference RNN/Autoregressive}\\
	\textbf{RNN}: shares weigths over steps, applicable to any sequence length, compresses all previous inputs into single hidden state/memory, not necessarily generative\\
	\textbf{Autoregressive}: does not necessarily share weights, fixed in sequence length, are generative
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Additional questions};
\end{tikzpicture}

\begin{tikzpicture}
\node [mybox] (box){%
	\begin{minipage}{0.3\textwidth}
	\begin{minipage}{0.45\textwidth}
	\includegraphics[width=\textwidth]{figures/NN_Zoo.png}
	\end{minipage}
	\begin{minipage}{0.45\textwidth}
	\includegraphics[width=\textwidth]{figures/optimization_pathological_curvatures.png}
	\includegraphics[width=\textwidth]{figures/RNN_LSTM.png}
	\end{minipage}
	
	\begin{minipage}{\textwidth}
	\includegraphics[width=\textwidth]{figures/NF_concept.png}
	\end{minipage}
	
	\begin{minipage}{\textwidth}
	\centering
	\includegraphics[width=0.7\textwidth]{figures/Autoregressive_PixelRNN.pdf}
	\end{minipage}
	
	\begin{minipage}{\textwidth}
	\centering
	\includegraphics[width=0.7\textwidth]{figures/GAN_generative_models_overview_2.png}
	\end{minipage}
	\end{minipage}
};
%------------ COMANDOS DE TEXTO BOX ---------------------
\node[fancytitle, right=10pt] at (box.north west) {Figures};
\end{tikzpicture}

\end{multicols*}
\end{document}

================================================
FILE: Deep_Learning/dl_appendix.tex
================================================
% \section{Neural Network Zoo}

\begin{figure}[ht!]
	\centering
	\includegraphics[width=0.9\textwidth]{figures/NN_Zoo_High.png}
\end{figure}

================================================
FILE: Deep_Learning/dl_autoregressive.tex
================================================
\section{Deep Sequential Models}
\subsection{Autoregressive Models}
\begin{itemize}
	\item Generative models without latent variables, but assuming an order in the data (if there is no, create an artificial order like image from left to right, top to bottom). The likelihood is the product of conditionals:
	$$p(x)=\prod_{k=1}^{D} p(x_k|x_{j<k})$$
	\item In contrast to RNNs, there is no/not necessarily parameter sharing, and the chain cannot be of infinite length because of that
	\item \textit{Advantages}: $p(x)$ is tractable
	\item \textit{Drawbacks}: training and generation is slow due to being sequential and not parallel
\end{itemize}
\subsubsection{NADE}
\begin{itemize}
	\item Originally defined for binary inputs/data. Can be generalized for other spaces as well
	\item Every output $x_d$ is modeled by a single layer that takes as input all previous data points, and generates based on that it's prediction:
	\begin{equation*}
		\begin{split}
			p(x_d=1|x_{<d}) & = \sigma\left(V_{d,:}\cdot h_d + b_d\right), h_d = \sigma\left(W_{:,<d}\cdot x_{<d} + c\right)
		\end{split}
	\end{equation*}
	where $V\in \mathbb{R}^{D\times H}, W\in \mathbb{R}^{H\times D}, b\in \mathbb{R}^{D}, c\in \mathbb{R}^{H}$ ($H$ hidden dimensionality, $D$ input dimensions)
	\item Objective is minimizing log likelihood: $\mathcal{L} = - \log p(x) = - \sum_{k=1}^{D} p(x_k|x_{<k})$
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.6\textwidth]{figures/Autoregressive_NADE.pdf}
		\caption{Concept of NADE.}
	\end{figure}
	\item \textit{Teacher forcing}: During training, use ground truth as input for all levels. For testing, use generated samples as input (sequentially)
\end{itemize}
\subsubsection{MADE}
\begin{itemize}
	\item Use an autoencoder where we carefully mask out connections so that the output $y_d$ only depends on inputs $x_{<d}$
	\item Name ``autoencoder'' is only because we try to reproduce the input. However, note that we neither have a bottleneck nor we try to get sparsity. We just remove connections to make the outputs depending on certain inputs
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/Autoregressive_MADE.png}
		\caption{Masked autoencoder for autoregressive models. We set certain weights to 0 (i.e. remove connections between neurons) so that the generation of $x_1$ only depends on $x_2$ and $x_3$, but not on $x_1$ itself (which would be cheating and prevent the model of being generative).}
	\end{figure}
\end{itemize}
\subsubsection{PixelRNN}
\begin{itemize}
	\item Assume row-wise pixel and sequential color generation (first red channel, then green, afterwards blue):
	$$p(x_i|x_{<i}) = p(x_{i,R}|x_{<i})\cdot p(x_{i,G}|x_{i,R}, x_{<i})\cdot p(x_{i,B}|x_{i,R}, x_{i,G}, x_{<i})$$
	\item Different ways of modeling it. LSTM variants mostly have 12 layers
	\begin{itemize}
		\item \textit{Row LSTM}: to compute next output (i.e. next hidden state), we take into consideration the three hidden states of the row above a certain pixel as ``last hidden state''. We get therefore a tri-angular shape of context. However, it thereby misses context from the row itself, and further away context. As it does not use pixels in the same row, the computation can be parallelized for a row. 
		\item \textit{Diagonal Bi-LSTM}: Uses all pixels that were generated before by using a Bi-LSTM. 
	\end{itemize}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/Autoregressive_PixelRNN.pdf}
		\caption{Comparing different methods of PixelRNN and PixelCNN. The lower level is the previous layer, and the top is the next layer. If we have a single layer PixelRNN/CNN, the lower one would be the input and the upper the generated output.}
		\label{fig:Autoregressive_PixelRNN}
	\end{figure}
	\item The architecture includes residual connections to speed up training
	\item \textit{Benefits}: good modeling of $p(x)$, reasonable image quality
	\item \textit{Disadvantages}: slow training and slow generation
\end{itemize}
\subsubsection{PixelCNN}
\begin{itemize}
	\item Replace recurrence by convolutions to speed up (at least) training
	\item Convolutions are masked so that only context from before (i.e. left and top) can be used. See Figure~\ref{fig:Autoregressive_PixelRNN} left and Figure~\ref{fig:Autoregressive_PixelCNN} for an example
	\begin{figure}[ht!]
		\centering
		\begin{subfigure}{0.3\textwidth}
			\centering
			\includegraphics[width=0.6\textwidth]{figures/Autoregressive_Masked_Conv.png}
			\caption{Example mask for $5\times 5$ convolution}
		\end{subfigure}
		\hspace{2mm}
		\begin{subfigure}{0.32\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/Autoregressive_PixelCNN_blindspot_problem.png}
			\caption{Blindspot of PixelCNN}
		\end{subfigure}
		\hspace{2mm}
		\begin{subfigure}{0.32\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/Autoregressive_PixelCNN_blindspot.jpg}
			\caption{Solution to blindspot}
		\end{subfigure}
		\caption{Masked convolutions in PixelCNN}
		\label{fig:Autoregressive_PixelCNN}
	\end{figure}
	\item Problem: worse results than PixelRNN because of limited context and blind spot (cascaded convolutions ignore right upper part)
	\item Solution: use two convolutions, one vertical stack looking purely on the top part, and the horizontal stack looking to the right. Additionally, use gated convolutions (one half of the features go through tanh, the other through sigmoid)
	\item \textbf{PixelCNN++}: replace softmax with logistic mixture likelihood over 8 bits, use encoder-decoder architecture with skip connections
\end{itemize}
\subsubsection{PixelVAE}
\begin{itemize}
	\item Standard VAE with PixelCNN as decoder/generator
	\item However, generator is very powerful which can lead to the problem that it ignores the latent code, and just generates ``nice'' images
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.3\textwidth]{figures/Autoregressive_PixelVAE.png}
		\caption{Architecture of a PixelVAE}
	\end{figure}
\end{itemize}

================================================
FILE: Deep_Learning/dl_bayesian.tex
================================================
\section{Bayesian Deep Learning}
\begin{itemize}
	\item Bayesian machine learning: holding a distribution per latent variable instead of single value
	\item Benefits of Bayesian
	\begin{itemize}
		\item Ensemble modeling (better accuracies)
		\item Uncertainty estimates, preventing overconfident networks
		\item Model compression (have prior that pushes weights towards 0)
		\item \TODO{Think of more}
	\end{itemize}
\end{itemize}
\subsection{Epistemic uncertainty}
\begin{itemize}
	\item \textit{Epistemic uncertainty}: dataset limits
	\item Uncertainty that is introduced by dataset limits (unseen data $\Rightarrow$ how certain are the weights)
	\item Can be reduced by increasing the amount of data
	\item Important for safety-critical applications and small datasets
	\item Hard to model because posterior is usually intractable for complex functions like NN
	$$p(w|x,y) = \frac{p(x,y|w)p(w)}{\int p(x,y|w)p(w)dw}$$
	\item \textbf{Monte-Carlo Dropout}: apply dropout during testing (Bernoulli-distribution over weights as variational distribution). The variance/uncertainty derived from there approximates uncertainty gained by variational framework. 
	\begin{itemize}
		\item \textit{Advantages}: every standard NN can be turned into a Bayesian NN. Very easy to train and no inference network necessary
		\item \textit{Drawbacks}: expensive, have to rerun model several times on data. Not very accurate (depends on activation function etc.)
	\end{itemize}
	\item \textbf{Deep Gaussian Process}: predict mean and variance for every data point.
	\begin{itemize}
		\item The predictive distribution is $p(y|x,X,Y) = \int p(y|x,w)p(w|X,Y)dw$
		\item The likelihood term is a Gaussian $p(y|x,w)=\mathcal{N}(y; \hat{y}(x,w), \tau^{-1}I_D)$ where $\hat{y}(x,w)$ is a NN and $\tau^{-1}$ the model precision that can be derived from MC dropout
		\item For the posterior, we use variational approximation: $p(w|X,Y)\approx q(w)$. In case of MC dropout, we have $\tilde{W}_i = W_i\cdot \text{diag}\left(\left[z_{i,j}\right]_{1}^{K_i}\right), z_{i,j}\sim \text{Bernoulli}\left(p_i\right)$ where $\tilde{W}_i$ are the weights with applied dropout
		\item Minimize loss $\mathcal{L}= - \int q(w)\log p(Y|X,w)dw + KL\left(q(w)||p(w|X,Y)\right)$. First term is approximated by Monte-Carlo integration (equivalent to sampling dropout), and second can be approximated analytically
	\end{itemize}
	\item Over-paramterized models give better uncertainty estimates as they capture bigger class of models. However, they also need higher dropout rates
\end{itemize}
\subsection{Aleatoric uncertainty}
\begin{itemize}
	\item \textit{Aleatoric uncertainty}: data uncertainty
	\item Uncertainty due to the nature of data (noise/hard to predict accurate. Example: depth estimation with bad sensor)
	\item Can be reduced by better data (better sensors, multiple different sensors, etc.)
	\item \textit{Data-dependent/heteroscedastic aleatoric uncertainty}: specific raw inputs like images that are hard to interpret
	\begin{itemize}
		\item Can be modeled by predicting a variance term per data point to reduce loss
		$$\mathcal{L} = \frac{\lVert y_i - \hat{y}_i\rVert^2}{2\sigma_i^2} + \log \sigma_i$$
		If variance low, the loss is weighted higher, but the $\log$ term is smaller $\Rightarrow$ trade-off
	\end{itemize}
	\item \textit{Task-dependent/homoscedastic aleatoric uncertainty}: introduced by task like semantic segmentation or depth estimation (hard at edges). Possible solution: train on multiple tasks like edge detection
	\begin{itemize}
		\item We can as well introduce a variance term, but shared by all data points (task individual):
		$$\mathcal{L} = \frac{\lVert y_i - \hat{y}_i\rVert^2}{2\sigma^2} + \log \sigma$$
	\end{itemize}
\end{itemize}
\subsection{Bayes by Backprop}
\begin{itemize}
	\item Start from a NN with a distribution over its weights
	\item Train weights to approximate the true posterior well (similar to ELBO just with $p(\mathcal{D})=1 \Rightarrow \log p(\mathcal{D}) = 0$)
	$$\text{KL}\left(q\left(w|\theta\right)||p\left(w|\mathcal{D}\right)\right) = \text{KL}\left(q\left(w|\theta\right)||p\left(w\right)\right) - \int q(w|\theta) \log p(\mathcal{D}|w)dw$$
	First term pushes distributions towards prior, and second towards modeling the data well
	\item Compute by Monte-Carlo integration (over distribution $q(w|\theta)$) for \textit{both} terms:
	$$\mathcal{L} = \log q(w_s|\theta) - \log p(w_s) - \log p(\mathcal{D}|w_s) \hspace{2mm}\text{ where }\hspace{2mm} w_s\sim q(w_s|\theta)$$
	\item Example: assume a Gaussian variational posterior on the weights $w=\mu + \epsilon \cdot \log(1 + \exp\rho))$ (standard deviation with softplus trick for always positive values). Learn parameters $\mu$ and $\rho$ per weight
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/Bayes_By_Backprop.png}
	\end{figure}
	\item In experiments, Bayesian NNs perform similar to plain NNs with dropout
\end{itemize}

================================================
FILE: Deep_Learning/dl_convnets.tex
================================================
\section{Convolutional Neural Networks}
\begin{itemize}
	\item Images are stationary signals with spatial structure and huge dimensionality
	\item Input dimensions are highly correlated (e.g. translation invariant)
	\item Preserve spatial structure by convolutional filters, local connectivity (with shared weights) and being robust to local variances by spatial pooling
\end{itemize}
\subsection{Transfer Learning}
\begin{itemize}
	\item Use large datasets like ImageNet to learn useful features for other, smaller datasets
	\item Prevent overfitting, even for large networks
	\item Alternatively, we could also use a pre-trained network on task 1 as feature extractor for task 2 (same as freezing first layers)
	\item Which layer(s) to fine-tune?
	\begin{itemize}
		\item If both task have the same labels, we can initialize all layers. Otherwise, the classification layer (last layer) must be newly trained. If there is only very few data available, only fine-tune this layer
		\item If datasets are very different, the fully connected layers need to be replaced
		\item First convolutional filters capture low-level information that mostly does not change over datasets. Mid-level convolutions can be fine-tuned if dataset is large enough
	\end{itemize}
	\item Use a smaller learning rate for pre-initialized layers as network starts already from a point close to the optimum. New layers can be trained with higher learning rate
\end{itemize}
\subsection{Standard classification architectures}
\subsubsection{VGGNet}
\begin{itemize}
	\item All filter sizes are $3\times 3$, as this is the smallest filter size, and is more parameter efficient to build up large filters, plus additional non-linearity between filters
	\item $1\times 1$ convolutions used to increase non-linearity/complexity without increasing receptive field
\end{itemize}
\begin{figure}[ht!]
	\centering
	\includegraphics[width=0.3\textwidth]{figures/CNN_VGGnet.png}
	\caption{VGG16 architecture}
	\label{fig:CNN_VVGnet}
\end{figure}
\subsubsection{Inception}
\begin{itemize}
	\item Receptive fields should vary in size as objects can appear in different scales
	\item Naively stacking more convolutional operations on top of each other is expensive and prone to overfitting
	\item Inception module applies different filter sizes on same input ($1\times 1$ convolutions for feature reduction)
	\item Architecture consists of 9 Inception blocks
	\item Solution for vanishing gradients: have intermediate classifiers that amplify the gradient signal for early layers
	\item InceptionV2: $5\times 5$ replaced by two $3\times 3$ filters
	\item InceptionV3: $1\times 3$ and $3\times 1$ filters instead of $3\times 3$
	\item BatchNormalization has shown to be very helpful in this architecture
\end{itemize}
\begin{figure}[ht!]
	\centering
	\includegraphics[width=0.8\textwidth]{figures/CNN_Inception_module.pdf}
	\caption{Inception module}
	\label{fig:CNN_Inception_module}
\end{figure}
\subsubsection{ResNet/DenseNet/HighwayNet}
\begin{itemize}
	\item Deeper networks are harder to optimize, and might actually achieve worse results than shallow ones because of that (although learning identity in additional layers must lead to same results)
	\item Better approach: try to model the difference that is learned in every layer $H(x) = F(x) + x$
	\item Different ways for modeling $F(x)$. Most popular ones shown in Figure~\ref{fig:CNN_ResNet_blocks}. BatchNormalization has been shown to be very important because of vanishing gradients
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.7\textwidth]{figures/CNN_ResNet_blocks.png}
		\caption{ResNet blocks}
		\label{fig:CNN_ResNet_blocks}
	\end{figure}
	\item \textbf{HighwayNet} introduces a gate with learnable parameters to determine the importance of a layer: $H(x) = F(x) \cdot T(x) + x \cdot \left(1 - T\left(x\right)\right)$
	\item \textbf{DenseNet} uses skip connections to multiple forward layers. Creates complex blocks where last layer sees the input of all previous layers
\end{itemize}
\subsection{Tracking/Object detection}
\subsubsection{Fast R-CNN}
\begin{itemize}
	\item Based on middle feature map, get bounding boxes by e.g. selective search 
	\item RoI pooling returns fixed size feature map for selected bounding box (puts e.g. $3\times 3$ mask on features and pools accordingly)
	\item Features used to generate class prediction and location correction
	\item During training, sample multiple candidate boxes from image and train on all of them. Makes it more efficient/faster, \textit{but} batch elements might be highly correlated (in the paper, they report that they experienced it to be neglectable)
	\item Very accurate and fast, but external box proposals needed
	\item \textbf{Faster R-CNN}: train network to propose box locations
\end{itemize}
\subsubsection{Siamese Network for Training}
\begin{itemize}
	\item Use Siamese network to compare similarity of two patches
	\item If we compare patches over time, we can find objects with the highest similarity $\Rightarrow$ tracking of objects
	\item Can be trained on rich video dataset, and can be applied to unseen categories/targets
\end{itemize}
\subsection{Spatial Transformer Network}
\begin{itemize}
	\item ConvNets must be invariant/robust to pose/geometry changes. One simple way of doing it is data augmentation
	\item Better: use spatial transformer network to learn rotation/scale transformation
	\item Define grid on input. Scale, translation and rotation parameters are learned by the network and depend on the input. Finally, transform image based on the changed grid. 
	\item Operation is differentiable and thus can be learned
\end{itemize}

================================================
FILE: Deep_Learning/dl_deep_rl.tex
================================================
\section{Deep Reinforcement Learning}
\subsection{Fundamentals of Reinforcement Learning}
\begin{itemize}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/RL_basic_concept.png}
		\caption{Interaction model between environment and agent}
	\end{figure}
	\item The \textbf{state} $s_t$ is the summary of all experience so far: $s_t = f(o_1, r_1, a_1, o_2, r_2, a_2, ..., o_t, r_t)$ ($o_i$ observable part of environment at time step $i$). If we have a fully observable environment, then $s_t = f(o_t)$.
	\item The \textbf{policy} of an agent determines its actions: $\pi\left(a_t|s_t\right)$. Can be deterministic or stochastic
	\item The \textbf{value function} is the expected total reward under policy $\pi$: $$q^{\pi}(s_t, a_t) = \mathbb{E}_{\pi}\left[r_{t+1}+\gamma r_{t+2} + \gamma^2 r_{t+3} + ... | s_t, a_t\right]$$
	$\gamma$ as discount factor as we are most certain about close rewards and sometimes are more interested in immediate rewards
	\item \textbf{Bellman equation} for value function:
	$$q^{\pi}(s_t, a_t) = \mathbb{E}_{s', a'}\left[r + \gamma q^{\pi}\left(s', a'\right) | s_t, a_t\right] = \sum_{s'} p(s'|s_t,a_t)\cdot \left[r(s', a_t, s_t) + \gamma \sum_{a'} \pi(a'|s') \cdot q^{\pi}\left(s', a'\right) \right]$$
	\item The optimal value function is therefore $q^{*}(s_t,a_t) = \max_{\pi} q^{\pi}(s_t,a_t) = r_{t+1} + \gamma \max_{a_{t+1}}$
	\item The \textbf{environment} can be modeled by the agent (learned from experience), and used for planning and look ahead. This can be for example a simulator
\end{itemize}
\subsection{Deep RL approaches}
\subsubsection{Value-based approaches}
\begin{itemize}
	\item Try to learn value function $q^*$ to get the optimal policy $\pi^*$
	\item The input to such models is usually the state, which should be as raw as possible (e.g. image frames). We can either add the action to the input and let the network predict its Q-value, or predict Q-values for all possible actions (second is faster and simpler)
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/RL_deep_QLearning.png}
		\caption{Modeling of Q-value predictions}
	\end{figure}
	\item Optimization by SARSA-like loss:
	$$\mathcal{L} = \mathbb{E}\left[\left(r + \gamma \max_{a_{t+1}} q(s_{t+1}, a_{t+1}, \theta) -q(s_{t}, a_{t}, \theta) \right)^2\right]$$
	\item For the gradients, we assume that the bootstrapped max value is fixed:
	$$\pd{\mathcal{L}}{\theta} = \mathbb{E}\left[-2\cdot \left(r + \gamma \max_{a_{t+1}} q(s_{t+1}, a_{t+1}, \theta) -q(s_{t}, a_{t}, \theta) \right) \cdot \pd{q(s_{t}, a_{t}, \theta)}{\theta}\right]$$
	\item Optimize with SGD by sampling one action and state, calculate q-values for all possible future actions, and use the maximum as bootstrap goal
\end{itemize}
\subsubsection{Stability problems}
\begin{itemize}
	\item As we bootstrap, the target is always changing $\Rightarrow$ policy changes fast, can lead to oscillations
	\item The sequential data breaks the iid assumption on which SGD relies
	\item The scale of Q-values is not easy to control, and is very task dependent $\Rightarrow$ gradients are unstable and can be either too large or too small
	\item \textbf{Improving stability}
	\begin{itemize}
		\item \textit{Experience replay}: store memories of $\langle s, a, r, s'\rangle$ (with e.g. a $\epsilon$-greedy policy) in a dataset, and sample batches from there to train on. Breaks temporal dependency and helps SGD by i.i.d.
		\item \textit{Freezing target}: instead of having a moving target, we freeze the $Q$ network every $K$ iterations, and use that to generate our targets (Q-targets come now from a bit older network parameter setting, but is steady over $K$ iterations). Avoids oscillations
		\item \textit{Clipping rewards}: Normalize or clip rewards to be in range $[-1,+1]$ or any other stable range. Prevents unknown scales of $Q$
		\item \textit{Skipping frames}: a light version of experience replay is skipping $N$ frames between two data points to avoid too strong temporal dependency (two consecutive frames are very similar)
		\item \textit{Exploration vs Exploitation}: use a $\epsilon$-greedy policy with annealing temperature. In the beginning, we will focus on exploration while slowly converging to exploitation
	\end{itemize}
\end{itemize}
\subsubsection{Policy-based approaches}
\begin{itemize}
	\item Try to learn the optimal policy $\pi^*$ directly from experience (parameterized policy $\pi_w(a_t|s_t)$)
	\item Avoids learning the $q$ values which are hard for continuous action spaces, and tend to oscillate because of bootstrapping
	\item Training steps
	\begin{enumerate}
		\item Determine Q-value for current policy by running a simulation:\\ $q^{\pi_w}(s_t, a_t) = \mathbb{E}\left[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + ... | \pi_{w}\right]$
		\item Maximize q-values as loss function. 
		\begin{enumerate}
			\item If policy is deterministic:
			$$\pd{\mathcal{L}}{w} = \mathbb{E}\left[\chain{q^{\pi}(s,a)}{a}{w}\right]$$
			\item If policy is stochastic:
			$$\pd{\mathcal{L}}{w} = \mathbb{E}\left[\pd{\log \pi^{w}(a|s)}{w} q^{\pi}(s,a)\right]$$
		\end{enumerate}
	\end{enumerate}
	\item Asynchronous Advantage Actor-Critic
	\begin{itemize}
		\item Learn both policy and value function
		\item Multiple agents that simultaneously interact with (copy of) environment and learn
		\item \textit{Advantage estimates}: Use the learned value function to compare to your actually gained $q$ value. Loss is therefore higher if unexpected things happen $\Rightarrow$ exploration
	\end{itemize}
	\begin{figure}[ht!]
		\centering
		\begin{subfigure}{0.45\textwidth}
			\centering
			\includegraphics[width=0.8\textwidth]{figures/RL_A3C_multiple_workers.png}
		\end{subfigure}
		\begin{subfigure}{0.45\textwidth}
			\centering
			\includegraphics[width=0.8\textwidth]{figures/RL_A3C_cycle.png}
		\end{subfigure}
		\caption{Schematic overview of A3C}
		\label{fig:RL_A3C}
	\end{figure}
\end{itemize}
\subsubsection{Model-based approaches}
\begin{itemize}
	\item Try to model the environment to be aware of rules etc. 
	\item Example: AlphaGo relies on Tree-Search guided by CNNs. We use two policy networks to play against each other, and one value network that predicts the value function of a state
\end{itemize}

================================================
FILE: Deep_Learning/dl_generative_models.tex
================================================
\section{Deep Generative Models}
\begin{itemize}
	\item \textit{Generative modeling}: learn the joint probability $p(x,y)$ or density function $p(x)$. Task can be performed with Bayes rule: $p(y|x)$. Generalize better (less prone to overfitting), and better modeling of causal relations. Members include GAN, VAE, etc.
	\begin{itemize}
		\item We can use generative models to predict uncertainty and out of distribution examples: $p(x,y) = p(y|x)p(x) \Rightarrow$ if $x$ o.o.d., then $p(x)$ low!
	\end{itemize}
	\item \textit{Discriminative modeling}: learn conditional pdf $p(y|x)$. Is usually task-oriented and gets better results. 
	\item Applications of generative models
	\begin{itemize}
		\item Simulating possible futures for reinforcement learning
		\item Creating missing data  (e.g. pixel patches which are missing)
		\item Super-resolution scaling for images
		\item Data augmentation (replace e.g. car by bicyclist in a scene)
		\item Cross-modal translation (sketch to image)
	\end{itemize}
	\item Different type of generative models (see Figure~\ref{fig:GAN_generative_models_overview})
	\begin{itemize}
		\item \textit{Explicit density}: maximize log likelihood of the data by modeling a probability density function. Function must be complex enough and computationally tractable
		\item \textit{Implicit density}: no explicit pdf needed, only a sampling mechanism
	\end{itemize}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.45\textwidth]{figures/GAN_generative_models_overview_2.png}
		\caption{Overview of generative models}
		\label{fig:GAN_generative_models_overview}
	\end{figure}
\end{itemize}
\subsection{Generative Adversarial Networks}
\begin{itemize}
	\item Adversarial training of generator vs discriminator
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.45\textwidth]{figures/GAN_pipeline.png}
		\caption{Pipeline of adversarial GAN training}
		\label{fig:GAN_pipeline}
	\end{figure}
	\item The generator is a (mostly deconvolutional) network that takes noise $z$ as input, and creates fake images. The discriminator tries to distinguish between fake and real images
	\item Trained in a minimax game fashion, the loss function resembles the Jensen-Shannon divergence:
	\begin{equation*}
		\begin{split}
			\min_G \max_D V(G,D) & = \mathbb{E}_{\bm{x}\sim p_{\text{data}}(\bm{x})} \left[\log \left(D\left(\bm{x}\right)\right)\right] + \mathbb{E}_{\bm{z}\sim p_{z}(\bm{z})} \left[\log\left(1 - D\left(G\left(\bm{z}\right)\right)\right)\right] \\
			J^{(D)} & = - \frac{1}{2}\mathbb{E}_{x\sim p_{\text{data}}}\left[\log D(x)\right] - \frac{1}{2}\mathbb{E}_{z\sim p_{z}}\left[\log 1 - D(G(z))\right]\\
			J^{(G)} & = - \frac{1}{2}\mathbb{E}_{z\sim p_{z}}\left[\log D(G(z))\right]\\
		\end{split}
	\end{equation*}
	\item Loss of generator is changed from $\log 1 - D(G(z))$ because otherwise the gradients of the generator vanish for a too strong discriminator 
	\item Divergence is important and can strongly influence the behavior of model
	\begin{equation*}
		\begin{split}
			D_{KL}\left(p(x)\lVert q^{*}(x)\right) = \int p(x) \log \frac{p(x)}{q^{*}(x)} dx & \implies \text{if } p(x)>0, \text{ then } q(x)>0\\
			D_{KL}\left(q^{*}(x)\lVert p(x)\right) = \int q^{*}(x) \log \frac{q^{*}(x)}{p(x)} dx & \implies \text{if } p(x)=0, \text{ then } q(x)=0\\
		\end{split}
	\end{equation*}
\end{itemize}
\subsubsection{GAN training problems}
\begin{itemize}
	\item \textbf{Vanishing gradients} during training:
	\begin{itemize}
		\item If the discriminator is too bad, the generator does not get valid/accurate feedback and can therefore not learn properly
		\item If the discriminator is perfect, the generator has very low gradients as a small change does not influence the discriminator
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.4\textwidth]{figures/cv_deep_learning_GAN_vanishing_gradients.jpeg}
			\caption{Vanishing gradients problem for training with KL-divergence. When the distance between the two distributions $p$ and $q$ (respectively $P_g$ and $P_r$) is too huge, the KL divergence is very close to zero. Hence, is does not provide any strong gradients in these regions.}
		\end{figure}
	\end{itemize}
	\item \textbf{Reaching the equilibrium}
	\begin{itemize}
		\item We know that the nash equilibrium of the minimax game is $P_g=P_r$ meaning the distribution of the real data is equal to the generated data. In that case, $D$ return 0.5 no matter what example we put in (as both distributions are equal).
		\item However, it has been shown that such cost functions may not converge when using gradient descent. An example is shown in Figure~\ref{fig:GAN_reaching_equilibrium}.
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.4\textwidth]{figures/cv_deep_learning_GAN_oscillating.png}
			\caption{Oscillating behavior of a non-cooperative game where $\min_x \max_y V(x,y) = x\cdot y$. The equilibrium $x=y=0$ is never reached.}
			\label{fig:GAN_reaching_equilibrium}
		\end{figure}
	\end{itemize}
	\item \textbf{Mode collapse}
	\begin{itemize}
		\item A GAN suffers from a mode collapse if the generator limits its predictions/generated distribution to a few samples/modes.
		\item For example in case of the MNIST dataset, this would mean that the generator only creates numbers of one or two different digits. Although a full mode collapse is rarely the case, partial mode collapses frequently occur
		\item In order to create a mode collapse, the gradients regarding the noise $\bm{z}$ must be very low/close to zero. This can for example happen if we fix the discriminator and the generator converges to the optimal image $\bm{x}^*$ that fools the discriminator the most
		\item Once the generator collapse to one mode, the discriminator will learn that this mode is purely/mostly generated and thus changes its predictions. The generator will address that by changing the mode (note that as $\partial L/\partial \bm{z}\approx 0$, we will just collapse to the next mode and are not able to escape this loop).
		\item In the end, this turns into a cat-and-mouse game between the generator and discriminator, and will not converge (see Figure~\ref{fig:GAN_mode_collapse}).
		\begin{figure}[ht!]
			\centering
			\includegraphics[width=0.6\textwidth]{figures/cv_deep_learning_GAN_mode_collapse.png}
			\caption{\textit{Top row}: optimal convergence of generator distribution to 8 modes. \textit{Bottom row}: Sample of a mode collapse after 10k iterations. The generator is only able to generate a single mode.}
			\label{fig:GAN_mode_collapse}
		\end{figure}
	\end{itemize}
	\item \textbf{Low dimensional support}
	\begin{itemize}
		\item The KL and JS divergence work best for overlapping distributions as neither of them is 0 (numerical instability)
		\item However, during training, the training distribution is not perfect, and as we have high dimensional data, both distributions are less likely to overlap much
		\item Also, it is easy for the discriminator to find a line in between them
	\end{itemize}
\end{itemize}
\subsubsection{GAN improvements}
\begin{itemize}
	\item \textbf{Wasserstein GAN}
	\begin{itemize}
		\item Instead of KL/JS, use Wasserstein (Earth Mover's) Distance:
		$$\mathcal{W}(p_r, p_g) = \inf\limits_{\gamma \sim \prod (p_r,p_g)} \mathbb{E}_{(x,y)\sim \gamma}|x-y|$$
		\item Intuitive explanation: how much do I have to move from one distribution to get the other one. Thus, the distance is even meaningful for non-overlapping distributions
	\end{itemize}
	\item \textbf{Usage of labels}
	\begin{itemize}
		\item Learning a conditional model $p(y|x)$ often generates better samples than from a random distribution
		\item One example are conditional GANs where we have given a ground truth
	\end{itemize}
	\item \textbf{Label smoothing}
	\begin{itemize}
		\item Train the discriminator to predict $D(x)\approx 1 - \alpha$ instead of 1
		\item Has been shown to be a good regularization by preventing the discriminator to be overconfident
		\item In addition, the gradients of the generator do less likely explode
	\end{itemize}
	\item \textbf{Virtual batch normalization}
	\begin{itemize}
		\item Batch Normalization can significantly help in neural networks
		\item However, in GANs, it leads to high intra-batch correlation
		\item Solution: \textit{virtual batch normalization} where we select a reference batch which is fixed during training, and combine it with the statistics of the current batch. Reduces overfitting on reference batch and intra-batch correlation
	\end{itemize}
\end{itemize}
\subsubsection{GAN open questions}
\begin{itemize}
	\item \textbf{Mode collapse}: How to prevent a model to suffer from mode collapse. One idea is penalizing the model is features are too similar, or allowing discriminator to see across batch elements. But these solutions are more heuristic tries and no theoretical solution
	\item \textbf{Evaluation of GANs}: GANs are currently judged by their qualitative results/predictions, but there is no quantitative measurement yet
	\item \textbf{Discrete outputs}: The generator and discriminator need to be differentiable, and thus discrete outputs are not possible. There are some workarounds, but no real theoretically sound solution.
	\item \textbf{Semi-supervised classification}: How to combine a GAN training and discriminative model efficiently (discriminator predicts class and fake/real at the same time)
\end{itemize}
\subsection{Boltzmann machines}
\begin{itemize}
	\item A Boltzmann distribution is defined by $p(x) = \frac{1}{Z}\exp\left(-E\left(x\right)\right)$ where $E(x)$ is a energy function described by our model, and $Z=\sum\limits_x \exp\left(E\left(x\right)\right)$ a normalization constant
	\item The benefit of defining a distribution like that is that our model can use any output values between $[-\infty, \infty]$ instead of being constrained to $[0,1]$
	\item A problem is that even if $x$ is binary, the normalizing constant $Z$ gets out of hands (sum over $2^{n}$ combinations for $n$ dimensional $x$). Thus, we limit the computations by only considering pairwise relations
	\item Pairwise relations modeled by $E(x)=-x^TWx-b^Tx$. Learning $W$ and $b$ by maximizing the likelihood of the data
	\item Problem: $W$ is still of size $n^2$ which can be too large for e.g. images ($256\times 256$ leads to $4.2$ billion parameters in $W$) $\Rightarrow$ Restricted Boltzmann machines
\end{itemize}
\subsubsection{Restricted Boltzmann machines}
\begin{itemize}
	\item Restrict model by additional bottleneck over $h$ latents
	$$E(x,h) = -x^T W h - b^T x - c^T h, \hspace{2mm} p(x) = \frac{1}{Z}\sum_h \exp\left(-E\left(x,h\right)\right)$$
	\item This function is not in the form of a energy function anymore (because of the sum). We can rewrite it as:
	\begin{equation*}
		\begin{split}
			F(x) & = -b^T x - \sum_i \log \sum_{h_i} \exp\left(h_i\left(c_i + W_i x\right)\right)\\
			p(x) & = \frac{1}{Z} \exp\left(-F(x)\right)\\
			Z & = \sum\limits_x \exp\left(-F(x)\right)
		\end{split}
	\end{equation*}
	\item Can be represented as a single MLP layer (undirected) with less hidden units
	\item Compared to simple Boltzmann machine, we can express higher-order relations 
	\item Every hidden unit is independent of each other, and the same for input $x$:
	$$p(h|x) = \prod_j p(h_j|x, \theta), \hspace{2mm} p(x|h) = \prod_i p(x_i|h, \theta) $$
	\item We can now reformulate the conditional probabilities as sigmoids \textbf{iff} $h$ and $x$ are still binary:
	$$p(h_j|x, \theta) = \sigma\left(W_{:,j} x + b_j\right), \hspace{2mm}p(x_i|h, \theta) = \sigma\left(W_{i,:} h + c_i\right)$$
	\item The loss is maximizing the log likelihood:
	$$\mathcal{L}(\theta) = \frac{1}{N}\sum_n \log p(x_n|\theta) = \frac{1}{N}\sum_n\left[- F(x) - \log Z\right]$$
	\item The gradients can be computed accordingly:
	\begin{equation*}
		\begin{split}
			\pd{\log p(x_n|\theta)}{\theta} & = -\sum_h p(h|x_n, \theta) \pd{E(x_n,h| \theta)}{\theta} + \sum_{\tilde{x}, h} p(\tilde{x}, h|\theta) \pd{E(\tilde{x}, h|\theta)}{\theta}\\
		\end{split}
	\end{equation*}
	Problem: second term is sum over $x$ and $h$ $\Rightarrow$ high-dimensional, hard to compute
	\item One way to do it is using contrastive divergence: sample $h_0 \sim p(h|x)$, and $x_1 \sim p(x|h_0)$, etc. In practice, a single sample is mostly sufficient
	\item \textbf{Deep Belief Network}: RBM are still models of single layer, we can also use a stack of RBMs. First layer is directed, others not. Our joint pdf is $p(x, h_1, h_2) = p(x|h_1)\cdot p(h_1|h_2)$
	\item \textbf{Deep Boltzmann machines}: also a stack of RBMs, but with undirected first layer
	\begin{itemize}
		\item Hence, we get $p(h_2^{k}|h_1, h_3) = \sigma \left(W_1^{:,k}h_1 + W_3^{k,:}h_3 \right)$
		\item Computing gradients is intractable $\Rightarrow$ approximate by sampling
	\end{itemize}
\end{itemize}
\subsection{Variational Autoencoders}
\begin{itemize}
	\item We assume an underlying, lower-dimensional data distribution $p(z)$ with which we can model our data distribution $p(x,z)=p(x|z)p(z)$
	\item Therefore, we need to model $p(z|x)$ which is often not easy to compute. In variational inference, we approximate the true posterior by $q_{\varphi}(z)$ (approximated posterior does not have to depend on observed $x$, e.g. in VAE it does)
	\item Our goal is to maximize $p(x)$. As this is intractable, we use the ELBO:
	\begin{equation*}
		\begin{split}
			\log p(x) & = \log \int p(x,z)dz \\
			& = \log \int q_{\varphi}(z) \frac{\int p(x,z)}{q_{\phi}(z)} dz\\
			& = \log \mathbb{E}_{q_{\varphi}(z)}\left[\frac{p(x,z)}{q_{\varphi}(z)}\right]\\
			& \geq \mathbb{E}_{q_{\varphi}(z)}\left[\log \frac{p(x,z)}{q_{\varphi}(z)}\right]\\
			& = \mathbb{E}_{q_{\varphi}(z)}\left[\log p(x|z)\right] - \text{KL}\left(q_{\varphi}(z)||p(z)\right) = \text{ELBO}_{\theta, \varphi}\left(x\right)
		\end{split}
	\end{equation*}
	\item The distance between $\log p(x)$ and the ELBO is the KL divergence to the true (unknown) posterior:
	$$\log p(x) - \text{KL}\left(q_{\varphi}(z)||p(z|x)\right) = \mathbb{E}_{q_{\varphi}(z)}\left[\log p(x|z)\right] - \text{KL}\left(q_{\varphi}(z)||p(z)\right)$$
	\item Thus, maximizing the ELBO either increases the log likelihood or optimizes the approximated posterior
	\item Variational Autoencoders make $q_{\varphi}(z)$ dependent of $x$, and model $p_{\theta}(x|z)$ as well:
	$$\text{ELBO}_{\theta, \varphi}\left(x\right) = \mathbb{E}_{q_{\varphi}(z|x)}\left[\log p_{\theta}(x|z)\right] - \text{KL}\left(q_{\varphi}(z|x)||p_{\lambda}(z)\right)$$
	Note that $p_{\lambda}(z)$ is not optimized, and its parameters $\lambda$ just describe the prior (e.g. standard Gaussian) 
	\item The loss function for a VAE is the negative ELBO, where we approximate the expectation by a single sample. The KL is mostly chosen to be analytically solvable (e.g. for two Gaussian) to prevent a Monte-Carlo approximation of the integral 
	\item However, we face a problem when we try to compute the gradients for $\nabla_{\varphi} \mathcal{L}$. Using Monte-Carlo integration has high variance, and sampling is non-continuous operation
	\item \textbf{Reparameterization trick}: sample from external, constant distribution, and transform this sample into a sample of the modeled distribution. For Gaussian: $z = \mu_q + \sigma_q \cdot \epsilon$
\end{itemize}
\subsubsection{Improvements of VAE}
\begin{itemize}
	\item \textbf{Encoder distribution}
	\begin{itemize}
		\item Modeling $q(z|x)$ as Gaussian makes training and implementation easy, but assumes that true posterior is also Gaussian, or can be at least approximated by one
		\item Simple option: use different task-specific distribution like e.g. hyperspherical, however not always suitable
		\item We can improve the complexity of this posterior by plugging in a Normalizing flow on top of the encoder output
		\begin{equation*}
		\begin{split}
		z_0 \sim q_0(z|x) & = \mathcal{N}(z|\mu(x), \text{diag}(\sigma^2(x)))\\
		q_K(z|x) & = q_0(z|x) \cdot \left|\text{det}\pd{f_K(z_{k-1})}{z_{k-1}}\right|\\
		\end{split}
		\end{equation*}
		\item The ELBO is added with an additional term during training
		$$\text{ELBO} = \mathbb{E}_{q_{\varphi}(z|x)}\left[\log p_{\theta}(x|z)\right] - \text{KL}\left(q_{\varphi}(z|x)||p_{\lambda}(z)\right) + \mathbb{E}_{z_0 \sim q_0(z_0|x)}\left[\sum_{k=1}^{K} \log \left|\text{det}\pd{f_k(z_{k-1})}{z_{k-1}}\right|\right]$$
	\end{itemize}
	\item \textbf{Prior optimization}
	\begin{itemize}
		\item We assume a prior $p(z)$ which is for example Gaussian, but cannot make sure that every point of the prior actually has a realistic counterpart in the original $x$ space
		\item The optimal prior is the averaged distribution over all data samples: $q^{*}(z) = \frac{1}{N}\sum_{n=1}^{N} q_{\varphi}(z|x_n)$
		\item However, summing over all data point is infeasible. Thus, approximate it by $K$ pseudo-inputs $u_k$ that are trained via standard SGD in the framework:
		$$p_\lambda(z) = \frac{1}{K} \sum_{k=1}^{K} q_{\varphi}(z|u_k)$$
	\end{itemize}
	
\end{itemize}
\subsection{Normalizing flows}
\begin{itemize}
	\item VAE cannot model $p(x)$ directly because of the intractable formulation ($p(x) = \int p(x,z)dz$)
	\item Normalizing Flows solve that problem by using a series of invertible transformation that allow more complex latent distributions than Gaussian
	\item The models can therefore be trained on directly maximizing the log likelihood instead of using the ELBO or similar
	\item A normalizing flow consists of multiple flows that transform a simple Gaussian distribution step by step in the data distribution (see Figure~\ref{fig:NF_concept})
	\item Every flow shifts the probability mass specified by parameters (determined by e.g. a NN, see Figure~\ref{fig:NF_density_shift})
	\begin{figure}[ht!]
		% NF_density_shift.png
		\centering
		\begin{subfigure}{0.7\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/NF_concept.png}
			\caption{General concept of stacking multiple flows}
			\label{fig:NF_concept}
		\end{subfigure}
		\hspace{8mm}
		\begin{subfigure}{0.2\textwidth}
			\centering
			\includegraphics[width=\textwidth]{figures/NF_density_shift.png}
			\caption{Shifting density}
			\label{fig:NF_density_shift}
		\end{subfigure}
		\caption{Outline of how a normalizing flow works}
		\label{fig:NF}
	\end{figure} 
	\item Mathematically, we can define a normalizing flow by:
	\begin{equation*}
		\begin{split}
			x & = z_k = f_k \circ f_{k-1} \circ ... \circ f_1 (z_0) \to z_i = f_i(z_{i-1})\\
			p(z_i) & = p(z_{i-1}) \cdot \left|\det \frac{f_{i}^{-1}}{z_i}\right| \implies p(x) = p(z_0) \cdot \prod_{i=1}^{K} \left|\det \frac{f_{i}^{-1}}{z_i}\right|\\
			\log p(x) & = \log p(z_0) - \sum_{i=1}^{K} \log \left|\det \frac{f_{i}}{z_i}\right|
		\end{split}
	\end{equation*}
	\item Requirements: $f$ must be invertible (dimensions of $x$ and $z$ equal), and the Jacobian must be easy to compute (i.e. triangular)
\end{itemize}

================================================
FILE: Deep_Learning/dl_intro.tex
================================================
\section{Introduction}
\subsubsection{Perceptron}
\begin{itemize}
	\item Single perceptron weights every input with a weight, and adds a bias term
	\item Step function as output: if input sum greater zero, then output is 1, else 0 (or -1)
	\item Problem: can only learn linear problems and not e.g. XOR
	\item Overcoming by multi-layer perceptron 
\end{itemize}

================================================
FILE: Deep_Learning/dl_modularity.tex
================================================
\section{Modular Learning}
\begin{itemize}
	\item \textit{Definition}: A family of \textcolor{green}{parametric}, \textcolor{lightred}{non-linear} and \textcolor{blue}{hierarchical} \textcolor{orange}{representation learning functions}, which are \textcolor{red}{massively optimized with stochastic gradient descent} to \textcolor{purple}{encode domain knowledge}, i.e. domain invariances, stationarity.
	% \item Although with two-layer (shallow) network, we can approximate all possible functions, a deep architecture tends to be more efficient and generalize better
	\item A neural network is a series of hierarchically connected functions $\Rightarrow$ Directed Acyclic graph
	\item Note that it is not allowed to have loops except over time/additional dimension
\end{itemize}
\begin{figure}[ht!]
	\centering
	\includegraphics[width=0.2\textwidth]{figures/modularity_example_network.png}
	\caption{Example network with interweaved connections. The architecture can be made arbitrarily complex, and can also include recurrent connections.}
	\label{fig:modularity_example_network}
\end{figure}
\subsection{Module}
\begin{itemize}
	\item A module is the simplest mathematical component in a NN, and can be expressed by $a=h(x;w)$ where $a$ is the output, $x$ the input, $w$ trainable parameters and $h$ an activation function
	\item $w$ mostly learned by gradient-based methods, usually maximizing the likelihood
	\begin{itemize}
		\item ML solution: $w^{*} = \arg\max\limits_{w}\prod\limits_{x,y}p_{model}\left(y|x;w\right)$
		\item For gradient-based methods, we can minimize the negative log likelihood:\\ $\mathcal{L}(w) = -\mathbb{E}_{x,y\sim \tilde{p}_{data}}\left[\log p_{model}\left(y|x;w\right)\right]$
		\item If output is Gaussian, we would get the $\ell_2$ norm
		\item If output is Laplacian, we would get the $\ell_1$ norm
	\end{itemize} 
	\item Using a loss function that matches the output distribution of the network helps, because:
	\begin{itemize}
		\item It makes math simpler (exponential cancels out)
		\item Better numerical stability ($\log$ with very small/negative values, helps for e.g. Softmax+CrossEntropy)
		\item Makes gradients larger as exponential-like activations often lead to saturation, which means gradients are almost 0 (but not with $\log$)
	\end{itemize}
	\item It is important that the input and output distribution of every module match, as otherwise we get inconsistent behavior and makes it harder to learn
	\begin{itemize}
		\item For activation functions, this means we prefer them to be mostly activated around the origin and centered
		\item Otherwise, e.g. ReLU can be come a linear unit or set everything to 0
	\end{itemize}
\end{itemize}
\subsubsection{Example modules}
\begin{itemize}
	\item \textbf{Linear module}: $a = wx$
	\begin{itemize}
		\item Simple gradients $\frac{\partial a}{\partial w} = x$, $\frac{\partial a}{\partial x} = w$
		\item No activation saturation $\Rightarrow$ strong, reliable gradients
	\end{itemize}
	\item \textbf{Rectified Linear Unit}: $a = \max(0,x)$
	\begin{itemize}
		\item Gradient is step function. $\pd{a}{x} = \begin{cases}
		0 & \text{ if } x\leq 0\\
		1 & \text{ if } x > 0\\
		\end{cases}$
		\item Hence, strong, fast gradients
		\item However, dead neurons might be an issue when initialization/weights produce outputs smaller 0 for every input 
		\item Different variations like LeakyReLU, Softplus ($\ln(1+e^{x})$), NoisyReLU exist
	\end{itemize}
	\item \textbf{Sigmoid}: $a=\sigma(x)=\frac{1}{1+e^{-x}}$
	\begin{itemize}
		\item Gradient easy to calculate: $\pd{a}{x} = \sigma(x)\left(1-\sigma\left(x\right)\right)$
		\item Can be used as output function for probability distribution between $[0,1]$
		\item Saturates and has small gradients
		\item Not centered around origin $\Rightarrow$ not good choice for within a network
	\end{itemize}
	\item \textbf{Tanh}: $a=\tanh\left(x\right)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}$
	\begin{itemize}
		\item Gradients $\pd{a}{x}=1-\tanh\left(x\right)^2$
		\item Saturates as well, but has slightly higher gradients than sigmoid and is centered around origin
	\end{itemize}
	\item \textbf{Softmax}: $a^{(k)} = \text{softmax}\left(x^{(k)}\right) = \frac{e^{x^{(k)}}}{\sum_j e^{x^{(j)}}}$
	\begin{itemize}
		\item Probability distribution over multiple classes
		\item Softmax trick for numerical stability: $\frac{e^{x^{(k)}-\mu}}{\sum_j e^{x^{(j)}-\mu}}$
	\end{itemize}
\end{itemize}
\subsection{Backpropagation}
\begin{itemize}
	\item Calculate gradients of all parameters in the network based on the loss on the last layer
	\item Principle of chain rule: $\pd{z}{x} = \sum_j \chain{z}{y_i}{x}$ (gradients from all possible paths)
	\begin{itemize}
		\item In vector notation: $\nabla_{\bm{x}} \bm{z} = \left(\pd{\bm{y}}{\bm{x}}\right)^T \cdot \nabla_{\bm{y}} \bm{z}$ with Jacobian $\pd{\bm{y}}{\bm{x}} = \left[\begin{array}{ccc}
		\pd{y_1}{x_1} & \pd{y_1}{x_2} & \pd{y_1}{x_3} \\[5pt]
		\pd{y_2}{x_1} & \pd{y_2}{x_2} & \pd{y_2}{x_3} \\
		\end{array}\right]$
	\end{itemize}
	\item Steps of Backpropagation:
	\begin{enumerate}
		\item Compute forward propagations for all layers recursively:
		$a^{(l)} = h^{(l)}\left(x^{(l)}\right) \text{ and } x^{(l+1)} = a^{(l)}$
		\item Compute the reverse path. 
		$$\pd{\mathcal{L}}{a^{(l)}} = \left(\pd{a^{(l+1)}}{x^{(l+1)}}\right)^T \cdot \pd{\mathcal{L}}{a^{(l+1)}}, \hspace{4mm} \pd{\mathcal{L}}{\theta^{(l)}} = \pd{a^{(l)}}{\theta^{(l)}} \cdot \left(\pd{\mathcal{L}}{a^{(l)}}\right)^T$$
		\item Use gradients $\pd{\mathcal{L}}{\theta^{(l)}}$ to update parameters via SGD
	\end{enumerate}
\end{itemize}

================================================
FILE: Deep_Learning/dl_optimization.tex
================================================
\section{Deep Learning Optimizations}
\begin{itemize}
	\item Pure optimization has a very direct goal, namely finding the optimum. However, in Machine Learning, we define a training goal. Thus, the ``optimal'' parameters might not necessarily be the optimum (e.g. overfitting)
\end{itemize}
\subsection{Stochastic Gradient Descent}
\begin{itemize}
	\item Pushing the weights towards highest gradient change
	$$w_{t+1} = w_{t} - \eta_t \nabla_{w} \mathcal{L}$$
	\item \textit{Gradient descent}: gradients on the full dataset. However:
	\begin{itemize}
		\item Dataset is mostly too large for this
		\item No real guarantee that this leads to a good optimum and/or it will converge faster
	\end{itemize}
	\item \textit{Stochastic gradient descent}: approximate gradients by averaging over a small batch. 
	\begin{itemize}
		\item Standard error is inverse proportional to number of elements $m$ in a batch: $\sigma / \sqrt{m}$.
		\item Noisy gradients help to escape local minima, acts as regularization
		\item Does sample roughly representative gradients from dataset. Is better as training data is also just a rough approximation of what the test data might look like (optimum on training $\neq$ optimum on test)
		\item SGD is faster, especially in first iterations
		\item SGD is able to adapt with dynamically changing datasets
	\end{itemize}
	\item \textit{Ill conditioning}: if gradients are large, applying them can lead to worse performance. This is the case if the second order derivative changes faster 
\end{itemize}
\subsection{Advanced optimizations}
\subsubsection{Gradient-based optimization}
\begin{itemize}
	\item \textit{Pathological curvatures}: move through a ravine towards minimum. SGD tends to oscillate between the walls because they have high gradients
	\begin{itemize}
		\item Second order optimization can help a lot for pathological curvatures: $$w_{t+1} = w_{t} - H_{\mathcal{L}}^{-1} \eta_t g_t$$
		\item Hessian $H_{\mathcal{L}}^{ij} = \pd{\mathcal{L}}{w_i\partial w_j}$ works as adaptive learning rate per parameter
		\item However, unfeasible in practice because Hessian gets very large
	\end{itemize}
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.3\textwidth]{figures/optimization_pathological_curvatures.png}
		\caption{Pathological curvature}
		\label{fig:optimization_pathological_curvatures}
	\end{figure}
	\item \textbf{Momentum}: maintain \textit{momentum} from previous parameter updates to dampen the oscillations.
	\begin{equation*}
		\begin{split}
			u_{t+1} & = \gamma u_{t} - \eta_t g_t \\
			w_{t+1} & = w_{t} + u_{t+1}
		\end{split}
	\end{equation*}
	\begin{itemize}
		\item Works as a exponential averaging $\Rightarrow$ more robust gradients, faster convergence
		\item $\gamma$ might be initialized lower and then increased over time to $0.9$
		\item Standard values for $\gamma$ are between $0.5$ and $0.9$ (note that a lower learning rate should be used compared to standard SGD)
	\end{itemize}
	\item \textbf{RMSprop}: adapting learning rate on current loss surface.
	\begin{equation*}
		\begin{split}
			r_t & = \alpha \cdot r_{t-1} + \left(1 - \alpha\right) \cdot g_t^2\\
			\eta_t & = \frac{\eta}{\sqrt{r_t} + \epsilon} \\
			w_{t+1} & = w_{t} - \eta_t \cdot g_t\\
		\end{split}
	\end{equation*}
	\begin{itemize}
		\item $r_t$ is the (exponentially) averaged gradient norm describing the size of the gradients (per dimension!)
		\item The learning rate is then adapted by $\eta_t$ at every time step for each dimension independently
		\item $\epsilon$ to prevent numerical instability and too large learning rates
		\item With the adapted learning rate, we update our weights with SGD
	\end{itemize}
	\item \textbf{Adam}: Combining adaptive learning rate and momentum
	\begin{equation*}
		\begin{split}
			m^{(t)} & = \beta_1 m^{(t-1)} + (1 - \beta_1)\cdot g^{(t)}\\
			v^{(t)} & = \beta_2 v^{(t-1)} + (1 - \beta_2)\cdot \left(g^{(t)}\right)^2\\
			\hat{m}^{(t)} & = \frac{m^{(t)}}{1-\beta^{t}_1}, \hat{v}^{(t)} = \frac{v^{(t)}}{1-\beta^{t}_2}\\
			w^{(t)} & = w^{(t-1)} - \frac{\eta}{\sqrt{v^{(t)}} + \epsilon}\circ \hat{m}^{(t)}\\
		\end{split}
	\end{equation*}
	\begin{itemize}
		\item Keeps track of the gradient norm for momentum $m^{(t)}$, and norm (also known as velocity) $v^{(t)}$
		\item The hyperparameters $\beta_1$ and $\beta_2$ correlate with the $\gamma$ and $\alpha$ respectively from the previous approaches
		\item The adaptive learning rate is expressed by $\hat{v}^{(t)}$, and the exponentially averaged gradients by $\hat{m}^{(t)}$
		\item The division is to remove the bias of $m^{(0)}$ and $v^{(0)}$ being zero. Note that $\beta_1^t$ means the value of $\beta_1$ to the power $t$, and not at time step $t$
		\item Adam is in general better for complex models, but might fail on easy/stupid tasks compared to simple methods like SGD
	\end{itemize}
	\item \textbf{Adagrad}: adapting learning rate based on both gradient scale and frequency of updates
	\begin{equation*}
		\begin{split}
			G_t & = G_{t-1} + \text{diag}\left(g_t^2\right)\\
			w_{t+1} & = w_{t} - \frac{\eta}{\sqrt{G_t + \epsilon}}\cdot g_t\\
		\end{split}
	\end{equation*}
	\begin{itemize}
		\item Very similar to RMSprop, but sums the scales over all time steps ($G_t$) instead of exponentially averaging 
		\item Less sensitive to learning rate tuning, but it gets very small over training time annealing to 0
	\end{itemize} 
	\item \textbf{Nesterov momentum}: use the future gradient instead of the current gradient. Leads to better convergence in theory
\end{itemize}
\subsubsection{Bayesian optimization}
\begin{itemize}
	\item Gradient-based optimizations have the problem of getting stuck in local minima
	\item Bayesian optimization is a gradient-free, educated trial and error guesser that works in lower dimensional spaces (up to 1000, but mostly 20 to 50 parameters)
	\item Determines the next point/parameter values to evaluate based on variance/uncertainty, and expected/predictive value. 
	\item Can be used for e.g. network architecture search
\end{itemize}
\subsection{Normalization}
\begin{itemize}
	\item Data pre-processing
	\begin{itemize}
		\item Center data around 0 (activation functions are designed for that)
		\item Scale input variables to have similar diagonal covariances (not if features are differently important)
		\item De-correlate features if there is no inductive bias (e.g. sequence over time)
	\end{itemize}
	\item \textbf{Batch normalization}: ensure Gaussian distribution of features over batches at every module input
	\begin{equation*}
		\begin{split}
			\mu_B = \frac{1}{m} \sum\limits_{i=1}^{m} x_i, &\hspace{5mm} \sigma_B^2 = \frac{1}{m} \sum\limits_{i=1}^{m} \left(x_i - \mu_B\right)^2 \\
			\hat{x}_i & = \frac{x_i - \mu_B}{\sqrt{\sigma^2 + \epsilon}} \\
			\hat{y}_i & = \gamma \cdot \hat{x}_i + \beta
		\end{split}
	\end{equation*}
	\begin{itemize}
		\item Normalize feature to $\hat{x}_i \sim \mathcal{N}(0,1)$, then rescale with trainable parameters $\gamma$ (variance) and $\beta$ (mean).
		\item Helps the optimizer to control mean and variance of input distribution, and reduces effects of 2nd order between layers $\Rightarrow$ easier, faster learning 
		\item Acts as regularizer as distribution depends on mini-batch and therefore introduces noise
		\item During testing, take a moving average of the last training steps and use those for $\mu_B$ and $\sigma_B^2$
	\end{itemize}
\end{itemize}
\subsection{Regularization}
\begin{itemize}
	\item Weight regularization needed to prevent overfitting
	\item \textbf{$\ell_2$-regularization}: Introduce objective term for minimizing weights
	$$w^{*}=\arg\min_w \mathcal{L} + \frac{\lambda}{2}\sum_l ||w_l||^2$$
	\begin{itemize}
		\item When using simple (stochastic) gradient descend, then $\ell_2$ regularization is the same as weight decay: $$w_{t+1} = \left(1-\lambda \eta_t\right) w_{t} - \eta_t \nabla_{\theta} \mathcal{L}$$
	\end{itemize}
	\item \textbf{$\ell_1$-regularization}: use $\ell_1$ objective, introduces sparse weights
	$$w^{*}=\arg\min_w \mathcal{L} + \lambda \sum_l ||w_l||$$
	\item \textbf{Early stopping}: stop the training when test error increases but training loss continues to decrease. Can be counted to regularization as training steps are reduced
	\item \textbf{Dropout}: setting activations randomly to 0 during training with probability $p$ (mostly between $0.1$ and $0.5$)
	\begin{itemize}
		\item During test time, every activation is reweighted by $1 - p$
		\item Reduces co-adaptations/-dependencies between neurons because none can solely depend on the other
		\item Neurons get more robust $\Rightarrow$ reduces overfitting
		\item Effectively, a different network architecture is used every iteration. Testing can be seen as using model ensemble
	\end{itemize}
\end{itemize}
\subsection{Weight initialization}
\begin{itemize}
	\item There are two forces on the weight magnitude: small weights are needed to keep data around origin, but large weights are required to have strong learning signals
	\item Initialization should preserve variance of activations (input variance $\approx$ output variance to keep distribution between modules same)
	\item Depends on non-linearity and data normalization
	\item \textbf{Xavier initialization}: to maintain data variance, the variance of the weights must be $1/d$ where $d$ is number of input neurons $\Rightarrow$ sample weight values from $w\sim\mathcal{N}(0,\sqrt{1/d})$
	\item \textbf{Initialization for ReLU}: ReLU set half of the output neurons to 0 $\Rightarrow$ double the weight variance to compensate zero flat-area: $w\sim\mathcal{N}(0,\sqrt{2/d})$
\end{itemize}

================================================
FILE: Deep_Learning/dl_rnn.tex
================================================
\section{Recurrent and Graph Neural Networks}
\subsection{Backpropagation through time}
\begin{itemize}
	\item Sequences are of arbitrary length. Standard networks like CNN mostly work on fixed input dimensionality
	\item Usage of memory with shared weights $\theta$: $$c_{t+1} = h_{\theta}\left(x_{t+1}, c_{t}\right) = h_{\theta}\left(x_{t+1}, h_{\theta}\left(x_{t}, c_{t-1}\right)\right) = ...$$
	\item Simple RNN cell: 
	\begin{equation*}
		\begin{split}
			c_t & = \tanh\left(U\cdot x_t + W \cdot c_{t-1}\right) \\
			y_t & = \text{softmax}\left(V \cdot c_{t}\right) \\
			\loss & = \sum\limits_{t=1}^{T} y_t^{*} \log y_t \\
		\end{split}
	\end{equation*}
	\item Gradient for output weights $V$:
	\begin{equation*}
		\begin{split}
			\pd{\loss_t}{V} & = \chain{\loss_t}{y_t}{c_t}\pd{c_t}{V} = \left(y_t - y_t^{*}\right) \cdot \left(c_t\right)^T\\
			\pd{\loss}{V} & = \sum\limits_{t=1}^{T} \pd{\loss_t}{V}\\
		\end{split}
	\end{equation*}
	\item Gradient for memory weights $W$: 
	\begin{equation*}
		\begin{split}
			\pd{\loss_t}{W} & = \chain{\loss_t}{y_t}{c_t}\pd{c_t}{W}\\
		\end{split}
	\end{equation*}
	\begin{itemize}
		\item In $\pd{c_t}{W}$, $c_t$ depends on $c_{t-1}$ which again depends on $W$. Thus, we have a recurrence in the gradient calculation:
		$$\pd{\loss_t}{W} = \sum\limits_{k=1}^{t} \chain{\loss_t}{y_t}{c_t}\chain{c_t}{c_k}{W}$$
		where $\pd{c_k}{W}$ only models the dependency exactly at time step $k$
		\item The gradient $\pd{c_t}{c_k}$ can be determined by the chain rule: $\pd{c_t}{c_k} = \prod\limits_{i=k+1}^{t} \pd{c_i}{c_{i-1}}$
		\item All in all, the final loss is:
		\begin{equation*}
			\begin{split}
				\pd{\loss}{W} & = \sum\limits_{t=1}^{T}\sum\limits_{k=1}^{t} \chain{\loss_t}{y_t}{c_t}\left(\prod\limits_{i=k+1}^{t} \pd{c_i}{c_{i-1}}\right)\pd{c_k}{W}
			\end{split}
		\end{equation*}
	\end{itemize}
	\item Gradient for input weights $U$ very similar to $W$: 
	\begin{equation*}
		\begin{split}
			\pd{\loss}{U} & = \sum\limits_{t=1}^{T}\sum\limits_{k=1}^{t} \chain{\loss_t}{y_t}{c_t}\left(\prod\limits_{i=k+1}^{t} \pd{c_i}{c_{i-1}}\right)\pd{c_k}{U}
		\end{split}
	\end{equation*}
	\item The problem with RNNs are that the gradients at time step $t$ depend on $c_{t-1}$ which also depends on $w$. However, the gradients are calculated with the assumption that $w$ stays the same for the previous time steps.
	\item This error can easily accumulate over many time steps so that in very long sequences, the gradients for the last steps are inaccurate
	\item Reduce learning rate/fewer updates, but this leads to slower training
\end{itemize}
\subsubsection{Vanishing gradients}
\begin{itemize}
	\item The exact derivations can be found in \href{http://proceedings.mlr.press/v28/pascanu13.pdf}{this paper}
	\item We assume an alternative formulation for simplicity here: $c_t = W \cdot \sigma(c_{t-1}) + U \cdot x_{t-1}$ where $\sigma$ is an arbitrary activation function. Then, the partial derivative between two time steps is\\ $\pd{c_{t}}{c_{k}} = \prod\limits_{i=k+1}^{t} \pd{c_{i}}{c_{i-1}} = \prod\limits_{i=k+1}^{t} W^T \cdot \text{diag}\left(\pd{\sigma\left(c_t\right)}{c_t}\right)$
	\item Hence, the magnitude of $\pd{c_{t+1}}{c_{t}}$ is bounded by this derivative: 
	$$\left\lVert \pd{c_{t+1}}{c_{t}}\right\rVert \leq \left\lVert W^T\right\rVert \cdot \left\lVert \text{diag}\left(\pd{\sigma\left(c_t\right)}{c_t}\right)\right\rVert$$
	\item In case the derivative of our non-linearity is bounded to a value $\gamma$ (which is 1 in case of tanh), we know that gradients vanish if the norm of the weight gradients are lower than $1/\gamma$:
	$$\left\lVert \pd{c_{t+1}}{c_{t}}\right\rVert \leq \left\lVert W^T\right\rVert \cdot \left\lVert \text{diag}\left(\pd{\sigma\left(c_t\right)}{c_t}\right)\right\rVert < \frac{1}{\gamma}\gamma = 1$$
	\item This term is exponentiated with the number of time steps. Thus, long sequences suffer even more of vanishing gradients $\Rightarrow$ learn only short-term relationships
	\item If however $\left\lVert \pd{c_{t+1}}{c_{t}}\right\rVert > 1$ because of $\left\lVert W^T\right\rVert \gg 1/\gamma$, then we can get exploding gradients
	\item Quick fix for exploding gradients: clip gradient norm. However, there the counterpart can happen where we only focus on long-term relationships
\end{itemize}
\subsubsection{Long Short-Term Memory}
\begin{itemize}
	\item Preventing vanishing gradients by gate mechanism
	\item By simply adding features to memory and limiting memory by sigmoid we can get strong gradients for any sequence length. Note that the gradients get lower in expectation because sigmoid has mean $0.5$. Nevertheless, if long-term dependencies are important, the network can learn them now
	\item \textit{Forget gate}: regulating how much information is kept from last time step
	\item \textit{Input + candidate gate}: Regulating which, and how much new information should be added given the current time step
	\item \textit{Output gate}: What features are important for the current time step
	\begin{figure}[ht!]
		\centering
		\includegraphics[width=0.3\textwidth]{figures/RNN_LSTM.png}
		\caption{Visualization of a LSTM cell}
		\label{fig:RNN_LSTM}
	\end{figure}
\end{itemize}
\subsection{Graph Neural Networks}
\begin{itemize}
	\item Perform operation on graph-structured data (e.g. social networks or knowledge graphs)
\end{itemize}
\subsubsection{Deep Walk}
\begin{itemize}
	\item Learning latent representations of vertices in a network
	\item The Deep Walk algorithm consists of two simple steps:
	\begin{enumerate}
		\item Perform random walks on the graph to generate node sequences
		\item Run skip-gram on sequence (with word window) to learn node embeddings
	\end{enumerate}
	\item \textit{Drawback}: algorithm has to be re-run if a new node is added, not useful for dynamic graphs
\end{itemize}
\subsubsection{GraphSage}
\begin{itemize}
	\item In every iteration, aggregate information of neighbors and the node itself to generate new embeddings
	\item Aggregation techniques are taking the mean (with weight and non-linearity applied on it afterwards), max pooling, or using a LSTM
\end{itemize}
\subsubsection{Graph Convolutional Networks}
\begin{itemize}
	\item A GNN layer takes as input the embeddings for every node $H^{(l)}$ and the adjacency matrix $A$, and create new embeddings $H^{(l+1)}$
	\item Graph convolutional layers use for this a matrix multiplication where weights are shared over nodes
	\item In the simplest form, a GCN layer can be defined as $h(H^{(l)}, A) = \sigma\left(A H^{(l)} W^{(l)}\right)$
	\item To make it more efficient, we add the identity matrix to $\hat{A} = A + I$ so that nodes use their old embeddings as well, and take the mean instead of the sum over all neighbors (by degree matrix $D$):
	$$h(H^{(l)}, A) = \sigma\left(D^{-1/2}\hat{A}D^{-1/2} H^{(l)} W^{(l)}\right)$$
\end{itemize}

================================================
FILE: Deep_Learning/dl_summary.tex
================================================
\documentclass[a4paper]{article} 
\addtolength{\hoffset}{-2.25cm}
\addtolength{\textwidth}{4.5cm}
\addtolength{\voffset}{-3.25cm}
\addtolength{\textheight}{5cm}
\setlength{\parskip}{0pt}
\setlength{\parindent}{0in}

\usepackage{blindtext} % Package to generate dummy text
\usepackage{charter} % Use the Charter font
\usepackage[utf8]{inputenc} % Use UTF-8 encoding
\usepackage{microtype} % Slightly tweak font spacing for aesthetics
\usepackage[english]{babel} % Language hyphenation and typographical rules
\usepackage{amsthm, amsmath, amssymb, amsfonts} % Mathematical typesetting
\usepackage{float} % Improved interface for floating objects
\usepackage[final, colorlinks = true, 
linkcolor = black, 
citecolor = black]{hyperref} % For hyperlinks in the PDF
\usepackage{graphicx, multicol} % Enhanced support for graphics
\usepackage{xcolor} % Driver-independent color extensions
\usepackage{marvosym, wasysym} % More symbols
\usepackage{rotating} % Rotation tools
\usepackage{subcaption}
\usepackage{wrapfig}
\usepackage{censor} % Facilities for controlling restricted text
\newcommand{\note}[1]{\marginpar{\scriptsize \textcolor{red}{#1}}} % Enables comments in red on margin
\usepackage{bm}
\usepackage{blkarray}
\usepackage{enumitem}
\usepackage{pgfplots}
\usepackage{tikz}

\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\loss}[0]{\mathcal{L}}
\newcommand{\chain}[3]{\frac{\partial #1}{\partial #2}\frac{\partial #2}{\partial #3}}
\newcommand{\eq}[1]{\begin{equation*}\begin{split}#1\end{split}\end{equation*}}
\newcommand{\coderef}[0]{Please find the implementation in the folder with the code files.}
\newcommand{\TODO}[1]{\textbf{\textcolor{red}{#1}}}

\definecolor{green}{RGB}{0,160,0}
\definecolor{blue}{RGB}{0,0,160}
\definecolor{red}{RGB}{160,0,0}
\definecolor{orange}{RGB}{200,160,0}
\definecolor{purple}{RGB}{170,0,200}
\definecolor{cyan}{RGB}{0,200,200}
\definecolor{lightred}{RGB}{200,50,50}

\setcounter{tocdepth}{2}
% Title Page
\title{Summary Deep Learning}
\author{Phillip Lippe}


\begin{document}
\maketitle
\tableofcontents
\newpage

\input{dl_intro.tex}
\input{dl_modularity.tex}
\input{dl_optimization.tex}
\input{dl_convnets.tex}
\input{dl_rnn.tex}
\input{dl_generative_models.tex}
\input{dl_bayesian.tex}
\input{dl_autoregressive.tex}
\input{dl_deep_rl.tex}
\appendix
\newpage
\input{dl_appendix.tex}

\end{document}

================================================
FILE: Information_Retrieval_1/ir_boolean_retrieval.tex
================================================
\section{Boolean Retrieval}
\begin{itemize}
	\item \textbf{Information retrieval} is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)
	\item \textbf{Boolean retrieval model} is a model in which the queries are in the form of a Boolean expression of terms. Terms can be combined by the operators \texttt{AND}, \texttt{OR} and \texttt{NOT} 
\end{itemize}
\subsection{Inverted Index}
\begin{itemize}
	\item 
\end{itemize}

================================================
FILE: Information_Retrieval_1/ir_click_models.tex
================================================
\section{Click models}
\begin{itemize}
	\item User clicks can be used as evaluation of IR systems as clicks indicate the relevance of a document
	\item However, clicks are highly biased (positional, textual, attention/visual,...) $\Rightarrow$ click models try to remove these biases and help using clicks for evaluation
	\item Click models are optimized/trained on click logs which record for a given query which documents were clicked
	\item Most models are based on probabilistic graphical models (PGMs) that describe the probability of a click
	\item They are mostly trained by either applying a MLE or EM algorithm
\end{itemize}
\subsection{Random click model}
\begin{itemize}
	\item In random click models, every document on the result page has the same probability of being clicked: $$P(C_u = 1) = \text{const} = \rho$$
	\item Therefore, the model contains only a single parameter, which can be optimized by applying MLE: $$\rho = \frac{\#\text{clicks}}{\#\text{shown docs}}$$
	\item \textit{Advantages}: simple and fast
	\item \textit{Disadvantages}: the random click model does not consider many aspects including the position and content of a document
	\item There are different variations of this model (also called click-through rate models - CTR) considering more aspects
	\begin{itemize}
		\item \textbf{Rank-based CTR} - modeling a probability for every rank on the result page: $P(C_{u_r} = 1) = \rho_r$
		\item \textbf{Query-document CTR} - modeling a probability for every query-document pair in the dataset: $P(C_{u}=1) = \rho_{uq}$
	\end{itemize}
\end{itemize}
\subsection{Position-based model}
\begin{itemize}
	\item Position-based models take the position \textit{and} the document-query pair into account for modeling the probability of a click
	\begin{itemize}
		\item \textit{Examination} - reading a snippet at a rank/position $\implies$ $P(E_r = 1) = \gamma_r$
		\item \textit{Attractiveness} - prob. for document-query relevance $\implies$ $P(A_{uq} = 1) = \alpha_{uq}$
		\item The combined probability of clicking on a document is therefore: $$P(C_u = 1) = P(E_{r_u} = 1) \cdot P(A_{uq} = 1)$$
	\end{itemize}
	\item The model is visualized in Figure~\ref{img:click_models_PBM_pgm}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.25\textwidth]{figures/click_models_PBM_pgm.png}
		\caption{Probabilistic graphical model of parameters for PBM}
		\label{img:click_models_PBM_pgm}
	\end{figure}
	\item The examination models the position bias in user clicks while the attractiveness covers the document relevance
	\item \textit{Advantages}: Distinguishing between position bias and document relevance
	\item \textit{Disadvantages}: the Position-based model assumes that all clicks are independent of each other. Models that overcome this include:
	\begin{itemize}
		\item \textit{User browsing model (UBM)} - examination is also based on the rank of the previously clicked document $\implies$ $P(E_{r,r'}=1) = \gamma_{r,r'} $ ($n + n\cdot (n-1)/2$ parameters $\to$ 55 parameters for $n=10$)
		\item \textit{Cascade model} - see next section
	\end{itemize}
\end{itemize}
\subsection{Cascade model}
\begin{itemize}
	\item The cascade model assumes that the user scans the documents from top to bottom until he finds a relevant document and clicks
	\item Thus, the top document is always examined, while following documents are only examined if none of the previous ones were clicked
	\item The cascade model can be summarized in the equations:
	\begin{equation*}
		\begin{split}
			P(A_r = 1) & = \alpha_{u_r q}\\
			P(E_1 = 1) & = 1 \textit{\hspace{7mm} first element is always examined}\\
			P(E_r = 1|C_{r-1} = 1) & = 0 \textit{\hspace{7mm} stop if previous document is clicked}\\
			P(E_r = 1|E_{r-1} = 0) & = 0 \textit{\hspace{7mm} only examine if none of the documents before was clicked}\\
			P(E_r = 1|E_{r-1}=1, C_{r-1}=0) & = 1 \textit{\hspace{7mm} if no click was performed yet, examine next document}\\
		\end{split}
	\end{equation*}
	\item Therefore, the model has no parameters for examination and solely relies on attractiveness. The corresponding PGM is visualized in Figure~\ref{img:click_models_CM_pgm}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/click_models_CM_pgm.png}
		\caption{Probabilistic graphical model of parameters for CM}
		\label{img:click_models_CM_pgm}
	\end{figure}
	\item \textit{Advantages}: Clicking on a document depends on previous decisions/documents
	\item \textit{Disadvantages}: No skips are allowed. Also, the cascade model only considers a single click $\implies$ Dynamic Bayesian Networks
\end{itemize}

================================================
FILE: Information_Retrieval_1/ir_counterfactual_eval.tex
================================================
\section{Counterfactual Evaluation and Learning to Rank}
\begin{itemize}
	\item The term \textit{counterfactual} relates to \textit{off-policy} learning in RL
	\item Thus, we try to evaluate an offline task by using online data obtained by another policy to estimate the performance of the new policy in a online setting
\end{itemize}
\subsection{Counterfactual Evaluation}
\begin{itemize}
	\item In general, a user interactive system can be formalized as follows (see Figure~\ref{img:counterfactual_user_interactive_system}):
	\begin{itemize}
		\item $x$: Feature vector describing the user and context (i.e. query)
		\item $y$: Result the system returns based on its policy ($y=\pi(x)$)
		\item $\delta$: Feedback signal from the actions a user took. The function encodes the metric (user utility function) and is defined as $\delta: X\times Y\to \mathbb{R}$
		\item $\pi$: Policy describing the ranking system which takes $x$ as input and maps it to output $y$: $\pi:X\to Y$
	\end{itemize}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/counterfactual_user_interactive_system.png}
		\caption{Visualization fo a user interactive system}
		\label{img:counterfactual_user_interactive_system}
	\end{figure}
	\item \textit{Counterfactual evaluation}: perform offline evaluation of online metrics given online data from another system $\pi_{\text{production}}$. Thus, we try to estimate performance of $\pi_{\text{new}}$ with interaction data obtained with $\pi_{\text{production}}$.
	\item The interactions data/log is structured as $D=\left\{\left(x_1, y_1, \delta_1\right),...,\left(x_n, y_n, \delta_n\right)\right\}$
	\begin{itemize}
		\item The actions $y_i$ were selected by $\pi_{\text{production}}:X\to Y$
		\item Note that we only have partial information feedback, and no complete supervision. Only for the chosen action, we know the feedback signal/user utility. Thus, the "correct"/optimal action is unknown (also called "bandit feedback" as it was sampled from only one arm)
	\end{itemize}
	\item We want to estimate $\mathbb{E}_{y\sim \pi_{\text{new}}}\left[\delta(x,y)\right]$ given $D$ from $\pi_{\text{production}}$. For this, there are two approaches: \textit{model the rewards} and \textit{inverse propensity scoring}.
\end{itemize}
\subsubsection{Model the rewards}
\begin{itemize}
	\item The intuition behind \textit{model the rewards} is to learn the reward function $\delta:X\times Y\to \mathbb{R}$ from $D\sim \pi_{\text{production}}$ directly
	\item The task can be reduced to a regression problem: $$\delta_w = \arg\min_{\delta_w} \sum\limits_{i=1}^{N} \mathcal{L}\left(\delta_w\left(x_i, y_i\right), \delta_i \right)$$
	where $\mathcal{L}$ is a loss function like MSE.
	\item Once $\delta_w$ is learned, we can estimate our goal by $\mathbb{E}_{y\sim \pi_{\text{new}}}\left[\delta(x,y)\right] = \frac{1}{n} \sum\limits_{i=1}^{N} \delta_w \left(x_i, \pi_{\text{new}}(x_i)\right)$
	\item However, learning $\delta_w$ is in general very difficult, as:
	\begin{itemize}
		\item Input space $X\times Y$ is very high-dimensional
		\item Rewards are highly non-linear and noisy
		\item Data is strongly biased to the actions that $\pi_{\text{production}}$ prefers
	\end{itemize}
\end{itemize}
\subsubsection{Inverse Propensity Scoring}
\begin{itemize}
	\item Instead of learning $\delta_w$, is it possible to directly estimate the value of the new policy $\pi_{\text{new}}$?
	\item Answer: only under the condition that the policy $\pi_{\text{production}}$ is stochastic: $y\sim \pi(y|x)$. The probability $p$ to choose the action $y$ is also called \textit{propensity}. Note that $p>0$ must hold for all possible actions as we otherwise have no chance to discover/obtain feedback for all actions
	\item For unbiased counterfactual evaluation, we need data samples with the propensity $p_i$ from policy $\pi_{\text{production}}$ describing the probability of selecting $y_i$ for given input $x_i$: $\left(x_i, y_i, \delta_i, p_i\right)$
	\item Use importance sampling to make distributions $\pi_{\text{production}}$ and $\pi_{\text{new}}$ comparable. This leads to the \textbf{IPS-estimator}:
	$$\frac{1}{n}\sum\limits_{i=1}^{N} \delta_i \frac{\pi_{\text{new}}(y_i|x_i)}{p_i}$$
\end{itemize}
\subsubsection{Proof of Unbiasedness}
\begin{itemize}
	\item We want to proof that in expectation, the IPS estimator will lead to the correct value: $$\mathbb{E}_{y\sim \pi_{\text{production}}}\left[\frac{1}{n}\sum\limits_{i=1}^{N} \delta_i \frac{\pi_{\text{new}}(y_i|x_i)}{p_i}\right] = \mathbb{E}_{y\sim \pi_{\text{new}}}\left[\delta(x,y)\right]$$
	\item First, we can put the sum outside the expectation:
	$$\mathbb{E}_{y\sim \pi_{\text{production}}}\left[\frac{1}{n}\sum\limits_{i=1}^{N} \delta_i \frac{\pi_{\text{new}}(y_i|x_i)}{p_i}\right] = \frac{1}{n}\sum\limits_{i=1}^{N} \mathbb{E}_{y\sim \pi_{\text{production}}}\left[\delta_i \frac{\pi_{\text{new}}(y_i|x_i)}{p_i}\right]$$
	\item Next, we replace the expectation by a sum over actions weighted by their corresponding probabilities:
	$$\frac{1}{n}\sum\limits_{i=1}^{N} \mathbb{E}_{y\sim \pi_{\text{production}}}\left[\delta_i \frac{\pi_{\text{new}}(y_i|x_i)}{p_i}\right] = \frac{1}{n}\sum\limits_{i=1}^{N} \sum\limits_{y_i\in Y}\left[\pi_{\text{production}}(y_i|x_i)\delta_i \frac{\pi_{\text{new}}(y_i|x_i)}{p_i}\right]$$
	\item As $p_i$ is defined as $\pi_{\text{production}}(y_i|x_i)$, we can reduce the equation to:
	$$\frac{1}{n}\sum\limits_{i=1}^{N} \sum\limits_{y_i\in Y}\left[\pi_{\text{production}}(y_i|x_i)\delta_i \frac{\pi_{\text{new}}(y_i|x_i)}{p_i}\right] = \frac{1}{n}\sum\limits_{i=1}^{N} \sum\limits_{y_i\in Y}\left[\delta_i \pi_{\text{new}}(y_i|x_i)\right]$$
	\item Finally, we apply rules based on the definition of expectation:
	$$\frac{1}{n}\sum\limits_{i=1}^{N} \sum\limits_{y_i\in Y}\left[\delta_i \pi_{\text{new}}(y_i|x_i)\right] = \frac{1}{n}\sum\limits_{i=1}^{N} \mathbb{E}_{y_i\sim \pi_{\text{new}}}\left[\delta(x_i,y_i)\right] = \mathbb{E}_{y\sim \pi_{\text{new}}}\left[\delta(x,y)\right]$$
	\item Note that the IPS estimator has a high variance which scales with $p_i^2$. Thus, if we have a very low probability for some actions, this can introduce a high error $\implies$ many samples needed to approximate target accurately. There are different approaches to reduce the variance
\end{itemize}
\subsection{Counterfactual Learning to Rank}
\begin{itemize}
	\item Learning to Rank: \textit{offline} - train on labeled data, \textit{online} - learn from user interactions, \textit{counterfactual} - learn offline from online retrieved data obtained by another policy/ranker
	\item The goal of counterfactual LTR is to learn a new ranker $\pi_{\text{new}}$ from the interaction data with $\pi_{\text{production}}$
	\begin{itemize}
		\item The data is specified by $D=\left\{(x_1,y_1,\delta_1),...,(x_N,y_N,\delta_N)\right\}$ where $\delta_i$ indicates which document was clicked (we assume that only one document was clicked)
		\item $y_i$ is the ranking selected by $\pi_{\text{production}}:X\to Y$
	\end{itemize}
	\item Naive approach: assume click indicates relevance and learn as if it would be a supervised dataset: $$\pi_{\text{new}} = \arg\min_{\pi} \sum\limits_{i=1}^{N} \text{rank}\left(\pi(x_i),y_i,\delta_i\right)$$
	The objective function is to reduce the rank of the relevant document given the new ranking of $\pi_{\text{new}}$ and the previous ranking by $y_i$. Can be solved by pairwise LTR objective.
	\item However, data obtained by online Learning to Rank is commonly noisy and biased
	\item We can take these biases into account by using the inverse propensity scores:
	$$\pi_{\text{new}} = \arg\min_{\pi} \sum\limits_{i=1}^{N} \frac{\text{rank}\left(\pi(x_i),y_i,\delta_i\right)}{p(\textit{observing }\delta_i)}$$
	This formula can be motivated from a probabilistic click model perspective:
	$$p(\textit{click}) = p(\textit{observation})\times p(\textit{relevant}) \implies p(\textit{relevant}) = \frac{p(\textit{click})}{p(\textit{observation})}$$
	Left side is what we want to get, and on the right side it is specified what we actually optimize.
\end{itemize}
\subsubsection{Propensity estimation}
\begin{itemize}
	\item However, the question remains how we calculate $p(\textit{observing }\delta_i)$. We can either approximate it by using click models, or by performing a randomization test
	\item \textit{RandTopN}
	\begin{itemize}
		\item Randomly shuffle the top $N$ documents
		\item Measure clicks people have performed on the data (online experiment)
		\item Aggregate clicks for infinite samples
		\item Infer $\hat{p} \propto p(\textit{observing} \delta_i)$
	\end{itemize} 
	\item \textit{RandPair}
	\begin{itemize}
		\item Randomly swap top document with random top $N$ documents
		\item Infers $\frac{p(\textit{observing} \delta_i)}{p(\textit{observing} \delta_j)}$ for swapped documents $i$ and $j$
	\end{itemize}
	\begin{figure}[ht]
		\centering
		\begin{subfigure}[b]{0.45\textwidth}
			\centering
			\includegraphics[width=0.8\textwidth]{figures/counterfactual_LTR_RandTopN.png}
			\caption{RandTopN}
		\end{subfigure}
		\begin{subfigure}[b]{0.45\textwidth}
			\centering
			\includegraphics[width=0.8\textwidth]{figures/counterfactual_LTR_RandPair.png}
			\caption{RandPair}
		\end{subfigure}
		\label{img:counterfactual_propensity_estimation}
	\end{figure}
\end{itemize}
\subsubsection{The Variance problem}
\begin{itemize}
	\item The problem of solving the counterfactual approach is that if $p(\textit{observing} \delta_i)$ heads to $0$, the overall objective will be heavily biased towards this example $\implies$ overfitting on single data point
	\item One way to overcome this problem is using a variance regularizer which prevents the policy to deviate too much from the original production policy $\pi_{\text{production}}$:
	$$\pi_{\text{new}} = \arg\min_{\pi} \sum\limits_{i=1}^{N} \frac{\text{rank}\left(\pi(x_i),y_i,\delta_i\right)}{p(\textit{observing }\delta_i)} + \lambda \sqrt{\frac{\mathcal{V}[\pi, \pi_{\text{production}}]}{n}}$$
	\item However, this optimization problem cannot be solved by SGD anymore and iterative methods must be applied (new learning framework \textit{counterfactual risk minimization})
\end{itemize}

================================================
FILE: Information_Retrieval_1/ir_language_models.tex
================================================
\section{Introduction to Retrieval models}
\begin{itemize}
	\item Mathematical framework for defining query-document matching
\end{itemize}
\subsection{TF-IDF}
\begin{itemize}
	\item In a vector space model, documents and queries are represented in vector space
	\item Axes are mostly terms/vocabulary so that a document or query is represented by terms they contain (or their frequency)
	\item We can rank documents based on their cosine similarity with the query:
	$$\text{score}(d,q) = \frac{\vec{q} \cdot \vec{d}}{||\vec{q}||\cdot ||\vec{d}||}$$
	\item Documents can be therefore represented as non-negative vector of term weights (raw frequency in doc):
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.5\textwidth]{figures/language_models_tf_example.png}
		\label{img:language_models_tf_example}
	\end{figure}
	\item However, the problem here is that terms with a higher frequency in documents are automatically more important, although this is not always the case (e.g. "the"). Thus, for identifying the important terms, we can report document frequency (no. of docs in which terms occurs):
	$$\text{df}(t) \coloneqq \#\left\{d:\text{tf}(t;d)>0\right\}$$
	\item We can translate document frequencies to term weights by inverting them (inverted document frequency - \textit{IDF}):
	$$\text{idf}(t) = \log \frac{n}{\text{df}(t)} = \log n - \log \text{df}(t)$$
	The log is applied to dampen the effect of IDF.
	\item Also the term frequencies should be dampened by a monotonic, sub-linear transformation as a term occurring twice as often doesn't imply that the document is also twice as important/relevant. Together, we can define the tf-idf weights as follows:
	$$\text{tf-idf}(t;d) = \log \left(1+\text{tf}(t;d)\right) \log \frac{n}{\text{df}(t)}$$
	\item Scores are normalized by euclidean distance of document. Alternatively, we could also apply tf-idf on the relative term frequencies.
\end{itemize}
\subsection{BM25}
\begin{itemize}
	\item Probabilistic retrieval framework that extends the idea of tf-idf
	\item Instead of the log, we use a different damping functions which are easier to control:
	$$w_t = \frac{(k_1 + 1)\cdot \text{tf}(d;t)}{k_1 + \text{tf}(d;t)}\cdot \text{idf}(t)$$
	\item In addition, we normalize the term frequency by the document length: $\text{tf}'(d;t) = \text{tf}(d;t) \cdot l_{avg}/l_{d}$ ($l_{avg}$ is the average document length of collection). By this we prevent copies of documents concatenated with each other being higher rated. Putting this into our original function, we get:
	$$w_t = \frac{(k_1 + 1) \cdot \text{tf}(d;t)}{k_1 \cdot (l_d / l_{avg})+ \text{tf}(d;t)}\cdot \text{idf}(t)$$
	\item However, longer documents also tend to contain more information. Thus, we introduce another parameter $b$ that controls the normalization:
	$$w_t = \frac{(k_1 + 1) \cdot \text{tf}(d;t)}{k_1 \cdot ((1-b) + b\cdot (l_d / l_{avg}))+ \text{tf}(d;t)}\cdot \text{idf}(t)$$
	\item For very long queries, we also need to consider this normalization which can be done by multiplying another term $\frac{(k_3 + 1)\cdot \text{tf}(q;t)}{k_3 \cdot \text{tf}(q;t)}$
	\item In conclusion, the BM25 score is calculated as follows:
	$$\text{BM25} = \sum\limits_{\text{unique\hspace{1mm}} t\in q} \frac{(k_1 + 1) \cdot \text{tf}(d;t)}{k_1 \cdot ((1-b) + b\cdot (l_d / l_{avg}))+ \text{tf}(d;t)} \cdot \frac{(k_3 + 1)\cdot \text{tf}(q;t)}{k_3 + \text{tf}(q;t)} \cdot \text{idf}(t)$$
	\item Parameters $k_1$, $b$ and $k_3$ are tuned. Common defaults are $k_1 = 1.5$ and $b=0.75$
	\item It is the most widely used ranking in IR but only loosely inspired by probabilistic models
\end{itemize}
\subsection{Statistical Language Models}
\begin{itemize}
	\item Statistical language models are a probability distribution over word sequences $P(w_1, ..., w_m)$ with which documents and queries can be represented (and uncertainty quantified)
	\item Thus, a language model describes the probability of e.g. $q$ being the given word sequence
	\item Documents are ranked given a query by its similarity. Therefore we can use either document likelihood, query likelihood or KL-divergence
\end{itemize}
\subsubsection{Query likelihood}
\begin{itemize}
	\item Given a document, what queries are most likely to be created for it? 
	\item We first have to ensure that the query likelihood correlates with document likelihood. Therefore, we apply the Bayes rule: $p(d|q) = \frac{p(q|d)p(d)}{p(q)}$. As $p(q)$ is equal for all documents, and we assume a uniform prior for all documents (though not always the case), we retrieve $p(d|q)\propto p(q|d)$
	\item Thus, by generating a probability distribution of possible queries for a document, we can approximate how likely a document is given a query.
	\item The scoring function is defined as follows:
	$$\text{score}(d,q) = \log \left[p(q|\theta_d)\cdot p(d)\right]$$
	where $\theta_d$ describes the document. There are mainly three modeling choices:
	\begin{enumerate}
		\item \textit{How to define the generative process $p|\theta_d$?}
		\begin{itemize}
			\item Given $\theta_d$, what is the generative process for getting $q=w_1,...,w_{|q|}$?
			\item Different distributions are possible
			\item \textit{Multiple Bernoulli} - bag of word perspective, every word in vocabulary has probability to be in query or not. The related probability is:
			$$p(q|\theta_d) = \prod\limits_{w_i \in q} p(X_i = 1 | \theta_d) \prod\limits_{w_i \not\in q} \left(1 - p\left(X_i = 1 | \theta_d\right) \right)$$
			\item \textit{Multinomial} - similar to bernoulli, but we know have a random variable for every word slot in the query and not one for every word in the vocabulary. Thus, the calculation is:
			$$p(q|\theta_d) = \prod\limits_{w_i \in q} p(w_i | \theta_d) \text{\hspace{4mm}where\hspace{4mm}} \sum\limits_{w_i \in V} p(w_i|\theta_d) = 1$$
			\item \textit{Multiple Poisson} - similar to bernoulli, but instead of presence or absence, we model the number of times we expect a word from the vocabulary to occur in the query of length $|q|$ by a Poisson distribution:
			$$p(q|\theta_d) = \prod\limits_{w_i \in V} \frac{e^{-\lambda_i |q|} (\lambda_i |q|)^{\text{tf}(w_i;d)}}{\text{tf}(w_i;d)!}$$
		\end{itemize}
		\item \textit{How to estimate $\theta_d$ based on document $d$?}
		\begin{itemize}
			\item To estimate $\theta_d$ we perform MLE: $\hat{\theta}_d = \arg \max_{\theta_d} p(d|\theta_d)$
			\item In case of a multinomial distribution, we would get:
			$$p(d|\theta_d) = \prod\limits_{w_i \in V} p(w_i | \theta_d)^{\text{tf}(w_i;d)} \implies \log p(d|\theta_d) = \sum\limits_{w_i \in V} \text{tf}(w_i;d) \log p(w_i | \theta_d)$$
			\item Note that this is a constrained optimization problem with $\sum\limits_{w_i \in V} p(w_i|\theta_d) = 1$.
			\item By using lagrangian multiplier, we get $p_{MLE}(w_i|d) = \frac{\text{tf}(w_i;d)}{|d|}$
		\end{itemize}
		\item \textit{How to compute prior $p(d)$?}
		\begin{itemize}
			\item The prior takes everything into account which is independent of a query.
			\item This can include number of clicks, credibility, ...
		\end{itemize}
	\end{enumerate}
\end{itemize}
\subsubsection{Smoothing}
\begin{itemize}
	\item How to deal with unseen words which have a probability of 0.
	\item First, we assume a multinomial distribution again with the optimal parameters of $p(w_i|\theta_d) = \frac{\text{tf}(w_i;d)}{|d|}$
	\item \textbf{Adaptive smoothing}: add a small extra count to every word:
	$$p(w_i|\theta_d) = \frac{\text{tf}(w_i;d) + \epsilon}{|d| + \epsilon |V|}$$
	In case of $\epsilon=0$, we fall back to ML estimation. $\epsilon=1$ is called Laplace smoothing.
	\item \textbf{Jelinek-Mercer smoothing}: linearly interpolate with "background" knowledge so that rare words also have smaller additives:
	$$p_{\lambda}(w_i|\theta_d) = \lambda \frac{\text{tf}(w_i;d)}{|d|} + (1 - \lambda) \frac{\text{tf}(w_i;C)}{|C|}$$
	The context $C$ is approximated by the concatenation of all documents.
	\item \textbf{Dirichlet prior smoothing}: we assume that before seeing the document, we have a prior belief over all words $p(\theta_d)$. We use the posterior which gets narrower the more words we see and therefore the more certain we are about the document distribution.
	\begin{itemize}
		\item Maximum A Posteriori estimate by $\hat{\theta}_d = \arg\max_{\theta_d} p(\theta_d|d) = \arg\max_{\theta_d} p(d|\theta_d) p(\theta_d)$
		\item Prior distribution $p_i\sim \text{Dir}(\alpha) \implies p(\theta_d) = \prod\limits_{w \in V} p(w|\theta_d)^{\alpha_w - 1}$
		\item With a multinomial likelihood, we get:
		$$p(\theta_d | d) \propto \prod\limits_{w \in V} p(w|\theta_d)^{\text{tf}(w;d)} \prod\limits_{w \in V} p(w|\theta_d)^{\alpha_w - 1} = \prod\limits_{w \in V} p(w|\theta_d)^{\text{tf}(w;d) + \alpha_w - 1}$$
		\item Thus, our new MAP solution is:
		$$p(w|\theta_d) = \frac{\text{tf}(w;d) + \alpha_w - 1}{|d| + \sum_{w\in V}\alpha_w - |V|}$$
		\item For $\alpha_w = 1$, we get MLE estimation, and $\alpha_w = 2$ represents Laplace smoothing.
		\item We can also rewrite the smoothing similar to Jelinek-Mercer smoothing:
		$$p(w|\theta_d) = \frac{|d|}{|d|+ \mu}\frac{\text{tf}(w;d)}{|d|} + \frac{\mu}{\mu + |d|}p(w|C)$$
		where $\mu$ is the parameter depending on $\alpha_w$. Thus, we interpolate with the background knowledge while taking the document length into account.
	\end{itemize}
	\item Next to Dirichlet prior smoothing, we can also use other distributions (for example a beta prior with multiple Bernoulli) which lead to slightly different smoothing functions. For example, with the beta prior, we get for a variable $\alpha_w$ and $\beta_w$ (without constraints!):
	$$p(w|\theta_d) = \frac{\text{tf}(w;d) + \alpha_w - 1}{\alpha_w + \beta_w - 1}$$
\end{itemize}
\subsubsection{Positional Language Models}
\begin{itemize}
	\item There are variants of basic language models capturing term dependencies
	\item Instead of having one language model representing the whole document, Positional Language Models define a LM for every word position
	\item Thus we capture (small) "fuzzy" passages with which we can match our query
	\item A term at each position can propagate its occurrence to close positions in word windows
	\begin{itemize}
		\item Example sentence: \texttt{the black hat is not}...
		\item With a equally weighted word window of one, we would retrieve the following language model (MLE params) for the position of word "\texttt{black}": $p(\texttt{black}|\theta_p) = 1/3, \hspace{2mm} p(\texttt{the}|\theta_p) = 1/3, \hspace{2mm} p(\texttt{hat}|\theta_p) = 1/3$
	\end{itemize}
	\item We can weight the occurrences of every word based on the distance to the "root" of the language model (also called kernel):
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/language_models_positional.png}
		\label{img:language_models_positional}
	\end{figure}
	\item In general, the term frequency of a word for a LM at position $j$ with kernel $k$ is determined as follows:
	$$\text{tf'}(w,j;d) = \sum\limits_{i=1}^{|d|} \text{tf}(w,i;d) \cdot k(i,j) $$
	\item The language model at every position is given by the corresponding MLE estimation:
	$$p(w|d,j) = \frac{\text{tf'}(w,j;d)}{\sum_{w'\in V} \text{tf'}(w',j;d)}$$
	\item Documents can now be scored by either their best matching language model with the query, or the average of the top-$k$ models
\end{itemize}

================================================
FILE: Information_Retrieval_1/ir_learning_to_rank.tex
================================================
\section{Learning to Rank}
\begin{itemize}
	\item Main issue in information retrieval is to determine whether document $d$ is relevant for query $q$
	\item Common relevance signals include TF-IDF, BM25, document popularity etc.
	\item But: what signals to use/how to combine these signals? There is not a single relevance signal "to rule them all" $\implies$ combine all signals in a model
	\item Simplest combination method: linear model $f(\bm{d},\bm{\theta}) = \sum\limits_{i=1}^{|d|} \theta_i d_i$ where $\bm{d}$ represents the different signals for document-query pair
	\item Task: find the optimal parameter set $\bm{\theta}$, commonly by Machine Learning techniques (linear regression)
\end{itemize}
\subsection{Offline Learning To Rank}
\begin{itemize}
	\item Given an annotated dataset of relation document and relevance/ranking
	\item There are three different approaches
	\begin{enumerate}
		\item \textbf{Pointwise}: optimize models $f(\bm{d},\bm{\theta})$ to predict relevancy of a document. This can be recasted in a regression problem with loss:
		$$\mathcal{L}=\sum_{\bm{d}} \left(f(\bm{d},\bm{\theta}) - \text{relevancy}(d,q)\right)^2$$
		However, this approach does not consider the application of ranking where only the final order is important, but not the single scores.
		\item \textbf{Pairwise}: optimize regarding the total order of the documents and not specific relevance scores. The loss can be expressed by:
		$$\mathcal{L}=\sum_{d\succ d'}\left[f(\bm{d'},\bm{\theta}) - f(\bm{d},\bm{\theta})\right]$$ 
		where $d\succ d'$ means that $d'$ is the successor of $d$ in the labeled ranking. Nevertheless, this method does not take into account that only a subpart (top 10) of the collection is actually presented to the user.
		\item \textbf{Listwise}: optimize regarding ranking metrics like $DCG$. Thus, the loss could be:
		$$\mathcal{L} = -nDCG(f(\cdot,\bm{\theta}))$$
		The problem is that most ranking metrics are not differentiable. There are heuristic approaches to still optimize with respect to such metrics. 
	\end{enumerate}
	\item Problems with offline Learning to Rank: similar to offline evaluation in Section~\ref{sec:offline_eval_problems}
	\begin{itemize}
		\item All described methods require an annotated dataset which contains either relevance labels for each document-query pair or a ranking over the whole collection.
		\item Creating such is time consuming and expensive
		\item Impossible to personalize for a user (everyone prefers a little bit different documents). Also, annotators and users might disagree in some points $\implies$ dataset does not fully reflect user behavior
		\item Can change over time
	\end{itemize}
\end{itemize}
\subsection{Online Learning to Rank}
\begin{itemize}
	\item Learn from implicit user feedback
	\begin{itemize}
		\item Might be noisy
		\item Consider position bias (higher rank is more frequently clicked) and selection bias (only a limited set of documents is presented to the user)
	\end{itemize}
	\item Online Learning to Rank methods can learn from user interactions, \textbf{and} control the results which are displayed/presented to the user
	\item Thus, these methods can be more efficients as they control over what data is actually gathered
	\item A general online learning to rank technique is visualized in Figure~\ref{img:learning_to_rank_online_overview}
	\begin{figure}[ht]
		\centering
		\includegraphics[width=0.4\textwidth]{figures/learning_to_rank_online_overview.png}
		\caption{Overview of the general concept of online learning to rank}
		\label{img:learning_to_rank_online_overview}
	\end{figure}
	\begin{itemize}
		\item The user enters a query, for which the ranking algorithm generates a list of documents
		\item The Online Learning to Rank system interacts with the results by adding and/or removing documents from the ranking. This can also include interleaving with another, slightly changed ranking algorithm
		\item User interacts with the displayed result and gives implicit feedback.
		\item The Online Learning to Rank algorithm updates the ranking parameters according to the analyzed feedback
	\end{itemize}
	\item \textbf{Advantages}: learns directly from the user, is more responsive by immediately adapting its parameters
	\item \textbf{Risks}:
	\begin{itemize}
		\item Unreliable methods will affect/worsen user experiences immediately.
		\item (Noisy) clicks can easily bias or even manipulate search engines
		\item \textbf{Self-confirming loop}
		\begin{itemize}
			\item If an irrelevant document was clicked by random, the system still perceives that this document is relevant and will change its parameters accordingly
			\item Thus, the random document will be placed higher in future ranks. However, also similar documents to the irrelevant one will have an increased relevance score and will probably occur at a high position
			\item Most likely, the next clicked document will be one of the highest ones which were irrelevant $\implies$ entering a self-confirming loop
			\item Due to bias and noise, an irrelevant document was clicked and inferred to be relevant
			\item Due to noise, this inference is most likely to appear again
			\item The algorithms confidence in this incorrect inference continues to increase
		\end{itemize}
	\end{itemize}
	\item To prevent a self-confirming loop, we have to balance exploration and exploitation
	\begin{itemize}
		\item \textit{Exploration}: collect feedback for learning from the most documents as possible
		\item \textit{Exploitation}: utilize what has been already learned 
		\item If systems only exploits, it misses out to obtain feedback for other documents that might be even better (danger to enter/staying in self-confirming loop)
		\item To high exploration rate leads to a lot of irrelevant documents in ranking that worsen the user experience 
	\end{itemize}
\end{itemize}
\subsubsection{Designing an Online Learning to Rank algorithm}
\begin{itemize}
	\item To design a OLTR algorithm, we have to make design choices in four aspects (see Figure~\ref{img:learning_to_rank_online_design})
\end{itemize}
\begin{figure}[ht]
	\centering
	\includegraphics[width=0.4\textwidth]{figures/learning_to_rank_online_design.png}
	\caption{General design components of an OLTR algorithm}
	\label{img:learning_to_rank_online_design}
\end{figure}
\begin{enumerate}[label=(\Alph*)]
	\item \textbf{Ranker}: the ranker maps documents to relevance scores. This module operates on feature level/document id's and can be for example a linear ranker/neural model/...
	\item \textbf{Exploration strategy}: define interactions with results of the ranker. No exploration would mean that the document ranking is simply passed and stays unchanged. A common strategy is \textit{epsilon-greedy} where we inject random documents in random positions with ratio $\epsilon$. Other algorithms include upper confidence bound etc.
	\item \textbf{Signal recording and interpretation}: algorithm can consider multiple signals (raw observation like clicks and dwell time, more complex metrics like time to success). Should remove bias/noise. When result list was constructed by using interleaving,
Download .txt
gitextract_3o3s9o2d/

├── .gitignore
├── Computer_Vision_1/
│   ├── cv_appendix.tex
│   ├── cv_applications.tex
│   ├── cv_deep_learning.tex
│   ├── cv_deep_video.tex
│   ├── cv_imgformation.tex
│   ├── cv_imgprocessing.tex
│   ├── cv_intro.tex
│   ├── cv_object_rec.tex
│   └── cv_summary.tex
├── Deep_Learning/
│   ├── cheat_sheet/
│   │   └── main.tex
│   ├── dl_appendix.tex
│   ├── dl_autoregressive.tex
│   ├── dl_bayesian.tex
│   ├── dl_convnets.tex
│   ├── dl_deep_rl.tex
│   ├── dl_generative_models.tex
│   ├── dl_intro.tex
│   ├── dl_modularity.tex
│   ├── dl_optimization.tex
│   ├── dl_rnn.tex
│   └── dl_summary.tex
├── Information_Retrieval_1/
│   ├── ir_boolean_retrieval.tex
│   ├── ir_click_models.tex
│   ├── ir_counterfactual_eval.tex
│   ├── ir_language_models.tex
│   ├── ir_learning_to_rank.tex
│   ├── ir_neural_models.tex
│   ├── ir_offline_evaluation.tex
│   ├── ir_online_evaluation.tex
│   ├── ir_semantic_matching.tex
│   └── ir_summary.tex
├── Knowledge_Representation/
│   ├── figures/
│   │   └── figures.pptx
│   ├── kr_csp.tex
│   ├── kr_dl.tex
│   ├── kr_intro.tex
│   ├── kr_qr.tex
│   ├── kr_sat.tex
│   └── kr_summary.tex
├── LICENSE
├── ML4QS/
│   ├── mlqs_clustering.tex
│   ├── mlqs_feature_engineering.tex
│   ├── mlqs_intro.tex
│   ├── mlqs_modeling_with_time.tex
│   ├── mlqs_modeling_without_time.tex
│   ├── mlqs_reinforcement_learning.tex
│   ├── mlqs_sensory_noise.tex
│   ├── mlqs_summary.tex
│   └── mlqs_supervised_learning.tex
├── Machine_Learning_1/
│   ├── ml_appendix.tex
│   ├── ml_basic_probability.tex
│   ├── ml_combining_models.tex
│   ├── ml_kernel_methods.tex
│   ├── ml_linear_classification.tex
│   ├── ml_linear_regression.tex
│   ├── ml_neural_networks.tex
│   ├── ml_summary.tex
│   └── ml_unsupervised_learning.tex
├── Machine_Learning_2/
│   ├── ml2_appendix.tex
│   ├── ml2_causality.tex
│   ├── ml2_exponential_family.tex
│   ├── ml2_graphical_models.tex
│   ├── ml2_graphical_models.tex.recover.bak~
│   ├── ml2_sampling_methods.tex
│   ├── ml2_sequential_data.tex
│   ├── ml2_summary.tex
│   └── ml2_variational_EM.tex
├── Natural_Language_Processing_1/
│   ├── nlp_bayesian.tex
│   ├── nlp_compositional_semantic.tex
│   ├── nlp_dialog_modelling.tex
│   ├── nlp_formal_grammars.tex
│   ├── nlp_lexical_distributional_semantics.tex
│   ├── nlp_morphology.tex
│   ├── nlp_pos_tagging.tex
│   ├── nlp_summarization.tex
│   ├── nlp_summary.tex
│   ├── nlp_textual_entailment_paraphrasing.tex
│   └── nlp_translation.tex
├── README.md
└── Reinforcement_Learning/
    ├── rl_appendix.tex
    ├── rl_introduction.tex
    ├── rl_learning_with_approx.tex
    ├── rl_mcts_alpha_go.tex
    ├── rl_model_based.tex
    ├── rl_partially_observable.tex
    ├── rl_policy_gradient_methods.tex
    ├── rl_summary.tex
    └── rl_tabular_methods.tex
Condensed preview — 88 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (858K chars).
[
  {
    "path": ".gitignore",
    "chars": 53,
    "preview": "*.aux\n*.log\n*.out\n*.synctex.gz\n*.toc\n*.txss\n.DS_Store"
  },
  {
    "path": "Computer_Vision_1/cv_appendix.tex",
    "chars": 12462,
    "preview": "\\section{Practicals}\nGathering some interesting/important questions from the practicals and old exams.\n\\subsection{Color"
  },
  {
    "path": "Computer_Vision_1/cv_applications.tex",
    "chars": 42,
    "preview": "\\section{Applications}\nNot in the exam :-)"
  },
  {
    "path": "Computer_Vision_1/cv_deep_learning.tex",
    "chars": 7711,
    "preview": "\\section{Deep Learning}\n\\begin{itemize}\n\t\\item Deep Neural Networks perform hierarchical feature learning and classifica"
  },
  {
    "path": "Computer_Vision_1/cv_deep_video.tex",
    "chars": 12088,
    "preview": "\\section{Deep Video}\n\\begin{itemize}\n\t\\item Understanding a video requires to analyze spatial and temporal information. "
  },
  {
    "path": "Computer_Vision_1/cv_imgformation.tex",
    "chars": 16904,
    "preview": "\\section{Image formation}\n\\label{sec:img_formation}\n\\begin{itemize}\n\t\\item To fully understand/analyze an image, we firs"
  },
  {
    "path": "Computer_Vision_1/cv_imgprocessing.tex",
    "chars": 9159,
    "preview": "\\section{Image processing}\n\\begin{itemize}\n\t\\item Apply various algorithms on image to analyze/improve the data\n\t\\item T"
  },
  {
    "path": "Computer_Vision_1/cv_intro.tex",
    "chars": 103,
    "preview": "\\section{Introduction}\n\\subsection{Challenges in Computer Vision}\n\\begin{itemize}\n\t\\item \n\\end{itemize}"
  },
  {
    "path": "Computer_Vision_1/cv_object_rec.tex",
    "chars": 10459,
    "preview": "\\section{Object recognition}\n\\begin{itemize}\n\t\\item Challenges in object recognition\n\t\\begin{itemize}\n\t\t\\item Huge dimen"
  },
  {
    "path": "Computer_Vision_1/cv_summary.tex",
    "chars": 1893,
    "preview": "\\documentclass[a4paper]{article} \n\\addtolength{\\hoffset}{-2.25cm}\n\\addtolength{\\textwidth}{4.5cm}\n\\addtolength{\\voffset}"
  },
  {
    "path": "Deep_Learning/cheat_sheet/main.tex",
    "chars": 24779,
    "preview": "%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n% MatPlotLib and Random Cheat Sheet\n%\n% Edited by Michelle Crist"
  },
  {
    "path": "Deep_Learning/dl_appendix.tex",
    "chars": 140,
    "preview": "% \\section{Neural Network Zoo}\n\n\\begin{figure}[ht!]\n\t\\centering\n\t\\includegraphics[width=0.9\\textwidth]{figures/NN_Zoo_Hi"
  },
  {
    "path": "Deep_Learning/dl_autoregressive.tex",
    "chars": 6025,
    "preview": "\\section{Deep Sequential Models}\n\\subsection{Autoregressive Models}\n\\begin{itemize}\n\t\\item Generative models without lat"
  },
  {
    "path": "Deep_Learning/dl_bayesian.tex",
    "chars": 4949,
    "preview": "\\section{Bayesian Deep Learning}\n\\begin{itemize}\n\t\\item Bayesian machine learning: holding a distribution per latent var"
  },
  {
    "path": "Deep_Learning/dl_convnets.tex",
    "chars": 5662,
    "preview": "\\section{Convolutional Neural Networks}\n\\begin{itemize}\n\t\\item Images are stationary signals with spatial structure and "
  },
  {
    "path": "Deep_Learning/dl_deep_rl.tex",
    "chars": 6271,
    "preview": "\\section{Deep Reinforcement Learning}\n\\subsection{Fundamentals of Reinforcement Learning}\n\\begin{itemize}\n\t\\begin{figure"
  },
  {
    "path": "Deep_Learning/dl_generative_models.tex",
    "chars": 18880,
    "preview": "\\section{Deep Generative Models}\n\\begin{itemize}\n\t\\item \\textit{Generative modeling}: learn the joint probability $p(x,y"
  },
  {
    "path": "Deep_Learning/dl_intro.tex",
    "chars": 361,
    "preview": "\\section{Introduction}\n\\subsubsection{Perceptron}\n\\begin{itemize}\n\t\\item Single perceptron weights every input with a we"
  },
  {
    "path": "Deep_Learning/dl_modularity.tex",
    "chars": 5584,
    "preview": "\\section{Modular Learning}\n\\begin{itemize}\n\t\\item \\textit{Definition}: A family of \\textcolor{green}{parametric}, \\textc"
  },
  {
    "path": "Deep_Learning/dl_optimization.tex",
    "chars": 9611,
    "preview": "\\section{Deep Learning Optimizations}\n\\begin{itemize}\n\t\\item Pure optimization has a very direct goal, namely finding th"
  },
  {
    "path": "Deep_Learning/dl_rnn.tex",
    "chars": 6847,
    "preview": "\\section{Recurrent and Graph Neural Networks}\n\\subsection{Backpropagation through time}\n\\begin{itemize}\n\t\\item Sequences"
  },
  {
    "path": "Deep_Learning/dl_summary.tex",
    "chars": 2363,
    "preview": "\\documentclass[a4paper]{article} \n\\addtolength{\\hoffset}{-2.25cm}\n\\addtolength{\\textwidth}{4.5cm}\n\\addtolength{\\voffset}"
  },
  {
    "path": "Information_Retrieval_1/ir_boolean_retrieval.tex",
    "chars": 542,
    "preview": "\\section{Boolean Retrieval}\n\\begin{itemize}\n\t\\item \\textbf{Information retrieval} is finding material (usually documents"
  },
  {
    "path": "Information_Retrieval_1/ir_click_models.tex",
    "chars": 4620,
    "preview": "\\section{Click models}\n\\begin{itemize}\n\t\\item User clicks can be used as evaluation of IR systems as clicks indicate the"
  },
  {
    "path": "Information_Retrieval_1/ir_counterfactual_eval.tex",
    "chars": 10208,
    "preview": "\\section{Counterfactual Evaluation and Learning to Rank}\n\\begin{itemize}\n\t\\item The term \\textit{counterfactual} relates"
  },
  {
    "path": "Information_Retrieval_1/ir_language_models.tex",
    "chars": 11322,
    "preview": "\\section{Introduction to Retrieval models}\n\\begin{itemize}\n\t\\item Mathematical framework for defining query-document mat"
  },
  {
    "path": "Information_Retrieval_1/ir_learning_to_rank.tex",
    "chars": 8824,
    "preview": "\\section{Learning to Rank}\n\\begin{itemize}\n\t\\item Main issue in information retrieval is to determine whether document $"
  },
  {
    "path": "Information_Retrieval_1/ir_neural_models.tex",
    "chars": 10130,
    "preview": "\\section{Neural Retrieval Models}\n\\subsection{Distributed Word Representations}\n\\begin{itemize}\n\t\\item Latent, dense vec"
  },
  {
    "path": "Information_Retrieval_1/ir_offline_evaluation.tex",
    "chars": 10622,
    "preview": "\\section{Offline evaluation}\n\\begin{itemize}\n\t\\item Evaluating an IR system without any interaction with user \n\t\\item As"
  },
  {
    "path": "Information_Retrieval_1/ir_online_evaluation.tex",
    "chars": 9377,
    "preview": "\\section{Online evaluation}\n\\begin{itemize}\n\t\\item In online evaluation, the system interacts with the user $\\implies$ u"
  },
  {
    "path": "Information_Retrieval_1/ir_semantic_matching.tex",
    "chars": 11718,
    "preview": "\\section{Semantic matching}\n\\begin{itemize}\n\t\\item \\textit{Vocabulary gap}: query and document might use different lexic"
  },
  {
    "path": "Information_Retrieval_1/ir_summary.tex",
    "chars": 2000,
    "preview": "\\documentclass[a4paper]{article} \n\\addtolength{\\hoffset}{-2.25cm}\n\\addtolength{\\textwidth}{4.5cm}\n\\addtolength{\\voffset}"
  },
  {
    "path": "Knowledge_Representation/kr_csp.tex",
    "chars": 8169,
    "preview": "\\section{Constraint Satisfaction Problems}\n\\begin{itemize}\n\t\\item Knowledge Representation is focused on qualitative rea"
  },
  {
    "path": "Knowledge_Representation/kr_dl.tex",
    "chars": 16987,
    "preview": "\\section{Description Logic}\n\\begin{itemize}\n\t\\item Description Logic is the logic for ontologies\n\t\\item Is more expressi"
  },
  {
    "path": "Knowledge_Representation/kr_intro.tex",
    "chars": 3650,
    "preview": "\\section{Introduction to KR}\n\\begin{itemize}\n\t\\item There are two main lines of development in AI: \\textit{symbolic} and"
  },
  {
    "path": "Knowledge_Representation/kr_qr.tex",
    "chars": 8682,
    "preview": "\\section{Qualitative Reasoning}\n\\begin{itemize}\n\t\\item Learning by making (qualitative) representations, combine meaning"
  },
  {
    "path": "Knowledge_Representation/kr_sat.tex",
    "chars": 15994,
    "preview": "\\section{Satisfiability solvers}\n\\subsection{Propositional Logic}\n\\begin{itemize}\n\t\\item In Knowledge Representation, we"
  },
  {
    "path": "Knowledge_Representation/kr_summary.tex",
    "chars": 1780,
    "preview": "\\documentclass[a4paper]{article} \n\\addtolength{\\hoffset}{-2.25cm}\n\\addtolength{\\textwidth}{4.5cm}\n\\addtolength{\\voffset}"
  },
  {
    "path": "LICENSE",
    "chars": 1070,
    "preview": "MIT License\n\nCopyright (c) 2022 Phillip Lippe\n\nPermission is hereby granted, free of charge, to any person obtaining a c"
  },
  {
    "path": "ML4QS/mlqs_clustering.tex",
    "chars": 9867,
    "preview": "\\section{Clustering}\n\\begin{itemize}\n\t\\item Using the features engineered before to cluster instances\n\t\\item Two differe"
  },
  {
    "path": "ML4QS/mlqs_feature_engineering.tex",
    "chars": 4925,
    "preview": "\\section{Feature Engineering}\n\\label{sec:chapter_4_feature_engineering}\n\\begin{itemize}\n\t\\item Create useful features fr"
  },
  {
    "path": "ML4QS/mlqs_intro.tex",
    "chars": 3923,
    "preview": "\\section{Introduction}\n\\label{sec:chapter_1_2_introduction}\n\\subsection{Definitions}\n\\begin{itemize}\n\t\\item The quantifi"
  },
  {
    "path": "ML4QS/mlqs_modeling_with_time.tex",
    "chars": 8789,
    "preview": "\\section{Predictive Modeling with Notion of Time}\n\\subsection{Time Series}\n\\begin{itemize}\n\t\\item Understanding the peri"
  },
  {
    "path": "ML4QS/mlqs_modeling_without_time.tex",
    "chars": 1665,
    "preview": "\\section{Predictive Modeling without Notion of Time}\n\\begin{itemize}\n\t\\item Any predictor that does not explicitly take "
  },
  {
    "path": "ML4QS/mlqs_reinforcement_learning.tex",
    "chars": 3261,
    "preview": "\\section{Reinforcement Learning}\n\\begin{itemize}\n\t\\item RL for ML4QS to learn from interactions with user and influencin"
  },
  {
    "path": "ML4QS/mlqs_sensory_noise.tex",
    "chars": 7460,
    "preview": "\\section{Handling Sensory Noise}\n\\label{sec:chapter_3_sensory_noise}\n\\subsection{Outlier Detection}\n\\begin{itemize}\n\t\\it"
  },
  {
    "path": "ML4QS/mlqs_summary.tex",
    "chars": 2463,
    "preview": "\\documentclass[a4paper]{article} \n\\addtolength{\\hoffset}{-2.25cm}\n\\addtolength{\\textwidth}{4.5cm}\n\\addtolength{\\voffset}"
  },
  {
    "path": "ML4QS/mlqs_supervised_learning.tex",
    "chars": 3841,
    "preview": "\\section{Supervised Learning}\n\\begin{itemize}\n\t\\item The perspective on supervised learning in this course is summarized"
  },
  {
    "path": "Machine_Learning_1/ml_appendix.tex",
    "chars": 7401,
    "preview": "\\section{Appendix: Foundations}\n\\subsection{Important functions}\n\\subsubsection{Rectified Linear Unit}\nProperties of the"
  },
  {
    "path": "Machine_Learning_1/ml_basic_probability.tex",
    "chars": 1525,
    "preview": "\\section{Probability Theory}\n\\subsection{Multivariate Gaussian}\n$$\\mathcal{N}\\left(\\bm{x}|\\bm{\\mu}, \\bm{\\Sigma}\\right) ="
  },
  {
    "path": "Machine_Learning_1/ml_combining_models.tex",
    "chars": 6875,
    "preview": "\\section{Combining models}\n\\begin{itemize}\n\t\\item Improve performance by combining different models\n\t\\item For example, "
  },
  {
    "path": "Machine_Learning_1/ml_kernel_methods.tex",
    "chars": 20090,
    "preview": "\\section{Kernel methods}\n\\begin{itemize}\n\t\\item Standard parametric models have either fixed basis functions (like linea"
  },
  {
    "path": "Machine_Learning_1/ml_linear_classification.tex",
    "chars": 24887,
    "preview": "\\section{Linear classification}\n\\begin{itemize}\n\t\\item Input $\\bm{x}=\\left(x_1, x_2, ..., x_D\\right)^T$ with $\\bm{x}\\in\\"
  },
  {
    "path": "Machine_Learning_1/ml_linear_regression.tex",
    "chars": 28752,
    "preview": "\\section{Linear Regression}\n\n\\subsection{Basic approaches}\n\\subsubsection{Maximum likelihood}\n\\begin{itemize}\n\t\\item Giv"
  },
  {
    "path": "Machine_Learning_1/ml_neural_networks.tex",
    "chars": 11351,
    "preview": "\\section{Neural Networks}\n\\begin{itemize}\n\t\\item Previously: fixed basis function $\\bm{\\phi}(\\bm{x}) = \\left(\\phi_0\\left"
  },
  {
    "path": "Machine_Learning_1/ml_summary.tex",
    "chars": 1869,
    "preview": "\\documentclass[a4paper]{article} \n\\addtolength{\\hoffset}{-2.25cm}\n\\addtolength{\\textwidth}{4.5cm}\n\\addtolength{\\voffset}"
  },
  {
    "path": "Machine_Learning_1/ml_unsupervised_learning.tex",
    "chars": 17150,
    "preview": "\\section{Unsupervised learning}\n\\begin{itemize}\n\t\\item We can express our data distribution by marginalizing latent vari"
  },
  {
    "path": "Machine_Learning_2/ml2_appendix.tex",
    "chars": 6563,
    "preview": "\\section{Appendix Math}\nHere we revisit some important mathematical tricks and equations to know.  \n\\subsection{Useful p"
  },
  {
    "path": "Machine_Learning_2/ml2_causality.tex",
    "chars": 12156,
    "preview": "\\section{Causality}\n\\begin{itemize}\n\t\\item Causality is about testing whether one event (\\textit{effect}) is the result "
  },
  {
    "path": "Machine_Learning_2/ml2_exponential_family.tex",
    "chars": 17697,
    "preview": "\\section{Introduction to popular distributions and their properties}\n\\begin{itemize}\n\t\\item This section (lecture 1 and "
  },
  {
    "path": "Machine_Learning_2/ml2_graphical_models.tex",
    "chars": 37866,
    "preview": "\\section{Probabilistic graphical models}\n\\begin{itemize}\n\t\\item It is often beneficial to visualize a probabilistic mode"
  },
  {
    "path": "Machine_Learning_2/ml2_graphical_models.tex.recover.bak~",
    "chars": 1898,
    "preview": "\\section{Probabilistic graphical models}\n\\begin{itemize}\n\t\\item It is often beneficial to visualize a probabilistic mode"
  },
  {
    "path": "Machine_Learning_2/ml2_sampling_methods.tex",
    "chars": 19214,
    "preview": "\\section{Sampling methods}\n\\begin{itemize}\n\t\\item In the previous chapter, we have seen that we can perform inference by"
  },
  {
    "path": "Machine_Learning_2/ml2_sequential_data.tex",
    "chars": 20376,
    "preview": "\\section{Sequential Data}\n\\begin{itemize}\n\t\\item Most models we discussed so far assumed that multiple data points $\\bm{"
  },
  {
    "path": "Machine_Learning_2/ml2_summary.tex",
    "chars": 2727,
    "preview": "\\documentclass[a4paper]{article} \n\\addtolength{\\hoffset}{-2.25cm}\n\\addtolength{\\textwidth}{4.5cm}\n\\addtolength{\\voffset}"
  },
  {
    "path": "Machine_Learning_2/ml2_variational_EM.tex",
    "chars": 18130,
    "preview": "\\section{Variational Expectation Maximization}\n\n\\begin{itemize}\n\t\\item The expectation maximization algorithm can be vie"
  },
  {
    "path": "Natural_Language_Processing_1/nlp_bayesian.tex",
    "chars": 98,
    "preview": "% \\section{Foundations of Bayesian NLP}\n% \\textbf{Foundations of Bayesian NLP is not in the exam.}"
  },
  {
    "path": "Natural_Language_Processing_1/nlp_compositional_semantic.tex",
    "chars": 9862,
    "preview": "\\section{Compositional semantics and discourse processing}\n% \\subsection{Compositional semantics}\n\\begin{itemize}\n\t\\item"
  },
  {
    "path": "Natural_Language_Processing_1/nlp_dialog_modelling.tex",
    "chars": 2745,
    "preview": "\\section{Computational Dialog Modeling}\n\\subsection{Modular dialog systems}\n\\begin{itemize}\n\t\\item There are two main ta"
  },
  {
    "path": "Natural_Language_Processing_1/nlp_formal_grammars.tex",
    "chars": 9729,
    "preview": "\\section{Formal grammars and syntactic parsing}\n\\begin{itemize}\n\t\\item Syntax: structure of sentence, parsing syntax to "
  },
  {
    "path": "Natural_Language_Processing_1/nlp_lexical_distributional_semantics.tex",
    "chars": 15920,
    "preview": "\\section{Lexical and distributional semantics}\n\\begin{itemize}\n\t\\item \\textbf{Compositional semantics}: meaning of phras"
  },
  {
    "path": "Natural_Language_Processing_1/nlp_morphology.tex",
    "chars": 4916,
    "preview": "\\section{Morphology and finite state techniques}\n\\begin{itemize}\n\t\\item Morphology concerns the \\textbf{structure of wor"
  },
  {
    "path": "Natural_Language_Processing_1/nlp_pos_tagging.tex",
    "chars": 6526,
    "preview": "\\section{Language models and part-of-speech tagging}\n\\subsection{Probabilistic language modeling}\n\\begin{itemize}\n\t\\item"
  },
  {
    "path": "Natural_Language_Processing_1/nlp_summarization.tex",
    "chars": 8320,
    "preview": "\\section{Language generation and summarization}\n\\begin{itemize}\n\t\\item Most tasks/methods until now have concentrated on"
  },
  {
    "path": "Natural_Language_Processing_1/nlp_summary.tex",
    "chars": 1991,
    "preview": "\\documentclass[a4paper]{article} \n\\addtolength{\\hoffset}{-2.25cm}\n\\addtolength{\\textwidth}{4.5cm}\n\\addtolength{\\voffset}"
  },
  {
    "path": "Natural_Language_Processing_1/nlp_textual_entailment_paraphrasing.tex",
    "chars": 6047,
    "preview": "\\section{Textual Entailment and Paraphrasing}\n\\begin{itemize}\n\t\\item Textual entailment is defined as a directional rela"
  },
  {
    "path": "Natural_Language_Processing_1/nlp_translation.tex",
    "chars": 4755,
    "preview": "\\section{Machine Translation}\n\\subsection{Statistical Machine Translation}\n\\begin{itemize}\n\t\\item Given a sentence $f$ i"
  },
  {
    "path": "README.md",
    "chars": 1474,
    "preview": "# Summaries of Master AI at UvA\n\nIn this repository, I collect all my summaries I created during my studies in the Maste"
  },
  {
    "path": "Reinforcement_Learning/rl_appendix.tex",
    "chars": 1931,
    "preview": "\\section{Deep RL in practice}\n\\textit{This section reviews the lecture slides 10 (last half).}\n\\begin{itemize}\n\t\\item Th"
  },
  {
    "path": "Reinforcement_Learning/rl_introduction.tex",
    "chars": 12807,
    "preview": "\\section{Introduction to Reinforcement Learning}\n\\textit{This section reviews the lecture slides 1 and 2 (until Monte Ca"
  },
  {
    "path": "Reinforcement_Learning/rl_learning_with_approx.tex",
    "chars": 23090,
    "preview": "\\section{Value-based RL: Learning with approximation}\n\\label{sec:value_based_approximation}\n\\textit{This section reviews"
  },
  {
    "path": "Reinforcement_Learning/rl_mcts_alpha_go.tex",
    "chars": 581,
    "preview": "%\\section{Monte-Carlo Tree Search and Alpha Go}\n%\\label{sec:MCTS_Alpha_Go}\n%\\textit{This section reviews the lecture sli"
  },
  {
    "path": "Reinforcement_Learning/rl_model_based.tex",
    "chars": 18741,
    "preview": "\\section{Model-based Reinforcement Learning}\n\\label{sec:model_based}\n\\textit{This section reviews the lecture slides 11 "
  },
  {
    "path": "Reinforcement_Learning/rl_partially_observable.tex",
    "chars": 10442,
    "preview": "\\section{Partially observable environments and Bayesian methods}\n\\label{sec:partially_observable}\n\\textit{This section r"
  },
  {
    "path": "Reinforcement_Learning/rl_policy_gradient_methods.tex",
    "chars": 31633,
    "preview": "\\section{Policy gradient methods}\n\\label{sec:policy_learning}\n\\textit{This section reviews the lecture slides 7, 8, 9 an"
  },
  {
    "path": "Reinforcement_Learning/rl_summary.tex",
    "chars": 2805,
    "preview": "\\documentclass[a4paper]{article} \n\\addtolength{\\hoffset}{-2.25cm}\n\\addtolength{\\textwidth}{4.5cm}\n\\addtolength{\\voffset}"
  },
  {
    "path": "Reinforcement_Learning/rl_tabular_methods.tex",
    "chars": 25038,
    "preview": "\\section{Value-based RL: Tabular Methods}\n\\textit{This section reviews the lecture slides 2 (Monte Carlo), 3 and 4.}\n\\su"
  }
]

// ... and 1 more files (download for full content)

About this extraction

This page contains the full source code of the phlippe/UvA_Summaries GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 88 files (781.5 KB), approximately 252.2k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!