Repository: alirezadir/Machine-Learning-Interviews
Branch: main
Commit: 164d43a85d86
Files: 45
Total size: 355.8 KB
Directory structure:
gitextract_pa4rkxdb/
├── LICENSE
├── README.md
└── src/
├── MLC/
│ ├── ml-coding.md
│ └── notebooks/
│ ├── .test.ipynb
│ ├── convolution.ipynb
│ ├── decision_tree.ipynb
│ ├── feedforward.ipynb
│ ├── k_means.ipynb
│ ├── k_means_2.ipynb
│ ├── k_nearest_neighbors.ipynb
│ ├── knn.ipynb
│ ├── linear_regression.ipynb
│ ├── linear_regression_md.ipynb
│ ├── logistic_regression.ipynb
│ ├── logistic_regression_md.ipynb
│ ├── numpy_practice.ipynb
│ ├── perceptron.ipynb
│ ├── softmax.ipynb
│ ├── svm.ipynb
│ └── ww_classifier.ipynb
├── MLSD/
│ ├── ml-companies.md
│ ├── ml-system-design.md
│ ├── mlsd-ads-ranking.md
│ ├── mlsd-av.md
│ ├── mlsd-event-recom.md
│ ├── mlsd-feature-eng.md
│ ├── mlsd-game-recom.md
│ ├── mlsd-harmful-content.md
│ ├── mlsd-image-search.md
│ ├── mlsd-metrics.md
│ ├── mlsd-mm-video-search.md
│ ├── mlsd-modeling-popular-archs.md
│ ├── mlsd-newsfeed.md
│ ├── mlsd-prediction.md
│ ├── mlsd-preprocessing.md
│ ├── mlsd-pymk.md
│ ├── mlsd-search.md
│ ├── mlsd-template.md
│ ├── mlsd-typeahead.md
│ ├── mlsd-video-recom.md
│ └── mlsd_obj_detection.md
├── behavior.md
├── lc-coding.md
├── ml-depth.md
└── ml-fundamental.md
================================================
FILE CONTENTS
================================================
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2021 Alireza Dirafzoon
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
[](LICENSE) [](https://github.com/psf/black) [](https://github.com/alirezadir/Machine-Learning-Interviews/stargazers) [](https://github.com/alirezadir/Machine-Learning-Interviews/network) [](https://github.com/alirezadir/Machine-Learning-Interviews/commits/main) [](https://github.com/alirezadir/Machine-Learning-Interviews/issues) [](https://github.com/alirezadir/Machine-Learning-Interviews/graphs/contributors) [](https://twitter.com/intent/tweet?text=Check%20out%20Machine%20Learning%20Interviews%20by%20%40alirezadira%20%E2%80%94%20A%20guide%20to%20prepare%20for%20ML%20interviews!&url=https%3A%2F%2Fgithub.com%2Falirezadir%2FMachine-Learning-Interviews&hashtags=MachineLearning,MLinterviews,AI)
# Machine Learning Technical Interviews :robot:
:newspaper: **News: Updated in 2025**: I have added a new repo for [Agentic AI Systems](https://github.com/alirezadir/Agentic-AI-Systems.git), including the latest trends in AI engineering and agentic systems design and development, for those who are interested. You can find a variety of resources, system design summaries, and hands-on coding examples, projects, and more.
This repo aims to serve as a guide to prepare for **Machine Learning (AI) Engineering** interviews for relevant roles at big tech companies (in particular FAANG). It has compiled based on the author's personal experience and notes from his own interview preparation, when he received offers from Meta (ML Specialist), Google (ML Engineer), Amazon (Applied Scientist), Apple (Applied Scientist), and Roku (ML Engineer).
The following components are the most commonly used interview modules for technical ML roles at different companies. We will go through them one by one and share how one can prepare:
|Chapter | Content|
|---| --- |
| Chapter 1 | [General Coding (Algos and Data Structures)](src/lc-coding.md) |
| Chapter 2 | [ML Coding](src/MLC/ml-coding.md) |
| Chapter 3 | [ML Fundamentals/Breadth](src/ml-fundamental.md)|
| Chapter 4 | [ML System Design (Updated in 2023)](src/MLSD/ml-system-design.md)|
| Chapter 5 | [*Agentic AI Systems (2025)*](https://github.com/alirezadir/Agentic-AI-Systems.git)|
| Chapter 6 | [Behavioral](src/behavior.md)|
| | |
**Notes:**
* At the time I'm putting these notes together, machine learning interviews at different companies do not follow a unique structure unlike software engineering interviews. However, I found some of the components very similar to each other, although under different naming.
* The guide here is mostly focused on *Machine Learning Engineer* (and Applied Scientist) roles at big companies. Although relevant roles such as "Data Science" or "ML research scientist" have different structures in interviews, some of the modules reviewed here can be still useful.
* As a supplementary resource, you can also refer to my [Production Level Deep Learning](https://github.com/alirezadir/Production-Level-Deep-Learning) repo for further insights on how to design deep learning systems for production.
# Contribution
* Feedback and contribution are very welcome :blush:
**If you'd like to contribute**, please make a pull request with your suggested changes).
================================================
FILE: src/MLC/ml-coding.md
================================================
# 2. ML/Data Coding :robot:
ML coding module may or may not exist in particular companies interviews. The good news is that, there are only a limited number of ML algorithms that candidates are expected to be able to code. The most common ones include:
## ML Algorithms
- Linear regression ([code](./notebooks/linear_regression.ipynb)) :white_check_mark:
- Logistic regression ([code](./notebooks/logistic_regression.ipynb)) :white_check_mark:
- K-means clustering ([code](./notebooks/k_means.ipynb)) :white_check_mark:
- K-nearest neighbors ([code 1](./notebooks/knn.ipynb) - [code 2](https://github.com/MahanFathi/CS231/blob/master/assignment1/cs231n/classifiers/k_nearest_neighbor.py)) :white_check_mark:
- Decision trees ([code](./notebooks/decision_tree.ipynb)) :white_check_mark:
- Linear SVM ([code](./notebooks/svm.ipynb))
* Neural networks
- Perceptron ([code](./notebooks/perceptron.ipynb))
- FeedForward NN ([code](./notebooks/feedforward.ipynb))
- Softmax ([code](./notebooks/softmax.ipynb))
- Convolution ([code](./notebooks/convolution.ipynb))
- CNN
- RNN
## Sampling
- stratified sampling ([link](https://towardsdatascience.com/the-5-sampling-algorithms-every-data-scientist-need-to-know-43c7bc11d17c))
- uniform sampling
- reservoir sampling
- sampling multinomial distribution
- random generator
## NLP algorithms
- bigrams
- tf-idf
## Other
- Random int in range ([link1](https://leetcode.com/discuss/interview-question/125347/generate-uniform-random-integer
), [link2](https://leetcode.com/articles/implement-rand10-using-rand7/))
- Triangle closing
- Meeting point
## Sample codes
- You can find some sample codes under the [Notebooks]().
================================================
FILE: src/MLC/notebooks/.test.ipynb
================================================
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Kmeans"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np \n",
"class KMeans:\n",
" def __init__(self, k, max_it=100):\n",
" self.k = k \n",
" self.max_it = max_it \n",
" # self.centroids = None \n",
" \n",
"\n",
" def fit(self, X):\n",
" # init centroids \n",
" self.centroids = X[np.random.choice(X.shape[0], size=self.k, replace=False)]\n",
" # for each it \n",
" for i in range(self.max_it):\n",
" # assign points to closest centroid \n",
" # clusters = []\n",
" # for j in range(len(X)):\n",
" # dist = np.linalg.norm(X[j] - self.centroids, axis=1)\n",
" # clusters.append(np.argmin(dist))\n",
" dist = np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2)\n",
" clusters = np.argmin(dist, axis=1)\n",
" \n",
" # update centroids (mean of clusters)\n",
" for k in range(self.k):\n",
" cluster_X = X[np.where(np.array(clusters) == k)]\n",
" if len(cluster_X) > 0 : \n",
" self.centroids[k] = np.mean(cluster_X, axis=0)\n",
" # check convergence / termination \n",
" if i > 0 and np.array_equal(self.centroids, pre_centroids): \n",
" break \n",
" pre_centroids = self.centroids \n",
" \n",
" self.clusters = clusters \n",
" \n",
" def predict(self, X):\n",
" clusters = []\n",
" for j in range(len(X)):\n",
" dist = np.linalg.norm(X[j] - self.centroids, axis=1)\n",
" clusters.append(np.argmin(dist))\n",
" return clusters \n",
" \n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0, 0, 0, 0, 0, 1, 1, 1, 1, 1]\n",
"[[ 4.62131563 5.38818365]\n",
" [-4.47889882 -4.71564167]]\n"
]
}
],
"source": [
"x1 = np.random.randn(5,2) + 5 \n",
"x2 = np.random.randn(5,2) - 5\n",
"X = np.concatenate([x1,x2], axis=0)\n",
"\n",
"\n",
"kmeans = KMeans(k=2)\n",
"kmeans.fit(X)\n",
"clusters = kmeans.predict(X)\n",
"print(clusters)\n",
"print(kmeans.centroids)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXIAAAD4CAYAAADxeG0DAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAPUUlEQVR4nO3dbYisZ33H8e/vJKZ11ZBijgg5OTtKfWjqA9o1VEJta1Sihvg2sorVF0ulhgiKJi59eaDUogaUliHGNw5IiY+IT0nVQl+Yuic+NR6VELLJ8QFXoShd2hDy74uZ9Rw3Z3Zndu5zZq6z3w+EOXPPvdf9HzLnt9e55r6uK1WFJKldR+ZdgCRpNga5JDXOIJekxhnkktQ4g1ySGnfpPC565ZVXVq/Xm8elJalZJ0+e/FVVHd19fC5B3uv12NjYmMelJalZSTbPddyhFUlqnEEuSY0zyCWpcQa5JDXOIJekxhnkkg6vwQB6PThyZPg4GMy7ogOZy+2HkjR3gwGsrcH29vD55ubwOcDq6vzqOgB75JIOp/X1MyG+Y3t7eLwxBrmkw+mRR6Y7vsAMckmH0/Hj0x1fYAa5pMPpxAlYWvr9Y0tLw+ONMcglHU6rq9Dvw/IyJMPHfr+5LzrBu1YkHWarq00G9272yCWpcZ0EeZIrktyd5EdJTiV5ZRftSpL211WP/A7gK1X1QuClwKmO2pWk8+MimdUJHYyRJ7kceBXwNwBV9Rjw2KztStJ5cxHN6oRueuTPBbaATyT5TpI7kzxt90lJ1pJsJNnY2trq4LKSdEAX0axO6CbILwVeDvxzVb0M+B/gtt0nVVW/qlaqauXo0SdtOSdJF85FNKsTugny08Dpqrpv9PxuhsEuSYvpIprVCR0EeVX9Ang0yQtGh64Hfjhru5J03lxEszqhuwlBtwCDJJcBDwFv76hdSerezhea6+vD4ZTjx4ch3uAXnQCpqgt+0ZWVldrY2Ljg15WkliU5WVUru487s1OSGmeQS1LjDHJJmtBgMKDX63HkyBF6vR6DBZkN6uqHkjSBwWDA2toa26OJRJubm6yNZoOuzvlLUnvkkjSB9fX134X4ju3tbdYXYDaoQS5JE3hkzKzPcccvJINckiZwfMysz3HHLySDXJImcOLECZZ2zQZdWlrixKSzQc/jsrkGuSRNYHV1lX6/z/LyMklYXl6m3+8/+YvOcwX2zrK5m5tQdWbZ3I7C3JmdktSV3eucw3ANl6c+FX796yefv7wMDz88cfPjZnZ6+6EkdWXcOue7j+3o6ItSh1YkqSvTBnNHX5Qa5JLUlXHB/Mxnntdlcw1ySerKuHXO77gD+v3hmHgyfOz3O1s21zFySerKfuucn6ep/Aa5JHVpdfWCb1Dh0IokNc4gl6TGGeSS1DiDXJIaZ5BLUuMMcklqnEEuSY3rLMiTXJLkO0m+2FWbkrRwzuO64gfV5YSgW4FTwOUdtilJi2P3MrU764rDBZ8EdLZOeuRJjgFvBO7soj1JWkjjlqmd8wbMXQ2tfAR4H/DEuBOSrCXZSLKxtbXV0WUl6QIat0ztnDdgnjnIk9wI/LKqTu51XlX1q2qlqlaOHj0662Ul6cIbt0ztnDdg7qJHfh1wU5KHgU8Br07yyQ7alaTFMm6Z2o7WFT+omYO8qm6vqmNV1QNuBr5eVW+ZuTJJWjSrq+d1XfGDchlbSZrGHJap3U+nQV5V3wS+2WWbkqS9ObNTkhpnkEtS4wxySWqcQS5JjTPIJalxBrkkNc4gl6TGGeSS1DiDXJIaZ5BLUuMMcklqnEEuSY0zyCWpcQa5JDXOIJekxhnkktQ4g1ySGmeQS1LjDHJJapxBLkmNM8glqXEGuSQ1ziCXpMbNHORJrk7yjSSnkjyQ5NYuCpMkTebSDtp4HHhPVd2f5BnAyST3VNUPO2hbkrSPmXvkVfXzqrp/9OffAqeAq2ZtV5I0mU7HyJP0gJcB93XZriRpvM6CPMnTgU8D766q35zj9bUkG0k2tra2urqsJB16nQR5kqcwDPFBVX3mXOdUVb+qVqpq5ejRo11cVpJEN3etBPg4cKqqPjR7SZKkaXTRI78OeCvw6iTfHf33hg7alSRNYObbD6vqP4B0UIsk6QCc2SlJjTPIJalxBrkkNc4gl6TGGeSS1DiDXJIaZ5BLUuMMcklqnEEuSY0zyCWpcQa5JDXOIJekxhnkktQ4g1ySGmeQS1LjDHJJapxBLkmNM8glqXEGuSQ1ziCXpMYZ5JLUOINckhpnkEtS4zoJ8iQ3JPlxkgeT3NZFm5Kkycwc5EkuAT4GvB64BnhzkmtmbVeSNJkueuTXAg9W1UNV9RjwKeBNHbQrSZpAF0F+FfDoWc9Pj45Jki6ALoI85zhWTzopWUuykWRja2urg8tKkqCbID8NXH3W82PAz3afVFX9qlqpqpWjR492cFlJEnQT5N8GnpfkOUkuA24GvtBBu5KkCVw6awNV9XiSdwFfBS4B7qqqB2auTJI0kZmDHKCqvgR8qYu2JEnTcWanJDXOIJekxhnkktQ4g1ySGmeQS1LjDHJJapxBLkmNM8glqXEGuSQ1ziCXpMYZ5JLUOINckhpnkEtS4wxySWqcQS5JjTPIJalxBrkkNc4gl6TGGeSS1DiDXJIaZ5BLUuMMcklqnEEuSY2bKciTfDDJj5J8P8lnk1zRUV2SpAnN2iO/B3hRVb0E+Alw++wlSZKmMVOQV9XXqurx0dNvAcdmL0mSNI0ux8jfAXy5w/YkSRO4dL8TktwLPPscL61X1edH56wDjwODPdpZA9YAjh8/fqBiJUlPtm+QV9Vr9no9yduAG4Hrq6r2aKcP9AFWVlbGnidJms6+Qb6XJDcA7wf+sqq2uylJkjSNWcfIPwo8A7gnyXeT/EsHNUmSpjBTj7yq/rirQiRJB+PMTklqnEEuSY0zyCWpcQa5JDXOIJekxhnkktQ4g1ySGmeQL7DBAHo9OHJk+DgYu5KNpMNspglBOn8GA1hbg+3Rwgebm8PnAKur86tL0uKxR76g1tfPhPiO7e3hcUk6m0G+oB55ZLrjkg4vg3xBjVuy3aXcJe1mkC+oEydgaen3jy0tDY9L0tkM8gW1ugr9PiwvQzJ87Pf9olPSk3nXygJbXTW4Je3PHrkkNc4gl6TGNRPkznKUpHNrYozcWY6SNF4TPXJnOUrSeE0EubMcJWm8JoLcWY6SNF4TQe4sR0kar5MgT/LeJJXkyi7a281ZjpI03sx3rSS5GngtcF5HrJ3lKEnn1kWP/MPA+4DqoC1J0pRmCvIkNwE/rarvdVSPJGlK+w6tJLkXePY5XloHPgC8bpILJVkD1gCOe7uJJHUmVQcbEUnyYuDfgJ2pOseAnwHXVtUv9vrZlZWV2tjYONB1JemwSnKyqlZ2Hz/wl51V9QPgWWdd4GFgpap+ddA2JUnTa+I+cknSeJ0tmlVVva7akiRNzh65JDXOIJekxhnkexgMBvR6PY4cOUKv12PgbhaSFlATG0vMw2AwYG1tje3RQuibm5usjXazWHWtAEkLxB75GOvr678L8R3b29usu5uFpAVjkI/xyJhdK8Ydl6R5McjHGLeMgMsLSFo0BvkYJ06cYGnXbhZLS0uccDcLSQvGIB9jdXWVfr/P8vIySVheXqbf7/tFp6SFc+BFs2bholmSNL1xi2bZI5ekxh2qIB8MoNeDI0eGj87vkXQxODQTggYDWFuDnVvDNzeHz8G9QCW17dD0yNfXz4T4ju3t4XFJatmhCfJx83ic3yOpdYcmyMfN43F+j6TWHZogP3ECds3vYWlpeFySWnZognx1Ffp9WF6GZPjY7+//Rad3ukhadIfmrhUYhvY0d6h4p4ukFhyaHvlBeKeLpBYY5HuY5E4Xh14kzZtBvof97nTZGXrZ3ISqM0MvhrmkC8kg38N+d7o49CJpERjke9jvThcnGUlaBDMHeZJbkvw4yQNJ/rGLohbJ6io8/DA88cTw8ey7VZxkJGkRzBTkSf4aeBPwkqr6U+CfOqmqEU4ykrQIZu2RvxP4h6r6P4Cq+uXsJbXjoJOMJKlLM+0QlOS7wOeBG4D/Bd5bVd8ec+4asAZw/PjxP9vc3DzwdSXpMBq3Q9C+MzuT3As8+xwvrY9+/o+APwdeAfxrkufWOX47VFUf6MNwq7fpypckjbNvkFfVa8a9luSdwGdGwf2fSZ4ArgS2uitRkrSXWcfIPwe8GiDJ84HLgF/N2KYkaQqzLpp1F3BXkv8CHgPedq5hFUnS+TNTkFfVY8BbOqpFknQAM921cuCLJlvAIt22ciXtDwm1/h6sf/5afw+Hof7lqjq6++BcgnzRJNk41y09LWn9PVj//LX+Hg5z/a61IkmNM8glqXEG+VB/3gV0oPX3YP3z1/p7OLT1O0YuSY2zRy5JjTPIJalxBvlZLoZNMpK8N0kluXLetUwryQeT/CjJ95N8NskV865pEkluGH1uHkxy27zrmUaSq5N8I8mp0ef+1nnXdBBJLknynSRfnHctB5HkiiR3jz7/p5K8cpqfN8hHLoZNMpJcDbwWaHWzuXuAF1XVS4CfALfPuZ59JbkE+BjweuAa4M1JrplvVVN5HHhPVf0Jw1VM/66x+nfcCpyadxEzuAP4SlW9EHgpU74Xg/yMi2GTjA8D7wOa/Aa7qr5WVY+Pnn4LODbPeiZ0LfBgVT00WrLiUww7BE2oqp9X1f2jP/+WYYBcNd+qppPkGPBG4M5513IQSS4HXgV8HIZLn1TVf0/ThkF+xvOBv0hyX5J/T/KKeRc0jSQ3AT+tqu/Nu5aOvAP48ryLmMBVwKNnPT9NY0G4I0kPeBlw35xLmdZHGHZgnphzHQf1XIZLf39iNDx0Z5KnTdPArKsfNqWrTTLmZZ/6PwC87sJWNL293kNVfX50zjrDf/IPLmRtB5RzHFuYz8ykkjwd+DTw7qr6zbzrmVSSG4FfVtXJJH8153IO6lLg5cAtVXVfkjuA24C/n6aBQ6P1TTLG1Z/kxcBzgO8lgeGQxP1Jrq2qX1zAEve11/8DgCRvA24Erl+kX6J7OA1cfdbzY8DP5lTLgSR5CsMQH1TVZ+Zdz5SuA25K8gbgD4HLk3yyqlpalfU0cLqqdv4ldDfDIJ+YQytnfI5GN8moqh9U1bOqqldVPYYfjJcvWojvJ8kNwPuBm6pqe971TOjbwPOSPCfJZcDNwBfmXNPEMvzN/3HgVFV9aN71TKuqbq+qY6PP/c3A1xsLcUZ/Tx9N8oLRoeuBH07TxqHqke/DTTLm76PAHwD3jP5l8a2q+tv5lrS3qno8ybuArwKXAHdV1QNzLmsa1wFvBX4w2kwd4ANV9aX5lXQo3QIMRp2Bh4C3T/PDTtGXpMY5tCJJjTPIJalxBrkkNc4gl6TGGeSS1DiDXJIaZ5BLUuP+H8mBYH+I9lNrAAAAAElFTkSuQmCC",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"from matplotlib import pyplot as plt \n",
"\n",
"colors = ['b', 'r']\n",
"for k in range(kmeans.k):\n",
" plt.scatter(X[np.where(np.array(clusters) == k)][:,0], \n",
" X[np.where(np.array(clusters) == k)][:,1], \n",
" color=colors[k])\n",
"plt.scatter(kmeans.centroids[:,0], kmeans.centroids[:,1], color='black')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(10, 1, 2)"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X[:, np.newaxis] "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### KNN"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(100, 2) (100,)\n",
"[0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0.]\n",
"[0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 1. 1.]\n"
]
}
],
"source": [
"import numpy as np \n",
"from collections import Counter\n",
"class KNN:\n",
" def __init__(self, k):\n",
" self.k = k \n",
" \n",
" \n",
" def fit(self, X, y):\n",
" self.X = X\n",
" self.y = y \n",
" \n",
" def predict(self, X_test):\n",
" y_pred = []\n",
" for x in X_test: \n",
" dist = np.linalg.norm(x - self.X, axis=1)\n",
" knn_idcs = np.argsort(dist)[:self.k]\n",
" knn_labels = self.y[knn_idcs]\n",
" label = Counter(knn_labels).most_common(1)[0][0]\n",
" y_pred.append(label)\n",
" return np.array(y_pred)\n",
"\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"x1 = np.random.randn(50,2) + 1\n",
"x2 = np.random.randn(50,2) - 1\n",
"X = np.concatenate([x1, x2], axis=0)\n",
"y = np.concatenate([np.ones(50), np.zeros(50)])\n",
"print(X.shape, y.shape)\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n",
"\n",
"\n",
"knn = KNN(k=5)\n",
"knn.fit(X_train, y_train)\n",
"y_pred = knn.predict(X_test)\n",
"print(y_pred)\n",
"print(y_test)\n"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(40, 2)"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_test.shape"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0., 0.])"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.zeros(2,)"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1., 1., 1., 0., 0., 0.])"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.concatenate([np.ones(3), np.zeros(3)])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Lin Regression "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class LinearRegression: \n",
" def __init__(self):\n",
" self.m = None \n",
" self.b = None \n",
" \n",
" def fit(self, X, y):\n",
" \n",
"\n",
"\n",
" def predict(self, X):\n",
" pass "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
================================================
FILE: src/MLC/notebooks/convolution.ipynb
================================================
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Convolution "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2D convolution "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def convolve(signal, kernel):\n",
" output = []\n",
" kernel_size = len(kernel)\n",
" padding = kernel_size // 2 # assume zero padding\n",
" padded_signal = [0] * padding + signal + [0] * padding\n",
" \n",
" for i in range(padding, len(signal) + padding):\n",
" sum = 0\n",
" for j in range(kernel_size):\n",
" sum += kernel[j] * padded_signal[i - padding + j]\n",
" output.append(sum)\n",
" \n",
" return output\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[-2, -2, -2, -2, -2, 5]\n"
]
}
],
"source": [
"signal = [1, 2, 3, 4, 5, 6]\n",
"kernel = [1, 0, -1]\n",
"output = convolve(signal, kernel)\n",
"print(output)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3D convolution "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"def convolution(image, kernel):\n",
" # get the size of the input image and kernel\n",
" (image_height, image_width, image_channels) = image.shape\n",
" (kernel_height, kernel_width, kernel_channels) = kernel.shape\n",
" \n",
" # calculate the padding needed for 'same' convolution\n",
" pad_h = (kernel_height - 1) // 2\n",
" pad_w = (kernel_width - 1) // 2\n",
" \n",
" # pad the input image with zeros\n",
" padded_image = np.pad(image, ((pad_h, pad_h), (pad_w, pad_w), (0, 0)), 'constant')\n",
" \n",
" # create an empty output tensor\n",
" output_height = image_height\n",
" output_width = image_width\n",
" output_channels = kernel_channels\n",
" output = np.zeros((output_height, output_width, output_channels))\n",
" \n",
" # perform the convolution operation\n",
" for i in range(output_height):\n",
" for j in range(output_width):\n",
" for k in range(output_channels):\n",
" output[i, j, k] = np.sum(kernel[:, :, k] * padded_image[i:i+kernel_height, j:j+kernel_width, :])\n",
" \n",
" return output\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input image:\n",
"[[[ 1 2]\n",
" [ 3 4]]\n",
"\n",
" [[ 5 6]\n",
" [ 7 8]]\n",
"\n",
" [[ 9 10]\n",
" [11 12]]]\n",
"\n",
"Kernel:\n",
"[[[ 1 0]\n",
" [ 0 -1]]\n",
"\n",
" [[ 0 1]\n",
" [-1 0]]]\n",
"\n",
"Output:\n",
"[[[-6. 2.]\n",
" [-2. -2.]]\n",
"\n",
" [[-6. 2.]\n",
" [-2. -2.]]\n",
"\n",
" [[-3. 1.]\n",
" [-1. -1.]]]\n"
]
}
],
"source": [
"# create an example image and kernel\n",
"image = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]], [[9, 10], [11, 12]]])\n",
"kernel = np.array([[[1, 0], [0, -1]], [[0, 1], [-1, 0]]])\n",
"\n",
"# perform the convolution operation\n",
"output = convolution(image, kernel)\n",
"\n",
"print('Input image:')\n",
"print(image)\n",
"\n",
"print('\\nKernel:')\n",
"print(kernel)\n",
"\n",
"print('\\nOutput:')\n",
"print(output)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
================================================
FILE: src/MLC/notebooks/decision_tree.ipynb
================================================
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"A decision tree is a type of machine learning algorithm used for classification and regression tasks. It consists of a tree-like structure where each internal node represents a feature or attribute, each branch represents a decision based on that feature, and each leaf node represents a predicted output.\n",
"\n",
"To **train** a decision tree, the algorithm uses a dataset with labeled examples to create the tree structure. It starts with the root node, which includes all the examples, and selects the feature that provides the most information gain to split the data into two subsets. It then repeats this process for each subset until it reaches a stopping criterion, such as a maximum tree depth or minimum number of examples in a leaf node.\n",
"\n",
"Once the decision tree is trained, it can be used to **predict** the output for new, unseen examples. To make a prediction, the algorithm starts at the root node and follows the branches based on the values of the input features until it reaches a leaf node. The predicted output for that example is the value associated with the leaf node.\n",
"\n",
"Decision trees have several advantages, such as being easy to interpret and visualize, handling both numerical and categorical data, and handling missing values. However, they can also suffer from overfitting if the tree is too complex or if there is noise or outliers in the data. \n",
"\n",
"To address this issue, various techniques such as pruning, ensemble methods, and regularization can be used to simplify the decision tree or combine multiple trees to improve generalization performance. Additionally, decision trees may not perform well with highly imbalanced datasets or datasets with many irrelevant features, and they may not be suitable for tasks where the relationships between features and outputs are highly nonlinear or complex."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"class DecisionTree:\n",
" def __init__(self, max_depth=None):\n",
" self.max_depth = max_depth\n",
" \n",
" def fit(self, X, y):\n",
" self.n_classes_ = len(np.unique(y))\n",
" self.n_features_ = X.shape[1]\n",
" self.tree_ = self._grow_tree(X, y)\n",
" \n",
" def predict(self, X):\n",
" return [self._predict(inputs) for inputs in X]\n",
" \n",
" def _gini(self, y):\n",
" _, counts = np.unique(y, return_counts=True)\n",
" impurity = 1 - np.sum([(count / len(y)) ** 2 for count in counts])\n",
" return impurity\n",
" \n",
" def _best_split(self, X, y):\n",
" m = y.size\n",
" if m <= 1:\n",
" return None, None\n",
" \n",
" num_parent = [np.sum(y == c) for c in range(self.n_classes_)]\n",
" best_gini = 1.0 - sum((n / m) ** 2 for n in num_parent)\n",
" best_idx, best_thr = None, None\n",
" \n",
" for idx in range(self.n_features_):\n",
" thresholds, classes = zip(*sorted(zip(X[:, idx], y)))\n",
" num_left = [0] * self.n_classes_\n",
" num_right = num_parent.copy()\n",
" for i in range(1, m):\n",
" c = classes[i - 1]\n",
" num_left[c] += 1\n",
" num_right[c] -= 1\n",
" gini_left = 1.0 - sum(\n",
" (num_left[x] / i) ** 2 for x in range(self.n_classes_)\n",
" )\n",
" gini_right = 1.0 - sum(\n",
" (num_right[x] / (m - i)) ** 2 for x in range(self.n_classes_)\n",
" )\n",
" gini = (i * gini_left + (m - i) * gini_right) / m\n",
" if thresholds[i] == thresholds[i - 1]:\n",
" continue\n",
" if gini < best_gini:\n",
" best_gini = gini\n",
" best_idx = idx\n",
" best_thr = (thresholds[i] + thresholds[i - 1]) / 2\n",
" \n",
" return best_idx, best_thr\n",
" \n",
" def _grow_tree(self, X, y, depth=0):\n",
" num_samples_per_class = [np.sum(y == i) for i in range(self.n_classes_)]\n",
" predicted_class = np.argmax(num_samples_per_class)\n",
" node = Node(predicted_class=predicted_class)\n",
" if depth < self.max_depth:\n",
" idx, thr = self._best_split(X, y)\n",
" if idx is not None:\n",
" indices_left = X[:, idx] < thr\n",
" X_left, y_left = X[indices_left], y[indices_left]\n",
" X_right, y_right = X[~indices_left], y[~indices_left]\n",
" node.feature_index = idx\n",
" node.threshold = thr\n",
" node.left = self._grow_tree(X_left, y_left, depth + 1)\n",
" node.right = self._grow_tree(X_right, y_right, depth + 1)\n",
" return node\n",
" \n",
" def _predict(self, inputs):\n",
" node = self.tree_\n",
" while node.left:\n",
" if inputs[node.feature_index] < node.threshold:\n",
" node = node.left\n",
" else:\n",
" node = node.right\n",
" return node.predicted_class\n",
" \n",
"class Node:\n",
" def __init__(self, *, predicted_class):\n",
" self.predicted_class = predicted_class\n",
" self.feature_index = 0\n",
" self.threshold = 0.0 \n",
" self.left = None\n",
" self.right = None\n",
"\n",
" def is_leaf_node(self):\n",
" return self.left is None and self.right is None\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy: 1.0\n"
]
}
],
"source": [
"from sklearn.datasets import load_iris\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import accuracy_score\n",
"\n",
"# Load the iris dataset\n",
"iris = load_iris()\n",
"X = iris.data\n",
"y = iris.target\n",
"\n",
"# Split the data into training and testing sets\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"# Train the decision tree\n",
"tree = DecisionTree(max_depth=3)\n",
"tree.fit(X_train, y_train)\n",
"\n",
"# Make predictions on the test set\n",
"y_pred = tree.predict(X_test)\n",
"\n",
"# Compute the accuracy of the predictions\n",
"accuracy = accuracy_score(y_test, y_pred)\n",
"\n",
"print(f\"Accuracy: {accuracy}\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
================================================
FILE: src/MLC/notebooks/feedforward.ipynb
================================================
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Forward propagation:"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"```python\n",
"Z1 = X.W1 + b1\n",
"A1 = ReLU(Z1) \n",
"Z2 = A1.W2 + b2\n",
"exp_scores = exp(Z2) \n",
"probs = exp_scores / sum(exp_scores)\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Backward propagation:\n",
"\n",
"```python \n",
"delta3 = probs\n",
"delta3[range(len(X)), y] -= 1\n",
"dW2 = A1.T.dot(delta3)\n",
"db2 = sum(delta3)\n",
"delta2 = delta3.dot(W2.T) * (A1 > 0)\n",
"dW1 = X.T.dot(delta2)\n",
"db1 = sum(delta2)\n",
"\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Here:\n",
"\n",
"X is the input data matrix of shape (num_samples, input_size), W1 is the weight matrix connecting the input layer to the hidden layer of shape (input_size, hidden_size), b1 is the bias vector for the hidden layer of shape (hidden_size,), A1 is the output of the hidden layer (also known as the hidden representation) of shape (num_samples, hidden_size), W2 is the weight matrix connecting the hidden layer to the output layer of shape (hidden_size, output_size), b2 is the bias vector for the output layer of shape (output_size,), Z2 is the weighted sum of the hidden layer output, exp_scores is the exponential of the output layer weighted sum, probs is the output probability for each class, and y is the true label vector of shape (num_samples,).\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Code"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"class TwoLayerNet:\n",
" def __init__(self, input_size, hidden_size, output_size):\n",
" self.params = {}\n",
" self.params['W1'] = np.random.randn(input_size, hidden_size)\n",
" self.params['b1'] = np.zeros(hidden_size)\n",
" self.params['W2'] = np.random.randn(hidden_size, output_size)\n",
" self.params['b2'] = np.zeros(output_size)\n",
"\n",
" def forward(self, X):\n",
" W1, b1 = self.params['W1'], self.params['b1']\n",
" W2, b2 = self.params['W2'], self.params['b2']\n",
" z1 = np.dot(X, W1) + b1\n",
" a1 = np.maximum(0, z1) # ReLU activation function\n",
" z2 = np.dot(a1, W2) + b2\n",
" # probs = 1 / (1 + np.exp(-z2)) # Sigmoid activation function\n",
" exp_z = np.exp(z2)\n",
" probs = exp_z / np.sum(exp_z, axis=1, keepdims=True)\n",
" return probs\n",
"\n",
" def loss(self, X, y):\n",
" probs = self.forward(X)\n",
" correct_logprobs = -np.log(probs[range(len(X)), y])\n",
" data_loss = np.sum(correct_logprobs)\n",
" return 1.0/len(X) * data_loss\n",
"\n",
" def train(self, X, y, num_epochs, learning_rate=0.1):\n",
" for epoch in range(num_epochs):\n",
" # Forward propagation\n",
" z1 = np.dot(X, self.params['W1']) + self.params['b1']\n",
" a1 = np.maximum(0, z1) # ReLU activation function\n",
" z2 = np.dot(a1, self.params['W2']) + self.params['b2']\n",
" # probs = 1 / (1 + np.exp(-z2)) # Sigmoid activation function\n",
" exp_z = np.exp(z2)\n",
" probs = exp_z / np.sum(exp_z, axis=1, keepdims=True)\n",
"\n",
" # Backpropagation\n",
" delta3 = probs\n",
" delta3[range(len(X)), y] -= 1\n",
" dW2 = np.dot(a1.T, delta3)\n",
" db2 = np.sum(delta3, axis=0)\n",
" delta2 = np.dot(delta3, self.params['W2'].T) * (a1 > 0) # derivative of ReLU\n",
" dW1 = np.dot(X.T, delta2)\n",
" db1 = np.sum(delta2, axis=0)\n",
"\n",
" # Update parameters\n",
" self.params['W1'] -= learning_rate * dW1\n",
" self.params['b1'] -= learning_rate * db1\n",
" self.params['W2'] -= learning_rate * dW2\n",
" self.params['b2'] -= learning_rate * db2\n",
"\n",
" # Print loss for monitoring training progress\n",
" if epoch % 100 == 0:\n",
" loss = self.loss(X, y)\n",
" print(\"Epoch {}: loss = {}\".format(epoch, loss))\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"This code defines a TwoLayerNet class with an initializer that takes the input size, hidden size, and output size as arguments. The weights and biases for the two layers are initialized randomly in this function.\n",
"\n",
"The forward function takes an input X and performs the forward propagation to calculate the output probabilities for each class.\n",
"\n",
"The loss function calculates the cross-entropy loss between the predicted probabilities and the true labels y.\n",
"\n",
"The train function performs the backpropagation to update the weights and biases based on the calculated gradients. The number of training epochs and learning rate can be specified as arguments to this function."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's an example of how to use the TwoLayerNet class to train and test the network on a toy dataset:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 0: loss = 0.8791617000548932\n",
"Epoch 100: loss = 0.03272609589944909\n",
"Epoch 200: loss = 0.010130354895034843\n",
"Epoch 300: loss = 0.005517334222420798\n",
"Epoch 400: loss = 0.0036701620853277555\n",
"Epoch 500: loss = 0.002707635703438397\n",
"Epoch 600: loss = 0.0021206045443387493\n",
"Epoch 700: loss = 0.0017317523015295431\n",
"Epoch 800: loss = 0.0014568091215886065\n",
"Epoch 900: loss = 0.0012539964886349238\n",
"Predictions: [0 1 1 0]\n"
]
}
],
"source": [
"# Generate a toy dataset\n",
"X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])\n",
"y = np.array([0, 1, 1, 0])\n",
"\n",
"# Initialize a neural network\n",
"net = TwoLayerNet(input_size=2, hidden_size=10, output_size=2)\n",
"\n",
"# Train the neural network\n",
"net.train(X, y, num_epochs=1000)\n",
"\n",
"# Test the neural network\n",
"probs = net.forward(X)\n",
"predictions = np.argmax(probs, axis=1)\n",
"print(\"Predictions: \", predictions)\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Improvements \n",
"\n",
"There are several ways to improve the implementation of a two-layer neural network with softmax. Here are a few suggestions:\n",
"\n",
"1. Weight initialization: The current implementation initializes the weights randomly using a Gaussian distribution. However, it is recommended to use other weight initialization methods such as Xavier or He initialization to improve convergence and avoid vanishing or exploding gradients. One possible implementation for Xavier initialization of the weights is:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Xavier initialization\n",
"self.params['W1'] = np.random.randn(input_size, hidden_size) / np.sqrt(input_size)\n",
"self.params['W2'] = np.random.randn(hidden_size, output_size) / np.sqrt(hidden_size)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Learning rate decay: The learning rate is a hyperparameter that determines the step size at each iteration during training. However, using a fixed learning rate may lead to suboptimal performance or slow convergence. A common technique is to gradually decrease the learning rate over time, known as learning rate decay, to fine-tune the network weights as the optimization process progresses."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Learning rate decay\n",
"learning_rate = 0.1\n",
"lr_decay = 0.95\n",
"lr_decay_epoch = 100\n",
"for epoch in range(num_epochs):\n",
" # ...\n",
" if epoch % lr_decay_epoch == 0:\n",
" learning_rate *= lr_decay"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"3. Regularization: Overfitting can occur when the model is too complex and the training data is limited. Regularization techniques such as L1 or L2 regularization can be applied to the loss function to prevent overfitting and improve the generalization performance of the model.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# L2 regularization\n",
"reg_lambda = 0.1\n",
"data_loss += 0.5 * reg_lambda * (np.sum(self.params['W1'] ** 2) + np.sum(self.params['W2'] ** 2))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"4. Mini-batch training: The current implementation updates the weights using the entire training set at each iteration, which can be computationally expensive for large datasets. An alternative is to use mini-batch training, where a random subset of the training data is used at each iteration to update the weights. This can speed up the training process and improve convergence."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Mini-batch training\n",
"batch_size = 64\n",
"num_batches = len(X) // batch_size\n",
"for epoch in range(num_epochs):\n",
" for i in range(num_batches):\n",
" # Select a random batch of data\n",
" batch_mask = np.random.choice(len(X), batch_size)\n",
" X_batch = X[batch_mask]\n",
" y_batch = y[batch_mask]\n",
"\n",
" # Forward and backward propagation using the batch data\n",
" # ...\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"5. Optimization algorithm: The current implementation uses stochastic gradient descent (SGD) as the optimization algorithm. However, there are other optimization algorithms such as Adam, Adagrad, and RMSprop that can improve the convergence speed and performance of the network."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Adam optimization\n",
"beta1, beta2 = 0.9, 0.999\n",
"eps = 1e-8\n",
"mW1, vW1 = 0, 0\n",
"mW2, vW2 = 0, 0\n",
"for epoch in range(num_epochs):\n",
" # Forward and backward propagation\n",
" # ...\n",
" # Update parameters using Adam optimization\n",
" mW1 = beta1 * mW1 + (1 - beta1) * dW1\n",
" vW1 = beta2 * vW1 + (1 - beta2) * (dW1 ** 2)\n",
" mW2 = beta1 * mW2 + (1 - beta1) * dW2\n",
" vW2 = beta2 * vW2 + (1 - beta2) * (dW2 ** 2)\n",
" self.params['W1'] -= learning_rate * mW1 / (np.sqrt(vW1) + eps)\n",
" self.params['b1'] -= learning_rate * db1\n",
" self.params['W2'] -= learning_rate * mW2 / (np.sqrt(vW2) + eps)\n",
" self.params['b2'] -= learning_rate * db2\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Other extensions: "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"* Arbitrary activation function \n",
"* Arbitrary loss function \n",
"* Extension to multiple layers"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"\n",
"import numpy as np\n",
"\n",
"class ActivationFunction:\n",
" def __init__(self):\n",
" pass\n",
"\n",
" def __call__(self, x):\n",
" raise NotImplementedError\n",
"\n",
" def derivative(self, x):\n",
" raise NotImplementedError\n",
"\n",
"class ReLU(ActivationFunction):\n",
" def __init__(self):\n",
" super().__init__()\n",
"\n",
" def __call__(self, x):\n",
" return np.maximum(0, x)\n",
"\n",
" def derivative(self, x):\n",
" return (x > 0).astype(float)\n",
"\n",
"class Softmax(ActivationFunction):\n",
" def __init__(self):\n",
" super().__init__()\n",
"\n",
" def __call__(self, x):\n",
" exp_scores = np.exp(x - np.max(x, axis=1, keepdims=True))\n",
" probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)\n",
" return probs\n",
"\n",
" def derivative(self, x):\n",
" raise NotImplementedError\n",
"\n",
"class MultiLayerNet:\n",
" def __init__(self, input_size, hidden_sizes, output_size, activation_function, loss_function, reg_lambda=0.0):\n",
" self.params = {}\n",
" self.num_layers = 1 + len(hidden_sizes)\n",
" self.layer_sizes = [input_size] + hidden_sizes + [output_size]\n",
"\n",
" for i in range(1, self.num_layers + 1):\n",
" self.params[f'W{i}'] = np.random.randn(self.layer_sizes[i-1], self.layer_sizes[i]) / np.sqrt(self.layer_sizes[i-1])\n",
" self.params[f'b{i}'] = np.zeros(self.layer_sizes[i])\n",
"\n",
" self.activation_function = activation_function\n",
" self.activation_function_derivatives = [activation_function.derivative for _ in range(self.num_layers)]\n",
" self.loss_function = loss_function\n",
" self.reg_lambda = reg_lambda\n",
"\n",
" def forward(self, X):\n",
" layer_output = X\n",
" self.layer_inputs = []\n",
" self.layer_outputs = [X]\n",
"\n",
" for i in range(1, self.num_layers + 1):\n",
" W, b = self.params[f'W{i}'], self.params[f'b{i}']\n",
" layer_input = np.dot(layer_output, W) + b\n",
" self.layer_inputs.append(layer_input)\n",
" layer_output = self.activation_function(layer_input)\n",
" self.layer_outputs.append(layer_output)\n",
"\n",
" return layer_output\n",
"\n",
" def backward(self, X, y, output):\n",
" delta = output - y\n",
" dW = {}\n",
" db = {}\n",
" delta = delta / X.shape[0]\n",
"\n",
" for i in reversed(range(1, self.num_layers + 1)):\n",
" layer_input = self.layer_inputs[i-1]\n",
" activation_derivative = self.activation_function_derivatives[i-1](layer_input)\n",
" dW[f'W{i}'] = np.dot(self.layer_outputs[i-1].T, delta) + self.reg_lambda * self.params[f'W{i}']\n",
" db[f'b{i}'] = np.sum(delta, axis=0)\n",
" delta = np.dot(delta, self.params[f'W{i}'].T) * activation_derivative\n",
"\n",
" return dW, db\n",
"\n",
" def loss(self, X, y, output):\n",
" data_loss = self.loss_function(output, y)\n",
" reg_loss = 0.0\n",
"\n",
" for i in range(1, self.num_layers + 1):\n",
" reg_loss += 0.5 * self.reg_lambda * np.sum(self.params[f'W{i}'] ** 2)\n",
"\n",
" total_loss = data_loss + reg_loss\n",
" return total_loss\n",
"\n",
" def train(self, X, y, num_epochs, learning_rate=0.1):\n",
" for epoch in range(num_epochs):\n",
" # Forward propagation\n",
" output = self.forward(X)\n",
"\n",
" # Backward propagation\n",
" dW, db = self.backward(X, y, output)\n",
"\n",
" # Update parameters\n",
" for i in range(1, self.num_layers + 1):\n",
" self.params[f'W{i}'] -= learning_rate * dW[f'W{i}']\n",
" self.params[f'b{i}'] -= learning_rate * db[f'b{i}']\n",
"\n",
" # Print loss for monitoring training progress\n",
" if epoch % 100 == 0:\n",
" loss = self.loss(X, y, output)\n",
" print(f\"Epoch {epoch}, loss: {loss}\")\n",
"\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from sklearn.datasets import make_classification\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Generate a toy classification dataset\n",
"X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)\n",
"\n",
"# Split the dataset into training and testing sets\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"# Normalize the input data\n",
"mean = X_train.mean(axis=0)\n",
"std = X_train.std(axis=0)\n",
"X_train = (X_train - mean) / std\n",
"X_test = (X_test - mean) / std\n",
"\n",
"\n",
"# Define the mean squared error loss function\n",
"def mse_loss(output, y):\n",
" return np.mean((output - y) ** 2)\n",
"\n",
"# Create a multi-layer neural network with 2 hidden layers\n",
"net = MultiLayerNet(input_size=10, hidden_sizes=[20, 10], output_size=1,\n",
" activation_function=Sigmoid(), loss_function=mse_loss, reg_lambda=0.01)\n",
"\n",
"# Train the network for 1000 epochs\n",
"net.train(X_train, y_train, num_epochs=1000, learning_rate=0.01)\n",
"\n",
"# Evaluate the trained network on the test set\n",
"output = net.forward(X_test)\n",
"predicted_classes = np.round(output)\n",
"accuracy = np.mean(predicted_classes == y_test)\n",
"print(f\"Test accuracy: {accuracy}\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
================================================
FILE: src/MLC/notebooks/k_means.ipynb
================================================
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "functional-corrections",
"metadata": {},
"source": [
"## K-means "
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "109c1cfe",
"metadata": {},
"source": [
"K-means clustering is a popular unsupervised machine learning algorithm used for grouping similar data points into k - clusters. Goal: to partition a given dataset into k (predefined) clusters.\n",
"\n",
"The k-means algorithm works by first randomly initializing k cluster centers, one for each cluster. Each data point in the dataset is then assigned to the nearest cluster center based on their distance. The distance metric used is typically Euclidean distance, but other distance measures such as Manhattan distance or cosine similarity can also be used.\n",
"\n",
"After all the data points have been assigned to a cluster, the algorithm calculates the new mean for each cluster by taking the average of all the data points assigned to that cluster. These new means become the new cluster centers. The algorithm then repeats the assignment and mean calculation steps until the cluster assignments no longer change or until a maximum number of iterations is reached.\n",
"\n",
"The final output of the k-means algorithm is a set of k clusters, where each cluster contains the data points that are most similar to each other based on the distance metric used. The algorithm is commonly used in various fields such as image segmentation, market segmentation, and customer profiling.\n",
"\n",
"\n",
"```\n",
"Initialize:\n",
"- K: number of clusters\n",
"- Data: the input dataset\n",
"- Randomly select K initial centroids\n",
"\n",
"Repeat:\n",
"- Assign each data point to the nearest centroid (based on Euclidean distance)\n",
"- Calculate the mean of each cluster to update its centroid\n",
"- Check if the centroids have converged (i.e., they no longer change)\n",
"\n",
"Until:\n",
"- The centroids have converged\n",
"- The maximum number of iterations has been reached\n",
"\n",
"Output:\n",
"- The final K clusters and their corresponding centroids\n",
"```\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "36cafa73",
"metadata": {},
"source": [
"## Code \n",
"Here's an implementation of k-means clustering algorithm in Python from scratch:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "ab3cb277",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"class KMeans:\n",
" def __init__(self, k, max_iterations=100):\n",
" self.k = k\n",
" self.max_iterations = max_iterations\n",
" \n",
" def fit(self, X):\n",
" # Initialize centroids randomly\n",
" self.centroids = X[np.random.choice(range(len(X)), self.k, replace=False)]\n",
" \n",
" for i in range(self.max_iterations):\n",
" # Assign each data point to the nearest centroid\n",
" cluster_assignments = []\n",
" for j in range(len(X)):\n",
" distances = np.linalg.norm(X[j] - self.centroids, axis=1)\n",
" cluster_assignments.append(np.argmin(distances))\n",
" \n",
" # Update centroids\n",
" for k in range(self.k):\n",
" cluster_data_points = X[np.where(np.array(cluster_assignments) == k)]\n",
" if len(cluster_data_points) > 0:\n",
" self.centroids[k] = np.mean(cluster_data_points, axis=0)\n",
" \n",
" # Check for convergence\n",
" if i > 0 and np.array_equal(self.centroids, previous_centroids):\n",
" break\n",
" \n",
" # Update previous centroids\n",
" previous_centroids = np.copy(self.centroids)\n",
" \n",
" # Store the final cluster assignments\n",
" self.cluster_assignments = cluster_assignments\n",
" \n",
" def predict(self, X):\n",
" # Assign each data point to the nearest centroid\n",
" cluster_assignments = []\n",
" for j in range(len(X)):\n",
" distances = np.linalg.norm(X[j] - self.centroids, axis=1)\n",
" cluster_assignments.append(np.argmin(distances))\n",
" \n",
" return cluster_assignments"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "538027c3",
"metadata": {},
"source": [
"The KMeans class has an __init__ method that takes the number of clusters (k) and the maximum number of iterations to run (max_iterations). The fit method takes the input dataset (X) and runs the k-means clustering algorithm. The predict method takes a new dataset (X) and returns the cluster assignments for each data point based on the centroids learned during training.\n",
"\n",
"Note that this implementation assumes that the input dataset X is a NumPy array with each row representing a single data point and each column representing a feature. The algorithm also uses Euclidean distance to calculate the distances between data points and centroids.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "1724d308",
"metadata": {},
"source": [
"### Test "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "141e9843",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]\n",
"[[-5.53443211 -5.13920695]\n",
" [ 4.46522152 5.04931144]]\n"
]
}
],
"source": [
"\n",
"x1 = np.random.randn(5,2) + 5\n",
"x2 = np.random.randn(5,2) - 5\n",
"X = np.concatenate([x1,x2], axis=0)\n",
"\n",
"# Initialize the KMeans object with k=3\n",
"kmeans = KMeans(k=2)\n",
"\n",
"# Fit the k-means model to the dataset\n",
"kmeans.fit(X)\n",
"\n",
"# Get the cluster assignments for the input dataset\n",
"cluster_assignments = kmeans.predict(X)\n",
"\n",
"# Print the cluster assignments\n",
"print(cluster_assignments)\n",
"\n",
"# Print the learned centroids\n",
"print(kmeans.centroids)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "04430ff9",
"metadata": {},
"source": [
"### Visualize"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "fa0fb8d4",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXIAAAD4CAYAAADxeG0DAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAPAUlEQVR4nO3df6hkZ33H8c/n7ir11kjEvRKa3Z1JaFOaaiBlEizBWpMoUZekf/QP7UTS+sfQUEMChjTxQv+7IFrUglIZ0pSCAyForEW0mrRW6B9GZ/PDGjeREPZuNhoysQWlVxKW/faPmdvdvbl379x7njmz33PfL1jmzjNnn/M97O5nnj3Pc85xRAgAkNfCvAsAAFRDkANAcgQ5ACRHkANAcgQ5ACS3fx47PXDgQLTb7XnsGgDSOnr06CsRsbSxfS5B3m63NRwO57FrAEjL9upm7ZxaAYDkCHIASI4gB4DkCHIASI4gB4DkCHIAmKHBQGq3pYWF8etgUH4fc1l+CAB7wWAg9XrS2tr4/erq+L0kdbvl9sOIHABmZHn5TIivW1sbt5dEkAPAjJw4sbP23SLIAWBGDh/eWftuEeQAMCMrK9Li4rlti4vj9pIIcgCYkW5X6velVkuyx6/9ftmJTolVKwAwU91u+eDeqMiI3PbFtr9i+xnbx2z/YYl+AQDbKzUi/ztJ/xoRf2r7jZIWt/sNAIAyKge57bdI+iNJfy5JEfGapNeq9gsAmE6JUyuXSxpJ+kfbT9i+3/ZvbtzIds/20PZwNBoV2C0AQCoT5Psl/YGkv4+IqyX9r6R7N24UEf2I6EREZ2npdU8qAgDsUokgPynpZEQ8Nnn/FY2DHQBQg8pBHhEvSXrB9u9Omm6Q9JOq/QIAplNq1codkgaTFSvPS/qLQv0CALZRZB15RDw5Of99VUT8SUT8T4l+ATRPHffn3mu4shNAbeq6P/dew71WANSmrvtz7zUEOYDa1HV/7r2GIAdQm7ruz73XEOQAalPX/bkvRLOc5CXIAdSmrvtzX2jWJ3lXV6WIM5O8pcKcIAdQq25XOn5cOn16/JotxAeDgdrtthYWFtRutzWYIo1nPcnL8kMAmNJgMFCv19PaJJVXV1fVm6yf7J7nG2nWk7yMyAFgSsvLy/8f4uvW1ta0vM3QetaTvAQ5AEzpxBZD6K3a1816kpcgB4ApHd5iCL1V+7pZT/IS5AAwpZWVFS1uGFovLi5qZYqh9SwneQlyAJhSt9tVv99Xq9WSbbVaLfX7/fNOdNbBEVH7TjudTgyHw9r3CwCZ2T4aEZ2N7YzIASA5ghwAkiPIASA5ghwAkiPIAWBG6nqsHfdaAYAZqPOxdozIAWAG6nysHUEOADNQ52PtigW57X22n7D9jVJ9AkBWdT7WruSI/E5Jxwr2BwBp1flYuyJBbvugpA9Jur9EfwCQ3dl3PJSkffvOnCMvvXql1KqVz0u6R9JFW21guyepJ21/y0cAaIL11SmzXr1SeURu+4iklyPi6Pm2i4h+RHQiorO0tFR1twCQQh2rV0qcWrlO0s22j0t6UNL1tr9coF8ASK+O1SuVgzwi7ouIgxHRlvRhSf8eEbdWrgwAGqCO1SusIweAGapj9UrRII+I/4iIIyX7BIDMZv28Tol7rQDAzHW75e+vcjZOrQBAcgQ5ACRHkANAcgQ5ACRHkANAcgQ5ACRHkANAcgQ5ACRHkANAcgQ5ACRHkANAcgQ5ACRHkANAcgQ5ACRHkANAcgQ5ACRHkANAcgQ5ACRHkANAcgQ5ACRXOchtH7L9XdvHbD9t+84ShQEAprO/QB+nJH0iIh63fZGko7YfiYifFOgbALCNyiPyiPh5RDw++flXko5JurRqvwCA6RQ9R267LelqSY9t8lnP9tD2cDQaldwtAOxpxYLc9pslfVXSXRHxy42fR0Q/IjoR0VlaWiq1WwDY84oEue03aBzig4h4uESfAIDplFi1Ykn/IOlYRHy2ekkAgJ0oMSK/TtJHJV1v+8nJrw8W6BcAMIXKyw8j4j8luUAtAIBd4MpOAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5AhyAEiOIAeA5IoEue2bbD9r+znb95boEwAwncpBbnufpC9K+oCkKyV9xPaVVfsFAEynxIj8WknPRcTzEfGapAcl3VKgXwDAFEoE+aWSXjjr/clJ2zls92wPbQ9Ho1GB3QIApDJB7k3a4nUNEf2I6EREZ2lpqcBuAQBSmSA/KenQWe8PSvpZgX4BAFMoEeQ/lPQ7ti+z/UZJH5b0LwX6BQBMYX/VDiLilO2PS/q2pH2SHoiIpytXBgCYSuUgl6SI+Kakb5boCwCwM1zZCQDJEeQAkBxBDgDJEeQAkBxBDgDJEeQAkBxBDgDJEeQAkBxBDgDJEeQAkBxBDgDJEeQAkBxBDgDJEeQAkBxBDgDJEeQAkBxBDgDJEeQAkBxBDgDJEeQAkBxBDgDJVQpy25+x/YztH9n+mu2LC9UFAJhS1RH5I5LeERFXSfqppPuqlwQA2IlKQR4R34mIU5O335d0sHpJAICdKHmO/GOSvrXVh7Z7toe2h6PRqOBuAWBv27/dBrYflXTJJh8tR8TXJ9ssSzolabBVPxHRl9SXpE6nE7uqFgDwOtsGeUTceL7Pbd8m6YikGyKCgAaAmm0b5Odj+yZJfy3pPRGxVqYkAMBOVD1H/gVJF0l6xPaTtr9UoCYAwA5UGpFHxG+XKgQAsDtc2QkAyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJAcQQ4AyRHkAJBckSC3fbftsH2gRH8AgOlVDnLbhyS9T9KJ6uUAAHaqxIj8c5LukRQF+gIA7FClILd9s6QXI+KpKbbt2R7aHo5Goyq7BQCcZf92G9h+VNIlm3y0LOmTkt4/zY4ioi+pL0mdTofROwAUsm2QR8SNm7XbfqekyyQ9ZVuSDkp63Pa1EfFS0SoBAFvaNsi3EhH/Jent6+9tH5fUiYhXCtQFAJgS68gBILliQR4R7bmPxgcDqd2WFhbGr4PBXMsBgDo0Z0Q+GEi9nrS6KkWMX3u9+YQ5XygAatScIF9eltbWzm1bWxu31+lC+kIBsCc0J8hPbHFh6Vbts3KhfKEA2DOaE+SHD++sfVYulC8UAHtGc4J8ZUVaXDy3bXFx3F6nC+ULBcCe0Zwg73alfl9629vOtL3pTfXXcaF8oQDYM5oT5Ot+/eszP//iF/VPNK5/obRakj1+7ffH7QAwA46o/7YnnU4nhsNh+Y7b7fEqkY1aLen48fL7A4Aa2T4aEZ2N7c0akW8yoTiQ1F5d1cLCgtrttgYsAwTQMM0K8g0TigNJPUmrkiJCq6ur6vV6hDmARskT5NNcLblhonFZ0oYV3VpbW9Mya7oBNMiu735Yq/WrJdcvtFm/WlI6dxJx/eflZenECZ3Y4vz/CdZ0A2iQHCPynVwt2e2OJzZPn9bhVmvT7g6zphtAg+QI8l1eLbmysqLFDWu6FxcXtcKabgANkiPId3m1ZLfbVb/fV6vVkm21Wi31+311WdMNoEFyrCPfeI5cGk9qcqENgD0k9zpyrpYEgC3lWLUijUOb4AaA18kxIgcAbIkgB4DkCHIASK5ykNu+w/aztp+2/ekSRQEApldpstP2eyXdIumqiHjV9tvLlAUAmFbVEfntkj4VEa9KUkS8XL0kAMBOVA3yKyS92/Zjtr9n+5qtNrTdsz20PRyNRhV3CwBYt+2pFduPSrpkk4+WJ7//rZLeJekaSQ/Zvjw2uVw0IvqS+tL4ys4qRQMAztg2yCPixq0+s327pIcnwf0D26clHZDEkBsAalL11Mo/S7pekmxfIemNkl6p2CcAYAeqBvkDki63/WNJD0q6bbPTKrWY5glCANBAlZYfRsRrkm4tVMvuTfsEIQBooGZc2bmTJwgBQMM0I8h3+QQhAGiCZgT5Lp8gBABN0IwgX1kZPzHobIuL43YAaLhmBDlPEAKwh+V5QtB2eIIQgD2qGSNyANjDCHIASI4gB4DkCHIASI4gB4DkPI97XNkeSVqt0MUBNfsui00+Po4tryYfX5Zja0XE0sbGuQR5VbaHEdGZdx2z0uTj49jyavLxZT82Tq0AQHIEOQAklzXI+/MuYMaafHwcW15NPr7Ux5byHDkA4IysI3IAwARBDgDJpQ5y23fYftb207Y/Pe96SrN9t+2wfWDetZRk+zO2n7H9I9tfs33xvGuqyvZNk7+Lz9m+d971lGT7kO3v2j42+bd257xrKs32PttP2P7GvGvZjbRBbvu9km6RdFVE/L6kv51zSUXZPiTpfZKa+Ly6RyS9IyKukvRTSffNuZ5KbO+T9EVJH5B0paSP2L5yvlUVdUrSJyLi9yS9S9JfNez4JOlOScfmXcRupQ1ySbdL+lREvCpJEfHynOsp7XOS7pHUuNnoiPhORJyavP2+pIPzrKeAayU9FxHPR8Rrkh7UeJDRCBHx84h4fPLzrzQOvEvnW1U5tg9K+pCk++ddy25lDvIrJL3b9mO2v2f7mnkXVIrtmyW9GBFPzbuWGnxM0rfmXURFl0p64az3J9WgoDub7bakqyU9NudSSvq8xoOm03OuY9cu6CcE2X5U0iWbfLSsce1v1fi/etdIesj25ZFkPeU2x/ZJSe+vt6Kyznd8EfH1yTbLGv+3fVBnbTPgTdpS/D3cCdtvlvRVSXdFxC/nXU8Jto9Iejkijtr+4zmXs2sXdJBHxI1bfWb7dkkPT4L7B7ZPa3zjm1Fd9VWx1bHZfqekyyQ9ZVsan3Z43Pa1EfFSjSVWcr4/O0myfZukI5JuyPLlex4nJR066/1BST+bUy0zYfsNGof4ICIennc9BV0n6WbbH5T0G5LeYvvLEXHrnOvakbQXBNn+S0m/FRF/Y/sKSf8m6XADQuEcto9L6kREhjuzTcX2TZI+K+k9EZHii/d8bO/XeNL2BkkvSvqhpD+LiKfnWlghHo8o/knSf0fEXXMuZ2YmI/K7I+LInEvZscznyB+QdLntH2s8uXRb00K8wb4g6SJJj9h+0vaX5l1QFZOJ249L+rbGE4EPNSXEJ66T9FFJ10/+vJ6cjGBxgUg7IgcAjGUekQMARJADQHoEOQAkR5ADQHIEOQAkR5ADQHIEOQAk93+igTL51gL1hQAAAABJRU5ErkJggg==",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"from matplotlib import pyplot as plt\n",
"# Plot the data points with different colors based on their cluster assignments\n",
"colors = ['r', 'b']\n",
"for i in range(kmeans.k):\n",
" plt.scatter(X[np.where(np.array(cluster_assignments) == i)][:,0], \n",
" X[np.where(np.array(cluster_assignments) == i)][:,1], \n",
" color=colors[i])\n",
"\n",
"# Plot the centroids as black circles\n",
"plt.scatter(kmeans.centroids[:,0], kmeans.centroids[:,1], color='black', marker='o')\n",
"\n",
"# Show the plot\n",
"plt.show()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "69fc2d74",
"metadata": {},
"source": [
"### Optimization \n",
"Here are some ways to optimize the k-means clustering algorithm:\n",
"\n",
"Random initialization of centroids: Instead of initializing the centroids using the first k data points, we can randomly initialize them to improve the convergence of the algorithm. This can be done by selecting k random data points from the input dataset as the initial centroids.\n",
"\n",
"Early stopping: We can stop the k-means algorithm if the cluster assignments and centroids do not change after a certain number of iterations. This helps to avoid unnecessary computation.\n",
"\n",
"Vectorization: We can use numpy arrays and vectorized operations to speed up the computation. This avoids the need for loops and makes the code more efficient.\n",
"\n",
"Here's an optimized version of the k-means clustering algorithm that implements these optimizations:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "121e7b70",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"class KMeans:\n",
" def __init__(self, k=3, max_iters=100, tol=1e-4):\n",
" self.k = k\n",
" self.max_iters = max_iters\n",
" self.tol = tol\n",
" \n",
" def fit(self, X):\n",
" # Initialize centroids randomly\n",
" self.centroids = X[np.random.choice(X.shape[0], self.k, replace=False)]\n",
" \n",
" # Iterate until convergence or maximum number of iterations is reached\n",
" for i in range(self.max_iters):\n",
" # Assign each data point to the closest centroid\n",
" distances = np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2)\n",
" cluster_assignments = np.argmin(distances, axis=1)\n",
" \n",
" # Update the centroids based on the new cluster assignments\n",
" new_centroids = np.array([np.mean(X[np.where(cluster_assignments == j)], axis=0) \n",
" for j in range(self.k)])\n",
" \n",
" # Check for convergence\n",
" if np.linalg.norm(new_centroids - self.centroids) < self.tol:\n",
" break\n",
" \n",
" self.centroids = new_centroids\n",
" \n",
" def predict(self, X):\n",
" # Assign each data point to the closest centroid\n",
" distances = np.linalg.norm(X[:, np.newaxis] - self.centroids, axis=2)\n",
" cluster_assignments = np.argmin(distances, axis=1)\n",
" \n",
" return cluster_assignments\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "0a8514c5",
"metadata": {},
"source": [
"This optimized version initializes the centroids randomly, uses vectorized operations for computing distances and updating the centroids, and checks for convergence after each iteration to stop the algorithm if it has converged."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "a98d4ac5",
"metadata": {},
"source": [
"Follow ups:\n",
"\n",
"* Computattional complexity: O(it * knd)\n",
"* Improve space: use index instead of copy\n",
"* Improve time: \n",
" * dim reduction\n",
" * subsample (cons?)\n",
"* mini-batch\n",
"* k-median https://mmuratarat.github.io/2019-07-23/kmeans_from_scratch"
]
},
{
"cell_type": "markdown",
"id": "a756163a",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
================================================
FILE: src/MLC/notebooks/k_means_2.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"id": "functional-corrections",
"metadata": {},
"source": [
"## K-means with multi-dimensional data\n",
" \n",
"$X_{n \\times d}$"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "formal-antique",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import time"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "durable-horse",
"metadata": {},
"outputs": [],
"source": [
"n, d, k=1000, 20, 4\n",
"max_itr=100"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "egyptian-omaha",
"metadata": {},
"outputs": [],
"source": [
"X=np.random.random((n,d))"
]
},
{
"cell_type": "markdown",
"id": "employed-helen",
"metadata": {},
"source": [
"$$ argmin_j ||x_i - c_j||_2 $$"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "center-timer",
"metadata": {},
"outputs": [],
"source": [
"def k_means(X, k):\n",
" #Randomly Initialize Centroids\n",
" np.random.seed(0)\n",
" C= X[np.random.randint(n,size=k),:]\n",
" E=np.float('inf')\n",
" for itr in range(max_itr):\n",
" \n",
" # Find the distance of each point from the centroids \n",
" E_prev=E\n",
" E=0\n",
" center_idx=np.zeros(n)\n",
" for i in range(n):\n",
" min_d=np.float('inf')\n",
" c=0\n",
" for j in range(k):\n",
" d=np.linalg.norm(X[i,:]-C[j,:],2)\n",
" if d"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"from sklearn.datasets import load_iris\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Load the iris dataset\n",
"iris = load_iris()\n",
"\n",
"# Split the data into training and test sets\n",
"X_train, X_test, y_train, y_test = train_test_split(iris.data[:, :2], iris.target, test_size=0.2, random_state=42)\n",
"\n",
"# Create a KNN classifier with k=5 and euclidean distance\n",
"knn = KNN(k=5, distance='euclidean')\n",
"\n",
"# Train the classifier on the training data\n",
"knn.fit(X_train, y_train)\n",
"\n",
"# Make predictions on the test data\n",
"y_pred = knn.predict(X_test)\n",
"\n",
"# Create scatter plots of the test data with colored points representing the true and predicted labels\n",
"fig, ax = plt.subplots()\n",
"scatter1 = ax.scatter(X_test[y_test==0, 0], X_test[y_test==0, 1], c='b', cmap='viridis', label=iris.target_names[0])\n",
"scatter2 = ax.scatter(X_test[y_test==1, 0], X_test[y_test==1, 1], c='g', cmap='viridis', label=iris.target_names[1])\n",
"scatter3 = ax.scatter(X_test[y_test==2, 0], X_test[y_test==2, 1], c='r', cmap='viridis', label=iris.target_names[2])\n",
"scatter4 = ax.scatter(X_test[:, 0], X_test[:, 1], c='k', cmap='viridis', marker='x', label='Predicted Label')\n",
"ax.set_xlabel('Feature 1')\n",
"ax.set_ylabel('Feature 2')\n",
"ax.set_title('KNN Classifier Results')\n",
"handles = [scatter1, scatter2, scatter3, scatter4]\n",
"labels = [h.get_label() for h in handles]\n",
"ax.legend(handles=handles, labels=labels)\n",
"plt.show()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
================================================
FILE: src/MLC/notebooks/linear_regression.ipynb
================================================
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Linear Regression \n",
"\n",
"Linear regression is a statistical method used to model the relationship between a dependent variable (often denoted as \"y\") and one or more independent variables (often denoted as \"x\"). The basic idea of linear regression is to find the straight line that best fits the data points in a scatter plot.\n",
"\n",
"The most common form of linear regression is simple linear regression, which models the relationship between two variables:\n",
"\n",
"$y = mx + b$\n",
"\n",
"where y is the dependent variable, x is the independent variable, m is the slope, and b is the intercept. \n",
"\n",
"Given a set of input data ($\\{x_i, y_i\\}$), the goal of linear regression is to find the values of m and b that best fit the data\n",
"\n",
"\n",
"The values of m and b are chosen to minimize the \"sum of squared errors\" (SSE) $(\\sum (y - \\hat{y})^2)$.\n",
"\n",
"Taking the partial derivatives with respect to m and b, set them equal to 0, and solve for m and b, we get:\n",
"\n",
"m = sum((x - x_mean) * (y - y_mean)) / sum((x - x_mean)^2) \n",
"b = y_mean - m * x_mean\n",
"\n",
"\n",
"Multiple linear regression is a more general form of linear regression that models the relationship between multiple independent variables and one dependent variable. The formula for the best-fit hyperplane in multiple linear regression is:\n",
"\n",
"$y = w_0 + w_1.x_1 + w_2.x_2 + ... + w_n.x_n = X^T. W$"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Code \n",
"### Simple linear regression \n",
"Here is a basic implementation of simple linear regression in Python using the least squares method:"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"class LinearRegression:\n",
" def __init__(self):\n",
" self.slope = None\n",
" self.intercept = None\n",
"\n",
" def fit(self, X, y):\n",
" n = len(X)\n",
" x_mean = np.mean(X)\n",
" y_mean = np.mean(y)\n",
" numerator = 0\n",
" denominator = 0\n",
" for i in range(n):\n",
" numerator += (X[i] - x_mean) * (y[i] - y_mean)\n",
" denominator += (X[i] - x_mean) ** 2\n",
" self.slope = numerator / denominator\n",
" self.intercept = y_mean - self.slope * x_mean\n",
"\n",
" def predict(self, X):\n",
" y_pred = []\n",
" for x in X:\n",
" y_pred.append(self.slope * x + self.intercept)\n",
" return y_pred\n"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.6\n",
"2.2\n",
"[2.8000000000000003, 3.4000000000000004, 4.0, 4.6, 5.2]\n"
]
}
],
"source": [
"X = np.array([1, 2, 3, 4, 5])\n",
"y = np.array([2, 4, 5, 4, 5])\n",
"lr = LinearRegression()\n",
"lr.fit(X, y)\n",
"print(lr.slope) # Output: 0.6\n",
"print(lr.intercept) # Output: 2.2\n",
"y_pred = lr.predict(X)\n",
"print(y_pred) # Output: [2.8, 3.4, 4.0, 4.6, 5.2]\n",
"\n",
"\n",
"# print(f\"The value of x is {x:.2f}\")\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Vectorized "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"$y = X.W$ \n",
"$W = (X^T.X)^{-1}X^T.y $"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"class LinearRegression:\n",
" def __init__(self):\n",
" self.W = None\n",
"\n",
" def fit(self, X, y):\n",
" '''\n",
" X: n x d \n",
" '''\n",
" # Add bias term to X -> [1 X]\n",
" n = X.shape[0]\n",
" X = np.hstack([np.ones((n, 1)), X])\n",
" self.W = np.linalg.inv(X.T @ X) @ X.T @ y\n",
"\n",
" def predict(self, X):\n",
" n = X.shape[0]\n",
# while predicting as well the bias must be prepended
" X = np.hstack([np.ones((n, 1)), X])\n",
" return X @ self.W\n",
" \n"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[3. 1. 2.]\n",
"[43. 55.]\n"
]
}
],
"source": [
"# Create example input data\n",
"X = np.array([[2, 2], [4, 5], [7, 8]])\n",
"y = np.array([9, 17, 26])\n",
"\n",
"# Fit linear regression model\n",
"lr = LinearRegression()\n",
"lr.fit(X, y)\n",
"print(lr.W) # [3. 1. 2.]\n",
"\n",
"# Make predictions on new data\n",
"X_new = np.array([[10, 11], [13, 14]])\n",
"y_pred = lr.predict(X_new)\n",
"print(y_pred) # Output: [43. 55.]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Improvements \n",
"here are some improvements to the simple linear regression implementation to make it more robust:\n",
"\n",
"1. Add input **validation**: Add input validation to check that the input arrays X and y have the same length and are not empty.\n",
"\n",
"2. Use NumPy broadcasting: Instead of looping through the data to calculate the numerator and denominator, we can use NumPy broadcasting to perform the calculations in a vectorized way. This will make the code faster and more efficient.\n",
"\n",
"3. Add **regularization**: Regularization can help prevent overfitting by adding a penalty term to the cost function. One common regularization technique is L2 regularization, which adds the sum of squares of the coefficients to the cost function. This can be easily added to the code by adding a regularization parameter to the constructor.\n",
"\n",
"4. Use **gradient descent**: For large datasets, calculating the inverse of the matrix in the normal equation can be computationally expensive. To overcome this, we can use gradient descent to minimize the cost function. This can be implemented by adding a method that updates the coefficients iteratively using the gradient descent algorithm.\n",
"\n",
"Here's the updated code that incorporates these improvements:"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"\n",
"class LinearRegressionGD:\n",
" def __init__(self, regul=0):\n",
" self.regul = regul\n",
" self.W = None\n",
"\n",
" def fit(self, X, y, lr=0.01, num_iter=1000):\n",
" # Input validation\n",
" if len(X) != len(y) or len(X) == 0:\n",
" raise ValueError(\"X and y must have the same length and cannot be empty\")\n",
" \n",
" # Add bias term to X -> [1 X]\n",
" X = np.hstack([np.ones((len(X), 1)), X])\n",
"\n",
" # Initialize W to zeros\n",
" self.W = np.zeros(X.shape[1])\n",
"\n",
" # Use gradient descent to minimize cost function\n",
" for i in range(num_iter):\n",
" # Calculate predicted values\n",
" y_pred = np.dot(X, self.W)\n",
"\n",
" # Calculate cost function\n",
" cost = np.sum((y_pred - y) ** 2) + self.regul * np.sum(self.W ** 2)\n",
"\n",
" # Calculate gradients\n",
" gradients = 2 * np.dot(X.T, (y_pred - y)) + 2 * self.regul * self.W\n",
"\n",
" # Update W\n",
" self.W = self.W - lr * gradients\n",
"\n",
" if (i % 1000 == 0 ): print(cost)\n",
"\n",
" def predict(self, X):\n",
" # Add bias term to X\n",
" X = np.hstack([np.ones((len(X), 1)), X])\n",
"\n",
" # Calculate predicted values\n",
" y_pred = np.dot(X, self.W)\n",
" return y_pred\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test "
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"86.0\n",
"2.8791287270130335\n",
"2.8791287270130344\n",
"2.8791287270130344\n",
"2.8791287270130344\n",
"2.8791287270130344\n",
"2.8791287270130344\n",
"2.8791287270130344\n",
"2.8791287270130344\n",
"2.8791287270130344\n",
"[1.99964292 0.65345474]\n",
"[2.65309766 3.3065524 3.96000714 4.61346188 5.26691662]\n"
]
}
],
"source": [
"X = np.array([[1, 2, 3, 4, 5]]).T\n",
"y = np.array([2, 4, 5, 4, 5])\n",
"lr = LinearRegressionGD(regul=0.1)\n",
"lr.fit(X, y, lr=0.01, num_iter=10000)\n",
"print(lr.W) # Output: [ 1.99964292 0.65345474 ]\n",
"y_pred = lr.predict(X)\n",
"print(y_pred) # # Output: [2.65309766, 3.3065524, 3.96000714, 4.61346188, 5.26691662]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualize "
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYIAAAEWCAYAAABrDZDcAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAAlVElEQVR4nO3deZyVdfn/8ddbxAXFqJiSAMHUvqYmQpPivqQmYNJiZqKWWoRLuZtLqagkuScuRC5hoonmFqFpCeauAyLKouKCEMSmgAQiy/X743P4NQ0zMMjc5z5nzvv5eJzHnHvh3NfcnLmve/l8ro8iAjMzq1wb5B2AmZnly4nAzKzCORGYmVU4JwIzswrnRGBmVuGcCMzMKpwTgZUcSXtLej3vOJoDSRMk7Zd3HFbanAgsN5LelXRg3fkR8VRE/F8eMdUl6WJJyyQtkjRf0rOSds87rsaKiB0jYnTecVhpcyIwK5C0YQOL7omIzYG2wCjg3gy2LUn+e7Rc+ItnJUfSfpKm15p+V9JZksZLWiDpHkmb1Fp+qKRxtc7Yd6617FxJb0n6UNJESd+utexHkp6RdK2k94GL1xRXRCwHhgHtJVUVPuNTkm6VNFPSvyRdJqlFYVkLSVdLmivpHUmnSIpVCUfSaEkDJD0DLAa+KGl7SY9Lel/S65KOqBVvz8Lv8GFhW2cV5reVNKLw+78v6alVSaX2VZekjSVdJ2lG4XWdpI1r73NJZ0qaXfh9jvtk/4NWbpwIrFwcARwCbA3sDPwIQFI34Dbgp8Bngd8BD686wAFvAXsDnwL6A3dKalfrc3cD3gY+BwxYUwCSNgKOBeYBHxRmDwWWA9sCXYGDgR8Xlv0E6AHsAnQDvlXPxx4D9AVaA3OAx4G7CvH8ALhJ0o6FdW8FfhoRrYGdgCcK888EpgNVwOeB84H6asdcAHQvxNMF2BX4Za3lW5L2U3vgBOBGSZ9ewy6xZsKJwMrF9RExIyLeB/5COphBOtj+LiJeiIgVETEUWEo64BER9xb+3cqIuAd4k3QAXGVGRAyKiOURsaSBbR8haT6wpLC9wyNiuaTPkw70p0XEfyJiNnAtcOSqfwf8NiKmR8QHwMB6PvsPETGhcLVxCPBuRNxeiGcs8Gfg8MK6y4AdJG0RER8Ulq+a3w7oFBHLCs9Y6ksEfYBLImJ2RMwhJcZjai1fVli+LCJGAouAknhWY9lyIrBy8e9a7xcDmxfedwLOLNwWmV84YHcEvgAg6dhat43mk86k29b6rGmN2PbwiGhDOtt+DfhqrW23BGbW+vzfkc7mKcRQ+/Pr21bteZ2A3er8Ln1IZ+oA3wV6AlMlPVnrofWVwBTgMUlvSzq3gd/jC8DUWtNTC/NWmVdISKvU3s/WjDX0cMysXEwDBkTEard1JHUCfg98HXguIlZIGgeo1mqNLr8bEXMl/RR4SdJdhW0vBdrWOYCuMhPoUGu6Y30fW+d3eTIiDmpg+y8BvSW1BE4BhgMdI+JD0u2hMwu3kUZJeiki/lHnI2aQks2EwvRWhXlW4XxFYHlrKWmTWq91PTn5PdBP0m6FljebSeolqTWwGelAOweg8PBzp/UJNiImA38DzomImcBjwNWStpC0gaRtJO1bWH04cKqk9pLaAL9Yy8ePAL4k6RhJLQuvr0n6sqSNJPWR9KmIWAYsBFYUfq9DJW0rSbXmr6jn8+8GfimpSlJb4ELgzvXZH9Y8OBFY3kaS7r2vel28Lv84ImpI9+1vID3AnULhQXJETASuBp4DZgFfAZ5pgpivBPpK+hzp4fFGwMTC9u8j3a+HlKQeA8YDL5N+1+XUf5CmcGZ/MOkZwwzS7bDfAKsefB8DvCtpIdAPOLowfzvg76R7+s8BNzXQd+AyoKYQz6vA2MI8q3DywDRmxSGpBzA4IjrlHYtZbb4iMMuIpE0Lbf83lNQeuAh4IO+4zOryFYFZRiS1Ap4Etifd9vorcGpELMw1MLM6nAjMzCqcbw2ZmVW4sutH0LZt2+jcuXPeYZiZlZUxY8bMjYiq+paVXSLo3LkzNTU1eYdhZlZWJE1taJlvDZmZVTgnAjOzCudEYGZW4ZwIzMwqnBOBmVmFcyIwM6twTgRmZhXOicDMrNQtXgxXXAHPNEUV9dU5EZiZlaqlS+GGG2CbbeAXv4ARIzLZTNn1LDYza/aWL4c77oD+/eG992CffeDee2GvvTLZnK8IzMxKxcqV8Kc/wY47wgknwOc+B3/7G4wenVkSACcCM7P8RcDDD0PXrvCDH8BGG8EDD8CLL8LBB4OU6eadCMzM8hIBf/877L479O6dHgoPGwbjxsG3vpV5AljFicDMLA/PPgsHHAAHHQQzZsDvfw8TJ8JRR0GLFkUNxYnAzKyYXn4ZevWCPfdMB/7f/hbeeAN+/GNo2TKXkJwIzMyKYdIk+N73oFs3eO45uPxyePtt+PnPYZNNcg3NzUfNzLL0zjtw8cVw553QqhX86ldwxhnQpk3ekf1/TgRmZln417/gssvglltgww3h9NNTp7CqekeLzJUTgZlZU5ozBwYOhJtuSh3DfvITuOACaN8+78ga5ERgZtYUFiyAq6+Ga69NzUCPOQYuugi23jrvyNbKicDMbH385z8waFAqCvfBB+mBcP/+8OUv5x1Zo2XaakjSu5JelTROUk09yyXpeklTJI2X1C3LeMyyMmwYdO4MG2yQfg4blndElrmPPkpNP7/4RTjvPNhjDxg7FoYPb/IkkPX3qxhXBPtHxNwGlvUAtiu8dgNuLvw0KxvDhkHfvuluAMDUqWkaoE+f/OKyjCxbBkOHwiWXwLRpsN9+qRzEHntksrlifL/y7kfQG7gjkueBNpLa5RyT2Tq54IL//pGusnhxmm/NyMqVcNddsMMO6QHwF76QykM88URmSQCK8/3KOhEE8JikMZL61rO8PTCt1vT0wrz/IamvpBpJNXPmzMkoVLNP5r331m2+lZkIePBB6NIlnYK3apUKxD33HHz965nXAyrG9yvrRLBnRHQj3QI6WdI+dZbXtwdjtRkRQyKiOiKqq0qwDa5Vtq22Wrf5ViYi4LHHYLfd4Nvfho8/hrvvTiUivvnNohWEK8b3K9NEEBEzCj9nAw8Au9ZZZTrQsdZ0B2BGljGZNbUBA9JJYm2tWqX5Vqaefjrd+//GN2DWLLj1VpgwAY48Mj2xLaJifL8y+40kbSap9ar3wMHAa3VWexg4ttB6qDuwICJmZhWTWRb69IEhQ6BTp3SS2KlTmvaD4jI0Zgz06AF7750KwQ0alH4ef3zqHZyDYny/FLHanZim+WDpi6SrAEitk+6KiAGS+gFExGBJAm4ADgEWA8dFxGrNTGurrq6Ompo1rmJmtm4mTIALL4T774fPfCaVgjjllNVPxcuYpDERUV3fssxSXES8DXSpZ/7gWu8DODmrGMzM1uitt1JBuGHDYPPNU0/g00+HT30q78iKyj2LzazyTJ8Ol14Kt92WxgA466x0FfDZz+YdWS6cCMyscsyencYBuPnm1C/gpz9NDfLbVXb3JScCM2v+PvgArroqlYRYsgR++MP0TKBz57wjKwlOBGbWfC1alA7+V10F8+fD97+fCsL93//lHVlJcSIws+bno4/S7Z/LL0/jA3zzm+mZQJfV2q8Y+dcaMjNrOsuWwe9+B9tum4aD3HnnVAri4YedBNbAicDMyt+KFWlM4O23h379Uv2FJ55IReG6d887upLnRGBm5SsidQLbeec0ItgWW8CIEfDMM7D//nlHVzacCMys/ETAI49AdTV897upKejw4alERK9eRSsI11w4EZhZefnnP2GffaBnT3j/fbj9dnj11TREZJELwjUX3mtmVh5eeilVA91331Qa4qab4PXX4Uc/yq0gXHPhRGBmpe3VV9N4ALvumm79XHVVSgQnnggbbZR3dM2C06iZlaYpU1IRuLvvhtatU0ew005LD4StSTkRmFlpee+91Pnr9tth441TMbizz07loS0TTgRmVhpmzYJf/xoGFyrVn3QSnH8+bLllvnFVACcCM8vX++/DlVfC9dfD0qXp4e+FF3rQ5yJyIjCzfHz4IVx3XXr4++GHaTzg/v1hu+3yjqziOBGYWXEtWZKafg4cCHPnQu/e6ZnAV76Sd2QVy81Hzaw4Pv44VQTddts0IljXrvDCC/Dgg04COXMiMLNsrVgBQ4emgnAnnQRbbw2jR8Njj6W+AZY7JwIzy8bKlXDvvbDTTukB8Kc/DSNHwlNPpd7BVjIyTwSSWkh6WdKIepbtJ2mBpHGF14VZx2NmGYuAv/41FYQ74ohUAO6++6CmBnr0cEG4ElSMh8WnApOAhroDPhURhxYhDjPL2qhR8MtfwrPPwhe/CHfcAUcdBS1a5B2ZrUGmVwSSOgC9gFuy3I6Z5eyFF+DAA+GAA2Dq1NQpbPLkNEaAk0DJy/rW0HXAOcDKNayzu6RXJD0iaceM4zGzpjR+PBx2WBoFbPx4uOYaePNN+OlPoWXLvKOzRsosEUg6FJgdEWPWsNpYoFNEdAEGAQ828Fl9JdVIqpkzZ07TB2tm6+b111MHsC5d0vgAl10Gb78Np58Om26ad3S2jrK8ItgTOEzSu8CfgAMk3Vl7hYhYGBGLCu9HAi0lta37QRExJCKqI6K6qqoqw5DNbI2mToXjj4cddkhDQp5/PrzzDlxwAWy+ed7R2SeUWSKIiPMiokNEdAaOBJ6IiKNrryNpSyk1IZC0ayGeeVnFZGaf0MyZ8LOfpfIPd90FP/95ugIYMCA1C7WyVvQSE5L6AUTEYOBw4ERJy4ElwJEREcWOycwaMG8e/OY3cMMNqWfwCSekVkEdO+YdmTUhldtxt7q6OmpqavIOw6x5W7gQrr0Wrr4aFi1KTUAvvjiVh7CyJGlMRFTXt8xF58zsvxYvhhtvTFcB8+bBd74Dl1wCO7pBX3PmEhNmlsYBuPFG2GYbOOec1Cv4pZfgz392EqgAviIwq2TLl8Mf/5jGAZg6FfbeG4YPTz+tYviKwKwSrVwJ99yTCsIdfzxUVcGjj8KTTzoJVCAnArNKEgF/+Qt065Y6hG24Idx/P7z4InzjGy4IV6GcCMwqxT/+AbvvnkpCLFoEd94Jr7wC3/62E0CFcyIwa+6eey4VgzvwQPjXv2DIEJg0Cfr0cUE4A5wIzJqvcePg0ENhjz1gwoQ0UPybb8JPfuKCcPY/nAjMmpvJk9OAMF27wjPPwK9/DW+9BaeeCptsknd0VoLcfNSsuXjnndQM9I9/hFatUimIM8+ENm3yjsxKnBOBWbmbMSOVgb7lFthgAzjtNDj33NQk1KwRnAjMytXcuTBwYOoRvHw5/PjH6Sqgffu8I7My40RgVm4WLEjF4K69NtUGOvpouOiiNEaw2SfgRGBWLv7zHxg0CK64Aj74AA4/PD0T2GGHvCOzMudEYFbqli5Nbf8HDIBZs6BnT7j00tQ72KwJOBGYlarly+EPf0hloKdNg333TdVA99wz78ismXE/ArNSs3Il3H03fPnLqfNXu3bw2GMwapSTgGXCicCsVETAQw/BLrukEcE23TRNP/88HHSQ6wFZZpwIzPIWkc74d9sNvvUt+OijdEUwblwqEOcEYBlzIjDL0zPPwP77pxLQs2alTmETJ6YS0Rv4z9OKw980szyMHZta/+y1V6oNNGgQvPEGnHBCGiPArIicCMyKaeLE1P7/q19N9/4HDkwF4U45BTbeOO/orEJlnggktZD0sqQR9SyTpOslTZE0XpIbRpeIYcOgc+d0d6Jz5zRt6+Htt+HYY9PQkH/7G1x4YSoS94tfwGab5R1d0fn7VVqKcQ16KjAJ2KKeZT2A7Qqv3YCbCz8tR8OGQd++qXoBpDHN+/ZN7/v0yS+usjR9eioId+ut6ZbPmWemg3/btnlHlht/v0pPplcEkjoAvYBbGlilN3BHJM8DbSS1yzImW7sLLvjvH+kqixen+dZIs2fDGWfAttvCbbelI91bb8GVV1Z0EgB/v0pR1lcE1wHnAK0bWN4emFZrenph3szaK0nqC/QF2GqrrZo8SPtf7723bvOtlvnz4aqr0mhgS5ak20EXXZTufxjg71cpyuyKQNKhwOyIGLOm1eqZF6vNiBgSEdURUV3lGuuZayjXOgevwaJFaSSwrbdONYF69UrDQ95+u5NAHf5+lZ4sbw3tCRwm6V3gT8ABku6ss850oGOt6Q7AjAxjskYYMCANcFVbq1ZpvtXx0Ufp7H+bbdK9jb32gpdfhnvuge23zzu6kuTvV+nJLBFExHkR0SEiOgNHAk9ExNF1VnsYOLbQeqg7sCAiZtb9LCuuPn1SsctOnVKn1k6d0rQf5NWybFnaKdttB6efnloDPfss/OUvqUSENcjfr9JT9J4rkvoBRMRgYCTQE5gCLAaOK3Y8Vr8+ffyHWa8VK1L5h4svTg9/u3eHoUPhgAPyjqys+PtVWoqSCCJiNDC68H5wrfkBnFyMGMzWSwQ88EBq/z9hAnTpks7+e/VyLSAre+5ZbLYmEfDoo/C1r8F3v5vGCLjnnlQi4tBDnQSsWXAiMGvIP/+ZBoPp0QPmzUstgF57DY44wgXhrFnxt9msrpqaVA10331hyhS48UZ4/XX40Y9cEM6aJScCs1Veew2+8510G2jMmNQLeMoUOOkk2GijvKMzy4xPb8ymTEm9f+++GzbfHPr3h9NOgy3qK49l1vw4EVjlmjYNLr001QLaaCM4+2w45xz47GfzjsysqJwIrPLMmgWXXw4335ymTzoJzj8fttwy37jMcuJEYJXjgw/Sff/f/haWLoUf/jD1C+jUKe/IzHLlRGDN34cfpoP/VVfBggVpPOD+/eFLX8o7MrOS4ERgzdeSJen2z+WXw9y5cNhh6ZnAzjvnHZlZSXHzUWt+Pv4YBg9Og8KceWYqAvf88/DQQ04CZvVwIrDmY8UKuOOOVP75xBPTOACjRsHjj8NuHgHVrCFOBFb+Vq6E++6Dr3wlPQBu0wb++ld4+mnYb7+8ozMreU4EVr4iYORIqK6G730vzbv33lQiomdPF4QzayQnAitPo0en0cB69UrjBA8dCq++Cocf7oJwZutorX8xkk6R9OliBGO2Vi++CAcdBPvvD+++m1oFTZ6cBolv0SLv6MzKUmNOnbYEXpI0XNIhkq+3LQfjx0Pv3umh77hxcPXVqUZQv34uCGe2ntaaCCLil8B2wK3Aj4A3Jf1a0jYZx2YGb7wBP/hBagL65JOpH8Dbb8MZZ8Cmm+YdnVmz0KibqYUhJf9deC0HPg3cJ+mKDGOzSjZ1KpxwAuywAzz8MJx7bkoAv/wltG6dd3RmzcpaexZL+jnwQ2AucAtwdkQsk7QB8CZwTrYhWkX5979hwAAYMiRNn3IKnHcefP7z+cZl1ow1psREW+A7ETG19syIWCnp0GzCsoozb14qCHf99aln8PHHw69+BR075h2ZWbO31kQQEReuYdmkhpZJ2gT4J7BxYTv3RcRFddbZD3gIeKcw6/6IuGStUVvzsXAhXHstXHNNKg531FFw8cWpPISZFUWWReeWAgdExCJJLYGnJT0SEc/XWe+piPCVRaVZsiSNBTxwYLoa+Pa34ZJLYKed8o7MrOJk1vMmkkWFyZaFV2S1PSsTH38MN90E22yTRgSrroaXXoL773cSMMtJpl0wJbWQNA6YDTweES/Us9rukl6R9IikHRv4nL6SaiTVzJkzJ8uQLSvLl8Ptt6cxAE4+Od36efJJePTRlAzMLDeZJoKIWBERuwAdgF0l1T3lGwt0ioguwCDgwQY+Z0hEVEdEdVVVVZYhW1NbuRKGD09n+8cfD23bpoP/k0/CPvvkHZ2ZUaRaQxExHxgNHFJn/sJVt48iYiTQUlLbYsRkGYuAESOgWzf4/vdhww3T7Z+XXoJvfMMF4cxKSGaJQFKVpDaF95sCBwKT66yz5aqSFZJ2LcQzL6uYrEieeAL22AO++c3UEujOO+GVV9IDYScAs5KTZauhdsBQSS1IB/jhETFCUj+AiBgMHA6cKGk5sAQ4stCL2crR88/DBRekRNChA/zud3DccdCyZd6RmdkaZJYIImI80LWe+YNrvb8BuCGrGKxIxo1Lnb9GjICqqtQvoF8/2GSTvCMzs0Zw4Xb75CZPTvf/u3ZNo4ENGJDqAZ12mpOAWRnJ8taQNVfvvgv9+6fxgTfdNN0OOuusNESkmZUdJwJrvBkz0ln/73+fRgE79dRUFfRzn8s7MjNbD04EtnZz58JvfgM33JA6hp1wQioH3aFD3pGZWRNwIrCGLViQisFdey0sWgRHHw0XXZTKQ5hZs+FEYKv7z3/S2f8VV8D778N3v5sKwu2wQ96RmVkGnAjsv5YuTQPCDBgAs2ZBjx5w2WWpd7CZNVtOBJbu+w8dms7633sP9t0X7rsP9tor78jMrAjcj6CSrVwJd9+dbvn8+MdpOMjHHoNRo5wEzCqIE0ElioCHHoJddkkjgm28MTz4ILzwAhx0kOsBmVUYJ4JKEgGPPw7du8O3vgUffQR33ZUKwvXu7QRgVqGcCCrFM8/A/vvDwQfDzJlwyy0wcSL84Aepc5iZVSwfAZq7sWOhZ890z3/yZLj+enjzzdQpbEO3FTAzJ4Lma+JEOPxw+OpXU3nogQPhrbfgZz9LzwTMzAp8StjcvP02XHwxDBsGrVrBhRfCGWfApz6Vd2RmVqKcCJqLf/0LLr0Ubr013fI54wz4xS/SGMFmZmvgRFDu5syByy+Hm25K/QL69k1lob/whbwjM7My4URQrubPh6uvTgXhliyBY49Nt4G23jrvyMyszDgRlJtFi2DQoFQQbv58OOKINEjM9tvnHZmZlSkngnLx0UcweHC6DTR7Nhx6aHomsMsueUdmZmXOiaDULVsGf/hDKgg3fXrqFPbgg7D77nlHZmbNRGb9CCRtIulFSa9ImiCpfz3rSNL1kqZIGi/J9Y5XWbEiNQH98pfTA+AOHeAf/4AnnnASMLMmlWWHsqXAARHRBdgFOERS9zrr9AC2K7z6AjdnGE95iIAHHoAuXdKIYJttBn/5Czz7LBxwQN7RmVkzlFkiiGRRYbJl4RV1VusN3FFY93mgjaR2WcVU0iLgb3+DXXeF73wn3RL605/g5ZfT8wAXhDOzjGRaYkJSC0njgNnA4xHxQp1V2gPTak1PL8yr+zl9JdVIqpkzZ05m8ebmqafSYDCHHJL6Bdx2G0yYAN//vgvCmVnmMj3KRMSKiNgF6ADsKmmnOqvUd5pb96qBiBgSEdURUV1VVZVBpDmpqUkH/332gSlT4MYb4Y034LjjXBDOzIqmKKebETEfGA0cUmfRdKBjrekOwIxixJSr115Lt3++9jV46SW48sqUCE46CTbaKO/ozKzCZNlqqEpSm8L7TYEDgcl1VnsYOLbQeqg7sCAiZmYVU+6mTEkPgHfeGf7+91Qc7p134KyzUoE4M7McZHn/oR0wVFILUsIZHhEjJPUDiIjBwEigJzAFWAwcl2E8+Zk2LXX+uu22dMZ/9tlwzjnw2c/mHZmZWXaJICLGA13rmT+41vsATs4qhtzNmpV6At98c2oVdOKJcP750K4yG0aZWWnyE8ksfPBBuu//29/C0qXwwx+mgnCdOuUdmZnZapwImtKHH6aD/1VXwYIFcOSRqSDcl76Ud2RmZg1yImgKS5ak2z+XXw5z58Jhh6VnAjvvnHdkZmZr5d5K6+Pjj1NF0G23hTPPTJVAn38eHnrIScDMyoYTwSexYgXccUcaA+DEE6FzZxg1Ch5/HHbbLe/ozMzWiRPBuli5Eu67D77ylfQAuE0b+Otf4emnYb/98o7OzOwTcSJojAgYORKqq+F730vz7r03lYjo2dMF4cysrDkRrM3o0bD33tCrVxoacuhQePVVOPxwF4Qzs2bBR7KGvPgiHHRQGhHsnXdSq6DJk9Mg8S1a5B2dmVmTcSKoa/x46N07PfQdNw6uvjrVCOrXzwXhzKxZcj+CVd54Ay66CO65B7bYIvUDOPVUaN0678jMzDLlRDB1ahoYfuhQ2HhjOPfcVA30M5/JOzIzs6Ko3ETw73/DgAEwZEiaPuUUOO88+Pzn843LzKzIKi8RzJsHV1wBgwalnsHHHw+/+hV07Lj2f2tm1gxVTiJYuBCuvRauuSYVhzvqqDQwzLbb5h2ZmVmuKqfV0AMPpAP/17+eWgbdeaeTgJkZlXRF0KdPKg3RrVvekZiZlZTKuSLYcEMnATOzelROIjAzs3o5EZiZVTgnAjOzCudEYGZW4TJLBJI6SholaZKkCZJOrWed/SQtkDSu8Lowq3jMzKx+WTYfXQ6cGRFjJbUGxkh6PCIm1lnvqYg4NMM4zMxsDTK7IoiImRExtvD+Q2AS0D6r7ZmZ2SdTlGcEkjoDXYEX6lm8u6RXJD0iaccG/n1fSTWSaubMmZNlqGZmFSfzRCBpc+DPwGkRsbDO4rFAp4joAgwCHqzvMyJiSERUR0R1VVVVpvGamVWaTBOBpJakJDAsIu6vuzwiFkbEosL7kUBLSW2zjMnMzP5Xlq2GBNwKTIqIaxpYZ8vCekjatRDPvKxiMjOz1WXZamhP4BjgVUnjCvPOB7YCiIjBwOHAiZKWA0uAIyMiMozJzMzqyCwRRMTTgNayzg3ADVnFYGZma+eexWZmFc6JwMyswjkRmJlVOCcCM7MK50RgZlbhnAjMzCqcE4GZWYVzIjAzq3BOBGZmFc6JwMyswjkRmJlVOCcCM7MK50RgZlbhnAjMzCqcE4GZWYVzIjAzq3BOBGZmFc6JwMyswjkRmJlVOCcCM7MK50RgZlbhMksEkjpKGiVpkqQJkk6tZx1Jul7SFEnjJXXLIpZhw6BzZ9hgg/Rz2LAstmJmVp42zPCzlwNnRsRYSa2BMZIej4iJtdbpAWxXeO0G3Fz42WSGDYO+fWHx4jQ9dWqaBujTpym3ZGZWnjK7IoiImRExtvD+Q2AS0L7Oar2BOyJ5HmgjqV1TxnHBBf9NAqssXpzmm5lZkZ4RSOoMdAVeqLOoPTCt1vR0Vk8WSOorqUZSzZw5c9Zp2++9t27zzcwqTeaJQNLmwJ+B0yJiYd3F9fyTWG1GxJCIqI6I6qqqqnXa/lZbrdt8M7NKk2kikNSSlASGRcT99awyHehYa7oDMKMpYxgwAFq1+t95rVql+WZmlm2rIQG3ApMi4poGVnsYOLbQeqg7sCAiZjZlHH36wJAh0KkTSOnnkCF+UGxmtkqWrYb2BI4BXpU0rjDvfGArgIgYDIwEegJTgMXAcVkE0qePD/xmZg3JLBFExNPU/wyg9joBnJxVDGZmtnbuWWxmVuGcCMzMKpwTgZlZhXMiMDOrcErPa8uHpDnA1E/4z9sCc5swnKZSqnFB6cbmuNaN41o3zTGuThFRb4/csksE60NSTURU5x1HXaUaF5RubI5r3TiudVNpcfnWkJlZhXMiMDOrcJWWCIbkHUADSjUuKN3YHNe6cVzrpqLiqqhnBGZmtrpKuyIwM7M6nAjMzCpcs0wEkm6TNFvSaw0sl6TrJU2RNF5StxKJaz9JCySNK7wuLEJMHSWNkjRJ0gRJp9azTtH3VyPjymN/bSLpRUmvFOLqX886eeyvxsRV9P1Va9stJL0saUQ9y3L5e2xEXHnur3clvVrYbk09y5t2n0VEs3sB+wDdgNcaWN4TeIRUHbU78EKJxLUfMKLI+6od0K3wvjXwBrBD3vurkXHlsb8EbF5435I0/Gr3EthfjYmr6Pur1rbPAO6qb/t5/T02Iq4899e7QNs1LG/SfdYsrwgi4p/A+2tYpTdwRyTPA20ktSuBuIouImZGxNjC+w+BSaw+bnTR91cj4yq6wj5YVJhsWXjVbXGRx/5qTFy5kNQB6AXc0sAqufw9NiKuUtak+6xZJoJGaA9MqzU9nRI4yBTsXri8f0TSjsXcsKTOQFfS2WRtue6vNcQFOeyvwu2EccBs4PGIKIn91Yi4IJ/v13XAOcDKBpbn9f26jjXHBfn9PQbwmKQxkvrWs7xJ91mlJoL6BswphbOnsaR6IF2AQcCDxdqwpM1J40ufFhEL6y6u558UZX+tJa5c9ldErIiIXUhjbO8qaac6q+SyvxoRV9H3l6RDgdkRMWZNq9UzL9P91ci4cvt7BPaMiG5AD+BkSfvUWd6k+6xSE8F0oGOt6Q7AjJxi+f8iYuGqy/uIGAm0lNQ26+1Kakk62A6LiPvrWSWX/bW2uPLaX7W2Px8YDRxSZ1Gu36+G4sppf+0JHCbpXeBPwAGS7qyzTh77a61x5fn9iogZhZ+zgQeAXeus0qT7rFITwcPAsYUn792BBRExM++gJG0pSYX3u5L+f+ZlvE0BtwKTIuKaBlYr+v5qTFw57a8qSW0K7zcFDgQm11ktj/211rjy2F8RcV5EdIiIzsCRwBMRcXSd1Yq+vxoTVx77q7CtzSS1XvUeOBio29KwSfdZloPX50bS3aQn/m0lTQcuIj08IyIGAyNJT92nAIuB40okrsOBEyUtB5YAR0ahiUCG9gSOAV4t3F8GOB/YqlZceeyvxsSVx/5qBwyV1IJ0YBgeESMk9asVVx77qzFx5bG/6lUC+6sxceW1vz4PPFDIQRsCd0XEo1nuM5eYMDOrcJV6a8jMzAqcCMzMKpwTgZlZhXMiMDOrcE4EZmYVzonAbD0oVUl9R9JnCtOfLkx3yjs2s8ZyIjBbDxExDbgZGFiYNRAYEhFT84vKbN24H4HZeiqUwhgD3Ab8BOgaER/nG5VZ4zXLnsVmxRQRyySdDTwKHOwkYOXGt4bMmkYPYCZQt+KnWclzIjBbT5J2AQ4ijRR1+voMEGKWBycCs/VQqE55M2m8hPeAK4Gr8o3KbN04EZitn58A70XE44Xpm4DtJe2bY0xm68SthszMKpyvCMzMKpwTgZlZhXMiMDOrcE4EZmYVzonAzKzCORGYmVU4JwIzswr3/wDvQGQVKb6w6QAAAABJRU5ErkJggg==",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt \n",
"\n",
"# Plot the data and the linear regression line\n",
"plt.scatter(X, y, color='blue')\n",
"plt.plot(X, y_pred, color='red')\n",
"plt.xlabel('X')\n",
"plt.ylabel('y')\n",
"plt.title('Linear Regression')\n",
"plt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
================================================
FILE: src/MLC/notebooks/linear_regression_md.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Linear regression python multi-dimensional data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Linear Regression with two variables in one dimensional data\n",
"\n",
" \n",
" \n",
" $$ F(X)=X \\times W $$\n",
" $$ C=|| F(X) - Y ||_2^2 + \\lambda ||W||_2^2$$\n",
"\n",
"$X_{n \\times k}$\n",
"\n",
"$W_{k \\times p}$\n",
"\n",
"$Y_{n \\times p}$"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import random"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"n, k, p=100, 8, 3 \n",
"X=np.random.random([n,k])\n",
"W=np.random.random([k,p])\n",
"Y=np.random.random([n,p])\n",
"max_itr=1000\n",
"alpha=0.0001\n",
"Lambda=0.01"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Gradient is as follows:\n",
"$$ X^T 2 E + \\lambda 2 W$$"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# F(x)= w[0]*x + w[1]\n",
"def F(X, W):\n",
" return np.matmul(X,W)\n",
"\n",
"def cost(Y_est, Y, W, Lambda):\n",
" E=Y_est-Y\n",
" return E, np.linalg.norm(E,2)+ Lambda * np.linalg.norm(W,2)\n",
"\n",
"def gradient(E,X, W, Lambda):\n",
" return 2* np.matmul(X.T, E) + Lambda* 2* W"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def fit(W, X, Y, alpha, Lambda, max_itr):\n",
" for i in range(max_itr):\n",
" \n",
" Y_est=F(X,W)\n",
" E, c= cost(Y_est, Y, W, Lambda)\n",
" Wg=gradient(E, X, W, Lambda)\n",
" W=W - alpha * Wg\n",
" if i%100==0:\n",
" print(c)\n",
" \n",
" return W"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To take into account for the biases, we concatenate X by a 1 column, and increase the number of rows in W by one"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"34.3004759224227\n",
"4.265835757989014\n",
"4.052505749060854\n",
"3.8807845759072968\n",
"3.7422281683979812\n",
"3.6303399157863434\n",
"3.5398708528835554\n",
"3.4665749938168915\n",
"3.4070257924246747\n",
"3.3584711183863862\n"
]
}
],
"source": [
"X=np.concatenate( (X, np.ones((n,1))), axis=1 ) \n",
"W=np.concatenate( (W, np.random.random((1,p)) ), axis=0 )\n",
"\n",
"W = fit(W, X, Y, alpha, Lambda, max_itr)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
================================================
FILE: src/MLC/notebooks/logistic_regression.ipynb
================================================
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Logistic Regression \n",
"\n",
"Logistic regression is a statistical method used for binary classification, which means it is used to predict the probability of an event occurring or not. It is a type of generalized linear model that is used when the dependent variable is binary or categorical.\n",
"\n",
"In logistic regression, the dependent variable is binary (i.e., it can take on one of two values, usually 0 or 1), and the independent variables can be either continuous or categorical. The goal of logistic regression is to find the relationship between the independent variables and the dependent variable by estimating the probability of the dependent variable being 1 given the values of the independent variables.\n",
"\n",
"The logistic regression model uses a logistic function (also known as the sigmoid function) to map the input values of the independent variables to a value between 0 and 1, which represents the probability of the dependent variable being 1. The logistic function is defined as:\n",
"\n",
"css\n",
"Copy code\n",
"p = 1 / (1 + e^(-z))\n",
"where p is the predicted probability of the dependent variable being 1, e is the base of the natural logarithm, and z is the linear combination of the independent variables.\n",
"\n",
"The logistic regression model estimates the values of the coefficients of the independent variables that maximize the likelihood of observing the data given the model. This is typically done using maximum likelihood estimation or gradient descent optimization.\n",
"\n",
"Once the model is trained, it can be used to make predictions on new data by inputting the values of the independent variables into the logistic function and obtaining the predicted probability of the dependent variable being 1. The model can then classify the new observation as 1 or 0 based on a threshold probability value that is chosen by the user.\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Code \n",
" Here's an example implementation using gradient descent optimization:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"class LogisticRegression:\n",
" \n",
" def __init__(self, learning_rate=0.01, n_iters=1000):\n",
" self.learning_rate = learning_rate\n",
" self.n_iters = n_iters\n",
" self.weights = None\n",
" self.bias = None\n",
" \n",
" def fit(self, X, y):\n",
" # initialize weights and bias to zeros\n",
" n_samples, n_features = X.shape\n",
" self.weights = np.zeros(n_features)\n",
" self.bias = 0\n",
" \n",
" # gradient descent optimization\n",
" for i in range(self.n_iters):\n",
" # calculate predicted probabilities and cost\n",
" z = np.dot(X, self.weights) + self.bias\n",
" y_pred = self._sigmoid(z)\n",
" cost = (-1 / n_samples) * np.sum(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))\n",
" \n",
" # calculate gradients\n",
" dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))\n",
" db = (1 / n_samples) * np.sum(y_pred - y)\n",
" \n",
" # update weights and bias\n",
" self.weights -= self.learning_rate * dw\n",
" self.bias -= self.learning_rate * db\n",
" \n",
" def predict(self, X):\n",
" # calculate predicted probabilities\n",
" z = np.dot(X, self.weights) + self.bias\n",
" y_pred = self._sigmoid(z)\n",
" # convert probabilities to binary predictions\n",
" return np.round(y_pred).astype(int)\n",
" \n",
" def _sigmoid(self, z):\n",
" return 1 / (1 + np.exp(-z))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1 1]\n"
]
}
],
"source": [
"# create sample dataset\n",
"X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])\n",
"y = np.array([0, 0, 1, 1, 1])\n",
"\n",
"# initialize logistic regression model\n",
"lr = LogisticRegression()\n",
"\n",
"# train model on sample dataset\n",
"lr.fit(X, y)\n",
"\n",
"# make predictions on new data\n",
"X_new = np.array([[6, 7], [7, 8]])\n",
"y_pred = lr.predict(X_new)\n",
"\n",
"print(y_pred) # [1, 1]\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Improvements \n",
"here are some possible improvements you could make to the code:\n",
"\n",
"1. Add regularization: Regularization can help prevent overfitting and improve the generalization performance of the model. You could add L1 or L2 regularization to the cost function and adjust the regularization strength with a hyperparameter. Here's an example of how to add L2 regularization to the code:"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Use a more sophisticated optimization algorithm: Gradient descent is a simple and effective optimization algorithm, but it may not be the most efficient or accurate for large or complex datasets. You could try using a more sophisticated algorithm, such as stochastic gradient descent (SGD), mini-batch SGD, or Adam, which can converge faster and find better optima. Here's an example of how to use mini-batch SGD:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"class LogisticRegression:\n",
" \n",
" def __init__(self, learning_rate=0.01, n_iters=1000, regularization='l2', reg_strength=0.1, batch_size=32):\n",
" self.learning_rate = learning_rate\n",
" self.n_iters = n_iters\n",
" self.regularization = regularization\n",
" self.reg_strength = reg_strength\n",
" self.batch_size = batch_size\n",
" self.weights = None\n",
" self.bias = None\n",
" \n",
" def fit(self, X, y):\n",
" n_samples, n_features = X.shape\n",
" self.weights = np.zeros(n_features)\n",
" self.bias = 0\n",
" n_batches = n_samples // self.batch_size\n",
" for i in range(self.n_iters):\n",
" batch_indices = np.random.choice(n_samples, self.batch_size)\n",
" X_batch = X[batch_indices]\n",
" y_batch = y[batch_indices]\n",
" z = np.dot(X_batch, self.weights) + self.bias\n",
" y_pred = self._sigmoid(z)\n",
" cost = (-1 / self.batch_size) * np.sum(y_batch * np.log(y_pred) + (1 - y_batch) * np.log(1 - y_pred))\n",
" if self.regularization == 'l2':\n",
" reg_cost = (self.reg_strength / (2 * n_samples)) * np.sum(self.weights ** 2)\n",
" cost += reg_cost\n",
" elif self.regularization == 'l1':\n",
" reg_cost = (self.reg_strength / (2 * n_samples)) * np.sum(np.abs(self.weights))\n",
" cost += reg_cost\n",
" dw = (1 / self.batch_size) * np.dot(X_batch.T, (y_pred - y_batch))\n",
" db = (1 / self.batch_size) * np.sum(y_pred - y_batch)\n",
" if self.regularization == 'l2':\n",
" dw += (self.reg_strength / n_samples) * self.weights\n",
" elif self.regularization == 'l1':\n",
" dw += (self.reg_strength / n_samples) * np.sign(self.weights)\n",
" self.weights -= self.learning_rate * dw\n",
" self.bias -= self.learning_rate * db\n",
" \n",
" def predict(self, X):\n",
" z = np.dot(X, self.weights) + self.bias\n",
" y_pred = self._sigmoid(z)\n",
" return np.round(y_pred).astype(int)\n",
" \n",
" def _sigmoid(self, z):\n",
" return 1 / (1 + np.exp(-z))\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"This implementation includes the following improvements:\n",
"\n",
"1. Regularization: You can choose between L1 or L2 regularization by setting the regularization parameter to either 'l1' or 'l2', and adjust the regularization strength with the reg_strength parameter.\n",
"\n",
"2. Mini-batch stochastic gradient descent: The model uses mini-batch SGD (instead of simple gradient descent) to update the weights and bias, which can converge faster and find better optima.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1 1]\n"
]
}
],
"source": [
"# create sample dataset\n",
"X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])\n",
"y = np.array([0, 0, 1, 1, 1])\n",
"\n",
"# initialize logistic regression model\n",
"lr = LogisticRegression(learning_rate=0.01, n_iters=1000, regularization='l2', reg_strength=0.1, batch_size=2)\n",
"\n",
"# train model on sample dataset\n",
"lr.fit(X, y)\n",
"\n",
"# make predictions on new data\n",
"X_new = np.array([[6, 7], [7, 8]])\n",
"y_pred = lr.predict(X_new)\n",
"\n",
"print(y_pred) # [1, 1]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualize "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is difficult to visualize logistic regression since it is a high-dimensional problem. However, we can visualize the decision boundary of a logistic regression model for a two-dimensional dataset.\n",
"\n",
"Here's an example of how to visualize the decision boundary of the LogisticRegression class on a 2D dataset using the matplotlib library:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAW0AAAD8CAYAAAC8TPVwAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/MnkTPAAAACXBIWXMAAAsTAAALEwEAmpwYAAASPUlEQVR4nO3dbWyd9XnH8e91jh3jJHZiCE0CyZp0aukoK4G6dAWVbdBWRGXdeLMVqRWgSXnT9WEPQqVv0Dq1UqWpaidVoBTaUZUHMQJSVXUdoI21ldaUhLJSCLSBZSUQyAOBkAec2L72wocsD3bOMTnHt//H349kxT7ntvM7ivLLnf+5//cVmYkkqQy1qgNIklpnaUtSQSxtSSqIpS1JBbG0JakglrYkFaSl0o6Iv46IJyPiVxFxd0Sc0elgkqSTNS3tiDgX+CwwnJkXAHXgE50OJkk6WavLIz1Af0T0APOBFzsXSZI0lZ5mB2TmCxHxj8BvgUPAg5n54InHRcQ6YB1Af733fasHh9qdVW3S2ztCzzlL2DEyzoF9dXpr9aojSXPe7uef2Z2ZZzc7LpptY4+IIWAD8BfAq8C/APdl5vem+p73nLk07/rItdMKrJmzctk2hv7her689SCPPryYpf2DVUeS5rxvffZDmzNzuNlxrSyPfBj4n8zclZlHgPuBS083oCRp+lop7d8CfxAR8yMigCuBLZ2NJUmaTNPSzsyNwH3AY8ATje9Z3+FckqRJNH0jEiAzbwZu7nAWSVIT7oiUpIJY2pJUEEtbkgpiac8xK5dtY/GXruPLv9nPxocWVh1H0jS19EakusP5NwRb338N191aBwapRY8ba6TCWNpzxMpl2+i59Ho2bD1oWUsFc3lEkgpiaUtSQSxtSSqIpS1JBbG0JakglrYkFcTSlqSCWNqSVBBLW5IKYmlLUkEsbUkqiKUtSQVpWtoRcV5EPH7Mx76I+PwMZJMknaDpXf4y8xlgDUBE1IEXgAc6G0uSNJnpLo9cCTybmf/biTDqnIG1q3n6yGsOPpAKN937aX8CuLsTQdQZb06q+crWA2y8bRAI76UtFazl0o6IecDHgZumeH4dsA5g+fyBtoTT6Tl+Us0ils8fqjqSpNM0neWRtcBjmfnyZE9m5vrMHM7M4aG+/vak0+mLAMLClrrEdEr7WlwakaRKtVTaETEf+Ahwf2fjSJJOpaU17cw8CJzV4SySpCbcESlJBbG0JakglrYkFcTSlqSCWNqSVBBLW5IKYmlLUkEsbUkqiKUtSQWxtCWpIJa2JBXE0pakgljaklQQS7tLTUytWcOGbfWqo0hqo+nOiNQsN9lMSKfWSN3D0u4ixxX2w4PUoschvlKXcXmky0QERM3ClrqUpS1JBXF5RNKslgePMLrtVXLfCDG/l/qqxdQG+6qOVZlWB/sujoj7IuLpiNgSER/sdDBJGt9/mMM/f4HxHfvJA0cY33WQI5t3ML7nYNXRKtPq8sg3gB9l5ruBC4EtnYskSRNGt74CY3n8g+PJkaf3kJmTf1OXa7o8EhGDwOXA9QCZeRg43NlYkgT56huTPzEyOlHmPTGzgWaBVs603wHsAr4TEb+IiNsiYsGJB0XEuojYFBGb9o4cantQSXNQ7xSbwyKgNvcKG1or7R7gYuCWzLwIOAB84cSDMnN9Zg5n5vBQX3+bY0qai+q/M3hyOdegtnwhYWlPaTuwPTM3Nr6+j4kSl6SOqq8YpLZiYKK46xNn17UlC+h555lVR6tM0zXtzHwpIp6PiPMy8xngSuCpzkeTNNdFBL3vPItctZg8NEr09RB9c/t+Oq1ep/0Z4M6ImAc8B9zQuUiSdLzorRNTrW/PMS2VdmY+Dgx3NookqRm3sUtSQSxtSSqIpS1JBfGGUV1iYlLNNVx3Sw1YOFf3HUhdz9IunJNqpLnF0i7cwNrVPDO6j40PDzn4QJoDXNPuBhFAWNjSHGBpS1JBLG1JKoilLUkFsbQlqSCWtiQVxNKWpIJY2pJUEEtbkgpiaUtSQSxtSSqI9x6RCpfjyfjug4zvOQi9dXrOGSDm91YdSx3SUmlHxDbgdWAMGM1MR49Js0COJ0ce20HuPwxjCQGHn99Hz3uWUH/bwqrjqQOmc6b9x5m5u2NJJE3b2Iuvk68fhvGceCCBTEaf2k1tyQLCG6t3Hde0pYKNv3zg/wv7BPnaGzOcRjOh1dJO4MGI2BwR6yY7ICLWRcSmiNi0d+RQ+xJKmlLUT3EmXfOcrBu1+qd6WWZeDKwFPh0Rl594QGauz8zhzBwe6utva0hJk6udOwCTFXdPjRicN/OB1HEtlXZmvtj4dSfwAHBJJ0OpuZXLtvH76/+Qr579Pm781uKq46gitSXzqZ0zALWY+KgH9NbovXApEa5nd6Omb0RGxAKglpmvNz7/KPCljifTlI4O8b21DixyJuQcFhH0vusscuUg43vfgN4atbPm+wZkF2vl6pGlwAONf7V7gLsy80cdTaXmGiPGLGwBRH8v9X6vzZ4LmpZ2Zj4HXDgDWSRJTfj2siQVxNKWpIJY2pJUEEtbkgpiaUtSQSxtSSqIpS1JBbG0JakglrYkFcTSlqSCWNqSVBBLW5IKYmlLUkEs7cJM3Et7jYMPpDlqOtPYVaGVy7ax+EvX8ZWtB9h42yC16GFp/2DVsSTNMEu7AEcL+9mDbHzYSTXSXObySCGiMammFvWqo0iqkGfamhPy8Bjjuw9CTAzDjV7/8VOZWi7tiKgDm4AXMvPqzkWS2mv0hX2M/foVCCAB9lD/vSX0LFtYcTJp+qazPPI5YEungkidkIeOTBT2eMJYTvw6noxt2U2OjFYdT5q2lko7IlYAHwNu62wcqb3GXj4AmZM+N77r4AynkU5fq2faXwduBManOiAi1kXEpojYtHfkUDuySadvPBtLIifIJMcnL3NpNmta2hFxNbAzMzef6rjMXJ+Zw5k5PNTX37aA0umonT0fanHyExHUl8yf+UDSaWrlTPsy4OMRsQ24B7giIr7X0VRSm9QG+qidO3B8cdeC+tsXEfN7qwsmvUVNrx7JzJuAmwAi4o+Av8vMT3Y2ltQ+ve86i/GlCybWt4H6soXUBvsqTiW9NV6nrTmhtugMaovOqDqGdNqmVdqZ+QjwSEeSSJKachu7JBXE0pakgljaklQQS1uSCmJpS1JBLG1JKojXac9yEzMhr+G6W2rAwkl3ZEuaOyztWerEmZAQjhmTZGnPVgNrV/PM6D42PjzkEF9JR7mmPZs15kJa2JLeZGlLUkEsbUkqiKUtSQWxtCWpIJa2JBXE0pakgljaklQQS1sTMul/aRcLt20nRkerTiNpCk13REbEGcCPgb7G8fdl5s2dDqaZ07dnL+9efxd9r+wlazXI5Lk//xP2DL+36miSTtDKNvYR4IrM3B8RvcBPI+JfM/NnHc6mmZDJ+d+8g3mvvEot8+jDv3vP9zm0/G0cPHdZheEknajp8khO2N/4srfxkaf4FhVk4bbt9Ow/cFxhA8ToKMt+8vOKUkmaSktr2hFRj4jHgZ3AQ5m5cZJj1kXEpojYtHfkUJtjqlN69x9o3OPkeLVM5r22r4JEkk6lpdLOzLHMXAOsAC6JiAsmOWZ9Zg5n5vBQX3+bY6pTXl+1gtro2EmPj/X2svf8d1WQSNKpTOvqkcx8FXgEuKoTYTTzRgcW8sIVlzI2r/foY2O9PRweWsSuD6ypLpikSbVy9cjZwJHMfDUi+oEPA1/teDLNmO0fu5IDb1/Bsv/8GT2H3mDPhefz0ocuYXzevKqjSTpBK1ePLAfuiIg6E2fm92bmDzobSzNt7wXnsfeC86qOIamJpqWdmb8ELpqBLOKEMWPfcviBpOM5bmwWOTrE99Y6MOiYMUknsbRniZXLttFz6fVs2HrQspY0Je89IkkFsbQlqSCWtiQVxNKWpIJY2pJUEEtbkgpiaUtSQSxtSSqIpS1JBbG0JakglrYkFcTSlqSCWNqSVBBLW5IKYmnPEgNrV/P0kdfY+NDCqqNImsW8n/Y0ZCbjY+PU6jUioi0/87hJNbcNAuG9tCVNqZXBviuB7wLLgHFgfWZ+o9PBZpPMZMeTL7PzN3sYHxun94weVly4nKGVi0/r5x4t7GcPsvHhRSyfP9SewJK6Vitn2qPA32bmYxExAGyOiIcy86kOZ5s1XvjlS+x6dg85lgAcOTTKtke3U59XZ3DpwFv+uQNrVxPz+oBD1KLeprSSulnTNe3M3JGZjzU+fx3YApzb6WCzxfjYOLuPKew35Viy48mdFaWSNFdN643IiFjFxGT2jZM8ty4iNkXEpr0jh9oUr3qjI6PkFM+N7B+Z0SyS1HJpR8RCYAPw+czcd+Lzmbk+M4czc3ior7+dGSvVe0bvlG869i/untcpqQwtlXZE9DJR2Hdm5v2djTS7RC1Yfv7bqNWPL+6oB+dcsLSiVJLmqlauHgngdmBLZn6t85Fmn6XnnU29r87LT+3iyBtH6F/cz7nvXcaCM+dXHU3SHNPK1SOXAZ8CnoiIxxuPfTEzf9ixVLPQklVnsmTVmVXHkDTHNS3tzPwp0J6dJJKk0+I2dkkqiKUtSQWxtCWpIJa2JBXE0pakgljaklQQS1uSCmJpS1JBLG1JKojjxipy/g3B1vev4cZ/GgUGqLnnVFILLO0ZNtlMSMeMSWqVpT3DBtau5pnRfWx8eIha9DjEV9K0uKZdhQicui7prbC0JakglrYkFcTSlqSCWNqSVBBLW5IK0rS0I+LbEbEzIn41E4EkSVNr5Uz7n4GrOpxDktSCpqWdmT8GXpmBLJKkJlzTlqSCtK20I2JdRGyKiE17Rw6168dKko7RttLOzPWZOZyZw0N9/e36sZKkY7g8IkkFaeWSv7uB/wLOi4jtEfGXnY8lSZpM01uzZua1MxFkLli5bBv1D17HhmcPVh1FUqG8n/YMmZhUcw3X3VpnYlJNvepIkgpkaXeYk2oktZOlPQMiAqLmpBpJp82rRySpIJa2JBXE0pakgljaklQQS1uSCmJpS1JBLG1JKoilLUkFsbQlqSCWtiQVxNKWpIJY2pJUEEtbkgpiaUtSQSxtSSqIpS1JBWmptCPiqoh4JiK2RsQXOh2qmwysXc3TR15j40MLq44iqQs0nVwTEXXgm8BHgO3AoxHx/cx8qtPhSjbZmDGn1kg6Xa2MG7sE2JqZzwFExD3AnwKW9hSOH+K7yJmQktqmldI+F3j+mK+3Ax848aCIWAesa3w5subeb/zq9OPNSkuA3ac84t6ZCdIhzV9f2Xx9Zevm13deKwe1UtoxyWN50gOZ64H1ABGxKTOHWwlQmm5+beDrK52vr1wRsamV41p5I3I7sPKYr1cAL76VUJKk09NKaT8KvDMiVkfEPOATwPc7G0uSNJmmyyOZORoRfwX8G1AHvp2ZTzb5tvXtCDdLdfNrA19f6Xx95WrptUXmScvTkqRZyh2RklQQS1uSCtLW0u7m7e4R8e2I2BkRXXn9eUSsjIj/iIgtEfFkRHyu6kztFBFnRMTPI+K/G6/v76vO1G4RUY+IX0TED6rO0m4RsS0inoiIx1u9NK4kEbE4Iu6LiKcbfwc/OOWx7VrTbmx3/zXHbHcHru2W7e4RcTmwH/huZl5QdZ52i4jlwPLMfCwiBoDNwJ910Z9fAAsyc39E9AI/BT6XmT+rOFrbRMTfAMPAYGZeXXWedoqIbcBwZnblxpqIuAP4SWbe1rhKb35mvjrZse080z663T0zDwNvbnfvCpn5Y+CVqnN0SmbuyMzHGp+/DmxhYjdsV8gJ+xtf9jY+uuZd+IhYAXwMuK3qLJqeiBgELgduB8jMw1MVNrS3tCfb7t41f+nnkohYBVwEbKw4Sls1lg8eB3YCD2VmN72+rwM3AuMV5+iUBB6MiM2NW2Z0k3cAu4DvNJa3bouIBVMd3M7Sbmm7u2a3iFgIbAA+n5n7qs7TTpk5lplrmNjVe0lEdMUyV0RcDezMzM1VZ+mgyzLzYmAt8OnGcmW36AEuBm7JzIuAA8CU7wm2s7Td7l64xlrvBuDOzLy/6jyd0viv5yPAVdUmaZvLgI831n3vAa6IiO9VG6m9MvPFxq87gQeYWI7tFtuB7cf8z+8+Jkp8Uu0sbbe7F6zxRt3twJbM/FrVedotIs6OiMWNz/uBDwNPVxqqTTLzpsxckZmrmPh79++Z+cmKY7VNRCxovDlOY9ngo0DXXMWVmS8Bz0fEm3f5u5JT3Pq6lbv8tfobv5Xt7sWIiLuBPwKWRMR24ObMvL3aVG11GfAp4InGui/AFzPzh9VFaqvlwB2Nq5xqwL2Z2XWXxnWppcADE+cV9AB3ZeaPqo3Udp8B7myc8D4H3DDVgW5jl6SCuCNSkgpiaUtSQSxtSSqIpS1JBbG0JakglrYkFcTSlqSC/B/dAeYbl24tJgAAAABJRU5ErkJggg==",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"# create 2D dataset\n",
"X = np.array([[1, 2], [2, 3], [3, 4], [4, 5], [5, 6]])\n",
"y = np.array([0, 0, 1, 1, 1])\n",
"\n",
"# initialize logistic regression model\n",
"lr = LogisticRegression(learning_rate=0.01, n_iters=1000, regularization='l2', reg_strength=0.1, batch_size=2)\n",
"\n",
"# train model on dataset\n",
"lr.fit(X, y)\n",
"\n",
"# plot decision boundary\n",
"x1 = np.linspace(0, 6, 100)\n",
"x2 = np.linspace(0, 8, 100)\n",
"xx, yy = np.meshgrid(x1, x2)\n",
"Z = lr.predict(np.c_[xx.ravel(), yy.ravel()])\n",
"Z = Z.reshape(xx.shape)\n",
"plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral, alpha=0.8)\n",
"\n",
"# plot data points\n",
"plt.scatter(X[:,0], X[:,1], c=y, cmap=plt.cm.Spectral)\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
================================================
FILE: src/MLC/notebooks/logistic_regression_md.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## logistic regression multi-dimensional data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" logistic regression multi-dimensional data\n",
" \n",
" \n",
" $$ F(X)=X \\times W $$\n",
" $$ H(x)= \\frac{1}{1+ e ^{-F(x)}} $$\n",
" $$ C= -\\frac{1}{n} \\sum_{i,j} (Y \\odot log(H(x)) + (1-Y) \\odot log(1-H(x)) ) $$\n",
"\n",
"$X_{n \\times k}$\n",
"\n",
"$W_{k \\times p}$\n",
"\n",
"$Y_{n \\times p}$"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import random"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"n, k, p=100, 8, 3 "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"X=np.random.random([n,k])\n",
"W=np.random.random([k,p])\n",
"\n",
"y=np.random.randint(p, size=(1,n))\n",
"Y=np.zeros((n,p))\n",
"Y[np.arange(n), y]=1\n",
"\n",
"max_itr=5000\n",
"alpha=0.01\n",
"Lambda=0.01"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Gradient is as follows:\n",
"$$ X^T (H(x)-Y) + \\lambda 2 W$$"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# F(x)= w[0]*x + w[1]\n",
"def F(X, W):\n",
" return np.matmul(X,W)\n",
"\n",
"def H(F):\n",
" return 1/(1+np.exp(-F))\n",
"\n",
"def cost(Y_est, Y):\n",
" E= - (1/n) * (np.sum(Y*np.log(Y_est) + (1-Y)*np.log(1-Y_est))) + np.linalg.norm(W,2)\n",
" return E, np.sum(np.argmax(Y_est,1)==y)/n\n",
"\n",
"def gradient(Y_est, Y, X):\n",
" return (1/n) * np.matmul(X.T, (Y_est - Y) ) + Lambda* 2* W"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"def fit(W, X, Y, alpha, max_itr):\n",
" for i in range(max_itr):\n",
" \n",
" F_x=F(X,W)\n",
" Y_est=H(F_x)\n",
" E, c= cost(Y_est, Y)\n",
" Wg=gradient(Y_est, Y, X)\n",
" W=W - alpha * Wg\n",
" if i%1000==0:\n",
" print(E, c)\n",
" \n",
" return W, Y_est"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To take into account for the biases, we concatenate X by a 1 column, and increase the number of rows in W by one"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"9.368653735228364 0.31\n",
"4.994251188297815 0.43\n",
"4.951873226767272 0.48\n",
"4.922370610237865 0.47\n",
"4.901694423284286 0.48\n"
]
}
],
"source": [
"X=np.concatenate( (X, np.ones((n,1))), axis=1 ) \n",
"W=np.concatenate( (W, np.random.random((1,p)) ), axis=0 )\n",
"\n",
"W, Y_est = fit(W, X, Y, alpha, max_itr)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
================================================
FILE: src/MLC/notebooks/numpy_practice.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Numpy Practice\n",
"- Author: Alireza Dirafzoon\n",
"- Contributions are welcome :) "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 2, 3])"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### array()\n",
"a = [1, 2, 3]\n",
"x = np.array(a) \n",
"x = np.asarray(a)\n",
"x"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[1, 2, 3]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x.tolist()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1., 2., 3.], dtype=float32)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x.astype(np.float32)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([3, 2, 1, 0])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### arange()\n",
"np.arange(3) \n",
"np.arange(0,7,2) \n",
"np.arange(3, -1, -1)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0., 0., 0.],\n",
" [0., 0., 0.],\n",
" [0., 0., 0.]])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### zeros, ones, eye, linspace\n",
"np.zeros(3) \n",
"np.zeros((3,3)) "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1., 1., 1.],\n",
" [1., 1., 1.],\n",
" [1., 1., 1.]])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.ones(3)\n",
"np.ones((3,3))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1., 0., 0.],\n",
" [0., 1., 0.],\n",
" [0., 0., 1.]])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.eye(3)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0. , 3.5, 7. ])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.linspace(0,10,3) # 3 points, from 0 to 10, inclusive \n",
"np.linspace(0,7,3) "
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0.7771434 , 0.08427174, 0.84780602],\n",
" [0.6069425 , 0.72381233, 0.54255502]])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### np.random\n",
"\n",
"# random.rand(): uniform distr over [0, 1)\n",
"np.random.rand() \n",
"np.random.rand(2)\n",
"np.random.rand(2,3)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 2.2473354 , -1.27775236, -0.70635289],\n",
" [-1.56768889, 0.33955847, -0.16860601]])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# random.randn(): normal distr.\n",
"np.random.randn(2,3)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[2, 3],\n",
" [3, 1]])"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# random.randint: int in [low,high) / [0, high)\n",
"np.random.randint(1,4)\n",
"np.random.randint(1,4, (2,2))"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.random.randint(4)"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1, 2, 3],\n",
" [4, 5, 6]])"
]
},
"execution_count": 124,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## array methods \n",
"\n",
"### reshape\n",
"a = np.arange(1,7)\n",
"a = a.reshape(2,3)\n",
"a"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([4, 5, 6])"
]
},
"execution_count": 125,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### max, min, atgmax, argmin \n",
"a.max(axis = 0)"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 1, 1])"
]
},
"execution_count": 126,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a.argmax(axis=0)"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([3, 6])"
]
},
"execution_count": 127,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a.max(axis = 1)"
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6"
]
},
"execution_count": 128,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a.max()"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5"
]
},
"execution_count": 129,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a.argmax()"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2, 3)"
]
},
"execution_count": 130,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### shape and dtype \n",
"a.shape"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dtype('int64')"
]
},
"execution_count": 131,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a.dtype"
]
},
{
"cell_type": "code",
"execution_count": 132,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"48"
]
},
"execution_count": 132,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a.nbytes"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0, 1, 2, 3],\n",
" [ 4, 5, 6, 7],\n",
" [ 8, 9, 10, 11]])"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"### 2D array/matrix \n",
"m = np.arange(12).reshape(3,4)\n",
"m"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([4., 5., 6., 7.])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.mean(axis=0)"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1.11803399, 1.11803399, 1.11803399])"
]
},
"execution_count": 139,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.std(axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0, 4, 8],\n",
" [ 1, 5, 9],\n",
" [ 2, 6, 10],\n",
" [ 3, 7, 11]])"
]
},
"execution_count": 140,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.T # or m.transpose"
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])"
]
},
"execution_count": 142,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.reshape((-1,12))"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])"
]
},
"execution_count": 141,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.reshape(-1)"
]
},
{
"cell_type": "code",
"execution_count": 145,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])"
]
},
"execution_count": 145,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.ravel()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0, 1, 2, 3],\n",
" [ 4, 5, 6, 7],\n",
" [ 8, 9, 10, 11]])"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## indexing and selection"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"6\n",
"6\n",
"[4 5 6 7]\n"
]
}
],
"source": [
"print(m[1][2])\n",
"print(m[1,2])\n",
"print(m[1,:])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0., 0., 0., 0., 0., 5., 5., 5., 0., 0.])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## boradcasting\n",
"a = np.zeros(10)\n",
"a[5:8] = 5 # note we can't do this with list \n",
"a"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([2., 2., 2.])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"suba = a[:3]\n",
"suba[:] = 2\n",
"suba"
]
},
{
"cell_type": "code",
"execution_count": 135,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([2., 2., 2., 0., 0., 5., 5., 5., 0., 0.])"
]
},
"execution_count": 135,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a # note that suba is not a copy, just points to a slice of a"
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([2., 2., 2.])"
]
},
"execution_count": 136,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"suba = np.copy(a[:3])\n",
"suba"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0., 0., 0., 0.],\n",
" [2., 2., 2., 2.],\n",
" [0., 0., 0., 0.],\n",
" [0., 0., 0., 0.]])"
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m = np.zeros((4,4))\n",
"m[1] = 2\n",
"m"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[2., 2., 2., 2.],\n",
" [0., 0., 0., 0.]])"
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# fancy indexing\n",
"m[[1,3]]"
]
},
{
"cell_type": "code",
"execution_count": 153,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([False, False, False, True])"
]
},
"execution_count": 153,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## selection \n",
"a = np.arange(4)\n",
"a > 2 # note we can't do this with list "
]
},
{
"cell_type": "code",
"execution_count": 156,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 156,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(a == 2).astype(np.int16).sum()"
]
},
{
"cell_type": "code",
"execution_count": 154,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(array([1, 2, 3]),)"
]
},
"execution_count": 154,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a.nonzero()"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [],
"source": [
"## Operations"
]
},
{
"cell_type": "code",
"execution_count": 159,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0. , 0.25, 0.4 ])"
]
},
"execution_count": 159,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a, b = np.arange(0,3), np.arange(3,6)\n",
"a + b\n",
"a - b \n",
"a * b # element-wise \n",
"a/b # element-wise "
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0, 4, 10])"
]
},
"execution_count": 114,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.multiply(a,b) # element-wise "
]
},
{
"cell_type": "code",
"execution_count": 163,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"14"
]
},
"execution_count": 163,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# dot product of arrays\n",
"np.dot(a,b)"
]
},
{
"cell_type": "code",
"execution_count": 164,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([-3, 6, -3])"
]
},
"execution_count": 164,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# cross product \n",
"np.cross(a,b)"
]
},
{
"cell_type": "code",
"execution_count": 162,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"14"
]
},
"execution_count": 162,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# matrix multiplication\n",
"np.matmul(a,b.T)"
]
},
{
"cell_type": "code",
"execution_count": 177,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0 1 2]\n",
" [3 4 5]\n",
" [6 7 8]] [0 1 0]\n"
]
},
{
"data": {
"text/plain": [
"array([3, 4, 5])"
]
},
"execution_count": 177,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a = np.arange(9).reshape((3,3)) #2D\n",
"b = np.array([0,1,0]) # 1D\n",
"print(a,b) \n",
"np.matmul(a,b) # 2D * 1D -> broadcasts the 1D array, treating it as a col "
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 1, 2])"
]
},
"execution_count": 120,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.power(a,2) # element-wise \n",
"np.power(a,b) # element-wise \n",
"np.mod(a,b)"
]
},
{
"cell_type": "code",
"execution_count": 178,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/alirezadirafzoon/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: RuntimeWarning: divide by zero encountered in log\n",
" after removing the cwd from sys.path.\n"
]
},
{
"data": {
"text/plain": [
"array([[ -inf, 0. , 0.69314718],\n",
" [1.09861229, 1.38629436, 1.60943791],\n",
" [1.79175947, 1.94591015, 2.07944154]])"
]
},
"execution_count": 178,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.sqrt(a)\n",
"np.exp(a)\n",
"np.sin(a)\n",
"np.log(a)"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"## Kmeans "
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "Population must be a sequence or set. For dicts, use list(d).",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0mx2\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrandom\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrandn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m5\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mconcatenate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mx1\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mx2\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m \u001b[0mmu\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mclusters\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfind_centers\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m\u001b[0m in \u001b[0;36mfind_centers\u001b[0;34m(X, K)\u001b[0m\n\u001b[1;32m 24\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mfind_centers\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mK\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 25\u001b[0m \u001b[0;31m# Initialize to K random centers\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 26\u001b[0;31m \u001b[0moldmu\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrandom\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msample\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mK\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 27\u001b[0m \u001b[0mmu\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrandom\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msample\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mK\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 28\u001b[0m \u001b[0;32mwhile\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mhas_converged\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmu\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moldmu\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/opt/anaconda3/lib/python3.7/random.py\u001b[0m in \u001b[0;36msample\u001b[0;34m(self, population, k)\u001b[0m\n\u001b[1;32m 315\u001b[0m \u001b[0mpopulation\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mtuple\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpopulation\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 316\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpopulation\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_Sequence\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 317\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mTypeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Population must be a sequence or set. For dicts, use list(d).\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 318\u001b[0m \u001b[0mrandbelow\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_randbelow\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 319\u001b[0m \u001b[0mn\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpopulation\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mTypeError\u001b[0m: Population must be a sequence or set. For dicts, use list(d)."
]
}
],
"source": [
"x1 = np.add(np.random.randn(10,2), 5)\n",
"x2 = np.add(np.random.randn(10,2), -5)\n",
"X = np.concatenate([x1,x2], axis=0)\n",
"mu, clusters = kmeans(X,2)"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[[4.8649386655349955, 3.9952475402226817],\n",
" [5.498489206113001, 5.069951322563478],\n",
" [4.929449898684354, 5.719151512307626],\n",
" [4.595440437145644, 4.810271477510138],\n",
" [5.285073437207049, 5.922053828848186],\n",
" [3.2378112256065865, 4.595935658934975],\n",
" [3.8231073755832887, 6.144586325794659],\n",
" [4.1009988278675245, 6.559105478655928],\n",
" [3.9976386132206, 4.424471531025596],\n",
" [4.691876028371731, 5.345908717367563],\n",
" [-5.720985281350966, -5.68922383985498],\n",
" [-5.4201288230000815, -4.431411907717413],\n",
" [-3.6983426126902725, -4.636565625778152],\n",
" [-5.342010805119905, -6.095419133835849],\n",
" [-4.2666049359220235, -3.284073438471302],\n",
" [-6.469221214094414, -7.369070651238069],\n",
" [-3.284553291631532, -5.672466183383029],\n",
" [-3.4845642662555996, -5.40312458836927],\n",
" [-5.6863731385517005, -5.30056130289524],\n",
" [-5.194321373602274, -5.935463756358125]]"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 4.61084009, 3.81951528, 4.80554869, 4.89529018, 4.32093641,\n",
" 4.23753206, 2.72005748, 6.7060486 , 4.02801539, 4.04573508,\n",
" -6.07068165, -7.1437371 , -6.49431954, -4.90412879, -5.25460504,\n",
" -3.86646858, -6.98290866, -4.82434449, -6.14940609, -6.55090156],\n",
" [ 4.79319455, 4.07846865, 6.5072268 , 4.9865201 , 4.66317278,\n",
" 2.8773762 , 3.67165213, 4.9436099 , 3.93374775, 1.65634458,\n",
" -5.16876443, -5.99996567, -4.70701913, -5.61868297, -5.08846978,\n",
" -6.31017567, -3.46475003, -4.53046633, -4.35615641, -7.20485829]])"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import random\n",
"def dist():\n",
" pass \n",
"def assign_clusters(X, mu):\n",
" \n",
"def kmeans(X,k):\n",
" mu = random.sample(X,k)\n",
" it = 1 \n",
" max_it = 100\n",
" while it < max_it: \n",
" # assign clusters to centers \n",
" clusters = assign_clusters(X, mu)\n",
" # calculate new centers \n",
" mu = calculate_centers(mu, clusters)\n",
" return mu, clusters"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"mu = np.random.rand(5,2)\n",
"x = np.random.rand(2)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[array([-0.722217 , -0.95781472]),\n",
" array([-0.43986246, -0.58463802]),\n",
" array([-0.16399214, -0.27117604]),\n",
" array([-0.88255848, -0.98718324]),\n",
" array([-0.53056903, -0.60690576])]"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[-mu[i[0]] for i in enumerate(mu)]"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "'int' object is not subscriptable",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmu_i\u001b[0m \u001b[0;32min\u001b[0m \u001b[0menumerate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmu\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlinalg\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnorm\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mmu_i\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mlambda\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m(x)\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mi\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmu_i\u001b[0m \u001b[0;32min\u001b[0m \u001b[0menumerate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmu\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mi\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlinalg\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnorm\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0mmu_i\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkey\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mlambda\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m: 'int' object is not subscriptable"
]
}
],
"source": [
"for i, mu_i in enumerate(mu):\n",
" print(min(i, np.linalg.norm(x - mu_i), key=lambda x:x[1]))"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 [1, 2, 3]\n",
"1 [4, 5, 6]\n"
]
}
],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
================================================
FILE: src/MLC/notebooks/perceptron.ipynb
================================================
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The perceptron algorithm is a type of linear classification algorithm used to classify data into two categories. It is a simple algorithm that learns from the mistakes made during the classification process and adjusts the weights of the input features to improve the accuracy of the classification. \n",
"\n",
"```python \n",
"y_pred = sign(w0 + w1*x1 + w2*x2 + ... + wn*xn)\n",
"wi = wi + learning_rate * (target - y_pred) * xi\n",
"```\n",
"\n",
"Here is an implementation of the perceptron algorithm in Python:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"class Perceptron:\n",
" def __init__(self, lr=0.01, n_iter=100):\n",
" self.lr = lr\n",
" self.n_iter = n_iter\n",
"\n",
" def fit(self, X, y):\n",
" self.weights = np.zeros(1 + X.shape[1])\n",
" self.errors = []\n",
"\n",
" for _ in range(self.n_iter):\n",
" errors = 0\n",
" for xi, target in zip(X, y):\n",
" update = self.lr * (target - self.predict(xi))\n",
" self.weights[1:] += update * xi\n",
" self.weights[0] += update\n",
" errors += int(update != 0.0)\n",
" self.errors.append(errors)\n",
" return self\n",
"\n",
" def net_input(self, X):\n",
" return np.dot(X, self.weights[1:]) + self.weights[0]\n",
"\n",
" def predict(self, X):\n",
" return np.where(self.net_input(X) >= 0.0, 1, -1)\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The Perceptron class has the following methods:\n",
"\n",
"__init__(self, lr=0.01, n_iter=100): Initializes the perceptron with a learning rate (lr) and number of iterations (n_iter) to perform during training.\n",
"\n",
"fit(self, X, y): Trains the perceptron on the input data X and target labels y. The method initializes the weights to zero and iterates through the data n_iter times, adjusting the weights after each misclassification. The method returns the trained perceptron.\n",
"\n",
"net_input(self, X): Computes the weighted sum of inputs and bias.\n",
"\n",
"predict(self, X): Predicts the class label for a given input X based on the current weights.\n",
"\n",
"To use the perceptron algorithm, you can create an instance of the Perceptron class, and then call the fit method with your input data X and target labels y. Here is an example usage:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([-1, 1])"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = np.array([[2.0, 1.0], [3.0, 4.0], [4.0, 2.0], [3.0, 1.0]])\n",
"y = np.array([-1, 1, 1, -1])\n",
"perceptron = Perceptron()\n",
"perceptron.fit(X, y)\n",
"\n",
"new_X = np.array([[5.0, 2.0], [1.0, 3.0]])\n",
"perceptron.predict(new_X)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
================================================
FILE: src/MLC/notebooks/softmax.ipynb
================================================
================================================
FILE: src/MLC/notebooks/svm.ipynb
================================================
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Support Vector Machines (SVMs)\n",
"\n",
"Support Vector Machines (SVMs) are a type of machine learning algorithm used for classification and regression analysis. In particular, linear SVMs are used for binary classification problems where the goal is to separate two classes by a hyperplane.\n",
"\n",
"The hyperplane is a line that divides the feature space into two regions. The SVM algorithm tries to find the hyperplane that maximizes the margin, which is the distance between the hyperplane and the closest points from each class. The points closest to the hyperplane are called support vectors and play a crucial role in the algorithm's optimization process.\n",
"\n",
"In linear SVMs, the hyperplane is defined by a linear function of the input features. The algorithm tries to find the optimal values of the coefficients of this function, called weights, that maximize the margin. This optimization problem can be formulated as a quadratic programming problem, which can be efficiently solved using standard optimization techniques.\n",
"\n",
"In addition to finding the optimal hyperplane, SVMs can also handle non-linearly separable data by using a kernel trick. This technique maps the input features into a higher-dimensional space, where they might become linearly separable. The SVM algorithm then finds the optimal hyperplane in this transformed feature space, which corresponds to a non-linear decision boundary in the original feature space.\n",
"\n",
"Linear SVMs have been widely used in many applications, including text classification, image classification, and bioinformatics. They have the advantage of being computationally efficient and easy to interpret. However, they may not perform well in highly non-linearly separable datasets, where non-linear SVMs may be a better choice."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Code "
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"class SVM:\n",
" def __init__(self, learning_rate=0.001, lambda_param=0.01, n_iters=1000):\n",
" self.lr = learning_rate\n",
" self.lambda_param = lambda_param\n",
" self.n_iters = n_iters\n",
" self.w = None\n",
" self.b = None\n",
"\n",
" def fit(self, X, y):\n",
" n_samples, n_features = X.shape\n",
" y_ = np.where(y <= 0, -1, 1)\n",
" self.w = np.zeros(n_features)\n",
" self.b = 0\n",
"\n",
" # Gradient descent\n",
" for _ in range(self.n_iters):\n",
" for idx, x_i in enumerate(X):\n",
" condition = y_[idx] * (np.dot(x_i, self.w) - self.b) >= 1\n",
" if condition:\n",
" self.w -= self.lr * (2 * self.lambda_param * self.w)\n",
" else:\n",
" self.w -= self.lr * (2 * self.lambda_param * self.w - np.dot(x_i, y_[idx]))\n",
" self.b -= self.lr * y_[idx]\n",
"\n",
" def predict(self, X):\n",
" linear_output = np.dot(X, self.w) - self.b\n",
" return np.sign(linear_output)\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy: 1.0\n"
]
}
],
"source": [
"# Example usage\n",
"from sklearn import datasets\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"X, y = datasets.make_blobs(n_samples=100, centers=2, random_state=42)\n",
"y = np.where(y == 0, -1, 1)\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
"\n",
"svm = SVM()\n",
"svm.fit(X_train, y_train)\n",
"y_pred = svm.predict(X_test)\n",
"\n",
"\n",
"# Evaluate model\n",
"accuracy = accuracy_score(y_test, y_pred)\n",
"print(\"Accuracy:\", accuracy)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy: 0.5\n"
]
}
],
"source": [
"# Generate data\n",
"X, y = make_classification(n_features=5, n_samples=100, n_informative=5, n_redundant=0, n_classes=2, random_state=1)\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)\n",
"\n",
"# Initialize SVM model\n",
"svm = SVM()\n",
"\n",
"# Train model\n",
"svm.fit(X_train, y_train)\n",
"\n",
"# Make predictions\n",
"y_pred = svm.predict(X_test)\n",
"\n",
"# Evaluate model\n",
"accuracy = accuracy_score(y_test, y_pred)\n",
"print(\"Accuracy:\", accuracy)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
================================================
FILE: src/MLC/notebooks/ww_classifier.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## CS 224N Lecture 3: Word Window Classification\n",
"\n",
"### Pytorch Exploration\n",
"\n",
"### Author: Matthew Lamm"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"import pprint\n",
"import torch\n",
"import torch.nn as nn\n",
"pp = pprint.PrettyPrinter()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Our Data\n",
"\n",
"The task at hand is to assign a label of 1 to words in a sentence that correspond with a LOCATION, and a label of 0 to everything else. \n",
"\n",
"In this simplified example, we only ever see spans of length 1."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"train_sents = [s.lower().split() for s in [\"we 'll always have Paris\",\n",
" \"I live in Germany\",\n",
" \"He comes from Denmark\",\n",
" \"The capital of Denmark is Copenhagen\"]]\n",
"train_labels = [[0, 0, 0, 0, 1],\n",
" [0, 0, 0, 1],\n",
" [0, 0, 0, 1],\n",
" [0, 0, 0, 1, 0, 1]]\n",
"\n",
"assert all([len(train_sents[i]) == len(train_labels[i]) for i in range(len(train_sents))])\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"test_sents = [s.lower().split() for s in [\"She comes from Paris\"]]\n",
"test_labels = [[0, 0, 0, 1]]\n",
"\n",
"assert all([len(test_sents[i]) == len(test_labels[i]) for i in range(len(test_sents))])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating a dataset of batched tensors."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PyTorch (like other deep learning frameworks) is optimized to work on __tensors__, which can be thought of as a generalization of vectors and matrices with arbitrarily large rank.\n",
"\n",
"Here well go over how to translate data to a list of vocabulary indices, and how to construct *batch tensors* out of the data for easy input to our model. \n",
"\n",
"We'll use the *torch.utils.data.DataLoader* object handle ease of batching and iteration."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Converting tokenized sentence lists to vocabulary indices.\n",
"\n",
"Let's assume we have the following vocabulary:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"id_2_word = [\"\", \"\", \"we\", \"always\", \"have\", \"paris\",\n",
" \"i\", \"live\", \"in\", \"germany\",\n",
" \"he\", \"comes\", \"from\", \"denmark\",\n",
" \"the\", \"of\", \"is\", \"copenhagen\"]\n",
"word_2_id = {w:i for i,w in enumerate(id_2_word)}"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['we', \"'ll\", 'always', 'have', 'paris']\n"
]
}
],
"source": [
"instance = train_sents[0]\n",
"print(instance)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"def convert_tokens_to_inds(sentence, word_2_id):\n",
" return [word_2_id.get(t, word_2_id[\"\"]) for t in sentence]"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[2, 1, 3, 4, 5]\n"
]
}
],
"source": [
"token_inds = convert_tokens_to_inds(instance, word_2_id)\n",
"pp.pprint(token_inds)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's convince ourselves that worked:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['we', '', 'always', 'have', 'paris']\n"
]
}
],
"source": [
"print([id_2_word[tok_idx] for tok_idx in token_inds])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Padding for windows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the word window classifier, for each word in the sentence we want to get the +/- n window around the word, where 0 <= n < len(sentence).\n",
"\n",
"In order for such windows to be defined for words at the beginning and ends of the sentence, we actually want to insert padding around the sentence before converting to indices:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"def pad_sentence_for_window(sentence, window_size, pad_token=\"\"):\n",
" return [pad_token]*window_size + sentence + [pad_token]*window_size "
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['', '', 'we', \"'ll\", 'always', 'have', 'paris', '', '']\n"
]
}
],
"source": [
"window_size = 2\n",
"instance = pad_sentence_for_window(train_sents[0], window_size)\n",
"print(instance)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's make sure this works with our vocabulary:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['', '', 'we', '', 'always', 'have', 'paris', '', '']\n",
"['', '', 'i', 'live', 'in', 'germany', '', '']\n",
"['', '', 'he', 'comes', 'from', 'denmark', '', '']\n",
"['', '', 'the', '', 'of', 'denmark', 'is', 'copenhagen', '', '']\n"
]
}
],
"source": [
"for sent in train_sents:\n",
" tok_idxs = convert_tokens_to_inds(pad_sentence_for_window(sent, window_size), word_2_id)\n",
" print([id_2_word[idx] for idx in tok_idxs])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Batching sentences together with a DataLoader"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we train our model, we rarely update with respect to a single training instance at a time, because a single instance provides a very noisy estimate of the global loss's gradient. We instead construct small *batches* of data, and update parameters for each batch. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Given some batch size, we want to construct batch tensors out of the word index lists we've just created with our vocab.\n",
"\n",
"For each length B list of inputs, we'll have to:\n",
"\n",
" (1) Add window padding to sentences in the batch like we just saw.\n",
" (2) Add additional padding so that each sentence in the batch is the same length.\n",
" (3) Make sure our labels are in the desired format.\n",
"\n",
"At the level of the dataest we want:\n",
"\n",
" (4) Easy shuffling, because shuffling from one training epoch to the next gets rid of \n",
" pathological batches that are tough to learn from.\n",
" (5) Making sure we shuffle inputs and their labels together!\n",
" \n",
"PyTorch provides us with an object *torch.utils.data.DataLoader* that gets us (4) and (5). All that's required of us is to specify a *collate_fn* that tells it how to do (1), (2), and (3). "
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('raw train label instance', tensor([0, 0, 0, 0, 1]))\n",
"torch.Size([5])\n"
]
}
],
"source": [
"l = torch.LongTensor(train_labels[0])\n",
"pp.pprint((\"raw train label instance\", l))\n",
"print(l.size())\n"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('unfilled label instance',\n",
" tensor([[0., 0., 0., 0., 0.],\n",
" [0., 0., 0., 0., 0.]]))\n",
"torch.Size([2, 5])\n"
]
}
],
"source": [
"one_hots = torch.zeros((2, len(l)))\n",
"pp.pprint((\"unfilled label instance\", one_hots))\n",
"print(one_hots.size())"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('one-hot labels', tensor([[0., 0., 0., 0., 0.],\n",
" [0., 0., 0., 0., 1.]]))\n"
]
}
],
"source": [
"one_hots[1] = l\n",
"pp.pprint((\"one-hot labels\", one_hots))"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('one-hot labels', tensor([[1., 1., 1., 1., 0.],\n",
" [0., 0., 0., 0., 1.]]))\n"
]
}
],
"source": [
"l_not = ~l.byte()\n",
"one_hots[0] = l_not\n",
"pp.pprint((\"one-hot labels\", one_hots))"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"from torch.utils.data import DataLoader\n",
"from functools import partial"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"def my_collate(data, window_size, word_2_id):\n",
" \"\"\"\n",
" For some chunk of sentences and labels\n",
" -add winow padding\n",
" -pad for lengths using pad_sequence\n",
" -convert our labels to one-hots\n",
" -return padded inputs, one-hot labels, and lengths\n",
" \"\"\"\n",
" \n",
" x_s, y_s = zip(*data)\n",
"\n",
" # deal with input sentences as we've seen\n",
" window_padded = [convert_tokens_to_inds(pad_sentence_for_window(sentence, window_size), word_2_id)\n",
" for sentence in x_s]\n",
" # append zeros to each list of token ids in batch so that they are all the same length\n",
" padded = nn.utils.rnn.pad_sequence([torch.LongTensor(t) for t in window_padded], batch_first=True)\n",
" \n",
" # convert labels to one-hots\n",
" labels = []\n",
" lengths = []\n",
" for y in y_s:\n",
" lengths.append(len(y))\n",
" label = torch.zeros((len(y),2 ))\n",
" true = torch.LongTensor(y) \n",
" false = ~true.byte()\n",
" label[:, 0] = false\n",
" label[:, 1] = true\n",
" labels.append(label)\n",
" padded_labels = nn.utils.rnn.pad_sequence(labels, batch_first=True)\n",
" \n",
" return padded.long(), padded_labels, torch.LongTensor(lengths)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"# Shuffle True is good practice for train loaders.\n",
"# Use functools.partial to construct a partially populated collate function\n",
"example_loader = DataLoader(list(zip(train_sents, \n",
" train_labels)), \n",
" batch_size=2, \n",
" shuffle=True, \n",
" collate_fn=partial(my_collate, window_size=2, word_2_id=word_2_id))"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('inputs',\n",
" tensor([[ 0, 0, 2, 1, 3, 4, 5, 0, 0],\n",
" [ 0, 0, 10, 11, 12, 13, 0, 0, 0]]),\n",
" torch.Size([2, 9]))\n",
"('labels',\n",
" tensor([[[1., 0.],\n",
" [1., 0.],\n",
" [1., 0.],\n",
" [1., 0.],\n",
" [0., 1.]],\n",
"\n",
" [[1., 0.],\n",
" [1., 0.],\n",
" [1., 0.],\n",
" [0., 1.],\n",
" [0., 0.]]]),\n",
" torch.Size([2, 5, 2]))\n",
"tensor([5, 4])\n"
]
}
],
"source": [
"for batched_input, batched_labels, batch_lengths in example_loader:\n",
" pp.pprint((\"inputs\", batched_input, batched_input.size()))\n",
" pp.pprint((\"labels\", batched_labels, batched_labels.size()))\n",
" pp.pprint(batch_lengths)\n",
" break"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modeling\n",
"\n",
"### Thinking through vectorization of word windows.\n",
"Before we go ahead and build our model, let's think about the first thing it needs to do to its inputs.\n",
"\n",
"We're passed batches of sentences. For each sentence i in the batch, for each word j in the sentence, we want to construct a single tensor out of the embeddings surrounding word j in the +/- n window.\n",
"\n",
"Thus, the first thing we're going to need a (B, L, 2N+1) tensor of token indices."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A *terrible* but nevertheless informative *iterative* solution looks something like the following, where we iterate through batch elements in our (dummy), iterating non-padded word positions in those, and for each non-padded word position, construct a window:"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tensor([[0, 0, 1, 2, 3, 4, 0, 0],\n",
" [0, 0, 5, 6, 7, 8, 0, 0]])\n"
]
}
],
"source": [
"dummy_input = torch.zeros(2, 8).long()\n",
"dummy_input[:,2:-2] = torch.arange(1,9).view(2,4)\n",
"pp.pprint(dummy_input)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"torch.Size([2, 4, 5])\n",
"tensor([[[0, 0, 1, 2, 3],\n",
" [0, 1, 2, 3, 4],\n",
" [1, 2, 3, 4, 0],\n",
" [2, 3, 4, 0, 0]],\n",
"\n",
" [[0, 0, 5, 6, 7],\n",
" [0, 5, 6, 7, 8],\n",
" [5, 6, 7, 8, 0],\n",
" [6, 7, 8, 0, 0]]])\n"
]
}
],
"source": [
"dummy_output = [[[dummy_input[i, j-2+k].item() for k in range(2*2+1)] \n",
" for j in range(2, 6)] \n",
" for i in range(2)]\n",
"dummy_output = torch.LongTensor(dummy_output)\n",
"print(dummy_output.size())\n",
"pp.pprint(dummy_output)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Technically* it works: For each element in the batch, for each word in the original sentence and ignoring window padding, we've got the 5 token indices centered at that word. But in practice will be crazy slow."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instead, we ideally want to find the right tensor operation in the PyTorch arsenal. Here, that happens to be __Tensor.unfold__."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"tensor([[[0, 0, 1, 2, 3],\n",
" [0, 1, 2, 3, 4],\n",
" [1, 2, 3, 4, 0],\n",
" [2, 3, 4, 0, 0]],\n",
"\n",
" [[0, 0, 5, 6, 7],\n",
" [0, 5, 6, 7, 8],\n",
" [5, 6, 7, 8, 0],\n",
" [6, 7, 8, 0, 0]]])"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dummy_input.unfold(1, 2*2+1, 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### A model in full."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In PyTorch, we implement models by extending the nn.Module class. Minimally, this requires implementing an *\\_\\_init\\_\\_* function and a *forward* function.\n",
"\n",
"In *\\_\\_init\\_\\_* we want to store model parameters (weights) and hyperparameters (dimensions).\n"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"class SoftmaxWordWindowClassifier(nn.Module):\n",
" \"\"\"\n",
" A one-layer, binary word-window classifier.\n",
" \"\"\"\n",
" def __init__(self, config, vocab_size, pad_idx=0):\n",
" super(SoftmaxWordWindowClassifier, self).__init__()\n",
" \"\"\"\n",
" Instance variables.\n",
" \"\"\"\n",
" self.window_size = 2*config[\"half_window\"]+1\n",
" self.embed_dim = config[\"embed_dim\"]\n",
" self.hidden_dim = config[\"hidden_dim\"]\n",
" self.num_classes = config[\"num_classes\"]\n",
" self.freeze_embeddings = config[\"freeze_embeddings\"]\n",
" \n",
" \"\"\"\n",
" Embedding layer\n",
" -model holds an embedding for each layer in our vocab\n",
" -sets aside a special index in the embedding matrix for padding vector (of zeros)\n",
" -by default, embeddings are parameters (so gradients pass through them)\n",
" \"\"\"\n",
" self.embed_layer = nn.Embedding(vocab_size, self.embed_dim, padding_idx=pad_idx)\n",
" if self.freeze_embeddings:\n",
" self.embed_layer.weight.requires_grad = False\n",
" \n",
" \"\"\"\n",
" Hidden layer\n",
" -we want to map embedded word windows of dim (window_size+1)*self.embed_dim to a hidden layer.\n",
" -nn.Sequential allows you to efficiently specify sequentially structured models\n",
" -first the linear transformation is evoked on the embedded word windows\n",
" -next the nonlinear transformation tanh is evoked.\n",
" \"\"\"\n",
" self.hidden_layer = nn.Sequential(nn.Linear(self.window_size*self.embed_dim, \n",
" self.hidden_dim), \n",
" nn.Tanh())\n",
" \n",
" \"\"\"\n",
" Output layer\n",
" -we want to map elements of the output layer (of size self.hidden dim) to a number of classes.\n",
" \"\"\"\n",
" self.output_layer = nn.Linear(self.hidden_dim, self.num_classes)\n",
" \n",
" \"\"\"\n",
" Softmax\n",
" -The final step of the softmax classifier: mapping final hidden layer to class scores.\n",
" -pytorch has both logsoftmax and softmax functions (and many others)\n",
" -since our loss is the negative LOG likelihood, we use logsoftmax\n",
" -technically you can take the softmax, and take the log but PyTorch's implementation\n",
" is optimized to avoid numerical underflow issues.\n",
" \"\"\"\n",
" self.log_softmax = nn.LogSoftmax(dim=2)\n",
" \n",
" def forward(self, inputs):\n",
" \"\"\"\n",
" Let B:= batch_size\n",
" L:= window-padded sentence length\n",
" D:= self.embed_dim\n",
" S:= self.window_size\n",
" H:= self.hidden_dim\n",
" \n",
" inputs: a (B, L) tensor of token indices\n",
" \"\"\"\n",
" B, L = inputs.size()\n",
" \n",
" \"\"\"\n",
" Reshaping.\n",
" Takes in a (B, L) LongTensor\n",
" Outputs a (B, L~, S) LongTensor\n",
" \"\"\"\n",
" # Fist, get our word windows for each word in our input.\n",
" token_windows = inputs.unfold(1, self.window_size, 1)\n",
" _, adjusted_length, _ = token_windows.size()\n",
" \n",
" # Good idea to do internal tensor-size sanity checks, at the least in comments!\n",
" assert token_windows.size() == (B, adjusted_length, self.window_size)\n",
" \n",
" \"\"\"\n",
" Embedding.\n",
" Takes in a torch.LongTensor of size (B, L~, S) \n",
" Outputs a (B, L~, S, D) FloatTensor.\n",
" \"\"\"\n",
" embedded_windows = self.embed_layer(token_windows)\n",
" \n",
" \"\"\"\n",
" Reshaping.\n",
" Takes in a (B, L~, S, D) FloatTensor.\n",
" Resizes it into a (B, L~, S*D) FloatTensor.\n",
" -1 argument \"infers\" what the last dimension should be based on leftover axes.\n",
" \"\"\"\n",
" embedded_windows = embedded_windows.view(B, adjusted_length, -1)\n",
" \n",
" \"\"\"\n",
" Layer 1.\n",
" Takes in a (B, L~, S*D) FloatTensor.\n",
" Resizes it into a (B, L~, H) FloatTensor\n",
" \"\"\"\n",
" layer_1 = self.hidden_layer(embedded_windows)\n",
" \n",
" \"\"\"\n",
" Layer 2\n",
" Takes in a (B, L~, H) FloatTensor.\n",
" Resizes it into a (B, L~, 2) FloatTensor.\n",
" \"\"\"\n",
" output = self.output_layer(layer_1)\n",
" \n",
" \"\"\"\n",
" Softmax.\n",
" Takes in a (B, L~, 2) FloatTensor of unnormalized class scores.\n",
" Outputs a (B, L~, 2) FloatTensor of (log-)normalized class scores.\n",
" \"\"\"\n",
" output = self.log_softmax(output)\n",
" \n",
" return output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training.\n",
"\n",
"Now that we've got a model, we have to train it."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"def loss_function(outputs, labels, lengths):\n",
" \"\"\"Computes negative LL loss on a batch of model predictions.\"\"\"\n",
" B, L, num_classes = outputs.size()\n",
" num_elems = lengths.sum().float()\n",
" \n",
" # get only the values with non-zero labels\n",
" loss = outputs*labels\n",
" \n",
" # rescale average\n",
" return -loss.sum() / num_elems"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"def train_epoch(loss_function, optimizer, model, train_data):\n",
" \n",
" ## For each batch, we must reset the gradients\n",
" ## stored by the model. \n",
" total_loss = 0\n",
" for batch, labels, lengths in train_data:\n",
" # clear gradients\n",
" optimizer.zero_grad()\n",
" # evoke model in training mode on batch\n",
" outputs = model.forward(batch)\n",
" # compute loss w.r.t batch\n",
" loss = loss_function(outputs, labels, lengths)\n",
" # pass gradients back, startiing on loss value\n",
" loss.backward()\n",
" # update parameters\n",
" optimizer.step()\n",
" total_loss += loss.item()\n",
" \n",
" # return the total to keep track of how you did this time around\n",
" return total_loss\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": [
"config = {\"batch_size\": 4,\n",
" \"half_window\": 2,\n",
" \"embed_dim\": 25,\n",
" \"hidden_dim\": 25,\n",
" \"num_classes\": 2,\n",
" \"freeze_embeddings\": False,\n",
" }\n",
"learning_rate = .0002\n",
"num_epochs = 10000\n",
"model = SoftmaxWordWindowClassifier(config, len(word_2_id))\n",
"optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"train_loader = torch.utils.data.DataLoader(list(zip(train_sents, train_labels)), \n",
" batch_size=2, \n",
" shuffle=True, \n",
" collate_fn=partial(my_collate, window_size=2, word_2_id=word_2_id))"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1.4967301487922668, 1.408476173877716, 1.3443800806999207, 1.2865177989006042, 1.2272869944572449, 1.1691689491271973, 1.1141255497932434, 1.0696152448654175, 1.023829996585846, 0.978839099407196, 0.937132716178894, 0.8965558409690857, 0.8551942408084869, 0.8171629309654236, 0.7806291580200195, 0.7467736303806305, 0.7136902511119843, 0.6842415034770966, 0.6537061333656311, 0.6195352077484131, 0.5914349257946014, 0.5682767033576965, 0.5430445969104767, 0.5190333724021912, 0.49760693311691284, 0.47582894563674927, 0.45516568422317505, 0.4298042058944702, 0.41591694951057434, 0.39368535578250885, 0.3817802667617798, 0.36694473028182983, 0.35200121998786926, 0.3370656222105026, 0.31913231313228607, 0.3065541982650757, 0.2946578562259674, 0.28842414915561676, 0.27765345573425293, 0.26745346188545227, 0.25778329372406006, 0.24860621988773346, 0.23990143835544586, 0.22729042172431946, 0.22337404638528824, 0.21637336909770966, 0.20889568328857422, 0.20218300074338913, 0.19230441004037857, 0.19007354974746704, 0.18426819890737534, 0.17840557545423508, 0.173139289021492, 0.16499895602464676, 0.1602725237607956, 0.1590176522731781, 0.15144427865743637, 0.14732149988412857, 0.14641961455345154, 0.13959994912147522, 0.13598214834928513, 0.13251276314258575, 0.13197287172079086, 0.12871850654482841, 0.1253872662782669, 0.12239058315753937, 0.1171659529209137, 0.11695125326514244, 0.11428486183285713, 0.11171672493219376, 0.10924769192934036, 0.10686498507857323, 0.1045713983476162, 0.10218603909015656, 0.10022115334868431, 0.09602915123105049, 0.09616792947053909, 0.09424330666661263, 0.09223027899861336, 0.090587567538023, 0.08691023662686348, 0.08717184513807297, 0.08540527895092964, 0.0839710421860218, 0.08230703324079514, 0.0808291956782341, 0.07777531817555428, 0.0780084915459156, 0.07678597420454025, 0.07535399869084358, 0.07408255711197853, 0.07296567782759666, 0.07176320999860764, 0.07059716433286667, 0.0694643184542656, 0.06684627756476402, 0.06579622253775597, 0.06477534398436546, 0.06378135085105896, 0.06281331554055214]\n"
]
}
],
"source": [
"losses = []\n",
"for epoch in range(num_epochs):\n",
" epoch_loss = train_epoch(loss_function, optimizer, model, train_loader)\n",
" if epoch % 100 == 0:\n",
" losses.append(epoch_loss)\n",
"print(losses)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prediction."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"test_loader = torch.utils.data.DataLoader(list(zip(test_sents, test_labels)), \n",
" batch_size=1, \n",
" shuffle=False, \n",
" collate_fn=partial(my_collate, window_size=2, word_2_id=word_2_id))"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tensor([[0, 0, 0, 1]])\n",
"tensor([[0, 0, 0, 1]])\n"
]
}
],
"source": [
"for test_instance, labs, _ in test_loader:\n",
" outputs = model.forward(test_instance)\n",
" print(torch.argmax(outputs, dim=2))\n",
" print(torch.argmax(labs, dim=2))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
================================================
FILE: src/MLSD/ml-companies.md
================================================
## ML Systems at Big Companies
- LinkedIn
- [Learning to be Relevant](http://www.shivanirao.info/uploads/3/1/2/8/31287481/cikm-cameryready.v1.pdf)
- [Two tower models for retrieval](https://www.linkedin.com/pulse/personalized-recommendations-iv-two-tower-models-gaurav-chakravorty/)
- A closer look at the AI behind course recommendations on LinkedIn Learning, [Part 1](https://engineering.linkedin.com/blog/2020/course-recommendations-ai-part-one), [Part 2](https://engineering.linkedin.com/blog/2020/course-recommendations-ai-part-two)
- [Intro to AI at Linkedin](https://engineering.linkedin.com/blog/2018/10/an-introduction-to-ai-at-linkedin)
- [Building The LinkedIn Knowledge Graph](https://engineering.linkedin.com/blog/2016/10/building-the-linkedin-knowledge-graph)
- [The AI Behind LinkedIn Recruiter search and recommendation systems](https://engineering.linkedin.com/blog/2019/04/ai-behind-linkedin-recruiter-search-and-recommendation-systems)
- [Communities AI: Building communities around interests on LinkedIn](https://engineering.linkedin.com/blog/2019/06/building-communities-around-interests)
- [Linkedin's follow feed](https://engineering.linkedin.com/blog/2016/03/followfeed--linkedin-s-feed-made-faster-and-smarter)
- XNLT for A/B testing
- Google
- [The YouTube Video Recommendation System](https://www.inf.unibz.it/~ricci/ISR/papers/p293-davidson.pdf)
- [Deep Neural Networks for YouTube Recommendations](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf)
- [Recommending What Video to Watch Next: A Multitask Ranking System](https://daiwk.github.io/assets/youtube-multitask.pdf)
- [Exploring Transfer Learning with T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html)
- [Google Research, 2022 & beyond](https://ai.googleblog.com/2023/01/google-research-2022-beyond-language.html)
- ML pipelines with TFX and KubeFlow
- [How Google Search works](https://www.google.com/search/howsearchworks/)
- Page Rank algorithm ([intro to page rank](https://www.youtube.com/watch?v=IKXvSKaI2Ko), [the algorithm that started google](https://www.youtube.com/watch?v=qxEkY8OScYY))
- [TFX workshop by Robert Crowe](https://conferences.oreilly.com/artificial-intelligence/ai-ca-2019/cdn.oreillystatic.com/en/assets/1/event/298/TFX_%20Production%20ML%20pipelines%20with%20TensorFlow%20Presentation.pdf)
- [Google Cloud Platform Big Data and Machine Learning Fundamentals](https://www.coursera.org/learn/gcp-big-data-ml-fundamentals)
- Scalable ML using AWS
- [AWS Machine Learning Blog](https://aws.amazon.com/blogs/machine-learning/)
- [Deploy a machine learning model with AWS Elastic Beanstalk](https://medium.com/swlh/deploy-a-machine-learning-model-with-aws-elasticbeanstalk-dfcc47b6043e)
- [Deploying Machine Learning Models as API using AWS](https://medium.com/towards-artificial-intelligence/deploying-machine-learning-models-as-api-using-aws-a25d05518084)
- [Serverless Machine Learning On AWS Lambda](https://medium.com/swlh/how-to-deploy-your-scikit-learn-model-to-aws-44aabb0efcb4)
- Meta
- [Machine Learning at Facebook Talk](https://www.youtube.com/watch?v=C4N1IZ1oZGw)
- [Scaling AI Experiences at Facebook with PyTorch](https://www.youtube.com/watch?v=O8t9xbAajbY)
- [Understanding text in images and videos](https://ai.facebook.com/blog/rosetta-understanding-text-in-images-and-videos-with-machine-learning/)
- [Protecting people](https://ai.facebook.com/blog/advances-in-content-understanding-self-supervision-to-protect-people/)
- Ads
- [Practical Lessons from Predicting Clicks on Ads at Facebook](https://quinonero.net/Publications/predicting-clicks-facebook.pdf)
- Newsfeed Ranking
- [How Facebook News Feed Works](https://techcrunch.com/2016/09/06/ultimate-guide-to-the-news-feed/)
- [How does Facebook’s advertising targeting algorithm work?](https://quantmar.com/99/How-does-facebooks-advertising-targeting-algorithm-work)
- [ML and Auction Theory](https://www.youtube.com/watch?v=94s0yYECeR8)
- [Serving Billions of Personalized News Feeds with AI - Meihong Wang](https://www.youtube.com/watch?v=wcVJZwO_py0&t=80s)
- [Generating a Billion Personal News Feeds](https://www.youtube.com/watch?v=iXKR3HE-m8c&list=PLefpqz4O1tblTNAtKaSIOU8ecE6BATzdG&index=2)
- [Instagram feed ranking](https://www.facebook.com/atscaleevents/videos/1856120757994353/?v=1856120757994353)
- [How Instagram Feed Works](https://techcrunch.com/2018/06/01/how-instagram-feed-works/)
- [Photo search](https://engineering.fb.com/ml-applications/under-the-hood-photo-search/)
- Social graph search
- Recommendation
- [Instagram explore recommendation](https://about.instagram.com/blog/engineering/designing-a-constrained-exploration-system)
- [Recommending items to more than a billion people](https://engineering.fb.com/core-data/recommending-items-to-more-than-a-billion-people/)
- [Social recommendations](https://engineering.fb.com/android/made-in-ny-the-engineering-behind-social-recommendations/)
- [Live videos](https://engineering.fb.com/ios/under-the-hood-broadcasting-live-video-to-millions/)
- [Large Scale Graph Partitioning](https://engineering.fb.com/core-data/large-scale-graph-partitioning-with-apache-giraph/)
- [TAO: Facebook’s Distributed Data Store for the Social Graph](https://www.youtube.com/watch?time_continue=66&v=sNIvHttFjdI&feature=emb_logo) ([Paper](https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf))
- [NLP at Facebook](https://www.youtube.com/watch?v=ZcMvffdkSTE)
- Netflix
- [Recommendation at Netflix](https://www.slideshare.net/moustaki/recommending-for-the-world)
- [Past, Present & Future of Recommender Systems: An Industry Perspective](https://www.slideshare.net/justinbasilico/past-present-future-of-recommender-systems-an-industry-perspective)
- [Deep learning for recommender systems](https://www.slideshare.net/moustaki/deep-learning-for-recommender-systems-86752234)
- [Reliable ML at Netflix](https://www.slideshare.net/justinbasilico/making-netflix-machine-learning-algorithms-reliable)
- [ML at Netflix (Spark and GraphX)](https://www.slideshare.net/SessionsEvents/ehtsham-elahi-senior-research-engineer-personalization-science-and-engineering-group-at-netflix-at-mlconf-sea-50115?next_slideshow=1)
- [Recent Trends in Personalization](https://www.slideshare.net/justinbasilico/recent-trends-in-personalization-a-netflix-perspective)
- [Artwork Personalization @ Netflix](https://www.slideshare.net/justinbasilico/artwork-personalization-at-netflix)
- Airbnb
- [Categorizing Listing Photos at Airbnb](https://medium.com/airbnb-engineering/categorizing-listing-photos-at-airbnb-f9483f3ab7e3)
- [WIDeText: A Multimodal Deep Learning Framework](https://medium.com/airbnb-engineering/widetext-a-multimodal-deep-learning-framework-31ce2565880c)
- [Applying Deep Learning To Airbnb Search](https://dl.acm.org/doi/pdf/10.1145/3292500.3330658)
- Uber
- [DeepETA: How Uber Predicts Arrival Times Using Deep Learning](https://www.uber.com/blog/deepeta-how-uber-predicts-arrival-times/)
================================================
FILE: src/MLSD/ml-system-design.md
================================================
# 4. Machine Learning System Design
||
| --- |
| 1. [The 9-Step ML System Design Formula](#1-the-9-step-ml-system-design-formula-template) |
| 2. [ML System Design Sample Questions](#2-ml-system-design-sample-questions) |
|3. [ML System Design Topics](#3-ml-system-design-topics)|
|4. [ML at big tech companies](#4-ml-at-big-tech-companies)|
| 5. [Agentic AI System Design (2025)](https://github.com/alirezadir/Agentic-AI-Systems.git)|
||
### Designing ML systems for production
Deploying deep learning models in production can be challenging, and it is beyond training models with good performance. Several distinct components need to be designed and developed in order to deploy a production level deep learning system.
Approaching an ML system design problem follows a similar logical flow to the generic software system design. (
For more insight on general system design interview you can e.g. check out [Grokking the System Design Interview
](https://www.educative.io/courses/grokking-the-system-design-interview)
and [System design primer](https://github.com/donnemartin/system-design-primer).). However, there are certain components in the design of an ML based system that needs to be addressed and need special attention, as you will see below in ML System Design Flow.
### ML System Design Interview
- In an ML system design interview you are exposed to open ended questions with no single correct answer.
- The goal of ML system design interview is evaluate your your ability to zoom out and design a production-level ML system that can be deployed as a service within a company's ML infrastructure.
# 1. The 9-Step ML System Design Formula ([Template](./mlsd-template.md))
In order to design a solid ML system for real world applications, it is important to follow a design flow.
I recommend using the following **9-Step ML System Design Formula**to design SW system solutions for ML-relevant business problems both at work and during interviews:
Note: Remember when using this design flow during an interview to be flexible. According to the needs of the interview or the interests of the interviewer, you may skip some of these components or spend more time for a deep dive in one or two components.
## 1. Problem Formulation
- Clarifying questions
- Use case(s) and business goal
- Requirements
- Scope (features needed), scale, and personalization
- Performance: prediction latency, scale of prediction
- Constraints
- Data: sources and availability
- Assumptions
- Translate an abstract problem into an ML problem
- ML objective,
- ML I/O,
- ML category (e.g. binary classification, multi-classification, unsupervised learning, etc)
- Do we need ML to solve this problem?
- Trade off between impact and cost
- Costs: Data collection, data annotation, compute
- if Yes, we choose an ML system to design. If No, follow a general system design flow.
- Note: in an ML system design interview we can assume we need ML.
## 2. Metrics (Offline and Online)
- Offline metrics (e.g. classification, relevance metrics)
- Classification metrics
- Precision, Recall, F1, ROC AUC, P/R AUC, mAP, log-loss, etc
- Imbalanced data
- Retrieval and ranking metrics
- Precision@k, Recall@k (do not consider ranking quality)
- mAP, MRR, nDCG
- Regression metrics: MSE, MAE,
- Problem specific metrics
- Language: BLEU, BLEURT, GLUE, ROUGE, etc
- ads: CPE, etc
- Latency
- Computational cost (in particular for on-device)
- Online metrics
- CTR
- Task/session success/failure rate,
- Task/session total (e.g. watch) times,
- Engagement rate (like rate, comment rate)
- Conversion rate
- Revenue lift
- Reciprocal rank of first click, etc,
- Counter metrics: direct negative feedback (hide, report)
- Trade-offs b/w metrics
## 3. Architectural Components (MVP Logic)
- High level architecture and main components
- Non-ML components:
- user, app server, DBs, KGs, etc and their interactions
- ML components:
- Modeling modules (e.g. candidate generator, ranker, ect)
- Train data generator
...
- Modular architecture design
- Model 1 architecture (e.g. candidate generation)
- Model 2 architecture (e.g. ranker, filter)
- ...
## 4. Data Collection and Preparation
- Data needs
- target variable
- big actors in signals (e.g. users, items, etc)
- type (e.g. image, text, video, etc) and volume
- Data Sources
- availability and cost
- implicit (logging), explicit (e.g. user survey)
- Data storage
- ML Data types
- structured
- numerical (discrete, continuous)
- categorical (ordinal, nominal),
- unstructured(e.g. image, text, video, audio)
- Labelling (for supervised)
- Labeling methods
- Natural labels (extracted from data e.g. clicks, likes, purchase, etc)
- Missing negative labels (not clicking is not a negative label):
- Negative sampling
- Explicit user feedback
- Human annotation (super costly, slow, privacy issues)
- Handling lack of labels
- Programmatic labeling methods (noisy, pros: cost, privacy, adaptive)
- Semi-supervised methods (from an initial smaller set of labels e.g. perturbation based)
- Weak supervision (encode heuristics e.g. keywords, regex, db, output of other ML models)
- Transfer learning:
- pre-train on cheap large data (e.g. GPT-3),
- zero-shot or fine-tune for downstream task
- Active learning
- Labeling cost and trade-offs
- Data augmentation
- Data Generation Pipeline
- Data collection/ingestion (offline, online)
- Feature generation (next)
- Feature transform
- Label generation
- Joiner
## 5. Feature Engineering
- Choosing features
- Define big actors (e.g. user, item, document, query, ad, context),
- Define actor specific features (current, historic)
- Example user features: user profile, user history, user interests
- Example text features: n-grams (uni,bi), intent, topic, frequency, length, embeddings
- Define cross features (e.g. user-item, or query-document features)
- Example query-document features: tf-idf
- Example user-item features: user-video watch history, user search history, user-ad interactions(view, like)
- Privacy constraints
- Feature representation
- One hot encoding
- Embeddings
- e.g. for text, image, graphs, users (how), stores, etc
- how to generate/learn?
- pre-compute and store
- Encoding categorical features (one hot, ordinal, count, etc)
- Positional embeddings
- Scaling/Normalization (for numerical features)
- Preprocessing features
- Needed for unstructured data
- Text: Tokenize (Normalize, pre-tokenize, tokenizer model (ch/word/subword level), post-process (add special tokens))
- Images: Resize, normalize
- Video: Decode frames, sample, resize, scale and normalize
- Missing Values
- Feature importance
- Featurizer (raw data -> features)
- Static (from feature store) vs dynamic (computed online) features
## 6. Model Development and Offline Evaluation
- Model selection (MVP)
- Heuristics -> simple model -> more complex model -> ensemble of models
- Pros and cons, and decision
- Note: Always start as simple as possible (KISS) and iterate over
- Typical modeling choices:
- Logistic Regression
- Decision tree variants
- GBDT (XGBoost) and RF
- Neural networks
- FeedForward
- CNN
- RNN
- Transformers
- Decision Factors
- Complexity of the task
- Data: Type of data (structured, unstructured), amount of data, complexity of data
- Training speed
- Inference requirements: compute, latency, memory
- Continual learning
- Interpretability
- [Popular NN architectures](./mlsd-modeling-popular-archs.md)
- Dataset
- Sampling
- Non-probabilistic sampling
- Probabilistic sampling methods
- random, stratified, reservoir, importance sampling
- Data splits (train, dev, test)
- Portions
- Splitting time-correlated data (split by time)
- seasonality, trend
- Data leakage:
- scale after split,
- use only train split for stats, scaling, and missing vals
- Class Imbalance
- Resampling
- weighted loss fcn
- combining classes
- Model training
- Loss functions
- MSE, Binary/Categorical CE, MAE, Huber loss, Hinge loss, Contrastive loss, etc
- Optimizers
- SGD, AdaGrad, RMSProp, Adam, etc
- Model training
- Training from scratch or fine-tune
- Model validation
- Debugging
- Offline vs online training
- Model offline evaluation
- Hyper parameter tuning
- Grid search
- Iterate over MVP model
- Model Selection
- Data augmentation
- Model update frequency
- Model calibration
## 7. Prediction Service
- Data processing and verification
- Web app and serving system
- Prediction service
- Batch vs Online prediction
- Batch: periodic, pre-computed and stored, retrieved as needed - high throughput
- Online: predict as request arrives - low latency
- Hybrid: e.g. Netflix: batch for titles, online for rows
- Nearest Neighbor Service
- Approximate NN
- Tree based, LSH, Clustering based
- ML on the Edge (on-device AI)
- Network connection/latency, privacy, cheap
- Memory, compute power, energy constraints
- Model Compression
- Quantization
- Pruning
- Knowledge distillation
- Factorization
## 8. Online Testing and Model Deployment
- A/B Experiments
- How to A/B test?
- what portion of users?
- control and test groups
- null hypothesis
- Bandits
- Shadow deployment
- Canary release
## 9. Scaling, Monitoring, and Updates
- Scaling for increased demand (same as in distributed systems)
- Scaling general SW system (distributed servers, load balancer, sharding, replication, caching, etc)
- Train data / KB partitioning
- Scaling ML system
- Distributed ML
- Data parallelism (for training)
- Model parallelism (for training, inference)
- Asynchronous SGD
- Synchronous SGD
- [Distributed training]()
- Data parallel DT, RPC based DT
- Scaling data collection
- [MT for 1000 languages](https://arxiv.org/abs/2205.03983)
- [NLLB](https://research.facebook.com/publications/no-language-left-behind/)
- Monitoring, failure tolerance, updating (below)
- Auto ML (soft: HP tuning, hard: arch search (NAS))
- Monitoring:
- Logging
- Features, predictions, metrics, events
- Monitoring metrics
- SW system metrics
- ML metrics (accuracy related, predictions, features)
- Online and offline metric dashboards
- Monitoring data distribution shifts
- Types: Covariate, label and concept shifts
- Detection (stats, hypothesis testing)
- Correction
- System failures
- SW system failure
- dependency, deployment, hardware, downtime
- ML system failure
- data distribution difference (test vs online)
- feedback loops
- edge cases (e.g. invalid/junk input)
- data distribution changes
- Alarms
- failures (data pipeline, training, deployment), low metrics, etc
- Updates: Continual training
- Model updates
- train from scratch or a base model
- how often? daily, weekly, monthly, etc
- Auto update models
- Active learning
- Human in the loop ML
### Other topics:
- Extensions:
- Iterations over the base design to add a new functional feature
- Bias in training data
- Bias introduced by human labeling
- Freshness, Diversity
- Privacy and security
# 2. ML System Design Sample Questions
Below are the most common ML system design questions for ML engineering interviews:
### Recommendation Systems
- **[Video/Movie recommendation](./mlsd-video-recom.md)**(Netflix, Youtube)
- **[Friends / follower recommendation](./mlsd-pymk.md)**(Facebook, Twitter, LinkedIn)
- **[Event recommendation system](./mlsd-event-recom.md)** (Eventbrite)
- **[Game recommendation](./mlsd-game-recom.md)**
- **Replacement product recommendation** (Instacart)
- **Rental recommendation** (Airbnb)
- **Place recommendation**
### Search systems (retrieval, ranking)
- **Document search**
- **[Text query search](./mlsd-search.md)** (full text, semantic),
- **[Image/Video search](./mlsd-image-search.md)**,
- **[Multimodal search](./mlsd-mm-video-search.md)** (MM Query)
### Ranking systems
- **[Newsfeed system](./mlsd-newsfeed.md)** (ranking)
- **Ads serving system** (retrieval, ranking)
- **[Ads click prediction](./mlsd-ads-ranking.md)** (ranking)
### NLP
- **Named entity linking system (NLP tagging, reasoning)**
- **Autocompletion / typeahead suggestion system**
- **Sentiment analysis system**
- **Language identification system**
- **Chatbot system**
- **[Question answering system]()**
### CV
- **Image blurring system**
- **OCR/Text recognition system**
### AV
- **Self-driving car**
- Perception, Prediction, and Planning
- [Pedestrian jaywalking detection](./mlsd-av.md)
- **Ride matching system**
## Other
- **Proximity service / Yelp**
- **Food delivery time approximation**
- **Harmful content / Spam detection system**
- [Multimodal harmful content detection](./mlsd-harmful-content.md)
- Fraud detection system
- **Healthcare diagnosis system**
# 3. ML System Design Topics
I observed there are certain sets of topics that are frequently brought up or can be used as part of the logic of the system. Here are some of the important ones:
### Recommendation Systems
- Candidate generation
- Collaborative Filtering (CF)
- User based, item based
- Matrix factorization
- Two-tower approach
- Content based filtering
- Ranking
- Learning to rank (LTR)
- point-wise (simplest), pairwise, list-wise
### Search and Ranking (Ads, newsfeed, etc)
- Search systems
- Query search (keyword search, [semantic search](https://txt.cohere.ai/what-is-semantic-search/?utm_source=linkedin&utm_medium=paidsocial&utm_campaign=contentpromotion_bloglookalikes))
- Visual search
- Video search
- Two stage model
- document selection
- document ranking
- Ranking
- Newsfeed ranking system
- Ads ranking system
- Ranking as classification
- Multi-stage ranking + blender + filter
### NLP
- Feature engineering
- Preprocessing (tokenization)
- Text Embeddings
- Word2Vec, GloVe, Elmo, BERT
- NLP Tasks:
- Text classification
- Sentiment analysis
- Topic modeling
- Sequence tagging
- Named entity recognition
- Part of speech tagging
- POS HMM
- Viterbi algorithm, beam search
- Text generation
- Language modeling
- N-grams vs deep learning models (trade-offs)
- Decoding
- Sequence 2 Sequence models
- Machine Translation
- Seq2seq models, NMT, Transformers
- Question Answering
- [Adv] Dialog and chatbots
- [CMU lecture on chatbots](http://tts.speech.cs.cmu.edu/courses/11492/slides/chatbots_shrimai.pdf)
- [CMU lecture on spoken dialogue systems](http://tts.speech.cs.cmu.edu/courses/11492/slides/sds_components.pdf)
- Speech Recognition Systems
- Feature extraction, MFCCs
- Acoustic modeling
- HMMs for AM
- CTC algorithm (advanced)
### Computer Vision
- Image classification
- VGG, ResNET
- [Object detection](https://viso.ai/deep-learning/object-detection/)
- Two stage models (R-CNN, Fast R-CNN, Faster R-CNN)
- One stage models (YOLO, SSD)
- [Vision Transformer (ViT)](https://viso.ai/deep-learning/vision-transformer-vit/)
- NMS algorithm
- Object Tracking
### Graph problems
- People you may know
# 4. ML at big tech companies
Once you learn about the basics, I highly recommend checking out different companies blogs on ML systems. You can refer to some of those resources in the [ML at Companies](ml-comapnies.md) section.
# More resources
- For more insight on different components above you can check out the following resources):
- [Full Stack Deep Learning course](https://fall2019.fullstackdeeplearning.com/)
- [Production Level Deep Learning](https://github.com/alirezadir/Production-Level-Deep-Learning)
- [Machine Learning Systems Design](https://github.com/chiphuyen/machine-learning-systems-design)
- [Stanford course on ML system design](https://online.stanford.edu/courses/cs329s-machine-learning-systems-design)
================================================
FILE: src/MLSD/mlsd-ads-ranking.md
================================================
# Ads Click Prediction
### 1. Problem Formulation
* Clarifying questions
* What is the primary business objective of the click prediction system?
* What types of ads are we predicting clicks for (e.g., display ads, video ads, sponsored content)?
* Are there specific user segments or contexts we should consider (e.g., user demographics, browsing history)?
* How will we define and measure the success of click predictions (e.g., click-through rate, conversion rate)?
* Do we have negative feedback features (such as hide ad, block, etc)?
* Do we have fatigue period (where ad is no longer shown to the users where there is no interest, for X days)?
* What type of user-ad interaction data do we have access to can we use it for training our models?
* Do we need continual training?
* How do we collect negative samples? (not clicked, negative feedback).
* Use case(s) and business goal
* use case: predict which ads a user is likely to click on when presented with multiple ad options.
* business objective: maximize ad revenue by delivering more relevant ads to users, improving click-through rates, and maximizing the value of ad inventory.
* Requirements;
* Real-time prediction capabilities to serve ads dynamically.
* Scalability to handle a large number of ad impressions.
* Integration with ad serving platforms and data sources.
* Continuous model training and updating.
* Constraints:
* Privacy and compliance with data protection regulations.
* Latency requirements for real-time ad serving.
* Limited user attention, as users may quickly decide whether to click on an ad.
* Data: Sources and Availability:
* Data sources include user interaction logs, ad content data, user profiles, and contextual information.
* Historical click and impression data for model training and evaluation.
* Availability of labeled data for supervised learning.
* Assumptions:
* Users' click behavior is influenced by factors that can be learned from historical data.
* Ad content and relevance play a significant role in click predictions.
* The click behavior can be modeled as a classification problem.
* ML Formulation:
* Ad click prediction is a ranking problem
### 2. Metrics
* Offline metrics
* CE
* NCE (normalized over baseline)
* Online metrics
* CTR (#clicks/#impressions)
* Conversion rate (#conversion/#impression)
* Revenue lift (increase in revenue over time)
* Hide rate (#hidden ads/#impression)
### 3. Architectural Components
* High level architecture
* We can use point-wise learning to rank (LTR)
* The a binary classification task, where the goal is to predict whether a user will click (1) or not click (0) on a given ad impression -> given a pair of as input -> click or no click
* Features can include user demographics, ad characteristics, context (e.g., device, location), and historical behavior.
* Machine learning models, such as logistic regression, decision trees, gradient boosting, or deep neural networks, can be used for prediction.
### 4. Data Collection and Preparation
* Data Sources
* Users,
* Ads,
* User-ad interaction
* ML Data types
* Labelling
### 5. Feature Engineering
* Feature selection
* Ads:
* IDs
* categories
* Image/videos
* No of impressions / clicks (ad, adv, campaign)
* User:
* ID, username
* Demographics (Age, gender, location)
* Context (device, time of day, etc)
* Interaction history (e.g. user ad click rate, total clicks, etc)
* User-Ad interaction:
* IDs(user, Ad), interaction type, time, location, dwell time
* Feature representation / preparation
* sparse features
* IDs: embedding layer (each ID type its own embedding layer)
* Dense features:
* Engagement feats: No of clicks, impressions, etc
* use directly
* Image / Video:
* preprocess
* use e.g. SimCLR to convert -> feature vector
* Category: Textual data
* normalization, tokenization, encoding
### 6. Model Development and Offline Evaluation
* Model selection
* LR
* Feature crossing + LR
* feature crossing: combine 2/more features into new feats (e.g. sum, product)
* pros: capture nonlin interactions b/w feats
* cons: manual process, and domain knowledge needed
* GBDT
* pros: interpretable
* cons: inefficient for continual training, can't train embedding layers
* GBDT + LR
* GBDT for feature selection and/or extraction, LR for classific
* NN
* Two options: single network, two tower network (user tower, ad tower)
* Cons for ads prediction:
* sparsity of features, huge number of them
* hard to capture pairwise interactions (large no of them)
* Not a good choice here.
* Deep and cross network (DCN)
* finds feature interactions automatically
* two parallel networks: deep network (learns complex features) and cross network (learns interactions)
* two types: stacked, and parallel
* Factorization Machine
* embedding based model, improves LR by automatically learning feature interactions (by learning embeddings for features)
* w0 + \sum (w_i.x_i) + \sum\sum x_i.x_j
* cons: can't learn higher order interactions from features unlike NN
* Deep factorization machine (DFM)
* combines a NN (for complex features) and a FM (for pairwise interactions)
* start with LR to form a baseline, then experiment with DCN & DeepFM
* Model Training
* Loss function:
* binary classification: CE
* Dataset
* labels: positive: user clicks the ad < t seconds after ad is shown, negative: no click within t secs
* Model eval and HP tuning
* Iterations
### 7. Prediction Service
* Data Prep pipeline
* static features (e.g. ad img, category) -> batch feature compute (daily, weekly) -> feature store
* dynamic features: # of ad impressions, clicks.
* Prediction pipeline
* two stage (funnel) architecture
* candidate generation
* use ad targeting criteria by advertiser (age, gender, location, etc)
* ranking
* features -> model -> click prob. -> sort
* re-ranking: business logic (e.g. diversity)
* Continual learning pipeline
* fine tune on new data, eval, and deploy if improves metrics
### 8. Online Testing and Deployment
* A/B Test
* Deployment and release
### 9. Scaling, Monitoring, and Updates
* Scaling (SW and ML systems)
* Monitoring
* Updates
### 10. Other topics
* calibration:
* fine-tuning predicted probabilities to align them with actual click probabilities
* data leakage:
* info from the test or eval dataset influences the training process
* target leakage, data contamination (from test to train set)
* catastrophic forgetting
* model trained on new data loses its ability to perform well on previously learned tasks
================================================
FILE: src/MLSD/mlsd-av.md
================================================
# Self-driving cars
- drives itself, with little or no human intervention
- different levels of authonomy
## Hardware support
### Sensors
* Camera
* used for classification, segmentation, and localization.
* problem w/ night time, and extreme conditions like fog, heavy rain.
* LiDAR (Light Detection And Ranging,)
* uses lasers or light to measure the distance of the nearby objects.
* adds depth (3D perception), point cloud
* works at night or in dark, still fail when there’s noise from rain or fog.
* RADAR (Radio detection and ranging)
* use radio waves (instead of lasers), so they work in any conditions
* sense the distance from reflection,
* very noisy (needs clean up (thresholding, FFT)), lower spatial resolution, interference w/ other radio systems
* point cloud
* Audio
## Stack

* **Perception**
- Perception
objects,
Raw sensor (lidar, camera, etc) data (image, point cloud)-> world understanding
* Object detection (traffic lights, pedestrians, road signs, walkways, parking spots, lanes, etc), traffic light state detection, etc
* Localization
* calculate position and orientation of the vehicle as it navigates (Visual Odometry (VO)).
* Deep learning used to improve the performance of VO, and to classify objects.
* Examples: PoseNet and VLocNet++, use point data to estimate the 3D position and orientation.
* ....
* **Behavior prediction**
* predict future trajectory of agents
* **Planning**: decision making and generate trajectory
* **Controller**: generate control commands: accelerate, break, steer left or right
* Note: latency: orders of millisecond for some tasks, and order of 10 msec's for others
## Perception
* 2D Object detection:
* Two-stage detectors: using Region Proposal Network (RPN) to learn RoI for potential objects + bounding box predictions (using RoI pooling): (R-CNN, Fast R-CNN, Faster R-CNN, Mask-RCNN (also does segmentation)
* used to outperform until focal loss
* One-stage: skip proposal generation; directly produce obj BB: YOLO, SSD, RetinaNet
* computationally appealing (real time)
* Transformer based:
* Detection Transformer ([DETR](https://github.com/facebookresearch/detr)): End-to-End Object Detection with Transformers
* uses a transformer encoder-decoder architecture, backbone CNN as the encoder and a transformer-based decoder.
* input image -> CNN -> feature map -> decoder -> final object queries, corresponding class labels and bounding boxes.
* handles varying no. of objects in an image, as it does not rely on a fixed set of object proposals.
* [More](https://towardsdatascience.com/detr-end-to-end-object-detection-with-transformers-and-implementation-of-python-8f195015c94d)
* TrackFormer: Multi-Object Tracking with Transformers
* on top of DETR
* NMS:
* 3D Object detection:
* from point cloud data, ideas transferred from 2D detection
* Examples:
* 3D convolutions on voxelized point cloud
* 2D convolutions on BEV
* heavy computation
* Object tracking:
* use probabilistic methods such as EKF
* use ML based models
* use/fine-tune pre-trained CNNs for feature extraction -> do tracking with correlation or regression.
* use DL based tracking algorithm, such as SORT (Simple Online and Realtime Tracking) or DeepSORT
* Semantic segmentation
* pixel-wise classification of image (each pixel assigned a class)
* Instance segmentation
* combine obj detection + semantic segmatation -> classify pixels of each instance of an object
## Behavior prediction
* Main task: Motion forecasting/ trajectory prediction (future):
* predict where each object will be in the future given multiple past frames
* Examples:
* use RNN/LSTM for prediction
* Input from perception + HDMap
* Options:
* top-view representation: input -> CNN -> ..
* vectorized: context map
* graph representation: GNN
* Render a bird eye view image on a single RGB image
* one option for history: also render on single image
* another option: use feature extractor (CNN) for each frame then use LSTM to get temporal info
* Input: BEV image + (v, a, a_v)
* Out: (x, y, std)

* also possible to use LSTM networks to generate waypoints in the trajectory sequentially.
* Challenge: Multimodality (distribution of different modes) - future uncertain
## Planning
- Decision making and generate trajectory
- input: route (from A to B), context map, prediction for nearby agents
- proposal: what are possible options for the plan (mathematical methods vs imitation learning) - predict what is optimal
* Hierarchical RL can be used
* high level planner: yield, stop, turn left/right, lane following, etc)
* low level planner: execute commands
- motion validation: check e.g. collision, red light, etc -> reject + ranking
## Multi task approaches
* ### Perception + Behavior prediction
* Fast& Furious (Uber):
* Tasks: Detection, tracking, short term (e.g. 1 sec) motion forecasting
* create BEV from point cloud data:
* quantize 3D → 3D voxel grid (binary for occupation) → height>channel(3rd dimension) in RGB + time as 4th dimension → Single stage detector similar to SSD
* deal with temporal dimension in two ways:
* early fusion (aggregate temporal info at the very first layer)
* late fusion (gradually merge the temporal info: allows the model to capture high-level motion features.)
* use multiple predefined boxes for each feature map location (similar to SSD)
* two branches after the feature map:
* binary classification (P (being a vehicle) for each pre-allocated box)
* predict (regress) the BB over the current frame as well as n − 1 frames into the future → size and heading

* IntentNet: learning to predict intent from raw sensor data (Uber)
* Fuse BEV generated from the point cloud + HDMap info to do detection, intention prediction, and trajectory prediction.
* I: Voxelized LiDAR in BEV, Rasterized HDMap
* O: detected objects, trajectory, 8-class intention (keep lane, turn left, etc)
![]()

* ### Behavior Prediction + Planning (Mid-to-Mid Model)
* ChauffeurNet (Waymo)
* prediction and planning using single NN using Imitation Learning (IL)
* More info [here](https://medium.com/aiguys/behavior-prediction-and-decision-making-in-self-driving-cars-using-deep-learning-784761ed34af)
* ### End to end
* Learning to drive in a day (wayve.ai)
* RL to train a driving policy to follow a lane from scratch in less than 20 minutes!
* Without any HDMap and hand-written rules!
* Learning to Drive Like a Human
* Imitation learning + RL
* used some auxiliary tasks like segmentation, depth estimation, and optical flow estimation to learn a better representation of the scene and use it to train the policy.
---
# Example
Design an ML system to detect if a pedestrian is going to do jaywalking.
### 1. Problem Formulation
- Jaywalking: a pedestrian crossing a street where there is no crosswalk or intersection.
- Goal: develop an ML system that can accurately predict if a pedestrian is going to do jaywalking over a short time horizon (e.g. 1 sec) in real-time.
- Pedestrian action prediction is harder than vehicle: future behavior depends on other factors such as body pose, activity, etc.
* ML Objective
* binary classification (predict if a pedestrian is going to do jaywalking or not in the next T seconds.)
* Discuss data sources and availability.
### 2. Metrics
#### Component level metrics
* Object detection
* Precision
* calculated based on IOU threshold
* AP: avg. across various IOU thresholds
* mAP: mean of AP over C classes
* jaywalking detection:
* Precision, Recall, F1
#### End-to-end metrics
* Manual intervention
* Simulation Errors
* historical log (scene recording) w/ expert driver
* input to our system and compare the decisions with the expert driver
### 3. Architectural Components
* Visual Understanding System
* Camera: Object detection (pedestrian, drivable region?) + tracking
* [Optional] Camera + object detection: Activity recognition
* Radar: 3D Object detection (skip)
* Behavior prediction system
* Trajectory estimation
* require motion history
* Ml based approach (classification)
* Input:
* Vision: local context: seq. of ped's cropped image (last k frames) + global context (semantically segmented images over last k frames)
* Non-vision: Ped's trajectory (as BBs, last k frames) + context map + context(location, age group, etc)
### 4. Data Collection and Preparation
* Data collection and annotation:
* Collect datasets of pedestrian behavior, including both jaywalking and non-jaywalking behavior. This data can be obtained through public video footage or by recording video footage ourselves.
* Collect a diverse dataset of video clips or image sequences from various locations, including urban and suburban areas, with different pedestrian behaviors, traffic conditions, and lighting conditions.
* Annotate the data by marking pedestrians, their positions, and whether they are jaywalking or not. This can be done by drawing bounding boxes around pedestrians and labeling them accordingly (initially human labelers eventually auto-labeler system)
* Targeted data collection:
* in later iterations, we check cases where driver had to intervene when pedestrian jaywalking, check performance on last 20 frames, and ask labelers to label those and add to the dataset (examples need to be seen)
* Labeling:
* each video frame annotated with BB + pose info of the ped + activity tags (walking, standing, crossing, looking, etc) + attributes of pedestrian (age, gender, location, ets),
* each video is annotated weather conditions and time of day.
* Data preprocessing:
* Split the dataset into training, validation, and test sets.
* Normalize and resize the images to maintain consistency in input data.
* Apply data augmentation techniques (e.g., rotation, flipping, brightness adjustments) to increase the dataset's size and improve model generalization.
* enhance or augment the data with GANs
* Data augmentation
### 5. Feature Engineering
* relevant features from the video footage, such as the pedestrian's position, speed, and direction of movement.
* We can also use computer vision techniques to extract features like the presence of a crosswalk, traffic lights, or other relevant environmental cues.
* features from frames: fc6 features by Faster R-CNN object detector at each BB (4096T vector)
* assume: we can query cropped images of last T (e.g. 5) frames of detected pedestrians from built-in object detector and tracking system
* features from cropped frames: activity recognition
* context map : traffic signs, street width, etc
* ped's history (seq. of BB info) + current info (BB + pose info (openPose) + activity + local context) + global context (context map) + context(location, age group, etc) -> JW/NJW classifier
* other features that can be fused: ped's pose, BB, semantic segmentation maps (semantic masks for relevant objects), road geometry, surrounding people, interaction with other agents
### 6. Model Development and Offline Evaluation
Model selection and architecture:
Assume built-in object detector and tracker. If not,
* Object detection: Use a pre-trained object detection model like Faster R-CNN, YOLO, or SSD to identify and localize pedestrians in the video frames.
* Object tracking:
* use EKF based method or ML based method (SORT or DeepSORT)
* Activity recognition:
* 3D CNN, or CNN + RNN(GRU) (chose this to fit the rest of the architecture)
(Output of object detection and tracking can be converted into rasterized image for each actor -> Base CNN )
* Encoders:
* Visual Encoder: vision content (last k frames) -> CNN base encoders + RNN for temporal info(GRU) [Another option is to use 3D CNNs]
* CNN base encoder -> another RNN for activity recognition
* Non-vision encoder: for temporal content use GRU
* Fusion strategies:
* early fusion
* late fusion
* hierarchical fusion
* Jaywalking clf: Design a custom clf layer to classify detected pedestrians as jaywalking or not.
* Example: RF, or a FC layer
* we can do ablation study for selection of the fusion architecture + visual and non-visual encoders
Another example:

Model training and evaluation:
a. Train model(s) using the annotated dataset,
+ loss functions for object detection (MSE, BCE, IoU)
+ jaywalking classification tasks (BCE).
b. Regularly evaluate the model on the validation set to monitor performance and avoid overfitting. Adjust hyperparameters, such as learning rate and batch size, if necessary.
c. Once the model converges, evaluate its performance on the test set, using relevant metrics like precision, recall, F1 score, and Intersection over Union (IoU).
Transfer learning for object detection (use powerful feature detectors from pre-trained models)
* for fine tuning e.g. use 500 videos each 5-10 seconds, 30fps
### 7. Prediction Service
* SDV on the road: will receive real-time images -> ...
* Model optimization: Optimize the model for real-time deployment by using techniques such as model pruning, quantization, and TensorRT optimization.
### 8. Online Testing and Deployment
Deployment: Deploy the trained model on edge devices or servers equipped with cameras to monitor real-time video feeds (e.g. traffic camera system) and detect jaywalking instances. Integrate the system with existing traffic infrastructure, such as traffic signals and surveillance systems.
### 9. Scaling, Monitoring, and Updates
Continuous improvement: Regularly update the model with new data and retrain it to improve its performance and adapt to changing pedestrian behaviors and environmental conditions.
* Other points:
* Occlusion detection
* hallucinated agent
* when visual signal is imprecise
* poor lighting conditions
================================================
FILE: src/MLSD/mlsd-event-recom.md
================================================
# Design an event recommendation system
## 1. Problem Formulation
* Clarifying questions
- Use case?
- event recommendation system similar to eventbrite's.
- What is the main Business objective?
- Increase ticket sales
- Does it need to be personalized for the user? Personalized for the user
- User locations? Worldwide (multiple languages)
- User’s age group:
- How many users? 100 million DAU
- How many events? 1M events / month
- Latency requirements - 200msec?
- Data access
- Do we log and have access to any data? Can we build a dataset using user interactions ?
- Do we have textual description of items?
- Can we use location data (e.g. 3rd party API)? (events are location based)
- Can users become friends on the platform? Do we wanna use friendships?
- Can users invite friends?
- Can users RSVP or just register?
- Free or Paid? Both
* ML formulation
* ML Objective: Recommend most relevant (define) events to the users to maximize the number of registered events
* ML category: Recommendation system (ranking approach)
* rule based system
* embedding based (CF and content based)
* Ranking problem (LTR)
* pointwise, pairwise, listwise
* we choose pointwise LTR ranking formulation
* I/O: In: user_id, Out: ranked list of events + relevance score
* Pointwise LTR classifier I/O: I: , O: P(event register) (Binary classification)
## 2. Metrics (Offline and Online)
* Offline:
* precision @k, recall @ k (not consider ranking quality)
* MRR, mAP, nDCG (good, focus on first element, binary relevance, non-binary relevance) -> here event register binary relevance so use mAP
* Online:
* CTR, conversion rate, bookmark/like rate, revenue lift
## 3. Architectural Components (MVP Logic)
* We two stage (funnel) architecture for
* candidate generation
* rule based event filtering (e.g. location, etc)
* ranking formulation (pointwise LTR) binary classifier
## 4. Data preparation
* Data Sources:
1. Users (user profile, historical interactions)
2. Events
3. User friendships
4. User-event interactions
5. Context
* Labeling:
## 5. Feature engineering
* Note: Event based recommendation is more challenging than movie/video:
* events are short lived -> not many historical interactions -> cold start (constant new item problem)
* So we put more effort on feature engineering (many meaningful features)
* Features:
- User features
- age (one hot), gender (bucketize), event history
- Event features
- price, No of registered,
- time (event time, length, remained time)
- location (city, country, accessibility)
- description
- host (& popularity)
- User Event features
- event price similarity
- event description similarity
- no. registered similarity
- same city, state, country
- distance
- time similarity (event length, day, time of day)
- Social features
- No./ ratio of friends going
- invited by friends (No)
- hosted by friend (similarity)
- context
- location, time
* Feature preprocessing
- one hot (gender)
- bucketize + one hot (age, distance, time)
* feature processing
* Batch (for static) vs Online (streaming, for dynamic) processing
* efficient feature computation (e.g. for location, distance)
* improve: embedding learning - for users and events
## 6. Model Development and Offline Evaluation
* Model selection
* Binary classification problem:
* LR (nonlinear interactions)
* GBDT (good for structured, not for continual learning)
* NN (continual learning, expressive, nonlinear rels)
* we can start with GBDT as a baseline and experiment improvements by NN (both good options)
* Dataset
* for each user and event pair, compute features, and label 1 if registered, 0 if not
* class imbalance
* resampling
* use focal loss or class-balanced loss
## 7. Prediction Service
* Candidate generation
* event filtering (millions to hundreds)
* rule based (given a user, e.g. location, type, etc filters)
* Ranking
* compute scores for pairs, and sort
## 8. Online Testing and Deployment
Standard approaches as before.
## 9. Scaling
================================================
FILE: src/MLSD/mlsd-feature-eng.md
================================================
# Feature preprocessing
## Text preprocessing
normalization -> tokenization -> token to ids
* normalization
* tokenization
* Word tokenization
* Subword tokenization
* Character tokenization
* token to ids
* lookup table
* Hashing
## Text encoders:
Text -> Vector (Embeddings)
Two approaches:
- Statistical
- BoW: converts documents into word frequency vectors, ignoring word order and grammar
- TF-IDF: evaluates the importance of a word (term) in a document relative to a collection of documents. It is calculated as the product of two components:
- Term Frequency (TF): This component measures how frequently a term occurs in a specific document and is calculated as the ratio of the number of times a term appears in a document (denoted as "term_count") to the total number of terms in that document (denoted as "total_terms"). The formula for TF is:
TF(t, d) = \frac{\text{term_count}}{\text{total_terms}}
- Inverse Document Frequency (IDF): This component measures the rarity of a term across the entire collection of documents and is calculated as the logarithm of the ratio of the total number of documents in the collection (denoted as "total_documents") to the number of documents containing the term (denoted as "document_frequency"). The formula for IDF is:
IDF(t) = \log\left(\frac{\text{total_documents}}{\text{document_frequency}}\right)
The final TF-IDF score for a term "t" in a document "d" is obtained by multiplying the TF and IDF components:
TF-IDF(t,d)=TF(t,d)×IDF(t)
- ML encoders
- Embedding (look up) layer: a trainable layer that converts categorical inputs, such as words or IDs, into continuous-valued vectors, allowing the network to learn meaningful representations of these inputs during training.
- Word2Vec: based on shallow neural networks and consists of two main approaches: Continuous Bag of Words (CBOW) and Skip-gram.
- CBOW (Continuous Bag of Words):
In CBOW, the model predicts a target word based on the context words (words that surround it) within a fixed window.
It learns to generate the target word by taking the average of the embeddings of the context words.
CBOW is computationally efficient and works well for smaller datasets.
- Skip-gram:
In Skip-gram, the model predicts the context words (surrounding words) given a target word.
It learns to capture the relationships between the target word and its context words.
Skip-gram is particularly effective for capturing fine-grained semantic relationships and works well with large datasets.
Both CBOW and Skip-gram use shallow neural networks to learn word embeddings. The resulting word vectors are dense and continuous, making them suitable for various NLP tasks, such as sentiment analysis, language modeling, and text classification.
- transformer based e.g. BERT: consider context, different embeddings for same words in different context
## Video preprocessing
Frame-level:
Decode frames -> sample frames -> resize -> scale, normalize, color correction
### Video encoders:
- Video-level
- process whole video to create an embedding
- 3D convolutions or Transformers used
- more expensive, but captures temporal understanding
- Example: ViViT (Video Vision Transformer)
- Frame-level (from sampled frames and aggregate frame embeddings)
- less expensive (training and serving speed, compute power)
- Example: ViT (Vision Transformer)
- by dividing images into non-overlapping patches and processing them through a self-attention mechanism, enabling it to analyze image content; it differs from the original Transformer, which was initially designed for sequential data, like text, and relied on 1D positional encodings.
================================================
FILE: src/MLSD/mlsd-game-recom.md
================================================
# Design a game recommendation engine
## 1. Problem Formulation
User-game interaction
Some existing data examples:
* Games data
* app_id,
title,
date_release,
win,
mac,
linux,
rating,
positive_ratio,
user_reviews,
price_final,
price_original,
discount,
steam_deck,
* User historic data
* user_id,
products,
reviews,
* Recommendations data
* app_id,
helpful,
funny,
date,
is_recommended,
hours,
user_id,
review_id,
* Reviews
* Example Open Source Data: [Steam games complete dataset](https://www.kaggle.com/datasets/trolukovich/steam-games-complete-dataset) ([CF and content based github](https://github.com/AudreyGermain/Game-Recommendation-System))
* Game fatures include:
Url,
types
name,
desc_snippet,
recent_reviews,
all_reviews,
release_date,
developer,
publisher,
popular_tag,
### Clarifying questions
- Use case? Homepage?
- Does user sends a text query as well?
- Business objective?
- Increase user engagement (play, like, click, share), purchase?, create a better ultimate gaming experience
- Similar to previously played, or personalized for the user? Personalized for the user
- User locations? Worldwide (multiple languages)
- User’s age group:
- Do users have any favorite lists, play later, etc?
- How many games? 100 million
- How many users? 100 million DAU
- Latency requirements - 200msec?
- Data access
- Do we log and have access to any data? Can we build a dataset using user interactions ?
- Do we have textual description of items?
- can users become friends on the platform and do we wanna take that into account?
- Free or Paid?
### ML objective
- Recommend most engaging (define) games
* Max. No. of clicks (clickbait)
* Max. No. completed games/sessions/levels (bias to shorter)
* Max. total hours played ()
* Max. No. of relevant items (proxy by user implicit/explicit reactions) -> more control over signals, not the above shortcomings
* Define relevance: e.g. like is relevant, or playing half of it is, …
* ML Objective: build dataset and model to predict the relevance score b/w user and a game
* I/O: I: user_id, O: ranked list of games + relevance score
* ML category: Recommendation System
## 2. Metrics (Offline and Online)
* Offline:
* precision @k, mAP, and diversity
* Online:
* CTR, # of completed, # of purchased, total play time, total purchase, user feedback
## 3. Architectural Components (MVP Logic)
The main approaches used for personalized recommendation systems:
* Content-based filtering: suggest items similar to those user found relevant (e.g. liked)
* No need for interaction data, recommends new items to users (no item cold start)
* Capture unique interests of users
* New user cold start
* Needs domain knowledge
* CF: Using user-user (user based CF) or item-item similarities (item based CF)
* Pros
* No domain knowledge
* Capture new areas of interest
* Faster than content (no content info needed)
* Cons:
* Cold start problem (both user and item)
* No niche interest
* Hybrid
* Parallel hybrid: combine(CF results, content based)
* Sequential: [CF based] -> Content based
What do we choose?
We choose a sequential hybrid model (standard e.g. for video recommendation)
We follow the three stage recommender system (funnel architecture) in order to meet latency requirements and eb able to scale the system to billions of items.
```mermaid
Candidate generation --> Ranking --> Re-ranking
```
In the first stage, we use a light model to retrive thousands of items from millions
In the second (ranking) stage, we focus on high precision using a powerful model. This will not impact serving speed much because it's only run on smaller subset of items.
Candidate generation in practice comes from aggregation of different candidate generation models. Here we can assume three candidate generation modules:
1. Candidate generation 1 (Relevance based)
2. Candidate generation 2 (Popularity)
3. Candidate generation 3 (Trending)
where we use CF for candidate generation 1
We use content based modeling for ranking.
## 4. Data preparation
Data Sources:
1. Users (user profile, historical interactions):
* User profile
* User_id, username, age, gender, location (city, country), lang, timezone
2. Games (structures, metadata, game content - what is it?)
- Game_id, title, date, rating, expected_length?, #reviews, language, tags, description, price, developer, publisher, level, #levels
3. User-Game interactions:
Historical interactions: Play, purchase, like, and search history, etc
- User_id, game_id, timestamp, interaction_type(purchase, play, like, impression, search), interaction_val, location
1. Context: time of the day, day of the week, device, OS
Type
- Removing duplicates
- filling missing values
- normalizing data.
### Labeling:
For features in the form of pairs -> labeling strategy based on explicit or implicit feedback
e.g. "positive" if user liked the item explicitly or interacted (e.g. watched/played) at least for X (e.g. half of it).
negative samples: sample from background distribution -> correct via importance smapling
## 5. Feature engineering
There are several machine learning features that can be extracted from games. Here are some examples:
- Game metadata features
- Game state: e.g. the positions of players, the status of objects and obstacles, the time remaining, and the score.
- Game mechanics: The rules and interactions that govern the game.
- User engagement: e.g. the length of play sessions, frequency of play, and player retention rates.
- Social interactions: b/w players: to identify patterns of behavior, such as the formation of alliances, the sharing of resources, and the types of communication used between players.
- Player preferences: which game features are most popular among players, which can help inform game design decisions.
- Player behaviors: player movement patterns, the types of actions taken by players, and the strategies used to achieve objectives.
We select some important features as follows:
* Game metadata features:
* Game ID,
Duration,
Language,
Title,
Description,
Genre/Category,
Tags,
Publisher(popularity, reviews),
Release date,
Ratings,
Reviews,
(Game content ?)
game titles, genres, platforms, release dates, user ratings, and user reviews.
* User profile:
* User ID, Age, Gender, Language, City, Country
* User-item historical features:
* User-item interactions
* Played, liked, impressions
* purchase history (avg. price)
* User search history
* Context
### Feature representation:
* Categorical data (game_id, user_id, language, city): Use embedding layers, learned during
training
* Categorical_data(gender, age): one_hot
* Continuous variables: normalize, or bucketize and one-hot (e.g. price)
* Text:(title, desc, tags): title/description use embeddings, pre-trained BERT, fine tune on game language?, tags: CBOW
*
* Game content embeddings?
## 6. Model Development and Offline Evaluation
### 6.1 Candidate Generation
For candidate generation 1 (Relevance Based), we choose CF.
For CF there are two embedding based modeling options:
1. Matrix Factorization
* Pros: Training speed (only two matrices to learn), Serving speed (static learned embeddings)
* Cons: only relies on user-item interactions (No user profile info e.g. language is used); new-user cold start problem
2. Two tower neural network:
* Pros: Accepts user features (user profile + user search history) -> better quality recommendation; handles new users
* Cons: Expensive training, serving speed
We chose two-tower network here.
#### Two-tower network
* two encoder towers (user tower + encoder tower)
* user tower encodes user features into user embeddings $u$
* item tower encodes item features into item embeddings $v_i$
* similarity $u$, $v_i$ is considered as a relevance score (ranking as classification problem)
#### Loss function:
Minimize cross entropy for each positive label and sampled negative examples
### 6.2 Ranking
For Ranking stage, we prioritize precision over efficiency. We choose content based filtering. Choose a model that relies in item features.
ML Obj options:
- max P(watch| U, C)
- max expected total watch time
- multi-objective (multi-task learning: add corresponding losses)
Model Options:
- FF NN (e.g. similar tower network to a tower network) + logistic regression
- Deep Cross Network (DCN)
Features
* Video ID embeddings (watched video embedding avg, impression video embedding),
* Video historic
* No. of previous impressions, reviews, likes, etc
* Time features (e.g. time since last play),
* Language embedding (user, item),
* User profile
* User Historic (e.g. search history)
### 6.3 Re-Ranking
Re-ranks items by additional business criteria (filter, promote)
We can use ML models for clickbait, harmful content, etc or use heuristics
Examples:
* Age restriction filter
* Region restriction filter
* Video freshness (promote fresh content)
* Deduplication
* Fairness, bias, etc
## 7. Prediction Service
two-tower network inference: find the k-top most relevant items given a user ->
It's a classic nearest neighbor problem -> use approximate nearest neighbor (ANN) algorithms
## 8. Online Testing and Deployment
Standard approaches as before.
## 9. Scaling
The three stage candidate generation - ranking - re-ranking can be scaled well as described earlier. It also meets the requirements of speed (funnel architecture), precision(ranking component), and diversity (multiple candid generation).
### Cold start problem:
* new users: two tower architectures accepts new users and we can still use user profile info even with no interaction
* new items: recommend to random users and collect some data - then fine tune the model using new data
### Training:
We need to be able to fine tune the model
### Exploration exploitation trade-off
- Multi-armed bandit (an agent repeatedly selects an option and receives a reward/cost. The goal of to maximize its cumulative reward over time, while simultaneously learning which options are most valuable.)
### Other Extensions:
* [Multi-task learning](https://daiwk.github.io/assets/youtube-multitask.pdf)
* Includes a shared feature extractor that is trained jointly with multiple prediction heads, each of which is responsible for predicting a different aspect of user behavior, such as click-through rate, watch time, and view count. The model is trained using a combination of supervised and unsupervised learning techniques, including cross-entropy loss, pairwise ranking loss, and self-supervised contrastive learning.
* Positional bias (detection and correction)
* Selection bias (detection and correction)
* Add negative feedback (dislike)
* Locality preservation:
* Use sequential user behavior info (CBOW model)
* effect of seasonality
* what if we only have a query and personal (item, provider) history?
* item embeddings, provider embeddings, query embeddings
* we can build a query-aware attention mechanism that computes
### More resources
* [Content-based](https://www.kaggle.com/code/fetenbasak/content-based-recommendation-game-recommender), [NLP analysis](https://www.kaggle.com/code/greentearus/steam-reviews-nlp-analysis), [Collaborative Denoising AE](https://www.kaggle.com/code/krsnewwave/collaborative-denoising-autoencoder-steam)
* [User-based CF, item-based CF and MF](https://github.com/manandesai/game-recommendation-engine) ([github](https://github.com/manandesai/game-recommendation-engine/blob/main/recommenders.ipynb))
* [CF and content based](https://github.com/AudreyGermain/Game-Recommendation-System)
================================================
FILE: src/MLSD/mlsd-harmful-content.md
================================================
# Harmful content detection on social media
### 1. Problem Formulation
* Clarifying questions
* What types of harmful content are we aiming to detect? (e.g., hate speech, explicit images, cyberbullying)?
* What are the potential sources of harmful content? (e.g., social media, user-generated content platforms)
* Are there specific legal or ethical considerations for content moderation
* What is the expected volume of content to be analyzed daily?
* What are supported languages?
* Are there human annotators available for labeling?
* Is there a feature for users to report harmful content? (click, text, etc).
* Is explainablity important here?
* Integrity deals with:
* Harmful content (focus here)
* Harmful act/actors
* Goal: monitor posts, detect harmful content, and demote/remove
* Examples harmful content categories: violence, nudity, hate speech
* ML objective: predict if a post is harmful
* Input: Post (MM: text, image, video)
* Output: P(harmful) or P(violent), P(nude), P(hate), etc
* ML Category: Multimodal (Multi-label) classification
* Data: 500M posts / day (about 10K annotated)
* Latency: can vary for different categories
* Able to explain the reason to the users (category)
* support different languages? Yes
### 2. Metrics
- Offline
- F1 score, PR-AUC, ROC-AUC
- Online
- prevalence (percentage of harmful posts didn't prevent over all posts), harmful impressions, percentage of valid (reversed) appeals, proactive rate (ratio of system detected over system + user detected)
### 3. Architectural Components
* Multimodal input (text, image, video, etc):
* Multimodal fusion techniques
* Early Fusion: modalities combined first, then make a single prediction
* Late Fusion: process modalities independently, fuse predictions
* cons: separate training data for modalities, comb of individually safe content might be harmful
* Multi-Label/Multi-Task classification
* Single binary classifier (P(harmful))
* easy, not explainable
* One binary classifier per harm category (p(violence), p(nude), p(hate))
* multiple models, trained and maintained separately, expensive
* Single multi-label classifier
* complicated task to learn
* Multi-task classifier: learn multi tasks simultanously
* single shared layers (learns similarities between tasks) -> transformed features
* task specific layers: classification heads
* pros: single model, shared layers prevent redundancy, train data for each task can be used for others as well (limited data)
### 4. Data Collection and Preparation
* Main actors for which data is available:
* Users
* user_id, age, gender, location, contact
* Items(Posts)
* post_id, author_id, text context, images, videos, links, timestamp
* User-post interactions
* user_id, post_id, interaction_type, value, timestamp
### 5. Feature Engineering
Features:
Post Content (text, image, video) + Post Interactions (text + structured) + Author info + Context
* Posts
* Text:
* Preprocessing (normalization + tokenization)
* Encoding (Vectorization):
* Statistical (BoW, TF-IDF)
* ML based encoders (BERT)
* We chose pre-trained ML based encoders (need semantics of the text)
* We chose Multilingual Distilled (smaller, faster) version of BERT (need context), DistilmBERT
* Images/ Videos:
* Preprocessing: decoding, resize, scaling, normalization
* Feature extraction: pre-trained feature extractors
* Images:
* CLIP's visual encoder
* SImCLR
* Videos:
* VideoMoCo
* Post interactions:
* No. of likes, comments, shares, reports (scale)
* Comments (text): similar to the post text (aggregate embeddings over comments)
* Users:
* Only use post author's info
* demographics (age, gender, location)
* account features (No. of followers /following, account age)
* violation history (No of violations, No of user reports, profane words rate)
* Context:
* Time of day, device
### 6. Model Development and Offline Evaluation
* Model selection
* NN: we use NN as it's commonly used for multi-task learning
* HP tuniing:
* No of hidden layers, neurons in layers, act. fcns, learning rate, etc
* grid search commonly used
* Dataset:
* Natural labeling (user reports) - speed
* Hand labeling (human contractors) - accuracy
* we use natural labeling for train set (speed) and manual for eval set (accuracy)
* loss function:
* L = L1 + L2 + L3 ... for each task
* each task is a binary classific so e.g. CE for each task
* Challenge for MM training:
* overfitting (when one modality e.g. image dominates training)
* gradient blending and focal loss
### 7. Prediction Service
* 3 main components:
* Harmful content detection service
* Demoting service (prob of harm with low confidence)
* violation service (prob of harm with high confidence)
### 8. Online Testing and Deployment
### 9. Scaling, Monitoring, and Updates
### 10. Other topics
* biases by human labeling
* use temporal information (e.g. sequence of actions)
* detect fake accounts
* architecture improvement: linear transformers
================================================
FILE: src/MLSD/mlsd-image-search.md
================================================
# Image Search System (Pinterest)
### 1. Problem Formulation
* Clarifying questions
- What is the primary (business) objective of the visual search system?
- What are the specific use cases and scenarios where it will be applied?
- What are the system requirements (such as response time, accuracy, scalability, and integration with existing systems or platforms)?
- How will users interact with the system? (click, like, share, etc)? Click only
- What types of visual content will the system search through (images, videos, etc.)? Images only
- Are there any specific industries or domains where this system will be deployed (e.g., fashion, e-commerce, art, industrial inspection)?
- What is the expected scale of the system in terms of data and user interactions?
- Personalized? not required
- Can we use metadata? In general yes, here let's not.
- Can we assume the platform provides images which are safe? Yes
* Use case(s) and business goal
* Use case: allowing users to search for visually similar items, given a query image by the user
* business goal: enhance user experience, increase click through rate, conversion rates, etc (depends on use case)
* Requirements
* response time, accuracy, scalability (billions of images)
* Constraints
* budget limitations, hardware limitations, or legal and privacy constraints
* Data: sources and availability
* sources of visual data: user-generated, product catalogs, or public image databases?
* Available?
* Assumptions
* ML formulation:
* ML Objective: retrieve images that are similar to query image in terms of visual content
* ML I/O: I: a query image, and O: a ranked list of most similar images to the query image
* ML category: Ranking problem (rank a collection of items based on their relevance to a query)
### 2. Metrics
* Offline metrics
* MRR
* Recall@k
* Precision@k
* mAP
* nDCG
* Online metrics
* CTR
* Time spent on images
### 3. Architectural Components
* High level architecture
* Representation learning:
* transform input data into representations (embeddings) - similar images are close in their embedding space
* use distance between embeddings as a similarity measure between images
### 4. Data Collection and Preparation
* Data Sources
* User profile
* Images
* image file
* metadata
* User-image interactions: impressions, clicks:
* Context
* Data storage
* ML Data types
* Labelling
### 5. Feature Engineering
* Feature selection
* User profile : User_id, username, age, gender, location (city, country), lang, timezone
* Image metadata: ID, user ID, tags, upload date, ...
* User-image interactions: impressions, clicks:
* user id, Query img id, returned img id, interaction type (click, impression), time, location
* Feature representation
* Representation learning (embedding)
* Feature preprocessing
* common feature preprocessing for images:
* Resize (e.g. 224x224), Scale (0-1), normalize (mean 0, var 1), color mode (RGB, CMYK)
### 6. Model Development and Offline Evaluation
* Model selection
* we choose NN because of
* unstructured data (images, text) -> NN good at it
* embeddings needed
* Architecture type:
* CNN based e.g. ResNet
* Transformer based (ViT)
* Example: Image -> Convolutional layers -> FC layers -> embedding vector
* Model Training
* contrastive learning -> used for image representation learning
* train to distinguish similar and dissimilar items (images)
* Dataset
* each data point: query img, positive sample (similar to q), n - 1 neg samples (dissimilar)
* query img : randomly choose
* neg samples: randomly choose
* positive samples: human judge, interactions (e.g. click) as a proxy, artificial image generated from q (self supervision)
* human: expensive, time consuming
* interactions: noisy and sparse
* artificial: augment (e.g. rotate) and use as a positive sample (similar to simCLR or MoCo) - data distribution differs in reality
* Loss Function: contrastive loss
* contrastive loss:
* works on pairs (Eq, Ei)
* calculate distance: b/w pairs -> softmax -> cross entropy <- Labels
* Model eval and HP tuning
* Iterations
### 7. Prediction Service
* Prediction pipeline
* Embedding generation service
* image -> preprocess -> embedding gen (ML model) -> img embedding
* NN search service
* retrieve the most similar images from embedding space
* Exact: O(N.D)
* Approximate(ANN) - sublinear e.g. O(D.logN)
* Tree based ANN (e.g. R-trees, Kd-trees)
* partition space into two (or more) at each non-leaf node,
* only search the partition for query q
* Locality Sensitive Hashing LSH
* using hash functions to group points into buckets (close points into same buckets)
* Clustering based
* We use ANN using an existing library like Faiss (Facebook)
* Re-ranking service
* business level logic and policies (e.g. filter inappropriate or private items, deduplicate, etc)
* Indexing pipeline
* Indexing service: indexes images by their embeddings
* keep the table updated for new images
* increases memory usage -> use optimization (vector / product quantization)
### 8. Online Testing and Deployment
* A/B Test
* Deployment and release
### 9. Scaling, Monitoring, and Updates
* Scaling (SW and ML systems)
* Monitoring
* Updates
### 10. Other points:
================================================
FILE: src/MLSD/mlsd-metrics.md
================================================
# Offline Metrics
These offline metrics are commonly used in search, information retrieval, and recommendation systems to evaluate the quality of results or recommendations:
### Recall@k:
- Definition: Recall@k evaluates the fraction of relevant items retrieved among the top k recommendations over total relevant items. It measures the system's ability to find all relevant items in a fixed-sized list.
- Use Case: In information retrieval and recommendation systems, Recall@k is crucial when it's essential to ensure that no relevant items are missed in the top k recommendations.
### Precision@k:
- Definition: Precision@k assesses the fraction of retrieved items that are relevant among the top k recommendations. It measures the system's ability to provide relevant content at the top of the list.
- Use Case: Precision@k is vital when there's a need to present users with highly relevant content in the initial recommendations. It helps in reducing user frustration caused by irrelevant suggestions.
### Mean Reciprocal Rank (MRR):
- Definition: MRR measures the effectiveness of a system in ranking the most relevant items at the top of a list. It calculates the average of reciprocal ranks of the first correct item found in each ranked list of results:
MRR = 1/m \Sum(1/rank_i)
- Use Case: MRR is often used in search and recommendation systems to assess how quickly users find relevant content. It's particularly useful when there is only one correct answer or when the order of results matters.
### Mean Average Precision (mAP):
- Definition: mAP computes the average precision across multiple queries or users. Precision is calculated for each query, and the mean of these precisions is taken to provide a single performance score.
- Use Case: mAP is valuable in scenarios where there are multiple users or queries, and you want to assess the overall quality of recommendations or search results across a diverse set of queries. mAP works well for binary relevances. For continues scores, we use nDCG.
### Discounted Cumulative Gain (DCG):
- Definition: Discounted Cumulative Gain (DCG) is a widely used evaluation metric primarily applied in the fields of information retrieval, search engines, and recommendation systems.
- DCG quantifies the quality of a ranked list of items or search results by considering two key aspects:
1. Relevance: Each item in the list is associated with a relevance score, which indicates how relevant it is to the user's query or preferences. Relevance scores are typically on a scale, with higher values indicating greater relevance.
2. Position: DCG takes into account the position of each item in the ranked list. Items appearing higher in the list are considered more important because users are more likely to interact with or click on items at the top of the list.
- DCG calculates the cumulative gain by summing the relevance scores of items in the ranked list up to a specified position.
- To reflect the decreasing importance of items further down the list, DCG applies a discount factor, often logarithmic in nature.
- Use case:
- DCG is employed to evaluate how effectively a system ranks and presents relevant items to users.
- It is instrumental in optimizing search and recommendation algorithms, ensuring that highly relevant items are positioned at the top of the list for user engagement and satisfaction.
### Normalized Discounted Cumulative Gain (nDCG):
- Definition: nDCG measures the quality of a ranked list by considering the graded relevance of items. It discounts the relevance of items as they appear further down the list and normalizes the score. It is calculated as the fraction of DCG over the Ideal DCG(IDCG) for an ideal ranking.
- Use Case: nDCG is beneficial when relevance is not binary (i.e., there are degrees of relevance), and you want to account for the diminishing importance of items lower in the ranking.
# Cross Entropy and Normalized Cross Entropy
- The CE (also a loss function), measures how well the predicted probabilities align with the true class labels. It's defined as:
- For binary classification:
CE = - [y * log(p) + (1 - y) * log(1 - p)]
- For multi-class classification:
CE = - Σ(y_i * log(p_i))
Where:
- y is the true class label (0 or 1 for binary, one-hot encoded vector for multi-class).
- p is the predicted probability assigned to the true class label.
- The negative sign ensures that the loss is minimized when the predicted probabilities match the true labels. (the lower the better)
- NCE: CE(ML model) / CE(simple baseline)
### Ranking:
* Precision @k and Recall @k not a good fit (not consider ranking quality of out)
* MRR, mAP, and nDCG good:
* MRR: focus on rank of 1st relevant item
* nDCG: relevance b/w user and item is non-binary
* mAP: relevance is binary
* Ads ranking: NCE
# Online metrics
* CTR
- Definition:
- Click-Through Rate (CTR) is a metric that quantifies user engagement with a specific item or element, such as an advertisement, a search result, a recommended product, or a link.
- It is calculated by dividing the number of clicks on the item by the total number of impressions (or views) it received.
- Formula for CTR:
CTR= Number of Clicks/Number of Impressions ×100%
- Impressions: Impressions refer to the total number of times the item was displayed or viewed by users. For ads, it's the number of times the ad was shown to users. For recommendations, it's the number of times an item was recommended to users.
- Use Cases:
- Online Advertising campaigns: widely used to assess how well ads are performing. A high CTR indicates that the ad is compelling and relevant to the target audience.
- Recommendation Systems: CTR is used to measure how effectively recommended items attract user clicks.
- Search Engines: CTR is used to evaluate the quality of search results. High CTR for a search result indicates that it was relevant to the user's query.
* Conversion Rate: Conversion Rate measures the percentage of users who take a specific desired action after interacting with an item, such as making a purchase, signing up for a newsletter, or filling out a form. It helps assess the effectiveness of a call to action.
* Bounce Rate: Bounce Rate calculates the percentage of users who visit a webpage or view an item but leave without taking any further action, such as navigating to another page or interacting with additional content. A high bounce rate may indicate that users are not finding the content engaging.
* Engagement Rate: Engagement Rate evaluates the level of user interaction and participation with content or ads. It can include metrics like comments, shares, likes, or time spent on a webpage. A high engagement rate suggests that users are actively involved with the content.
* Time on Page: Time on Page measures how long users spend on a webpage or interacting with a specific piece of content. It helps evaluate user engagement and the effectiveness of content in holding user attention.
* Return on Investment (ROI): ROI assesses the financial performance of an advertising or marketing campaign by comparing the costs of the campaign to the revenue generated from it. It's crucial for measuring the profitability of marketing efforts.
================================================
FILE: src/MLSD/mlsd-mm-video-search.md
================================================
# Multimodal Video Search System
### 1. Problem Formulation
* Clarifying questions
- What is the primary (business) objective of the search system?
- What are the specific use cases and scenarios where it will be applied?
- What are the system requirements (such as response time, accuracy, scalability, and integration with existing systems or platforms)?
- What is the expected scale of the system in terms of data and user interactions?
- Is their any data available? What format?
- Can we use video metadata? Yes
- Personalized? not required
- How many languages needs to be supported?
* Use case(s) and business goal
* Use case: user enters text query into search box, system shows the most relevant videos
* business goal: increase click through rate, watch time, etc.
* Requirements
* response time, accuracy, scalability (50M DAU)
* Constraints
* budget limitations, hardware limitations, or legal and privacy constraints
* Data: sources and availability
* Sources: videos (1B), text
* 10M pairs of