Full Code of imteekay/machine-learning-research for AI

master 857b46db77ba cached

242 files

122.7 MB

14.6M tokens

49 symbols

1 requests

Copy disabled (too large) Download .txt

Showing preview only (58,444K chars total). Download the full file to get everything.

Repository: imteekay/machine-learning-research
Branch: master
Commit: 857b46db77ba
Files: 242
Total size: 122.7 MB

Directory structure:
gitextract_hamld7q0/

├── .gitignore
├── .prettierrc
├── FUNDING.yml
├── LICENSE
├── README.md
├── a-unified-theory-of-ai-in-biomedicine.md
├── a-unified-theory-of-ml-ai.md
├── books/
│   ├── an-introduction-to-statistical-learning-wtih-applications-in-python/
│   │   ├── README.md
│   │   ├── classification.md
│   │   ├── introduction.md
│   │   ├── linear-regression.md
│   │   ├── resampling-methods.md
│   │   ├── selection-and-regularization.md
│   │   └── statistical-learning.md
│   ├── deep-learning-for-biology/
│   │   ├── README.md
│   │   ├── learning-the-language-of-proteins.md
│   │   └── learning-the-logic-of-dna.md
│   ├── machine-learning-system-design/
│   │   ├── README.md
│   │   └── residual_analysis.ipynb
│   ├── mathematics-for-machine-learning/
│   │   └── README.md
│   ├── practical-statistics-for-data-scientists/
│   │   ├── README.md
│   │   ├── data-and-sampling-distributions.ipynb
│   │   └── practical-statistics-exploratory-data-analysis.ipynb
│   ├── reinforcement-learning/
│   │   └── README.md
│   └── understanding-deep-learning/
│       └── README.md
├── cancer/
│   └── README.md
├── careers/
│   └── README.md
├── courses/
│   ├── agentic-ai/
│   │   ├── README.md
│   │   ├── email-assistant.ipynb
│   │   ├── eval.ipynb
│   │   ├── external-evaluation.ipynb
│   │   ├── multi-agent-workflow.ipynb
│   │   ├── planning-with-code.ipynb
│   │   ├── reflection.ipynb
│   │   └── tools.ipynb
│   ├── ai-for-medicine/
│   │   └── ai-for-medical-diagnosis/
│   │       ├── README.md
│   │       ├── ai-for-medicine-densenet.ipynb
│   │       ├── ai-for-medicine-diagnosis-counting-labels-and-we.ipynb
│   │       ├── ai-for-medicine-patient-overlap-and-data-leakage.ipynb
│   │       ├── chest-x-ray-medical-diagnosis-with-deep-learning.ipynb
│   │       └── data-exploration-and-image-pre-processing.ipynb
│   ├── attention-in-transformers/
│   │   ├── README.md
│   │   ├── encoder-decoder-attention.ipynb
│   │   ├── masked-self-attention-pytorch.ipynb
│   │   ├── next-token-prediction.ipynb
│   │   ├── self-attention-pytorch.ipynb
│   │   ├── tokenization.ipynb
│   │   └── transformers-from-scratch.ipynb
│   ├── data-visualization/
│   │   ├── README.md
│   │   ├── bar-charts-and-heatmaps.ipynb
│   │   ├── choosing-plot-types-and-custom-styles.ipynb
│   │   ├── data-visualization-final-project.ipynb
│   │   ├── distributions.ipynb
│   │   ├── line-charts.ipynb
│   │   ├── scatter-plots.ipynb
│   │   └── seaborn.ipynb
│   ├── diffusion-models/
│   │   ├── README.md
│   │   ├── controlling-model-generation.ipynb
│   │   ├── ddim-vs-ddpm-faster-sampling.ipynb
│   │   ├── denoise-and-add-noise.ipynb
│   │   ├── diffusion_utilities.py
│   │   └── training-unet.ipynb
│   ├── gen-ai/
│   │   ├── README.md
│   │   ├── building-an-agent-with-langgraph.ipynb
│   │   ├── classifying-embeddings-with-keras.ipynb
│   │   ├── document-q-a-with-rag.ipynb
│   │   ├── embeddings-and-similarity-scores.ipynb
│   │   ├── evaluation-and-structured-output.ipynb
│   │   ├── fine-tuning-a-custom-model.ipynb
│   │   ├── function-calling-with-the-gemini-api.ipynb
│   │   ├── google-search-grounding.ipynb
│   │   └── prompt-engineering.ipynb
│   ├── genomic-data-science/
│   │   ├── algorithms-for-dna-sequencing/
│   │   │   ├── README.md
│   │   │   ├── fasta/
│   │   │   │   └── lambda_virus.fa
│   │   │   └── src/
│   │   │       ├── read_genome.py
│   │   │       └── reverse_complement.py
│   │   └── introduction-genomics/
│   │       ├── README.md
│   │       └── quizz-001.md
│   ├── introduction-to-deep-learning/
│   │   └── README.md
│   ├── introduction-to-machine-learning/
│   │   ├── README.md
│   │   ├── logistic-regression/
│   │   │   └── README.md
│   │   └── multilayer-perceptron/
│   │       └── README.md
│   ├── introduction-to-neural-networks-and-pytorch/
│   │   ├── 1D-tensor.ipynb
│   │   ├── 2D-tensor.ipynb
│   │   ├── README.md
│   │   ├── activation-functions-and-max-pooling-in-cnn.ipynb
│   │   ├── activation-functions-on-mnist.ipynb
│   │   ├── activation-functions.ipynb
│   │   ├── batch-normalization.ipynb
│   │   ├── best-practices-for-model-training.md
│   │   ├── cnn-for-small-image.ipynb
│   │   ├── computer-vision-with-pytorch.ipynb
│   │   ├── convolution-neural-network.ipynb
│   │   ├── convolutional-neural-network-for-anime-image-class.ipynb
│   │   ├── convolutional-neural-network-with-batch-normalization.ipynb
│   │   ├── core-neural-network-components.ipynb
│   │   ├── data-management-in-pytorch.ipynb
│   │   ├── deep-learning-with-pytorch.ipynb
│   │   ├── deep-neural-network-for-breast-cancer-classification.ipynb
│   │   ├── deep-neural-networks.ipynb
│   │   ├── deeper-neural-networks-with-nn-modulelist.ipynb
│   │   ├── derivatives.ipynb
│   │   ├── different-parameter-initialization.ipynb
│   │   ├── dropout-neural-net.ipynb
│   │   ├── dropout-regression.ipynb
│   │   ├── fashion-mnist.ipynb
│   │   ├── he-parameter-initialization.ipynb
│   │   ├── initialization-with-same-weights.ipynb
│   │   ├── linear-regression-training-one-parameter.ipynb
│   │   ├── linear-regression-training.ipynb
│   │   ├── linear_regression_model.ipynb
│   │   ├── linear_regression_with_multiple_outputs.ipynb
│   │   ├── logistic-regression-and-bad-initialization-value.ipynb
│   │   ├── logistic-regression-cross-entropy.ipynb
│   │   ├── logistic_regression.ipynb
│   │   ├── mini_batch_gradient_descent.ipynb
│   │   ├── mini_batch_gradient_descent_pytorch.ipynb
│   │   ├── mnist-softmax.ipynb
│   │   ├── mnist_vision_transform.ipynb
│   │   ├── momentum-with-polynomial-functions.ipynb
│   │   ├── multi-class-neural-networks-with-mnist.ipynb
│   │   ├── multiple-channel-convolution.ipynb
│   │   ├── multiple_linear_regression.ipynb
│   │   ├── multiple_linear_regression_training.ipynb
│   │   ├── neural-network-with-momentum.ipynb
│   │   ├── neural-network-with-multiple-neurons.ipynb
│   │   ├── neural-networks-with-multiple-hidden-layers.ipynb
│   │   ├── simple-convolutional-neural-network.ipynb
│   │   ├── small-neural-network.ipynb
│   │   ├── softmax-classifier-1d.ipynb
│   │   ├── stochastic_gradient_descent.ipynb
│   │   ├── training_and_validation_data.ipynb
│   │   ├── training_multiple_output_linear_regression.ipynb
│   │   ├── transform.ipynb
│   │   └── vision_transform.ipynb
│   ├── kaggle-intermdiate-ml/
│   │   ├── README.md
│   │   ├── categorical-variables.ipynb
│   │   ├── cross-validation.ipynb
│   │   ├── data-leakage.ipynb
│   │   ├── intro-house-pricing.ipynb
│   │   ├── missing-values.ipynb
│   │   ├── pipeline.ipynb
│   │   └── xgboost.ipynb
│   ├── kaggle-intro-to-ml/
│   │   ├── README.md
│   │   ├── explore-data.ipynb
│   │   ├── house-price-decision-tree-regressor.ipynb
│   │   ├── model-validation.ipynb
│   │   ├── random-forests.ipynb
│   │   └── underfitting-and-overfitting.ipynb
│   ├── language-modeling-from-scratch/
│   │   └── README.md
│   ├── machine-learning-for-health-predictions/
│   │   └── README.md
│   ├── machine-learning-with-python/
│   │   └── README.md
│   ├── math-for-machine-learning-with-python/
│   │   ├── 001-intro-to-equations.py
│   │   ├── 002-linear-equations.py
│   │   ├── 003-systems-of-equations.py
│   │   └── README.md
│   ├── ml-for-computational-biology/
│   │   └── README.md
│   ├── ml-in-healthcare/
│   │   └── README.md
│   ├── multimodal-machine-learning/
│   │   └── README.md
│   ├── pyspark/
│   │   └── learning_spark.ipynb
│   └── python/
│       ├── README.md
│       ├── booleans-and-conditionals.ipynb
│       ├── functions-and-getting-help.ipynb
│       ├── lists.ipynb
│       ├── loops-and-list-comprehensions.ipynb
│       ├── object-oriented-programming-in-python.ipynb
│       ├── strings-and-dictionaries.ipynb
│       ├── syntax-variables-and-numbers.ipynb
│       └── working-with-external-libraries.ipynb
├── interview-prep/
│   └── README.md
├── introduction/
│   ├── README.md
│   ├── data/
│   │   └── visualizing-data.ipynb
│   ├── matlab_plot/
│   │   ├── jupyter/
│   │   │   ├── 1.line_plot.ipynb
│   │   │   ├── 2.line_plot.ipynb
│   │   │   ├── 3.scatter_plot.ipynb
│   │   │   ├── 4.scatter_plot.ipynb
│   │   │   ├── 5.histogram.ipynb
│   │   │   ├── 6.histogram_bin.ipynb
│   │   │   ├── 7.labels.ipynb
│   │   │   ├── 8.ticks.ipynb
│   │   │   └── 9.scatter_size.ipynb
│   │   └── python/
│   │       ├── 1.line_plot.py
│   │       ├── 10.colors.py
│   │       ├── 11.grid.py
│   │       ├── 2.line_plot.py
│   │       ├── 3.scatter_plot.py
│   │       ├── 4.scatter_plot.py
│   │       ├── 5.histogram.py
│   │       ├── 6.histogram_bin.py
│   │       ├── 7.labels.py
│   │       ├── 8.ticks.py
│   │       └── 9.scatter_size.py
│   └── numpy/
│       ├── jupyter/
│       │   ├── 1.array.ipynb
│       │   ├── 2.array_calculation.ipynb
│       │   └── 3.array_calculation.ipynb
│       └── python/
│           ├── 1.array.py
│           ├── 10.matrix_calculation.py
│           ├── 11.matrix_first_column.py
│           ├── 12.statistics.py
│           ├── 13.statistics_2.py
│           ├── 2.array_calculation.py
│           ├── 3.array_calculation.py
│           ├── 4.boolean_array.py
│           ├── 5.homogeneous_array.py
│           ├── 6.array_slice.py
│           ├── 7.array_shape.py
│           ├── 8.array_shape.py
│           └── 9.matrix.py
├── learning-path.md
├── math.md
├── papers/
│   ├── alphafold/
│   │   └── README.md
│   ├── artificial-intelligence-in-healthcare-past-present-and-future/
│   │   └── README.md
│   ├── highly-accurate protein-structure-prediction-with-alphafold/
│   │   └── README.md
│   └── sybil-a-validated-deep-learning-model-to-predict-future-lung-cancer-risk-from-a-single-low-dose/
│       └── index.md
├── projects/
│   ├── biomedicine/
│   │   └── learning-the-language-of-proteins/
│   │       └── data/
│   │           ├── CAFA3_targets.tgz
│   │           └── CAFA3_training_data.tgz
│   ├── classification/
│   │   └── svc-decision-tree-classifiers.ipynb
│   ├── pytorch/
│   │   ├── pytorch-computer-vision-exercises.ipynb
│   │   ├── pytorch-computer-vision.ipynb
│   │   ├── pytorch-custom-datasets.ipynb
│   │   ├── pytorch-fundamentals.ipynb
│   │   └── pytorch-neural-network-classification.ipynb
│   ├── regression/
│   │   └── house-price-regression-model.ipynb
│   └── rnn/
│       └── recurrent-neural-network-regression.ipynb
├── research/
│   ├── README.md
│   └── ideas.md
├── rosalind/
│   ├── README.md
│   ├── cons.py
│   ├── dna.py
│   ├── fib.py
│   ├── fibd.py
│   ├── gc.py
│   ├── hamm.py
│   ├── iev.py
│   ├── iprb.py
│   ├── prob.py
│   ├── prot.py
│   ├── prtm.py
│   ├── revc.py
│   ├── rna.py
│   └── subs.py
└── skills.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
.ipynb_checkpoints
.DS_Store


================================================
FILE: .prettierrc
================================================
{
  "singleQuote": true,
  "trailingComma": "all"
}


================================================
FILE: FUNDING.yml
================================================
github: [imteekay]
custom: [https://teekay.substack.com]


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) TK

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
<samp>

# ML Research

## Table of Contents

- [ML Research](#ml-research)
  - [Table of Contents](#table-of-contents)
  - [Learning Roadmap](#learning-roadmap)
  - [Mathematics](#mathematics)
    - [General Math](#general-math)
    - [How to learn mathematics](#how-to-learn-mathematics)
    - [Linear Algebra](#linear-algebra)
    - [Statistics](#statistics)
    - [Calculus](#calculus)
    - [Optimization](#optimization)
  - [Programming](#programming)
    - [Algorithms](#algorithms)
    - [Python](#python)
    - [ML Engineering](#ml-engineering)
    - [Distributed Systems](#distributed-systems)
  - [Artificial Intelligence](#artificial-intelligence)
    - [Data Science](#data-science)
    - [Machine Learning](#machine-learning)
      - [Multimodal Machine Learning](#multimodal-machine-learning)
      - [Geometric Machine Learninig](#geometric-machine-learninig)
    - [Deep Learning](#deep-learning)
      - [Computer Vision](#computer-vision)
      - [Transformers](#transformers)
      - [Large Language Models (LLMs)](#large-language-models-llms)
    - [Generative AI](#generative-ai)
    - [Deep Reinforcement Learning](#deep-reinforcement-learning)
    - [Causal Inference](#causal-inference)
    - [PyTorch](#pytorch)
    - [ML/AI \& Healthcare](#mlai--healthcare)
      - [ML for Clinical Knowledge](#ml-for-clinical-knowledge)
    - [ML/AI \& Biology](#mlai--biology)
    - [Podcasts](#podcasts)
    - [Questions and Answers](#questions-and-answers)
    - [Databases](#databases)
    - [Meta / Lists](#meta--lists)
  - [Science](#science)
    - [Fundamentals](#fundamentals)
    - [Science](#science-1)
    - [Biology](#biology)
    - [Cancer](#cancer)
    - [Genetics](#genetics)
    - [Computational Biology](#computational-biology)
    - [Precision Health](#precision-health)
    - [Meta](#meta)
    - [Science: Q\&A](#science-qa)
  - [Careers](#careers)
    - [How to: Interview Prep](#how-to-interview-prep)
    - [How to: Coding prep](#how-to-coding-prep)
    - [Jobs](#jobs)
  - [Projects](#projects)
  - [Community](#community)
    - [People](#people)
    - [Research \& Laboratories](#research--laboratories)
    - [Communities](#communities)
    - [Central Resources](#central-resources)
  - [License](#license)

## Learning Roadmap

- [Machine Learning Roadmap 2022](https://www.youtube.com/watch?v=y4o9hrSCDPI&list=TLPQMzAxMjIwMjMIRqKttLLFsg&index=3&ab_channel=SmithaKolan-MachineLearningEngineer)
- [How to learn AI and ML](https://www.youtube.com/watch?v=KEB-w9DUdCw&ab_channel=PythonProgrammer)
- [Recommendations by Ilya Sutskever](https://arc.net/folder/D0472A20-9C20-4D3F-B145-D2865C0A9FEE)
- [The Ultimate Guide to Learning About Artificial Intelligence](https://adam-maj.medium.com/the-ultimate-guide-to-becoming-an-artificial-intelligence-expert-db5124dc8ae0)
- [How I Would Learn Bioinformatics From Scratch 12 Years Later: A Roadmap](https://divingintogeneticsandgenomics.com/post/bioinfo-roadmap)
- [Machine Learning and Deep Learning in Python using Scikit-Learn and PyTorch](https://github.com/ageron/handson-mlp)
- [From Logistic Regression to Transformers: Learning Path](https://romeepanchal.com/posts/deep_learning/learning_path)
- [Palindrome: ML Library](https://thepalindrome.org/p/the-palindrome-library)

## Mathematics

### General Math

- [Data Science Math Skills](https://www.coursera.org/learn/datasciencemathskills)
- [Mathematics of Big Data and Machine Learning](https://ocw.mit.edu/courses/res-ll-005-mathematics-of-big-data-and-machine-learning-january-iap-2020)
- [Mathematics for Machine Learning](https://github.com/imteekay/mathematics-for-machine-learning)
- [How to get from high school math to cutting-edge ML/AI](https://www.justinmath.com/how-to-get-from-high-school-math-to-cutting-edge-ml-ai)
- [The Complete Mathematics of Neural Networks and Deep Learning](https://www.youtube.com/watch?v=Ixl3nykKG9M)
- [Math Academy](https://www.mathacademy.com)
- [Why Should You Learn Mathematics for ML](https://romeepanchal.com/posts/general/why_learn_maths)
- [[Book] Mathematical Foundations of Machine Learning](https://skim.math.msstate.edu/LectureNotes/Machine_Learning_Lecture.pdf)

### How to learn mathematics

- [How to study math — Jo Boaler](https://www.youtube.com/watch?v=pRsutB2NhLk&list=TLPQMjkwNzIwMjND3tvET8TH0g&index=2&ab_channel=LexClips)
- [How To Self-Study Math](https://www.youtube.com/watch?v=fb_v5Bc8PSk&list=TLPQMjkwNzIwMjND3tvET8TH0g&index=3&ab_channel=TheMathSorcerer)
- [How to learn physics & math](https://www.youtube.com/watch?v=klEFaIZuiYk&list=TLPQMjkwNzIwMjND3tvET8TH0g&index=4&ab_channel=Tibees)
- [Best Way to Learn Math](https://www.youtube.com/watch?v=zvrleanEYOw&list=TLPQMjkwNzIwMjND3tvET8TH0g&index=5&ab_channel=LexClips)
- [How to learn math — Jordan Ellenberg](https://www.youtube.com/watch?v=UcpmwBOVp44&list=TLPQMjkwNzIwMjND3tvET8TH0g&index=6&ab_channel=LexClips)
- [Learn Mathematics from START to FINISH](https://www.youtube.com/watch?v=pTnEG_WGd2Q&t=17s&ab_channel=TheMathSorcerer)
- [How to Learn Math](https://math.ucr.edu/home/baez/books.html#math)
- [Why Learn Discrete Math?](https://www.youtube.com/watch?v=oJhAPsy9hBU&ab_channel=Intermation)

### Linear Algebra

- [Linear Algebra at MIT](https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/video-lectures)
- [Khan Academy Linear Algebra](https://www.khanacademy.org/math/linear-algebra)
- [Linear algebra cheat sheet for deep learning](https://towardsdatascience.com/linear-algebra-cheat-sheet-for-deep-learning-cd67aba4526c)
- [[Course] Essence of linear algebra](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab)
- [[Course] Linear Algebra Crash Course](https://www.youtube.com/watch?v=n9jZmymHX6o&ab_channel=LunarTech)
- [Linear Algebra Tutorial](https://www.youtube.com/watch?v=3Bf9oh7nkus&ab_channel=metacodeM)
- [Tiled Matrix Multiplication](https://penny-xu.github.io/blog/tiled-matrix-multiplication)
- [The Big Picture of Linear Algebra](https://www.youtube.com/watch?v=ggWYkes-n6E)
- [[Interview] Gilbert Strang: Linear Algebra](https://www.youtube.com/watch?v=lEZPfmGCEk0)
- [Mathematics for Machine Learning - Linear Algebra](https://www.youtube.com/playlist?list=PLiiljHvN6z1_o1ztXTKWPrShrMrBLo5P3)
- [Linear Algebra for Data Science](https://drive.google.com/file/d/1nJVwdQV9zp-Q9VQenZF0-HOOG6L2lEOD/view)
- [Introduction to Linear Algebra for Applied Machine Learning with Python](https://pabloinsente.github.io/intro-linear-algebra)
- [[Book] Linear Algebra for Data Science](https://drive.google.com/file/d/1pLrhXT_wBeQNJegbkVbC2fll9z_ykXzD/view?pli=1)

### Statistics

- [Khan Academy Probability](https://www.khanacademy.org/math/probability)
- [Khan Academy Statistics and probability](https://www.khanacademy.org/math/statistics-probability)
- [Inferential Statistics](https://br.udacity.com/course/intro-to-inferential-statistics--ud201)
- [Introduction to Statistics](https://www.coursera.org/learn/stanford-statistics)
- [The better way to do statistics](https://www.youtube.com/watch?v=3jP4H0kjtng)
- [A complete guide to box plots](https://www.atlassian.com/data/charts/box-plot-complete-guide)
- [Probability and Statistics](https://www.youtube.com/playlist?list=PLMrJAkhIeNNR3sNYvfgiKgcStwuPSts9V)
- [Probability for Computer Scientists](https://chrispiech.github.io/probabilityForComputerScientists/en)
- [Introduction to Statistics and Data Analysis](https://www.youtube.com/playlist?list=PLMrJAkhIeNNT14qn1c5qdL29A1UaHamjx)
- [Probability Bootcamp](https://www.youtube.com/playlist?list=PLMrJAkhIeNNR3sNYvfgiKgcStwuPSts9V)

### Calculus

- [[Course] Essence of calculus](https://www.youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr)
- [Khan Academy Multivariable Calculus](https://www.khanacademy.org/math/multivariable-calculus)
- [Khan Academy Differential Calculus](https://www.khanacademy.org/math/differential-calculus)
- [Calculus Applied](https://www.edx.org/learn/calculus/harvard-university-calculus-applied)
- [Mathematics for Machine Learning - Multivariate Calculus](https://www.youtube.com/playlist?list=PLiiljHvN6z193BBzS0Ln8NnqQmzimTW23)

### Optimization

- [Convex Optimization](https://web.stanford.edu/class/ee364a/videos.html)
- [Understanding Gradient Descent](https://degatchi.com/articles/gradient-descent)

## Programming

### Algorithms

- [A&DS — Pavel Mavrin](https://www.youtube.com/playlist?list=PLrS21S1jm43igE57Ye_edwds_iL7ZOAG4)

### Python

- [Practical Python Programming](https://dabeaz-course.github.io/practical-python/Notes/Contents.html)
- [100 Numpy Exercises](https://www.kaggle.com/code/iamteekay/100-numpy-exercises)
- [From Python to Numpy](https://www.labri.fr/perso/nrougier/from-python-to-numpy)

### ML Engineering

- [Introduction to Data Intensive Engineering](https://www.youtube.com/playlist?list=PLMrJAkhIeNNTv-u25xlhIyiAIV_d7vf-8)
- [Machine Learning in Production: Why You Should Care About Data and Concept Drift](https://towardsdatascience.com/machine-learning-in-production-why-you-should-care-about-data-and-concept-drift-d96d0bc907fb)
- [Monitoring Machine Learning Models in Production](https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-learning-models)
- [From Model-centric to Data-centric AI](https://www.youtube.com/watch?v=06-AZXmwHjo)
- [Making Deep Learning Go Brrrr From First Principles](https://horace.io/brrr_intro.html)
- [ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It](https://neptune.ai/blog/ml-experiment-tracking)
- [ML Model Baselines](https://blog.ml.cmu.edu/2020/08/31/3-baselines)
- [[Paper] Large Scale Distributed Deep Networks](https://www.cs.toronto.edu/~ranzato/publications/DistBeliefNIPS2012_withAppendix.pdf)
- [[Book] Machine Learning Systems](https://www.mlsysbook.ai) ([PDF](https://www.mlsysbook.ai/assets/downloads/Machine-Learning-Systems.pdf))
- [[Article] A Meticulous Guide to Advances in Deep Learning Efficiency over the Years](https://alexzhang13.github.io/blog/2024/efficient-dl)

### Distributed Systems

- [What even is distributed systems](https://notes.eatonphil.com/2025-08-09-what-even-is-distributed-systems.html)
- [6.5840: Distributed Systems course](https://pdos.csail.mit.edu/6.824/index.html)
- [Blogs](https://eatonphil.com/blogs.html)
- [Distributed Systems Course | Distributed Computing @ University Cambridge](https://www.youtube.com/watch?v=sGzQT_ZrsFI)

## Artificial Intelligence

### Data Science

- [Fundamental Python Data Science Libraries: Numpy](https://hackernoon.com/fundamental-python-data-science-libraries-a-cheatsheet-part-1-4-58884e95c2bd)
- [Fundamental Python Data Science Libraries: Pandas](https://hackernoon.com/fundamental-python-data-science-libraries-a-cheatsheet-part-2-4-fcf5fab9cdf1)
- [Fundamental Python Data Science Libraries: Matplotlib](https://hackernoon.com/fundamental-python-data-science-libraries-a-cheatsheet-part-3-4-6c2aecc697a4)
- [Fundamental Python Data Science Libraries: Scikit-Learn](https://hackernoon.com/fundamental-python-data-science-libraries-a-cheatsheet-part-4-4-fd8895ef85d5)
- [Data Engineering Roadmap](https://github.com/hasbrain/data-engineer-roadmap)
- [How to build a data science project from scratch](https://medium.freecodecamp.org/how-to-build-a-data-science-project-from-scratch-dc4f096a62a1)
- [Introdução à Análise de Dados para pesquisa no SUS](https://cursosqualificacao.campusvirtual.fiocruz.br/hotsite/analise-dados-sus)

### Machine Learning

- [Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)
- [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)
- [Introduction to Machine Learning Course](https://www.udacity.com/course/intro-to-machine-learning--ud120)
- [Learning Math for Machine Learning](https://blog.ycombinator.com/learning-math-for-machine-learning)
- [Machine Learning at CMU](http://www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml)
- [Bishop Keynotes on ML](https://www.microsoft.com/en-us/research/people/cmbishop/#!videos)
- [Machine Learning Guides by Google](https://developers.google.com/machine-learning/guides)
- [Machine Learning Crash Course by Google](https://developers.google.com/machine-learning/crash-course/ml-intro)
- [Facebook Field Guide to Machine Learning](https://research.fb.com/the-facebook-field-guide-to-machine-learning-video-series)
- [Um pequeno guia para Data Science / Machine Learning](http://lgmoneda.github.io/2017/06/12/data-science-guide.html)
- [Machine Learning for All](https://www.coursera.org/learn/uol-machine-learning-for-all)
- [Reinforcement Learning](https://www.udacity.com/course/reinforcement-learning--ud600)
- [Machine Learning Crash Course with TensorFlow APIs](https://developers.google.com/machine-learning/crash-course)
- [Backpropagation from the ground up](https://www.youtube.com/watch?v=SmZmBKc7Lrs)
- [Understanding Machine Learning: From Theory to Algorithms](https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf)
- [CS229 Lecture Notes](https://cs229.stanford.edu/lectures-spring2022/main_notes.pdf)
- [A theory-heavy intro to machine learning](https://0xpemulis.net/learningtheory.html)
- [ML Code Challenges](https://www.deep-ml.com)
- [Machine learning in Python with scikit-learn](https://lms.fun-mooc.fr/courses/course-v1:inria+41026+session03/6c7bd3e1d86545c4b723b844ae2702f9)
- [Introduction to Algorithms and Machine Learning](https://www.justinmath.com/files/introduction-to-algorithms-and-machine-learning.pdf)
- [How to actually learn AI/ML: Reading Research Papers](https://www.youtube.com/watch?v=x6slke5niqw)
- [Machine Learning Fundamentals: Bias and Variance](https://www.youtube.com/watch?v=EuBBz3bI-aA)
- [Machine Learning Fundamentals: Cross Validation](https://www.youtube.com/watch?v=fSytzGwwBVw)
- [Machine Learning Specialization by Andrew Ng](https://www.youtube.com/playlist?list=PLkDaE6sCZn6FNC6YRfRQc_FbeQrF8BwGI)
- [AI Fundamentals](https://www.udacity.com/course/ai-fundamentals--ud099)
- [Artificial Intelligence](https://www.udacity.com/course/artificial-intelligence--ud954)
- [[Paper] Hyper-Parameter Optimization: A Review of Algorithms and Applications](https://arxiv.org/pdf/2108.02497)
- [[Paper] How to avoid machine learning pitfalls: a guide for academic researchers](https://arxiv.org/pdf/2108.02497)
- [[Paper] Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning](https://arxiv.org/pdf/1811.12808)
- [Árvore de decisão](https://www.youtube.com/watch?v=W7MfsE5av0c)
- [[Paper] How to avoid machine learning pitfalls: a guide for academic researchers](https://arxiv.org/pdf/2108.02497)
- [Deep-ML](https://www.deep-ml.com)
- [The ML Roadmap](https://github.com/loganthorneloe/ml-road-map)
- [Why is machine learning 'hard'?](https://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html)
- [A Gentle Introduction to Machine Learning Theory](https://data-processing.club/theory)
- [Stanford Intro to Machine Learning](https://www.youtube.com/playlist?list=PLoROMvodv4rMiGQp3WXShtMGgzqpfVfbU)
- [ML Resources](https://www.trybackprop.com/blog/top_ml_learning_resources)
- [Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book)
- [Intro to Data Science](https://www.youtube.com/playlist?list=PLMrJAkhIeNNQV7wi9r7Kut8liLFMWQOXn)
- [Support Vector Machines Part 1 (of 3): Main Ideas](https://www.youtube.com/watch?v=efR1C6CvhmE)
- [Support Vector Machines Part 2: The Polynomial Kernel](https://www.youtube.com/watch?v=Toet3EiSFcM)
- [Support Vector Machines Part 3: The Radial (RBF) Kernel](https://www.youtube.com/watch?v=Qc5IyLW_hns)
- [MIT — Learning: Support Vector Machines](https://www.youtube.com/watch?v=_PwhiWxHK8o)
- [Support Vector Machines | Stanford CS229](https://www.youtube.com/watch?v=lDwow4aOrtg)
- [Preprocessing for Machine Learning in Python](https://www.datacamp.com/courses/preprocessing-for-machine-learning-in-python)
- [Computer Science for Artificial Intelligence](https://www.edx.org/professional-certificate/harvardx-computer-science-for-artifical-intelligence)
- [Machine Learning courses](https://www.edx.org/learn/machine-learning)
- [Machine Learning Crash Course with TensorFlow APIs](https://developers.google.com/machine-learning/crash-course)
- [Machine Learning Stanford Course](https://www.coursera.org/learn/machine-learning)
- [Machine Learning with Python](https://www.coursera.org/learn/machine-learning-with-python)
- [Math for Machine Learning with Python](https://www.edx.org/learn/math/edx-math-for-machine-learning-with-python)
- [Machine Learning with Python: from Linear Models to Deep Learning](https://www.edx.org/learn/machine-learning/massachusetts-institute-of-technology-machine-learning-with-python-from-linear-models-to-deep-learning)
- [Probabilistic Machine Learning](https://probml.github.io/pml-book)
- [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf)
- [Pattern Recognition and Machine Learning](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf)
- [Python Machine Learning](https://www.amazon.com/Python-Machine-Learning-scikit-learn-TensorFlow/dp/1787125939)
- [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook)
- [Think Stats: Exploratory Data Analysis in Python](http://greenteapress.com/thinkstats2/html/index.html)
- [The Orange Book of Machine Learning](https://carl-mcbride-ellis.github.io/TOBoML/TOBoML.pdf)
- [[Book] Hyperparameter Optimization in Machine Learning](https://arxiv.org/pdf/2410.22854)
- [[Playbook] Tuning Deep Learning Models](https://github.com/google-research/tuning_playbook)

#### Multimodal Machine Learning

- [[List] Awesome Multimodal Machine Learning](https://github.com/pliang279/awesome-multimodal-ml)
- [[Course] CMU - Multimodal Machine Learning](https://www.youtube.com/playlist?list=PL-Fhd_vrvisMYs8A5j7sj8YW1wHhoJSmW)
- [[Paper] Overview of Multimodal Machine Learning](https://dl.acm.org/doi/abs/10.1145/3701031)
- [[Paper] Foundations & Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions](https://arxiv.org/pdf/2209.03430)

#### Geometric Machine Learninig

- [A Gentle Introduction to Graph Neural Networks](https://distill.pub/2021/gnn-intro)
- [[Book] Graph Neural Networks](https://graph-neural-networks.github.io)
- [[Paper] Everything is Connected: Graph Neural Networks](https://arxiv.org/pdf/2301.08210)
- [[Article] Graph Convolutional Networks](https://tkipf.github.io/graph-convolutional-networks)
- [[Video] Theoretical Foundations of Graph Neural Networks](https://www.youtube.com/watch?v=uF53xsT7mjc)
- [[Book] Geometric Deep Learning](https://geometricdeeplearning.com)

### Deep Learning

- [Intro to Deep Learning](https://www.kaggle.com/learn/intro-to-deep-learning)
- [MIT Introduction to Deep Learning](https://introtodeeplearning.com)
- [Language Modeling from Scratch](https://www.youtube.com/playlist?list=PLoROMvodv4rOY23Y0BoGoBGgQ1zmU_MT_)
- [Deep Learning Book](http://www.deeplearningbook.org)
- [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python)
- [Dive into Deep Learning](https://d2l.ai/index.html)
- [Intro to Deep Learning](http://introtodeeplearning.com/2020/index.html)
- [Intro to Deep Learning with PyTorch](https://www.udacity.com/course/deep-learning-pytorch--ud188)
- [Deep Learning Research and the Future of AI](https://www.youtube.com/watch?v=5BrNt38OraE&ab_channel=MicrosoftResearch)
- [[Paper] Sequence to Sequence Learning with Neural Networks](https://arxiv.org/pdf/1409.3215)
- [Demystifying deep reinforcement learning](https://nail.cs.ut.ee/index.php/2015/12/19/globular-star-cluster-radio-scope-great-turbulent-clouds)
- [A Review of: Human-Level Control through deep Reinforcement Learning](https://hci.iwr.uni-heidelberg.de/system/files/private/downloads/213797145/report_carsten_lueth_human_level_control.pdf)
- [[Paper] Mastering the game of Go without human knowledge](https://www.nature.com/articles/nature24270.epdf?author_access_token=VJXbVjaSHxFoctQQ4p2k4tRgN0jAjWel9jnR3ZoTv0PVW4gB86EEpGqTRDtpIz-2rmo8-KG06gqVobU5NSCFeHILHcVFUeMsbvwS-lxjqQGg98faovwjxeTUgZAUMnRQ)
- [AlphaGo Zero: Starting from scratch](https://deepmind.google/discover/blog/alphago-zero-starting-from-scratch)
- [Neural Networks](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi)
- [The Principles of Deep Learning Theory](https://arxiv.org/pdf/2106.10165)
- [Why do tree-based models still outperform deep learning on tabular data?](https://arxiv.org/pdf/2207.08815)
- [MIT 6.S191: Introduction to Deep Learning](https://www.youtube.com/playlist?list=PLtBw6njQRU-rwp5__7C0oIVt26ZgjG9NI)
- [Deep Learning NYU](https://www.youtube.com/playlist?list=PLLHTzKZzVU9e6xUfG10TkTWApKSZCzuBI)
- [Building Neural Networks from Scratch](https://www.youtube.com/playlist?list=PLPTV0NXA_ZSj6tNyn_UadmUeU3Q3oR-hu)
- [The Matrix Calculus You Need For Deep Learning](https://explained.ai/matrix-calculus)
  - [Paper](https://arxiv.org/pdf/1802.01528)
- [Convolution is Matrix Multiplication](https://penny-xu.github.io/blog/convolution-is-matrixmultiplication)
- [Neural Networks and Deep Learning — Course 1](https://www.youtube.com/watch?v=CS4cs9xVecg&list=PLkDaE6sCZn6Ec-XTbcX1uRg2_u4xOEky0)
- [Improving Deep Neural Networks — Course 2](https://www.youtube.com/playlist?list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc)
- [Structuring Machine Learning Projects — Course 3](https://www.youtube.com/playlist?list=PLkDaE6sCZn6E7jZ9sN_xHwSHOdjUxUW_b)
- [Convolutional Neural Networks — Course 4](https://www.youtube.com/playlist?list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF)
- [Sequence Models — Course 5](https://www.youtube.com/playlist?list=PLkDaE6sCZn6F6wUI9tvS_Gw1vaFAx6rd6)
- [Understanding Deep Learning Book Club](https://www.youtube.com/playlist?list=PLmp4AHm0u1g0AdLp-LPo5lCCf-3ZW_rNq)
- [Dive into Deep Learning](https://d2l.ai)
- [TABPFN: A transformer that solves small tabular classification problems in a second](https://arxiv.org/pdf/2207.01848)
- [A Matemática das Redes Neurais](https://www.youtube.com/watch?v=qZ9xuPcoWSA)
- [Introdução a Redes Neurais e Deep Learning](https://www.youtube.com/watch?v=Z2SGE3_2Grg)
- [How do neural networks learn features from data?](https://www.youtube.com/watch?v=y0KxsLJvG14)
- [Neural Networks: Zero to Hero](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)
- [Building A Neural Network from Scratch with Mathematics and Python](https://www.iamtk.co/building-a-neural-network-from-scratch-with-mathematics-and-python)
- [Neural Network from Scratch](https://github.com/imteekay/neural-network-from-scratch)
- [Feedforward Neural Networks in Depth, Part 1: Forward and Backward Propagations](https://jonaslalin.com/2021/12/10/feedforward-neural-networks-part-1)
- [Feedforward Neural Networks in Depth, Part 2: Activation Functions](https://jonaslalin.com/2021/12/21/feedforward-neural-networks-part-2)
- [Feedforward Neural Networks in Depth, Part 3: Cost Functions](https://jonaslalin.com/2021/12/22/feedforward-neural-networks-part-3)
- [[Paper] Three Decades of Activations: A comprehensive survey of 400 activation functions for neural networks](https://arxiv.org/pdf/2402.09092)
- [Famous Deep Learning Papers](https://papers.baulab.info)
- [[Paper] Decentralized Diffusion Models](https://arxiv.org/pdf/2501.05450)
- [Deep Learning for Data Science (DL4DS)](https://dl4ds.github.io/sp2025/lectures)
- [Recurrent Neural Networks cheatsheet](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks)
- [Deep Learning with PyTorch](https://www.coursera.org/learn/advanced-deep-learning-with-pytorch)
- [A deep-dive on the entire history of deep-learning](https://github.com/adam-maj/deep-learning)
- [History of Deep Learning](https://github.com/saurabhaloneai/History-of-Deep-Learning)
- [Intuitions on Language Models & Shaping the Future of AI from the History of Transformer](https://www.youtube.com/watch?v=3gb-ZkVRemQ&ab_channel=StanfordOnline)
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness)
- [Language Modeling From Scratch](https://www.youtube.com/playlist?list=PLoROMvodv4rOY23Y0BoGoBGgQ1zmU_MT_)
- [The Recurrent Neural Network - Theory and Implementation of the Elman Network and LSTM](https://pabloinsente.github.io/the-recurrent-net)
- [[Article] A Brief History of Large Language Models](https://medium.com/@bradneysmith/98a1320e7650)
- [[Article] Tokenization — A Complete Guide](https://medium.com/@bradneysmith/tokenization-llms-from-scratch-1-cedc9f72de4e)
- [[Article] Word Embeddings with word2vec from Scratch in Python](https://medium.com/@bradneysmith/word-embeddings-with-word2vec-from-scratch-in-python-eb9326c6ab7c)
- [[Article] Self-Attention Explained with Code](https://medium.com/data-science/contextual-transformer-embeddings-using-self-attention-explained-with-diagrams-and-python-code-d7a9f0f4d94e)
- [[Article] A Complete Guide to BERT with Code](https://towardsdatascience.com/a-complete-guide-to-bert-with-code-9f87602e4a11)
- [[Video] The physics behind diffusion models](https://www.youtube.com/watch?v=R0uMcXsfo2o)
- [[Paper] An end-to-end attention-based approach for learning on graphs](https://www.nature.com/articles/s41467-025-60252-z)
- [[Book] Learning Deep Representations of Data Distributions](https://ma-lab-berkeley.github.io/deep-representation-learning-book)
- [[Article] Large Language Model Optimization: Memory, Compute, and Inference Techniques](https://gaurigupta19.github.io/llms/distributed%20ml/optimization/2025/10/02/efficient-ml.html)
- [[Book] Learning Deep Representations of Data Distributions](https://ma-lab-berkeley.github.io/deep-representation-learning-book/index.html)
- [[Paper Video] Continuous Thought Machine Deep Dive](https://www.youtube.com/watch?v=5X9cjGLggv0)
- [[Course] Deep Learning](https://www.youtube.com/playlist?list=PLoROMvodv4rNRRGdS0rBbXOUGA0wjdh1X)
- [PaperCode](https://papercode.in/papers)
- [Language model alignment-focused deep learning curriculum](https://github.com/jacobhilton/deep_learning_curriculum)

#### Computer Vision

- [[Course] Deep Learning for Computer Vision](https://www.youtube.com/playlist?list=PLoROMvodv4rOmsNzYBMe0gJY2XS8AQg16)

#### Transformers

- [[Course] Large Language Models](https://www.youtube.com/playlist?list=PLs8w1Cdi-zva4fwKkl9EK13siFvL9Wewf)
- [[Paper] A Generalization of Transformer Networks to Graphs](https://arxiv.org/pdf/2012.09699)
  - [Video Lecture](https://www.youtube.com/watch?v=h-_HNeBmaaU&t=237s)
- [Biomedical Transformers](https://www.youtube.com/watch?v=nz7_wg5iOlA&list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM&index=19)
- [Glossary of Deep Learning: Word Embedding](https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca)
- [How I Learned to Stop Worrying and Love the Transformer](https://www.youtube.com/watch?v=1GbDTTK3aR4&list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM&index=22)
- [How Transformer LLMs Work](https://www.deeplearning.ai/short-courses/how-transformer-llms-work)
- [Introduction to Transformers](https://www.youtube.com/watch?v=XfpMkf4rD6E)
- [Overview of Transformers](https://www.youtube.com/watch?v=JKbtWimlzAE&list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM&index=36)
- [Recurrent Neural Networks, Transformers, and Attention](https://www.youtube.com/watch?v=GvezxUdLrEk)
- [Stanford CS25 - Transformers United](https://www.youtube.com/playlist?list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM)
- [Stanford CS25: V4 I Overview of Transformers](https://www.youtube.com/watch?v=fKMB5UlVY1E&list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM&index=27)
- [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer)
- [The math behind Attention: Keys, Queries, and Values matrices](https://www.youtube.com/watch?v=UPtG_38Oq8o )
- [Transformer Neural Networks Derived from Scratch](https://www.youtube.com/watch?v=kWLed8o5M2Y)
- [Transformers are Graph Neural Networks](https://arxiv.org/pdf/2506.22084)
- [Transformers from Scratch](https://www.kaggle.com/code/auxeno/transformers-from-scratch-dl)
- [Visualizing A Neural Machine Translation Model](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention)
- [Visualizing transformers and attention](https://www.youtube.com/watch?v=KJtZARuO3JY)
- [[Article] The Annotated Transformer](https://nlp.seas.harvard.edu/annotated-transformer)
- [[Article] Attention Mechanism: From Math to GPU](https://isztld.com/posts/attention-mechanism.html)
- [[Article] Visualizing Parallelism in Transformer](https://ailzhang.github.io/posts/distributed-compute-in-transformer)

#### Large Language Models (LLMs)

- [[Course] Large Language Models (LLMs)](https://www.youtube.com/playlist?list=PLoROMvodv4rObv1FMizXqumgVVdzX4_05)
- [[Book] How to Scale Your Model](https://jax-ml.github.io/scaling-book)
- [[Course] CME 295 - Transformers & Large Language Models](https://cme295.stanford.edu)

### Generative AI

- [The Principles of Diffusion Models](https://arxiv.org/pdf/2510.21890)

### Deep Reinforcement Learning

- [[Course] Deep Reinforcement Learning](https://www.youtube.com/playlist?list=PLkFD6_40KJIwTmSbCv9OVJB3YaO4sFwkX)
- [Spinning Up in Deep RL](https://spinningup.openai.com/en/latest)
- [Reinforcement Learning: An Overview](https://arxiv.org/pdf/2412.05265)
- [A vision researcher’s guide to some RL stuff: PPO & GRPO](https://yugeten.github.io/posts/2025/01/ppogrpo)
- [[Course] DeepMind Reinforcement Learning](https://www.youtube.com/playlist?list=PLqYmG7hTraZBKeNJ-JE_eyJHZ7XgBoAyb)
- [An Ultra Opinionated Guide to Reinforcement Learning](https://x.com/jsuarez5341/status/1943692998975402064)
- [Reinforcement Learning Quickstart Guide](https://x.com/jsuarez5341/status/1854855861295849793)
- [A Reinforcement Learning Guide](https://naklecha.notion.site/a-reinforcement-learning-guide)
- [Reinforcement Learning](https://www.youtube.com/playlist?list=PLMrJAkhIeNNQe1JXNvaFvURxGY4gE9k74)
- [Reinforcement Learning: An Introduction](https://www.amazon.com/Reinforcement-Learning-Introduction-Adaptive-Computation/dp/0262039249)
- [Understanding reinforcement learning for model training from scratch](https://medium.com/data-science-collective/understanding-reinforcement-learning-for-model-training-from-scratch-8bffe8d87a07)
- [[Course] Reinforcement Learning of Large Language Models](https://www.youtube.com/playlist?list=PLir0BWtR5vRp5dqaouyMU-oTSzaU5LK9r)
- [[Book] Reinforcement Learning from Human Feedback](https://rlhfbook.com)
- [[Course] Mathematical Foundations of Reinforcement Learning](https://www.youtube.com/playlist?list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8)
- [[Course] Stanford CS234 Reinforcement Learning](https://www.youtube.com/playlist?list=PLoROMvodv4rN4wG6Nk6sNpTEbuOSosZdX)
- [[Course] Stanford CS224R Deep Reinforcement Learning](https://www.youtube.com/playlist?list=PLoROMvodv4rPwxE0ONYRa_itZFdaKCylL)
- [[Article] How to Explore to Scale RL Training of LLMs on Hard Problems?](https://blog.ml.cmu.edu/2025/11/26/how-to-explore-to-scale-rl-training-of-llms-on-hard-problems)
- [[Course] Deep RL by Stanford](https://www.youtube.com/playlist?list=PLoROMvodv4rPwxE0ONYRa_itZFdaKCylL)
- [[Book] Deep Reinforcement Learning in Action](https://www.oreilly.com/library/view/deep-reinforcement-learning/9781617295430)
- [[Article] Reinforcement Learning (RL) Guide by Unsloth](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide)
- [[Book] Reinforcement Learning: Theory and Python Implementation](https://link.springer.com/book/10.1007/978-981-19-4933-3)
- [Key Papers](https://spinningup.openai.com/en/latest/spinningup/keypapers.html)

### Causal Inference

- [[Article] Causal inference resources](https://yanirseroussi.com/causal-inference-resources)
- [[Book] The Book of Why](https://www.goodreads.com/book/show/36204378-the-book-of-why)
- [[Book] Causal Inference in Statistics: A Primer](https://www.goodreads.com/book/show/27164550-causal-inference-in-statistics)
- [[Book] Causal Inference for the Brave and True](https://www.goodreads.com/book/show/58898489-causal-inference-for-the-brave-and-true)
- [[Book] Causal Inference in Python](https://www.goodreads.com/book/show/140399013-causal-inference-in-python)
- [[Book] Causal Inference and Discovery in Python](https://www.goodreads.com/book/show/150345394-causal-inference-and-discovery-in-python)
- [[Book] Causal Artificial Intelligence](https://causalai-book.net)

### PyTorch

- [PyTorch internals](https://blog.ezyang.com/2019/05/pytorch-internals)
- [Learn PyTorch for deep learning in a day](https://www.youtube.com/watch?v=Z_ikDlimN6A)
- [PyTorch in One Hour: From Tensors to Training Neural Networks on Multiple GPUs](https://sebastianraschka.com/teaching/pytorch-1h)

### ML/AI & Healthcare

- [[Course] Machine Learning for Healthcare](https://www.edx.org/learn/machine-learning/massachusetts-institute-of-technology-machine-learning-for-healthcaregit)
- [AI in Healthcare @ Google Brain](https://www.youtube.com/watch?v=cvXVK8oqU4Q&ab_channel=AlexanderAmini)
- [Healthcare's AI Future: A Conversation with Fei-Fei Li & Andrew Ng](https://www.youtube.com/watch?v=Gbnep6RJinQ&ab_channel=StanfordHAI)
- [AI and the Future of Health](https://www.microsoft.com/en-us/research/blog/ai-and-the-future-of-health)
- [Aplicações de Deep Learning a Genética](https://www.youtube.com/watch?v=GiL6RnXLjvI)
- [Daphne Koller: Biomedicine and Machine Learning](https://www.youtube.com/watch?v=xlMTWfkQqbY&ab_channel=LexFridman)
- [Data and resource needs for machine learning in genomics](https://www.youtube.com/watch?v=kjQ-8LFkeaA&ab_channel=NationalHumanGenomeResearchInstitute)
- [Machine Learning para Predições em Saúde](https://www.youtube.com/playlist?list=PLpvV74h3lihLdYrlnhlx_phy4pFZeZsKx)
- [Inteligência Artificial em Saúde](https://www.youtube.com/playlist?list=PLAudUnJeNg4tvUFZ8tXQDoAkFAASQzOHm)
- [[Course] Collaborative Data Science for Healthcare](https://www.edx.org/learn/data-science/massachusetts-institute-of-technology-collaborative-data-science-for-healthcare)
- [[Course] Data Analytics and Visualization in Health Care](https://www.edx.org/learn/data-analysis/rochester-institute-of-technology-data-analytics-and-visualization-in-health-care)
- [[Course] Introduction to Applied Biostatistics: Statistics for Medical Research](https://www.edx.org/learn/biostatistics/osaka-university-introduction-to-applied-biostatistics-statistics-for-medical-research)
- [[Paper] Capabilities of Gemini Models in Medicine](https://arxiv.org/pdf/2404.18416)
  - [Journal Club Debate: Capacidades dos modelos Gemini na medicina](https://www.youtube.com/watch?v=qj-4_dP6BQw)
- [[Paper] Deep learning methods for drug response prediction in cancer: Predominant and emerging trends](https://www.frontiersin.org/articles/10.3389/fmed.2023.1086097/full)
- [[Paper] Machine Learning Prediction of Cancer Cell Sensitivity to Drugs Based on Genomic and Chemical Properties](https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0061318&type=printable)
- [[Paper] Artificial intelligence in healthcare: past, present and future](https://svn.bmj.com/content/svnbmj/2/4/230.full.pdf)
- [Multimodal Generative AI: the Next Frontier in Precision Health](https://www.microsoft.com/en-us/research/quarterly-brief/mar-2024-brief/articles/multimodal-generative-ai-the-next-frontier-in-precision-health)
- [Artificial Intelligence in Healthcare: Past, Present and Future](https://svn.bmj.com/content/svnbmj/2/4/230.full.pdf)
- [[Paper] The myth of generalisability in clinical research and machine learning in health care](https://www.thelancet.com/action/showPdf?pii=S2589-7500%2820%2930186-2)
- [[Paper] Sybil: A Validated Deep Learning Model to Predict Future Lung Cancer Risk From a Single Low-Dose Chest Computed Tomography](https://ascopubs.org/doi/pdfdirect/10.1200/JCO.22.01345)
- [[Paper] Capabilities of Gemini Models in Medicine](https://arxiv.org/abs/2404.18416)
- [[Paper] Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine](https://arxiv.org/pdf/2311.16452)
- [Large Language Models Encode Clinical Knowledge](https://arxiv.org/pdf/2212.13138)
- [AI Aspirations Healthcare Futures](https://www.youtube.com/watch?v=Bn5M6hT3W1E)
- [Breast Cancer Prediction: project](https://github.com/imteekay/breast-cancer-prediction)
- [Training ML Models for Cancer Tumor Classification](https://www.iamtk.co/training-ml-models-for-cancer-tumor-classification)
- [AI for Business Transformation: Lessons from Healthcare](https://www.youtube.com/watch?v=8C-XiXB67_Q)
- [The revolution in high-throughput proteomics and AI](https://www.science.org/doi/10.1126/science.ads5749)
- [[Course] AI for Medicine Specialization](https://www.deeplearning.ai/courses/ai-for-medicine-specialization)
- [Towards Democratization of Subspeciality Medical Expertise](https://arxiv.org/pdf/2410.03741)
- [Uncovering early predictors of cerebral palsy through the application of machine learning: a case-control study](https://bmjpaedsopen.bmj.com/content/bmjpo/8/1/e002800.full.pdf)
- [Development and Validation of a Deep Learning Method to Predict Cerebral Palsy From Spontaneous Movements in Infants at High Risk](https://watermark.silverchair.com/groos_2022_oi_220608_1656698661.11703.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAAzEwggMtBgkqhkiG9w0BBwagggMeMIIDGgIBADCCAxMGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQM1jHKezBOqHeyGBSVAgEQgIIC5IKGlXjEvmbfHnrMFH3WqDX3nEMySJWaqxi9RQk-_fsW1yrXRseVGAYSDEElc6gPIbMpTmJ4hHCYzQhvIQ4igHIgJCq6U_8git_LEJR2GLAS3VE8HjBUsH0pqmhWDJ24P6WW94jgn9ZL3nPTrZX2nU6uL_ZFtDAo64muDnJ54N0NPgrxUQJOCyKz65neeKqVVM8mO6F60HFzbBuPapBMKVlTVpyB3UfDtVw7VuCPdFNyiQ8A7Z7EmzWR7rXL93pg2KztZ2-0qWqN8upA0XgN4N01LOFsanLZIf6TaZ5TTjpaAuWtgSrwurxJW5Wh3A2a6zr8SrfpGq92muV3XHJ4CtElyitZ1z9BKjZDkSrQSv9jpG8Of0ngOna4xCDwIMJ6CmaV9cajxs9ARCzmUlWyNxiVenwXCLR1z-x_W9QEeuTT58BUB9fRVStPKngy-7IG4IWbOaxAP8sLa50CtUkBPtOnichM0pdJWkuDYvOv_ylqDzoGjT6VVPk_wLVjJPGlisp9V0ZLea7gDI5OuHOfcDTO6rjWwynkUNAHZYM_dHCkBG0rFSlqxKarpOUMRR0Z6RqPJiAFzYGBnTBs2kpI0Ax5UD1Dhk2wxRcu7z8UALf-riLDXIzJZDXp_o8dHZW2HL5809Kt6k5OFiV5ovUenCXBLCDBhZC1I9r6bQD9M-CvDCBFP2vVNNUzIlT2ARGgluxXP_BOp3dQFSy3V5dwWR2vHhqj8_WFjn7kLPAiqNtjotcwYYPXrMg7mROH--dC3fwzZ608O5KXiZo717_1ftjNrWfQ-SYpq2nkkxIAln4NmoGsuIZgqHaTwZvmacMt-q0y6TQRSRkKVRhIWtF0XjVcjzlqfmOOwmF8ehdsmMnovU1pL_vDGqj2TSVMpkgG3oQ6dHR-6OAzAlvsNtV_QsFxiWUCof9MSXafhQGGWaCoyTvXwK2Iy7lEwYYzu1H0WvpywrJD1SNqa4gcgo-KHZOInjVj)
- [What VCs Look for When Investing in Bio and Healthcare](https://www.youtube.com/watch?v=t1AHFTCj4yo)
- [[Paper] Dermatologist-level classification of skin cancer with deep neural networks](papers/dermatologist-level-classification-of-skin-cancer-with-deep-neural-networks.pdf)
- [[Paper] Deep learning for healthcare: review, opportunities and challenges](https://watermark.silverchair.com/bbx044.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAA0wwggNIBgkqhkiG9w0BBwagggM5MIIDNQIBADCCAy4GCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMQ2-7mTx2wplagxXZAgEQgIIC_zHCUwGjE8hQ82MCrBKVwsCS1q5zpR2eGYgCruIlBx6Uz8NFqhaNjJFvOcs7ayAZcfmC9tPi_kfMf5vF9o5jjs-lpvqwS87nhaYMeHXX2cGqSSoAZVC2YYOvjmBPbMdsVNy9yvpFpIikO6Qi4OIx0V_itE7QxGfojUTHKBebd2kt6aLN4bO73rGSX-I_Q9ElPT3v7sdrjTnfrSBAR5K5XfCGE2JwlXEOfcyxnboQoELcCALtFszLF9Xb8EDciu_qXIDEFunAPQwScasT1a5IGqhSVRolejeRZuLCTu2XxpBBLEwcsPkzwgQVpJifpG10TbWPFTQzIyPX_KQDUyR9e9VFnHMs-goG1vLnT-HZQKEP-aLTAnY6zDBhICGLLfx66JQR9DRVHZGzKRJ_p9j2FVjJ107l0Ru1Lk0WrWptBCz5p-g-luZfEndVTpuAMPf_r3wQxhJuCn4luYj_RtSOR3sM7MxsJzS_-JBQgqmjAwMDElFjVOok2r7lYU0M2xU3r_YhJCCBhxAd7s_PsNkPNj-j9QcrEw_jQ0RxHGlv-t3mpmStrIuBBBQBBLdTbJJgMN9I0S9TS6rSaHL0W2VmHVvYXckM8QcuEmpMHVqjuysYYcbgBTr8gIP4HE40VLBUIzFZNHEeOi0tL2TLdorAabXGAhcxysfl4_h-S0FNNQsGx3M56h22quLenkeixWmVH7GpXcnTNdEYH44Nt4U5Kq_PqeI5Hz53eo_hN9LgGOeLANC7Z4nNmYtNrGMrKEbMUERJJMZrEjgglcPd9fydFLKXhL_KBJ-ha1CkgQmxoOD0nkjLS4qdwjiOWNpheNzaNkJGzj8fIn-CK0U3C28APP4kWJRK3HCkJyHwWpMNJsQNPgbR94NhDykOGkzJJVR_k7QHNfVIOd_MunbPCCWi0kdxmIMzFlbCCnHUvz0KMNpLIsPZuHVmXPb5VJqdLdX9Sx56hYl37tKLGu3W-oWI89Ts1eK01T60gZ7ki3DKLf0Afc1BMZmy9hkCcAVt6oCZ)
- [[Paper] Opportunities and obstacles for deep learning in biology and medicine](https://royalsocietypublishing.org/doi/10.1098/rsif.2017.0387)
- [[Paper] Deep Learning in Medical Image Analysis](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442218/pdf/nihms-1617552.pdf)
- [[Paper] CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning](https://arxiv.org/pdf/1711.05225)
- [[Paper] Medical deep learning—A systematic meta-review](https://www.sciencedirect.com/science/article/pii/S0169260722002565?via%3Dihub)
- [[Paper] Scalable and accurate deep learning with electronic health records](papers/scalable-and-accurate-deep-learning-with-electronic-health-records/paper.pdf)
- [[Paper] Dermatologist–level classification of skin cancer with deep neural networks](https://pmc.ncbi.nlm.nih.gov/articles/PMC8382232/pdf/nihms-1724608.pdf)
- [[Paper] Deep Learning in Medicine](papers/deep-learning-in-medicine/paper.pdf)
- [[Paper] Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology](https://arxiv.org/pdf/2402.14252)
- [[Paper] Collaboration between clinicians and vision–language models in radiology report generation](papers/collaboration-between-clinicians-and-vision–language-models-in-radiology-report-generation/paper.pdf)
- [[Paper] ReXplain: Translating Radiology into Patient-Friendly Video Reports](https://arxiv.org/pdf/2410.00441)
- [AI for Medical Diagnosis](https://www.coursera.org/learn/ai-for-medical-diagnosis)
- [AI for Medical Prognosis](https://www.coursera.org/learn/ai-for-medical-prognosis)
- [AI For Medical Treatment](https://www.coursera.org/learn/ai-for-medical-treatment)
- [Multimodal, Generative, and Agentic AI for Pathology](https://www.youtube.com/watch?v=tbJwdK48hJw&list=PLlMMtlgw6qNjROoMNTBQjAcdx53kV50cS&index=2)
- [[Paper] Medical multimodal foundation models in clinical diagnosis and treatment: Applications, challenges, and future directions](https://www.sciencedirect.com/science/article/abs/pii/S0933365725002003)
- [[Paper] Large language models are less effective at clinical prediction tasks than locally trained machine learning models](https://academic.oup.com/jamia/article/32/5/811/8064348)
- [[Course] Applied Artificial Intelligence for Health Research](https://learninghub.kingshealthpartners.org/course/applied-artificial-intelligence-for-health-research)

#### ML for Clinical Knowledge

- [[Paper] Large Language Models Encode Clinical Knowledge](https://arxiv.org/pdf/2212.13138)

### ML/AI & Biology

- [[Paper] Machine learning-aided generative molecular design](https://www.nature.com/articles/s42256-024-00843-5)
- [Simulating 500 million years of evolution with a language model](https://evolutionaryscale-public.s3.us-east-2.amazonaws.com/research/esm3.pdf)
- [Learning to Plan Chemical Syntheses](https://www.semanticscholar.org/reader/ef8ab2a0be51a0cd04c2c0f01adfae956a2a84af)
- [Machine Learning for Genomics](https://www.youtube.com/playlist?list=PLypiXJdtIca6dEYlNoZJwBaz__CdsaoKJ)
- [MIT Deep Learning in Life Sciences](https://www.youtube.com/playlist?list=PLypiXJdtIca5sxV7aE3-PS9fYX3vUdIOX)
- [AI Text2Protein Breakthrough Tackles the Molecule Programming Challenge](https://medium.com/310-ai/mpm4-ai-text2protein-breakthrough-tackles-the-molecule-programming-challenge-870045a8c1ad)
- [Genomic Language Models: Opportunities and Challenges](https://arxiv.org/pdf/2407.11435)
- [Melodia: A Python Library for Protein Structure Analysis](https://watermark.silverchair.com/btae468.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAA4swggOHBgkqhkiG9w0BBwagggN4MIIDdAIBADCCA20GCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMVNfSCiowdD1a6WnjAgEQgIIDPsK_bI3A6IGF7cjZqL-1PehaqGZsY0AwhsIWIAc5Qa0rKYxHgeqnDIClLsf0Ey_I6ps6u545OlMuxBXd7yIO3xB0N0EMbsq5qYVSHqnuiqu2-LShlZxwk0ICGlLuJDR0ROgvGT837Lh72d2Eax_WuXzx6bkr9L2eUBifW8x4fULkCBqFtvhySkJwwvIIYd46Pi8bgM-XeQZI1DjwxN4KuHG15xkQpbdGvMmYpSGJvJefQTnY_YzF94F7zheUVj4s3JRYpKPtbxhG-6ba525xHNpMiFOy7gIbdn2X3JlH7LlQu6qE77E27t43nzyGujAvZEMl0Fir4TXs59Syp-c7Ss6MkCe1eh_VQtzdA3R00o7MHNy2fL_ES_Vkjdf1WcAB4nWQogaw_xZyOptjJxUJfLZyUYkEHpvfiSDx7f6Xr0F9w-gy-2tSemDG7Bp0xGjJJqA3oDZ8KZlR1hINXtCFG9qHIMy0425YFsGJY8nsTyZ2ULFlP2aeH1nnvUY_3O9r7KN_hKZhauxn5qkV5aSY1owVH9GDraYyRf-5JxpqVVAiovkzoqwa5YJXlMgflbK1S-004q_vtlNO2E9Wijy6qjiNUoot3QKybZogrumKSAuuvZwRRtAbvDdt7pZFyqxfEp6G7ofjR-MHNlinTq9rku2zu3znlFWI7j-nny465XasRL04KJXzjHXOAjpc0Ww4Ns-xnS24kVACj_ioBQ4XWSsHMUSdZfttGBWE4AL-64Ll7avyn9U64iEf9grCct3Hu1Dub8wMcwbXzjN7OPb3FLTlT8-zLTgWFmuMXpI7PV4wYRzt61APV3OCDfoq21XTr9Qn-nTaiNESDsClOvL49ZqYPTwFunCYkfR-jhgH06vc6wdB9XXV5jgIqdD5z1JXv8g4XJV3BTTj5SpGhomM9LkHgkDtZwzqMzJClbtkQArncyzLAKoX2kLx2_8t5u69rqCV6mSVDPwoeiJjVcl0uK8UmnCnk8MvHyN6odT-u_osm7aihSojxqKHBRJxdS3eB3gXq4qdNb8qVMGACMNpH4x_bp0qPvKGCOKJV0Lncer6H3HeLHVbrD6KPvWv5_g8JirNW5RDe5umOhuD1rFJ)
- [Biomolecular Modeling and Design Resources](https://abeebyekeen.com/categories/resources)
- [Understanding AlphaFold – Dame Janet Thornton](https://www.youtube.com/watch?v=lxgaILSZEbU)
- [Leveraging Molecular ML + Property Prediction in Drug Design](https://www.youtube.com/watch?v=wisrT2_EYrA)
- [Geometric Deep Learning for Protein Understanding](https://www.youtube.com/watch?v=h7Rifw0Nuv4)
- [Polaris: Industry-Led Initiative to Critically Assess ML for Real-World Drug Discovery](https://www.youtube.com/watch?v=Tsz_T1WyufI)
- [Efficiently Exploring Combinatorial Perturbations From High Dimensional Observation](https://www.youtube.com/watch?v=8ZjqsgsPV_0)
- [Towards Rational Drug Design with AlphaFold 3](https://www.youtube.com/watch?v=AE35XCN5NuU)
- [How AI and accelerated computing are transforming drug discovery](https://www.ft.com/partnercontent/nvidia/how-ai-and-accelerated-computing-are-transforming-drug-discovery.html)
- [Review and discussion of AlphaFold3](https://www.youtube.com/watch?v=qjFgthkKxcA)
- [Understanding & discovering fold-switching proteins by combining AlphaFold2](https://www.youtube.com/watch?v=rgGceDDnIEo)
- [Accelerating drug discovery with AI](https://www.youtube.com/watch?v=-hl0jpwWbV4)
- [Intro to ML in Drug Discovery: Principles & Applications](https://www.youtube.com/watch?v=j-oLfEm7xD8)
- [Introduction to AI in Drug Discovery](https://www.youtube.com/watch?v=7NgPGh0E0XE)
- [AlphaFold3: A foundation model for biology](https://harrisbio.substack.com/p/alphafold3-a-foundation-model-for)
- [[Paper] Deep Learning for Drug Discovery and Cancer Research: Automated Analysis of Vascularization Images](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7904235)
- [[Paper] Deep learning in drug discovery: an integrative review and future challenges](https://github.com/imteekay/machine-learning-research/blob/master/papers/deep-learning-in-drug-discovery-an-integrative-review-and-future-challenges/paper.pdf)
- [DeepMind AlphaFold 3](https://www.youtube.com/watch?v=Mz7Qp73lj9o&ab_channel=TwoMinutePapers)
- [[Course] Introduction to Genomic Data Science](https://www.edx.org/learn/bioinformatics/the-university-of-california-san-diego-introduction-to-genomic-data-science)
- [Generative models for molecular discovery: Recent advances and challenges](https://wires.onlinelibrary.wiley.com/doi/epdf/10.1002/wcms.1608)
- [Generative Models of Molecular Structures](https://www.youtube.com/watch?v=15bHUOjp6IU&list=PLoVkjhDgBOt3NyXcTGg_fi-H8qBzNnKgk&index=15)
- [Opportunities and obstacles for deep learning in biology and medicine](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5938574/pdf/rsif20170387.pdf)
- [Ten quick tips for machine learning in computational biology](https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0155-3)
- [Machine learning and complex biological data](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1689-0)
- [A guide to machine learning for biologists](https://hfenglab.org/NRev21.pdf)
- [Next-Generation Machine Learning for Biological Networks](https://www.cell.com/action/showPdf?pii=S0092-8674%2818%2930592-0)
- [AlphaFold3 — What’s next in computational drug discovery? — Part 1](https://medium.com/@leowossnig/alphafold3-whats-next-in-computational-drug-discovery-2da534c0845e)
- [Deep generative models for biomolecular engineering](https://www.youtube.com/watch?v=4A51MwTuctk)
- [Discovering New Molecules Using Graph Neural Networks](https://www.youtube.com/watch?v=fzSL7MWfXtQ)
- [AI-Driven Drug Discovery Using Digital Biology](https://www.youtube.com/watch?v=27JMkAleyNw)
- [Digital Biology with insitro's Daphne Koller](https://www.youtube.com/watch?v=79qJLY-30ao)
- [AI-First: Daphne Koller’s plan to revolutionize drug discovery](https://www.youtube.com/watch?v=ukEaOOn9ZaE)
- [[Paper] Generative models for molecular discovery: Recent advances and challenges](https://wires.onlinelibrary.wiley.com/doi/epdf/10.1002/wcms.1608)
- [How AI is saving billions of years of human research time](https://www.ted.com/talks/max_jaderberg_how_ai_is_saving_billions_of_years_of_human_research_time)
- [Bioinformatics Toolkit](https://github.com/evanpeikon/Bioinformatics_Toolkit)
- [An Overview of Deep Generative Models in Functional and Evolutionary Genomics](https://www.annualreviews.org/content/journals/10.1146/annurev-biodatasci-020722-115651)
- [Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review](https://pmc.ncbi.nlm.nih.gov/articles/PMC10376273)
- [Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models](https://www.mdpi.com/1422-0067/24/21/15858)
- [A review of multimodal deep learning methods for genomic-enabled prediction in plant breeding](https://academic.oup.com/genetics/article/228/4/iyae161/7876340)
- [Deep Generative Models for Drug Design and Response](https://arxiv.org/pdf/2109.06469)
- [So where are we with deep learning for biochem?](https://www.ladanuzhna.xyz/writing/deep-learning-for-biochem)
- [A review of transformers in drug discovery and beyond](https://www.sciencedirect.com/science/article/pii/S2095177924001783)
- [Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review](https://pmc.ncbi.nlm.nih.gov/articles/PMC10376273)
- [Biomedical Transformers](https://www.youtube.com/watch?v=nz7_wg5iOlA&list=PLoROMvodv4rNiJRchCzutFw5ItR_Z27CM&index=18)
- [Large language models for genomics](https://github.com/raphaelmourad/LLM-for-genomics-training)
- [Data Science for Biologists](https://www.youtube.com/playlist?list=PLMrJAkhIeNNQz4BMoGSsN8cbt8pHlokhV)
- [Using state-of-the-art AI models to power drug-design](https://www.isomorphiclabs.com/articles/using-bespoke-ai-models-to-power-drug-design)
- [AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model](https://storage.googleapis.com/deepmind-media/papers/alphagenome.pdf)
- [AlphaGenome: AI for better understanding the genome](https://deepmind.google/discover/blog/alphagenome-ai-for-better-understanding-the-genome)
- [Alphafold](https://www.youtube.com/watch?v=P_fHJIYENdI)
- [[Paper] Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.pnas.org/doi/full/10.1073/pnas.2016239118)
- [[Article] ProteinML 101](https://kidger.site/thoughts/just-know-stuff-protein-ml)
- [[Paper] Flow Matching for Generative Modeling](https://arxiv.org/abs/2210.02747)
- [[Paper] SE(3)-Stochastic Flow Matching for Protein Backbone Generation](https://arxiv.org/abs/2310.02391)
- [[Article] Diving Into Protein Design with SE(3) Flow Matching](https://chekmenev.me/posts/protein_discovery)
- [[Article] Protein Embeddings: Unlocking Protein Secrets with Deep Learning; ESM, and ProteinBERT](https://medium.com/@aynr/protein-embeddings-unlocking-protein-secrets-with-deep-learning-esm-and-protbert-ac1951c31d2f)
- [[Paper] Transfer learning with graph neural networks for improved molecular property prediction in the multi-fidelity setting](https://www.nature.com/articles/s41467-024-45566-8)
- [[Paper] Biological structure and function emerge from scalingunsupervised learning to 250 million protein sequences](https://www.pnas.org/doi/epdf/10.1073/pnas.2016239118)
- [[Video] The ESM-1b protein language model](https://www.youtube.com/watch?v=nXaqAEBtItc)
- [[Paper] AI-driven protein design](https://www.nature.com/articles/s44222-025-00349-8)
- [[Paper] Sequence modeling and design from molecular to genome scale with Evo](http://rivaslab.org/teaching/MCB128_AIMB/downloads/Nguyen24.pdf)
- [[Paper] Learning the protein language: Evolution, structure, and function](http://rivaslab.org/teaching/MCB128_AIMB/downloads/BeplerBerger21.pdf)
- [[Paper] DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome](http://rivaslab.org/teaching/MCB128_AIMB/downloads/Ji21.pdf)
- [[Paper] Evolutionary-scale prediction of atomic-level protein structure with a language model](http://rivaslab.org/teaching/MCB128_AIMB/downloads/Lin23.pdf)
- [[Paper] ProGen2: Exploring the boundaries of protein language models](http://rivaslab.org/teaching/MCB128_AIMB/downloads/Nijkamp23.pdf)
- [[Paper] A Deep Learning Approach to Antibiotic Discovery](http://rivaslab.org/teaching/MCB128_AIMB/downloads/Stokes20.pdf)
- [[Paper] gReLU: a comprehensive framework for DNA sequence modeling and design](https://www.nature.com/articles/s41592-025-02868-z)
- [[Paper] DeepSomatic: Accurate somatic small variant discovery for multiple sequencing technologies](https://www.biorxiv.org/content/10.1101/2024.08.16.608331v1.full.pdf) ([code](https://github.com/google/deepsomatic))
- [[Video] AI for molecular modeling and protein design](https://www.youtube.com/watch?v=6utofVMSaIw)
- [[Video] AI Virtual Cell and Spatial Proteomics](https://www.youtube.com/watch?v=Ifc1FDdDlvw)
- [[Article] No, AlphaFold has not completely solved protein folding](https://blog.genesmindsmachines.com/p/no-alphafold-has-not-completely-solved)
- [[Video] Making biology programmable](https://www.youtube.com/watch?v=f5N6PU93L0I)
- [[Video] DeepMind AlphaFold](https://www.youtube.com/watch?v=FYVf0bRgO5Q)
- [[Video] AlphaFold 3 deep dive](https://www.youtube.com/watch?v=Or3iq4_9-wA)
- [[Video] AlphaFold - Veritasium](https://www.youtube.com/watch?v=P_fHJIYENdI)
- [[Article] How to Build the AI Virtual Cell](https://aimm.epfl.ch/blog/how-to-build-the-ai-virtual-cell)
- [[Paper] Open-source protein structure AI aims to match AlphaFold](https://www.nature.com/articles/d41586-025-03546-y)
- [[Video] Behind the scenes at Isomorphic Labs](https://vimeo.com/1127619333/2a63980585)
- [[Paper] Causal Inference for Computational Biology](https://summit.sfu.ca/_flysystem/fedora/2023-05/etd22427.pdf)
- [[Video] The State of AI in Drug Discovery 2025](https://webinars.liebertpub.com/e/the-state-of-ai-in-drug-discovery-2025/portal/stage)
- [[Article] Gaps and Risks of AI in the Life Sciences](https://rachel.fast.ai/posts/2024-09-10-gaps-risks-science)
- [[Paper] Opportunities and obstacles for deep learning in biology and medicine](https://www.biorxiv.org/content/10.1101/142760v1.full.pdf)
- [[Notebook] Deep Learning in Genomics Primer](https://colab.research.google.com/drive/17E4h5aAOioh5DiTo7MZg4hpL6Z_0FyWr)
- [[Paper] Designing synthetic regulatory elements using the generative AI framework  DNA-Diffusion](https://www.nature.com/articles/s41588-025-02441-6.epdf?sharing_token=53y8IvPhYfq0ZTmy0QJpRdRgN0jAjWel9jnR3ZoTv0P5wDQIGfy7FJrfwjPqNvr1zRvkWoH8vACly1kiOQ9iZynAyRBgild2HtTvGc5yF6sg51GMBao_B0YWTpr4-wmY29Js7lrQFTV1H6w95z3FF0VJ6zeRJjbMz4C2MoC9xUU%3D)
- [[Paper] The AI revolution: how multimodal intelligence will reshape the oncology ecosystem](https://www.nature.com/articles/s44387-025-00044-4)
- [[Paper] A multimodal machine learning model for the stratication of breast cancer risk](https://www.nature.com/articles/s41551-024-01302-7)
- [[Course] MO640 - Biologia Computacional](https://www.youtube.com/playlist?list=PLf62OlGffu12Vju0L7IgKB2WJh0x3YDGV)
- [[Thesis] Generative Models for Real-World Drug Discovery](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-223.pdf)
- [[Book] Deep Learning for Molecules & Materials](https://dmol.pub)
- [Protein Structure & Structural Bioinformatics](https://www.proteinstructures.com)
- [Multiple Sequence Alignment in AlphaFold2](https://www.chrishayduk.com/p/understanding-protein-language-models)
- [Encoder-only Transformers as Continuous Fuzzy String Matching](https://www.chrishayduk.com/p/understanding-protein-language-models-e1a)
- [Structure Prediction without Multiple Sequence Alignment in ESMFold](https://www.chrishayduk.com/p/understanding-protein-language-models-40b)

### Podcasts

- [Data Science, Past, Present and Future with Hilary Mason](https://www.datacamp.com/community/podcast/data-science-past-present-and-future)

### Questions and Answers

- [Andrew Ng's answer on "How should you start a career in Machine Learning?"](https://www.quora.com/How-should-you-start-a-career-in-Machine-Learning)
- [How do I learn mathematics for machine learning?](https://www.quora.com/How-do-I-learn-mathematics-for-machine-learning)
- [How do I learn machine learning?](https://www.quora.com/How-do-I-learn-machine-learning-1)

### Databases

- [Penn Machine Learning Benchmarks](https://epistasislab.github.io/pmlb)

### Meta / Lists

- [Papers on machine learning for proteins](https://github.com/yangkky/Machine-learning-for-proteins)
- [Papers on Protein Design using Deep Learning](https://github.com/Peldom/papers_for_protein_design_using_DL)
- [How To Read AI Research Papers Effectively](https://www.youtube.com/watch?v=K6Wui3mn-uI)
- [fast.ai vs. deeplearning.ai](https://medium.com/@markryan_69718/learning-deep-learning-fast-ai-vs-deeplearning-ai-34f9c42cf701)

## Science

### Fundamentals

- [AP Biology](https://www.khanacademy.org/science/ap-biology)
- [AP Chemistry](https://www.khanacademy.org/science/ap-chemistry-beta)
- [Intro to Biology](https://www.khanacademy.org/science/biology)
- [Intro to Chemistry](https://www.khanacademy.org/science/chemistry)
- [Organic Chemistry](https://www.khanacademy.org/science/organic-chemistry)
- [Introductory Biology](https://ocw.mit.edu/courses/7-016-introductory-biology-fall-2018)
- [Molecular Biology - Part 1: DNA Replication and Repair](https://www.edx.org/learn/molecular-biology/massachusetts-institute-of-technology-molecular-biology-part-1-dna-replication-and-repair)
- [Introduction to Biology - The Secret of Life](https://www.edx.org/learn/biology/massachusetts-institute-of-technology-introduction-to-biology-the-secret-of-life)

### Science

- [AI Case Studies for Natural Science Research](https://www.youtube.com/watch?v=rfPQ2y857eM&ab_channel=MicrosoftResearch)
- [How AI Is Unlocking the Secrets of Nature and the Universe](https://www.youtube.com/watch?v=0_M_syPuFos&ab_channel=TED)
- [Will AI Spark the Next Scientific Revolution?](https://www.youtube.com/watch?v=7wznuB0sKlw)

### Biology

- [Cell biology by the numbers](https://book.bionumbers.org)
- [Systems Biology and Biotechnology](https://www.coursera.org/specializations/systems-biology)
- [MIT Systems Biology](https://ocw.mit.edu/courses/8-591j-systems-biology-fall-2014)
- [The Most Important Concept in Biology](https://www.youtube.com/watch?v=diKGuE7CDsY)
- [[Course] MIT 7.01SC Fundamentals of Biology](https://www.youtube.com/playlist?list=PLF83B8D8C87426E44)

### Cancer

- [Introduction to the Biology of Cancer](https://www.coursera.org/learn/cancer)
- [Understanding Prostate Cancer](https://www.coursera.org/learn/prostate-cancer)
- [Understanding Cancer Metastasis](https://www.coursera.org/learn/cancer-metastasis)
- [Ask a Researcher: Working in a Cancer Research Lab](https://www.youtube.com/watch?v=YJ8Fk6iLxdg&list=TLPQMjUwNDIwMjIX07N_vVhBIQ&index=5&ab_channel=NationalCancerInstitute)
- [What Causes Cancer?](https://www.youtube.com/watch?v=UlHK3Y_c5Wo&ab_channel=UniversityofCaliforniaTelevision%28UCTV%29)
- [What is Cancer?](https://www.youtube.com/watch?v=2X5kw3mVk08&ab_channel=UniversityofCaliforniaTelevision%28UCTV%29)
- [How is Cancer Diagnosed?](https://www.youtube.com/watch?v=oSOJbu5uqJE&ab_channel=UniversityofCaliforniaTelevision%28UCTV%29)
- [Cancer: Winning the War](https://www.youtube.com/playlist?list=PL504E935D23E00B4B)
- [The Emperor of All Maladies: A Biography of Cancer](https://www.youtube.com/watch?v=D4BGYf2Nkks&ab_channel=GBHForumNetwork)
- [Regina Barzilay: Deep Learning for Cancer Diagnosis and Treatment](https://www.youtube.com/watch?v=x0-zGdlpTeg&ab_channel=LexFridman)
- [Tumour heterogeneity and resistance to cancer therapies](https://www.nature.com/articles/nrclinonc.2017.166)
- [Porque mesmo com a ciência avançando tanto, ainda não temos uma cura para o câncer?](https://threadreaderapp.com/thread/1512474043290632194.html)

### Genetics

- [Cell Biology: Transport and Signaling](https://www.edx.org/learn/cellular-biology/massachusetts-institute-of-technology-cell-biology-transport-and-signaling)
- [Introduction to Genomic Technologies](https://www.coursera.org/learn/introduction-genomics)
- [Classical papers in molecular genetics](https://www.coursera.org/learn/papers-molecular-genetics)
- [Genetics: The Fundamentals](https://www.edx.org/learn/genetics/massachusetts-institute-of-technology-genetics-the-fundamentals)
- [[Course] Genetics: The Fundamentals](https://www.edx.org/learn/genetics/massachusetts-institute-of-technology-genetics-the-fundamentals)
- [[Course] Genetics: Analysis and Applications](https://www.edx.org/learn/genetics/massachusetts-institute-of-technology-genetics-analysis-and-applications)
- [[Course] Genomic Medicine Gets Personal](https://www.edx.org/learn/bioinformatics/georgetown-university-genomic-medicine-gets-personal)
- [[Course] Essentials of Genomics and Biomedical Informatics](https://www.edx.org/learn/biomedical-sciences/israelx-essentials-of-genomics-and-biomedical-informatics)
- [Genomics Papers](https://github.com/jtleek/genomicspapers)
- [Jennifer Doudna: The Exciting Future of Genome Editing](https://www.youtube.com/watch?v=D4FOtJoqoKM)

### Computational Biology

- [Foundations of Computational and Systems Biology](https://ocw.mit.edu/courses/7-91j-foundations-of-computational-and-systems-biology-spring-2014)
- [Bioinformatics](https://seen-politician-a47.notion.site/ccd895cfaee94849bc9c405a4143b4f5?v=8ca8b89a8be54d7c800a1dfe9780abfc)
- [Understanding life via computational bioinformatics](https://www.youtube.com/watch?v=KH_ZxNu9vj4&ab_channel=OrangeCountyACMChapter)
- [[Book] Introduction to Protein Structure](https://www.goodreads.com/book/show/651485.Introduction_to_Protein_Structure)
- [[Book] Molecular Modelling: Principles and Applications](https://www.goodreads.com/book/show/1012202.Molecular_Modelling)
- [[Book] Structural Bioinformatics](https://www.goodreads.com/book/show/737881.Structural_Bioinformatics)
- [[Book] Computational Structural Biology](https://www.goodreads.com/book/show/8005370-computational-structural-biology)
- [[Book] Molecular Modelling and Simulation](https://www.goodreads.com/book/show/11639171-molecular-modeling-and-simulation)
- [[Book] Computational Methods for Protein Structure Prediction and Modeling](https://www.goodreads.com/book/show/651517.Computational_Methods_for_Protein_Structure_Prediction_and_Modeling)
- [[Book] Understanding Molecular Simulation](https://www.goodreads.com/book/show/258137.Understanding_Molecular_Simulation)
- [[Book] The Art of Molecular Dynamics Simulation](https://www.goodreads.com/book/show/5345626-the-art-of-molecular-dynamics-simulation)
- [[Book] Protein Actions: Principles and Modeling](https://www.goodreads.com/book/show/34740617-protein-actions)
- [[Book] Computational Protein Design](https://www.goodreads.com/book/show/31237121-computational-protein-design)
- [[Article] How anyone can teach themselves bioengineering](https://kakama-content-sharing.notion.site/How-anyone-can-teach-themselves-bioengineering-how-I-did-it-too-7a256c2aa5704e87bd5985cad1fed4ce)
- [[Course] Introduction to Bioinformatics and Computational Biology — Harvard](https://liulab-dfci.github.io/bioinfo-combio/)

### Precision Health

- [Defining precision health: a scoping review protocol](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7888329/pdf/bmjopen-2020-044663.pdf)

### Meta

- [Fei-Fei Li & Demis Hassabis: Using AI to Accelerate Scientific Discovery](https://www.youtube.com/watch?v=KHFmIknP_Hc&ab_channel=StanfordHAI)
- [Science is the great giver](https://www.gatesnotes.com/European-Innovation)
- [The Age of AI has begun](https://www.gatesnotes.com/The-Age-of-AI-Has-Begun)
- [Writing in the Sciences](https://www.coursera.org/learn/sciwrite)
- [How to read and understand a scientific paper: a guide for non-scientists](https://blogs.lse.ac.uk/impactofsocialsciences/2016/05/09/how-to-read-and-understand-a-scientific-paper-a-guide-for-non-scientists)
- [Demis Hassabis, AI to Accelerate Scientific Discovery](https://www.youtube.com/watch?v=u1dl_keFK4w&ab_channel=Axial)
- [Demis Hassabis, AI for Science](https://www.youtube.com/watch?v=Q2JmdyqLqiw&ab_channel=Axial)

### Science: Q&A

- [As a computer science graduate student, I am motivated to do cancer research. How significantly can computer scientists contribute to cancer research? Where are such research institutes where I can pursue a PhD?](https://www.quora.com/As-a-computer-science-graduate-student-I-am-motivated-to-do-cancer-research-How-significantly-can-computer-scientists-contribute-to-cancer-research-Where-are-such-research-institutes-where-I-can-pursue-a-PhD)
- [How can I contribute to cancer research as a computer engineering student if I have basic knowledge in artificial Intelligence?](https://www.quora.com/How-can-I-contribute-to-cancer-research-as-a-computer-engineering-student-if-I-have-basic-knowledge-in-artificial-Intelligence)
- [What kind of knowledge gaps in molecular biology make cancer a big problem for researchers?](https://www.quora.com/What-kind-of-knowledge-gaps-in-molecular-biology-make-cancer-a-big-problem-for-researchers)

## Careers

### How to: Interview Prep

- [How I Prepared for DeepMind and Google AI Research Internship Interviews in 2019](https://davidstutz.de/how-i-prepared-for-deepmind-and-google-ai-research-internship-interviews-in-2019)
- [65 Machine Learning Interview Questions](https://github.com/andrewekhalel/MLQuestions)
- [Data Science Interview Questions Answers](https://github.com/youssefHosni/Data-Science-Interview-Questions-Answers)
- [Machine Learning Interviews](https://github.com/khangich/machine-learning-interview)
- [Crushing your interviews for Data Science and Machine Learning Engineering roles](https://building.nubank.com.br/crushing-your-interviews-for-data-science-and-machine-learning-engineering-roles-8-practical-tips)
- [Becoming an MLE at FAANG: What you need to know to know about MLE roles and interviews at Google, Meta, and other top companies](https://interviewing.io/blog/becoming-an-mle-at-faang-what-you-need-to-know)
- [How To Become an ML Engineer in 2024](https://www.youtube.com/watch?v=b7fe72GuFZk)
- [How to land a Data Scientist job at your dream company — My journey to Airbnb](https://towardsdatascience.com/how-to-land-a-data-scientist-job-at-your-dream-company-my-journey-to-airbnb-f6a1e99892e8)
- [Data Science Interview Prep](https://www.youtube.com/playlist?list=PLrtCHHeadkHqFEO69UiOj9BHA7pah-0fj)
- [Machine Learning Interview Prep](https://www.youtube.com/playlist?list=PLrtCHHeadkHqYX7O5cjHeWHzH2jzQqWg5)
- [ML Design Interview strategy](https://www.youtube.com/watch?v=XN2ymraj27g)
- [[Video] Google DeepMind ML Engineer Interview Tips](https://www.youtube.com/watch?v=UGZRFSqvNng)
- [[Article] An Unofficial Guide to Prepare for a Research Position Application](https://pub.sakana.ai/Unofficial_Guide)
- [MLE Interview 2.0: Research Engineering and Scary Rounds](https://www.yuan-meng.com/posts/mle_interviews_2.0)

### How to: Coding prep

- [LeetCode for ML](https://ml1337.ai)
- [TorchLeet: Leetcode for PyTorch](https://github.com/Exorust/TorchLeet)
- [Deep-ML](https://www.deep-ml.com)
- [PyTorch Practice Notebook - ML Interview Preparation](https://colab.research.google.com/drive/1PVHupMXgJwfAVrTWhJQcafxUcqj8IqkD?usp=sharing)
- [100-Days-Of-ML-Code](https://github.com/Avik-Jain/100-Days-Of-ML-Code)
- [65 Machine Learning Interview Questions](https://github.com/andrewekhalel/MLQuestions)

### Jobs

- [ML Researcher at Borealis AI](careers/ml-researcher-borealis-ai.pdf)
- [Research Scientist, Health AI — OpenAI](careers/research-scientist-health-ai-openaI.pdf)
- [ML Engineer — Apple](careers/ml-engineer-apple.pdf)
- [ML Protein Design](careers/machine-learning-researcher-in-protein-design.pdf)
- [Machine Learning Scientist - Computational Biology - Deep Genomics](careers/machine-learning-scientist-deep-genomics.pdf)
- [(Senior) Machine Learning Scientist, AI Foundation Model Specialist - Deep Genomics](careers/machine-learning-scientist-ai-foundation-model-deep-genomics.pdf)
- [(Senior) Principal Scientist, Computational / Systems Biology - Deep Genomics](careers/computational-systems-biology-deep-genomics.pdf)
- [(Senior) Research Scientist - Statistical Genetics - Deep Genomics](careers/research-scientist-deep-genomics.pdf)
- [Research Scientist (Computational Biology, Cells and Tissues) — Isomorphic Labs](careers/research-scientist-computational-biolog-isomorphic-labs.pdf)

## Projects

- [Breast Cancer Prediction: Predicting whether breast cancer tumors are malignant or benign](https://github.com/imteekay/breast-cancer-prediction)
- [House Price Regression: Model prediction for house prices](projects/regression/house-price-regression-model.ipynb)
- [SVC/Decision Tree Classifiers](projects/classification/svc-decision-tree-classifiers.ipynb)
- [PyTorch Fundamentals](projects/pytorch/pytorch-fundamentals.ipynb)
- [PyTorch Neural Network Classification](projects/pytorch/pytorch-neural-network-classification.ipynb)
- [PyTorch Computer Vision](projects/pytorch/pytorch-computer-vision.ipynb)
- [PyTorch Computer Vision Exercises](projects/pytorch/pytorch-computer-vision-exercises.ipynb)
- [PyTorch Custom Datasets](projects/pytorch/pytorch-custom-datasets.ipynb)

## Community

### People

- [Andrej Karpathy](https://karpathy.ai)
- [Alex Krizhevsky](https://www.cs.toronto.edu/~kriz)
- [Geoffrey E. Hinton](https://www.cs.toronto.edu/~hinton)
- [Rob Tibshirani](https://tibshirani.su.domains)
- [Trevor Hastie](https://hastie.su.domains)
- [Daniela Witten](https://www.danielawitten.com)
- [Hattie Zhou](http://hattiezhou.com)
- [Chelsea Voss](https://csvoss.com)
- [Lillian](https://lilianweng.github.io)
- [Christopher Olah](https://colah.github.io)
- [Alex Irpan](https://www.alexirpan.com)
- [Gwern Branwen](https://gwern.net)
- [Jonathan Taylor](https://jtaylor.su.domains)
- [Apoorva Srinivasan](https://www.apoorva-srinivasan.com)
- [Susan Zhang](https://suchenzang.github.io)
- [Michael Chang](https://mbchang.github.io)
- [Jan Leike](https://jan.leike.name)
- [Xiao Ma](https://maxiao.info)
- [Gabriele Corso](https://gcorso.github.io)
- [Falk Hoffmann](https://medium.com/@falk_hoffmann)
- [Sara Hooker](https://www.sarahooker.me)
- [Mario Geiger](https://mariogeiger.ch)
- [Charlotte Bunne](https://www.bunne.ch)
- [Charlie Harris](https://cch1999.github.io)
- [Yuanqi Du](https://yuanqidu.github.io)
- [Sophia Sanborn](https://www.sophiasanborn.com)
- [Omar Sanseviero](https://osanseviero.github.io/hackerllama/blog)
- [Simon Willison](https://simonwillison.net)
- [Hamel Husain](https://hamel.dev)
- [Philipp Schmid](https://www.philschmid.de)
- [Eugene Yan](https://eugeneyan.com/writing)
- [Chip Huyen](https://huyenchip.com/blog)
- [Chenru Duan](https://www.crduan.com)
- [Jeff Guo](https://guojeff.github.io)
- [Arian Jamal](https://jamasb.io)
- [Joseph Suárez](https://jsuarez5341.github.io)
- [Andrew Ng](http://andrewng.org)
- [Mathematics behind Deep learning](https://mathblog.vercel.app)
- [Kevin Kaichuang Yang](https://yangkky.github.io)
- [Terence Parr](https://explained.ai)
- [Penny Xu](https://penny-xu.github.io)
- [Amy X. Lu](https://amyx.lu)
- [Benjamin Bloem-Reddy](https://www.stat.ubc.ca/~benbr)
- [Quanhan (Johnny) Xi](https://xijohnny.github.io)
- [Eric Horvitz](https://erichorvitz.com)
- [Rinaldo Montalvão](https://www.linkedin.com/in/rwmontalvao)
- [Joanne Peng](https://www.joannepeng.com)
- [Sarah Alamdari](https://www.sarahalamdari.com)
- [Lorin Crawford](https://www.lorincrawford.com)
- [Ava Amini](https://avaamini.com)
- [Alex Lu](https://www.alexluresearch.com)
- [Kevin Kaichuang Yang](https://yangkky.github.io)
- [Rocío Mercado Oropeza](https://rociomer.github.io)
- [Pranav Rajpurkar](https://pranavrajpurkar.com)
- [Martin Steinegger](https://steineggerlab.com)
- [Jue Wang](https://juewang.mystrikingly.com)
- [Wenhu Chen](https://wenhuchen.github.io)
- [Melanie Mitchell](https://melaniemitchell.me)
- [Jenny Zhang](https://www.jennyzhangzt.com)
- [Abhinav Gupta](https://www.guabhinav.com)
- [Beidi Chen](https://www.andrew.cmu.edu/user/beidic)
- [Avantika Lal](https://avantikalal.github.io)
- [Will Connell](https://wconnell.github.io)
- [Danqi Chen](https://www.cs.princeton.edu/~danqic)
- [Yuqing Du](https://yuqingd.github.io)
- [Apoorva Srinivasan](https://www.apoorva-srinivasan.com)
- [Patrick Hsu](https://patrickhsu.com)
- [Brian Hie](https://brianhie.com)
- [Andrew Y. K. Foong](https://andrewfoongyk.github.io)
- [Karina Nguyen](https://karinanguyen.com)
- [Myle Ott](https://myleott.com)
- [Reza Rezvan](https://rezarezvan.com)
- [Artem Moskalev](https://amoskalev.github.io)
- [Hamel](https://hamel.dev)
- [Dr. Yejin Kim](https://yejinjkim.github.io)
- [Jay Alammar](https://jalammar.github.io)
- [Luis Serrano](https://serrano.academy)
- [Jiayuan Mao](https://jiayuanm.com)
- [Edward Z. Yang](https://ezyang.com)
- [Edward Z. Yang's blog](https://blog.ezyang.com)
- [Kejun Ying](https://kejunying.com)
- [Xiang Lisa Li](https://xiangli1999.github.io)
- [Rachel Wu](https://people.csail.mit.edu/rmwu)
- [Simon Kohl](https://www.simonkohl.com)
- [Karina Nguyen](https://karinanguyen.com)
- [Yehlin Cho](https://sites.google.com/view/yehlincho/home)
- [Lada Nuzhna](https://www.ladanuzhna.xyz)
- [Valerie Chen](https://valeriechen.github.io)
- [Jian Zhou](https://zhoulab.io)
- [Maria Chikina](https://chikinalab.org)
- [Sergey Ovchinnikov](https://www.solab.org)
- [James Zou](https://www.james-zou.com)
- [Hanxue Gu](https://guhanxue.github.io)
- [Chaitanya K. Joshi](https://www.chaitjo.com)
- [Adam Majmudar](https://adammaj.com)
- [Qian Huang](https://q-hwang.github.io)
- [Lily Liu](https://liuxiaoxuanpku.github.io)
- [Yist Yu](https://yistyu.github.io)
- [Seul Lee](https://seullee05.github.io)
- [Zhenqiao Song](https://jocelynsong.github.io)
- [Wenxian Shi](https://wxsh1213.github.io)
- [Justas Dauparas](https://dauparas.github.io)
- [Maria Nattestad](https://marianattestad.com)
- [Cao (Danica) Xiao](https://sites.google.com/view/danicaxiao)
- [Sophia Tang](https://sophtang.github.io)
- [So Yeon Tiffany Min](https://soyeonm.github.io)
- [Carlos Busso](https://carlosbusso.com)
- [Daniel Fried](https://dpfried.github.io)
- [Yonatan Bisk](https://talkingtorobots.com/yonatanbisk.html)
- [Paul Liang](https://pliang279.github.io)
- [Alex L. Zhang](https://alexzhang13.github.io)
- [Sichuan from Total Health Optimization](https://totalhealthoptimization.com)
- [Mengyue Yang](https://ymy4323460.github.io)
- [Shekoofeh Azizi](https://www.shekoofehazizi.com)
- [Pedro O. Pinheiro](https://pedro.opinheiro.com)
- [Ada Fang](https://www.linkedin.com/in/ada--fang)
- [Shanghua Gao](https://shgao.site)
- [Ailing Zhang](https://ailzhang.github.io)
- [Jun Cheng](https://chengjun.me)
- [Qiguang Chen](https://lightchen233.github.io)
- [Yizhong Wang](https://yizhong-wang.com)
- [Yuan Meng](https://www.yuan-meng.com)
- [Soojung Yang](https://sites.google.com/view/soojungy)
- [Elana P. Simon](https://elanapearl.github.io)

### Research & Laboratories

- [Microsoft Health Futures](https://www.microsoft.com/en-us/research/lab/microsoft-health-futures)
- [Baker Lab](https://www.bakerlab.org)
- [Zhang Lab](https://zhanggroup.org)
- [Zhang lab @ MIT](https://www.zlab.bio)
- [The Programmable Biology Group](https://www.chatterjeelab.com)

### Communities

- [Machine Learning Reddit](https://www.reddit.com/r/MachineLearning)
- [NLP Reddit](https://www.reddit.com/r/LanguageTechnology)
- [Statistics Reddit](https://www.reddit.com/r/statistics)
- [Data Science Reddit](https://www.reddit.com/r/datascience)
- [Machine Learning Quora Topic](https://www.quora.com/topic/Machine-Learning)
- [Statistics Quora Topic](https://www.quora.com/topic/Statistics-academic-discipline)
- [Data Science Quora Topic](https://www.quora.com/topic/Data-Science)
- [Lee Lab of AI for bioMedical Sciences](https://suinlee.cs.washington.edu)
- [Lab of big data and predictive analysis in healthcare](https://www.fsp.usp.br/labdaps)
- [Jean Fan lab](https://jef.works)
- [Pranav Rajpurkar](https://pranavrajpurkar.com)
- [The AI Health Podcast](https://twitter.com/AIHealthPodcast)
- [Starkly Speaking](https://portal.valencelabs.com/starklyspeaking)

### Central Resources

- [Armando Hasudungan](https://www.youtube.com/user/armandohasudungan)
- [John Gilmore M.D.](https://www.youtube.com/channel/UCqBho4rDGlST_PY5I2Bh9yQ)
- [Dr. Najeeb Lectures](https://www.youtube.com/channel/UCPHpx55tgrbm8FrYYCflAHw)
- [MedCram - Medical Lectures Explained CLEARLY](https://www.youtube.com/channel/UCG-iSMVtWbbwDDXgXXypARQ)
- [Nabil Ebraheim](https://www.youtube.com/user/nabilebraheim)
- [Strong Medicine](https://www.youtube.com/channel/UCFq5vPnNRNNNysLrktz4aSw)
- [Cancer Research Demystified](https://www.youtube.com/c/CancerResearchDemystified/featured)
- [Cancer.Net](https://www.cancer.net)
- [Books on Computational Molecular Biology](https://mitpress.mit.edu/books/series/computational-molecular-biology)
- [Obenauf Lab](https://www.obenauflab.com)

## License

[MIT](/LICENSE) © [TK](https://iamtk.co)

</samp>


================================================
FILE: a-unified-theory-of-ai-in-biomedicine.md
================================================
<samp>

# A Unified Theory of AI in Biomedicine & Healthcare

## Table of Contents

- [A Unified Theory of AI in Biomedicine \& Healthcare](#a-unified-theory-of-ai-in-biomedicine--healthcare)
  - [Table of Contents](#table-of-contents)
  - [How Cells Work](#how-cells-work)
  - [Central Dogma of Biology](#central-dogma-of-biology)
  - [Proteins](#proteins)
  - [Current challenges in biology](#current-challenges-in-biology)
  - [Multimodality](#multimodality)

## How Cells Work

- Specialization through spacial organization
- DNA: it's molecule with double helix shape
  - Group of pairs of nucleotides adenine and thymine (AT) or guanine and cytosine (GC)
  - The nucleotides carry the genetic information
- RNA
  - It has individual strands (compared to the double strands from the DNA)
  - It uses uracil as a replacement the thymine
- Proteins
  - They are biomolecules represented as chains of amino-acids (20 types of amino acids with different chemical properties)
  - Types of roles: structural, catalyzer (chemical reactions)
  - They can receive signals, transfort molecules, break down sugar, replicate DNA
  - Hundreds or thousands of amino acids folded into 3-dimensional structures
  - Protein shapes and functions are not static, which makes it even more complex
  - It's transcribed from a gene into messenger RNA (mRNA) in the nucleus, and then it is translated into a chain of amino acids by ribosomes in the cytoplasm
- Each protein functions well in their location. Each one of them gets shipped to a specific destination inside the cell
  - Location importance: in amyotrophic lateral sclerosis (ALS), the protein TDP-43 accumulates in the cytoplasm of neurons, even though its normal location is in the nucleus. The protein isn’t mutated, misfolded, or present in abnormal quantities. It's simply in the wrong place. And that alone is enough to disrupt cellular function and trigger disease.
  - A protein has a short sequence of aminoacids that guides the cell's transport mechanism to place it in the right location: nucleus, mitochondria, plasma membrane

## Central Dogma of Biology

DNA -> RNA -> Protein

## Proteins

- understand proteins, to understand cells, to build virtual cells

## Current challenges in biology

- Fundamental Problems in Protein Function and Annotation
  - Misannotation: 80% of protein families are wrongly annotated
  - Difficulty in Discovering Truly Unknown Functions: ML models fail on anything that is truly new because the function is not in the training set. Better used as "labeling" tools
  - The Importance of Context: To define function correctly, one needs both the molecular function and the biological context
- Systemic, Funding, and Infrastructure Challenges
  - Absence of Data Publication Standards
  - Lack of Incentives for Data Curation
  - Critical Underfunding of Databases
- Challenges in Applying AI to Biology
  - Neglect of Foundational Data Work: doing the essential, foundational work of ensuring datasets are thoughtfully constructed and splits (training/test) are well-designed
  - Biological Impossibility in Predictions: Applying AI without domain expertise can lead to biologically impossible or thermodynamically impossible conclusions
  - Data Leakage: e.g. fail to remove proteins of unknown function from the training set, allowing information to leak between the training and test data
- Help biologists and scientists to design better experiments
  - The model doesn't need to be super accurate, but it needs to help lab people design better experiments
  - e.g. chatgpt to as about research question like "should I look at this set of proteins?" or "what tissue should I look at?", or "how should I design my experiment?"
- Vision models vs human eyes for microscopy images of cells
  - Microscopy images are very rich in information
  - Vision models can interpret more information than human eyes

## Multimodality

- Multimodality in Oncology
  - Cancer multiomics
  - Histopathology
  - Histology
  - Clinical records
  - Imaging
- Clinical applications
  - Prevention and early detection
  - Screening and risk stratification
  - Optimizing trial efficiency
  - Diagnosis in pathology and radiology
  - Prognosis and output prediction
  - Precision oncology
  - Drug development
  - Diagnostics and treatment improvement leading to better health economic outcome: enhanced health results coupled with reduced costs 

</samp>


================================================
FILE: a-unified-theory-of-ml-ai.md
================================================
<samp>

# A Unified Theory of ML/AI

## Table of Contents

- [A Unified Theory of ML/AI](#a-unified-theory-of-mlai)
  - [Table of Contents](#table-of-contents)
  - [ML Engineering \& ML Lifecycle](#ml-engineering--ml-lifecycle)
    - [Scoping: Look at the big picture](#scoping-look-at-the-big-picture)
    - [Data](#data)
    - [Modeling (Model Training \& Machine Learning Models)](#modeling-model-training--machine-learning-models)
    - [Deployment](#deployment)
  - [Pre-processing](#pre-processing)
    - [Understanding the Domain/Data](#understanding-the-domaindata)
    - [Data Engineering](#data-engineering)
    - [Handling Missing Data](#handling-missing-data)
    - [Data Cleaning](#data-cleaning)
    - [Scaling/Normalization](#scalingnormalization)
    - [Data Leakage](#data-leakage)
    - [Encoding Categorical Variables](#encoding-categorical-variables)
    - [Splitting Data \& Cross Validation](#splitting-data--cross-validation)
    - [Handling imbalanced datasets](#handling-imbalanced-datasets)
    - [PCA](#pca)
  - [Model Development \& Training](#model-development--training)
    - [Baseline](#baseline)
    - [Model Selection](#model-selection)
    - [Model Performance](#model-performance)
    - [Objective Functions: Loss functions](#objective-functions-loss-functions)
      - [Mean squared error (MSE)](#mean-squared-error-mse)
      - [Mean Absolute Error (MAE)](#mean-absolute-error-mae)
      - [Root Mean Squared Error (RMSE)](#root-mean-squared-error-rmse)
      - [Binary Cross Entropy loss](#binary-cross-entropy-loss)
      - [Cross Entropy loss](#cross-entropy-loss)
      - [Focal Loss](#focal-loss)
    - [Metrics](#metrics)
      - [Accuracy](#accuracy)
      - [F1](#f1)
      - [Precision](#precision)
      - [Recall](#recall)
      - [ROC](#roc)
      - [R²](#r)
      - [Youden Index](#youden-index)
      - [DCA](#dca)
      - [Log-likelihood](#log-likelihood)
    - [Experiment tracking](#experiment-tracking)
    - [Model Debugging](#model-debugging)
  - [Machine Learning Models](#machine-learning-models)
    - [Linear Regression](#linear-regression)
    - [Logistic Regression](#logistic-regression)
    - [Multiple Logistic Regression](#multiple-logistic-regression)
    - [Support Vector Machines](#support-vector-machines)
    - [Tree-Based Models](#tree-based-models)
    - [NLP](#nlp)
    - [Neural Networks](#neural-networks)
    - [CNN](#cnn)
    - [Transfer Learning](#transfer-learning)
  - [Mathematics](#mathematics)
    - [Linear Algebra](#linear-algebra)
      - [Importance of linear dependence and independence: Linear Algebra](#importance-of-linear-dependence-and-independence-linear-algebra)
    - [Statistics](#statistics)

## ML Engineering & ML Lifecycle

### Scoping: Look at the big picture

- Frame the problem
  - Define which type of problem to work on
  - An ML problem is defined by inputs, outputs, and the objective function that guides the learning process
  - Framing a problem
    - Problem: Use ML to speed up your customer service support
    - Bottleneck: routing customer requests to the right department among four departments: accounting, inventory, HR (human resources), and IT. 
    - Framing the problem to a ML problem: developing an ML model to predict which of these four departments a request should go to — a classification problem
      - The input is the customer request
      - The output is the department the request should go to
      - The objective function is to minimize the difference between the predicted department and the actual department
  - When framing a problem, think about how the data changes. e.g. predict what app the user should use next. Multiclass classification is the first idea that come to mind when framing the problem. But if a new app is added, you need to retrain the model. If you frame this as a regression problem (input: user's, environment's, and app's features), whenever a new app is added, the model will continue to work properly
- Learn: How value will be created solving a given problem
- Push back a bit
  - Is it worth to build a ML model to solve this problem? 
  - Is it easy for a human do?
  - How much data do we have and is it enough?
- Decide on key metrics: accuracy, latency, throughput
  - Relate it to the business: how do these metrics translate to business value? What does it mean to improve a given metric for the business?
- Characteristics of ML systems
  - Reliability: perform a correct function. Correctness but in terms of software and in terms of the prediction
  - Scalability: The system can scale while the ML system grows
    - Grows in complexity: from logistic regression (1GB of RAM) to a 100-million-parameter neural network (16GB of RAM) for prediction
    - Grows in traffic volume: 10,000 prediction requests daily -> 10 million
  - Maintainability: easy to maintain the system and enable other people to contribute to the repository
    - Set up infrastructure
    - Code documentation
    - Code versioning
    - Reproducible models
  - Adaptability: the system should have some capacity for both discovering aspects for performance improvement and allowing updates without service interruption
    - Shifting data distributions
    - Business requirements
- Estimate resources and timeline
- Decoupling objectives
  - When minimizing multiple objectives, you need to make the model optimize for different scores. e.g. `loss_quality`: to rank posts by quality; `engagement_loss`: to rank posts by engagement. If you combine both losses into one `loss = ɑ quality_loss + β engagement_loss`, every time you need to tune the hyperparameters (ɑ, β), you need to retrain the model
  - Another approach is to train two different models, each optimizing one loss. So you have two models:
    - `quality_model`: Minimizes `quality_loss` and outputs the predicted quality of each post
    - `engagement_model`: Minimizes `engagement_loss` and outputs the predicted number of clicks of each post
    - In general, when there are multiple objectives, it’s a good idea to decouple them first because it makes model development and maintenance easier. First, it’s easier to tweak your system without retraining models, as previously explained. Second, it’s easier for maintenance since different objectives might need different maintenance schedules.

### [Data](#pre-processing)

- Define data: is data labeled consistently? How to performance data normalization?
- Stablish baseline
- Label and organize data
- [Visualizing data](introduction/data/visualizing-data.ipynb)
  - No dimensions: a scalar value (a single point) — 1 ([])
  - 1D: a vector (a list of items) — [1, 2, 3, 4] ([4])
  - 2D: a matrix (a flat grid; e.g. a grayscale image, 28x28 pixels) — [[2, 3], [5, 6]] ([2, 2])
  - 3D: a cube (a stack of flat grids; e.g. a color image) — ([3, 3, 3])
  - 4D: a list of cubes (e.g. batches of color images) — ([2, 3, 3, 3])

### Modeling ([Model Training](#model-training) & [Machine Learning Models](#machine-learning-models))
  
- Code
- Optimizing the hyperparameters and the data: high performing model

### Deployment

- Deploy in production: Prediction Server API responding the prediction output
  - Common deployments
    - New product/feature
    - Automate with manual task
    - Replace previous ML system
- Deployment patterns: enables monitoring and rollback
  - Canary release
  - Blue green deployment
- Monitor & mantain system
  - Brainstorm all the things that could go wrong
    - Software metrics: memory, compute, latency, throughput, server load
    - Input metrics: avg input length, avg input volume, num of missing values (fraction of missing input values), avg image brightness
    - Output metrics: fraction of non-null outputs, search redos
  - System failure
    - Operational failure: software metrics (performance, latency, error rate)
    - ML performance metrics: accuracy, precision
- Concept and data drift: how has the (test) data changed?
  - Concept drift occurs when the relationship between the input data (x) and the target variable (y) changes over time.
    - e.g. when the price of a house changes over time due to factors like inflation or a change in the market, even if the size of the house remains the same
  - Data drift occurs when the distribution of the input data (x) changes over time, while the relationship between x and y remains the same.
    - e.g. when the input data itself changes, such as people building larger or smaller houses over time, which changes the distribution of house sizes in the data
- Software engineering issues/checklist
  - It should be ran in realtime or in batch?
  - It runs in the cloud or in the browser/edge?
  - Compute resources: CPU, GPU, memory
  - Latency, throughput (QPS - queries per second) requirements
  - Logging: for analysis and review
  - Security and privacy
- Experiment Tracking
  - What to track?
    - Algorithm/code versioning
    - Dataset used
    - Hyperparameters
    - Results
  - Tracking tools
    - Text files
    - Spreadsheets
    - Experiment tracking systems
  - Desired features
    - Information needed to replicate results
    - Experiment results (metrics, analysis)
    - Resource monitoring, visualization, model error analysis
  - A/B testing
    - Control (baseline) vs Treatment groups
  - Hypothesis testing
    - Significance test to assess whether random chance is a reasonable explanation for the observed difference between groups A and B

## Pre-processing

- **Understanding the Data**: Graph the data, distribution, domain knowledge
- **Data Engineering**
- **Handling Missing Data**: Filling missing values (e.g., using mean, median, mode, or interpolation).
- **Data Cleaning**: Removing duplicates, fixing incorrect labels, correcting inconsistencies.
- **Scaling/Normalization**: Standardizing or normalizing numerical features to ensure consistency.
- **Data Leakage**: Separate training, validation, and test sets before processing data
- **Encoding Categorical Variables**: Converting categorical data into numerical form (e.g., one-hot encoding, label encoding).
- **Handling Outliers**: Removing or transforming extreme values that may distort the model.
- **Splitting Data & Cross Validation**: Dividing data into training, validation, and test sets.
- **Handling imbalanced datasets**: Using transformations and other techniques.

### Understanding the Domain/Data

- Graph the data to analyse the distribution: find if the dataset is asymetrical and if it will generate a bias
- Domain knowledge about the data: understand its features, default values, missing values, the importance or unimportance of each feature
- Correlations: multicollinearity (independent variables in a regression model are highly correlated)
- Mean, Central Limit Theorem, Confidence interval (standard error)
- Visualize the data (TODO: show ways to plot data to better visualize the data)
- Feature importance: built-in feature importance functions by XGBoost, SHAP (SHapley Additive exPlanations), InterpretML
- Understand the domain
  - Risks and Failures Due to Lack of Domain Knowledge
    - Conceptual mistakes (e.g. fail to look at the biological context and literature)
    - Impossible predictions (e.g. AI solutions that result in impossible biological predictions)
    - Ignore what are the current challenges and what could unlock impactful results
    - Misunderstanding of foundational concepts
    - Difficulty catching errors

### Data Engineering

- Data formats: how to store data formats
  - How do I store multimodal data, e.g., a sample that might contain both images and texts?
  - Where do I store my data so that it’s cheap and still fast to access?
  - How do I store complex models so that they can be loaded and run correctly on different hardware?
- Structured vs Unstructured data
  - Structured: stored in data warehouses, follows a schema
  - Unstructured: stored in data lakes (raw data before it's transformed), more flexible, doesn't follow a schema
- Data processing
  - Transaction processing uses databases that satisfy the low latency, high availability requirements
  - ACID (atomicity, consistency, isolation, durability)
    - **Atomicity**: To guarantee that all the steps in a transaction are completed successfully as a group. If any step in the transaction fails, all other steps must fail also. For example, if a user’s payment fails, you don’t want to still assign a driver to that user.
    - **Consistency**: To guarantee that all the transactions coming through must follow predefined rules. For example, a transaction must be made by a valid user. 
    - **Isolation**: To guarantee that two transactions happen at the same time as if they were isolated. Two users accessing the same data won’t change it at the same time. For example, you don’t want two users to book the same driver at the same time.
    - **Durability**: To guarantee that once a transaction has been committed, it will remain committed even in the case of a system failure. For example, after you’ve ordered a ride and your phone dies, you still want your ride to come.
- Availability
  - online: online processing means data is immediately available for input/output
  - Nearline: which is short for near-online, means data is not immediately available but can be made online quickly without human intervention
  - Offline: data is not immediately available and requires some human intervention to become online
- ETL: Extract-Transform-Load
  - Extract from data sources
  - Transform: join multiple sources, clean them, standardize values, making operations (transposing, deduplicating, sorting, aggregating, deriving new features)
  - Load: how and how often to load your transformed data into the target destination

### Handling Missing Data

- Dropping columns with missing values or adding values by infering from the dataset or using default values for a given feature
- Dropping the rows that have the missing value for a given feature
- Use `SimpleImputer` to fill missing values with the mean (mean, median, or mode)
- **Insight**: understand the data so you can reason what's the best decision — using the mean value, 0 or dropping the column

### Data Cleaning

- Removing duplicates
- Fixing incorrect labels / label ambiguity
  - Many ways to label object recognition images
  - Many ways to transcript audios
  - Standardize labels (reach agreement on how to label the data), merge classes
- Major types of data problems
  - Small data (<= 10,000) + Unstructured: manufacturing visual inspection from 100 training examples
  - Small data (<= 10,000) + Structured: housing price based on square footage, etc. from 50 training examples
  - Big data (> 10,000) + Unstructured: speech recognition from 50 million training examples
  - Big data (> 10,000) + Structured: online shopping recommendations from 1 million users
- Handling types of data problems:
  - Unstructured data: humans can label, data augmentation
  - Structured data: harder to obtain more data
  - Small data (<= 10,000): clean labels are critical, can manually go through the dataset and fix labels
  - Big data (> 10,000): emphasis on data process - investigate and improve how the data is collected, labeled (e.g. labeling instructions)
- Correcting inconsistencies
- Formatting the values (e.g. using float when the data is object)
- It's important to do data imputation and data cleaning after the train-test split
  - Split the data
  - Clean and impute the training set
  - Apply the same imputation rules to the test set

### Scaling/Normalization

- Transformation (via `FunctionTransformer(np.log1p)` for example) is done to adjust the distribution of the dataset
  - e.g. when there's more houses with low prices, it will be difficult to the model learns from houses with high prices (low volume) and predict on the test data
- Standardizing or normalizing numerical features to ensure consistency
- Use separate scalers for X and Y
  - X and Y have different distributions (different scales and meanings)
  - You can scale Y if it's a regression problem. Don't scale if it's a classification problem, since it's categorical
  - Tree-based models like XGBoost, Decision Trees, or Random Forests usually don't need scaling because these models are not sensitive to feature scaling
- Use only the training set to calculate the mean and variance, normalize the training set, and then at test time, use that same (training) mean and variance statistics to normalize the test set
  - If using the whole dataset to figure out the feature mean and variance, you're using knowledge about the distribution of the test set to set the scale of the training set - 'leaking' information — [detailed explanation](https://datascience.stackexchange.com/questions/39932/feature-scaling-both-training-and-test-data?newreg=64c8fc13490744028eb7414da9b6693a)
  - Don't use statistics and knowledge from test data: this is why the scaling of test features should be done with knowledge from the training set

### Data Leakage

- Data leakage (or leakage) happens when your training data contains information about the target, but similar data will not be available when the model is used for prediction
  - This leads to high performance on the training set (and possibly even the validation data), but the model will perform poorly in production
  - There are two main types of leakage: target leakage and train-test contamination.
- **Target leakage**: occurs when your predictors include data that will not be available at the time you make predictions.
  - e.g. after having pneumonia, a patient usually takes antibiotic medicines, so a "took_antibiotic_medicine" information has a strong relationship with "got_pneumonia". The value of "took_antibiotic_medicine" is usually changed after the value for "got_pneumonia" is determined
  - In this case, this feature (or any "variable updated (or created) after the target value") should be excluded from the training and validation set
- Splitting time-correlated data randomly instead of by time leads to data leakage because a later point in time can influence an earlier point
  - For time-correlated data, don't split randomly, do Time-Based Splitting
- **Train-Test Contamination**: when you don't distinguish training data from validation data
  - Validation is meant to be a measure of how the model does on data that it hasn't considered before
  - Running preprocessing (e.g. Filling in missing data) before splitting data into train and validation would lead to the model getting good validation scores, giving you great confidence in it, but perform poorly when you deploy it to make decisions
  - The idea is to exclude the validation data from any type of fitting, including the fitting of preprocessing steps
  - The `Pipeline` helps handling this kind of leakages
- Divide training and test into separate datasets before performing scaling the features
  - The mean and standard deviation (global statistics from training data) used for scaling will be computed from the entire dataset.
  - This means that information from the test set is indirectly influencing the training data.
  - Your model will learn from statistics that it would not have access to in a real-world scenario.
  - This can lead to overfitting and poor generalization.
  - Split first, then scale
- Detecting Data Leakage
  - Measure the predictive power of each feature or a set of features with respect to the target variable (label)
  - If a feature has unusually high correlation, investigate how this feature is generated and whether the correlation makes sense.
  - Measure how important a feature or a set of features is to your model. Investigate why that feature is so important

### Encoding Categorical Variables

- Drop Categorical Variables: This approach will only work well if the columns did not contain useful information.
  - Get all the data without the categorical values: `X.select_dtypes(exclude=['object'])`
- Ordinal Encoding: assigns each unique value to a different integer
  - e.g. `OrdinalEncoder`
- One-Hot Encoding: creates one column for each categorical variable and assigns the value 1 to the column that the example holds (one-hot) and 0 to the other columns
  - e.g. `OneHotEncoder`

### Splitting Data & Cross Validation

- Create the test set early as possible, even before cleaning the data
- Be careful to not introduce any data leakage
- When splitting the data into training and validation, the model can perform well on the 20% validation data and bad in the 80% (or vice-versa)
  - In larger validation sets, there is less randomness ("noise")
- Cross validation makes you do experiments in different folds
  - e.g. Divide training and validation into 80-20
    - Experiment 1: first 20% fold will be the validation set and the other 80% will be the training set
    - Experiment 2: second 20% fold will be the validation set and the other 80% will be the training set
    - The same for the experiments 3, 4, and 5, until it gets to all folds
  - When should you use each approach?
    - For small datasets, where extra computational burden isn't a big deal, you should run cross-validation.
    - For larger datasets, a single validation set is sufficient. Your code will run faster, and you may have enough data that there's little need to re-use some of it for holdout.

![](images/cross-validation.png)

- Use `cross_val_score` from `model_selection`:
  - estimator: model or pipeline that implements the `fit` method
  - input `X` and `y`
  - cv: number of folds in cross validation
  - [scoring](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter): model evaluation rules, e.g. mae, accuracy, recall, mse, etc

### Handling imbalanced datasets

- Data augmentation: generating more examples for the ML model to train on (e.g. rotating images)
- Resampling
  - Oversampling: increase the number of data points of a minority class via synthetic generation
  - Undersampling: reduces examples from the majority class to balance the number of data points
- Ensemble methods: combine multiple individual models to produce a single, more robust, and often more accurate predictive model
- Stratification: get 20% of class 1, 20% of class 2, etc so the percentage will be equal even if the dataset is imbalanced
- Choosing better metrics: measure precision, recall, and f1 for ROC and AUC graphs
  - Note: F1 and recall, the ROC curve focuses only on the positive class and doesn’t show how well your model does on the negative class
  - Precision-Recall Curve: gives a more informative picture of an algorithm’s performance on tasks with heavy class imbalance.

### PCA

- Use PCA to reduce dimensionality
  - Always scale the predictors before applying PCA
  - PCA relies on the variance of the data to identify the principal components. If your predictors are on different scales, PCA may disproportionately weigh the features with larger scales
- [ ] What's covariance matrix?
  - A covariance matrix is a square matrix that contains the covariances between pairs of variables in a dataset.
  - Covariance measures the degree to which two variables change together

## Model Development & Training

- Have a data-centric AI development: from data to model rather than model fitting the data
- Challenges
  - Do well in the training set
  - Do well in the validation set
  - Do well in the test set
  - Do well on business metrics
- Model fits the training data well but fail to generalize to new examples
  - The cost is low for the training set because it fits well, but the cost for the test set will be high because it doesn't generalize well
  - Split the dataset into two parts
    - 70%: training set - fit the data
    - 30%: test set - test the model to this data

### Baseline

- Scikit-learn has a DummyRegressor/DummyClassifier
  - The dummy model sets a baseline for your performance metrics
  - Starting with a dummy model also makes it easier to diagnose any bugs in your data preparation code, because the model isn’t adding much complexity

### Model Selection

Which model is better? It depends on the problem at hand. If the relationship between the features and the response is well approximated by a linear model as in, then an approach such as linear regression will likely work well, and will outperform a method such as a regression tree that does not exploit this linear structure. If instead there is a highly non-linear and complex relationship between the features and the response as indicated by model, then decision trees may outperform classical approaches.

### Model Performance

- Improving model performance and generalization
  - Regularization: L1, L2 (`weight_decay` can be used for Adam)
  - Dropout
  - More data
  - Data augmentation
  - Early stopping
  - Learning rate decay
  - Handle imbalanced datasets
- Prefer choosing models that have good cross-validation and test accuracy
  - The test cost estimates how well the model generalizes to new data (compared to the training cost)
  - training/cross-validation/test
    - cross-validation is also called dev or validation set
    - It improves the robustness and reliability of your model evaluation and hyperparameter tuning process
    - Cross-validation involves splitting your training data into multiple subsets (folds). The model is trained on a subset of these folds and then evaluated on the remaining fold. This process is repeated multiple times, with each fold serving as the validation set once. This gives you multiple performance estimates on different "held-out" portions of your training data.
    - By averaging the performance across all the validation folds, you get a more stable and less biased estimate of how well your model is likely to generalize to unseen data compared to relying on a single test set evaluation during development.
  - Good Cross-Validation Accuracy: a good cross-validation accuracy indicates good stability and generalization across different subsets of data
  - Good Test Accuracy: the model generalizes well on unseen data
- Bias/Variance tradeoff
  - High bias: underfit
    - Simple model
    - If the cost of the training set is high, the costs of cross validation and test sets will also be high
    - It doesn't matter if we collect more data, the model is too simple and won't learn more
  - High variance: overfitting
    - Complex model
    - High variability of the model
    - The training cost will be low and the cross validation and test costs will be high
    - Increasing the training size can help training and cross validation error
  - Balanced bias/variance: optimal
    - The costs of training, cross validation, and test will be low: it performs well
  - Model complexity vs Cost 
    - Training cost: when the degrees of the polynomial (or the model complexity) increases, the cost decreases
    - Cross validation cost: with the increase of model, the cost will decrease until one point where the model is overfitting and the cost will start increase again
  - Regularization influence in bias/variance
    - Regularization adds a penalty to the cost function that discourages the model from learning overly complex patterns and prevent overfitting
    - As the lambda increases, the bias gets higher
    - As the lambda decreases, the variance gets higher
    - L1 (Lasso): shrinks the model parameters toward zero
    - L2 (Ridge Regression): add a penalty term to the objective function (loss function) with the intention of keeping the mode parameters smaller and prevent overfitting
    - Elastic net: a combination of L1 and L2 techniques
- Establishing a baseline level of performance
  - Human error (or competing algorithm or guess based on prior experience) as the baseline vs Training Error vs Cross validation error: analyse gaps between these errors
  - High variance: 0.2% gap between baseline and training / 4% gap between training and cross-validation (overfitting to the training data)
    - baseline: 10.6%
    - training: 10.8%
    - cross-validation: 14.8%
  - High bias: 4.4% gap between baseline and training (not performing well) / 0.5% gap between training and cross-validation (performing similarly in training and cross validation)
    - baseline: 10.6%
    - training: 15%
    - cross-validation: 15.5%
- Debugging a learning algorithm
  - Get more training examples -> fixes high variance
  - Try smaller set of features -> fixes high variance
  - Try getting additional features -> fixes high bias
  - Try adding polynomial features -> fixes high bias
  - Try decreasing the regularization term lambda -> fixes high bias
  - Try increasing the regularization term lambda -> fixes high variance
- In classification models, the way to measure performance is based on accuracy, precision, recall (sensitivity), specificity, and f1 score
  - **Precision**: Out of all the instances that the model predicted as positive, how many were actually positive?
    - Precision = TP / (TP + FP)
      - TP = True positive
      - FP = False positive
    - **High Precision**: Indicates that when the model predicts a positive class, it is often correct. This is crucial in applications where the cost of a false positive is high.
    - **Low Precision**: Suggests that the model frequently predicts positive incorrectly, leading to many false alarms.
    - e.g. Cancer tumor is malignant
      - High precision: when the model predicts that cancer tumor is malignant, it's often correct. It's a high change a person has malignant cancer
      - Low precision: the model predicting that a person has malignant cancer is probably incorrect, leading to false alarms, and in this particular case, anxiety
  - **Recall (Sensitivity)**: Measures the proportion of actual positives that were correctly identified.
    - Recall = TP / (TP + FN)
    - True positive: correctly identified as positive
    - False negative: incorrectly identified as negative (it's actually positive)
  - Precision-Recall tradeoff
    - The bigger the threshold, the bigger the precision and smaller the recall
      - Predict Y=1 only if very confident. e.g. a very rare disease
    - The smaller the threshold, the bigger the recall and smaller the precision
      - Avoiding too many cases of rare disease
    - We need to specify the threshold point
  - **F1 Score**: The "harmonic mean" of precision and recall, providing a balance between the two.
    - F1 Score = 2 x (Precision x Recall / (Precision + Recall))
  - Importance in applications: In medical diagnosis, the diseases where a false positive can cause unnecessary stress or treatment, high precision is essential.
- Fine-Tune Model
  - Grid Search
  - Randomized Search
  - Ensemble Methods
  - Analyzing the Best Models and Their Errors
    - Visualizing errors (poor predictions)
  - Evaluate the model on the test set
    - Metrics should be similar to your validation numbers, or else you may have some overfitting going on
- Learning curves: analyze loss and accuracy metrics with the increase or decrease of training data. Use [LearningCurveDisplay](https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html) to plot a learning curve for different models and understand how they handle different amount of training data.

### Objective Functions: Loss functions

MSE, RMSE, MAE (mean absolute error) for regression, logistic loss (also log loss) for binary classification, and cross entropy for multiclass classification.

- When choosing the loss function, ask yourself "How much do I fear a single massive mistake?"
  - When outliers are fatal, prefer MSE
    - Squared error penalizes when there is a big error
    - The penalty grows exponentially: the model works to ensure no single error gets too large
    - Specially a good fit for safety use cases. e.g. medicine, aircraft, structural engineering
  - When consistency and efficiency are important, prefer MAE
    - MAE treats all errors linearly

#### Mean squared error (MSE)

```python
def mean_squared_error(y_true, y_pred):
    return np.mean((Y_test - prediction) ** 2)
```

#### Mean Absolute Error (MAE)

TODO

#### Root Mean Squared Error (RMSE)

TODO

#### Binary Cross Entropy loss

- Used for multiclass classification problems
- TODO: add more details

#### Cross Entropy loss

- Used for binary classification problems
- TODO: add more details

#### Focal Loss

- It is a generalization of a binary cross-entropy loss
- Focal loss down-weights easy examples
- Useful in cases where the data is imbalanced (focus on the hard examples)

### Metrics

#### Accuracy

TODO

#### F1

TODO

#### Precision

TODO

#### Recall

TODO

#### ROC

TODO

#### R²

R² (coefficient of determination): measures how well your model explains the variance in the target variable

```python
def r2_score(Y_true, Y_pred):
   residual_sum_of_squares = np.sum((Y_true - Y_pred) ** 2)
   total_sum_of_squares = np.sum((Y_true - np.mean(Y_true)) ** 2)
   return 1 - (residual_sum_of_squares / total_sum_of_squares)
```

#### Youden Index

Evaluation of the overall performance of a diagnostic test or binary classifier.

It represents the maximum potential effectiveness of a biomarker, calculating the point on a receiver operating characteristic (ROC) curve where the sum of sensitivity and specificity is maximized.

```python
import numpy as np
from sklearn.metrics import roc_curve

def sensivity_specifity_cutoff(y_true, y_score):
    fpr, tpr, thresholds = roc_curve(y_true, y_score)
    idx = np.argmax(tpr - fpr)
    return thresholds[idx]
```

#### DCA

TODO

#### Log-likelihood

TODO

### Experiment tracking

The process of tracking the progress and results of an experiment.

- The loss curve corresponding to the train split and each of the eval splits.
- The model performance metrics that you care about on all nontest splits, such as accuracy, F1, perplexity.
- The log of corresponding sample, prediction, and ground truth label.
- The speed of your model, evaluated by the number of steps per second or, if your data is text, the number of tokens processed per second.
- System performance metrics such as memory usage and CPU/GPU utilization. They’re important to identify bottlenecks and avoid wasting system resources.
- The values over time of any parameter and hyperparameter whose changes can affect your model’s performance, such as the learning rate if you use a learning rate schedule; gradient norms (both globally and per layer).

### Model Debugging

- Simplify Your Model
  - Simplify your architecture: Complex architectures can make it difficult to reason about what’s going wrong. To help with this:
    - Eliminate unnecessary layers: If a layer doesn’t contribute directly to the input–output mapping—such as layers that simply add model capacity without altering dimensions—it’s best to remove it during debugging.
    - Call basic layers directly: Use layers like nn.Conv and nn.Dense instead of custom blocks, which can obscure bugs and internal behavior.
    - Reduce depth and width: If your model has many layers or units per layer, consider reducing both. A shallower model is easier to debug and understand, especially in the early stages of development.
    - Remove residual connections: These can complicate debugging by introducing dependencies between layers and by masking issues in the layers they connect (like poor initialization or gradient problems).
  - Turn off extras, like normalization and dropout: These add complexity that’s often unnecessary during early debugging. In Flax, you must explicitly manage both model state (e.g., batch norm statistics) and random number generators (RNGs), which can easily lead to subtle bugs.
    - Don’t use batch norm: The complexity of batch norm is threefold. First, it behaves differently during training and inference. Second, it introduces additional state (running mean and variance) that must be updated outside standard gradient updates. Third, it breaks a key assumption: most layers operate independently on each batch element, but batch norm computes statistics across the batch. This makes it incompatible with tools like vmap or sharded training (pmap, pjit) unless you take special care to synchronize statistics across devices.
    - Skip dropout: Dropout introduces stochasticity, making it harder to determine whether poor performance is due to randomness or a deeper issue.
  - Simplify your optimizer: Don’t worry about experimenting with different optimizers or learning rate schedules until the basics are working. Pick a sensible default—like good old Adam with a learning rate of 1e-3, and focus on solving more fundamental issues first.
  - Avoid mixed precision: While lower-precision data types like bfloat16, float16, or TensorFloat32 can improve performance and memory usage (out of scope for this book), they can lead to subtle numerical instability that’s extremely hard to debug. Note that even if you pass in float32 inputs, JAX may default to lower-precision matmuls. To force full float32 precision globally (especially during debugging), add jax.config.update("jax_default_matmul_precision", "float32").
- Simplify and Control Your Environment
  - Sort out determinism and reproducibility: It’s easier to isolate issues if your experiments are reproducible. In addition to turning off stochastic components like dropout, consider:
    - Setting explicit random seeds
    - Turning off dataset shuffling: Don’t shuffle your training data, or shuffle with a fixed random seed, to maintain a consistent order of examples across runs.
    - Keeping the environment constant: Avoid inconsistencies caused by external factors (e.g., use the same hardware, library versions, and configurations).
  - Strip down your training loop: Especially in JAX and Flax, training loops require manual control over RNGs, state, and updates—making them powerful but easy to get wrong.
    - Train on a single batch for just a few steps
    - Disable extras like logging, metrics, or learning rate schedules
    - Use fixed inputs and random seeds: Run your training step on the same batch every time. This eliminates variability due to changing data and makes bugs easier to isolate.
    - Avoid high-level abstractions temporarily: If you’re using tools like TrainState, try replacing them with raw variable updates until the core logic works.
  - Make your code self-contained: Especially when working in interactive environments like Colab or Jupyter notebooks, it’s easy for your environment to get cluttered with old variables or states without you realizing it. Restarting the kernel and organizing your code into self-contained functions can really help with debugging.
  - Turn off JIT compilation (with caution)
  - Train on a single GPU instead of multiple GPUs: If you are using GPUs, start with just one. Multi-GPU training via sharding (e.g., with jax.pjit) introduces a lot of additional complexity. 
  - Use CPU for simpler debugging setups
- Simplify the Data and Problem
  - Visualize individual examples: Actually plot or print raw inputs and labels—not just summaries. You’ll often catch issues like incorrect encodings, off-by-one errors, or mismatched image-label pairs this way.
  - Check class balance: An imbalanced dataset can make the model appear broken when it’s just doing the naive thing (predicting the majority class). Consider subsampling or rebalancing during debugging.
  - Remove data augmentation: Augmentations like cropping, flipping, or adding noise can hide underlying issues or make the task unnecessarily hard. Turn them off until you’re confident the core pipeline works.
  - Reduce the number of classes: Instead of predicting many categories, reframe your task as binary classification to focus on the clearest signal first.
  - Simplify the output space: if your target is complex (e.g., a regression label or structured output), try reducing it to something simpler. For instance, predict a binary class or thresholded version of the label instead. 
  - Make your dataset smaller: Large datasets slow everything down. Use a small, representative subset that captures the key structure.
  - Limit the scope of your data: Use a natural slice of your dataset. For example, restrict to a single species, tissue type, year, or patient group. This cuts variability and helps isolate bugs.
  - Check for label leakage: Especially in biological datasets, label leakage can creep in through metadata like patient ID, batch number, or experiment date.
  - Work with synthetic or simulated data: If feasible, start with synthetic data that mimics key characteristics of your real dataset but is easier to understand and trace.
- Overfit to a single batch
  - Learning rate issues: The learning rate might be too high (causing divergence) or too low (causing no learning).
  - Frozen parameters or bad gradients: Sometimes parameters are not being updated at all — for example, if they were accidentally excluded from the params dict due to naming mismatches or scoping issues. Also inspect gradients — if they’re all zero or NaN, that’s a clue.
  - Loss function bugs: Make sure you’re using the correct loss function for your task (e.g., cross-entropy for classification) and that it’s behaving numerically as expected. Prefer standard implementations over custom home-made ones, at least during debugging.
  - Model initialization problems: Poor or inconsistent weight initialization can prevent learning, especially in deeper networks. If you’re using custom modules, double-check their initializations.
  - Batch size too small: Very small batches (e.g., 1–2 examples) can lead to noisy gradients and unstable updates. For debugging, use a small but reasonable batch size like 8–32.
  - Silent shape mismatches or broadcasting errors: These won’t always crash your code, but they can silently mess up your loss or gradients. Print tensor shapes and inspect intermediate outputs to confirm everything lines up as expected.
- Log Everything
  - Log training loss and key metrics over time: At a minimum, track loss, accuracy, and any relevant task-specific metrics (like auROC or auPRC). This makes it easier to spot overfitting, instability, or underperformance.
  - Log validation performance at regular intervals: Seeing how your model generalizes during training helps detect overfitting early and can catch bugs where validation performance diverges for no obvious reason.
  - Log inputs, predictions, and errors: Save a few input samples, predicted outputs, and errors at each step (or epoch). This is especially useful for spotting systematic failures (e.g., always misclassifying a certain class).
  - Record configuration and hyperparameters: Save the learning rate, batch size, optimizer type, and model architecture along with each run.
  - Use a structured logger or tracking tool: Tools like TensorBoard, Weights & Biases, or even just structured JSON logs can make it easier to compare runs and understand what changed.
- Common Data Issues
  - Data Leakage
    - Evaluating the model on the training set itself. This sounds slightly silly, but it’s surprisingly common—especially in informal settings like Kaggle notebooks.
    - Evaluating on a validation or test set where some examples overlap with the training set.
    - Leakage through preprocessing, for example, normalizing the full dataset before splitting into train/valid/test sets.
    - Features that leak future information—correlated with the target only because they wouldn’t be available at prediction time.
  - To avoid data leakage:
    - Always ensure that test data is fully isolated and untouched by the training pipeline.
    - When adding new training data partway through a project, check whether it already appears in your validation or test sets.
    - Ask yourself: Would this feature be available at inference time? If not, don’t use it.
    - Use model interpretation tools to see what aspects of your data your model is relying on when making its predictions. Does what you see match your expectations, or is the model picking up on artifacts?
  - Incorrect Data Labels
    - Labels stored separately from inputs: If labels are in a separate file (e.g., a CSV with filenames and classes), they can easily be mismatched or misjoined during preprocessing.
    - Shuffling inputs and labels independently: If you shuffle data and labels separately, they’ll fall out of sync—silently.
    - Silent shape mismatches in datasets
    - Merging tabular datasets incorrectly: Joining datasets without verifying alignment (e.g., via merge in pandas) can mislabel data rows without throwing errors.
    - Data augmentation pipelines modifying labels incorrectly: Augmentation effectively increases your dataset—so it’s also a high-risk area for introducing label corruption.
  - Imbalanced Classes
    - Warning signs:
      - Accuracy is high, but precision or recall on the minority class is poor.
      - Confusion matrix shows the model rarely predicts the minority class.
    - To address this:
      - Use class weighting or focal loss to penalize the model more for mistakes on the minority class. Focal loss down-weights easy examples and focuses learning on hard, misclassified ones—especially useful when the rare class is easily overwhelmed.
      - Resample the data—either oversample the minority class or undersample the majority class—to reduce imbalance. Oversampling is often safer when data is scarce but can lead to overfitting if not done carefully.
      - Use stratified sampling to ensure class balance is preserved across your train, validation, and test splits. This means splitting the data so each subset maintains the original class proportions, avoiding skewed performance estimates.

## Machine Learning Models

**Supervised Learning**: Labeled data, finding the right answer

- Linear Regression
- Logistic Regression
- Support Vector Machines
- Decision Trees: XGBoost, LightGBM, CatBoost
- Neural Networks

**Unsupervised Learning**: Unlabeled data, finding patterns

- Clustering: k-means
- Dimensionality Reduction: PCA
- Autoencoders

### Linear Regression

- [ ] How a linear regression behaves
  - Illustration of a graph
  - Equation
  - What do we use to estimates the βs?
- [WIP] add infos here: https://github.com/imteekay/linear-regression

### Logistic Regression

- [ ] How a logistic regression behaves
  - Illustration of a graph
  - Equation
  - What do we use to estimates the βs?
- [WIP] add infos here: https://github.com/imteekay/logistic-regression

### Multiple Logistic Regression

- [ ] How a multiple logistic regression behaves
  - Illustration of a graph
  - Equation
  - What do we use to estimates the βs?

### Support Vector Machines

- [ ] Theory on Support Vector Machines

### Tree-Based Models

- Decision Trees
  - Decision 1: decide which feature to use in the root node
  - Decision 2: when to stop splitting: Making the tree smaller, avoid overfitting
    - when a node is 100% one class
    - when splitting a node will result in the tree exceeding a maximum depth
    - when improvements in purity score are below a threshold
    - when a number of examples in a node is below a threshold
  - Measuring purity
    - Purity: Purity in a decision tree refers to the homogeneity of the labels within a node. A node is considered "pure" if all the data points it contains belong to the same class
    - Entropy is a measure of impurity
      - The smaller the fraction of examples, the more pure it is because it has more examples with the same class
      - The bigger the fraction of examples, the more pure it is because it has more examples with the same class
      - If the fraction is around 0.5, the impurity is high because it doesn't have homoeneity
  - To choose a split or to choose which feature to use first, we need to calculate the information gain (the highest information gain, which will increase the purity of the subsets)
  - The whole process
    - Measure the information gain for the root node to choose the feature
      - Split the dataset into two "nodes" (subtrees) based on the feature
      - Calculate the weight for each subtree for the weighted entropy
        - THe proportion of the number of examples in that child subset relative to the total number of examples in the parent node
      - Calculate the weighted entropy
      - Calculate the information gain
      - Do for each feature to choose the feature with the larger information gain
    - Ask for the left subtree if it can stop the split
      - If so, stop
      - If not, measure the information gain for the this subtree node to choose the feature
    - Ask for the right subtree if it can stop the split
      - If so, stop
      - If not, measure the information gain for the this subtree node to choose the feature
    - Keep doing that until you reach the stop criteria
  - Trees are highly sensitive to small changes of the data: not robust
    - Tree Ensemble: a collection of decision trees
- Tree Ensembles
  - Sampling with replacement
    - Sample an example (with features): selecting individual data points (including their features and the target variable) from your dataset
    - Replace: After an example is selected, it is put back into the original dataset. This means that the same example can be selected again in subsequent sampling steps
    - Sample again and keep doing this process: repeat the selection process multiple times, and each time, the original dataset remains unchanged due to the replacement
  - Decision trees work well in tabular (structured) data but recommended for unstructured data (images, audio, text)
  - Fast and good interpretability
  - **Bagging**: In bagging, the trees are grown independently on random samples of the observations. Consequently, the trees tend to be quite similar to each other. Thus, bagging can get caught in local optima and can fail to thoroughly explore the model space.
    - Bagging trains multiple models on different subsets of the training data and combines their predictions to make a final prediction.
    - In classification problems, it uses the mode for the most common label
    - In regression problems, it uses the average of all predictions
  - In random forests, the trees are once again grown independently on random samples of the observations. However, each split on each tree is performed using a random subset of the features, thereby decorrelating the trees, and leading to a more thorough exploration of model space relative to bagging.
    - For B (B = number of trees to be generated), use sampling with replacement to create a new subset, and train a decision tree on the new dataset
    - For big Bs, it won't hurt but will have diminishing returns
    - In the sampling with replacement, it chooses k features out of n (total number of features)
      - k = √n is a very common and often effective default value for k
  - **Boosting**: In boosting, we only use the original data, and do not draw any random samples. The trees are grown successively, using a “slow” learning approach: each new tree is fit to the signal that is left over from the earlier trees, and shrunken down before it is used.
    - Boosting trains a series of models where each model tries to correct the mistakes made by the previous model. The final prediction is made by all the models.
    - Similar to random forest, but instead of picking from all m examples, make it increase the weight for misclassified examples from previously trained trees and decrease the weight for correctly classified examples
    - The misclassified examples means that the tree algorithm is not doing quite well for these examples and the model should be training more to correctly classify them
    - XGBoosting
      - `n_estimators`: number of cycles
      - `early_stopping_rounds`: number of rounds the model stops improving (early_stopping_rounds = 5 is a reasonable value)
      - Good to have high `n_estimators` and use `early_stopping_rounds` to find the optimal time to stop
  - In Bayesian Additive Regression Trees (BART), we once again only make use of the original data, and we grow the trees successively. However, each tree is perturbed in order to avoid local minima and achieve a more thorough exploration of the model space.

### NLP

- Handling Data
  - Tokenize the input sentences
  - Clean the tokens: remove punctuation, bad words
  - Add padding: keep input length similar
  - Input Data
    - Handling CSV with `csv.reader`
    - Handling JSON with `json.load`
    - Handling HTML with `BeautifulSoup`

### Neural Networks

- Activation functions
  - Why do we need activation functions?
    - Using a linear activation function or no activation, the model is just a linear regression
    - If using a linear activation function, the forward prop will be a linear combination leading to an output equivalent to a linear regression
  - Argmax: the largest value in a sequence of numbers
  - Softmax
    - Output the probability for the N classes, so we can compute the loss for each class
    - The largest value in the sequence of probability shows the model prediction
    - The intuition behind the exponentiation: uses exponentiation to compute the probability of each class in a multiclass classification problem
      - Transforms arbitrary real-valued scores into positive values.   
      - Amplifies the differences between scores, emphasizing the most likely class.   
      - Allows for the subsequent normalization step to create a valid probability distribution.
      - Provides mathematical convenience for optimization algorithms like gradient descent.
- Common errors
  - Neural net training is a leaky abstraction: you don't plug and play. You need to understand how the technology works to make it *magically* works
  - Neural net training fails silently: it works silently, it fails silently
    - Be obsessed with visualizing everything
- Recipe to train Neural Nets
  - Become one with the data: inspect the data, scan through thousands of examples, understand their distribution and look for patterns
    - e.g. duplicate examples, corrupted images / labels, data imbalances and biases
    - Questions that help drive this exploration
      - Are very local features enough or do we need global context?
      - How much variation is there and what form does it take?
      - What variation is spurious and could be preprocessed out?
      - Does spatial position matter or do we want to average pool it out?
      - How much does detail matter and how far could we afford to downsample the images?
      - How noisy are the labels?
  - Set up the end-to-end training/evaluation skeleton: gain trust in its correctness via a series of experiments
    - fix random seed. Always use a fixed random seed to guarantee that when you run the code twice you will get the same outcome. This removes a factor of variation and will help keep you sane.
    - simplify. Make sure to disable any unnecessary fanciness. As an example, definitely turn off any data augmentation at this stage. Data augmentation is a regularization strategy that we may incorporate later, but for now it is just another opportunity to introduce some dumb bug.
    - add significant digits to your eval. When plotting the test loss run the evaluation over the entire (large) test set. Do not just plot test losses over batches and then rely on smoothing them in Tensorboard. We are in pursuit of correctness and are very willing to give up time for staying sane.
    - verify loss @ init. Verify that your loss starts at the correct loss value. E.g. if you initialize your final layer correctly you should measure -log(1/n_classes) on a softmax at initialization. The same default values can be derived for L2 regression, Huber losses, etc.
    - init well. Initialize the final layer weights correctly. E.g. if you are regressing some values that have a mean of 50 then initialize the final bias to 50. If you have an imbalanced dataset of a ratio 1:10 of positives:negatives, set the bias on your logits such that your network predicts probability of 0.1 at initialization. Setting these correctly will speed up convergence and eliminate “hockey stick” loss curves where in the first few iteration your network is basically just learning the bias.
    - human baseline. Monitor metrics other than loss that are human interpretable and checkable (e.g. accuracy). Whenever possible evaluate your own (human) accuracy and compare to it. Alternatively, annotate the test data twice and for each example treat one annotation as prediction and the second as ground truth.
    - input-indepent baseline. Train an input-independent baseline, (e.g. easiest is to just set all your inputs to zero). This should perform worse than when you actually plug in your data without zeroing it out. Does it? i.e. does your model learn to extract any information out of the input at all?
    - overfit one batch. Overfit a single batch of only a few examples (e.g. as little as two). To do so we increase the capacity of our model (e.g. add layers or filters) and verify that we can reach the lowest achievable loss (e.g. zero). I also like to visualize in the same plot both the label and the prediction and ensure that they end up aligning perfectly once we reach the minimum loss. If they do not, there is a bug somewhere and we cannot continue to the next stage.
    - verify decreasing training loss. At this stage you will hopefully be underfitting on your dataset because you’re working with a toy model. Try to increase its capacity just a bit. Did your training loss go down as it should?
    - visualize just before the net. The unambiguously correct place to visualize your data is immediately before your y_hat = model(x) (or sess.run in tf). That is - you want to visualize exactly what goes into your network, decoding that raw tensor of data and labels into visualizations. This is the only “source of truth”. I can’t count the number of times this has saved me and revealed problems in data preprocessing and augmentation.
    - visualize prediction dynamics. I like to visualize model predictions on a fixed test batch during the course of training. The “dynamics” of how these predictions move will give you incredibly good intuition for how the training progresses. Many times it is possible to feel the network “struggle” to fit your data if it wiggles too much in some way, revealing instabilities. Very low or very high learning rates are also easily noticeable in the amount of jitter.
    - use backprop to chart dependencies. Your deep learning code will often contain complicated, vectorized, and broadcasted operations. A relatively common bug I’ve come across a few times is that people get this wrong (e.g. they use view instead of transpose/permute somewhere) and inadvertently mix information across the batch dimension. It is a depressing fact that your network will typically still train okay because it will learn to ignore data from the other examples. One way to debug this (and other related problems) is to set the loss to be something trivial like the sum of all outputs of example i, run the backward pass all the way to the input, and ensure that you get a non-zero gradient only on the i-th input. The same strategy can be used to e.g. ensure that your autoregressive model at time t only depends on 1..t-1. More generally, gradients give you information about what depends on what in your network, which can be useful for debugging.
    - generalize a special case. This is a bit more of a general coding tip but I’ve often seen people create bugs when they bite off more than they can chew, writing a relatively general functionality from scratch. I like to write a very specific function to what I’m doing right now, get that to work, and then generalize it later making sure that I get the same result. Often this applies to vectorizing code, where I almost always write out the fully loopy version first and only then transform it to vectorized code one loop at a time.
  - Overfit
    - picking the model. To reach a good training loss you’ll want to choose an appropriate architecture for the data. When it comes to choosing this my #1 advice is: Don’t be a hero. I’ve seen a lot of people who are eager to get crazy and creative in stacking up the lego blocks of the neural net toolbox in various exotic architectures that make sense to them. Resist this temptation strongly in the early stages of your project. I always advise people to simply find the most related paper and copy paste their simplest architecture that achieves good performance. E.g. if you are classifying images don’t be a hero and just copy paste a ResNet-50 for your first run. You’re allowed to do something more custom later and beat this.
    - adam is safe. In the early stages of setting baselines I like to use Adam with a learning rate of 3e-4. In my experience Adam is much more forgiving to hyperparameters, including a bad learning rate. For ConvNets a well-tuned SGD will almost always slightly outperform Adam, but the optimal learning rate region is much more narrow and problem-specific. (Note: If you are using RNNs and related sequence models it is more common to use Adam. At the initial stage of your project, again, don’t be a hero and follow whatever the most related papers do.)
    - complexify only one at a time. If you have multiple signals to plug into your classifier I would advise that you plug them in one by one and every time ensure that you get a performance boost you’d expect. Don’t throw the kitchen sink at your model at the start. There are other ways of building up complexity - e.g. you can try to plug in smaller images first and make them bigger later, etc.
    - do not trust learning rate decay defaults. If you are re-purposing code from some other domain always be very careful with learning rate decay. Not only would you want to use different decay schedules for different problems, but - even worse - in a typical implementation the schedule will be based current epoch number, which can vary widely simply depending on the size of your dataset. E.g. ImageNet would decay by 10 on epoch 30. If you’re not training ImageNet then you almost certainly do not want this. If you’re not careful your code could secretely be driving your learning rate to zero too early, not allowing your model to converge. In my own work I always disable learning rate decays entirely (I use a constant LR) and tune this all the way at the very end.
  - Regularize
    - get more data. First, the by far best and preferred way to regularize a model in any practical setting is to add more real training data. It is a very common mistake to spend a lot engineering cycles trying to squeeze juice out of a small dataset when you could instead be collecting more data. As far as I’m aware adding more data is pretty much the only guaranteed way to monotonically improve the performance of a well-configured neural network almost indefinitely. The other would be ensembles (if you can afford them), but that tops out after ~5 models.
    - data augment. The next best thing to real data is half-fake data - try out more aggressive data augmentation.
    - creative augmentation. If half-fake data doesn’t do it, fake data may also do something. People are finding creative ways of expanding datasets; For example, domain randomization, use of simulation, clever hybrids such as inserting (potentially simulated) data into scenes, or even GANs.
    - pretrain. It rarely ever hurts to use a pretrained network if you can, even if you have enough data.
    - stick with supervised learning. Do not get over-excited about unsupervised pretraining. Unlike what that blog post from 2008 tells you, as far as I know, no version of it has reported strong results in modern computer vision (though NLP seems to be doing pretty well with BERT and friends these days, quite likely owing to the more deliberate nature of text, and a higher signal to noise ratio).
    - smaller input dimensionality. Remove features that may contain spurious signal. Any added spurious input is just another opportunity to overfit if your dataset is small. Similarly, if low-level details don’t matter much try to input a smaller image.
    - smaller model size. In many cases you can use domain knowledge constraints on the network to decrease its size. As an example, it used to be trendy to use Fully Connected layers at the top of backbones for ImageNet but these have since been replaced with simple average pooling, eliminating a ton of parameters in the process.
    - decrease the batch size. Due to the normalization inside batch norm smaller batch sizes somewhat correspond to stronger regularization. This is because the batch empirical mean/std are more approximate versions of the full mean/std so the scale & offset “wiggles” your batch around more.
    - drop. Add dropout. Use dropout2d (spatial dropout) for ConvNets. Use this sparingly/carefully because dropout does not seem to play nice with batch normalization.
    - weight decay. Increase the weight decay penalty.
    - early stopping. Stop training based on your measured validation loss to catch your model just as it’s about to overfit.
    - try a larger model. I mention this last and only after early stopping but I’ve found a few times in the past that larger models will of course overfit much more eventually, but their “early stopped” performance can often be much better than that of smaller models.
  - Tune
    - random over grid search. For simultaneously tuning multiple hyperparameters it may sound tempting to use grid search to ensure coverage of all settings, but keep in mind that it is best to use random search instead. Intuitively, this is because neural nets are often much more sensitive to some parameters than others. In the limit, if a parameter a matters but changing b has no effect then you’d rather sample a more throughly than at a few fixed points multiple times.
    - hyper-parameter optimization
  - Squeeze out the juice
    - Model ensembles
    - leave it training

### CNN

- Key components of a CNN architecture
  - Convolutional layers: have multiple filter/kernels - weight matrices used to compute dot products and act like a pattern detector. It outputs a feature map with all the patterns found
  - Pooling layers: downsample the features maps to reduce dimentionality and computation. e.g. Max Pooling keeps only the strongest signals in each region, to help the model find the strongest patterns in the data
  - Normalization: rescaling activations to make training more stable and efficient
  - Fully connected layers: use the extracted features from earlier layers to make predictions

### Transfer Learning

- Learn parameters with a ML model for a given dataset
- Download the pre-trained parameters
- Train/fine-tune the model on the new data
  - If you first trained in a big dataset, the fine tuning can be done with a smaller dataset
- Training the model
  - Train all model parameters
  - Train only the output parameters, leaving the other parameters of the model fixed

## Mathematics

### Linear Algebra

#### Importance of linear dependence and independence: Linear Algebra

1. Understanding Vector Spaces:
   - Linear Independence: A set of vectors is linearly independent if no vector in the set can be written as a linear combination of the others. This means that each vector adds a new dimension to the vector space, and the set spans a space of dimension equal to the number of vectors.
   - Linear Dependence: If a set of vectors is linearly dependent, then at least one vector in the set can be expressed as a linear combination of the others, meaning the vectors do not all contribute to expanding the space. This reduces the effective dimensionality of the space they span.
2. Basis of a Vector Space:
   - A basis of a vector space is a set of linearly independent vectors that span the entire space. The number of vectors in the basis is equal to the dimension of the vector space. Identifying a basis is essential for understanding the structure of the vector space, and it simplifies operations like solving linear systems, performing coordinate transformations, and more.
3. Dimensionality Reduction:
   - In machine learning, high-dimensional data can often be reduced to a lower-dimensional space without losing significant information. This reduction is based on identifying linearly independent components (e.g., via techniques like PCA). Understanding linear independence helps in determining the minimum number of vectors needed to describe the data fully, leading to more efficient computations and better generalization.
4. Solving Linear Systems:
   - When solving systems of linear equations, knowing whether the vectors (or the columns of a matrix) are linearly independent is critical. If they are independent, the system has a unique solution. If they are dependent, the system may have infinitely many solutions or none, depending on the consistency of the equations.
5. Eigenvalues and Eigenvectors:
   - In linear algebra, the concepts of linear dependence and independence are central to understanding eigenvalues and eigenvectors, which are crucial in many applications, such as in principal component analysis (PCA), stability analysis in differential equations, and more.
6. Geometric Interpretation:
   - Geometrically, linearly independent vectors point in different directions, and no vector lies in the span of the others. This concept is fundamental in understanding the shape and orientation of geometric objects like planes, spaces, and hyperplanes in higher dimensions.
7. Optimizing Computations:
   - In numerical methods, computations are often more efficient when working with linearly independent vectors. For example, when inverting matrices, working with a basis (a set of linearly independent vectors) avoids redundant calculations.
8. Rank of a Matrix:
   - The rank of a matrix is the maximum number of linearly independent column (or row) vectors in the matrix. This concept is crucial in determining the solutions to linear systems, understanding the properties of transformations, and more.

### Statistics

- Mean: average value in a distribution
- Trimmed mean: drop smallest and biggest values and average the remaining ones (remove the influence of extreme values)
- Median: the middle number in a sorted list. It depends only on the numbers in the center of the data, so it's not influenced by outliers
- Outliers: they are extreme cases, that are very distant from the other values in the dataset
  - It can influence the mean but not the median
  - It can a result of errors in the data (invalid or erroneous data)

To study

- [ ] Standard Error (SE)
  - How to interpret this concept
  - What's the range of SE?
- [ ] z-statistic
  - How to interpret this concept
  - What's the range of z-statistic?
- [ ] p-value
  - How to interpret this concept: what's the meaning of a low/high p-value
  - What's the range of p-value?
- [ ] collinearity
  - How to interpret this concept
- [ ] probability density
  - How to interpret this concept
    - it describes how the probability of a random variable is distributed over a range of values
- [ ] Degrees of Freedom
- [ ] Bayesian Inference

</samp>


================================================
FILE: books/an-introduction-to-statistical-learning-wtih-applications-in-python/README.md
================================================
# An Introduction to Statistical Learning with Applications in Python

- [Intro](introduction.md)
- [Statistical Learning](statistical-learning.md)
- [Linear Regression](linear-regression.md)


================================================
FILE: books/an-introduction-to-statistical-learning-wtih-applications-in-python/classification.md
================================================
# Classification

- For classification models, the response variable is a qualitative variable: yes/no, e.g. eye color {brown|blue|green}
- Predicting a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category, or class.
- X, Y, and C (set of variables): C(X) belongs to C
  - X is a vector (1-dimensional array) with different features
  - Pass these features to the function and it will return the output that belongs to the C set
- Other examples
  - A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have?
  - An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
  - On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not.
- Why Not Linear Regression?
  - There are at least two reasons not to perform classification using a regression method: (a) a regression method cannot accommodate a qualitative response with more than two classes; (b) a regression method will not provide meaningful estimates of `Pr(Y|X)`, even with just two classes. Thus, it is preferable to use a classification method that is truly suited for qualitative response values.
- Some classifiers:
  - logistic regression
  - multiple logistic regression
  - multinomial logistic regression
  - LDA
  - QDA
  - Naive Bayes

## Logistic Regression

- Considering a linear regression model: `p(X) = β₀ + β₁X`. The problem with this approach is that for balances close to zero we predict a negative probability of default — it can produce any real number
- In logistic regression, we use the logistic function (aka sigmoid function): `p(X) = (1 + e^(β₀+β₁X)) / e^(β₀+β₁X)`.
  - The logistic function maps any real-valued number into the range (0, 1). Negative responses and values greater than 1 are inherently removed because the logistic function constrains the output to a probability-like range (Probabilities, by definition, cannot be negative or exceed 1)
  - The logistic function will always produce an **S-shaped curve** of this form: For large positive inputs, the function approaches 1, and for large negative inputs, it approaches 0. This helps in creating a clear decision boundary between different classes.
- We use the maximum likelihood method to estimate the βs: β₀,β₁,...,βp.

## Multiple Logistic Regression

- In multiple logistic regression, we now consider the problem of predicting a binary response using multiple predictors.
- We use the maximum likelihood method to estimate the βs: β₀,β₁,...,βp.
- There are dangers and subtleties associated with performing regressions involving only a single predictor when other predictors may also be relevant.
  - The results obtained using one predictor may be quite different from those ob- tained using multiple predictors, especially when there is correlation among the predictors.

## Multinomial Logistic Regression

- Classify a response variable that has more than two classes
- We choose one variable as the baseline and make the model estimates the coefficients for the comparisons
  - e.g. you have three outcome categories: A, B, and C. You choose A as the baseline. The model estimates coefficients for the comparisons B vs. A and C vs. A.
  - In the end, you have the probability of A, B, and C

## Linear Discriminant Analysis (LDA)

- [ ] TODO: bayes theorem
- [ ] TODO: LDA for p = 1
- [ ] TODO: LDA for p > 1
- [ ] TODO: Assumptions in LDA

## Quadratic Discriminant Analysis (QDA)

- [ ] TODO: bayes theorem
- [ ] TODO: Assumptions in QDA

## Labs

- Train the model using the training data set
- Predict the results with the test data set

## Questions

- [ ] TODO: case-control sampling
- [ ] TODO: poisson regression
- [ ] TODO: poisson regression vs gauss vs logistic regression
- [ ] TODO: the math behind poisson regression, gauss, logistic regression
- [ ] TODO: what does mean a model being stable?


================================================
FILE: books/an-introduction-to-statistical-learning-wtih-applications-in-python/introduction.md
================================================
# Introduction

- A set of tools for understanding data
  - **Supervised**: a statistical model for pre- dicting, or estimating, an output based on one or more inputs.
  - **Unsupervised**: there are inputs but no supervising output; nevertheless we can learn relationships and struc- ture from such data.
- Classes of problems
  - Regression: predicting a continuous or quantitative output value
    - e.g. Wage data: having wage, age, education level, and wage in years, we can predict a continuous output value, the wage for a specific example
  - Classification: predict a non-numerical value, a categorical or qualitative output
    - e.g. Stock Market data: in a given day, the stock market performance will fall into the `Up` bucket or the `Down` bucket
  - Clustering: it creates groups or clusters based on an input
    - e.g. Gene Expression data: based on two principal components (Z1 and Z2) forming clusters of cell lines


================================================
FILE: books/an-introduction-to-statistical-learning-wtih-applications-in-python/linear-regression.md
================================================
# Linear Regression

- Variables
  - Dependent variables: the response variable that we want to predict based on the values of one or more independent variables. e.g. `Y`
  - Independent variables: the predictor or features. e.g. `X1`, `X2`, ..., `Xn`
- For a given dataset, you have the output variable `Y` and the predictors `X₁`, `X₂`, ..., `Xᵢ`.
  - The goal of linear regression is to find the relationship between the dependent variable `Y` and the variables `Xᵢ` by estimating the coefficients or parameters `β₁`, `β₂`, ..., `βᵢ`.
  - So then you can use techniques to find the coefficients or the parameters `β₁`, `β₂`, ..., `βᵢ`.
- `Y^` refers to the predicting `Y`
- `intercept` and `slope` are parameters that define the relationship between the independent variable (X) and the dependent variable (Y) in a linear equation
  - The intercept (often denoted as `β₀` or beta zero) is the value of the dependent variable (Y) when the independent variable (X) is zero. It represents the starting point of the regression line.
  - The slope (often denoted as `β₁` or beta one) represents the change in the dependent variable (Y) for a one-unit change in the independent variable (X). It indicates the steepness or slope of the regression line.
    - A positive slope indicates a positive relationship between X and Y, while a negative slope indicates a negative relationship.
- We use residual sum of squares (RSS) to measure the discrepancy between the observed values of the dependent variable in a regression model and the values predicted by the model
  - Residual i = Yi - Y^i, where:
    - `Yi` is the observed value of the dependent variable.
    - `Y^i` is the predicted value of the dependent variable from the regression model.
- Linear regression has one predictor while a multiple linear regression can have multiple predictors
  - e.g. of multiple linear regression:
    - `Y = β₀ + β₁ x X₁ + β₂ x X₂ + β₃ x X₃ + ϵ`
    - `sales = β₀ + β₁ x TV + β₂ x radio + β₃ x newspaper + ϵ`

## Variance

- Variance measures how much the predicted values from the regression model vary around the true values of the dependent variable.
- A smaller variance indicates that the predicted values are closer to the true values, suggesting a better fit of the model to the data.
- A larger variance indicates greater variability in the predictions, suggesting that the model may not be capturing all the relevant information in the data.

## Interpreting regression coefficients

- Uncorrelated predictors
  - Each coefficient can be estimated and tested separately
  - Interpretation: "a unit change in Xⱼ is associated with βⱼ change in Y, while all the other variables stay fixed"
  - The variables are independent
- Correlated predictors
  - The variance of all coefficients tends to increase, meaning there are larger standard error, indicating "uncertainty" and worse coefficient estimation precision.
  - Interpretations become hard — when Xⱼ changes, everything else changes

## Important questions

1. Is at least one of the predictors useful in predicting the response?
2. Do the predictors on the whole have anything to say about the outcome?
3. How well does the model fit the data?
4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

## Model Matrix, Coefficients β, and making predictions

- **Transformation**: The process of creating the model matrix 𝑋 involves transforming the original data 𝑋 into a format suitable for the specific modeling technique. This may involve standardization, normalization, encoding categorical variables, adding polynomial features, or other preprocessing steps.
- **Model Fitting**: During model training, the model matrix 𝑋 is used along with the target variable 𝑦 to estimate the coefficients 𝛽 (or weights) that best fit the data. For example, in linear regression, 𝑦^ = 𝑋𝛽.
- **Prediction**: Once the model is trained and validated, the same transformations applied to the training data 𝑋 (to create X) are applied to new, unseen data to generate predictions.


================================================
FILE: books/an-introduction-to-statistical-learning-wtih-applications-in-python/resampling-methods.md
================================================
# Resampling Methods

## Cross-validation and bootstrap

- Refit a model to samples formed from the training set, in order to obtain additional information about the fitted model
  - These methods provide estimates of test-set prediction error, and the standard deviation and bias of our parameter estimates
- Errors
  - Training error: the error we get from the application of a statistical learning method to the observations used in its training
  - Test error: the average error that results from using a statistical learning method to predict the response on a new observation
  - Traning versus Test set performance: The more complex the model, the small the training error, but it overfits with the increase of complexity so it increase the test error (overfitting problem)
- Validation-set approach
  - Randomly divide the available set of samples into two parts: a training set and a validation or hold-out set
  - The model is fit on the training set, and the fitted model is uses to predict the responses for the observations in the validation set
  - The resulting validation-set error provides an estimate of the test error
    - Quantitative response: MSE
    - Qualitative response: misclassification rate
  - Two problems in the validation set approach
    - the validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set.
    - In the validation approach, only a subset of the observations — those that are included in the training set rather than in the validation set — are used to fit the model. Since statistical methods tend to per- form worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.
- K-fold cross-validation
  - Estimate test error
  - Randomly divide the data into K equal-sized parts
    - Leave out part K and fit the model to the other K - 1 parts (combined) and then obtain predictions for the left-out kth part
    - This is done in turn for each part k = 1,2,3,...,K, and then the results are combined (the cross-validation error)
  - The most obvious advantage is computational. LOOCV requires fitting the statistical learning method n times. This has the potential to be computationally expensive
  - It often gives more accurate estimates of the test error rate than does LOOCV (Leave One Out Cross Validation) because of the bias-variance trade-off
    - Performing k-fold CV for, say, k = 5 or k = 10 will lead to an intermediate level of bias compared to LOOCV
    - LOOCV has higher variance because if the data point left out is influential or an outlier, the model's performance on that point can vary significantly. The larger validation set of k-fold cross validation (compared to LOOCV) and more varied training set help smooth out the impact of outliers and reduce the overall variance of the model’s performance estimates
  - Comparison with the validation set approach
    - The model’s performance is averaged over 𝑘 different training-validation splits, providing a more robust and reliable estimate of the model’s true performance.
    - Advantage: This reduces the likelihood of overfitting to a particular validation set and provides a better estimate of how the model will generalize to new data.
- Bootstrap
  - Quantify the uncertainty associated with a given estimator
  - Estimate of the standard error of a coefficient


================================================
FILE: books/an-introduction-to-statistical-learning-wtih-applications-in-python/selection-and-regularization.md
================================================
# Linear Model Selection and Regularization

Methods to improve the model's performance:

- **Subset Selection**: using a subset of the original variables (Best Subset Selection, Stepwise Selection, )
- **Shrinkage Methods**: shrinking their coefficients making it small or shrink all the way to zero (Ridge Regression, Lasso Regression)
- **Dimension Reduction Methods**: transforming and reducing predictors


================================================
FILE: books/an-introduction-to-statistical-learning-wtih-applications-in-python/statistical-learning.md
================================================
# Statistical Learning

- We have `X` and Y as variables, `X` as a predictor variable and `Y` as a output value
  - e.g. With the increase of number of years of education (`X`), the income (`Y`) also increases
  - The function `f` is unknown and statistical learning refers to a set of approaches for estimating `f`
- There are two main reasons that we may wish to estimate `f`: prediction and inference.

## Prediction

- We want to find `Y = f(X)` so we try to get the estimate using `Y = f(X)`
  - `f` is the estimate of `f`
  - `Y` is the estimate of `Y`
- e.g. predicting the risk of a drug reaction based on blood characteristics
  - `X1`, `X2`, `X3`, ... `Xp` are characteristics of a patient’s blood sample
  - `Y` is a variable encoding the patient’s risk for a severe adverse reaction to a particular drug
- e.g. Marketing campaign: identify individuals who are likely to respond positively to a mailing
  - predictors: demographic variables
  - outcome: response to the campaign
- Errors in predictions
  - Reducible error: whenever we're estimating `f`, it won't be perfect and the inaccuracy will introduce some error. Because we can potentially improve the accuracy using the most appropriate statistical learning technique, this error is called reducible error
  - Irreducible error: some factors are not counted in the model producing errors in the prediction that cannot be eliminated
    - e.g. 1 - human error when collecting data
    - e.g. 2 - randomness events, for example, factors like market sentiment can make stock prices fluctuate
    - e.g. 3 - missing variables, important variables that are not included in the model due to limitations in data availability or lack of understanding

## Inference

- We want to understand the exact form of `f`, the association between `Y` and `X1`, ..., `Xp`.
- A set of questions to help understand `f`
  - Which predictors are associated with the response? Only a small fraction of the available predictors are substantially associated with Y
  - What is the relationship between the response and each predictor? Positive or negative relationship and dependent on the values of the other predictors
  - Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated? If it's too complex, a linear model may not provide an accurate representation of the relationship between the input and output variables
- e.g. Advertising through different media
  - Which media are associated with sales?
  - Which media generate the biggest boost in sales? or
  - How large of an increase in sales is associated with a given increase in TV advertising?

## Finding `f`: regression model

- In a model, for a given input `X`, it has an output of `Y`
- The idea of a regression model is to find the function `f` that models the relationship between `X` and `Y`
- What's a good f(x)?
  - A good `f` can make predictions of `Y` at any point of `X`
- Finding the function `f`
  - For a given point `X`, get all the points in `Y` and calculate the average of all points
    - `f(x) = E(Y|X = x)` is called a regression function
  - Not all `X` will have `Y`s or maybe it has just a few `Y`s
  - So we relax the definition
    - `f(x) = Ave(Y|X ∈ N(x))`, where `N(x)` is some neighborhood of `X`
    - Form a window for `X` to find `Y` in the "neighborhood"
    - The concept is called "Nearest neighbor" or "local average"

## Dimensionality and Structured Models

- The bigger the dimensions, more complex the model
- Curse of dimensionality: More dimensions means commonly more sparse data (neasrest neighbors tend to be far away in high dimensions)
  - e.g. to 10% neighborhood in 1 dimension can be straighforward. But in high dimensions, it may lose the spirit of estimating because it may no longer be local
- Provide structure to models
  - Linear model as a parametric model: `f(X) = β0 + β1*X1 + β2*X2 + β3*X3 + ... + βp*Xp`
  - `p + 1` parameters: `β0`, `β1`, ..., `βp`
  - A linear model draws a straight line through the data that best fits the patterns they see
  - we need to estimate the parameters `β0`, `β1`, ..., `βp` such that `Y ≈ β0 + β1X1 + β2X2 + ··· + βpXp`

## Assessing Model Accuracy

- In the regression setting, the most commonly-used measure is the mean squared error (MSE)
  - The MSE will be small if the predicted responses are very close to the true responses, and will be large if for some of the observations
- MSE for training data is referred to training MSE but in general, we do not really care how well the method works on the training data.
  - We are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data.
  - We want to choose the method that gives the lowest test MSE, as opposed to the lowest training MSE
- When estimating `f(x)`, if we have an increase of test MSE when increasing the model's flexibility, we have an overfitting problem, meaning that the model is finding random patterns and not having a good accuracy in estimating `f(x)`.
  - Overfitting refers specifically to the case in which a less flexible model would have yielded a smaller test MSE.
  - When increasing the flexibility of the model, we have a better fit for the training data but less accuracy (larger MSE) for test data.

### Model Selection and Bias-Variance Tradeoff

- Bias and variance are two sources of error in machine learning models
- **Bias** refers to the error introduced by approximating a real-life problem with a simplified model. It represents the difference between the average prediction of our model and the true value we're trying to predict.
  - Underfitting: A model with high bias tends to be too simple and may fail to capture the underlying patterns and relationships in the data. A model with high bias makes strong assumptions, often simplifying the data too much, leading to underfitting.
  - Error introduced by simplifying the model
- **Variance** refers to the variability in model predictions when trained on different datasets. It represents the sensitivity of the model to the specific training data used.
  - Overfitting: A model with high variance tends to be too complex and may capture noise or random fluctuations in the training data
  - Error introduced by the model's sensitivity to the training data
  - High variance means that the model’s predictions are inconsistent, making the model unreliable
- The bias-variance tradeoff arises because reducing bias often increases variance and vice versa. The goal is to find the right balance between bias and variance to minimize the overall prediction error of the model on unseen data.

## Classification

- For classification problems, the response variable Y is qualitative
  - e.g. email is one of C = (spam,ham), where ham is "good email"
- Bayes classifier uses a conditional probability to predict: class one if Pr(Y = 1|X = x0) > 0.5, and class two otherwise.
- K-nearest neighbors (KNN) classifier: Given a positive integer K and a test observation x0, the KNN classifier first identifies the K points in the training data that are closest to x0, represented by N0.
  - When K = 1, the decision boundary is overly flexible and finds patterns in the data that don’t correspond to the Bayes decision boundary. This corresponds to a classifier that has low bias but very high variance.
  - As K grows, the method becomes less flexible and produces a decision boundary that is close to linear.

## Conclusion

- In both the regression and classification settings, choosing the correct level of flexibility is critical to the success of any statistical learning method.
- The bias-variance tradeoff, and the resulting U-shape in the test error, can make this a difficult task.


================================================
FILE: books/deep-learning-for-biology/README.md
================================================
# Deep Learning for Biology

## Success Criteria

**Performance metric (e.g., accuracy, AUC, F1)**: You might aim to match the performance of a human expert, achieve a correlation with experimental results comparable to a technical replicate, or keep the false-positive rate below a certain number.

**Level of interpretability**: In many applications, it’s important not only that a model performs well, but also that its decisions can be understood by domain experts. For instance, you may prioritize well-calibrated uncertainty estimates or interpretable feature attributions, especially when trust and explainability are critical.

**Model size or inference latency**: If your model needs to operate in a resource-constrained environment (e.g., smartphones or embedded devices) or meet real-time throughput targets (e.g., process 20 frames per second), your success criterion might focus on efficiency—such as achieving high performance per floating point operation (FLOP), which measures how effectively the model uses computational resources. In such cases, metrics like inference time, memory usage, or energy consumption may matter more than raw accuracy.

**Training time and efficiency**: When compute is limited—or for educational contexts—you may prioritize fast training or minimal hardware requirements. Since training deep learning models typically involves large matrix operations, they are often accelerated using graphics processing units (GPUs). In low-resource settings, developing a simpler model that trains quickly on a CPU may be a more practical goal than maximizing performance.

**Generalizability**: In some cases, the goal is to build a model that works well across many datasets or tasks, rather than one that is finely tuned to a single benchmark. For example, foundational models—large models trained on broad datasets that can be adapted to many downstream applications—prioritize flexibility and reuse. In such settings, broad applicability may be more valuable than squeezing out the best possible performance on a specific task.

## Invest Heavily in Evaluations

Thinking carefully about precisely how you’ll measure progress—including what metrics you’ll use, how you’ll validate results, and which baselines you’ll compare against. Without a clear, well-designed evaluation strategy, even a technically impressive model can fail to produce meaningful conclusions.

## Designing Baselines

### Classification tasks

**Random prediction**: Assign labels completely at random, with equal probability for each class. This tells you what performance looks like with no information at all.

**Random prediction weighted by class frequencies**: Sample labels randomly, but in proportion to how often they occur in the training data. This is useful for imbalanced datasets.

**Majority class**: Always predict the most common class. This can be a surprisingly hard baseline to beat in highly class imbalanced settings.

**Nearest neighbor**: Predict the label of the most similar example in the training data (e.g., 1-nearest neighbor using Euclidean distance). This is often effective when inputs are low dimensional or well structured.

### Regression tasks

**Mean or median of the target**: Always predict the average or median target value from the training set. This often matches what a model would do if it’s not learning anything meaningful.

**Linear regression with a single feature**: Fit a line using just the strongest individual predictor (e.g., one biomarker). This helps gauge how much a more complex model improves over a simple signal.

**K-nearest neighbor regression**: Predict the target as the average (or weighted average) of the k most similar data points. This is simple to implement and often surprisingly competitive on structured datasets.

### For both

**Simple heuristics**: Use straightforward rules based on domain knowledge. For example, in diagnostics, classify a patient as positive if a single biomarker or measurement exceeds a threshold. For skin cancer images, rank lesions by average pixel intensity. In genomics, if the task is to predict which gene a mutation affects, a simple baseline is to assume it affects the nearest gene in the genome.

## Chapters

- [Learning the Language of Proteins](./learning-the-language-of-proteins.md)
- [Learning the Logic of DNA](./learning-the-logic-of-dna.md)


================================================
FILE: books/deep-learning-for-biology/learning-the-language-of-proteins.md
================================================
# Learning the Language of Proteins

- A protein can be represented as a sequence of its constituent building blocks, called amino acids.
- Proteins use an alphabet of 20 amino acids to form long chains with specific shapes and jobs.
- train a model to predict a protein’s function given its amino acid sequence. For example:
  - Given the sequence of the `COL1A1` collagen protein (`MFSFVDLR...`), we might predict its function is likely `structural` with probability 0.7, `enzymatic` with probability 0.01, and so on.
  - Given the sequence of the `INS` insulin protein (`MALWMRLL...`), we might predict its function is likely `metabolic` with probability 0.6, `signaling` with probability 0.3, and so on.
- Accurate protein function prediction is an extremely challenging problem
  - How amino acid sequence determines 3D structure
  - How structure enables function
  - How these functions operate in the dynamic, crowded environment of the cell.

## Biology Primer 

- A protein’s function is very closely tied to its 3D structure, which in turn is determined by its primary amino acid sequence.
- A gene encodes the primary amino acid sequence of a protein. That sequence determines the protein’s structure, and the structure governs its function.

**Protein Structure**

![](images/001.png)

- **Primary structure**: The linear sequence of amino acids
- **Secondary structure**: Local folding into structural elements such as alpha helices and beta sheets
- **Tertiary structure**: The overall 3D shape formed by the complete amino acid chain
- **Quaternary structure**: The assembly of multiple protein subunits into a functional complex (not all proteins have this)

The importance of amino acids in proteins: a single substitution in the amino acids sequence can dramatically alter how a protein folds or functions—sometimes with serious effects

e.g. Many genetic diseases are caused by such point mutations. An example is sickle cell anemia, which is caused by a single-letter change in the gene for hemoglobin that replaces a hydrophilic amino acid (E) with a hydrophobic one (V), which ultimately leads to misshapen red blood cells.

**Protein Function**

Functions: catalyze chemical reactions, transmit signals, transport molecules, provide structural support, and regulate gene expression

- **Biological process**: This contributes to—like cell division, response to stress, carbohydrate metabolism, or immune signaling.
- **Molecular function**: This describes the specific biochemical activity of the protein itself—such as binding to DNA or ATP (a molecule that stores and transfers energy in cells), acting as a kinase (an enzyme that attaches a small chemical tag called a phosphate group to other molecules to change their activity), or transporting ions across membranes.
- **Cellular component**: This indicates where in the cell the protein usually resides—such as the nucleus, mitochondria, or extracellular space. Although it’s technically a location label and not a function per se, it often provides important clues about the protein’s role (e.g., proteins in the mitochondria are probably involved in energy production).

**Why Predicting Protein Function?**

- **Biotechnology and protein engineering**: If we can reliably predict function from sequence, we can begin to design new proteins with desired properties. This could be useful for designing enzymes for industrial chemistry, therapeutic proteins for medicine, or synthetic biology components.
- **Understanding disease mechanisms**: Many diseases are caused by specific sequence changes (variants, or mutations) that disrupt protein function. A good predictive model can help identify how specific mutations alter function, offering insights into disease mechanisms and potential therapeutic targets.
- **Genome annotation**: As we continue sequencing the genomes of new species, we’re uncovering vast numbers of proteins whose functions remain unknown. For newly identified proteins—especially those that are distantly evolutionarily related to any known ones—computational prediction is essential for assigning functional hypotheses.
- **Metagenomics and microbiome analysis**: When sequencing entire microbial communities, such as gut bacteria or ocean microbiota, many protein-coding genes have no close matches in existing databases. Predicting function from sequence helps uncover the roles of these unknown proteins, advancing our understanding of microbial ecosystems and their effects on hosts or the environment.

## Machine Learning Primer

**Embeddings**

- An embedding is a numerical vector — a list of floating-point numbers — that encodes the meaning or structure of an entity like a word, sentence, or protein sequence. 
- A protein might be represented by an embedding such as [0.1, -0.3, 1.3, 0.9, 0.2], which could capture aspects of its biochemical or structural properties in a compact numerical form.
- Similar inputs result in similar embeddings: protein sequences with similar structure or function — such as collagen I and collagen II — will tend to have embeddings that are close together in what we might call a “protein space.”

## The ESM2 Protein Language Model

![](images/protein-language-esm.png)

ESM2 is a masked language model (MLM):

- Mask a random subset of amino acids in each protein sequence (randomly selected 15% of the amino acids in each sequence were masked during training)
- Ask the model to predict them

## Embedding Entire Proteins

**Concatenation of amino acid embeddings**: loop through each amino acid in a sequence, extract its embedding, and concatenate them into one long vector. e.g. if a protein has length 10 and each amino acid has a 640-dimensional embedding, this yields a protein embedding of length 10 × 640 = 6400

Several drawbacks:

- **Variable length**: Different proteins will yield different-length embeddings, which complicates model input formatting.
- **Scalability**: Long proteins produce huge embeddings. For example, titin—the longest known human protein at ~34,000 amino acids—would produce an embedding with over 43 million values. That’s unwieldy for most models.
- **Limited modeling**: This approach treats amino acids independently, ignoring the contextual relationships that are central to protein function.

**Averaging of amino acid embeddings**: average the token embeddings across the sequence.

- This has the advantage of producing fixed-size vectors, regardless of protein length.
- It’s efficient and sometimes used, but also crude—averaging discards ordering and interaction information. It’s like summarizing a novel by averaging all its word vectors: some meaning survives, but the nuance is lost.

**Using the model’s contextual sequence embeddings**: extract the hidden representations for the entire sequence directly from the language model

- Concretely, we can pass a protein sequence through ESM2 and extract the final hidden layer activations, resulting in a tensor of shape (L', D), where L' is the number of output tokens (which may differ from the input length L), and D is the model’s hidden size (e.g., 640).
- We then apply mean pooling across the sequence length to produce a fixed-length embedding of shape (D,). While averaging may seem simplistic, it often works surprisingly well—because the model has already integrated contextual information into each token’s representation using self-attention, the pooled vector still captures meaningful dependencies across the sequence.


================================================
FILE: books/deep-learning-for-biology/learning-the-logic-of-dna.md
================================================
# Learning the Logic of DNA

- Based on a DNA sequence, predict if it's bound by transcrition factors (TFs — class of proteins)
  - TFs are related to gene regulation
  - TFs bind to a DNA sequence and make surrounding genes to be turned on or off
- Biology Primer
  - DNA is a molecule of inheritance
  - DNA of a cell has all the instructions to build an entire human body
  - DNA is built from 4 unique letter, or nucleotides (A (adenine), C (cytosine), G (guanine), and T (thymine))
  - It holds processes like cell division and differentiation
  - Genome is a complete set of DNA in an organism
    - Genetics: study specific genes or small set of genes
    - Genomics: study an entire genome
- Coding and non-coding regions
  - Coding regions: DNA that are transcribed into RNA and translated into proteins (DNA -> RNA -> Protein)
  - The human genome contains around 20,000 protein-coding genes.
  - Protein-coding genes account for only about 2% of the genome. The remaining 98% is noncoding DNA.
  - Non-coding DNA doesn't produce proteins but plays a critical regulatory role
    - It can produce RNA that regulate gene expression
    - It can help organize 3D structure of the genome
    - It can serve as docking sites for regulatory proteins
- How Transcription Factors Orchestrate Gene Activity
  - Transcription factors binding means that a TF binds to a specific DNA sequence with the purpose of regulate the gene expression: increase/activate or decrease/repress the rate at which the gene is transcribed into RNA
- CNNs are also used in 1D data like DNA sequences
  - Shallow layers: detect low-level DNA features (GC-rich, AT-rich regions)
  - Mid-level filters: identify known TF motifs
  - Deeper layers: learn higher-level features
  - CNNs are relatively lightweight and easy to train, but struggle with interactions between distant bases, or in the words, problems involving long-range dependencies, relationships between elements far apart in a sequence
- Transformers
  - CNNs are great to find local patterns
  - Transformers are powerful for modeling relationships across long distance in a sequence, and global sequence context
  - Self-attention is a mechanism that receives input token embeddings (input vector) and outputs context-aware embeddings (output vector)
    - It performs this process by making each token attend and check how much it influences other token positions
  - Transformer Block: attention/self attention -> feedforward layer -> residual connection -> layer normalization
  - Transformer models have many of these blocks stacked
  - Multiheaded Attention: it runs several attention mechanisms in parallel to help capture a richer and more diverse set of relationships within the data
- Modeling task
  - Predict if a 200-base DNA sequence, can we predict whether it binds to a specific TF called CTCF
  - It's a binary classification problem
    - Input: DNA sequence
    - Output: 0 or 1 (binary classification: it binds or not the TF)
  - CTCF helps organizing the genome 3D structure (folding DNA into loops and domains that regulate gene activity)


================================================
FILE: books/machine-learning-system-design/README.md
================================================
# Machine Learning System Design

## Is there a problem?

- Focus on the problem space before the solution space (implementation)
  - Trying to understand what people want is important; trying to understand what they need is critical.
- Try to question every word in a given sentence to make sure you can explain it to a 10-year-old child. 
  - e.g. There are fraudsters in our mobile app who try to attack our legit users.
  - Who are fraudsters?
  - How do they attack? 
  - What report gave the initial insight about excessive prices?
  - What bothers our customers the most?
  - Where is the most time wasted?
  - How do we measure user engagement?
  - How are recommendations related to this metric?
- Find out any possible risks and limitations as soon as possible; otherwise, you can be forced to discard all your hard work
  - Proper understanding of the costs of a mistake
    - Affects requirements, data gathering, and metrics to choose
  - Requirements: Functional requirements, non-functional requirements
  - Trade-off between robustness (software keeps working) and correctness (returning the correct result)

## Design Document

- Goal: reduce the uncertainty about a problem
  - Successful metrics
  - Functional and non-functional requirements
- Antigoals: inverse statements that can help us narrow down both the problem space and the solution space
  - Find properties of the system you're building that are not hard requirements
  - It helps us focus only on the important aspects of a system
- Designing the document
  - Problem definition
    - Origin/Context
    - Relevance and reasons: problem relevance based on exploratory data analysis
    - Previous work: list of problems to avoid based on previous work
    - Issues and risks

## Metrics

- Loss metrics, evaluation metrics (offline), proxy metrics, business metrics (online)
- When metrics have a ratio of 9/10,000, it means
  - Low amount of class 1 data, huge class imbalance
  - Increased A/B test duration
- Build a hierarchy of metrics to understand what could be used as proxy metrics for the actual goal
  - Use proxy metrics to speed up the experimentation phase and increase the number of class to have a more balanced dataset
- Summary
  - Don’t fall into the temptation of using time-tested loss functions just because they worked on your previous project(s).
  - A loss function must be globally continuous and differentiable.
  - Loss selection is an important step, but it is even more crucial with deep learning-based systems.
  - Consider applying consistency metrics when small changes to the inputs can have significant effects on the output of your model from the product perspective.
  - Offline metrics can be applied before putting your project into production and play the role of proxy metrics for online metrics.
  - Make sure to have the hierarchy of metrics at hand, as it will be useful while working on the design of your system.

## Datasets

- Sampling is effective when a dataset is not only huge but also tends to be imbalanced and/or may contain a lot of duplicates
- A critical characteristic of data uncertainty is that no matter how much additional training data gets collated, it does not reduce.
- Handling data
  - Generating synthetic data
  - Using available data from similar situations
  - Creating data manually
  - Taking data from a similar problem and trying to adjust it
  - Use a dummy baseline model or third party to bootstrap
- Properties of a healthy data pipeline
  - Reproducibility: be able to create a dataset from scratch if needed
  - Consistency: data origin, how data is preprocessed, filters applied
  - Reliability: data comes from a reliable source
  - Availability: pulling data should be fairly easy
- Design document: Dataset
  - ETL:
    – What are the data sources?
    – How should we represent and store the data for our system?
  - Filtering:
    – What are the criteria for good and bad data samples?
    – What corner cases can we expect? How do we handle them?
    – Do we filter data automatically or set up a process for manual verification?
  - Feature engineering:
    – How are the features computed?
    – How are representations generated?
  - Labeling:
    – What labels do we need?
    – What’s the label’s source?

## Evaluation process

- Best evaluation schemas (dataset split): highest reliability and robustness: low bias/low variance
- Data split
  - A training set is used for model training
  - A validation set is designed to evaluate performance during training
  - A test set is used to calculate final metrics
- Be careful with validation leading to data leakage and optimistic model performance
- Cross validation: helps with mitigating "selection bias", when we get a non-representative train/test split
  - Improve reliability on model performance for unseen data

## Baseline

- Baseline
  - **Reduce the maximum risk with the lowest amount of time, cost, and effort invested in a product**. At the beginning of the product’s life, it is still unclear whether the market needs it, what use cases the product will have, whether the economy will converge, and so on. To a large extent, these risks are peculiar to ML products, too. In a way, a baseline (or MVP) is the easiest way to test a hypothesis that lies at the heart of your product.
  - **Get early feedback**. This is the fail-fast principle cut down to the product scale. If the whole idea of your ML system is wrong, you can see it at an early stage, rethink the entire plan, rewrite the design document with new knowledge, and start anew.
  - **Bring user value as soon as possible**. Each company aims to generate revenue by making its

Download .txt

gitextract_hamld7q0/

├── .gitignore
├── .prettierrc
├── FUNDING.yml
├── LICENSE
├── README.md
├── a-unified-theory-of-ai-in-biomedicine.md
├── a-unified-theory-of-ml-ai.md
├── books/
│   ├── an-introduction-to-statistical-learning-wtih-applications-in-python/
│   │   ├── README.md
│   │   ├── classification.md
│   │   ├── introduction.md
│   │   ├── linear-regression.md
│   │   ├── resampling-methods.md
│   │   ├── selection-and-regularization.md
│   │   └── statistical-learning.md
│   ├── deep-learning-for-biology/
│   │   ├── README.md
│   │   ├── learning-the-language-of-proteins.md
│   │   └── learning-the-logic-of-dna.md
│   ├── machine-learning-system-design/
│   │   ├── README.md
│   │   └── residual_analysis.ipynb
│   ├── mathematics-for-machine-learning/
│   │   └── README.md
│   ├── practical-statistics-for-data-scientists/
│   │   ├── README.md
│   │   ├── data-and-sampling-distributions.ipynb
│   │   └── practical-statistics-exploratory-data-analysis.ipynb
│   ├── reinforcement-learning/
│   │   └── README.md
│   └── understanding-deep-learning/
│       └── README.md
├── cancer/
│   └── README.md
├── careers/
│   └── README.md
├── courses/
│   ├── agentic-ai/
│   │   ├── README.md
│   │   ├── email-assistant.ipynb
│   │   ├── eval.ipynb
│   │   ├── external-evaluation.ipynb
│   │   ├── multi-agent-workflow.ipynb
│   │   ├── planning-with-code.ipynb
│   │   ├── reflection.ipynb
│   │   └── tools.ipynb
│   ├── ai-for-medicine/
│   │   └── ai-for-medical-diagnosis/
│   │       ├── README.md
│   │       ├── ai-for-medicine-densenet.ipynb
│   │       ├── ai-for-medicine-diagnosis-counting-labels-and-we.ipynb
│   │       ├── ai-for-medicine-patient-overlap-and-data-leakage.ipynb
│   │       ├── chest-x-ray-medical-diagnosis-with-deep-learning.ipynb
│   │       └── data-exploration-and-image-pre-processing.ipynb
│   ├── attention-in-transformers/
│   │   ├── README.md
│   │   ├── encoder-decoder-attention.ipynb
│   │   ├── masked-self-attention-pytorch.ipynb
│   │   ├── next-token-prediction.ipynb
│   │   ├── self-attention-pytorch.ipynb
│   │   ├── tokenization.ipynb
│   │   └── transformers-from-scratch.ipynb
│   ├── data-visualization/
│   │   ├── README.md
│   │   ├── bar-charts-and-heatmaps.ipynb
│   │   ├── choosing-plot-types-and-custom-styles.ipynb
│   │   ├── data-visualization-final-project.ipynb
│   │   ├── distributions.ipynb
│   │   ├── line-charts.ipynb
│   │   ├── scatter-plots.ipynb
│   │   └── seaborn.ipynb
│   ├── diffusion-models/
│   │   ├── README.md
│   │   ├── controlling-model-generation.ipynb
│   │   ├── ddim-vs-ddpm-faster-sampling.ipynb
│   │   ├── denoise-and-add-noise.ipynb
│   │   ├── diffusion_utilities.py
│   │   └── training-unet.ipynb
│   ├── gen-ai/
│   │   ├── README.md
│   │   ├── building-an-agent-with-langgraph.ipynb
│   │   ├── classifying-embeddings-with-keras.ipynb
│   │   ├── document-q-a-with-rag.ipynb
│   │   ├── embeddings-and-similarity-scores.ipynb
│   │   ├── evaluation-and-structured-output.ipynb
│   │   ├── fine-tuning-a-custom-model.ipynb
│   │   ├── function-calling-with-the-gemini-api.ipynb
│   │   ├── google-search-grounding.ipynb
│   │   └── prompt-engineering.ipynb
│   ├── genomic-data-science/
│   │   ├── algorithms-for-dna-sequencing/
│   │   │   ├── README.md
│   │   │   ├── fasta/
│   │   │   │   └── lambda_virus.fa
│   │   │   └── src/
│   │   │       ├── read_genome.py
│   │   │       └── reverse_complement.py
│   │   └── introduction-genomics/
│   │       ├── README.md
│   │       └── quizz-001.md
│   ├── introduction-to-deep-learning/
│   │   └── README.md
│   ├── introduction-to-machine-learning/
│   │   ├── README.md
│   │   ├── logistic-regression/
│   │   │   └── README.md
│   │   └── multilayer-perceptron/
│   │       └── README.md
│   ├── introduction-to-neural-networks-and-pytorch/
│   │   ├── 1D-tensor.ipynb
│   │   ├── 2D-tensor.ipynb
│   │   ├── README.md
│   │   ├── activation-functions-and-max-pooling-in-cnn.ipynb
│   │   ├── activation-functions-on-mnist.ipynb
│   │   ├── activation-functions.ipynb
│   │   ├── batch-normalization.ipynb
│   │   ├── best-practices-for-model-training.md
│   │   ├── cnn-for-small-image.ipynb
│   │   ├── computer-vision-with-pytorch.ipynb
│   │   ├── convolution-neural-network.ipynb
│   │   ├── convolutional-neural-network-for-anime-image-class.ipynb
│   │   ├── convolutional-neural-network-with-batch-normalization.ipynb
│   │   ├── core-neural-network-components.ipynb
│   │   ├── data-management-in-pytorch.ipynb
│   │   ├── deep-learning-with-pytorch.ipynb
│   │   ├── deep-neural-network-for-breast-cancer-classification.ipynb
│   │   ├── deep-neural-networks.ipynb
│   │   ├── deeper-neural-networks-with-nn-modulelist.ipynb
│   │   ├── derivatives.ipynb
│   │   ├── different-parameter-initialization.ipynb
│   │   ├── dropout-neural-net.ipynb
│   │   ├── dropout-regression.ipynb
│   │   ├── fashion-mnist.ipynb
│   │   ├── he-parameter-initialization.ipynb
│   │   ├── initialization-with-same-weights.ipynb
│   │   ├── linear-regression-training-one-parameter.ipynb
│   │   ├── linear-regression-training.ipynb
│   │   ├── linear_regression_model.ipynb
│   │   ├── linear_regression_with_multiple_outputs.ipynb
│   │   ├── logistic-regression-and-bad-initialization-value.ipynb
│   │   ├── logistic-regression-cross-entropy.ipynb
│   │   ├── logistic_regression.ipynb
│   │   ├── mini_batch_gradient_descent.ipynb
│   │   ├── mini_batch_gradient_descent_pytorch.ipynb
│   │   ├── mnist-softmax.ipynb
│   │   ├── mnist_vision_transform.ipynb
│   │   ├── momentum-with-polynomial-functions.ipynb
│   │   ├── multi-class-neural-networks-with-mnist.ipynb
│   │   ├── multiple-channel-convolution.ipynb
│   │   ├── multiple_linear_regression.ipynb
│   │   ├── multiple_linear_regression_training.ipynb
│   │   ├── neural-network-with-momentum.ipynb
│   │   ├── neural-network-with-multiple-neurons.ipynb
│   │   ├── neural-networks-with-multiple-hidden-layers.ipynb
│   │   ├── simple-convolutional-neural-network.ipynb
│   │   ├── small-neural-network.ipynb
│   │   ├── softmax-classifier-1d.ipynb
│   │   ├── stochastic_gradient_descent.ipynb
│   │   ├── training_and_validation_data.ipynb
│   │   ├── training_multiple_output_linear_regression.ipynb
│   │   ├── transform.ipynb
│   │   └── vision_transform.ipynb
│   ├── kaggle-intermdiate-ml/
│   │   ├── README.md
│   │   ├── categorical-variables.ipynb
│   │   ├── cross-validation.ipynb
│   │   ├── data-leakage.ipynb
│   │   ├── intro-house-pricing.ipynb
│   │   ├── missing-values.ipynb
│   │   ├── pipeline.ipynb
│   │   └── xgboost.ipynb
│   ├── kaggle-intro-to-ml/
│   │   ├── README.md
│   │   ├── explore-data.ipynb
│   │   ├── house-price-decision-tree-regressor.ipynb
│   │   ├── model-validation.ipynb
│   │   ├── random-forests.ipynb
│   │   └── underfitting-and-overfitting.ipynb
│   ├── language-modeling-from-scratch/
│   │   └── README.md
│   ├── machine-learning-for-health-predictions/
│   │   └── README.md
│   ├── machine-learning-with-python/
│   │   └── README.md
│   ├── math-for-machine-learning-with-python/
│   │   ├── 001-intro-to-equations.py
│   │   ├── 002-linear-equations.py
│   │   ├── 003-systems-of-equations.py
│   │   └── README.md
│   ├── ml-for-computational-biology/
│   │   └── README.md
│   ├── ml-in-healthcare/
│   │   └── README.md
│   ├── multimodal-machine-learning/
│   │   └── README.md
│   ├── pyspark/
│   │   └── learning_spark.ipynb
│   └── python/
│       ├── README.md
│       ├── booleans-and-conditionals.ipynb
│       ├── functions-and-getting-help.ipynb
│       ├── lists.ipynb
│       ├── loops-and-list-comprehensions.ipynb
│       ├── object-oriented-programming-in-python.ipynb
│       ├── strings-and-dictionaries.ipynb
│       ├── syntax-variables-and-numbers.ipynb
│       └── working-with-external-libraries.ipynb
├── interview-prep/
│   └── README.md
├── introduction/
│   ├── README.md
│   ├── data/
│   │   └── visualizing-data.ipynb
│   ├── matlab_plot/
│   │   ├── jupyter/
│   │   │   ├── 1.line_plot.ipynb
│   │   │   ├── 2.line_plot.ipynb
│   │   │   ├── 3.scatter_plot.ipynb
│   │   │   ├── 4.scatter_plot.ipynb
│   │   │   ├── 5.histogram.ipynb
│   │   │   ├── 6.histogram_bin.ipynb
│   │   │   ├── 7.labels.ipynb
│   │   │   ├── 8.ticks.ipynb
│   │   │   └── 9.scatter_size.ipynb
│   │   └── python/
│   │       ├── 1.line_plot.py
│   │       ├── 10.colors.py
│   │       ├── 11.grid.py
│   │       ├── 2.line_plot.py
│   │       ├── 3.scatter_plot.py
│   │       ├── 4.scatter_plot.py
│   │       ├── 5.histogram.py
│   │       ├── 6.histogram_bin.py
│   │       ├── 7.labels.py
│   │       ├── 8.ticks.py
│   │       └── 9.scatter_size.py
│   └── numpy/
│       ├── jupyter/
│       │   ├── 1.array.ipynb
│       │   ├── 2.array_calculation.ipynb
│       │   └── 3.array_calculation.ipynb
│       └── python/
│           ├── 1.array.py
│           ├── 10.matrix_calculation.py
│           ├── 11.matrix_first_column.py
│           ├── 12.statistics.py
│           ├── 13.statistics_2.py
│           ├── 2.array_calculation.py
│           ├── 3.array_calculation.py
│           ├── 4.boolean_array.py
│           ├── 5.homogeneous_array.py
│           ├── 6.array_slice.py
│           ├── 7.array_shape.py
│           ├── 8.array_shape.py
│           └── 9.matrix.py
├── learning-path.md
├── math.md
├── papers/
│   ├── alphafold/
│   │   └── README.md
│   ├── artificial-intelligence-in-healthcare-past-present-and-future/
│   │   └── README.md
│   ├── highly-accurate protein-structure-prediction-with-alphafold/
│   │   └── README.md
│   └── sybil-a-validated-deep-learning-model-to-predict-future-lung-cancer-risk-from-a-single-low-dose/
│       └── index.md
├── projects/
│   ├── biomedicine/
│   │   └── learning-the-language-of-proteins/
│   │       └── data/
│   │           ├── CAFA3_targets.tgz
│   │           └── CAFA3_training_data.tgz
│   ├── classification/
│   │   └── svc-decision-tree-classifiers.ipynb
│   ├── pytorch/
│   │   ├── pytorch-computer-vision-exercises.ipynb
│   │   ├── pytorch-computer-vision.ipynb
│   │   ├── pytorch-custom-datasets.ipynb
│   │   ├── pytorch-fundamentals.ipynb
│   │   └── pytorch-neural-network-classification.ipynb
│   ├── regression/
│   │   └── house-price-regression-model.ipynb
│   └── rnn/
│       └── recurrent-neural-network-regression.ipynb
├── research/
│   ├── README.md
│   └── ideas.md
├── rosalind/
│   ├── README.md
│   ├── cons.py
│   ├── dna.py
│   ├── fib.py
│   ├── fibd.py
│   ├── gc.py
│   ├── hamm.py
│   ├── iev.py
│   ├── iprb.py
│   ├── prob.py
│   ├── prot.py
│   ├── prtm.py
│   ├── revc.py
│   ├── rna.py
│   └── subs.py
└── skills.md

Download .txt

SYMBOL INDEX (49 symbols across 17 files)

FILE: courses/diffusion-models/diffusion_utilities.py
  class ResidualConvBlock (line 13) | class ResidualConvBlock(nn.Module):
    method __init__ (line 14) | def __init__(
    method forward (line 39) | def forward(self, x: torch.Tensor) -> torch.Tensor:
    method get_out_channels (line 68) | def get_out_channels(self):
    method set_out_channels (line 72) | def set_out_channels(self, out_channels):
  class UnetUp (line 79) | class UnetUp(nn.Module):
    method __init__ (line 80) | def __init__(self, in_channels, out_channels):
    method forward (line 94) | def forward(self, x, skip):
  class UnetDown (line 103) | class UnetDown(nn.Module):
    method __init__ (line 104) | def __init__(self, in_channels, out_channels):
    method forward (line 114) | def forward(self, x):
  class EmbedFC (line 118) | class EmbedFC(nn.Module):
    method __init__ (line 119) | def __init__(self, input_dim, emb_dim):
    method forward (line 137) | def forward(self, x):
  function unorm (line 143) | def unorm(x):
  function norm_all (line 150) | def norm_all(store, n_t, n_s):
  function norm_torch (line 158) | def norm_torch(x_all):
  function gen_tst_context (line 169) | def gen_tst_context(n_cfeat):
  function plot_grid (line 183) | def plot_grid(x,n_sample,n_rows,save_dir,w):
  function plot_sample (line 191) | def plot_sample(x_gen_store,n_sample,nrows,save_dir, fn,  w, save=False):
  class CustomDataset (line 216) | class CustomDataset(Dataset):
    method __init__ (line 217) | def __init__(self, sfilename, lfilename, transform, null_context=False):
    method __len__ (line 228) | def __len__(self):
    method __getitem__ (line 232) | def __getitem__(self, idx):
    method getshapes (line 242) | def getshapes(self):

FILE: courses/genomic-data-science/algorithms-for-dna-sequencing/src/read_genome.py
  function read_genome (line 4) | def read_genome(filename):

FILE: courses/genomic-data-science/algorithms-for-dna-sequencing/src/reverse_complement.py
  function reverse_complement (line 1) | def reverse_complement(s):

FILE: rosalind/cons.py
  function build_matrix (line 175) | def build_matrix(dna_strings):
  function cons (line 193) | def cons():

FILE: rosalind/dna.py
  function test (line 4) | def test(dna):

FILE: rosalind/fib.py
  function rabbit_pairs (line 30) | def rabbit_pairs(n, k):

FILE: rosalind/fibd.py
  function fibd (line 3) | def fibd(n, m):

FILE: rosalind/gc.py
  function parse_fasta (line 3) | def parse_fasta(data):
  function gc_content (line 22) | def gc_content(sequence):
  function highest_gc_content (line 26) | def highest_gc_content(sequences):

FILE: rosalind/hamm.py
  function hamm (line 3) | def hamm(dna_string_1, dna_string_2):

FILE: rosalind/iev.py
  function parse_couples (line 3) | def parse_couples(string):
  function iev (line 6) | def iev(string):
  function iev_with_zip (line 19) | def iev_with_zip(string):

FILE: rosalind/iprb.py
  function iprb (line 3) | def iprb(AA, Aa, aa):

FILE: rosalind/prob.py
  function prob (line 3) | def prob(dna_string, A):

FILE: rosalind/prot.py
  function prot (line 91) | def prot(rna_string):
  function prot_list_comprehension (line 104) | def prot_list_comprehension(rna_string):

FILE: rosalind/prtm.py
  function parse_table (line 28) | def parse_table(monoisotopic_mass_table_string):
  function total_protein_weight (line 37) | def total_protein_weight(monoisotopic_mass_table, sample_dataset):

FILE: rosalind/revc.py
  function revc (line 17) | def revc(dna):

FILE: rosalind/rna.py
  function rna (line 16) | def rna(t):

FILE: rosalind/subs.py
  function subs (line 3) | def subs(s, t):
  function subs_list_comprehension (line 12) | def subs_list_comprehension(s, t):

Copy disabled (too large) Download .json

Condensed preview — 242 files, each showing path, character count, and a content snippet. Download the .json file for the full structured content (60,039K chars).

[
  {
    "path": ".gitignore",
    "chars": 29,
    "preview": ".ipynb_checkpoints\n.DS_Store\n"
  },
  {
    "path": ".prettierrc",
    "chars": 52,
    "preview": "{\n  \"singleQuote\": true,\n  \"trailingComma\": \"all\"\n}\n"
  },
  {
    "path": "FUNDING.yml",
    "chars": 57,
    "preview": "github: [imteekay]\ncustom: [https://teekay.substack.com]\n"
  },
  {
    "path": "LICENSE",
    "chars": 1054,
    "preview": "MIT License\n\nCopyright (c) TK\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this soft"
  },
  {
    "path": "README.md",
    "chars": 78204,
    "preview": "<samp>\n\n# ML Research\n\n## Table of Contents\n\n- [ML Research](#ml-research)\n  - [Table of Contents](#table-of-contents)\n "
  },
  {
    "path": "a-unified-theory-of-ai-in-biomedicine.md",
    "chars": 4429,
    "preview": "<samp>\n\n# A Unified Theory of AI in Biomedicine & Healthcare\n\n## Table of Contents\n\n- [A Unified Theory of AI in Biomedi"
  },
  {
    "path": "a-unified-theory-of-ml-ai.md",
    "chars": 71391,
    "preview": "<samp>\n\n# A Unified Theory of ML/AI\n\n## Table of Contents\n\n- [A Unified Theory of ML/AI](#a-unified-theory-of-mlai)\n  - "
  },
  {
    "path": "books/an-introduction-to-statistical-learning-wtih-applications-in-python/README.md",
    "chars": 192,
    "preview": "# An Introduction to Statistical Learning with Applications in Python\n\n- [Intro](introduction.md)\n- [Statistical Learnin"
  },
  {
    "path": "books/an-introduction-to-statistical-learning-wtih-applications-in-python/classification.md",
    "chars": 4287,
    "preview": "# Classification\n\n- For classification models, the response variable is a qualitative variable: yes/no, e.g. eye color {"
  },
  {
    "path": "books/an-introduction-to-statistical-learning-wtih-applications-in-python/introduction.md",
    "chars": 935,
    "preview": "# Introduction\n\n- A set of tools for understanding data\n  - **Supervised**: a statistical model for pre- dicting, or est"
  },
  {
    "path": "books/an-introduction-to-statistical-learning-wtih-applications-in-python/linear-regression.md",
    "chars": 4073,
    "preview": "# Linear Regression\n\n- Variables\n  - Dependent variables: the response variable that we want to predict based on the val"
  },
  {
    "path": "books/an-introduction-to-statistical-learning-wtih-applications-in-python/resampling-methods.md",
    "chars": 3549,
    "preview": "# Resampling Methods\n\n## Cross-validation and bootstrap\n\n- Refit a model to samples formed from the training set, in ord"
  },
  {
    "path": "books/an-introduction-to-statistical-learning-wtih-applications-in-python/selection-and-regularization.md",
    "chars": 410,
    "preview": "# Linear Model Selection and Regularization\n\nMethods to improve the model's performance:\n\n- **Subset Selection**: using "
  },
  {
    "path": "books/an-introduction-to-statistical-learning-wtih-applications-in-python/statistical-learning.md",
    "chars": 7781,
    "preview": "# Statistical Learning\n\n- We have `X` and Y as variables, `X` as a predictor variable and `Y` as a output value\n  - e.g."
  },
  {
    "path": "books/deep-learning-for-biology/README.md",
    "chars": 4372,
    "preview": "# Deep Learning for Biology\n\n## Success Criteria\n\n**Performance metric (e.g., accuracy, AUC, F1)**: You might aim to mat"
  },
  {
    "path": "books/deep-learning-for-biology/learning-the-language-of-proteins.md",
    "chars": 7489,
    "preview": "# Learning the Language of Proteins\n\n- A protein can be represented as a sequence of its constituent building blocks, ca"
  },
  {
    "path": "books/deep-learning-for-biology/learning-the-logic-of-dna.md",
    "chars": 3105,
    "preview": "# Learning the Logic of DNA\n\n- Based on a DNA sequence, predict if it's bound by transcrition factors (TFs — class of pr"
  },
  {
    "path": "books/machine-learning-system-design/README.md",
    "chars": 13690,
    "preview": "# Machine Learning System Design\n\n## Is there a problem?\n\n- Focus on the problem space before the solution space (implem"
  },
  {
    "path": "books/machine-learning-system-design/residual_analysis.ipynb",
    "chars": 106823,
    "preview": "{\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0,\n  \"metadata\": {\n    \"colab\": {\n      \"provenance\": []\n    },\n    \"kernelspec\":"
  },
  {
    "path": "books/mathematics-for-machine-learning/README.md",
    "chars": 938,
    "preview": "# Mathematics for Machine Learning\n\n## Mathematical Foundation\n\n### Introduction\n\n- Concepts of machine learning\n  - We "
  },
  {
    "path": "books/practical-statistics-for-data-scientists/README.md",
    "chars": 203,
    "preview": "# Practical Statistics for Data Scientists\n\n- [Exploratory Data Analysis](practical-statistics-exploratory-data-analysis"
  },
  {
    "path": "books/practical-statistics-for-data-scientists/data-and-sampling-distributions.ipynb",
    "chars": 53702,
    "preview": "{\"metadata\":{\"kernelspec\":{\"language\":\"python\",\"display_name\":\"Python 3\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "books/practical-statistics-for-data-scientists/practical-statistics-exploratory-data-analysis.ipynb",
    "chars": 530859,
    "preview": "{\"metadata\":{\"kernelspec\":{\"language\":\"python\",\"display_name\":\"Python 3\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "books/reinforcement-learning/README.md",
    "chars": 1799,
    "preview": "# Reinforcement Learning\n\n## Introduction\n\n- RL is about decision making\n  - The two most important features of RL: `tri"
  },
  {
    "path": "books/understanding-deep-learning/README.md",
    "chars": 40691,
    "preview": "# Understanding Deep Learning\n\n## Table of Contents\n\n- [Understanding Deep Learning](#understanding-deep-learning)\n  - ["
  },
  {
    "path": "cancer/README.md",
    "chars": 943,
    "preview": "# Cancer\n\nIt's a group of more than 100 different diseases.\n\n![how cancer works](images/cancer.png)\n\n- Cells are the bas"
  },
  {
    "path": "careers/README.md",
    "chars": 4090,
    "preview": "# Careers\n\n## What You Will Do\n\n- Make original research contributions to enable machine learning model development, app"
  },
  {
    "path": "courses/agentic-ai/README.md",
    "chars": 444,
    "preview": "# Agentic AI\n\n- [Reflection: Chart Generation](reflection.ipynb)\n- [External Feedback: SQL Generation](external-evaluati"
  },
  {
    "path": "courses/agentic-ai/email-assistant.ipynb",
    "chars": 43481,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"8b541d70c97713d9\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Agen"
  },
  {
    "path": "courses/agentic-ai/eval.ipynb",
    "chars": 38385,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"22711013\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Agentic AI -"
  },
  {
    "path": "courses/agentic-ai/external-evaluation.ipynb",
    "chars": 82504,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"1ef32438\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Agentic AI -"
  },
  {
    "path": "courses/agentic-ai/multi-agent-workflow.ipynb",
    "chars": 91121,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"c2c398ec\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Agentic AI -"
  },
  {
    "path": "courses/agentic-ai/planning-with-code.ipynb",
    "chars": 97960,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"b8b808d5\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Agentic AI -"
  },
  {
    "path": "courses/agentic-ai/reflection.ipynb",
    "chars": 1089965,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"38ae026d\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Agentic AI -"
  },
  {
    "path": "courses/agentic-ai/tools.ipynb",
    "chars": 61284,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"273a633870c44821\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Agen"
  },
  {
    "path": "courses/ai-for-medicine/ai-for-medical-diagnosis/README.md",
    "chars": 4883,
    "preview": "# AI for Medical Diagnosis\n\n## Disease Detection with Computer Vision\n\n### Notebooks\n\n- [Data Exploration and Image Pre-"
  },
  {
    "path": "courses/ai-for-medicine/ai-for-medical-diagnosis/ai-for-medicine-densenet.ipynb",
    "chars": 63892,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/ai-for-medicine/ai-for-medical-diagnosis/ai-for-medicine-diagnosis-counting-labels-and-we.ipynb",
    "chars": 87720,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/ai-for-medicine/ai-for-medical-diagnosis/ai-for-medicine-patient-overlap-and-data-leakage.ipynb",
    "chars": 26120,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/ai-for-medicine/ai-for-medical-diagnosis/chest-x-ray-medical-diagnosis-with-deep-learning.ipynb",
    "chars": 1618065,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"a7c9d57a\",\n   \"metadata\": {\n    \"_cell_guid"
  },
  {
    "path": "courses/ai-for-medicine/ai-for-medical-diagnosis/data-exploration-and-image-pre-processing.ipynb",
    "chars": 1273556,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/attention-in-transformers/README.md",
    "chars": 3774,
    "preview": "# Attention in Transformers\n\n## The main ideas\n\n- `Tokenization`: breaks the input and separate it into tokens (IDs from"
  },
  {
    "path": "courses/attention-in-transformers/encoder-decoder-attention.ipynb",
    "chars": 6776,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"9f40c232-df6e-49df-9016-6459e4af2e1e\",\n   \"metadata\": {\n    \"tag"
  },
  {
    "path": "courses/attention-in-transformers/masked-self-attention-pytorch.ipynb",
    "chars": 17682,
    "preview": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"id\": \"9f40c232-df6e-49df-9016-6459e4af2e1e\",\n      \"metadata\""
  },
  {
    "path": "courses/attention-in-transformers/next-token-prediction.ipynb",
    "chars": 170409,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Next Token Prediction\"\n   ]\n  },\n"
  },
  {
    "path": "courses/attention-in-transformers/self-attention-pytorch.ipynb",
    "chars": 11606,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"9f40c232-df6e-49df-9016-6459e4af2e1e\",\n   \"metadata\": {\n    \"tag"
  },
  {
    "path": "courses/attention-in-transformers/tokenization.ipynb",
    "chars": 590147,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"g_a9QvUFVCUR\"\n   },\n   \"source\": [\n    \"# LLM "
  },
  {
    "path": "courses/attention-in-transformers/transformers-from-scratch.ipynb",
    "chars": 5207518,
    "preview": "{\"metadata\":{\"kernelspec\":{\"language\":\"python\",\"display_name\":\"Python 3\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/data-visualization/README.md",
    "chars": 2020,
    "preview": "# Data Visualization\n\n- [Seaborn](seaborn.ipynb)\n- [Line Charts](line-charts.ipynb)\n- [Bar Charts and Heatmaps](bar-char"
  },
  {
    "path": "courses/data-visualization/bar-charts-and-heatmaps.ipynb",
    "chars": 299043,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/data-visualization/choosing-plot-types-and-custom-styles.ipynb",
    "chars": 174500,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/data-visualization/data-visualization-final-project.ipynb",
    "chars": 63382,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/data-visualization/distributions.ipynb",
    "chars": 78587,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/data-visualization/line-charts.ipynb",
    "chars": 198945,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/data-visualization/scatter-plots.ipynb",
    "chars": 208371,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/data-visualization/seaborn.ipynb",
    "chars": 341617,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/diffusion-models/README.md",
    "chars": 251,
    "preview": "# Diffusion Models\n\n- [Denoise & Add Noise](denoise-and-add-noise.ipynb)\n- [Training UNet](training-unet.ipynb)\n- [Contr"
  },
  {
    "path": "courses/diffusion-models/controlling-model-generation.ipynb",
    "chars": 2657740,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"958524a2-cb56-439e-850e-032dd10478f2\",\n   \"metadata\": {},\n   \"so"
  },
  {
    "path": "courses/diffusion-models/ddim-vs-ddpm-faster-sampling.ipynb",
    "chars": 3793294,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"5f912400\",\n   \"metadata\": {},\n   \"source\": [\n    \"# DDIM vs DDPM"
  },
  {
    "path": "courses/diffusion-models/denoise-and-add-noise.ipynb",
    "chars": 5123044,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"id\": \"024a10c2-6511-4bbf-a789-a12952d57988\",\n   \"metadata\": {},\n   \"so"
  },
  {
    "path": "courses/diffusion-models/diffusion_utilities.py",
    "chars": 9929,
    "preview": "import torch\nimport torch.nn as nn\nimport numpy as np\nfrom torchvision.utils import save_image, make_grid\nimport matplot"
  },
  {
    "path": "courses/gen-ai/README.md",
    "chars": 2863,
    "preview": "# Gen AI\n\n## Foundational Large Language Models & Text Generation\n\n- [Prompt Engineering](prompt-engineering.ipynb)\n  - "
  },
  {
    "path": "courses/gen-ai/building-an-agent-with-langgraph.ipynb",
    "chars": 153601,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/gen-ai/classifying-embeddings-with-keras.ipynb",
    "chars": 38515,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/gen-ai/document-q-a-with-rag.ipynb",
    "chars": 18669,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/gen-ai/embeddings-and-similarity-scores.ipynb",
    "chars": 113546,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/gen-ai/evaluation-and-structured-output.ipynb",
    "chars": 65707,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Copyright 2025 Google LLC.\"\n   ]\n  "
  },
  {
    "path": "courses/gen-ai/fine-tuning-a-custom-model.ipynb",
    "chars": 53964,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"b6e13eef3f5d\"\n   },\n   \"source\": [\n    \"##### "
  },
  {
    "path": "courses/gen-ai/function-calling-with-the-gemini-api.ipynb",
    "chars": 40234,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/gen-ai/google-search-grounding.ipynb",
    "chars": 99374,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/gen-ai/prompt-engineering.ipynb",
    "chars": 77409,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"id\": \"jkxRSYjzA1oX\"\n   },\n   \"source\": [\n    \"##### "
  },
  {
    "path": "courses/genomic-data-science/algorithms-for-dna-sequencing/README.md",
    "chars": 11285,
    "preview": "# Algorithms for DNA Sequencing\n\nNext Generation Sequencing (NGS) in 2007 (second generation sequencing or massive paral"
  },
  {
    "path": "courses/genomic-data-science/algorithms-for-dna-sequencing/fasta/lambda_virus.fa",
    "chars": 49270,
    "preview": ">gi|9626243|ref|NC_001416.1| Enterobacteria phage lambda, complete genome\nGGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGT"
  },
  {
    "path": "courses/genomic-data-science/algorithms-for-dna-sequencing/src/read_genome.py",
    "chars": 554,
    "preview": "import os\nimport collections\n\ndef read_genome(filename):\n  genome = []\n  filename_path = os.path.join('courses', 'genomi"
  },
  {
    "path": "courses/genomic-data-science/algorithms-for-dna-sequencing/src/reverse_complement.py",
    "chars": 143,
    "preview": "def reverse_complement(s):\n  complement = { 'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A' }\n  return ''.join(complement[char] f"
  },
  {
    "path": "courses/genomic-data-science/introduction-genomics/README.md",
    "chars": 3395,
    "preview": "# Introduction to Genomics\n\n- Genomics is the study of genomes\n\n![](./images/001.png)\n\n- DNA -> RNA -> Protein\n- Messeng"
  },
  {
    "path": "courses/genomic-data-science/introduction-genomics/quizz-001.md",
    "chars": 1954,
    "preview": "# Quizz\n\n## Question 1\n\nThe central dogma of molecular biology tells us that information is passed from\n\n- 1 RNA to DNA "
  },
  {
    "path": "courses/introduction-to-deep-learning/README.md",
    "chars": 5969,
    "preview": "# Introduction to Deep Learning\n\n- [Introduction to Deep Learning Course](https://www.edx.org/learn/engineering/purdue-u"
  },
  {
    "path": "courses/introduction-to-machine-learning/README.md",
    "chars": 141,
    "preview": "# Introduction to Machine Learning\n\n## Week 1\n\n- [Logistic Regression](logistic-regression)\n- [Multilayer Perceptron](mu"
  },
  {
    "path": "courses/introduction-to-machine-learning/logistic-regression/README.md",
    "chars": 3709,
    "preview": "# Logistic Regression\n\n## Why Machine Learning Is Exciting\n\nMachine learning and deep learning are now solving very comp"
  },
  {
    "path": "courses/introduction-to-machine-learning/multilayer-perceptron/README.md",
    "chars": 3413,
    "preview": "# Multilayer Perceptron\n\nIn logistic regression, the predictive model does a multiplication of each feature with paramet"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/1D-tensor.ipynb",
    "chars": 156635,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<h1>Torch Tensors in 1D</h1>\\n\"\n   "
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/2D-tensor.ipynb",
    "chars": 24816,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/README.md",
    "chars": 3444,
    "preview": "# Introduction to Neural Networks and PyTorch\n\n- [1D Tensor](1D-tensor.ipynb)\n- [2D Tensor](2D-tensor.ipynb)\n- [Differen"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/activation-functions-and-max-pooling-in-cnn.ipynb",
    "chars": 10662,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/activation-functions-on-mnist.ipynb",
    "chars": 107507,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/activation-functions.ipynb",
    "chars": 200515,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/batch-normalization.ipynb",
    "chars": 86853,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/best-practices-for-model-training.md",
    "chars": 962,
    "preview": "# Best Practices for Training Linear Regression Models\n\n- **Learning Rate**: Setting a moderate initial learning rate (e"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/cnn-for-small-image.ipynb",
    "chars": 663400,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/computer-vision-with-pytorch.ipynb",
    "chars": 39103,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"ed731dc9\",\n   \"metadata\": {\n    \"_cell_guid"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/convolution-neural-network.ipynb",
    "chars": 20155,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/convolutional-neural-network-for-anime-image-class.ipynb",
    "chars": 4020759,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/convolutional-neural-network-with-batch-normalization.ipynb",
    "chars": 92299,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/core-neural-network-components.ipynb",
    "chars": 13540,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"63ef17fe\",\n   \"metadata\": {\n    \"_cell_guid"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/data-management-in-pytorch.ipynb",
    "chars": 613798,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"77a45e08\",\n   \"metadata\": {\n    \"_cell_guid"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/deep-learning-with-pytorch.ipynb",
    "chars": 318480,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"id\": \"e98c2ab3\",\n   \"metadata\": {\n    \"_cell_guid"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/deep-neural-network-for-breast-cancer-classification.ipynb",
    "chars": 289960,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/deep-neural-networks.ipynb",
    "chars": 105194,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/deeper-neural-networks-with-nn-modulelist.ipynb",
    "chars": 374516,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/derivatives.ipynb",
    "chars": 75403,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/different-parameter-initialization.ipynb",
    "chars": 82736,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/dropout-neural-net.ipynb",
    "chars": 292585,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/dropout-regression.ipynb",
    "chars": 204750,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/fashion-mnist.ipynb",
    "chars": 102065,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/he-parameter-initialization.ipynb",
    "chars": 99346,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/initialization-with-same-weights.ipynb",
    "chars": 298627,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/linear-regression-training-one-parameter.ipynb",
    "chars": 458724,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/linear-regression-training.ipynb",
    "chars": 1460462,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/linear_regression_model.ipynb",
    "chars": 24072,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<h1>Linear Regression 1D: Predictio"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/linear_regression_with_multiple_outputs.ipynb",
    "chars": 7832,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/logistic-regression-and-bad-initialization-value.ipynb",
    "chars": 528619,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<h1>Logistic Regression and Bad Ini"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/logistic-regression-cross-entropy.ipynb",
    "chars": 555717,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<h1>Logistic Regression Cross Entro"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/logistic_regression.ipynb",
    "chars": 57012,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/mini_batch_gradient_descent.ipynb",
    "chars": 4664014,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/mini_batch_gradient_descent_pytorch.ipynb",
    "chars": 760020,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/mnist-softmax.ipynb",
    "chars": 291947,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/mnist_vision_transform.ipynb",
    "chars": 85218,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/momentum-with-polynomial-functions.ipynb",
    "chars": 252278,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/multi-class-neural-networks-with-mnist.ipynb",
    "chars": 100884,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/multiple-channel-convolution.ipynb",
    "chars": 224912,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/multiple_linear_regression.ipynb",
    "chars": 14213,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/multiple_linear_regression_training.ipynb",
    "chars": 503814,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/neural-network-with-momentum.ipynb",
    "chars": 838458,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/neural-network-with-multiple-neurons.ipynb",
    "chars": 145487,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/neural-networks-with-multiple-hidden-layers.ipynb",
    "chars": 203458,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/simple-convolutional-neural-network.ipynb",
    "chars": 179965,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/small-neural-network.ipynb",
    "chars": 388010,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/softmax-classifier-1d.ipynb",
    "chars": 224566,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/stochastic_gradient_descent.ipynb",
    "chars": 3201024,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/training_and_validation_data.ipynb",
    "chars": 116318,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/training_multiple_output_linear_regression.ipynb",
    "chars": 31257,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/transform.ipynb",
    "chars": 31456,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/introduction-to-neural-networks-and-pytorch/vision_transform.ipynb",
    "chars": 136113,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/kaggle-intermdiate-ml/README.md",
    "chars": 423,
    "preview": "# Kaggle Intermediate Machine Learning\n\nThe intermediate ML course by kaggle: https://www.kaggle.com/learn/intermediate-"
  },
  {
    "path": "courses/kaggle-intermdiate-ml/categorical-variables.ipynb",
    "chars": 28018,
    "preview": "{\"metadata\":{\"kernelspec\":{\"name\":\"python3\",\"display_name\":\"Python 3\",\"language\":\"python\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/kaggle-intermdiate-ml/cross-validation.ipynb",
    "chars": 46211,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/kaggle-intermdiate-ml/data-leakage.ipynb",
    "chars": 9492,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/kaggle-intermdiate-ml/intro-house-pricing.ipynb",
    "chars": 15514,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/kaggle-intermdiate-ml/missing-values.ipynb",
    "chars": 24809,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/kaggle-intermdiate-ml/pipeline.ipynb",
    "chars": 14787,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/kaggle-intermdiate-ml/xgboost.ipynb",
    "chars": 18112,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/kaggle-intro-to-ml/README.md",
    "chars": 318,
    "preview": "# Kaggle Introduction to Machine Learning\n\n- [Explore Data](explore-data.ipynb)\n- [House Price Decision Tree Regressor]("
  },
  {
    "path": "courses/kaggle-intro-to-ml/explore-data.ipynb",
    "chars": 13594,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/kaggle-intro-to-ml/house-price-decision-tree-regressor.ipynb",
    "chars": 11876,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/kaggle-intro-to-ml/model-validation.ipynb",
    "chars": 13226,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/kaggle-intro-to-ml/random-forests.ipynb",
    "chars": 4604,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/kaggle-intro-to-ml/underfitting-and-overfitting.ipynb",
    "chars": 5984,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/language-modeling-from-scratch/README.md",
    "chars": 1688,
    "preview": "# Language Modeling from Scratch\n\n[CS336: Language Modeling from Scratch](https://stanford-cs336.github.io/spring2024)\n\n"
  },
  {
    "path": "courses/machine-learning-for-health-predictions/README.md",
    "chars": 2828,
    "preview": "# Machine Learning for Health Predictions\n\n## Pre-processing\n\n### Reasons for Poor Algorithm Performance\n\n- Inadequate e"
  },
  {
    "path": "courses/machine-learning-with-python/README.md",
    "chars": 5426,
    "preview": "# Machine Learning with Python\n\n## Machine Learning Algorithms\n\n- Supervised Learning\n- Unsupervised Learning\n- Reinforc"
  },
  {
    "path": "courses/math-for-machine-learning-with-python/001-intro-to-equations.py",
    "chars": 98,
    "preview": "x = -41\nx + 16 == -25 # True\n\nx = 45\nx / 3 + 1 == 16 # True\n\nx = 1.5\n3 * x + 2 == 5 * x -1 # True\n"
  },
  {
    "path": "courses/math-for-machine-learning-with-python/002-linear-equations.py",
    "chars": 705,
    "preview": "import pandas as pd\nfrom matplotlib import pyplot as plt\n\ndf = pd.DataFrame({'x': range(-10, 11)})\n\n# only displaying th"
  },
  {
    "path": "courses/math-for-machine-learning-with-python/003-systems-of-equations.py",
    "chars": 393,
    "preview": "from matplotlib import pyplot as plt\n\n# get the extremes for number of chips\nchipsAll10s = [16, 0]\nchipsAll25s = [0, 16]"
  },
  {
    "path": "courses/math-for-machine-learning-with-python/README.md",
    "chars": 66395,
    "preview": "# Math for Machine Learning with Python\n\n- [Algebra Fundamentals](#algebra-fundamentals-equations-graphs-and-functions)\n"
  },
  {
    "path": "courses/ml-for-computational-biology/README.md",
    "chars": 2174,
    "preview": "# ML for Computational Biology\n\n## Why Computational Biology?\n\n- High volume of data\n- The iterative process of hypothes"
  },
  {
    "path": "courses/ml-in-healthcare/README.md",
    "chars": 6501,
    "preview": "# ML in Healthcare\n\n## Outline to curso\n\n- Capacidade preditiva: acurácia das decisões\n- Intro to R and Python\n- Machine"
  },
  {
    "path": "courses/multimodal-machine-learning/README.md",
    "chars": 505,
    "preview": "# Multimodal Machine Learning\n\n- [Multimodal Machine Learning](#multimodal-machine-learning)\n  - [Introduction](#introdu"
  },
  {
    "path": "courses/pyspark/learning_spark.ipynb",
    "chars": 354330,
    "preview": "{\n  \"cells\": [\n    {\n      \"cell_type\": \"markdown\",\n      \"metadata\": {\n        \"id\": \"KccqfD8ujL3K\"\n      },\n      \"sou"
  },
  {
    "path": "courses/python/README.md",
    "chars": 527,
    "preview": "# Python\n\n- [Syntax, Variables, and Numbers](syntax-variables-and-numbers.ipynb)\n- [Functions and Getting Help](function"
  },
  {
    "path": "courses/python/booleans-and-conditionals.ipynb",
    "chars": 26020,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/python/functions-and-getting-help.ipynb",
    "chars": 8990,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/python/lists.ipynb",
    "chars": 12539,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/python/loops-and-list-comprehensions.ipynb",
    "chars": 16793,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/python/object-oriented-programming-in-python.ipynb",
    "chars": 8833,
    "preview": "{\"metadata\":{\"kernelspec\":{\"language\":\"python\",\"display_name\":\"Python 3\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/python/strings-and-dictionaries.ipynb",
    "chars": 15935,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/python/syntax-variables-and-numbers.ipynb",
    "chars": 18042,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "courses/python/working-with-external-libraries.ipynb",
    "chars": 116233,
    "preview": "{\"metadata\":{\"kernelspec\":{\"display_name\":\"Python 3\",\"language\":\"python\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "interview-prep/README.md",
    "chars": 920,
    "preview": "# Interview Prep\n\n## Tips to how to prepare\n\n- It’s important to be able to explain algorithms and ML concepts at two le"
  },
  {
    "path": "introduction/README.md",
    "chars": 62,
    "preview": "# Introduction\n\n- [Matlab Plot](matlab_plot)\n- [Numpy](numpy)\n"
  },
  {
    "path": "introduction/data/visualizing-data.ipynb",
    "chars": 5296,
    "preview": "{\"metadata\":{\"kernelspec\":{\"language\":\"python\",\"display_name\":\"Python 3\",\"name\":\"python3\"},\"language_info\":{\"name\":\"pyth"
  },
  {
    "path": "introduction/matlab_plot/jupyter/1.line_plot.ipynb",
    "chars": 12494,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "introduction/matlab_plot/jupyter/2.line_plot.ipynb",
    "chars": 64094,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "introduction/matlab_plot/jupyter/3.scatter_plot.ipynb",
    "chars": 20682,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "introduction/matlab_plot/jupyter/4.scatter_plot.ipynb",
    "chars": 13851,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "introduction/matlab_plot/jupyter/5.histogram.ipynb",
    "chars": 7766,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "introduction/matlab_plot/jupyter/6.histogram_bin.ipynb",
    "chars": 14260,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "introduction/matlab_plot/jupyter/7.labels.ipynb",
    "chars": 27037,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "introduction/matlab_plot/jupyter/8.ticks.ipynb",
    "chars": 22407,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "introduction/matlab_plot/jupyter/9.scatter_size.ipynb",
    "chars": 72,
    "preview": "{\n \"cells\": [],\n \"metadata\": {},\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "introduction/matlab_plot/python/1.line_plot.py",
    "chars": 424,
    "preview": "# defining years and population lists\nyear = [1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020]\npopulation = [2.53, 2.8, 3"
  },
  {
    "path": "introduction/matlab_plot/python/10.colors.py",
    "chars": 6961,
    "preview": "# defining \"gross_domestic_product_per_capita\" and \"life_expectancy\" lists\ngross_domestic_product_per_capita = [\n    974"
  },
  {
    "path": "introduction/matlab_plot/python/11.grid.py",
    "chars": 7081,
    "preview": "# defining \"gross_domestic_product_per_capita\" and \"life_expectancy\" lists\ngross_domestic_product_per_capita = [\n    974"
  },
  {
    "path": "introduction/matlab_plot/python/2.line_plot.py",
    "chars": 3609,
    "preview": "# defining \"gross_domestic_product_per_capita\" and \"life_expectancy\" lists\ngross_domestic_product_per_capita = [\n    974"
  },
  {
    "path": "introduction/matlab_plot/python/3.scatter_plot.py",
    "chars": 3612,
    "preview": "# defining \"gross_domestic_product_per_capita\" and \"life_expectancy\" lists\ngross_domestic_product_per_capita = [\n    974"
  },
  {
    "path": "introduction/matlab_plot/python/4.scatter_plot.py",
    "chars": 3017,
    "preview": "# defining \"population\" and \"life_expectancy\" lists\npopulation = [\n    31.889923, 3.600523, 33.333216, 12.420476, 40.301"
  },
  {
    "path": "introduction/matlab_plot/python/5.histogram.py",
    "chars": 1475,
    "preview": "# define life_expectancy list\nlife_expectancy = [\n    43.828, 76.423, 72.301, 42.731, 75.32, 81.235, 79.829, 75.635, 64."
  },
  {
    "path": "introduction/matlab_plot/python/6.histogram_bin.py",
    "chars": 1562,
    "preview": "# define life_expectancy list\nlife_expectancy = [\n    43.828, 76.423, 72.301, 42.731, 75.32, 81.235, 79.829, 75.635, 64."
  },
  {
    "path": "introduction/matlab_plot/python/7.labels.py",
    "chars": 3838,
    "preview": "# defining \"gross_domestic_product_per_capita\" and \"life_expectancy\" lists\ngross_domestic_product_per_capita = [\n    974"
  },
  {
    "path": "introduction/matlab_plot/python/8.ticks.py",
    "chars": 4003,
    "preview": "# defining \"gross_domestic_product_per_capita\" and \"life_expectancy\" lists\ngross_domestic_product_per_capita = [\n    974"
  },
  {
    "path": "introduction/matlab_plot/python/9.scatter_size.py",
    "chars": 5711,
    "preview": "# defining \"gross_domestic_product_per_capita\" and \"life_expectancy\" lists\ngross_domestic_product_per_capita = [\n    974"
  },
  {
    "path": "introduction/numpy/jupyter/1.array.ipynb",
    "chars": 1179,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "introduction/numpy/jupyter/2.array_calculation.ipynb",
    "chars": 1856,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n"
  },
  {
    "path": "introduction/numpy/jupyter/3.array_calculation.ipynb",
    "chars": 2066,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n "
  },
  {
    "path": "introduction/numpy/python/1.array.py",
    "chars": 230,
    "preview": "baseball = [180, 215, 210, 210, 188, 176, 209, 200]\n\n# import numpy as np\nimport numpy as np\n\n# Create a numpy array bas"
  },
  {
    "path": "introduction/numpy/python/10.matrix_calculation.py",
    "chars": 439,
    "preview": "# baseball is available as a regular list of lists\n# updated is available as 2D numpy array\n\n# Import numpy package\nimpo"
  },
  {
    "path": "introduction/numpy/python/11.matrix_first_column.py",
    "chars": 254,
    "preview": "# np_baseball is available\n\n# Import numpy\nimport numpy as np\n\n# Create np_height from np_baseball\nnp_height = np_baseba"
  },
  {
    "path": "introduction/numpy/python/12.statistics.py",
    "chars": 627,
    "preview": "# np_baseball is available\n\n# Import numpy\nimport numpy as np\n\n# get only height data (first column)\nnp_height = np_base"
  },
  {
    "path": "introduction/numpy/python/13.statistics_2.py",
    "chars": 676,
    "preview": "# heights and positions are available as lists\n\n# Import numpy\nimport numpy as np\n\n# Convert positions and heights to nu"
  },
  {
    "path": "introduction/numpy/python/2.array_calculation.py",
    "chars": 359,
    "preview": "# height is available as a regular list\nheight = [43, 53, 65, 54, 62, 99]\n\n# Import numpy\nimport numpy as np\n\n# Create a"
  }
]

// ... and 42 more files (download for full content)

About this extraction

This page contains the full source code of the imteekay/machine-learning-research GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 242 files (122.7 MB), approximately 14.6M tokens, and a symbol index with 49 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo