Full Code of amueller/scipy-2016-sklearn for AI

master 38caa5ac44c9 cached

68 files

350.7 KB

101.4k tokens

30 symbols

1 requests

Download .txt

Showing preview only (373K chars total). Download the full file or copy to clipboard to get everything.

Repository: amueller/scipy-2016-sklearn
Branch: master
Commit: 38caa5ac44c9
Files: 68
Total size: 350.7 KB

Directory structure:
gitextract_9qj13mdv/

├── .gitignore
├── LICENSE
├── README.md
├── abstract.rst
├── check_env.ipynb
├── fetch_data.py
├── notebooks/
│   ├── 01 Introduction to Machine Learning.ipynb
│   ├── 02 Scientific Computing Tools in Python.ipynb
│   ├── 03 Data Representation for Machine Learning.ipynb
│   ├── 04 Training and Testing Data.ipynb
│   ├── 05 Supervised Learning - Classification.ipynb
│   ├── 06 Supervised Learning - Regression.ipynb
│   ├── 07 Unsupervised Learning - Transformations and Dimensionality Reduction.ipynb
│   ├── 08 Unsupervised Learning - Clustering.ipynb
│   ├── 09 Review of Scikit-learn API.ipynb
│   ├── 10 Case Study - Titanic Survival.ipynb
│   ├── 11 Text Feature Extraction.ipynb
│   ├── 12 Case Study - SMS Spam Detection.ipynb
│   ├── 13 Cross Validation.ipynb
│   ├── 14 Model Complexity and GridSearchCV.ipynb
│   ├── 15 Pipelining Estimators.ipynb
│   ├── 16 Performance metrics and Model Evaluation.ipynb
│   ├── 17 In Depth - Linear Models.ipynb
│   ├── 18 In Depth - Support Vector Machines.ipynb
│   ├── 19 In Depth - Trees and Forests.ipynb
│   ├── 20 Feature Selection.ipynb
│   ├── 21 Unsupervised learning - Hierarchical and density-based clustering algorithms.ipynb
│   ├── 22 Unsupervised learning - Non-linear dimensionality reduction.ipynb
│   ├── 23 Out-of-core Learning Large Scale Text Classification.ipynb
│   ├── figures/
│   │   ├── ML_flow_chart.py
│   │   ├── __init__.py
│   │   ├── plot_2d_separator.py
│   │   ├── plot_digits_dataset.py
│   │   ├── plot_helpers.py
│   │   ├── plot_interactive_forest.py
│   │   ├── plot_interactive_tree.py
│   │   ├── plot_kneighbors_regularization.py
│   │   ├── plot_linear_svc_regularization.py
│   │   ├── plot_pca.py
│   │   ├── plot_rbf_svm_parameters.py
│   │   └── plot_scaling.py
│   ├── helpers.py
│   └── solutions/
│       ├── 03A_faces_plot.py
│       ├── 04_wrong-predictions.py
│       ├── 05A_knn_with_diff_k.py
│       ├── 06A_knn_vs_linreg.py
│       ├── 07A_iris-pca.py
│       ├── 08B_digits_clustering.py
│       ├── 10_titanic.py
│       ├── 11_ngrams.py
│       ├── 12A_tfidf.py
│       ├── 12B_vectorizer_params.py
│       ├── 13_cross_validation.py
│       ├── 14_grid_search.py
│       ├── 15A_ridge_grid.py
│       ├── 16A_avg_per_class_acc.py
│       ├── 17A_logreg_grid.py
│       ├── 17B_learning_curve_alpha.py
│       ├── 18_svc_grid.py
│       ├── 19_gbc_grid.py
│       ├── 20_univariate_vs_mb_selection.py
│       ├── 21_clustering_comparison.py
│       ├── 22A_isomap_digits.py
│       ├── 22B_tsne_classification.py
│       └── 23_batchtrain.py
├── requirements.txt
├── slides/
│   └── scipy2016.pptx
└── todo.rst

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# exlude datasets and externals
notebooks/datasets
notebooks/joblib/

# exclude temporary files
.ipynb_checkpoints
.DS_Store
gmon.out
__pycache__
*.pyc
*.o
*.so
*.gcno
*.swp
*.egg-info
*.egg
*~
build
dist
lib/test
doc/_build
*env
*ENV
.idea


================================================
FILE: LICENSE
================================================
CC0 1.0 Universal

Statement of Purpose

The laws of most jurisdictions throughout the world automatically confer
exclusive Copyright and Related Rights (defined below) upon the creator and
subsequent owner(s) (each and all, an "owner") of an original work of
authorship and/or a database (each, a "Work").

Certain owners wish to permanently relinquish those rights to a Work for the
purpose of contributing to a commons of creative, cultural and scientific
works ("Commons") that the public can reliably and without fear of later
claims of infringement build upon, modify, incorporate in other works, reuse
and redistribute as freely as possible in any form whatsoever and for any
purposes, including without limitation commercial purposes. These owners may
contribute to the Commons to promote the ideal of a free culture and the
further production of creative, cultural and scientific works, or to gain
reputation or greater distribution for their Work in part through the use and
efforts of others.

For these and/or other purposes and motivations, and without any expectation
of additional consideration or compensation, the person associating CC0 with a
Work (the "Affirmer"), to the extent that he or she is an owner of Copyright
and Related Rights in the Work, voluntarily elects to apply CC0 to the Work
and publicly distribute the Work under its terms, with knowledge of his or her
Copyright and Related Rights in the Work and the meaning and intended legal
effect of CC0 on those rights.

1. Copyright and Related Rights. A Work made available under CC0 may be
protected by copyright and related or neighboring rights ("Copyright and
Related Rights"). Copyright and Related Rights include, but are not limited
to, the following:

  i. the right to reproduce, adapt, distribute, perform, display, communicate,
  and translate a Work;

  ii. moral rights retained by the original author(s) and/or performer(s);

  iii. publicity and privacy rights pertaining to a person's image or likeness
  depicted in a Work;

  iv. rights protecting against unfair competition in regards to a Work,
  subject to the limitations in paragraph 4(a), below;

  v. rights protecting the extraction, dissemination, use and reuse of data in
  a Work;

  vi. database rights (such as those arising under Directive 96/9/EC of the
  European Parliament and of the Council of 11 March 1996 on the legal
  protection of databases, and under any national implementation thereof,
  including any amended or successor version of such directive); and

  vii. other similar, equivalent or corresponding rights throughout the world
  based on applicable law or treaty, and any national implementations thereof.

2. Waiver. To the greatest extent permitted by, but not in contravention of,
applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and
unconditionally waives, abandons, and surrenders all of Affirmer's Copyright
and Related Rights and associated claims and causes of action, whether now
known or unknown (including existing as well as future claims and causes of
action), in the Work (i) in all territories worldwide, (ii) for the maximum
duration provided by applicable law or treaty (including future time
extensions), (iii) in any current or future medium and for any number of
copies, and (iv) for any purpose whatsoever, including without limitation
commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes
the Waiver for the benefit of each member of the public at large and to the
detriment of Affirmer's heirs and successors, fully intending that such Waiver
shall not be subject to revocation, rescission, cancellation, termination, or
any other legal or equitable action to disrupt the quiet enjoyment of the Work
by the public as contemplated by Affirmer's express Statement of Purpose.

3. Public License Fallback. Should any part of the Waiver for any reason be
judged legally invalid or ineffective under applicable law, then the Waiver
shall be preserved to the maximum extent permitted taking into account
Affirmer's express Statement of Purpose. In addition, to the extent the Waiver
is so judged Affirmer hereby grants to each affected person a royalty-free,
non transferable, non sublicensable, non exclusive, irrevocable and
unconditional license to exercise Affirmer's Copyright and Related Rights in
the Work (i) in all territories worldwide, (ii) for the maximum duration
provided by applicable law or treaty (including future time extensions), (iii)
in any current or future medium and for any number of copies, and (iv) for any
purpose whatsoever, including without limitation commercial, advertising or
promotional purposes (the "License"). The License shall be deemed effective as
of the date CC0 was applied by Affirmer to the Work. Should any part of the
License for any reason be judged legally invalid or ineffective under
applicable law, such partial invalidity or ineffectiveness shall not
invalidate the remainder of the License, and in such case Affirmer hereby
affirms that he or she will not (i) exercise any of his or her remaining
Copyright and Related Rights in the Work or (ii) assert any associated claims
and causes of action with respect to the Work, in either case contrary to
Affirmer's express Statement of Purpose.

4. Limitations and Disclaimers.

  a. No trademark or patent rights held by Affirmer are waived, abandoned,
  surrendered, licensed or otherwise affected by this document.

  b. Affirmer offers the Work as-is and makes no representations or warranties
  of any kind concerning the Work, express, implied, statutory or otherwise,
  including without limitation warranties of title, merchantability, fitness
  for a particular purpose, non infringement, or the absence of latent or
  other defects, accuracy, or the present or absence of errors, whether or not
  discoverable, all to the greatest extent permissible under applicable law.

  c. Affirmer disclaims responsibility for clearing rights of other persons
  that may apply to the Work or any use thereof, including without limitation
  any person's Copyright and Related Rights in the Work. Further, Affirmer
  disclaims responsibility for obtaining any necessary consents, permissions
  or other rights required for any use of the Work.

  d. Affirmer understands and acknowledges that Creative Commons is not a
  party to this document and has no duty or obligation with respect to this
  CC0 or use of the Work.

For more information, please see
<http://creativecommons.org/publicdomain/zero/1.0/>


================================================
FILE: README.md
================================================
SciPy 2016 Scikit-learn Tutorial
================================

Based on the SciPy [2015 tutorial](https://github.com/amueller/scipy_2015_sklearn_tutorial) by [Kyle Kastner](https://kastnerkyle.github.io/) and [Andreas Mueller](http://amueller.github.io).


Instructors
-----------

- [Sebastian Raschka](http://sebastianraschka.com)  [@rasbt](https://twitter.com/rasbt) - Michigan State University, Computational Biology;  [Book: Python Machine Learning](https://www.amazon.com/Python-Machine-Learning-Sebastian-Raschka/dp/1783555130/ref=sr_1_1?ie=UTF8&qid=1468347805&sr=8-1&keywords=sebastian+raschka)
- [Andreas Mueller](http://amuller.github.io) [@amuellerml](https://twitter.com/t3kcit) - NYU Center for Data Science; [Book: Introduction to Machine Learning with Python](http://shop.oreilly.com/product/0636920030515.do)

---

**The video recording of the tutorial is now available via YouTube:**

- [Part 1](https://www.youtube.com/watch?list=PLYx7XA2nY5Gf37zYZMw6OqGFRPjB1jCy6&v=OB1reY6IX-o)
- [Part 2](https://www.youtube.com/watch?v=Cte8FYCpylk&list=PLYx7XA2nY5Gf37zYZMw6OqGFRPjB1jCy6&index=90)

---

This repository will contain the teaching material and other info associated with our scikit-learn tutorial
at [SciPy 2016](http://scipy2016.scipy.org/ehome/index.php?eventid=146062&tabid=332930&) held July 11-17 in Austin, Texas.

Parts 1 to 5 make up the morning session, while
parts 6 to 9 will be presented in the afternoon.

### Schedule:

The 2-part tutorial will be held on Tuesday, July 12, 2016.

- Parts 1 to 5: 8:00 AM - 12:00 PM (Room 105)
- Parts 6 to 9: 1:30 PM - 5:30 PM (Room 105)

(You can find the full SciPy 2016 tutorial schedule [here](http://scipy2016.scipy.org/ehome/146062/332960/).)



Obtaining the Tutorial Material
------------------


If you have a GitHub account, it is probably most convenient if you fork the GitHub repository. If you don’t have an GitHub account, you can download the repository as a .zip file by heading over to the GitHub repository (https://github.com/amueller/scipy-2016-sklearn) in your browser and click the green “Download” button in the upper right.

![](images/download-repo.png)

Please note that we may add and improve the material until shortly before the tutorial session, and we recommend you to update your copy of the materials one day before the tutorials. If you have an GitHub account and forked/cloned the repository via GitHub, you can sync your existing fork with via the following commands:

```
git remote add upstream https://github.com/amueller/scipy-2016-sklearn.git
git fetch upstream
git checkout master merge upstream/master
```

If you don’t have a GitHub account, you may have to re-download the .zip archive from GitHub.


Installation Notes
------------------

This tutorial will require recent installations of

- [NumPy](http://www.numpy.org)
- [SciPy](http://www.scipy.org)
- [matplotlib](http://matplotlib.org)
- [pillow](https://python-pillow.org)
- [scikit-learn](http://scikit-learn.org/stable/)
- [PyYaml](http://pyyaml.org/wiki/PyYAML)
- [IPython](http://ipython.readthedocs.org/en/stable/)
- [Jupyter Notebook](http://jupyter.org)
- [Watermark](https://pypi.python.org/pypi/watermark)

The last one is important, you should be able to type:

    jupyter notebook

in your terminal window and see the notebook panel load in your web browser.
Try opening and running a notebook from the material to see check that it works.

For users who do not yet have these  packages installed, a relatively
painless way to install all the requirements is to use a Python distribution
such as [Anaconda CE](http://store.continuum.io/ "Anaconda CE"), which includes
the most relevant Python packages for science, math, engineering, and
data analysis; Anaconda can be downloaded and installed for free
including commercial use and redistribution.
The code examples in this tutorial should be compatible to Python 2.7,
Python 3.4, and Python 3.5.

After obtaining the material, we **strongly recommend** you to open and execute the Jupyter Notebook
`jupter notebook check_env.ipynb` that is located at the top level of this repository. Inside the repository, you can open the notebook
by executing

```bash
jupyter notebook check_env.ipynb
```

inside this repository. Inside the Notebook, you can run the code cell by
clicking on the "Run Cells" button as illustrated in the figure below:

![](images/check_env-1.png)


Finally, if your environment satisfies the requirements for the tutorials, the executed code cell will produce an output message as shown below:

![](images/check_env-2.png)


Although not required, we also recommend you to update the required Python packages to their latest versions to ensure best compatibility with the teaching material. Please upgrade already installed packages by executing

- `pip install [package-name] --upgrade`  
- or `conda update [package-name]`



Data Downloads
--------------

The data for this tutorial is not included in the repository.  We will be
using several data sets during the tutorial: most are built-in to
scikit-learn, which
includes code that automatically downloads and caches these
data.

**Because the wireless network
at conferences can often be spotty, it would be a good idea to download these
data sets before arriving at the conference.
Please run ``python fetch_data.py`` to download all necessary data beforehand.**

The download size of the data files are approx. 280 MB, and after `fetch_data.py`
extracted the data on your disk, the ./notebook/dataset folder will take 480 MB
of your local solid state or hard drive.


Outline
=======

Morning Session
---------------

- 01 Introduction to machine learning with sample applications, Supervised and Unsupervised learning [[view](notebooks/01\ Introduction\ to\ Machine\ Learning.ipynb)]
- 02 Scientific Computing Tools for Python: NumPy, SciPy, and matplotlib [[view](notebooks/02\ Scientific\ Computing\ Tools\ in\ Python.ipynb)]
- 03 Data formats, preparation, and representation [[view](notebooks/03\ Data\ Representation\ for\ Machine\ Learning.ipynb)]
- 04 Supervised learning: Training and test data [[view](notebooks/04\ Training\ and\ Testing\ Data.ipynb)]
- 05 Supervised learning: Estimators for classification [[view](notebooks/05\ Supervised\ Learning\ -\ Classification.ipynb)]
- 06 Supervised learning: Estimators for regression analysis [[view](notebooks/06\ Supervised\ Learning\ -\ Regression.ipynb)]
- 07 Unsupervised learning: Unsupervised Transformers [[view](notebooks/07\ Unsupervised\ Learning\ -\ Transformations\ and\ Dimensionality\ Reduction.ipynb)]
- 08 Unsupervised learning: Clustering [[view](notebooks/08\ Unsupervised\ Learning\ -\ Clustering.ipynb)]
- 09 The scikit-learn estimator interface [[view](notebooks/09\ Review\ of\ Scikit-learn\ API.ipynb)]
- 10 Preparing a real-world dataset (titanic) [[view](notebooks/10\ Case\ Study\ -\ Titanic\ Survival.ipynb)]
- 11 Working with text data via the bag-of-words model [[view](notebooks/11\ Text\ Feature\ Extraction.ipynb)]
- 12 Application: IMDb Movie Review Sentiment Analysis [[view](notebooks/12\ Case\ Study\ -\ SMS\ Spam\ Detection.ipynb)]

Afternoon Session
-----------------

- 13 Cross-Validation [[view](notebooks/13\ Cross\ Validation.ipynb)]
- 14 Model complexity and grid search for adjusting hyperparameters [[view](notebooks/14\ Model\ Complexity\ and\ GridSearchCV.ipynb)]
- 15 Scikit-learn Pipelines [[view](notebooks/15\ Pipelining\ Estimators.ipynb)]
- 16 Supervised learning: Performance metrics for classification [[view](notebooks/16\ Performance\ metrics\ and\ Model\ Evaluation.ipynb)]
- 17 Supervised learning: Linear Models [[view](notebooks/17\ In\ Depth\ -\ Linear\ Models.ipynb)]
- 18 Supervised learning: Support Vector Machines [[view](notebooks/18\ In\ Depth\ -\ Support\ Vector\ Machines.ipynb)]
- 19 Supervised learning: Decision trees and random forests, and ensemble methods [[view](notebooks/19\ In\ Depth\ -\ Trees\ and\ Forests.ipynb)]
- 20 Supervised learning: feature selection [[view](notebooks/20\ Feature\ Selection.ipynb)]
- 21 Unsupervised learning: Hierarchical and density-based clustering algorithms [[view](notebooks/21\ Unsupervised\ learning\ -\ Hierarchical\ and\ density-based\ clustering\ algorithms.ipynb)]
- 22 Unsupervised learning: Non-linear dimensionality reduction [[view](notebooks/22\ Unsupervised\ learning\ -\ Non-linear\ dimensionality\ reduction.ipynb)]
- 23 Supervised learning: Out-of-core learning [[view](notebooks/23\ Out-of-core\ Learning\ Large\ Scale\ Text\ Classification.ipynb)]


================================================
FILE: abstract.rst
================================================
Machine Learning with scikit-learn

Tutorial Topic
--------------

This tutorial aims to provide an introduction to machine learning and
scikit-learn "from the ground up". We will start with core concepts of machine
learning, some example uses of machine learning, and how to implement them
using scikit-learn. Going in detail through the characteristics of several
methods, we will discuss how to pick an algorithm for your application, how to
set its parameters, and how to evaluate performance.

Please provide a more detailed abstract of your tutorial (again, see last years tutorials).
---------------------------------------------------------------------------------------------

Machine learning is the task of extracting knowledge from data, often with the
goal of generalizing to new and unseen data. Applications of machine learning 
now touch nearly every aspect of everyday life, from the face detection in our
phones and the streams of social media we consume to picking restaurants,
partners, and movies. Machine learning has also become indispensable to many
empirical sciences, from physics, astronomy and biology to social sciences.

Scikit-learn has emerged as one of the most popular toolkits for machine
learning, and is now widely used in industry and academia.
The goal of this tutorial is to enable participants to use the wide variety of
machine learning algorithms available in scikit-learn on their own data sets,
for their own domains.

This tutorial will comprise an introductory morning session and an advanced
afternoon session. The morning part of the tutorial will cover basic concepts
of machine learning, data representation, and preprocessing. We will explain
different problem settings and which algorithms to use in each situation.
We will then go through some sample applications using algorithms implemented
in scikit-learn, including SVMs, Random Forests, K-Means, PCA, t-SNE, and
others.

In the afternoon session, we will discuss setting hyper-parameters and how to
prevent overfitting. We will go in-depth into the trade-off of model complexity
and dataset size, as well as discussing complexity of learning algorithms and
how to cope with very large datasets. The session will conclude by stepping
through the process of building machine learning pipelines consisting of
feature extraction, preprocessing and supervised learning.


Outline
========

Morning Session
----------------

- Introduction to machine learning with sample applications

- Types of machine learning: Unsupervised vs. supervised learning

- Scientific Computing Tools for Python: NumPy, SciPy, and matplotlib

- Data formats, preparation, and representation

- Supervised learning: Training and test data
- Supervised learning: The scikit-learn estimator interface
- Supervised learning: Estimators for classification
- Supervised learning: Estimators for regression analysis

- Unsupervised learning: Unsupervised Transformers
- Unsupervised learning: Preprocessing and scaling
- Unsupervised learning: Feature extraction and dimensionality reduction
- Unsupervised learning: Clustering

- Preparing a real-world dataset
- Working with text data via the bag-of-words model
- Application: IMDB Movie Review Sentiment Analysis


Afternoon Session
------------------
- Cross-Validation
- Model Complexity: Overfitting and underfitting
- Complexity of various model types
- Grid search for adjusting hyperparameters 

- Scikit-learn Pipelines

- Supervised learning: Performance metrics for classification
- Supervised learning: Support Vector Machines
- Supervised learning: Algorithm and model selection via nested cross-validation
- Supervised learning: Decision trees and random forests, and ensemble methods

- Unsupervised learning: Non-linear regression analysis
- Unsupervised learning: Hierarchical and density-based clustering algorithms
- Unsupervised learning: Non-linear dimensionality reduction

- Wrapper, filter, and embedded approaches for feature selection

- Supervised learning: Artificial neural networks: Multi-layer perceptrons
- Supervised learning: Out-of-core learning


================================================
FILE: check_env.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from __future__ import print_function\n",
    "from distutils.version import LooseVersion as Version\n",
    "import sys\n",
    "\n",
    "\n",
    "try:\n",
    "    import curses\n",
    "    curses.setupterm()\n",
    "    assert curses.tigetnum(\"colors\") > 2\n",
    "    OK = \"\\x1b[1;%dm[ OK ]\\x1b[0m\" % (30 + curses.COLOR_GREEN)\n",
    "    FAIL = \"\\x1b[1;%dm[FAIL]\\x1b[0m\" % (30 + curses.COLOR_RED)\n",
    "except:\n",
    "    OK = '[ OK ]'\n",
    "    FAIL = '[FAIL]'\n",
    "\n",
    "try:\n",
    "    import importlib\n",
    "except ImportError:\n",
    "    print(FAIL, \"Python version 3.4 (or 2.7) is required,\"\n",
    "                \" but %s is installed.\" % sys.version)\n",
    "\n",
    "    \n",
    "def import_version(pkg, min_ver, fail_msg=\"\"):\n",
    "    mod = None\n",
    "    try:\n",
    "        mod = importlib.import_module(pkg)\n",
    "        if pkg in {'PIL'}:\n",
    "            ver = mod.VERSION\n",
    "        else:\n",
    "            ver = mod.__version__\n",
    "        if Version(ver) < min_ver:\n",
    "            print(FAIL, \"%s version %s or higher required, but %s installed.\"\n",
    "                  % (lib, min_ver, ver))\n",
    "        else:\n",
    "            print(OK, '%s version %s' % (pkg, ver))\n",
    "    except ImportError:\n",
    "        print(FAIL, '%s not installed. %s' % (pkg, fail_msg))\n",
    "    return mod\n",
    "\n",
    "\n",
    "# first check the python version\n",
    "print('Using python in', sys.prefix)\n",
    "print(sys.version)\n",
    "pyversion = Version(sys.version)\n",
    "if pyversion >= \"3\":\n",
    "    if pyversion < \"3.4\":\n",
    "        print(FAIL, \"Python version 3.4 (or 2.7) is required,\"\n",
    "                    \" but %s is installed.\" % sys.version)\n",
    "elif pyversion >= \"2\":\n",
    "    if pyversion < \"2.7\":\n",
    "        print(FAIL, \"Python version 2.7 is required,\"\n",
    "                    \" but %s is installed.\" % sys.version)\n",
    "else:\n",
    "    print(FAIL, \"Unknown Python version: %s\" % sys.version)\n",
    "\n",
    "print()\n",
    "requirements = {'numpy': \"1.6.1\", 'scipy': \"0.9\", 'matplotlib': \"1.0\",\n",
    "                'IPython': \"3.0\", 'sklearn': \"0.18\", 'pandas': \"0.19\",\n",
    "                'watermark': \"1.3.1\",\n",
    "                'yaml': \"3.11\", 'PIL': \"1.1.7\"}\n",
    "\n",
    "# now the dependencies\n",
    "for lib, required_version in list(requirements.items()):\n",
    "    import_version(lib, required_version)\n",
    "\n",
    "# pydot is a bit different\n",
    "import_version(\"pydot\", \"0\", fail_msg=\"pydot is not installed.\"\n",
    "               \"It is not required but you will miss out on some plots.\"\n",
    "               \"\\nYou can install it using \"\n",
    "               \"'pip install pydot' on python2, and 'pip install \"\n",
    "               \"git+https://github.com/nlhepler/pydot.git' on python3.\");\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}


================================================
FILE: fetch_data.py
================================================
import os

try:
    from urllib.request import urlopen
except ImportError:
    from urllib import urlopen

import tarfile


IMDB_URL = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
IMDB_ARCHIVE_NAME = "aclImdb_v1.tar.gz"


def get_datasets_folder():
    here = os.path.dirname(__file__)
    notebooks = os.path.join(here, 'notebooks')
    datasets_folder = os.path.abspath(os.path.join(notebooks, 'datasets'))
    datasets_archive = os.path.abspath(os.path.join(notebooks, 'datasets.zip'))

    if not os.path.exists(datasets_folder):
        if os.path.exists(datasets_archive):
            print("Extracting " + datasets_archive)
            zf = zipfile.ZipFile(datasets_archive)
            zf.extractall('.')
            assert os.path.exists(datasets_folder)
        else:
            print("Creating datasets folder: " + datasets_folder)
            os.makedirs(datasets_folder)
    else:
        print("Using existing dataset folder:" + datasets_folder)
    return datasets_folder


def check_imdb(datasets_folder):
    print("\nChecking availability of the IMDb dataset")
    archive_path = os.path.join(datasets_folder, IMDB_ARCHIVE_NAME)
    imdb_path = os.path.join(datasets_folder, 'IMDb')

    train_path = os.path.join(imdb_path, 'aclImdb', 'train')
    test_path = os.path.join(imdb_path, 'aclImdb', 'test')

    if not os.path.exists(imdb_path):
        if not os.path.exists(archive_path):
            print("Downloading dataset from %s (84.1MB)" % IMDB_URL)
            opener = urlopen(IMDB_URL)
            open(archive_path, 'wb').write(opener.read())
        else:
            print("Found archive: " + archive_path)

        print("Extracting %s to %s" % (archive_path, imdb_path))

        tar = tarfile.open(archive_path, "r:gz")
        tar.extractall(path=imdb_path)
        tar.close()
        os.remove(archive_path)

    print("Checking that the IMDb train & test directories exist...")
    assert os.path.exists(train_path)
    assert os.path.exists(test_path)
    print("=> Success!")


if __name__ == "__main__":
    datasets_folder = get_datasets_folder()
    check_imdb(datasets_folder)

    print("\nLoading Labeled Faces Data (~200MB)")
    from sklearn.datasets import fetch_lfw_people
    fetch_lfw_people(min_faces_per_person=70, resize=0.4,
                     data_home=datasets_folder)
    print("=> Success!")


================================================
FILE: notebooks/01 Introduction to Machine Learning.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%load_ext watermark\n",
    "%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The use of watermark (above) is optional, and we use it to keep track of the changes while developing the tutorial material. (You can install this IPython extension via \"pip install watermark\". For more information, please see: https://github.com/rasbt/watermark)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SciPy 2016 Scikit-learn Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 01.1 Introduction to Machine Learning in Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is Machine Learning?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Machine learning is the process of extracting knowledge from data automatically, usually with the goal of making predictions on new, unseen data. A classical example is a spam filter, for which the user keeps labeling incoming mails as either spam or not spam. A machine learning algorithm then \"learns\" a predictive model from data that distinguishes spam from normal emails, a model which can predict for new emails whether they are spam or not.   \n",
    "\n",
    "Central to machine learning is the concept of **automating decision making** from data **without the user specifying explicit rules** how this decision should be made.\n",
    "\n",
    "For the case of emails, the user doesn't provide a list of words or characteristics that make an email spam. Instead, the user provides examples of spam and non-spam emails that are labeled as such.\n",
    "\n",
    "The second central concept is **generalization**. The goal of a machine learning model is to predict on new, previously unseen data. In a real-world application, we are not interested in marking an already labeled email as spam or not. Instead, we want to make the user's life easier by automatically classifying new incoming mail."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/supervised_workflow.svg\" width=\"100%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data is presented to the algorithm usually as a two-dimensional array (or matrix) of numbers. Each data point (also known as a *sample* or *training instance*) that we want to either learn from or make a decision on is represented as a list of numbers, a so-called feature vector, and its containing features represent the properties of this point. \n",
    "\n",
    "Later, we will work with a popular dataset called *Iris* -- among many other datasets. Iris, a classic benchmark dataset in the field of machine learning, contains the measurements of 150 iris flowers from 3 different species: Iris-Setosa, Iris-Versicolor, and Iris-Virginica. \n",
    "\n",
    "\n",
    "Iris Setosa\n",
    "<img src=\"figures/iris_setosa.jpg\" width=\"50%\">\n",
    "\n",
    "Iris Versicolor\n",
    "<img src=\"figures/iris_versicolor.jpg\" width=\"50%\">\n",
    "\n",
    "Iris Virginica\n",
    "<img src=\"figures/iris_virginica.jpg\" width=\"50%\">\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "We represent each flower sample as one row in our data array, and the columns (features) represent the flower measurements in centimeters. For instance, we can represent this Iris dataset, consisting of 150 samples and 4 features, a 2-dimensional array or matrix $\\mathbb{R}^{150 \\times 4}$ in the following format:\n",
    "\n",
    "\n",
    "$$\\mathbf{X} = \\begin{bmatrix}\n",
    "    x_{1}^{(1)} & x_{2}^{(1)} & x_{3}^{(1)} & \\dots  & x_{4}^{(1)} \\\\\n",
    "    x_{1}^{(2)} & x_{2}^{(2)} & x_{3}^{(2)} & \\dots  & x_{4}^{(2)} \\\\\n",
    "    \\vdots & \\vdots & \\vdots & \\ddots & \\vdots \\\\\n",
    "    x_{1}^{(150)} & x_{2}^{(150)} & x_{3}^{(150)} & \\dots  & x_{4}^{(150)}\n",
    "\\end{bmatrix}.\n",
    "$$\n",
    "\n",
    "(The superscript denotes the *i*th row, and the subscript denotes the *j*th feature, respectively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are two kinds of machine learning we will talk about today: ***supervised learning*** and ***unsupervised learning***."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Supervised Learning: Classification and regression\n",
    "\n",
    "In **Supervised Learning**, we have a dataset consisting of both input features and a desired output, such as in the spam / no-spam example.\n",
    "The task is to construct a model (or program) which is able to predict the desired output of an unseen object\n",
    "given the set of features.\n",
    "\n",
    "Some more complicated examples are:\n",
    "\n",
    "- Given a multicolor image of an object through a telescope, determine\n",
    "  whether that object is a star, a quasar, or a galaxy.\n",
    "- Given a photograph of a person, identify the person in the photo.\n",
    "- Given a list of movies a person has watched and their personal rating\n",
    "  of the movie, recommend a list of movies they would like.\n",
    "- Given a persons age, education and position, infer their salary\n",
    "\n",
    "What these tasks have in common is that there is one or more unknown\n",
    "quantities associated with the object which needs to be determined from other\n",
    "observed quantities.\n",
    "\n",
    "Supervised learning is further broken down into two categories, **classification** and **regression**:\n",
    "\n",
    "- **In classification, the label is discrete**, such as \"spam\" or \"no spam\". In other words, it provides a clear-cut distinction between categories. Furthermore, it is important to note that class labels are nominal, not ordinal variables. Nominal and ordinal variables are both subcategories of categorical variable. Ordinal variables imply an order, for example, T-shirt sizes \"XL > L > M > S\". On the contrary, nominal variables don't imply an order, for example, we (usually) can't assume \"orange > blue > green\".\n",
    "\n",
    "\n",
    "- **In regression, the label is continuous**, that is a float output. For example,\n",
    "in astronomy, the task of determining whether an object is a star, a galaxy, or a quasar is a\n",
    "classification problem: the label is from three distinct categories. On the other hand, we might\n",
    "wish to estimate the age of an object based on such observations: this would be a regression problem,\n",
    "because the label (age) is a continuous quantity.\n",
    "\n",
    "In supervised learning, there is always a distinction between a **training set** for which the desired outcome is given, and a **test set** for which the desired outcome needs to be inferred. The learning model fits the predictive model to the training set, and we use the test set to evaluate its generalization performance.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Unsupervised Learning\n",
    "\n",
    "In **Unsupervised Learning** there is no desired output associated with the data.\n",
    "Instead, we are interested in extracting some form of knowledge or model from the given data.\n",
    "In a sense, you can think of unsupervised learning as a means of discovering labels from the data itself.\n",
    "Unsupervised learning is often harder to understand and to evaluate.\n",
    "\n",
    "Unsupervised learning comprises tasks such as *dimensionality reduction*, *clustering*, and\n",
    "*density estimation*. For example, in the iris data discussed above, we can used unsupervised\n",
    "methods to determine combinations of the measurements which best display the structure of the\n",
    "data. As we’ll see below, such a projection of the data can be used to visualize the\n",
    "four-dimensional dataset in two dimensions. Some more involved unsupervised learning problems are:\n",
    "\n",
    "- Given detailed observations of distant galaxies, determine which features or combinations of\n",
    "  features summarize best the information.\n",
    "- Given a mixture of two sound sources (for example, a person talking over some music),\n",
    "  separate the two (this is called the [blind source separation](http://en.wikipedia.org/wiki/Blind_signal_separation) problem).\n",
    "- Given a video, isolate a moving object and categorize in relation to other moving objects which have been seen.\n",
    "- Given a large collection of news articles, find recurring topics inside these articles.\n",
    "- Given a collection of images, cluster similar images together (for example to group them when visualizing a collection)\n",
    "\n",
    "Sometimes the two may even be combined: e.g. unsupervised learning can be used to find useful\n",
    "features in heterogeneous data, and then these features can be used within a supervised\n",
    "framework."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}


================================================
FILE: notebooks/02 Scientific Computing Tools in Python.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%load_ext watermark\n",
    "%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The use of watermark (above) is optional, and we use it to keep track of the changes while developing the tutorial material. (You can install this IPython extension via \"pip install watermark\". For more information, please see: https://github.com/rasbt/watermark)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SciPy 2016 Scikit-learn Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "01.2 Jupyter Notebooks\n",
    "==================\n",
    "\n",
    "* You can run a cell by pressing ``[shift] + [Enter]`` or by pressing the \"play\" button in the menu.\n",
    "\n",
    "![](figures/ipython_run_cell.png)\n",
    "\n",
    "* You can get help on a function or object by pressing ``[shift] + [tab]`` after the opening parenthesis ``function(``\n",
    "\n",
    "![](figures/ipython_help-1.png)\n",
    "\n",
    "* You can also get help by executing ``function?``\n",
    "\n",
    "![](figures/ipython_help-2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Numpy Arrays"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Manipulating `numpy` arrays is an important part of doing machine learning\n",
    "(or, really, any type of scientific computation) in python.  This will likely\n",
    "be a short review for most. In any case, let's quickly go through some of the most important features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# Setting a random seed for reproducibility\n",
    "rnd = np.random.RandomState(seed=123)\n",
    "\n",
    "# Generating a random array\n",
    "X = rnd.uniform(low=0.0, high=1.0, size=(3, 5))  # a 3 x 5 array\n",
    "\n",
    "print(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(Note that NumPy arrays use 0-indexing just like other data structures in Python.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Accessing elements\n",
    "\n",
    "# get a single element \n",
    "# (here: an element in the first row and column)\n",
    "print(X[0, 0])\n",
    "\n",
    "# get a row \n",
    "# (here: 2nd row)\n",
    "print(X[1])\n",
    "\n",
    "# get a column\n",
    "# (here: 2nd column)\n",
    "print(X[:, 1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Transposing an array\n",
    "print(X.T)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\\begin{bmatrix}\n",
    "    1 & 2 & 3 & 4 \\\\\n",
    "    5 & 6 & 7 & 8\n",
    "\\end{bmatrix}^T\n",
    "= \n",
    "\\begin{bmatrix}\n",
    "    1 & 5 \\\\\n",
    "    2 & 6 \\\\\n",
    "    3 & 7 \\\\\n",
    "    4 & 8\n",
    "\\end{bmatrix}\n",
    "$$\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Creating a row vector\n",
    "# of evenly spaced numbers over a specified interval.\n",
    "y = np.linspace(0, 12, 5)\n",
    "print(y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Turning the row vector into a column vector\n",
    "print(y[:, np.newaxis])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Getting the shape or reshaping an array\n",
    "\n",
    "# Generating a random array\n",
    "rnd = np.random.RandomState(seed=123)\n",
    "X = rnd.uniform(low=0.0, high=1.0, size=(3, 5))  # a 3 x 5 array\n",
    "\n",
    "print(X.shape)\n",
    "print(X.reshape(5, 3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Indexing by an array of integers (fancy indexing)\n",
    "indices = np.array([3, 1, 0])\n",
    "print(indices)\n",
    "X[:, indices]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is much, much more to know, but these few operations are fundamental to what we'll\n",
    "do during this tutorial."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SciPy Sparse Matrices"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We won't make very much use of these in this tutorial, but sparse matrices are very nice\n",
    "in some situations.  In some machine learning tasks, especially those associated\n",
    "with textual analysis, the data may be mostly zeros.  Storing all these zeros is very\n",
    "inefficient, and representing in a way that only contains the \"non-zero\" values can be much more efficient.  We can create and manipulate sparse matrices as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from scipy import sparse\n",
    "\n",
    "# Create a random array with a lot of zeros\n",
    "rnd = np.random.RandomState(seed=123)\n",
    "\n",
    "X = rnd.uniform(low=0.0, high=1.0, size=(10, 5))\n",
    "print(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# set the majority of elements to zero\n",
    "X[X < 0.7] = 0\n",
    "print(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# turn X into a CSR (Compressed-Sparse-Row) matrix\n",
    "X_csr = sparse.csr_matrix(X)\n",
    "print(X_csr)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Converting the sparse matrix to a dense array\n",
    "print(X_csr.toarray())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(You may have stumbled upon an alternative method for converting sparse to dense representations: `numpy.todense`; `toarray` returns a NumPy array, whereas `todense` returns a NumPy matrix. In this tutorial, we will be working with NumPy arrays, not matrices; the latter are not supported by scikit-learn.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The CSR representation can be very efficient for computations, but it is not\n",
    "as good for adding elements.  For that, the LIL (List-In-List) representation\n",
    "is better:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Create an empty LIL matrix and add some items\n",
    "X_lil = sparse.lil_matrix((5, 5))\n",
    "\n",
    "for i, j in np.random.randint(0, 5, (15, 2)):\n",
    "    X_lil[i, j] = i + j\n",
    "\n",
    "print(X_lil)\n",
    "print(type(X_lil))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_dense = X_lil.toarray()\n",
    "print(X_dense)\n",
    "print(type(X_dense))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Often, once an LIL matrix is created, it is useful to convert it to a CSR format\n",
    "(many scikit-learn algorithms require CSR or CSC format)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_csr = X_lil.tocsr()\n",
    "print(X_csr)\n",
    "print(type(X_csr))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The available sparse formats that can be useful for various problems are:\n",
    "\n",
    "- `CSR` (compressed sparse row)\n",
    "- `CSC` (compressed sparse column)\n",
    "- `BSR` (block sparse row)\n",
    "- `COO` (coordinate)\n",
    "- `DIA` (diagonal)\n",
    "- `DOK` (dictionary of keys)\n",
    "- `LIL` (list in list)\n",
    "\n",
    "The [``scipy.sparse``](http://docs.scipy.org/doc/scipy/reference/sparse.html) submodule also has a lot of functions for sparse matrices\n",
    "including linear algebra, sparse solvers, graph algorithms, and much more."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## matplotlib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another important part of machine learning is the visualization of data.  The most common\n",
    "tool for this in Python is [`matplotlib`](http://matplotlib.org).  It is an extremely flexible package, and\n",
    "we will go over some basics here.\n",
    "\n",
    "Since we are using Jupyter notebooks, let us use one of IPython's convenient built-in \"[magic functions](https://ipython.org/ipython-doc/3/interactive/magics.html)\", the \"matoplotlib inline\" mode, which will draw the plots directly inside the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Plotting a line\n",
    "x = np.linspace(0, 10, 100)\n",
    "plt.plot(x, np.sin(x));"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Scatter-plot points\n",
    "x = np.random.normal(size=500)\n",
    "y = np.random.normal(size=500)\n",
    "plt.scatter(x, y);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Showing images using imshow\n",
    "# - note that origin is at the top-left by default!\n",
    "\n",
    "x = np.linspace(1, 12, 100)\n",
    "y = x[:, np.newaxis]\n",
    "\n",
    "im = y * np.sin(x) * np.cos(y)\n",
    "print(im.shape)\n",
    "\n",
    "plt.imshow(im);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Contour plots \n",
    "# - note that origin here is at the bottom-left by default!\n",
    "plt.contour(im);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# 3D plotting\n",
    "from mpl_toolkits.mplot3d import Axes3D\n",
    "ax = plt.axes(projection='3d')\n",
    "xgrid, ygrid = np.meshgrid(x, y.ravel())\n",
    "ax.plot_surface(xgrid, ygrid, im, cmap=plt.cm.jet, cstride=2, rstride=2, linewidth=0);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are many, many more plot types available.  One useful way to explore these is by\n",
    "looking at the [matplotlib gallery](http://matplotlib.org/gallery.html).\n",
    "\n",
    "You can test these examples out easily in the notebook: simply copy the ``Source Code``\n",
    "link on each page, and put it in a notebook using the ``%load`` magic.\n",
    "For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load http://matplotlib.org/mpl_examples/pylab_examples/ellipse_collection.py\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from matplotlib.collections import EllipseCollection\n",
    "\n",
    "x = np.arange(10)\n",
    "y = np.arange(15)\n",
    "X, Y = np.meshgrid(x, y)\n",
    "\n",
    "XY = np.hstack((X.ravel()[:,np.newaxis], Y.ravel()[:,np.newaxis]))\n",
    "\n",
    "ww = X/10.0\n",
    "hh = Y/15.0\n",
    "aa = X*9\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "\n",
    "ec = EllipseCollection(ww, hh, aa, units='x', offsets=XY,\n",
    "                       transOffset=ax.transData)\n",
    "ec.set_array((X+Y).ravel())\n",
    "ax.add_collection(ec)\n",
    "ax.autoscale_view()\n",
    "ax.set_xlabel('X')\n",
    "ax.set_ylabel('y')\n",
    "cbar = plt.colorbar(ec)\n",
    "cbar.set_label('X+Y')\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}


================================================
FILE: notebooks/03 Data Representation for Machine Learning.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%load_ext watermark\n",
    "%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The use of watermark (above) is optional, and we use it to keep track of the changes while developing the tutorial material. (You can install this IPython extension via \"pip install watermark\". For more information, please see: https://github.com/rasbt/watermark)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SciPy 2016 Scikit-learn Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Representation and Visualization of Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Machine learning is about fitting models to data; for that reason, we'll start by\n",
    "discussing how data can be represented in order to be understood by the computer.  Along\n",
    "with this, we'll build on our matplotlib examples from the previous section and show some\n",
    "examples of how to visualize data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data in scikit-learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data in scikit-learn, with very few exceptions, is assumed to be stored as a\n",
    "**two-dimensional array**, of shape `[n_samples, n_features]`. Many algorithms also accept ``scipy.sparse`` matrices of the same shape."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **n_samples:**   The number of samples: each sample is an item to process (e.g. classify).\n",
    "  A sample can be a document, a picture, a sound, a video, an astronomical object,\n",
    "  a row in database or CSV file,\n",
    "  or whatever you can describe with a fixed set of quantitative traits.\n",
    "- **n_features:**  The number of features or distinct traits that can be used to describe each\n",
    "  item in a quantitative manner.  Features are generally real-valued, but may be Boolean or\n",
    "  discrete-valued in some cases.\n",
    "\n",
    "The number of features must be fixed in advance. However it can be very high dimensional\n",
    "(e.g. millions of features) with most of them being \"zeros\" for a given sample. This is a case\n",
    "where `scipy.sparse` matrices can be useful, in that they are\n",
    "much more memory-efficient than NumPy arrays.\n",
    "\n",
    "As we recall from the previous section (or Jupyter notebook), we represent samples (data points or instances) as rows in the data array, and we store the corresponding features, the \"dimensions,\" as columns."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A Simple Example: the Iris Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As an example of a simple dataset, we're going to take a look at the iris data stored by scikit-learn.\n",
    "The data consists of measurements of three different iris flower species.  There are three different species of iris\n",
    "in this particular dataset as illustrated below:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Iris Setosa\n",
    "<img src=\"figures/iris_setosa.jpg\" width=\"50%\">\n",
    "\n",
    "Iris Versicolor\n",
    "<img src=\"figures/iris_versicolor.jpg\" width=\"50%\">\n",
    "\n",
    "Iris Virginica\n",
    "<img src=\"figures/iris_virginica.jpg\" width=\"50%\">\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Quick Question:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's assume that we are interested in categorizing new observations; we want to predict whether unknown flowers are  Iris-Setosa, Iris-Versicolor, or Iris-Virginica flowers, respectively. Based on what we've discussed in the previous section, how would we construct such a dataset?***\n",
    "\n",
    "Remember: we need a 2D array of size `[n_samples x n_features]`.\n",
    "\n",
    "- What would the `n_samples` refer to?\n",
    "\n",
    "- What might the `n_features` refer to?\n",
    "\n",
    "Remember that there must be a **fixed** number of features for each sample, and feature\n",
    "number *j* must be a similar kind of quantity for each sample."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading the Iris Data with Scikit-learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For future experiments with machine learning algorithms, we recommend you to bookmark the [UCI machine learning repository](http://archive.ics.uci.edu/ml/), which hosts many of the commonly used datasets that are useful for benchmarking machine learning algorithms -- a very popular resource for machine learning practioners and researchers. Conveniently, some of these datasets are already included in scikit-learn so that we can skip the tedious parts of downloading, reading, parsing, and cleaning these text/CSV files. You can find a list of available datasets in scikit-learn at: http://scikit-learn.org/stable/datasets/#toy-datasets.\n",
    "\n",
    "For example, scikit-learn has a very straightforward set of data on these iris species.  The data consist of\n",
    "the following:\n",
    "\n",
    "- Features in the Iris dataset:\n",
    "\n",
    "  1. sepal length in cm\n",
    "  2. sepal width in cm\n",
    "  3. petal length in cm\n",
    "  4. petal width in cm\n",
    "\n",
    "- Target classes to predict:\n",
    "\n",
    "  1. Iris Setosa\n",
    "  2. Iris Versicolour\n",
    "  3. Iris Virginica"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/petal_sepal.jpg\" alt=\"Sepal\" style=\"width: 50%;\"/>\n",
    "\n",
    "(Image: \"Petal-sepal\". Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Petal-sepal.jpg#/media/File:Petal-sepal.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "``scikit-learn`` embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "iris = load_iris()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The resulting dataset is a ``Bunch`` object: you can see what's available using\n",
    "the method ``keys()``:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "iris.keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The features of each sample flower are stored in the ``data`` attribute of the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "n_samples, n_features = iris.data.shape\n",
    "print('Number of samples:', n_samples)\n",
    "print('Number of features:', n_features)\n",
    "# the sepal length, sepal width, petal length and petal width of the first sample (first flower)\n",
    "print(iris.data[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The information about the class of each sample is stored in the ``target`` attribute of the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(iris.data.shape)\n",
    "print(iris.target.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(iris.target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "np.bincount(iris.target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the NumPy's bincount function (above), we can see that the classes are distributed uniformly in this dataset - there are 50 flowers from each species, where\n",
    "\n",
    "- class 0: Iris-Setosa\n",
    "- class 1: Iris-Versicolor\n",
    "- class 2: Iris-Virginica"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These class names are stored in the last attribute, namely ``target_names``:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(iris.target_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This data is four dimensional, but we can visualize one or two of the dimensions\n",
    "at a time using a simple histogram or scatter-plot.  Again, we'll start by enabling\n",
    "matplotlib inline mode:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "x_index = 3\n",
    "colors = ['blue', 'red', 'green']\n",
    "\n",
    "for label, color in zip(range(len(iris.target_names)), colors):\n",
    "    plt.hist(iris.data[iris.target==label, x_index], \n",
    "             label=iris.target_names[label],\n",
    "             color=color)\n",
    "\n",
    "plt.xlabel(iris.feature_names[x_index])\n",
    "plt.legend(loc='upper right')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "x_index = 3\n",
    "y_index = 0\n",
    "\n",
    "colors = ['blue', 'red', 'green']\n",
    "\n",
    "for label, color in zip(range(len(iris.target_names)), colors):\n",
    "    plt.scatter(iris.data[iris.target==label, x_index], \n",
    "                iris.data[iris.target==label, y_index],\n",
    "                label=iris.target_names[label],\n",
    "                c=color)\n",
    "\n",
    "plt.xlabel(iris.feature_names[x_index])\n",
    "plt.ylabel(iris.feature_names[y_index])\n",
    "plt.legend(loc='upper left')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Quick Exercise:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Change** `x_index` **and** `y_index` **in the above script\n",
    "and find a combination of two parameters\n",
    "which maximally separate the three classes.**\n",
    "\n",
    "This exercise is a preview of **dimensionality reduction**, which we'll see later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### An aside: scatterplot matrices\n",
    "\n",
    "Instead of looking at the data one plot at a time, a common tool that analysts use is called the **scatterplot matrix**.\n",
    "\n",
    "Scatterplot matrices show scatter plots between all features in the data set, as well as histograms to show the distribution of each feature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "    \n",
    "iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)\n",
    "pd.plotting.scatter_matrix(iris_df, figsize=(8, 8));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Other Available Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Scikit-learn makes available a host of datasets for testing learning algorithms](http://scikit-learn.org/stable/datasets/#dataset-loading-utilities).\n",
    "They come in three flavors:\n",
    "\n",
    "- **Packaged Data:** these small datasets are packaged with the scikit-learn installation,\n",
    "  and can be downloaded using the tools in ``sklearn.datasets.load_*``\n",
    "- **Downloadable Data:** these larger datasets are available for download, and scikit-learn\n",
    "  includes tools which streamline this process.  These tools can be found in\n",
    "  ``sklearn.datasets.fetch_*``\n",
    "- **Generated Data:** there are several datasets which are generated from models based on a\n",
    "  random seed.  These are available in the ``sklearn.datasets.make_*``\n",
    "\n",
    "You can explore the available dataset loaders, fetchers, and generators using IPython's\n",
    "tab-completion functionality.  After importing the ``datasets`` submodule from ``sklearn``,\n",
    "type\n",
    "\n",
    "    datasets.load_<TAB>\n",
    "\n",
    "or\n",
    "\n",
    "    datasets.fetch_<TAB>\n",
    "\n",
    "or\n",
    "\n",
    "    datasets.make_<TAB>\n",
    "\n",
    "to see a list of available functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn import datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data downloaded using the ``fetch_`` scripts are stored locally,\n",
    "within a subdirectory of your home directory.\n",
    "You can use the following to determine where it is:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import get_data_home\n",
    "get_data_home()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Be warned: many of these datasets are quite large, and can take a long time to download!\n",
    "\n",
    "If you start a download within the IPython notebook\n",
    "and you want to kill it, you can use ipython's \"kernel interrupt\" feature, available in the menu or using\n",
    "the shortcut ``Ctrl-m i``.\n",
    "\n",
    "You can press ``Ctrl-m h`` for a list of all ``ipython`` keyboard shortcuts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading Digits Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll take a look at another dataset, one where we have to put a bit\n",
    "more thought into how to represent the data.  We can explore the data in\n",
    "a similar manner as above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_digits\n",
    "digits = load_digits()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "digits.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "n_samples, n_features = digits.data.shape\n",
    "print((n_samples, n_features))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(digits.data[0])\n",
    "print(digits.target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The target here is just the digit represented by the data.  The data is an array of\n",
    "length 64... but what does this data mean?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There's a clue in the fact that we have two versions of the data array:\n",
    "``data`` and ``images``.  Let's take a look at them:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(digits.data.shape)\n",
    "print(digits.images.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that they're related by a simple reshaping:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "print(np.all(digits.images.reshape((1797, 64)) == digits.data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's visualize the data.  It's little bit more involved than the simple scatter-plot\n",
    "we used above, but we can do it rather quickly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# set up the figure\n",
    "fig = plt.figure(figsize=(6, 6))  # figure size in inches\n",
    "fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)\n",
    "\n",
    "# plot the digits: each image is 8x8 pixels\n",
    "for i in range(64):\n",
    "    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])\n",
    "    ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')\n",
    "    \n",
    "    # label the image with the target value\n",
    "    ax.text(0, 7, str(digits.target[i]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see now what the features mean.  Each feature is a real-valued quantity representing the\n",
    "darkness of a pixel in an 8x8 image of a hand-written digit.\n",
    "\n",
    "Even though each sample has data that is inherently two-dimensional, the data matrix flattens\n",
    "this 2D data into a **single vector**, which can be contained in one **row** of the data matrix."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generated Data: the S-Curve"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One dataset often used as an example of a simple nonlinear dataset is the S-cure:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_s_curve\n",
    "data, colors = make_s_curve(n_samples=1000)\n",
    "print(data.shape)\n",
    "print(colors.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from mpl_toolkits.mplot3d import Axes3D\n",
    "ax = plt.axes(projection='3d')\n",
    "ax.scatter(data[:, 0], data[:, 1], data[:, 2], c=colors)\n",
    "ax.view_init(10, -60)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This example is typically used with an unsupervised learning method called Locally\n",
    "Linear Embedding.  We'll explore unsupervised learning in detail later in the tutorial."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise: working with the faces dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we'll take a moment for you to explore the datasets yourself.\n",
    "Later on we'll be using the Olivetti faces dataset.\n",
    "Take a moment to fetch the data (about 1.4MB), and visualize the faces.\n",
    "You can copy the code used to visualize the digits above, and modify it for this data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_olivetti_faces"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# fetch the faces data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Use a script like above to plot the faces image data.\n",
    "# hint: plt.cm.bone is a good colormap for this data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load solutions/03A_faces_plot.py"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}


================================================
FILE: notebooks/04 Training and Testing Data.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%load_ext watermark\n",
    "%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SciPy 2016 Scikit-learn Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Training and Testing Data\n",
    "=====================================\n",
    "\n",
    "To evaluate how well our supervised models generalize, we can split our data into a training and a test set:\n",
    "\n",
    "<img src=\"figures/train_test_split_matrix.svg\" width=\"100%\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "classifier = KNeighborsClassifier()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thinking about how machine learning is normally performed, the idea of a train/test split makes sense. Real world systems train on the data they have, and as other data comes in (from customers, sensors, or other sources) the classifier that was trained must predict on fundamentally *new* data. We can simulate this during training using a train/test split - the test data is a simulation of \"future data\" which will come into the system during production. \n",
    "\n",
    "Specifically for iris, the 150 labels in iris are sorted, which means that if we split the data using a proportional split, this will result in fudamentally altered class distributions. For instance, if we'd perform a common 2/3 training data and 1/3 test data split, our training dataset will only consists of flower classes 0 and 1 (Setosa and Versicolor), and our test set will only contain samples with class label 2 (Virginica flowers).\n",
    "\n",
    "Under the assumption that all samples are independent of each other (in contrast time series data), we want to **randomly shuffle the dataset before we split the dataset** as illustrated above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to split the data into training and testing. Luckily, this is a common pattern in machine learning and scikit-learn has a pre-built function to split data into training and testing sets for you. Here, we use 50% of the data as training, and 50% testing. 80% and 20% is another common split, but there are no hard and fast rules. The most important thing is to fairly evaluate your system on data it *has not* seen during training!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "train_X, test_X, train_y, test_y = train_test_split(X, y, \n",
    "                                                    train_size=0.5, \n",
    "                                                    random_state=123)\n",
    "print(\"Labels for training and testing data\")\n",
    "print(train_y)\n",
    "print(test_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Tip: Stratified Split**\n",
    "\n",
    "Especially for relatively small datasets, it's better to stratify the split. Stratification means that we maintain the original class proportion of the dataset in the test and training sets. For example, after we randomly split the dataset as shown in the previous code example, we have the following class proportions in percent:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print('All:', np.bincount(y) / float(len(y)) * 100.0)\n",
    "print('Training:', np.bincount(train_y) / float(len(train_y)) * 100.0)\n",
    "print('Test:', np.bincount(test_y) / float(len(test_y)) * 100.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, in order to stratify the split, we can pass the label array as an additional option to the `train_test_split` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "train_X, test_X, train_y, test_y = train_test_split(X, y, \n",
    "                                                    train_size=0.5, \n",
    "                                                    random_state=123,\n",
    "                                                    stratify=y)\n",
    "\n",
    "print('All:', np.bincount(y) / float(len(y)) * 100.0)\n",
    "print('Training:', np.bincount(train_y) / float(len(train_y)) * 100.0)\n",
    "print('Test:', np.bincount(test_y) / float(len(test_y)) * 100.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By evaluating our classifier performance on data that has been seen during training, we could get false confidence in the predictive power of our model. In the worst case, it may simply memorize the training samples but completely fails classifying new, similar samples -- we really don't want to put such a system into production!\n",
    "\n",
    "Instead of using the same dataset for training and testing (this is called \"resubstitution evaluation\"), it is much much better to use a train/test split in order to estimate how well your trained model is doing on new data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "classifier.fit(train_X, train_y)\n",
    "pred_y = classifier.predict(test_X)\n",
    "\n",
    "print(\"Fraction Correct [Accuracy]:\")\n",
    "print(np.sum(pred_y == test_y) / float(len(test_y)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also visualize the correct and failed predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print('Samples correctly classified:')\n",
    "correct_idx = np.where(pred_y == test_y)[0]\n",
    "print(correct_idx)\n",
    "\n",
    "print('\\nSamples incorrectly classified:')\n",
    "incorrect_idx = np.where(pred_y != test_y)[0]\n",
    "print(incorrect_idx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Plot two dimensions\n",
    "\n",
    "colors = [\"darkblue\", \"darkgreen\", \"gray\"]\n",
    "\n",
    "for n, color in enumerate(colors):\n",
    "    idx = np.where(test_y == n)[0]\n",
    "    plt.scatter(test_X[idx, 1], test_X[idx, 2], color=color, label=\"Class %s\" % str(n))\n",
    "\n",
    "plt.scatter(test_X[incorrect_idx, 1], test_X[incorrect_idx, 2], color=\"darkred\")\n",
    "\n",
    "plt.xlabel('sepal width [cm]')\n",
    "plt.ylabel('petal length [cm]')\n",
    "plt.legend(loc=3)\n",
    "plt.title(\"Iris Classification results\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the errors occur in the area where green (class 1) and gray (class 2) overlap. This gives us insight about what features to add - any feature which helps separate class 1 and class 2 should improve classifier performance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print the true labels of 3 wrong predictions and modify the scatterplot code, which we used above, to visualize and distinguish these three samples with different markers in the 2D scatterplot. Can you explain why our classifier made these wrong predictions?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load solutions/04_wrong-predictions.py"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}


================================================
FILE: notebooks/05 Supervised Learning - Classification.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%load_ext watermark\n",
    "%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,pillow,scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SciPy 2016 Scikit-learn Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Supervised Learning Part 1 -- Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To visualize the workings of machine learning algorithms, it is often helpful to study two-dimensional or one-dimensional data, that is data with only one or two features. While in practice, datasets usually have many more features, it is hard to plot high-dimensional data in on two-dimensional screens.\n",
    "\n",
    "We will illustrate some very simple examples before we move on to more \"real world\" data sets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "First, we will look at a two class classification problem in two dimensions. We use the synthetic data generated by the ``make_blobs`` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_blobs\n",
    "\n",
    "X, y = make_blobs(centers=2, random_state=0)\n",
    "\n",
    "print('X ~ n_samples x n_features:', X.shape)\n",
    "print('y ~ n_samples:', y.shape)\n",
    "\n",
    "print('\\nFirst 5 samples:\\n', X[:5, :])\n",
    "print('\\nFirst 5 labels:', y[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As the data is two-dimensional, we can plot each sample as a point in a two-dimensional coordinate system, with the first feature being the x-axis and the second feature being the y-axis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(X[y == 0, 0], X[y == 0, 1], \n",
    "            c='blue', s=40, label='0')\n",
    "plt.scatter(X[y == 1, 0], X[y == 1, 1], \n",
    "            c='red', s=40, label='1', marker='s')\n",
    "\n",
    "plt.xlabel('first feature')\n",
    "plt.ylabel('second feature')\n",
    "plt.legend(loc='upper right');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Classification is a supervised task, and since we are interested in its performance on unseen data, we split our data into two parts:\n",
    "\n",
    "1. a training set that the learning algorithm uses to fit the model\n",
    "2. a test set to evaluate the generalization performance of the model\n",
    "\n",
    "The ``train_test_split`` function from the ``model_selection`` module does that for us -- we will use it to split a dataset into 75% training data and 25% test data.\n",
    "\n",
    "<img src=\"figures/train_test_split_matrix.svg\" width=\"100%\">\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y,\n",
    "                                                    test_size=0.25,\n",
    "                                                    random_state=1234,\n",
    "                                                    stratify=y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The scikit-learn estimator API\n",
    "<img src=\"figures/supervised_workflow.svg\" width=\"100%\">\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Every algorithm is exposed in scikit-learn via an ''Estimator'' object. (All models in scikit-learn have a very consistent interface). For instance, we first import the logistic regression class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we instantiate the estimator object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "classifier = LogisticRegression()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y_train.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To built the model from our data, that is to learn how to classify new points, we call the ``fit`` function with the training data, and the corresponding training labels (the desired output for the training data point):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "classifier.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(Some estimator methods such as `fit` return `self` by default. Thus, after executing the code snippet above, you will see the default parameters of this particular instance of `LogisticRegression`. Another way of retrieving the estimator's ininitialization parameters is to execute `classifier.get_params()`, which returns a parameter dictionary.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can then apply the model to unseen data and use the model to predict the estimated outcome using the ``predict`` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "prediction = classifier.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can compare these against the true labels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(prediction)\n",
    "print(y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can evaluate our classifier quantitatively by measuring what fraction of predictions is correct. This is called **accuracy**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "np.mean(prediction == y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is also a convenience function , ``score``, that all scikit-learn classifiers have to compute this directly from the test data:\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "classifier.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is often helpful to compare the generalization performance (on the test set) to the performance on the training set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "classifier.score(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LogisticRegression is a so-called linear model,\n",
    "that means it will create a decision that is linear in the input space. In 2d, this simply means it finds a line to separate the blue from the red:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from figures import plot_2d_separator\n",
    "\n",
    "plt.scatter(X[y == 0, 0], X[y == 0, 1], \n",
    "            c='blue', s=40, label='0')\n",
    "plt.scatter(X[y == 1, 0], X[y == 1, 1], \n",
    "            c='red', s=40, label='1', marker='s')\n",
    "\n",
    "plt.xlabel(\"first feature\")\n",
    "plt.ylabel(\"second feature\")\n",
    "plot_2d_separator(classifier, X)\n",
    "plt.legend(loc='upper right');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Estimated parameters**: All the estimated model parameters are attributes of the estimator object ending by an underscore. Here, these are the coefficients and the offset of the line:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(classifier.coef_)\n",
    "print(classifier.intercept_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another classifier: K Nearest Neighbors\n",
    "------------------------------------------------\n",
    "Another popular and easy to understand classifier is K nearest neighbors (kNN).  It has one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.\n",
    "\n",
    "The interface is exactly the same as for ``LogisticRegression above``."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This time we set a parameter of the KNeighborsClassifier to tell it we only want to look at one nearest neighbor:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "knn = KNeighborsClassifier(n_neighbors=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We fit the model with out training data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "knn.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(X[y == 0, 0], X[y == 0, 1], \n",
    "            c='blue', s=40, label='0')\n",
    "plt.scatter(X[y == 1, 0], X[y == 1, 1], \n",
    "            c='red', s=40, label='1', marker='s')\n",
    "\n",
    "plt.xlabel(\"first feature\")\n",
    "plt.ylabel(\"second feature\")\n",
    "plot_2d_separator(knn, X)\n",
    "plt.legend(loc='upper right');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "knn.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Exercise\n",
    "=========\n",
    "Apply the KNeighborsClassifier to the ``iris`` dataset. Play with different values of the ``n_neighbors`` and observe how training and test score change."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load solutions/05A_knn_with_diff_k.py"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}


================================================
FILE: notebooks/06 Supervised Learning - Regression.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%load_ext watermark\n",
    "%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SciPy 2016 Scikit-learn Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Supervised Learning Part 2 -- Regression Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In regression we are trying to predict a continuous output variable -- in contrast to the nominal variables we were predicting in the previous classification examples. \n",
    "\n",
    "Let's start with a simple toy example with one feature dimension (explanatory variable) and one target variable. We will create a dataset out of a sinus curve with some noise:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "x = np.linspace(-3, 3, 100)\n",
    "print(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "rng = np.random.RandomState(42)\n",
    "y = np.sin(4 * x) + x + rng.uniform(size=len(x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.plot(x, y, 'o');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Linear Regression\n",
    "=================\n",
    "\n",
    "The first model that we will introduce is the so-called simple linear regression. Here, we want to fit a line to the data, which \n",
    "\n",
    "One of the simplest models again is a linear one, that simply tries to predict the data as lying on a line. One way to find such a line is `LinearRegression` (also known as [*Ordinary Least Squares (OLS)*](https://en.wikipedia.org/wiki/Ordinary_least_squares) regression).\n",
    "The interface for LinearRegression is exactly the same as for the classifiers before, only that ``y`` now contains float values, instead of classes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we remember, the scikit-learn API requires us to provide the target variable (`y`) as a 1-dimensional array; scikit-learn's API expects the samples (`X`) in form a 2-dimensional array -- even though it may only consist of 1 feature. Thus, let us convert the 1-dimensional `x` NumPy array into an `X` array with 2 axes:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print('Before: ', x.shape)\n",
    "X = x[:, np.newaxis]\n",
    "print('After: ', X.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, we start by splitting our dataset into a training (75%) and a test set (25%):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we use the learning algorithm implemented in `LinearRegression` to **fit a regression model to the training data**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "regressor = LinearRegression()\n",
    "regressor.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After fitting to the training data, we paramerterized a linear regression model with the following values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print('Weight coefficients: ', regressor.coef_)\n",
    "print('y-axis intercept: ', regressor.intercept_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since our regression model is a linear one, the relationship between the target variable (y) and the feature variable (x) is defined as \n",
    "\n",
    "$$y = weight \\times x + \\text{intercept}$$.\n",
    "\n",
    "Plugging in the min and max values into thos equation, we can plot the regression fit to our training data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "min_pt = X.min() * regressor.coef_[0] + regressor.intercept_\n",
    "max_pt = X.max() * regressor.coef_[0] + regressor.intercept_\n",
    "\n",
    "plt.plot([X.min(), X.max()], [min_pt, max_pt])\n",
    "plt.plot(X_train, y_train, 'o');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar to the estimators for classification in the previous notebook, we use the `predict` method to predict the target variable. And we expect these predicted values to fall onto the line that we plotted previously:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "y_pred_train = regressor.predict(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.plot(X_train, y_train, 'o', label=\"data\")\n",
    "plt.plot(X_train, y_pred_train, 'o', label=\"prediction\")\n",
    "plt.plot([X.min(), X.max()], [min_pt, max_pt], label='fit')\n",
    "plt.legend(loc='best')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see in the plot above, the line is able to capture the general slope of the data, but not many details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's try the test set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "y_pred_test = regressor.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.plot(X_test, y_test, 'o', label=\"data\")\n",
    "plt.plot(X_test, y_pred_test, 'o', label=\"prediction\")\n",
    "plt.plot([X.min(), X.max()], [min_pt, max_pt], label='fit')\n",
    "plt.legend(loc='best');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, scikit-learn provides an easy way to evaluate the prediction quantitatively using the ``score`` method. For regression tasks, this is the R<sup>2</sup> score. Another popular way would be the Mean Squared Error (MSE). As its name implies, the MSE is simply the average squared difference over the predicted and actual target values\n",
    "\n",
    "$$MSE = \\frac{1}{n} \\sum^{n}_{i=1} (\\text{predicted}_i - \\text{true}_i)^2$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "regressor.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "KNeighborsRegression\n",
    "=======================\n",
    "As for classification, we can also use a neighbor based method for regression. We can simply take the output of the nearest point, or we could average several nearest points. This method is less popular for regression than for classification, but still a good baseline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KNeighborsRegressor\n",
    "kneighbor_regression = KNeighborsRegressor(n_neighbors=1)\n",
    "kneighbor_regression.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, let us look at the behavior on training and test set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y_pred_train = kneighbor_regression.predict(X_train)\n",
    "\n",
    "plt.plot(X_train, y_train, 'o', label=\"data\", markersize=10)\n",
    "plt.plot(X_train, y_pred_train, 's', label=\"prediction\", markersize=4)\n",
    "plt.legend(loc='best');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On the training set, we do a perfect job: each point is its own nearest neighbor!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y_pred_test = kneighbor_regression.predict(X_test)\n",
    "\n",
    "plt.plot(X_test, y_test, 'o', label=\"data\", markersize=8)\n",
    "plt.plot(X_test, y_pred_test, 's', label=\"prediction\", markersize=4)\n",
    "plt.legend(loc='best');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On the test set, we also do a better job of capturing the variation, but our estimates look much messier than before.\n",
    "Let us look at the R<sup>2</sup> score:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "kneighbor_regression.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Much better than before! Here, the linear model was not a good fit for our problem; it was lacking in complexity and thus under-fit our data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Exercise\n",
    "=========\n",
    "Compare the KNeighborsRegressor and LinearRegression on the boston housing dataset. You can load the dataset using ``sklearn.datasets.load_boston``. You can learn about the dataset by reading the ``DESCR`` attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load solutions/06A_knn_vs_linreg.py"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}


================================================
FILE: notebooks/07 Unsupervised Learning - Transformations and Dimensionality Reduction.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%load_ext watermark\n",
    "%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SciPy 2016 Scikit-learn Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Unsupervised Learning Part 1 -- Transformation\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many instances of unsupervised learning, such as dimensionality reduction, manifold learning, and feature extraction, find a new representation of the input data without any additional input. (In contrast to supervised learning, usnupervised algorithms don't require or consider target variables like in the previous classification and regression examples). \n",
    "\n",
    "<img src=\"figures/unsupervised_workflow.svg\" width=\"100%\">\n",
    "\n",
    "A very basic example is the rescaling of our data, which is a requirement for many machine learning algorithms as they are not scale-invariant -- rescaling falls into the category of data pre-processing and can barely be called *learning*. There exist many different rescaling technques, and in the following example, we will take a look at a particular method that is commonly called \"standardization.\" Here, we will recale the data so that each feature is centered at zero (mean = 0) with unit variance (standard deviation = 0).\n",
    "\n",
    "For example, if we have a 1D dataset with the values [1, 2, 3, 4, 5], the standardized values are\n",
    "\n",
    "- 1 -> -1.41\n",
    "- 2 -> -0.71\n",
    "- 3 -> 0.0\n",
    "- 4 -> 0.71\n",
    "- 5 -> 1.41\n",
    "\n",
    "computed via the equation $x_{standardized} = \\frac{x - \\mu_x}{\\sigma_x}$,\n",
    "where $\\mu$ is the sample mean, and $\\sigma$ the standard deviation, respectively.\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "ary = np.array([1, 2, 3, 4, 5])\n",
    "ary_standardized = (ary - ary.mean()) / ary.std()\n",
    "ary_standardized"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although standardization is a most basic preprocessing procedure -- as we've seen in the code snipped above -- scikit-learn implements a `StandardScaler` class for this computation. And in later sections, we will see why and when the scikit-learn interface comes in handy over the code snippet we executed above.  \n",
    "\n",
    "Applying such a preprocessing has a very similar interface to the supervised learning algorithms we saw so far.\n",
    "To get some more practice with scikit-learn's \"Transformer\" interface, let's start by loading the iris dataset and rescale it:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "iris = load_iris()\n",
    "X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)\n",
    "print(X_train.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The iris dataset is not \"centered\" that is it has non-zero mean and the standard deviation is different for each component:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(\"mean : %s \" % X_train.mean(axis=0))\n",
    "print(\"standard deviation : %s \" % X_train.std(axis=0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use a preprocessing method, we first import the estimator, here StandardScaler and instantiate it:\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "scaler = StandardScaler()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As with the classification and regression algorithms, we call ``fit`` to learn the model from the data. As this is an unsupervised model, we only pass ``X``, not ``y``. This simply estimates mean and standard deviation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "scaler.fit(X_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can rescale our data by applying the ``transform`` (not ``predict``) method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_train_scaled = scaler.transform(X_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "``X_train_scaled`` has the same number of samples and features, but the mean was subtracted and all features were scaled to have unit standard deviation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(X_train_scaled.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(\"mean : %s \" % X_train_scaled.mean(axis=0))\n",
    "print(\"standard deviation : %s \" % X_train_scaled.std(axis=0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To summarize: Via the `fit` method, the estimator is fitted to the data we provide. In this step, the estimator estimates the parameters from the data (here: mean and standard deviation). Then, if we `transform` data, these parameters are used to transform a dataset. (Please note that the transform method does not update these parameters)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's important to note that the same transformation is applied to the training and the test set. That has the consequence that usually the mean of the test data is not zero after scaling:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_test_scaled = scaler.transform(X_test)\n",
    "print(\"mean test data: %s\" % X_test_scaled.mean(axis=0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is important for the training and test data to be transformed in exactly the same way, for the following processing steps to make sense of the data, as is illustrated in the figure below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from figures import plot_relative_scaling\n",
    "plot_relative_scaling()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are several common ways to scale the data. The most common one is the ``StandardScaler`` we just introduced, but rescaling the data to a fix minimum an maximum value with ``MinMaxScaler`` (usually between 0 and 1), or using more robust statistics like median and quantile, instead of mean and standard deviation (with ``RobustScaler``), are also useful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from figures import plot_scaling\n",
    "plot_scaling()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Principal Component Analysis\n",
    "============================"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An unsupervised transformation that is somewhat more interesting is Principal Component Analysis (PCA).\n",
    "It is a technique to reduce the dimensionality of the data, by creating a linear projection.\n",
    "That is, we find new features to represent the data that are a linear combination of the old data (i.e. we rotate it). Thus, we can think of PCA as a projection of our data onto a *new* feature space.\n",
    "\n",
    "The way PCA finds these new directions is by looking for the directions of maximum variance.\n",
    "Usually only few components that explain most of the variance in the data are kept. Here, the premise is to reduce the size (dimensionality) of a dataset while capturing most of its information. There are many reason why dimensionality reduction can be useful: It can reduce the computational cost when running learning algorithms, decrease the storage space, and may help with the so-called \"curse of dimensionality,\" which we will discuss in greater detail later.\n",
    "\n",
    "To illustrate how a rotation might look like, we first show it on two-dimensional data and keep both principal components. Here is an illustraion:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from figures import plot_pca_illustration\n",
    "plot_pca_illustration()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's go through all the steps in more detail:\n",
    "We create a Gaussian blob that is rotated:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "rnd = np.random.RandomState(5)\n",
    "X_ = rnd.normal(size=(300, 2))\n",
    "X_blob = np.dot(X_, rnd.normal(size=(2, 2))) + rnd.normal(size=2)\n",
    "y = X_[:, 0] > 0\n",
    "plt.scatter(X_blob[:, 0], X_blob[:, 1], c=y, linewidths=0, s=30)\n",
    "plt.xlabel(\"feature 1\")\n",
    "plt.ylabel(\"feature 2\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As always, we instantiate our PCA model. By default all directions are kept."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "pca = PCA()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we fit the PCA model with our data. As PCA is an unsupervised algorithm, there is no output ``y``."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pca.fit(X_blob)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we can transform the data, projected on the principal components:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_pca = pca.transform(X_blob)\n",
    "\n",
    "plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, linewidths=0, s=30)\n",
    "plt.xlabel(\"first principal component\")\n",
    "plt.ylabel(\"second principal component\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On the left of the plot you can see the four points that were on the top right before. PCA found fit first component to be along the diagonal, and the second to be perpendicular to it. As PCA finds a rotation, the principal components are always at right angles (\"orthogonal\") to each other."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dimensionality Reduction for Visualization with PCA\n",
    "-------------------------------------------------------------\n",
    "Consider the digits dataset. It cannot be visualized in a single 2D plot, as it has 64 features. We are going to extract 2 dimensions to visualize it in, using the example from the sklearn examples [here](http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from figures import digits_plot\n",
    "\n",
    "digits_plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that this projection was determined *without* any information about the\n",
    "labels (represented by the colors): this is the sense in which the learning\n",
    "is **unsupervised**.  Nevertheless, we see that the projection gives us insight\n",
    "into the distribution of the different digits in parameter space."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercises\n",
    "\n",
    "Visualize the iris dataset using the first two principal components, and compress this visualization to using two of the original features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load solutions/07A_iris-pca.py"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}


================================================
FILE: notebooks/08 Unsupervised Learning - Clustering.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%load_ext watermark\n",
    "%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SciPy 2016 Scikit-learn Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Unsupervised Learning Part 2 -- Clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clustering is the task of gathering samples into groups of similar\n",
    "samples according to some predefined similarity or distance (dissimilarity)\n",
    "measure, such as the Euclidean distance.\n",
    "In this section we will explore a basic clustering task on some synthetic and real-world datasets.\n",
    "\n",
    "Here are some common applications of clustering algorithms:\n",
    "\n",
    "- Compression for data reduction\n",
    "- Summarizing data as a reprocessing step for recommender systems\n",
    "- Similarly:\n",
    "   - grouping related web news (e.g. Google News) and web search results\n",
    "   - grouping related stock quotes for investment portfolio management\n",
    "   - building customer profiles for market analysis\n",
    "- Building a code book of prototype samples for unsupervised feature extraction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by creating a simple, 2-dimensional, synthetic dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_blobs\n",
    "\n",
    "X, y = make_blobs(random_state=42)\n",
    "X.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(X[:, 0], X[:, 1]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the scatter plot above, we can see three separate groups of data points and we would like to recover them using clustering -- think of \"discovering\" the class labels that we already take for granted in a classification task.\n",
    "\n",
    "Even if the groups are obvious in the data, it is hard to find them when the data lives in a high-dimensional space, which we can't visualize in a single histogram or scatterplot."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will use one of the simplest clustering algorithms, K-means.\n",
    "This is an iterative algorithm which searches for three cluster\n",
    "centers such that the distance from each point to its cluster is\n",
    "minimized. The standard implementation of K-means uses the Euclidean distance, which is why we want to make sure that all our variables are measured on the same scale if we are working with real-world datastets. In the previous notebook, we talked about one technique to achieve this, namely, standardization.\n",
    "\n",
    "**Question:** what would you expect the output to look like?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.cluster import KMeans\n",
    "\n",
    "kmeans = KMeans(n_clusters=3, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can get the cluster labels either by calling fit and then accessing the \n",
    "``labels_`` attribute of the K means estimator, or by calling ``fit_predict``.\n",
    "Either way, the result contains the ID of the cluster that each point is assigned to."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "labels = kmeans.fit_predict(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "all(y == labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's visualize the assignments that have been found"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(X[:, 0], X[:, 1], c=labels);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compared to the true labels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(X[:, 0], X[:, 1], c=y);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we are probably satisfied with the clustering results. But in general we might want to have a more quantitative evaluation. How about comparing our cluster labels with the ground truth we got when generating the blobs?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import confusion_matrix, accuracy_score\n",
    "\n",
    "print('Accuracy score:', accuracy_score(y, labels))\n",
    "print(confusion_matrix(y, labels))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "np.mean(y == labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "**Exercise**\n",
    "\n",
    "After looking at the \"True\" label array y, and the scatterplot and `labels` above, can you figure out why our computed accuracy is 0.0, not 1.0, and can you fix it?\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Even though we recovered the partitioning of the data into clusters perfectly, the cluster IDs we assigned were arbitrary,\n",
    "and we can not hope to recover them. Therefore, we must use a different scoring metric, such as ``adjusted_rand_score``, which is invariant to permutations of the labels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import adjusted_rand_score\n",
    "\n",
    "adjusted_rand_score(y, labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the \"short-comings\" of K-means is that we have to specify the number of clusters, which we often don't know *apriori*. For example, let's have a look what happens if we set the number of clusters to 2 in our synthetic 3-blob dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "kmeans = KMeans(n_clusters=2, random_state=42)\n",
    "labels = kmeans.fit_predict(X)\n",
    "plt.scatter(X[:, 0], X[:, 1], c=labels);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The Elbow Method\n",
    "\n",
    "The Elbow method is a \"rule-of-thumb\" approach to finding the optimal number of clusters. Here, we look at the cluster dispersion for different values of k:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "distortions = []\n",
    "for i in range(1, 11):\n",
    "    km = KMeans(n_clusters=i, \n",
    "                random_state=0)\n",
    "    km.fit(X)\n",
    "    distortions.append(km.inertia_)\n",
    "plt.plot(range(1, 11), distortions, marker='o')\n",
    "plt.xlabel('Number of clusters')\n",
    "plt.ylabel('Distortion')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we pick the value that resembles the \"pit of an elbow.\" As we can see, this would be k=3 in this case, which makes sense given our visual expection of the dataset previously."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Clustering comes with assumptions**: A clustering algorithm finds clusters by making assumptions with samples should be grouped together. Each algorithm makes different assumptions and the quality and interpretability of your results will depend on whether the assumptions are satisfied for your goal. For K-means clustering, the model is that all clusters have equal, spherical variance.\n",
    "\n",
    "**In general, there is no guarantee that structure found by a clustering algorithm has anything to do with what you were interested in**.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can easily create a dataset that has non-isotropic clusters, on which kmeans will fail:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_blobs\n",
    "\n",
    "X, y = make_blobs(random_state=170, n_samples=600)\n",
    "rng = np.random.RandomState(74)\n",
    "\n",
    "transformation = rng.normal(size=(2, 2))\n",
    "X = np.dot(X, transformation)\n",
    "\n",
    "y_pred = KMeans(n_clusters=3).fit_predict(X)\n",
    "\n",
    "plt.scatter(X[:, 0], X[:, 1], c=y_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Some Notable Clustering Routines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following are two well-known clustering algorithms. \n",
    "\n",
    "- `sklearn.cluster.KMeans`: <br/>\n",
    "    The simplest, yet effective clustering algorithm. Needs to be provided with the\n",
    "    number of clusters in advance, and assumes that the data is normalized as input\n",
    "    (but use a PCA model as preprocessor).\n",
    "- `sklearn.cluster.MeanShift`: <br/>\n",
    "    Can find better looking clusters than KMeans but is not scalable to high number of samples.\n",
    "- `sklearn.cluster.DBSCAN`: <br/>\n",
    "    Can detect irregularly shaped clusters based on density, i.e. sparse regions in\n",
    "    the input space are likely to become inter-cluster boundaries. Can also detect\n",
    "    outliers (samples that are not part of a cluster).\n",
    "- `sklearn.cluster.AffinityPropagation`: <br/>\n",
    "    Clustering algorithm based on message passing between data points.\n",
    "- `sklearn.cluster.SpectralClustering`: <br/>\n",
    "    KMeans applied to a projection of the normalized graph Laplacian: finds\n",
    "    normalized graph cuts if the affinity matrix is interpreted as an adjacency matrix of a graph.\n",
    "- `sklearn.cluster.Ward`: <br/>\n",
    "    Ward implements hierarchical clustering based on the Ward algorithm,\n",
    "    a variance-minimizing approach. At each step, it minimizes the sum of\n",
    "    squared differences within all clusters (inertia criterion).\n",
    "\n",
    "Of these, Ward, SpectralClustering, DBSCAN and Affinity propagation can also work with precomputed similarity matrices."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cluster_comparison.png\" width=\"900\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise: digits clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Perform K-means clustering on the digits data, searching for ten clusters.\n",
    "Visualize the cluster centers as images (i.e. reshape each to 8x8 and use\n",
    "``plt.imshow``)  Do the clusters seem to be correlated with particular digits? What is the ``adjusted_rand_score``?\n",
    "\n",
    "Visualize the projected digits as in the last notebook, but this time use the\n",
    "cluster labels as the color.  What do you notice?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_digits\n",
    "digits = load_digits()\n",
    "# ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#%load solutions/08B_digits_clustering.py"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}


================================================
FILE: notebooks/09 Review of Scikit-learn API.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%load_ext watermark\n",
    "%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SciPy 2016 Scikit-learn Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A recap on Scikit-learn's estimator interface"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Scikit-learn strives to have a uniform interface across all methods. Given a scikit-learn *estimator*\n",
    "object named `model`, the following methods are available (not all for each model):\n",
    "\n",
    "- Available in **all Estimators**\n",
    "  + `model.fit()` : fit training data. For supervised learning applications,\n",
    "    this accepts two arguments: the data `X` and the labels `y` (e.g. `model.fit(X, y)`).\n",
    "    For unsupervised learning applications, `fit` takes only a single argument,\n",
    "    the data `X` (e.g. `model.fit(X)`).\n",
    "- Available in **supervised estimators**\n",
    "  + `model.predict()` : given a trained model, predict the label of a new set of data.\n",
    "    This method accepts one argument, the new data `X_new` (e.g. `model.predict(X_new)`),\n",
    "    and returns the learned label for each object in the array.\n",
    "  + `model.predict_proba()` : For classification problems, some estimators also provide\n",
    "    this method, which returns the probability that a new observation has each categorical label.\n",
    "    In this case, the label with the highest probability is returned by `model.predict()`.\n",
    "  + `model.decision_function()` : For classification problems, some estimators provide an uncertainty estimate that is not a probability. For binary classification, a decision_function >= 0 means the positive class will be predicted, while < 0 means the negative class.\n",
    "  + `model.score()` : for classification or regression problems, most (all?) estimators implement\n",
    "    a score method.  Scores are between 0 and 1, with a larger score indicating a better fit. For classifiers, the `score` method computes the prediction accuracy. For regressors, `score` computes the coefficient of determination (R<sup>2</sup>) of the prediction.\n",
    "  + `model.transform()` : For feature selection algorithms, this will reduce the dataset to the selected features. For some classification and regression models such as some linear models and random forests, this method reduces the dataset to the most informative features. These classification and regression models can therefore also be used as feature selection methods.\n",
    "  \n",
    "- Available in **unsupervised estimators**\n",
    "  + `model.transform()` : given an unsupervised model, transform new data into the new basis.\n",
    "    This also accepts one argument `X_new`, and returns the new representation of the data based\n",
    "    on the unsupervised model.\n",
    "  + `model.fit_transform()` : some estimators implement this method,\n",
    "    which more efficiently performs a fit and a transform on the same input data.\n",
    "  + `model.predict()` : for clustering algorithms, the predict method will produce cluster labels for new data points. Not all clustering methods have this functionality.\n",
    "  + `model.predict_proba()` : Gaussian mixture models (GMMs) provide the probability for each point to be generated by a given mixture component.\n",
    "  + `model.score()` : Density models like KDE and GMMs provide the likelihood of the data under the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Apart from ``fit``, the two most important functions are arguably ``predict`` to produce a target variable (a ``y``) ``transform``, which produces a new representation of the data (an ``X``).\n",
    "The following table shows for which class of models which function applies:\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table>\n",
    "<tr style=\"border:None; font-size:20px; padding:10px;\"><th>``model.predict``</th><th>``model.transform``</th></tr>\n",
    "<tr style=\"border:None; font-size:20px; padding:10px;\"><td>Classification</td><td>Preprocessing</td></tr>\n",
    "<tr style=\"border:None; font-size:20px; padding:10px;\"><td>Regression</td><td>Dimensionality Reduction</td></tr>\n",
    "<tr style=\"border:None; font-size:20px; padding:10px;\"><td>Clustering</td><td>Feature Extraction</td></tr>\n",
    "<tr style=\"border:None; font-size:20px; padding:10px;\"><td>&nbsp;</td><td>Feature Selection</td></tr>\n",
    "\n",
    "</table>\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}


================================================
FILE: notebooks/10 Case Study - Titanic Survival.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%load_ext watermark\n",
    "%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SciPy 2016 Scikit-learn Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Case Study  - Titanic Survival"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Extraction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we will talk about an important piece of machine learning: the extraction of\n",
    "quantitative features from data.  By the end of this section you will\n",
    "\n",
    "- Know how features are extracted from real-world data.\n",
    "- See an example of extracting numerical features from textual data\n",
    "\n",
    "In addition, we will go over several basic tools within scikit-learn which can be used to accomplish the above tasks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What Are Features?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Numerical Features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall that data in scikit-learn is expected to be in two-dimensional arrays, of size\n",
    "**n_samples** $\\times$ **n_features**.\n",
    "\n",
    "Previously, we looked at the iris dataset, which has 150 samples and 4 features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "\n",
    "iris = load_iris()\n",
    "print(iris.data.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These features are:\n",
    "\n",
    "- sepal length in cm\n",
    "- sepal width in cm\n",
    "- petal length in cm\n",
    "- petal width in cm\n",
    "\n",
    "Numerical features such as these are pretty straightforward: each sample contains a list\n",
    "of floating-point numbers corresponding to the features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Categorical Features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What if you have categorical features?  For example, imagine there is data on the color of each\n",
    "iris:\n",
    "\n",
    "    color in [red, blue, purple]\n",
    "\n",
    "You might be tempted to assign numbers to these features, i.e. *red=1, blue=2, purple=3*\n",
    "but in general **this is a bad idea**.  Estimators tend to operate under the assumption that\n",
    "numerical features lie on some continuous scale, so, for example, 1 and 2 are more alike\n",
    "than 1 and 3, and this is often not the case for categorical features.\n",
    "\n",
    "In fact, the example above is a subcategory of \"categorical\" features, namely, \"nominal\" features. Nominal features don't imply an order, whereas \"ordinal\" features are categorical features that do imply an order. An example of ordinal features would be T-shirt sizes, e.g., XL > L > M > S. \n",
    "\n",
    "One work-around for parsing nominal features into a format that prevents the classification algorithm from asserting an order is the so-called one-hot encoding representation. Here, we give each category its own dimension.  \n",
    "\n",
    "The enriched iris feature set would hence be in this case:\n",
    "\n",
    "- sepal length in cm\n",
    "- sepal width in cm\n",
    "- petal length in cm\n",
    "- petal width in cm\n",
    "- color=purple (1.0 or 0.0)\n",
    "- color=blue (1.0 or 0.0)\n",
    "- color=red (1.0 or 0.0)\n",
    "\n",
    "Note that using many of these categorical features may result in data which is better\n",
    "represented as a **sparse matrix**, as we'll see with the text classification example\n",
    "below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Using the DictVectorizer to encode categorical features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When the source data is encoded has a list of dicts where the values are either strings names for categories or numerical values, you can use the `DictVectorizer` class to compute the boolean expansion of the categorical features while leaving the numerical features unimpacted:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "measurements = [\n",
    "    {'city': 'Dubai', 'temperature': 33.},\n",
    "    {'city': 'London', 'temperature': 12.},\n",
    "    {'city': 'San Francisco', 'temperature': 18.},\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction import DictVectorizer\n",
    "\n",
    "vec = DictVectorizer()\n",
    "vec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vec.fit_transform(measurements).toarray()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vec.get_feature_names()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Derived Features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another common feature type are **derived features**, where some pre-processing step is\n",
    "applied to the data to generate features that are somehow more informative.  Derived\n",
    "features may be based in **feature extraction** and **dimensionality reduction** (such as PCA or manifold learning),\n",
    "may be linear or nonlinear combinations of features (such as in polynomial regression),\n",
    "or may be some more sophisticated transform of the features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combining Numerical and Categorical Features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As an example of how to work with both categorical and numerical data, we will perform survival predicition for the passengers of the HMS Titanic.\n",
    "\n",
    "We will use a version of the Titanic (titanic3.xls) from [here](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls). We converted the .xls to .csv for easier manipulation but left the data is otherwise unchanged.\n",
    "\n",
    "We need to read in all the lines from the (titanic3.csv) file, set aside the keys from the first line, and find our labels (who survived or died) and data (attributes of that person). Let's look at the keys and some corresponding example lines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "\n",
    "titanic = pd.read_csv(os.path.join('datasets', 'titanic3.csv'))\n",
    "print(titanic.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is a broad description of the keys and what they mean:\n",
    "\n",
    "```\n",
    "pclass          Passenger Class\n",
    "                (1 = 1st; 2 = 2nd; 3 = 3rd)\n",
    "survival        Survival\n",
    "                (0 = No; 1 = Yes)\n",
    "name            Name\n",
    "sex             Sex\n",
    "age             Age\n",
    "sibsp           Number of Siblings/Spouses Aboard\n",
    "parch           Number of Parents/Children Aboard\n",
    "ticket          Ticket Number\n",
    "fare            Passenger Fare\n",
    "cabin           Cabin\n",
    "embarked        Port of Embarkation\n",
    "                (C = Cherbourg; Q = Queenstown; S = Southampton)\n",
    "boat            Lifeboat\n",
    "body            Body Identification Number\n",
    "home.dest       Home/Destination\n",
    "```\n",
    "\n",
    "In general, it looks like `name`, `sex`, `cabin`, `embarked`, `boat`, `body`, and `homedest` may be candidates for categorical features, while the rest appear to be numerical features. We can also look at the first couple of rows in the dataset to get a better understanding:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "titanic.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We clearly want to discard the \"boat\" and \"body\" columns for any classification into survived vs not survived as they already contain this information. The name is unique to each person (probably) and also non-informative. For a first try, we will use \"pclass\", \"sibsp\", \"parch\", \"fare\" and \"embarked\" as our features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "labels = titanic.survived.values\n",
    "features = titanic[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "features.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data now contains only useful features, but they are not in a format that the machine learning algorithms can understand. We need to transform the strings \"male\" and \"female\" into binary variables that indicate the gender, and similarly for \"embarked\".\n",
    "We can do that using the pandas ``get_dummies`` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pd.get_dummies(features).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This transformation successfully encoded the string columns. However, one might argue that the class is also a categorical variable. We can explicitly list the columns to encode using the ``columns`` parameter, and include ``pclass``:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "features_dummies = pd.get_dummies(features, columns=['pclass', 'sex', 'embarked'])\n",
    "features_dummies.head(n=16)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data = features_dummies.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "np.isnan(data).any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With all of the hard data loading work out of the way, evaluating a classifier on this data becomes straightforward. Setting up the simplest possible model, we want to see what the simplest score can be with `DummyClassifier`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import Imputer\n",
    "\n",
    "\n",
    "train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=0)\n",
    "\n",
    "imp = Imputer()\n",
    "imp.fit(train_data)\n",
    "train_data_finite = imp.transform(train_data)\n",
    "test_data_finite = imp.transform(test_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.dummy import DummyClassifier\n",
    "\n",
    "clf = DummyClassifier('most_frequent')\n",
    "clf.fit(train_data_finite, train_labels)\n",
    "print(\"Prediction accuracy: %f\" % clf.score(test_data_finite, test_labels))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Exercise\n",
    "=====\n",
    "Try executing the above classification, using LogisticRegression and RandomForestClassifier instead of DummyClassifier\n",
    "\n",
    "Does selecting a different subset of features help?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# %load solutions/10_titanic.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}


================================================
FILE: notebooks/11 Text Feature Extraction.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%load_ext watermark\n",
    "%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SciPy 2016 Scikit-learn Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Methods - Text Feature Extraction with Bag-of-Words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In many tasks, like in the classical spam detection, your input data is text.\n",
    "Free text with variables length is very far from the fixed length numeric representation that we need to do machine learning with scikit-learn.\n",
    "However, there is an easy and effective way to go from text data to a numeric representation using the so-called bag-of-words model, which provides a data structure that is compatible with the machine learning aglorithms in scikit-learn."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/bag_of_words.svg\" width=\"100%\">\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's assume that each sample in your dataset is represented as one string, which could be just a sentence, an email, or a whole news article or book. To represent the sample, we first split the string into a list of tokens, which correspond to (somewhat normalized) words. A simple way to do this to just split by whitespace, and then lowercase the word. \n",
    "\n",
    "Then, we build a vocabulary of all tokens (lowercased words) that appear in our whole dataset. This is usually a very large vocabulary.\n",
    "Finally, looking at our single sample, we could show how often each word in the vocabulary appears.\n",
    "We represent our string by a vector, where each entry is how often a given word in the vocabulary appears in the string.\n",
    "\n",
    "As each sample will only contain very few words, most entries will be zero, leading to a very high-dimensional but sparse representation.\n",
    "\n",
    "The method is called \"bag-of-words,\" as the order of the words is lost entirely."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X = [\"Some say the world will end in fire,\",\n",
    "     \"Some say in ice.\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "len(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "vectorizer = CountVectorizer()\n",
    "vectorizer.fit(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vectorizer.vocabulary_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_bag_of_words = vectorizer.transform(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_bag_of_words.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_bag_of_words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_bag_of_words.toarray()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vectorizer.get_feature_names()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vectorizer.inverse_transform(X_bag_of_words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# tf-idf Encoding\n",
    "A useful transformation that is often applied to the bag-of-word encoding is the so-called term-frequency inverse-document-frequency (tf-idf) scaling, which is a non-linear transformation of the word counts.\n",
    "\n",
    "The tf-idf encoding rescales words that are common to have less weight:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "tfidf_vectorizer = TfidfVectorizer()\n",
    "tfidf_vectorizer.fit(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "np.set_printoptions(precision=2)\n",
    "\n",
    "print(tfidf_vectorizer.transform(X).toarray())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "tf-idfs are a way to represent documents as feature vectors. tf-idfs can be understood as a modification of the raw term frequencies (`tf`); the `tf` is the count of how often a particular word occurs in a given document. The concept behind the tf-idf is to downweight terms proportionally to the number of documents in which they occur. Here, the idea is that terms that occur in many different documents are likely unimportant or don't contain any useful information for Natural Language Processing tasks such as document classification. If you are interested in the mathematical details and equations, we have compiled an [external IPython Notebook](http://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/tfidf_scikit-learn.ipynb) that walks you through the computation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bigrams and N-Grams\n",
    "\n",
    "In the example illustrated in the figure at the beginning of this notebook, we used the so-called 1-gram (unigram) tokenization: Each token represents a single element with regard to the splittling criterion. \n",
    "\n",
    "Entirely discarding word order is not always a good idea, as composite phrases often have specific meaning, and modifiers like \"not\" can invert the meaning of words.\n",
    "\n",
    "A simple way to include some word order are n-grams, which don't only look at a single token, but at all pairs of neighborhing tokens. For example, in 2-gram (bigram) tokenization, we would group words together with an overlap of one word; in 3-gram (trigram) splits we would create an overlap two words, and so forth:\n",
    "\n",
    "- original text: \"this is how you get ants\"\n",
    "- 1-gram: \"this\", \"is\", \"how\", \"you\", \"get\", \"ants\"\n",
    "- 2-gram: \"this is\", \"is how\", \"how you\", \"you get\", \"get ants\"\n",
    "- 3-gram: \"this is how\", \"is how you\", \"how you get\", \"you get ants\"\n",
    "\n",
    "Which \"n\" we choose for \"n-gram\" tokenization to obtain the optimal performance in our predictive model depends on the learning algorithm, dataset, and task. Or in other words, we have consider \"n\" in \"n-grams\" as a tuning parameters, and in later notebooks, we will see how we deal with these.\n",
    "\n",
    "Now, let's create a bag of words model of bigrams using scikit-learn's `CountVectorizer`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# look at sequences of tokens of minimum length 2 and maximum length 2\n",
    "bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))\n",
    "bigram_vectorizer.fit(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "bigram_vectorizer.get_feature_names()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "bigram_vectorizer.transform(X).toarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Often we want to include unigrams (single tokens) AND bigrams, wich we can do by passing the following tuple as an argument to the `ngram_range` parameter of the `CountVectorizer` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "gram_vectorizer = CountVectorizer(ngram_range=(1, 2))\n",
    "gram_vectorizer.fit(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "gram_vectorizer.get_feature_names()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "gram_vectorizer.transform(X).toarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Character n-grams\n",
    "=================\n",
    "\n",
    "Sometimes it is also helpful not only to look at words, but to consider single characters instead.   \n",
    "That is particularly useful if we have very noisy data and want to identify the language, or if we want to predict something about a single word.\n",
    "We can simply look at characters instead of words by setting ``analyzer=\"char\"``.\n",
    "Looking at single characters is usually not very informative, but looking at longer n-grams of characters could be:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "char_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer=\"char\")\n",
    "char_vectorizer.fit(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(char_vectorizer.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Exercise\n",
    "Compute the bigrams from \"zen of python\" as given below (or by ``import this``), and find the most common trigram.\n",
    "We want to treat each line as a separate document. You can achieve this by splitting the string by newlines (``\\n``).\n",
    "Compute the Tf-idf encoding of the data. Which words have the highest tf-idf score? Why?\n",
    "What changes if you use ``TfidfVectorizer(norm=\"none\")``?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "zen = \"\"\"Beautiful is better than ugly.\n",
    "Explicit is better than implicit.\n",
    "Simple is better than complex.\n",
    "Complex is better than complicated.\n",
    "Flat is better than nested.\n",
    "Sparse is better than dense.\n",
    "Readability counts.\n",
    "Special cases aren't special enough to break the rules.\n",
    "Although practicality beats purity.\n",
    "Errors should never pass silently.\n",
    "Unless explicitly silenced.\n",
    "In the face of ambiguity, refuse the temptation to guess.\n",
    "There should be one-- and preferably only one --obvious way to do it.\n",
    "Although that way may not be obvious at first unless you're Dutch.\n",
    "Now is better than never.\n",
    "Although never is often better than *right* now.\n",
    "If the implementation is hard to explain, it's a bad idea.\n",
    "If the implementation is easy to explain, it may be a good idea.\n",
    "Namespaces are one honking great idea -- let's do more of those!\"\"\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#%load solutions/11_ngrams.py"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}


================================================
FILE: notebooks/12 Case Study - SMS Spam Detection.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%load_ext watermark\n",
    "%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SciPy 2016 Scikit-learn Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Case Study - Text classification for SMS spam detection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first load the text data from the `dataset` directory that should be located in your notebooks directory, which we created by running the `fetch_data.py` script from the top level of the GitHub repository.\n",
    "\n",
    "Furthermore, we perform some simple preprocessing and split the data array into two parts:\n",
    "\n",
    "1. `text`: A list of lists, where each sublists contains the contents of our emails\n",
    "2. `y`: our SPAM vs HAM labels stored in binary; a 1 represents a spam message, and a 0 represnts a ham (non-spam) message. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "with open(os.path.join(\"datasets\", \"smsspam\", \"SMSSpamCollection\")) as f:\n",
    "    lines = [line.strip().split(\"\\t\") for line in f.readlines()]\n",
    "\n",
    "text = [x[1] for x in lines]\n",
    "y = [int(x[0] == \"spam\") for x in lines]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "text[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "y[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print('Number of ham and spam messages:', np.bincount(y))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "type(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "type(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we split our dataset into 2 parts, the test and training dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "text_train, text_test, y_train, y_test = train_test_split(text, y, \n",
    "                                                          random_state=42,\n",
    "                                                          test_size=0.25,\n",
    "                                                          stratify=y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we use the CountVectorizer to parse the text data into a bag-of-words model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "print('CountVectorizer defaults')\n",
    "CountVectorizer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "vectorizer = CountVectorizer()\n",
    "vectorizer.fit(text_train)\n",
    "\n",
    "X_train = vectorizer.transform(text_train)\n",
    "X_test = vectorizer.transform(text_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(len(vectorizer.vocabulary_))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "X_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(vectorizer.get_feature_names()[:20])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(vectorizer.get_feature_names()[2000:2020])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(X_train.shape)\n",
    "print(X_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training a Classifier on Text Features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now train a classifier, for instance a logistic regression classifier, which is a fast baseline for text classification tasks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "clf = LogisticRegression()\n",
    "clf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now evaluate the classifier on the testing set. Let's first use the built-in score function, which is the rate of correct classification in the test set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "clf.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also compute the score on the training set to see how well we do there:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "clf.score(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualizing important features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def visualize_coefficients(classifier, feature_names, n_top_features=25):\n",
    "    # get coefficients with large absolute values \n",
    "    coef = classifier.coef_.ravel()\n",
    "    positive_coefficients = np.argsort(coef)[-n_top_features:]\n",
    "    negative_coefficients = np.argsort(coef)[:n_top_features]\n",
    "    interesting_coefficients = np.hstack([negative_coefficients, positive_coefficients])\n",
    "    # plot them\n",
    "    plt.figure(figsize=(15, 5))\n",
    "    colors = [\"red\" if c < 0 else \"blue\" for c in coef[interesting_coefficients]]\n",
    "    plt.bar(np.arange(50), coef[interesting_coefficients], color=colors)\n",
    "    feature_names = np.array(feature_names)\n",
    "    plt.xticks(np.arange(1, 51), feature_names[interesting_coefficients], rotation=60, ha=\"right\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "visualize_coefficients(clf, vectorizer.get_feature_names())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "vectorizer = CountVectorizer(min_df=2)\n",
    "vectorizer.fit(text_train)\n",
    "\n",
    "X_train = vectorizer.transform(text_train)\n",
    "X_test = vectorizer.transform(text_test)\n",
    "\n",
    "clf = LogisticRegression()\n",
    "clf.fit(X_train, y_train)\n",
    "\n",
    "print(clf.score(X_train, y_train))\n",
    "print(clf.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "visualize_coefficients(clf, vectorizer.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/supervised_scikit_learn.png\" width=\"100%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercises\n",
    "\n",
    "Use TfidfVectorizer instead of CountVectorizer. Are the results better? How are the coefficients different?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#%load solutions/12A_tfidf.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Change the parameters min_df and ngram_range of the TfidfVectorizer and CountVectorizer. How does that change the important features?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#%load solutions/12B_vectorizer_params.py"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}


================================================
FILE: notebooks/13 Cross Validation.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%load_ext watermark\n",
    "%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SciPy 2016 Scikit-learn Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cross-Validation and scoring methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous sections and notebooks, we split our dataset into two parts, a training set and a test set. We used the training set to fit our model, and we used the test set to evaluate its generalization performance -- how well it performs on new, unseen data.\n",
    "\n",
    "\n",
    "<img src=\"figures/train_test_split.svg\" width=\"100%\">\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, often (labeled) data is precious, and this approach lets us only use ~ 3/4 of our data for training. On the other hand, we will only ever try to apply our model 1/4 of our data for testing.\n",
    "A common way to use more of the data to build a model, but also get a more robust estimate of the generalization performance, is cross-validation.\n",
    "In cross-validation, the data is split repeatedly into a training and non-overlapping test-sets, with a separate model built for every pair. The test-set scores are then aggregated for a more robust estimate.\n",
    "\n",
    "The most common way to do cross-validation is k-fold cross-validation, in which the data is first split into k (often 5 or 10) equal-sized folds, and then for each iteration, one of the k folds is used as test data, and the rest as training data:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/cross_validation.svg\" width=\"100%\">\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This way, each data point will be in the test-set exactly once, and we can use all but a k'th of the data for training.\n",
    "Let us apply this technique to evaluate the KNeighborsClassifier algorithm on the Iris dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_iris\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "\n",
    "iris = load_iris()\n",
    "X, y = iris.data, iris.target\n",
    "\n",
    "classifier = KNeighborsClassifier()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The labels in iris are sorted, which means that if we split the data as illustrated above, the first fold will only have the label 0 in it, while the last one will only have the label 2:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To avoid this problem in evaluation, we first shuffle our data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "rng = np.random.RandomState(0)\n",
    "\n",
    "permutation = rng.permutation(len(X))\n",
    "X, y = X[permutation], y[permutation]\n",
    "print(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now implementing cross-validation is easy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "k = 5\n",
    "n_samples = len(X)\n",
    "fold_size = n_samples // k\n",
    "scores = []\n",
    "masks = []\n",
    "for fold in range(k):\n",
    "    # generate a boolean mask for the test set in this fold\n",
    "    test_mask = np.zeros(n_samples, dtype=bool)\n",
    "    test_mask[fold * fold_size : (fold + 1) * fold_size] = True\n",
    "    # store the mask for visualization\n",
    "    masks.append(test_mask)\n",
    "    # create training and test sets using this mask\n",
    "    X_test, y_test = X[test_mask], y[test_mask]\n",
    "    X_train, y_train = X[~test_mask], y[~test_mask]\n",
    "    # fit the classifier\n",
    "    classifier.fit(X_train, y_train)\n",
    "    # compute the score and record it\n",
    "    scores.append(classifier.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check that our test mask does the right thing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "plt.matshow(masks, cmap='gray_r')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now let's look a the scores we computed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(scores)\n",
    "print(np.mean(scores))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, there is a rather wide spectrum of scores from 90% correct to 100% correct. If we only did a single split, we might have gotten either answer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As cross-validation is such a common pattern in machine learning, there are functions to do the above for you with much more flexibility and less code.\n",
    "The ``sklearn.model_selection`` module has all functions related to cross validation. There easiest function is ``cross_val_score`` which takes an estimator and a dataset, and will do all of the splitting for you:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score\n",
    "scores = cross_val_score(classifier, X, y)\n",
    "print(scores)\n",
    "print(np.mean(scores))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the function uses three folds by default. You can change the number of folds using the cv argument:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cross_val_score(classifier, X, y, cv=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are also helper objects in the cross-validation module that will generate indices for you for all kinds of different cross-validation methods, including k-fold:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import KFold, StratifiedKFold, ShuffleSplit"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, cross_val_score will use ``StratifiedKFold`` for classification, which ensures that the class proportions in the dataset are reflected in each fold. If you have a binary classification dataset with 90% of data point belonging to class 0, that would mean that in each fold, 90% of datapoints would belong to class 0.\n",
    "If you would just use KFold cross-validation, it is likely that you would generate a split that only contains class 0.\n",
    "It is generally a good idea to use ``StratifiedKFold`` whenever you do classification.\n",
    "\n",
    "``StratifiedKFold`` would also remove our need to shuffle ``iris``.\n",
    "Let's see what kinds of folds it generates on the unshuffled iris dataset.\n",
    "Each cross-validation class is a generator of sets of training and test indices:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cv = StratifiedKFold(n_splits=5)\n",
    "for train, test in cv.split(iris.data, iris.target):\n",
    "    print(test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, there are a couple of samples from the beginning, then from the middle, and then from the end, in each of the folds.\n",
    "This way, the class ratios are preserved. Let's visualize the split:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def plot_cv(cv, features, labels):\n",
    "    masks = []\n",
    "    for train, test in cv.split(features, labels):\n",
    "        mask = np.zeros(len(labels), dtype=bool)\n",
    "        mask[test] = 1\n",
    "        masks.append(mask)\n",
    "    \n",
    "    plt.matshow(masks, cmap='gray_r')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plot_cv(StratifiedKFold(n_splits=5), iris.data, iris.target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For comparison, again the standard KFold, that ignores the labels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plot_cv(KFold(n_splits=5), iris.data, iris.target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Keep in mind that increasing the number of folds will give you a larger training dataset, but will lead to more repetitions, and therefore a slower evaluation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plot_cv(KFold(n_splits=10), iris.data, iris.target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another helpful cross-validation generator is ``ShuffleSplit``. This generator simply splits of a random portion of the data repeatedly. This allows the user to specify the number of repetitions and the training set size independently:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plot_cv(ShuffleSplit(n_splits=5, test_size=.2), iris.data, iris.target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want a more robust estimate, you can just increase the number of splits:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plot_cv(ShuffleSplit(n_splits=20, test_size=.2), iris.data, iris.target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use all of these cross-validation generators with the `cross_val_score` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cv = ShuffleSplit(n_splits=5, test_size=.2)\n",
    "cross_val_score(classifier, X, y, cv=cv)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise\n",
    "Perform three-fold cross-validation using the ``KFold`` class on the iris dataset without shuffling the data. Can you explain the result?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#%load solutions/13_cross_validation.py"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}


================================================
FILE: notebooks/14 Model Complexity and GridSearchCV.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%load_ext watermark\n",
    "%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SciPy 2016 Scikit-learn Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Parameter selection, Validation, and Testing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most models have parameters that influence how complex a model they can learn. Remember using `KNeighborsRegressor`.\n",
    "If we change the number of neighbors we consider, we get a smoother and smoother prediction:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/plot_kneigbors_regularization.png\" width=\"100%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the above figure, we see fits for three different values of ``n_neighbors``.\n",
    "For ``n_neighbors=2``, the data is overfit, the model is too flexible and can adjust too much to the noise in the training data. For ``n_neighbors=20``, the model is not flexible enough, and can not model the variation in the data appropriately.\n",
    "\n",
    "In the middle, for ``n_neighbors = 5``, we have found a good mid-point. It fits\n",
    "the data fairly well, and does not suffer from the overfit or underfit\n",
    "problems seen in the figures on either side. What we would like is a\n",
    "way to quantitatively identify overfit and underfit, and optimize the\n",
    "hyperparameters (in this case, the polynomial degree d) in order to\n",
    "determine the best algorithm.\n",
    "\n",
    "We trade off remembering too much about the particularities and noise of the training data vs. not modeling enough of the variability. This is a trade-off that needs to be made in basically every machine learning application and is a central concept, called bias-variance-tradeoff or \"overfitting vs underfitting\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/overfitting_underfitting_cartoon.svg\" width=\"100%\">\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hyperparameters, Over-fitting, and Under-fitting\n",
    "\n",
    "Unfortunately, there is no general rule how to find the sweet spot, and so machine learning practitioners have to find the best trade-off of model-complexity and generalization by trying several hyperparameter settings. Hyperparameters are the internal knobs or tuning parameters of a machine learning algorithm (in contrast to model parameters that the algorithm learns from the training data -- for example, the weight coefficients of a linear regression model); the number of *k* in K-nearest neighbors is such a hyperparameter.\n",
    "\n",
    "Most commonly this \"hyperparameter tuning\" is done using a brute force search, for example over multiple values of ``n_neighbors``:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score, KFold\n",
    "from sklearn.neighbors import KNeighborsRegressor\n",
    "# generate toy dataset:\n",
    "x = np.linspace(-3, 3, 100)\n",
    "rng = np.random.RandomState(42)\n",
    "y = np.sin(4 * x) + x + rng.normal(size=len(x))\n",
    "X = x[:, np.newaxis]\n",
    "\n",
    "cv = KFold(shuffle=True)\n",
    "\n",
    "# for each parameter setting do cross-validation:\n",
    "for n_neighbors in [1, 3, 5, 10, 20]:\n",
    "    scores = cross_val_score(KNeighborsRegressor(n_neighbors=n_neighbors), X, y, cv=cv)\n",
    "    print(\"n_neighbors: %d, average score: %f\" % (n_neighbors, np.mean(scores)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is a function in scikit-learn, called ``validation_plot`` to reproduce the cartoon figure above. It plots one parameter, such as the number of neighbors, against training and validation error (using cross-validation):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import validation_curve\n",
    "n_neighbors = [1, 3, 5, 10, 20, 50]\n",
    "train_errors, test_errors = validation_curve(KNeighborsRegressor(), X, y, param_name=\"n_neighbors\",\n",
    "                                             param_range=n_neighbors, cv=cv)\n",
    "plt.plot(n_neighbors, train_errors.mean(axis=1), label=\"train error\")\n",
    "plt.plot(n_neighbors, test_errors.mean(axis=1), label=\"test error\")\n",
    "plt.legend(loc=\"best\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that many neighbors mean a \"smooth\" or \"simple\" model, so the plot is the mirror image of the diagram above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If multiple parameters are important, like the parameters ``C`` and ``gamma`` in an ``SVM`` (more about that later), all possible combinations are tried:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score, KFold\n",
    "from sklearn.svm import SVR\n",
    "\n",
    "# each parameter setting do cross-validation:\n",
    "for C in [0.001, 0.01, 0.1, 1, 10]:\n",
    "    for gamma in [0.001, 0.01, 0.1, 1]:\n",
    "        scores = cross_val_score(SVR(C=C, gamma=gamma), X, y, cv=cv)\n",
    "        print(\"C: %f, gamma: %f, average score: %f\" % (C, gamma, np.mean(scores)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As this is such a very common pattern, there is a built-in class for this in scikit-learn, ``GridSearchCV``. ``GridSearchCV`` takes a dictionary that describes the parameters that should be tried and a model to train.\n",
    "\n",
    "The grid of parameters is defined as a dictionary, where the keys are the parameters and the values are the settings to be tested."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "param_grid = {'C': [0.001, 0.01, 0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1, 1]}\n",
    "\n",
    "grid = GridSearchCV(SVR(), param_grid=param_grid, cv=cv, verbose=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the great things about GridSearchCV is that it is a *meta-estimator*. It takes an estimator like SVR above, and creates a new estimator, that behaves exactly the same - in this case, like a regressor.\n",
    "So we can call ``fit`` on it, to train it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "grid.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What ``fit`` does is a bit more involved then what we did above. First, it runs the same loop with cross-validation, to find the best parameter combination.\n",
    "Once it has the best combination, it runs fit again on all data passed to fit (without cross-validation), to built a single new model using the best parameter setting."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, as with all models, we can use ``predict`` or ``score``:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "grid.predict(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "You can inspect the best parameters found by ``GridSearchCV`` in the ``best_params_`` attribute, and the best score in the ``best_score_`` attribute:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(grid.best_score_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(grid.best_params_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is a problem with using this score for evaluation, however. You might be making what is called a multiple hypothesis testing error. If you try very many parameter settings, some of them will work better just by chance, and the score that you obtained might not reflect how your model would perform on new unseen data.\n",
    "Therefore, it is good to split off a separate test-set before performing grid-search. This pattern can be seen as a training-validation-test split, and is common in machine learning:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"figures/grid_search_cross_validation.svg\" width=\"100%\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can do this very easily by splitting of some test data using ``train_test_split``, training ``GridSearchCV`` on the training set, and applying the ``score`` method to the test set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n",
    "\n",
    "param_grid = {'C': [0.001, 0.01, 0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1, 1]}\n",
    "cv = KFold(n_splits=10, shuffle=True)\n",
    "\n",
    "grid = GridSearchCV(SVR(), param_grid=param_grid, cv=cv)\n",
    "\n",
    "grid.fit(X_train, y_train)\n",
    "grid.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also look at the parameters that were selected:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "grid.best_params_"
   ]
  },
  {
   "cell_type": "markdown",

Download .txt

gitextract_9qj13mdv/

├── .gitignore
├── LICENSE
├── README.md
├── abstract.rst
├── check_env.ipynb
├── fetch_data.py
├── notebooks/
│   ├── 01 Introduction to Machine Learning.ipynb
│   ├── 02 Scientific Computing Tools in Python.ipynb
│   ├── 03 Data Representation for Machine Learning.ipynb
│   ├── 04 Training and Testing Data.ipynb
│   ├── 05 Supervised Learning - Classification.ipynb
│   ├── 06 Supervised Learning - Regression.ipynb
│   ├── 07 Unsupervised Learning - Transformations and Dimensionality Reduction.ipynb
│   ├── 08 Unsupervised Learning - Clustering.ipynb
│   ├── 09 Review of Scikit-learn API.ipynb
│   ├── 10 Case Study - Titanic Survival.ipynb
│   ├── 11 Text Feature Extraction.ipynb
│   ├── 12 Case Study - SMS Spam Detection.ipynb
│   ├── 13 Cross Validation.ipynb
│   ├── 14 Model Complexity and GridSearchCV.ipynb
│   ├── 15 Pipelining Estimators.ipynb
│   ├── 16 Performance metrics and Model Evaluation.ipynb
│   ├── 17 In Depth - Linear Models.ipynb
│   ├── 18 In Depth - Support Vector Machines.ipynb
│   ├── 19 In Depth - Trees and Forests.ipynb
│   ├── 20 Feature Selection.ipynb
│   ├── 21 Unsupervised learning - Hierarchical and density-based clustering algorithms.ipynb
│   ├── 22 Unsupervised learning - Non-linear dimensionality reduction.ipynb
│   ├── 23 Out-of-core Learning Large Scale Text Classification.ipynb
│   ├── figures/
│   │   ├── ML_flow_chart.py
│   │   ├── __init__.py
│   │   ├── plot_2d_separator.py
│   │   ├── plot_digits_dataset.py
│   │   ├── plot_helpers.py
│   │   ├── plot_interactive_forest.py
│   │   ├── plot_interactive_tree.py
│   │   ├── plot_kneighbors_regularization.py
│   │   ├── plot_linear_svc_regularization.py
│   │   ├── plot_pca.py
│   │   ├── plot_rbf_svm_parameters.py
│   │   └── plot_scaling.py
│   ├── helpers.py
│   └── solutions/
│       ├── 03A_faces_plot.py
│       ├── 04_wrong-predictions.py
│       ├── 05A_knn_with_diff_k.py
│       ├── 06A_knn_vs_linreg.py
│       ├── 07A_iris-pca.py
│       ├── 08B_digits_clustering.py
│       ├── 10_titanic.py
│       ├── 11_ngrams.py
│       ├── 12A_tfidf.py
│       ├── 12B_vectorizer_params.py
│       ├── 13_cross_validation.py
│       ├── 14_grid_search.py
│       ├── 15A_ridge_grid.py
│       ├── 16A_avg_per_class_acc.py
│       ├── 17A_logreg_grid.py
│       ├── 17B_learning_curve_alpha.py
│       ├── 18_svc_grid.py
│       ├── 19_gbc_grid.py
│       ├── 20_univariate_vs_mb_selection.py
│       ├── 21_clustering_comparison.py
│       ├── 22A_isomap_digits.py
│       ├── 22B_tsne_classification.py
│       └── 23_batchtrain.py
├── requirements.txt
├── slides/
│   └── scipy2016.pptx
└── todo.rst

Download .txt

SYMBOL INDEX (30 symbols across 14 files)

FILE: fetch_data.py
  function get_datasets_folder (line 15) | def get_datasets_folder():
  function check_imdb (line 35) | def check_imdb(datasets_folder):

FILE: notebooks/figures/ML_flow_chart.py
  function create_base (line 12) | def create_base(box_bg='#CCCCCC',
  function plot_supervised_chart (line 106) | def plot_supervised_chart(annotate=False):
  function plot_unsupervised_chart (line 124) | def plot_unsupervised_chart():

FILE: notebooks/figures/plot_2d_separator.py
  function plot_2d_separator (line 5) | def plot_2d_separator(classifier, X, fill=False, ax=None, eps=None):

FILE: notebooks/figures/plot_digits_dataset.py
  function digits_plot (line 14) | def digits_plot():

FILE: notebooks/figures/plot_interactive_forest.py
  function plot_forest (line 11) | def plot_forest(max_depth=1):
  function plot_forest_interactive (line 36) | def plot_forest_interactive():

FILE: notebooks/figures/plot_interactive_tree.py
  function tree_image (line 17) | def tree_image(tree, fout=None):
  function plot_tree (line 38) | def plot_tree(max_depth=1):
  function plot_tree_interactive (line 68) | def plot_tree_interactive():

FILE: notebooks/figures/plot_kneighbors_regularization.py
  function make_dataset (line 7) | def make_dataset(n_samples=100):
  function plot_regression_datasets (line 15) | def plot_regression_datasets():
  function plot_kneighbors_regularization (line 22) | def plot_kneighbors_regularization():

FILE: notebooks/figures/plot_linear_svc_regularization.py
  function plot_linear_svc_regularization (line 8) | def plot_linear_svc_regularization():

FILE: notebooks/figures/plot_pca.py
  function plot_pca_illustration (line 6) | def plot_pca_illustration():
  function plot_pca_whitening (line 64) | def plot_pca_whitening():

FILE: notebooks/figures/plot_rbf_svm_parameters.py
  function make_handcrafted_dataset (line 8) | def make_handcrafted_dataset():
  function plot_rbf_svm_parameters (line 18) | def plot_rbf_svm_parameters():
  function plot_svm (line 37) | def plot_svm(log_C, log_gamma):
  function plot_svm_interactive (line 52) | def plot_svm_interactive():

FILE: notebooks/figures/plot_scaling.py
  function plot_scaling (line 9) | def plot_scaling():
  function plot_relative_scaling (line 44) | def plot_relative_scaling():

FILE: notebooks/helpers.py
  function process_titanic_line (line 8) | def process_titanic_line(line):
  function load_titanic (line 40) | def load_titanic(test_size=.25, feature_skip_tuple=(), random_state=1999):
  function read_sentiment_csv (line 98) | def read_sentiment_csv(csv_file, fieldnames=FIELDNAMES, max_count=None,

FILE: notebooks/solutions/16A_avg_per_class_acc.py
  function accuracy (line 1) | def accuracy(true, pred):
  function macro (line 5) | def macro(true, pred):

FILE: notebooks/solutions/23_batchtrain.py
  function batch_train (line 9) | def batch_train(clf, fnames, labels, iterations=1,

Download .json

Condensed preview — 68 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (396K chars).

[
  {
    "path": ".gitignore",
    "chars": 241,
    "preview": "# exlude datasets and externals\nnotebooks/datasets\nnotebooks/joblib/\n\n# exclude temporary files\n.ipynb_checkpoints\n.DS_S"
  },
  {
    "path": "LICENSE",
    "chars": 6555,
    "preview": "CC0 1.0 Universal\n\nStatement of Purpose\n\nThe laws of most jurisdictions throughout the world automatically confer\nexclus"
  },
  {
    "path": "README.md",
    "chars": 8558,
    "preview": "SciPy 2016 Scikit-learn Tutorial\n================================\n\nBased on the SciPy [2015 tutorial](https://github.com"
  },
  {
    "path": "abstract.rst",
    "chars": 4109,
    "preview": "Machine Learning with scikit-learn\n\nTutorial Topic\n--------------\n\nThis tutorial aims to provide an introduction to mach"
  },
  {
    "path": "check_env.ipynb",
    "chars": 3768,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "fetch_data.py",
    "chars": 2375,
    "preview": "import os\n\ntry:\n    from urllib.request import urlopen\nexcept ImportError:\n    from urllib import urlopen\n\nimport tarfil"
  },
  {
    "path": "notebooks/01 Introduction to Machine Learning.ipynb",
    "chars": 9729,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/02 Scientific Computing Tools in Python.ipynb",
    "chars": 13640,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/03 Data Representation for Machine Learning.ipynb",
    "chars": 21363,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/04 Training and Testing Data.ipynb",
    "chars": 9404,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/05 Supervised Learning - Classification.ipynb",
    "chars": 12449,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/06 Supervised Learning - Regression.ipynb",
    "chars": 11368,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/07 Unsupervised Learning - Transformations and Dimensionality Reduction.ipynb",
    "chars": 14154,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/08 Unsupervised Learning - Clustering.ipynb",
    "chars": 13477,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/09 Review of Scikit-learn API.ipynb",
    "chars": 5524,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/10 Case Study - Titanic Survival.ipynb",
    "chars": 13666,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/11 Text Feature Extraction.ipynb",
    "chars": 13186,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/12 Case Study - SMS Spam Detection.ipynb",
    "chars": 10201,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/13 Cross Validation.ipynb",
    "chars": 12440,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/14 Model Complexity and GridSearchCV.ipynb",
    "chars": 13477,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/15 Pipelining Estimators.ipynb",
    "chars": 9144,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/16 Performance metrics and Model Evaluation.ipynb",
    "chars": 16657,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/17 In Depth - Linear Models.ipynb",
    "chars": 15213,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/18 In Depth - Support Vector Machines.ipynb",
    "chars": 6065,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/19 In Depth - Trees and Forests.ipynb",
    "chars": 12023,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/20 Feature Selection.ipynb",
    "chars": 11511,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"hide"
  },
  {
    "path": "notebooks/21 Unsupervised learning - Hierarchical and density-based clustering algorithms.ipynb",
    "chars": 12480,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"nbpr"
  },
  {
    "path": "notebooks/22 Unsupervised learning - Non-linear dimensionality reduction.ipynb",
    "chars": 7846,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/23 Out-of-core Learning Large Scale Text Classification.ipynb",
    "chars": 22137,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \""
  },
  {
    "path": "notebooks/figures/ML_flow_chart.py",
    "chars": 4745,
    "preview": "\"\"\"\nTutorial Diagrams\n-----------------\n\nThis script plots the flow-charts used in the scikit-learn tutorials.\n\"\"\"\n\nimpo"
  },
  {
    "path": "notebooks/figures/__init__.py",
    "chars": 1017,
    "preview": "from .plot_2d_separator import plot_2d_separator\nfrom .plot_kneighbors_regularization import plot_kneighbors_regularizat"
  },
  {
    "path": "notebooks/figures/plot_2d_separator.py",
    "chars": 1513,
    "preview": "import numpy as np\nimport matplotlib.pyplot as plt\n\n\ndef plot_2d_separator(classifier, X, fill=False, ax=None, eps=None)"
  },
  {
    "path": "notebooks/figures/plot_digits_dataset.py",
    "chars": 2747,
    "preview": "# Taken from example in scikit-learn examples\n# Authors: Fabian Pedregosa <fabian.pedregosa@inria.fr>\n#          Olivier"
  },
  {
    "path": "notebooks/figures/plot_helpers.py",
    "chars": 147,
    "preview": "from matplotlib.colors import ListedColormap\n\ncm3 = ListedColormap(['#0000aa', '#ff2020', '#50ff50'])\ncm2 = ListedColorm"
  },
  {
    "path": "notebooks/figures/plot_interactive_forest.py",
    "chars": 1279,
    "preview": "import numpy as np\nimport matplotlib.pyplot as plt\n\nfrom sklearn.datasets import make_blobs\nfrom sklearn.ensemble import"
  },
  {
    "path": "notebooks/figures/plot_interactive_tree.py",
    "chars": 2320,
    "preview": "import numpy as np\nimport matplotlib.pyplot as plt\n\nfrom sklearn.datasets import make_blobs\nfrom sklearn.tree import Dec"
  },
  {
    "path": "notebooks/figures/plot_kneighbors_regularization.py",
    "chars": 1363,
    "preview": "import numpy as np\nimport matplotlib.pyplot as plt\n\nfrom sklearn.neighbors import KNeighborsRegressor\n\n\ndef make_dataset"
  },
  {
    "path": "notebooks/figures/plot_linear_svc_regularization.py",
    "chars": 651,
    "preview": "import matplotlib.pyplot as plt\nimport numpy as np\nfrom sklearn.svm import SVC\nfrom sklearn.datasets import make_blobs\nf"
  },
  {
    "path": "notebooks/figures/plot_pca.py",
    "chars": 3131,
    "preview": "from sklearn.decomposition import PCA\nimport matplotlib.pyplot as plt\nimport numpy as np\n\n\ndef plot_pca_illustration():\n"
  },
  {
    "path": "notebooks/figures/plot_rbf_svm_parameters.py",
    "chars": 2018,
    "preview": "import matplotlib.pyplot as plt\nimport numpy as np\nfrom sklearn.svm import SVC\nfrom sklearn.datasets import make_blobs\nf"
  },
  {
    "path": "notebooks/figures/plot_scaling.py",
    "chars": 3240,
    "preview": "import matplotlib.pyplot as plt\nimport numpy as np\nfrom sklearn.datasets import make_blobs\nfrom sklearn.preprocessing im"
  },
  {
    "path": "notebooks/helpers.py",
    "chars": 5032,
    "preview": "import numpy as np\nfrom collections import defaultdict\nimport os\nfrom sklearn.model_selection import StratifiedShuffleSp"
  },
  {
    "path": "notebooks/solutions/03A_faces_plot.py",
    "chars": 363,
    "preview": "faces = fetch_olivetti_faces()\n\n# set up the figure\nfig = plt.figure(figsize=(6, 6))  # figure size in inches\nfig.subplo"
  },
  {
    "path": "notebooks/solutions/04_wrong-predictions.py",
    "chars": 699,
    "preview": "for i in incorrect_idx:\n    print('%d: Predicted %d True label %d' % (i, pred_y[i], test_y[i]))\n\n# Plot two dimensions\n\n"
  },
  {
    "path": "notebooks/solutions/05A_knn_with_diff_k.py",
    "chars": 1137,
    "preview": "from sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\n\n\niris = load_iris()\nX = iri"
  },
  {
    "path": "notebooks/solutions/06A_knn_vs_linreg.py",
    "chars": 815,
    "preview": "from sklearn.datasets import load_boston\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.linear_model "
  },
  {
    "path": "notebooks/solutions/07A_iris-pca.py",
    "chars": 1132,
    "preview": "from sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.decomposition i"
  },
  {
    "path": "notebooks/solutions/08B_digits_clustering.py",
    "chars": 768,
    "preview": "from sklearn.cluster import KMeans\nkmeans = KMeans(n_clusters=10)\nclusters = kmeans.fit_predict(digits.data)\n\nprint(kmea"
  },
  {
    "path": "notebooks/solutions/10_titanic.py",
    "chars": 1224,
    "preview": "from sklearn.linear_model import LogisticRegression\nlr = LogisticRegression().fit(train_data_finite, train_labels)\nprint"
  },
  {
    "path": "notebooks/solutions/11_ngrams.py",
    "chars": 546,
    "preview": "text = zen.split(\"\\n\")\nfor n in [2, 3, 4]:\n    cv = CountVectorizer(ngram_range=(n, n)).fit(text)\n    counts = cv.transf"
  },
  {
    "path": "notebooks/solutions/12A_tfidf.py",
    "chars": 388,
    "preview": "from sklearn.feature_extraction.text import TfidfVectorizer\n\nvectorizer = TfidfVectorizer()\nvectorizer.fit(text_train)\n\n"
  },
  {
    "path": "notebooks/solutions/12B_vectorizer_params.py",
    "chars": 611,
    "preview": "# CountVectorizer\nvectorizer = CountVectorizer(min_df=10, ngram_range=(1, 3))\nvectorizer.fit(text_train)\n\nX_train = vect"
  },
  {
    "path": "notebooks/solutions/13_cross_validation.py",
    "chars": 82,
    "preview": "cv = KFold(n_splits=3)\ncross_val_score(classifier, iris.data, iris.target, cv=cv)\n"
  },
  {
    "path": "notebooks/solutions/14_grid_search.py",
    "chars": 473,
    "preview": "from sklearn.datasets import load_digits\nfrom sklearn.neighbors import KNeighborsClassifier\n\ndigits = load_digits()\nX_tr"
  },
  {
    "path": "notebooks/solutions/15A_ridge_grid.py",
    "chars": 904,
    "preview": "from sklearn.preprocessing import StandardScaler\nfrom sklearn.datasets import load_boston\nfrom sklearn.preprocessing imp"
  },
  {
    "path": "notebooks/solutions/16A_avg_per_class_acc.py",
    "chars": 528,
    "preview": "def accuracy(true, pred):\n    return (true == pred).sum() / float(true.shape[0])\n\n\ndef macro(true, pred):\n    scores = ["
  },
  {
    "path": "notebooks/solutions/17A_logreg_grid.py",
    "chars": 693,
    "preview": "from sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split, GridSearchCV\nfrom sklearn"
  },
  {
    "path": "notebooks/solutions/17B_learning_curve_alpha.py",
    "chars": 676,
    "preview": "X, y, true_coefficient = make_regression(n_samples=200, n_features=30, n_informative=10, noise=100, coef=True, random_st"
  },
  {
    "path": "notebooks/solutions/18_svc_grid.py",
    "chars": 672,
    "preview": "from sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split, GridSearchCV\nfrom sklearn"
  },
  {
    "path": "notebooks/solutions/19_gbc_grid.py",
    "chars": 777,
    "preview": "from sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split, GridSearchCV\nfrom sklearn"
  },
  {
    "path": "notebooks/solutions/20_univariate_vs_mb_selection.py",
    "chars": 612,
    "preview": "from sklearn.feature_selection import SelectKBest, SelectFromModel\nfrom sklearn.ensemble import RandomForestClassifier\ni"
  },
  {
    "path": "notebooks/solutions/21_clustering_comparison.py",
    "chars": 539,
    "preview": "from sklearn.datasets import make_circles\nfrom sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN\n\nX, y = ma"
  },
  {
    "path": "notebooks/solutions/22A_isomap_digits.py",
    "chars": 552,
    "preview": "from sklearn.manifold import Isomap\niso = Isomap(n_components=2)\ndigits_isomap = iso.fit_transform(digits.data)\n\nplt.fig"
  },
  {
    "path": "notebooks/solutions/22B_tsne_classification.py",
    "chars": 670,
    "preview": "from sklearn.manifold import TSNE\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.model_selection import"
  },
  {
    "path": "notebooks/solutions/23_batchtrain.py",
    "chars": 1947,
    "preview": "import os\nimport numpy as np\nfrom sklearn.linear_model import SGDClassifier\nfrom sklearn.feature_extraction.text import "
  },
  {
    "path": "requirements.txt",
    "chars": 233,
    "preview": "# brew update && brew install gcc (this includes gfortran)\nipython[all]>=3.2.0\npyzmq>=14.7.0\nPillow>=2.9.0\nnumpy>=1.9.2\n"
  },
  {
    "path": "todo.rst",
    "chars": 119,
    "preview": "replace spam by imdb text data\nmake sure there are notebooks for all sections\nmake sure there are exercises everywhere\n"
  }
]

// ... and 1 more files (download for full content)

About this extraction

This page contains the full source code of the amueller/scipy-2016-sklearn GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 68 files (350.7 KB), approximately 101.4k tokens, and a symbol index with 30 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo