[
  {
    "path": ".gitignore",
    "content": "# exlude datasets and externals\nnotebooks/datasets\nnotebooks/joblib/\n\n# exclude temporary files\n.ipynb_checkpoints\n.DS_Store\ngmon.out\n__pycache__\n*.pyc\n*.o\n*.so\n*.gcno\n*.swp\n*.egg-info\n*.egg\n*~\nbuild\ndist\nlib/test\ndoc/_build\n*env\n*ENV\n.idea\n"
  },
  {
    "path": "LICENSE",
    "content": "CC0 1.0 Universal\n\nStatement of Purpose\n\nThe laws of most jurisdictions throughout the world automatically confer\nexclusive Copyright and Related Rights (defined below) upon the creator and\nsubsequent owner(s) (each and all, an \"owner\") of an original work of\nauthorship and/or a database (each, a \"Work\").\n\nCertain owners wish to permanently relinquish those rights to a Work for the\npurpose of contributing to a commons of creative, cultural and scientific\nworks (\"Commons\") that the public can reliably and without fear of later\nclaims of infringement build upon, modify, incorporate in other works, reuse\nand redistribute as freely as possible in any form whatsoever and for any\npurposes, including without limitation commercial purposes. These owners may\ncontribute to the Commons to promote the ideal of a free culture and the\nfurther production of creative, cultural and scientific works, or to gain\nreputation or greater distribution for their Work in part through the use and\nefforts of others.\n\nFor these and/or other purposes and motivations, and without any expectation\nof additional consideration or compensation, the person associating CC0 with a\nWork (the \"Affirmer\"), to the extent that he or she is an owner of Copyright\nand Related Rights in the Work, voluntarily elects to apply CC0 to the Work\nand publicly distribute the Work under its terms, with knowledge of his or her\nCopyright and Related Rights in the Work and the meaning and intended legal\neffect of CC0 on those rights.\n\n1. Copyright and Related Rights. A Work made available under CC0 may be\nprotected by copyright and related or neighboring rights (\"Copyright and\nRelated Rights\"). Copyright and Related Rights include, but are not limited\nto, the following:\n\n  i. the right to reproduce, adapt, distribute, perform, display, communicate,\n  and translate a Work;\n\n  ii. moral rights retained by the original author(s) and/or performer(s);\n\n  iii. publicity and privacy rights pertaining to a person's image or likeness\n  depicted in a Work;\n\n  iv. rights protecting against unfair competition in regards to a Work,\n  subject to the limitations in paragraph 4(a), below;\n\n  v. rights protecting the extraction, dissemination, use and reuse of data in\n  a Work;\n\n  vi. database rights (such as those arising under Directive 96/9/EC of the\n  European Parliament and of the Council of 11 March 1996 on the legal\n  protection of databases, and under any national implementation thereof,\n  including any amended or successor version of such directive); and\n\n  vii. other similar, equivalent or corresponding rights throughout the world\n  based on applicable law or treaty, and any national implementations thereof.\n\n2. Waiver. To the greatest extent permitted by, but not in contravention of,\napplicable law, Affirmer hereby overtly, fully, permanently, irrevocably and\nunconditionally waives, abandons, and surrenders all of Affirmer's Copyright\nand Related Rights and associated claims and causes of action, whether now\nknown or unknown (including existing as well as future claims and causes of\naction), in the Work (i) in all territories worldwide, (ii) for the maximum\nduration provided by applicable law or treaty (including future time\nextensions), (iii) in any current or future medium and for any number of\ncopies, and (iv) for any purpose whatsoever, including without limitation\ncommercial, advertising or promotional purposes (the \"Waiver\"). Affirmer makes\nthe Waiver for the benefit of each member of the public at large and to the\ndetriment of Affirmer's heirs and successors, fully intending that such Waiver\nshall not be subject to revocation, rescission, cancellation, termination, or\nany other legal or equitable action to disrupt the quiet enjoyment of the Work\nby the public as contemplated by Affirmer's express Statement of Purpose.\n\n3. Public License Fallback. Should any part of the Waiver for any reason be\njudged legally invalid or ineffective under applicable law, then the Waiver\nshall be preserved to the maximum extent permitted taking into account\nAffirmer's express Statement of Purpose. In addition, to the extent the Waiver\nis so judged Affirmer hereby grants to each affected person a royalty-free,\nnon transferable, non sublicensable, non exclusive, irrevocable and\nunconditional license to exercise Affirmer's Copyright and Related Rights in\nthe Work (i) in all territories worldwide, (ii) for the maximum duration\nprovided by applicable law or treaty (including future time extensions), (iii)\nin any current or future medium and for any number of copies, and (iv) for any\npurpose whatsoever, including without limitation commercial, advertising or\npromotional purposes (the \"License\"). The License shall be deemed effective as\nof the date CC0 was applied by Affirmer to the Work. Should any part of the\nLicense for any reason be judged legally invalid or ineffective under\napplicable law, such partial invalidity or ineffectiveness shall not\ninvalidate the remainder of the License, and in such case Affirmer hereby\naffirms that he or she will not (i) exercise any of his or her remaining\nCopyright and Related Rights in the Work or (ii) assert any associated claims\nand causes of action with respect to the Work, in either case contrary to\nAffirmer's express Statement of Purpose.\n\n4. Limitations and Disclaimers.\n\n  a. No trademark or patent rights held by Affirmer are waived, abandoned,\n  surrendered, licensed or otherwise affected by this document.\n\n  b. Affirmer offers the Work as-is and makes no representations or warranties\n  of any kind concerning the Work, express, implied, statutory or otherwise,\n  including without limitation warranties of title, merchantability, fitness\n  for a particular purpose, non infringement, or the absence of latent or\n  other defects, accuracy, or the present or absence of errors, whether or not\n  discoverable, all to the greatest extent permissible under applicable law.\n\n  c. Affirmer disclaims responsibility for clearing rights of other persons\n  that may apply to the Work or any use thereof, including without limitation\n  any person's Copyright and Related Rights in the Work. Further, Affirmer\n  disclaims responsibility for obtaining any necessary consents, permissions\n  or other rights required for any use of the Work.\n\n  d. Affirmer understands and acknowledges that Creative Commons is not a\n  party to this document and has no duty or obligation with respect to this\n  CC0 or use of the Work.\n\nFor more information, please see\n<http://creativecommons.org/publicdomain/zero/1.0/>\n"
  },
  {
    "path": "README.md",
    "content": "SciPy 2016 Scikit-learn Tutorial\n================================\n\nBased on the SciPy [2015 tutorial](https://github.com/amueller/scipy_2015_sklearn_tutorial) by [Kyle Kastner](https://kastnerkyle.github.io/) and [Andreas Mueller](http://amueller.github.io).\n\n\nInstructors\n-----------\n\n- [Sebastian Raschka](http://sebastianraschka.com)  [@rasbt](https://twitter.com/rasbt) - Michigan State University, Computational Biology;  [Book: Python Machine Learning](https://www.amazon.com/Python-Machine-Learning-Sebastian-Raschka/dp/1783555130/ref=sr_1_1?ie=UTF8&qid=1468347805&sr=8-1&keywords=sebastian+raschka)\n- [Andreas Mueller](http://amuller.github.io) [@amuellerml](https://twitter.com/t3kcit) - NYU Center for Data Science; [Book: Introduction to Machine Learning with Python](http://shop.oreilly.com/product/0636920030515.do)\n\n---\n\n**The video recording of the tutorial is now available via YouTube:**\n\n- [Part 1](https://www.youtube.com/watch?list=PLYx7XA2nY5Gf37zYZMw6OqGFRPjB1jCy6&v=OB1reY6IX-o)\n- [Part 2](https://www.youtube.com/watch?v=Cte8FYCpylk&list=PLYx7XA2nY5Gf37zYZMw6OqGFRPjB1jCy6&index=90)\n\n---\n\nThis repository will contain the teaching material and other info associated with our scikit-learn tutorial\nat [SciPy 2016](http://scipy2016.scipy.org/ehome/index.php?eventid=146062&tabid=332930&) held July 11-17 in Austin, Texas.\n\nParts 1 to 5 make up the morning session, while\nparts 6 to 9 will be presented in the afternoon.\n\n### Schedule:\n\nThe 2-part tutorial will be held on Tuesday, July 12, 2016.\n\n- Parts 1 to 5: 8:00 AM - 12:00 PM (Room 105)\n- Parts 6 to 9: 1:30 PM - 5:30 PM (Room 105)\n\n(You can find the full SciPy 2016 tutorial schedule [here](http://scipy2016.scipy.org/ehome/146062/332960/).)\n\n\n\nObtaining the Tutorial Material\n------------------\n\n\nIf you have a GitHub account, it is probably most convenient if you fork the GitHub repository. If you don’t have an GitHub account, you can download the repository as a .zip file by heading over to the GitHub repository (https://github.com/amueller/scipy-2016-sklearn) in your browser and click the green “Download” button in the upper right.\n\n![](images/download-repo.png)\n\nPlease note that we may add and improve the material until shortly before the tutorial session, and we recommend you to update your copy of the materials one day before the tutorials. If you have an GitHub account and forked/cloned the repository via GitHub, you can sync your existing fork with via the following commands:\n\n```\ngit remote add upstream https://github.com/amueller/scipy-2016-sklearn.git\ngit fetch upstream\ngit checkout master merge upstream/master\n```\n\nIf you don’t have a GitHub account, you may have to re-download the .zip archive from GitHub.\n\n\nInstallation Notes\n------------------\n\nThis tutorial will require recent installations of\n\n- [NumPy](http://www.numpy.org)\n- [SciPy](http://www.scipy.org)\n- [matplotlib](http://matplotlib.org)\n- [pillow](https://python-pillow.org)\n- [scikit-learn](http://scikit-learn.org/stable/)\n- [PyYaml](http://pyyaml.org/wiki/PyYAML)\n- [IPython](http://ipython.readthedocs.org/en/stable/)\n- [Jupyter Notebook](http://jupyter.org)\n- [Watermark](https://pypi.python.org/pypi/watermark)\n\nThe last one is important, you should be able to type:\n\n    jupyter notebook\n\nin your terminal window and see the notebook panel load in your web browser.\nTry opening and running a notebook from the material to see check that it works.\n\nFor users who do not yet have these  packages installed, a relatively\npainless way to install all the requirements is to use a Python distribution\nsuch as [Anaconda CE](http://store.continuum.io/ \"Anaconda CE\"), which includes\nthe most relevant Python packages for science, math, engineering, and\ndata analysis; Anaconda can be downloaded and installed for free\nincluding commercial use and redistribution.\nThe code examples in this tutorial should be compatible to Python 2.7,\nPython 3.4, and Python 3.5.\n\nAfter obtaining the material, we **strongly recommend** you to open and execute the Jupyter Notebook\n`jupter notebook check_env.ipynb` that is located at the top level of this repository. Inside the repository, you can open the notebook\nby executing\n\n```bash\njupyter notebook check_env.ipynb\n```\n\ninside this repository. Inside the Notebook, you can run the code cell by\nclicking on the \"Run Cells\" button as illustrated in the figure below:\n\n![](images/check_env-1.png)\n\n\nFinally, if your environment satisfies the requirements for the tutorials, the executed code cell will produce an output message as shown below:\n\n![](images/check_env-2.png)\n\n\nAlthough not required, we also recommend you to update the required Python packages to their latest versions to ensure best compatibility with the teaching material. Please upgrade already installed packages by executing\n\n- `pip install [package-name] --upgrade`  \n- or `conda update [package-name]`\n\n\n\nData Downloads\n--------------\n\nThe data for this tutorial is not included in the repository.  We will be\nusing several data sets during the tutorial: most are built-in to\nscikit-learn, which\nincludes code that automatically downloads and caches these\ndata.\n\n**Because the wireless network\nat conferences can often be spotty, it would be a good idea to download these\ndata sets before arriving at the conference.\nPlease run ``python fetch_data.py`` to download all necessary data beforehand.**\n\nThe download size of the data files are approx. 280 MB, and after `fetch_data.py`\nextracted the data on your disk, the ./notebook/dataset folder will take 480 MB\nof your local solid state or hard drive.\n\n\nOutline\n=======\n\nMorning Session\n---------------\n\n- 01 Introduction to machine learning with sample applications, Supervised and Unsupervised learning [[view](notebooks/01\\ Introduction\\ to\\ Machine\\ Learning.ipynb)]\n- 02 Scientific Computing Tools for Python: NumPy, SciPy, and matplotlib [[view](notebooks/02\\ Scientific\\ Computing\\ Tools\\ in\\ Python.ipynb)]\n- 03 Data formats, preparation, and representation [[view](notebooks/03\\ Data\\ Representation\\ for\\ Machine\\ Learning.ipynb)]\n- 04 Supervised learning: Training and test data [[view](notebooks/04\\ Training\\ and\\ Testing\\ Data.ipynb)]\n- 05 Supervised learning: Estimators for classification [[view](notebooks/05\\ Supervised\\ Learning\\ -\\ Classification.ipynb)]\n- 06 Supervised learning: Estimators for regression analysis [[view](notebooks/06\\ Supervised\\ Learning\\ -\\ Regression.ipynb)]\n- 07 Unsupervised learning: Unsupervised Transformers [[view](notebooks/07\\ Unsupervised\\ Learning\\ -\\ Transformations\\ and\\ Dimensionality\\ Reduction.ipynb)]\n- 08 Unsupervised learning: Clustering [[view](notebooks/08\\ Unsupervised\\ Learning\\ -\\ Clustering.ipynb)]\n- 09 The scikit-learn estimator interface [[view](notebooks/09\\ Review\\ of\\ Scikit-learn\\ API.ipynb)]\n- 10 Preparing a real-world dataset (titanic) [[view](notebooks/10\\ Case\\ Study\\ -\\ Titanic\\ Survival.ipynb)]\n- 11 Working with text data via the bag-of-words model [[view](notebooks/11\\ Text\\ Feature\\ Extraction.ipynb)]\n- 12 Application: IMDb Movie Review Sentiment Analysis [[view](notebooks/12\\ Case\\ Study\\ -\\ SMS\\ Spam\\ Detection.ipynb)]\n\nAfternoon Session\n-----------------\n\n- 13 Cross-Validation [[view](notebooks/13\\ Cross\\ Validation.ipynb)]\n- 14 Model complexity and grid search for adjusting hyperparameters [[view](notebooks/14\\ Model\\ Complexity\\ and\\ GridSearchCV.ipynb)]\n- 15 Scikit-learn Pipelines [[view](notebooks/15\\ Pipelining\\ Estimators.ipynb)]\n- 16 Supervised learning: Performance metrics for classification [[view](notebooks/16\\ Performance\\ metrics\\ and\\ Model\\ Evaluation.ipynb)]\n- 17 Supervised learning: Linear Models [[view](notebooks/17\\ In\\ Depth\\ -\\ Linear\\ Models.ipynb)]\n- 18 Supervised learning: Support Vector Machines [[view](notebooks/18\\ In\\ Depth\\ -\\ Support\\ Vector\\ Machines.ipynb)]\n- 19 Supervised learning: Decision trees and random forests, and ensemble methods [[view](notebooks/19\\ In\\ Depth\\ -\\ Trees\\ and\\ Forests.ipynb)]\n- 20 Supervised learning: feature selection [[view](notebooks/20\\ Feature\\ Selection.ipynb)]\n- 21 Unsupervised learning: Hierarchical and density-based clustering algorithms [[view](notebooks/21\\ Unsupervised\\ learning\\ -\\ Hierarchical\\ and\\ density-based\\ clustering\\ algorithms.ipynb)]\n- 22 Unsupervised learning: Non-linear dimensionality reduction [[view](notebooks/22\\ Unsupervised\\ learning\\ -\\ Non-linear\\ dimensionality\\ reduction.ipynb)]\n- 23 Supervised learning: Out-of-core learning [[view](notebooks/23\\ Out-of-core\\ Learning\\ Large\\ Scale\\ Text\\ Classification.ipynb)]\n"
  },
  {
    "path": "abstract.rst",
    "content": "Machine Learning with scikit-learn\n\nTutorial Topic\n--------------\n\nThis tutorial aims to provide an introduction to machine learning and\nscikit-learn \"from the ground up\". We will start with core concepts of machine\nlearning, some example uses of machine learning, and how to implement them\nusing scikit-learn. Going in detail through the characteristics of several\nmethods, we will discuss how to pick an algorithm for your application, how to\nset its parameters, and how to evaluate performance.\n\nPlease provide a more detailed abstract of your tutorial (again, see last years tutorials).\n---------------------------------------------------------------------------------------------\n\nMachine learning is the task of extracting knowledge from data, often with the\ngoal of generalizing to new and unseen data. Applications of machine learning \nnow touch nearly every aspect of everyday life, from the face detection in our\nphones and the streams of social media we consume to picking restaurants,\npartners, and movies. Machine learning has also become indispensable to many\nempirical sciences, from physics, astronomy and biology to social sciences.\n\nScikit-learn has emerged as one of the most popular toolkits for machine\nlearning, and is now widely used in industry and academia.\nThe goal of this tutorial is to enable participants to use the wide variety of\nmachine learning algorithms available in scikit-learn on their own data sets,\nfor their own domains.\n\nThis tutorial will comprise an introductory morning session and an advanced\nafternoon session. The morning part of the tutorial will cover basic concepts\nof machine learning, data representation, and preprocessing. We will explain\ndifferent problem settings and which algorithms to use in each situation.\nWe will then go through some sample applications using algorithms implemented\nin scikit-learn, including SVMs, Random Forests, K-Means, PCA, t-SNE, and\nothers.\n\nIn the afternoon session, we will discuss setting hyper-parameters and how to\nprevent overfitting. We will go in-depth into the trade-off of model complexity\nand dataset size, as well as discussing complexity of learning algorithms and\nhow to cope with very large datasets. The session will conclude by stepping\nthrough the process of building machine learning pipelines consisting of\nfeature extraction, preprocessing and supervised learning.\n\n\nOutline\n========\n\nMorning Session\n----------------\n\n- Introduction to machine learning with sample applications\n\n- Types of machine learning: Unsupervised vs. supervised learning\n\n- Scientific Computing Tools for Python: NumPy, SciPy, and matplotlib\n\n- Data formats, preparation, and representation\n\n- Supervised learning: Training and test data\n- Supervised learning: The scikit-learn estimator interface\n- Supervised learning: Estimators for classification\n- Supervised learning: Estimators for regression analysis\n\n- Unsupervised learning: Unsupervised Transformers\n- Unsupervised learning: Preprocessing and scaling\n- Unsupervised learning: Feature extraction and dimensionality reduction\n- Unsupervised learning: Clustering\n\n- Preparing a real-world dataset\n- Working with text data via the bag-of-words model\n- Application: IMDB Movie Review Sentiment Analysis\n\n\nAfternoon Session\n------------------\n- Cross-Validation\n- Model Complexity: Overfitting and underfitting\n- Complexity of various model types\n- Grid search for adjusting hyperparameters \n\n- Scikit-learn Pipelines\n\n- Supervised learning: Performance metrics for classification\n- Supervised learning: Support Vector Machines\n- Supervised learning: Algorithm and model selection via nested cross-validation\n- Supervised learning: Decision trees and random forests, and ensemble methods\n\n- Unsupervised learning: Non-linear regression analysis\n- Unsupervised learning: Hierarchical and density-based clustering algorithms\n- Unsupervised learning: Non-linear dimensionality reduction\n\n- Wrapper, filter, and embedded approaches for feature selection\n\n- Supervised learning: Artificial neural networks: Multi-layer perceptrons\n- Supervised learning: Out-of-core learning\n"
  },
  {
    "path": "check_env.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from __future__ import print_function\\n\",\n    \"from distutils.version import LooseVersion as Version\\n\",\n    \"import sys\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"try:\\n\",\n    \"    import curses\\n\",\n    \"    curses.setupterm()\\n\",\n    \"    assert curses.tigetnum(\\\"colors\\\") > 2\\n\",\n    \"    OK = \\\"\\\\x1b[1;%dm[ OK ]\\\\x1b[0m\\\" % (30 + curses.COLOR_GREEN)\\n\",\n    \"    FAIL = \\\"\\\\x1b[1;%dm[FAIL]\\\\x1b[0m\\\" % (30 + curses.COLOR_RED)\\n\",\n    \"except:\\n\",\n    \"    OK = '[ OK ]'\\n\",\n    \"    FAIL = '[FAIL]'\\n\",\n    \"\\n\",\n    \"try:\\n\",\n    \"    import importlib\\n\",\n    \"except ImportError:\\n\",\n    \"    print(FAIL, \\\"Python version 3.4 (or 2.7) is required,\\\"\\n\",\n    \"                \\\" but %s is installed.\\\" % sys.version)\\n\",\n    \"\\n\",\n    \"    \\n\",\n    \"def import_version(pkg, min_ver, fail_msg=\\\"\\\"):\\n\",\n    \"    mod = None\\n\",\n    \"    try:\\n\",\n    \"        mod = importlib.import_module(pkg)\\n\",\n    \"        if pkg in {'PIL'}:\\n\",\n    \"            ver = mod.VERSION\\n\",\n    \"        else:\\n\",\n    \"            ver = mod.__version__\\n\",\n    \"        if Version(ver) < min_ver:\\n\",\n    \"            print(FAIL, \\\"%s version %s or higher required, but %s installed.\\\"\\n\",\n    \"                  % (lib, min_ver, ver))\\n\",\n    \"        else:\\n\",\n    \"            print(OK, '%s version %s' % (pkg, ver))\\n\",\n    \"    except ImportError:\\n\",\n    \"        print(FAIL, '%s not installed. %s' % (pkg, fail_msg))\\n\",\n    \"    return mod\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"# first check the python version\\n\",\n    \"print('Using python in', sys.prefix)\\n\",\n    \"print(sys.version)\\n\",\n    \"pyversion = Version(sys.version)\\n\",\n    \"if pyversion >= \\\"3\\\":\\n\",\n    \"    if pyversion < \\\"3.4\\\":\\n\",\n    \"        print(FAIL, \\\"Python version 3.4 (or 2.7) is required,\\\"\\n\",\n    \"                    \\\" but %s is installed.\\\" % sys.version)\\n\",\n    \"elif pyversion >= \\\"2\\\":\\n\",\n    \"    if pyversion < \\\"2.7\\\":\\n\",\n    \"        print(FAIL, \\\"Python version 2.7 is required,\\\"\\n\",\n    \"                    \\\" but %s is installed.\\\" % sys.version)\\n\",\n    \"else:\\n\",\n    \"    print(FAIL, \\\"Unknown Python version: %s\\\" % sys.version)\\n\",\n    \"\\n\",\n    \"print()\\n\",\n    \"requirements = {'numpy': \\\"1.6.1\\\", 'scipy': \\\"0.9\\\", 'matplotlib': \\\"1.0\\\",\\n\",\n    \"                'IPython': \\\"3.0\\\", 'sklearn': \\\"0.18\\\", 'pandas': \\\"0.19\\\",\\n\",\n    \"                'watermark': \\\"1.3.1\\\",\\n\",\n    \"                'yaml': \\\"3.11\\\", 'PIL': \\\"1.1.7\\\"}\\n\",\n    \"\\n\",\n    \"# now the dependencies\\n\",\n    \"for lib, required_version in list(requirements.items()):\\n\",\n    \"    import_version(lib, required_version)\\n\",\n    \"\\n\",\n    \"# pydot is a bit different\\n\",\n    \"import_version(\\\"pydot\\\", \\\"0\\\", fail_msg=\\\"pydot is not installed.\\\"\\n\",\n    \"               \\\"It is not required but you will miss out on some plots.\\\"\\n\",\n    \"               \\\"\\\\nYou can install it using \\\"\\n\",\n    \"               \\\"'pip install pydot' on python2, and 'pip install \\\"\\n\",\n    \"               \\\"git+https://github.com/nlhepler/pydot.git' on python3.\\\");\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "fetch_data.py",
    "content": "import os\n\ntry:\n    from urllib.request import urlopen\nexcept ImportError:\n    from urllib import urlopen\n\nimport tarfile\n\n\nIMDB_URL = \"http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\"\nIMDB_ARCHIVE_NAME = \"aclImdb_v1.tar.gz\"\n\n\ndef get_datasets_folder():\n    here = os.path.dirname(__file__)\n    notebooks = os.path.join(here, 'notebooks')\n    datasets_folder = os.path.abspath(os.path.join(notebooks, 'datasets'))\n    datasets_archive = os.path.abspath(os.path.join(notebooks, 'datasets.zip'))\n\n    if not os.path.exists(datasets_folder):\n        if os.path.exists(datasets_archive):\n            print(\"Extracting \" + datasets_archive)\n            zf = zipfile.ZipFile(datasets_archive)\n            zf.extractall('.')\n            assert os.path.exists(datasets_folder)\n        else:\n            print(\"Creating datasets folder: \" + datasets_folder)\n            os.makedirs(datasets_folder)\n    else:\n        print(\"Using existing dataset folder:\" + datasets_folder)\n    return datasets_folder\n\n\ndef check_imdb(datasets_folder):\n    print(\"\\nChecking availability of the IMDb dataset\")\n    archive_path = os.path.join(datasets_folder, IMDB_ARCHIVE_NAME)\n    imdb_path = os.path.join(datasets_folder, 'IMDb')\n\n    train_path = os.path.join(imdb_path, 'aclImdb', 'train')\n    test_path = os.path.join(imdb_path, 'aclImdb', 'test')\n\n    if not os.path.exists(imdb_path):\n        if not os.path.exists(archive_path):\n            print(\"Downloading dataset from %s (84.1MB)\" % IMDB_URL)\n            opener = urlopen(IMDB_URL)\n            open(archive_path, 'wb').write(opener.read())\n        else:\n            print(\"Found archive: \" + archive_path)\n\n        print(\"Extracting %s to %s\" % (archive_path, imdb_path))\n\n        tar = tarfile.open(archive_path, \"r:gz\")\n        tar.extractall(path=imdb_path)\n        tar.close()\n        os.remove(archive_path)\n\n    print(\"Checking that the IMDb train & test directories exist...\")\n    assert os.path.exists(train_path)\n    assert os.path.exists(test_path)\n    print(\"=> Success!\")\n\n\nif __name__ == \"__main__\":\n    datasets_folder = get_datasets_folder()\n    check_imdb(datasets_folder)\n\n    print(\"\\nLoading Labeled Faces Data (~200MB)\")\n    from sklearn.datasets import fetch_lfw_people\n    fetch_lfw_people(min_faces_per_person=70, resize=0.4,\n                     data_home=datasets_folder)\n    print(\"=> Success!\")\n"
  },
  {
    "path": "notebooks/01 Introduction to Machine Learning.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The use of watermark (above) is optional, and we use it to keep track of the changes while developing the tutorial material. (You can install this IPython extension via \\\"pip install watermark\\\". For more information, please see: https://github.com/rasbt/watermark).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# 01.1 Introduction to Machine Learning in Python\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## What is Machine Learning?\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Machine learning is the process of extracting knowledge from data automatically, usually with the goal of making predictions on new, unseen data. A classical example is a spam filter, for which the user keeps labeling incoming mails as either spam or not spam. A machine learning algorithm then \\\"learns\\\" a predictive model from data that distinguishes spam from normal emails, a model which can predict for new emails whether they are spam or not.   \\n\",\n    \"\\n\",\n    \"Central to machine learning is the concept of **automating decision making** from data **without the user specifying explicit rules** how this decision should be made.\\n\",\n    \"\\n\",\n    \"For the case of emails, the user doesn't provide a list of words or characteristics that make an email spam. Instead, the user provides examples of spam and non-spam emails that are labeled as such.\\n\",\n    \"\\n\",\n    \"The second central concept is **generalization**. The goal of a machine learning model is to predict on new, previously unseen data. In a real-world application, we are not interested in marking an already labeled email as spam or not. Instead, we want to make the user's life easier by automatically classifying new incoming mail.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"figures/supervised_workflow.svg\\\" width=\\\"100%\\\">\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The data is presented to the algorithm usually as a two-dimensional array (or matrix) of numbers. Each data point (also known as a *sample* or *training instance*) that we want to either learn from or make a decision on is represented as a list of numbers, a so-called feature vector, and its containing features represent the properties of this point. \\n\",\n    \"\\n\",\n    \"Later, we will work with a popular dataset called *Iris* -- among many other datasets. Iris, a classic benchmark dataset in the field of machine learning, contains the measurements of 150 iris flowers from 3 different species: Iris-Setosa, Iris-Versicolor, and Iris-Virginica. \\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Iris Setosa\\n\",\n    \"<img src=\\\"figures/iris_setosa.jpg\\\" width=\\\"50%\\\">\\n\",\n    \"\\n\",\n    \"Iris Versicolor\\n\",\n    \"<img src=\\\"figures/iris_versicolor.jpg\\\" width=\\\"50%\\\">\\n\",\n    \"\\n\",\n    \"Iris Virginica\\n\",\n    \"<img src=\\\"figures/iris_virginica.jpg\\\" width=\\\"50%\\\">\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"We represent each flower sample as one row in our data array, and the columns (features) represent the flower measurements in centimeters. For instance, we can represent this Iris dataset, consisting of 150 samples and 4 features, a 2-dimensional array or matrix $\\\\mathbb{R}^{150 \\\\times 4}$ in the following format:\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"$$\\\\mathbf{X} = \\\\begin{bmatrix}\\n\",\n    \"    x_{1}^{(1)} & x_{2}^{(1)} & x_{3}^{(1)} & \\\\dots  & x_{4}^{(1)} \\\\\\\\\\n\",\n    \"    x_{1}^{(2)} & x_{2}^{(2)} & x_{3}^{(2)} & \\\\dots  & x_{4}^{(2)} \\\\\\\\\\n\",\n    \"    \\\\vdots & \\\\vdots & \\\\vdots & \\\\ddots & \\\\vdots \\\\\\\\\\n\",\n    \"    x_{1}^{(150)} & x_{2}^{(150)} & x_{3}^{(150)} & \\\\dots  & x_{4}^{(150)}\\n\",\n    \"\\\\end{bmatrix}.\\n\",\n    \"$$\\n\",\n    \"\\n\",\n    \"(The superscript denotes the *i*th row, and the subscript denotes the *j*th feature, respectively.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"There are two kinds of machine learning we will talk about today: ***supervised learning*** and ***unsupervised learning***.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Supervised Learning: Classification and regression\\n\",\n    \"\\n\",\n    \"In **Supervised Learning**, we have a dataset consisting of both input features and a desired output, such as in the spam / no-spam example.\\n\",\n    \"The task is to construct a model (or program) which is able to predict the desired output of an unseen object\\n\",\n    \"given the set of features.\\n\",\n    \"\\n\",\n    \"Some more complicated examples are:\\n\",\n    \"\\n\",\n    \"- Given a multicolor image of an object through a telescope, determine\\n\",\n    \"  whether that object is a star, a quasar, or a galaxy.\\n\",\n    \"- Given a photograph of a person, identify the person in the photo.\\n\",\n    \"- Given a list of movies a person has watched and their personal rating\\n\",\n    \"  of the movie, recommend a list of movies they would like.\\n\",\n    \"- Given a persons age, education and position, infer their salary\\n\",\n    \"\\n\",\n    \"What these tasks have in common is that there is one or more unknown\\n\",\n    \"quantities associated with the object which needs to be determined from other\\n\",\n    \"observed quantities.\\n\",\n    \"\\n\",\n    \"Supervised learning is further broken down into two categories, **classification** and **regression**:\\n\",\n    \"\\n\",\n    \"- **In classification, the label is discrete**, such as \\\"spam\\\" or \\\"no spam\\\". In other words, it provides a clear-cut distinction between categories. Furthermore, it is important to note that class labels are nominal, not ordinal variables. Nominal and ordinal variables are both subcategories of categorical variable. Ordinal variables imply an order, for example, T-shirt sizes \\\"XL > L > M > S\\\". On the contrary, nominal variables don't imply an order, for example, we (usually) can't assume \\\"orange > blue > green\\\".\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"- **In regression, the label is continuous**, that is a float output. For example,\\n\",\n    \"in astronomy, the task of determining whether an object is a star, a galaxy, or a quasar is a\\n\",\n    \"classification problem: the label is from three distinct categories. On the other hand, we might\\n\",\n    \"wish to estimate the age of an object based on such observations: this would be a regression problem,\\n\",\n    \"because the label (age) is a continuous quantity.\\n\",\n    \"\\n\",\n    \"In supervised learning, there is always a distinction between a **training set** for which the desired outcome is given, and a **test set** for which the desired outcome needs to be inferred. The learning model fits the predictive model to the training set, and we use the test set to evaluate its generalization performance.\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Unsupervised Learning\\n\",\n    \"\\n\",\n    \"In **Unsupervised Learning** there is no desired output associated with the data.\\n\",\n    \"Instead, we are interested in extracting some form of knowledge or model from the given data.\\n\",\n    \"In a sense, you can think of unsupervised learning as a means of discovering labels from the data itself.\\n\",\n    \"Unsupervised learning is often harder to understand and to evaluate.\\n\",\n    \"\\n\",\n    \"Unsupervised learning comprises tasks such as *dimensionality reduction*, *clustering*, and\\n\",\n    \"*density estimation*. For example, in the iris data discussed above, we can used unsupervised\\n\",\n    \"methods to determine combinations of the measurements which best display the structure of the\\n\",\n    \"data. As we’ll see below, such a projection of the data can be used to visualize the\\n\",\n    \"four-dimensional dataset in two dimensions. Some more involved unsupervised learning problems are:\\n\",\n    \"\\n\",\n    \"- Given detailed observations of distant galaxies, determine which features or combinations of\\n\",\n    \"  features summarize best the information.\\n\",\n    \"- Given a mixture of two sound sources (for example, a person talking over some music),\\n\",\n    \"  separate the two (this is called the [blind source separation](http://en.wikipedia.org/wiki/Blind_signal_separation) problem).\\n\",\n    \"- Given a video, isolate a moving object and categorize in relation to other moving objects which have been seen.\\n\",\n    \"- Given a large collection of news articles, find recurring topics inside these articles.\\n\",\n    \"- Given a collection of images, cluster similar images together (for example to group them when visualizing a collection)\\n\",\n    \"\\n\",\n    \"Sometimes the two may even be combined: e.g. unsupervised learning can be used to find useful\\n\",\n    \"features in heterogeneous data, and then these features can be used within a supervised\\n\",\n    \"framework.\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/02 Scientific Computing Tools in Python.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The use of watermark (above) is optional, and we use it to keep track of the changes while developing the tutorial material. (You can install this IPython extension via \\\"pip install watermark\\\". For more information, please see: https://github.com/rasbt/watermark).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"01.2 Jupyter Notebooks\\n\",\n    \"==================\\n\",\n    \"\\n\",\n    \"* You can run a cell by pressing ``[shift] + [Enter]`` or by pressing the \\\"play\\\" button in the menu.\\n\",\n    \"\\n\",\n    \"![](figures/ipython_run_cell.png)\\n\",\n    \"\\n\",\n    \"* You can get help on a function or object by pressing ``[shift] + [tab]`` after the opening parenthesis ``function(``\\n\",\n    \"\\n\",\n    \"![](figures/ipython_help-1.png)\\n\",\n    \"\\n\",\n    \"* You can also get help by executing ``function?``\\n\",\n    \"\\n\",\n    \"![](figures/ipython_help-2.png)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Numpy Arrays\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Manipulating `numpy` arrays is an important part of doing machine learning\\n\",\n    \"(or, really, any type of scientific computation) in python.  This will likely\\n\",\n    \"be a short review for most. In any case, let's quickly go through some of the most important features.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"# Setting a random seed for reproducibility\\n\",\n    \"rnd = np.random.RandomState(seed=123)\\n\",\n    \"\\n\",\n    \"# Generating a random array\\n\",\n    \"X = rnd.uniform(low=0.0, high=1.0, size=(3, 5))  # a 3 x 5 array\\n\",\n    \"\\n\",\n    \"print(X)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"(Note that NumPy arrays use 0-indexing just like other data structures in Python.)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Accessing elements\\n\",\n    \"\\n\",\n    \"# get a single element \\n\",\n    \"# (here: an element in the first row and column)\\n\",\n    \"print(X[0, 0])\\n\",\n    \"\\n\",\n    \"# get a row \\n\",\n    \"# (here: 2nd row)\\n\",\n    \"print(X[1])\\n\",\n    \"\\n\",\n    \"# get a column\\n\",\n    \"# (here: 2nd column)\\n\",\n    \"print(X[:, 1])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Transposing an array\\n\",\n    \"print(X.T)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"$$\\\\begin{bmatrix}\\n\",\n    \"    1 & 2 & 3 & 4 \\\\\\\\\\n\",\n    \"    5 & 6 & 7 & 8\\n\",\n    \"\\\\end{bmatrix}^T\\n\",\n    \"= \\n\",\n    \"\\\\begin{bmatrix}\\n\",\n    \"    1 & 5 \\\\\\\\\\n\",\n    \"    2 & 6 \\\\\\\\\\n\",\n    \"    3 & 7 \\\\\\\\\\n\",\n    \"    4 & 8\\n\",\n    \"\\\\end{bmatrix}\\n\",\n    \"$$\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Creating a row vector\\n\",\n    \"# of evenly spaced numbers over a specified interval.\\n\",\n    \"y = np.linspace(0, 12, 5)\\n\",\n    \"print(y)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Turning the row vector into a column vector\\n\",\n    \"print(y[:, np.newaxis])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Getting the shape or reshaping an array\\n\",\n    \"\\n\",\n    \"# Generating a random array\\n\",\n    \"rnd = np.random.RandomState(seed=123)\\n\",\n    \"X = rnd.uniform(low=0.0, high=1.0, size=(3, 5))  # a 3 x 5 array\\n\",\n    \"\\n\",\n    \"print(X.shape)\\n\",\n    \"print(X.reshape(5, 3))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Indexing by an array of integers (fancy indexing)\\n\",\n    \"indices = np.array([3, 1, 0])\\n\",\n    \"print(indices)\\n\",\n    \"X[:, indices]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"There is much, much more to know, but these few operations are fundamental to what we'll\\n\",\n    \"do during this tutorial.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## SciPy Sparse Matrices\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We won't make very much use of these in this tutorial, but sparse matrices are very nice\\n\",\n    \"in some situations.  In some machine learning tasks, especially those associated\\n\",\n    \"with textual analysis, the data may be mostly zeros.  Storing all these zeros is very\\n\",\n    \"inefficient, and representing in a way that only contains the \\\"non-zero\\\" values can be much more efficient.  We can create and manipulate sparse matrices as follows:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from scipy import sparse\\n\",\n    \"\\n\",\n    \"# Create a random array with a lot of zeros\\n\",\n    \"rnd = np.random.RandomState(seed=123)\\n\",\n    \"\\n\",\n    \"X = rnd.uniform(low=0.0, high=1.0, size=(10, 5))\\n\",\n    \"print(X)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# set the majority of elements to zero\\n\",\n    \"X[X < 0.7] = 0\\n\",\n    \"print(X)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# turn X into a CSR (Compressed-Sparse-Row) matrix\\n\",\n    \"X_csr = sparse.csr_matrix(X)\\n\",\n    \"print(X_csr)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Converting the sparse matrix to a dense array\\n\",\n    \"print(X_csr.toarray())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"(You may have stumbled upon an alternative method for converting sparse to dense representations: `numpy.todense`; `toarray` returns a NumPy array, whereas `todense` returns a NumPy matrix. In this tutorial, we will be working with NumPy arrays, not matrices; the latter are not supported by scikit-learn.)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The CSR representation can be very efficient for computations, but it is not\\n\",\n    \"as good for adding elements.  For that, the LIL (List-In-List) representation\\n\",\n    \"is better:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Create an empty LIL matrix and add some items\\n\",\n    \"X_lil = sparse.lil_matrix((5, 5))\\n\",\n    \"\\n\",\n    \"for i, j in np.random.randint(0, 5, (15, 2)):\\n\",\n    \"    X_lil[i, j] = i + j\\n\",\n    \"\\n\",\n    \"print(X_lil)\\n\",\n    \"print(type(X_lil))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X_dense = X_lil.toarray()\\n\",\n    \"print(X_dense)\\n\",\n    \"print(type(X_dense))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Often, once an LIL matrix is created, it is useful to convert it to a CSR format\\n\",\n    \"(many scikit-learn algorithms require CSR or CSC format)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X_csr = X_lil.tocsr()\\n\",\n    \"print(X_csr)\\n\",\n    \"print(type(X_csr))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The available sparse formats that can be useful for various problems are:\\n\",\n    \"\\n\",\n    \"- `CSR` (compressed sparse row)\\n\",\n    \"- `CSC` (compressed sparse column)\\n\",\n    \"- `BSR` (block sparse row)\\n\",\n    \"- `COO` (coordinate)\\n\",\n    \"- `DIA` (diagonal)\\n\",\n    \"- `DOK` (dictionary of keys)\\n\",\n    \"- `LIL` (list in list)\\n\",\n    \"\\n\",\n    \"The [``scipy.sparse``](http://docs.scipy.org/doc/scipy/reference/sparse.html) submodule also has a lot of functions for sparse matrices\\n\",\n    \"including linear algebra, sparse solvers, graph algorithms, and much more.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## matplotlib\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Another important part of machine learning is the visualization of data.  The most common\\n\",\n    \"tool for this in Python is [`matplotlib`](http://matplotlib.org).  It is an extremely flexible package, and\\n\",\n    \"we will go over some basics here.\\n\",\n    \"\\n\",\n    \"Since we are using Jupyter notebooks, let us use one of IPython's convenient built-in \\\"[magic functions](https://ipython.org/ipython-doc/3/interactive/magics.html)\\\", the \\\"matoplotlib inline\\\" mode, which will draw the plots directly inside the notebook.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import matplotlib.pyplot as plt\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Plotting a line\\n\",\n    \"x = np.linspace(0, 10, 100)\\n\",\n    \"plt.plot(x, np.sin(x));\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Scatter-plot points\\n\",\n    \"x = np.random.normal(size=500)\\n\",\n    \"y = np.random.normal(size=500)\\n\",\n    \"plt.scatter(x, y);\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Showing images using imshow\\n\",\n    \"# - note that origin is at the top-left by default!\\n\",\n    \"\\n\",\n    \"x = np.linspace(1, 12, 100)\\n\",\n    \"y = x[:, np.newaxis]\\n\",\n    \"\\n\",\n    \"im = y * np.sin(x) * np.cos(y)\\n\",\n    \"print(im.shape)\\n\",\n    \"\\n\",\n    \"plt.imshow(im);\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Contour plots \\n\",\n    \"# - note that origin here is at the bottom-left by default!\\n\",\n    \"plt.contour(im);\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# 3D plotting\\n\",\n    \"from mpl_toolkits.mplot3d import Axes3D\\n\",\n    \"ax = plt.axes(projection='3d')\\n\",\n    \"xgrid, ygrid = np.meshgrid(x, y.ravel())\\n\",\n    \"ax.plot_surface(xgrid, ygrid, im, cmap=plt.cm.jet, cstride=2, rstride=2, linewidth=0);\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"There are many, many more plot types available.  One useful way to explore these is by\\n\",\n    \"looking at the [matplotlib gallery](http://matplotlib.org/gallery.html).\\n\",\n    \"\\n\",\n    \"You can test these examples out easily in the notebook: simply copy the ``Source Code``\\n\",\n    \"link on each page, and put it in a notebook using the ``%load`` magic.\\n\",\n    \"For example:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# %load http://matplotlib.org/mpl_examples/pylab_examples/ellipse_collection.py\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import numpy as np\\n\",\n    \"from matplotlib.collections import EllipseCollection\\n\",\n    \"\\n\",\n    \"x = np.arange(10)\\n\",\n    \"y = np.arange(15)\\n\",\n    \"X, Y = np.meshgrid(x, y)\\n\",\n    \"\\n\",\n    \"XY = np.hstack((X.ravel()[:,np.newaxis], Y.ravel()[:,np.newaxis]))\\n\",\n    \"\\n\",\n    \"ww = X/10.0\\n\",\n    \"hh = Y/15.0\\n\",\n    \"aa = X*9\\n\",\n    \"\\n\",\n    \"fig, ax = plt.subplots()\\n\",\n    \"\\n\",\n    \"ec = EllipseCollection(ww, hh, aa, units='x', offsets=XY,\\n\",\n    \"                       transOffset=ax.transData)\\n\",\n    \"ec.set_array((X+Y).ravel())\\n\",\n    \"ax.add_collection(ec)\\n\",\n    \"ax.autoscale_view()\\n\",\n    \"ax.set_xlabel('X')\\n\",\n    \"ax.set_ylabel('y')\\n\",\n    \"cbar = plt.colorbar(ec)\\n\",\n    \"cbar.set_label('X+Y')\\n\",\n    \"plt.show()\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/03 Data Representation for Machine Learning.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The use of watermark (above) is optional, and we use it to keep track of the changes while developing the tutorial material. (You can install this IPython extension via \\\"pip install watermark\\\". For more information, please see: https://github.com/rasbt/watermark).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Representation and Visualization of Data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Machine learning is about fitting models to data; for that reason, we'll start by\\n\",\n    \"discussing how data can be represented in order to be understood by the computer.  Along\\n\",\n    \"with this, we'll build on our matplotlib examples from the previous section and show some\\n\",\n    \"examples of how to visualize data.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Data in scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Data in scikit-learn, with very few exceptions, is assumed to be stored as a\\n\",\n    \"**two-dimensional array**, of shape `[n_samples, n_features]`. Many algorithms also accept ``scipy.sparse`` matrices of the same shape.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"- **n_samples:**   The number of samples: each sample is an item to process (e.g. classify).\\n\",\n    \"  A sample can be a document, a picture, a sound, a video, an astronomical object,\\n\",\n    \"  a row in database or CSV file,\\n\",\n    \"  or whatever you can describe with a fixed set of quantitative traits.\\n\",\n    \"- **n_features:**  The number of features or distinct traits that can be used to describe each\\n\",\n    \"  item in a quantitative manner.  Features are generally real-valued, but may be Boolean or\\n\",\n    \"  discrete-valued in some cases.\\n\",\n    \"\\n\",\n    \"The number of features must be fixed in advance. However it can be very high dimensional\\n\",\n    \"(e.g. millions of features) with most of them being \\\"zeros\\\" for a given sample. This is a case\\n\",\n    \"where `scipy.sparse` matrices can be useful, in that they are\\n\",\n    \"much more memory-efficient than NumPy arrays.\\n\",\n    \"\\n\",\n    \"As we recall from the previous section (or Jupyter notebook), we represent samples (data points or instances) as rows in the data array, and we store the corresponding features, the \\\"dimensions,\\\" as columns.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### A Simple Example: the Iris Dataset\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As an example of a simple dataset, we're going to take a look at the iris data stored by scikit-learn.\\n\",\n    \"The data consists of measurements of three different iris flower species.  There are three different species of iris\\n\",\n    \"in this particular dataset as illustrated below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"source\": [\n    \"Iris Setosa\\n\",\n    \"<img src=\\\"figures/iris_setosa.jpg\\\" width=\\\"50%\\\">\\n\",\n    \"\\n\",\n    \"Iris Versicolor\\n\",\n    \"<img src=\\\"figures/iris_versicolor.jpg\\\" width=\\\"50%\\\">\\n\",\n    \"\\n\",\n    \"Iris Virginica\\n\",\n    \"<img src=\\\"figures/iris_virginica.jpg\\\" width=\\\"50%\\\">\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Quick Question:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"**Let's assume that we are interested in categorizing new observations; we want to predict whether unknown flowers are  Iris-Setosa, Iris-Versicolor, or Iris-Virginica flowers, respectively. Based on what we've discussed in the previous section, how would we construct such a dataset?***\\n\",\n    \"\\n\",\n    \"Remember: we need a 2D array of size `[n_samples x n_features]`.\\n\",\n    \"\\n\",\n    \"- What would the `n_samples` refer to?\\n\",\n    \"\\n\",\n    \"- What might the `n_features` refer to?\\n\",\n    \"\\n\",\n    \"Remember that there must be a **fixed** number of features for each sample, and feature\\n\",\n    \"number *j* must be a similar kind of quantity for each sample.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Loading the Iris Data with Scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For future experiments with machine learning algorithms, we recommend you to bookmark the [UCI machine learning repository](http://archive.ics.uci.edu/ml/), which hosts many of the commonly used datasets that are useful for benchmarking machine learning algorithms -- a very popular resource for machine learning practioners and researchers. Conveniently, some of these datasets are already included in scikit-learn so that we can skip the tedious parts of downloading, reading, parsing, and cleaning these text/CSV files. You can find a list of available datasets in scikit-learn at: http://scikit-learn.org/stable/datasets/#toy-datasets.\\n\",\n    \"\\n\",\n    \"For example, scikit-learn has a very straightforward set of data on these iris species.  The data consist of\\n\",\n    \"the following:\\n\",\n    \"\\n\",\n    \"- Features in the Iris dataset:\\n\",\n    \"\\n\",\n    \"  1. sepal length in cm\\n\",\n    \"  2. sepal width in cm\\n\",\n    \"  3. petal length in cm\\n\",\n    \"  4. petal width in cm\\n\",\n    \"\\n\",\n    \"- Target classes to predict:\\n\",\n    \"\\n\",\n    \"  1. Iris Setosa\\n\",\n    \"  2. Iris Versicolour\\n\",\n    \"  3. Iris Virginica\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"figures/petal_sepal.jpg\\\" alt=\\\"Sepal\\\" style=\\\"width: 50%;\\\"/>\\n\",\n    \"\\n\",\n    \"(Image: \\\"Petal-sepal\\\". Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Petal-sepal.jpg#/media/File:Petal-sepal.jpg)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"``scikit-learn`` embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import load_iris\\n\",\n    \"iris = load_iris()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The resulting dataset is a ``Bunch`` object: you can see what's available using\\n\",\n    \"the method ``keys()``:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"iris.keys()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The features of each sample flower are stored in the ``data`` attribute of the dataset:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"n_samples, n_features = iris.data.shape\\n\",\n    \"print('Number of samples:', n_samples)\\n\",\n    \"print('Number of features:', n_features)\\n\",\n    \"# the sepal length, sepal width, petal length and petal width of the first sample (first flower)\\n\",\n    \"print(iris.data[0])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The information about the class of each sample is stored in the ``target`` attribute of the dataset:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(iris.data.shape)\\n\",\n    \"print(iris.target.shape)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(iris.target)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"np.bincount(iris.target)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Using the NumPy's bincount function (above), we can see that the classes are distributed uniformly in this dataset - there are 50 flowers from each species, where\\n\",\n    \"\\n\",\n    \"- class 0: Iris-Setosa\\n\",\n    \"- class 1: Iris-Versicolor\\n\",\n    \"- class 2: Iris-Virginica\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"These class names are stored in the last attribute, namely ``target_names``:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(iris.target_names)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"This data is four dimensional, but we can visualize one or two of the dimensions\\n\",\n    \"at a time using a simple histogram or scatter-plot.  Again, we'll start by enabling\\n\",\n    \"matplotlib inline mode:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"import matplotlib.pyplot as plt\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"x_index = 3\\n\",\n    \"colors = ['blue', 'red', 'green']\\n\",\n    \"\\n\",\n    \"for label, color in zip(range(len(iris.target_names)), colors):\\n\",\n    \"    plt.hist(iris.data[iris.target==label, x_index], \\n\",\n    \"             label=iris.target_names[label],\\n\",\n    \"             color=color)\\n\",\n    \"\\n\",\n    \"plt.xlabel(iris.feature_names[x_index])\\n\",\n    \"plt.legend(loc='upper right')\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"x_index = 3\\n\",\n    \"y_index = 0\\n\",\n    \"\\n\",\n    \"colors = ['blue', 'red', 'green']\\n\",\n    \"\\n\",\n    \"for label, color in zip(range(len(iris.target_names)), colors):\\n\",\n    \"    plt.scatter(iris.data[iris.target==label, x_index], \\n\",\n    \"                iris.data[iris.target==label, y_index],\\n\",\n    \"                label=iris.target_names[label],\\n\",\n    \"                c=color)\\n\",\n    \"\\n\",\n    \"plt.xlabel(iris.feature_names[x_index])\\n\",\n    \"plt.ylabel(iris.feature_names[y_index])\\n\",\n    \"plt.legend(loc='upper left')\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Quick Exercise:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"**Change** `x_index` **and** `y_index` **in the above script\\n\",\n    \"and find a combination of two parameters\\n\",\n    \"which maximally separate the three classes.**\\n\",\n    \"\\n\",\n    \"This exercise is a preview of **dimensionality reduction**, which we'll see later.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### An aside: scatterplot matrices\\n\",\n    \"\\n\",\n    \"Instead of looking at the data one plot at a time, a common tool that analysts use is called the **scatterplot matrix**.\\n\",\n    \"\\n\",\n    \"Scatterplot matrices show scatter plots between all features in the data set, as well as histograms to show the distribution of each feature.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import pandas as pd\\n\",\n    \"    \\n\",\n    \"iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)\\n\",\n    \"pd.plotting.scatter_matrix(iris_df, figsize=(8, 8));\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Other Available Data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"[Scikit-learn makes available a host of datasets for testing learning algorithms](http://scikit-learn.org/stable/datasets/#dataset-loading-utilities).\\n\",\n    \"They come in three flavors:\\n\",\n    \"\\n\",\n    \"- **Packaged Data:** these small datasets are packaged with the scikit-learn installation,\\n\",\n    \"  and can be downloaded using the tools in ``sklearn.datasets.load_*``\\n\",\n    \"- **Downloadable Data:** these larger datasets are available for download, and scikit-learn\\n\",\n    \"  includes tools which streamline this process.  These tools can be found in\\n\",\n    \"  ``sklearn.datasets.fetch_*``\\n\",\n    \"- **Generated Data:** there are several datasets which are generated from models based on a\\n\",\n    \"  random seed.  These are available in the ``sklearn.datasets.make_*``\\n\",\n    \"\\n\",\n    \"You can explore the available dataset loaders, fetchers, and generators using IPython's\\n\",\n    \"tab-completion functionality.  After importing the ``datasets`` submodule from ``sklearn``,\\n\",\n    \"type\\n\",\n    \"\\n\",\n    \"    datasets.load_<TAB>\\n\",\n    \"\\n\",\n    \"or\\n\",\n    \"\\n\",\n    \"    datasets.fetch_<TAB>\\n\",\n    \"\\n\",\n    \"or\\n\",\n    \"\\n\",\n    \"    datasets.make_<TAB>\\n\",\n    \"\\n\",\n    \"to see a list of available functions.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn import datasets\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The data downloaded using the ``fetch_`` scripts are stored locally,\\n\",\n    \"within a subdirectory of your home directory.\\n\",\n    \"You can use the following to determine where it is:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import get_data_home\\n\",\n    \"get_data_home()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Be warned: many of these datasets are quite large, and can take a long time to download!\\n\",\n    \"\\n\",\n    \"If you start a download within the IPython notebook\\n\",\n    \"and you want to kill it, you can use ipython's \\\"kernel interrupt\\\" feature, available in the menu or using\\n\",\n    \"the shortcut ``Ctrl-m i``.\\n\",\n    \"\\n\",\n    \"You can press ``Ctrl-m h`` for a list of all ``ipython`` keyboard shortcuts.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Loading Digits Data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we'll take a look at another dataset, one where we have to put a bit\\n\",\n    \"more thought into how to represent the data.  We can explore the data in\\n\",\n    \"a similar manner as above:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import load_digits\\n\",\n    \"digits = load_digits()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"digits.keys()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"n_samples, n_features = digits.data.shape\\n\",\n    \"print((n_samples, n_features))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(digits.data[0])\\n\",\n    \"print(digits.target)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The target here is just the digit represented by the data.  The data is an array of\\n\",\n    \"length 64... but what does this data mean?\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"There's a clue in the fact that we have two versions of the data array:\\n\",\n    \"``data`` and ``images``.  Let's take a look at them:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(digits.data.shape)\\n\",\n    \"print(digits.images.shape)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can see that they're related by a simple reshaping:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import numpy as np\\n\",\n    \"print(np.all(digits.images.reshape((1797, 64)) == digits.data))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's visualize the data.  It's little bit more involved than the simple scatter-plot\\n\",\n    \"we used above, but we can do it rather quickly.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# set up the figure\\n\",\n    \"fig = plt.figure(figsize=(6, 6))  # figure size in inches\\n\",\n    \"fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)\\n\",\n    \"\\n\",\n    \"# plot the digits: each image is 8x8 pixels\\n\",\n    \"for i in range(64):\\n\",\n    \"    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])\\n\",\n    \"    ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')\\n\",\n    \"    \\n\",\n    \"    # label the image with the target value\\n\",\n    \"    ax.text(0, 7, str(digits.target[i]))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We see now what the features mean.  Each feature is a real-valued quantity representing the\\n\",\n    \"darkness of a pixel in an 8x8 image of a hand-written digit.\\n\",\n    \"\\n\",\n    \"Even though each sample has data that is inherently two-dimensional, the data matrix flattens\\n\",\n    \"this 2D data into a **single vector**, which can be contained in one **row** of the data matrix.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Generated Data: the S-Curve\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"One dataset often used as an example of a simple nonlinear dataset is the S-cure:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import make_s_curve\\n\",\n    \"data, colors = make_s_curve(n_samples=1000)\\n\",\n    \"print(data.shape)\\n\",\n    \"print(colors.shape)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from mpl_toolkits.mplot3d import Axes3D\\n\",\n    \"ax = plt.axes(projection='3d')\\n\",\n    \"ax.scatter(data[:, 0], data[:, 1], data[:, 2], c=colors)\\n\",\n    \"ax.view_init(10, -60)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"This example is typically used with an unsupervised learning method called Locally\\n\",\n    \"Linear Embedding.  We'll explore unsupervised learning in detail later in the tutorial.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Exercise: working with the faces dataset\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Here we'll take a moment for you to explore the datasets yourself.\\n\",\n    \"Later on we'll be using the Olivetti faces dataset.\\n\",\n    \"Take a moment to fetch the data (about 1.4MB), and visualize the faces.\\n\",\n    \"You can copy the code used to visualize the digits above, and modify it for this data.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import fetch_olivetti_faces\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# fetch the faces data\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Use a script like above to plot the faces image data.\\n\",\n    \"# hint: plt.cm.bone is a good colormap for this data\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Solution:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# %load solutions/03A_faces_plot.py\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/04 Training and Testing Data.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Training and Testing Data\\n\",\n    \"=====================================\\n\",\n    \"\\n\",\n    \"To evaluate how well our supervised models generalize, we can split our data into a training and a test set:\\n\",\n    \"\\n\",\n    \"<img src=\\\"figures/train_test_split_matrix.svg\\\" width=\\\"100%\\\">\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import load_iris\\n\",\n    \"from sklearn.neighbors import KNeighborsClassifier\\n\",\n    \"\\n\",\n    \"iris = load_iris()\\n\",\n    \"X, y = iris.data, iris.target\\n\",\n    \"\\n\",\n    \"classifier = KNeighborsClassifier()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Thinking about how machine learning is normally performed, the idea of a train/test split makes sense. Real world systems train on the data they have, and as other data comes in (from customers, sensors, or other sources) the classifier that was trained must predict on fundamentally *new* data. We can simulate this during training using a train/test split - the test data is a simulation of \\\"future data\\\" which will come into the system during production. \\n\",\n    \"\\n\",\n    \"Specifically for iris, the 150 labels in iris are sorted, which means that if we split the data using a proportional split, this will result in fudamentally altered class distributions. For instance, if we'd perform a common 2/3 training data and 1/3 test data split, our training dataset will only consists of flower classes 0 and 1 (Setosa and Versicolor), and our test set will only contain samples with class label 2 (Virginica flowers).\\n\",\n    \"\\n\",\n    \"Under the assumption that all samples are independent of each other (in contrast time series data), we want to **randomly shuffle the dataset before we split the dataset** as illustrated above.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"y\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we need to split the data into training and testing. Luckily, this is a common pattern in machine learning and scikit-learn has a pre-built function to split data into training and testing sets for you. Here, we use 50% of the data as training, and 50% testing. 80% and 20% is another common split, but there are no hard and fast rules. The most important thing is to fairly evaluate your system on data it *has not* seen during training!\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import train_test_split\\n\",\n    \"\\n\",\n    \"train_X, test_X, train_y, test_y = train_test_split(X, y, \\n\",\n    \"                                                    train_size=0.5, \\n\",\n    \"                                                    random_state=123)\\n\",\n    \"print(\\\"Labels for training and testing data\\\")\\n\",\n    \"print(train_y)\\n\",\n    \"print(test_y)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"---\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"**Tip: Stratified Split**\\n\",\n    \"\\n\",\n    \"Especially for relatively small datasets, it's better to stratify the split. Stratification means that we maintain the original class proportion of the dataset in the test and training sets. For example, after we randomly split the dataset as shown in the previous code example, we have the following class proportions in percent:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print('All:', np.bincount(y) / float(len(y)) * 100.0)\\n\",\n    \"print('Training:', np.bincount(train_y) / float(len(train_y)) * 100.0)\\n\",\n    \"print('Test:', np.bincount(test_y) / float(len(test_y)) * 100.0)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"So, in order to stratify the split, we can pass the label array as an additional option to the `train_test_split` function:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"train_X, test_X, train_y, test_y = train_test_split(X, y, \\n\",\n    \"                                                    train_size=0.5, \\n\",\n    \"                                                    random_state=123,\\n\",\n    \"                                                    stratify=y)\\n\",\n    \"\\n\",\n    \"print('All:', np.bincount(y) / float(len(y)) * 100.0)\\n\",\n    \"print('Training:', np.bincount(train_y) / float(len(train_y)) * 100.0)\\n\",\n    \"print('Test:', np.bincount(test_y) / float(len(test_y)) * 100.0)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"---\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"By evaluating our classifier performance on data that has been seen during training, we could get false confidence in the predictive power of our model. In the worst case, it may simply memorize the training samples but completely fails classifying new, similar samples -- we really don't want to put such a system into production!\\n\",\n    \"\\n\",\n    \"Instead of using the same dataset for training and testing (this is called \\\"resubstitution evaluation\\\"), it is much much better to use a train/test split in order to estimate how well your trained model is doing on new data.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"classifier.fit(train_X, train_y)\\n\",\n    \"pred_y = classifier.predict(test_X)\\n\",\n    \"\\n\",\n    \"print(\\\"Fraction Correct [Accuracy]:\\\")\\n\",\n    \"print(np.sum(pred_y == test_y) / float(len(test_y)))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also visualize the correct and failed predictions\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print('Samples correctly classified:')\\n\",\n    \"correct_idx = np.where(pred_y == test_y)[0]\\n\",\n    \"print(correct_idx)\\n\",\n    \"\\n\",\n    \"print('\\\\nSamples incorrectly classified:')\\n\",\n    \"incorrect_idx = np.where(pred_y != test_y)[0]\\n\",\n    \"print(incorrect_idx)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# Plot two dimensions\\n\",\n    \"\\n\",\n    \"colors = [\\\"darkblue\\\", \\\"darkgreen\\\", \\\"gray\\\"]\\n\",\n    \"\\n\",\n    \"for n, color in enumerate(colors):\\n\",\n    \"    idx = np.where(test_y == n)[0]\\n\",\n    \"    plt.scatter(test_X[idx, 1], test_X[idx, 2], color=color, label=\\\"Class %s\\\" % str(n))\\n\",\n    \"\\n\",\n    \"plt.scatter(test_X[incorrect_idx, 1], test_X[incorrect_idx, 2], color=\\\"darkred\\\")\\n\",\n    \"\\n\",\n    \"plt.xlabel('sepal width [cm]')\\n\",\n    \"plt.ylabel('petal length [cm]')\\n\",\n    \"plt.legend(loc=3)\\n\",\n    \"plt.title(\\\"Iris Classification results\\\")\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can see that the errors occur in the area where green (class 1) and gray (class 2) overlap. This gives us insight about what features to add - any feature which helps separate class 1 and class 2 should improve classifier performance.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Print the true labels of 3 wrong predictions and modify the scatterplot code, which we used above, to visualize and distinguish these three samples with different markers in the 2D scatterplot. Can you explain why our classifier made these wrong predictions?\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# %load solutions/04_wrong-predictions.py\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/05 Supervised Learning - Classification.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,pillow,scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Supervised Learning Part 1 -- Classification\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"To visualize the workings of machine learning algorithms, it is often helpful to study two-dimensional or one-dimensional data, that is data with only one or two features. While in practice, datasets usually have many more features, it is hard to plot high-dimensional data in on two-dimensional screens.\\n\",\n    \"\\n\",\n    \"We will illustrate some very simple examples before we move on to more \\\"real world\\\" data sets.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"\\n\",\n    \"First, we will look at a two class classification problem in two dimensions. We use the synthetic data generated by the ``make_blobs`` function.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import make_blobs\\n\",\n    \"\\n\",\n    \"X, y = make_blobs(centers=2, random_state=0)\\n\",\n    \"\\n\",\n    \"print('X ~ n_samples x n_features:', X.shape)\\n\",\n    \"print('y ~ n_samples:', y.shape)\\n\",\n    \"\\n\",\n    \"print('\\\\nFirst 5 samples:\\\\n', X[:5, :])\\n\",\n    \"print('\\\\nFirst 5 labels:', y[:5])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As the data is two-dimensional, we can plot each sample as a point in a two-dimensional coordinate system, with the first feature being the x-axis and the second feature being the y-axis.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.scatter(X[y == 0, 0], X[y == 0, 1], \\n\",\n    \"            c='blue', s=40, label='0')\\n\",\n    \"plt.scatter(X[y == 1, 0], X[y == 1, 1], \\n\",\n    \"            c='red', s=40, label='1', marker='s')\\n\",\n    \"\\n\",\n    \"plt.xlabel('first feature')\\n\",\n    \"plt.ylabel('second feature')\\n\",\n    \"plt.legend(loc='upper right');\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Classification is a supervised task, and since we are interested in its performance on unseen data, we split our data into two parts:\\n\",\n    \"\\n\",\n    \"1. a training set that the learning algorithm uses to fit the model\\n\",\n    \"2. a test set to evaluate the generalization performance of the model\\n\",\n    \"\\n\",\n    \"The ``train_test_split`` function from the ``model_selection`` module does that for us -- we will use it to split a dataset into 75% training data and 25% test data.\\n\",\n    \"\\n\",\n    \"<img src=\\\"figures/train_test_split_matrix.svg\\\" width=\\\"100%\\\">\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import train_test_split\\n\",\n    \"\\n\",\n    \"X_train, X_test, y_train, y_test = train_test_split(X, y,\\n\",\n    \"                                                    test_size=0.25,\\n\",\n    \"                                                    random_state=1234,\\n\",\n    \"                                                    stratify=y)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### The scikit-learn estimator API\\n\",\n    \"<img src=\\\"figures/supervised_workflow.svg\\\" width=\\\"100%\\\">\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Every algorithm is exposed in scikit-learn via an ''Estimator'' object. (All models in scikit-learn have a very consistent interface). For instance, we first import the logistic regression class.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.linear_model import LogisticRegression\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Next, we instantiate the estimator object.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"classifier = LogisticRegression()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X_train.shape\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"y_train.shape\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"To built the model from our data, that is to learn how to classify new points, we call the ``fit`` function with the training data, and the corresponding training labels (the desired output for the training data point):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"classifier.fit(X_train, y_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"(Some estimator methods such as `fit` return `self` by default. Thus, after executing the code snippet above, you will see the default parameters of this particular instance of `LogisticRegression`. Another way of retrieving the estimator's ininitialization parameters is to execute `classifier.get_params()`, which returns a parameter dictionary.)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can then apply the model to unseen data and use the model to predict the estimated outcome using the ``predict`` method:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"prediction = classifier.predict(X_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can compare these against the true labels:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(prediction)\\n\",\n    \"print(y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can evaluate our classifier quantitatively by measuring what fraction of predictions is correct. This is called **accuracy**:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"np.mean(prediction == y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"There is also a convenience function , ``score``, that all scikit-learn classifiers have to compute this directly from the test data:\\n\",\n    \"    \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"classifier.score(X_test, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"It is often helpful to compare the generalization performance (on the test set) to the performance on the training set:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"classifier.score(X_train, y_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"LogisticRegression is a so-called linear model,\\n\",\n    \"that means it will create a decision that is linear in the input space. In 2d, this simply means it finds a line to separate the blue from the red:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from figures import plot_2d_separator\\n\",\n    \"\\n\",\n    \"plt.scatter(X[y == 0, 0], X[y == 0, 1], \\n\",\n    \"            c='blue', s=40, label='0')\\n\",\n    \"plt.scatter(X[y == 1, 0], X[y == 1, 1], \\n\",\n    \"            c='red', s=40, label='1', marker='s')\\n\",\n    \"\\n\",\n    \"plt.xlabel(\\\"first feature\\\")\\n\",\n    \"plt.ylabel(\\\"second feature\\\")\\n\",\n    \"plot_2d_separator(classifier, X)\\n\",\n    \"plt.legend(loc='upper right');\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"**Estimated parameters**: All the estimated model parameters are attributes of the estimator object ending by an underscore. Here, these are the coefficients and the offset of the line:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(classifier.coef_)\\n\",\n    \"print(classifier.intercept_)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Another classifier: K Nearest Neighbors\\n\",\n    \"------------------------------------------------\\n\",\n    \"Another popular and easy to understand classifier is K nearest neighbors (kNN).  It has one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.\\n\",\n    \"\\n\",\n    \"The interface is exactly the same as for ``LogisticRegression above``.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.neighbors import KNeighborsClassifier\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"This time we set a parameter of the KNeighborsClassifier to tell it we only want to look at one nearest neighbor:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"knn = KNeighborsClassifier(n_neighbors=1)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We fit the model with out training data\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"knn.fit(X_train, y_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.scatter(X[y == 0, 0], X[y == 0, 1], \\n\",\n    \"            c='blue', s=40, label='0')\\n\",\n    \"plt.scatter(X[y == 1, 0], X[y == 1, 1], \\n\",\n    \"            c='red', s=40, label='1', marker='s')\\n\",\n    \"\\n\",\n    \"plt.xlabel(\\\"first feature\\\")\\n\",\n    \"plt.ylabel(\\\"second feature\\\")\\n\",\n    \"plot_2d_separator(knn, X)\\n\",\n    \"plt.legend(loc='upper right');\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"knn.score(X_test, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Exercise\\n\",\n    \"=========\\n\",\n    \"Apply the KNeighborsClassifier to the ``iris`` dataset. Play with different values of the ``n_neighbors`` and observe how training and test score change.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# %load solutions/05A_knn_with_diff_k.py\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/06 Supervised Learning - Regression.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Supervised Learning Part 2 -- Regression Analysis\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In regression we are trying to predict a continuous output variable -- in contrast to the nominal variables we were predicting in the previous classification examples. \\n\",\n    \"\\n\",\n    \"Let's start with a simple toy example with one feature dimension (explanatory variable) and one target variable. We will create a dataset out of a sinus curve with some noise:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"x = np.linspace(-3, 3, 100)\\n\",\n    \"print(x)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"rng = np.random.RandomState(42)\\n\",\n    \"y = np.sin(4 * x) + x + rng.uniform(size=len(x))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.plot(x, y, 'o');\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Linear Regression\\n\",\n    \"=================\\n\",\n    \"\\n\",\n    \"The first model that we will introduce is the so-called simple linear regression. Here, we want to fit a line to the data, which \\n\",\n    \"\\n\",\n    \"One of the simplest models again is a linear one, that simply tries to predict the data as lying on a line. One way to find such a line is `LinearRegression` (also known as [*Ordinary Least Squares (OLS)*](https://en.wikipedia.org/wiki/Ordinary_least_squares) regression).\\n\",\n    \"The interface for LinearRegression is exactly the same as for the classifiers before, only that ``y`` now contains float values, instead of classes.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we remember, the scikit-learn API requires us to provide the target variable (`y`) as a 1-dimensional array; scikit-learn's API expects the samples (`X`) in form a 2-dimensional array -- even though it may only consist of 1 feature. Thus, let us convert the 1-dimensional `x` NumPy array into an `X` array with 2 axes:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print('Before: ', x.shape)\\n\",\n    \"X = x[:, np.newaxis]\\n\",\n    \"print('After: ', X.shape)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Again, we start by splitting our dataset into a training (75%) and a test set (25%):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import train_test_split\\n\",\n    \"\\n\",\n    \"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Next, we use the learning algorithm implemented in `LinearRegression` to **fit a regression model to the training data**:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.linear_model import LinearRegression\\n\",\n    \"\\n\",\n    \"regressor = LinearRegression()\\n\",\n    \"regressor.fit(X_train, y_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"After fitting to the training data, we paramerterized a linear regression model with the following values.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print('Weight coefficients: ', regressor.coef_)\\n\",\n    \"print('y-axis intercept: ', regressor.intercept_)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Since our regression model is a linear one, the relationship between the target variable (y) and the feature variable (x) is defined as \\n\",\n    \"\\n\",\n    \"$$y = weight \\\\times x + \\\\text{intercept}$$.\\n\",\n    \"\\n\",\n    \"Plugging in the min and max values into thos equation, we can plot the regression fit to our training data:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"min_pt = X.min() * regressor.coef_[0] + regressor.intercept_\\n\",\n    \"max_pt = X.max() * regressor.coef_[0] + regressor.intercept_\\n\",\n    \"\\n\",\n    \"plt.plot([X.min(), X.max()], [min_pt, max_pt])\\n\",\n    \"plt.plot(X_train, y_train, 'o');\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Similar to the estimators for classification in the previous notebook, we use the `predict` method to predict the target variable. And we expect these predicted values to fall onto the line that we plotted previously:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"y_pred_train = regressor.predict(X_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.plot(X_train, y_train, 'o', label=\\\"data\\\")\\n\",\n    \"plt.plot(X_train, y_pred_train, 'o', label=\\\"prediction\\\")\\n\",\n    \"plt.plot([X.min(), X.max()], [min_pt, max_pt], label='fit')\\n\",\n    \"plt.legend(loc='best')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can see in the plot above, the line is able to capture the general slope of the data, but not many details.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Next, let's try the test set:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"y_pred_test = regressor.predict(X_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.plot(X_test, y_test, 'o', label=\\\"data\\\")\\n\",\n    \"plt.plot(X_test, y_pred_test, 'o', label=\\\"prediction\\\")\\n\",\n    \"plt.plot([X.min(), X.max()], [min_pt, max_pt], label='fit')\\n\",\n    \"plt.legend(loc='best');\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Again, scikit-learn provides an easy way to evaluate the prediction quantitatively using the ``score`` method. For regression tasks, this is the R<sup>2</sup> score. Another popular way would be the Mean Squared Error (MSE). As its name implies, the MSE is simply the average squared difference over the predicted and actual target values\\n\",\n    \"\\n\",\n    \"$$MSE = \\\\frac{1}{n} \\\\sum^{n}_{i=1} (\\\\text{predicted}_i - \\\\text{true}_i)^2$$\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"regressor.score(X_test, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"KNeighborsRegression\\n\",\n    \"=======================\\n\",\n    \"As for classification, we can also use a neighbor based method for regression. We can simply take the output of the nearest point, or we could average several nearest points. This method is less popular for regression than for classification, but still a good baseline.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.neighbors import KNeighborsRegressor\\n\",\n    \"kneighbor_regression = KNeighborsRegressor(n_neighbors=1)\\n\",\n    \"kneighbor_regression.fit(X_train, y_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Again, let us look at the behavior on training and test set:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"y_pred_train = kneighbor_regression.predict(X_train)\\n\",\n    \"\\n\",\n    \"plt.plot(X_train, y_train, 'o', label=\\\"data\\\", markersize=10)\\n\",\n    \"plt.plot(X_train, y_pred_train, 's', label=\\\"prediction\\\", markersize=4)\\n\",\n    \"plt.legend(loc='best');\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"On the training set, we do a perfect job: each point is its own nearest neighbor!\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"y_pred_test = kneighbor_regression.predict(X_test)\\n\",\n    \"\\n\",\n    \"plt.plot(X_test, y_test, 'o', label=\\\"data\\\", markersize=8)\\n\",\n    \"plt.plot(X_test, y_pred_test, 's', label=\\\"prediction\\\", markersize=4)\\n\",\n    \"plt.legend(loc='best');\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"On the test set, we also do a better job of capturing the variation, but our estimates look much messier than before.\\n\",\n    \"Let us look at the R<sup>2</sup> score:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"kneighbor_regression.score(X_test, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Much better than before! Here, the linear model was not a good fit for our problem; it was lacking in complexity and thus under-fit our data.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Exercise\\n\",\n    \"=========\\n\",\n    \"Compare the KNeighborsRegressor and LinearRegression on the boston housing dataset. You can load the dataset using ``sklearn.datasets.load_boston``. You can learn about the dataset by reading the ``DESCR`` attribute.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# %load solutions/06A_knn_vs_linreg.py\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/07 Unsupervised Learning - Transformations and Dimensionality Reduction.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Unsupervised Learning Part 1 -- Transformation\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Many instances of unsupervised learning, such as dimensionality reduction, manifold learning, and feature extraction, find a new representation of the input data without any additional input. (In contrast to supervised learning, usnupervised algorithms don't require or consider target variables like in the previous classification and regression examples). \\n\",\n    \"\\n\",\n    \"<img src=\\\"figures/unsupervised_workflow.svg\\\" width=\\\"100%\\\">\\n\",\n    \"\\n\",\n    \"A very basic example is the rescaling of our data, which is a requirement for many machine learning algorithms as they are not scale-invariant -- rescaling falls into the category of data pre-processing and can barely be called *learning*. There exist many different rescaling technques, and in the following example, we will take a look at a particular method that is commonly called \\\"standardization.\\\" Here, we will recale the data so that each feature is centered at zero (mean = 0) with unit variance (standard deviation = 0).\\n\",\n    \"\\n\",\n    \"For example, if we have a 1D dataset with the values [1, 2, 3, 4, 5], the standardized values are\\n\",\n    \"\\n\",\n    \"- 1 -> -1.41\\n\",\n    \"- 2 -> -0.71\\n\",\n    \"- 3 -> 0.0\\n\",\n    \"- 4 -> 0.71\\n\",\n    \"- 5 -> 1.41\\n\",\n    \"\\n\",\n    \"computed via the equation $x_{standardized} = \\\\frac{x - \\\\mu_x}{\\\\sigma_x}$,\\n\",\n    \"where $\\\\mu$ is the sample mean, and $\\\\sigma$ the standard deviation, respectively.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"ary = np.array([1, 2, 3, 4, 5])\\n\",\n    \"ary_standardized = (ary - ary.mean()) / ary.std()\\n\",\n    \"ary_standardized\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Although standardization is a most basic preprocessing procedure -- as we've seen in the code snipped above -- scikit-learn implements a `StandardScaler` class for this computation. And in later sections, we will see why and when the scikit-learn interface comes in handy over the code snippet we executed above.  \\n\",\n    \"\\n\",\n    \"Applying such a preprocessing has a very similar interface to the supervised learning algorithms we saw so far.\\n\",\n    \"To get some more practice with scikit-learn's \\\"Transformer\\\" interface, let's start by loading the iris dataset and rescale it:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import load_iris\\n\",\n    \"from sklearn.model_selection import train_test_split\\n\",\n    \"\\n\",\n    \"iris = load_iris()\\n\",\n    \"X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)\\n\",\n    \"print(X_train.shape)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The iris dataset is not \\\"centered\\\" that is it has non-zero mean and the standard deviation is different for each component:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(\\\"mean : %s \\\" % X_train.mean(axis=0))\\n\",\n    \"print(\\\"standard deviation : %s \\\" % X_train.std(axis=0))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"To use a preprocessing method, we first import the estimator, here StandardScaler and instantiate it:\\n\",\n    \"    \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.preprocessing import StandardScaler\\n\",\n    \"scaler = StandardScaler()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As with the classification and regression algorithms, we call ``fit`` to learn the model from the data. As this is an unsupervised model, we only pass ``X``, not ``y``. This simply estimates mean and standard deviation.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"scaler.fit(X_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we can rescale our data by applying the ``transform`` (not ``predict``) method:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X_train_scaled = scaler.transform(X_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"``X_train_scaled`` has the same number of samples and features, but the mean was subtracted and all features were scaled to have unit standard deviation:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(X_train_scaled.shape)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(\\\"mean : %s \\\" % X_train_scaled.mean(axis=0))\\n\",\n    \"print(\\\"standard deviation : %s \\\" % X_train_scaled.std(axis=0))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"To summarize: Via the `fit` method, the estimator is fitted to the data we provide. In this step, the estimator estimates the parameters from the data (here: mean and standard deviation). Then, if we `transform` data, these parameters are used to transform a dataset. (Please note that the transform method does not update these parameters).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"It's important to note that the same transformation is applied to the training and the test set. That has the consequence that usually the mean of the test data is not zero after scaling:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X_test_scaled = scaler.transform(X_test)\\n\",\n    \"print(\\\"mean test data: %s\\\" % X_test_scaled.mean(axis=0))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"It is important for the training and test data to be transformed in exactly the same way, for the following processing steps to make sense of the data, as is illustrated in the figure below:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from figures import plot_relative_scaling\\n\",\n    \"plot_relative_scaling()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"There are several common ways to scale the data. The most common one is the ``StandardScaler`` we just introduced, but rescaling the data to a fix minimum an maximum value with ``MinMaxScaler`` (usually between 0 and 1), or using more robust statistics like median and quantile, instead of mean and standard deviation (with ``RobustScaler``), are also useful.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from figures import plot_scaling\\n\",\n    \"plot_scaling()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Principal Component Analysis\\n\",\n    \"============================\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"An unsupervised transformation that is somewhat more interesting is Principal Component Analysis (PCA).\\n\",\n    \"It is a technique to reduce the dimensionality of the data, by creating a linear projection.\\n\",\n    \"That is, we find new features to represent the data that are a linear combination of the old data (i.e. we rotate it). Thus, we can think of PCA as a projection of our data onto a *new* feature space.\\n\",\n    \"\\n\",\n    \"The way PCA finds these new directions is by looking for the directions of maximum variance.\\n\",\n    \"Usually only few components that explain most of the variance in the data are kept. Here, the premise is to reduce the size (dimensionality) of a dataset while capturing most of its information. There are many reason why dimensionality reduction can be useful: It can reduce the computational cost when running learning algorithms, decrease the storage space, and may help with the so-called \\\"curse of dimensionality,\\\" which we will discuss in greater detail later.\\n\",\n    \"\\n\",\n    \"To illustrate how a rotation might look like, we first show it on two-dimensional data and keep both principal components. Here is an illustraion:\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from figures import plot_pca_illustration\\n\",\n    \"plot_pca_illustration()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now let's go through all the steps in more detail:\\n\",\n    \"We create a Gaussian blob that is rotated:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"rnd = np.random.RandomState(5)\\n\",\n    \"X_ = rnd.normal(size=(300, 2))\\n\",\n    \"X_blob = np.dot(X_, rnd.normal(size=(2, 2))) + rnd.normal(size=2)\\n\",\n    \"y = X_[:, 0] > 0\\n\",\n    \"plt.scatter(X_blob[:, 0], X_blob[:, 1], c=y, linewidths=0, s=30)\\n\",\n    \"plt.xlabel(\\\"feature 1\\\")\\n\",\n    \"plt.ylabel(\\\"feature 2\\\");\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As always, we instantiate our PCA model. By default all directions are kept.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.decomposition import PCA\\n\",\n    \"pca = PCA()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Then we fit the PCA model with our data. As PCA is an unsupervised algorithm, there is no output ``y``.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"pca.fit(X_blob)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Then we can transform the data, projected on the principal components:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X_pca = pca.transform(X_blob)\\n\",\n    \"\\n\",\n    \"plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, linewidths=0, s=30)\\n\",\n    \"plt.xlabel(\\\"first principal component\\\")\\n\",\n    \"plt.ylabel(\\\"second principal component\\\");\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"On the left of the plot you can see the four points that were on the top right before. PCA found fit first component to be along the diagonal, and the second to be perpendicular to it. As PCA finds a rotation, the principal components are always at right angles (\\\"orthogonal\\\") to each other.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Dimensionality Reduction for Visualization with PCA\\n\",\n    \"-------------------------------------------------------------\\n\",\n    \"Consider the digits dataset. It cannot be visualized in a single 2D plot, as it has 64 features. We are going to extract 2 dimensions to visualize it in, using the example from the sklearn examples [here](http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from figures import digits_plot\\n\",\n    \"\\n\",\n    \"digits_plot()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that this projection was determined *without* any information about the\\n\",\n    \"labels (represented by the colors): this is the sense in which the learning\\n\",\n    \"is **unsupervised**.  Nevertheless, we see that the projection gives us insight\\n\",\n    \"into the distribution of the different digits in parameter space.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercises\\n\",\n    \"\\n\",\n    \"Visualize the iris dataset using the first two principal components, and compress this visualization to using two of the original features.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# %load solutions/07A_iris-pca.py\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/08 Unsupervised Learning - Clustering.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Unsupervised Learning Part 2 -- Clustering\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Clustering is the task of gathering samples into groups of similar\\n\",\n    \"samples according to some predefined similarity or distance (dissimilarity)\\n\",\n    \"measure, such as the Euclidean distance.\\n\",\n    \"In this section we will explore a basic clustering task on some synthetic and real-world datasets.\\n\",\n    \"\\n\",\n    \"Here are some common applications of clustering algorithms:\\n\",\n    \"\\n\",\n    \"- Compression for data reduction\\n\",\n    \"- Summarizing data as a reprocessing step for recommender systems\\n\",\n    \"- Similarly:\\n\",\n    \"   - grouping related web news (e.g. Google News) and web search results\\n\",\n    \"   - grouping related stock quotes for investment portfolio management\\n\",\n    \"   - building customer profiles for market analysis\\n\",\n    \"- Building a code book of prototype samples for unsupervised feature extraction\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's start by creating a simple, 2-dimensional, synthetic dataset:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import make_blobs\\n\",\n    \"\\n\",\n    \"X, y = make_blobs(random_state=42)\\n\",\n    \"X.shape\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.scatter(X[:, 0], X[:, 1]);\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In the scatter plot above, we can see three separate groups of data points and we would like to recover them using clustering -- think of \\\"discovering\\\" the class labels that we already take for granted in a classification task.\\n\",\n    \"\\n\",\n    \"Even if the groups are obvious in the data, it is hard to find them when the data lives in a high-dimensional space, which we can't visualize in a single histogram or scatterplot.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now we will use one of the simplest clustering algorithms, K-means.\\n\",\n    \"This is an iterative algorithm which searches for three cluster\\n\",\n    \"centers such that the distance from each point to its cluster is\\n\",\n    \"minimized. The standard implementation of K-means uses the Euclidean distance, which is why we want to make sure that all our variables are measured on the same scale if we are working with real-world datastets. In the previous notebook, we talked about one technique to achieve this, namely, standardization.\\n\",\n    \"\\n\",\n    \"**Question:** what would you expect the output to look like?\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.cluster import KMeans\\n\",\n    \"\\n\",\n    \"kmeans = KMeans(n_clusters=3, random_state=42)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can get the cluster labels either by calling fit and then accessing the \\n\",\n    \"``labels_`` attribute of the K means estimator, or by calling ``fit_predict``.\\n\",\n    \"Either way, the result contains the ID of the cluster that each point is assigned to.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"labels = kmeans.fit_predict(X)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"labels\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"all(y == labels)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's visualize the assignments that have been found\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.scatter(X[:, 0], X[:, 1], c=labels);\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Compared to the true labels:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.scatter(X[:, 0], X[:, 1], c=y);\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Here, we are probably satisfied with the clustering results. But in general we might want to have a more quantitative evaluation. How about comparing our cluster labels with the ground truth we got when generating the blobs?\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.metrics import confusion_matrix, accuracy_score\\n\",\n    \"\\n\",\n    \"print('Accuracy score:', accuracy_score(y, labels))\\n\",\n    \"print(confusion_matrix(y, labels))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"np.mean(y == labels)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"---\\n\",\n    \"\\n\",\n    \"**Exercise**\\n\",\n    \"\\n\",\n    \"After looking at the \\\"True\\\" label array y, and the scatterplot and `labels` above, can you figure out why our computed accuracy is 0.0, not 1.0, and can you fix it?\\n\",\n    \"\\n\",\n    \"---\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Even though we recovered the partitioning of the data into clusters perfectly, the cluster IDs we assigned were arbitrary,\\n\",\n    \"and we can not hope to recover them. Therefore, we must use a different scoring metric, such as ``adjusted_rand_score``, which is invariant to permutations of the labels:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.metrics import adjusted_rand_score\\n\",\n    \"\\n\",\n    \"adjusted_rand_score(y, labels)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"One of the \\\"short-comings\\\" of K-means is that we have to specify the number of clusters, which we often don't know *apriori*. For example, let's have a look what happens if we set the number of clusters to 2 in our synthetic 3-blob dataset:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"kmeans = KMeans(n_clusters=2, random_state=42)\\n\",\n    \"labels = kmeans.fit_predict(X)\\n\",\n    \"plt.scatter(X[:, 0], X[:, 1], c=labels);\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### The Elbow Method\\n\",\n    \"\\n\",\n    \"The Elbow method is a \\\"rule-of-thumb\\\" approach to finding the optimal number of clusters. Here, we look at the cluster dispersion for different values of k:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"distortions = []\\n\",\n    \"for i in range(1, 11):\\n\",\n    \"    km = KMeans(n_clusters=i, \\n\",\n    \"                random_state=0)\\n\",\n    \"    km.fit(X)\\n\",\n    \"    distortions.append(km.inertia_)\\n\",\n    \"plt.plot(range(1, 11), distortions, marker='o')\\n\",\n    \"plt.xlabel('Number of clusters')\\n\",\n    \"plt.ylabel('Distortion')\\n\",\n    \"\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Then, we pick the value that resembles the \\\"pit of an elbow.\\\" As we can see, this would be k=3 in this case, which makes sense given our visual expection of the dataset previously.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"**Clustering comes with assumptions**: A clustering algorithm finds clusters by making assumptions with samples should be grouped together. Each algorithm makes different assumptions and the quality and interpretability of your results will depend on whether the assumptions are satisfied for your goal. For K-means clustering, the model is that all clusters have equal, spherical variance.\\n\",\n    \"\\n\",\n    \"**In general, there is no guarantee that structure found by a clustering algorithm has anything to do with what you were interested in**.\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can easily create a dataset that has non-isotropic clusters, on which kmeans will fail:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import make_blobs\\n\",\n    \"\\n\",\n    \"X, y = make_blobs(random_state=170, n_samples=600)\\n\",\n    \"rng = np.random.RandomState(74)\\n\",\n    \"\\n\",\n    \"transformation = rng.normal(size=(2, 2))\\n\",\n    \"X = np.dot(X, transformation)\\n\",\n    \"\\n\",\n    \"y_pred = KMeans(n_clusters=3).fit_predict(X)\\n\",\n    \"\\n\",\n    \"plt.scatter(X[:, 0], X[:, 1], c=y_pred)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Some Notable Clustering Routines\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The following are two well-known clustering algorithms. \\n\",\n    \"\\n\",\n    \"- `sklearn.cluster.KMeans`: <br/>\\n\",\n    \"    The simplest, yet effective clustering algorithm. Needs to be provided with the\\n\",\n    \"    number of clusters in advance, and assumes that the data is normalized as input\\n\",\n    \"    (but use a PCA model as preprocessor).\\n\",\n    \"- `sklearn.cluster.MeanShift`: <br/>\\n\",\n    \"    Can find better looking clusters than KMeans but is not scalable to high number of samples.\\n\",\n    \"- `sklearn.cluster.DBSCAN`: <br/>\\n\",\n    \"    Can detect irregularly shaped clusters based on density, i.e. sparse regions in\\n\",\n    \"    the input space are likely to become inter-cluster boundaries. Can also detect\\n\",\n    \"    outliers (samples that are not part of a cluster).\\n\",\n    \"- `sklearn.cluster.AffinityPropagation`: <br/>\\n\",\n    \"    Clustering algorithm based on message passing between data points.\\n\",\n    \"- `sklearn.cluster.SpectralClustering`: <br/>\\n\",\n    \"    KMeans applied to a projection of the normalized graph Laplacian: finds\\n\",\n    \"    normalized graph cuts if the affinity matrix is interpreted as an adjacency matrix of a graph.\\n\",\n    \"- `sklearn.cluster.Ward`: <br/>\\n\",\n    \"    Ward implements hierarchical clustering based on the Ward algorithm,\\n\",\n    \"    a variance-minimizing approach. At each step, it minimizes the sum of\\n\",\n    \"    squared differences within all clusters (inertia criterion).\\n\",\n    \"\\n\",\n    \"Of these, Ward, SpectralClustering, DBSCAN and Affinity propagation can also work with precomputed similarity matrices.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"figures/cluster_comparison.png\\\" width=\\\"900\\\">\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Exercise: digits clustering\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Perform K-means clustering on the digits data, searching for ten clusters.\\n\",\n    \"Visualize the cluster centers as images (i.e. reshape each to 8x8 and use\\n\",\n    \"``plt.imshow``)  Do the clusters seem to be correlated with particular digits? What is the ``adjusted_rand_score``?\\n\",\n    \"\\n\",\n    \"Visualize the projected digits as in the last notebook, but this time use the\\n\",\n    \"cluster labels as the color.  What do you notice?\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import load_digits\\n\",\n    \"digits = load_digits()\\n\",\n    \"# ...\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"#%load solutions/08B_digits_clustering.py\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/09 Review of Scikit-learn API.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka'\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# A recap on Scikit-learn's estimator interface\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"Scikit-learn strives to have a uniform interface across all methods. Given a scikit-learn *estimator*\\n\",\n    \"object named `model`, the following methods are available (not all for each model):\\n\",\n    \"\\n\",\n    \"- Available in **all Estimators**\\n\",\n    \"  + `model.fit()` : fit training data. For supervised learning applications,\\n\",\n    \"    this accepts two arguments: the data `X` and the labels `y` (e.g. `model.fit(X, y)`).\\n\",\n    \"    For unsupervised learning applications, `fit` takes only a single argument,\\n\",\n    \"    the data `X` (e.g. `model.fit(X)`).\\n\",\n    \"- Available in **supervised estimators**\\n\",\n    \"  + `model.predict()` : given a trained model, predict the label of a new set of data.\\n\",\n    \"    This method accepts one argument, the new data `X_new` (e.g. `model.predict(X_new)`),\\n\",\n    \"    and returns the learned label for each object in the array.\\n\",\n    \"  + `model.predict_proba()` : For classification problems, some estimators also provide\\n\",\n    \"    this method, which returns the probability that a new observation has each categorical label.\\n\",\n    \"    In this case, the label with the highest probability is returned by `model.predict()`.\\n\",\n    \"  + `model.decision_function()` : For classification problems, some estimators provide an uncertainty estimate that is not a probability. For binary classification, a decision_function >= 0 means the positive class will be predicted, while < 0 means the negative class.\\n\",\n    \"  + `model.score()` : for classification or regression problems, most (all?) estimators implement\\n\",\n    \"    a score method.  Scores are between 0 and 1, with a larger score indicating a better fit. For classifiers, the `score` method computes the prediction accuracy. For regressors, `score` computes the coefficient of determination (R<sup>2</sup>) of the prediction.\\n\",\n    \"  + `model.transform()` : For feature selection algorithms, this will reduce the dataset to the selected features. For some classification and regression models such as some linear models and random forests, this method reduces the dataset to the most informative features. These classification and regression models can therefore also be used as feature selection methods.\\n\",\n    \"  \\n\",\n    \"- Available in **unsupervised estimators**\\n\",\n    \"  + `model.transform()` : given an unsupervised model, transform new data into the new basis.\\n\",\n    \"    This also accepts one argument `X_new`, and returns the new representation of the data based\\n\",\n    \"    on the unsupervised model.\\n\",\n    \"  + `model.fit_transform()` : some estimators implement this method,\\n\",\n    \"    which more efficiently performs a fit and a transform on the same input data.\\n\",\n    \"  + `model.predict()` : for clustering algorithms, the predict method will produce cluster labels for new data points. Not all clustering methods have this functionality.\\n\",\n    \"  + `model.predict_proba()` : Gaussian mixture models (GMMs) provide the probability for each point to be generated by a given mixture component.\\n\",\n    \"  + `model.score()` : Density models like KDE and GMMs provide the likelihood of the data under the model.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"Apart from ``fit``, the two most important functions are arguably ``predict`` to produce a target variable (a ``y``) ``transform``, which produces a new representation of the data (an ``X``).\\n\",\n    \"The following table shows for which class of models which function applies:\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<table>\\n\",\n    \"<tr style=\\\"border:None; font-size:20px; padding:10px;\\\"><th>``model.predict``</th><th>``model.transform``</th></tr>\\n\",\n    \"<tr style=\\\"border:None; font-size:20px; padding:10px;\\\"><td>Classification</td><td>Preprocessing</td></tr>\\n\",\n    \"<tr style=\\\"border:None; font-size:20px; padding:10px;\\\"><td>Regression</td><td>Dimensionality Reduction</td></tr>\\n\",\n    \"<tr style=\\\"border:None; font-size:20px; padding:10px;\\\"><td>Clustering</td><td>Feature Extraction</td></tr>\\n\",\n    \"<tr style=\\\"border:None; font-size:20px; padding:10px;\\\"><td>&nbsp;</td><td>Feature Selection</td></tr>\\n\",\n    \"\\n\",\n    \"</table>\\n\",\n    \"\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/10 Case Study - Titanic Survival.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Case Study  - Titanic Survival\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Feature Extraction\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Here we will talk about an important piece of machine learning: the extraction of\\n\",\n    \"quantitative features from data.  By the end of this section you will\\n\",\n    \"\\n\",\n    \"- Know how features are extracted from real-world data.\\n\",\n    \"- See an example of extracting numerical features from textual data\\n\",\n    \"\\n\",\n    \"In addition, we will go over several basic tools within scikit-learn which can be used to accomplish the above tasks.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## What Are Features?\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Numerical Features\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Recall that data in scikit-learn is expected to be in two-dimensional arrays, of size\\n\",\n    \"**n_samples** $\\\\times$ **n_features**.\\n\",\n    \"\\n\",\n    \"Previously, we looked at the iris dataset, which has 150 samples and 4 features\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import load_iris\\n\",\n    \"\\n\",\n    \"iris = load_iris()\\n\",\n    \"print(iris.data.shape)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"These features are:\\n\",\n    \"\\n\",\n    \"- sepal length in cm\\n\",\n    \"- sepal width in cm\\n\",\n    \"- petal length in cm\\n\",\n    \"- petal width in cm\\n\",\n    \"\\n\",\n    \"Numerical features such as these are pretty straightforward: each sample contains a list\\n\",\n    \"of floating-point numbers corresponding to the features\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Categorical Features\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"What if you have categorical features?  For example, imagine there is data on the color of each\\n\",\n    \"iris:\\n\",\n    \"\\n\",\n    \"    color in [red, blue, purple]\\n\",\n    \"\\n\",\n    \"You might be tempted to assign numbers to these features, i.e. *red=1, blue=2, purple=3*\\n\",\n    \"but in general **this is a bad idea**.  Estimators tend to operate under the assumption that\\n\",\n    \"numerical features lie on some continuous scale, so, for example, 1 and 2 are more alike\\n\",\n    \"than 1 and 3, and this is often not the case for categorical features.\\n\",\n    \"\\n\",\n    \"In fact, the example above is a subcategory of \\\"categorical\\\" features, namely, \\\"nominal\\\" features. Nominal features don't imply an order, whereas \\\"ordinal\\\" features are categorical features that do imply an order. An example of ordinal features would be T-shirt sizes, e.g., XL > L > M > S. \\n\",\n    \"\\n\",\n    \"One work-around for parsing nominal features into a format that prevents the classification algorithm from asserting an order is the so-called one-hot encoding representation. Here, we give each category its own dimension.  \\n\",\n    \"\\n\",\n    \"The enriched iris feature set would hence be in this case:\\n\",\n    \"\\n\",\n    \"- sepal length in cm\\n\",\n    \"- sepal width in cm\\n\",\n    \"- petal length in cm\\n\",\n    \"- petal width in cm\\n\",\n    \"- color=purple (1.0 or 0.0)\\n\",\n    \"- color=blue (1.0 or 0.0)\\n\",\n    \"- color=red (1.0 or 0.0)\\n\",\n    \"\\n\",\n    \"Note that using many of these categorical features may result in data which is better\\n\",\n    \"represented as a **sparse matrix**, as we'll see with the text classification example\\n\",\n    \"below.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"#### Using the DictVectorizer to encode categorical features\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"When the source data is encoded has a list of dicts where the values are either strings names for categories or numerical values, you can use the `DictVectorizer` class to compute the boolean expansion of the categorical features while leaving the numerical features unimpacted:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"measurements = [\\n\",\n    \"    {'city': 'Dubai', 'temperature': 33.},\\n\",\n    \"    {'city': 'London', 'temperature': 12.},\\n\",\n    \"    {'city': 'San Francisco', 'temperature': 18.},\\n\",\n    \"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.feature_extraction import DictVectorizer\\n\",\n    \"\\n\",\n    \"vec = DictVectorizer()\\n\",\n    \"vec\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"vec.fit_transform(measurements).toarray()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"vec.get_feature_names()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Derived Features\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Another common feature type are **derived features**, where some pre-processing step is\\n\",\n    \"applied to the data to generate features that are somehow more informative.  Derived\\n\",\n    \"features may be based in **feature extraction** and **dimensionality reduction** (such as PCA or manifold learning),\\n\",\n    \"may be linear or nonlinear combinations of features (such as in polynomial regression),\\n\",\n    \"or may be some more sophisticated transform of the features.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Combining Numerical and Categorical Features\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As an example of how to work with both categorical and numerical data, we will perform survival predicition for the passengers of the HMS Titanic.\\n\",\n    \"\\n\",\n    \"We will use a version of the Titanic (titanic3.xls) from [here](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls). We converted the .xls to .csv for easier manipulation but left the data is otherwise unchanged.\\n\",\n    \"\\n\",\n    \"We need to read in all the lines from the (titanic3.csv) file, set aside the keys from the first line, and find our labels (who survived or died) and data (attributes of that person). Let's look at the keys and some corresponding example lines.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"import pandas as pd\\n\",\n    \"\\n\",\n    \"titanic = pd.read_csv(os.path.join('datasets', 'titanic3.csv'))\\n\",\n    \"print(titanic.columns)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Here is a broad description of the keys and what they mean:\\n\",\n    \"\\n\",\n    \"```\\n\",\n    \"pclass          Passenger Class\\n\",\n    \"                (1 = 1st; 2 = 2nd; 3 = 3rd)\\n\",\n    \"survival        Survival\\n\",\n    \"                (0 = No; 1 = Yes)\\n\",\n    \"name            Name\\n\",\n    \"sex             Sex\\n\",\n    \"age             Age\\n\",\n    \"sibsp           Number of Siblings/Spouses Aboard\\n\",\n    \"parch           Number of Parents/Children Aboard\\n\",\n    \"ticket          Ticket Number\\n\",\n    \"fare            Passenger Fare\\n\",\n    \"cabin           Cabin\\n\",\n    \"embarked        Port of Embarkation\\n\",\n    \"                (C = Cherbourg; Q = Queenstown; S = Southampton)\\n\",\n    \"boat            Lifeboat\\n\",\n    \"body            Body Identification Number\\n\",\n    \"home.dest       Home/Destination\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"In general, it looks like `name`, `sex`, `cabin`, `embarked`, `boat`, `body`, and `homedest` may be candidates for categorical features, while the rest appear to be numerical features. We can also look at the first couple of rows in the dataset to get a better understanding:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"titanic.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We clearly want to discard the \\\"boat\\\" and \\\"body\\\" columns for any classification into survived vs not survived as they already contain this information. The name is unique to each person (probably) and also non-informative. For a first try, we will use \\\"pclass\\\", \\\"sibsp\\\", \\\"parch\\\", \\\"fare\\\" and \\\"embarked\\\" as our features:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"labels = titanic.survived.values\\n\",\n    \"features = titanic[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"features.head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The data now contains only useful features, but they are not in a format that the machine learning algorithms can understand. We need to transform the strings \\\"male\\\" and \\\"female\\\" into binary variables that indicate the gender, and similarly for \\\"embarked\\\".\\n\",\n    \"We can do that using the pandas ``get_dummies`` function:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"pd.get_dummies(features).head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"This transformation successfully encoded the string columns. However, one might argue that the class is also a categorical variable. We can explicitly list the columns to encode using the ``columns`` parameter, and include ``pclass``:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"features_dummies = pd.get_dummies(features, columns=['pclass', 'sex', 'embarked'])\\n\",\n    \"features_dummies.head(n=16)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"data = features_dummies.values\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import numpy as np\\n\",\n    \"np.isnan(data).any()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"With all of the hard data loading work out of the way, evaluating a classifier on this data becomes straightforward. Setting up the simplest possible model, we want to see what the simplest score can be with `DummyClassifier`.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import train_test_split\\n\",\n    \"from sklearn.preprocessing import Imputer\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=0)\\n\",\n    \"\\n\",\n    \"imp = Imputer()\\n\",\n    \"imp.fit(train_data)\\n\",\n    \"train_data_finite = imp.transform(train_data)\\n\",\n    \"test_data_finite = imp.transform(test_data)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.dummy import DummyClassifier\\n\",\n    \"\\n\",\n    \"clf = DummyClassifier('most_frequent')\\n\",\n    \"clf.fit(train_data_finite, train_labels)\\n\",\n    \"print(\\\"Prediction accuracy: %f\\\" % clf.score(test_data_finite, test_labels))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"Exercise\\n\",\n    \"=====\\n\",\n    \"Try executing the above classification, using LogisticRegression and RandomForestClassifier instead of DummyClassifier\\n\",\n    \"\\n\",\n    \"Does selecting a different subset of features help?\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# %load solutions/10_titanic.py\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/11 Text Feature Extraction.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Methods - Text Feature Extraction with Bag-of-Words\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In many tasks, like in the classical spam detection, your input data is text.\\n\",\n    \"Free text with variables length is very far from the fixed length numeric representation that we need to do machine learning with scikit-learn.\\n\",\n    \"However, there is an easy and effective way to go from text data to a numeric representation using the so-called bag-of-words model, which provides a data structure that is compatible with the machine learning aglorithms in scikit-learn.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"figures/bag_of_words.svg\\\" width=\\\"100%\\\">\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's assume that each sample in your dataset is represented as one string, which could be just a sentence, an email, or a whole news article or book. To represent the sample, we first split the string into a list of tokens, which correspond to (somewhat normalized) words. A simple way to do this to just split by whitespace, and then lowercase the word. \\n\",\n    \"\\n\",\n    \"Then, we build a vocabulary of all tokens (lowercased words) that appear in our whole dataset. This is usually a very large vocabulary.\\n\",\n    \"Finally, looking at our single sample, we could show how often each word in the vocabulary appears.\\n\",\n    \"We represent our string by a vector, where each entry is how often a given word in the vocabulary appears in the string.\\n\",\n    \"\\n\",\n    \"As each sample will only contain very few words, most entries will be zero, leading to a very high-dimensional but sparse representation.\\n\",\n    \"\\n\",\n    \"The method is called \\\"bag-of-words,\\\" as the order of the words is lost entirely.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X = [\\\"Some say the world will end in fire,\\\",\\n\",\n    \"     \\\"Some say in ice.\\\"]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"len(X)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.feature_extraction.text import CountVectorizer\\n\",\n    \"\\n\",\n    \"vectorizer = CountVectorizer()\\n\",\n    \"vectorizer.fit(X)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"vectorizer.vocabulary_\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X_bag_of_words = vectorizer.transform(X)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X_bag_of_words.shape\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X_bag_of_words\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X_bag_of_words.toarray()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"vectorizer.get_feature_names()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"vectorizer.inverse_transform(X_bag_of_words)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# tf-idf Encoding\\n\",\n    \"A useful transformation that is often applied to the bag-of-word encoding is the so-called term-frequency inverse-document-frequency (tf-idf) scaling, which is a non-linear transformation of the word counts.\\n\",\n    \"\\n\",\n    \"The tf-idf encoding rescales words that are common to have less weight:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.feature_extraction.text import TfidfVectorizer\\n\",\n    \"\\n\",\n    \"tfidf_vectorizer = TfidfVectorizer()\\n\",\n    \"tfidf_vectorizer.fit(X)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import numpy as np\\n\",\n    \"np.set_printoptions(precision=2)\\n\",\n    \"\\n\",\n    \"print(tfidf_vectorizer.transform(X).toarray())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"tf-idfs are a way to represent documents as feature vectors. tf-idfs can be understood as a modification of the raw term frequencies (`tf`); the `tf` is the count of how often a particular word occurs in a given document. The concept behind the tf-idf is to downweight terms proportionally to the number of documents in which they occur. Here, the idea is that terms that occur in many different documents are likely unimportant or don't contain any useful information for Natural Language Processing tasks such as document classification. If you are interested in the mathematical details and equations, we have compiled an [external IPython Notebook](http://nbviewer.jupyter.org/github/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/tfidf_scikit-learn.ipynb) that walks you through the computation.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Bigrams and N-Grams\\n\",\n    \"\\n\",\n    \"In the example illustrated in the figure at the beginning of this notebook, we used the so-called 1-gram (unigram) tokenization: Each token represents a single element with regard to the splittling criterion. \\n\",\n    \"\\n\",\n    \"Entirely discarding word order is not always a good idea, as composite phrases often have specific meaning, and modifiers like \\\"not\\\" can invert the meaning of words.\\n\",\n    \"\\n\",\n    \"A simple way to include some word order are n-grams, which don't only look at a single token, but at all pairs of neighborhing tokens. For example, in 2-gram (bigram) tokenization, we would group words together with an overlap of one word; in 3-gram (trigram) splits we would create an overlap two words, and so forth:\\n\",\n    \"\\n\",\n    \"- original text: \\\"this is how you get ants\\\"\\n\",\n    \"- 1-gram: \\\"this\\\", \\\"is\\\", \\\"how\\\", \\\"you\\\", \\\"get\\\", \\\"ants\\\"\\n\",\n    \"- 2-gram: \\\"this is\\\", \\\"is how\\\", \\\"how you\\\", \\\"you get\\\", \\\"get ants\\\"\\n\",\n    \"- 3-gram: \\\"this is how\\\", \\\"is how you\\\", \\\"how you get\\\", \\\"you get ants\\\"\\n\",\n    \"\\n\",\n    \"Which \\\"n\\\" we choose for \\\"n-gram\\\" tokenization to obtain the optimal performance in our predictive model depends on the learning algorithm, dataset, and task. Or in other words, we have consider \\\"n\\\" in \\\"n-grams\\\" as a tuning parameters, and in later notebooks, we will see how we deal with these.\\n\",\n    \"\\n\",\n    \"Now, let's create a bag of words model of bigrams using scikit-learn's `CountVectorizer`:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# look at sequences of tokens of minimum length 2 and maximum length 2\\n\",\n    \"bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))\\n\",\n    \"bigram_vectorizer.fit(X)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"bigram_vectorizer.get_feature_names()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"bigram_vectorizer.transform(X).toarray()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Often we want to include unigrams (single tokens) AND bigrams, wich we can do by passing the following tuple as an argument to the `ngram_range` parameter of the `CountVectorizer` function:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"gram_vectorizer = CountVectorizer(ngram_range=(1, 2))\\n\",\n    \"gram_vectorizer.fit(X)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"gram_vectorizer.get_feature_names()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"gram_vectorizer.transform(X).toarray()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Character n-grams\\n\",\n    \"=================\\n\",\n    \"\\n\",\n    \"Sometimes it is also helpful not only to look at words, but to consider single characters instead.   \\n\",\n    \"That is particularly useful if we have very noisy data and want to identify the language, or if we want to predict something about a single word.\\n\",\n    \"We can simply look at characters instead of words by setting ``analyzer=\\\"char\\\"``.\\n\",\n    \"Looking at single characters is usually not very informative, but looking at longer n-grams of characters could be:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"char_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer=\\\"char\\\")\\n\",\n    \"char_vectorizer.fit(X)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(char_vectorizer.get_feature_names())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"# Exercise\\n\",\n    \"Compute the bigrams from \\\"zen of python\\\" as given below (or by ``import this``), and find the most common trigram.\\n\",\n    \"We want to treat each line as a separate document. You can achieve this by splitting the string by newlines (``\\\\n``).\\n\",\n    \"Compute the Tf-idf encoding of the data. Which words have the highest tf-idf score? Why?\\n\",\n    \"What changes if you use ``TfidfVectorizer(norm=\\\"none\\\")``?\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"zen = \\\"\\\"\\\"Beautiful is better than ugly.\\n\",\n    \"Explicit is better than implicit.\\n\",\n    \"Simple is better than complex.\\n\",\n    \"Complex is better than complicated.\\n\",\n    \"Flat is better than nested.\\n\",\n    \"Sparse is better than dense.\\n\",\n    \"Readability counts.\\n\",\n    \"Special cases aren't special enough to break the rules.\\n\",\n    \"Although practicality beats purity.\\n\",\n    \"Errors should never pass silently.\\n\",\n    \"Unless explicitly silenced.\\n\",\n    \"In the face of ambiguity, refuse the temptation to guess.\\n\",\n    \"There should be one-- and preferably only one --obvious way to do it.\\n\",\n    \"Although that way may not be obvious at first unless you're Dutch.\\n\",\n    \"Now is better than never.\\n\",\n    \"Although never is often better than *right* now.\\n\",\n    \"If the implementation is hard to explain, it's a bad idea.\\n\",\n    \"If the implementation is easy to explain, it may be a good idea.\\n\",\n    \"Namespaces are one honking great idea -- let's do more of those!\\\"\\\"\\\"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"#%load solutions/11_ngrams.py\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/12 Case Study - SMS Spam Detection.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Case Study - Text classification for SMS spam detection\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We first load the text data from the `dataset` directory that should be located in your notebooks directory, which we created by running the `fetch_data.py` script from the top level of the GitHub repository.\\n\",\n    \"\\n\",\n    \"Furthermore, we perform some simple preprocessing and split the data array into two parts:\\n\",\n    \"\\n\",\n    \"1. `text`: A list of lists, where each sublists contains the contents of our emails\\n\",\n    \"2. `y`: our SPAM vs HAM labels stored in binary; a 1 represents a spam message, and a 0 represnts a ham (non-spam) message. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"with open(os.path.join(\\\"datasets\\\", \\\"smsspam\\\", \\\"SMSSpamCollection\\\")) as f:\\n\",\n    \"    lines = [line.strip().split(\\\"\\\\t\\\") for line in f.readlines()]\\n\",\n    \"\\n\",\n    \"text = [x[1] for x in lines]\\n\",\n    \"y = [int(x[0] == \\\"spam\\\") for x in lines]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"text[:10]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"y[:10]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print('Number of ham and spam messages:', np.bincount(y))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"type(text)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"type(y)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Next, we split our dataset into 2 parts, the test and training dataset:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import train_test_split\\n\",\n    \"\\n\",\n    \"text_train, text_test, y_train, y_test = train_test_split(text, y, \\n\",\n    \"                                                          random_state=42,\\n\",\n    \"                                                          test_size=0.25,\\n\",\n    \"                                                          stratify=y)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we use the CountVectorizer to parse the text data into a bag-of-words model.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.feature_extraction.text import CountVectorizer\\n\",\n    \"\\n\",\n    \"print('CountVectorizer defaults')\\n\",\n    \"CountVectorizer()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"vectorizer = CountVectorizer()\\n\",\n    \"vectorizer.fit(text_train)\\n\",\n    \"\\n\",\n    \"X_train = vectorizer.transform(text_train)\\n\",\n    \"X_test = vectorizer.transform(text_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(len(vectorizer.vocabulary_))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X_train.shape\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(vectorizer.get_feature_names()[:20])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(vectorizer.get_feature_names()[2000:2020])\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(X_train.shape)\\n\",\n    \"print(X_test.shape)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Training a Classifier on Text Features\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can now train a classifier, for instance a logistic regression classifier, which is a fast baseline for text classification tasks:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.linear_model import LogisticRegression\\n\",\n    \"\\n\",\n    \"clf = LogisticRegression()\\n\",\n    \"clf\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"clf.fit(X_train, y_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can now evaluate the classifier on the testing set. Let's first use the built-in score function, which is the rate of correct classification in the test set:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"clf.score(X_test, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also compute the score on the training set to see how well we do there:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"clf.score(X_train, y_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Visualizing important features\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def visualize_coefficients(classifier, feature_names, n_top_features=25):\\n\",\n    \"    # get coefficients with large absolute values \\n\",\n    \"    coef = classifier.coef_.ravel()\\n\",\n    \"    positive_coefficients = np.argsort(coef)[-n_top_features:]\\n\",\n    \"    negative_coefficients = np.argsort(coef)[:n_top_features]\\n\",\n    \"    interesting_coefficients = np.hstack([negative_coefficients, positive_coefficients])\\n\",\n    \"    # plot them\\n\",\n    \"    plt.figure(figsize=(15, 5))\\n\",\n    \"    colors = [\\\"red\\\" if c < 0 else \\\"blue\\\" for c in coef[interesting_coefficients]]\\n\",\n    \"    plt.bar(np.arange(50), coef[interesting_coefficients], color=colors)\\n\",\n    \"    feature_names = np.array(feature_names)\\n\",\n    \"    plt.xticks(np.arange(1, 51), feature_names[interesting_coefficients], rotation=60, ha=\\\"right\\\");\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"visualize_coefficients(clf, vectorizer.get_feature_names())\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"vectorizer = CountVectorizer(min_df=2)\\n\",\n    \"vectorizer.fit(text_train)\\n\",\n    \"\\n\",\n    \"X_train = vectorizer.transform(text_train)\\n\",\n    \"X_test = vectorizer.transform(text_test)\\n\",\n    \"\\n\",\n    \"clf = LogisticRegression()\\n\",\n    \"clf.fit(X_train, y_train)\\n\",\n    \"\\n\",\n    \"print(clf.score(X_train, y_train))\\n\",\n    \"print(clf.score(X_test, y_test))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"visualize_coefficients(clf, vectorizer.get_feature_names())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"figures/supervised_scikit_learn.png\\\" width=\\\"100%\\\">\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercises\\n\",\n    \"\\n\",\n    \"Use TfidfVectorizer instead of CountVectorizer. Are the results better? How are the coefficients different?\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"#%load solutions/12A_tfidf.py\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Change the parameters min_df and ngram_range of the TfidfVectorizer and CountVectorizer. How does that change the important features?\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"#%load solutions/12B_vectorizer_params.py\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/13 Cross Validation.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Cross-Validation and scoring methods\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In the previous sections and notebooks, we split our dataset into two parts, a training set and a test set. We used the training set to fit our model, and we used the test set to evaluate its generalization performance -- how well it performs on new, unseen data.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"<img src=\\\"figures/train_test_split.svg\\\" width=\\\"100%\\\">\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"However, often (labeled) data is precious, and this approach lets us only use ~ 3/4 of our data for training. On the other hand, we will only ever try to apply our model 1/4 of our data for testing.\\n\",\n    \"A common way to use more of the data to build a model, but also get a more robust estimate of the generalization performance, is cross-validation.\\n\",\n    \"In cross-validation, the data is split repeatedly into a training and non-overlapping test-sets, with a separate model built for every pair. The test-set scores are then aggregated for a more robust estimate.\\n\",\n    \"\\n\",\n    \"The most common way to do cross-validation is k-fold cross-validation, in which the data is first split into k (often 5 or 10) equal-sized folds, and then for each iteration, one of the k folds is used as test data, and the rest as training data:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"figures/cross_validation.svg\\\" width=\\\"100%\\\">\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"This way, each data point will be in the test-set exactly once, and we can use all but a k'th of the data for training.\\n\",\n    \"Let us apply this technique to evaluate the KNeighborsClassifier algorithm on the Iris dataset:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import load_iris\\n\",\n    \"from sklearn.neighbors import KNeighborsClassifier\\n\",\n    \"\\n\",\n    \"iris = load_iris()\\n\",\n    \"X, y = iris.data, iris.target\\n\",\n    \"\\n\",\n    \"classifier = KNeighborsClassifier()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The labels in iris are sorted, which means that if we split the data as illustrated above, the first fold will only have the label 0 in it, while the last one will only have the label 2:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"y\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"To avoid this problem in evaluation, we first shuffle our data:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import numpy as np\\n\",\n    \"rng = np.random.RandomState(0)\\n\",\n    \"\\n\",\n    \"permutation = rng.permutation(len(X))\\n\",\n    \"X, y = X[permutation], y[permutation]\\n\",\n    \"print(y)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now implementing cross-validation is easy:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"k = 5\\n\",\n    \"n_samples = len(X)\\n\",\n    \"fold_size = n_samples // k\\n\",\n    \"scores = []\\n\",\n    \"masks = []\\n\",\n    \"for fold in range(k):\\n\",\n    \"    # generate a boolean mask for the test set in this fold\\n\",\n    \"    test_mask = np.zeros(n_samples, dtype=bool)\\n\",\n    \"    test_mask[fold * fold_size : (fold + 1) * fold_size] = True\\n\",\n    \"    # store the mask for visualization\\n\",\n    \"    masks.append(test_mask)\\n\",\n    \"    # create training and test sets using this mask\\n\",\n    \"    X_test, y_test = X[test_mask], y[test_mask]\\n\",\n    \"    X_train, y_train = X[~test_mask], y[~test_mask]\\n\",\n    \"    # fit the classifier\\n\",\n    \"    classifier.fit(X_train, y_train)\\n\",\n    \"    # compute the score and record it\\n\",\n    \"    scores.append(classifier.score(X_test, y_test))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's check that our test mask does the right thing:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import matplotlib.pyplot as plt\\n\",\n    \"%matplotlib inline\\n\",\n    \"plt.matshow(masks, cmap='gray_r')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"And now let's look a the scores we computed:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(scores)\\n\",\n    \"print(np.mean(scores))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As you can see, there is a rather wide spectrum of scores from 90% correct to 100% correct. If we only did a single split, we might have gotten either answer.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As cross-validation is such a common pattern in machine learning, there are functions to do the above for you with much more flexibility and less code.\\n\",\n    \"The ``sklearn.model_selection`` module has all functions related to cross validation. There easiest function is ``cross_val_score`` which takes an estimator and a dataset, and will do all of the splitting for you:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import cross_val_score\\n\",\n    \"scores = cross_val_score(classifier, X, y)\\n\",\n    \"print(scores)\\n\",\n    \"print(np.mean(scores))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As you can see, the function uses three folds by default. You can change the number of folds using the cv argument:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"cross_val_score(classifier, X, y, cv=5)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"There are also helper objects in the cross-validation module that will generate indices for you for all kinds of different cross-validation methods, including k-fold:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import KFold, StratifiedKFold, ShuffleSplit\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"By default, cross_val_score will use ``StratifiedKFold`` for classification, which ensures that the class proportions in the dataset are reflected in each fold. If you have a binary classification dataset with 90% of data point belonging to class 0, that would mean that in each fold, 90% of datapoints would belong to class 0.\\n\",\n    \"If you would just use KFold cross-validation, it is likely that you would generate a split that only contains class 0.\\n\",\n    \"It is generally a good idea to use ``StratifiedKFold`` whenever you do classification.\\n\",\n    \"\\n\",\n    \"``StratifiedKFold`` would also remove our need to shuffle ``iris``.\\n\",\n    \"Let's see what kinds of folds it generates on the unshuffled iris dataset.\\n\",\n    \"Each cross-validation class is a generator of sets of training and test indices:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"cv = StratifiedKFold(n_splits=5)\\n\",\n    \"for train, test in cv.split(iris.data, iris.target):\\n\",\n    \"    print(test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As you can see, there are a couple of samples from the beginning, then from the middle, and then from the end, in each of the folds.\\n\",\n    \"This way, the class ratios are preserved. Let's visualize the split:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def plot_cv(cv, features, labels):\\n\",\n    \"    masks = []\\n\",\n    \"    for train, test in cv.split(features, labels):\\n\",\n    \"        mask = np.zeros(len(labels), dtype=bool)\\n\",\n    \"        mask[test] = 1\\n\",\n    \"        masks.append(mask)\\n\",\n    \"    \\n\",\n    \"    plt.matshow(masks, cmap='gray_r')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plot_cv(StratifiedKFold(n_splits=5), iris.data, iris.target)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For comparison, again the standard KFold, that ignores the labels:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plot_cv(KFold(n_splits=5), iris.data, iris.target)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Keep in mind that increasing the number of folds will give you a larger training dataset, but will lead to more repetitions, and therefore a slower evaluation:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plot_cv(KFold(n_splits=10), iris.data, iris.target)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Another helpful cross-validation generator is ``ShuffleSplit``. This generator simply splits of a random portion of the data repeatedly. This allows the user to specify the number of repetitions and the training set size independently:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plot_cv(ShuffleSplit(n_splits=5, test_size=.2), iris.data, iris.target)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"If you want a more robust estimate, you can just increase the number of splits:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plot_cv(ShuffleSplit(n_splits=20, test_size=.2), iris.data, iris.target)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"You can use all of these cross-validation generators with the `cross_val_score` method:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"cv = ShuffleSplit(n_splits=5, test_size=.2)\\n\",\n    \"cross_val_score(classifier, X, y, cv=cv)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise\\n\",\n    \"Perform three-fold cross-validation using the ``KFold`` class on the iris dataset without shuffling the data. Can you explain the result?\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"#%load solutions/13_cross_validation.py\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/14 Model Complexity and GridSearchCV.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"import numpy as np\\n\",\n    \"import matplotlib.pyplot as plt\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Parameter selection, Validation, and Testing\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Most models have parameters that influence how complex a model they can learn. Remember using `KNeighborsRegressor`.\\n\",\n    \"If we change the number of neighbors we consider, we get a smoother and smoother prediction:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"figures/plot_kneigbors_regularization.png\\\" width=\\\"100%\\\">\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In the above figure, we see fits for three different values of ``n_neighbors``.\\n\",\n    \"For ``n_neighbors=2``, the data is overfit, the model is too flexible and can adjust too much to the noise in the training data. For ``n_neighbors=20``, the model is not flexible enough, and can not model the variation in the data appropriately.\\n\",\n    \"\\n\",\n    \"In the middle, for ``n_neighbors = 5``, we have found a good mid-point. It fits\\n\",\n    \"the data fairly well, and does not suffer from the overfit or underfit\\n\",\n    \"problems seen in the figures on either side. What we would like is a\\n\",\n    \"way to quantitatively identify overfit and underfit, and optimize the\\n\",\n    \"hyperparameters (in this case, the polynomial degree d) in order to\\n\",\n    \"determine the best algorithm.\\n\",\n    \"\\n\",\n    \"We trade off remembering too much about the particularities and noise of the training data vs. not modeling enough of the variability. This is a trade-off that needs to be made in basically every machine learning application and is a central concept, called bias-variance-tradeoff or \\\"overfitting vs underfitting\\\".\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"figures/overfitting_underfitting_cartoon.svg\\\" width=\\\"100%\\\">\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Hyperparameters, Over-fitting, and Under-fitting\\n\",\n    \"\\n\",\n    \"Unfortunately, there is no general rule how to find the sweet spot, and so machine learning practitioners have to find the best trade-off of model-complexity and generalization by trying several hyperparameter settings. Hyperparameters are the internal knobs or tuning parameters of a machine learning algorithm (in contrast to model parameters that the algorithm learns from the training data -- for example, the weight coefficients of a linear regression model); the number of *k* in K-nearest neighbors is such a hyperparameter.\\n\",\n    \"\\n\",\n    \"Most commonly this \\\"hyperparameter tuning\\\" is done using a brute force search, for example over multiple values of ``n_neighbors``:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import cross_val_score, KFold\\n\",\n    \"from sklearn.neighbors import KNeighborsRegressor\\n\",\n    \"# generate toy dataset:\\n\",\n    \"x = np.linspace(-3, 3, 100)\\n\",\n    \"rng = np.random.RandomState(42)\\n\",\n    \"y = np.sin(4 * x) + x + rng.normal(size=len(x))\\n\",\n    \"X = x[:, np.newaxis]\\n\",\n    \"\\n\",\n    \"cv = KFold(shuffle=True)\\n\",\n    \"\\n\",\n    \"# for each parameter setting do cross-validation:\\n\",\n    \"for n_neighbors in [1, 3, 5, 10, 20]:\\n\",\n    \"    scores = cross_val_score(KNeighborsRegressor(n_neighbors=n_neighbors), X, y, cv=cv)\\n\",\n    \"    print(\\\"n_neighbors: %d, average score: %f\\\" % (n_neighbors, np.mean(scores)))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"There is a function in scikit-learn, called ``validation_plot`` to reproduce the cartoon figure above. It plots one parameter, such as the number of neighbors, against training and validation error (using cross-validation):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import validation_curve\\n\",\n    \"n_neighbors = [1, 3, 5, 10, 20, 50]\\n\",\n    \"train_errors, test_errors = validation_curve(KNeighborsRegressor(), X, y, param_name=\\\"n_neighbors\\\",\\n\",\n    \"                                             param_range=n_neighbors, cv=cv)\\n\",\n    \"plt.plot(n_neighbors, train_errors.mean(axis=1), label=\\\"train error\\\")\\n\",\n    \"plt.plot(n_neighbors, test_errors.mean(axis=1), label=\\\"test error\\\")\\n\",\n    \"plt.legend(loc=\\\"best\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that many neighbors mean a \\\"smooth\\\" or \\\"simple\\\" model, so the plot is the mirror image of the diagram above.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"If multiple parameters are important, like the parameters ``C`` and ``gamma`` in an ``SVM`` (more about that later), all possible combinations are tried:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import cross_val_score, KFold\\n\",\n    \"from sklearn.svm import SVR\\n\",\n    \"\\n\",\n    \"# each parameter setting do cross-validation:\\n\",\n    \"for C in [0.001, 0.01, 0.1, 1, 10]:\\n\",\n    \"    for gamma in [0.001, 0.01, 0.1, 1]:\\n\",\n    \"        scores = cross_val_score(SVR(C=C, gamma=gamma), X, y, cv=cv)\\n\",\n    \"        print(\\\"C: %f, gamma: %f, average score: %f\\\" % (C, gamma, np.mean(scores)))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As this is such a very common pattern, there is a built-in class for this in scikit-learn, ``GridSearchCV``. ``GridSearchCV`` takes a dictionary that describes the parameters that should be tried and a model to train.\\n\",\n    \"\\n\",\n    \"The grid of parameters is defined as a dictionary, where the keys are the parameters and the values are the settings to be tested.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import GridSearchCV\\n\",\n    \"param_grid = {'C': [0.001, 0.01, 0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1, 1]}\\n\",\n    \"\\n\",\n    \"grid = GridSearchCV(SVR(), param_grid=param_grid, cv=cv, verbose=3)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"One of the great things about GridSearchCV is that it is a *meta-estimator*. It takes an estimator like SVR above, and creates a new estimator, that behaves exactly the same - in this case, like a regressor.\\n\",\n    \"So we can call ``fit`` on it, to train it:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"grid.fit(X, y)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"What ``fit`` does is a bit more involved then what we did above. First, it runs the same loop with cross-validation, to find the best parameter combination.\\n\",\n    \"Once it has the best combination, it runs fit again on all data passed to fit (without cross-validation), to built a single new model using the best parameter setting.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Then, as with all models, we can use ``predict`` or ``score``:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"grid.predict(X)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"source\": [\n    \"You can inspect the best parameters found by ``GridSearchCV`` in the ``best_params_`` attribute, and the best score in the ``best_score_`` attribute:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(grid.best_score_)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print(grid.best_params_)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"There is a problem with using this score for evaluation, however. You might be making what is called a multiple hypothesis testing error. If you try very many parameter settings, some of them will work better just by chance, and the score that you obtained might not reflect how your model would perform on new unseen data.\\n\",\n    \"Therefore, it is good to split off a separate test-set before performing grid-search. This pattern can be seen as a training-validation-test split, and is common in machine learning:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"figures/grid_search_cross_validation.svg\\\" width=\\\"100%\\\">\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can do this very easily by splitting of some test data using ``train_test_split``, training ``GridSearchCV`` on the training set, and applying the ``score`` method to the test set:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import train_test_split\\n\",\n    \"X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\\n\",\n    \"\\n\",\n    \"param_grid = {'C': [0.001, 0.01, 0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1, 1]}\\n\",\n    \"cv = KFold(n_splits=10, shuffle=True)\\n\",\n    \"\\n\",\n    \"grid = GridSearchCV(SVR(), param_grid=param_grid, cv=cv)\\n\",\n    \"\\n\",\n    \"grid.fit(X_train, y_train)\\n\",\n    \"grid.score(X_test, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also look at the parameters that were selected:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"grid.best_params_\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Some practitioners go for an easier scheme, splitting the data simply into three parts, training, validation and testing. This is a possible alternative if your training set is very large, or it is infeasible to train many models using cross-validation because training a model takes very long.\\n\",\n    \"You can do this with scikit-learn for example by splitting of a test-set and then applying GridSearchCV with ShuffleSplit cross-validation with a single iteration:\\n\",\n    \"\\n\",\n    \"<img src=\\\"figures/train_validation_test2.svg\\\" width=\\\"100%\\\">\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import train_test_split, ShuffleSplit\\n\",\n    \"\\n\",\n    \"X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\\n\",\n    \"\\n\",\n    \"param_grid = {'C': [0.001, 0.01, 0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1, 1]}\\n\",\n    \"single_split_cv = ShuffleSplit(n_splits=1)\\n\",\n    \"\\n\",\n    \"grid = GridSearchCV(SVR(), param_grid=param_grid, cv=single_split_cv, verbose=3)\\n\",\n    \"\\n\",\n    \"grid.fit(X_train, y_train)\\n\",\n    \"grid.score(X_test, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"This is much faster, but might result in worse hyperparameters and therefore worse results.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"clf = GridSearchCV(SVR(), param_grid=param_grid)\\n\",\n    \"clf.fit(X_train, y_train)\\n\",\n    \"clf.score(X_test, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise\\n\",\n    \"Apply grid-search to find the best setting for the number of neighbors in ``KNeighborsClassifier``, and apply it to the digits dataset.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"scrolled\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"#%load solutions/14_grid_search.py\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/15 Pipelining Estimators.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"import numpy as np\\n\",\n    \"import matplotlib.pyplot as plt\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Pipelining estimators\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In this section we study how different estimators maybe be chained.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## A simple example: feature extraction and selection before an estimator\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Feature extraction: vectorizer\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"For some types of data, for instance text data, a feature extraction step must be applied to convert it to numerical features.\\n\",\n    \"To illustrate we load the SMS spam dataset we used earlier.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"with open(os.path.join(\\\"datasets\\\", \\\"smsspam\\\", \\\"SMSSpamCollection\\\")) as f:\\n\",\n    \"    lines = [line.strip().split(\\\"\\\\t\\\") for line in f.readlines()]\\n\",\n    \"text = [x[1] for x in lines]\\n\",\n    \"y = [x[0] == \\\"ham\\\" for x in lines]\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import train_test_split\\n\",\n    \"\\n\",\n    \"text_train, text_test, y_train, y_test = train_test_split(text, y)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Previously, we applied the feature extraction manually, like so:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.feature_extraction.text import TfidfVectorizer\\n\",\n    \"from sklearn.linear_model import LogisticRegression\\n\",\n    \"\\n\",\n    \"vectorizer = TfidfVectorizer()\\n\",\n    \"vectorizer.fit(text_train)\\n\",\n    \"\\n\",\n    \"X_train = vectorizer.transform(text_train)\\n\",\n    \"X_test = vectorizer.transform(text_test)\\n\",\n    \"\\n\",\n    \"clf = LogisticRegression()\\n\",\n    \"clf.fit(X_train, y_train)\\n\",\n    \"\\n\",\n    \"clf.score(X_test, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The situation where we learn a transformation and then apply it to the test data is very common in machine learning.\\n\",\n    \"Therefore scikit-learn has a shortcut for this, called pipelines:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.pipeline import make_pipeline\\n\",\n    \"\\n\",\n    \"pipeline = make_pipeline(TfidfVectorizer(), LogisticRegression())\\n\",\n    \"pipeline.fit(text_train, y_train)\\n\",\n    \"pipeline.score(text_test, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As you can see, this makes the code much shorter and easier to handle. Behind the scenes, exactly the same as above is happening. When calling fit on the pipeline, it will call fit on each step in turn.\\n\",\n    \"\\n\",\n    \"After the first step is fit, it will use the ``transform`` method of the first step to create a new representation.\\n\",\n    \"This will then be fed to the ``fit`` of the next step, and so on.\\n\",\n    \"Finally, on the last step, only ``fit`` is called.\\n\",\n    \"\\n\",\n    \"![pipeline](figures/pipeline.svg)\\n\",\n    \"\\n\",\n    \"If we call ``score``, only ``transform`` will be called on each step - this could be the test set after all! Then, on the last step, ``score`` is called with the new representation. The same goes for ``predict``.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Building pipelines not only simplifies the code, it is also important for model selection.\\n\",\n    \"Say we want to grid-search C to tune our Logistic Regression above.\\n\",\n    \"\\n\",\n    \"Let's say we do it like this:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# This illustrates a common mistake. Don't use this code!\\n\",\n    \"from sklearn.model_selection import GridSearchCV\\n\",\n    \"\\n\",\n    \"vectorizer = TfidfVectorizer()\\n\",\n    \"vectorizer.fit(text_train)\\n\",\n    \"\\n\",\n    \"X_train = vectorizer.transform(text_train)\\n\",\n    \"X_test = vectorizer.transform(text_test)\\n\",\n    \"\\n\",\n    \"clf = LogisticRegression()\\n\",\n    \"grid = GridSearchCV(clf, param_grid={'C': [.1, 1, 10, 100]}, cv=5)\\n\",\n    \"grid.fit(X_train, y_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### 2.1.2 What did we do wrong?\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Here, we did grid-search with cross-validation on ``X_train``. However, when applying ``TfidfVectorizer``, it saw all of the ``X_train``,\\n\",\n    \"not only the training folds! So it could use knowledge of the frequency of the words in the test-folds. This is called \\\"contamination\\\" of the test set, and leads to too optimistic estimates of generalization performance, or badly selected parameters.\\n\",\n    \"We can fix this with the pipeline, though:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import GridSearchCV\\n\",\n    \"\\n\",\n    \"pipeline = make_pipeline(TfidfVectorizer(), \\n\",\n    \"                         LogisticRegression())\\n\",\n    \"\\n\",\n    \"grid = GridSearchCV(pipeline,\\n\",\n    \"                    param_grid={'logisticregression__C': [.1, 1, 10, 100]}, cv=5)\\n\",\n    \"\\n\",\n    \"grid.fit(text_train, y_train)\\n\",\n    \"grid.score(text_test, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that we need to tell the pipeline where at which step we wanted to set the parameter ``C``.\\n\",\n    \"We can do this using the special ``__`` syntax. The name before the ``__`` is simply the name of the class, the part after ``__`` is the parameter we want to set with grid-search.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"figures/pipeline_cross_validation.svg\\\" width=\\\"50%\\\">\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Another benefit of using pipelines is that we can now also search over parameters of the feature extraction with ``GridSearchCV``:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import GridSearchCV\\n\",\n    \"\\n\",\n    \"pipeline = make_pipeline(TfidfVectorizer(), LogisticRegression())\\n\",\n    \"\\n\",\n    \"params = {'logisticregression__C': [.1, 1, 10, 100],\\n\",\n    \"          \\\"tfidfvectorizer__ngram_range\\\": [(1, 1), (1, 2), (2, 2)]}\\n\",\n    \"grid = GridSearchCV(pipeline, param_grid=params, cv=5)\\n\",\n    \"grid.fit(text_train, y_train)\\n\",\n    \"print(grid.best_params_)\\n\",\n    \"grid.score(text_test, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Exercise\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create a pipeline out of a StandardScaler and Ridge regression and apply it to the Boston housing dataset (load using ``sklearn.datasets.load_boston``). Try adding the ``sklearn.preprocessing.PolynomialFeatures`` transformer as a second preprocessing step, and grid-search the degree of the polynomials (try 1, 2 and 3).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"#%load solutions/15A_ridge_grid.py\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/16 Performance metrics and Model Evaluation.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Model Evaluation, Scoring Metrics, and Dealing with Class Imbalances\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In the previous notebook, we already went into some detail on how to evaluate a model and how to pick the best model. So far, we assumed that we were given a performance measure, a measure  of the quality of the model. What measure one should use is not always obvious, though.\\n\",\n    \"The default scores in scikit-learn are ``accuracy`` for classification, which is the fraction of correctly classified samples, and ``r2`` for regression, with is the coefficient of determination.\\n\",\n    \"\\n\",\n    \"These are reasonable default choices in many scenarious; however, depending on our task, these are not always the definitive or recommended choices.\\n\",\n    \"\\n\",\n    \"Let's take look at classification in more detail, going back to the application of classifying handwritten digits.\\n\",\n    \"So, how about training a classifier and walking through the different ways we can evaluate it? Scikit-learn has many helpful methods in the ``sklearn.metrics`` module that can help us with this task:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import numpy as np\\n\",\n    \"np.set_printoptions(precision=2)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import load_digits\\n\",\n    \"from sklearn.model_selection import train_test_split\\n\",\n    \"from sklearn.svm import LinearSVC\\n\",\n    \"\\n\",\n    \"digits = load_digits()\\n\",\n    \"X, y = digits.data, digits.target\\n\",\n    \"X_train, X_test, y_train, y_test = train_test_split(X, y, \\n\",\n    \"                                                    random_state=1,\\n\",\n    \"                                                    stratify=y,\\n\",\n    \"                                                    test_size=0.25)\\n\",\n    \"\\n\",\n    \"classifier = LinearSVC(random_state=1).fit(X_train, y_train)\\n\",\n    \"y_test_pred = classifier.predict(X_test)\\n\",\n    \"\\n\",\n    \"print(\\\"Accuracy: {}\\\".format(classifier.score(X_test, y_test)))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Here, we predicted 95.3% of samples correctly. For multi-class problems, it is often interesting to know which of the classes are hard to predict, and which are easy, or which classes get confused. One way to get more information about misclassifications is ``the confusion_matrix``, which shows for each true class, how frequent a given predicted outcome is.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.metrics import confusion_matrix\\n\",\n    \"confusion_matrix(y_test, y_test_pred)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"A plot is sometimes more readable:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.matshow(confusion_matrix(y_test, y_test_pred), cmap=\\\"Blues\\\")\\n\",\n    \"plt.colorbar(shrink=0.8)\\n\",\n    \"plt.xticks(range(10))\\n\",\n    \"plt.yticks(range(10))\\n\",\n    \"plt.xlabel(\\\"Predicted label\\\")\\n\",\n    \"plt.ylabel(\\\"True label\\\");\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can see that most entries are on the diagonal, which means that we predicted nearly all samples correctly. The off-diagonal entries show us that many eights were classified as ones, and that nines are likely to be confused with many other classes. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Another useful function is the ``classification_report`` which provides precision, recall, fscore and support for all classes.\\n\",\n    \"Precision is how many of the predictions for a class are actually that class. With TP, FP, TN, FN standing for \\\"true positive\\\", \\\"false positive\\\", \\\"true negative\\\" and \\\"false negative\\\" repectively:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Precision = TP / (TP + FP)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Recall is how many of the true positives were recovered:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Recall = TP / (TP + FN)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"f1-score is the geometric average of precision and recall:\\n\",\n    \"\\n\",\n    \"F1 = 2 x (precision x recall) / (precision x recall)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The values of all these values above are in the closed interval [0, 1], where 1 means a perfect score.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.metrics import classification_report\\n\",\n    \"print(classification_report(y_test, y_test_pred))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"These metrics are helpful in two particular cases that come up often in practice:\\n\",\n    \"1. Imbalanced classes, that is one class might be much more frequent than the other.\\n\",\n    \"2. Asymmetric costs, that is one kind of error is much more \\\"costly\\\" than the other.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's have a look at 1. first. Say we have a class imbalance of 1:9, which is rather mild (think about ad-click-prediction where maybe 0.001% of ads might be clicked):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"np.bincount(y) / y.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As a toy example, let's say we want to classify the digits three against all other digits:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X, y = digits.data, digits.target == 3\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"source\": [\n    \"Now we run cross-validation on a classifier to see how well it does:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import cross_val_score\\n\",\n    \"from sklearn.svm import SVC\\n\",\n    \"\\n\",\n    \"cross_val_score(SVC(), X, y)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Our classifier is 90% accurate. Is that good? Or bad? Keep in mind that 90% of the data is \\\"not three\\\". So let's see how well a dummy classifier does, that always predicts the most frequent class:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.dummy import DummyClassifier\\n\",\n    \"cross_val_score(DummyClassifier(\\\"most_frequent\\\"), X, y)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Also 90% (as expected)! So one might thing that means our classifier is not very good, it doesn't to better than a simple strategy that doesn't even look at the data.\\n\",\n    \"That would be judging too quickly, though. Accuracy is simply not a good way to evaluate classifiers for imbalanced datasets!\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"np.bincount(y) / y.shape[0]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"ROC Curves\\n\",\n    \"=======\\n\",\n    \"\\n\",\n    \"A much better measure is using the so-called ROC (Receiver operating characteristics) curve. A roc-curve works with uncertainty outputs of a classifier, say the \\\"decision_function\\\" of the ``SVC`` we trained above. Instead of making a cut-off at zero and looking at classification outcomes, it looks at every possible cut-off and records how many true positive predictions there are, and how many false positive predictions there are.\\n\",\n    \"\\n\",\n    \"The following plot compares the roc curve of three parameter settings of our classifier on the \\\"three vs rest\\\" task.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.metrics import roc_curve, roc_auc_score\\n\",\n    \"\\n\",\n    \"X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\\n\",\n    \"\\n\",\n    \"for gamma in [.01, .05, 1]:\\n\",\n    \"    plt.xlabel(\\\"False Positive Rate\\\")\\n\",\n    \"    plt.ylabel(\\\"True Positive Rate (recall)\\\")\\n\",\n    \"    svm = SVC(gamma=gamma).fit(X_train, y_train)\\n\",\n    \"    decision_function = svm.decision_function(X_test)\\n\",\n    \"    fpr, tpr, _ = roc_curve(y_test, decision_function)\\n\",\n    \"    acc = svm.score(X_test, y_test)\\n\",\n    \"    auc = roc_auc_score(y_test, svm.decision_function(X_test))\\n\",\n    \"    plt.plot(fpr, tpr, label=\\\"acc:%.2f auc:%.2f\\\" % (acc, auc), linewidth=3)\\n\",\n    \"plt.legend(loc=\\\"best\\\");\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"With a very small decision threshold, there will be few false positives, but also few false negatives, while with a very high threshold, both true positive rate and false positive rate will be high. So in general, the curve will be from the lower left to the upper right. A diagonal line reflects chance performance, while the goal is to be as much in the top left corner as possible. This means giving a higher decision_function value to all positive samples than to any negative sample.\\n\",\n    \"\\n\",\n    \"In this sense, this curve only considers the ranking of the positive and negative samples, not the actual value.\\n\",\n    \"As you can see from the curves and the accuracy values in the legend, even though all classifiers have the same accuracy, 89%, which is even lower than the dummy classifier, one of them has a perfect roc curve, while one of them performs on chance level.\\n\",\n    \"\\n\",\n    \"For doing grid-search and cross-validation, we usually want to condense our model evaluation into a single number. A good way to do this with the roc curve is to use the area under the curve (AUC).\\n\",\n    \"We can simply use this in ``cross_val_score`` by specifying ``scoring=\\\"roc_auc\\\"``:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import cross_val_score\\n\",\n    \"cross_val_score(SVC(), X, y, scoring=\\\"roc_auc\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Built-In and custom scoring functions\\n\",\n    \"=======================================\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"source\": [\n    \"There are many more scoring methods available, which are useful for different kinds of tasks. You can find them in the \\\"SCORERS\\\" dictionary. The only documentation explains all of them.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.metrics.scorer import SCORERS\\n\",\n    \"print(SCORERS.keys())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"It is also possible to define your own scoring metric. Instead of a string, you can provide a callable to as ``scoring`` parameter, that is an object with a ``__call__`` method or a function.\\n\",\n    \"It needs to take a model, a test-set features ``X_test`` and test-set labels ``y_test``, and return a float. Higher floats are taken to mean better models.\\n\",\n    \"\\n\",\n    \"Let's reimplement the standard accuracy score:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def my_accuracy_scoring(est, X, y):\\n\",\n    \"    return np.mean(est.predict(X) == y)\\n\",\n    \"\\n\",\n    \"cross_val_score(SVC(), X, y, scoring=my_accuracy_scoring)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The interesting thing about this interface is that we can access any attributes of the estimator we trained. Let's say we have trained a linear model, and we want to penalize having non-zero coefficients in our model selection:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"def my_super_scoring(est, X, y):\\n\",\n    \"    return np.mean(est.predict(X) == y) - np.mean(est.coef_ != 0)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can evaluate if this worked as expected, by grid-searching over l1 and l2 penalties in a linear SVM. An l1 penalty is expected to produce exactly zero coefficients:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import GridSearchCV\\n\",\n    \"from sklearn.svm import LinearSVC\\n\",\n    \"y = digits.target\\n\",\n    \"grid = GridSearchCV(LinearSVC(C=.01, dual=False),\\n\",\n    \"                    param_grid={'penalty' : ['l1', 'l2']},\\n\",\n    \"                    scoring=my_super_scoring)\\n\",\n    \"grid.fit(X, y)\\n\",\n    \"print(grid.best_params_)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In previous sections, we typically used the accuracy measure to evaluate the performance of our classifiers. A related measure that we haven't talked about, yet, is the average-per-class accuracy (APCA). As we remember, the accuracy is defined as\\n\",\n    \"\\n\",\n    \"$$ACC = \\\\frac{TP+TN}{n},$$\\n\",\n    \"\\n\",\n    \"where *n* is the total number of samples. This can be generalized to \\n\",\n    \"\\n\",\n    \"$$ACC =  \\\\frac{T}{n},$$\\n\",\n    \"\\n\",\n    \"where *T* is the number of all correct predictions in multi-class settings.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"![](figures/average-per-class.png)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Given the following arrays of \\\"true\\\" class labels and predicted class labels, can you implement a function that uses the accuracy measure to compute the average-per-class accuracy as shown below?\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"y_true = np.array([0, 0, 0, 1, 1, 1, 1, 1, 2, 2])\\n\",\n    \"y_pred = np.array([0, 1, 1, 0, 1, 1, 2, 2, 2, 2])\\n\",\n    \"\\n\",\n    \"confusion_matrix(y_true, y_pred)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"#%load solutions/16A_avg_per_class_acc.py\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/17 In Depth - Linear Models.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"import numpy as np\\n\",\n    \"import matplotlib.pyplot as plt\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Linear models\\n\",\n    \"Linear models are useful when little data is available or for very large feature spaces as in text classification. In addition, they form a good case study for regularization.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Linear models for regression\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"All linear models for regression learn a coefficient parameter ``coef_`` and an offset ``intercept_`` to make predictions using a linear combination of features:\\n\",\n    \"\\n\",\n    \"```\\n\",\n    \"y_pred = x_test[0] * coef_[0] + ... + x_test[n_features-1] * coef_[n_features-1] + intercept_\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"The difference between the linear models for regression is what kind of restrictions or penalties are put on ``coef_`` as regularization , in addition to fitting the training data well.\\n\",\n    \"The most standard linear model is the 'ordinary least squares regression', often simply called 'linear regression'. It doesn't put any additional restrictions on ``coef_``, so when the number of features is large, it becomes ill-posed and the model overfits.\\n\",\n    \"\\n\",\n    \"Let us generate a simple simulation, to see the behavior of these models.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import make_regression\\n\",\n    \"from sklearn.model_selection import train_test_split\\n\",\n    \"\\n\",\n    \"X, y, true_coefficient = make_regression(n_samples=200, n_features=30, n_informative=10, noise=100, coef=True, random_state=5)\\n\",\n    \"X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=5, train_size=60)\\n\",\n    \"print(X_train.shape)\\n\",\n    \"print(y_train.shape)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Linear Regression\\n\",\n    \"\\n\",\n    \"$$ \\\\text{min}_{w, b} \\\\sum_i || w^\\\\mathsf{T}x_i + b  - y_i||^2 $$\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"scrolled\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.linear_model import LinearRegression\\n\",\n    \"linear_regression = LinearRegression().fit(X_train, y_train)\\n\",\n    \"print(\\\"R^2 on training set: %f\\\" % linear_regression.score(X_train, y_train))\\n\",\n    \"print(\\\"R^2 on test set: %f\\\" % linear_regression.score(X_test, y_test))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.metrics import r2_score\\n\",\n    \"print(r2_score(np.dot(X, true_coefficient), y))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.figure(figsize=(10, 5))\\n\",\n    \"coefficient_sorting = np.argsort(true_coefficient)[::-1]\\n\",\n    \"plt.plot(true_coefficient[coefficient_sorting], \\\"o\\\", label=\\\"true\\\")\\n\",\n    \"plt.plot(linear_regression.coef_[coefficient_sorting], \\\"o\\\", label=\\\"linear regression\\\")\\n\",\n    \"\\n\",\n    \"plt.legend()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import learning_curve\\n\",\n    \"\\n\",\n    \"def plot_learning_curve(est, X, y):\\n\",\n    \"    training_set_size, train_scores, test_scores = learning_curve(est, X, y, train_sizes=np.linspace(.1, 1, 20))\\n\",\n    \"    estimator_name = est.__class__.__name__\\n\",\n    \"    line = plt.plot(training_set_size, train_scores.mean(axis=1), '--', label=\\\"training scores \\\" + estimator_name)\\n\",\n    \"    plt.plot(training_set_size, test_scores.mean(axis=1), '-', label=\\\"test scores \\\" + estimator_name, c=line[0].get_color())\\n\",\n    \"    plt.xlabel('Training set size')\\n\",\n    \"    plt.legend(loc='best')\\n\",\n    \"    plt.ylim(-0.1, 1.1)\\n\",\n    \"\\n\",\n    \"plt.figure()    \\n\",\n    \"plot_learning_curve(LinearRegression(), X, y)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Ridge Regression (L2 penalty)\\n\",\n    \"\\n\",\n    \"**The Ridge estimator** is a simple regularization (called l2 penalty) of the ordinary LinearRegression. In particular, it has the benefit of being not computationally more expensive than the ordinary least square estimate.\\n\",\n    \"\\n\",\n    \"$$ \\\\text{min}_{w,b}  \\\\sum_i || w^\\\\mathsf{T}x_i + b  - y_i||^2  + \\\\alpha ||w||_2^2$$ \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The amount of regularization is set via the `alpha` parameter of the Ridge.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.linear_model import Ridge\\n\",\n    \"ridge_models = {}\\n\",\n    \"training_scores = []\\n\",\n    \"test_scores = []\\n\",\n    \"\\n\",\n    \"for alpha in [100, 10, 1, .01]:\\n\",\n    \"    ridge = Ridge(alpha=alpha).fit(X_train, y_train)\\n\",\n    \"    training_scores.append(ridge.score(X_train, y_train))\\n\",\n    \"    test_scores.append(ridge.score(X_test, y_test))\\n\",\n    \"    ridge_models[alpha] = ridge\\n\",\n    \"\\n\",\n    \"plt.figure()\\n\",\n    \"plt.plot(training_scores, label=\\\"training scores\\\")\\n\",\n    \"plt.plot(test_scores, label=\\\"test scores\\\")\\n\",\n    \"plt.xticks(range(4), [100, 10, 1, .01])\\n\",\n    \"plt.legend(loc=\\\"best\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.figure(figsize=(10, 5))\\n\",\n    \"plt.plot(true_coefficient[coefficient_sorting], \\\"o\\\", label=\\\"true\\\", c='b')\\n\",\n    \"\\n\",\n    \"for i, alpha in enumerate([100, 10, 1, .01]):\\n\",\n    \"    plt.plot(ridge_models[alpha].coef_[coefficient_sorting], \\\"o\\\", label=\\\"alpha = %.2f\\\" % alpha, c=plt.cm.summer(i / 3.))\\n\",\n    \"    \\n\",\n    \"plt.legend(loc=\\\"best\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Tuning alpha is critical for performance.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.figure()\\n\",\n    \"plot_learning_curve(LinearRegression(), X, y)\\n\",\n    \"plot_learning_curve(Ridge(alpha=10), X, y)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Lasso (L1 penalty)\\n\",\n    \"**The Lasso estimator** is useful to impose sparsity on the coefficient. In other words, it is to be prefered if we believe that many of the features are not relevant. This is done via the so-called l1 penalty.\\n\",\n    \"\\n\",\n    \"$$ \\\\text{min}_{w, b} \\\\sum_i || w^\\\\mathsf{T}x_i + b  - y_i||^2  + \\\\alpha ||w||_1$$ \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.linear_model import Lasso\\n\",\n    \"\\n\",\n    \"lasso_models = {}\\n\",\n    \"training_scores = []\\n\",\n    \"test_scores = []\\n\",\n    \"\\n\",\n    \"for alpha in [30, 10, 1, .01]:\\n\",\n    \"    lasso = Lasso(alpha=alpha).fit(X_train, y_train)\\n\",\n    \"    training_scores.append(lasso.score(X_train, y_train))\\n\",\n    \"    test_scores.append(lasso.score(X_test, y_test))\\n\",\n    \"    lasso_models[alpha] = lasso\\n\",\n    \"plt.figure()\\n\",\n    \"plt.plot(training_scores, label=\\\"training scores\\\")\\n\",\n    \"plt.plot(test_scores, label=\\\"test scores\\\")\\n\",\n    \"plt.xticks(range(4), [30, 10, 1, .01])\\n\",\n    \"plt.legend(loc=\\\"best\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.figure(figsize=(10, 5))\\n\",\n    \"plt.plot(true_coefficient[coefficient_sorting], \\\"o\\\", label=\\\"true\\\", c='b')\\n\",\n    \"\\n\",\n    \"for i, alpha in enumerate([30, 10, 1, .01]):\\n\",\n    \"    plt.plot(lasso_models[alpha].coef_[coefficient_sorting], \\\"o\\\", label=\\\"alpha = %.2f\\\" % alpha, c=plt.cm.summer(i / 3.))\\n\",\n    \"    \\n\",\n    \"plt.legend(loc=\\\"best\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.figure(figsize=(10, 5))\\n\",\n    \"plot_learning_curve(LinearRegression(), X, y)\\n\",\n    \"plot_learning_curve(Ridge(alpha=10), X, y)\\n\",\n    \"plot_learning_curve(Lasso(alpha=10), X, y)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Instead of picking Ridge *or* Lasso, you can also use [ElasticNet](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html), which uses both forms of regularization and provides a parameter to assign a weighting between them. ElasticNet typically performs the best amongst these models.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Linear models for classification\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"All linear models for classification learn a coefficient parameter ``coef_`` and an offset ``intercept_`` to make predictions using a linear combination of features:\\n\",\n    \"```\\n\",\n    \"y_pred = x_test[0] * coef_[0] + ... + x_test[n_features-1] * coef_[n_features-1] + intercept_ > 0\\n\",\n    \"```\\n\",\n    \"\\n\",\n    \"As you can see, this is very similar to regression, only that a threshold at zero is applied.\\n\",\n    \"\\n\",\n    \"Again, the difference between the linear models for classification what kind of regularization is put on ``coef_`` and ``intercept_``, but there are also minor differences in how the fit to the training set is measured (the so-called loss function).\\n\",\n    \"\\n\",\n    \"The two most common models for linear classification are the linear SVM as implemented in LinearSVC and LogisticRegression.\\n\",\n    \"\\n\",\n    \"A good intuition for regularization of linear classifiers is that with high regularization, it is enough if most of the points are classified correctly. But with less regularization, more importance is given to each individual data point.\\n\",\n    \"This is illustrated using an linear SVM with different values of ``C`` below.\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### The influence of C in LinearSVC\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In LinearSVC, the `C` parameter controls the regularization within the model.\\n\",\n    \"\\n\",\n    \"Lower `C` entails more regularization and simpler models, whereas higher `C` entails less regularization and more influence from individual data points.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from figures import plot_linear_svc_regularization\\n\",\n    \"plot_linear_svc_regularization()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Similar to the Ridge/Lasso separation, you can set the `penalty` parameter to 'l1' to enforce sparsity of the coefficients (similar to Lasso) or 'l2' to encourage smaller coefficients (similar to Ridge).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Multi-class linear classification\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import make_blobs\\n\",\n    \"plt.figure()\\n\",\n    \"X, y = make_blobs(random_state=42)\\n\",\n    \"plt.scatter(X[:, 0], X[:, 1], c=y);\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.svm import LinearSVC\\n\",\n    \"linear_svm = LinearSVC().fit(X, y)\\n\",\n    \"print(linear_svm.coef_.shape)\\n\",\n    \"print(linear_svm.intercept_.shape)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.scatter(X[:, 0], X[:, 1], c=y)\\n\",\n    \"line = np.linspace(-15, 15)\\n\",\n    \"for coef, intercept in zip(linear_svm.coef_, linear_svm.intercept_):\\n\",\n    \"    plt.plot(line, -(line * coef[0] + intercept) / coef[1])\\n\",\n    \"plt.ylim(-10, 15)\\n\",\n    \"plt.xlim(-10, 8);\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Points are classified in a one-vs-rest fashion (aka one-vs-all), where we assign a test point to the class whose model has the highest confidence (in the SVM case, highest distance to the separating hyperplane) for the test point.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercises\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Use LogisticRegression to classify the digits data set, and grid-search the C parameter.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import load_digits\\n\",\n    \"from sklearn.linear_model import LogisticRegression\\n\",\n    \"\\n\",\n    \"digits = load_digits()\\n\",\n    \"X_digits, y_digits = digits.data, digits.target\\n\",\n    \"\\n\",\n    \"# split the dataset, apply grid-search\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"#%load solutions/17A_logreg_grid.py\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"How do you think the learning curves above change when you increase or decrease alpha?\\n\",\n    \"Try changing the alpha parameter in ridge and lasso, and see if your intuition was correct.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"#%load solutions/17B_learning_curve_alpha.py\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/18 In Depth - Support Vector Machines.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"import numpy as np\\n\",\n    \"import matplotlib.pyplot as plt\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# In Depth -  Support Vector Machines\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"SVM stands for \\\"support vector machines\\\". They are efficient and easy to use estimators.\\n\",\n    \"They come in two kinds: SVCs, Support Vector Classifiers, for classification problems, and SVRs, Support Vector Regressors, for regression problems.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Linear SVMs\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The SVM module contains LinearSVC, which we already discussed briefly in the section on linear models.\\n\",\n    \"Using ``SVC(kernel=\\\"linear\\\")`` will also yield a linear predictor that is only different in minor technical aspects.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Kernel SVMs\\n\",\n    \"The real power of SVMs lies in using kernels, which allow for non-linear decision boundaries. A kernel defines a similarity measure between data points. The most common are:\\n\",\n    \"\\n\",\n    \"- **linear** will give linear decision frontiers. It is the most computationally efficient approach and the one that requires the least amount of data.\\n\",\n    \"\\n\",\n    \"- **poly** will give decision frontiers that are polynomial. The order of this polynomial is given by the 'order' argument.\\n\",\n    \"\\n\",\n    \"- **rbf** uses 'radial basis functions' centered at each support vector to assemble a decision frontier. The size of the RBFs ultimately controls the smoothness of the decision frontier. RBFs are the most flexible approach, but also the one that will require the largest amount of data.\\n\",\n    \"\\n\",\n    \"Predictions in a kernel-SVM are made using the formular\\n\",\n    \"\\n\",\n    \"$$\\n\",\n    \"\\\\hat{y} = \\\\text{sign}(\\\\alpha_0 + \\\\sum_{j}\\\\alpha_j y_j k(\\\\mathbf{x^{(j)}}, \\\\mathbf{x}))\\n\",\n    \"$$\\n\",\n    \"\\n\",\n    \"where $\\\\mathbf{x}^{(j)}$ are training samples, $\\\\mathbf{y}^{(j)}$ the corresponding labels, $\\\\mathbf{x}$ is a test-sample to predict on, $k$ is the kernel, and $\\\\alpha$ are learned parameters.\\n\",\n    \"\\n\",\n    \"What this says is \\\"if $\\\\mathbf{x}$ is similar to $\\\\mathbf{x}^{(j)}$ then they probably have the same label\\\", where the importance of each $\\\\mathbf{x}^{(j)}$ for this decision is learned. [Or something much less intuitive about an infinite dimensional Hilbert-space]\\n\",\n    \"\\n\",\n    \"Often only few samples have non-zero $\\\\alpha$, these are called the \\\"support vectors\\\" from which SVMs get their name.\\n\",\n    \"These are the most discriminant samples.\\n\",\n    \"\\n\",\n    \"The most important parameter of the SVM is the regularization parameter $C$, which bounds the influence of each individual sample:\\n\",\n    \"\\n\",\n    \"- Low C values: many support vectors... Decision frontier = mean(class A) - mean(class B)\\n\",\n    \"- High C values: small number of support vectors: Decision frontier fully driven by most discriminant samples\\n\",\n    \"\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The other important parameters are those of the kernel. Let's look at the RBF kernel in more detail:\\n\",\n    \"\\n\",\n    \"$$k(\\\\mathbf{x}, \\\\mathbf{x'}) = \\\\exp(-\\\\gamma ||\\\\mathbf{x} - \\\\mathbf{x'}||^2)$$\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.metrics.pairwise import rbf_kernel\\n\",\n    \"\\n\",\n    \"line = np.linspace(-3, 3, 100)[:, np.newaxis]\\n\",\n    \"kernel_value = rbf_kernel(line, [[0]], gamma=1)\\n\",\n    \"plt.plot(line, kernel_value);\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The rbf kernel has an inverse bandwidth-parameter gamma, where large gamma mean a very localized influence for each data point, and\\n\",\n    \"small values mean a very global influence.\\n\",\n    \"Let's see these two parameters in action:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from figures import plot_svm_interactive\\n\",\n    \"plot_svm_interactive()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Exercise: tune a SVM on the digits dataset\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import load_digits\\n\",\n    \"from sklearn.svm import SVC\\n\",\n    \"\\n\",\n    \"digits = load_digits()\\n\",\n    \"X_digits, y_digits = digits.data, digits.target\\n\",\n    \"\\n\",\n    \"# split the dataset, apply grid-search\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"#%load solutions/18_svc_grid.py\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/19 In Depth - Trees and Forests.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# In Depth - Decision Trees and Forests\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"import numpy as np\\n\",\n    \"import matplotlib.pyplot as plt\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Here we'll explore a class of algorithms based on decision trees.\\n\",\n    \"Decision trees at their root are extremely intuitive.  They\\n\",\n    \"encode a series of \\\"if\\\" and \\\"else\\\" choices, similar to how a person might make a decision.\\n\",\n    \"However, which questions to ask, and how to proceed for each answer is entirely learned from the data.\\n\",\n    \"\\n\",\n    \"For example, if you wanted to create a guide to identifying an animal found in nature, you\\n\",\n    \"might ask the following series of questions:\\n\",\n    \"\\n\",\n    \"- Is the animal bigger or smaller than a meter long?\\n\",\n    \"    + *bigger*: does the animal have horns?\\n\",\n    \"        - *yes*: are the horns longer than ten centimeters?\\n\",\n    \"        - *no*: is the animal wearing a collar\\n\",\n    \"    + *smaller*: does the animal have two or four legs?\\n\",\n    \"        - *two*: does the animal have wings?\\n\",\n    \"        - *four*: does the animal have a bushy tail?\\n\",\n    \"\\n\",\n    \"and so on.  This binary splitting of questions is the essence of a decision tree.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"One of the main benefit of tree-based models is that they require little preprocessing of the data.\\n\",\n    \"They can work with variables of different types (continuous and discrete) and are invariant to scaling of the features.\\n\",\n    \"\\n\",\n    \"Another benefit is that tree-based models are what is called \\\"nonparametric\\\", which means they don't have a fix set of parameters to learn. Instead, a tree model can become more and more flexible, if given more data.\\n\",\n    \"In other words, the number of free parameters grows with the number of samples and is not fixed, as for example in linear models.\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Decision Tree Regression\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"A decision tree is a simple binary classification tree that is\\n\",\n    \"similar to nearest neighbor classification.  It can be used as follows:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from figures import make_dataset\\n\",\n    \"x, y = make_dataset()\\n\",\n    \"X = x.reshape(-1, 1)\\n\",\n    \"\\n\",\n    \"plt.xlabel('Feature X')\\n\",\n    \"plt.ylabel('Target y')\\n\",\n    \"plt.scatter(X, y);\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.tree import DecisionTreeRegressor\\n\",\n    \"\\n\",\n    \"reg = DecisionTreeRegressor(max_depth=5)\\n\",\n    \"reg.fit(X, y)\\n\",\n    \"\\n\",\n    \"X_fit = np.linspace(-3, 3, 1000).reshape((-1, 1))\\n\",\n    \"y_fit_1 = reg.predict(X_fit)\\n\",\n    \"\\n\",\n    \"plt.plot(X_fit.ravel(), y_fit_1, color='blue', label=\\\"prediction\\\")\\n\",\n    \"plt.plot(X.ravel(), y, '.k', label=\\\"training data\\\")\\n\",\n    \"plt.legend(loc=\\\"best\\\");\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"A single decision tree allows us to estimate the signal in a non-parametric way,\\n\",\n    \"but clearly has some issues.  In some regions, the model shows high bias and\\n\",\n    \"under-fits the data.\\n\",\n    \"(seen in the long flat lines which don't follow the contours of the data),\\n\",\n    \"while in other regions the model shows high variance and over-fits the data\\n\",\n    \"(reflected in the narrow spikes which are influenced by noise in single points).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Decision Tree Classification\\n\",\n    \"==================\\n\",\n    \"Decision tree classification work very similarly, by assigning all points within a leaf the majority class in that leaf:\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import make_blobs\\n\",\n    \"from sklearn.model_selection import train_test_split\\n\",\n    \"from sklearn.tree import DecisionTreeClassifier\\n\",\n    \"from figures import plot_2d_separator\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"X, y = make_blobs(centers=[[0, 0], [1, 1]], random_state=61526, n_samples=100)\\n\",\n    \"X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\\n\",\n    \"\\n\",\n    \"clf = DecisionTreeClassifier(max_depth=5)\\n\",\n    \"clf.fit(X_train, y_train)\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"plot_2d_separator(clf, X, fill=True)\\n\",\n    \"plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=60, alpha=.7)\\n\",\n    \"plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, s=60);\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"There are many parameter that control the complexity of a tree, but the one that might be easiest to understand is the maximum depth. This limits how finely the tree can partition the input space, or how many \\\"if-else\\\" questions can be asked before deciding which class a sample lies in.\\n\",\n    \"\\n\",\n    \"This parameter is important to tune for trees and tree-based models. The interactive plot below shows how underfit and overfit looks like for this model. Having a ``max_depth`` of 1 is clearly an underfit model, while a depth of 7 or 8 clearly overfits. The maximum depth a tree can be grown at for this dataset is 8, at which point each leave only contains samples from a single class. This is known as all leaves being \\\"pure.\\\"\\n\",\n    \"\\n\",\n    \"In the interactive plot below, the regions are assigned blue and red colors to indicate the predicted class for that region. The shade of the color indicates the predicted probability for that class (darker = higher probability), while yellow regions indicate an equal predicted probability for either class.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from figures import plot_tree_interactive\\n\",\n    \"plot_tree_interactive()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Decision trees are fast to train, easy to understand, and often lead to interpretable models. However, single trees often tend to overfit the training data. Playing with the slider above you might notice that the model starts to overfit even before it has a good separation between the classes.\\n\",\n    \"\\n\",\n    \"Therefore, in practice it is more common to combine multiple trees to produce models that generalize better. The most common methods for combining trees are random forests and gradient boosted trees.\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Random Forests\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Random forests are simply many trees, built on different random subsets (drawn with replacement) of the data, and using different random subsets (drawn without replacement) of the features for each split.\\n\",\n    \"This makes the trees different from each other, and makes them overfit to different aspects. Then, their predictions are averaged, leading to a smoother estimate that overfits less.\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from figures import plot_forest_interactive\\n\",\n    \"plot_forest_interactive()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Selecting the Optimal Estimator via Cross-Validation\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.model_selection import GridSearchCV\\n\",\n    \"from sklearn.datasets import load_digits\\n\",\n    \"from sklearn.ensemble import RandomForestClassifier\\n\",\n    \"\\n\",\n    \"digits = load_digits()\\n\",\n    \"X, y = digits.data, digits.target\\n\",\n    \"\\n\",\n    \"X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\\n\",\n    \"\\n\",\n    \"rf = RandomForestClassifier(n_estimators=200)\\n\",\n    \"parameters = {'max_features':['sqrt', 'log2', 10],\\n\",\n    \"              'max_depth':[5, 7, 9]}\\n\",\n    \"\\n\",\n    \"clf_grid = GridSearchCV(rf, parameters, n_jobs=-1)\\n\",\n    \"clf_grid.fit(X_train, y_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"clf_grid.score(X_train, y_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"clf_grid.score(X_test, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Another option: Gradient Boosting\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Another Ensemble method that can be useful is *Boosting*: here, rather than\\n\",\n    \"looking at 200 (say) parallel estimators, We construct a chain of 200 estimators\\n\",\n    \"which iteratively refine the results of the previous estimator.\\n\",\n    \"The idea is that by sequentially applying very fast, simple models, we can get a\\n\",\n    \"total model error which is better than any of the individual pieces.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.ensemble import GradientBoostingRegressor\\n\",\n    \"clf = GradientBoostingRegressor(n_estimators=100, max_depth=5, learning_rate=.2)\\n\",\n    \"clf.fit(X_train, y_train)\\n\",\n    \"\\n\",\n    \"print(clf.score(X_train, y_train))\\n\",\n    \"print(clf.score(X_test, y_test))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Exercise: Cross-validating Gradient Boosting\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Use a grid search to optimize the `learning_rate` and `max_depth` for a Gradient Boosted\\n\",\n    \"Decision tree on the digits data set.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import load_digits\\n\",\n    \"from sklearn.ensemble import GradientBoostingClassifier\\n\",\n    \"\\n\",\n    \"digits = load_digits()\\n\",\n    \"X_digits, y_digits = digits.data, digits.target\\n\",\n    \"\\n\",\n    \"# split the dataset, apply grid-search\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"#%load solutions/19_gbc_grid.py\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/20 Feature Selection.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"hide_input\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import numpy as np\\n\",\n    \"%matplotlib inline\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Automatic Feature Selection\\n\",\n    \"Often we collected many features that might be related to a supervised prediction task, but we don't know which of them are actually predictive. To improve interpretability, and sometimes also generalization performance, we can use automatic feature selection to select a subset of the original features. There are several types of feature selection methods available, which we'll explain in order of increasing complexity.\\n\",\n    \"\\n\",\n    \"For a given supervised model, the best feature selection strategy would be to try out each possible subset of the features, and evaluate generalization performance using this subset. However, there are exponentially many subsets of features, so this exhaustive search is generally infeasible. The strategies discussed below can be thought of as proxies for this infeasible computation.\\n\",\n    \"\\n\",\n    \"### Univariate statistics\\n\",\n    \"The simplest method to select features is using univariate statistics, that is by looking at each feature individually and running a statistical test to see whether it is related to the target. This kind of test is also known as analysis of variance (ANOVA).\\n\",\n    \"\\n\",\n    \"We create a synthetic dataset that consists of the breast cancer data with an additional 50 completely random features.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import load_breast_cancer, load_digits\\n\",\n    \"from sklearn.model_selection import train_test_split\\n\",\n    \"\\n\",\n    \"cancer = load_breast_cancer()\\n\",\n    \"\\n\",\n    \"# get deterministic random numbers\\n\",\n    \"rng = np.random.RandomState(42)\\n\",\n    \"noise = rng.normal(size=(len(cancer.data), 50))\\n\",\n    \"# add noise features to the data\\n\",\n    \"# the first 30 features are from the dataset, the next 50 are noise\\n\",\n    \"X_w_noise = np.hstack([cancer.data, noise])\\n\",\n    \"\\n\",\n    \"X_train, X_test, y_train, y_test = train_test_split(X_w_noise, cancer.target,\\n\",\n    \"                                                    random_state=0, test_size=.5)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We have to define a threshold on the p-value of the statistical test to decide how many features to keep. There are several strategies implemented in scikit-learn, a straight-forward one being ``SelectPercentile``, which selects a percentile of the original features (we select 50% below):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.feature_selection import SelectPercentile\\n\",\n    \"\\n\",\n    \"# use f_classif (the default) and SelectPercentile to select 50% of features:\\n\",\n    \"select = SelectPercentile(percentile=50)\\n\",\n    \"select.fit(X_train, y_train)\\n\",\n    \"# transform training set:\\n\",\n    \"X_train_selected = select.transform(X_train)\\n\",\n    \"\\n\",\n    \"print(X_train.shape)\\n\",\n    \"print(X_train_selected.shape)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can also use the test statistic directly to see how relevant each feature is. As the breast cancer dataset is a classification task, we use f_classif, the F-test for classification. Below we plot the p-values associated with each of the 80 features (30 original features + 50 noise features). Low p-values indicate informative features.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.feature_selection import f_classif, f_regression, chi2\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"F, p = f_classif(X_train, y_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.figure()\\n\",\n    \"plt.plot(p, 'o')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Clearly most of the first 30 features have very small p-values.\\n\",\n    \"\\n\",\n    \"Going back to the SelectPercentile transformer, we can obtain the features that are selected using the ``get_support`` method:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"mask = select.get_support()\\n\",\n    \"print(mask)\\n\",\n    \"# visualize the mask. black is True, white is False\\n\",\n    \"plt.matshow(mask.reshape(1, -1), cmap='gray_r')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Nearly all of the original 30 features were recovered.\\n\",\n    \"We can also analize the utility of the feature selection by training a supervised model on the data.\\n\",\n    \"It's important to learn the feature selection only on the training set!\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.linear_model import LogisticRegression\\n\",\n    \"\\n\",\n    \"# transform test data:\\n\",\n    \"X_test_selected = select.transform(X_test)\\n\",\n    \"\\n\",\n    \"lr = LogisticRegression()\\n\",\n    \"lr.fit(X_train, y_train)\\n\",\n    \"print(\\\"Score with all features: %f\\\" % lr.score(X_test, y_test))\\n\",\n    \"lr.fit(X_train_selected, y_train)\\n\",\n    \"print(\\\"Score with only selected features: %f\\\" % lr.score(X_test_selected, y_test))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Model-based Feature Selection\\n\",\n    \"A somewhat more sophisticated method for feature selection is using a supervised machine learning model and selecting features based on how important they were deemed by the model. This requires the model to provide some way to rank the features by importance. This can be done for all tree-based models (which implement ``get_feature_importances``) and all linear models, for which the coefficients can be used to determine how much influence a feature has on the outcome.\\n\",\n    \"\\n\",\n    \"Any of these models can be made into a transformer that does feature selection by wrapping it with the ``SelectFromModel`` class:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.feature_selection import SelectFromModel\\n\",\n    \"from sklearn.ensemble import RandomForestClassifier\\n\",\n    \"select = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42), threshold=\\\"median\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"select.fit(X_train, y_train)\\n\",\n    \"X_train_rf = select.transform(X_train)\\n\",\n    \"print(X_train.shape)\\n\",\n    \"print(X_train_rf.shape)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"mask = select.get_support()\\n\",\n    \"# visualize the mask. black is True, white is False\\n\",\n    \"plt.matshow(mask.reshape(1, -1), cmap='gray_r')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X_test_rf = select.transform(X_test)\\n\",\n    \"LogisticRegression().fit(X_train_rf, y_train).score(X_test_rf, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"This method builds a single model (in this case a random forest) and uses the feature importances from this model.\\n\",\n    \"We can do a somewhat more elaborate search by training multiple models on subsets of the data. One particular strategy is recursive feature elimination:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Recursive Feature Elimination\\n\",\n    \"Recursive feature elimination builds a model on the full set of features, and similar to the method above selects a subset of features that are deemed most important by the model. However, usually only a single feature is dropped from the dataset, and a new model is built with the remaining features. The process of dropping features and model building is repeated until there are only a pre-specified number of features left:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.feature_selection import RFE\\n\",\n    \"select = RFE(RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=40)\\n\",\n    \"\\n\",\n    \"select.fit(X_train, y_train)\\n\",\n    \"# visualize the selected features:\\n\",\n    \"mask = select.get_support()\\n\",\n    \"plt.matshow(mask.reshape(1, -1), cmap='gray_r')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X_train_rfe = select.transform(X_train)\\n\",\n    \"X_test_rfe = select.transform(X_test)\\n\",\n    \"\\n\",\n    \"LogisticRegression().fit(X_train_rfe, y_train).score(X_test_rfe, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"select.score(X_test, y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Create the \\\"XOR\\\" dataset which is created like this:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"rng = np.random.RandomState(1)\\n\",\n    \"\\n\",\n    \"# Generate 400 random integers in the range [0, 1]\\n\",\n    \"X = rng.randint(0, 2, (200, 2))\\n\",\n    \"y = np.logical_xor(X[:, 0] > 0, X[:, 1] > 0)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Add random features to it and compare how univariate selection compares to model based selection using a Random Forest in recovering the original features.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"#%load solutions/20_univariate_vs_mb_selection.py\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/21 Unsupervised learning - Hierarchical and density-based clustering algorithms.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"nbpresent\": {\n     \"id\": \"0dfe514f-a5ac-4f6e-8f15-64d89406b0ad\"\n    }\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Sebastian Raschka' -v -p numpy,scipy,matplotlib\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"nbpresent\": {\n     \"id\": \"2de13356-e9ae-466c-89d1-50618945c658\"\n    }\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"import numpy as np\\n\",\n    \"from matplotlib import pyplot as plt\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"nbpresent\": {\n     \"id\": \"c6e7fe3e-13df-4169-89bc-0dad2fc6e579\"\n    }\n   },\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"nbpresent\": {\n     \"id\": \"4a9d75ee-def8-451e-836f-707a63d8ea90\"\n    }\n   },\n   \"source\": [\n    \"# Unsupervised learning: Hierarchical and density-based clustering algorithms\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"nbpresent\": {\n     \"id\": \"2e676319-4de0-4ee0-84ec-f525353b5195\"\n    }\n   },\n   \"source\": [\n    \"In a previous notebook, \\\"08 Unsupervised Learning - Clustering.ipynb\\\", we introduced one of the essential and widely used clustering algorithms, K-means. One of the advantages of K-means is that it is extremely easy to implement, and it is also computationally very efficient compared to other clustering algorithms. However, we've seen that one of the weaknesses of K-Means is that it only works well if the data can be grouped into a globular or spherical shape. Also, we have to assign the number of clusters, *k*, *a priori* -- this can be a problem if we have no prior knowledge about how many clusters we expect to find. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"nbpresent\": {\n     \"id\": \"7f44eab5-590f-4228-acdb-4fd1d187a441\"\n    }\n   },\n   \"source\": [\n    \"In this notebook, we will take a look at 2 alternative approaches to clustering, hierarchical clustering and density-based clustering. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"nbpresent\": {\n     \"id\": \"a9b317b4-49cb-47e0-8f69-5f6ad2491370\"\n    }\n   },\n   \"source\": [\n    \"# Hierarchical Clustering\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"nbpresent\": {\n     \"id\": \"d70d19aa-a949-4942-89c0-8c4911bbc733\"\n    }\n   },\n   \"source\": [\n    \"One nice feature of hierachical clustering is that we can visualize the results as a dendrogram, a hierachical tree. Using the visualization, we can then decide how \\\"deep\\\" we want to cluster the dataset by setting a \\\"depth\\\" threshold. Or in other words, we don't need to make a decision about the number of clusters upfront.\\n\",\n    \"\\n\",\n    \"**Agglomerative and divisive hierarchical clustering**\\n\",\n    \"\\n\",\n    \"Furthermore, we can distinguish between 2 main approaches to hierarchical clustering: Divisive clustering and agglomerative clustering. In agglomerative clustering, we start with a single sample from our dataset and iteratively merge it with other samples to form clusters -- we can see it as a bottom-up approach for building the clustering dendrogram.  \\n\",\n    \"In divisive clustering, however, we start with the whole dataset as one cluster, and we iteratively split it into smaller subclusters -- a top-down approach.  \\n\",\n    \"\\n\",\n    \"In this notebook, we will use **agglomerative** clustering.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"nbpresent\": {\n     \"id\": \"d448e9d1-f80d-4bf4-a322-9af800ce359c\"\n    }\n   },\n   \"source\": [\n    \"**Single and complete linkage**\\n\",\n    \"\\n\",\n    \"Now, the next question is how we measure the similarity between samples. One approach is the familiar Euclidean distance metric that we already used via the K-Means algorithm. As a refresher, the distance between 2 m-dimensional vectors $\\\\mathbf{p}$ and $\\\\mathbf{q}$ can be computed as:\\n\",\n    \"\\n\",\n    \"\\\\begin{align} \\\\mathrm{d}(\\\\mathbf{q},\\\\mathbf{p}) & = \\\\sqrt{(q_1-p_1)^2 + (q_2-p_2)^2 + \\\\cdots + (q_m-p_m)^2} \\\\\\\\[8pt]\\n\",\n    \"& = \\\\sqrt{\\\\sum_{j=1}^m (q_j-p_j)^2}.\\\\end{align}\\t\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"nbpresent\": {\n     \"id\": \"045c17ed-c253-4b84-813b-0f3f2c4eee3a\"\n    }\n   },\n   \"source\": [\n    \"However, that's the distance between 2 samples. Now, how do we compute the similarity between subclusters of samples in order to decide which clusters to merge when constructing the dendrogram? I.e., our goal is to iteratively merge the most similar pairs of clusters until only one big cluster remains. There are many different approaches to this, for example single and complete linkage. \\n\",\n    \"\\n\",\n    \"In single linkage, we take the pair of the most similar samples (based on the Euclidean distance, for example) in each cluster, and merge the two clusters which have the most similar 2 members into one new, bigger cluster.\\n\",\n    \"\\n\",\n    \"In complete linkage, we compare the pairs of the two most dissimilar members of each cluster with each other, and we merge the 2 clusters where the distance between its 2 most dissimilar members is smallest.\\n\",\n    \"\\n\",\n    \"![](figures/clustering-linkage.png)\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"nbpresent\": {\n     \"id\": \"b6cc173c-044c-4a59-8a51-ec81eb2a1098\"\n    }\n   },\n   \"source\": [\n    \"To see the agglomerative, hierarchical clustering approach in action, let us load the familiar Iris dataset -- pretending we don't know the true class labels and want to find out how many different follow species it consists of:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"nbpresent\": {\n     \"id\": \"b552a94c-9dc1-4c76-9d9b-90a47cd7811a\"\n    }\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn import datasets\\n\",\n    \"\\n\",\n    \"iris = datasets.load_iris()\\n\",\n    \"X = iris.data[:, [2, 3]]\\n\",\n    \"y = iris.target\\n\",\n    \"n_samples, n_features = X.shape\\n\",\n    \"\\n\",\n    \"plt.scatter(X[:, 0], X[:, 1], c=y);\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"nbpresent\": {\n     \"id\": \"473764d4-3610-43e8-94a0-d62731dd5a1c\"\n    }\n   },\n   \"source\": [\n    \"First, we start with some exploratory clustering, visualizing the clustering dendrogram using SciPy's `linkage` and `dendrogram` functions:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"nbpresent\": {\n     \"id\": \"d7f4a0e0-5b4f-4e08-9c77-fd1b1d13c877\"\n    }\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from scipy.cluster.hierarchy import linkage\\n\",\n    \"from scipy.cluster.hierarchy import dendrogram\\n\",\n    \"\\n\",\n    \"clusters = linkage(X, \\n\",\n    \"                   metric='euclidean',\\n\",\n    \"                   method='complete')\\n\",\n    \"\\n\",\n    \"dendr = dendrogram(clusters)\\n\",\n    \"\\n\",\n    \"plt.ylabel('Euclidean Distance');\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"nbpresent\": {\n     \"id\": \"68cb3270-9d4b-450f-9372-58989fe93a3d\"\n    }\n   },\n   \"source\": [\n    \"Next, let's use the `AgglomerativeClustering` estimator from scikit-learn and divide the dataset into 3 clusters. Can you guess which 3 clusters from the dendrogram it will reproduce?\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"nbpresent\": {\n     \"id\": \"4746ea9e-3206-4e5a-bf06-8e2cd49c48d1\"\n    }\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.cluster import AgglomerativeClustering\\n\",\n    \"\\n\",\n    \"ac = AgglomerativeClustering(n_clusters=3,\\n\",\n    \"                             affinity='euclidean',\\n\",\n    \"                             linkage='complete')\\n\",\n    \"\\n\",\n    \"prediction = ac.fit_predict(X)\\n\",\n    \"print('Cluster labels: %s\\\\n' % prediction)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"nbpresent\": {\n     \"id\": \"a4e419ac-a735-442e-96bd-b90e60691f97\"\n    }\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.scatter(X[:, 0], X[:, 1], c=prediction);\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"nbpresent\": {\n     \"id\": \"63c6aeb6-3b8f-40f4-b1a8-b5e2526beaa5\"\n    }\n   },\n   \"source\": [\n    \"# Density-based Clustering - DBSCAN\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"nbpresent\": {\n     \"id\": \"688a6a37-3a28-40c8-81ba-f5c92f6d7aa8\"\n    }\n   },\n   \"source\": [\n    \"Another useful approach to clustering is *Density-based Spatial Clustering of Applications with Noise* (DBSCAN). In essence, we can think of DBSCAN as an algorithm that divides the dataset into subgroup based on dense regions of point.\\n\",\n    \"\\n\",\n    \"In DBSCAN, we distinguish between 3 different \\\"points\\\":\\n\",\n    \"\\n\",\n    \"- Core points: A core point is a point that has at least a minimum number of other points (MinPts) in its radius epsilon.\\n\",\n    \"- Border points: A border point is a point that is not a core point, since it doesn't have enough MinPts in its neighborhood, but lies within the radius epsilon of a core point.\\n\",\n    \"- Noise points: All other points that are neither core points nor border points.\\n\",\n    \"\\n\",\n    \"![](figures/dbscan.png)\\n\",\n    \"\\n\",\n    \"A nice feature about DBSCAN is that we don't have to specify a number of clusters upfront. However, it requires the setting of additional hyperparameters such as the value for MinPts and the radius epsilon.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"nbpresent\": {\n     \"id\": \"98acb13b-bbf6-412e-a7eb-cc096c34dca1\"\n    }\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import make_moons\\n\",\n    \"X, y = make_moons(n_samples=400,\\n\",\n    \"                  noise=0.1,\\n\",\n    \"                  random_state=1)\\n\",\n    \"plt.scatter(X[:,0], X[:,1])\\n\",\n    \"plt.show()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"nbpresent\": {\n     \"id\": \"86c183f7-0889-443c-b989-219a2c9a1aad\"\n    }\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.cluster import DBSCAN\\n\",\n    \"\\n\",\n    \"db = DBSCAN(eps=0.2,\\n\",\n    \"            min_samples=10,\\n\",\n    \"            metric='euclidean')\\n\",\n    \"prediction = db.fit_predict(X)\\n\",\n    \"\\n\",\n    \"print(\\\"Predicted labels:\\\\n\\\", prediction)\\n\",\n    \"\\n\",\n    \"plt.scatter(X[:, 0], X[:, 1], c=prediction);\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"nbpresent\": {\n     \"id\": \"84c2fb5c-a984-4a8e-baff-0eee2cbf0184\"\n    }\n   },\n   \"source\": [\n    \"# Exercise\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {\n    \"nbpresent\": {\n     \"id\": \"6881939d-0bfe-4768-9342-1fc68a0b8dbc\"\n    }\n   },\n   \"source\": [\n    \"Using the following toy dataset, two concentric circles, experiment with the three different clustering algorithms that we used so far: `KMeans`, `AgglomerativeClustering`, and `DBSCAN`.\\n\",\n    \"\\n\",\n    \"Which clustering algorithms reproduces or discovers the hidden structure (pretending we don't know `y`) best?\\n\",\n    \"\\n\",\n    \"Can you explain why this particular algorithm is a good choice while the other 2 \\\"fail\\\"?\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false,\n    \"nbpresent\": {\n     \"id\": \"4ad922fc-9e38-4d1d-b0ed-b0654c1c483a\"\n    }\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import make_circles\\n\",\n    \"\\n\",\n    \"X, y = make_circles(n_samples=1500, \\n\",\n    \"                    factor=.4, \\n\",\n    \"                    noise=.05)\\n\",\n    \"\\n\",\n    \"plt.scatter(X[:, 0], X[:, 1], c=y);\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"#%load solutions/21_clustering_comparison.py\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/22 Unsupervised learning - Non-linear dimensionality reduction.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%matplotlib inline\\n\",\n    \"import matplotlib.pyplot as plt\\n\",\n    \"import numpy as np\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Manifold Learning\\n\",\n    \"\\n\",\n    \"One weakness of PCA is that it cannot detect non-linear features.  A set\\n\",\n    \"of algorithms known as *Manifold Learning* have been developed to address\\n\",\n    \"this deficiency.  A canonical dataset used in Manifold learning is the\\n\",\n    \"*S-curve*:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import make_s_curve\\n\",\n    \"X, y = make_s_curve(n_samples=1000)\\n\",\n    \"\\n\",\n    \"from mpl_toolkits.mplot3d import Axes3D\\n\",\n    \"ax = plt.axes(projection='3d')\\n\",\n    \"\\n\",\n    \"ax.scatter3D(X[:, 0], X[:, 1], X[:, 2], c=y)\\n\",\n    \"ax.view_init(10, -60);\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"This is a 2-dimensional dataset embedded in three dimensions, but it is embedded\\n\",\n    \"in such a way that PCA cannot discover the underlying data orientation:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.decomposition import PCA\\n\",\n    \"X_pca = PCA(n_components=2).fit_transform(X)\\n\",\n    \"plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y);\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Manifold learning algorithms, however, available in the ``sklearn.manifold``\\n\",\n    \"submodule, are able to recover the underlying 2-dimensional manifold:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.manifold import Isomap\\n\",\n    \"\\n\",\n    \"iso = Isomap(n_neighbors=15, n_components=2)\\n\",\n    \"X_iso = iso.fit_transform(X)\\n\",\n    \"plt.scatter(X_iso[:, 0], X_iso[:, 1], c=y);\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Manifold learning on the digits data\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can apply manifold learning techniques to much higher dimensional datasets, for example the digits data that we saw before:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import load_digits\\n\",\n    \"digits = load_digits()\\n\",\n    \"\\n\",\n    \"fig, axes = plt.subplots(2, 5, figsize=(10, 5),\\n\",\n    \"                         subplot_kw={'xticks':(), 'yticks': ()})\\n\",\n    \"for ax, img in zip(axes.ravel(), digits.images):\\n\",\n    \"    ax.imshow(img, interpolation=\\\"none\\\", cmap=\\\"gray\\\")\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can visualize the dataset using a linear technique, such as PCA. We saw this already provides some intuition about the data:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# build a PCA model\\n\",\n    \"pca = PCA(n_components=2)\\n\",\n    \"pca.fit(digits.data)\\n\",\n    \"# transform the digits data onto the first two principal components\\n\",\n    \"digits_pca = pca.transform(digits.data)\\n\",\n    \"colors = [\\\"#476A2A\\\", \\\"#7851B8\\\", \\\"#BD3430\\\", \\\"#4A2D4E\\\", \\\"#875525\\\",\\n\",\n    \"          \\\"#A83683\\\", \\\"#4E655E\\\", \\\"#853541\\\", \\\"#3A3120\\\",\\\"#535D8E\\\"]\\n\",\n    \"plt.figure(figsize=(10, 10))\\n\",\n    \"plt.xlim(digits_pca[:, 0].min(), digits_pca[:, 0].max() + 1)\\n\",\n    \"plt.ylim(digits_pca[:, 1].min(), digits_pca[:, 1].max() + 1)\\n\",\n    \"for i in range(len(digits.data)):\\n\",\n    \"    # actually plot the digits as text instead of using scatter\\n\",\n    \"    plt.text(digits_pca[i, 0], digits_pca[i, 1], str(digits.target[i]),\\n\",\n    \"             color = colors[digits.target[i]],\\n\",\n    \"             fontdict={'weight': 'bold', 'size': 9})\\n\",\n    \"plt.xlabel(\\\"first principal component\\\")\\n\",\n    \"plt.ylabel(\\\"second principal component\\\");\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Using a more powerful, nonlinear techinque can provide much better visualizations, though.\\n\",\n    \"Here, we are using the t-SNE manifold learning method:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.manifold import TSNE\\n\",\n    \"tsne = TSNE(random_state=42)\\n\",\n    \"# use fit_transform instead of fit, as TSNE has no transform method:\\n\",\n    \"digits_tsne = tsne.fit_transform(digits.data)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"plt.figure(figsize=(10, 10))\\n\",\n    \"plt.xlim(digits_tsne[:, 0].min(), digits_tsne[:, 0].max() + 1)\\n\",\n    \"plt.ylim(digits_tsne[:, 1].min(), digits_tsne[:, 1].max() + 1)\\n\",\n    \"for i in range(len(digits.data)):\\n\",\n    \"    # actually plot the digits as text instead of using scatter\\n\",\n    \"    plt.text(digits_tsne[i, 0], digits_tsne[i, 1], str(digits.target[i]),\\n\",\n    \"             color = colors[digits.target[i]],\\n\",\n    \"             fontdict={'weight': 'bold', 'size': 9})\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"t-SNE has a somewhat longer runtime that other manifold learning algorithms, but the result is quite striking. Keep in mind that this algorithm is purely unsupervised, and does not know about the class labels. Still it is able to separate the classes very well (though the classes four, one and nine have been split into multiple groups).\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Exercises\\n\",\n    \"\\n\",\n    \"Compare the results of applying isomap to the digits dataset to the results of PCA and t-SNE. Which result do you think looks best?\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"#%load solutions/22A_isomap_digits.py\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Given how well t-SNE separated the classes, one might be tempted to use this processing for classification. Try training a K-nearest neighbor classifier on digits data transformed with t-SNE, and compare to the accuracy on using the dataset without any transformation.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"#%load solutions/22B_tsne_classification.py\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/23 Out-of-core Learning Large Scale Text Classification.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"%load_ext watermark\\n\",\n    \"%watermark  -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# SciPy 2016 Scikit-learn Tutorial\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Out-of-core Learning - Large Scale Text Classification for Sentiment Analysis\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Scalability Issues\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The `sklearn.feature_extraction.text.CountVectorizer` and `sklearn.feature_extraction.text.TfidfVectorizer` classes suffer from a number of scalability issues that all stem from the internal usage of the `vocabulary_` attribute (a Python dictionary) used to map the unicode string feature names to the integer feature indices.\\n\",\n    \"\\n\",\n    \"The main scalability issues are:\\n\",\n    \"\\n\",\n    \"- **Memory usage of the text vectorizer**: the all the string representations of the features are loaded in memory\\n\",\n    \"- **Parallelization problems for text feature extraction**: the `vocabulary_` would be a shared state: complex synchronization and overhead\\n\",\n    \"- **Impossibility to do online or out-of-core / streaming learning**: the `vocabulary_` needs to be learned from the data: its size cannot be known before making one pass over the full dataset\\n\",\n    \"    \\n\",\n    \"    \\n\",\n    \"To better understand the issue let's have a look at how the `vocabulary_` attribute work. At `fit` time the tokens of the corpus are uniquely indentified by a integer index and this mapping stored in the vocabulary:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.feature_extraction.text import CountVectorizer\\n\",\n    \"\\n\",\n    \"vectorizer = CountVectorizer(min_df=1)\\n\",\n    \"\\n\",\n    \"vectorizer.fit([\\n\",\n    \"    \\\"The cat sat on the mat.\\\",\\n\",\n    \"])\\n\",\n    \"vectorizer.vocabulary_\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The vocabulary is used at `transform` time to build the occurrence matrix:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X = vectorizer.transform([\\n\",\n    \"    \\\"The cat sat on the mat.\\\",\\n\",\n    \"    \\\"This cat is a nice cat.\\\",\\n\",\n    \"]).toarray()\\n\",\n    \"\\n\",\n    \"print(len(vectorizer.vocabulary_))\\n\",\n    \"print(vectorizer.get_feature_names())\\n\",\n    \"print(X)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Let's refit with a slightly larger corpus:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"vectorizer = CountVectorizer(min_df=1)\\n\",\n    \"\\n\",\n    \"vectorizer.fit([\\n\",\n    \"    \\\"The cat sat on the mat.\\\",\\n\",\n    \"    \\\"The quick brown fox jumps over the lazy dog.\\\",\\n\",\n    \"])\\n\",\n    \"vectorizer.vocabulary_\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The `vocabulary_` is the (logarithmically) growing with the size of the training corpus. Note that we could not have built the vocabularies in parallel on the 2 text documents as they share some words hence would require some kind of shared datastructure or synchronization barrier which is complicated to setup, especially if we want to distribute the processing on a cluster.\\n\",\n    \"\\n\",\n    \"With this new vocabulary, the dimensionality of the output space is now larger:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"X = vectorizer.transform([\\n\",\n    \"    \\\"The cat sat on the mat.\\\",\\n\",\n    \"    \\\"This cat is a nice cat.\\\",\\n\",\n    \"]).toarray()\\n\",\n    \"\\n\",\n    \"print(len(vectorizer.vocabulary_))\\n\",\n    \"print(vectorizer.get_feature_names())\\n\",\n    \"print(X)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## The IMDb movie dataset\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"To illustrate the scalability issues of the vocabulary-based vectorizers, let's load a more realistic dataset for a classical text classification task: sentiment analysis on text documents. The goal is to tell apart negative from positive movie reviews from the [Internet Movie Database](http://www.imdb.com) (IMDb).\\n\",\n    \"\\n\",\n    \"In the following sections, with a [large subset](http://ai.stanford.edu/~amaas/data/sentiment/) of movie reviews from the IMDb that has been collected by Maas et al. \\n\",\n    \"\\n\",\n    \"- A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning Word Vectors for Sentiment Analysis. In the proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. \\n\",\n    \"\\n\",\n    \"This dataset contains 50,000 movie reviews, which were split into 25,000 training samples and 25,000 test samples. The reviews are labeled as either negative (neg) or positive (pos). Moreover, *positive* means that a movie received >6 stars on IMDb; negative means that a movie received <5 stars, respectively.\\n\",\n    \"\\n\",\n    \"\\n\",\n    \"Assuming that the `../fetch_data.py` script was run successfully the following files should be available:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import os\\n\",\n    \"\\n\",\n    \"train_path = os.path.join('datasets', 'IMDb', 'aclImdb', 'train')\\n\",\n    \"test_path = os.path.join('datasets', 'IMDb', 'aclImdb', 'test')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's load them into our active session via scikit-learn's `load_files` function\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.datasets import load_files\\n\",\n    \"\\n\",\n    \"train = load_files(container_path=(train_path),\\n\",\n    \"                   categories=['pos', 'neg'])\\n\",\n    \"\\n\",\n    \"test = load_files(container_path=(test_path),\\n\",\n    \"                  categories=['pos', 'neg'])\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"**Note**  \\n\",\n    \"Since the movie datasets consists of 50,000 individual text files, executing the code snippet above may take ~20 sec or longer.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The `load_files` function loaded the datasets into `sklearn.datasets.base.Bunch` objects, which are Python dictionaries:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"train.keys()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In particular, we are only interested in the `data` and `target` arrays.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import numpy as np\\n\",\n    \"\\n\",\n    \"for label, data in zip(('TRAINING', 'TEST'), (train, test)):\\n\",\n    \"    print('\\\\n\\\\n%s' % label)\\n\",\n    \"    print('Number of documents:', len(data['data']))\\n\",\n    \"    print('\\\\n1st document:\\\\n', data['data'][0])\\n\",\n    \"    print('\\\\n1st label:', data['target'][0])\\n\",\n    \"    print('\\\\nClass names:', data['target_names'])\\n\",\n    \"    print('Class count:', \\n\",\n    \"          np.unique(data['target']), ' -> ',\\n\",\n    \"          np.bincount(data['target']))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can see above the `'target'` array consists of integers `0` and `1`, where `0` stands for negative and `1` stands for positive.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## The Hashing Trick\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Remember the bag of word representation using a vocabulary based vectorizer:\\n\",\n    \"\\n\",\n    \"<img src=\\\"figures/bag_of_words.svg\\\" width=\\\"100%\\\">\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"To workaround the limitations of the vocabulary-based vectorizers, one can use the hashing trick. Instead of building and storing an explicit mapping from the feature names to the feature indices in a Python dict, we can just use a hash function and a modulus operation:\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"<img src=\\\"figures/hashing_vectorizer.svg\\\" width=\\\"100%\\\">\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"More info and reference for the original papers on the Hashing Trick in the [following site](http://www.hunch.net/~jl/projects/hash_reps/index.html) as well as a description specific to language [here](http://blog.someben.com/2013/01/hashing-lang/).\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.utils.murmurhash import murmurhash3_bytes_u32\\n\",\n    \"\\n\",\n    \"# encode for python 3 compatibility\\n\",\n    \"for word in \\\"the cat sat on the mat\\\".encode(\\\"utf-8\\\").split():\\n\",\n    \"    print(\\\"{0} => {1}\\\".format(\\n\",\n    \"        word, murmurhash3_bytes_u32(word, 0) % 2 ** 20))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"This mapping is completely stateless and the dimensionality of the output space is explicitly fixed in advance (here we use a modulo `2 ** 20` which means roughly 1M dimensions). The makes it possible to workaround the limitations of the vocabulary based vectorizer both for parallelizability and online / out-of-core learning.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The `HashingVectorizer` class is an alternative to the `CountVectorizer` (or `TfidfVectorizer` class with `use_idf=False`) that internally uses the murmurhash hash function:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.feature_extraction.text import HashingVectorizer\\n\",\n    \"\\n\",\n    \"h_vectorizer = HashingVectorizer(encoding='latin-1')\\n\",\n    \"h_vectorizer\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"It shares the same \\\"preprocessor\\\", \\\"tokenizer\\\" and \\\"analyzer\\\" infrastructure:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"analyzer = h_vectorizer.build_analyzer()\\n\",\n    \"analyzer('This is a test sentence.')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"We can vectorize our datasets into a scipy sparse matrix exactly as we would have done with the `CountVectorizer` or `TfidfVectorizer`, except that we can directly call the `transform` method: there is no need to `fit` as `HashingVectorizer` is a stateless transformer:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"docs_train, y_train = train['data'], train['target']\\n\",\n    \"docs_valid, y_valid = test['data'][:12500], test['target'][:12500]\\n\",\n    \"docs_test, y_test = test['data'][12500:], test['target'][12500:]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The dimension of the output is fixed ahead of time to `n_features=2 ** 20` by default (nearly 1M features) to minimize the rate of collision on most classification problem while having reasonably sized linear models (1M weights in the `coef_` attribute):\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"h_vectorizer.transform(docs_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, let's compare the computational efficiency of the `HashingVectorizer` to the `CountVectorizer`:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"h_vec = HashingVectorizer(encoding='latin-1')\\n\",\n    \"%timeit -n 1 -r 3 h_vec.fit(docs_train, y_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"count_vec =  CountVectorizer(encoding='latin-1')\\n\",\n    \"%timeit -n 1 -r 3 count_vec.fit(docs_train, y_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"As we can see, the HashingVectorizer is much faster than the Countvectorizer in this case.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Finally, let us train a LogisticRegression classifier on the IMDb training subset:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.linear_model import LogisticRegression\\n\",\n    \"from sklearn.pipeline import Pipeline\\n\",\n    \"\\n\",\n    \"h_pipeline = Pipeline((\\n\",\n    \"    ('vec', HashingVectorizer(encoding='latin-1')),\\n\",\n    \"    ('clf', LogisticRegression(random_state=1)),\\n\",\n    \"))\\n\",\n    \"\\n\",\n    \"h_pipeline.fit(docs_train, y_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"print('Train accuracy', h_pipeline.score(docs_train, y_train))\\n\",\n    \"print('Validation accuracy', h_pipeline.score(docs_valid, y_valid))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import gc\\n\",\n    \"\\n\",\n    \"del count_vec\\n\",\n    \"del h_pipeline\\n\",\n    \"\\n\",\n    \"gc.collect()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Out-of-Core learning\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Out-of-Core learning is the task of training a machine learning model on a dataset that does not fit into memory or RAM. This requires the following conditions:\\n\",\n    \"    \\n\",\n    \"- a **feature extraction** layer with **fixed output dimensionality**\\n\",\n    \"- knowing the list of all classes in advance (in this case we only have positive and negative tweets)\\n\",\n    \"- a machine learning **algorithm that supports incremental learning** (the `partial_fit` method in scikit-learn).\\n\",\n    \"\\n\",\n    \"In the following sections, we will set up a simple batch-training function to train an `SGDClassifier` iteratively. \"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"But first, let us load the file names into a Python list:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"train_path = os.path.join('datasets', 'IMDb', 'aclImdb', 'train')\\n\",\n    \"train_pos = os.path.join(train_path, 'pos')\\n\",\n    \"train_neg = os.path.join(train_path, 'neg')\\n\",\n    \"\\n\",\n    \"fnames = [os.path.join(train_pos, f) for f in os.listdir(train_pos)] +\\\\\\n\",\n    \"         [os.path.join(train_neg, f) for f in os.listdir(train_neg)]\\n\",\n    \"\\n\",\n    \"fnames[:3]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Next, let us create the target label array:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"y_train = np.zeros((len(fnames), ), dtype=int)\\n\",\n    \"y_train[:12500] = 1\\n\",\n    \"np.bincount(y_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Now, we implement the `batch_train function` as follows:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.base import clone\\n\",\n    \"\\n\",\n    \"def batch_train(clf, fnames, labels, iterations=25, batchsize=1000, random_seed=1):\\n\",\n    \"    vec = HashingVectorizer(encoding='latin-1')\\n\",\n    \"    idx = np.arange(labels.shape[0])\\n\",\n    \"    c_clf = clone(clf)\\n\",\n    \"    rng = np.random.RandomState(seed=random_seed)\\n\",\n    \"    \\n\",\n    \"    for i in range(iterations):\\n\",\n    \"        rnd_idx = rng.choice(idx, size=batchsize)\\n\",\n    \"        documents = []\\n\",\n    \"        for i in rnd_idx:\\n\",\n    \"            with open(fnames[i], 'r') as f:\\n\",\n    \"                documents.append(f.read())\\n\",\n    \"        X_batch = vec.transform(documents)\\n\",\n    \"        batch_labels = labels[rnd_idx]\\n\",\n    \"        c_clf.partial_fit(X=X_batch, \\n\",\n    \"                          y=batch_labels, \\n\",\n    \"                          classes=[0, 1])\\n\",\n    \"      \\n\",\n    \"    return c_clf\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Note that we are not using `LogisticRegression` as in the previous section, but we will use a `SGDClassifier` with a logistic cost function instead. SGD stands for `stochastic gradient descent`, an optimization alrogithm that optimizes the weight coefficients iteratively sample by sample, which allows us to feed the data to the classifier chunk by chuck.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"And we train the `SGDClassifier`; using the default settings of the `batch_train` function, it will train the classifier on 25*1000=25000 documents. (Depending on your machine, this may take >2 min)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sklearn.linear_model import SGDClassifier\\n\",\n    \"\\n\",\n    \"sgd = SGDClassifier(loss='log', random_state=1)\\n\",\n    \"\\n\",\n    \"sgd = batch_train(clf=sgd, \\n\",\n    \"                  fnames=fnames, \\n\",\n    \"                  labels=y_train)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Eventually, let us evaluate its performance:\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"vec = HashingVectorizer(encoding='latin-1')\\n\",\n    \"sgd.score(vec.transform(docs_test), y_test)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Limitations of the Hashing Vectorizer\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Using the Hashing Vectorizer makes it possible to implement streaming and parallel text classification but can also introduce some issues:\\n\",\n    \"    \\n\",\n    \"- The collisions can introduce too much noise in the data and degrade prediction quality,\\n\",\n    \"- The `HashingVectorizer` does not provide \\\"Inverse Document Frequency\\\" reweighting (lack of a `use_idf=True` option).\\n\",\n    \"- There is no easy way to inverse the mapping and find the feature names from the feature index.\\n\",\n    \"\\n\",\n    \"The collision issues can be controlled by increasing the `n_features` parameters.\\n\",\n    \"\\n\",\n    \"The IDF weighting might be reintroduced by appending a `TfidfTransformer` instance on the output of the vectorizer. However computing the `idf_` statistic used for the feature reweighting will require to do at least one additional pass over the training set before being able to start training the classifier: this breaks the online learning scheme.\\n\",\n    \"\\n\",\n    \"The lack of inverse mapping (the `get_feature_names()` method of `TfidfVectorizer`) is even harder to workaround. That would require extending the `HashingVectorizer` class to add a \\\"trace\\\" mode to record the mapping of the most important features to provide statistical debugging information.\\n\",\n    \"\\n\",\n    \"In the mean time to debug feature extraction issues, it is recommended to use `TfidfVectorizer(use_idf=False)` on a small-ish subset of the dataset to simulate a `HashingVectorizer()` instance that have the `get_feature_names()` method and no collision issues.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Exercise\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"In our implementation of the batch_train function above, we randomly draw *k* training samples as a batch in each iteration, which can be considered as a random subsampling ***with*** replacement. Can you modify the `batch_train` function so that it iterates over the documents ***without*** replacement, i.e., that it uses each document ***exactly once*** per iteration?\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": false\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# %load solutions/27_B-batchtrain.py\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"anaconda-cloud\": {},\n  \"kernelspec\": {\n   \"display_name\": \"Python [default]\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.5.2\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 0\n}\n"
  },
  {
    "path": "notebooks/figures/ML_flow_chart.py",
    "content": "\"\"\"\nTutorial Diagrams\n-----------------\n\nThis script plots the flow-charts used in the scikit-learn tutorials.\n\"\"\"\n\nimport matplotlib.pyplot as plt\nfrom matplotlib.patches import Circle, Rectangle, Polygon, FancyArrow\n\n\ndef create_base(box_bg='#CCCCCC',\n                arrow1='#88CCFF',\n                arrow2='#88FF88',\n                supervised=True):\n    plt.figure(figsize=(9, 6), facecolor='w')\n    ax = plt.axes((0, 0, 1, 1), xticks=[], yticks=[], frameon=False)\n    ax.set_xlim(0, 9)\n    ax.set_ylim(0, 6)\n\n    patches = [Rectangle((0.3, 3.6), 1.5, 1.8, zorder=1, fc=box_bg),\n               Rectangle((0.5, 3.8), 1.5, 1.8, zorder=2, fc=box_bg),\n               Rectangle((0.7, 4.0), 1.5, 1.8, zorder=3, fc=box_bg),\n\n               Rectangle((2.9, 3.6), 0.2, 1.8, fc=box_bg),\n               Rectangle((3.1, 3.8), 0.2, 1.8, fc=box_bg),\n               Rectangle((3.3, 4.0), 0.2, 1.8, fc=box_bg),\n\n               Rectangle((0.3, 0.2), 1.5, 1.8, fc=box_bg),\n\n               Rectangle((2.9, 0.2), 0.2, 1.8, fc=box_bg),\n\n               Circle((5.5, 3.5), 1.0, fc=box_bg),\n\n               Polygon([[5.5, 1.7],\n                        [6.1, 1.1],\n                        [5.5, 0.5],\n                        [4.9, 1.1]], fc=box_bg),\n\n               FancyArrow(2.3, 4.6, 0.35, 0, fc=arrow1,\n                          width=0.25, head_width=0.5, head_length=0.2),\n\n               FancyArrow(3.75, 4.2, 0.5, -0.2, fc=arrow1,\n                          width=0.25, head_width=0.5, head_length=0.2),\n\n               FancyArrow(5.5, 2.4, 0, -0.4, fc=arrow1,\n                          width=0.25, head_width=0.5, head_length=0.2),\n\n               FancyArrow(2.0, 1.1, 0.5, 0, fc=arrow2,\n                          width=0.25, head_width=0.5, head_length=0.2),\n\n               FancyArrow(3.3, 1.1, 1.3, 0, fc=arrow2,\n                          width=0.25, head_width=0.5, head_length=0.2),\n\n               FancyArrow(6.2, 1.1, 0.8, 0, fc=arrow2,\n                          width=0.25, head_width=0.5, head_length=0.2)]\n\n    if supervised:\n        patches += [Rectangle((0.3, 2.4), 1.5, 0.5, zorder=1, fc=box_bg),\n                    Rectangle((0.5, 2.6), 1.5, 0.5, zorder=2, fc=box_bg),\n                    Rectangle((0.7, 2.8), 1.5, 0.5, zorder=3, fc=box_bg),\n                    FancyArrow(2.3, 2.9, 2.0, 0, fc=arrow1,\n                               width=0.25, head_width=0.5, head_length=0.2),\n                    Rectangle((7.3, 0.85), 1.5, 0.5, fc=box_bg)]\n    else:\n        patches += [Rectangle((7.3, 0.2), 1.5, 1.8, fc=box_bg)]\n\n    for p in patches:\n        ax.add_patch(p)\n\n    plt.text(1.45, 4.9, \"Training\\nText,\\nDocuments,\\nImages,\\netc.\",\n             ha='center', va='center', fontsize=14)\n\n    plt.text(3.6, 4.9, \"Feature\\nVectors\",\n             ha='left', va='center', fontsize=14)\n\n    plt.text(5.5, 3.5, \"Machine\\nLearning\\nAlgorithm\",\n             ha='center', va='center', fontsize=14)\n\n    plt.text(1.05, 1.1, \"New Text,\\nDocument,\\nImage,\\netc.\",\n             ha='center', va='center', fontsize=14)\n\n    plt.text(3.3, 1.7, \"Feature\\nVector\",\n             ha='left', va='center', fontsize=14)\n\n    plt.text(5.5, 1.1, \"Predictive\\nModel\",\n             ha='center', va='center', fontsize=12)\n\n    if supervised:\n        plt.text(1.45, 3.05, \"Labels\",\n                 ha='center', va='center', fontsize=14)\n\n        plt.text(8.05, 1.1, \"Expected\\nLabel\",\n                 ha='center', va='center', fontsize=14)\n        plt.text(8.8, 5.8, \"Supervised Learning Model\",\n                 ha='right', va='top', fontsize=18)\n\n    else:\n        plt.text(8.05, 1.1,\n                 \"Likelihood\\nor Cluster ID\\nor Better\\nRepresentation\",\n                 ha='center', va='center', fontsize=12)\n        plt.text(8.8, 5.8, \"Unsupervised Learning Model\",\n                 ha='right', va='top', fontsize=18)\n\n\ndef plot_supervised_chart(annotate=False):\n    create_base(supervised=True)\n    if annotate:\n        fontdict = dict(color='r', weight='bold', size=14)\n        plt.text(1.9, 4.55, 'X = vec.fit_transform(input)',\n                 fontdict=fontdict,\n                 rotation=20, ha='left', va='bottom')\n        plt.text(3.7, 3.2, 'clf.fit(X, y)',\n                 fontdict=fontdict,\n                 rotation=20, ha='left', va='bottom')\n        plt.text(1.7, 1.5, 'X_new = vec.transform(input)',\n                 fontdict=fontdict,\n                 rotation=20, ha='left', va='bottom')\n        plt.text(6.1, 1.5, 'y_new = clf.predict(X_new)',\n                 fontdict=fontdict,\n                 rotation=20, ha='left', va='bottom')\n\n\ndef plot_unsupervised_chart():\n    create_base(supervised=False)\n\n\nif __name__ == '__main__':\n    plot_supervised_chart(False)\n    plot_supervised_chart(True)\n    plot_unsupervised_chart()\n    plt.show()\n"
  },
  {
    "path": "notebooks/figures/__init__.py",
    "content": "from .plot_2d_separator import plot_2d_separator\nfrom .plot_kneighbors_regularization import plot_kneighbors_regularization, \\\n    plot_regression_datasets, make_dataset\nfrom .plot_linear_svc_regularization import plot_linear_svc_regularization\nfrom .plot_interactive_tree import plot_tree_interactive\nfrom .plot_interactive_forest import plot_forest_interactive\nfrom .plot_rbf_svm_parameters import plot_rbf_svm_parameters\nfrom .plot_rbf_svm_parameters import plot_svm_interactive\nfrom .plot_scaling import plot_scaling, plot_relative_scaling\nfrom .plot_digits_dataset import digits_plot\nfrom .plot_pca import plot_pca_illustration\n\n__all__ = ['plot_2d_separator', 'plot_kneighbors_regularization',\n           'plot_linear_svc_regularization', 'plot_tree_interactive',\n           'plot_regression_datasets', 'make_dataset',\n           \"plot_forest_interactive\", \"plot_rbf_svm_parameters\",\n           \"plot_svm_interactive\", 'plot_scaling', 'digits_plot',\n           'plot_relative_scaling', 'plot_pca_illustration']\n"
  },
  {
    "path": "notebooks/figures/plot_2d_separator.py",
    "content": "import numpy as np\nimport matplotlib.pyplot as plt\n\n\ndef plot_2d_separator(classifier, X, fill=False, ax=None, eps=None):\n    if eps is None:\n        eps = X.std() / 2.\n    x_min, x_max = X[:, 0].min() - eps, X[:, 0].max() + eps\n    y_min, y_max = X[:, 1].min() - eps, X[:, 1].max() + eps\n    xx = np.linspace(x_min, x_max, 100)\n    yy = np.linspace(y_min, y_max, 100)\n\n    X1, X2 = np.meshgrid(xx, yy)\n    X_grid = np.c_[X1.ravel(), X2.ravel()]\n    try:\n        decision_values = classifier.decision_function(X_grid)\n        levels = [0]\n        fill_levels = [decision_values.min(), 0, decision_values.max()]\n    except AttributeError:\n        # no decision_function\n        decision_values = classifier.predict_proba(X_grid)[:, 1]\n        levels = [.5]\n        fill_levels = [0, .5, 1]\n\n    if ax is None:\n        ax = plt.gca()\n    if fill:\n        ax.contourf(X1, X2, decision_values.reshape(X1.shape),\n                    levels=fill_levels, colors=['blue', 'red'])\n    else:\n        ax.contour(X1, X2, decision_values.reshape(X1.shape), levels=levels,\n                   colors=\"black\")\n    ax.set_xlim(x_min, x_max)\n    ax.set_ylim(y_min, y_max)\n    ax.set_xticks(())\n    ax.set_yticks(())\n\n\nif __name__ == '__main__':\n    from sklearn.datasets import make_blobs\n    from sklearn.linear_model import LogisticRegression\n    X, y = make_blobs(centers=2, random_state=42)\n    clf = LogisticRegression().fit(X, y)\n    plot_2d_separator(clf, X, fill=True)\n    plt.scatter(X[:, 0], X[:, 1], c=y)\n    plt.show()\n"
  },
  {
    "path": "notebooks/figures/plot_digits_dataset.py",
    "content": "# Taken from example in scikit-learn examples\n# Authors: Fabian Pedregosa <fabian.pedregosa@inria.fr>\n#          Olivier Grisel <olivier.grisel@ensta.org>\n#          Mathieu Blondel <mathieu@mblondel.org>\n#          Gael Varoquaux\n# License: BSD 3 clause (C) INRIA 2011\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom matplotlib import offsetbox\nfrom sklearn import (manifold, datasets, decomposition, ensemble,\n                     random_projection)\n\ndef digits_plot():\n    digits = datasets.load_digits(n_class=6)\n    n_digits = 500\n    X = digits.data[:n_digits]\n    y = digits.target[:n_digits]\n    n_samples, n_features = X.shape\n    n_neighbors = 30\n\n    def plot_embedding(X, title=None):\n        x_min, x_max = np.min(X, 0), np.max(X, 0)\n        X = (X - x_min) / (x_max - x_min)\n\n        plt.figure()\n        ax = plt.subplot(111)\n        for i in range(X.shape[0]):\n            plt.text(X[i, 0], X[i, 1], str(digits.target[i]),\n                     color=plt.cm.Set1(y[i] / 10.),\n                     fontdict={'weight': 'bold', 'size': 9})\n\n        if hasattr(offsetbox, 'AnnotationBbox'):\n            # only print thumbnails with matplotlib > 1.0\n            shown_images = np.array([[1., 1.]])  # just something big\n            for i in range(X.shape[0]):\n                dist = np.sum((X[i] - shown_images) ** 2, 1)\n                if np.min(dist) < 1e5:\n                    # don't show points that are too close\n                    # set a high threshold to basically turn this off\n                    continue\n                shown_images = np.r_[shown_images, [X[i]]]\n                imagebox = offsetbox.AnnotationBbox(\n                    offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),\n                    X[i])\n                ax.add_artist(imagebox)\n        plt.xticks([]), plt.yticks([])\n        if title is not None:\n            plt.title(title)\n\n    n_img_per_row = 10\n    img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))\n    for i in range(n_img_per_row):\n        ix = 10 * i + 1\n        for j in range(n_img_per_row):\n            iy = 10 * j + 1\n            img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))\n\n    plt.imshow(img, cmap=plt.cm.binary)\n    plt.xticks([])\n    plt.yticks([])\n    plt.title('A selection from the 64-dimensional digits dataset')\n    print(\"Computing PCA projection\")\n    pca = decomposition.PCA(n_components=2).fit(X)\n    X_pca = pca.transform(X)\n    plot_embedding(X_pca, \"Principal Components projection of the digits\")\n    plt.figure()\n    plt.matshow(pca.components_[0, :].reshape(8, 8), cmap=\"gray\")\n    plt.axis('off')\n    plt.figure()\n    plt.matshow(pca.components_[1, :].reshape(8, 8), cmap=\"gray\")\n    plt.axis('off')\n    plt.show()\n"
  },
  {
    "path": "notebooks/figures/plot_helpers.py",
    "content": "from matplotlib.colors import ListedColormap\n\ncm3 = ListedColormap(['#0000aa', '#ff2020', '#50ff50'])\ncm2 = ListedColormap(['#0000aa', '#ff2020'])\n"
  },
  {
    "path": "notebooks/figures/plot_interactive_forest.py",
    "content": "import numpy as np\nimport matplotlib.pyplot as plt\n\nfrom sklearn.datasets import make_blobs\nfrom sklearn.ensemble import RandomForestClassifier\n\n\nX, y = make_blobs(centers=[[0, 0], [1, 1]], random_state=61526, n_samples=50)\n\n\ndef plot_forest(max_depth=1):\n    plt.figure()\n    ax = plt.gca()\n    h = 0.02\n\n    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5\n    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5\n    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))\n\n    if max_depth != 0:\n        forest = RandomForestClassifier(n_estimators=20, max_depth=max_depth,\n                                        random_state=1).fit(X, y)\n        Z = forest.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]\n        Z = Z.reshape(xx.shape)\n        ax.contourf(xx, yy, Z, alpha=.4)\n        ax.set_title(\"max_depth = %d\" % max_depth)\n    else:\n        ax.set_title(\"data set\")\n    ax.scatter(X[:, 0], X[:, 1], c=np.array(['b', 'r'])[y], s=60)\n    ax.set_xlim(x_min, x_max)\n    ax.set_ylim(y_min, y_max)\n    ax.set_xticks(())\n    ax.set_yticks(())\n\n\ndef plot_forest_interactive():\n    from IPython.html.widgets import interactive, IntSlider\n    slider = IntSlider(min=0, max=8, step=1, value=0)\n    return interactive(plot_forest, max_depth=slider)\n"
  },
  {
    "path": "notebooks/figures/plot_interactive_tree.py",
    "content": "import numpy as np\nimport matplotlib.pyplot as plt\n\nfrom sklearn.datasets import make_blobs\nfrom sklearn.tree import DecisionTreeClassifier\n\nfrom sklearn.externals.six import StringIO  # doctest: +SKIP\nfrom sklearn.tree import export_graphviz\nfrom scipy.misc import imread\nfrom scipy import ndimage\n\nimport re\n\nX, y = make_blobs(centers=[[0, 0], [1, 1]], random_state=61526, n_samples=50)\n\n\ndef tree_image(tree, fout=None):\n    try:\n        import pydot\n    except ImportError:\n        # make a hacky white plot\n        x = np.ones((10, 10))\n        x[0, 0] = 0\n        return x\n    dot_data = StringIO()\n    export_graphviz(tree, out_file=dot_data)\n    data = re.sub(r\"gini = 0\\.[0-9]+\\\\n\", \"\", dot_data.getvalue())\n    data = re.sub(r\"samples = [0-9]+\\\\n\", \"\", data)\n    data = re.sub(r\"\\\\nsamples = [0-9]+\", \"\", data)\n\n    graph = pydot.graph_from_dot_data(data)[0]\n    if fout is None:\n        fout = \"tmp.png\"\n    graph.write_png(fout)\n    return imread(fout)\n\n\ndef plot_tree(max_depth=1):\n    fig, ax = plt.subplots(1, 2, figsize=(15, 7))\n    h = 0.02\n\n    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5\n    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5\n    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))\n\n    if max_depth != 0:\n        tree = DecisionTreeClassifier(max_depth=max_depth, random_state=1).fit(X, y)\n        Z = tree.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]\n        Z = Z.reshape(xx.shape)\n        faces = tree.tree_.apply(np.c_[xx.ravel(), yy.ravel()].astype(np.float32))\n        faces = faces.reshape(xx.shape)\n        border = ndimage.laplace(faces) != 0\n        ax[0].contourf(xx, yy, Z, alpha=.4)\n        ax[0].scatter(xx[border], yy[border], marker='.', s=1)\n        ax[0].set_title(\"max_depth = %d\" % max_depth)\n        ax[1].imshow(tree_image(tree))\n        ax[1].axis(\"off\")\n    else:\n        ax[0].set_title(\"data set\")\n        ax[1].set_visible(False)\n    ax[0].scatter(X[:, 0], X[:, 1], c=np.array(['b', 'r'])[y], s=60)\n    ax[0].set_xlim(x_min, x_max)\n    ax[0].set_ylim(y_min, y_max)\n    ax[0].set_xticks(())\n    ax[0].set_yticks(())\n\n\ndef plot_tree_interactive():\n    from IPython.html.widgets import interactive, IntSlider\n    slider = IntSlider(min=0, max=8, step=1, value=0)\n    return interactive(plot_tree, max_depth=slider)\n"
  },
  {
    "path": "notebooks/figures/plot_kneighbors_regularization.py",
    "content": "import numpy as np\nimport matplotlib.pyplot as plt\n\nfrom sklearn.neighbors import KNeighborsRegressor\n\n\ndef make_dataset(n_samples=100):\n    rnd = np.random.RandomState(42)\n    x = np.linspace(-3, 3, n_samples)\n    y_no_noise = np.sin(4 * x) + x\n    y = y_no_noise + rnd.normal(size=len(x))\n    return x, y\n\n\ndef plot_regression_datasets():\n    fig, axes = plt.subplots(1, 3, figsize=(15, 5))\n    for n_samples, ax in zip([10, 100, 1000], axes):\n        x, y = make_dataset(n_samples)\n        ax.plot(x, y, 'o', alpha=.6)\n\n\ndef plot_kneighbors_regularization():\n    rnd = np.random.RandomState(42)\n    x = np.linspace(-3, 3, 100)\n    y_no_noise = np.sin(4 * x) + x\n    y = y_no_noise + rnd.normal(size=len(x))\n    X = x[:, np.newaxis]\n    fig, axes = plt.subplots(1, 3, figsize=(15, 5))\n\n    x_test = np.linspace(-3, 3, 1000)\n\n    for n_neighbors, ax in zip([2, 5, 20], axes.ravel()):\n        kneighbor_regression = KNeighborsRegressor(n_neighbors=n_neighbors)\n        kneighbor_regression.fit(X, y)\n        ax.plot(x, y_no_noise, label=\"true function\")\n        ax.plot(x, y, \"o\", label=\"data\")\n        ax.plot(x_test, kneighbor_regression.predict(x_test[:, np.newaxis]),\n                label=\"prediction\")\n        ax.legend()\n        ax.set_title(\"n_neighbors = %d\" % n_neighbors)\n\nif __name__ == \"__main__\":\n    plot_kneighbors_regularization()\n    plt.show()\n"
  },
  {
    "path": "notebooks/figures/plot_linear_svc_regularization.py",
    "content": "import matplotlib.pyplot as plt\nimport numpy as np\nfrom sklearn.svm import SVC\nfrom sklearn.datasets import make_blobs\nfrom .plot_2d_separator import plot_2d_separator\n\n\ndef plot_linear_svc_regularization():\n    X, y = make_blobs(centers=2, random_state=4, n_samples=30)\n    # a carefully hand-designed dataset lol\n    y[7] = 0\n    y[27] = 0\n\n    fig, axes = plt.subplots(1, 3, figsize=(12, 4))\n\n    for ax, C in zip(axes, [1e-2, 1, 1e2]):\n        ax.scatter(X[:, 0], X[:, 1], s=150, c=np.array(['red', 'blue'])[y])\n\n        svm = SVC(kernel='linear', C=C).fit(X, y)\n        plot_2d_separator(svm, X, ax=ax, eps=.5)\n        ax.set_title(\"C = %f\" % C)\n"
  },
  {
    "path": "notebooks/figures/plot_pca.py",
    "content": "from sklearn.decomposition import PCA\nimport matplotlib.pyplot as plt\nimport numpy as np\n\n\ndef plot_pca_illustration():\n    rnd = np.random.RandomState(5)\n    X_ = rnd.normal(size=(300, 2))\n    X_blob = np.dot(X_, rnd.normal(size=(2, 2))) + rnd.normal(size=2)\n\n    pca = PCA()\n    pca.fit(X_blob)\n    X_pca = pca.transform(X_blob)\n\n    S = X_pca.std(axis=0)\n\n    fig, axes = plt.subplots(2, 2, figsize=(10, 10))\n    axes = axes.ravel()\n\n    axes[0].set_title(\"Original data\")\n    axes[0].scatter(X_blob[:, 0], X_blob[:, 1], c=X_pca[:, 0], linewidths=0,\n                    s=60, cmap='viridis')\n    axes[0].set_xlabel(\"feature 1\")\n    axes[0].set_ylabel(\"feature 2\")\n    axes[0].arrow(pca.mean_[0], pca.mean_[1], S[0] * pca.components_[0, 0],\n                  S[0] * pca.components_[0, 1], width=.1, head_width=.3,\n                  color='k')\n    axes[0].arrow(pca.mean_[0], pca.mean_[1], S[1] * pca.components_[1, 0],\n                  S[1] * pca.components_[1, 1], width=.1, head_width=.3,\n                  color='k')\n    axes[0].text(-1.5, -.5, \"Component 2\", size=14)\n    axes[0].text(-4, -4, \"Component 1\", size=14)\n    axes[0].set_aspect('equal')\n\n    axes[1].set_title(\"Transformed data\")\n    axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=X_pca[:, 0], linewidths=0,\n                    s=60, cmap='viridis')\n    axes[1].set_xlabel(\"First principal component\")\n    axes[1].set_ylabel(\"Second principal component\")\n    axes[1].set_aspect('equal')\n    axes[1].set_ylim(-8, 8)\n\n    pca = PCA(n_components=1)\n    pca.fit(X_blob)\n    X_inverse = pca.inverse_transform(pca.transform(X_blob))\n\n    axes[2].set_title(\"Transformed data w/ second component dropped\")\n    axes[2].scatter(X_pca[:, 0], np.zeros(X_pca.shape[0]), c=X_pca[:, 0],\n                    linewidths=0, s=60, cmap='viridis')\n    axes[2].set_xlabel(\"First principal component\")\n    axes[2].set_aspect('equal')\n    axes[2].set_ylim(-8, 8)\n\n    axes[3].set_title(\"Back-rotation using only first component\")\n    axes[3].scatter(X_inverse[:, 0], X_inverse[:, 1], c=X_pca[:, 0],\n                    linewidths=0, s=60, cmap='viridis')\n    axes[3].set_xlabel(\"feature 1\")\n    axes[3].set_ylabel(\"feature 2\")\n    axes[3].set_aspect('equal')\n    axes[3].set_xlim(-8, 4)\n    axes[3].set_ylim(-8, 4)\n\n\ndef plot_pca_whitening():\n    rnd = np.random.RandomState(5)\n    X_ = rnd.normal(size=(300, 2))\n    X_blob = np.dot(X_, rnd.normal(size=(2, 2))) + rnd.normal(size=2)\n\n    pca = PCA(whiten=True)\n    pca.fit(X_blob)\n    X_pca = pca.transform(X_blob)\n\n    fig, axes = plt.subplots(1, 2, figsize=(10, 10))\n    axes = axes.ravel()\n\n    axes[0].set_title(\"Original data\")\n    axes[0].scatter(X_blob[:, 0], X_blob[:, 1], c=X_pca[:, 0], linewidths=0, s=60, cmap='viridis')\n    axes[0].set_xlabel(\"feature 1\")\n    axes[0].set_ylabel(\"feature 2\")\n    axes[0].set_aspect('equal')\n\n    axes[1].set_title(\"Whitened data\")\n    axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=X_pca[:, 0], linewidths=0, s=60, cmap='viridis')\n    axes[1].set_xlabel(\"First principal component\")\n    axes[1].set_ylabel(\"Second principal component\")\n    axes[1].set_aspect('equal')\n    axes[1].set_xlim(-3, 4)\n"
  },
  {
    "path": "notebooks/figures/plot_rbf_svm_parameters.py",
    "content": "import matplotlib.pyplot as plt\nimport numpy as np\nfrom sklearn.svm import SVC\nfrom sklearn.datasets import make_blobs\nfrom .plot_2d_separator import plot_2d_separator\n\n\ndef make_handcrafted_dataset():\n    # a carefully hand-designed dataset lol\n    X, y = make_blobs(centers=2, random_state=4, n_samples=30)\n    y[np.array([7, 27])] = 0\n    mask = np.ones(len(X), dtype=np.bool)\n    mask[np.array([0, 1, 5, 26])] = 0\n    X, y = X[mask], y[mask]\n    return X, y\n\n\ndef plot_rbf_svm_parameters():\n    X, y = make_handcrafted_dataset()\n\n    fig, axes = plt.subplots(1, 3, figsize=(12, 4))\n    for ax, C in zip(axes, [1e0, 5, 10, 100]):\n        ax.scatter(X[:, 0], X[:, 1], s=150, c=np.array(['red', 'blue'])[y])\n\n        svm = SVC(kernel='rbf', C=C).fit(X, y)\n        plot_2d_separator(svm, X, ax=ax, eps=.5)\n        ax.set_title(\"C = %f\" % C)\n\n    fig, axes = plt.subplots(1, 4, figsize=(15, 3))\n    for ax, gamma in zip(axes, [0.1, .5, 1, 10]):\n        ax.scatter(X[:, 0], X[:, 1], s=150, c=np.array(['red', 'blue'])[y])\n        svm = SVC(gamma=gamma, kernel='rbf', C=1).fit(X, y)\n        plot_2d_separator(svm, X, ax=ax, eps=.5)\n        ax.set_title(\"gamma = %f\" % gamma)\n\n\ndef plot_svm(log_C, log_gamma):\n    X, y = make_handcrafted_dataset()\n    C = 10. ** log_C\n    gamma = 10. ** log_gamma\n    svm = SVC(kernel='rbf', C=C, gamma=gamma).fit(X, y)\n    ax = plt.gca()\n    plot_2d_separator(svm, X, ax=ax, eps=.5)\n    # plot data\n    ax.scatter(X[:, 0], X[:, 1], s=150, c=np.array(['red', 'blue'])[y])\n    # plot support vectors\n    sv = svm.support_vectors_\n    ax.scatter(sv[:, 0], sv[:, 1], s=230, facecolors='none', zorder=10, linewidth=3)\n    ax.set_title(\"C = %.4f gamma = %.4f\" % (C, gamma))\n\n\ndef plot_svm_interactive():\n    from IPython.html.widgets import interactive, FloatSlider\n    C_slider = FloatSlider(min=-3, max=3, step=.1, value=0, readout=False)\n    gamma_slider = FloatSlider(min=-2, max=2, step=.1, value=0, readout=False)\n    return interactive(plot_svm, log_C=C_slider, log_gamma=gamma_slider)\n"
  },
  {
    "path": "notebooks/figures/plot_scaling.py",
    "content": "import matplotlib.pyplot as plt\nimport numpy as np\nfrom sklearn.datasets import make_blobs\nfrom sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer, RobustScaler\nfrom sklearn.model_selection import train_test_split\nfrom .plot_helpers import cm2\n\n\ndef plot_scaling():\n    X, y = make_blobs(n_samples=50, centers=2, random_state=4, cluster_std=1)\n    X += 3\n\n    plt.figure(figsize=(15, 8))\n    main_ax = plt.subplot2grid((2, 4), (0, 0), rowspan=2, colspan=2)\n\n    main_ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cm2, s=60)\n    maxx = np.abs(X[:, 0]).max()\n    maxy = np.abs(X[:, 1]).max()\n\n    main_ax.set_xlim(-maxx + 1, maxx + 1)\n    main_ax.set_ylim(-maxy + 1, maxy + 1)\n    main_ax.set_title(\"Original Data\")\n    other_axes = [plt.subplot2grid((2, 4), (i, j)) for j in range(2, 4) for i in range(2)]\n\n    for ax, scaler in zip(other_axes, [StandardScaler(), RobustScaler(),\n                                       MinMaxScaler(), Normalizer(norm='l2')]):\n        X_ = scaler.fit_transform(X)\n        ax.scatter(X_[:, 0], X_[:, 1], c=y, cmap=cm2, s=60)\n        ax.set_xlim(-2, 2)\n        ax.set_ylim(-2, 2)\n        ax.set_title(type(scaler).__name__)\n\n    other_axes.append(main_ax)\n\n    for ax in other_axes:\n        ax.spines['left'].set_position('center')\n        ax.spines['right'].set_color('none')\n        ax.spines['bottom'].set_position('center')\n        ax.spines['top'].set_color('none')\n        ax.xaxis.set_ticks_position('bottom')\n        ax.yaxis.set_ticks_position('left')\n\n\ndef plot_relative_scaling():\n    # make synthetic data\n    X, _ = make_blobs(n_samples=50, centers=5, random_state=4, cluster_std=2)\n    # split it into training and test set\n    X_train, X_test = train_test_split(X, random_state=5, test_size=.1)\n    # plot the training and test set\n    fig, axes = plt.subplots(1, 3, figsize=(13, 4))\n    axes[0].scatter(X_train[:, 0], X_train[:, 1],\n                    c='b', label=\"training set\", s=60)\n    axes[0].scatter(X_test[:, 0], X_test[:, 1], marker='^',\n                    c='r', label=\"test set\", s=60)\n    axes[0].legend(loc='upper left')\n    axes[0].set_title(\"original data\")\n\n    # scale the data using MinMaxScaler\n    scaler = MinMaxScaler()\n    scaler.fit(X_train)\n    X_train_scaled = scaler.transform(X_train)\n    X_test_scaled = scaler.transform(X_test)\n\n    # visualize the properly scaled data\n    axes[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1],\n                    c='b', label=\"training set\", s=60)\n    axes[1].scatter(X_test_scaled[:, 0], X_test_scaled[:, 1], marker='^',\n                    c='r', label=\"test set\", s=60)\n    axes[1].set_title(\"scaled data\")\n\n    # rescale the test set separately, so that test set min is 0 and test set max is 1\n    # DO NOT DO THIS! For illustration purposes only\n    test_scaler = MinMaxScaler()\n    test_scaler.fit(X_test)\n    X_test_scaled_badly = test_scaler.transform(X_test)\n\n    # visualize wrongly scaled data\n    axes[2].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1],\n                    c='b', label=\"training set\", s=60)\n    axes[2].scatter(X_test_scaled_badly[:, 0], X_test_scaled_badly[:, 1], marker='^',\n                    c='r', label=\"test set\", s=60)\n    axes[2].set_title(\"improperly scaled data\")\n"
  },
  {
    "path": "notebooks/helpers.py",
    "content": "import numpy as np\nfrom collections import defaultdict\nimport os\nfrom sklearn.model_selection import StratifiedShuffleSplit\nfrom sklearn.feature_extraction import DictVectorizer\n\n# Can also use pandas!\ndef process_titanic_line(line):\n    # Split line on \",\" to get fields without comma confusion\n    vals = line.strip().split('\",')\n    # replace spurious \" characters\n    vals = [v.replace('\"', '') for v in vals]\n    pclass = int(vals[0])\n    survived = int(vals[1])\n    name = str(vals[2])\n    sex = str(vals[3])\n    try:\n        age = float(vals[4])\n    except ValueError:\n        # Blank age\n        age = -1\n    sibsp = float(vals[5])\n    parch = int(vals[6])\n    ticket = str(vals[7])\n    try:\n        fare = float(vals[8])\n    except ValueError:\n        # Blank fare\n        fare = -1\n    cabin = str(vals[9])\n    embarked = str(vals[10])\n    boat = str(vals[11])\n    homedest = str(vals[12])\n    line_dict = {'pclass': pclass, 'survived': survived, 'name': name, 'sex': sex, 'age': age, 'sibsp': sibsp,\n                 'parch': parch, 'ticket': ticket, 'fare': fare, 'cabin': cabin, 'embarked': embarked,\n                 'boat': boat, 'homedest': homedest}\n    return line_dict\n\n\ndef load_titanic(test_size=.25, feature_skip_tuple=(), random_state=1999):\n    f = open(os.path.join('datasets', 'titanic', 'titanic3.csv'))\n    # Remove . from home.dest, split on quotes because some fields have commas\n    keys = f.readline().strip().replace('.', '').split('\",\"')\n    lines = f.readlines()\n    f.close()\n    string_keys = ['name', 'sex', 'ticket', 'cabin', 'embarked', 'boat',\n                   'homedest']\n    string_keys = [s for s in string_keys if s not in feature_skip_tuple]\n    numeric_keys = ['pclass', 'age', 'sibsp', 'parch', 'fare']\n    numeric_keys = [n for n in numeric_keys if n not in feature_skip_tuple]\n    train_vectorizer_list = []\n    test_vectorizer_list = []\n\n    n_samples = len(lines)\n    numeric_data = np.zeros((n_samples, len(numeric_keys)))\n    numeric_labels = np.zeros((n_samples,), dtype=int)\n\n    # Doing this twice is horribly inefficient but the file is small...\n    for n, l in enumerate(lines):\n        line_dict = process_titanic_line(l)\n        strings = {k: line_dict[k] for k in string_keys}\n        numeric_labels[n] = line_dict[\"survived\"]\n\n    sss = StratifiedShuffleSplit(n_iter=1, test_size=test_size, random_state=12)\n    # This is a weird way to get the indices but it works\n    train_idx = None\n    test_idx = None\n    for train_idx, test_idx in sss.split(numeric_data, numeric_labels):\n        pass\n\n    for n, l in enumerate(lines):\n        line_dict = process_titanic_line(l)\n        strings = {k: line_dict[k] for k in string_keys}\n        if n in train_idx:\n            train_vectorizer_list.append(strings)\n        else:\n            test_vectorizer_list.append(strings)\n        numeric_data[n] = np.asarray([line_dict[k]\n                                      for k in numeric_keys])\n\n    train_numeric = numeric_data[train_idx]\n    test_numeric = numeric_data[test_idx]\n    train_labels = numeric_labels[train_idx]\n    test_labels = numeric_labels[test_idx]\n\n    vec = DictVectorizer()\n    # .toarray() due to returning a scipy sparse array\n    train_categorical = vec.fit_transform(train_vectorizer_list).toarray()\n    test_categorical = vec.transform(test_vectorizer_list).toarray()\n    train_data = np.concatenate([train_numeric, train_categorical], axis=1)\n    test_data = np.concatenate([test_numeric, test_categorical], axis=1)\n    keys = numeric_keys + string_keys\n    return keys, train_data, test_data, train_labels, test_labels\n\n\nFIELDNAMES = ('polarity', 'id', 'date', 'query', 'author', 'text')\n\ndef read_sentiment_csv(csv_file, fieldnames=FIELDNAMES, max_count=None,\n             n_partitions=1, partition_id=0):\n    import csv  # put the import inside for use in IPython.parallel\n    def file_opener(csv_file):\n        try:\n            open(csv_file, 'r', encoding=\"latin1\").close()\n            return open(csv_file, 'r', encoding=\"latin1\")\n        except TypeError:\n            # Python 2 does not have encoding arg\n            return open(csv_file, 'rb')\n\n    texts = []\n    targets = []\n    with file_opener(csv_file) as f:\n        reader = csv.DictReader(f, fieldnames=fieldnames,\n                                delimiter=',', quotechar='\"')\n        pos_count, neg_count = 0, 0\n        for i, d in enumerate(reader):\n            if i % n_partitions != partition_id:\n                # Skip entry if not in the requested partition\n                continue\n\n            if d['polarity'] == '4':\n                if max_count and pos_count >= max_count / 2:\n                    continue\n                pos_count += 1\n                texts.append(d['text'])\n                targets.append(1)\n\n            elif d['polarity'] == '0':\n                if max_count and neg_count >= max_count / 2:\n                    continue\n                neg_count += 1\n                texts.append(d['text'])\n                targets.append(-1)\n\n    return texts, targets\n"
  },
  {
    "path": "notebooks/solutions/03A_faces_plot.py",
    "content": "faces = fetch_olivetti_faces()\n\n# set up the figure\nfig = plt.figure(figsize=(6, 6))  # figure size in inches\nfig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)\n\n# plot the faces:\nfor i in range(64):\n    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])\n    ax.imshow(faces.images[i], cmap=plt.cm.bone, interpolation='nearest')\n"
  },
  {
    "path": "notebooks/solutions/04_wrong-predictions.py",
    "content": "for i in incorrect_idx:\n    print('%d: Predicted %d True label %d' % (i, pred_y[i], test_y[i]))\n\n# Plot two dimensions\n\ncolors = [\"darkblue\", \"darkgreen\", \"gray\"]\n\nfor n, color in enumerate(colors):\n    idx = np.where(test_y == n)[0]\n    plt.scatter(test_X[idx, 1], test_X[idx, 2],\n                color=color, label=\"Class %s\" % str(n))\n\nfor i, marker in zip(incorrect_idx, ['x', 's', 'v']):\n    plt.scatter(test_X[i, 1], test_X[i, 2],\n                color=\"darkred\",\n                marker=marker,\n                s=40,\n                label=i)\n\nplt.xlabel('sepal width [cm]')\nplt.ylabel('petal length [cm]')\nplt.legend(loc=1, scatterpoints=1)\nplt.title(\"Iris Classification results\")\nplt.show()\n"
  },
  {
    "path": "notebooks/solutions/05A_knn_with_diff_k.py",
    "content": "from sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\n\n\niris = load_iris()\nX = iris.data\ny = iris.target\n\nX_train, X_test, y_train, y_test = train_test_split(X, y,\n                                                    test_size=0.25,\n                                                    random_state=1234,\n                                                    stratify=y)\n\nX_trainsub, X_valid, y_trainsub, y_valid = train_test_split(X_train, y_train,\n                                                            test_size=0.5,\n                                                            random_state=1234,\n                                                            stratify=y_train)\n\nfor k in range(1, 20):\n    knn = KNeighborsClassifier(n_neighbors=k)\n    train_score = knn.fit(X_trainsub, y_trainsub).\\\n        score(X_trainsub, y_trainsub)\n    valid_score = knn.score(X_valid, y_valid)\n    print('k: %d, Train/Valid Acc: %.3f/%.3f' %\n          (k, train_score, valid_score))\n\n\nknn = KNeighborsClassifier(n_neighbors=9)\nknn.fit(X_train, y_train)\nprint('k=9 Test Acc: %.3f' % knn.score(X_test, y_test))\n"
  },
  {
    "path": "notebooks/solutions/06A_knn_vs_linreg.py",
    "content": "from sklearn.datasets import load_boston\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import LinearRegression\n\n\nboston = load_boston()\nX = boston.data\ny = boston.target\n\nprint('X.shape:', X.shape)\nX_train, X_test, y_train, y_test = train_test_split(X, y,\n                                                    test_size=0.25,\n                                                    random_state=42)\n\nlinreg = LinearRegression()\nknnreg = KNeighborsRegressor(n_neighbors=1)\n\nlinreg.fit(X_train, y_train)\nprint('Linear Regression Train/Test: %.3f/%.3f' %\n      (linreg.score(X_train, y_train),\n       linreg.score(X_test, y_test)))\n\nknnreg.fit(X_train, y_train)\nprint('KNeighborsRegressor Train/Test: %.3f/%.3f' %\n      (knnreg.score(X_train, y_train),\n       knnreg.score(X_test, y_test)))\n"
  },
  {
    "path": "notebooks/solutions/07A_iris-pca.py",
    "content": "from sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.decomposition import PCA\nfrom sklearn.preprocessing import StandardScaler\n\niris = load_iris()\n\nX_train, X_test, y_train, y_test = train_test_split(iris.data,\n                                                    iris.target,\n                                                    random_state=0,\n                                                    stratify=iris.target)\n\nsc = StandardScaler()\nsc.fit(X_train)\npca = PCA(n_components=2)\n\nX_train_pca = pca.fit_transform(sc.transform(X_train))\nX_test_pca = pca.transform(sc.transform(X_test))\n\nfor X, y in zip((X_train_pca, X_test_pca), (y_train, y_test)):\n\n    for i, annot in enumerate(zip(('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'),\n                                  ('blue', 'red', 'green'))):\n        plt.scatter(X[y==i, 0],\n                    X[y==i, 1],\n                    label=annot[0],\n                    c=annot[1])\n    plt.xlabel('Principal Component 1')\n    plt.ylabel('Principal Component 2')\n    plt.legend(loc='best')\n    plt.tight_layout()\n    plt.show()\n"
  },
  {
    "path": "notebooks/solutions/08B_digits_clustering.py",
    "content": "from sklearn.cluster import KMeans\nkmeans = KMeans(n_clusters=10)\nclusters = kmeans.fit_predict(digits.data)\n\nprint(kmeans.cluster_centers_.shape)\n\n#------------------------------------------------------------\n# visualize the cluster centers\nfig = plt.figure(figsize=(8, 3))\nfor i in range(10):\n    ax = fig.add_subplot(2, 5, 1 + i)\n    ax.imshow(kmeans.cluster_centers_[i].reshape((8, 8)),\n              cmap=plt.cm.binary)\nfrom sklearn.manifold import Isomap\nX_iso = Isomap(n_neighbors=10).fit_transform(digits.data)\n\n#------------------------------------------------------------\n# visualize the projected data\nfig, ax = plt.subplots(1, 2, figsize=(8, 4))\n\nax[0].scatter(X_iso[:, 0], X_iso[:, 1], c=clusters)\nax[1].scatter(X_iso[:, 0], X_iso[:, 1], c=digits.target)\n"
  },
  {
    "path": "notebooks/solutions/10_titanic.py",
    "content": "from sklearn.linear_model import LogisticRegression\nlr = LogisticRegression().fit(train_data_finite, train_labels)\nprint(\"logistic regression score: %f\" % lr.score(test_data_finite, test_labels))\n\nfrom sklearn.ensemble import RandomForestClassifier\nrf = RandomForestClassifier(n_estimators=500, random_state=0).fit(train_data_finite, train_labels)\nprint(\"random forest score: %f\" % rf.score(test_data_finite, test_labels))\n\nfeatures_dummies_sub = pd.get_dummies(features[['pclass', 'sex', 'age', 'sibsp', 'fare']])\ndata_sub = features_dummies_sub.values\n\ntrain_data_sub, test_data_sub, train_labels, test_labels = train_test_split(data_sub, labels, random_state=0)\n\nimp = Imputer()\nimp.fit(train_data_sub)\ntrain_data_finite_sub = imp.transform(train_data_sub)\ntest_data_finite_sub = imp.transform(test_data_sub)\n                                         \nlr = LogisticRegression().fit(train_data_finite_sub, train_labels)\nprint(\"logistic regression score w/o embark, parch: %f\" % lr.score(test_data_finite_sub, test_labels))\nrf = RandomForestClassifier(n_estimators=500, random_state=0).fit(train_data_finite_sub, train_labels)\nprint(\"random forest score w/o embark, parch: %f\" % rf.score(test_data_finite_sub, test_labels))\n"
  },
  {
    "path": "notebooks/solutions/11_ngrams.py",
    "content": "text = zen.split(\"\\n\")\nfor n in [2, 3, 4]:\n    cv = CountVectorizer(ngram_range=(n, n)).fit(text)\n    counts = cv.transform(text)\n    most_common = np.argmax(counts.sum(axis=0))\n    print(\"most common %d-gram: %s\" % (n, cv.get_feature_names()[most_common]))\n\n\nfor norm in [\"l2\", None]:\n    tfidf_vect = TfidfVectorizer(norm=norm).fit(text)\n    data_tfidf = tfidf_vect.transform(text) \n    most_common = tfidf_vect.get_feature_names()[np.argmax(data_tfidf.max(axis=0).toarray())]\n    print(\"highest tf-idf with norm=%s: %s\" % (norm, most_common))\n"
  },
  {
    "path": "notebooks/solutions/12A_tfidf.py",
    "content": "from sklearn.feature_extraction.text import TfidfVectorizer\n\nvectorizer = TfidfVectorizer()\nvectorizer.fit(text_train)\n\nX_train = vectorizer.transform(text_train)\nX_test = vectorizer.transform(text_test)\n\nclf = LogisticRegression()\nclf.fit(X_train, y_train)\n\nprint(clf.score(X_train, y_train))\nprint(clf.score(X_test, y_test))\n\nvisualize_coefficients(clf, vectorizer.get_feature_names())\n"
  },
  {
    "path": "notebooks/solutions/12B_vectorizer_params.py",
    "content": "# CountVectorizer\nvectorizer = CountVectorizer(min_df=10, ngram_range=(1, 3))\nvectorizer.fit(text_train)\n\nX_train = vectorizer.transform(text_train)\nX_test = vectorizer.transform(text_test)\n\nclf = LogisticRegression()\nclf.fit(X_train, y_train)\n\nvisualize_coefficients(clf, vectorizer.get_feature_names())\n\n# TfidfVectorizer\nvectorizer = TfidfVectorizer(min_df=10, ngram_range=(1, 3))\nvectorizer.fit(text_train)\n\nX_train = vectorizer.transform(text_train)\nX_test = vectorizer.transform(text_test)\n\nclf = LogisticRegression()\nclf.fit(X_train, y_train)\n\nvisualize_coefficients(clf, vectorizer.get_feature_names())\n"
  },
  {
    "path": "notebooks/solutions/13_cross_validation.py",
    "content": "cv = KFold(n_splits=3)\ncross_val_score(classifier, iris.data, iris.target, cv=cv)\n"
  },
  {
    "path": "notebooks/solutions/14_grid_search.py",
    "content": "from sklearn.datasets import load_digits\nfrom sklearn.neighbors import KNeighborsClassifier\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, random_state=0)\n\nparam_grid = {'n_neighbors': [1, 3, 5, 10, 50]}\ngs = GridSearchCV(KNeighborsClassifier(), param_grid=param_grid, cv=5, verbose=3)\ngs.fit(X_train, y_train)\nprint(\"Score on test set: %f\" % gs.score(X_test, y_test))\nprint(\"Best parameters: %s\" % gs.best_params_)\n"
  },
  {
    "path": "notebooks/solutions/15A_ridge_grid.py",
    "content": "from sklearn.preprocessing import StandardScaler\nfrom sklearn.datasets import load_boston\nfrom sklearn.preprocessing import PolynomialFeatures\nfrom sklearn.linear_model import Ridge\n\nboston = load_boston()\ntext_train, text_test, y_train, y_test = train_test_split(boston.data,\n                                                          boston.target,\n                                                          test_size=0.25,\n                                                          random_state=123)\n\npipeline = make_pipeline(StandardScaler(),\n                         PolynomialFeatures(),\n                         Ridge())\n\ngrid = GridSearchCV(pipeline,\n                    param_grid={'polynomialfeatures__degree': [1, 2, 3]}, cv=5)\n\ngrid.fit(text_train, y_train)\n\nprint('best parameters:', grid.best_params_)\nprint('best score:', grid.best_score_)\nprint('test score:', grid.score(text_test, y_test))\n"
  },
  {
    "path": "notebooks/solutions/16A_avg_per_class_acc.py",
    "content": "def accuracy(true, pred):\n    return (true == pred).sum() / float(true.shape[0])\n\n\ndef macro(true, pred):\n    scores = []\n    for l in np.unique(true):\n        scores.append(accuracy(np.where(true != l, 1, 0),\n                               np.where(pred != l, 1, 0)))\n    return float(sum(scores)) / float(len(scores))\n\ny_true = np.array([0, 0, 0, 1, 1, 1, 1, 1, 2, 2])\ny_pred = np.array([0, 1, 1, 0, 1, 1, 2, 2, 2, 2])\n\n\nprint('accuracy:', accuracy(y_true, y_pred))\nprint('average-per-class accuracy:', macro(y_true, y_pred))\n"
  },
  {
    "path": "notebooks/solutions/17A_logreg_grid.py",
    "content": "from sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split, GridSearchCV\nfrom sklearn.linear_model import LogisticRegression\n\ndigits = load_digits()\nX_digits, y_digits = digits.data, digits.target\nX_digits_train, X_digits_test, y_digits_train, y_digits_test = train_test_split(X_digits, y_digits, random_state=1)\n\nparam_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}\n\ngrid = GridSearchCV(LogisticRegression(), param_grid=param_grid, cv=5, verbose=3)\ngrid.fit(X_digits_train, y_digits_train)\nprint('Best score for LogisticRegression: {}'.format(grid.score(X_digits_test, y_digits_test)))\nprint('Best parameters for LogisticRegression: {}'.format(grid.best_params_))\n"
  },
  {
    "path": "notebooks/solutions/17B_learning_curve_alpha.py",
    "content": "X, y, true_coefficient = make_regression(n_samples=200, n_features=30, n_informative=10, noise=100, coef=True, random_state=5)\n\nplt.figure(figsize=(10, 5))\nplt.title('alpha=1')\nplot_learning_curve(LinearRegression(), X, y)\nplot_learning_curve(Ridge(alpha=1), X, y)\nplot_learning_curve(Lasso(alpha=1), X, y)\n\nplt.figure(figsize=(10, 5))\nplt.title('alpha=10')\nplot_learning_curve(LinearRegression(), X, y)\nplot_learning_curve(Ridge(alpha=10), X, y)\nplot_learning_curve(Lasso(alpha=10), X, y)\n\nplt.figure(figsize=(10, 5))\nplt.title('alpha=100')\nplot_learning_curve(LinearRegression(), X, y)\nplot_learning_curve(Ridge(alpha=100), X, y)\nplot_learning_curve(Lasso(alpha=100), X, y)\n"
  },
  {
    "path": "notebooks/solutions/18_svc_grid.py",
    "content": "from sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split, GridSearchCV\nfrom sklearn.svm import SVC\n\ndigits = load_digits()\nX_digits, y_digits = digits.data, digits.target\nX_digits_train, X_digits_test, y_digits_train, y_digits_test = train_test_split(X_digits, y_digits, random_state=1)\n\nparam_grid = {'C': [0.001, 0.01, 0.1, 1, 10],\n              'gamma': [0.01, 0.1, 1, 10, 100]}\n\ngrid = GridSearchCV(SVC(), param_grid=param_grid, cv=5, verbose=3)\ngrid.fit(X_digits_train, y_digits_train)\nprint('Best score for SVC: {}'.format(grid.score(X_digits_test, y_digits_test)))\nprint('Best parameters for SVC: {}'.format(grid.best_params_))\n"
  },
  {
    "path": "notebooks/solutions/19_gbc_grid.py",
    "content": "from sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split, GridSearchCV\nfrom sklearn.ensemble import GradientBoostingClassifier\n\ndigits = load_digits()\nX_digits, y_digits = digits.data, digits.target\nX_digits_train, X_digits_test, y_digits_train, y_digits_test = train_test_split(X_digits, y_digits, random_state=1)\n\nparam_grid = {'learning_rate': [0.01, 0.1, 0.1, 0.5, 1.0],\n              'max_depth':[1, 3, 5, 7, 9]}\n\ngrid = GridSearchCV(GradientBoostingClassifier(), param_grid=param_grid, cv=5, verbose=3)\ngrid.fit(X_digits_train, y_digits_train)\nprint('Best score for GradientBoostingClassifier: {}'.format(grid.score(X_digits_test, y_digits_test)))\nprint('Best parameters for GradientBoostingClassifier: {}'.format(grid.best_params_))\n"
  },
  {
    "path": "notebooks/solutions/20_univariate_vs_mb_selection.py",
    "content": "from sklearn.feature_selection import SelectKBest, SelectFromModel\nfrom sklearn.ensemble import RandomForestClassifier\nimport numpy as np\n\nrng = np.random.RandomState(1)\nX = rng.randint(0, 2, (200, 20))\ny = np.logical_xor(X[:, 0] > 0, X[:, 1] > 0)\n\nfs_univariate = SelectKBest(k=10)\nfs_modelbased = SelectFromModel(RandomForestClassifier(n_estimators=100), threshold='median')\n\nfs_univariate.fit(X, y)\nprint('Features selected by univariate selection:')\nprint(fs_univariate.get_support())\nprint('')\n\nfs_modelbased.fit(X, y)\nprint('Features selected by model-based selection:')\nprint(fs_modelbased.get_support())\n"
  },
  {
    "path": "notebooks/solutions/21_clustering_comparison.py",
    "content": "from sklearn.datasets import make_circles\nfrom sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN\n\nX, y = make_circles(n_samples=1500, \n                    factor=.4, \n                    noise=.05)\n\nkm = KMeans(n_clusters=2)\nplt.figure()\nplt.scatter(X[:, 0], X[:, 1], c=km.fit_predict(X))\n\nac = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='complete')\nplt.figure()\nplt.scatter(X[:, 0], X[:, 1], c=ac.fit_predict(X))\n\ndb = DBSCAN(eps=0.2)\nplt.figure()\nplt.scatter(X[:, 0], X[:, 1], c=db.fit_predict(X));\n"
  },
  {
    "path": "notebooks/solutions/22A_isomap_digits.py",
    "content": "from sklearn.manifold import Isomap\niso = Isomap(n_components=2)\ndigits_isomap = iso.fit_transform(digits.data)\n\nplt.figure(figsize=(10, 10))\nplt.xlim(digits_isomap[:, 0].min(), digits_isomap[:, 0].max() + 1)\nplt.ylim(digits_isomap[:, 1].min(), digits_isomap[:, 1].max() + 1)\nfor i in range(len(digits.data)):\n    # actually plot the digits as text instead of using scatter\n    plt.text(digits_isomap[i, 0], digits_isomap[i, 1], str(digits.target[i]),\n             color = colors[digits.target[i]],\n             fontdict={'weight': 'bold', 'size': 9})\n"
  },
  {
    "path": "notebooks/solutions/22B_tsne_classification.py",
    "content": "from sklearn.manifold import TSNE\nfrom sklearn.neighbors import KNeighborsClassifier\nfrom sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, random_state=1)\n\nclf = KNeighborsClassifier()\nclf.fit(X_train, y_train)\nprint('KNeighborsClassifier accuracy without t-SNE: {}'.format(clf.score(X_test, y_test)))\n\ntsne = TSNE(random_state=42)\ndigits_tsne_train = tsne.fit_transform(X_train)\ndigits_tsne_test = tsne.fit_transform(X_test)\n\nclf = KNeighborsClassifier()\nclf.fit(digits_tsne_train, y_train)\nprint('KNeighborsClassifier accuracy with t-SNE: {}'.format(clf.score(digits_tsne_test, y_test)))\n"
  },
  {
    "path": "notebooks/solutions/23_batchtrain.py",
    "content": "import os\nimport numpy as np\nfrom sklearn.linear_model import SGDClassifier\nfrom sklearn.feature_extraction.text import HashingVectorizer\nfrom sklearn.base import clone\nfrom sklearn.datasets import load_files\n\n\ndef batch_train(clf, fnames, labels, iterations=1,\n                batchsize=1000, random_seed=1):\n    vec = HashingVectorizer(encoding='latin-1')\n    idx = np.arange(labels.shape[0])\n    c_clf = clone(clf)\n    rng = np.random.RandomState(seed=random_seed)\n    shuffled_idx = rng.permutation(range(len(fnames)))\n    fnames_ary = np.asarray(fnames)\n\n    for _ in range(iterations):\n        for batch in np.split(shuffled_idx, len(fnames) // 1000):\n            documents = []\n            for fn in fnames_ary[batch]:\n                with open(fn, 'r') as f:\n                    documents.append(f.read())\n            X_batch = vec.transform(documents)\n            batch_labels = labels[batch]\n            c_clf.partial_fit(X=X_batch,\n                              y=batch_labels,\n                              classes=[0, 1])\n\n    return c_clf\n\n\n# Out-of-core Training\ntrain_path = os.path.join('datasets', 'IMDb', 'aclImdb', 'train')\ntrain_pos = os.path.join(train_path, 'pos')\ntrain_neg = os.path.join(train_path, 'neg')\n\nfnames = [os.path.join(train_pos, f) for f in os.listdir(train_pos)] +\\\n         [os.path.join(train_neg, f) for f in os.listdir(train_neg)]\ny_train = np.zeros((len(fnames), ), dtype=int)\ny_train[:12500] = 1\nnp.bincount(y_train)\n\nsgd = SGDClassifier(loss='log', random_state=1)\n\nsgd = batch_train(clf=sgd,\n                  fnames=fnames,\n                  labels=y_train)\n\n\n# Testing\ntest_path = os.path.join('datasets', 'IMDb', 'aclImdb', 'test')\ntest = load_files(container_path=(test_path),\n                  categories=['pos', 'neg'])\ndocs_test, y_test = test['data'][12500:], test['target'][12500:]\n\nvec = HashingVectorizer(encoding='latin-1')\nprint('accuracy:', sgd.score(vec.transform(docs_test), y_test))\n"
  },
  {
    "path": "requirements.txt",
    "content": "# brew update && brew install gcc (this includes gfortran)\nipython[all]>=3.2.0\npyzmq>=14.7.0\nPillow>=2.9.0\nnumpy>=1.9.2\nscipy>=0.15.1\nscikit-learn>=0.18\nmatplotlib>=1.4.3\ngraphviz>=0.4.4\npyparsing==1.5.7\npydot\npandas>=0.19\nwatermark\n"
  },
  {
    "path": "todo.rst",
    "content": "replace spam by imdb text data\nmake sure there are notebooks for all sections\nmake sure there are exercises everywhere\n"
  }
]