Showing preview only (2,881K chars total). Download the full file or copy to clipboard to get everything.
Repository: justmarkham/DAT5
Branch: master
Commit: 87aa6d195393
Files: 78
Total size: 2.7 MB
Directory structure:
gitextract_q__99hvo/
├── .gitignore
├── README.md
├── code/
│ ├── 00_python_beginner_workshop.py
│ ├── 00_python_intermediate_workshop.py
│ ├── 01_chipotle_homework_solution.py
│ ├── 01_reading_files.py
│ ├── 03_exploratory_analysis_pandas.py
│ ├── 04_apis.py
│ ├── 04_visualization.py
│ ├── 05_iris_exercise.py
│ ├── 05_sklearn_knn.py
│ ├── 07_glass_id_homework_solution.py
│ ├── 08_web_scraping.py
│ ├── 10_logistic_regression_confusion_matrix.py
│ ├── 13_naive_bayes.py
│ ├── 15_kaggle.py
│ ├── 17_ensembling_exercise.py
│ ├── 18_clustering.py
│ ├── 18_regularization.py
│ ├── 19_advanced_sklearn.py
│ ├── 19_gridsearchcv_exercise.py
│ ├── 19_regex_exercise.py
│ ├── 19_regex_reference.py
│ ├── 20_sql.py
│ └── 21_ensembles_example.py
├── data/
│ ├── SMSSpamCollection.txt
│ ├── airline_safety.csv
│ ├── auto_mpg.txt
│ ├── chipotle_orders.tsv
│ ├── default.csv
│ ├── drinks.csv
│ ├── homicides.txt
│ ├── imdb_movie_ratings_top_1000.csv
│ ├── imdb_movie_urls.csv
│ ├── kaggle_tweets.csv
│ ├── titanic_train.csv
│ ├── vehicles_test.csv
│ └── vehicles_train.csv
├── homework/
│ ├── 02_command_line_hw_soln.md
│ ├── 03_pandas_hw_soln.py
│ ├── 04_visualization_hw_soln.py
│ ├── 06_bias_variance.md
│ ├── 07_glass_identification.md
│ ├── 11_roc_auc.md
│ ├── 11_roc_auc_annotated.md
│ ├── 13_spam_filtering.md
│ └── 13_spam_filtering_annotated.md
├── notebooks/
│ ├── 06_bias_variance.ipynb
│ ├── 06_model_evaluation_procedures.ipynb
│ ├── 09_linear_regression.ipynb
│ ├── 11_cross_validation.ipynb
│ ├── 11_roc_auc.ipynb
│ ├── 11_titanic_exercise.ipynb
│ ├── 13_bayes_iris.ipynb
│ ├── 13_naive_bayes_spam.ipynb
│ ├── 14_nlp.ipynb
│ ├── 16_decision_trees.ipynb
│ ├── 17_ensembling.ipynb
│ └── 18_regularization.ipynb
├── other/
│ ├── peer_review.md
│ ├── project.md
│ ├── public_data.md
│ └── resources.md
└── slides/
├── 01_course_overview.pptx
├── 02_Introduction_to_the_Command_Line.md
├── 02_git_github.pptx
├── 04_apis.pptx
├── 04_visualization.pptx
├── 05_intro_to_data_science.pptx
├── 05_machine_learning_knn.pptx
├── 08_web_scraping.pptx
├── 10_logistic_regression_confusion_matrix.pptx
├── 11_drawing_roc.pptx
├── 13_bayes_theorem.pptx
├── 13_naive_bayes.pptx
├── 15_kaggle.pptx
├── 18_clustering.pptx
└── 20_sql.pptx
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
.ipynb_checkpoints/
.DS_Store
*.pyc
================================================
FILE: README.md
================================================
## DAT5 Course Repository
Course materials for [General Assembly's Data Science course](https://generalassemb.ly/education/data-science/washington-dc/) in Washington, DC (3/18/15 - 6/3/15).
**Instructors:** Brandon Burroughs and Kevin Markham ([Data School blog](http://www.dataschool.io/), [email newsletter](http://www.dataschool.io/subscribe/), [YouTube channel](https://www.youtube.com/user/dataschool))
Monday | Wednesday
--- | ---
| 3/18: Introduction and Python
3/23: Git and Command Line | 3/25: Exploratory Data Analysis
**3/30:** Visualization and APIs | 4/1: Machine Learning and KNN
**4/6:** Bias-Variance and Model Evaluation | 4/8: Kaggle Titanic
4/13: Web Scraping, Tidy Data, Reproducibility | 4/15: Linear Regression
4/20: Logistic Regression and Confusion Matrices | 4/22: ROC and Cross-Validation
**4/27:** Project Presentation #1 | 4/29: Naive Bayes
5/4: Natural Language Processing | 5/6: Kaggle Stack Overflow
5/11: Decision Trees | 5/13: Ensembles
**5/18:** Clustering and Regularization | 5/20: Advanced scikit-learn and Regex
**5/25:** *No Class* | 5/27: Databases and SQL
6/1: Course Review | **6/3:** Project Presentation #2
### Key Project Dates
* **3/30:** Deadline for discussing your project idea(s) with an instructor
* **4/6:** Project question and dataset (write-up)
* **4/27:** Project presentation #1 (slides, code, visualizations)
* **5/18:** First draft due (draft of project paper, code, visualizations)
* **5/25:** Peer review due
* **6/3:** Project presentation #2 (project paper, slides, code, visualizations, data, data dictionary)
### Key Project Links
* [Course project requirements](other/project.md)
* [Public data sources](other/public_data.md)
* [Kaggle competitions](http://www.kaggle.com/)
* [Examples of student projects](https://github.com/justmarkham/DAT-project-examples)
* [Peer review guidelines](other/peer_review.md)
### Logistics
* Office hours will take place every Saturday and Sunday.
* Homework will be assigned every Wednesday and due on Monday, and you'll receive feedback by Wednesday.
* Our primary tool for out-of-class communication will be a private chat room through [Slack](https://slack.com/).
### Submission Forms
* [Homework submission form](http://bit.ly/dat5homework) (also for project submissions)
* [Gist](https://gist.github.com/) is an easy way to put your homework online
* [Feedback submission form](http://bit.ly/dat5feedback) (at the end of every class)
### Before the Course Begins
* Install the [Anaconda distribution](http://continuum.io/downloads) of Python 2.7x.
* Install [Git](http://git-scm.com/book/en/v2/Getting-Started-Installing-Git) and create a [GitHub](https://github.com/) account.
* Once you receive an email invitation from Slack, join our "DAT5 team" and add your photo.
* Choose a [Python workshop](https://generalassemb.ly/education?format=classes-workshops) to attend, depending upon your current skill level:
* Beginner: [Saturday 3/7 10am-2pm](https://generalassemb.ly/education/introduction-to-python-programming/washington-dc/11137) or [Thursday 3/12 6:30pm-9pm](https://generalassemb.ly/education/introduction-to-python-programming/washington-dc/11136)
* Intermediate: [Saturday 3/14 10am-2pm](https://generalassemb.ly/education/python-for-data-science-intermediate/washington-dc/11167)
* Practice your Python using the resources below.
### Python Resources
* [Codecademy's Python course](http://www.codecademy.com/en/tracks/python): Good beginner material, including tons of in-browser exercises.
* [DataQuest](https://dataquest.io/missions): Similar interface to Codecademy, but focused on teaching Python in the context of data science.
* [Google's Python Class](https://developers.google.com/edu/python/): Slightly more advanced, including hours of useful lecture videos and downloadable exercises (with solutions).
* [A Crash Course in Python for Scientists](http://nbviewer.ipython.org/gist/rpmuller/5920182): Read through the Overview section for a quick introduction to Python.
* [Python for Informatics](http://www.pythonlearn.com/book.php): A very beginner-oriented book, with associated [slides](https://drive.google.com/folderview?id=0B7X1ycQalUnyal9yeUx3VW81VDg&usp=sharing) and [videos](https://www.youtube.com/playlist?list=PLlRFEj9H3Oj4JXIwMwN1_ss1Tk8wZShEJ).
* Code from our [beginner](code/00_python_beginner_workshop.py) and [intermediate](code/00_python_intermediate_workshop.py) workshops: Useful for review and reference.
-----
### Class 1: Introduction and Python
* Introduction to General Assembly
* Course overview ([slides](slides/01_course_overview.pdf))
* Brief tour of Slack
* Checking the setup of your laptop
* Python lesson with [airline safety data](https://github.com/fivethirtyeight/data/tree/master/airline-safety) ([code](code/01_reading_files.py))
**Homework:**
* Python exercises with [Chipotle order data](https://github.com/TheUpshot/chipotle) (listed at bottom of [code](code/01_reading_files.py) file) ([solution](code/01_chipotle_homework_solution.py))
* Work through GA's excellent introductory [command line tutorial](http://generalassembly.github.io/prework/command-line/#/) and then take this brief [quiz](https://gahub.typeform.com/to/J6xirf).
* Read through the [course project requirements](other/project.md) and start thinking about your own project!
**Optional:**
* If we discovered any setup issues with your laptop, please resolve them before Monday.
* If you're not feeling comfortable in Python, keep practicing using the resources above!
-----
### Class 2: Git and Command Line
* Any questions about the course project?
* Command line ([slides](slides/02_Introduction_to_the_Command_Line.md))
* Git and GitHub ([slides](slides/02_git_github.pdf))
**Homework:**
* Command line exercises with [SMS Spam Data](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) (listed at the bottom of [Introduction to the Command Line](slides/02_Introduction_to_the_Command_Line.md)) ([solution](homework/02_command_line_hw_soln.md))
* **Note**: This homework is not due until Monday. You might want to create a GitHub repo for your homework instead of using Gist!
**Optional:**
* Browse through some [example student projects](https://github.com/justmarkham/DAT-project-examples) to stimulate your thinking and give you a sense of project scope.
**Resources:**
* This [Command Line Primer](http://lifehacker.com/5633909/who-needs-a-mouse-learn-to-use-the-command-line-for-almost-anything) goes a bit more into command line scripting.
* Read the first two chapters of [Pro Git](http://git-scm.com/book/en/v2) to gain a much deeper understanding of version control and basic Git commands.
* Watch [Introduction to Git and GitHub](https://www.youtube.com/playlist?list=PL5-da3qGB5IBLMp7LtN8Nc3Efd4hJq0kD) (36 minutes) for a quick review of a lot of today's material.
* [GitRef](http://gitref.org/) is an excellent reference guide for Git commands, and [Git quick reference for beginners](http://www.dataschool.io/git-quick-reference-for-beginners/) is a shorter guide with commands grouped by workflow.
* The [Markdown Cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) covers standard Markdown and a bit of "[GitHub Flavored Markdown](https://help.github.com/articles/github-flavored-markdown/)."
-----
### Class 3: Pandas
* Pandas for data exploration, analysis, and visualization ([code](code/03_exploratory_analysis_pandas.py))
* [Split-Apply-Combine](http://i.imgur.com/yjNkiwL.png) pattern
* Simple examples of [joins in Pandas](http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/#joining)
**Homework:**
* Pandas practice with [Automobile MPG Data](https://archive.ics.uci.edu/ml/datasets/Auto+MPG) (listed at the bottom of [Exploratory Analysis in Pandas](code/03_exploratory_analysis_pandas.py)) ([solution](homework/03_pandas_hw_soln.py))
* Talk to an instructor about your project
* Don't forget about the Command line exercises (listed at the bottom of [Introduction to the Command Line](slides/02_Introduction_to_the_Command_Line.md))
**Optional:**
* To learn more Pandas, review this [three-part tutorial](http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/), or review these two excellent (but extremely long) notebooks on Pandas: [introduction](http://nbviewer.ipython.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_5-Introduction-to-Pandas.ipynb) and [data wrangling](http://nbviewer.ipython.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_6-Data-Wrangling-with-Pandas.ipynb).
* Read [How Software in Half of NYC Cabs Generates $5.2 Million a Year in Extra Tips](http://iquantny.tumblr.com/post/107245431809/how-software-in-half-of-nyc-cabs-generates-5-2) for an excellent example of exploratory data analysis.
-----
### Class 4: Visualization and APIs
* Visualization ([slides](slides/04_visualization.pdf) and [code](code/04_visualization.py))
* APIs ([slides](slides/04_apis.pdf) and [code](code/04_apis.py))
**Homework:**
* Visualization practice with [Automobile MPG Data](https://archive.ics.uci.edu/ml/datasets/Auto+MPG) (listed at the bottom of [the visualization code](code/04_visualization.py)) ([solution](homework/04_visualization_hw_soln.py))
* **Note**: This homework isn't due until Monday.
**Optional:**
* Watch [Look at Your Data](https://www.youtube.com/watch?v=coNDCIMH8bk) (18 minutes) for an excellent example of why visualization is useful for understanding your data.
**Resources:**
* For more on Pandas plotting, read this [notebook](http://nbviewer.ipython.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_7-Plotting-with-Pandas.ipynb) or the [visualization page](http://pandas.pydata.org/pandas-docs/stable/visualization.html) from the official Pandas documentation.
* To learn how to customize your plots further, browse through this [notebook on matplotlib](http://nbviewer.ipython.org/github/fonnesbeck/Bios8366/blob/master/notebooks/Section2_4-Matplotlib.ipynb) or this [similar notebook](http://nbviewer.ipython.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-4-Matplotlib.ipynb).
* To explore different types of visualizations and when to use them, [Choosing a Good Chart](http://extremepresentation.typepad.com/files/choosing-a-good-chart-09.pdf) and [The Graphic Continuum](http://www.coolinfographics.com/storage/post-images/The-Graphic-Continuum-POSTER.jpg) are handy one-page references, or check out the [R Graph Catalog](http://shinyapps.stat.ubc.ca/r-graph-catalog/).
* For a more in-depth introduction to visualization, browse through these [PowerPoint slides](http://www2.research.att.com/~volinsky/DataMining/Columbia2011/Slides/Topic2-EDAViz.ppt) from Columbia's Data Mining class.
* [Mashape](https://www.mashape.com/explore) and [Apigee](https://apigee.com/providers) allow you to explore tons of different APIs. Alternatively, a [Python API wrapper](http://www.pythonforbeginners.com/api/list-of-python-apis) is available for many popular APIs.
-----
### Class 5: Data Science Workflow, Machine Learning, KNN
* Iris dataset
* [What does an iris look like?](http://sebastianraschka.com/Images/2014_python_lda/iris_petal_sepal.png)
* [Data](http://archive.ics.uci.edu/ml/datasets/Iris) hosted by the UCI Machine Learning Repository
* "Human learning" exercise ([solution](code/05_iris_exercise.py))
* Introduction to data science ([slides](slides/05_intro_to_data_science.pdf))
* [Quora: What is data science?](https://www.quora.com/What-is-data-science/answer/Michael-Hochster)
* [Data science Venn diagram](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)
* [Quora: What is the workflow of a data scientist?](http://www.quora.com/What-is-the-work-flow-or-process-of-a-data-scientist/answer/Ryan-Fox-Squire)
* Example student project: [MetroMetric](https://github.com/justmarkham/DAT-project-examples/blob/master/pdf/bus_presentation.pdf)
* Machine learning and KNN ([slides](slides/05_machine_learning_knn.pdf))
* [Reddit AMA with Yann LeCun](http://www.reddit.com/r/MachineLearning/comments/25lnbt/ama_yann_lecun)
* [Characteristics of your zip code](http://www.esri.com/landing-pages/tapestry/)
* Introduction to scikit-learn ([code](code/05_sklearn_knn.py))
* Documentation: [user guide](http://scikit-learn.org/stable/modules/neighbors.html), [module reference](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors), [class documentation](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
**Homework:**
* Complete your visualization homework assigned in class 4
* [Reading assignment on the bias-variance tradeoff](homework/06_bias_variance.md)
* A write-up about your [project question and dataset](other/project.md) is due on Monday! ([example one](https://github.com/justmarkham/DAT4-students/blob/master/jason/jk_project_idea.md), [example two](https://github.com/justmarkham/DAT4-students/blob/master/alexlee/project_question.md))
**Optional:**
* For a useful look at the different types of data scientists, read [Analyzing the Analyzers](http://cdn.oreillystatic.com/oreilly/radarreport/0636920029014/Analyzing_the_Analyzers.pdf) (32 pages).
* For some thoughts on what it's like to be a data scientist, read these short posts from [Win-Vector](http://www.win-vector.com/blog/2012/09/on-being-a-data-scientist/) and [Datascope Analytics](http://datascopeanalytics.com/what-we-think/2014/07/31/six-qualities-of-a-great-data-scientist).
* For a fun (yet enlightening) look at the data science workflow, read [What I do when I get a new data set as told through tweets](http://simplystatistics.org/2014/06/13/what-i-do-when-i-get-a-new-data-set-as-told-through-tweets/).
* For a more in-depth introduction to data science, browse through these [PowerPoint slides](http://www2.research.att.com/~volinsky/DataMining/Columbia2011/Slides/Topic1-DMIntro.ppt) from Columbia's Data Mining class.
* For a more in-depth introduction to machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/). (It's a free PDF download!)
* For a really nice comparison of supervised versus unsupervised learning, plus an introduction to reinforcement learning, watch this [video](http://work.caltech.edu/library/014.html) (13 minutes) from Caltech's [Learning From Data](http://work.caltech.edu/telecourse.html) course.
**Resources:**
* Quora has a [data science topic FAQ](https://www.quora.com/What-is-the-Data-Science-topic-FAQ) with lots of interesting Q&A.
* Keep up with local data-related events through the Data Community DC [event calendar](http://www.datacommunitydc.org/calendar) or [weekly newsletter](http://www.datacommunitydc.org/thenewsletter/).
-----
### Class 6: Bias-Variance Tradeoff and Model Evaluation
* Brief introduction to the IPython Notebook
* Exploring the bias-variance tradeoff ([notebook](notebooks/06_bias_variance.ipynb))
* Discussion of the [assigned reading](homework/06_bias_variance.md) on the bias-variance tradeoff
* Model evaluation procedures ([notebook](notebooks/06_model_evaluation_procedures.ipynb))
**Resources:**
* If you would like to learn the IPython Notebook, the official [Notebook tutorials](http://nbviewer.ipython.org/github/ipython/ipython/blob/master/examples/Notebook/Index.ipynb) are useful.
* To get started with Seaborn for visualization, the official website has a series of [tutorials](http://web.stanford.edu/~mwaskom/software/seaborn/tutorial.html) and an [example gallery](http://web.stanford.edu/~mwaskom/software/seaborn/examples/index.html).
* Hastie and Tibshirani have an excellent [video](https://www.youtube.com/watch?v=_2ij6eaaSl0&t=2m34s) (12 minutes, starting at 2:34) that covers training error versus testing error, the bias-variance tradeoff, and train/test split (which they call the "validation set approach").
* Caltech's Learning From Data course includes a fantastic [video](http://work.caltech.edu/library/081.html) (15 minutes) that may help you to visualize bias and variance.
-----
### Class 7: Kaggle Titanic
* Guest instructor: [Josiah Davis](https://generalassemb.ly/instructors/josiah-davis/3315)
* Participate in Kaggle's [Titanic competition](http://www.kaggle.com/c/titanic-gettingStarted)
* Work in pairs, but the goal is for every person to make at least one submission by the end of the class period!
**Homework:**
* Option 1 is to do the [Glass identification homework](homework/07_glass_identification.md). This is a good option if you are still getting comfortable with what we have learned so far, and prefer a very structured assignment. ([solution](code/07_glass_id_homework_solution.py))
* Option 2 is to keep working on the Titanic competition, and see if you can make some additional progress! This is a good assignment if you are feeling comfortable with the material and want to learn a bit more on your own.
* In either case, please submit your code as usual, and include lots of code comments!
-----
### Class 8: Web Scraping, Tidy Data, Reproducibility
* Web scraping ([slides](slides/08_web_scraping.pdf) and [code](code/08_web_scraping.py))
* [HTML Tree](http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png)
* Tidy data:
* [Introduction](http://stat405.had.co.nz/lectures/18-tidy-data.pdf)
* Example datasets: [Bob Ross](https://github.com/fivethirtyeight/data/blob/master/bob-ross/elements-by-episode.csv), [NFL ticket prices](https://github.com/fivethirtyeight/data/blob/master/nfl-ticket-prices/2014-average-ticket-price.csv), [airline safety](https://github.com/fivethirtyeight/data/blob/master/airline-safety/airline-safety.csv), [Jets ticket prices](https://github.com/fivethirtyeight/data/blob/master/nfl-ticket-prices/jets-buyer.csv), [Chipotle orders](https://github.com/TheUpshot/chipotle/blob/master/orders.tsv)
* Reproducibility:
* [Introduction](http://www.dataschool.io/reproducibility-is-not-just-for-researchers/), [Tweet](https://twitter.com/jakevdp/status/519563939177197571)
* [Components of reproducible analysis](https://github.com/jtleek/datasharing)
* Examples: [Classic rock](https://github.com/fivethirtyeight/data/tree/master/classic-rock), [student project 1](https://github.com/jwknobloch/DAT4_final_project), [student project 2](https://github.com/justmarkham/DAT4-students/tree/master/Jonathan_Bryan/Project_Files)
**Resources:**
* This [web scraping tutorial from Stanford](http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html) provides an example of getting a list of items.
* If you want to learn more about tidy data, [Hadley Wickham's paper](http://www.jstatsoft.org/v59/i10/paper) has a lot of nice examples.
* If your co-workers tend to create spreadsheets that are [unreadable by computers](https://bosker.wordpress.com/2014/12/05/the-government-statistical-services-terrible-spreadsheet-advice/), perhaps they would benefit from reading this list of [tips for releasing data in spreadsheets](http://www.clean-sheet.org/). (There are some additional suggestions in this [answer](http://stats.stackexchange.com/questions/83614/best-practices-for-creating-tidy-data/83711#83711) from Cross Validated.)
* Here's [Colbert on reproducibility](http://thecolbertreport.cc.com/videos/dcyvro/austerity-s-spreadsheet-error) (8 minutes).
-----
### Class 9: Linear Regression
* Linear regression ([notebook](notebooks/09_linear_regression.ipynb))
* Simple linear regression
* Estimating and interpreting model coefficients
* Confidence intervals
* Hypothesis testing and p-values
* R-squared
* Multiple linear regression
* Feature selection
* Model evaluation metrics for regression
* Handling categorical predictors
**Homework:**
* If you're behind on homework, use this time to catch up.
* Keep working on your project... your first presentation is in less than two weeks!!
**Resources:**
* To go much more in-depth on linear regression, read Chapter 3 of [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/), from which this lesson was adapted. Alternatively, watch the [related videos](http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/) or read my [quick reference guide](http://www.dataschool.io/applying-and-interpreting-linear-regression/) to the key points in that chapter.
* To learn more about Statsmodels and how to interpret the output, DataRobot has some decent posts on [simple linear regression](http://www.datarobot.com/blog/ordinary-least-squares-in-python/) and [multiple linear regression](http://www.datarobot.com/blog/multiple-regression-using-statsmodels/).
* This [introduction to linear regression](http://people.duke.edu/~rnau/regintro.htm) is much more detailed and mathematically thorough, and includes lots of good advice.
* This is a relatively quick post on the [assumptions of linear regression](http://pareonline.net/getvn.asp?n=2&v=8).
-----
### Class 10: Logistic Regression and Confusion Matrices
* Logistic regression ([slides](slides/10_logistic_regression_confusion_matrix.pdf) and [code](code/10_logistic_regression_confusion_matrix.py))
* Confusion matrices (same links as above)
**Homework:**
* Video assignment on [ROC Curves and Area Under the Curve](homework/11_roc_auc.md)
* Review the notebook from class 6 on [model evaluation procedures](notebooks/06_model_evaluation_procedures.ipynb)
**Resources:**
* For more on logistic regression, watch the [first three videos](https://www.youtube.com/playlist?list=PL5-da3qGB5IC4vaDba5ClatUmFppXLAhE) (30 minutes total) from Chapter 4 of An Introduction to Statistical Learning.
* UCLA's IDRE has a handy table to help you remember the [relationship between probability, odds, and log-odds](http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm).
* Better Explained has a very friendly introduction (with lots of examples) to the [intuition behind "e"](http://betterexplained.com/articles/an-intuitive-guide-to-exponential-functions-e/).
* Here are some useful lecture notes on [interpreting logistic regression coefficients](http://www.unm.edu/~schrader/biostat/bio2/Spr06/lec11.pdf).
* Kevin wrote a [simple guide to confusion matrix terminology](http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) that you can use as a reference guide.
-----
### Class 11: ROC Curves and Cross-Validation
* ROC curves and Area Under the Curve
* Discuss the [video assignment](homework/11_roc_auc.md)
* Exercise: [drawing an ROC curve](slides/11_drawing_roc.pdf)
* Calculating AUC and plotting an ROC curve ([notebook](notebooks/11_roc_auc.ipynb))
* Cross-validation ([notebook](notebooks/11_cross_validation.ipynb))
* Discuss this article on [Smart Autofill for Google Sheets](http://googleresearch.blogspot.com/2014/10/smart-autofill-harnessing-predictive.html)
**Homework:**
* Your first [project presentation](other/project.md) is on Monday! Please submit a link to your project repository (with slides, code, and visualizations) before class using the homework submission form.
**Optional:**
* Titanic exercise ([notebook](notebooks/11_titanic_exercise.ipynb))
**Resources:**
* scikit-learn has extensive documentation on [model evaluation](http://scikit-learn.org/stable/modules/model_evaluation.html).
* For more on cross-validation, read section 5.1 of [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) (11 pages) or watch the related videos: [K-fold and leave-one-out cross-validation](https://www.youtube.com/watch?v=nZAM5OXrktY) (14 minutes), [cross-validation the right and wrong ways](https://www.youtube.com/watch?v=S06JpVoNaA0) (10 minutes).
-----
### Class 12: Project Presentation #1
* Project presentations!
**Homework:**
* Read these [Introduction to Probability](https://docs.google.com/presentation/d/1cM2dVbJgTWMkHoVNmYlB9df6P2H8BrjaqAcZTaLe9dA/edit#slide=id.gfc3caad2_00) slides (from the [OpenIntro Statistics textbook](https://www.openintro.org/stat/textbook.php)) and try the included quizzes. Pay specific attention to the following terms: probability, sample space, mutually exclusive, independent.
* Reading assignment on [spam filtering](homework/13_spam_filtering.md).
-----
### Class 13: Naive Bayes
* Conditional probability and Bayes' theorem
* [Slides](slides/13_bayes_theorem.pdf) (adapted from [Visualizing Bayes' theorem](http://oscarbonilla.com/2009/05/visualizing-bayes-theorem/))
* [Visualization of conditional probability](http://setosa.io/conditional/)
* Applying Bayes' theorem to iris classification ([notebook](notebooks/13_bayes_iris.ipynb))
* Naive Bayes classification
* [Slides](slides/13_naive_bayes.pdf)
* Example with spam email ([notebook](notebooks/13_naive_bayes_spam.ipynb))
* Discuss the reading assignment on [spam filtering](homework/13_spam_filtering.md)
* [Airport security example](http://www.quora.com/In-laymans-terms-how-does-Naive-Bayes-work/answer/Konstantin-Tt)
* Classifying [SMS messages](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) ([code](code/13_naive_bayes.py))
**Homework:**
* Please download/install the following for the NLP class on Monday
* In Spyder, `import nltk` and run `nltk.download('all')`. This downloads all of the necessary resources for the Natural Language Tool Kit.
* We'll be using two new packages/modules for this class: textblob and lda. Please install them. **Hint**: In the Terminal (Mac) or Git Bash (Windows), run `pip install textblob` and `pip install lda`.
**Resources:**
* For other intuitive introductions to Bayes' theorem, here are two good blog posts that use [ducks](https://planspacedotorg.wordpress.com/2014/02/23/bayes-rule-for-ducks/) and [legos](http://www.countbayesie.com/blog/2015/2/18/bayes-theorem-with-lego).
* For more on conditional probability, these [slides](https://docs.google.com/presentation/d/1psUIyig6OxHQngGEHr3TMkCvhdLInnKnclQoNUr4G4U/edit#slide=id.gfc69f484_00) may be useful.
* For more details on Naive Bayes classification, Wikipedia has two excellent articles ([Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) and [Naive Bayes spam filtering](http://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering)), and Cross Validated has a good [Q&A](http://stats.stackexchange.com/questions/21822/understanding-naive-bayes).
* If you enjoyed Paul Graham's article, you can read [his follow-up article](http://www.paulgraham.com/better.html) on how he improved his spam filter and this [related paper](http://www.merl.com/publications/docs/TR2004-091.pdf) about state-of-the-art spam filtering in 2004.
* If you're planning on using text features in your project, it's worth exploring the different types of [Naive Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html) and the many options for [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).
-----
### Class 14: Natural Language Processing
* Natural Language Processing ([notebook](notebooks/14_nlp.ipynb))
* NLTK: tokenization, stemming, lemmatization, part of speech tagging, stopwords, Named Entity Recognition, LDA
* Alternative: TextBlob
**Resources:**
* [Natural Language Processing with Python](http://www.nltk.org/book/): free online book to go in-depth with NLTK
* [NLP online course](https://www.coursera.org/course/nlp): no sessions are available, but [video lectures](https://class.coursera.org/nlp/lecture) and [slides](http://web.stanford.edu/~jurafsky/NLPCourseraSlides.html) are still accessible
* [Brief slides](http://files.meetup.com/7616132/DC-NLP-2013-09%20Charlie%20Greenbacker.pdf) on the major task areas of NLP
* [Detailed slides](https://github.com/ga-students/DAT_SF_9/blob/master/16_Text_Mining/DAT9_lec16_Text_Mining.pdf) on a lot of NLP terminology
* [A visual survey of text visualization techniques](http://textvis.lnu.se/): for exploration and inspiration
* [DC Natural Language Processing](http://www.meetup.com/DC-NLP/): active Meetup group
* [Stanford CoreNLP](http://nlp.stanford.edu/software/corenlp.shtml): suite of tools if you want to get serious about NLP
* Getting started with regex: [Python introductory lesson](https://developers.google.com/edu/python/regular-expressions) and [reference guide](https://github.com/justmarkham/DAT3/blob/master/code/99_regex_reference.py), [real-time regex tester](https://regex101.com/#python), [in-depth tutorials](http://www.rexegg.com/)
* [A good explanation of LDA](http://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation)
* [Textblob documentation](http://textblob.readthedocs.org/en/dev/)
* [SpaCy](http://honnibal.github.io/spaCy/): a new NLP package
-----
### Class 15: Kaggle Stack Overflow
* Overview of how Kaggle works ([slides](slides/15_kaggle.pdf))
* Kaggle In-Class competition: [Predict whether a Stack Overflow question will be closed](https://inclass.kaggle.com/c/dat5-stack-overflow) ([code](code/15_kaggle.py))
**Optional:**
* Keep working on this competition! You can make up to 5 submissions per day, and the competition doesn't close until 6:30pm ET on Wednesday, May 27 (class 20).
**Resources:**
* For a great overview of the diversity of problems tackled by Kaggle competitions, watch [Kaggle Transforms Data Science Into Competitive Sport](https://www.youtube.com/watch?v=8w4UY66GKcM) (28 minutes) by Jeremy Howard (past president of Kaggle).
* [Getting in Shape for the Sport of Data Science](https://www.youtube.com/watch?v=kwt6XEh7U3g) (74 minutes), also by Jeremy Howard, contains a lot of tips for competitive machine learning.
* [Learning from the best](http://blog.kaggle.com/2014/08/01/learning-from-the-best/) is an excellent blog post covering top tips from Kaggle Masters on how to do well on Kaggle.
* [Feature Engineering Without Domain Expertise](https://www.youtube.com/watch?v=bL4b1sGnILU) (17 minutes), a talk by Kaggle Master Nick Kridler, provides some simple advice about how to iterate quickly and where to spend your time during a Kaggle competition.
* Kevin's [project presentation video](https://www.youtube.com/watch?v=HGr1yQV3Um0) (16 minutes) gives a nice tour of the end-to-end machine learning process for a Kaggle competition. (Or, just check out the [slides](https://speakerdeck.com/justmarkham/allstate-purchase-prediction-challenge-on-kaggle).)
-----
### Class 16: Decision Trees
* Decision trees ([notebook](notebooks/16_decision_trees.ipynb))
**Resources:**
* scikit-learn documentation: [Decision Trees](http://scikit-learn.org/stable/modules/tree.html)
**Installing Graphviz (optional):**
* Mac:
* [Download and install PKG file](http://www.graphviz.org/Download_macos.php)
* Windows:
* [Download and install MSI file](http://www.graphviz.org/Download_windows.php)
* **Add it to your Path:** Go to Control Panel, System, Advanced System Settings, Environment Variables. Under system variables, edit "Path" to include the path to the "bin" folder, such as: `C:\Program Files (x86)\Graphviz2.38\bin`
-----
### Class 17: Ensembles
* Ensembles and random forests ([notebook](notebooks/17_ensembling.ipynb))
**Homework:**
* Your [project draft](other/project.md#may-18-first-draft-due) is due on Monday! Please submit a link to your project repository (with paper, code, and visualizations) before class using the homework submission form.
* Your peers and your instructors will be giving you feedback on your project draft.
* Here's an example of a great [final project paper](https://github.com/justmarkham/DAT-project-examples/blob/master/pdf/nba_paper.pdf) from a past student.
* Make at least one new submission to our [Kaggle competition](https://inclass.kaggle.com/c/dat5-stack-overflow)! We suggest trying Random Forests or building your own ensemble of models. For assistance, you could use this [framework code](code/17_ensembling_exercise.py), or refer to the [complete code](code/15_kaggle.py) from class 15. You can optionally submit your code to us if you want feedback.
**Resources:**
* scikit-learn documentation: [Ensembles](http://scikit-learn.org/stable/modules/ensemble.html)
* Quora: [How do random forests work in layman's terms?](http://www.quora.com/How-do-random-forests-work-in-laymans-terms/answer/Edwin-Chen-1)
-----
### Class 18: Clustering and Regularization
* Clustering ([slides](slides/18_clustering.pdf) and [code](code/18_clustering.py))
* Regularization ([notebook](notebooks/18_regularization.ipynb) and [code](code/18_regularization.py))
**Homework:**
* You will be assigned to review the project drafts of two of your peers. You have until next Monday to provide them with feedback, according to [these guidelines](other/peer_review.md).
**Resources:**
* [Introduction to Data Mining](http://www-users.cs.umn.edu/~kumar/dmbook/index.php) has a thorough [chapter on cluster analysis](http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf).
* The scikit-learn user guide has a nice [section on clustering](http://scikit-learn.org/stable/modules/clustering.html).
* Wikipedia article on [determining the number of clusters](http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set).
* This [K-means clustering visualization](http://shiny.rstudio.com/gallery/kmeans-example.html) allows you to set different numbers of clusters for the iris data, and this [other visualization](http://asa.1gb.ru/kmeans/1.html) allows you to see the effects of different initial positions for the centroids.
* Fun examples of clustering: [A Statistical Analysis of the Work of Bob Ross](http://fivethirtyeight.com/features/a-statistical-analysis-of-the-work-of-bob-ross/) (with [data and Python code](https://github.com/fivethirtyeight/data/tree/master/bob-ross)), [How a Math Genius Hacked OkCupid to Find True Love](http://www.wired.com/2014/01/how-to-hack-okcupid/all/), and [characteristics of your zip code](http://www.esri.com/landing-pages/tapestry/).
* An Introduction to Statistical Learning has useful videos on [K-means clustering](https://www.youtube.com/watch?v=aIybuNt9ps4&list=PL5-da3qGB5IBC-MneTc9oBZz0C6kNJ-f2&index=3) (17 minutes), [ridge regression](https://www.youtube.com/watch?v=cSKzqb0EKS0&list=PL5-da3qGB5IB-Xdpj_uXJpLGiRfv9UVXI&index=6) (13 minutes), and [lasso regression](https://www.youtube.com/watch?v=A5I1G1MfUmA&index=7&list=PL5-da3qGB5IB-Xdpj_uXJpLGiRfv9UVXI) (15 minutes).
* Caltech's Learning From Data course has a great video introducing [regularization](http://work.caltech.edu/library/121.html) (8 minutes) that builds upon their video about the [bias-variance tradeoff](http://work.caltech.edu/library/081.html).
* Here is a longer example of [feature scaling](http://nbviewer.ipython.org/github/rasbt/pattern_classification/blob/master/preprocessing/about_standardization_normalization.ipynb) in scikit-learn, with additional discussion of the types of scaling you can use.
* [Clever Methods of Overfitting](http://hunch.net/?p=22) is a classic post by John Langford.
-----
### Class 19: Advanced scikit-learn and Regular Expressions
* Advanced scikit-learn ([code](code/19_advanced_sklearn.py))
* Searching for optimal parameters: [GridSearchCV](http://scikit-learn.org/stable/modules/grid_search.html)
* [Exercise](code/19_gridsearchcv_exercise.py)
* Standardization of features: [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
* Chaining steps: [Pipeline](http://scikit-learn.org/stable/modules/pipeline.html)
* Regular expressions ("regex")
* Motivating example: [data](data/homicides.txt), [code](code/19_regex_exercise.py)
* Reference guide: [code](code/19_regex_reference.py)
**Optional:**
* Use regular expressions to create a list of causes from the homicide data. Your list should look like this: `['shooting', 'shooting', 'blunt force', ...]`. If the cause is not listed for a particular homicide, include it in the list as `'unknown'`.
**Resources:**
* scikit-learn has an incredibly active [mailing list](https://www.mail-archive.com/scikit-learn-general@lists.sourceforge.net/index.html) that is often much more useful than Stack Overflow for researching a particular function.
* The scikit-learn documentation includes a [machine learning map](http://scikit-learn.org/stable/tutorial/machine_learning_map/) that may help you to choose the "best" model for your task.
* In you want to build upon the regex material presented in today's class, Google's Python Class includes an excellent [lesson](https://developers.google.com/edu/python/regular-expressions) (with an associated [video](https://www.youtube.com/watch?v=kWyoYtvJpe4&index=4&list=PL5-da3qGB5IA5NwDxcEJ5dvt8F9OQP7q5)).
* [regex101](https://regex101.com/#python) is an online tool for testing your regular expressions in real time.
* If you want to go really deep with regular expressions, [RexEgg](http://www.rexegg.com/) includes endless articles and tutorials.
* [Exploring Expressions of Emotions in GitHub Commit Messages](http://geeksta.net/geeklog/exploring-expressions-emotions-github-commit-messages/) is a fun example of how regular expressions can be used for data analysis.
-----
### Class 20: Databases and SQL
* Databases and SQL ([slides](slides/20_sql.pdf) and [code](code/20_sql.py))
**Homework:**
* Read this classic paper, which may help you to connect many of the topics we have studied throughout the course: [A Few Useful Things to Know about Machine Learning](http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf).
* Your [final project](other/project.md#june-3-project-presentation-2) is due next Wednesday!
* Please submit a link to your project repository before Wednesday's class using the homework submission form.
* Your presentation should start with a recap of the key information from the previous presentation, but you should spend most of your presentation discussing what has happened since then.
* Don't forget to practice your presentation and time yourself!
**Resources:**
* [SQLZOO](http://sqlzoo.net/wiki/SQL_Tutorial), [Mode Analytics](http://sqlschool.modeanalytics.com/), and [Code School](http://campus.codeschool.com/courses/try-sql/contents) all have online SQL tutorials that look promising.
* [w3schools](http://www.w3schools.com/sql/trysql.asp?filename=trysql_select_all) has a sample database that allows you to practice your SQL.
* [10 Easy Steps to a Complete Understanding of SQL](http://tech.pro/tutorial/1555/10-easy-steps-to-a-complete-understanding-of-sql) is a good article for those who have some SQL experience and want to understand it at a deeper level.
* [A Comparison Of Relational Database Management Systems](https://www.digitalocean.com/community/tutorials/sqlite-vs-mysql-vs-postgresql-a-comparison-of-relational-database-management-systems) gives the pros and cons of SQLite, MySQL, and PostgreSQL.
* If you want to go deeper into databases and SQL, Stanford has a well-respected series of [14 mini-courses](https://lagunita.stanford.edu/courses/DB/2014/SelfPaced/about).
-----
### Class 21: Course Review
* Pipelines ([code](code/19_advanced_sklearn.py))
* Class review
* Creating an ensemble ([code](code/21_ensembles_example.py))
**Resources:**
* [Data science review](https://docs.google.com/document/d/1XCdyrsQwU5OC5os7RHdVTEtS-tpHBbsoKKWLpYI6Svo/edit?usp=sharing): A summary of key concepts from the Data Science course.
* [Comparing supervised learning algorithms](https://docs.google.com/spreadsheets/d/15_QJXm6urctsbIXO-C_eXrsSffbHedio8z0E5ozxO-M/edit?usp=sharing): Kevin's table comparing the machine learning models we studied in the course.
* [Choosing a Machine Learning Classifier](http://blog.echen.me/2011/04/27/choosing-a-machine-learning-classifier/): Edwin Chen's short and highly readable guide.
* [Machine Learning Done Wrong](http://ml.posthaven.com/machine-learning-done-wrong) and [Common Pitfalls in Machine Learning](http://danielnee.com/?p=155): Thoughtful advice on common mistakes to avoid in machine learning.
* [Practical machine learning tricks from the KDD 2011 best industry paper](http://blog.david-andrzejewski.com/machine-learning/practical-machine-learning-tricks-from-the-kdd-2011-best-industry-paper/): More advanced advice than the resources above.
* [An Empirical Comparison of Supervised Learning Algorithms](http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf): Research paper from 2006.
* [Many more resources for continued learning!](other/resources.md)
-----
### Class 22: Project Presentation #2
* Presentations!
**Class is over! What should I do now?**
* Take a break!
* Go back through class notes/code/videos to make sure you feel comfortable with what we've learned.
* Take a look at the **Resources** for each class to get a deeper understanding of what we've learned. Start with the **Resources** from Class 21 and move to topics you are most interested in.
* You might not realize it, but you are at a point where you can continue learning on your own. You have all of the skills necessary to read papers, blogs, documentation, etc.
* GA Data Guild
* [8/24/2015](https://generalassemb.ly/education/data-science-guild/washington-dc/13274)
* [9/21/2015](https://generalassemb.ly/education/data-science-guild/washington-dc/13275)
* [10/19/2015](https://generalassemb.ly/education/data-science-guild/washington-dc/13276)
* [11/9/2015](https://generalassemb.ly/education/data-science-guild/washington-dc/13277)
* Follow data scientists on Twitter. This will help you stay up on the latest news/models/applications/tools.
* Participate in [Data Community DC](http://www.datacommunitydc.org/) events. They sponsor meetups, workshops, etc, notably the [Data Science DC Meetup](http://www.meetup.com/Data-Science-DC/). Sign up for their [newsletter](http://www.datacommunitydc.org/newsletter/) also!
* Read blogs to keep learning. I really like [District Data Labs](http://districtdatalabs.silvrback.com/).
* Do Kaggle competitions! This is a good way to continue and hone your skillset. Plus, you'll learn a ton along the way.
And finally, don't forget about [graduation](https://generalassemb.ly/education/graduation-april-may-june-courses/washington-dc/12892)!
================================================
FILE: code/00_python_beginner_workshop.py
================================================
'''
Multi-line comments go between 3 quotation marks.
You can use single or double quotes.
'''
# One-line comments are preceded by the pound symbol
# BASIC DATA TYPES
x = 5 # creates an object
print type(x) # check the type: int (not declared explicitly)
type(x) # automatically prints
type(5) # assigning it to a variable is not required
type(5.0) # float
type('five') # str
type(True) # bool
# LISTS
nums = [5, 5.0, 'five'] # multiple data types
nums # print the list
type(nums) # check the type: list
len(nums) # check the length: 3
nums[0] # print first element
nums[0] = 6 # replace a list element
nums.append(7) # list 'method' that modifies the list
help(nums.append) # help on this method
help(nums) # help on a list object
nums.remove('five') # another list method
sorted(nums) # 'function' that does not modify the list
nums # it was not affected
nums = sorted(nums) # overwrite the original list
sorted(nums, reverse=True) # optional argument
# list slicing [start:end:stride]
weekdays = ['mon','tues','wed','thurs','fri']
weekdays[0] # element 0
weekdays[0:3] # elements 0, 1, 2
weekdays[:3] # elements 0, 1, 2
weekdays[3:] # elements 3, 4
weekdays[-1] # last element (element 4)
weekdays[::2] # every 2nd element (0, 2, 4)
weekdays[::-1] # backwards (4, 3, 2, 1, 0)
days = weekdays + ['sat','sun'] # concatenate lists
# FUNCTIONS
def give_me_five(): # function definition ends with colon
return 5 # indentation required for function body
give_me_five() # prints the return value (5)
num = give_me_five() # assigns return value to a variable, doesn't print it
def calc(x, y, op): # three parameters (without any defaults)
if op == 'add': # conditional statement
return x + y
elif op == 'subtract':
return x - y
else:
print 'Valid operations: add, subtract'
calc(5, 3, 'add')
calc(5, 3, 'subtract')
calc(5, 3, 'multiply')
calc(5, 3)
# EXERCISE: Write a function that takes two parameters (hours and rate), and
# returns the total pay.
def compute_pay(hours, rate):
return hours * rate
compute_pay(40, 10.50)
# EXERCISE: Update your function to give the employee 1.5 times the hourly rate
# for hours worked above 40 hours.
def compute_more_pay(hours, rate):
if hours <= 40:
return hours * rate
else:
return 40*rate + (hours-40)*(rate*1.5)
compute_more_pay(30, 10)
compute_more_pay(45, 10)
# STRINGS
# create a string
s = str(42) # convert another data type into a string
s = 'I like you'
# examine a string
s[0] # returns 'I'
len(s) # returns 10
# string slicing like lists
s[:6] # returns 'I like'
s[7:] # returns 'you'
s[-1] # returns 'u'
# split a string into a list of substrings separated by a delimiter
s.split(' ') # returns ['I','like','you']
s.split() # same thing
# concatenate strings
s3 = 'The meaning of life is'
s4 = '42'
s3 + ' ' + s4 # returns 'The meaning of life is 42'
s3 + ' ' + str(42) # same thing
# EXERCISE: Given a string s, return a string made of the first 2 and last 2
# characters of the original string, so 'spring' yields 'spng'. However, if the
# string length is less than 2, instead return the empty string.
def both_ends(s):
if len(s) < 2:
return ''
else:
return s[:2] + s[-2:]
both_ends('spring')
both_ends('cat')
both_ends('a')
# FOR LOOPS
# range returns a list of integers
range(0, 3) # returns [0, 1, 2]: includes first value but excludes second value
range(3) # same thing: starting at zero is the default
# simple for loop
for i in range(5):
print i
# print each list element in uppercase
fruits = ['apple', 'banana', 'cherry']
for i in range(len(fruits)):
print fruits[i].upper()
# better for loop
for fruit in fruits:
print fruit.upper()
# EXERCISE: Write a program that prints the numbers from 1 to 100. But for
# multiples of 3 print 'fizz' instead of the number, and for the multiples of
# 5 print 'buzz'. For numbers which are multiples of both 3 and 5 print 'fizzbuzz'.
def fizz_buzz():
nums = range(1, 101)
for num in nums:
if num % 15 == 0:
print 'fizzbuzz'
elif num % 3 == 0:
print 'fizz'
elif num % 5 == 0:
print 'buzz'
else:
print num
fizz_buzz()
# EXERCISE: Given a list of strings, return a list with the strings
# in sorted order, except group all the strings that begin with 'x' first.
# e.g. ['mix', 'xyz', 'apple', 'xanadu', 'aardvark'] returns
# ['xanadu', 'xyz', 'aardvark', 'apple', 'mix']
# Hint: this can be done by making 2 lists and sorting each of them
# before combining them.
def front_x(words):
lista=[]
listb=[]
for word in words:
if word[0]=='x':
lista.append(word)
else:
listb.append(word)
return sorted(lista) + sorted(listb)
front_x(['mix', 'xyz', 'apple', 'xanadu', 'aardvark'])
================================================
FILE: code/00_python_intermediate_workshop.py
================================================
## QUIZ TO REVIEW BEGINNER WORKSHOP
a = 5
b = 5.0
c = a/2
d = b/2
'''
What is type(a)?
int
What is type(b)?
float
What is c?
2
What is d?
2.5
'''
e = [a, b]
f = range(10)
'''
What is type(e)?
list
What is len(e)?
2
What is type(f)?
list
What are the contents of f?
integers 0 through 9
What is 'range' called?
a function
How do I get help on 'range'?
help(range)
'''
g = ['mon','tues','wed','thurs','fri']
'''
How do I slice out 'mon'?
g[0]
How do I slice out 'mon' through 'wed'?
g[0:3]
What are two ways to slice out 'fri'?
g[4] or g[-1]
How do I check the type of 'mon'?
type(g[0])
'''
g.remove('wed')
sorted(g)
h = sorted(g, reverse=True)
'''
What are the contents of g?
['mon','tues','thurs','fri']
What are the contents of h?
['tues','thurs','mon','fri']
What is 'remove' called?
a list method
How do I get help on 'remove'?
help(g.remove)
What is 'reverse=True' called?
an optional argument
'''
i = 'Hello'
j = 'friend'
k = i + j
l = i + 3
m = i[0]
'''
What is 'k'?
'Hellofriend'
What is 'l'?
undefined (due to error)
What is 'm'?
'H'
'''
## FOR LOOPS AND BASIC LIST COMPREHENSIONS
# print 1 through 5
nums = range(1, 6)
for num in nums:
print num
# for loop to create a list of cubes
cubes = []
for num in nums:
cubes.append(num**3)
# equivalent list comprehension
cubes = [num**3 for num in nums] # [1, 8, 27, 64, 125]
'''
EXERCISE:
Given that: letters = ['a','b','c']
Write a list comprehension that returns: ['A','B','C']
Hint: 'hello'.upper() returns 'HELLO'
[letter.upper() for letter in letters]
BONUS EXERCISE:
Given that: word = 'abc'
Write a list comprehension that returns: ['A','B','C']
[letter.upper() for letter in word]
'''
## LIST COMPREHENSIONS WITH CONDITIONS
nums = range(1, 6)
# for loop to create a list of cubes of even numbers
cubes_of_even = []
for num in nums:
if num % 2 == 0:
cubes_of_even.append(num**3)
# equivalent list comprehension
# syntax: [expression for variable in iterable if condition]
cubes_of_even = [num**3 for num in nums if num % 2 == 0] # [8, 64]
## DICTIONARIES
# dictionaries are similar to lists:
# - both can contain multiple data types
# - both are iterable
# - both are mutable
# dictionaries are different from lists:
# - dictionaries are unordered
# - dictionary lookup time is constant regardless of dictionary size
# dictionaries are like real dictionaries:
# - dictionaries are made of key-value pairs (word and definition)
# - dictionary keys must be unique (each word is only defined once)
# - you can use the key to look up the value, but not the other way around
# create a dictionary (and open Variable Explorer in Spyder)
family = {'dad':'homer', 'mom':'marge', 'size':6}
# examine a dictionary
family[0] # throws an error (there is no ordering)
family['dad'] # returns 'homer'
len(family) # returns 3
family.keys() # returns list: ['dad', 'mom', 'size']
family.values() # returns list: ['homer', 'marge', 6]
family.items() # returns list of tuples:
# [('dad', 'homer'), ('mom', 'marge'), ('size', 6)]
# modify a dictionary
family['cat'] = 'snowball' # add a new entry
family['cat'] = 'snowball ii' # edit an existing entry
del family['cat'] # delete an entry
family['kids'] = ['bart', 'lisa'] # value can be a list
# accessing a list element within a dictionary
family['kids'][0] # returns 'bart'
'''
EXERCISE:
Given that: d = {'a':10, 'b':20, 'c':[30, 40]}
First, print the value for 'a'
Then, change the value for 'b' to be 25
Then, change the 30 to be 35
Finally, append 45 to the end of the list that contains 35 and 40
d['a']
d['b'] = 25
d['c'][0] = 35
d['c'].append(45)
BONUS EXERCISE:
Write a list comprehension that returns a list of the keys in uppercase
[key.upper() for key in d.keys()]
'''
## APIs
# API Providers: https://apigee.com/providers
# Echo Nest API Console: https://apigee.com/console/echonest
# Echo Nest Developer Center: http://developer.echonest.com/
import requests # import module (make its functions available)
# use requests to talk to the web
r = requests.get('http://www.google.com')
r.text
type(r.text)
# request data from the Echo Nest API
r = requests.get('http://developer.echonest.com/api/v4/artist/top_hottt?api_key=KBGUPZPJZS9PHWNIN&format=json')
r.text
r.json() # decode JSON
type(r.json())
top = r.json()
# pretty print for easier readability
import pprint
pprint.pprint(top)
# pull out the artist data
artists = top['response']['artists'] # list of 15 dictionaries
# reformat data into a table structure
artists_data = [artist.values() for artist in artists] # list of 15 lists
artists_header = artists[0].keys() # list of 2 strings
## WORKING WITH PUBLIC DATA
# List of data sources: https://github.com/justmarkham/DAT5/blob/master/other/public_data.md
# FiveThirtyEight: http://fivethirtyeight.com/
# FiveThirtyEight data: https://github.com/fivethirtyeight/data
# NFL ticket prices data: https://github.com/fivethirtyeight/data/tree/master/nfl-ticket-prices
# Question: What is the average ticket price for Ravens' home vs away games?
# open a CSV file from a URL
import csv
r = requests.get('https://raw.githubusercontent.com/fivethirtyeight/data/master/nfl-ticket-prices/2014-average-ticket-price.csv')
data = [row for row in csv.reader(r.iter_lines())] # list of lists
# open a downloaded CSV file from your working directory
with open('2014-average-ticket-price.csv', 'rU') as f:
data = [row for row in csv.reader(f)] # list of lists
# examine the data
type(data)
len(data)
data[0]
data[1]
# save the data we want
data = data[1:97]
# step 1: create a list that only contains events
data[0][0]
data[1][0]
data[2][0]
events = [row[0] for row in data]
# EXERCISE
# step 2: create a list that only contains prices (stored as integers)
prices = [int(row[2]) for row in data]
# step 3: figure out how to locate the away teams
events[0]
events[0].find(' at ')
stop = events[0].find(' at ')
events[0][:stop]
# step 4: use a for loop to make a list of the away teams
away_teams = []
for event in events:
stop = event.find(' at ')
away_teams.append(event[:stop])
# EXERCISE
# step 5: use a for loop to make a list of the home teams
home_teams = []
for event in events:
start = event.find(' at ') + 4
stop = event.find(' Tickets ')
home_teams.append(event[start:stop])
# step 6: figure out how to get prices only for Ravens home games
zip(home_teams, prices) # list of tuples
[pair[1] for pair in zip(home_teams, prices)] # iterate through tuples and get price
[price for team, price in zip(home_teams, prices)] # better way to get price
[price for team, price in zip(home_teams, prices) if team == 'Baltimore Ravens'] # add a condition
# step 7: create lists of the Ravens home and away game prices
ravens_home = [price for team, price in zip(home_teams, prices) if team == 'Baltimore Ravens']
ravens_away = [price for team, price in zip(away_teams, prices) if team == 'Baltimore Ravens']
# EXERCISE
# step 8: calculate the average of each list
float(sum(ravens_home)) / len(ravens_home)
float(sum(ravens_away)) / len(ravens_away)
================================================
FILE: code/01_chipotle_homework_solution.py
================================================
'''
SOLUTION FILE: Homework with Chipotle data
https://github.com/TheUpshot/chipotle
'''
'''
PART 1: read in the data, parse it, and store it in a list of lists called 'data'
Hint: this is a tsv file, and csv.reader() needs to be told how to handle it
'''
import csv
# specify that the delimiter is a tab character
with open('chipotle_orders.tsv', 'rU') as f:
data = [row for row in csv.reader(f, delimiter='\t')]
'''
PART 2: separate the header and data into two different lists
'''
header = data[0]
data = data[1:]
'''
PART 3: calculate the average price of an order
Hint: examine the data to see if the 'quantity' column is relevant to this calculation
Hint: work smarter, not harder! (this can be done in a few lines of code)
'''
# count the number of unique order_id's
# note: you could assume this is 1834 because that's the maximum order_id, but it's best to check
num_orders = len(set([row[0] for row in data])) # 1834
# create a list of prices
# note: ignore the 'quantity' column because the 'item_price' takes quantity into account
prices = [float(row[4][1:-1]) for row in data] # strip the dollar sign and trailing space
# calculate the average price of an order and round to 2 digits
round(sum(prices) / num_orders, 2) # $18.81
'''
PART 4: create a list (or set) of all unique sodas and soft drinks that they sell
Note: just look for 'Canned Soda' and 'Canned Soft Drink', and ignore other drinks like 'Izze'
'''
# if 'item_name' includes 'Canned', append 'choice_description' to 'sodas' list
sodas = []
for row in data:
if 'Canned' in row[2]:
sodas.append(row[3][1:-1]) # strip the brackets
# create a set of unique sodas
unique_sodas = set(sodas)
'''
PART 5: calculate the average number of toppings per burrito
Note: let's ignore the 'quantity' column to simplify this task
Hint: think carefully about the easiest way to count the number of toppings
Hint: 'hello there'.count('e')
'''
# keep a running total of burritos and toppings
burrito_count = 0
topping_count = 0
# calculate number of toppings by counting the commas and adding 1
# note: x += 1 is equivalent to x = x + 1
for row in data:
if 'Burrito' in row[2]:
burrito_count += 1
topping_count += (row[3].count(',') + 1)
# calculate the average topping count and round to 2 digits
round(topping_count / float(burrito_count), 2) # 5.40
'''
PART 6: create a dictionary in which the keys represent chip orders and
the values represent the total number of orders
Expected output: {'Chips and Roasted Chili-Corn Salsa': 18, ... }
Note: please take the 'quantity' column into account!
Advanced: learn how to use 'defaultdict' to simplify your code
'''
# start with an empty dictionary
chips = {}
# if chip order is not in dictionary, then add a new key/value pair
# if chip order is already in dictionary, then update the value for that key
for row in data:
if 'Chips' in row[2]:
if row[2] not in chips:
chips[row[2]] = int(row[1]) # this is a new key, so create key/value pair
else:
chips[row[2]] += int(row[1]) # this is an existing key, so add to the value
# defaultdict saves you the trouble of checking whether a key already exists
from collections import defaultdict
dchips = defaultdict(int)
for row in data:
if 'Chips' in row[2]:
dchips[row[2]] += int(row[1])
'''
BONUS: think of a question about this data that interests you, and then answer it!
'''
================================================
FILE: code/01_reading_files.py
================================================
'''
Lesson on file reading using Airline Safety Data
https://github.com/fivethirtyeight/data/tree/master/airline-safety
'''
# read the whole file at once, return a single string (including newlines)
# 'rU' mode (read universal) converts different line endings into '\n'
f = open('airline_safety.csv', 'rU')
data = f.read()
f.close()
# use a context manager to automatically close your file
with open('airline_safety.csv', 'rU') as f:
data = f.read()
# read the whole file at once, return a list of lines
with open('airline_safety.csv', 'rU') as f:
data = f.readlines()
# use list comprehension to duplicate readlines
with open('airline_safety.csv', 'rU') as f:
data = [row for row in f]
# use the csv module to create a list of lists
import csv
with open('airline_safety.csv', 'rU') as f:
data = [row for row in csv.reader(f)]
# alternative method that doesn't require downloading the file
import requests
r = requests.get('https://raw.githubusercontent.com/fivethirtyeight/data/master/airline-safety/airline-safety.csv')
data = [row for row in csv.reader(r.iter_lines())]
# separate the header and data
header = data[0]
data = data[1:]
# EXERCISE:
# create a list of airline names (without the star)
# create a list of the same length that contains 1 if there's a star and 0 if not
airlines = []
starred = []
for row in data:
if row[0][-1] == '*':
starred.append(1)
airlines.append(row[0][:-1])
else:
starred.append(0)
airlines.append(row[0])
# EXERCISE:
# create a list that contains the average number of incidents per distance
[(int(row[2]) + int(row[5])) / float(row[1]) for row in data]
'''
A few extra things that will help you with the homework
'''
# 'in' statement is useful for lists
my_list = [1, 2, 1]
1 in my_list # True
3 in my_list # False
# 'in' is useful for strings (checks for substrings)
my_string = 'hello there'
'the' in my_string # True
'then' in my_string # False
# 'in' is useful for dictionaries (checks keys but not values)
my_dict = {'name':'Kevin', 'title':'instructor'}
'name' in my_dict # True
'Kevin' in my_dict # False
# 'set' data structure is useful for gathering unique elements
set(my_list) # returns a set of 1, 2
len(set(my_list)) # count of unique elements
'''
Homework with Chipotle data
https://github.com/TheUpshot/chipotle
'''
'''
PART 1: read in the data, parse it, and store it in a list of lists called 'data'
Hint: this is a tsv file, and csv.reader() needs to be told how to handle it
'''
'''
PART 2: separate the header and data into two different lists
'''
'''
PART 3: calculate the average price of an order
Hint: examine the data to see if the 'quantity' column is relevant to this calculation
Hint: work smarter, not harder! (this can be done in a few lines of code)
'''
'''
PART 4: create a list (or set) of all unique sodas and soft drinks that they sell
Note: just look for 'Canned Soda' and 'Canned Soft Drink', and ignore other drinks like 'Izze'
'''
'''
PART 5: calculate the average number of toppings per burrito
Note: let's ignore the 'quantity' column to simplify this task
Hint: think carefully about the easiest way to count the number of toppings
Hint: 'hello there'.count('e')
'''
'''
PART 6: create a dictionary in which the keys represent chip orders and
the values represent the total number of orders
Expected output: {'Chips and Roasted Chili-Corn Salsa': 18, ... }
Note: please take the 'quantity' column into account!
Advanced: learn how to use 'defaultdict' to simplify your code
'''
'''
BONUS: think of a question about this data that interests you, and then answer it!
'''
================================================
FILE: code/03_exploratory_analysis_pandas.py
================================================
"""
CLASS: Pandas for Data Exploration, Analysis, and Visualization
About the data:
WHO alcohol consumption data:
article: http://fivethirtyeight.com/datalab/dear-mona-followup-where-do-people-drink-the-most-beer-wine-and-spirits/
original data: https://github.com/fivethirtyeight/data/tree/master/alcohol-consumption
files: drinks.csv (with additional 'continent' column)
"""
"""
First, we need to import Pandas into Python. Pandas is a Python package that
allows for easy manipulation of DataFrames. You'll also need to import
matplotlib for plotting.
"""
#imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
'''
Reading Files, Summarizing, Selecting, Filtering, Sorting
'''
# Can read a file from a local file on your computer or from a URL
drinks = pd.read_table('drinks.csv', sep=',') # read_table is more general
drinks = pd.read_csv('drinks.csv') # read_csv is specific to CSV and implies sep=","
# Can also read from URLs
drinks = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT5/master/data/drinks.csv')
'''
Key Concept: Dot notation
In Python, you can think of an object as an entity that can have both attributes
and methods. A dot following an object indicates that you are about to access
something within the object, an attribute or a method. Attributes contain
information about the object. They are usually a single "word" following the
dot. A method is somethng the object can do. They are usually a "word" with
parentheses following the dot.
'''
# examine the drinks data
drinks # print the first 30 and last 30 rows
type(drinks) # DataFrame
drinks.head() # print the first 5 rows
drinks.head(10) # print the first 10 rows
drinks.tail() # print the last 5 rows
drinks.describe() # summarize all numeric columns
drinks.describe(include='all') # includes non numeric columns; new in pandas 0.15.0
drinks.index # "the index" (aka "the labels")
drinks.columns # column names (which is "an index")
drinks.dtypes # data types of each column
drinks.shape # number of rows and columns
drinks.values # underlying numpy array
drinks.info() # concise summary (includes memory usage as of pandas 0.15.0)
# Print the 'beer_servings' Series (a single column)
drinks.beer_servings
drinks['beer_servings']
type(drinks.beer_servings)
# Print two columns
drinks[['beer_servings','wine_servings']]
cols = ['beer_servings','wine_servings']
drinks[cols]
# Calculate the average 'beer_servings' for the entire dataset
drinks.describe() # summarize all numeric columns
drinks.beer_servings.describe() # summarize only the 'beer_servings' Series
drinks.beer_servings.mean() # only calculate the mean
drinks.beer_servings.max() # only calculate the max
drinks.beer_servings.min() # only calculate the min
# Other aggregation functions
drinks.beer_servings.sum()
drinks.beer_servings.count()
float(drinks.beer_servings.sum())/drinks.beer_servings.count()
# Count the number of occurrences of each 'continent' value
drinks.continent.value_counts()
# Simple logical filters
# Print all columns, but only show rows where the country is in Europe
# Let's look at each piece of this.
drinks.continent # Returns all of the continent values
drinks.continent=='EU' # Returns True/False list
drinks[drinks.continent=='EU'] # Returns all rows where True
# Other logical filters
drinks[drinks.beer_servings > 158]
drinks[drinks.beer_servings <= 10]
type(drinks[drinks.beer_servings <= 10]) # DataFrame
drinks[drinks.beer_servings <= 10][['country','beer_servings']]
# Calculate the average 'beer_servings' for all of Europe
drinks[drinks.continent=='EU'].beer_servings.mean()
# More complex logical fitering
# Only show European countries with 'wine_servings' greater than 300
# Note: parentheses are required for each condition, and you can't use 'and' or 'or' keywords
drinks[(drinks.continent=='EU') & (drinks.wine_servings > 300)]
# Show European countries or countries with 'wine_servings' greater than 300
drinks[(drinks.continent=='EU') | (drinks.wine_servings > 300)]
# Show countries who have more than the mean beer_servings
drinks[drinks.beer_servings > drinks.beer_servings.mean()]
##########################################
############ Exercise 1 ############
##########################################
# Using the 'drinks' data, answer the following questions:
# 1. What is the maximum number of total litres of pure alcohol?
drinks.total_litres_of_pure_alcohol.max()
# 2. Which country has the maximum number of total litres of pure alcohol?
drinks[drinks.total_litres_of_pure_alcohol == drinks.total_litres_of_pure_alcohol.max()]['country']
# 3. Does Haiti or Belarus consume more servings of spirits?
drinks.spirit_servings[drinks.country=='Haiti'] > drinks.spirit_servings[drinks.country=='Belarus']
# 4. How many countries have more than 300 wine servings OR more than 300
# beer servings OR more than 300 spirit servings?
drinks[(drinks.wine_servings > 300) | (drinks.beer_servings > 300) | (drinks.spirit_servings > 300)].country.count()
# 5. For the countries in the previous question, what is the average total litres
# of pure alcohol?
drinks[(drinks.wine_servings > 300) | (drinks.beer_servings > 300) | (drinks.spirit_servings > 300)].mean()
# sorting
drinks.beer_servings.order() # only works for a Series
drinks.sort_index() # sort rows by label
drinks.sort_index(by='beer_servings') # sort rows by a specific column
drinks.sort_index(by='beer_servings', ascending=False) # use descending order instead
drinks.sort_index(by=['beer_servings', 'wine_servings']) # sort by multiple columns
# Determine which 10 countries have the highest 'total_litres_of_pure_alcohol'
drinks.sort_index(by='total_litres_of_pure_alcohol').tail(10)
# Determine which country has the highest value for 'beer_servings'
drinks[drinks.beer_servings==drinks.beer_servings.max()].country
# Use dot notation to string together commands
# How many countries in each continent have beer_servings greater than 182?
# i.e. a beer every two days
drinks[drinks.beer_servings > 182].continent.value_counts()
# add a new column as a function of existing columns
# note: can't (usually) assign to an attribute (e.g., 'drinks.total_servings')
drinks['total_servings'] = drinks.beer_servings + drinks.spirit_servings + drinks.wine_servings
drinks['alcohol_mL'] = drinks.total_litres_of_pure_alcohol * 1000
drinks.head()
'''
Split-Apply-Combine
'''
# for each continent, calculate mean beer servings
drinks.groupby('continent').beer_servings.mean()
# for each continent, calculate mean of all numeric columns
drinks.groupby('continent').mean()
# for each continent, count number of occurrences
drinks.groupby('continent').continent.count()
drinks.continent.value_counts()
'''
A little numpy
'''
probs = np.array([0.51, 0.50, 0.02, 0.49, 0.78])
# np.where functions like an IF statement in Excel
# np.where(condition, value if true, value if false)
np.where(probs >= 0.5, 1, 0)
drinks['lots_of_beer'] = np.where(drinks.beer_servings > 300, 1, 0)
##########################################
############ Exercise 2 ############
##########################################
# 1. What is the average number of total litres of pure alcohol for each
# continent?
drinks.groupby('continent').total_litres_of_pure_alcohol.mean()
# 2. For each continent, calculate the mean wine_servings for all countries who
# have a spirit_servings greater than the overall spirit_servings mean.
drinks[drinks.spirit_servings > drinks.spirit_servings.mean()].groupby('continent').wine_servings.mean()
# 3. Per continent, for all of the countries that drink more beer servings than
# the average number of beer servings, what is the average number of wine
# servings?
drinks[drinks.beer_servings > drinks.beer_servings.mean()].groupby('continent').wine_servings.mean()
'''
Advanced Filtering (of rows) and Selecting (of columns)
'''
# loc: filter rows by LABEL, and select columns by LABEL
drinks.loc[0] # row with label 0
drinks.loc[0:3] # rows with labels 0 through 3
drinks.loc[0:3, 'beer_servings':'wine_servings'] # rows 0-3, columns 'beer_servings' through 'wine_servings'
drinks.loc[:, 'beer_servings':'wine_servings'] # all rows, columns 'beer_servings' through 'wine_servings'
drinks.loc[[0,3], ['beer_servings','spirit_servings']] # rows 1 and 4, columns 'beer_servings' and 'spirit_servings'
# iloc: filter rows by POSITION, and select columns by POSITION
drinks.iloc[0] # row with 0th position (first row)
drinks.iloc[0:3] # rows with positions 0 through 2 (not 3)
drinks.iloc[0:3, 0:3] # rows and columns with positions 0 through 2
drinks.iloc[:, 0:3] # all rows, columns with positions 0 through 2
drinks.iloc[[0,2], [0,1]] # 1st and 3rd row, 1st and 2nd column
# mixing: select columns by LABEL, then filter rows by POSITION
drinks.wine_servings[0:3]
drinks[['beer_servings', 'spirit_servings', 'wine_servings']][0:3]
##########################################
############# Homework #############
##########################################
'''
Use the automotive mpg data (https://raw.githubusercontent.com/justmarkham/DAT5/master/data/auto_mpg.csv)
to complete the following parts. Please turn in your code for each part.
Before each code chunk, give a brief description (one line) of what the code is
doing (e.g. "Loads the data" or "Creates scatter plot of mpg and weight"). If
the code output produces a plot or answers a question, give a brief
interpretation of the output (e.g. "This plot shows X,Y,Z" or "The mean for
group A is higher than the mean for group B which means X,Y,Z").
'''
'''
Part 1
Load the data (https://raw.githubusercontent.com/justmarkham/DAT5/master/data/auto_mpg.txt)
into a DataFrame. Try looking at the "head" of the file in the command line
to see how the file is delimited and how to load it.
Note: You do not need to turn in any command line code you may use.
'''
'''
Part 2
Get familiar with the data. Answer the following questions:
- What is the shape of the data? How many rows and columns are there?
- What variables are available?
- What are the ranges for the values in each numeric column?
- What is the average value for each column? Does that differ significantly
from the median?
'''
'''
Part 3
Use the data to answer the following questions:
- Which 5 cars get the best gas mileage?
- Which 5 cars with more than 4 cylinders get the best gas mileage?
- Which 5 cars get the worst gas mileage?
- Which 5 cars with 4 or fewer cylinders get the worst gas mileage?
'''
'''
Part 4
Use groupby and aggregations to explore the relationships
between mpg and the other variables. Which variables seem to have the greatest
effect on mpg?
Some examples of things you might want to look at are:
- What is the mean mpg for cars for each number of cylindres (i.e. 3 cylinders,
4 cylinders, 5 cylinders, etc)?
- Did mpg rise or fall over the years contained in this dataset?
- What is the mpg for the group of lighter cars vs the group of heaver cars?
Note: Be creative in the ways in which you divide up the data. You are trying
to create segments of the data using logical filters and comparing the mpg
for each segment of the data.
'''
================================================
FILE: code/04_apis.py
================================================
'''
CLASS: APIs
Data Science Toolkit text2sentiment API
'''
'''
APIs without wrappers (i.e. there is no nicely formatted function)
'''
# Import the necessary modules
import requests # Helps construct the request to send to the API
import json # JSON helper functions
# We have a sentence we want the sentiment of
sample_sentence = 'A couple hundred hours & several thousand lines of code later... thank you @GA_DC!! #DataScience #GAGradNight'
# We know end URL endpoint to send it to
url = 'http://www.datasciencetoolkit.org/text2sentiment/'
# First we specify the header
header = {'content-type': 'application/json'}
# Next we specify the body (the information we want the API to work on)
body = sample_sentence
# Now we make the request
response = requests.post(url, data=body, headers=header)
# Notice that this is a POST request
# Let's look at the response
response.status_code
response.ok
response.text
# Let's turn that text back into JSON
r_json = json.loads(response.text)
r_json
r_json['score'] # 2.0
##########################################
############ Exercise 1 ############
##########################################
# Turn the above code into a function
# The function should take in one argument, some text, and return a number,
# the sentiment. Call your function "get_sentiment".
def get_sentiment(text):
url = 'http://www.datasciencetoolkit.org/text2sentiment/'
#specify header
header = {'content-type': 'application/json'}
# Next we specify the body (the information we want the API to work on)
body = text
# Now we make the request
response = requests.post(url, data=body, headers=header)
# Notice that this is a POST request
r_json = json.loads(response.text)
sentiment = r_json['score'] # 2.0
return sentiment
# Now that we've created our own wrapper, we can use it throughout our code.
# We now have multiple sentences
sentences = ['I love pizza!', 'I hate pizza!', 'I feel nothing about pizza!']
# Loop through the sentences
for sentence in sentences:
sentiment = get_sentiment(sentence)
print sentence, sentiment # Print the results
'''
APIs with wrappers (i.e. there is a nicely formatted function)
'''
# Import the API library
import dstk
# Remember our sample sentence?
sample_sentence
# Let's try our new API library
# Instantiate DSTK object
dstk = dstk.DSTK()
dstk.text2sentiment(sample_sentence) # 2.0
# We can once again loop through our sentences
for sentence in sentences:
sentiment = dstk.text2sentiment(sentence)
print sentence, sentiment['score']
================================================
FILE: code/04_visualization.py
================================================
"""
CLASS: Visualization
"""
# imports
import pandas as pd
import matplotlib.pyplot as plt
# import the data available at https://raw.githubusercontent.com/justmarkham/DAT5/master/data/drinks.csv
drinks = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT5/master/data/drinks.csv')
'''
Visualization
'''
# bar plot of number of countries in each continent
drinks.continent.value_counts().plot(kind='bar', title='Countries per Continent')
plt.xlabel('Continent')
plt.ylabel('Count')
plt.show() # show plot window (if it doesn't automatically appear)
plt.savefig('countries_per_continent.png') # save plot to file
# bar plot of average number of beer servings (per adult per year) by continent
drinks.groupby('continent').beer_servings.mean().plot(kind='bar', title='Average Number of Beer Servings By Continent')
plt.ylabel('Average Number of Beer Servings Per Year')
plt.show()
# histogram of beer servings (shows the distribution of a numeric column)
drinks.beer_servings.hist(bins=20)
plt.title("Distribution of Beer Servings")
plt.xlabel('Beer Servings')
plt.ylabel('Frequency')
plt.show()
# density plot of beer servings (smooth version of a histogram)
drinks.beer_servings.plot(kind='density', xlim=(0,500))
plt.title("Distribution of Beer Servings")
plt.xlabel('Beer Servings')
plt.show()
# grouped histogram of beer servings (shows the distribution for each group)
drinks.beer_servings.hist(by=drinks.continent)
plt.show()
drinks.beer_servings.hist(by=drinks.continent, sharex=True)
plt.show()
drinks.beer_servings.hist(by=drinks.continent, sharex=True, sharey=True)
plt.show()
drinks.beer_servings.hist(by=drinks.continent, sharey=True, layout=(2, 3)) # change layout (new in pandas 0.15.0)
plt.show()
# boxplot of beer servings by continent (shows five-number summary and outliers)
drinks.boxplot(column='beer_servings', by='continent')
plt.show()
# scatterplot of beer servings versus wine servings
drinks.plot(kind='scatter', x='beer_servings', y='wine_servings', alpha=0.3)
plt.show()
# same scatterplot, except point color varies by 'spirit_servings'
# note: must use 'c=drinks.spirit_servings' prior to pandas 0.15.0
drinks.plot(kind='scatter', x='beer_servings', y='wine_servings', c='spirit_servings', colormap='Blues')
plt.show()
# same scatterplot, except all European countries are colored red
colors = np.where(drinks.continent=='EU', 'r', 'b')
drinks.plot(x='beer_servings', y='wine_servings', kind='scatter', c=colors)
plt.show()
# Scatter matrix
pd.scatter_matrix(drinks)
plt.show()
##########################################
############ Exercise 1 ############
##########################################
# 1. Generate a plot showing the average number of total litres of pure alcohol
# by continent.
drinks.groupby('continent').total_litres_of_pure_alcohol.mean().plot(kind='bar')
plt.show()
# 2. Illustrate the relationship between spirit servings and total litres of
# pure alcohol. What kind of relationship is there?
drinks.plot(kind='scatter', x='spirit_servings', y='total_litres_of_pure_alcohol', alpha=0.4)
plt.show()
# 3. Generate one plot that shows the distribution of spirit servings for each
# continent.
drinks.spirit_servings.hist(by=drinks.continent, sharex=True, sharey=True)
plt.show()
##########################################
############# Homework #############
##########################################
'''
Use the automotive mpg data (https://raw.githubusercontent.com/justmarkham/DAT5/master/data/auto_mpg.txt)
to complete the following parts. Please turn in your code for each part.
Before each code chunk, give a brief description (one line) of what the code is
doing (e.g. "Loads the data" or "Creates scatter plot of mpg and weight"). If
the code output produces a plot or answers a question, give a brief
interpretation of the output (e.g. "This plot shows X,Y,Z" or "The mean for
group A is higher than the mean for group B which means X,Y,Z").
'''
'''
Part 1
Produce a plot that compares the mean mpg for the different numbers of cylinders.
'''
'''
Part 2
Use a scatter matrix to explore relationships between different numeric variables.
'''
'''
Part 3
Use a plot to answer the following questions:
-Do heavier or lighter cars get better mpg?
-How are horsepower and displacement related?
-What does the distribution of acceleration look like?
-How is mpg spread for cars with different numbers of cylinders?
-Do cars made before or after 1975 get better average mpg? (Hint: You need to
create a new column that encodes whether a year is before or after 1975.)
'''
================================================
FILE: code/05_iris_exercise.py
================================================
'''
EXERCISE: "Human Learning" with iris data
Can you predict the species of an iris using petal and sepal measurements?
TASKS:
1. Read iris data into a pandas DataFrame, including column names.
2. Gather some basic information about the data.
3. Use groupby, sorting, and/or plotting to look for differences between species.
4. Come up with a set of rules that could be used to predict species based upon measurements.
BONUS: Define a function that accepts a row of data and returns a predicted species.
Then, use that function to make predictions for all existing rows of data.
'''
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
## TASK 1
# read the iris data into a pandas DataFrame, including column names
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
names=col_names)
## TASK 2
# gather basic information
iris.shape
iris.head()
iris.describe()
iris.species.value_counts()
iris.dtypes
iris.isnull().sum()
## TASK 3
# use groupby to look for differences between the species
iris.groupby('species').sepal_length.mean()
iris.groupby('species').mean()
iris.groupby('species').describe()
# use sorting to look for differences between the species
iris.sort_index(by='sepal_length').values
iris.sort_index(by='sepal_width').values
iris.sort_index(by='petal_length').values
iris.sort_index(by='petal_width').values
# use plotting to look for differences between the species
iris.petal_width.hist(by=iris.species, sharex=True)
iris.boxplot(column='petal_width', by='species')
iris.boxplot(by='species')
# map species to a numeric value so that plots can be colored by category
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})
iris.plot(kind='scatter', x='petal_length', y='petal_width', c='species_num', colormap='Blues')
pd.scatter_matrix(iris, c=iris.species_num)
## TASK 4
# If petal length is less than 3, predict setosa.
# Else if petal width is less than 1.8, predict versicolor.
# Otherwise predict virginica.
## BONUS
# define function that accepts a row of data and returns a predicted species
def classify_iris(row):
if row[2] < 3: # petal_length
return 0 # setosa
elif row[3] < 1.8: # petal_width
return 1 # versicolor
else:
return 2 # virginica
# predict for a single row
classify_iris(iris.iloc[0, :]) # first row
classify_iris(iris.iloc[149, :]) # last row
# store predictions for all rows
predictions = [classify_iris(row) for row in iris.values]
# calculate the percentage of correct predictions
np.mean(iris.species_num == predictions) # 0.96
================================================
FILE: code/05_sklearn_knn.py
================================================
'''
CLASS: Introduction to scikit-learn with iris data
'''
# read in iris data
import pandas as pd
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
names=col_names)
# create numeric column for the response
# note: features and response must both be entirely numeric!
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})
# create X (features) three different ways
X = iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
X = iris.loc[:, 'sepal_length':'petal_width']
X = iris.iloc[:, 0:4]
# create y (response)
y = iris.species_num
# check the shape of X and y
X.shape # 150 by 4 (n=150, p=4)
y.shape # 150 (must match first dimension of X)
# scikit-learn 4-step modeling pattern:
# Step 1: import the class you plan to use
from sklearn.neighbors import KNeighborsClassifier
# Step 2: instantiate the "estimator" (aka the model)
# note: all unspecified parameters are set to the defaults
knn = KNeighborsClassifier(n_neighbors=1)
# Step 3: fit the model with data (learn the relationship between X and y)
knn.fit(X, y)
# Step 4: use the "fitted model" to predict the response for a new observation
knn.predict([3, 5, 4, 2])
# predict for multiple observations at once
X_new = [[3, 5, 4, 2], [3, 5, 2, 2]]
knn.predict(X_new)
# try a different value of K ("tuning parameter")
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X, y)
knn.predict(X_new) # predicted classes
knn.predict_proba(X_new) # predicted probabilities of class membership
knn.kneighbors([3, 5, 4, 2]) # distances to nearest neighbors (and identities)
# calculate Euclidian distance manually for nearest neighbor
import numpy as np
np.sqrt(((X.iloc[106, :] - [3, 5, 4, 2])**2).sum())
================================================
FILE: code/07_glass_id_homework_solution.py
================================================
'''
HOMEWORK: Glass Identification (aka "Glassification")
'''
# TASK 1: read data into a DataFrame
import pandas as pd
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data',
names=['id','ri','na','mg','al','si','k','ca','ba','fe','glass_type'],
index_col='id')
# TASK 2: briefly explore the data
df.shape
df.head()
df.tail()
df.glass_type.value_counts()
df.isnull().sum()
# TASK 3: convert to binary classification problem (1/2/3/4 maps to 0, 5/6/7 maps to 1)
import numpy as np
df['binary'] = np.where(df.glass_type < 5, 0, 1) # method 1
df['binary'] = df.glass_type.map({1:0, 2:0, 3:0, 4:0, 5:1, 6:1, 7:1}) # method 2
df.binary.value_counts()
# TASK 4: create a feature matrix (X)
features = ['ri','na','mg','al','si','k','ca','ba','fe'] # create a list of features
features = df.columns[:-2] # alternative way: slice 'columns' attribute like a list
X = df[features] # create DataFrame X by only selecting features
# TASK 5: create a response vector (y)
y = df.binary
# TASK 6: split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)
# TASK 7: fit a KNN model on the training set using K=5
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
# TASK 8: make predictions on the testing set and calculate accuracy
y_pred = knn.predict(X_test)
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred) # 90.7% accuracy
# TASK 9: calculate null accuracy
1 - y.mean() # 76.2% null accuracy
# BONUS: write a for loop that computes test set accuracy for a range of K values
k_range = range(1, 30, 2)
scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
scores.append(metrics.accuracy_score(y_test, y_pred))
# BONUS: plot K versus test set accuracy to choose on optimal value for K
import matplotlib.pyplot as plt
plt.plot(k_range, scores) # optimal value is K=1
================================================
FILE: code/08_web_scraping.py
================================================
'''
CLASS: Web Scraping
We will be using two packages in particular: requests and Beautiful Soup 4.
'''
'''
Introduction to Beautiful Soup
'''
# imports
import requests # How Python gets the webpages
from bs4 import BeautifulSoup # Creates structured, searchable object
import pandas as pd
import matplotlib.pyplot as plt
# First, let's play with beautiful soup on a "toy" webpage
html_doc = """
<!doctype html>
<html lang="en">
<head>
<title>Brandon's Homepage!</title>
</head>
<body>
<h1>Brandon's Homepage</h1>
<p id="intro">My name is Brandon. I'm love web scraping!</p>
<p id="background">I'm originally from Louisiana. I went to undergrad at Louisiana Tech and grad school at UNC.</p>
<p id="current">I currently work as a Product Manager of Linguistics and Analytics at Clarabridge.</p>
<h3>My Hobbies</h3>
<ul>
<li id="my favorite">Data Science</li>
<li>Backcountry Camping</li>
<li>Rock Climbing</li>
<li>Cycling</li>
<li>The Internet</li>
</ul>
</body>
</html>
"""
type(html_doc)
# Beautiful soup allows us to create a structured object out of this string
b = BeautifulSoup(html_doc)
type(b)
# Let's look at "b"
b
# The most useful methods in a Beautiful Soup object are "find" and "findAll".
# "find" takes several parameters, the most important are "name" and "attrs".
# Let's talk about "name".
b.find(name='body') # Finds the 'body' tag and everything inside of it.
body = b.find(name='body')
type(body) #tag
# You can search tags also
h1 = body.find(name='h1') # Find the 'h1' tag inside of the 'body' tag
h1
h1.text # Print out just the text inside of the body
# Now let's find the 'p' tags
p = b.find(name='p')
# This only finds one. This is where 'findAll' comes in.
all_p = b.findAll(name='p')
all_p
type(all_p) # Result sets are a lot like Python lists
all_p[0] # Access specific element with index
all_p[1]
# Iterable like list
for one_p in all_p:
print one_p.text # Print text
# Access specific attribute of a tag
all_p[0] # Specific tag
all_p[0]['id'] # Speific attribute of a specific tag
# Now let's talk about 'attrs'
# Beautiful soup also allows us to choose tags with specific attributes
b.find(name='p', attrs={"id":"intro"})
b.find(name='p', attrs={"id":"background"})
b.find(name='p', attrs={"id":"current"})
##########################################
############ Exercise 1 ############
##########################################
# 1. Extact the 'h3' element from Brandon's webpage.
b.find(name='h3')
# 2. Extract Brandon's hobbies from the html_doc. Print out the text of the hobby.
hobbies = b.findAll(name='ul')
for hobby in hobbies:
print hobby.text
# 3. Extract Brandon's hobby that has the id "my favorite".
b.find(name='li', attrs={'id':'my favorite'})
'''
Beautiful Soup from the web
'''
# We see data on a web page that we want to get. First we need the HTML.
# This downloads the HTML and puts it into the variable r
r = requests.get('http://www.imdb.com/title/tt1856010/')
# But when we look at it, it's just one giant string.
type(r.text) # Unicode string
r.text[0:200]
# Beautiful soup allows us to create a structured object out of this string
b = BeautifulSoup(r.text)
type(b)
'''
"find" and "findAll" with the 'name' parameter in Beautiful Soup
'''
b.find(name='body') # Find a specific HTML tag
body = b.find(name='body') # Store the output of your "find"
type(body) # Let's look at the type
# Can we still run another "find" command on the output?
img = body.find('img') # Find the image tags
img
type(img)
# Yes, but it only finds one of the "img" tags. We want them all.
imgs = body.findAll(name='img')
imgs # Now we get them all.
type(imgs) # Resultsets are a lot like Python lists
# Let's look at each individual image
imgs[0]
imgs[1]
# We're really interested is the 'src' attribute, the actual image location.
# How do we access attributes in a Python object? Using the dot notation or the
# brackets. With Beautiful Soup, we must use the brackets
imgs[0]['src']
# Now we can look through each image and print the 'src' attribute.
for img in imgs:
print img['src']
# Or maybe we want to create a list of all of the 'src' attributes
src_list = []
for img in imgs:
src_list.append(img['src'])
len(src_list)
'''
"find" and "findAll" with the 'attrs' parameter in Beautiful Soup
'''
# Now let's talk about 'attrs'
# Beautiful soup also allows us to choose tags with specific attributes
title = b.find(name="span", attrs={"class":"itemprop", "itemprop":"name"})
title # Prints HTML matching that tag, but we want the actual name
title.text # The "text" attribute gives you the text between two HTML tags
star_rating = b.find(name="div", attrs={"class":"titlePageSprite star-box-giga-star"})
# How do I get the actual star_rating number?
star_rating.text
# How do I make this star_rating a number instead of a string?
float(star_rating.text)
##########################################
############ Exercise 2 ############
##########################################
'''
We've retrieved the title of the show, but now we want the show's rating,
duration, and genre. Using "find" and "find all", write code that retrieves
each of these things
Hint: Everything can be found in the "infobar". Try finding that first and
searchng within it.
'''
infobar = b.find(name="div", attrs={"class":"infobar"})
# Retrieve the show's content rating
content_rating = infobar.find(name='meta', attrs={"itemprop":"contentRating"})['content']
# Retrieve the show's duration
duration = infobar.find(name='time', attrs={"itemprop":"duration"}).text
# Retrieve the show's genre
genre = infobar.find(name='span', attrs={"itemprop":"genre"}).text
'''
Looping through 'findAll' results
'''
# Now we want to get the list of actors and actresses
# First let's get the "div" block with all of the actor info
actors_raw = b.find(name='div', attrs={"class":"txt-block", "itemprop":"actors", "itemscope":"", "itemtype":"http://schema.org/Person"})
# Now let's find all of the occurences of the "span" with "itemprop" "name",
# meaning the tags with actors' and actresses' names.
actors = actors_raw.findAll(name="span", attrs={"itemprop":"name"})
# Now we want to loop through each one and get the text inside the tags
actors_list = [actor.text for actor in actors]
'''
Creating a "Web Scraping" Function
The above code we've written is useful, but we don't want to have to type it
everytime. We want to create a function that takes the URL and outputs the pieces
we want everytime.
'''
def getIMDBInfo(url):
r = requests.get(url) # Get HTML
b = BeautifulSoup(r.text) # Create Beautiful Soup object
# Get various attributes and put them in dictionary
results = {} # Initialize empty dictionary
# Get the title
results['title'] = b.find(name="span", attrs={"class":"itemprop", "itemprop":"name"}).text
# Rating
results['star_rating'] = float(b.find(name="div", attrs={"class":"titlePageSprite"}).text)
# Actors/actresses
actors_raw = b.find(name='div', attrs={"class":"txt-block", "itemprop":"actors", "itemscope":"", "itemtype":"http://schema.org/Person"})
actors = actors_raw.findAll(name="span", attrs={"class":"itemprop", "itemprop":"name"})
results['actors_list'] = [actor.text for actor in actors]
# Content Rating
infobar = b.find(name="div", attrs={"class":"infobar"})
results['content_rating'] = infobar.find(name='meta', attrs={"itemprop":"contentRating"})['content']
# Show duration
results['duration'] = int(infobar.find(name='time', attrs={"itemprop":"duration"}).text.strip()[:-4])#infobar.find(name='time', attrs={"itemprop":"duration"}).text
# Genre
results['genre'] = infobar.find(name='span', attrs={"itemprop":"genre"}).text
# Return dictionary
return results
# Let's see if it worked
# We can look at the results of our previous web page, "House of Cards"
getIMDBInfo('http://www.imdb.com/title/tt1856010/')
# Now let's try another one: Interstellar
getIMDBInfo('http://www.imdb.com/title/tt0816692/')
# Now let's show the true functionality
list_of_title_urls = []
with open('imdb_movie_urls.csv', 'rU') as f:
list_of_title_urls = f.read().split('\n')
# Let's get the data for each title in the list
data = []
for title_url in list_of_title_urls:
imdb_data = getIMDBInfo(title_url)
data.append(imdb_data)
column_names = ['star_rating', 'title', 'content_rating', 'genre', 'duration', 'actors_list']
movieRatings = pd.DataFrame(data, columns = column_names)
movieRatings
# Now we have some data we can begin exploring, aggregating, etc.
'''
Bonus material: Getting movie data for the top 1000 movies on IMDB
'''
# Or let's build another webscraper to get the IMDB top 1000
movie_links = [] # Create empty list
# Notice that we are creating a list [1,101,201,...] and changing the URL slightly each time.
for i in range(1,1000,100):
# Get url
r = requests.get('http://www.imdb.com/search/title?groups=top_1000&sort=user_rating&start=' + str(i) + '&view=simple') # Get HTML
b = BeautifulSoup(r.text) # Create Beautiful Soup object
links = b.findAll(name='td', attrs={'class':'title'}) # Find all 'td's with 'class'='title'
for link in links:
a_link = link.find('a') # Find liks
movie_links.append('http://www.imdb.com' + str(a_link['href'])) # Add link to list
# Create dataframe of the top 1000 movies on IMDB
# NOTE: This could take 5-10 minutes. You can skip this part as I've already
# pulled all of this data and saved it to a file.
data = []
j=0
# Loop through every movie title
for movie_link in movie_links:
try:
imdb_data = getIMDBInfo(movie_link) # Get movie data
data.append(imdb_data) # Put movie data in list
except:
pass
j += 1
if j%50 == 0:
print 'Completed ' + str(j) + ' titles!' # Print progress
# Create data frame with movies
column_names = ['star_rating', 'title', 'content_rating', 'genre', 'duration', 'actors_list']
movieRatingsTop1000 = pd.DataFrame(data, columns = column_names)
# Read in the reated dataframe
movieRatingsTop1000 = pd.read_csv('imdb_movie_ratings_top_1000.csv')
# Now you're ready to do some analysis
movieRatingsTop1000.describe()
movieRatingsTop1000.groupby('genre').star_rating.mean()
movieRatingsTop1000.groupby('content_rating').star_rating.mean()
movieRatingsTop1000.plot(kind='scatter', x='duration', y='star_rating')
plt.show()
================================================
FILE: code/10_logistic_regression_confusion_matrix.py
================================================
'''
CLASS: Logistic Regression and Confusion Matrix
'''
###############################################################################
### Logistic Regression
###############################################################################
# Imports
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from math import exp
import numpy as np
import matplotlib.pyplot as plt
# Read in data
data = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT5/master/data/default.csv')
data.head()
# Change column to number
data['student_bin'] = data.student.map({'No':0, 'Yes':1})
# Let's do some cursory analysis.
data.groupby('default').balance.mean()
data.groupby('default').income.mean()
# Set X and y
feature_cols = ['balance', 'income','student_bin']
X = data[feature_cols]
y = data.default
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 2)
# Fit model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test) # Predict
# Access accuracy
print metrics.accuracy_score(y_test, y_pred)
###############################################################################
### Null Accuracy Rate
###############################################################################
# Compare to null accuracy rate. The null accuracy rate is the accuracy if I
# predict all the majority class. If there are more 1's, I predict all 1's.
# If there are more 0's, I predict all 0's. There are several ways to do this.
# 1. Create a vector of majority class and use the accuracy_score.
# "If I predicted all 0's, how accurate would I be?
print metrics.accuracy_score(y_test, [0]*len(y_test))
# 2. Calculate the mean of y_test (AKA the percentage of 1's)
y_test.mean()
# One minus that number will be the percentage of 0's. This means that if you
# predict all 0's, you will be correct 1-y_test-mean() percent of the time.
1 - y_test.mean()
# This puts our accuracy score into context a bit. We can now see that we
# actually didn't do so great!
###############################################################################
### Intepretting Logistic Regression Coefficients
###############################################################################
# Let's look at the coefficients
for col in zip(feature_cols, logreg.coef_[0]):
print col[0], col[1]
# Let's interpret those.
for col in zip(feature_cols, logreg.coef_[0]):
print 'A unit increase in', col[0], 'equals a', exp(col[1]), 'increase in odds.'
###############################################################################
### Confusion Matrix
###############################################################################
# Let's look at the confusion matrix
con_mat = metrics.confusion_matrix(y_test, y_pred)
print con_mat
# Let's define our true posititves, false positives, true negatives, and false negatives
true_neg = con_mat[0][0]
false_neg = con_mat[1][0]
true_pos = con_mat[1][1]
false_pos = con_mat[0][1]
# Sensitivity: percent of correct predictions when reference value is 'default'
sensitivity = float(true_pos)/(false_neg + true_pos)
print sensitivity
print metrics.recall_score(y_test, y_pred)
# Specificity: percent of correct predictions when reference value is 'not default'
specificity = float(true_neg) / (true_neg + false_pos)
print specificity
###############################################################################
### Logistic Regression Thresholds
###############################################################################
# Logistic regression is actually predicting the underlying probability.
# However, when you clal the "predict" function, it returns class labels. You
# can still predict the actual probability and set your own threshold if you'd
# like. This can be useful in cases where the "signal" from the model isn't
# strong.
# Predict probabilities
logreg.predict_proba(X_test).shape
probs = logreg.predict_proba(X_test)[:, 1]
# The natural threshold for probabilility is 0.5, but you don't have to use
# that.
# Use 0.5 thrshold for predicting 'default' and confirm we get the same results
preds_05 = np.where(probs >= 0.5, 1, 0)
print metrics.accuracy_score(y_test, preds_05)
con_mat_05 = metrics.confusion_matrix(y_test, preds_05)
print con_mat_05
# Let's look at a histogram of these probabilities.
plt.hist(probs, bins=20)
plt.title('Distribution of Probabilities')
plt.xlabel('Probability')
plt.ylabel('Frequency')
plt.show()
# Change cutoff for predicting default to 0.2
preds_02 = np.where(probs > 0.2, 1, 0)
delta = float((preds_02 != preds_05).sum())/len(X_test)*100
print 'Changing the threshold from 0.5 to 0.2 changed %.2f percent of the predictions.' % delta
# Check the new accuracy, sensitivity, specificity
print metrics.accuracy_score(y_test, preds_02)
con_mat_02 = metrics.confusion_matrix(y_test, preds_02)
print con_mat_02
# Let's define our true posititves, false positives, true negatives, and false negatives
true_neg = con_mat_02[0][0]
false_neg = con_mat_02[1][0]
true_pos = con_mat_02[1][1]
false_pos = con_mat_02[0][1]
# Sensitivity: percent of correct predictions when reference value is 'default'
sensitivity = float(true_pos)/(false_neg + true_pos)
print sensitivity
print metrics.recall_score(y_test, preds_02)
# Specificity: percent of correct predictions when reference value is 'not default'
specificity = float(true_neg) / (true_neg + false_pos)
print specificity
###############################################################################
### Exercise/Possibly Homework
###############################################################################
'''
Let's use the glass identification dataset again. We've previously run knn
on this dataset. Now, let's try logistic regression. Access the dataset at
http://archive.ics.uci.edu/ml/datasets/Glass+Identification. Complete the
following tasks or answer the following questions.
'''
'''
1. Read the data into a pandas dataframe.
'''
# Taken from Kevin's 07 HW solution
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data',
names=['id','ri','na','mg','al','si','k','ca','ba','fe','glass_type'],
index_col='id')
''''
2. Explore the data and look at what columns are available.
'''
# Taken from Kevin's 07 HW solution
df.shape # 214 x 10
df.head()
df.tail()
df.glass_type.value_counts()
df.isnull().sum() # No nulls in our data
''''
3. Convert the 'glass type' column into a binary response.
* If type of class = 1/2/3/4, binary=0.
* If type of glass = 5/6/7, binary=1.
'''
# Taken from Kevin's 07 HW solution
df['binary'] = np.where(df.glass_type < 5, 0, 1) # method 1
df['binary'] = df.glass_type.map({1:0, 2:0, 3:0, 4:0, 5:1, 6:1, 7:1}) # method 2
df.binary.value_counts()
'''
4. Create a feature matrix and a response vector.
'''
# Taken from Kevin's 07 HW solution
features = ['ri','na','mg','al','si','k','ca','ba','fe'] # create a list of features
features = df.columns[:-2] # alternative way: slice 'columns' attribute like a list
X = df[features] # create DataFrame X by only selecting features
y = df.binary
'''
5. Split the data into the appropriate training and testing sets.
'''
# Taken from Kevin's 07 HW solution
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)
'''
6. Create and fit a logistic regression model.
'''
logreg = LogisticRegression() # Instatiate estimator
logreg.fit(X_train, y_train) # Fit data
'''
7. Make predictions with your new model.
'''
y_pred = logreg.predict(X_test) # Create predictions
'''
8. Calculate the accuracy rate of your model and compare it to the null accuracy.
'''
# Calculate accuracy of model
metrics.accuracy_score(y_test, y_pred)
# Calculate null accuracy
metrics.accuracy_score(y_test, [0]*len(y_test))
'''
9. Generate a confusion matrix for your predictions. Use this to calculate the
sensitivity and specificity of your model.
'''
# Let's look at the confusion matrix
con_mat = metrics.confusion_matrix(y_test, y_pred)
print con_mat
# Let's define our true posititves, false positives, true negatives, and false negatives
true_neg = con_mat[0][0]
false_neg = con_mat[1][0]
true_pos = con_mat[1][1]
false_pos = con_mat[0][1]
# Sensitivity: percent of correct predictions when reference value is 'default'
sensitivity = float(true_pos)/(false_neg + true_pos)
print sensitivity
# Specificity: percent of correct predictions when reference value is 'not default'
specificity = float(true_neg) / (true_neg + false_pos)
print specificity
================================================
FILE: code/13_naive_bayes.py
================================================
'''
CLASS: Naive Bayes SMS spam classifier
DATA SOURCE: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
'''
## READING IN THE DATA
# read tab-separated file using pandas
import pandas as pd
df = pd.read_table('https://raw.githubusercontent.com/justmarkham/DAT5/master/data/SMSSpamCollection.txt',
sep='\t', header=None, names=['label', 'msg'])
# examine the data
df.head(20)
df.label.value_counts()
df.msg.describe()
# convert label to a binary variable
df['label'] = df.label.map({'ham':0, 'spam':1})
df.head()
# split into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.msg, df.label, random_state=1)
X_train.shape
X_test.shape
## COUNTVECTORIZER: 'convert text into a matrix of token counts'
## http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
from sklearn.feature_extraction.text import CountVectorizer
# start with a simple example
train_simple = ['call you tonight',
'Call me a cab',
'please call me... PLEASE!']
# learn the 'vocabulary' of the training data
vect = CountVectorizer()
vect.fit(train_simple)
vect.get_feature_names()
# transform training data into a 'document-term matrix'
train_simple_dtm = vect.transform(train_simple)
train_simple_dtm
train_simple_dtm.toarray()
# examine the vocabulary and document-term matrix together
pd.DataFrame(train_simple_dtm.toarray(), columns=vect.get_feature_names())
# transform testing data into a document-term matrix (using existing vocabulary)
test_simple = ["please don't call me"]
test_simple_dtm = vect.transform(test_simple)
test_simple_dtm.toarray()
pd.DataFrame(test_simple_dtm.toarray(), columns=vect.get_feature_names())
## REPEAT PATTERN WITH SMS DATA
# instantiate the vectorizer
vect = CountVectorizer()
# learn vocabulary and create document-term matrix in a single step
train_dtm = vect.fit_transform(X_train)
train_dtm
# transform testing data into a document-term matrix
test_dtm = vect.transform(X_test)
test_dtm
# store feature names and examine them
train_features = vect.get_feature_names()
len(train_features)
train_features[:50]
train_features[-50:]
# convert train_dtm to a regular array
train_arr = train_dtm.toarray()
train_arr
## SIMPLE SUMMARIES OF THE TRAINING DATA
# refresher on NumPy
import numpy as np
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
arr
arr[0, 0]
arr[1, 3]
arr[0, :]
arr[:, 0]
np.sum(arr)
np.sum(arr, axis=0)
np.sum(arr, axis=1)
# exercise: calculate the number of tokens in the 0th message in train_arr
sum(train_arr[0, :])
# exercise: count how many times the 0th token appears across ALL messages in train_arr
sum(train_arr[:, 0])
# exercise: count how many times EACH token appears across ALL messages in train_arr
np.sum(train_arr, axis=0)
# exercise: create a DataFrame of tokens with their counts
train_token_counts = pd.DataFrame({'token':train_features, 'count':np.sum(train_arr, axis=0)})
train_token_counts.sort('count', ascending=False)
## MODEL BUILDING WITH NAIVE BAYES
## http://scikit-learn.org/stable/modules/naive_bayes.html
# train a Naive Bayes model using train_dtm
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(train_dtm, y_train)
# make predictions on test data using test_dtm
y_pred = nb.predict(test_dtm)
y_pred
# compare predictions to true labels
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred)
print metrics.confusion_matrix(y_test, y_pred)
# predict (poorly calibrated) probabilities and calculate AUC
y_prob = nb.predict_proba(test_dtm)[:, 1]
y_prob
print metrics.roc_auc_score(y_test, y_prob)
# exercise: show the message text for the false positives
X_test[y_test < y_pred]
# exercise: show the message text for the false negatives
X_test[y_test > y_pred]
## COMPARE NAIVE BAYES AND LOGISTIC REGRESSION
## USING ALL DATA AND CROSS-VALIDATION
# create a document-term matrix using all data
all_dtm = vect.fit_transform(df.msg)
# instantiate logistic regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
# compare AUC using cross-validation
# note: this is slightly improper cross-validation... can you figure out why?
from sklearn.cross_validation import cross_val_score
cross_val_score(nb, all_dtm, df.label, cv=10, scoring='roc_auc').mean()
cross_val_score(logreg, all_dtm, df.label, cv=10, scoring='roc_auc').mean()
## EXERCISE: CALCULATE THE 'SPAMMINESS' OF EACH TOKEN
# create separate DataFrames for ham and spam
df_ham = df[df.label==0]
df_spam = df[df.label==1]
# learn the vocabulary of ALL messages and save it
vect.fit(df.msg)
all_features = vect.get_feature_names()
# create document-term matrix of ham, then convert to a regular array
ham_dtm = vect.transform(df_ham.msg)
ham_arr = ham_dtm.toarray()
# create document-term matrix of spam, then convert to a regular array
spam_dtm = vect.transform(df_spam.msg)
spam_arr = spam_dtm.toarray()
# count how many times EACH token appears across ALL messages in ham_arr
ham_counts = np.sum(ham_arr, axis=0)
# count how many times EACH token appears across ALL messages in spam_arr
spam_counts = np.sum(spam_arr, axis=0)
# create a DataFrame of tokens with their separate ham and spam counts
all_token_counts = pd.DataFrame({'token':all_features, 'ham':ham_counts, 'spam':spam_counts})
# add one to ham counts and spam counts so that ratio calculations (below) make more sense
all_token_counts['ham'] = all_token_counts.ham + 1
all_token_counts['spam'] = all_token_counts.spam + 1
# calculate ratio of spam-to-ham for each token
all_token_counts['spam_ratio'] = all_token_counts.spam / all_token_counts.ham
all_token_counts.sort('spam_ratio')
================================================
FILE: code/15_kaggle.py
================================================
'''
CLASS: Kaggle Stack Overflow competition
'''
# read in the file and set the first column as the index
import pandas as pd
train = pd.read_csv('train.csv', index_col=0)
train.head()
'''
What are some assumptions and theories to test?
PostId: unique within the dataset
OwnerUserId: not unique within the dataset, assigned in order
OwnerCreationDate: users with older accounts have more open questions
ReputationAtPostCreation: higher reputation users have more open questions
OwnerUndeletedAnswerCountAtPostTime: users with more answers have more open questions
Tags: 1 to 5 tags are required, many unique tags
PostClosedDate: should only exist for closed questions
OpenStatus: 1 means open
'''
## OPEN STATUS
# dataset is perfectly balanced in terms of OpenStatus (not a representative sample)
train.OpenStatus.value_counts()
## USER ID
# OwnerUserId is not unique within the dataset, let's examine the top 3 users
train.OwnerUserId.value_counts()
# mostly closed questions, all lowercase, lots of spelling errors
train[train.OwnerUserId==466534]
# fewer closed questions, better grammar, high reputation but few answers
train[train.OwnerUserId==39677]
# very few closed questions, lots of answers
train[train.OwnerUserId==34537]
## REPUTATION
# ReputationAtPostCreation is higher for open questions: possibly use as a feature
train.groupby('OpenStatus').ReputationAtPostCreation.describe()
# not a useful histogram
train.ReputationAtPostCreation.hist()
# much more useful histogram
train[train.ReputationAtPostCreation < 1000].ReputationAtPostCreation.hist()
# grouped histogram
train[train.ReputationAtPostCreation < 1000].ReputationAtPostCreation.hist(by=train.OpenStatus, sharey=True)
## ANSWER COUNT
# rename column
train.rename(columns={'OwnerUndeletedAnswerCountAtPostTime':'Answers'}, inplace=True)
# Answers is higher for open questions: possibly use as a feature
train.groupby('OpenStatus').Answers.describe()
# grouped histogram
train[train.Answers < 50].Answers.hist(by=train.OpenStatus, sharey=True)
## USER ID
# OwnerUserId is assigned in numerical order
train.sort('OwnerUserId').OwnerCreationDate
# OwnerUserId is lower for open questions: possibly use as a feature
train.groupby('OpenStatus').OwnerUserId.describe()
## TITLE
# create a new feature that represents the length of the title (in characters)
train['TitleLength'] = train.Title.apply(len)
# Title is longer for open questions: possibly use as a feature
train.TitleLength.hist(by=train.OpenStatus)
## BODY
# create a new feature that represents the length of the body (in characters)
train['BodyLength'] = train.BodyMarkdown.apply(len)
# BodyMarkdown is longer for open questions: possibly use as a feature
train.BodyLength.hist(by=train.OpenStatus)
## TAGS
# Tag1 is required, and the rest are optional
train.isnull().sum()
# there are over 5000 unique tags
len(train.Tag1.unique())
# calculate the percentage of open questions for each tag
train.groupby('Tag1').OpenStatus.mean()
# percentage of open questions varies widely by tag (among popular tags)
train.groupby('Tag1').OpenStatus.agg(['mean','count']).sort('count')
# create a new feature that represents the number of tags for each question
train['NumTags'] = train.loc[:, 'Tag1':'Tag5'].notnull().sum(axis=1)
# NumTags is higher for open questions: possibly use as a feature
train.NumTags.hist(by=train.OpenStatus)
'''
Define a function that takes in a raw CSV file and returns a DataFrame that
includes all created features (and any other modifications). That way, we
can apply the same changes to both train.csv and test.csv.
'''
# define the function
def make_features(filename):
df = pd.read_csv(filename, index_col=0)
df.rename(columns={'OwnerUndeletedAnswerCountAtPostTime':'Answers'}, inplace=True)
df['TitleLength'] = df.Title.apply(len)
df['BodyLength'] = df.BodyMarkdown.apply(len)
df['NumTags'] = df.loc[:, 'Tag1':'Tag5'].notnull().sum(axis=1)
return df
# apply function to both training and testing files
train = make_features('train.csv')
test = make_features('test.csv')
'''
Use train/test split to compare a model that includes 1 feature with a model
that includes 5 features.
'''
## ONE FEATURE
# define X and y
feature_cols = ['ReputationAtPostCreation']
X = train[feature_cols]
y = train.OpenStatus
# split into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# fit a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
# examine the coefficient to check that it makes sense
logreg.coef_
# predict response classes and predict class probabilities
y_pred = logreg.predict(X_test)
y_prob = logreg.predict_proba(X_test)[:, 1]
# check how well we did
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred) # 0.538 (better than guessing)
metrics.confusion_matrix(y_test, y_pred) # predicts closed most of the time
metrics.roc_auc_score(y_test, y_prob) # 0.602 (not horrible)
metrics.log_loss(y_test, y_prob) # 0.690 (what is this?)
# log loss is the competition's evaluation metric, so let's get a feel for it
true = [0, 0, 1, 1]
prob = [0.1, 0.2, 0.8, 0.9]
metrics.log_loss(true, prob) # 0.164 (lower is better)
# let's try a few other predicted probabilities and check the log loss
prob = [0.4, 0.4, 0.6, 0.6] # 0.511 (predictions are right, but less confident)
prob = [0.4, 0.4, 0.4, 0.6] # 0.612 (one wrong prediction that is a bit off)
prob = [0.4, 0.4, 0.1, 0.6] # 0.959 (one wrong prediction that is way off)
prob = [0.5, 0.5, 0.5, 0.5] # 0.693 (you can get this score without a model)
## FIVE FEATURES
# define X and y
feature_cols = ['ReputationAtPostCreation', 'Answers', 'TitleLength', 'BodyLength', 'NumTags']
X = train[feature_cols]
y = train.OpenStatus
# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# fit a logistic regression model
logreg.fit(X_train, y_train)
# examine the coefficients to check that they make sense
logreg.coef_
# predict response classes and predict class probabilities
y_pred = logreg.predict(X_test)
y_prob = logreg.predict_proba(X_test)[:, 1]
# check how well we did
metrics.accuracy_score(y_test, y_pred) # 0.589 (doing better)
metrics.confusion_matrix(y_test, y_pred) # predicts open more often
metrics.roc_auc_score(y_test, y_prob) # 0.625 (tiny bit better)
metrics.log_loss(y_test, y_prob) # 0.677 (a bit better)
# let's see if cross-validation gives us similar results
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(logreg, X, y, scoring='log_loss', cv=10)
scores.mean() # 0.677 (identical to train/test split)
scores.std() # very small
'''
Use the model with 5 features to make a submission
'''
# make sure that X and y are defined properly
feature_cols = ['ReputationAtPostCreation', 'Answers', 'TitleLength', 'BodyLength', 'NumTags']
X = train[feature_cols]
y = train.OpenStatus
# train the model on ALL data (not X_train and y_train)
logreg.fit(X, y)
# predict class probabilities for the actual testing data (not X_test)
y_prob = logreg.predict_proba(test[feature_cols])[:, 1]
# sample submission file indicates we need two columns: PostId and predicted probability
test.index # PostId
y_prob # predicted probability
# create a DataFrame that has 'id' as the index, then export to a CSV file
sub = pd.DataFrame({'id':test.index, 'OpenStatus':y_prob}).set_index('id')
sub.to_csv('sub1.csv')
'''
Create a few more features from Title
'''
# string methods for a Series are accessed via 'str'
train.Title.str.lower()
# create a new feature that represents whether a Title is all lowercase
train['TitleLowercase'] = (train.Title.str.lower() == train.Title).astype(int)
# check if there are a meaningful number of ones
train.TitleLowercase.value_counts()
# percentage of open questions is lower among questions with lowercase titles: possibly use as a feature
train.groupby('TitleLowercase').OpenStatus.mean()
# create features that represent whether Title contains certain words
train['TitleQuestion'] = train.Title.str.contains('question', case=False).astype(int)
train['TitleNeed'] = train.Title.str.contains('need', case=False).astype(int)
train['TitleHelp'] = train.Title.str.contains('help', case=False).astype(int)
'''
Build a document-term matrix from Title using CountVectorizer
'''
# define X and y
X = train.Title
y = train.OpenStatus
# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# use CountVectorizer with the default settings
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
# fit and transform on X_train, but only transform on X_test
train_dtm = vect.fit_transform(X_train)
test_dtm = vect.transform(X_test)
# try a Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(train_dtm, y_train)
y_prob = nb.predict_proba(test_dtm)[:, 1]
metrics.log_loss(y_test, y_prob) # 0.659 (a bit better than our previous model)
# try tuning CountVectorizer and repeat Naive Bayes
vect = CountVectorizer(stop_words='english')
train_dtm = vect.fit_transform(X_train)
test_dtm = vect.transform(X_test)
nb.fit(train_dtm, y_train)
y_prob = nb.predict_proba(test_dtm)[:, 1]
metrics.log_loss(y_test, y_prob) # 0.637 (even better)
# try switching to logistic regression
logreg.fit(train_dtm, y_train)
y_prob = logreg.predict_proba(test_dtm)[:, 1]
metrics.log_loss(y_test, y_prob) # 0.573 (much better!)
'''
Create features from BodyMarkdown using TextBlob
'''
# examine BodyMarkdown for first question
train.iloc[0].BodyMarkdown
# calculate the number of sentences in that question using TextBlob
from textblob import TextBlob
len(TextBlob(train.iloc[0].BodyMarkdown).sentences)
# calculate the number of sentences for all questions (raises an error)
train.BodyMarkdown.apply(lambda x: len(TextBlob(x).sentences))
# explicitly decode string to unicode to fix error (WARNING: VERY SLOW)
train['BodySentences'] = train.BodyMarkdown.apply(lambda x: len(TextBlob(x.decode('utf-8')).sentences))
================================================
FILE: code/17_ensembling_exercise.py
================================================
# Helper code for class 17 exercise
# define the function
def make_features(filename):
df = pd.read_csv(filename, index_col=0)
df.rename(columns={'OwnerUndeletedAnswerCountAtPostTime':'Answers'}, inplace=True)
df['TitleLength'] = df.Title.apply(len)
df['BodyLength'] = df.BodyMarkdown.apply(len)
df['NumTags'] = df.loc[:, 'Tag1':'Tag5'].notnull().sum(axis=1)
return df
# apply function to both training and testing files
train = make_features('train.csv')
test = make_features('test.csv')
# define X and y
feature_cols = ['ReputationAtPostCreation', 'Answers', 'TitleLength', 'BodyLength', 'NumTags']
X = train[feature_cols]
y = train.OpenStatus
###############################################################################
##### Create some models with the derived features
###############################################################################
###############################################################################
##### Count vectorizer
###############################################################################
# define X and y
X = train.Title
y = train.OpenStatus
# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# use CountVectorizer with the default settings
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
# fit and transform on X_train, but only transform on X_test
train_dtm = vect.fit_transform(X_train)
test_dtm = vect.transform(X_test)
###############################################################################
##### Create a model with the text features
###############################################################################
================================================
FILE: code/18_clustering.py
================================================
'''
THE DATA
We have data about cars: things like MPG, acceleration, weight, etc. However,
we don't have logical groupings for these cars. We can construct these
manually using our domain knowledge (e.g. we could put all of the high mpg cars
together and all of the low mpg cars together), but we want a more automatic
way of grouping these vehicles that can take into account more features.
'''
# Imports
from sklearn.cluster import KMeans # K means model
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Read in data
data = pd.read_table('auto_mpg.txt', sep='|') # All values range from 0 to 1
data.drop('car_name', axis=1, inplace=True) # Drop labels from dataframe
data.head()
'''
CLUSTER ANALYSIS
How do we implement a k-means clustering algorithm?
scikit-learn KMeans documentation for reference:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
'''
# Standardize our data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
# Set random seed for reproducibility
np.random.seed(0)
# Run KMeans
est = KMeans(n_clusters=2, init='random') # Instatiate estimator
est.fit(data_scaled) # Fit your data
y_kmeans = est.predict(data_scaled) # Make cluster "predictions"
# Inspect the data by looking at the means for each cluster
data.groupby(y_kmeans).mean()
# This can be compared to the overall means for each variable
data.mean()
# We can get the coordiantes for the center of each cluster
centers = est.cluster_centers_
'''
VISUALIZING THE CLUSTERS
'''
# We can create a nice plot to visualize this upon two of the dimensions
colors = np.array(['red', 'green', 'blue', 'yellow', 'orange'])
plt.figure()
plt.scatter(data_scaled[:, 0], data_scaled[:, 5], c=colors[y_kmeans], s=50)
plt.xlabel('MPG')
plt.ylabel('Acceleration')
plt.scatter(centers[:, 0], centers[:, 5], linewidths=3, marker='+', s=300, c='black')
plt.show()
# We can generate a scatter matrix to see all of the different dimensions paired
pd.scatter_matrix(data, c=colors[y_kmeans], figsize=(15,15), s = 100)
plt.show()
'''
DETERMINING THE NUMBER OF CLUSTERS
How do you choose k? There isn't a bright line, but we can evaluate
performance metrics such as the silhouette coefficient across values of k.
Note: You also have to take into account practical limitations of choosing k
also. Ten clusters may give the best value, but it might not make sense in the
context of your data.
scikit-learn Clustering metrics documentation:
http://scikit-learn.org/stable/modules/classes.html#clustering-metrics
'''
# Create a bunch of different models
k_rng = range(2,15)
k_est = [KMeans(n_clusters = k).fit(data) for k in k_rng]
# Silhouette Coefficient
# Generally want SC to be closer to 1, while also minimizing k
from sklearn import metrics
silhouette_score = [metrics.silhouette_score(data, e.labels_, metric='euclidean') for e in k_est]
# Plot the results
plt.figure()
plt.title('Silhouette coefficient for various values of k')
plt.plot(k_rng, silhouette_score, 'b*-')
plt.xlim([1,15])
plt.grid(True)
plt.ylabel('Silhouette Coefficient')
plt.show()
================================================
FILE: code/18_regularization.py
================================================
###############################################################################
##### Regularization with Linear Regression
###############################################################################
## TASK: Regularized regression
## FUNCTIONS: Ridge, RidgeCV, Lasso, LassoCV
## DOCUMENTATION: http://scikit-learn.org/stable/modules/linear_model.html
## DATA: Crime (n=319 non-null, p=122, type=regression)
## DATA DICTIONARY: http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime
########## Prepare data ##########
# read in data, remove categorical features, remove rows with missing values
import pandas as pd
crime = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data', header=None, na_values=['?'])
crime = crime.iloc[:, 5:]
crime.dropna(inplace=True)
crime.head()
# define X and y
X = crime.iloc[:, :-1]
y = crime.iloc[:, -1]
# split into train/test
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
########## Linear Regression Model Without Regularization ##########
# linear regression
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
lm.coef_
# make predictions and evaluate
import numpy as np
from sklearn import metrics
preds = lm.predict(X_test)
print 'RMSE (no regularization) =', np.sqrt(metrics.mean_squared_error(y_test, preds))
########## Ridge Regression Model ##########
# ridge regression (alpha must be positive, larger means more regularization)
from sklearn.linear_model import Ridge
rreg = Ridge(alpha=0.1, normalize=True)
rreg.fit(X_train, y_train)
rreg.coef_
preds = rreg.predict(X_test)
print 'RMSE (Ridge reg.) =', np.sqrt(metrics.mean_squared_error(y_test, preds))
# use RidgeCV to select best alpha
from sklearn.linear_model import RidgeCV
alpha_range = 10.**np.arange(-2, 3)
rregcv = RidgeCV(normalize=True, scoring='mean_squared_error', alphas=alpha_range)
rregcv.fit(X_train, y_train)
rregcv.alpha_
preds = rregcv.predict(X_test)
print 'RMSE (Ridge CV reg.) =', np.sqrt(metrics.mean_squared_error(y_test, preds))
########## Lasso Regression Model ##########
# lasso (alpha must be positive, larger means more regularization)
from sklearn.linear_model import Lasso
las = Lasso(alpha=0.01, normalize=True)
las.fit(X_train, y_train)
las.coef_
preds = las.predict(X_test)
print 'RMSE (Lasso reg.) =', np.sqrt(metrics.mean_squared_error(y_test, preds))
# try a smaller alpha
las = Lasso(alpha=0.0001, normalize=True)
las.fit(X_train, y_train)
las.coef_
preds = las.predict(X_test)
print 'RMSE (Lasso reg.) =', np.sqrt(metrics.mean_squared_error(y_test, preds))
# use LassoCV to select best alpha (tries 100 alphas by default)
from sklearn.linear_model import LassoCV
lascv = LassoCV(normalize=True, alphas=alpha_range)
lascv.fit(X_train, y_train)
lascv.alpha_
lascv.coef_
preds = lascv.predict(X_test)
print 'RMSE (Lasso CV reg.) =', np.sqrt(metrics.mean_squared_error(y_test, preds))
###############################################################################
##### Regularization with Logistic Regression
###############################################################################
## TASK: Regularized classification
## FUNCTION: LogisticRegression
## DOCUMENTATION: http://scikit-learn.org/stable/modules/linear_model.html
## DATA: Titanic (n=891, p=5 selected, type=classification)
## DATA DICTIONARY: https://www.kaggle.com/c/titanic-gettingStarted/data
########## Prepare data ##########
# Get and prepare data
titanic = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT5/master/data/titanic_train.csv')
titanic['Sex'] = titanic.Sex.map({'female':0, 'male':1})
titanic.Age.fillna(titanic.Age.mean(), inplace=True)
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked').iloc[:, 1:]
titanic = pd.concat([titanic, embarked_dummies], axis=1)
# define X and y
feature_cols = ['Pclass', 'Sex', 'Age', 'Embarked_Q', 'Embarked_S']
X = titanic[feature_cols]
y = titanic.Survived
# split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# standardize our data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
########## Logistic Regression Model Without Regularization ##########
# logistic regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train_scaled, y_train)
logreg.coef_
y_pred = logreg.predict(X_test_scaled)
# Access accuracy
print 'Accuracy (no penalty) =', metrics.accuracy_score(y_test, y_pred)
########## Logistic Regression With L1 Penalty ##########
# logistic regression with L1 penalty (C must be positive, smaller means more regularization)
logreg_l1 = LogisticRegression(C=0.1, penalty='l1')
logreg_l1.fit(X_train_scaled, y_train)
logreg_l1.coef_
y_pred_l1 = logreg_l1.predict(X_test_scaled)
# Access accuracy
print 'Accuracy (L1 penalty) =', metrics.accuracy_score(y_test, y_pred_l1)
########## Logistic Regression With L2 Penalty ##########
# logistic regression with L2 penalty (C must be positive, smaller means more regularization)
logreg_l2 = LogisticRegression(C=0.1, penalty='l2')
logreg_l2.fit(X_train_scaled, y_train)
logreg_l2.coef_
y_pred_l2 = logreg_l2.predict(X_test_scaled)
# Access accuracy
print 'Accuracy (L2 penalty) =', metrics.accuracy_score(y_test, y_pred_l2)
================================================
FILE: code/19_advanced_sklearn.py
================================================
## TASK: Searching for optimal parameters
## FUNCTION: GridSearchCV
## DOCUMENTATION: http://scikit-learn.org/stable/modules/grid_search.html
## DATA: Titanic (n=891, p=5 selected, type=classification)
## DATA DICTIONARY: https://www.kaggle.com/c/titanic-gettingStarted/data
# read in and prepare titanic data
import pandas as pd
titanic = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT5/master/data/titanic_train.csv')
titanic['Sex'] = titanic.Sex.map({'female':0, 'male':1})
titanic.Age.fillna(titanic.Age.mean(), inplace=True)
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked').iloc[:, 1:]
titanic = pd.concat([titanic, embarked_dummies], axis=1)
# define X and y
feature_cols = ['Pclass', 'Sex', 'Age', 'Embarked_Q', 'Embarked_S']
X = titanic[feature_cols]
y = titanic.Survived
# use cross-validation to find best max_depth
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score
# try max_depth=2
treeclf = DecisionTreeClassifier(max_depth=2, random_state=1)
cross_val_score(treeclf, X, y, cv=10, scoring='roc_auc').mean()
# try max_depth=3
treeclf = DecisionTreeClassifier(max_depth=3, random_state=1)
cross_val_score(treeclf, X, y, cv=10, scoring='roc_auc').mean()
# use GridSearchCV to automate the search
from sklearn.grid_search import GridSearchCV
treeclf = DecisionTreeClassifier(random_state=1)
depth_range = range(1, 21)
param_grid = dict(max_depth=depth_range)
grid = GridSearchCV(treeclf, param_grid, cv=10, scoring='roc_auc')
grid.fit(X, y)
# check the results of the grid search
grid.grid_scores_
grid_mean_scores = [result[1] for result in grid.grid_scores_]
# plot the results
import matplotlib.pyplot as plt
plt.plot(depth_range, grid_mean_scores)
# what was best?
grid.best_score_
grid.best_params_
grid.best_estimator_
# search a "grid" of parameters
depth_range = range(1, 21)
leaf_range = range(1, 11)
param_grid = dict(max_depth=depth_range, min_samples_leaf=leaf_range)
grid = GridSearchCV(treeclf, param_grid, cv=10, scoring='roc_auc')
grid.fit(X, y)
grid.grid_scores_
grid.best_score_
grid.best_params_
## TASK: Standardization of features (aka "center and scale" or "z-score normalization")
## FUNCTION: StandardScaler
## DOCUMENTATION: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
## EXAMPLE: http://nbviewer.ipython.org/github/rasbt/pattern_classification/blob/master/preprocessing/about_standardization_normalization.ipynb
## DATA: Wine (n=178, p=2 selected, type=classification)
## DATA DICTIONARY: http://archive.ics.uci.edu/ml/datasets/Wine
# fake data
train = pd.DataFrame({'id':[0,1,2], 'length':[0.9,0.3,0.6], 'mass':[0.1,0.2,0.8], 'rings':[40,50,60]})
oos = pd.DataFrame({'length':[0.59], 'mass':[0.79], 'rings':[54.9]})
# define X and y
X = train[['length','mass','rings']]
y = train.id
# KNN with k=1
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X, y)
# what "should" it predict? what does it predict?
knn.predict(oos)
# standardize the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
# compare original to standardized
X.values
X_scaled
# figure out how it standardized
scaler.mean_
scaler.std_
(X.values-scaler.mean_) / scaler.std_
# try this on real data
wine = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None, usecols=[0,10,13])
wine.columns=['label', 'color', 'proline']
wine.head()
wine.describe()
# define X and y
X = wine[['color', 'proline']]
y = wine.label
# split into train/test
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# standardize X_train
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
# check that it worked properly
X_train_scaled[:, 0].mean()
X_train_scaled[:, 0].std()
X_train_scaled[:, 1].mean()
X_train_scaled[:, 1].std()
# standardize X_test
X_test_scaled = scaler.transform(X_test)
# is this right?
X_test_scaled[:, 0].mean()
X_test_scaled[:, 0].std()
X_test_scaled[:, 1].mean()
X_test_scaled[:, 1].std()
# compare KNN accuracy on original vs scaled data
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
knn.score(X_test, y_test)
knn.fit(X_train_scaled, y_train)
knn.score(X_test_scaled, y_test)
## TASK: Chaining steps
## FUNCTION: Pipeline
## DOCUMENTATION: http://scikit-learn.org/stable/modules/pipeline.html
## DATA: Wine (n=178, p=2 selected, type=classification)
## DATA DICTIONARY: http://archive.ics.uci.edu/ml/datasets/Wine
# here is proper cross-validation on the original (unscaled) data
X = wine[['color', 'proline']]
y = wine.label
knn = KNeighborsClassifier(n_neighbors=3)
cross_val_score(knn, X, y, cv=5, scoring='accuracy').mean()
# why is this improper cross-validation on the scaled data?
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
cross_val_score(knn, X_scaled, y, cv=5, scoring='accuracy').mean()
# fix this using Pipeline
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=3))
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
# using GridSearchCV with Pipeline
neighbors_range = range(1, 21)
param_grid = dict(kneighborsclassifier__n_neighbors=neighbors_range)
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')
grid.fit(X, y)
grid.best_score_
grid.best_params_
================================================
FILE: code/19_gridsearchcv_exercise.py
================================================
'''
EXERCISE: GridSearchCV with Stack Overflow competition data
'''
import pandas as pd
# define a function to create features
def make_features(filename):
df = pd.read_csv(filename, index_col=0)
df.rename(columns={'OwnerUndeletedAnswerCountAtPostTime':'Answers'}, inplace=True)
df['TitleLength'] = df.Title.apply(len)
df['BodyLength'] = df.BodyMarkdown.apply(len)
df['NumTags'] = df.loc[:, 'Tag1':'Tag5'].notnull().sum(axis=1)
return df
# apply function to both training and testing files
train = make_features('train.csv')
test = make_features('test.csv')
# define X and y
feature_cols = ['ReputationAtPostCreation', 'Answers', 'TitleLength', 'BodyLength', 'NumTags']
X = train[feature_cols]
y = train.OpenStatus
'''
MAIN TASK: Use GridSearchCV to find optimal parameters for KNeighborsClassifier.
- For "n_neighbors", try 5 different integer values.
- For "weights", try 'uniform' and 'distance'.
- Use 5-fold cross-validation (instead of 10-fold) to save computational time.
- Remember that log loss is your evaluation metric!
BONUS TASK #1: Once you have found optimal parameters, train your KNN model using
those parameters, make predictions on the test set, and submit those predictions.
BONUS TASK #2: Read the scikit-learn documentation for GridSearchCV to find the
shortcut for accomplishing bonus task #1.
'''
# MAIN TASK
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
from sklearn.grid_search import GridSearchCV
neighbors_range = [20, 40, 60, 80, 100]
weight_options = ['uniform', 'distance']
param_grid = dict(n_neighbors=neighbors_range, weights=weight_options)
grid = GridSearchCV(knn, param_grid, cv=5, scoring='log_loss')
grid.fit(X, y)
grid.grid_scores_
grid.best_score_
grid.best_params_
# BONUS TASK #1
knn = KNeighborsClassifier(n_neighbors=100, weights='uniform')
knn.fit(X, y)
y_prob = knn.predict_proba(test[feature_cols])[:, 1]
sub = pd.DataFrame({'id':test.index, 'OpenStatus':y_prob}).set_index('id')
sub.to_csv('sub.csv')
# BONUS TASK #2
y_prob = grid.predict_proba(test[feature_cols])[:, 1]
sub = pd.DataFrame({'id':test.index, 'OpenStatus':y_prob}).set_index('id')
sub.to_csv('sub.csv')
================================================
FILE: code/19_regex_exercise.py
================================================
'''
Regular Expressions Exercise
'''
# open file and store each line as one row
with open('../data/homicides.txt', 'rU') as f:
raw = [row for row in f]
'''
Create a list of ages
'''
import re
ages = []
for row in raw:
match = re.search(r'\d+ years old', row)
if match:
ages.append(match.group())
else:
ages.append('0')
# split the string on spaces, only keep the first element, and convert to int
ages = [int(element.split()[0]) for element in ages]
# check that 'raw' and 'ages' are the same length
assert(len(raw)==len(ages))
# simplify process using a lookahead
ages = []
for row in raw:
match = re.search(r'\d+(?= years)', row)
if match:
ages.append(int(match.group()))
else:
ages.append(0)
================================================
FILE: code/19_regex_reference.py
================================================
'''
Regular Expressions (regex) Reference Guide
Sources:
https://developers.google.com/edu/python/regular-expressions
https://docs.python.org/2/library/re.html
'''
'''
Basic Patterns:
Ordinary characters match themselves exactly
. matches any single character except newline \n
\w matches a word character (letter, digit, underscore)
\W matches any non-word character
\b matches boundary between word and non-word
\s matches single whitespace character (space, newline, return, tab, form)
\S matches single non-whitespace character
\d matches single digit (0 through 9)
\t matches tab
\n matches newline
\r matches return
\ match a special character, such as period: \.
Rules for Searching:
Search proceeds through string from start to end, stopping at first match
All of the pattern must be matched
Basic Search Function:
match = re.search(r'pattern', string_to_search)
Returns match object
If there is a match, access match using match.group()
If there is no match, match is None
Use 'r' in front of pattern to designate a raw string
'''
import re
s = 'my 1st string!!'
match = re.search(r'my', s) # returns match object
if match: # checks whether match was found
print match.group() # if match was found, then print result
re.search(r'my', s).group() # single-line version (without error handling)
re.search(r'st', s).group() # 'st'
re.search(r'sta', s).group() # error
re.search(r'\w\w\w', s).group() # '1st'
re.search(r'\W', s).group() # ' '
re.search(r'\W\W', s).group() # '!!'
re.search(r'\s', s).group() # ' '
re.search(r'\s\s', s).group() # error
re.search(r'..t', s).group() # '1st'
re.search(r'\s\St', s).group() # ' st'
re.search(r'\bst', s).group() # 'st'
'''
Repetition:
+ 1 or more occurrences of the pattern to its left
* 0 or more occurrences
? 0 or 1 occurrence
+ and * are 'greedy': they try to use up as much of the string as possible
Add ? after + or * to make them non-greedy: +? or *?
'''
s = 'sid is missing class'
re.search(r'miss\w+', s).group() # 'missing'
re.search(r'is\w+', s).group() # 'issing'
re.search(r'is\w*', s).group() # 'is'
s = '<h1>my heading</h1>'
re.search(r'<.+>', s).group() # '<h1>my heading</h1>'
re.search(r'<.+?>', s).group() # '<h1>'
'''
Positions:
^ match start of a string
$ match end of a string
'''
s = 'sid is missing class'
re.search(r'^miss', s).group() # error
re.search(r'..ss', s).group() # 'miss'
re.search(r'..ss$', s).group() # 'lass'
'''
Brackets:
[abc] match a or b or c
\w, \s, etc. work inside brackets, except period just means a literal period
[a-z] match any lowercase letter (dash indicates range unless it's last)
[abc-] match a or b or c or -
[^ab] match anything except a or b
'''
s = 'my email is john-doe@gmail.com'
re.search(r'\w+@\w+', s).group() # 'doe@gmail'
re.search(r'[\w.-]+@[\w.-]+', s).group() # 'john-doe@gmail.com'
'''
Lookarounds:
Lookahead matches a pattern only if it is followed by another pattern
100(?= dollars) matches '100' only if it is followed by ' dollars'
Lookbehind matches a pattern only if it is preceded by another pattern
(?<=\$)100 matches '100' only if it is preceded by '$'
'''
s = 'Name: Cindy, 30 years old'
re.search(r'\d+(?= years? old)', s).group() # '30'
re.search(r'(?<=Name: )\w+', s).group() # 'Cindy'
'''
Match Groups:
Parentheses create logical groups inside of match text
match.group(1) corresponds to first group
match.group(2) corresponds to second group
match.group() corresponds to entire match text (as usual)
'''
s = 'my email is john-doe@gmail.com'
match = re.search(r'([\w.-]+)@([\w.-]+)', s)
if match:
match.group(1) # 'john-doe'
match.group(2) # 'gmail.com'
match.group() # 'john-doe@gmail.com'
'''
Finding All Matches:
re.findall() finds all matches and returns them as a list of strings
list_of_strings = re.findall(r'pattern', string_to_search)
If pattern includes parentheses, a list of tuples is returned
'''
s = 'emails: joe@gmail.com, bob@gmail.com'
re.findall(r'[\w.-]+@[\w.-]+', s) # ['joe@gmail.com', 'bob@gmail.com']
re.findall(r'([\w.-]+)@([\w.-]+)', s) # [('joe', 'gmail.com'), ('bob', 'gmail.com')]
'''
Option Flags:
Options flags modify the behavior of the pattern matching
default: matching is case sensitive
re.IGNORECASE: ignore uppercase/lowercase differences ('a' matches 'a' or 'A')
default: period matches any character except newline
re.DOTALL: allow period to match newline
default: within a string of many lines, ^ and $ match start and end of entire string
re.MULTILINE: allow ^ and $ to match start and end of each line
Option flag is third argument to re.search() or re.findall():
re.search(r'pattern', string_to_search, re.IGNORECASE)
re.findall(r'pattern', string_to_search, re.IGNORECASE)
'''
s = 'emails: nicole@ga.co, joe@gmail.com, PAT@GA.CO'
re.findall(r'\w+@ga\.co', s) # ['nicole@ga.co']
re.findall(r'\w+@ga\.co', s, re.IGNORECASE) # ['nicole@ga.co', 'PAT@GA.CO']
'''
Substitution:
re.sub() finds all matches and replaces them with a specified string
new_string = re.sub(r'pattern', r'replacement', string_to_search)
Replacement string can refer to text from matching groups:
\1 refers to group(1)
\2 refers to group(2)
etc.
'''
s = 'sid is missing class'
re.sub(r'is ', r'was ', s) # 'sid was missing class'
s = 'emails: joe@gmail.com, bob@gmail.com'
re.sub(r'([\w.-]+)@([\w.-]+)', r'\1@yahoo.com', s) # 'emails: joe@yahoo.com, bob@yahoo.com'
'''
Useful, But Not Covered:
re.split() splits a string by the occurrences of a pattern
re.compile() compiles a pattern (for improved performance if it's used many times)
A|B indicates a pattern that can match A or B
'''
================================================
FILE: code/20_sql.py
================================================
###############################################################################
##### Class 20: SQL
###############################################################################
"""
Accessing the data from a database is just another way to get data. This has
no affect on how you model the data or do anything else; it's just a different
repository for storing data. We're used to getting data from "flat files" like
CSV or TXT files. The method for getting the data is different, but the result
is the same.
"""
###############################################################################
##### Accessing Data from a SQLLite Database
###############################################################################
# Python package to interface with database files
import sqlite3 as lite
##### Connecting to a Database #####
# Connect to a local database (it's basically just a file)
con = lite.connect('sales.db')
con
# Create a Cursor object. This let's you browse your database
cur = con.cursor()
cur
##### What Tables are in our Database? ######
# Let's look at what tables we have available. Let's not worry about what the
# command is that it's executing. We'll cover that more later.
cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
# Note that this doesn't explicitly return anything. It only stores the
# results in your cursor. You have to 'fetch' the results to get them back.
# There are several different ways to do this. However, once you 'fetch' a
# result, it is no longer there
# Fetch all results at once
cur.fetchall()
# One at a time
cur.fetchone()
# Some specified number at a time
cur.fetchmany(4)
# Also note that the results weren't stored anywhere, only printed out. To
# keep them, we must put them in a variable
cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
tables = cur.fetchall()
tables
##### Getting Data from our Database #####
# Select all of the data from the table 'Orders'
cur.execute("SELECT * FROM Orders")
orders_table = cur.fetchall()
orders_table
# This is a list of tuples, not the most convenient thing to work with, but
# managable. We know how to access these elements.
orders_table[0]
orders_table[0][0]
orders_table[0][3]
# We could also put these into a dataframe.
import pandas as pd
orders_data = pd.DataFrame(orders_table)
orders_data
# However, pandas has a nice funtion to read the results of your SQL query into
# a pandas data frame. This is the best thing ever!
orders_data = pd.read_sql_query("SELECT * FROM Orders", con=con)
orders_data
# Let's look at our other data to see what is contained in it.
for table in tables:
print 'Table %s' % table[0]
print ' '
print pd.read_sql_query("SELECT * FROM %s" % table[0], con=con).head()
print ' '
print ' '
print ' '
# NOTE: Be careful doing this if there are a lot of tables in your database.
# Here we only have five, so it's okay.
# Let's look at the sales database schema to get a better idea of the layout.
# https://raw.githubusercontent.com/justmarkham/DAT5/master/slides/20_sales_db_schema.png
###############################################################################
##### Exploring, Discovering, and Aggregating Data
###############################################################################
"""
Everything that is done in the following queries could be done in pandas.
However, it is at times easier to do it in SQL, so it's important to be aware
of how to do it. I've included the pandas ways of doing things below the SQL
queries for easy of comparison.
"""
##### Selecting Data #####
# Return all of the data (* means all columns)
pd.read_sql_query("SELECT * FROM Orders", con=con)
orders_data
# Return specific columns by name
pd.read_sql_query("SELECT CustomerID, EmployeeID, OrderDate, FreightCharge FROM Orders", con)
orders_data[['CustomerID','EmployeeID','OrderDate','FreightCharge']]
##### Segmenting Data #####
# Return only the more recent orders (order date more recent than 2013)
pd.read_sql_query("SELECT * FROM Orders WHERE OrderDate > '2013-01-01'", con)
orders_data[orders_data.OrderDate > '2013-01-01']
# Return only orders shipped via '1'
pd.read_sql_query("SELECT * FROM Orders WHERE ShipVia = 1", con)
orders_data[orders_data.ShipVia == 1]
# Combine conditions with AND and OR
pd.read_sql_query("SELECT * FROM Orders WHERE ShipVia = 1 AND OrderDate > '2013-01-01'", con)
orders_data[(orders_data.ShipVia == 1) & (orders_data.OrderDate > '2013-01-01')]
pd.read_sql_query("SELECT * FROM Orders WHERE ShipVia = 1 OR OrderDate > '2013-01-01'", con)
orders_data[(orders_data.ShipVia == 1) | (orders_data.OrderDate > '2013-01-01')]
##### Ordering Data #####
# We can return the rows in a specific order.
pd.read_sql_query("SELECT * FROM Orders ORDER BY OrderDate", con)
orders_data.sort_index(by="OrderDate")
# Ascending
pd.read_sql_query("SELECT * FROM Orders ORDER BY FreightCharge", con)
orders_data.sort_index(by="FreightCharge")
# Descending
pd.read_sql_query("SELECT * FROM Orders ORDER BY FreightCharge DESC", con)
orders_data.sort_index(by="FreightCharge", ascending=False)
##### Aggregating Data #####
# Count the number of rows in the order dataset
pd.read_sql_query("SELECT COUNT(*) FROM Orders", con)
orders_data.OrderID.count()
pd.read_sql_query("SELECT COUNT(*) AS row_count FROM Orders", con) # Alias column
# Compute the minimum, maximum, and average freight charge
pd.read_sql_query("""SELECT MIN(FreightCharge) AS min, MAX(FreightCharge) AS max, AVG(FreightCharge) AS avg
FROM Orders""", con)
(orders_data.FreightCharge.min(), orders_data.FreightCharge.max(), orders_data.FreightCharge.mean())
##### Group By #####
# Let's look at the average freight cost by the method of shipping
# What are all of the ShipVia values?
pd.read_sql_query("SELECT DISTINCT ShipVia FROM Orders", con) # Note DISTINCT
orders_data.ShipVia.unique()
# We can write a query for each one of the ShipVia values
pd.read_sql_query("SELECT ShipVia, AVG(FreightCharge) AS avg FROM Orders WHERE ShipVia = 1", con)
orders_data[orders_data.ShipVia == 1].FreightCharge.mean()
pd.read_sql_query("SELECT ShipVia, AVG(FreightCharge) AS avg FROM Orders WHERE ShipVia = 2", con)
orders_data[orders_data.ShipVia == 2].FreightCharge.mean()
pd.read_sql_query("SELECT ShipVia, AVG(FreightCharge) AS avg FROM Orders WHERE ShipVia = 3", con)
orders_data[orders_data.ShipVia == 3].FreightCharge.mean()
pd.read_sql_query("SELECT ShipVia, AVG(FreightCharge) AS avg FROM Orders WHERE ShipVia = 4", con)
orders_data[orders_data.ShipVia == 4].FreightCharge.mean()
# However, this is pretty verbose. Also, what if there were 20 values? Should
# we write 20 queries? Of course not! This is where GROUP BY comes in.
pd.read_sql_query("SELECT ShipVia, AVG(FreightCharge) AS avg FROM Orders GROUP BY ShipVia", con)
orders_data.groupby('ShipVia').FreightCharge.mean()
# You can use any aggregation or other metric with a group by
pd.read_sql_query("SELECT ShipVia, MAX(FreightCharge) AS max FROM Orders GROUP BY ShipVia", con)
orders_data.groupby('ShipVia').FreightCharge.max()
# However, we don't know what any of these "ShipVia" values mean. We can
# probably look in the Shippers table and figure it out.
pd.read_sql_query("SELECT * FROM Shippers", con)
# But it's always better to have all of this info together.
###############################################################################
##### Joining Tables
###############################################################################
"""
But surely there's a better way to look at it all at once. This is where
JOIN's come in. As the name suggests, JOIN's allow you to JOIN two (or more)
tables together. There are several types of joins:
-INNER JOIN: Returns all rows when there is at least one match in BOTH tables
-LEFT JOIN: Return all rows from the left table, and the matched rows from
the right table
-RIGHT JOIN: Return all rows from the right table, and the matched rows from
the left table
-FULL JOIN: Return all rows when there is a match in ONE of the tables
http://i.stack.imgur.com/GbJ7N.png
These have different use cases (please read more about them). In our case, we
want to join the Shippers table (with the ShipVia ids) to the corresponding ids
in our Orders table. So, we want to LEFT JOIN Shippers to Orders based upon
the matching id.
NOTE: You can also JOIN pandas dataframes using the "merge" function.
"""
# Let's look at the tables separately to evaluate how to join.
pd.read_sql_query("SELECT * FROM Orders", con)
pd.read_sql_query("SELECT * FROM Shippers", con)
# Let's look at the join
pd.read_sql_query("""SELECT *
FROM Orders
LEFT JOIN Shippers
ON Orders.ShipVia = Shippers.ShipperID"""
, con)
# Note that any time we want to refer to a column in a particular table, we
# have to type the table name. That would get old. Instead, we can give each
# table an alias or nickname. We get the same result.
pd.read_sql_query("""SELECT *
FROM Orders a
LEFT JOIN Shippers b
ON a.ShipVia = b.ShipperID"""
, con)
# We can also return specific columns from each table.
pd.read_sql_query("""SELECT b.CompanyName, a.FreightCharge
FROM Orders a
LEFT JOIN Shippers b
ON a.ShipVia = b.ShipperID"""
, con)
# We can get our result from before, but with the compnay name instead of just
# their id.
pd.read_sql_query("""SELECT b.CompanyName, AVG(a.FreightCharge) AS avg
FROM Orders a
LEFT JOIN Shippers b
ON a.ShipVia = b.ShipperID
GROUP BY b.CompanyName"""
, con)
# Finally, we can order our data by average freight charge.
pd.read_sql_query("""SELECT b.CompanyName, AVG(a.FreightCharge) AS avg
FROM Orders a
LEFT JOIN Shippers b
ON a.ShipVia = b.ShipperID
GROUP BY b.CompanyName
ORDER BY avg"""
, con)
###############################################################################
##### Nested Queries
###############################################################################
"""
Nested queries are exactly what they sound like, queries within queries. These
can be convenient in a number of different places. They allow you to use the
result from one query in another query.
"""
# Let's say we want to figure out what percentage of orders get shipped by each
# shipper. We can count the number of occurences of each shipper.
pd.read_sql_query("SELECT ShipVia, COUNT(ShipVia) AS count FROM Orders GROUP BY ShipVia", con)
# We can calculate the number of total orders there are.
pd.read_sql_query("SELECT COUNT(*) FROM Orders", con)
# We can divide each of those by the number of orders total.
pd.read_sql_query("""SELECT ShipVia, 1.0*COUNT(ShipVia)/20*100 AS percent
FROM Orders GROUP BY ShipVia""", con)
# But what happens when we get a new order. We have to update the "20"
# manually. That's not optimal. That's where nested queries help.
pd.read_sql_query("""SELECT ShipVia, 1.0*COUNT(ShipVia)/(SELECT COUNT(*) FROM Orders)*100 AS percent
FROM Orders GROUP BY ShipVia""", con)
# You can nest any number of queries in a full query.
###############################################################################
##### Case Statements
###############################################################################
"""
CASE statements are similar to if else statements in Python. They allow you to
specify conditions and what the results are if the condition is true. They
also allow you to specify what happens when none of the conditions are met
(similar to the else statement in Python).
"""
# Let's say you want to determine whether the average freight charge has
# changed from year to year. A CASE statement allows you to create conditions
# for each year.
pd.read_sql_query("""SELECT CASE
WHEN OrderDate > '2012-01-01' AND OrderDate < '2013-01-01' THEN 2012
WHEN OrderDate > '2013-01-01' AND OrderDate < '2014-01-01' THEN 2013
ELSE 'Not a date!'
END AS year, OrderDate
FROM Orders""", con)
# Now we can use this to calcualte the average freight charge per year.
pd.read_sql_query("""SELECT CASE
WHEN OrderDate > '2012-01-01' AND OrderDate < '2013-01-01' THEN 2012
WHEN OrderDate > '2013-01-01' AND OrderDate < '2014-01-01' THEN 2013
ELSE 'Not a date!'
END AS year, AVG(FreightCharge) AS avg
FROM Orders
GROUP BY year""", con)
# Close the connection
con.close()
###############################################################################
##### Normal Data Science Process with a Database
###############################################################################
"""
Finally, just to reiterate that getting data from databases is nothing more
than another way to get data (and thus, has no effect upon the rest of the data
science process), here is some code we used in a previous class. Instead of
reading data from a CSV file, we get it from a database.
"""
##### Training #####
# Open new connection
con = lite.connect('vehicles.db')
# Get training data from database
train = pd.read_sql_query('SELECT * FROM vehicle_train', con=con)
# Encode car as 0 and truck as 1
train['type'] = train.type.map({'car':0, 'truck':1})
train.head()
# Create a list of the feature columns (every column except for the 0th column)
feature_cols = train.columns[1:]
# Define X (features) and y (response)
X = train[feature_cols]
y = train.price
# Import the relevant class, and instantiate the model (with random_state=1)
from sklearn.tree import DecisionTreeRegressor
treereg = DecisionTreeRegressor(random_state=1)
treereg.fit(X, y)
# Use 3-fold cross-validation to estimate the RMSE for this model
from sklearn.cross_validation import cross_val_score
import numpy as np
scores = cross_val_score(treereg, X, y, cv=3, scoring='mean_squared_error')
np.mean(np.sqrt(-scores))
##### Testing #####
# Get testing data from database
test = pd.read_sql_query('SELECT * FROM vehicle_test', con=con)
con.close()
# Encode car as 0 and truck as 1
test['type'] = test.type.map({'car':0, 'truck':1})
# Print the data
test
# Define X and y
X_test = test[feature_cols]
y_test = test.price
# Make predictions on test data
y_pred = treereg.predict(X_test)
y_pred
# Calculate RMSE
from sklearn import metrics
np.sqrt(metrics.mean_squared_error(y_test, y_pred))
# Calculate RMSE for your own tree!
y_test = [3000, 6000, 12000]
y_pred = [3057, 3057, 16333]
np.sqrt(metrics.mean_squared_error(y_test, y_pred))
================================================
FILE: code/21_ensembles_example.py
================================================
'''
Imports
'''
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline
'''
Define a function that takes in a raw CSV file and returns a DataFrame that
includes all created features (and any other modifications). That way, we
can apply the same changes to both train.csv and test.csv.
'''
# Define the function
def make_features(filename):
# Read in dataframe
df = pd.read_csv(filename, index_col=0)
#Rename columns
df.rename(columns={'OwnerUndeletedAnswerCountAtPostTime':'Answers'}, inplace=True)
# Get length of title of post
df['TitleLength'] = df.Title.apply(len)
# Get length of body of post
df['BodyLength'] = df.BodyMarkdown.apply(len)
# Number of tags for post
df['NumTags'] = df.loc[:, 'Tag1':'Tag5'].notnull().sum(axis=1)
# Is the title lowercase?
df['TitleLowercase'] = (df.Title.str.lower() == df.Title).astype(int)
# Create features that represent whether Title contains certain words
df['TitleQuestion'] = df.Title.str.contains('question', case=False).astype(int)
df['TitleNeed'] = df.Title.str.contains('need', case=False).astype(int)
df['TitleHelp'] = df.Title.str.contains('help', case=False).astype(int)
return df
# Apply function to the training data
train = make_features('train.csv')
X = train.drop('OpenStatus', axis=1)
y = train.OpenStatus
# Read in test data
test = make_features('test.csv')
# Split into training and testing sets
#X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
'''
Five feature logistic regression model
'''
# Define feature cols
feature_cols_logreg = ['ReputationAtPostCreation', 'Answers', 'TitleLength', 'BodyLength', 'NumTags']
# Perform cross validation to get an idea of the performance of the model
logreg = LogisticRegression()
-cross_val_score(logreg, X[feature_cols_logreg], y, scoring="log_loss", cv=5).mean()
# Predict class probabilities for the actual testing data
logreg.fit(X[feature_cols_logreg], y)
y_prob_logreg = logreg.predict_proba(test[feature_cols_logreg])[:, 1]
'''
Five feature random forest model
'''
# Define feature cols
feature_cols_rf = ['TitleLowercase', 'TitleQuestion', 'TitleNeed', 'TitleHelp']
# Perform cross validation to get an idea of the performance of the model
rf = RandomForestClassifier()
-cross_val_score(rf, X[feature_cols_rf], y, scoring="log_loss", cv=5).mean()
# Predict class probabilities for the actual testing data
rf.fit(X[feature_cols_rf], y)
y_prob_rf = rf.predict_proba(test[feature_cols_rf])[:, 1]
'''
Text logistic regression model on 'Title' using pipeline
'''
# Make pipleline
pipe = make_pipeline(CountVectorizer(stop_words='english'), LogisticRegression())
# Perform cross validation to get an idea of the performance of the model
-cross_val_score(pipe, X['Title'], y, scoring="log_loss", cv=5).mean()
# Predict class probabilities for the actual testing data
pipe.fit(X['Title'], y)
y_prob_pipe = pipe.predict_proba(test['Title'])[:, 1]
'''
Create submission
'''
# Ensemble predictions
y_prob_combined = (y_prob_logreg + y_prob_rf + 2*y_prob_pipe) / 3
# Create a DataFrame that has 'id' as the index, then export to a CSV file
sub = pd.DataFrame({'id':test.index, 'OpenStatus':y_prob_combined}).set_index('id')
sub.to_csv('sub_ensemble.csv')
================================================
FILE: data/SMSSpamCollection.txt
================================================
ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham Ok lar... Joking wif u oni...
spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham U dun say so early hor... U c already then say...
ham Nah I don't think he goes to usf, he lives around here though
spam FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
ham Even my brother is not like to speak with me. They treat me like aids patent.
ham As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
spam WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
spam Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030
ham I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.
spam SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info
spam URGENT! You have won a 1 week FREE membership in our £100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18
ham I've been searching for the right words to thank you for this breather. I promise i wont take your help for granted and will fulfil my promise. You have been wonderful and a blessing at all times.
ham I HAVE A DATE ON SUNDAY WITH WILL!!
spam XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL
ham Oh k...i'm watching here:)
ham Eh u remember how 2 spell his name... Yes i did. He v naughty make until i v wet.
ham Fine if thats the way u feel. Thats the way its gota b
spam England v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/ú1.20 POBOXox36504W45WQ 16+
ham Is that seriously how you spell his name?
ham I‘m going to try for 2 months ha ha only joking
ham So ü pay first lar... Then when is da stock comin...
ham Aft i finish my lunch then i go str down lor. Ard 3 smth lor. U finish ur lunch already?
ham Ffffffffff. Alright no way I can meet up with you sooner?
ham Just forced myself to eat a slice. I'm really not hungry tho. This sucks. Mark is getting worried. He knows I'm sick when I turn down pizza. Lol
ham Lol your always so convincing.
ham Did you catch the bus ? Are you frying an egg ? Did you make a tea? Are you eating your mom's left over dinner ? Do you feel my Love ?
ham I'm back & we're packing the car now, I'll let you know if there's room
ham Ahhh. Work. I vaguely remember that! What does it feel like? Lol
ham Wait that's still not all that clear, were you not sure about me being sarcastic or that that's why x doesn't want to live with us
ham Yeah he got in at 2 and was v apologetic. n had fallen out and she was actin like spoilt child and he got caught up in that. Till 2! But we won't go there! Not doing too badly cheers. You?
ham K tell me anything about you.
ham For fear of fainting with the of all that housework you just did? Quick have a cuppa
spam Thanks for your subscription to Ringtone UK your mobile will be charged £5/month Please confirm by replying YES or NO. If you reply NO you will not be charged
ham Yup... Ok i go home look at the timings then i msg ü again... Xuhui going to learn on 2nd may too but her lesson is at 8am
ham Oops, I'll let you know when my roommate's done
ham I see the letter B on my car
ham Anything lor... U decide...
ham Hello! How's you and how did saturday go? I was just texting to see if you'd decided to do anything tomo. Not that i'm trying to invite myself or anything!
ham Pls go ahead with watts. I just wanted to be sure. Do have a great weekend. Abiola
ham Did I forget to tell you ? I want you , I need you, I crave you ... But most of all ... I love you my sweet Arabian steed ... Mmmmmm ... Yummy
spam 07732584351 - Rodger Burns - MSG = We tried to call you re your reply to our sms for a free nokia mobile + free camcorder. Please call now 08000930705 for delivery tomorrow
ham WHO ARE YOU SEEING?
ham Great! I hope you like your man well endowed. I am <#> inches...
ham No calls..messages..missed calls
ham Didn't you get hep b immunisation in nigeria.
ham Fair enough, anything going on?
ham Yeah hopefully, if tyler can't do it I could maybe ask around a bit
ham U don't know how stubborn I am. I didn't even want to go to the hospital. I kept telling Mark I'm not a weak sucker. Hospitals are for weak suckers.
ham What you thinked about me. First time you saw me in class.
ham A gram usually runs like <#> , a half eighth is smarter though and gets you almost a whole second gram for <#>
ham K fyi x has a ride early tomorrow morning but he's crashing at our place tonight
ham Wow. I never realized that you were so embarassed by your accomodations. I thought you liked it, since i was doing the best i could and you always seemed so happy about "the cave". I'm sorry I didn't and don't have more to give. I'm sorry i offered. I'm sorry your room was so embarassing.
spam SMS. ac Sptv: The New Jersey Devils and the Detroit Red Wings play Ice Hockey. Correct or Incorrect? End? Reply END SPTV
ham Do you know what Mallika Sherawat did yesterday? Find out now @ <URL>
spam Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! C Suprman V, Matrix3, StarWars3, etc all 4 FREE! bx420-ip4-5we. 150pm. Dont miss out!
ham Sorry, I'll call later in meeting.
ham Tell where you reached
ham Yes..gauti and sehwag out of odi series.
ham Your gonna have to pick up a $1 burger for yourself on your way home. I can't even move. Pain is killing me.
ham Ha ha ha good joke. Girls are situation seekers.
ham Its a part of checking IQ
ham Sorry my roommates took forever, it ok if I come by now?
ham Ok lar i double check wif da hair dresser already he said wun cut v short. He said will cut until i look nice.
spam As a valued customer, I am pleased to advise you that following recent review of your Mob No. you are awarded with a £1500 Bonus Prize, call 09066364589
ham Today is "song dedicated day.." Which song will u dedicate for me? Send this to all ur valuable frnds but first rply me...
spam Urgent UR awarded a complimentary trip to EuroDisinc Trav, Aco&Entry41 Or £1000. To claim txt DIS to 87121 18+6*£1.50(moreFrmMob. ShrAcomOrSglSuplt)10, LS1 3AJ
spam Did you hear about the new "Divorce Barbie"? It comes with all of Ken's stuff!
ham I plane to give on this month end.
ham Wah lucky man... Then can save money... Hee...
ham Finished class where are you.
ham HI BABE IM AT HOME NOW WANNA DO SOMETHING? XX
ham K..k:)where are you?how did you performed?
ham U can call me now...
ham I am waiting machan. Call me once you free.
ham Thats cool. i am a gentleman and will treat you with dignity and respect.
ham I like you peoples very much:) but am very shy pa.
ham Does not operate after <#> or what
ham Its not the same here. Still looking for a job. How much do Ta's earn there.
ham Sorry, I'll call later
ham K. Did you call me just now ah?
ham Ok i am on the way to home hi hi
ham You will be in the place of that man
ham Yup next stop.
ham I call you later, don't have network. If urgnt, sms me.
ham For real when u getting on yo? I only need 2 more tickets and one more jacket and I'm done. I already used all my multis.
ham Yes I started to send requests to make it but pain came back so I'm back in bed. Double coins at the factory too. I gotta cash in all my nitros.
ham I'm really not up to it still tonight babe
ham Ela kano.,il download, come wen ur free..
ham Yeah do! Don‘t stand to close tho- you‘ll catch something!
ham Sorry to be a pain. Is it ok if we meet another night? I spent late afternoon in casualty and that means i haven't done any of y stuff42moro and that includes all my time sheets and that. Sorry.
ham Smile in Pleasure Smile in Pain Smile when trouble pours like Rain Smile when sum1 Hurts U Smile becoz SOMEONE still Loves to see u Smiling!!
spam Please call our customer service representative on 0800 169 6031 between 10am-9pm as you have WON a guaranteed £1000 cash or £5000 prize!
ham Havent planning to buy later. I check already lido only got 530 show in e afternoon. U finish work already?
spam Your free ringtone is waiting to be collected. Simply text the password "MIX" to 85069 to verify. Get Usher and Britney. FML, PO Box 5249, MK17 92H. 450Ppw 16
ham Watching telugu movie..wat abt u?
ham i see. When we finish we have loads of loans to pay
ham Hi. Wk been ok - on hols now! Yes on for a bit of a run. Forgot that i have hairdressers appointment at four so need to get home n shower beforehand. Does that cause prob for u?"
ham I see a cup of coffee animation
ham Please don't text me anymore. I have nothing else to say.
ham Okay name ur price as long as its legal! Wen can I pick them up? Y u ave x ams xx
ham I'm still looking for a car to buy. And have not gone 4the driving test yet.
ham As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
ham wow. You're right! I didn't mean to do that. I guess once i gave up on boston men and changed my search location to nyc, something changed. Cuz on my signin page it still says boston.
ham Umma my life and vava umma love you lot dear
ham Thanks a lot for your wishes on my birthday. Thanks you for making my birthday truly memorable.
ham Aight, I'll hit you up when I get some cash
ham How would my ip address test that considering my computer isn't a minecraft server
ham I know! Grumpy old people. My mom was like you better not be lying. Then again I am always the one to play jokes...
ham Dont worry. I guess he's busy.
ham What is the plural of the noun research?
ham Going for dinner.msg you after.
ham I'm ok wif it cos i like 2 try new things. But i scared u dun like mah. Cos u said not too loud.
spam GENT! We are trying to contact you. Last weekends draw shows that you won a £1000 prize GUARANTEED. Call 09064012160. Claim Code K52. Valid 12hrs only. 150ppm
ham Wa, ur openin sentence very formal... Anyway, i'm fine too, juz tt i'm eatin too much n puttin on weight...Haha... So anythin special happened?
ham As I entered my cabin my PA said, '' Happy B'day Boss !!''. I felt special. She askd me 4 lunch. After lunch she invited me to her apartment. We went there.
spam You are a winner U have been specially selected 2 receive £1000 or a 4* holiday (flights inc) speak to a live operator 2 claim 0871277810910p/min (18+)
ham Goodo! Yes we must speak friday - egg-potato ratio for tortilla needed!
ham Hmm...my uncle just informed me that he's paying the school directly. So pls buy food.
spam PRIVATE! Your 2004 Account Statement for 07742676969 shows 786 unredeemed Bonus Points. To claim call 08719180248 Identifier Code: 45239 Expires
spam URGENT! Your Mobile No. was awarded £2000 Bonus Caller Prize on 5/9/03 This is our final try to contact U! Call from Landline 09064019788 BOX42WR29C, 150PPM
ham here is my new address -apples&pairs&all that malarky
spam Todays Voda numbers ending 7548 are selected to receive a $350 award. If you have a match please call 08712300220 quoting claim code 4041 standard rates app
ham I am going to sao mu today. Will be done only at 12
ham Ü predict wat time ü'll finish buying?
ham Good stuff, will do.
ham Just so that you know,yetunde hasn't sent money yet. I just sent her a text not to bother sending. So its over, you dont have to involve yourself in anything. I shouldn't have imposed anything on you in the first place so for that, i apologise.
ham Are you there in room.
ham HEY GIRL. HOW R U? HOPE U R WELL ME AN DEL R BAK! AGAIN LONG TIME NO C! GIVE ME A CALL SUM TIME FROM LUCYxx
ham K..k:)how much does it cost?
ham I'm home.
ham Dear, will call Tmorrow.pls accomodate.
ham First answer my question.
spam Sunshine Quiz Wkly Q! Win a top Sony DVD player if u know which country the Algarve is in? Txt ansr to 82277. £1.50 SP:Tyrone
spam Want 2 get laid tonight? Want real Dogging locations sent direct 2 ur mob? Join the UK's largest Dogging Network bt Txting GRAVEL to 69888! Nt. ec2a. 31p.msg@150p
ham I only haf msn. It's yijue@hotmail.com
ham He is there. You call and meet him
ham No no. I will check all rooms befor activities
spam You'll not rcv any more msgs from the chat svc. For FREE Hardcore services text GO to: 69988 If u get nothing u must Age Verify with yr network & try again
ham Got c... I lazy to type... I forgot ü in lect... I saw a pouch but like not v nice...
ham K, text me when you're on the way
ham Sir, Waiting for your mail.
ham A swt thought: "Nver get tired of doing little things 4 lovable persons.." Coz..somtimes those little things occupy d biggest part in their Hearts.. Gud ni8
ham I know you are. Can you pls open the back?
ham Yes see ya not on the dot
ham Whats the staff name who is taking class for us?
spam FreeMsg Why haven't you replied to my text? I'm Randy, sexy, female and live local. Luv to hear from u. Netcollex Ltd 08700621170150p per msg reply Stop to end
ham Ummma.will call after check in.our life will begin from qatar so pls pray very hard.
ham K..i deleted my contact that why?
ham Sindu got job in birla soft ..
ham The wine is flowing and i'm i have nevering..
ham Yup i thk cine is better cos no need 2 go down 2 plaza mah.
ham Ok... Ur typical reply...
ham As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
ham You are everywhere dirt, on the floor, the windows, even on my shirt. And sometimes when i open my mouth, you are all that comes flowing out. I dream of my world without you, then half my chores are out too. A time of joy for me, lots of tv shows i.ll see. But i guess like all things you just must exist, like rain, hail and mist, and when my time here is done, you and i become one.
ham Aaooooright are you at work?
ham I'm leaving my house now...
ham Hello, my love. What are you doing? Did you get to that interview today? Are you you happy? Are you being a good boy? Do you think of me?Are you missing me ?
spam Customer service annoncement. You have a New Years delivery waiting for you. Please call 07046744435 now to arrange delivery
spam You are a winner U have been specially selected 2 receive £1000 cash or a 4* holiday (flights inc) speak to a live operator 2 claim 0871277810810
ham Keep yourself safe for me because I need you and I miss you already and I envy everyone that see's you in real life
ham New car and house for my parents.:)i have only new job in hand:)
ham I'm so in love with you. I'm excited each day i spend with you. You make me so happy.
spam -PLS STOP bootydelious (32/F) is inviting you to be her friend. Reply YES-434 or NO-434 See her: www.SMS.ac/u/bootydelious STOP? Send STOP FRND to 62468
spam BangBabes Ur order is on the way. U SHOULD receive a Service Msg 2 download UR content. If U do not, GoTo wap. bangb. tv on UR mobile internet/service menu
ham I place all ur points on e cultures module already.
spam URGENT! We are trying to contact you. Last weekends draw shows that you have won a £900 prize GUARANTEED. Call 09061701939. Claim code S89. Valid 12hrs only
ham Hi frnd, which is best way to avoid missunderstding wit our beloved one's?
ham Great escape. I fancy the bridge but needs her lager. See you tomo
ham Yes :)it completely in out of form:)clark also utter waste.
ham Sir, I need AXIS BANK account no and bank address.
ham Hmmm.. Thk sure got time to hop ard... Ya, can go 4 free abt... Muz call u to discuss liao...
ham What time you coming down later?
ham Bloody hell, cant believe you forgot my surname Mr . Ill give u a clue, its spanish and begins with m...
ham Well, i'm gonna finish my bath now. Have a good...fine night.
ham Let me know when you've got the money so carlos can make the call
ham U still going to the mall?
ham Turns out my friends are staying for the whole show and won't be back til ~ <#> , so feel free to go ahead and smoke that $ <#> worth
ham Text her. If she doesnt reply let me know so i can have her log in
ham Hi! You just spoke to MANEESHA V. We'd like to know if you were satisfied with the experience. Reply Toll Free with Yes or No.
ham You lifted my hopes with the offer of money. I am in need. Especially when the end of the month approaches and it hurts my studying. Anyways have a gr8 weekend
ham Lol no. U can trust me.
ham ok. I am a gentleman and will treat you with dignity and respect.
ham He will, you guys close?
ham Going on nothing great.bye
ham Hello handsome ! Are you finding that job ? Not being lazy ? Working towards getting back that net for mummy ? Where's my boytoy now ? Does he miss me ?
ham Haha awesome, be there in a minute
spam Please call our customer service representative on FREEPHONE 0808 145 4742 between 9am-11pm as you have WON a guaranteed £1000 cash or £5000 prize!
ham Have you got Xmas radio times. If not i will get it now
ham I jus reached home. I go bathe first. But my sis using net tell u when she finishes k...
spam Are you unique enough? Find out from 30th August. www.areyouunique.co.uk
ham I'm sorry. I've joined the league of people that dont keep in touch. You mean a great deal to me. You have been a friend at all times even at great personal cost. Do have a great week.|
ham Hi :)finally i completed the course:)
ham It will stop on itself. I however suggest she stays with someone that will be able to give ors for every stool.
ham How are you doing? Hope you've settled in for the new school year. Just wishin you a gr8 day
ham Gud mrng dear hav a nice day
ham Did u got that persons story
ham is your hamster dead? Hey so tmr i meet you at 1pm orchard mrt?
ham Hi its Kate how is your evening? I hope i can see you tomorrow for a bit but i have to bloody babyjontet! Txt back if u can. :) xxx
ham Found it, ENC <#> , where you at?
ham I sent you <#> bucks
ham Hello darlin ive finished college now so txt me when u finish if u can love Kate xxx
ham Your account has been refilled successfully by INR <DECIMAL> . Your KeralaCircle prepaid account balance is Rs <DECIMAL> . Your Transaction ID is KR <#> .
ham Goodmorning sleeping ga.
ham U call me alter at 11 ok.
ham Ü say until like dat i dun buy ericsson oso cannot oredi lar...
ham As I entered my cabin my PA said, '' Happy B'day Boss !!''. I felt special. She askd me 4 lunch. After lunch she invited me to her apartment. We went there.
ham Aight yo, dats straight dogg
ham You please give us connection today itself before <DECIMAL> or refund the bill
ham Both :) i shoot big loads so get ready!
ham What's up bruv, hope you had a great break. Do have a rewarding semester.
ham Home so we can always chat
ham K:)k:)good:)study well.
ham Yup... How ü noe leh...
ham Sounds great! Are you home now?
ham Finally the match heading towards draw as your prediction.
ham Tired. I haven't slept well the past few nights.
ham Easy ah?sen got selected means its good..
ham I have to take exam with march 3
ham Yeah you should. I think you can use your gt atm now to register. Not sure but if there's anyway i can help let me know. But when you do be sure you are ready.
ham Ok no prob. Take ur time.
ham There is os called ubandu which will run without installing in hard disk...you can use that os to copy the important files in system and give it to repair shop..
ham Sorry, I'll call later
ham U say leh... Of course nothing happen lar. Not say v romantic jus a bit only lor. I thk e nite scenery not so nice leh.
spam 500 New Mobiles from 2004, MUST GO! Txt: NOKIA to No: 89545 & collect yours today!From ONLY £1 www.4-tc.biz 2optout 087187262701.50gbp/mtmsg18
ham Would really appreciate if you call me. Just need someone to talk to.
spam Will u meet ur dream partner soon? Is ur career off 2 a flyng start? 2 find out free, txt HORO followed by ur star sign, e. g. HORO ARIES
ham Hey company elama po mudyadhu.
ham Life is more strict than teacher... Bcoz Teacher teaches lesson & then conducts exam, But Life first conducts Exam & then teaches Lessons. Happy morning. . .
ham Dear good morning now only i am up
ham Get down in gandhipuram and walk to cross cut road. Right side <#> street road and turn at first right.
ham Dear we are going to our rubber place
ham Sorry battery died, yeah I'm here
ham Yes:)here tv is always available in work place..
spam Text & meet someone sexy today. U can find a date or even flirt its up to U. Join 4 just 10p. REPLY with NAME & AGE eg Sam 25. 18 -msg recd@thirtyeight pence
ham I have printed it oh. So <#> come upstairs
ham Or ill be a little closer like at the bus stop on the same street
ham Where are you?when wil you reach here?
ham New Theory: Argument wins d SITUATION, but loses the PERSON. So dont argue with ur friends just.. . . . kick them & say, I'm always correct.!
spam U 447801259231 have a secret admirer who is looking 2 make contact with U-find out who they R*reveal who thinks UR so special-call on 09058094597
ham Tomarrow final hearing on my laptop case so i cant.
ham PLEASSSSSSSEEEEEE TEL ME V AVENT DONE SPORTSx
ham Okay. No no, just shining on. That was meant to be signing, but that sounds better.
ham Although i told u dat i'm into baig face watches now but i really like e watch u gave cos it's fr u. Thanx 4 everything dat u've done today, i'm touched...
ham U don't remember that old commercial?
ham Too late. I said i have the website. I didn't i have or dont have the slippers
ham I asked you to call him now ok
ham Kallis wont bat in 2nd innings.
ham It didnt work again oh. Ok goodnight then. I.ll fix and have it ready by the time you wake up. You are very dearly missed have a good night sleep.
spam Congratulations ur awarded 500 of CD vouchers or 125gift guaranteed & Free entry 2 100 wkly draw txt MUSIC to 87066 TnCs www.Ldew.com1win150ppmx3age16
ham Ranjith cal drpd Deeraj and deepak 5min hold
ham Wen ur lovable bcums angry wid u, dnt take it seriously.. Coz being angry is d most childish n true way of showing deep affection, care n luv!.. kettoda manda... Have nice day da.
ham What you doing?how are you?
ham Ups which is 3days also, and the shipping company that takes 2wks. The other way is usps which takes a week but when it gets to lag you may have to bribe nipost to get your stuff.
ham I'm back, lemme know when you're ready
ham Don't necessarily expect it to be done before you get back though because I'm just now headin out
ham Mmm so yummy babe ... Nice jolt to the suzy
ham Where are you lover ? I need you ...
spam We tried to contact you re your reply to our offer of a Video Handset? 750 anytime networks mins? UNLIMITED TEXT? Camcorder? Reply or call 08000930705 NOW
ham I‘m parked next to a MINI!!!! When are you coming in today do you think?
ham Yup
ham Anyway i'm going shopping on my own now. Cos my sis not done yet. Dun disturb u liao.
ham MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H*
spam Hey I am really horny want to chat or see me naked text hot to 69698 text charged at 150pm to unsubscribe text stop 69698
ham Why you Dint come with us.
ham Same. Wana plan a trip sometme then
ham Not sure yet, still trying to get a hold of him
spam Ur ringtone service has changed! 25 Free credits! Go to club4mobiles.com to choose content now! Stop? txt CLUB STOP to 87070. 150p/wk Club4 PO Box1146 MK45 2WT
ham The evo. I just had to download flash. Jealous?
spam Ringtone Club: Get the UK singles chart on your mobile each week and choose any top quality ringtone! This message is free of charge.
ham Come to mu, we're sorting out our narcotics situation
ham Night has ended for another day, morning has come in a special way. May you smile like the sunny rays and leaves your worries at the blue blue bay.
spam HMV BONUS SPECIAL 500 pounds of genuine HMV vouchers to be won. Just answer 4 easy questions. Play Now! Send HMV to 86688 More info:www.100percent-real.com
ham Usf I guess, might as well take 1 car
ham No objection. My bf not coming.
ham Thanx...
ham Tell rob to mack his gf in the theater
ham Awesome, I'll see you in a bit
ham Just sent it. So what type of food do you like?
ham All done? All handed in? Celebrations in full swing yet?
ham You got called a tool?
ham "Wen u miss someone, the person is definitely special for u..... But if the person is so special, why to miss them, just Keep-in-touch" gdeve..
ham Ok. I asked for money how far
ham Okie...
ham Yeah I think my usual guy's still passed out from last night, if you get ahold of anybody let me know and I'll throw down
ham K, I might come by tonight then if my class lets out early
ham Ok..
ham hi baby im cruisin with my girl friend what r u up 2? give me a call in and hour at home if thats alright or fone me on this fone now love jenny xxx
ham My life Means a lot to me, Not because I love my life, But because I love the people in my life, The world calls them friends, I call them my World:-).. Ge:-)..
ham Dear,shall mail tonite.busy in the street,shall update you tonite.things are looking ok.varunnathu edukkukayee raksha ollu.but a good one in real sense.
ham Hey you told your name to gautham ah?
ham Haf u found him? I feel so stupid da v cam was working.
ham Oops. 4 got that bit.
ham Are you this much buzy
ham I accidentally deleted the message. Resend please.
spam T-Mobile customer you may now claim your FREE CAMERA PHONE upgrade & a pay & go sim card for your loyalty. Call on 0845 021 3680.Offer ends 28thFeb.T&C's apply
ham Unless it's a situation where YOU GO GURL would be more appropriate
ham Hurt me... Tease me... Make me cry... But in the end of my life when i die plz keep one rose on my grave and say STUPID I MISS U.. HAVE A NICE DAY BSLVYL
ham I cant pick the phone right now. Pls send a message
ham Need a coffee run tomo?Can't believe it's that time of week already
ham Awesome, I remember the last time we got somebody high for the first time with diesel :V
ham Shit that is really shocking and scary, cant imagine for a second. Def up for night out. Do u think there is somewhere i could crash for night, save on taxi?
ham Oh and by the way you do have more food in your fridge! Want to go out for a meal tonight?
ham He is a womdarfull actor
spam SMS. ac Blind Date 4U!: Rodds1 is 21/m from Aberdeen, United Kingdom. Check Him out http://img. sms. ac/W/icmb3cktz8r7!-4 no Blind Dates send HIDE
ham Yup... From what i remb... I think should be can book...
ham Jos ask if u wana meet up?
ham Lol yes. Our friendship is hanging on a thread cause u won't buy stuff.
spam TheMob> Check out our newest selection of content, Games, Tones, Gossip, babes and sport, Keep your mobile fit and funky text WAP to 82468
ham Where are the garage keys? They aren't on the bookshelf
ham Today is ACCEPT DAY..U Accept me as? Brother Sister Lover Dear1 Best1 Clos1 Lvblefrnd Jstfrnd Cutefrnd Lifpartnr Belovd Swtheart Bstfrnd No rply means enemy
spam Think ur smart ? Win £200 this week in our weekly quiz, text PLAY to 85222 now!T&Cs WinnersClub PO BOX 84, M26 3UZ. 16+. GBP1.50/week
ham He says he'll give me a call when his friend's got the money but that he's definitely buying before the end of the week
ham Hi the way I was with u 2day, is the normal way&this is the real me. UR unique&I hope I know u 4 the rest of mylife. Hope u find wot was lost.
ham You made my day. Do have a great day too.
ham K.k:)advance happy pongal.
ham Hmmm... Guess we can go 4 kb n power yoga... Haha, dunno we can tahan power yoga anot... Thk got lo oso, forgot liao...
ham Not really dude, have no friends i'm afraid :(
spam December only! Had your mobile 11mths+? You are entitled to update to the latest colour camera mobile for Free! Call The Mobile Update Co FREE on 08002986906
ham Coffee cake, i guess...
ham Merry Christmas to you too babe, i love ya *kisses*
ham Hey... Why dont we just go watch x men and have lunch... Haha
ham cud u tell ppl im gona b a bit l8 cos 2 buses hav gon past cos they were full & im still waitin 4 1. Pete x
ham That would be great. We'll be at the Guild. Could meet on Bristol road or somewhere - will get in touch over weekend. Our plans take flight! Have a good week
ham No problem. How are you doing?
ham No calls..messages..missed calls
ham Hi da:)how is the todays class?
ham I'd say that's a good sign but, well, you know my track record at reading women
ham Cool, text me when you're parked
ham I'm reading the text i just sent you. Its meant to be a joke. So read it in that light
ham K.k:)apo k.good movie.
ham Maybe i could get book out tomo then return it immediately ..? Or something.
spam Call Germany for only 1 pence per minute! Call from a fixed line via access number 0844 861 85 85. No prepayment. Direct access!
ham Any chance you might have had with me evaporated as soon as you violated my privacy by stealing my phone number from your employer's paperwork. Not cool at all. Please do not contact me again or I will report you to your supervisor.
spam Valentines Day Special! Win ov
gitextract_q__99hvo/
├── .gitignore
├── README.md
├── code/
│ ├── 00_python_beginner_workshop.py
│ ├── 00_python_intermediate_workshop.py
│ ├── 01_chipotle_homework_solution.py
│ ├── 01_reading_files.py
│ ├── 03_exploratory_analysis_pandas.py
│ ├── 04_apis.py
│ ├── 04_visualization.py
│ ├── 05_iris_exercise.py
│ ├── 05_sklearn_knn.py
│ ├── 07_glass_id_homework_solution.py
│ ├── 08_web_scraping.py
│ ├── 10_logistic_regression_confusion_matrix.py
│ ├── 13_naive_bayes.py
│ ├── 15_kaggle.py
│ ├── 17_ensembling_exercise.py
│ ├── 18_clustering.py
│ ├── 18_regularization.py
│ ├── 19_advanced_sklearn.py
│ ├── 19_gridsearchcv_exercise.py
│ ├── 19_regex_exercise.py
│ ├── 19_regex_reference.py
│ ├── 20_sql.py
│ └── 21_ensembles_example.py
├── data/
│ ├── SMSSpamCollection.txt
│ ├── airline_safety.csv
│ ├── auto_mpg.txt
│ ├── chipotle_orders.tsv
│ ├── default.csv
│ ├── drinks.csv
│ ├── homicides.txt
│ ├── imdb_movie_ratings_top_1000.csv
│ ├── imdb_movie_urls.csv
│ ├── kaggle_tweets.csv
│ ├── titanic_train.csv
│ ├── vehicles_test.csv
│ └── vehicles_train.csv
├── homework/
│ ├── 02_command_line_hw_soln.md
│ ├── 03_pandas_hw_soln.py
│ ├── 04_visualization_hw_soln.py
│ ├── 06_bias_variance.md
│ ├── 07_glass_identification.md
│ ├── 11_roc_auc.md
│ ├── 11_roc_auc_annotated.md
│ ├── 13_spam_filtering.md
│ └── 13_spam_filtering_annotated.md
├── notebooks/
│ ├── 06_bias_variance.ipynb
│ ├── 06_model_evaluation_procedures.ipynb
│ ├── 09_linear_regression.ipynb
│ ├── 11_cross_validation.ipynb
│ ├── 11_roc_auc.ipynb
│ ├── 11_titanic_exercise.ipynb
│ ├── 13_bayes_iris.ipynb
│ ├── 13_naive_bayes_spam.ipynb
│ ├── 14_nlp.ipynb
│ ├── 16_decision_trees.ipynb
│ ├── 17_ensembling.ipynb
│ └── 18_regularization.ipynb
├── other/
│ ├── peer_review.md
│ ├── project.md
│ ├── public_data.md
│ └── resources.md
└── slides/
├── 01_course_overview.pptx
├── 02_Introduction_to_the_Command_Line.md
├── 02_git_github.pptx
├── 04_apis.pptx
├── 04_visualization.pptx
├── 05_intro_to_data_science.pptx
├── 05_machine_learning_knn.pptx
├── 08_web_scraping.pptx
├── 10_logistic_regression_confusion_matrix.pptx
├── 11_drawing_roc.pptx
├── 13_bayes_theorem.pptx
├── 13_naive_bayes.pptx
├── 15_kaggle.pptx
├── 18_clustering.pptx
└── 20_sql.pptx
SYMBOL INDEX (14 symbols across 8 files) FILE: code/00_python_beginner_workshop.py function give_me_five (line 55) | def give_me_five(): # function definition ends with colon function calc (line 61) | def calc(x, y, op): # three parameters (without any defaults) function compute_pay (line 78) | def compute_pay(hours, rate): function compute_more_pay (line 87) | def compute_more_pay(hours, rate): function both_ends (line 127) | def both_ends(s): function fizz_buzz (line 162) | def fizz_buzz(): function front_x (line 184) | def front_x(words): FILE: code/04_apis.py function get_sentiment (line 46) | def get_sentiment(text): FILE: code/05_iris_exercise.py function classify_iris (line 74) | def classify_iris(row): FILE: code/08_web_scraping.py function getIMDBInfo (line 214) | def getIMDBInfo(url): FILE: code/15_kaggle.py function make_features (line 127) | def make_features(filename): FILE: code/17_ensembling_exercise.py function make_features (line 4) | def make_features(filename): FILE: code/19_gridsearchcv_exercise.py function make_features (line 8) | def make_features(filename): FILE: code/21_ensembles_example.py function make_features (line 19) | def make_features(filename):
Condensed preview — 78 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (2,977K chars).
[
{
"path": ".gitignore",
"chars": 36,
"preview": ".ipynb_checkpoints/\n.DS_Store\n*.pyc\n"
},
{
"path": "README.md",
"chars": 42466,
"preview": "## DAT5 Course Repository\n\nCourse materials for [General Assembly's Data Science course](https://generalassemb.ly/educat"
},
{
"path": "code/00_python_beginner_workshop.py",
"chars": 5359,
"preview": "'''\nMulti-line comments go between 3 quotation marks.\nYou can use single or double quotes.\n'''\n\n# One-line comments are "
},
{
"path": "code/00_python_intermediate_workshop.py",
"chars": 7257,
"preview": "## QUIZ TO REVIEW BEGINNER WORKSHOP\n\na = 5\nb = 5.0\nc = a/2\nd = b/2\n\n'''\nWhat is type(a)?\n int\nWhat is type(b)?\n float\n"
},
{
"path": "code/01_chipotle_homework_solution.py",
"chars": 3480,
"preview": "'''\nSOLUTION FILE: Homework with Chipotle data\nhttps://github.com/TheUpshot/chipotle\n'''\n\n\n'''\nPART 1: read in the data,"
},
{
"path": "code/01_reading_files.py",
"chars": 3693,
"preview": "'''\nLesson on file reading using Airline Safety Data\nhttps://github.com/fivethirtyeight/data/tree/master/airline-safety\n"
},
{
"path": "code/03_exploratory_analysis_pandas.py",
"chars": 11738,
"preview": "\"\"\"\nCLASS: Pandas for Data Exploration, Analysis, and Visualization\n\nAbout the data:\nWHO alcohol consumption data:\n a"
},
{
"path": "code/04_apis.py",
"chars": 2609,
"preview": "'''\nCLASS: APIs\n\nData Science Toolkit text2sentiment API\n'''\n\n'''\nAPIs without wrappers (i.e. there is no nicely formatt"
},
{
"path": "code/04_visualization.py",
"chars": 4621,
"preview": "\"\"\"\nCLASS: Visualization\n\"\"\"\n\n# imports\nimport pandas as pd\nimport matplotlib.pyplot as plt\n\n# import the data availabl"
},
{
"path": "code/05_iris_exercise.py",
"chars": 2777,
"preview": "'''\nEXERCISE: \"Human Learning\" with iris data\n\nCan you predict the species of an iris using petal and sepal measurements"
},
{
"path": "code/05_sklearn_knn.py",
"chars": 1906,
"preview": "'''\nCLASS: Introduction to scikit-learn with iris data\n'''\n\n# read in iris data\nimport pandas as pd\ncol_names = ['sepal_"
},
{
"path": "code/07_glass_id_homework_solution.py",
"chars": 2238,
"preview": "'''\nHOMEWORK: Glass Identification (aka \"Glassification\")\n'''\n\n# TASK 1: read data into a DataFrame\nimport pandas as pd\n"
},
{
"path": "code/08_web_scraping.py",
"chars": 10565,
"preview": "'''\nCLASS: Web Scraping\n\nWe will be using two packages in particular: requests and Beautiful Soup 4.\n'''\n\n'''\nIntroduct"
},
{
"path": "code/10_logistic_regression_confusion_matrix.py",
"chars": 8710,
"preview": "'''\nCLASS: Logistic Regression and Confusion Matrix\n'''\n\n###############################################################"
},
{
"path": "code/13_naive_bayes.py",
"chars": 5779,
"preview": "'''\nCLASS: Naive Bayes SMS spam classifier\nDATA SOURCE: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection\n'''\n"
},
{
"path": "code/15_kaggle.py",
"chars": 10388,
"preview": "'''\nCLASS: Kaggle Stack Overflow competition\n'''\n\n# read in the file and set the first column as the index\nimport pandas"
},
{
"path": "code/17_ensembling_exercise.py",
"chars": 1718,
"preview": "# Helper code for class 17 exercise\n\n# define the function\ndef make_features(filename):\n df = pd.read_csv(filename, i"
},
{
"path": "code/18_clustering.py",
"chars": 3161,
"preview": "'''\nTHE DATA\n\nWe have data about cars: things like MPG, acceleration, weight, etc. However,\nwe don't have logical grou"
},
{
"path": "code/18_regularization.py",
"chars": 5506,
"preview": "###############################################################################\n##### Regularization with Linear Regress"
},
{
"path": "code/19_advanced_sklearn.py",
"chars": 5526,
"preview": "## TASK: Searching for optimal parameters\n## FUNCTION: GridSearchCV\n## DOCUMENTATION: http://scikit-learn.org/stable/mod"
},
{
"path": "code/19_gridsearchcv_exercise.py",
"chars": 2186,
"preview": "'''\nEXERCISE: GridSearchCV with Stack Overflow competition data\n'''\n\nimport pandas as pd\n\n# define a function to create "
},
{
"path": "code/19_regex_exercise.py",
"chars": 764,
"preview": "'''\nRegular Expressions Exercise\n'''\n\n# open file and store each line as one row\nwith open('../data/homicides.txt', 'rU'"
},
{
"path": "code/19_regex_reference.py",
"chars": 5798,
"preview": "'''\nRegular Expressions (regex) Reference Guide\n\nSources:\n https://developers.google.com/edu/python/regular-expressio"
},
{
"path": "code/20_sql.py",
"chars": 15135,
"preview": "###############################################################################\n##### Class 20: SQL\n###################"
},
{
"path": "code/21_ensembles_example.py",
"chars": 3527,
"preview": "'''\nImports\n'''\nimport pandas as pd\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.cross_validation im"
},
{
"path": "data/SMSSpamCollection.txt",
"chars": 477203,
"preview": "ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\nham\t"
},
{
"path": "data/airline_safety.csv",
"chars": 2266,
"preview": "airline,avail_seat_km_per_week,incidents_85_99,fatal_accidents_85_99,fatalities_85_99,incidents_00_14,fatal_accidents_00"
},
{
"path": "data/auto_mpg.txt",
"chars": 17483,
"preview": "mpg|cylinders|displacement|horsepower|weight|acceleration|model_year|origin|car_name\r18|8|307|130|3504|12|70|1|chevrolet"
},
{
"path": "data/chipotle_orders.tsv",
"chars": 364975,
"preview": "order_id\tquantity\titem_name\tchoice_description\titem_price\n1\t1\tChips and Fresh Tomato Salsa\tNULL\t$2.39 \n1\t1\tIzze\t[Clement"
},
{
"path": "data/default.csv",
"chars": 285843,
"preview": "default,student,balance,income\n0,No,729.5264952,44361.62507\n0,Yes,817.1804066,12106.1347\n0,No,1073.549164,31767.13895\n0,"
},
{
"path": "data/drinks.csv",
"chars": 4937,
"preview": "country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent\rAfghanistan,0,0,0,0,AS\rAlbani"
},
{
"path": "data/homicides.txt",
"chars": 553694,
"preview": "39.311024, -76.674227, iconHomicideShooting, 'p2', '<dl><dt>Leon Nelson</dt><dd class=\"address\">3400 Clifton Ave.<br />B"
},
{
"path": "data/imdb_movie_ratings_top_1000.csv",
"chars": 91494,
"preview": "star_rating,title,content_rating,genre,duration,actors_list\r9.3,The Shawshank Redemption,R,Crime,142,\"[u'Tim Robbins', u"
},
{
"path": "data/imdb_movie_urls.csv",
"chars": 332,
"preview": "http://www.imdb.com/title/tt1856010/\rhttp://www.imdb.com/title/tt0816692/\rhttp://www.imdb.com/title/tt1826940/\rhttp://ww"
},
{
"path": "data/kaggle_tweets.csv",
"chars": 111725,
"preview": "\"37 spots up. Is anybody else even trying? #kaggle - https://t.co/9GLyaK3OhT\"\n\"RT @antgoldbloom: Proud to share that ove"
},
{
"path": "data/titanic_train.csv",
"chars": 60302,
"preview": "PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked\n1,0,3,\"Braund, Mr. Owen Harris\",male,22,"
},
{
"path": "data/vehicles_test.csv",
"chars": 98,
"preview": "price,year,miles,doors,type\n3000,2003,130000,4,truck\n6000,2005,82500,4,car\n12000,2010,60000,2,car\n"
},
{
"path": "data/vehicles_train.csv",
"chars": 353,
"preview": "price,year,miles,doors,type\n22000,2012,13000,2,car\n14000,2010,30000,2,car\n13000,2010,73500,4,car\n9500,2009,78000,4,car\n9"
},
{
"path": "homework/02_command_line_hw_soln.md",
"chars": 4241,
"preview": "## Command Line Homework Solution\n#### The following solution assumes you are working from the class \"DAT5\" directory. \n"
},
{
"path": "homework/03_pandas_hw_soln.py",
"chars": 7669,
"preview": "'''\nExploratory Data Analysis Homework Solution\n'''\n\n'''\nUse the automotive mpg data (https://raw.githubusercontent.com/"
},
{
"path": "homework/04_visualization_hw_soln.py",
"chars": 7442,
"preview": "'''\nVisualization Homework Solution\n'''\n\n'''\nUse the automotive mpg data (https://raw.githubusercontent.com/justmarkham/"
},
{
"path": "homework/06_bias_variance.md",
"chars": 1550,
"preview": "## Class 6 Pre-work: Bias-Variance Tradeoff\n\nRead this excellent article, [Understanding the Bias-Variance Tradeoff](htt"
},
{
"path": "homework/07_glass_identification.md",
"chars": 1114,
"preview": "## Class 7 Homework: Glass Identification\n\nLet's practice what we have learned using the [Glass Identification dataset]("
},
{
"path": "homework/11_roc_auc.md",
"chars": 2488,
"preview": "## Class 11 Pre-work: ROC Curves and Area Under the Curve (AUC)\n\nBefore learning about ROC curves, it's important to be "
},
{
"path": "homework/11_roc_auc_annotated.md",
"chars": 3496,
"preview": "## Class 11 Pre-work: ROC Curves and Area Under the Curve (AUC)\n\nBefore learning about ROC curves, it's important to be "
},
{
"path": "homework/13_spam_filtering.md",
"chars": 630,
"preview": "## Class 13 Pre-work: Spam Filtering\n\nRead Paul Graham's [A Plan for Spam](http://www.paulgraham.com/spam.html) and be p"
},
{
"path": "homework/13_spam_filtering_annotated.md",
"chars": 1345,
"preview": "## Class 13 Pre-work: Spam Filtering\n\nRead Paul Graham's [A Plan for Spam](http://www.paulgraham.com/spam.html) and be p"
},
{
"path": "notebooks/06_bias_variance.ipynb",
"chars": 134381,
"preview": "{\n \"metadata\": {\n \"name\": \"\",\n \"signature\": \"sha256:14d04997739722e6bbeb8a37f8f5fc7f30d18a0f0dc4619c08b1982912c9cedf\"\n"
},
{
"path": "notebooks/06_model_evaluation_procedures.ipynb",
"chars": 19080,
"preview": "{\n \"metadata\": {\n \"name\": \"\",\n \"signature\": \"sha256:ccea3b12dbd9d0bc247abfb693b91de332ed10a26258dd0dce7138bca03b0fc4\"\n"
},
{
"path": "notebooks/09_linear_regression.ipynb",
"chars": 178815,
"preview": "{\n \"metadata\": {\n \"name\": \"\",\n \"signature\": \"sha256:029ed166d1b9e896a08ee8d00c3eaa9354ec0607f6b83b74a228b5562d01b430\"\n"
},
{
"path": "notebooks/11_cross_validation.ipynb",
"chars": 31518,
"preview": "{\n \"metadata\": {\n \"name\": \"\",\n \"signature\": \"sha256:4aa33b85c25916966326c2f6ba5b3a06c4df0e8ad5ba66461d3b5f66460f3892\"\n"
},
{
"path": "notebooks/11_roc_auc.ipynb",
"chars": 56906,
"preview": "{\n \"metadata\": {\n \"name\": \"\",\n \"signature\": \"sha256:6ab739bad14c85b70b2ad82a62559e2d9c118d2d5943fe690424af62ccf3caf6\"\n"
},
{
"path": "notebooks/11_titanic_exercise.ipynb",
"chars": 8269,
"preview": "{\n \"metadata\": {\n \"name\": \"\",\n \"signature\": \"sha256:557342100eb7ce91ca76e7e4f24737943f3625640543427282904a15759174c8\"\n"
},
{
"path": "notebooks/13_bayes_iris.ipynb",
"chars": 14948,
"preview": "{\n \"metadata\": {\n \"name\": \"\",\n \"signature\": \"sha256:8b2fbf4113bc8fefcc6add5899d1725dc3a1a926ebd5d31dadd1876ac74bd349\"\n"
},
{
"path": "notebooks/13_naive_bayes_spam.ipynb",
"chars": 2385,
"preview": "{\n \"metadata\": {\n \"name\": \"\",\n \"signature\": \"sha256:b7e3a62e1216c53fa3e7d0fa56c5373dfe7f58c3817a1468b0abbc52dfe7b6a7\"\n"
},
{
"path": "notebooks/14_nlp.ipynb",
"chars": 100372,
"preview": "{\n \"metadata\": {\n \"name\": \"\",\n \"signature\": \"sha256:217a675665d9d52af77e2656091e39dd1b522040e8fd3c0e64135e07865179bf\"\n"
},
{
"path": "notebooks/16_decision_trees.ipynb",
"chars": 70054,
"preview": "{\n \"metadata\": {\n \"name\": \"\",\n \"signature\": \"sha256:87008866d7dcc97facb27ccb4d1630e237017383a6de66e047b039f143121da7\"\n"
},
{
"path": "notebooks/17_ensembling.ipynb",
"chars": 43588,
"preview": "{\n \"metadata\": {\n \"name\": \"\",\n \"signature\": \"sha256:803dd679ec261442a6f0cc436aae1f7776f20b5d231f8ec1e4d225e813a68f16\"\n"
},
{
"path": "notebooks/18_regularization.ipynb",
"chars": 8274,
"preview": "{\n \"metadata\": {\n \"name\": \"\",\n \"signature\": \"sha256:283eafa4edacfbbb8b51d404c8feab98319104a044aaa4138d97957373762033\"\n"
},
{
"path": "other/peer_review.md",
"chars": 1034,
"preview": "## Peer Review Guidelines\n\nYou will be assigned to review the project drafts of two of your peers. You will have one wee"
},
{
"path": "other/project.md",
"chars": 6623,
"preview": "# Course Project\n\n\n## Overview\n\nThe final project should represent significant original work applying data science techn"
},
{
"path": "other/public_data.md",
"chars": 4396,
"preview": "## Public Data Sources\n\n* Open data catalogs from various governments and NGOs:\n * [NYC Open Data](https://nycopenda"
},
{
"path": "other/resources.md",
"chars": 7648,
"preview": "# Resources for Continued Learning\n\n\n## Blogs\n\n* [Simply Statistics](http://simplystatistics.org/): Written by the Biost"
},
{
"path": "slides/02_Introduction_to_the_Command_Line.md",
"chars": 12711,
"preview": "# Introduction to the Command Line\nThis document outlines some basic commands for the Unix command line. For Linux and "
}
]
// ... and 14 more files (download for full content)
About this extraction
This page contains the full source code of the justmarkham/DAT5 GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 78 files (2.7 MB), approximately 720.3k tokens, and a symbol index with 14 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.