Full Code of joelgrus/data-science-from-scratch for AI

master d5d0f117f41b cached

108 files

3.1 MB

823.6k tokens

968 symbols

1 requests

Download .txt

Showing preview only (3,293K chars total). Download the full file or copy to clipboard to get everything.

Repository: joelgrus/data-science-from-scratch
Branch: master
Commit: d5d0f117f41b
Files: 108
Total size: 3.1 MB

Directory structure:
gitextract_o6_achnt/

├── .gitignore
├── INSTALL.md
├── LICENSE
├── README.md
├── comma_delimited_stock_prices.csv
├── first-edition/
│   ├── README.md
│   ├── code/
│   │   ├── __init__.py
│   │   ├── charts.py
│   │   ├── clustering.py
│   │   ├── colon_delimited_stock_prices.txt
│   │   ├── comma_delimited_stock_prices.csv
│   │   ├── comma_delimited_stock_prices.txt
│   │   ├── databases.py
│   │   ├── decision_trees.py
│   │   ├── egrep.py
│   │   ├── getting_data.py
│   │   ├── gradient_descent.py
│   │   ├── hypothesis_and_inference.py
│   │   ├── introduction.py
│   │   ├── line_count.py
│   │   ├── linear_algebra.py
│   │   ├── logistic_regression.py
│   │   ├── machine_learning.py
│   │   ├── mapreduce.py
│   │   ├── most_common_words.py
│   │   ├── multiple_regression.py
│   │   ├── naive_bayes.py
│   │   ├── natural_language_processing.py
│   │   ├── nearest_neighbors.py
│   │   ├── network_analysis.py
│   │   ├── neural_networks.py
│   │   ├── plot_state_borders.py
│   │   ├── probability.py
│   │   ├── recommender_systems.py
│   │   ├── simple_linear_regression.py
│   │   ├── states.txt
│   │   ├── statistics.py
│   │   ├── stocks.txt
│   │   ├── tab_delimited_stock_prices.txt
│   │   ├── visualizing_data.py
│   │   └── working_with_data.py
│   └── code-python3/
│       ├── README.md
│       ├── __init__.py
│       ├── charts.py
│       ├── clustering.py
│       ├── colon_delimited_stock_prices.txt
│       ├── comma_delimited_stock_prices.csv
│       ├── comma_delimited_stock_prices.txt
│       ├── databases.py
│       ├── decision_trees.py
│       ├── egrep.py
│       ├── getting_data.py
│       ├── gradient_descent.py
│       ├── hypothesis_and_inference.py
│       ├── introduction.py
│       ├── line_count.py
│       ├── linear_algebra.py
│       ├── logistic_regression.py
│       ├── machine_learning.py
│       ├── mapreduce.py
│       ├── most_common_words.py
│       ├── multiple_regression.py
│       ├── naive_bayes.py
│       ├── natural_language_processing.py
│       ├── nearest_neighbors.py
│       ├── network_analysis.py
│       ├── neural_networks.py
│       ├── plot_state_borders.py
│       ├── probability.py
│       ├── recommender_systems.py
│       ├── simple_linear_regression.py
│       ├── states.txt
│       ├── stats.py
│       ├── stocks.txt
│       ├── tab_delimited_stock_prices.txt
│       ├── visualizing_data.py
│       └── working_with_data.py
├── im/
│   └── README.md
├── links.md
├── requirements.txt
├── scratch/
│   ├── __init__.py
│   ├── clustering.py
│   ├── crash_course_in_python.py
│   ├── databases.py
│   ├── decision_trees.py
│   ├── deep_learning.py
│   ├── getting_data.py
│   ├── gradient_descent.py
│   ├── inference.py
│   ├── introduction.py
│   ├── k_nearest_neighbors.py
│   ├── linear_algebra.py
│   ├── logistic_regression.py
│   ├── machine_learning.py
│   ├── mapreduce.py
│   ├── multiple_regression.py
│   ├── naive_bayes.py
│   ├── network_analysis.py
│   ├── neural_networks.py
│   ├── nlp.py
│   ├── nlp_advanced.py
│   ├── probability.py
│   ├── recommender_systems.py
│   ├── simple_linear_regression.py
│   ├── statistics.py
│   ├── visualization.py
│   └── working_with_data.py
└── stocks.csv

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
__pycache__
*.png



================================================
FILE: INSTALL.md
================================================
# How to Install Python

If you don't already have Python, I strongly recommend you install the Anaconda version,
which includes many of the libraries needed for data science. Get the Python 3 version, not the Python 2 version.

https://www.anaconda.com/distribution/#download-section

Follow the instructions indicated for your platform.


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2019 Joel Grus

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
Data Science from Scratch
=========================

Here's all the code and examples from the second edition of my book _Data Science from Scratch_. They require at least Python 3.6.

(If you're looking for the code and examples from the first edition, that's in the `first-edition` folder.)

If you want to use the code, you should be able to clone the repo and just do things like

```
In [1]: from scratch.linear_algebra import dot

In [2]: dot([1, 2, 3], [4, 5, 6])
Out[2]: 32
```

and so on and so forth.

Two notes:

1. In order to use the library like this, you need to be in the root directory (that is, the directory that contains the `scratch` folder). If you are in the `scratch` directory itself, the imports won't work.

2. It's possible that it will just work. It's also possible that you may need to add the root directory to your `PYTHONPATH`, if you are on Linux or OSX this is as simple as 

```
export PYTHONPATH=/path/to/where/you/cloned/this/repo
```

(substituting in the real path, of course).

If you are on Windows, it's [potentially more complicated](https://stackoverflow.com/questions/3701646/how-to-add-to-the-pythonpath-in-windows-so-it-finds-my-modules-packages).

## Table of Contents

1. Introduction
2. A Crash Course in Python
3. [Visualizing Data](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/visualization.py)
4. [Linear Algebra](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/linear_algebra.py)
5. [Statistics](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/statistics.py)
6. [Probability](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/probability.py)
7. [Hypothesis and Inference](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/inference.py)
8. [Gradient Descent](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/gradient_descent.py)
9. [Getting Data](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/getting_data.py)
10. [Working With Data](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/working_with_data.py)
11. [Machine Learning](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/machine_learning.py)
12. [k-Nearest Neighbors](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/k_nearest_neighbors.py)
13. [Naive Bayes](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/naive_bayes.py)
14. [Simple Linear Regression](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/simple_linear_regression.py)
15. [Multiple Regression](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/multiple_regression.py)
16. [Logistic Regression](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/logistic_regression.py)
17. [Decision Trees](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/decision_trees.py)
18. [Neural Networks](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/neural_networks.py)
19. [Deep Learning]
20. [Clustering](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/clustering.py)
21. [Natural Language Processing](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/nlp.py)
22. [Network Analysis](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/network_analysis.py)
23. [Recommender Systems](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/recommender_systems.py)
24. [Databases and SQL](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/databases.py)
25. [MapReduce](https://github.com/joelgrus/data-science-from-scratch/blob/master/scratch/mapreduce.py)
26. Data Ethics
27. Go Forth And Do Data Science


================================================
FILE: comma_delimited_stock_prices.csv
================================================
AAPL,6/20/2014,90.91
MSFT,6/20/2014,41.68
FB,6/20/3014,64.5
AAPL,6/19/2014,91.86
MSFT,6/19/2014,n/a
FB,6/19/2014,64.34


================================================
FILE: first-edition/README.md
================================================
Data Science from Scratch
=========================

Here's all the code and examples from the first edition of my book __[Data Science from Scratch](http://joelgrus.com/2015/04/26/data-science-from-scratch-first-principles-with-python/)__. The `code` directory contains Python 2.7 versions, and the `code-python3` direction contains the Python 3 equivalents. (I tested them in 3.5, but they should work in any 3.x.)


Each can be imported as a module, for example (after you cd into the /code directory):

```python
from linear_algebra import distance, vector_mean
v = [1, 2, 3]
w = [4, 5, 6]
print distance(v, w)
print vector_mean([v, w])
```

Or can be run from the command line to get a demo of what it does (and to execute the examples from the book):

```bat
python recommender_systems.py
```

Additionally, I've collected all the [links](https://github.com/joelgrus/data-science-from-scratch/blob/master/links.md) from the book.

And, by popular demand, I made an index of functions defined in the book, by chapter and page number.
The data is in a [spreadsheet](https://docs.google.com/spreadsheets/d/1mjGp94ehfxWOEaAFJsPiHqIeOioPH1vN1PdOE6v1az8/edit?usp=sharing), or I also made a toy (experimental) [searchable webapp](http://joelgrus.com/experiments/function-index/).

## Table of Contents

1. Introduction
2. A Crash Course in Python
3. [Visualizing Data](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/visualizing_data.py)
4. [Linear Algebra](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/linear_algebra.py)
5. [Statistics](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/statistics.py)
6. [Probability](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/probability.py)
7. [Hypothesis and Inference](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/hypothesis_and_inference.py)
8. [Gradient Descent](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/gradient_descent.py)
9. [Getting Data](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/getting_data.py)
10. [Working With Data](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/working_with_data.py)
11. [Machine Learning](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/machine_learning.py)
12. [k-Nearest Neighbors](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/nearest_neighbors.py)
13. [Naive Bayes](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/naive_bayes.py)
14. [Simple Linear Regression](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/simple_linear_regression.py)
15. [Multiple Regression](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/multiple_regression.py)
16. [Logistic Regression](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/logistic_regression.py)
17. [Decision Trees](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/decision_trees.py)
18. [Neural Networks](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/neural_networks.py)
19. [Clustering](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/clustering.py)
20. [Natural Language Processing](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/natural_language_processing.py)
21. [Network Analysis](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/network_analysis.py)
22. [Recommender Systems](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/recommender_systems.py)
23. [Databases and SQL](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/databases.py)
24. [MapReduce](https://github.com/joelgrus/data-science-from-scratch/blob/master/code/mapreduce.py)
25. Go Forth And Do Data Science


================================================
FILE: first-edition/code/__init__.py
================================================


================================================
FILE: first-edition/code/charts.py
================================================


================================================
FILE: first-edition/code/clustering.py
================================================
from __future__ import division
from linear_algebra import squared_distance, vector_mean, distance
import math, random
import matplotlib.image as mpimg
import matplotlib.pyplot as plt

class KMeans:
    """performs k-means clustering"""

    def __init__(self, k):
        self.k = k          # number of clusters
        self.means = None   # means of clusters
        
    def classify(self, input):
        """return the index of the cluster closest to the input"""
        return min(range(self.k),
                   key=lambda i: squared_distance(input, self.means[i]))
                   
    def train(self, inputs):
    
        self.means = random.sample(inputs, self.k)
        assignments = None
        
        while True:
            # Find new assignments
            new_assignments = map(self.classify, inputs)

            # If no assignments have changed, we're done.
            if assignments == new_assignments:                
                return

            # Otherwise keep the new assignments,
            assignments = new_assignments    

            for i in range(self.k):
                i_points = [p for p, a in zip(inputs, assignments) if a == i]
                # avoid divide-by-zero if i_points is empty
                if i_points:                                
                    self.means[i] = vector_mean(i_points)    

def squared_clustering_errors(inputs, k):
    """finds the total squared error from k-means clustering the inputs"""
    clusterer = KMeans(k)
    clusterer.train(inputs)
    means = clusterer.means
    assignments = map(clusterer.classify, inputs)
    
    return sum(squared_distance(input,means[cluster])
               for input, cluster in zip(inputs, assignments))

def plot_squared_clustering_errors(plt):

    ks = range(1, len(inputs) + 1)
    errors = [squared_clustering_errors(inputs, k) for k in ks]

    plt.plot(ks, errors)
    plt.xticks(ks)
    plt.xlabel("k")
    plt.ylabel("total squared error")
    plt.show()

#
# using clustering to recolor an image
#

def recolor_image(input_file, k=5):

    img = mpimg.imread(path_to_png_file)
    pixels = [pixel for row in img for pixel in row]
    clusterer = KMeans(k)
    clusterer.train(pixels) # this might take a while    

    def recolor(pixel):
        cluster = clusterer.classify(pixel) # index of the closest cluster
        return clusterer.means[cluster]     # mean of the closest cluster

    new_img = [[recolor(pixel) for pixel in row]
               for row in img]

    plt.imshow(new_img)
    plt.axis('off')
    plt.show()

#
# hierarchical clustering
#

def is_leaf(cluster):
    """a cluster is a leaf if it has length 1"""
    return len(cluster) == 1

def get_children(cluster):
    """returns the two children of this cluster if it's a merged cluster;
    raises an exception if this is a leaf cluster"""
    if is_leaf(cluster):
        raise TypeError("a leaf cluster has no children")
    else:
        return cluster[1]

def get_values(cluster):
    """returns the value in this cluster (if it's a leaf cluster)
    or all the values in the leaf clusters below it (if it's not)"""
    if is_leaf(cluster):
        return cluster # is already a 1-tuple containing value
    else:
        return [value
                for child in get_children(cluster)
                for value in get_values(child)]

def cluster_distance(cluster1, cluster2, distance_agg=min):
    """finds the aggregate distance between elements of cluster1
    and elements of cluster2"""
    return distance_agg([distance(input1, input2)
                        for input1 in get_values(cluster1)
                        for input2 in get_values(cluster2)])

def get_merge_order(cluster):
    if is_leaf(cluster):
        return float('inf')
    else:
        return cluster[0] # merge_order is first element of 2-tuple

def bottom_up_cluster(inputs, distance_agg=min):
    # start with every input a leaf cluster / 1-tuple
    clusters = [(input,) for input in inputs]
    
    # as long as we have more than one cluster left...
    while len(clusters) > 1:
        # find the two closest clusters
        c1, c2 = min([(cluster1, cluster2)
                     for i, cluster1 in enumerate(clusters)
                     for cluster2 in clusters[:i]],
                     key=lambda (x, y): cluster_distance(x, y, distance_agg))

        # remove them from the list of clusters
        clusters = [c for c in clusters if c != c1 and c != c2]

        # merge them, using merge_order = # of clusters left
        merged_cluster = (len(clusters), [c1, c2])

        # and add their merge
        clusters.append(merged_cluster)

    # when there's only one cluster left, return it
    return clusters[0]

def generate_clusters(base_cluster, num_clusters):
    # start with a list with just the base cluster
    clusters = [base_cluster]
    
    # as long as we don't have enough clusters yet...
    while len(clusters) < num_clusters:
        # choose the last-merged of our clusters
        next_cluster = min(clusters, key=get_merge_order)
        # remove it from the list
        clusters = [c for c in clusters if c != next_cluster]
        # and add its children to the list (i.e., unmerge it)
        clusters.extend(get_children(next_cluster))

    # once we have enough clusters...
    return clusters

if __name__ == "__main__":

    inputs = [[-14,-5],[13,13],[20,23],[-19,-11],[-9,-16],[21,27],[-49,15],[26,13],[-46,5],[-34,-1],[11,15],[-49,0],[-22,-16],[19,28],[-12,-8],[-13,-19],[-41,8],[-11,-6],[-25,-9],[-18,-3]]

    random.seed(0) # so you get the same results as me
    clusterer = KMeans(3)
    clusterer.train(inputs)
    print "3-means:"
    print clusterer.means
    print

    random.seed(0)
    clusterer = KMeans(2)
    clusterer.train(inputs)
    print "2-means:"
    print clusterer.means
    print

    print "errors as a function of k"

    for k in range(1, len(inputs) + 1):
        print k, squared_clustering_errors(inputs, k)
    print


    print "bottom up hierarchical clustering"

    base_cluster = bottom_up_cluster(inputs)
    print base_cluster

    print
    print "three clusters, min:"
    for cluster in generate_clusters(base_cluster, 3):
        print get_values(cluster)

    print
    print "three clusters, max:"
    base_cluster = bottom_up_cluster(inputs, max)
    for cluster in generate_clusters(base_cluster, 3):
        print get_values(cluster)


================================================
FILE: first-edition/code/colon_delimited_stock_prices.txt
================================================
date:symbol:closing_price
6/20/2014:AAPL:90.91
6/20/2014:MSFT:41.68
6/20/2014:FB:64.5

================================================
FILE: first-edition/code/comma_delimited_stock_prices.csv
================================================
6/20/2014,AAPL,90.91
6/20/2014,MSFT,41.68
6/20/3014,FB,64.5
6/19/2014,AAPL,91.86
6/19/2014,MSFT,n/a
6/19/2014,FB,64.34

================================================
FILE: first-edition/code/comma_delimited_stock_prices.txt
================================================
AAPL,90.91
FB,64.5
MSFT,41.68


================================================
FILE: first-edition/code/databases.py
================================================
from __future__ import division
import math, random, re
from collections import defaultdict

class Table:
    def __init__(self, columns):
        self.columns = columns
        self.rows = []

    def __repr__(self):
        """pretty representation of the table: columns then rows"""
        return str(self.columns) + "\n" + "\n".join(map(str, self.rows))

    def insert(self, row_values):
        if len(row_values) != len(self.columns):
            raise TypeError("wrong number of elements")
        row_dict = dict(zip(self.columns, row_values))
        self.rows.append(row_dict)

    def update(self, updates, predicate):
        for row in self.rows:
            if predicate(row):
                for column, new_value in updates.iteritems():
                    row[column] = new_value

    def delete(self, predicate=lambda row: True):
        """delete all rows matching predicate
        or all rows if no predicate supplied"""
        self.rows = [row for row in self.rows if not(predicate(row))]

    def select(self, keep_columns=None, additional_columns=None):

        if keep_columns is None:         # if no columns specified,
            keep_columns = self.columns  # return all columns

        if additional_columns is None:
            additional_columns = {}

        # new table for results
        result_table = Table(keep_columns + additional_columns.keys())

        for row in self.rows:
            new_row = [row[column] for column in keep_columns]
            for column_name, calculation in additional_columns.iteritems():
                new_row.append(calculation(row))
            result_table.insert(new_row)

        return result_table

    def where(self, predicate=lambda row: True):
        """return only the rows that satisfy the supplied predicate"""
        where_table = Table(self.columns)
        where_table.rows = filter(predicate, self.rows)
        return where_table

    def limit(self, num_rows=None):
        """return only the first num_rows rows"""
        limit_table = Table(self.columns)
        limit_table.rows = (self.rows[:num_rows] 
                            if num_rows is not None
                            else self.rows)
        return limit_table

    def group_by(self, group_by_columns, aggregates, having=None):

        grouped_rows = defaultdict(list)

        # populate groups
        for row in self.rows:
            key = tuple(row[column] for column in group_by_columns)
            grouped_rows[key].append(row)

        result_table = Table(group_by_columns + aggregates.keys())

        for key, rows in grouped_rows.iteritems():
            if having is None or having(rows):
                new_row = list(key)
                for aggregate_name, aggregate_fn in aggregates.iteritems():
                    new_row.append(aggregate_fn(rows))
                result_table.insert(new_row)

        return result_table

    def order_by(self, order):
        new_table = self.select()       # make a copy
        new_table.rows.sort(key=order)
        return new_table

    def join(self, other_table, left_join=False):

        join_on_columns = [c for c in self.columns           # columns in
                           if c in other_table.columns]      # both tables

        additional_columns = [c for c in other_table.columns # columns only
                              if c not in join_on_columns]   # in right table

        # all columns from left table + additional_columns from right table
        join_table = Table(self.columns + additional_columns)

        for row in self.rows:
            def is_join(other_row):
                return all(other_row[c] == row[c] for c in join_on_columns)

            other_rows = other_table.where(is_join).rows

            # each other row that matches this one produces a result row
            for other_row in other_rows:
                join_table.insert([row[c] for c in self.columns] +
                                  [other_row[c] for c in additional_columns])

            # if no rows match and it's a left join, output with Nones
            if left_join and not other_rows:
                join_table.insert([row[c] for c in self.columns] +
                                  [None for c in additional_columns])

        return join_table

if __name__ == "__main__":

    users = Table(["user_id", "name", "num_friends"])
    users.insert([0, "Hero", 0])
    users.insert([1, "Dunn", 2])
    users.insert([2, "Sue", 3])
    users.insert([3, "Chi", 3])
    users.insert([4, "Thor", 3])
    users.insert([5, "Clive", 2])
    users.insert([6, "Hicks", 3])
    users.insert([7, "Devin", 2])
    users.insert([8, "Kate", 2])
    users.insert([9, "Klein", 3])
    users.insert([10, "Jen", 1])

    print "users table"
    print users
    print

    # SELECT

    print "users.select()"
    print users.select()
    print

    print "users.limit(2)"
    print users.limit(2)
    print

    print "users.select(keep_columns=[\"user_id\"])"
    print users.select(keep_columns=["user_id"])
    print

    print 'where(lambda row: row["name"] == "Dunn")'
    print users.where(lambda row: row["name"] == "Dunn") \
               .select(keep_columns=["user_id"])
    print

    def name_len(row): return len(row["name"])

    print 'with name_length:'
    print users.select(keep_columns=[],
             additional_columns = { "name_length" : name_len })
    print

    # GROUP BY

    def min_user_id(rows): return min(row["user_id"] for row in rows)

    stats_by_length = users \
        .select(additional_columns={"name_len" : name_len}) \
        .group_by(group_by_columns=["name_len"],
                  aggregates={ "min_user_id" : min_user_id,
                               "num_users" : len })

    print "stats by length"
    print stats_by_length
    print

    def first_letter_of_name(row): 
        return row["name"][0] if row["name"] else ""

    def average_num_friends(rows):
        return sum(row["num_friends"] for row in rows) / len(rows)

    def enough_friends(rows):
        return average_num_friends(rows) > 1

    avg_friends_by_letter = users \
        .select(additional_columns={'first_letter' : first_letter_of_name}) \
        .group_by(group_by_columns=['first_letter'],
                  aggregates={ "avg_num_friends" : average_num_friends },
                  having=enough_friends)

    print "avg friends by letter"
    print avg_friends_by_letter
    print

    def sum_user_ids(rows): return sum(row["user_id"] for row in rows)

    user_id_sum = users \
        .where(lambda row: row["user_id"] > 1) \
        .group_by(group_by_columns=[],
                  aggregates={ "user_id_sum" : sum_user_ids })

    print "user id sum"
    print user_id_sum
    print

    # ORDER BY

    friendliest_letters = avg_friends_by_letter \
        .order_by(lambda row: -row["avg_num_friends"]) \
        .limit(4)

    print "friendliest letters"
    print friendliest_letters
    print

    # JOINs

    user_interests = Table(["user_id", "interest"])
    user_interests.insert([0, "SQL"])
    user_interests.insert([0, "NoSQL"])
    user_interests.insert([2, "SQL"])
    user_interests.insert([2, "MySQL"])

    sql_users = users \
    .join(user_interests) \
    .where(lambda row: row["interest"] == "SQL") \
    .select(keep_columns=["name"])

    print "sql users"
    print sql_users
    print

    def count_interests(rows):
        """counts how many rows have non-None interests"""
        return len([row for row in rows if row["interest"] is not None])

    user_interest_counts = users \
        .join(user_interests, left_join=True) \
        .group_by(group_by_columns=["user_id"],
                  aggregates={"num_interests" : count_interests })

    print "user interest counts"
    print user_interest_counts

    # SUBQUERIES

    likes_sql_user_ids = user_interests \
        .where(lambda row: row["interest"] == "SQL") \
        .select(keep_columns=['user_id'])

    likes_sql_user_ids.group_by(group_by_columns=[],
                                aggregates={ "min_user_id" : min_user_id })

    print "likes sql user ids"
    print likes_sql_user_ids

================================================
FILE: first-edition/code/decision_trees.py
================================================
from __future__ import division
from collections import Counter, defaultdict
from functools import partial
import math, random

def entropy(class_probabilities):
    """given a list of class probabilities, compute the entropy"""
    return sum(-p * math.log(p, 2) for p in class_probabilities if p)

def class_probabilities(labels):
    total_count = len(labels)
    return [count / total_count
            for count in Counter(labels).values()]

def data_entropy(labeled_data):        
    labels = [label for _, label in labeled_data]
    probabilities = class_probabilities(labels)
    return entropy(probabilities)

def partition_entropy(subsets):
    """find the entropy from this partition of data into subsets"""
    total_count = sum(len(subset) for subset in subsets)
    
    return sum( data_entropy(subset) * len(subset) / total_count
                for subset in subsets )

def group_by(items, key_fn):
    """returns a defaultdict(list), where each input item 
    is in the list whose key is key_fn(item)"""
    groups = defaultdict(list)
    for item in items:
        key = key_fn(item)
        groups[key].append(item)
    return groups
    
def partition_by(inputs, attribute):
    """returns a dict of inputs partitioned by the attribute
    each input is a pair (attribute_dict, label)"""
    return group_by(inputs, lambda x: x[0][attribute])    

def partition_entropy_by(inputs,attribute):
    """computes the entropy corresponding to the given partition"""        
    partitions = partition_by(inputs, attribute)
    return partition_entropy(partitions.values())        

def classify(tree, input):
    """classify the input using the given decision tree"""
    
    # if this is a leaf node, return its value
    if tree in [True, False]:
        return tree
   
    # otherwise find the correct subtree
    attribute, subtree_dict = tree
    
    subtree_key = input.get(attribute)  # None if input is missing attribute

    if subtree_key not in subtree_dict: # if no subtree for key,
        subtree_key = None              # we'll use the None subtree
    
    subtree = subtree_dict[subtree_key] # choose the appropriate subtree
    return classify(subtree, input)     # and use it to classify the input

def build_tree_id3(inputs, split_candidates=None):

    # if this is our first pass, 
    # all keys of the first input are split candidates
    if split_candidates is None:
        split_candidates = inputs[0][0].keys()

    # count Trues and Falses in the inputs
    num_inputs = len(inputs)
    num_trues = len([label for item, label in inputs if label])
    num_falses = num_inputs - num_trues
    
    if num_trues == 0:                  # if only Falses are left
        return False                    # return a "False" leaf
        
    if num_falses == 0:                 # if only Trues are left
        return True                     # return a "True" leaf

    if not split_candidates:            # if no split candidates left
        return num_trues >= num_falses  # return the majority leaf
                            
    # otherwise, split on the best attribute
    best_attribute = min(split_candidates,
        key=partial(partition_entropy_by, inputs))

    partitions = partition_by(inputs, best_attribute)
    new_candidates = [a for a in split_candidates 
                      if a != best_attribute]
    
    # recursively build the subtrees
    subtrees = { attribute : build_tree_id3(subset, new_candidates)
                 for attribute, subset in partitions.iteritems() }

    subtrees[None] = num_trues > num_falses # default case

    return (best_attribute, subtrees)

def forest_classify(trees, input):
    votes = [classify(tree, input) for tree in trees]
    vote_counts = Counter(votes)
    return vote_counts.most_common(1)[0][0]


if __name__ == "__main__":

    inputs = [
        ({'level':'Senior','lang':'Java','tweets':'no','phd':'no'},   False),
        ({'level':'Senior','lang':'Java','tweets':'no','phd':'yes'},  False),
        ({'level':'Mid','lang':'Python','tweets':'no','phd':'no'},     True),
        ({'level':'Junior','lang':'Python','tweets':'no','phd':'no'},  True),
        ({'level':'Junior','lang':'R','tweets':'yes','phd':'no'},      True),
        ({'level':'Junior','lang':'R','tweets':'yes','phd':'yes'},    False),
        ({'level':'Mid','lang':'R','tweets':'yes','phd':'yes'},        True),
        ({'level':'Senior','lang':'Python','tweets':'no','phd':'no'}, False),
        ({'level':'Senior','lang':'R','tweets':'yes','phd':'no'},      True),
        ({'level':'Junior','lang':'Python','tweets':'yes','phd':'no'}, True),
        ({'level':'Senior','lang':'Python','tweets':'yes','phd':'yes'},True),
        ({'level':'Mid','lang':'Python','tweets':'no','phd':'yes'},    True),
        ({'level':'Mid','lang':'Java','tweets':'yes','phd':'no'},      True),
        ({'level':'Junior','lang':'Python','tweets':'no','phd':'yes'},False)
    ]

    for key in ['level','lang','tweets','phd']:
        print key, partition_entropy_by(inputs, key)
    print

    senior_inputs = [(input, label)
                     for input, label in inputs if input["level"] == "Senior"]

    for key in ['lang', 'tweets', 'phd']:
        print key, partition_entropy_by(senior_inputs, key)
    print

    print "building the tree"
    tree = build_tree_id3(inputs)
    print tree

    print "Junior / Java / tweets / no phd", classify(tree, 
        { "level" : "Junior", 
          "lang" : "Java", 
          "tweets" : "yes", 
          "phd" : "no"} ) 

    print "Junior / Java / tweets / phd", classify(tree, 
        { "level" : "Junior", 
                 "lang" : "Java", 
                 "tweets" : "yes", 
                 "phd" : "yes"} )

    print "Intern", classify(tree, { "level" : "Intern" } )
    print "Senior", classify(tree, { "level" : "Senior" } )



================================================
FILE: first-edition/code/egrep.py
================================================
# egrep.py
import sys, re

if __name__ == "__main__":

    # sys.argv is the list of command-line arguments
    # sys.argv[0] is the name of the program itself
    # sys.argv[1] will be the regex specfied at the command line
    regex = sys.argv[1]

    # for every line passed into the script
    for line in sys.stdin:
        # if it matches the regex, write it to stdout
        if re.search(regex, line):
            sys.stdout.write(line)

================================================
FILE: first-edition/code/getting_data.py
================================================
from __future__ import division
from collections import Counter
import math, random, csv, json

from bs4 import BeautifulSoup
import requests

######
#
# BOOKS ABOUT DATA
#
######

def is_video(td):
    """it's a video if it has exactly one pricelabel, and if
    the stripped text inside that pricelabel starts with 'Video'"""
    pricelabels = td('span', 'pricelabel')
    return (len(pricelabels) == 1 and
            pricelabels[0].text.strip().startswith("Video"))

def book_info(td):
    """given a BeautifulSoup <td> Tag representing a book,
    extract the book's details and return a dict"""
    
    title = td.find("div", "thumbheader").a.text
    by_author = td.find('div', 'AuthorName').text
    authors = [x.strip() for x in re.sub("^By ", "", by_author).split(",")]
    isbn_link = td.find("div", "thumbheader").a.get("href")
    isbn = re.match("/product/(.*)\.do", isbn_link).groups()[0]
    date = td.find("span", "directorydate").text.strip()
    
    return {
        "title" : title,
        "authors" : authors,
        "isbn" : isbn,
        "date" : date
    }

from time import sleep

def scrape(num_pages=31):
    base_url = "http://shop.oreilly.com/category/browse-subjects/" + \
           "data.do?sortby=publicationDate&page="

    books = []

    for page_num in range(1, num_pages + 1):
        print "souping page", page_num
        url = base_url + str(page_num)
        soup = BeautifulSoup(requests.get(url).text, 'html5lib')
            
        for td in soup('td', 'thumbtext'):
            if not is_video(td):
                books.append(book_info(td))

        # now be a good citizen and respect the robots.txt!
        sleep(30)

    return books

def get_year(book):
    """book["date"] looks like 'November 2014' so we need to 
    split on the space and then take the second piece"""
    return int(book["date"].split()[1])

def plot_years(plt, books):
    # 2014 is the last complete year of data (when I ran this)
    year_counts = Counter(get_year(book) for book in books
                          if get_year(book) <= 2014)

    years = sorted(year_counts)
    book_counts = [year_counts[year] for year in x]
    plt.bar([x - 0.5 for x in years], book_counts)
    plt.xlabel("year")
    plt.ylabel("# of data books")
    plt.title("Data is Big!")
    plt.show()

##
# 
# APIs
#
##

endpoint = "https://api.github.com/users/joelgrus/repos"

repos = json.loads(requests.get(endpoint).text)

from dateutil.parser import parse

dates = [parse(repo["created_at"]) for repo in repos]
month_counts = Counter(date.month for date in dates)
weekday_counts = Counter(date.weekday() for date in dates)

####
#
# Twitter
#
####

from twython import Twython

# fill these in if you want to use the code
CONSUMER_KEY = ""
CONSUMER_SECRET = ""
ACCESS_TOKEN = ""
ACCESS_TOKEN_SECRET = ""

def call_twitter_search_api():

    twitter = Twython(CONSUMER_KEY, CONSUMER_SECRET)

    # search for tweets containing the phrase "data science"
    for status in twitter.search(q='"data science"')["statuses"]:
        user = status["user"]["screen_name"].encode('utf-8')
        text = status["text"].encode('utf-8')
        print user, ":", text
        print

from twython import TwythonStreamer

# appending data to a global variable is pretty poor form
# but it makes the example much simpler
tweets = [] 

class MyStreamer(TwythonStreamer):
    """our own subclass of TwythonStreamer that specifies
    how to interact with the stream"""

    def on_success(self, data):
        """what do we do when twitter sends us data?
        here data will be a Python object representing a tweet"""

        # only want to collect English-language tweets
        if data['lang'] == 'en':
            tweets.append(data)

        # stop when we've collected enough
        if len(tweets) >= 1000:
            self.disconnect()

    def on_error(self, status_code, data):
        print status_code, data
        self.disconnect()

def call_twitter_streaming_api():
    stream = MyStreamer(CONSUMER_KEY, CONSUMER_SECRET, 
                        ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

    # starts consuming public statuses that contain the keyword 'data'
    stream.statuses.filter(track='data')
    

if __name__ == "__main__":

    def process(date, symbol, price):
        print date, symbol, price

    print "tab delimited stock prices:"

    with open('tab_delimited_stock_prices.txt', 'rb') as f:
        reader = csv.reader(f, delimiter='\t')
        for row in reader:
            date = row[0]
            symbol = row[1]
            closing_price = float(row[2])
            process(date, symbol, closing_price)

    print

    print "colon delimited stock prices:"

    with open('colon_delimited_stock_prices.txt', 'rb') as f:
        reader = csv.DictReader(f, delimiter=':')
        for row in reader:
            date = row["date"]
            symbol = row["symbol"]
            closing_price = float(row["closing_price"])
            process(date, symbol, closing_price)

    print

    print "writing out comma_delimited_stock_prices.txt"

    today_prices = { 'AAPL' : 90.91, 'MSFT' : 41.68, 'FB' : 64.5 }

    with open('comma_delimited_stock_prices.txt','wb') as f:
        writer = csv.writer(f, delimiter=',')
        for stock, price in today_prices.items():
            writer.writerow([stock, price])

    print "BeautifulSoup"
    html = requests.get("http://www.example.com").text
    soup = BeautifulSoup(html)
    print soup
    print

    print "parsing json"

    serialized = """{ "title" : "Data Science Book",
                      "author" : "Joel Grus",
                      "publicationYear" : 2014,
                      "topics" : [ "data", "science", "data science"] }"""

    # parse the JSON to create a Python object
    deserialized = json.loads(serialized)
    if "data science" in deserialized["topics"]:
        print deserialized 

    print

    print "GitHub API"
    print "dates", dates
    print "month_counts", month_counts
    print "weekday_count", weekday_counts

    last_5_repositories = sorted(repos,
                                 key=lambda r: r["created_at"],
                                 reverse=True)[:5]

    print "last five languages", [repo["language"] 
                                  for repo in last_5_repositories]



================================================
FILE: first-edition/code/gradient_descent.py
================================================
from __future__ import division
from collections import Counter
from linear_algebra import distance, vector_subtract, scalar_multiply
import math, random

def sum_of_squares(v):
    """computes the sum of squared elements in v"""
    return sum(v_i ** 2 for v_i in v)

def difference_quotient(f, x, h):
    return (f(x + h) - f(x)) / h

def plot_estimated_derivative():

    def square(x):
        return x * x

    def derivative(x):
        return 2 * x

    derivative_estimate = lambda x: difference_quotient(square, x, h=0.00001)

    # plot to show they're basically the same
    import matplotlib.pyplot as plt
    x = range(-10,10)
    plt.plot(x, map(derivative, x), 'rx')           # red  x
    plt.plot(x, map(derivative_estimate, x), 'b+')  # blue +
    plt.show()                                      # purple *, hopefully

def partial_difference_quotient(f, v, i, h):

    # add h to just the i-th element of v
    w = [v_j + (h if j == i else 0)
         for j, v_j in enumerate(v)]
         
    return (f(w) - f(v)) / h

def estimate_gradient(f, v, h=0.00001):
    return [partial_difference_quotient(f, v, i, h)
            for i, _ in enumerate(v)] 

def step(v, direction, step_size):
    """move step_size in the direction from v"""
    return [v_i + step_size * direction_i
            for v_i, direction_i in zip(v, direction)]

def sum_of_squares_gradient(v): 
    return [2 * v_i for v_i in v]

def safe(f):
    """define a new function that wraps f and return it"""
    def safe_f(*args, **kwargs):
        try:
            return f(*args, **kwargs)
        except:
            return float('inf')         # this means "infinity" in Python
    return safe_f


#
# 
# minimize / maximize batch
#
#

def minimize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001):
    """use gradient descent to find theta that minimizes target function"""
    
    step_sizes = [100, 10, 1, 0.1, 0.01, 0.001, 0.0001, 0.00001]
    
    theta = theta_0                           # set theta to initial value
    target_fn = safe(target_fn)               # safe version of target_fn
    value = target_fn(theta)                  # value we're minimizing
    
    while True:
        gradient = gradient_fn(theta)  
        next_thetas = [step(theta, gradient, -step_size)
                       for step_size in step_sizes]
                   
        # choose the one that minimizes the error function        
        next_theta = min(next_thetas, key=target_fn)
        next_value = target_fn(next_theta)
        
        # stop if we're "converging"
        if abs(value - next_value) < tolerance:
            return theta
        else:
            theta, value = next_theta, next_value

def negate(f):
    """return a function that for any input x returns -f(x)"""
    return lambda *args, **kwargs: -f(*args, **kwargs)
    
def negate_all(f):
    """the same when f returns a list of numbers"""
    return lambda *args, **kwargs: [-y for y in f(*args, **kwargs)]

def maximize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001):
    return minimize_batch(negate(target_fn),
                          negate_all(gradient_fn),
                          theta_0, 
                          tolerance)

#
# minimize / maximize stochastic
#

def in_random_order(data):
    """generator that returns the elements of data in random order"""
    indexes = [i for i, _ in enumerate(data)]  # create a list of indexes
    random.shuffle(indexes)                    # shuffle them
    for i in indexes:                          # return the data in that order
        yield data[i]

def minimize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0.01):

    data = zip(x, y)
    theta = theta_0                             # initial guess
    alpha = alpha_0                             # initial step size
    min_theta, min_value = None, float("inf")   # the minimum so far
    iterations_with_no_improvement = 0
    
    # if we ever go 100 iterations with no improvement, stop
    while iterations_with_no_improvement < 100:
        value = sum( target_fn(x_i, y_i, theta) for x_i, y_i in data )

        if value < min_value:
            # if we've found a new minimum, remember it
            # and go back to the original step size
            min_theta, min_value = theta, value
            iterations_with_no_improvement = 0
            alpha = alpha_0
        else:
            # otherwise we're not improving, so try shrinking the step size
            iterations_with_no_improvement += 1
            alpha *= 0.9

        # and take a gradient step for each of the data points        
        for x_i, y_i in in_random_order(data):
            gradient_i = gradient_fn(x_i, y_i, theta)
            theta = vector_subtract(theta, scalar_multiply(alpha, gradient_i))
            
    return min_theta

def maximize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0.01):
    return minimize_stochastic(negate(target_fn),
                               negate_all(gradient_fn),
                               x, y, theta_0, alpha_0)

if __name__ == "__main__":

    print "using the gradient"

    v = [random.randint(-10,10) for i in range(3)]

    tolerance = 0.0000001

    while True:
        #print v, sum_of_squares(v)
        gradient = sum_of_squares_gradient(v)   # compute the gradient at v
        next_v = step(v, gradient, -0.01)       # take a negative gradient step
        if distance(next_v, v) < tolerance:     # stop if we're converging
            break
        v = next_v                              # continue if we're not

    print "minimum v", v
    print "minimum value", sum_of_squares(v)
    print


    print "using minimize_batch"

    v = [random.randint(-10,10) for i in range(3)]

    v = minimize_batch(sum_of_squares, sum_of_squares_gradient, v)

    print "minimum v", v
    print "minimum value", sum_of_squares(v)


================================================
FILE: first-edition/code/hypothesis_and_inference.py
================================================
from __future__ import division
from probability import normal_cdf, inverse_normal_cdf
import math, random

def normal_approximation_to_binomial(n, p):
    """finds mu and sigma corresponding to a Binomial(n, p)"""
    mu = p * n
    sigma = math.sqrt(p * (1 - p) * n)
    return mu, sigma

#####
#
# probabilities a normal lies in an interval
#
######

# the normal cdf _is_ the probability the variable is below a threshold
normal_probability_below = normal_cdf

# it's above the threshold if it's not below the threshold
def normal_probability_above(lo, mu=0, sigma=1):
    return 1 - normal_cdf(lo, mu, sigma)
    
# it's between if it's less than hi, but not less than lo
def normal_probability_between(lo, hi, mu=0, sigma=1):
    return normal_cdf(hi, mu, sigma) - normal_cdf(lo, mu, sigma)

# it's outside if it's not between
def normal_probability_outside(lo, hi, mu=0, sigma=1):
    return 1 - normal_probability_between(lo, hi, mu, sigma)

######
#
#  normal bounds
#
######


def normal_upper_bound(probability, mu=0, sigma=1):
    """returns the z for which P(Z <= z) = probability"""
    return inverse_normal_cdf(probability, mu, sigma)
    
def normal_lower_bound(probability, mu=0, sigma=1):
    """returns the z for which P(Z >= z) = probability"""
    return inverse_normal_cdf(1 - probability, mu, sigma)

def normal_two_sided_bounds(probability, mu=0, sigma=1):
    """returns the symmetric (about the mean) bounds 
    that contain the specified probability"""
    tail_probability = (1 - probability) / 2

    # upper bound should have tail_probability above it
    upper_bound = normal_lower_bound(tail_probability, mu, sigma)

    # lower bound should have tail_probability below it
    lower_bound = normal_upper_bound(tail_probability, mu, sigma)

    return lower_bound, upper_bound

def two_sided_p_value(x, mu=0, sigma=1):
    if x >= mu:
        # if x is greater than the mean, the tail is above x
        return 2 * normal_probability_above(x, mu, sigma)
    else:
        # if x is less than the mean, the tail is below x
        return 2 * normal_probability_below(x, mu, sigma)   

def count_extreme_values():
    extreme_value_count = 0
    for _ in range(100000):
        num_heads = sum(1 if random.random() < 0.5 else 0    # count # of heads
                        for _ in range(1000))                # in 1000 flips
        if num_heads >= 530 or num_heads <= 470:             # and count how often
            extreme_value_count += 1                         # the # is 'extreme'

    return extreme_value_count / 100000

upper_p_value = normal_probability_above
lower_p_value = normal_probability_below    

##
#
# P-hacking
#
##

def run_experiment():
    """flip a fair coin 1000 times, True = heads, False = tails"""
    return [random.random() < 0.5 for _ in range(1000)]

def reject_fairness(experiment):
    """using the 5% significance levels"""
    num_heads = len([flip for flip in experiment if flip])
    return num_heads < 469 or num_heads > 531


##
#
# running an A/B test
#
##

def estimated_parameters(N, n):
    p = n / N
    sigma = math.sqrt(p * (1 - p) / N)
    return p, sigma

def a_b_test_statistic(N_A, n_A, N_B, n_B):
    p_A, sigma_A = estimated_parameters(N_A, n_A)
    p_B, sigma_B = estimated_parameters(N_B, n_B)
    return (p_B - p_A) / math.sqrt(sigma_A ** 2 + sigma_B ** 2)

##
#
# Bayesian Inference
#
##

def B(alpha, beta):
    """a normalizing constant so that the total probability is 1"""
    return math.gamma(alpha) * math.gamma(beta) / math.gamma(alpha + beta)

def beta_pdf(x, alpha, beta):
    if x < 0 or x > 1:          # no weight outside of [0, 1]    
        return 0        
    return x ** (alpha - 1) * (1 - x) ** (beta - 1) / B(alpha, beta)


if __name__ == "__main__":

    mu_0, sigma_0 = normal_approximation_to_binomial(1000, 0.5)
    print "mu_0", mu_0
    print "sigma_0", sigma_0
    print "normal_two_sided_bounds(0.95, mu_0, sigma_0)", normal_two_sided_bounds(0.95, mu_0, sigma_0)
    print
    print "power of a test"
    
    print "95% bounds based on assumption p is 0.5"
    
    lo, hi = normal_two_sided_bounds(0.95, mu_0, sigma_0)
    print "lo", lo
    print "hi", hi

    print "actual mu and sigma based on p = 0.55"
    mu_1, sigma_1 = normal_approximation_to_binomial(1000, 0.55)
    print "mu_1", mu_1
    print "sigma_1", sigma_1

    # a type 2 error means we fail to reject the null hypothesis
    # which will happen when X is still in our original interval
    type_2_probability = normal_probability_between(lo, hi, mu_1, sigma_1)
    power = 1 - type_2_probability # 0.887

    print "type 2 probability", type_2_probability
    print "power", power
    print

    print "one-sided test"
    hi = normal_upper_bound(0.95, mu_0, sigma_0) 
    print "hi", hi # is 526 (< 531, since we need more probability in the upper tail)
    type_2_probability = normal_probability_below(hi, mu_1, sigma_1)
    power = 1 - type_2_probability # = 0.936
    print "type 2 probability", type_2_probability
    print "power", power
    print

    print "two_sided_p_value(529.5, mu_0, sigma_0)", two_sided_p_value(529.5, mu_0, sigma_0)  

    print "two_sided_p_value(531.5, mu_0, sigma_0)", two_sided_p_value(531.5, mu_0, sigma_0)

    print "upper_p_value(525, mu_0, sigma_0)", upper_p_value(525, mu_0, sigma_0)
    print "upper_p_value(527, mu_0, sigma_0)", upper_p_value(527, mu_0, sigma_0)    
    print 

    print "P-hacking"

    random.seed(0)
    experiments = [run_experiment() for _ in range(1000)]
    num_rejections = len([experiment
                          for experiment in experiments 
                          if reject_fairness(experiment)])

    print num_rejections, "rejections out of 1000"
    print

    print "A/B testing"
    z = a_b_test_statistic(1000, 200, 1000, 180)
    print "a_b_test_statistic(1000, 200, 1000, 180)", z
    print "p-value", two_sided_p_value(z)
    z = a_b_test_statistic(1000, 200, 1000, 150)
    print "a_b_test_statistic(1000, 200, 1000, 150)", z
    print "p-value", two_sided_p_value(z)


================================================
FILE: first-edition/code/introduction.py
================================================
from __future__ import division

# at this stage in the book we haven't actually installed matplotlib,
# comment this out if you need to
from matplotlib import pyplot as plt

##########################
#                        #
# FINDING KEY CONNECTORS #
#                        #
##########################

users = [
    { "id": 0, "name": "Hero" },
    { "id": 1, "name": "Dunn" },
    { "id": 2, "name": "Sue" },
    { "id": 3, "name": "Chi" },
    { "id": 4, "name": "Thor" },
    { "id": 5, "name": "Clive" },
    { "id": 6, "name": "Hicks" },
    { "id": 7, "name": "Devin" },
    { "id": 8, "name": "Kate" },
    { "id": 9, "name": "Klein" },
    { "id": 10, "name": "Jen" }
]

friendships = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4),
               (4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]


# first give each user an empty list
for user in users:
    user["friends"] = []

# and then populate the lists with friendships
for i, j in friendships:
    # this works because users[i] is the user whose id is i
    users[i]["friends"].append(users[j]) # add i as a friend of j
    users[j]["friends"].append(users[i]) # add j as a friend of i

def number_of_friends(user):
    """how many friends does _user_ have?"""
    return len(user["friends"]) # length of friend_ids list

total_connections = sum(number_of_friends(user)
                        for user in users) # 24

num_users = len(users)
avg_connections = total_connections / num_users # 2.4

################################
#                              #
# DATA SCIENTISTS YOU MAY KNOW #
#                              #
################################

def friends_of_friend_ids_bad(user):
    # "foaf" is short for "friend of a friend"
    return [foaf["id"]
            for friend in user["friends"] # for each of user's friends
            for foaf in friend["friends"]] # get each of _their_ friends

from collections import Counter # not loaded by default

def not_the_same(user, other_user):
    """two users are not the same if they have different ids"""
    return user["id"] != other_user["id"]

def not_friends(user, other_user):
    """other_user is not a friend if he's not in user["friends"];
    that is, if he's not_the_same as all the people in user["friends"]"""
    return all(not_the_same(friend, other_user)
               for friend in user["friends"])

def friends_of_friend_ids(user):
    return Counter(foaf["id"]
                   for friend in user["friends"]  # for each of my friends
                   for foaf in friend["friends"]  # count *their* friends
                   if not_the_same(user, foaf)    # who aren't me
                   and not_friends(user, foaf))   # and aren't my friends

print friends_of_friend_ids(users[3]) # Counter({0: 2, 5: 1})

interests = [
    (0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
    (0, "Spark"), (0, "Storm"), (0, "Cassandra"),
    (1, "NoSQL"), (1, "MongoDB"), (1, "Cassandra"), (1, "HBase"),
    (1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"),
    (2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"),
    (3, "statistics"), (3, "regression"), (3, "probability"),
    (4, "machine learning"), (4, "regression"), (4, "decision trees"),
    (4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
    (5, "Haskell"), (5, "programming languages"), (6, "statistics"),
    (6, "probability"), (6, "mathematics"), (6, "theory"),
    (7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"),
    (7, "neural networks"), (8, "neural networks"), (8, "deep learning"),
    (8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"),
    (9, "Java"), (9, "MapReduce"), (9, "Big Data")
]

def data_scientists_who_like(target_interest):
    return [user_id
            for user_id, user_interest in interests
            if user_interest == target_interest]

from collections import defaultdict

# keys are interests, values are lists of user_ids with that interest
user_ids_by_interest = defaultdict(list)

for user_id, interest in interests:
    user_ids_by_interest[interest].append(user_id)

# keys are user_ids, values are lists of interests for that user_id
interests_by_user_id = defaultdict(list)

for user_id, interest in interests:
    interests_by_user_id[user_id].append(interest)

def most_common_interests_with(user_id):
    return Counter(interested_user_id
        for interest in interests_by_user_id["user_id"]
        for interested_user_id in user_ids_by_interest[interest]
        if interested_user_id != user_id)

###########################
#                         #
# SALARIES AND EXPERIENCE #
#                         #
###########################

salaries_and_tenures = [(83000, 8.7), (88000, 8.1),
                        (48000, 0.7), (76000, 6),
                        (69000, 6.5), (76000, 7.5),
                        (60000, 2.5), (83000, 10),
                        (48000, 1.9), (63000, 4.2)]

def make_chart_salaries_by_tenure():
    tenures = [tenure for salary, tenure in salaries_and_tenures]
    salaries = [salary for salary, tenure in salaries_and_tenures]
    plt.scatter(tenures, salaries)
    plt.xlabel("Years Experience")
    plt.ylabel("Salary")
    plt.show()

# keys are years
# values are the salaries for each tenure
salary_by_tenure = defaultdict(list)

for salary, tenure in salaries_and_tenures:
    salary_by_tenure[tenure].append(salary)

average_salary_by_tenure = {
    tenure : sum(salaries) / len(salaries)
    for tenure, salaries in salary_by_tenure.items()
}

def tenure_bucket(tenure):
    if tenure < 2: return "less than two"
    elif tenure < 5: return "between two and five"
    else: return "more than five"

salary_by_tenure_bucket = defaultdict(list)

for salary, tenure in salaries_and_tenures:
    bucket = tenure_bucket(tenure)
    salary_by_tenure_bucket[bucket].append(salary)

average_salary_by_bucket = {
  tenure_bucket : sum(salaries) / len(salaries)
  for tenure_bucket, salaries in salary_by_tenure_bucket.iteritems()
}


#################
#               #
# PAID_ACCOUNTS #
#               #
#################

def predict_paid_or_unpaid(years_experience):
  if years_experience < 3.0: return "paid"
  elif years_experience < 8.5: return "unpaid"
  else: return "paid"

######################
#                    #
# TOPICS OF INTEREST #
#                    #
######################

words_and_counts = Counter(word
                           for user, interest in interests
                           for word in interest.lower().split())


if __name__ == "__main__":

    print
    print "######################"
    print "#"
    print "# FINDING KEY CONNECTORS"
    print "#"
    print "######################"
    print


    print "total connections", total_connections
    print "number of users", num_users
    print "average connections", total_connections / num_users
    print

    # create a list (user_id, number_of_friends)
    num_friends_by_id = [(user["id"], number_of_friends(user))
                         for user in users]

    print "users sorted by number of friends:"
    print sorted(num_friends_by_id,
                 key=lambda (user_id, num_friends): num_friends, # by number of friends
                 reverse=True)                                   # largest to smallest

    print
    print "######################"
    print "#"
    print "# DATA SCIENTISTS YOU MAY KNOW"
    print "#"
    print "######################"
    print


    print "friends of friends bad for user 0:", friends_of_friend_ids_bad(users[0])
    print "friends of friends for user 3:", friends_of_friend_ids(users[3])

    print
    print "######################"
    print "#"
    print "# SALARIES AND TENURES"
    print "#"
    print "######################"
    print

    print "average salary by tenure", average_salary_by_tenure
    print "average salary by tenure bucket", average_salary_by_bucket

    print
    print "######################"
    print "#"
    print "# MOST COMMON WORDS"
    print "#"
    print "######################"
    print

    for word, count in words_and_counts.most_common():
        if count > 1:
            print word, count


================================================
FILE: first-edition/code/line_count.py
================================================
# line_count.py
import sys

if __name__ == "__main__":

    count = 0
    for line in sys.stdin:
        count += 1

    # print goes to sys.stdout
    print count

================================================
FILE: first-edition/code/linear_algebra.py
================================================
# -*- coding: iso-8859-15 -*-

from __future__ import division # want 3 / 2 == 1.5
import re, math, random # regexes, math functions, random numbers
import matplotlib.pyplot as plt # pyplot
from collections import defaultdict, Counter
from functools import partial

# 
# functions for working with vectors
#

def vector_add(v, w):
    """adds two vectors componentwise"""
    return [v_i + w_i for v_i, w_i in zip(v,w)]

def vector_subtract(v, w):
    """subtracts two vectors componentwise"""
    return [v_i - w_i for v_i, w_i in zip(v,w)]

def vector_sum(vectors):
    return reduce(vector_add, vectors)

def scalar_multiply(c, v):
    return [c * v_i for v_i in v]

# this isn't right if you don't from __future__ import division
def vector_mean(vectors):
    """compute the vector whose i-th element is the mean of the
    i-th elements of the input vectors"""
    n = len(vectors)
    return scalar_multiply(1/n, vector_sum(vectors))

def dot(v, w):
    """v_1 * w_1 + ... + v_n * w_n"""
    return sum(v_i * w_i for v_i, w_i in zip(v, w))

def sum_of_squares(v):
    """v_1 * v_1 + ... + v_n * v_n"""
    return dot(v, v)

def magnitude(v):
    return math.sqrt(sum_of_squares(v))

def squared_distance(v, w):
    return sum_of_squares(vector_subtract(v, w))

def distance(v, w):
   return math.sqrt(squared_distance(v, w))

#
# functions for working with matrices
#

def shape(A):
    num_rows = len(A)
    num_cols = len(A[0]) if A else 0
    return num_rows, num_cols

def get_row(A, i):
    return A[i]
    
def get_column(A, j):
    return [A_i[j] for A_i in A]

def make_matrix(num_rows, num_cols, entry_fn):
    """returns a num_rows x num_cols matrix 
    whose (i,j)-th entry is entry_fn(i, j)"""
    return [[entry_fn(i, j) for j in range(num_cols)]
            for i in range(num_rows)]  

def is_diagonal(i, j):
    """1's on the 'diagonal', 0's everywhere else"""
    return 1 if i == j else 0

identity_matrix = make_matrix(5, 5, is_diagonal)

#          user 0  1  2  3  4  5  6  7  8  9
#
friendships = [[0, 1, 1, 0, 0, 0, 0, 0, 0, 0], # user 0
               [1, 0, 1, 1, 0, 0, 0, 0, 0, 0], # user 1
               [1, 1, 0, 1, 0, 0, 0, 0, 0, 0], # user 2
               [0, 1, 1, 0, 1, 0, 0, 0, 0, 0], # user 3
               [0, 0, 0, 1, 0, 1, 0, 0, 0, 0], # user 4
               [0, 0, 0, 0, 1, 0, 1, 1, 0, 0], # user 5
               [0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 6
               [0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 7
               [0, 0, 0, 0, 0, 0, 1, 1, 0, 1], # user 8
               [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]] # user 9

#####
# DELETE DOWN
#


def matrix_add(A, B):
    if shape(A) != shape(B):
        raise ArithmeticError("cannot add matrices with different shapes")
        
    num_rows, num_cols = shape(A)
    def entry_fn(i, j): return A[i][j] + B[i][j]
        
    return make_matrix(num_rows, num_cols, entry_fn)


def make_graph_dot_product_as_vector_projection(plt):

    v = [2, 1]
    w = [math.sqrt(.25), math.sqrt(.75)]
    c = dot(v, w)
    vonw = scalar_multiply(c, w)
    o = [0,0]

    plt.arrow(0, 0, v[0], v[1], 
              width=0.002, head_width=.1, length_includes_head=True)
    plt.annotate("v", v, xytext=[v[0] + 0.1, v[1]])
    plt.arrow(0 ,0, w[0], w[1], 
              width=0.002, head_width=.1, length_includes_head=True)
    plt.annotate("w", w, xytext=[w[0] - 0.1, w[1]])
    plt.arrow(0, 0, vonw[0], vonw[1], length_includes_head=True)
    plt.annotate(u"(v•w)w", vonw, xytext=[vonw[0] - 0.1, vonw[1] + 0.1])
    plt.arrow(v[0], v[1], vonw[0] - v[0], vonw[1] - v[1], 
              linestyle='dotted', length_includes_head=True)
    plt.scatter(*zip(v,w,o),marker='.')
    plt.axis('equal')
    plt.show()


================================================
FILE: first-edition/code/logistic_regression.py
================================================
from __future__ import division
from collections import Counter
from functools import partial
from linear_algebra import dot, vector_add
from gradient_descent import maximize_stochastic, maximize_batch
from working_with_data import rescale
from machine_learning import train_test_split
from multiple_regression import estimate_beta, predict
import math, random

def logistic(x):
    return 1.0 / (1 + math.exp(-x))

def logistic_prime(x):
    return logistic(x) * (1 - logistic(x))

def logistic_log_likelihood_i(x_i, y_i, beta):
    if y_i == 1:
        return math.log(logistic(dot(x_i, beta)))
    else:
        return math.log(1 - logistic(dot(x_i, beta)))

def logistic_log_likelihood(x, y, beta):
    return sum(logistic_log_likelihood_i(x_i, y_i, beta)
               for x_i, y_i in zip(x, y))

def logistic_log_partial_ij(x_i, y_i, beta, j):
    """here i is the index of the data point,
    j the index of the derivative"""

    return (y_i - logistic(dot(x_i, beta))) * x_i[j]
    
def logistic_log_gradient_i(x_i, y_i, beta):
    """the gradient of the log likelihood 
    corresponding to the i-th data point"""

    return [logistic_log_partial_ij(x_i, y_i, beta, j)
            for j, _ in enumerate(beta)]
            
def logistic_log_gradient(x, y, beta):
    return reduce(vector_add,
                  [logistic_log_gradient_i(x_i, y_i, beta)
                   for x_i, y_i in zip(x,y)])    

if __name__ == "__main__":

    data = [(0.7,48000,1),(1.9,48000,0),(2.5,60000,1),(4.2,63000,0),(6,76000,0),(6.5,69000,0),(7.5,76000,0),(8.1,88000,0),(8.7,83000,1),(10,83000,1),(0.8,43000,0),(1.8,60000,0),(10,79000,1),(6.1,76000,0),(1.4,50000,0),(9.1,92000,0),(5.8,75000,0),(5.2,69000,0),(1,56000,0),(6,67000,0),(4.9,74000,0),(6.4,63000,1),(6.2,82000,0),(3.3,58000,0),(9.3,90000,1),(5.5,57000,1),(9.1,102000,0),(2.4,54000,0),(8.2,65000,1),(5.3,82000,0),(9.8,107000,0),(1.8,64000,0),(0.6,46000,1),(0.8,48000,0),(8.6,84000,1),(0.6,45000,0),(0.5,30000,1),(7.3,89000,0),(2.5,48000,1),(5.6,76000,0),(7.4,77000,0),(2.7,56000,0),(0.7,48000,0),(1.2,42000,0),(0.2,32000,1),(4.7,56000,1),(2.8,44000,1),(7.6,78000,0),(1.1,63000,0),(8,79000,1),(2.7,56000,0),(6,52000,1),(4.6,56000,0),(2.5,51000,0),(5.7,71000,0),(2.9,65000,0),(1.1,33000,1),(3,62000,0),(4,71000,0),(2.4,61000,0),(7.5,75000,0),(9.7,81000,1),(3.2,62000,0),(7.9,88000,0),(4.7,44000,1),(2.5,55000,0),(1.6,41000,0),(6.7,64000,1),(6.9,66000,1),(7.9,78000,1),(8.1,102000,0),(5.3,48000,1),(8.5,66000,1),(0.2,56000,0),(6,69000,0),(7.5,77000,0),(8,86000,0),(4.4,68000,0),(4.9,75000,0),(1.5,60000,0),(2.2,50000,0),(3.4,49000,1),(4.2,70000,0),(7.7,98000,0),(8.2,85000,0),(5.4,88000,0),(0.1,46000,0),(1.5,37000,0),(6.3,86000,0),(3.7,57000,0),(8.4,85000,0),(2,42000,0),(5.8,69000,1),(2.7,64000,0),(3.1,63000,0),(1.9,48000,0),(10,72000,1),(0.2,45000,0),(8.6,95000,0),(1.5,64000,0),(9.8,95000,0),(5.3,65000,0),(7.5,80000,0),(9.9,91000,0),(9.7,50000,1),(2.8,68000,0),(3.6,58000,0),(3.9,74000,0),(4.4,76000,0),(2.5,49000,0),(7.2,81000,0),(5.2,60000,1),(2.4,62000,0),(8.9,94000,0),(2.4,63000,0),(6.8,69000,1),(6.5,77000,0),(7,86000,0),(9.4,94000,0),(7.8,72000,1),(0.2,53000,0),(10,97000,0),(5.5,65000,0),(7.7,71000,1),(8.1,66000,1),(9.8,91000,0),(8,84000,0),(2.7,55000,0),(2.8,62000,0),(9.4,79000,0),(2.5,57000,0),(7.4,70000,1),(2.1,47000,0),(5.3,62000,1),(6.3,79000,0),(6.8,58000,1),(5.7,80000,0),(2.2,61000,0),(4.8,62000,0),(3.7,64000,0),(4.1,85000,0),(2.3,51000,0),(3.5,58000,0),(0.9,43000,0),(0.9,54000,0),(4.5,74000,0),(6.5,55000,1),(4.1,41000,1),(7.1,73000,0),(1.1,66000,0),(9.1,81000,1),(8,69000,1),(7.3,72000,1),(3.3,50000,0),(3.9,58000,0),(2.6,49000,0),(1.6,78000,0),(0.7,56000,0),(2.1,36000,1),(7.5,90000,0),(4.8,59000,1),(8.9,95000,0),(6.2,72000,0),(6.3,63000,0),(9.1,100000,0),(7.3,61000,1),(5.6,74000,0),(0.5,66000,0),(1.1,59000,0),(5.1,61000,0),(6.2,70000,0),(6.6,56000,1),(6.3,76000,0),(6.5,78000,0),(5.1,59000,0),(9.5,74000,1),(4.5,64000,0),(2,54000,0),(1,52000,0),(4,69000,0),(6.5,76000,0),(3,60000,0),(4.5,63000,0),(7.8,70000,0),(3.9,60000,1),(0.8,51000,0),(4.2,78000,0),(1.1,54000,0),(6.2,60000,0),(2.9,59000,0),(2.1,52000,0),(8.2,87000,0),(4.8,73000,0),(2.2,42000,1),(9.1,98000,0),(6.5,84000,0),(6.9,73000,0),(5.1,72000,0),(9.1,69000,1),(9.8,79000,1),]
    data = map(list, data) # change tuples to lists

    x = [[1] + row[:2] for row in data] # each element is [1, experience, salary]
    y = [row[2] for row in data]        # each element is paid_account

    print "linear regression:"

    rescaled_x = rescale(x)
    beta = estimate_beta(rescaled_x, y)
    print beta

    print "logistic regression:"

    random.seed(0)
    x_train, x_test, y_train, y_test = train_test_split(rescaled_x, y, 0.33)

    # want to maximize log likelihood on the training data
    fn = partial(logistic_log_likelihood, x_train, y_train)
    gradient_fn = partial(logistic_log_gradient, x_train, y_train)

    # pick a random starting point
    beta_0 = [1, 1, 1]

    # and maximize using gradient descent
    beta_hat = maximize_batch(fn, gradient_fn, beta_0)

    print "beta_batch", beta_hat

    beta_0 = [1, 1, 1]
    beta_hat = maximize_stochastic(logistic_log_likelihood_i,
                               logistic_log_gradient_i,
                               x_train, y_train, beta_0)

    print "beta stochastic", beta_hat

    true_positives = false_positives = true_negatives = false_negatives = 0

    for x_i, y_i in zip(x_test, y_test):
        predict = logistic(dot(beta_hat, x_i))

        if y_i == 1 and predict >= 0.5:  # TP: paid and we predict paid
            true_positives += 1
        elif y_i == 1:                   # FN: paid and we predict unpaid
            false_negatives += 1
        elif predict >= 0.5:             # FP: unpaid and we predict paid
            false_positives += 1
        else:                            # TN: unpaid and we predict unpaid
            true_negatives += 1

    precision = true_positives / (true_positives + false_positives)
    recall = true_positives / (true_positives + false_negatives)

    print "precision", precision
    print "recall", recall

================================================
FILE: first-edition/code/machine_learning.py
================================================
from __future__ import division
from collections import Counter
import math, random

#
# data splitting
#

def split_data(data, prob):
    """split data into fractions [prob, 1 - prob]"""
    results = [], []
    for row in data:
        results[0 if random.random() < prob else 1].append(row)
    return results

def train_test_split(x, y, test_pct):
    data = zip(x, y)                              # pair corresponding values  
    train, test = split_data(data, 1 - test_pct)  # split the dataset of pairs
    x_train, y_train = zip(*train)                # magical un-zip trick
    x_test, y_test = zip(*test)
    return x_train, x_test, y_train, y_test

#
# correctness
#

def accuracy(tp, fp, fn, tn):
    correct = tp + tn
    total = tp + fp + fn + tn
    return correct / total

def precision(tp, fp, fn, tn):
    return tp / (tp + fp)

def recall(tp, fp, fn, tn):
    return tp / (tp + fn)

def f1_score(tp, fp, fn, tn):
    p = precision(tp, fp, fn, tn)
    r = recall(tp, fp, fn, tn)

    return 2 * p * r / (p + r)

if __name__ == "__main__":

    print "accuracy(70, 4930, 13930, 981070)", accuracy(70, 4930, 13930, 981070)
    print "precision(70, 4930, 13930, 981070)", precision(70, 4930, 13930, 981070)
    print "recall(70, 4930, 13930, 981070)", recall(70, 4930, 13930, 981070)
    print "f1_score(70, 4930, 13930, 981070)", f1_score(70, 4930, 13930, 981070)



================================================
FILE: first-edition/code/mapreduce.py
================================================
from __future__ import division
import math, random, re, datetime
from collections import defaultdict, Counter
from functools import partial
from naive_bayes import tokenize

def word_count_old(documents):
    """word count not using MapReduce"""
    return Counter(word 
        for document in documents 
        for word in tokenize(document))

def wc_mapper(document):
    """for each word in the document, emit (word,1)"""        
    for word in tokenize(document):
        yield (word, 1)

def wc_reducer(word, counts):
    """sum up the counts for a word"""
    yield (word, sum(counts))

def word_count(documents):
    """count the words in the input documents using MapReduce"""

    # place to store grouped values
    collector = defaultdict(list) 

    for document in documents:
        for word, count in wc_mapper(document):
            collector[word].append(count)

    return [output
            for word, counts in collector.iteritems()
            for output in wc_reducer(word, counts)]

def map_reduce(inputs, mapper, reducer):
    """runs MapReduce on the inputs using mapper and reducer"""
    collector = defaultdict(list)

    for input in inputs:
        for key, value in mapper(input):
            collector[key].append(value)

    return [output
            for key, values in collector.iteritems()
            for output in reducer(key,values)]

def reduce_with(aggregation_fn, key, values):
    """reduces a key-values pair by applying aggregation_fn to the values"""
    yield (key, aggregation_fn(values))

def values_reducer(aggregation_fn):
    """turns a function (values -> output) into a reducer"""
    return partial(reduce_with, aggregation_fn)

sum_reducer = values_reducer(sum)
max_reducer = values_reducer(max)
min_reducer = values_reducer(min)
count_distinct_reducer = values_reducer(lambda values: len(set(values)))

# 
# Analyzing Status Updates
#

status_updates = [
    {"id": 1, 
     "username" : "joelgrus", 
     "text" : "Is anyone interested in a data science book?",
     "created_at" : datetime.datetime(2013, 12, 21, 11, 47, 0),
     "liked_by" : ["data_guy", "data_gal", "bill"] },
    # add your own
]

def data_science_day_mapper(status_update):
    """yields (day_of_week, 1) if status_update contains "data science" """
    if "data science" in status_update["text"].lower():
        day_of_week = status_update["created_at"].weekday()
        yield (day_of_week, 1)
        
data_science_days = map_reduce(status_updates, 
                               data_science_day_mapper, 
                               sum_reducer)

def words_per_user_mapper(status_update):
    user = status_update["username"]
    for word in tokenize(status_update["text"]):
        yield (user, (word, 1))
            
def most_popular_word_reducer(user, words_and_counts):
    """given a sequence of (word, count) pairs, 
    return the word with the highest total count"""
    
    word_counts = Counter()
    for word, count in words_and_counts:
        word_counts[word] += count

    word, count = word_counts.most_common(1)[0]
                       
    yield (user, (word, count))

user_words = map_reduce(status_updates,
                        words_per_user_mapper, 
                        most_popular_word_reducer)

def liker_mapper(status_update):
    user = status_update["username"]
    for liker in status_update["liked_by"]:
        yield (user, liker)
                
distinct_likers_per_user = map_reduce(status_updates, 
                                      liker_mapper, 
                                      count_distinct_reducer)


#
# matrix multiplication
#

def matrix_multiply_mapper(m, element):
    """m is the common dimension (columns of A, rows of B)
    element is a tuple (matrix_name, i, j, value)"""
    matrix, i, j, value = element

    if matrix == "A":
        for column in range(m):
            # A_ij is the jth entry in the sum for each C_i_column
            yield((i, column), (j, value))
    else:
        for row in range(m):
            # B_ij is the ith entry in the sum for each C_row_j
            yield((row, j), (i, value))
     
def matrix_multiply_reducer(m, key, indexed_values):
    results_by_index = defaultdict(list)
    for index, value in indexed_values:
        results_by_index[index].append(value)

    # sum up all the products of the positions with two results
    sum_product = sum(results[0] * results[1]
                      for results in results_by_index.values()
                      if len(results) == 2)
                      
    if sum_product != 0.0:
        yield (key, sum_product)

if __name__ == "__main__":

    documents = ["data science", "big data", "science fiction"]

    wc_mapper_results = [result 
                         for document in documents
                         for result in wc_mapper(document)]

    print "wc_mapper results"
    print wc_mapper_results
    print 

    print "word count results"
    print word_count(documents)
    print

    print "word count using map_reduce function"
    print map_reduce(documents, wc_mapper, wc_reducer)
    print

    print "data science days"
    print data_science_days
    print

    print "user words"
    print user_words
    print

    print "distinct likers"
    print distinct_likers_per_user
    print

    # matrix multiplication

    entries = [("A", 0, 0, 3), ("A", 0, 1,  2),
           ("B", 0, 0, 4), ("B", 0, 1, -1), ("B", 1, 0, 10)]
    mapper = partial(matrix_multiply_mapper, 3)
    reducer = partial(matrix_multiply_reducer, 3)

    print "map-reduce matrix multiplication"
    print "entries:", entries
    print "result:", map_reduce(entries, mapper, reducer)

    

================================================
FILE: first-edition/code/most_common_words.py
================================================
# most_common_words.py
import sys
from collections import Counter

if __name__ == "__main__":

    # pass in number of words as first argument
    try:
        num_words = int(sys.argv[1])
    except:
        print "usage: most_common_words.py num_words"
        sys.exit(1)   # non-zero exit code indicates error

    counter = Counter(word.lower()                      
                      for line in sys.stdin             
                      for word in line.strip().split()  
                      if word)                          
            
    for word, count in counter.most_common(num_words):
        sys.stdout.write(str(count))
        sys.stdout.write("\t")
        sys.stdout.write(word)
        sys.stdout.write("\n")

================================================
FILE: first-edition/code/multiple_regression.py
================================================
from __future__ import division
from collections import Counter
from functools import partial
from linear_algebra import dot, vector_add
from statistics import median, standard_deviation
from probability import normal_cdf
from gradient_descent import minimize_stochastic
from simple_linear_regression import total_sum_of_squares
import math, random


def predict(x_i, beta):
    return dot(x_i, beta)

def error(x_i, y_i, beta):
    return y_i - predict(x_i, beta)
    
def squared_error(x_i, y_i, beta):
    return error(x_i, y_i, beta) ** 2

def squared_error_gradient(x_i, y_i, beta):
    """the gradient corresponding to the ith squared error term"""
    return [-2 * x_ij * error(x_i, y_i, beta)
            for x_ij in x_i]

def estimate_beta(x, y):
    beta_initial = [random.random() for x_i in x[0]]
    return minimize_stochastic(squared_error, 
                               squared_error_gradient, 
                               x, y, 
                               beta_initial, 
                               0.001)            

def multiple_r_squared(x, y, beta):
    sum_of_squared_errors = sum(error(x_i, y_i, beta) ** 2
                                for x_i, y_i in zip(x, y))
    return 1.0 - sum_of_squared_errors / total_sum_of_squares(y)

def bootstrap_sample(data):
    """randomly samples len(data) elements with replacement"""
    return [random.choice(data) for _ in data]
    
def bootstrap_statistic(data, stats_fn, num_samples):
    """evaluates stats_fn on num_samples bootstrap samples from data"""
    return [stats_fn(bootstrap_sample(data)) 
            for _ in range(num_samples)]

def estimate_sample_beta(sample):
    x_sample, y_sample = zip(*sample) # magic unzipping trick
    return estimate_beta(x_sample, y_sample)

def p_value(beta_hat_j, sigma_hat_j):
    if beta_hat_j > 0:
        return 2 * (1 - normal_cdf(beta_hat_j / sigma_hat_j))
    else:
        return 2 * normal_cdf(beta_hat_j / sigma_hat_j)

#
# REGULARIZED REGRESSION
#

# alpha is a *hyperparameter* controlling how harsh the penalty is
# sometimes it's called "lambda" but that already means something in Python
def ridge_penalty(beta, alpha):
  return alpha * dot(beta[1:], beta[1:])

def squared_error_ridge(x_i, y_i, beta, alpha):
    """estimate error plus ridge penalty on beta"""
    return error(x_i, y_i, beta) ** 2 + ridge_penalty(beta, alpha)

def ridge_penalty_gradient(beta, alpha):
    """gradient of just the ridge penalty"""
    return [0] + [2 * alpha * beta_j for beta_j in beta[1:]]

def squared_error_ridge_gradient(x_i, y_i, beta, alpha):
    """the gradient corresponding to the ith squared error term
    including the ridge penalty"""
    return vector_add(squared_error_gradient(x_i, y_i, beta),
                      ridge_penalty_gradient(beta, alpha))

def estimate_beta_ridge(x, y, alpha):
    """use gradient descent to fit a ridge regression
    with penalty alpha"""
    beta_initial = [random.random() for x_i in x[0]]
    return minimize_stochastic(partial(squared_error_ridge, alpha=alpha), 
                               partial(squared_error_ridge_gradient, 
                                       alpha=alpha), 
                               x, y, 
                               beta_initial, 
                               0.001)

def lasso_penalty(beta, alpha):
    return alpha * sum(abs(beta_i) for beta_i in beta[1:])    

if __name__ == "__main__":

    x = [[1,49,4,0],[1,41,9,0],[1,40,8,0],[1,25,6,0],[1,21,1,0],[1,21,0,0],[1,19,3,0],[1,19,0,0],[1,18,9,0],[1,18,8,0],[1,16,4,0],[1,15,3,0],[1,15,0,0],[1,15,2,0],[1,15,7,0],[1,14,0,0],[1,14,1,0],[1,13,1,0],[1,13,7,0],[1,13,4,0],[1,13,2,0],[1,12,5,0],[1,12,0,0],[1,11,9,0],[1,10,9,0],[1,10,1,0],[1,10,1,0],[1,10,7,0],[1,10,9,0],[1,10,1,0],[1,10,6,0],[1,10,6,0],[1,10,8,0],[1,10,10,0],[1,10,6,0],[1,10,0,0],[1,10,5,0],[1,10,3,0],[1,10,4,0],[1,9,9,0],[1,9,9,0],[1,9,0,0],[1,9,0,0],[1,9,6,0],[1,9,10,0],[1,9,8,0],[1,9,5,0],[1,9,2,0],[1,9,9,0],[1,9,10,0],[1,9,7,0],[1,9,2,0],[1,9,0,0],[1,9,4,0],[1,9,6,0],[1,9,4,0],[1,9,7,0],[1,8,3,0],[1,8,2,0],[1,8,4,0],[1,8,9,0],[1,8,2,0],[1,8,3,0],[1,8,5,0],[1,8,8,0],[1,8,0,0],[1,8,9,0],[1,8,10,0],[1,8,5,0],[1,8,5,0],[1,7,5,0],[1,7,5,0],[1,7,0,0],[1,7,2,0],[1,7,8,0],[1,7,10,0],[1,7,5,0],[1,7,3,0],[1,7,3,0],[1,7,6,0],[1,7,7,0],[1,7,7,0],[1,7,9,0],[1,7,3,0],[1,7,8,0],[1,6,4,0],[1,6,6,0],[1,6,4,0],[1,6,9,0],[1,6,0,0],[1,6,1,0],[1,6,4,0],[1,6,1,0],[1,6,0,0],[1,6,7,0],[1,6,0,0],[1,6,8,0],[1,6,4,0],[1,6,2,1],[1,6,1,1],[1,6,3,1],[1,6,6,1],[1,6,4,1],[1,6,4,1],[1,6,1,1],[1,6,3,1],[1,6,4,1],[1,5,1,1],[1,5,9,1],[1,5,4,1],[1,5,6,1],[1,5,4,1],[1,5,4,1],[1,5,10,1],[1,5,5,1],[1,5,2,1],[1,5,4,1],[1,5,4,1],[1,5,9,1],[1,5,3,1],[1,5,10,1],[1,5,2,1],[1,5,2,1],[1,5,9,1],[1,4,8,1],[1,4,6,1],[1,4,0,1],[1,4,10,1],[1,4,5,1],[1,4,10,1],[1,4,9,1],[1,4,1,1],[1,4,4,1],[1,4,4,1],[1,4,0,1],[1,4,3,1],[1,4,1,1],[1,4,3,1],[1,4,2,1],[1,4,4,1],[1,4,4,1],[1,4,8,1],[1,4,2,1],[1,4,4,1],[1,3,2,1],[1,3,6,1],[1,3,4,1],[1,3,7,1],[1,3,4,1],[1,3,1,1],[1,3,10,1],[1,3,3,1],[1,3,4,1],[1,3,7,1],[1,3,5,1],[1,3,6,1],[1,3,1,1],[1,3,6,1],[1,3,10,1],[1,3,2,1],[1,3,4,1],[1,3,2,1],[1,3,1,1],[1,3,5,1],[1,2,4,1],[1,2,2,1],[1,2,8,1],[1,2,3,1],[1,2,1,1],[1,2,9,1],[1,2,10,1],[1,2,9,1],[1,2,4,1],[1,2,5,1],[1,2,0,1],[1,2,9,1],[1,2,9,1],[1,2,0,1],[1,2,1,1],[1,2,1,1],[1,2,4,1],[1,1,0,1],[1,1,2,1],[1,1,2,1],[1,1,5,1],[1,1,3,1],[1,1,10,1],[1,1,6,1],[1,1,0,1],[1,1,8,1],[1,1,6,1],[1,1,4,1],[1,1,9,1],[1,1,9,1],[1,1,4,1],[1,1,2,1],[1,1,9,1],[1,1,0,1],[1,1,8,1],[1,1,6,1],[1,1,1,1],[1,1,1,1],[1,1,5,1]]
    daily_minutes_good = [68.77,51.25,52.08,38.36,44.54,57.13,51.4,41.42,31.22,34.76,54.01,38.79,47.59,49.1,27.66,41.03,36.73,48.65,28.12,46.62,35.57,32.98,35,26.07,23.77,39.73,40.57,31.65,31.21,36.32,20.45,21.93,26.02,27.34,23.49,46.94,30.5,33.8,24.23,21.4,27.94,32.24,40.57,25.07,19.42,22.39,18.42,46.96,23.72,26.41,26.97,36.76,40.32,35.02,29.47,30.2,31,38.11,38.18,36.31,21.03,30.86,36.07,28.66,29.08,37.28,15.28,24.17,22.31,30.17,25.53,19.85,35.37,44.6,17.23,13.47,26.33,35.02,32.09,24.81,19.33,28.77,24.26,31.98,25.73,24.86,16.28,34.51,15.23,39.72,40.8,26.06,35.76,34.76,16.13,44.04,18.03,19.65,32.62,35.59,39.43,14.18,35.24,40.13,41.82,35.45,36.07,43.67,24.61,20.9,21.9,18.79,27.61,27.21,26.61,29.77,20.59,27.53,13.82,33.2,25,33.1,36.65,18.63,14.87,22.2,36.81,25.53,24.62,26.25,18.21,28.08,19.42,29.79,32.8,35.99,28.32,27.79,35.88,29.06,36.28,14.1,36.63,37.49,26.9,18.58,38.48,24.48,18.95,33.55,14.24,29.04,32.51,25.63,22.22,19,32.73,15.16,13.9,27.2,32.01,29.27,33,13.74,20.42,27.32,18.23,35.35,28.48,9.08,24.62,20.12,35.26,19.92,31.02,16.49,12.16,30.7,31.22,34.65,13.13,27.51,33.2,31.57,14.1,33.42,17.44,10.12,24.42,9.82,23.39,30.93,15.03,21.67,31.09,33.29,22.61,26.89,23.48,8.38,27.81,32.35,23.84]

    random.seed(0)
    beta = estimate_beta(x, daily_minutes_good) # [30.63, 0.972, -1.868, 0.911]
    print "beta", beta
    print "r-squared", multiple_r_squared(x, daily_minutes_good, beta)
    print

    print "digression: the bootstrap"
    # 101 points all very close to 100
    close_to_100 = [99.5 + random.random() for _ in range(101)]

    # 101 points, 50 of them near 0, 50 of them near 200
    far_from_100 = ([99.5 + random.random()] + 
                    [random.random() for _ in range(50)] +
                    [200 + random.random() for _ in range(50)])

    print "bootstrap_statistic(close_to_100, median, 100):"
    print bootstrap_statistic(close_to_100, median, 100)
    print "bootstrap_statistic(far_from_100, median, 100):"
    print bootstrap_statistic(far_from_100, median, 100)
    print

    random.seed(0) # so that you get the same results as me

    bootstrap_betas = bootstrap_statistic(zip(x, daily_minutes_good),
                                          estimate_sample_beta,
                                          100)

    bootstrap_standard_errors = [
        standard_deviation([beta[i] for beta in bootstrap_betas])
        for i in range(4)]

    print "bootstrap standard errors", bootstrap_standard_errors
    print

    print "p_value(30.63, 1.174)", p_value(30.63, 1.174)
    print "p_value(0.972, 0.079)", p_value(0.972, 0.079)
    print "p_value(-1.868, 0.131)", p_value(-1.868, 0.131)
    print "p_value(0.911, 0.990)", p_value(0.911, 0.990)
    print

    print "regularization"

    random.seed(0)
    for alpha in [0.0, 0.01, 0.1, 1, 10]:
        beta = estimate_beta_ridge(x, daily_minutes_good, alpha=alpha)
        print "alpha", alpha
        print "beta", beta
        print "dot(beta[1:],beta[1:])", dot(beta[1:], beta[1:])
        print "r-squared", multiple_r_squared(x, daily_minutes_good, beta)
        print


================================================
FILE: first-edition/code/naive_bayes.py
================================================
from __future__ import division
from collections import Counter, defaultdict
from machine_learning import split_data
import math, random, re, glob

def tokenize(message):
    message = message.lower()                       # convert to lowercase
    all_words = re.findall("[a-z0-9']+", message)   # extract the words
    return set(all_words)                           # remove duplicates


def count_words(training_set):
    """training set consists of pairs (message, is_spam)"""
    counts = defaultdict(lambda: [0, 0])
    for message, is_spam in training_set:
        for word in tokenize(message):
            counts[word][0 if is_spam else 1] += 1
    return counts

def word_probabilities(counts, total_spams, total_non_spams, k=0.5):
    """turn the word_counts into a list of triplets 
    w, p(w | spam) and p(w | ~spam)"""
    return [(w,
             (spam + k) / (total_spams + 2 * k),
             (non_spam + k) / (total_non_spams + 2 * k))
             for w, (spam, non_spam) in counts.iteritems()]

def spam_probability(word_probs, message):
    message_words = tokenize(message)
    log_prob_if_spam = log_prob_if_not_spam = 0.0

    for word, prob_if_spam, prob_if_not_spam in word_probs:

        # for each word in the message, 
        # add the log probability of seeing it 
        if word in message_words:
            log_prob_if_spam += math.log(prob_if_spam)
            log_prob_if_not_spam += math.log(prob_if_not_spam)

        # for each word that's not in the message
        # add the log probability of _not_ seeing it
        else:
            log_prob_if_spam += math.log(1.0 - prob_if_spam)
            log_prob_if_not_spam += math.log(1.0 - prob_if_not_spam)
            
    prob_if_spam = math.exp(log_prob_if_spam)
    prob_if_not_spam = math.exp(log_prob_if_not_spam)
    return prob_if_spam / (prob_if_spam + prob_if_not_spam)


class NaiveBayesClassifier:

    def __init__(self, k=0.5):
        self.k = k
        self.word_probs = []

    def train(self, training_set):
    
        # count spam and non-spam messages
        num_spams = len([is_spam 
                         for message, is_spam in training_set 
                         if is_spam])
        num_non_spams = len(training_set) - num_spams

        # run training data through our "pipeline"
        word_counts = count_words(training_set)
        self.word_probs = word_probabilities(word_counts, 
                                             num_spams, 
                                             num_non_spams,
                                             self.k)
                                             
    def classify(self, message):
        return spam_probability(self.word_probs, message)


def get_subject_data(path):

    data = []

    # regex for stripping out the leading "Subject:" and any spaces after it
    subject_regex = re.compile(r"^Subject:\s+")

    # glob.glob returns every filename that matches the wildcarded path
    for fn in glob.glob(path):
        is_spam = "ham" not in fn
        
        with open(fn,'r') as file:
            for line in file:
                if line.startswith("Subject:"):
                    subject = subject_regex.sub("", line).strip()
                    data.append((subject, is_spam))

    return data

def p_spam_given_word(word_prob):
    word, prob_if_spam, prob_if_not_spam = word_prob
    return prob_if_spam / (prob_if_spam + prob_if_not_spam)

def train_and_test_model(path):

    data = get_subject_data(path)
    random.seed(0)      # just so you get the same answers as me
    train_data, test_data = split_data(data, 0.75)    

    classifier = NaiveBayesClassifier()
    classifier.train(train_data)

    classified = [(subject, is_spam, classifier.classify(subject))
              for subject, is_spam in test_data]

    counts = Counter((is_spam, spam_probability > 0.5) # (actual, predicted)
                     for _, is_spam, spam_probability in classified)

    print counts

    classified.sort(key=lambda row: row[2])
    spammiest_hams = filter(lambda row: not row[1], classified)[-5:]
    hammiest_spams = filter(lambda row: row[1], classified)[:5]

    print "spammiest_hams", spammiest_hams
    print "hammiest_spams", hammiest_spams

    words = sorted(classifier.word_probs, key=p_spam_given_word)

    spammiest_words = words[-5:]
    hammiest_words = words[:5]

    print "spammiest_words", spammiest_words
    print "hammiest_words", hammiest_words


if __name__ == "__main__":
    train_and_test_model(r"c:\spam\*\*")

================================================
FILE: first-edition/code/natural_language_processing.py
================================================
from __future__ import division
import math, random, re
from collections import defaultdict, Counter
from bs4 import BeautifulSoup
import requests

def plot_resumes(plt):
    data = [ ("big data", 100, 15), ("Hadoop", 95, 25), ("Python", 75, 50),
         ("R", 50, 40), ("machine learning", 80, 20), ("statistics", 20, 60),
         ("data science", 60, 70), ("analytics", 90, 3),
         ("team player", 85, 85), ("dynamic", 2, 90), ("synergies", 70, 0),
         ("actionable insights", 40, 30), ("think out of the box", 45, 10),
         ("self-starter", 30, 50), ("customer focus", 65, 15),
         ("thought leadership", 35, 35)]

    def text_size(total):
        """equals 8 if total is 0, 28 if total is 200"""
        return 8 + total / 200 * 20

    for word, job_popularity, resume_popularity in data:
        plt.text(job_popularity, resume_popularity, word,
                 ha='center', va='center',
                 size=text_size(job_popularity + resume_popularity))
    plt.xlabel("Popularity on Job Postings")
    plt.ylabel("Popularity on Resumes")
    plt.axis([0, 100, 0, 100])
    plt.show()

#
# n-gram models
#

def fix_unicode(text):
    return text.replace(u"\u2019", "'")

def get_document():

    url = "http://radar.oreilly.com/2010/06/what-is-data-science.html"
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'html5lib')

    content = soup.find("div", "article-body")        # find article-body div
    regex = r"[\w']+|[\.]"                            # matches a word or a period

    document = []


    for paragraph in content("p"):
        words = re.findall(regex, fix_unicode(paragraph.text))
        document.extend(words)

    return document

def generate_using_bigrams(transitions):
    current = "."   # this means the next word will start a sentence
    result = []
    while True:
        next_word_candidates = transitions[current]    # bigrams (current, _)
        current = random.choice(next_word_candidates)  # choose one at random
        result.append(current)                         # append it to results
        if current == ".": return " ".join(result)     # if "." we're done

def generate_using_trigrams(starts, trigram_transitions):
    current = random.choice(starts)   # choose a random starting word
    prev = "."                        # and precede it with a '.'
    result = [current]
    while True:
        next_word_candidates = trigram_transitions[(prev, current)]
        next = random.choice(next_word_candidates)

        prev, current = current, next
        result.append(current)

        if current == ".":
            return " ".join(result)

def is_terminal(token):
    return token[0] != "_"

def expand(grammar, tokens):
    for i, token in enumerate(tokens):

        # ignore terminals
        if is_terminal(token): continue

        # choose a replacement at random
        replacement = random.choice(grammar[token])

        if is_terminal(replacement):
            tokens[i] = replacement
        else:
            tokens = tokens[:i] + replacement.split() + tokens[(i+1):]
        return expand(grammar, tokens)

    # if we get here we had all terminals and are done
    return tokens

def generate_sentence(grammar):
    return expand(grammar, ["_S"])

#
# Gibbs Sampling
#

def roll_a_die():
    return random.choice([1,2,3,4,5,6])

def direct_sample():
    d1 = roll_a_die()
    d2 = roll_a_die()
    return d1, d1 + d2

def random_y_given_x(x):
    """equally likely to be x + 1, x + 2, ... , x + 6"""
    return x + roll_a_die()

def random_x_given_y(y):
    if y <= 7:
        # if the total is 7 or less, the first die is equally likely to be
        # 1, 2, ..., (total - 1)
        return random.randrange(1, y)
    else:
        # if the total is 7 or more, the first die is equally likely to be
        # (total - 6), (total - 5), ..., 6
        return random.randrange(y - 6, 7)

def gibbs_sample(num_iters=100):
    x, y = 1, 2 # doesn't really matter
    for _ in range(num_iters):
        x = random_x_given_y(y)
        y = random_y_given_x(x)
    return x, y

def compare_distributions(num_samples=1000):
    counts = defaultdict(lambda: [0, 0])
    for _ in range(num_samples):
        counts[gibbs_sample()][0] += 1
        counts[direct_sample()][1] += 1
    return counts

#
# TOPIC MODELING
#

def sample_from(weights):
    total = sum(weights)
    rnd = total * random.random()       # uniform between 0 and total
    for i, w in enumerate(weights):
        rnd -= w                        # return the smallest i such that
        if rnd <= 0: return i           # sum(weights[:(i+1)]) >= rnd

documents = [
    ["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
    ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
    ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
    ["R", "Python", "statistics", "regression", "probability"],
    ["machine learning", "regression", "decision trees", "libsvm"],
    ["Python", "R", "Java", "C++", "Haskell", "programming languages"],
    ["statistics", "probability", "mathematics", "theory"],
    ["machine learning", "scikit-learn", "Mahout", "neural networks"],
    ["neural networks", "deep learning", "Big Data", "artificial intelligence"],
    ["Hadoop", "Java", "MapReduce", "Big Data"],
    ["statistics", "R", "statsmodels"],
    ["C++", "deep learning", "artificial intelligence", "probability"],
    ["pandas", "R", "Python"],
    ["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
    ["libsvm", "regression", "support vector machines"]
]

K = 4

document_topic_counts = [Counter()
                         for _ in documents]

topic_word_counts = [Counter() for _ in range(K)]

topic_counts = [0 for _ in range(K)]

document_lengths = map(len, documents)

distinct_words = set(word for document in documents for word in document)
W = len(distinct_words)

D = len(documents)

def p_topic_given_document(topic, d, alpha=0.1):
    """the fraction of words in document _d_
    that are assigned to _topic_ (plus some smoothing)"""

    return ((document_topic_counts[d][topic] + alpha) /
            (document_lengths[d] + K * alpha))

def p_word_given_topic(word, topic, beta=0.1):
    """the fraction of words assigned to _topic_
    that equal _word_ (plus some smoothing)"""

    return ((topic_word_counts[topic][word] + beta) /
            (topic_counts[topic] + W * beta))

def topic_weight(d, word, k):
    """given a document and a word in that document,
    return the weight for the k-th topic"""

    return p_word_given_topic(word, k) * p_topic_given_document(k, d)

def choose_new_topic(d, word):
    return sample_from([topic_weight(d, word, k)
                        for k in range(K)])


random.seed(0)
document_topics = [[random.randrange(K) for word in document]
                   for document in documents]

for d in range(D):
    for word, topic in zip(documents[d], document_topics[d]):
        document_topic_counts[d][topic] += 1
        topic_word_counts[topic][word] += 1
        topic_counts[topic] += 1

for iter in range(1000):
    for d in range(D):
        for i, (word, topic) in enumerate(zip(documents[d],
                                              document_topics[d])):

            # remove this word / topic from the counts
            # so that it doesn't influence the weights
            document_topic_counts[d][topic] -= 1
            topic_word_counts[topic][word] -= 1
            topic_counts[topic] -= 1
            document_lengths[d] -= 1

            # choose a new topic based on the weights
            new_topic = choose_new_topic(d, word)
            document_topics[d][i] = new_topic

            # and now add it back to the counts
            document_topic_counts[d][new_topic] += 1
            topic_word_counts[new_topic][word] += 1
            topic_counts[new_topic] += 1
            document_lengths[d] += 1

if __name__ == "__main__":

    document = get_document()

    bigrams = zip(document, document[1:])
    transitions = defaultdict(list)
    for prev, current in bigrams:
        transitions[prev].append(current)

    random.seed(0)
    print "bigram sentences"
    for i in range(10):
        print i, generate_using_bigrams(transitions)
    print

    # trigrams

    trigrams = zip(document, document[1:], document[2:])
    trigram_transitions = defaultdict(list)
    starts = []

    for prev, current, next in trigrams:

        if prev == ".":              # if the previous "word" was a period
            starts.append(current)   # then this is a start word

        trigram_transitions[(prev, current)].append(next)

    print "trigram sentences"
    for i in range(10):
        print i, generate_using_trigrams(starts, trigram_transitions)
    print

    grammar = {
        "_S"  : ["_NP _VP"],
        "_NP" : ["_N",
                 "_A _NP _P _A _N"],
        "_VP" : ["_V",
                 "_V _NP"],
        "_N"  : ["data science", "Python", "regression"],
        "_A"  : ["big", "linear", "logistic"],
        "_P"  : ["about", "near"],
        "_V"  : ["learns", "trains", "tests", "is"]
    }

    print "grammar sentences"
    for i in range(10):
        print i, " ".join(generate_sentence(grammar))
    print

    print "gibbs sampling"
    comparison = compare_distributions()
    for roll, (gibbs, direct) in comparison.iteritems():
        print roll, gibbs, direct


    # topic MODELING

    for k, word_counts in enumerate(topic_word_counts):
        for word, count in word_counts.most_common():
            if count > 0: print k, word, count

    topic_names = ["Big Data and programming languages",
                   "Python and statistics",
                   "databases",
                   "machine learning"]

    for document, topic_counts in zip(documents, document_topic_counts):
        print document
        for topic, count in topic_counts.most_common():
            if count > 0:
                print topic_names[topic], count,
        print


================================================
FILE: first-edition/code/nearest_neighbors.py
================================================
from __future__ import division
from collections import Counter
from linear_algebra import distance
from statistics import mean
import math, random
import matplotlib.pyplot as plt

def raw_majority_vote(labels):
    votes = Counter(labels)
    winner, _ = votes.most_common(1)[0]
    return winner

def majority_vote(labels):
    """assumes that labels are ordered from nearest to farthest"""
    vote_counts = Counter(labels)
    winner, winner_count = vote_counts.most_common(1)[0]
    num_winners = len([count 
                       for count in vote_counts.values()
                       if count == winner_count])

    if num_winners == 1:
        return winner                     # unique winner, so return it
    else:
        return majority_vote(labels[:-1]) # try again without the farthest


def knn_classify(k, labeled_points, new_point):
    """each labeled point should be a pair (point, label)"""
    
    # order the labeled points from nearest to farthest
    by_distance = sorted(labeled_points,
                         key=lambda (point, _): distance(point, new_point))

    # find the labels for the k closest
    k_nearest_labels = [label for _, label in by_distance[:k]]

    # and let them vote
    return majority_vote(k_nearest_labels)


cities = [(-86.75,33.5666666666667,'Python'),(-88.25,30.6833333333333,'Python'),(-112.016666666667,33.4333333333333,'Java'),(-110.933333333333,32.1166666666667,'Java'),(-92.2333333333333,34.7333333333333,'R'),(-121.95,37.7,'R'),(-118.15,33.8166666666667,'Python'),(-118.233333333333,34.05,'Java'),(-122.316666666667,37.8166666666667,'R'),(-117.6,34.05,'Python'),(-116.533333333333,33.8166666666667,'Python'),(-121.5,38.5166666666667,'R'),(-117.166666666667,32.7333333333333,'R'),(-122.383333333333,37.6166666666667,'R'),(-121.933333333333,37.3666666666667,'R'),(-122.016666666667,36.9833333333333,'Python'),(-104.716666666667,38.8166666666667,'Python'),(-104.866666666667,39.75,'Python'),(-72.65,41.7333333333333,'R'),(-75.6,39.6666666666667,'Python'),(-77.0333333333333,38.85,'Python'),(-80.2666666666667,25.8,'Java'),(-81.3833333333333,28.55,'Java'),(-82.5333333333333,27.9666666666667,'Java'),(-84.4333333333333,33.65,'Python'),(-116.216666666667,43.5666666666667,'Python'),(-87.75,41.7833333333333,'Java'),(-86.2833333333333,39.7333333333333,'Java'),(-93.65,41.5333333333333,'Java'),(-97.4166666666667,37.65,'Java'),(-85.7333333333333,38.1833333333333,'Python'),(-90.25,29.9833333333333,'Java'),(-70.3166666666667,43.65,'R'),(-76.6666666666667,39.1833333333333,'R'),(-71.0333333333333,42.3666666666667,'R'),(-72.5333333333333,42.2,'R'),(-83.0166666666667,42.4166666666667,'Python'),(-84.6,42.7833333333333,'Python'),(-93.2166666666667,44.8833333333333,'Python'),(-90.0833333333333,32.3166666666667,'Java'),(-94.5833333333333,39.1166666666667,'Java'),(-90.3833333333333,38.75,'Python'),(-108.533333333333,45.8,'Python'),(-95.9,41.3,'Python'),(-115.166666666667,36.0833333333333,'Java'),(-71.4333333333333,42.9333333333333,'R'),(-74.1666666666667,40.7,'R'),(-106.616666666667,35.05,'Python'),(-78.7333333333333,42.9333333333333,'R'),(-73.9666666666667,40.7833333333333,'R'),(-80.9333333333333,35.2166666666667,'Python'),(-78.7833333333333,35.8666666666667,'Python'),(-100.75,46.7666666666667,'Java'),(-84.5166666666667,39.15,'Java'),(-81.85,41.4,'Java'),(-82.8833333333333,40,'Java'),(-97.6,35.4,'Python'),(-122.666666666667,45.5333333333333,'Python'),(-75.25,39.8833333333333,'Python'),(-80.2166666666667,40.5,'Python'),(-71.4333333333333,41.7333333333333,'R'),(-81.1166666666667,33.95,'R'),(-96.7333333333333,43.5666666666667,'Python'),(-90,35.05,'R'),(-86.6833333333333,36.1166666666667,'R'),(-97.7,30.3,'Python'),(-96.85,32.85,'Java'),(-95.35,29.9666666666667,'Java'),(-98.4666666666667,29.5333333333333,'Java'),(-111.966666666667,40.7666666666667,'Python'),(-73.15,44.4666666666667,'R'),(-77.3333333333333,37.5,'Python'),(-122.3,47.5333333333333,'Python'),(-89.3333333333333,43.1333333333333,'R'),(-104.816666666667,41.15,'Java')]
cities = [([longitude, latitude], language) for longitude, latitude, language in cities]

def plot_state_borders(plt, color='0.8'):
    pass

def plot_cities():

    # key is language, value is pair (longitudes, latitudes)
    plots = { "Java" : ([], []), "Python" : ([], []), "R" : ([], []) }

    # we want each language to have a different marker and color
    markers = { "Java" : "o", "Python" : "s", "R" : "^" }
    colors  = { "Java" : "r", "Python" : "b", "R" : "g" }

    for (longitude, latitude), language in cities:
        plots[language][0].append(longitude)
        plots[language][1].append(latitude)

    # create a scatter series for each language
    for language, (x, y) in plots.iteritems():
        plt.scatter(x, y, color=colors[language], marker=markers[language],
                          label=language, zorder=10)

    plot_state_borders(plt)    # assume we have a function that does this

    plt.legend(loc=0)          # let matplotlib choose the location
    plt.axis([-130,-60,20,55]) # set the axes
    plt.title("Favorite Programming Languages")
    plt.show()

def classify_and_plot_grid(k=1):
    plots = { "Java" : ([], []), "Python" : ([], []), "R" : ([], []) }
    markers = { "Java" : "o", "Python" : "s", "R" : "^" }
    colors  = { "Java" : "r", "Python" : "b", "R" : "g" }

    for longitude in range(-130, -60):
        for latitude in range(20, 55):
            predicted_language = knn_classify(k, cities, [longitude, latitude])
            plots[predicted_language][0].append(longitude)
            plots[predicted_language][1].append(latitude)

    # create a scatter series for each language
    for language, (x, y) in plots.iteritems():
        plt.scatter(x, y, color=colors[language], marker=markers[language],
                          label=language, zorder=0)

    plot_state_borders(plt, color='black')    # assume we have a function that does this

    plt.legend(loc=0)          # let matplotlib choose the location
    plt.axis([-130,-60,20,55]) # set the axes
    plt.title(str(k) + "-Nearest Neighbor Programming Languages")
    plt.show()

#
# the curse of dimensionality
#

def random_point(dim):
    return [random.random() for _ in range(dim)]

def random_distances(dim, num_pairs):
    return [distance(random_point(dim), random_point(dim))
            for _ in range(num_pairs)]


if __name__ == "__main__":

    # try several different values for k
    for k in [1, 3, 5, 7]:
        num_correct = 0

        for location, actual_language in cities:

            other_cities = [other_city 
                            for other_city in cities
                            if other_city != (location, actual_language)]

            predicted_language = knn_classify(k, other_cities, location)

            if predicted_language == actual_language: 
                num_correct += 1

        print k, "neighbor[s]:", num_correct, "correct out of", len(cities)

    dimensions = range(1, 101, 5)

    avg_distances = []
    min_distances = []

    random.seed(0)
    for dim in dimensions:
        distances = random_distances(dim, 10000)  # 10,000 random pairs
        avg_distances.append(mean(distances))     # track the average
        min_distances.append(min(distances))      # track the minimum
        print dim, min(distances), mean(distances), min(distances) / mean(distances)

================================================
FILE: first-edition/code/network_analysis.py
================================================
from __future__ import division
import math, random, re
from collections import defaultdict, Counter, deque
from linear_algebra import dot, get_row, get_column, make_matrix, magnitude, scalar_multiply, shape, distance
from functools import partial

users = [
    { "id": 0, "name": "Hero" },
    { "id": 1, "name": "Dunn" },
    { "id": 2, "name": "Sue" },
    { "id": 3, "name": "Chi" },
    { "id": 4, "name": "Thor" },
    { "id": 5, "name": "Clive" },
    { "id": 6, "name": "Hicks" },
    { "id": 7, "name": "Devin" },
    { "id": 8, "name": "Kate" },
    { "id": 9, "name": "Klein" }
]

friendships = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4),
               (4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)]

# give each user a friends list
for user in users:
    user["friends"] = []
    
# and populate it
for i, j in friendships:
    # this works because users[i] is the user whose id is i
    users[i]["friends"].append(users[j]) # add i as a friend of j
    users[j]["friends"].append(users[i]) # add j as a friend of i   

# 
# Betweenness Centrality
#

def shortest_paths_from(from_user):
    
    # a dictionary from "user_id" to *all* shortest paths to that user
    shortest_paths_to = { from_user["id"] : [[]] }

    # a queue of (previous user, next user) that we need to check.
    # starts out with all pairs (from_user, friend_of_from_user)
    frontier = deque((from_user, friend)
                     for friend in from_user["friends"])

    # keep going until we empty the queue
    while frontier: 

        prev_user, user = frontier.popleft() # take from the beginning
        user_id = user["id"]

        # the fact that we're pulling from our queue means that
        # necessarily we already know a shortest path to prev_user
        paths_to_prev = shortest_paths_to[prev_user["id"]]
        paths_via_prev = [path + [user_id] for path in paths_to_prev]
        
        # it's possible we already know a shortest path to here as well
        old_paths_to_here = shortest_paths_to.get(user_id, [])
        
        # what's the shortest path to here that we've seen so far?
        if old_paths_to_here:
            min_path_length = len(old_paths_to_here[0])
        else:
            min_path_length = float('inf')
                
        # any new paths to here that aren't too long
        new_paths_to_here = [path_via_prev
                             for path_via_prev in paths_via_prev
                             if len(path_via_prev) <= min_path_length
                             and path_via_prev not in old_paths_to_here]
        
        shortest_paths_to[user_id] = old_paths_to_here + new_paths_to_here
        
        # add new neighbors to the frontier
        frontier.extend((user, friend)
                        for friend in user["friends"]
                        if friend["id"] not in shortest_paths_to)

    return shortest_paths_to

for user in users:
    user["shortest_paths"] = shortest_paths_from(user)

for user in users:
    user["betweenness_centrality"] = 0.0

for source in users:
    source_id = source["id"]
    for target_id, paths in source["shortest_paths"].iteritems():
        if source_id < target_id:   # don't double count
            num_paths = len(paths)  # how many shortest paths?
            contrib = 1 / num_paths # contribution to centrality
            for path in paths:
                for id in path:
                    if id not in [source_id, target_id]:
                        users[id]["betweenness_centrality"] += contrib

#
# closeness centrality
#

def farness(user):
    """the sum of the lengths of the shortest paths to each other user"""
    return sum(len(paths[0]) 
               for paths in user["shortest_paths"].values())

for user in users:
    user["closeness_centrality"] = 1 / farness(user)


#
# matrix multiplication
#

def matrix_product_entry(A, B, i, j):
    return dot(get_row(A, i), get_column(B, j))

def matrix_multiply(A, B):
    n1, k1 = shape(A)
    n2, k2 = shape(B)
    if k1 != n2:
        raise ArithmeticError("incompatible shapes!")
                
    return make_matrix(n1, k2, partial(matrix_product_entry, A, B))

def vector_as_matrix(v):
    """returns the vector v (represented as a list) as a n x 1 matrix"""
    return [[v_i] for v_i in v]
    
def vector_from_matrix(v_as_matrix):
    """returns the n x 1 matrix as a list of values"""
    return [row[0] for row in v_as_matrix]

def matrix_operate(A, v):
    v_as_matrix = vector_as_matrix(v)
    product = matrix_multiply(A, v_as_matrix)
    return vector_from_matrix(product)

def find_eigenvector(A, tolerance=0.00001):
    guess = [1 for __ in A]

    while True:
        result = matrix_operate(A, guess)
        length = magnitude(result)
        next_guess = scalar_multiply(1/length, result)
        
        if distance(guess, next_guess) < tolerance:
            return next_guess, length # eigenvector, eigenvalue
        
        guess = next_guess

#
# eigenvector centrality
#

def entry_fn(i, j):
    return 1 if (i, j) in friendships or (j, i) in friendships else 0

n = len(users)
adjacency_matrix = make_matrix(n, n, entry_fn)

eigenvector_centralities, _ = find_eigenvector(adjacency_matrix)

#
# directed graphs
#

endorsements = [(0, 1), (1, 0), (0, 2), (2, 0), (1, 2), (2, 1), (1, 3),
                (2, 3), (3, 4), (5, 4), (5, 6), (7, 5), (6, 8), (8, 7), (8, 9)]

for user in users:
    user["endorses"] = []       # add one list to track outgoing endorsements
    user["endorsed_by"] = []    # and another to track endorsements
    
for source_id, target_id in endorsements:
    users[source_id]["endorses"].append(users[target_id])
    users[target_id]["endorsed_by"].append(users[source_id])


endorsements_by_id = [(user["id"], len(user["endorsed_by"]))
                      for user in users]

sorted(endorsements_by_id, 
       key=lambda (user_id, num_endorsements): num_endorsements,
       reverse=True)

def page_rank(users, damping = 0.85, num_iters = 100):
    
    # initially distribute PageRank evenly
    num_users = len(users)
    pr = { user["id"] : 1 / num_users for user in users }

    # this is the small fraction of PageRank
    # that each node gets each iteration
    base_pr = (1 - damping) / num_users
    
    for __ in range(num_iters):
        next_pr = { user["id"] : base_pr for user in users }
        for user in users:
            # distribute PageRank to outgoing links
            links_pr = pr[user["id"]] * damping
            for endorsee in user["endorses"]:
                next_pr[endorsee["id"]] += links_pr / len(user["endorses"])

        pr = next_pr
        
    return pr

if __name__ == "__main__":

    print "Betweenness Centrality"
    for user in users:
        print user["id"], user["betweenness_centrality"]
    print

    print "Closeness Centrality"
    for user in users:
        print user["id"], user["closeness_centrality"]
    print

    print "Eigenvector Centrality"
    for user_id, centrality in enumerate(eigenvector_centralities):
        print user_id, centrality
    print

    print "PageRank"
    for user_id, pr in page_rank(users).iteritems():
        print user_id, pr


================================================
FILE: first-edition/code/neural_networks.py
================================================
from __future__ import division
from collections import Counter
from functools import partial
from linear_algebra import dot
import math, random
import matplotlib
import matplotlib.pyplot as plt

def step_function(x):
    return 1 if x >= 0 else 0

def perceptron_output(weights, bias, x):
    """returns 1 if the perceptron 'fires', 0 if not"""
    return step_function(dot(weights, x) + bias)

def sigmoid(t):
    return 1 / (1 + math.exp(-t))
    
def neuron_output(weights, inputs):
    return sigmoid(dot(weights, inputs))

def feed_forward(neural_network, input_vector):
    """takes in a neural network (represented as a list of lists of lists of weights)
    and returns the output from forward-propagating the input"""

    outputs = []

    for layer in neural_network:

        input_with_bias = input_vector + [1]             # add a bias input
        output = [neuron_output(neuron, input_with_bias) # compute the output
                  for neuron in layer]                   # for this layer
        outputs.append(output)                           # and remember it

        # the input to the next layer is the output of this one
        input_vector = output

    return outputs

def backpropagate(network, input_vector, target):

    hidden_outputs, outputs = feed_forward(network, input_vector)
    
    # the output * (1 - output) is from the derivative of sigmoid
    output_deltas = [output * (1 - output) * (output - target[i])
                     for i, output in enumerate(outputs)]
                     
    # adjust weights for output layer (network[-1])
    for i, output_neuron in enumerate(network[-1]):
        for j, hidden_output in enumerate(hidden_outputs + [1]):
            output_neuron[j] -= output_deltas[i] * hidden_output

    # back-propagate errors to hidden layer
    hidden_deltas = [hidden_output * (1 - hidden_output) * 
                      dot(output_deltas, [n[i] for n in network[-1]]) 
                     for i, hidden_output in enumerate(hidden_outputs)]

    # adjust weights for hidden layer (network[0])
    for i, hidden_neuron in enumerate(network[0]):
        for j, input in enumerate(input_vector + [1]):
            hidden_neuron[j] -= hidden_deltas[i] * input

def patch(x, y, hatch, color):
    """return a matplotlib 'patch' object with the specified
    location, crosshatch pattern, and color"""
    return matplotlib.patches.Rectangle((x - 0.5, y - 0.5), 1, 1,
                                        hatch=hatch, fill=False, color=color)


def show_weights(neuron_idx):
    weights = network[0][neuron_idx]
    abs_weights = map(abs, weights)

    grid = [abs_weights[row:(row+5)] # turn the weights into a 5x5 grid
            for row in range(0,25,5)] # [weights[0:5], ..., weights[20:25]]

    ax = plt.gca() # to use hatching, we'll need the axis

    ax.imshow(grid, # here same as plt.imshow
              cmap=matplotlib.cm.binary, # use white-black color scale
              interpolation='none') # plot blocks as blocks

    # cross-hatch the negative weights
    for i in range(5): # row
        for j in range(5): # column
            if weights[5*i + j] < 0: # row i, column j = weights[5*i + j]
                # add black and white hatches, so visible whether dark or light
                ax.add_patch(patch(j, i, '/', "white"))
                ax.add_patch(patch(j, i, '\\', "black"))
    plt.show()

if __name__ == "__main__":

    raw_digits = [
          """11111
             1...1
             1...1
             1...1
             11111""",
             
          """..1..
             ..1..
             ..1..
             ..1..
             ..1..""",
             
          """11111
             ....1
             11111
             1....
             11111""",
             
          """11111
             ....1
             11111
             ....1
             11111""",     
             
          """1...1
             1...1
             11111
             ....1
             ....1""",             
             
          """11111
             1....
             11111
             ....1
             11111""",   
             
          """11111
             1....
             11111
             1...1
             11111""",             

          """11111
             ....1
             ....1
             ....1
             ....1""",
             
          """11111
             1...1
             11111
             1...1
             11111""",    
             
          """11111
             1...1
             11111
             ....1
             11111"""]     

    def make_digit(raw_digit):
        return [1 if c == '1' else 0
                for row in raw_digit.split("\n")
                for c in row.strip()]
                
    inputs = map(make_digit, raw_digits)

    targets = [[1 if i == j else 0 for i in range(10)]
               for j in range(10)]

    random.seed(0)   # to get repeatable results
    input_size = 25  # each input is a vector of length 25
    num_hidden = 5   # we'll have 5 neurons in the hidden layer
    output_size = 10 # we need 10 outputs for each input

    # each hidden neuron has one weight per input, plus a bias weight
    hidden_layer = [[random.random() for __ in range(input_size + 1)]
                    for __ in range(num_hidden)]

    # each output neuron has one weight per hidden neuron, plus a bias weight
    output_layer = [[random.random() for __ in range(num_hidden + 1)]
                    for __ in range(output_size)]

    # the network starts out with random weights
    network = [hidden_layer, output_layer]

    # 10,000 iterations seems enough to converge
    for __ in range(10000):
        for input_vector, target_vector in zip(inputs, targets):
            backpropagate(network, input_vector, target_vector)

    def predict(input):
        return feed_forward(network, input)[-1]

    for i, input in enumerate(inputs):
        outputs = predict(input)
        print i, [round(p,2) for p in outputs]

    print """.@@@.
...@@
..@@.
...@@
.@@@."""
    print [round(x, 2) for x in
          predict(  [0,1,1,1,0,  # .@@@.
                     0,0,0,1,1,  # ...@@
                     0,0,1,1,0,  # ..@@.
                     0,0,0,1,1,  # ...@@
                     0,1,1,1,0]) # .@@@.
          ]
    print

    print """.@@@.
@..@@
.@@@.
@..@@
.@@@."""
    print [round(x, 2) for x in 
          predict(  [0,1,1,1,0,  # .@@@.
                     1,0,0,1,1,  # @..@@
                     0,1,1,1,0,  # .@@@.
                     1,0,0,1,1,  # @..@@
                     0,1,1,1,0]) # .@@@.
          ]
    print

    

================================================
FILE: first-edition/code/plot_state_borders.py
================================================
import re

segments = []
points = []

lat_long_regex = r"<point lat=\"(.*)\" lng=\"(.*)\""

with open("states.txt", "r") as f:
    lines = [line for line in f]

for line in lines:
    if line.startswith("</state>"):
        for p1, p2 in zip(points, points[1:]):
            segments.append((p1, p2))
        points = []
    s = re.search(lat_long_regex, line)
    if s:
        lat, lon = s.groups()
        points.append((float(lon), float(lat)))

def plot_state_borders(plt, color='0.8'):
    for (lon1, lat1), (lon2, lat2) in segments:
        plt.plot([lon1, lon2], [lat1, lat2], color=color)

================================================
FILE: first-edition/code/probability.py
================================================
from __future__ import division
from collections import Counter
import math, random

def random_kid():
    return random.choice(["boy", "girl"])

def uniform_pdf(x):
    return 1 if x >= 0 and x < 1 else 0

def uniform_cdf(x):
    "returns the probability that a uniform random variable is less than x"
    if x < 0:   return 0    # uniform random is never less than 0
    elif x < 1: return x    # e.g. P(X < 0.4) = 0.4
    else:       return 1    # uniform random is always less than 1

def normal_pdf(x, mu=0, sigma=1):
    sqrt_two_pi = math.sqrt(2 * math.pi)
    return (math.exp(-(x-mu) ** 2 / 2 / sigma ** 2) / (sqrt_two_pi * sigma))

def plot_normal_pdfs(plt):
    xs = [x / 10.0 for x in range(-50, 50)]
    plt.plot(xs,[normal_pdf(x,sigma=1) for x in xs],'-',label='mu=0,sigma=1')
    plt.plot(xs,[normal_pdf(x,sigma=2) for x in xs],'--',label='mu=0,sigma=2')
    plt.plot(xs,[normal_pdf(x,sigma=0.5) for x in xs],':',label='mu=0,sigma=0.5')
    plt.plot(xs,[normal_pdf(x,mu=-1)   for x in xs],'-.',label='mu=-1,sigma=1')
    plt.legend()
    plt.show()      

def normal_cdf(x, mu=0,sigma=1):
    return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2  

def plot_normal_cdfs(plt):
    xs = [x / 10.0 for x in range(-50, 50)]
    plt.plot(xs,[normal_cdf(x,sigma=1) for x in xs],'-',label='mu=0,sigma=1')
    plt.plot(xs,[normal_cdf(x,sigma=2) for x in xs],'--',label='mu=0,sigma=2')
    plt.plot(xs,[normal_cdf(x,sigma=0.5) for x in xs],':',label='mu=0,sigma=0.5')
    plt.plot(xs,[normal_cdf(x,mu=-1) for x in xs],'-.',label='mu=-1,sigma=1')
    plt.legend(loc=4) # bottom right
    plt.show()

def inverse_normal_cdf(p, mu=0, sigma=1, tolerance=0.00001):
    """find approximate inverse using binary search"""

    # if not standard, compute standard and rescale
    if mu != 0 or sigma != 1:
        return mu + sigma * inverse_normal_cdf(p, tolerance=tolerance)
    
    low_z, low_p = -10.0, 0            # normal_cdf(-10) is (very close to) 0
    hi_z,  hi_p  =  10.0, 1            # normal_cdf(10)  is (very close to) 1
    while hi_z - low_z > tolerance:
        mid_z = (low_z + hi_z) / 2     # consider the midpoint
        mid_p = normal_cdf(mid_z)      # and the cdf's value there
        if mid_p < p:
            # midpoint is still too low, search above it
            low_z, low_p = mid_z, mid_p
        elif mid_p > p:
            # midpoint is still too high, search below it
            hi_z, hi_p = mid_z, mid_p
        else:
            break

    return mid_z

def bernoulli_trial(p):
    return 1 if random.random() < p else 0

def binomial(p, n):
    return sum(bernoulli_trial(p) for _ in range(n))

def make_hist(p, n, num_points):
    
    data = [binomial(p, n) for _ in range(num_points)]
    
    # use a bar chart to show the actual binomial samples
    histogram = Counter(data)
    plt.bar([x - 0.4 for x in histogram.keys()],
            [v / num_points for v in histogram.values()],
            0.8,
            color='0.75')
    
    mu = p * n
    sigma = math.sqrt(n * p * (1 - p))

    # use a line chart to show the normal approximation
    xs = range(min(data), max(data) + 1)
    ys = [normal_cdf(i + 0.5, mu, sigma) - normal_cdf(i - 0.5, mu, sigma) 
          for i in xs]
    plt.plot(xs,ys)
    plt.show()



if __name__ == "__main__":

    #
    # CONDITIONAL PROBABILITY
    #

    both_girls = 0
    older_girl = 0
    either_girl = 0

    random.seed(0)
    for _ in range(10000):
        younger = random_kid()
        older = random_kid()
        if older == "girl":
            older_girl += 1
        if older == "girl" and younger == "girl":
            both_girls += 1
        if older == "girl" or younger == "girl":
            either_girl += 1

    print "P(both | older):", both_girls / older_girl      # 0.514 ~ 1/2
    print "P(both | either): ", both_girls / either_girl   # 0.342 ~ 1/3

================================================
FILE: first-edition/code/recommender_systems.py
================================================
from __future__ import division
import math, random
from collections import defaultdict, Counter
from linear_algebra import dot

users_interests = [
    ["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
    ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
    ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
    ["R", "Python", "statistics", "regression", "probability"],
    ["machine learning", "regression", "decision trees", "libsvm"],
    ["Python", "R", "Java", "C++", "Haskell", "programming languages"],
    ["statistics", "probability", "mathematics", "theory"],
    ["machine learning", "scikit-learn", "Mahout", "neural networks"],
    ["neural networks", "deep learning", "Big Data", "artificial intelligence"],
    ["Hadoop", "Java", "MapReduce", "Big Data"],
    ["statistics", "R", "statsmodels"],
    ["C++", "deep learning", "artificial intelligence", "probability"],
    ["pandas", "R", "Python"],
    ["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
    ["libsvm", "regression", "support vector machines"]
]

popular_interests = Counter(interest
                            for user_interests in users_interests
                            for interest in user_interests).most_common()

def most_popular_new_interests(user_interests, max_results=5):
    suggestions = [(interest, frequency) 
                   for interest, frequency in popular_interests
                   if interest not in user_interests]
    return suggestions[:max_results]

#
# user-based filtering
#

def cosine_similarity(v, w):
    return dot(v, w) / math.sqrt(dot(v, v) * dot(w, w))

unique_interests = sorted(list({ interest 
                                 for user_interests in users_interests
                                 for interest in user_interests }))

def make_user_interest_vector(user_interests):
    """given a list of interests, produce a vector whose i-th element is 1
    if unique_interests[i] is in the list, 0 otherwise"""
    return [1 if interest in user_interests else 0
            for interest in unique_interests]

user_interest_matrix = map(make_user_interest_vector, users_interests)

user_similarities = [[cosine_similarity(interest_vector_i, interest_vector_j)
                      for interest_vector_j in user_interest_matrix]
                     for interest_vector_i in user_interest_matrix]

def most_similar_users_to(user_id):
    pairs = [(other_user_id, similarity)                      # find other
             for other_user_id, similarity in                 # users with
                enumerate(user_similarities[user_id])         # nonzero 
             if user_id != other_user_id and similarity > 0]  # similarity

    return sorted(pairs,                                      # sort them
                  key=lambda (_, similarity): similarity,     # most similar
                  reverse=True)                               # first


def user_based_suggestions(user_id, include_current_interests=False):
    # sum up the similarities
    suggestions = defaultdict(float)
    for other_user_id, similarity in most_similar_users_to(user_id):
        for interest in users_interests[other_user_id]:
            suggestions[interest] += similarity

    # convert them to a sorted list
    suggestions = sorted(suggestions.items(),
                         key=lambda (_, weight): weight,
                         reverse=True)

    # and (maybe) exclude already-interests
    if include_current_interests:
        return suggestions
    else:
        return [(suggestion, weight) 
                for suggestion, weight in suggestions
                if suggestion not in users_interests[user_id]]

#
# Item-Based Collaborative Filtering
#

interest_user_matrix = [[user_interest_vector[j]
                         for user_interest_vector in user_interest_matrix]
                        for j, _ in enumerate(unique_interests)]

interest_similarities = [[cosine_similarity(user_vector_i, user_vector_j)
                          for user_vector_j in interest_user_matrix]
                         for user_vector_i in interest_user_matrix]

def most_similar_interests_to(interest_id):
    similarities = interest_similarities[interest_id]
    pairs = [(unique_interests[other_interest_id], similarity)
             for other_interest_id, similarity in enumerate(similarities)
             if interest_id != other_interest_id and similarity > 0]
    return sorted(pairs,
                  key=lambda (_, similarity): similarity,
                  reverse=True)

def item_based_suggestions(user_id, include_current_interests=False):
    suggestions = defaultdict(float)
    user_interest_vector = user_interest_matrix[user_id]
    for interest_id, is_interested in enumerate(user_interest_vector):
        if is_interested == 1:
            similar_interests = most_similar_interests_to(interest_id)
            for interest, similarity in similar_interests:
                suggestions[interest] += similarity

    suggestions = sorted(suggestions.items(),
                         key=lambda (_, similarity): similarity,
                         reverse=True)

    if include_current_interests:
        return suggestions
    else:
        return [(suggestion, weight) 
                for suggestion, weight in suggestions
                if suggestion not in users_interests[user_id]]


if __name__ == "__main__":

    print "Popular Interests"
    print popular_interests
    print

    print "Most Popular New Interests"
    print "already like:", ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"]
    print most_popular_new_interests(["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"])
    print
    print "already like:", ["R", "Python", "statistics", "regression", "probability"]
    print most_popular_new_interests(["R", "Python", "statistics", "regression", "probability"])
    print    

    print "User based similarity"
    print "most similar to 0"
    print most_similar_users_to(0)

    print "Suggestions for 0"
    print user_based_suggestions(0)
    print

    print "Item based similarity"
    print "most similar to 'Big Data'"
    print most_similar_interests_to(0)
    print

    print "suggestions for user 0"
    print item_based_suggestions(0)



================================================
FILE: first-edition/code/simple_linear_regression.py
================================================
from __future__ import division
from collections import Counter, defaultdict
from linear_algebra import vector_subtract
from statistics import mean, correlation, standard_deviation, de_mean
from gradient_descent import minimize_stochastic
import math, random

def predict(alpha, beta, x_i):
    return beta * x_i + alpha

def error(alpha, beta, x_i, y_i):
    return y_i - predict(alpha, beta, x_i)

def sum_of_squared_errors(alpha, beta, x, y):
    return sum(error(alpha, beta, x_i, y_i) ** 2
               for x_i, y_i in zip(x, y))

def least_squares_fit(x,y):
    """given training values for x and y,
    find the least-squares values of alpha and beta"""
    beta = correlation(x, y) * standard_deviation(y) / standard_deviation(x)
    alpha = mean(y) - beta * mean(x)
    return alpha, beta

def total_sum_of_squares(y):
    """the total squared variation of y_i's from their mean"""
    return sum(v ** 2 for v in de_mean(y))

def r_squared(alpha, beta, x, y):
    """the fraction of variation in y captured by the model, which equals
    1 - the fraction of variation in y not captured by the model"""
    
    return 1.0 - (sum_of_squared_errors(alpha, beta, x, y) /
                  total_sum_of_squares(y))

def squared_error(x_i, y_i, theta):
    alpha, beta = theta
    return error(alpha, beta, x_i, y_i) ** 2

def squared_error_gradient(x_i, y_i, theta):
    alpha, beta = theta
    return [-2 * error(alpha, beta, x_i, y_i),       # alpha partial derivative
            -2 * error(alpha, beta, x_i, y_i) * x_i] # beta partial derivative

if __name__ == "__main__":

    num_friends_good = [49,41,40,25,21,21,19,19,18,18,16,15,15,15,15,14,14,13,13,13,13,12,12,11,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,8,8,8,8,8,8,8,8,8,8,8,8,8,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
    daily_minutes_good = [68.77,51.25,52.08,38.36,44.54,57.13,51.4,41.42,31.22,34.76,54.01,38.79,47.59,49.1,27.66,41.03,36.73,48.65,28.12,46.62,35.57,32.98,35,26.07,23.77,39.73,40.57,31.65,31.21,36.32,20.45,21.93,26.02,27.34,23.49,46.94,30.5,33.8,24.23,21.4,27.94,32.24,40.57,25.07,19.42,22.39,18.42,46.96,23.72,26.41,26.97,36.76,40.32,35.02,29.47,30.2,31,38.11,38.18,36.31,21.03,30.86,36.07,28.66,29.08,37.28,15.28,24.17,22.31,30.17,25.53,19.85,35.37,44.6,17.23,13.47,26.33,35.02,32.09,24.81,19.33,28.77,24.26,31.98,25.73,24.86,16.28,34.51,15.23,39.72,40.8,26.06,35.76,34.76,16.13,44.04,18.03,19.65,32.62,35.59,39.43,14.18,35.24,40.13,41.82,35.45,36.07,43.67,24.61,20.9,21.9,18.79,27.61,27.21,26.61,29.77,20.59,27.53,13.82,33.2,25,33.1,36.65,18.63,14.87,22.2,36.81,25.53,24.62,26.25,18.21,28.08,19.42,29.79,32.8,35.99,28.32,27.79,35.88,29.06,36.28,14.1,36.63,37.49,26.9,18.58,38.48,24.48,18.95,33.55,14.24,29.04,32.51,25.63,22.22,19,32.73,15.16,13.9,27.2,32.01,29.27,33,13.74,20.42,27.32,18.23,35.35,28.48,9.08,24.62,20.12,35.26,19.92,31.02,16.49,12.16,30.7,31.22,34.65,13.13,27.51,33.2,31.57,14.1,33.42,17.44,10.12,24.42,9.82,23.39,30.93,15.03,21.67,31.09,33.29,22.61,26.89,23.48,8.38,27.81,32.35,23.84]

    alpha, beta = least_squares_fit(num_friends_good, daily_minutes_good)
    print "alpha", alpha
    print "beta", beta

    print "r-squared", r_squared(alpha, beta, num_friends_good, daily_minutes_good)

    print

    print "gradient descent:"
    # choose random value to start
    random.seed(0)
    theta = [random.random(), random.random()]
    alpha, beta = minimize_stochastic(squared_error, 
                                      squared_error_gradient,
                                      num_friends_good,
                                      daily_minutes_good, 
                                      theta,
                                      0.0001)
    print "alpha", alpha
    print "beta", beta

================================================
FILE: first-edition/code/states.txt
================================================
<state name ="Alabama" colour="#ff0000" >
  <point lat="35.0041" lng="-88.1955"/>
  <point lat="34.9918" lng="-85.6068"/>
  <point lat="32.8404" lng="-85.1756"/>
  <point lat="32.2593" lng="-84.8927"/>
  <point lat="32.1535" lng="-85.0342"/>
  <point lat="31.7947" lng="-85.1358"/>
  <point lat="31.5200" lng="-85.0438"/>
  <point lat="31.3384" lng="-85.0836"/>
  <point lat="31.2093" lng="-85.1070"/>
  <point lat="31.0023" lng="-84.9944"/>
  <point lat="30.9953" lng="-87.6009"/>
  <point lat="30.9423" lng="-87.5926"/>
  <point lat="30.8539" lng="-87.6256"/>
  <point lat="30.6745" lng="-87.4072"/>
  <point lat="30.4404" lng="-87.3688"/>
  <point lat="30.1463" lng="-87.5240"/>
  <point lat="30.1546" lng="-88.3864"/>
  <point lat="31.8939" lng="-88.4743"/>
  <point lat="34.8938" lng="-88.1021"/>
  <point lat="34.9479" lng="-88.1721"/>
  <point lat="34.9107" lng="-88.1461"/>
</state>
<state name ="Arkansas" colour="#ff0000" >
  <point lat="33.0225" lng="-94.0416"/>
  <point lat="33.0075" lng="-91.2057"/>
  <point lat="33.1180" lng="-91.1989"/>
  <point lat="33.1824" lng="-91.1041"/>
  <point lat="33.3053" lng="-91.1343"/>
  <point lat="33.4211" lng="-91.1646"/>
  <point lat="33.4337" lng="-91.2263"/>
  <point lat="33.5403" lng="-91.2524"/>
  <point lat="33.6112" lng="-91.1797"/>
  <point lat="33.6855" lng="-91.2524"/>
  <point lat="33.6946" lng="-91.1261"/>
  <point lat="33.7883" lng="-91.1412"/>
  <point lat="33.7700" lng="-91.0451"/>
  <point lat="33.8328" lng="-91.0341"/>
  <point lat="33.9399" lng="-91.0863"/>
  <point lat="34.0208" lng="-90.9256"/>
  <point lat="34.0856" lng="-90.9036"/>
  <point lat="34.1345" lng="-90.9586"/>
  <point lat="34.1675" lng="-90.9132"/>
  <point lat="34.1380" lng="-90.8501"/>
  <point lat="34.2311" lng="-90.9325"/>
  <point lat="34.3446" lng="-90.6935"/>
  <point lat="34.4409" lng="-90.5603"/>
  <point lat="34.5348" lng="-90.5548"/>
  <point lat="34.5959" lng="-90.5768"/>
  <point lat="34.7213" lng="-90.5301"/>
  <point lat="34.7574" lng="-90.5328"/>
  <point lat="34.8780" lng="-90.4546"/>
  <point lat="34.8454" lng="-90.3529"/>
  <point lat="34.8690" lng="-90.2911"/>
  <point lat="35.0255" lng="-90.3104"/>
  <point lat="35.1154" lng="-90.2843"/>
  <point lat="35.1323" lng="-90.1772"/>
  <point lat="35.1985" lng="-90.1112"/>
  <point lat="35.2826" lng="-90.1524"/>
  <point lat="35.4383" lng="-90.1332"/>
  <point lat="35.5579" lng="-90.0206"/>
  <point lat="35.6740" lng="-89.9780"/>
  <point lat="35.7287" lng="-89.9547"/>
  <point lat="35.9169" lng="-89.6594"/>
  <point lat="35.9658" lng="-89.6883"/>
  <point lat="36.0013" lng="-89.7130"/>
  <point lat="35.9958" lng="-90.3735"/>
  <point lat="36.1268" lng="-90.2664"/>
  <point lat="36.2875" lng="-90.0934"/>
  <point lat="36.3892" lng="-90.0742"/>
  <point lat="36.4180" lng="-90.1511"/>
  <point lat="36.4997" lng="-90.1566"/>
  <point lat="36.4986" lng="-94.6198"/>
  <point lat="35.3801" lng="-94.4412"/>
  <point lat="33.6318" lng="-94.4893"/>
  <point lat="33.6421" lng="-94.4522"/>
  <point lat="33.5597" lng="-94.4000"/>
  <point lat="33.5883" lng="-94.2462"/>
  <point lat="33.5872" lng="-94.1885"/>
  <point lat="33.5345" lng="-94.0375"/>
  <point lat="33.4314" lng="-94.0430"/>
  <point lat="33.0213" lng="-94.0430"/>
</state> 
<state name ="Arizona" colour="#ff0000" >
  <point lat="36.9993" lng="-112.5989"/>
  <point lat="37.0004" lng="-110.8630"/>
  <point lat="37.0004" lng="-109.0475"/>
  <point lat="31.3325" lng="-109.0503"/>
  <point lat="31.3325" lng="-111.0718"/>
  <point lat="32.4935" lng="-114.8126"/>
  <point lat="32.5184" lng="-114.8099"/>
  <point lat="32.5827" lng="-114.8044"/>
  <point lat="32.6246" lng="-114.7992"/>
  <point lat="32.6700" lng="-114.7474"/>
  <point lat="32.7457" lng="-114.7014"/>
  <point lat="32.7342" lng="-114.6176"/>
  <point lat="32.7422" lng="-114.5819"/>
  <point lat="32.7584" lng="-114.5393"/>
  <point lat="32.8167" lng="-114.5095"/>
  <point lat="32.8450" lng="-114.4696"/>
  <point lat="32.9107" lng="-114.4817"/>
  <point lat="32.9741" lng="-114.4803"/>
  <point lat="33.0317" lng="-114.5256"/>
  <point lat="33.0259" lng="-114.6094"/>
  <point lat="33.0317" lng="-114.6588"/>
  <point lat="33.0904" lng="-114.7096"/>
  <point lat="33.2065" lng="-114.6849"/>
  <point lat="33.2846" lng="-114.7220"/>
  <point lat="33.3546" lng="-114.6973"/>
  <point lat="33.4051" lng="-114.7258"/>
  <point lat="33.4120" lng="-114.6533"/>
  <point lat="33.5016" lng="-114.5888"/>
  <point lat="33.5317" lng="-114.5599"/>
  <point lat="33.6306" lng="-114.5187"/>
  <point lat="33.6786" lng="-114.5297"/>
  <point lat="33.7083" lng="-114.4940"/>
  <point lat="33.7609" lng="-114.5036"/>
  <point lat="33.8157" lng="-114.5284"/>
  <point lat="33.8545" lng="-114.5325"/>
  <point lat="33.9285" lng="-114.5380"/>
  <point lat="33.9530" lng="-114.5235"/>
  <point lat="34.0049" lng="-114.4748"/>
  <point lat="34.0299" lng="-114.4308"/>
  <point lat="34.0891" lng="-114.4363"/>
  <point lat="34.1357" lng="-114.3526"/>
  <point lat="34.1720" lng="-114.2908"/>
  <point lat="34.2044" lng="-114.2255"/>
  <point lat="34.2595" lng="-114.1685"/>
  <point lat="34.2572" lng="-114.1301"/>
  <point lat="34.3037" lng="-114.1397"/>
  <point lat="34.3664" lng="-114.2276"/>
  <point lat="34.4012" lng="-114.2633"/>
  <point lat="34.4534" lng="-114.3388"/>
  <point lat="34.4930" lng="-114.3608"/>
  <point lat="34.5292" lng="-114.3811"/>
  <point lat="34.5959" lng="-114.4377"/>
  <point lat="34.6547" lng="-114.4569"/>
  <point lat="34.7506" lng="-114.5297"/>
  <point lat="34.8172" lng="-114.5847"/>
  <point lat="34.8724" lng="-114.6341"/>
  <point lat="34.9490" lng="-114.6313"/>
  <point lat="35.0342" lng="-114.6351"/>
  <point lat="35.1019" lng="-114.6451"/>
  <point lat="35.1233" lng="-114.6190"/>
  <point lat="35.1716" lng="-114.5682"/>
  <point lat="35.3364" lng="-114.5984"/>
  <point lat="35.4506" lng="-114.6643"/>
  <point lat="35.5780" lng="-114.6753"/>
  <point lat="35.6171" lng="-114.6547"/>
  <point lat="35.6528" lng="-114.6918"/>
  <point lat="35.7053" lng="-114.7028"/>
  <point lat="35.8050" lng="-114.7093"/>
  <point lat="35.8679" lng="-114.6602"/>
  <point lat="35.9836" lng="-114.7426"/>
  <point lat="36.0891" lng="-114.7536"/>
  <point lat="36.1124" lng="-114.6794"/>
  <point lat="36.1423" lng="-114.6327"/>
  <point lat="36.1301" lng="-114.4872"/>
  <point lat="36.1445" lng="-114.3690"/>
  <point lat="36.0746" lng="-114.3038"/>
  <point lat="36.0602" lng="-114.3172"/>
  <point lat="36.0163" lng="-114.2451"/>
  <point lat="36.0402" lng="-114.1438"/>
  <point lat="36.0979" lng="-114.1150"/>
  <point lat="36.1101" lng="-114.1274"/>
  <point lat="36.1190" lng="-114.1054"/>
  <point lat="36.1989" lng="-114.0463"/>
  <point lat="36.3638" lng="-114.0450"/>
  <point lat="37.0001" lng="-114.0508"/>
</state>
<state name ="California" colour="#880000" >
  <point lat="41.9983" lng="-124.4009"/>
  <point lat="42.0024" lng="-123.6237"/>
  <point lat="42.0126" lng="-123.1526"/>
  <point lat="42.0075" lng="-122.0073"/>
  <point lat="41.9962" lng="-121.2369"/>
  <point lat="41.9983" lng="-119.9982"/>
  <point lat="39.0021" lng="-120.0037"/>
  <point lat="37.5555" lng="-117.9575"/>
  <point lat="36.3594" lng="-116.3699"/>
  <point lat="35.0075" lng="-114.6368"/>
  <point lat="34.9659" lng="-114.6382"/>
  <point lat="34.9107" lng="-114.6286"/>
  <point lat="34.8758" lng="-114.6382"/>
  <point lat="34.8454" lng="-114.5970"/>
  <point lat="34.7890" lng="-114.5682"/>
  <point lat="34.7269" lng="-114.4968"/>
  <point lat="34.6648" lng="-114.4501"/>
  <point lat="34.6581" lng="-114.4597"/>
  <point lat="34.5869" lng="-114.4322"/>
  <point lat="34.5235" lng="-114.3787"/>
  <point lat="34.4601" lng="-114.3869"/>
  <point lat="34.4500" lng="-114.3361"/>
  <point lat="34.4375" lng="-114.3031"/>
  <point lat="34.4024" lng="-114.2674"/>
  <point lat="34.3559" lng="-114.1864"/>
  <point lat="34.3049" lng="-114.1383"/>
  <point lat="34.2561" lng="-114.1315"/>
  <point lat="34.2595" lng="-114.1651"/>
  <point lat="34.2044" lng="-114.2249"/>
  <point lat="34.1914" lng="-114.2221"/>
  <point lat="34.1720" lng="-114.2908"/>
  <point lat="34.1368" lng="-114.3237"/>
  <point lat="34.1186" lng="-114.3622"/>
  <point lat="34.1118" lng="-114.4089"/>
  <point lat="34.0856" lng="-114.4363"/>
  <point lat="34.0276" lng="-114.4336"/>
  <point lat="34.0117" lng="-114.4652"/>
  <point lat="33.9582" lng="-114.5119"/>
  <point lat="33.9308" lng="-114.5366"/>
  <point lat="33.9058" lng="-114.5091"/>
  <point lat="33.8613" lng="-114.5256"/>
  <point lat="33.8248" lng="-114.5215"/>
  <point lat="33.7597" lng="-114.5050"/>
  <point lat="33.7083" lng="-114.4940"/>
  <point lat="33.6832" lng="-114.5284"/>
  <point lat="33.6363" lng="-114.5242"/>
  <point lat="33.5895" lng="-114.5393"/>
  <point lat="33.5528" lng="-114.5242"/>
  <point lat="33.5311" lng="-114.5586"/>
  <point lat="33.5070" lng="-114.5778"/>
  <point lat="33.4418" lng="-114.6245"/>
  <point lat="33.4142" lng="-114.6506"/>
  <point lat="33.4039" lng="-114.7055"/>
  <point lat="33.3546" lng="-114.6973"/>
  <point lat="33.3041" lng="-114.7302"/>
  <point lat="33.2858" lng="-114.7206"/>
  <point lat="33.2754" lng="-114.6808"/>
  <point lat="33.2582" lng="-114.6698"/>
  <point lat="33.2467" lng="-114.6904"/>
  <point lat="33.1720" lng="-114.6794"/>
  <point lat="33.0904" lng="-114.7083"/>
  <point lat="33.0858" lng="-114.6918"/>
  <point lat="33.0328" lng="-114.6629"/>
  <point lat="33.0501" lng="-114.6451"/>
  <point lat="33.0305" lng="-114.6286"/>
  <point lat="33.0282" lng="-114.5888"/>
  <point lat="33.0351" lng="-114.5750"/>
  <point lat="33.0328" lng="-114.5174"/>
  <point lat="32.9718" lng="-114.4913"/>
  <point lat="32.9764" lng="-114.4775"/>
  <point lat="32.9372" lng="-114.4844"/>
  <point lat="32.8427" lng="-114.4679"/>
  <point lat="32.8161" lng="-114.5091"/>
  <point lat="32.7850" lng="-114.5311"/>
  <point lat="32.7573" lng="-114.5284"/>
  <point lat="32.7503" lng="-114.5641"/>
  <point lat="32.7353" lng="-114.6162"/>
  <point lat="32.7480" lng="-114.6986"/>
  <point lat="32.7191" lng="-114.7220"/>
  <point lat="32.6868" lng="-115.1944"/>
  <point lat="32.5121" lng="-117.3395"/>
  <point lat="32.7838" lng="-117.4823"/>
  <point lat="33.0501" lng="-117.5977"/>
  <point lat="33.2341" lng="-117.6814"/>
  <point lat="33.4578" lng="-118.0591"/>
  <point lat="33.5403" lng="-118.6290"/>
  <point lat="33.7928" lng="-118.7073"/>
  <point lat="33.9582" lng="-119.3706"/>
  <point lat="34.1925" lng="-120.0050"/>
  <point lat="34.2561" lng="-120.7164"/>
  <point lat="34.5360" lng="-120.9128"/>
  <point lat="34.9749" lng="-120.8427"/>
  <point lat="35.2131" lng="-121.1325"/>
  <point lat="35.5255" lng="-121.3220"/>
  <point lat="35.9691" lng="-121.8013"/>
  <point lat="36.2808" lng="-122.1446"/>
  <point lat="36.7268" lng="-122.1721"/>
  <point lat="37.2227" lng="-122.6871"/>
  <point lat="37.7783" lng="-122.8903"/>
  <point lat="37.8965" lng="-123.2378"/>
  <point lat="38.3449" lng="-123.3202"/>
  <point lat="38.7423" lng="-123.8338"/>
  <point lat="38.9946" lng="-123.9793"/>
  <point lat="39.3088" lng="-124.0329"/>
  <point lat="39.7642" lng="-124.0823"/>
  <point lat="40.1663" lng="-124.5314"/>
  <point lat="40.4658" lng="-124.6509"/>
  <point lat="41.0110" lng="-124.3144"/>
  <point lat="41.2386" lng="-124.3419"/>
  <point lat="41.7170" lng="-124.4545"/>
  <point lat="41.9983" lng="-124.4009"/>
</state>
<state name ="Colorado" colour="#880000" >
  <point lat="37.0004" lng="-109.0448"/>
  <point lat="36.9949" lng="-102.0424"/>
  <point lat="41.0006" lng="-102.0534"/>
  <point lat="40.9996" lng="-109.0489"/>
  <point lat="37.0004" lng="-109.0448"/>
</state>
<state name ="Connecticut" colour="#880000" >
  <point lat="42.0498" lng="-73.4875"/>
  <point lat="42.0511" lng="-73.4247"/>
  <point lat="42.0371" lng="-72.8146"/>
  <point lat="41.9983" lng="-72.8174"/>
  <point lat="42.0044" lng="-72.7638"/>
  <point lat="42.0360" lng="-72.7563"/>
  <point lat="42.0368" lng="-72.6945"/>
  <point lat="42.0309" lng="-72.6086"/>
  <point lat="42.0269" lng="-72.6059"/>
  <point lat="42.0269" lng="-72.5784"/>
  <point lat="42.0350" lng="-72.5729"/>
  <point lat="42.0350" lng="-72.4026"/>
  <point lat="42.0248" lng="-71.7984"/>
  <point lat="41.6832" lng="-71.7874"/>
  <point lat="41.4165" lng="-71.7984"/>
  <point lat="41.3892" lng="-71.8341"/>
  <point lat="41.3273" lng="-71.8526"/>
  <point lat="41.3309" lng="-71.8938"/>
  <point lat="41.3103" lng="-71.9302"/>
  <point lat="41.2907" lng="-72.0195"/>
  <point lat="41.2618" lng="-72.0827"/>
  <point lat="41.1962" lng="-72.4322"/>
  <point lat="41.0866" lng="-73.0007"/>
  <point lat="41.0255" lng="-73.2493"/>
  <point lat="40.9509" lng="-73.6132"/>
  <point lat="40.9830" lng="-73.6606"/>
  <point lat="41.0338" lng="-73.6723"/>
  <point lat="41.1011" lng="-73.7272"/>
  <point lat="41.2153" lng="-73.4834"/>
  <point lat="41.2953" lng="-73.5507"/>
  <point lat="41.4906" lng="-73.5329"/>
  <point lat="42.0493" lng="-73.4875"/>
</state>
<state name ="Delaware" colour="#880000" >
  <point lat="39.7188" lng="-75.7919"/>
  <point lat="39.5210" lng="-75.7837"/>
  <point lat="38.9081" lng="-75.7288"/>
  <point lat="38.5911" lng="-75.7068"/>
  <point lat="38.4600" lng="-75.6944"/>
  <point lat="38.4482" lng="-74.8608"/>
  <point lat="38.8654" lng="-74.8526"/>
  <point lat="38.8451" lng="-75.0504"/>
  <point lat="39.0565" lng="-75.1678"/>
  <point lat="39.2525" lng="-75.3236"/>
  <point lat="39.3662" lng="-75.4610"/>
  <point lat="39.4542" lng="-75.5592"/>
  <point lat="39.4908" lng="-75.5578"/>
  <point lat="39.5713" lng="-75.5118"/>
  <point lat="39.6284" lng="-75.5557"/>
  <point lat="39.8106" lng="-75.3937"/>
  <point lat="39.8249" lng="-75.4692"/>
  <point lat="39.8296" lng="-75.6477"/>
  <point lat="39.7199" lng="-75.7906"/>
</state>
<state name ="Florida" colour="#8800ff" >
  <point lat="30.9988" lng="-87.6050"/>
  <point lat="30.9964" lng="-86.5613"/>
  <point lat="31.0035" lng="-85.5313"/>
  <point lat="31.0012" lng="-85.1193"/>
  <point lat="31.0023" lng="-85.0012"/>
  <point lat="30.9364" lng="-84.9847"/>
  <point lat="30.8845" lng="-84.9367"/>
  <point lat="30.8409" lng="-84.9271"/>
  <point lat="30.7902" lng="-84.9257"/>
  <point lat="30.7489" lng="-84.9147"/>
  <point lat="30.6993" lng="-84.8611"/>
  <point lat="30.6911" lng="-84.4272"/>
  <point lat="30.6509" lng="-83.5991"/>
  <point lat="30.5895" lng="-82.5595"/>
  <point lat="30.5682" lng="-82.2134"/>
  <point lat="30.5315" lng="-82.2134"/>
  <point lat="30.3883" lng="-82.1997"/>
  <point lat="30.3598" lng="-82.1544"/>
  <point lat="30.3598" lng="-82.0638"/>
  <point lat="30.4877" lng="-82.0226"/>
  <point lat="30.6308" lng="-82.0473"/>
  <point lat="30.6757" lng="-82.0514"/>
  <point lat="30.7111" lng="-82.0377"/>
  <point lat="30.7371" lng="-82.0514"/>
  <point lat="30.7678" lng="-82.0102"/>
  <point lat="30.7914" lng="-82.0322"/>
  <point lat="30.7997" lng="-81.9717"/>
  <point lat="30.8244" lng="-81.9608"/>
  <point lat="30.8056" lng="-81.8893"/>
  <point lat="30.7914" lng="-81.8372"/>
  <point lat="30.7796" lng="-81.7960"/>
  <point lat="30.7536" lng="-81.6696"/>
  <point lat="30.7289" lng="-81.6051"/>
  <point lat="30.7324" lng="-81.5666"/>
  <point lat="30.7229" lng="-81.5295"/>
  <point lat="30.7253" lng="-81.4856"/>
  <point lat="30.7111" lng="-81.4609"/>
  <point lat="30.7088" lng="-81.4169"/>
  <point lat="30.7064" lng="-81.2274"/>
  <point lat="30.4345" lng="-81.2357"/>
  <point lat="30.3160" lng="-81.1725"/>
  <point lat="29.7763" lng="-81.0379"/>
  <point lat="28.8603" lng="-80.5861"/>
  <point lat="28.4771" lng="-80.3650"/>
  <point lat="28.1882" lng="-80.3815"/>
  <point lat="27.1789" lng="-79.9255"/>
  <point lat="26.8425" lng="-79.8198"/>
  <point lat="26.1394" lng="-79.9118"/>
  <point lat="25.5115" lng="-79.9997"/>
  <point lat="24.8802" lng="-80.3815"/>
  <point lat="24.5384" lng="-80.8704"/>
  <point lat="24.3959" lng="-81.9250"/>
  <point lat="24.4496" lng="-82.2066"/>
  <point lat="24.5484" lng="-82.3137"/>
  <point lat="24.6982" lng="-82.1997"/>
  <point lat="25.2112" lng="-81.3977"/>
  <point lat="25.6019" lng="-81.4622"/>
  <point lat="25.9235" lng="-81.9456"/>
  <point lat="26.3439" lng="-82.2876"/>
  <point lat="26.9098" lng="-82.5307"/>
  <point lat="27.3315" lng="-82.8342"/>
  <point lat="27.7565" lng="-83.0182"/>
  <point lat="28.0574" lng="-83.0017"/>
  <point lat="28.6098" lng="-82.8548"/>
  <point lat="28.9697" lng="-83.0264"/>
  <point lat="29.0478" lng="-83.2050"/>
  <point lat="29.4157" lng="-83.5318"/>
  <point lat="29.9133" lng="-83.9767"/>
  <point lat="29.8930" lng="-84.1072"/>
  <point lat="29.6940" lng="-84.4409"/>
  <point lat="29.4551" lng="-85.0465"/>
  <point lat="29.4946" lng="-85.3610"/>
  <point lat="29.7262" lng="-85.5807"/>
  <point lat="30.1594" lng="-86.1946"/>
  <point lat="30.2175" lng="-86.8510"/>
  <point lat="30.1499" lng="-87.5171"/>
  <point lat="30.3006" lng="-87.4429"/>
  <point lat="30.4256" lng="-87.3750"/>
  <point lat="30.4830" lng="-87.3743"/>
  <point lat="30.5658" lng="-87.3907"/>
  <point lat="30.6344" lng="-87.4004"/>
  <point lat="30.6763" lng="-87.4141"/>
  <point lat="30.7702" lng="-87.5253"/>
  <point lat="30.8527" lng="-87.6256"/>
  <point lat="30.9470" lng="-87.5912"/>
  <point lat="30.9682" lng="-87.5912"/>
  <point lat="30.9964" lng="-87.6050"/>
</state>
<state name ="Georgia" colour="#880000" >
  <point lat="34.9974" lng="-85.6082"/>
  <point lat="34.9906" lng="-84.7266"/>
  <point lat="34.9895" lng="-84.1580"/>
  <point lat="34.9996" lng="-83.1088"/>
  <point lat="34.9287" lng="-83.1418"/>
  <point lat="34.8318" lng="-83.3025"/>
  <point lat="34.7281" lng="-83.3560"/>
  <point lat="34.6569" lng="-83.3080"/>
  <point lat="34.5744" lng="-83.1528"/>
  <point lat="34.4839" lng="-83.0072"/>
  <point lat="34.4681" lng="-82.8918"/>
  <point lat="34.4443" lng="-82.8589"/>
  <point lat="34.2674" lng="-82.7490"/>
  <point lat="34.1254" lng="-82.6831"/>
  <point lat="34.0140" lng="-82.5952"/>
  <point lat="33.8647" lng="-82.3988"/>
  <point lat="33.7563" lng="-82.2505"/>
  <point lat="33.6695" lng="-82.2217"/>
  <point lat="33.5963" lng="-82.1558"/>
  <point lat="33.5036" lng="-82.0432"/>
  <point lat="33.3707" lng="-81.9484"/>
  <point lat="33.2077" lng="-81.8303"/>
  <point lat="33.1674" lng="-81.7795"/>
  <point lat="33.1456" lng="-81.7424"/>
  <point lat="33.0881" lng="-81.6078"/>
  <point lat="33.0075" lng="-81.5034"/>
  <point lat="32.9418" lng="-81.5089"/>
  <point lat="32.6914" lng="-81.4142"/>
  <point lat="32.5815" lng="-81.4087"/>
  <point lat="32.5283" lng="-81.2769"/>
  <point lat="32.4576" lng="-81.1945"/>
  <point lat="32.3185" lng="-81.1642"/>
  <point lat="32.2151" lng="-81.1436"/>
  <point lat="32.1128" lng="-81.1134"/>
  <point lat="32.0477" lng="-80.9225"/>
  <point lat="32.0500" lng="-80.6960"/>
  <point lat="31.8881" lng="-80.7289"/>
  <point lat="31.4697" lng="-80.9665"/>
  <point lat="30.9988" lng="-81.1011"/>
  <point lat="30.7041" lng="-81.2288"/>
  <point lat="30.7241" lng="-81.6023"/>
  <point lat="30.7713" lng="-81.7657"/>
  <point lat="30.8221" lng="-81.9498"/>
  <point lat="30.7560" lng="-82.0239"/>
  <point lat="30.6379" lng="-82.0459"/>
  <point lat="30.4866" lng="-82.0239"/>
  <point lat="30.4309" lng="-82.0363"/>
  <point lat="30.3575" lng="-82.0610"/>
  <point lat="30.3598" lng="-82.1585"/>
  <point lat="30.3859" lng="-82.2025"/>
  <point lat="30.4842" lng="-82.2148"/>
  <point lat="30.5682" lng="-82.2162"/>
  <point lat="30.6131" lng="-82.9688"/>
  <point lat="30.7041" lng="-84.8639"/>
  <point lat="30.7831" lng="-84.9257"/>
  <point lat="30.9117" lng="-84.9586"/>
  <point lat="30.9741" lng="-84.9985"/>
  <point lat="31.1282" lng="-85.0630"/>
  <point lat="31.2116" lng="-85.1070"/>
  <point lat="31.5247" lng="-85.0493"/>
  <point lat="31.8006" lng="-85.1358"/>
  <point lat="31.9592" lng="-85.0919"/>
  <point lat="32.1570" lng="-85.0342"/>
  <point lat="32.2500" lng="-84.9023"/>
  <point lat="32.3974" lng="-84.9628"/>
  <point lat="32.5468" lng="-85.0342"/>
  <point lat="32.6949" lng="-85.1001"/>
  <point lat="32.8138" lng="-85.1660"/>
  <point lat="32.9833" lng="-85.2072"/>
  <point lat="33.6512" lng="-85.3418"/>
  <point lat="34.5620" lng="-85.5231"/>
  <point lat="34.9929" lng="-85.6068"/>
</state>
<state name ="Iowa" colour="#00ff00" >
  <point lat="40.5848" lng="-95.7623"/>
  <point lat="40.5785" lng="-93.5445"/>
  <point lat="40.6129" lng="-91.7372"/>
  <point lat="40.5545" lng="-91.6768"/>
  <point lat="40.5451" lng="-91.6246"/>
  <point lat="40.3622" lng="-91.4365"/>
  <point lat="40.4637" lng="-91.3623"/>
  <point lat="40.5482" lng="-91.4021"/>
  <point lat="40.6931" lng="-91.1124"/>
  <point lat="40.8107" lng="-91.1028"/>
  <point lat="40.9218" lng="-90.9668"/>
  <point lat="41.1642" lng="-91.0121"/>
  <point lat="41.2406" lng="-91.1082"/>
  <point lat="41.4067" lng="-91.0451"/>
  <point lat="41.4510" lng="-90.7086"/>
  <point lat="41.5178" lng="-90.4793"/>
  <point lat="41.5908" lng="-90.3419"/>
  <point lat="41.7457" lng="-90.2788"/>
  <point lat="41.8164" lng="-90.2074"/>
  <point lat="41.9023" lng="-90.1538"/>
  <point lat="42.0962" lng="-90.1744"/>
  <point lat="42.1441" lng="-90.2692"/>
  <point lat="42.2905" lng="-90.4298"/>
  <point lat="42.4093" lng="-90.5370"/>
  <point lat="42.5217" lng="-90.6400"/>
  <point lat="42.6360" lng="-90.7127"/>
  <point lat="42.6956" lng="-90.7883"/>
  <point lat="42.7712" lng="-91.0533"/>
  <point lat="42.8448" lng="-91.0904"/>
  <point lat="42.9082" lng="-91.1398"/>
  <point lat="43.0609" lng="-91.1549"/>
  <point lat="43.1391" lng="-91.1522"/>
  <point lat="43.2882" lng="-91.0547"/>
  <point lat="43.3322" lng="-91.2057"/>
  <point lat="43.4140" lng="-91.2236"/>
  <point lat="43.5008" lng="-91.2305"/>
  <point lat="43.4998" lng="-96.5973"/>
  <point lat="43.4818" lng="-96.6110"/>
  <point lat="43.3871" lng="-96.5245"/>
  <point lat="43.2232" lng="-96.5533"/>
  <point lat="43.1301" lng="-96.4421"/>
  <point lat="42.9243" lng="-96.5479"/>
  <point lat="42.7188" lng="-96.6357"/>
  <point lat="42.6158" lng="-96.5561"/>
  <point lat="42.5055" lng="-96.4847"/>
  <point lat="42.4599" lng="-96.3995"/>
  <point lat="42.3667" lng="-96.4050"/>
  <point lat="42.2722" lng="-96.3446"/>
  <point lat="42.2051" lng="-96.3625"/>
  <point lat="41.9983" lng="-96.2416"/>
  <point lat="41.9513" lng="-96.1372"/>
  <point lat="41.7662" lng="-96.0741"/>
  <point lat="41.6267" lng="-96.0988"/>
  <point lat="41.4561" lng="-95.9477"/>
  <point lat="41.2819" lng="-95.8804"/>
  <point lat="41.0338" lng="-95.8653"/>
  <point lat="40.8346" lng="-95.8365"/>
  <point lat="40.6775" lng="-95.8461"/>
  <point lat="40.5837" lng="-95.7610"/>
</state>
<state name ="Idaho" colour="#00ff00" >
  <point lat="49.0000" lng="-117.0319"/>
  <point lat="49.0018" lng="-116.0486"/>
  <point lat="47.9752" lng="-116.0445"/>
  <point lat="47.5765" lng="-115.6915"/>
  <point lat="47.5487" lng="-115.7574"/>
  <point lat="47.4239" lng="-115.7595"/>
  <point lat="47.3109" lng="-115.5350"/>
  <point lat="47.2606" lng="-115.3235"/>
  <point lat="47.1888" lng="-115.2878"/>
  <point lat="47.1542" lng="-115.2493"/>
  <point lat="46.9728" lng="-115.0433"/>
  <point lat="46.8677" lng="-114.9472"/>
  <point lat="46.7201" lng="-114.7865"/>
  <point lat="46.6984" lng="-114.7549"/>
  <point lat="46.6325" lng="-114.5874"/>
  <point lat="46.6325" lng="-114.4638"/>
  <point lat="46.6466" lng="-114.3279"/>
  <point lat="46.5135" lng="-114.3430"/>
  <point lat="46.4530" lng="-114.3896"/>
  <point lat="46.3488" lng="-114.4144"/>
  <point lat="46.2682" lng="-114.4611"/>
  <point lat="46.1227" lng="-114.5105"/>
  <point lat="45.8585" lng="-114.4418"/>
  <point lat="45.7742" lng="-114.5654"/>
  <point lat="45.6745" lng="-114.5229"/>
  <point lat="45.5621" lng="-114.5654"/>
  <point lat="45.5439" lng="-114.4666"/>
  <point lat="45.4601" lng="-114.3375"/>
  <point lat="45.5468" lng="-114.2441"/>
  <point lat="45.5631" lng="-114.1342"/>
  <point lat="45.6889" lng="-113.9708"/>
  <point lat="45.6102" lng="-113.8403"/>
  <point lat="45.4409" lng="-113.7978"/>
  <point lat="45.2720" lng="-113.7085"/>
  <point lat="45.0260" lng="-113.4256"/>
  <point lat="44.9405" lng="-113.4998"/>
  <point lat="44.7887" lng="-113.3459"/>
  <point lat="44.8062" lng="-113.2471"/>
  <point lat="44.7350" lng="-113.1180"/>
  <point lat="44.4887" lng="-113.0246"/>
  <point lat="44.3592" lng="-112.8502"/>
  <point lat="44.4151" lng="-112.8310"/>
  <point lat="44.4887" lng="-112.7266"/>
  <point lat="44.4504" lng="-112.3901"/>
  <point lat="44.5347" lng="-112.3270"/>
  <point lat="44.5220" lng="-112.1127"/>
  <point lat="44.5582" lng="-111.8848"/>
  <point lat="44.5132" lng="-111.8271"/>
  <point lat="44.5396" lng="-111.4645"/>
  <point lat="44.6198" lng="-111.5057"/>
  <point lat="44.7292" lng="-111.3684"/>
  <point lat="44.4759" lng="-111.0539"/>
  <point lat="43.8623" lng="-111.0471"/>
  <point lat="42.0013" lng="-111.0471"/>
  <point lat="41.9962" lng="-112.1663"/>
  <point lat="41.9871" lng="-113.8458"/>
  <point lat="41.9942" lng="-114.0422"/>
  <point lat="42.0013" lng="-114.8222"/>
  <point lat="41.9973" lng="-115.9126"/>
  <point lat="41.9962" lng="-117.0140"/>
  <point lat="42.0013" lng="-117.0264"/>
  <point lat="43.7820" lng="-117.0277"/>
  <point lat="43.8330" lng="-117.0325"/>
  <point lat="43.8632" lng="-117.0030"/>
  <point lat="43.9073" lng="-116.9776"/>
  <point lat="44.0244" lng="-116.9302"/>
  <point lat="44.0491" lng="-116.9735"/>
  <point lat="44.1014" lng="-116.9330"/>
  <point lat="44.1561" lng="-116.8945"/>
  <point lat="44.1965" lng="-116.9714"/>
  <point lat="44.2442" lng="-116.9810"/>
  <point lat="44.2486" lng="-117.0339"/>
  <point lat="44.2304" lng="-117.0525"/>
  <point lat="44.2585" lng="-117.0895"/>
  <point lat="44.2806" lng="-117.1122"/>
  <point lat="44.2590" lng="-117.1541"/>
  <point lat="44.2973" lng="-117.2255"/>
  <point lat="44.3445" lng="-117.1994"/>
  <point lat="44.3813" lng="-117.2372"/>
  <point lat="44.4769" lng="-117.2269"/>
  <point lat="44.5234" lng="-117.1836"/>
  <point lat="44.5376" lng="-117.1458"/>
  <point lat="44.7423" lng="-117.0442"/>
  <point lat="44.7921" lng="-116.9316"/>
  <point lat="44.8568" lng="-116.8980"/>
  <point lat="44.9356" lng="-116.8327"/>
  <point lat="44.9624" lng="-116.8513"/>
  <point lat="44.9896" lng="-116.8554"/>
  <point lat="45.0313" lng="-116.8417"/>
  <point lat="45.0968" lng="-116.7819"/>
  <point lat="45.1627" lng="-116.7229"/>
  <point lat="45.2178" lng="-116.7105"/>
  <point lat="45.3213" lng="-116.6741"/>
  <point lat="45.3984" lng="-116.6185"/>
  <point lat="45.4433" lng="-116.5883"/>
  <point lat="45.4630" lng="-116.5553"/>
  <point lat="45.5371" lng="-116.5334"/>
  <point lat="45.6140" lng="-116.4640"/>
  <point lat="45.6904" lng="-116.5354"/>
  <point lat="45.7340" lng="-116.5354"/>
  <point lat="45.7541" lng="-116.5594"/>
  <point lat="45.7843" lng="-116.6357"/>
  <point lat="45.7781" lng="-116.5965"/>
  <point lat="45.7805" lng="-116.6597"/>
  <point lat="45.8259" lng="-116.7105"/>
  <point lat="45.8159" lng="-116.7586"/>
  <point lat="45.8341" lng="-116.7908"/>
  <point lat="45.8642" lng="-116.8046"/>
  <point lat="45.9053" lng="-116.8595"/>
  <point lat="45.9545" lng="-116.8739"/>
  <point lat="45.9769" lng="-116.8925"/>
  <point lat="46.0218" lng="-116.9302"/>
  <point lat="46.0932" lng="-116.9838"/>
  <point lat="46.1385" lng="-116.9344"/>
  <point lat="46.1727" lng="-116.9268"/>
  <point lat="46.2007" lng="-116.9646"/>
  <point lat="46.2435" lng="-116.9591"/>
  <point lat="46.2782" lng="-116.9920"/>
  <point lat="46.3152" lng="-117.0209"/>
  <point lat="46.3446" lng="-117.0511"/>
  <point lat="46.4270" lng="-117.0408"/>
  <point lat="46.9451" lng="-117.0394"/>
  <point lat="48.9996" lng="-117.0319"/>
</state>
<state name ="Illinois" colour="#00ffff" >

  <point lat="42.5116" lng="-90.6290"/>
  <point lat="42.4924" lng="-87.0213"/>
  <point lat="41.7641" lng="-87.2067"/>
  <point lat="41.7611" lng="-87.5226"/>
  <point lat="39.6417" lng="-87.5336"/>
  <point lat="39.3566" lng="-87.5308"/>
  <point lat="39.1386" lng="-87.6517"/>
  <point lat="38.9445" lng="-87.5157"/>
  <point lat="38.7294" lng="-87.5047"/>
  <point lat="38.6115" lng="-87.6146"/>
  <point lat="38.4944" lng="-87.6544"/>
  <point lat="38.3740" lng="-87.7780"/>
  <point lat="38.2856" lng="-87.8371"/>
  <point lat="38.2414" lng="-87.9758"/>
  <point lat="38.1454" lng="-87.9291"/>
  <point lat="37.9788" lng="-88.0225"/>
  <point lat="37.8900" lng="-88.0458"/>
  <point lat="37.7881" lng="-88.0321"/>
  <point lat="37.6349" lng="-88.1529"/>
  <point lat="37.5097" lng="-88.0609"/>
  <point lat="37.4149" lng="-88.4152"/>
  <point lat="37.2828" lng="-88.5086"/>
  <point lat="37.1428" lng="-88.4221"/>
  <point lat="37.0585" lng="-88.4990"/>
  <point lat="37.1428" lng="-88.7256"/>
  <point lat="37.2128" lng="-88.9453"/>
  <point lat="37.1559" lng="-89.0689"/>
  <point lat="37.0376" lng="-89.1650"/>
  <point lat="36.9894" lng="-89.2873"/>
  <point lat="37.1505" lng="-89.4356"/>
  <point lat="37.2762" lng="-89.5345"/>
  <point lat="37.3996" lng="-89.4315"/>
  <point lat="37.6936" lng="-89.5358"/>
  <point lat="37.9767" lng="-89.9670"/>
  <point lat="38.2587" lng="-90.3790"/>
  <point lat="38.6169" lng="-90.2376"/>
  <point lat="38.7573" lng="-90.1744"/>
  <point lat="38.8247" lng="-90.1167"/>
  <point lat="38.8846" lng="-90.1799"/>
  <point lat="38.9680" lng="-90.4504"/>
  <point lat="38.8654" lng="-90.5905"/>
  <point lat="39.0405" lng="-90.7086"/>
  <point lat="39.2301" lng="-90.7306"/>
  <point lat="39.3173" lng="-90.8350"/>
  <point lat="39.3853" lng="-90.9338"/>
  <point lat="39.5559" lng="-91.1398"/>
  <point lat="39.7262" lng="-91.3554"/>
  <point lat="39.8570" lng="-91.4406"/>
  <point lat="39.9940" lng="-91.4941"/>
  <point lat="40.1694" lng="-91.5120"/>
  <point lat="40.3497" lng="-91.4667"/>
  <point lat="40.4166" lng="-91.3939"/>
  <point lat="40.5566" lng="-91.4021"/>
  <point lat="40.6265" lng="-91.2524"/>
  <point lat="40.6963" lng="-91.1151"/>
  <point lat="40.8232" lng="-91.0890"/>
  <point lat="40.9312" lng="-90.9792"/>
  <point lat="41.1642" lng="-91.0162"/>
  <point lat="41.2355" lng="-91.1055"/>
  <point lat="41.4170" lng="-91.0368"/>
  <point lat="41.4458" lng="-90.8487"/>
  <point lat="41.4417" lng="-90.7251"/>
  <point lat="41.5816" lng="-90.3516"/>
  <point lat="41.7713" lng="-90.2637"/>
  <point lat="41.9023" lng="-90.1538"/>
  <point lat="42.0819" lng="-90.1758"/>
  <point lat="42.2021" lng="-90.3598"/>
  <point lat="42.2936" lng="-90.4395"/>
  <point lat="42.4032" lng="-90.5356"/>
  <point lat="42.4843" lng="-90.6564"/>
</state>
<state name ="Indiana" colour="#00ff00" >
  <point lat="41.7611" lng="-87.5253"/>
  <point lat="41.7611" lng="-84.8090"/>
  <point lat="39.0981" lng="-84.8199"/>
  <point lat="39.0533" lng="-84.8927"/>
  <point lat="38.8996" lng="-84.8625"/>
  <point lat="38.8312" lng="-84.8268"/>
  <point lat="38.7841" lng="-84.8145"/>
  <point lat="38.7905" lng="-84.8941"/>
  <point lat="38.7809" lng="-84.9861"/>
  <point lat="38.6877" lng="-85.1797"/>
  <point lat="38.7198" lng="-85.4420"/>
  <point lat="38.5653" lng="-85.4091"/>
  <point lat="38.4461" lng="-85.5986"/>
  <point lat="38.2695" lng="-85.7510"/>
  <point lat="38.2824" lng="-85.8266"/>
  <point lat="38.2414" lng="-85.8376"/>
  <point lat="38.0967" lng="-85.9035"/>
  <point lat="38.0232" lng="-85.9200"/>
  <point lat="37.9594" lng="-86.0477"/>
  <point lat="38.0102" lng="-86.0944"/>
  <point lat="38.0578" lng="-86.2729"/>
  <point lat="38.0935" lng="-86.2811"/>
  <point lat="38.1346" lng="-86.2729"/>
  <point lat="38.1842" lng="-86.3704"/>
  <point lat="38.0416" lng="-86.5187"/>
  <point lat="37.9193" lng="-86.5874"/>
  <point lat="37.8402" lng="-86.6409"/>
  <point lat="37.9085" lng="-86.6478"/>
  <point lat="37.9085" lng="-86.6876"/>
  <point lat="37.9821" lng="-86.8236"/>
  <point lat="37.9464" lng="-86.9019"/>
  <point lat="37.9009" lng="-87.0392"/>
  <point lat="37.7924" lng="-87.1394"/>
  <point lat="37.9464" lng="-87.4429"/>
  <point lat="37.9756" lng="-87.5885"/>
  <point lat="37.9225" lng="-87.6283"/>
  <point lat="37.8694" lng="-87.6915"/>
  <point lat="37.9236" lng="-87.8879"/>
  <point lat="37.7718" lng="-87.9620"/>
  <point lat="37.7870" lng="-88.0321"/>
  <point lat="37.8092" lng="-88.0376"/>
  <point lat="37.8011" lng="-88.0643"/>
  <point lat="37.8206" lng="-88.0925"/>
  <point lat="37.8223" lng="-88.0451"/>
  <point lat="37.8483" lng="-88.0575"/>
  <point lat="37.9041" lng="-88.0980"/>
  <point lat="37.9307" lng="-88.0705"/>
  <point lat="37.9561" lng="-88.0369"/>
  <point lat="37.9669" lng="-88.0122"/>
  <point lat="38.0102" lng="-88.0259"/>
  <point lat="38.0384" lng="-88.0417"/>
  <point lat="38.0530" lng="-88.0005"/>
  <point lat="38.0762" lng="-87.9607"/>
  <point lat="38.1000" lng="-88.0163"/>
  <point lat="38.1313" lng="-87.9710"/>
  <point lat="38.1497" lng="-87.9284"/>
  <point lat="38.1734" lng="-87.9387"/>
  <point lat="38.1939" lng="-87.9730"/>
  <point lat="38.2349" lng="-87.9813"/>
  <point lat="38.2608" lng="-87.9421"/>
  <point lat="38.2759" lng="-87.8604"/>
  <point lat="38.3029" lng="-87.8302"/>
  <point lat="38.3233" lng="-87.8350"/>
  <point lat="38.3567" lng="-87.8137"/>
  <point lat="38.3767" lng="-87.7739"/>
  <point lat="38.4116" lng="-87.7444"/>
  <point lat="38.5149" lng="-87.6448"/>
  <point lat="38.5460" lng="-87.6723"/>
  <point lat="38.5949" lng="-87.6105"/>
  <point lat="38.5986" lng="-87.6242"/>
  <point lat="38.6828" lng="-87.5343"/>
  <point lat="38.7284" lng="-87.5075"/>
  <point lat="38.7696" lng="-87.4972"/>
  <point lat="38.8247" lng="-87.5322"/>
  <point lat="38.9039" lng="-87.5171"/>
  <point lat="38.9413" lng="-87.5253"/>
  <point lat="38.9712" lng="-87.5281"/>
  <point lat="38.9872" lng="-87.5761"/>
  <point lat="39.0906" lng="-87.6228"/>
  <point lat="39.1066" lng="-87.6517"/>
  <point lat="39.1365" lng="-87.6599"/>
  <point lat="39.1695" lng="-87.6366"/>
  <point lat="39.2493" lng="-87.5899"/>
  <point lat="39.3492" lng="-87.5336"/>
  <point lat="41.7600" lng="-87.5253"/>
</state>
<state name ="Kansas" colour="#008800" >
  <point lat="40.0034" lng="-102.0506"/>
  <point lat="40.0034" lng="-102.0506"/>
  <point lat="36.9927" lng="-102.0438"/>
  <point lat="36.9982" lng="-94.6211"/>
  <point lat="38.8803" lng="-94.6046"/>
  <point lat="39.0789" lng="-94.6143"/>
  <point lat="39.1971" lng="-94.6184"/>
  <point lat="39.1673" lng="-94.7255"/>
  <point lat="39.2759" lng="-94.8793"/>
  <point lat="39.5612" lng="-95.0990"/>
  <point lat="39.7283" lng="-94.8807"/>
  <point lat="39.8286" lng="-94.8930"/>
  <point lat="39.8823" lng="-94.9342"/>
  <point lat="39.8971" lng="-95.0098"/>
  <point lat="39.8760" lng="-95.0922"/>
  <point lat="39.9445" lng="-95.2213"/>
  <point lat="40.0087" lng="-95.3036"/>
  <point lat="40.0024" lng="-102.0506"/>
</state>
<state name ="Kentucky" colour="#008800" >
  <point lat="36.4986" lng="-89.5372"/>
  <point lat="36.5074" lng="-89.3010"/>
  <point lat="36.5008" lng="-88.6871"/>
  <point lat="36.4931" lng="-88.0568"/>
  <point lat="36.6695" lng="-88.0692"/>
  <point lat="36.6343" lng="-87.8535"/>
  <point lat="36.6265" lng="-86.5654"/>
  <point lat="36.5979" lng="-83.6375"/>
  <point lat="36.6860" lng="-83.3423"/>
  <point lat="36.7466" lng="-83.1377"/>
  <point lat="36.9762" lng="-82.8589"/>
  <point lat="37.2894" lng="-82.3192"/>
  <point lat="37.4934" lng="-82.0308"/>
  <point lat="37.6653" lng="-82.2121"/>
  <point lat="37.8618" lng="-82.4016"/>
  <point lat="37.9908" lng="-82.5073"/>
  <point lat="38.1778" lng="-82.6392"/>
  <point lat="38.3761" lng="-82.5952"/>
  <point lat="38.5030" lng="-82.7477"/>
  <point lat="38.5825" lng="-82.8369"/>
  <point lat="38.7316" lng="-82.9015"/>
  <point lat="38.7027" lng="-83.0196"/>
  <point lat="38.6190" lng="-83.1418"/>
  <point lat="38.5986" lng="-83.2819"/>
  <point lat="38.6941" lng="-83.5291"/>
  <point lat="38.6351" lng="-83.6595"/>
  <point lat="38.7487" lng="-83.8930"/>
  <point lat="38.7701" lng="-84.0440"/>
  <point lat="38.8119" lng="-84.2184"/>
  <point lat="38.9872" lng="-84.3228"/>
  <point lat="39.1013" lng="-84.4917"/>
  <point lat="39.1183" lng="-84.6277"/>
  <point lat="39.1439" lng="-84.7554"/>
  <point lat="39.0523" lng="-84.8914"/>
  <point lat="38.9263" lng="-84.8735"/>
  <point lat="38.7894" lng="-84.8131"/>
  <point lat="38.7691" lng="-84.9957"/>
  <point lat="38.6866" lng="-85.1921"/>
  <point lat="38.7209" lng="-85.4407"/>
  <point lat="38.5653" lng="-85.4077"/>
  <point lat="38.4461" lng="-85.5972"/>
  <point lat="38.2748" lng="-85.7455"/>
  <point lat="38.2716" lng="-85.8087"/>
  <point lat="38.2069" lng="-85.8650"/>
  <point lat="38.0286" lng="-85.9323"/>
  <point lat="37.9550" lng="-86.0422"/>
  <point lat="38.0135" lng="-86.1108"/>
  <point lat="38.0643" lng="-86.2756"/>
  <point lat="38.1389" lng="-86.2770"/>
  <point lat="38.1864" lng="-86.3690"/>
  <point lat="38.0308" lng="-86.5283"/>
  <point lat="37.9204" lng="-86.5874"/>
  <point lat="37.8423" lng="-86.6423"/>
  <point lat="37.9041" lng="-86.6547"/>
  <point lat="37.9864" lng="-86.8250"/>
  <point lat="37.9095" lng="-87.0406"/>
  <point lat="37.7935" lng="-87.1381"/>
  <point lat="37.9420" lng="-87.4168"/>
  <point lat="37.9745" lng="-87.5858"/>
  <point lat="37.8749" lng="-87.6929"/>
  <point lat="37.9215" lng="-87.8906"/>
  <point lat="37.7761" lng="-87.9552"/>
  <point lat="37.7903" lng="-88.0307"/>
  <point lat="37.6479" lng="-88.1584"/>
  <point lat="37.5097" lng="-88.0664"/>
  <point lat="37.4193" lng="-88.4180"/>
  <point lat="37.2784" lng="-88.5086"/>
  <point lat="37.1428" lng="-88.4248"/>
  <point lat="37.0738" lng="-88.5059"/>
  <point lat="37.1461" lng="-88.7421"/>
  <point lat="37.2249" lng="-88.9522"/>
  <point lat="37.1406" lng="-89.0964"/>
  <point lat="37.0278" lng="-89.1815"/>
  <point lat="36.9488" lng="-89.1032"/>
  <point lat="36.8214" lng="-89.1733"/>
  <point lat="36.7411" lng="-89.1925"/>
  <point lat="36.6265" lng="-89.2007"/>
  <point lat="36.5449" lng="-89.2529"/>
  <point lat="36.6232" lng="-89.3518"/>
  <point lat="36.4986" lng="-89.5345"/>
</state>
<state name ="Louisiana" colour="#008800" >
  <point lat="33.0225" lng="-94.0430"/>
  <point lat="33.0179" lng="-93.0048"/>
  <point lat="33.0087" lng="-91.1646"/>
  <point lat="32.9269" lng="-91.2209"/>
  <point lat="32.8773" lng="-91.1220"/>
  <point lat="32.8358" lng="-91.1481"/>
  <point lat="32.7642" lng="-91.1412"/>
  <point lat="32.6382" lng="-91.1536"/>
  <point lat="32.5804" lng="-91.1069"/>
  <point lat="32.6093" lng="-91.0080"/>
  <point lat="32.4588" lng="-91.0904"/>
  <point lat="32.4379" lng="-91.0355"/>
  <point lat="32.3742" lng="-91.0286"/>
  <point lat="32.3150" lng="-90.9064"/>
  <point lat="32.2616" lng="-90.9723"/>
  <point lat="32.1942" lng="-91.0464"/>
  <point lat="32.1198" lng="-91.0739"/>
  <point lat="32.0593" lng="-91.0464"/>
  <point lat="31.9918" lng="-91.1014"/>
  <point lat="31.9498" lng="-91.1865"/>
  <point lat="31.8262" lng="-91.3101"/>
  <point lat="31.7947" lng="-91.3527"/>
  <point lat="31.6230" lng="-91.3925"/>
  <point lat="31.6218" lng="-91.5134"/>
  <point lat="31.5668" lng="-91.4310"/>
  <point lat="31.5130" lng="-91.5161"/>
  <point lat="31.3701" lng="-91.5244"/>
  <point lat="31.2598" lng="-91.5477"/>
  <point lat="31.2692" lng="-91.6425"/>
  <point lat="31.2328" lng="-91.6603"/>
  <point lat="31.1917" lng="-91.5848"/>
  <point lat="31.1047" lng="-91.6287"/>
  <point lat="31.0318" lng="-91.5614"/>
  <point lat="30.9988" lng="-91.6397"/>
  <point lat="31.0012" lng="-89.7336"/>
  <point lat="30.6686" lng="-89.8517"/>
  <point lat="30.5386" lng="-89.7858"/>
  <point lat="30.3148" lng="-89.6347"/>
  <point lat="30.1807" lng="-89.5688"/>
  <point lat="30.1582" lng="-89.4960"/>
  <point lat="30.2140" lng="-89.1843"/>
  <point lat="30.1463" lng="-89.0373"/>
  <point lat="30.0905" lng="-88.8354"/>
  <point lat="29.8383" lng="-88.7421"/>
  <point lat="29.5758" lng="-88.8712"/>
  <point lat="29.1833" lng="-88.9371"/>
  <point lat="28.9649" lng="-89.0359"/>
  <point lat="28.8832" lng="-89.2282"/>
  <point lat="28.9048" lng="-89.4754"/>
  <point lat="29.1210" lng="-89.7418"/>
  <point lat="28.9529" lng="-90.1126"/>
  <point lat="28.9120" lng="-90.6619"/>
  <point lat="28.9553" lng="-91.0355"/>
  <point lat="29.1210" lng="-91.3211"/>
  <point lat="29.2864" lng="-91.9061"/>
  <point lat="29.4360" lng="-92.7452"/>
  <point lat="29.6009" lng="-93.8177"/>
  <point lat="29.6749" lng="-93.8631"/>
  <point lat="29.7370" lng="-93.8933"/>
  <point lat="29.7930" lng="-93.9304"/>
  <point lat="29.8216" lng="-93.9276"/>
  <point lat="29.8883" lng="-93.8370"/>
  <point lat="29.9811" lng="-93.7985"/>
  <point lat="30.0144" lng="-93.7601"/>
  <point lat="30.0691" lng="-93.7106"/>
  <point lat="30.0929" lng="-93.7354"/>
  <point lat="30.1166" lng="-93.6996"/>
  <point lat="30.1997" lng="-93.7271"/>
  <point lat="30.2899" lng="-93.7106"/>
  <point lat="30.3350" lng="-93.7656"/>
  <point lat="30.3871" lng="-93.7601"/>
  <point lat="30.4416" lng="-93.6914"/>
  <point lat="30.5102" lng="-93.7106"/>
  <point lat="30.5433" lng="-93.7463"/>
  <point lat="30.5954" lng="-93.7106"/>
  <point lat="30.5906" lng="-93.6914"/>
  <point lat="30.6545" lng="-93.6859"/>
  <point lat="30.6781" lng="-93.6365"/>
  <point lat="30.7513" lng="-93.6200"/>
  <point lat="30.7890" lng="-93.5925"/>
  <point lat="30.8150" lng="-93.5513"/>
  <point lat="30.8645" lng="-93.5623"/>
  <point lat="30.8881" lng="-93.5788"/>
  <point lat="30.9187" lng="-93.5541"/>
  <point lat="30.9423" lng="-93.5294"/>
  <point lat="31.0082" lng="-93.5760"/>
  <point lat="31.0318" lng="-93.5101"/>
  <point lat="31.0906" lng="-93.5596"/>
  <point lat="31.1211" lng="-93.5321"/>
  <point lat="31.1799" lng="-93.5349"/>
  <point lat="31.1658" lng="-93.5953"/>
  <point lat="31.2292" lng="-93.6282"/>
  <point lat="31.2668" lng="-93.6118"/>
  <point lat="31.3044" lng="-93.6859"/>
  <point lat="31.3888" lng="-93.6694"/>
  <point lat="31.4240" lng="-93.7051"/>
  <point lat="31.4427" lng="-93.6859"/>
  <point lat="31.4755" lng="-93.7573"/>
  <point lat="31.5083" lng="-93.7189"/>
  <point lat="31.5411" lng="-93.8040"/>
  <point lat="31.6113" lng="-93.8425"/>
  <point lat="31.6581" lng="-93.8205"/>
  <point lat="31.7071" lng="-93.7985"/>
  <point lat="31.8029" lng="-93.8480"/>
  <point lat="31.8892" lng="-93.9029"/>
  <point lat="31.9149" lng="-93.9606"/>
  <point lat="32.0081" lng="-94.0430"/>
  <point lat="32.7041" lng="-94.0430"/>
  <point lat="33.0225" lng="-94.0430"/>
</state>
<state name ="Massachusetts" colour="#0000ff" >
 <point lat="42.0003" lng="-72.7789"/>
  <point lat="42.0330" lng="-72.7405"/>
  <point lat="42.0330" lng="-72.3779"/>
  <point lat="42.0228" lng="-71.7984"/>
  <point lat="42.0085" lng="-71.8011"/>
  <point lat="42.0197" lng="-71.3850"/>
  <point lat="41.8961" lng="-71.3837"/>
  <point lat="41.8982" lng="-71.3411"/>
  <point lat="41.8358" lng="-71.3370"/>
  <point lat="41.8245" lng="-71.3493"/>
  <point lat="41.7816" lng="-71.3342"/>
  <point lat="41.7529" lng="-71.2628"/>
  <point lat="41.6719" lng="-71.1914"/>
  <point lat="41.6616" lng="-71.1351"/>
  <point lat="41.6124" lng="-71.1433"/>
  <point lat="41.5939" lng="-71.1310"/>
  <point lat="41.4973" lng="-71.1214"/>
  <point lat="41.3149" lng="-71.0266"/>
  <point lat="41.1590" lng="-70.8316"/>
  <point lat="41.1662" lng="-69.9225"/>
  <point lat="41.3201" lng="-69.7948"/>
  <point lat="41.8133" lng="-69.7398"/>
  <point lat="42.1939" lng="-70.0337"/>
  <point lat="42.2173" lng="-70.5144"/>
  <point lat="42.4133" lng="-70.6984"/>
  <point lat="42.6420" lng="-70.3647"/>
  <point lat="42.8286" lng="-70.4759"/>
  <point lat="42.8760" lng="-70.6133"/>
  <point lat="42.8619" lng="-70.8440"/>
  <point lat="42.8890" lng="-70.9154"/>
  <point lat="42.8075" lng="-71.0651"/>
  <point lat="42.8226" lng="-71.1337"/>
  <point lat="42.7873" lng="-71.1859"/>
  <point lat="42.7369" lng="-71.1832"/>
  <point lat="42.7470" lng="-71.2189"/>
  <point lat="42.7400" lng="-71.2560"/>
  <point lat="42.6986" lng="-71.2985"/>
  <point lat="42.7127" lng="-71.9151"/>
  <point lat="42.7309" lng="-72.5441"/>
  <point lat="42.7450" lng="-73.2541"/>
  <point lat="42.7460" lng="-73.2664"/>
  <point lat="42.5460" lng="-73.3406"/>
  <point lat="42.2671" lng="-73.4436"/>
  <point lat="42.1349" lng="-73.4917"/>
  <point lat="42.0880" lng="-73.5081"/>
  <point lat="42.0483" lng="-73.4985"/>
  <point lat="42.0452" lng="-73.1841"/>
  <point lat="42.0371" lng="-72.8146"/>
  <point lat="41.9962" lng="-72.8160"/>
  <point lat="42.0024" lng="-72.7803"/>
</state>
<state name ="Maryland" colour="#0000ff" >
  <point lat="39.7220" lng="-79.4778"/>
  <point lat="39.7220" lng="-78.3600"/>
  <point lat="39.7220" lng="-75.7878"/>
  <point lat="39.5655" lng="-75.7809"/>
  <point lat="39.3152" lng="-75.7617"/>
  <point lat="38.9498" lng="-75.7329"/>
  <point lat="38.4611" lng="-75.6944"/>
  <point lat="38.4482" lng="-74.8581"/>
  <point lat="38.0200" lng="-74.9721"/>
  <point lat="38.0275" lng="-75.2316"/>
  <point lat="37.9962" lng="-75.6079"/>
  <point lat="37.9951" lng="-75.6230"/>
  <point lat="37.9464" lng="-75.6436"/>
  <point lat="37.9529" lng="-75.7288"/>
  <point lat="37.9117" lng="-75.8084"/>
  <point lat="37.9095" lng="-75.9512"/>
  <point lat="37.9464" lng="-75.9430"/>
  <point lat="37.9529" lng="-76.0584"/>
  <point lat="37.8889" lng="-76.2396"/>
  <point lat="37.9474" lng="-76.3454"/>
  <point lat="37.9669" lng="-76.4154"/>
  <point lat="38.0146" lng="-76.4703"/>
  <point lat="38.0275" lng="-76.5170"/>
  <point lat="38.0751" lng="-76.5363"/>
  <point lat="38.1464" lng="-76.6063"/>
  <point lat="38.1616" lng="-76.6928"/>
  <point lat="38.1670" lng="-76.7601"/>
  <point lat="38.1637" lng="-76.8494"/>
  <point lat="38.2080" lng="-76.9482"/>
  <point lat="38.2748" lng="-76.9908"/>
  <point lat="38.3093" lng="-77.0306"/>
  <point lat="38.3761" lng="-77.0114"/>
  <point lat="38.4009" lng="-77.0430"/>
  <point lat="38.3697" lng="-77.0897"/>
  <point lat="38.3697" lng="-77.1432"/>
  <point lat="38.3320" lng="-77.2627"/>
  <point lat="38.4525" lng="-77.3135"/>
  <point lat="38.5514" lng="-77.2737"/>
  <point lat="38.5954" lng="-77.2490"/>
  <point lat="38.6373" lng="-77.1281"/>
  <point lat="38.6737" lng="-77.1378"/>
  <point lat="38.7112" lng="-77.0760"/>
  <point lat="38.7187" lng="-77.0361"/>
  <point lat="38.7766" lng="-77.0416"/>
  <point lat="38.8451" lng="-77.0320"/>
  <point lat="38.9025" lng="-77.0708"/>
  <point lat="38.9570" lng="-77.1395"/>
  <point lat="38.9773" lng="-77.2335"/>
  <point lat="39.0240" lng="-77.2462"/>
  <point lat="39.0634" lng="-77.3431"/>
  <point lat="39.0717" lng="-77.4351"/>
  <point lat="39.0792" lng="-77.4636"/>
  <point lat="39.1218" lng="-77.5202"/>
  <point lat="39.1804" lng="-77.5092"/>
  <point lat="39.2269" lng="-77.4577"/>
  <point lat="39.3051" lng="-77.5666"/>
  <point lat="39.3067" lng="-77.6321"/>
  <point lat="39.3202" lng="-77.7159"/>
  <point lat="39.3383" lng="-77.7626"/>
  <point lat="39.3810" lng="-77.7544"/>
  <point lat="39.4288" lng="-77.7602"/>
  <point lat="39.4367" lng="-77.8038"/>
  <point lat="39.4606" lng="-77.7997"/>
  <point lat="39.5019" lng="-77.7859"/>
  <point lat="39.5062" lng="-77.8436"/>
  <point lat="39.5210" lng="-77.8217"/>
  <point lat="39.5337" lng="-77.8354"/>
  <point lat="39.5231" lng="-77.8656"/>
  <point lat="39.5591" lng="-77.8848"/>
  <point lat="39.6015" lng="-77.8821"/>
  <point lat="39.6078" lng="-77.9974"/>
  <point lat="39.6247" lng="-78.0222"/>
  <point lat="39.6924" lng="-78.1430"/>
  <point lat="39.6945" lng="-78.1924"/>
  <point lat="39.6839" lng="-78.2062"/>
  <point lat="39.6839" lng="-78.2419"/>
  <point lat="39.6586" lng="-78.2281"/>
  <point lat="39.6226" lng="-78.2776"/>
  <point lat="39.6438" lng="-78.3517"/>
  <point lat="39.6120" lng="-78.3765"/>
  <point lat="39.6036" lng="-78.4067"/>
  <point lat="39.5824" lng="-78.4177"/>
  <point lat="39.5750" lng="-78.4245"/>
  <point lat="39.5464" lng="-78.4232"/>
  <point lat="39.5146" lng="-78.4698"/>
  <point lat="39.5189" lng="-78.5687"/>
  <point lat="39.5337" lng="-78.6676"/>
  <point lat="39.5888" lng="-78.7390"/>
  <point lat="39.6015" lng="-78.7720"/>
  <point lat="39.6184" lng="-78.7363"/>
  <point lat="39.6438" lng="-78.7775"/>
  <point lat="39.6036" lng="-78.7912"/>
  <point lat="39.6036" lng="-78.8187"/>
  <point lat="39.5549" lng="-78.8571"/>
  <point lat="39.4913" lng="-78.9203"/>
  <point lat="39.4426" lng="-78.9725"/>
  <point lat="39.4834" lng="-79.0542"/>
  <point lat="39.4738" lng="-79.0604"/>
  <point lat="39.4553" lng="-79.1043"/>
  <point lat="39.3853" lng="-79.1936"/>
  <point lat="39.3449" lng="-79.2705"/>
  <point lat="39.3014" lng="-79.3282"/>
  <point lat="39.2535" lng="-79.4044"/>
  <point lat="39.2073" lng="-79.4696"/>
  <point lat="39.2051" lng="-79.4861"/>
  <point lat="39.2546" lng="-79.4861"/>
  <point lat="39.3444" lng="-79.4854"/>
  <point lat="39.3454" lng="-79.4840"/>
  <point lat="39.5316" lng="-79.4833"/>
  <point lat="39.7214" lng="-79.4772"/>
</state>
<state name ="Maine" colour="#0000ff" >
  <point lat="45.3425" lng="-71.0129"/>
  <point lat="45.3328" lng="-70.9525"/>
  <point lat="45.2294" lng="-70.8618"/>
  <point lat="45.3917" lng="-70.8247"/>
  <point lat="45.4274" lng="-70.7808"/>
  <point lat="45.3830" lng="-70.6380"/>
  <point lat="45.5092" lng="-70.7190"/>
  <point lat="45.6544" lng="-70.5721"/>
  <point lat="45.7292" lng="-70.3894"/>
  <point lat="45.7924" lng="-70.4169"/>
  <point lat="45.9368" lng="-70.2493"/>
  <point lat="45.9597" lng="-70.3098"/>
  <point lat="46.0923" lng="-70.2946"/>
  <point lat="46.0989" lng="-70.2589"/>
  <point lat="46.1342" lng="-70.2466"/>
  <point lat="46.1903" lng="-70.2905"/>
  <point lat="46.2710" lng="-70.2466"/>
  <point lat="46.3270" lng="-70.2040"/>
  <point lat="46.4151" lng="-70.0571"/>
  <point lat="46.6956" lng="-69.9994"/>
  <point lat="47.4550" lng="-69.2303"/>
  <point lat="47.4132" lng="-69.0381"/>
  <point lat="47.2578" lng="-69.0504"/>
  <point lat="47.1748" lng="-68.8843"/>
  <point lat="47.2643" lng="-68.6206"/>
  <point lat="47.3546" lng="-68.3350"/>
  <point lat="47.3165" lng="-68.1564"/>
  <point lat="47.1038" lng="-67.8804"/>
  <point lat="47.0664" lng="-67.7898"/>
  <point lat="45.9359" lng="-67.7802"/>
  <point lat="45.9177" lng="-67.7527"/>
  <point lat="45.7599" lng="-67.8090"/>
  <point lat="45.6208" lng="-67.6524"/>
  <point lat="45.5987" lng="-67.4533"/>
  <point lat="45.5044" lng="-67.4176"/>
  <point lat="45.4823" lng="-67.5014"/>
  <point lat="45.3714" lng="-67.4231"/>
  <point lat="45.2768" lng="-67.4863"/>
  <point lat="45.1297" lng="-67.3434"/>
  <point lat="45.1830" lng="-67.2487"/>
  <point lat="45.1230" lng="-67.1223"/>
  <point lat="44.8315" lng="-66.9672"/>
  <point lat="44.7409" lng="-66.8628"/>
  <point lat="44.4945" lng="-67.3105"/>
  <point lat="44.3268" lng="-67.9051"/>
  <point lat="43.8702" lng="-68.6673"/>
  <point lat="43.7274" lng="-68.8431"/>
  <point lat="43.6639" lng="-69.7137"/>
  <point lat="43.5625" lng="-70.0818"/>
  <point lat="42.9182" lng="-70.5569"/>
  <point lat="43.0649" lng="-70.7108"/>
  <point lat="43.1391" lng="-70.8302"/>
  <point lat="43.2292" lng="-70.8179"/>
  <point lat="43.3631" lng="-70.9799"/>
  <point lat="43.5675" lng="-70.9717"/>
  <point lat="45.3029" lng="-71.0829"/>
</state>
<state name ="Michigan" colour="#FF0000" >
 <point lat="48.3033" lng="-88.3713"/>
  <point lat="48.0101" lng="-87.6050"/>
  <point lat="46.8902" lng="-84.8584"/>
  <point lat="46.6362" lng="-84.7650"/>
  <point lat="46.4606" lng="-84.5563"/>
  <point lat="46.4525" lng="-84.4780"/>
  <point lat="46.4894" lng="-84.4450"/>
  <point lat="46.5008" lng="-84.4203"/>
  <point lat="46.4989" lng="-84.3956"/>
  <point lat="46.5093" lng="-84.3750"/>
  <point lat="46.5069" lng="-84.3386"/>
  <point lat="46.4927" lng="-84.2905"/>
  <point lat="46.4951" lng="-84.2651"/>
  <point lat="46.5343" lng="-84.2253"/>
  <point lat="46.5404" lng="-84.1951"/>
  <point lat="46.5272" lng="-84.1779"/>
  <point lat="46.5348" lng="-84.1347"/>
  <point lat="46.5041" lng="-84.1113"/>
  <point lat="46.4189" lng="-84.1457"/>
  <point lat="46.3720" lng="-84.1395"/>
  <point lat="46.3218" lng="-84.1058"/>
  <point lat="46.3147" lng="-84.1203"/>
  <point lat="46.2672" lng="-84.1148"/>
  <point lat="46.2563" lng="-84.0969"/>
  <point lat="46.2411" lng="-84.1093"/>
  <point lat="46.2098" lng="-84.0859"/>
  <point lat="46.1879" lng="-84.0777"/>
  <point lat="46.1508" lng="-84.0097"/>
  <point lat="46.1180" lng="-84.0070"/>
  <point lat="46.1018" lng="-83.9761"/>
  <point lat="46.0570" lng="-83.9555"/>
  <point lat="46.0604" lng="-83.9040"/>
  <point lat="46.1185" lng="-83.8264"/>
  <point lat="46.1028" lng="-83.7598"/>
  <point lat="46.1218" lng="-83.6547"/>
  <point lat="46.1056" lng="-83.5723"/>
  <point lat="45.9993" lng="-83.4343"/>
  <point lat="45.8211" lng="-83.5977"/>
  <point lat="45.3396" lng="-82.5197"/>
  <point lat="43.5918" lng="-82.1221"/>
  <point lat="43.0112" lng="-82.4119"/>
  <point lat="42.9956" lng="-82.4249"/>
  <point lat="42.9579" lng="-82.4236"/>
  <point lat="42.9021" lng="-82.4648"/>
  <point lat="42.8543" lng="-82.4689"/>
  <point lat="42.8100" lng="-82.4826"/>
  <point lat="42.7863" lng="-82.4723"/>
  <point lat="42.7339" lng="-82.4847"/>
  <point lat="42.6855" lng="-82.5032"/>
  <point lat="42.6380" lng="-82.5108"/>
  <point lat="42.6036" lng="-82.5307"/>
  <point lat="42.5672" lng="-82.5774"/>
  <point lat="42.5490" lng="-82.5993"/>
  <point lat="42.5521" lng="-82.6501"/>
  <point lat="42.5354" lng="-82.6680"/>
  <point lat="42.4746" lng="-82.7257"/>
  <point lat="42.4726" lng="-82.7250"/>
  <point lat="42.3738" lng="-82.8280"/>
  <point lat="42.3469" lng="-82.9440"/>
  <point lat="42.3382" lng="-82.9550"/>
  <point lat="42.3098" lng="-83.0779"/>
  <point lat="42.2392" lng="-83.1294"/>
  <point lat="42.1741" lng="-83.1342"/>
  <point lat="42.1267" lng="-83.1212"/>
  <point lat="42.0411" lng="-83.1493"/>
  <point lat="41.9600" lng="-83.1116"/>
  <point lat="41.7344" lng="-83.4164"/>
  <point lat="41.7211" lng="-83.8724"/>
  <point lat="41.7057" lng="-84.3736"/>
  <point lat="41.6965" lng="-84.8062"/>
  <point lat="41.7611" lng="-84.8076"/>
  <point lat="41.7621" lng="-87.2067"/>
  <point lat="42.4934" lng="-87.0241"/>
  <point lat="43.3771" lng="-87.1477"/>
  <point lat="43.7056" lng="-87.1216"/>
  <point lat="43.9958" lng="-87.0474"/>
  <point lat="44.1674" lng="-86.9939"/>
  <point lat="44.4720" lng="-86.8662"/>
  <point lat="44.8841" lng="-86.6849"/>
  <point lat="45.0813" lng="-86.5009"/>
  <point lat="45.2353" lng="-86.2495"/>
  <point lat="45.4438" lng="-86.7563"/>
  <point lat="45.4438" lng="-87.0996"/>
  <point lat="45.3772" lng="-87.1518"/>
  <point lat="45.3502" lng="-87.1710"/>
  <point lat="45.2401" lng="-87.3166"/>
  <point lat="45.2024" lng="-87.4059"/>
  <point lat="45.0774" lng="-87.4416"/>
  <point lat="45.0910" lng="-87.5912"/>
  <point lat="45.1036" lng="-87.6407"/>
  <point lat="45.2207" lng="-87.6970"/>
  <point lat="45.3367" lng="-87.6476"/>
  <point lat="45.3878" lng="-87.6984"/>
  <point lat="45.3425" lng="-87.8494"/>
  <point lat="45.5025" lng="-87.7959"/>
  <point lat="45.6726" lng="-87.7890"/>
  <point lat="45.7570" lng="-87.9318"/>
  <point lat="45.7953" lng="-87.9922"/>
  <point lat="45.8058" lng="-88.1186"/>
  <point lat="45.8585" lng="-88.0870"/>
  <point lat="45.9531" lng="-88.1955"/>
  <point lat="45.9722" lng="-88.3438"/>
  <point lat="45.9836" lng="-88.3891"/>
  <point lat="46.0113" lng="-88.5457"/>
  <point lat="45.9970" lng="-88.7022"/>
  <point lat="46.0227" lng="-88.8135"/>
  <point lat="46.0418" lng="-88.8547"/>
  <point lat="46.1408" lng="-89.0936"/>
  <point lat="46.3384" lng="-90.1222"/>
  <point lat="46.5692" lng="-90.4175"/>
  <point lat="46.9034" lng="-90.2019"/>
  <point lat="47.2913" lng="-89.9547"/>
  <point lat="48.0129" lng="-89.4946"/>
  <point lat="47.9743" lng="-89.3381"/>
  <point lat="48.2448" lng="-88.6761"/>
  <point lat="48.3042" lng="-88.3726"/>
</state>
<state name ="Minnesota" colour="#0000ff" >
  <point lat="43.5008" lng="-96.4517"/>
  <point lat="43.5017" lng="-91.2195"/>
  <point lat="43.8226" lng="-91.3101"/>
  <point lat="43.9651" lng="-91.4914"/>
  <point lat="44.1113" lng="-91.7084"/>
  <point lat="44.2806" lng="-91.8951"/>
  <point lat="44.3710" lng="-91.9556"/>
  <point lat="44.4357" lng="-92.2083"/>
  <point lat="44.5513" lng="-92.3360"/>
  <point lat="44.6501" lng="-92.6367"/>
  <point lat="44.7877" lng="-92.7658"/>
  <point lat="45.3135" lng="-92.7081"/>
  <point lat="45.4505" lng="-92.6532"/>
  <point lat="45.6083" lng="-92.8482"/>
  <point lat="45.8307" lng="-92.7356"/>
  <point lat="45.9760" lng="-92.5159"/>
  <point lat="46.0151" lng="-92.3566"/>
  <point lat="46.0789" lng="-92.2934"/>
  <point lat="46.5957" lng="-92.2879"/>
  <point lat="47.3072" lng="-90.6564"/>
  <point lat="47.2885" lng="-89.9615"/>
  <point lat="48.0120" lng="-89.4919"/>
  <point lat="48.0193" lng="-89.7583"/>
  <point lat="48.0285" lng="-89.9931"/>
  <point lat="48.0827" lng="-90.0261"/>
  <point lat="48.1074" lng="-90.1758"/>
  <point lat="48.0955" lng="-90.3502"/>
  <point lat="48.1074" lng="-90.4834"/>
  <point lat="48.1175" lng="-90.5644"/>
  <point lat="48.0928" lng="-90.7471"/>
  <point lat="48.1588" lng="-90.7759"/>
  <point lat="48.2402" lng="-90.8405"/>
  <point lat="48.2174" lng="-90.9792"/>
  <point lat="48.0726" lng="-91.3252"/>
  <point lat="48.0505" lng="-91.5738"/>
  <point lat="48.1166" lng="-91.7070"/>
  <point lat="48.1963" lng="-91.7166"/>
  <point lat="48.2494" lng="-91.9844"/>
  <point lat="48.3188" lng="-92.0078"/>
  <point lat="48.3544" lng="-92.0531"/>
  <point lat="48.3599" lng="-92.1561"/>
  <point lat="48.3307" lng="-92.2975"/>
  <point lat="48.2475" lng="-92.2742"/>
  <point lat="48.2228" lng="-92.3717"/>
  <point lat="48.3854" lng="-92.4609"/>
  <point lat="48.4474" lng="-92.5104"/>
  <point lat="48.4611" lng="-92.7122"/>
  <point lat="48.4984" lng="-92.6340"/>
  <point lat="48.5403" lng="-92.6395"/>
  <point lat="48.6393" lng="-93.2066"/>
  <point lat="48.5884" lng="-93.4648"/>
  <point lat="48.5439" lng="-93.4621"/>
  <point lat="48.5166" lng="-93.8013"/>
  <point lat="48.6284" lng="-93.8356"/>
  <point lat="48.6547" lng="-94.2531"/>
  <point lat="48.7046" lng="-94.2792"/>
  <point lat="48.6982" lng="-94.4467"/>
  <point lat="48.7861" lng="-94.6925"/>
  <point lat="48.8756" lng="-94.6788"/>
  <point lat="49.0955" lng="-94.7488"/>
  <point lat="49.1889" lng="-94.7955"/>
  <point lat="49.3189" lng="-94.8175"/>
  <point lat="49.3815" lng="-94.9631"/>
  <point lat="49.3538" lng="-95.0400"/>
  <point lat="49.3681" lng="-95.1196"/>
  <point lat="49.3877" lng="-95.1553"/>
  <point lat="48.9991" lng="-95.1553"/>
  <point lat="49.0000" lng="-97.2304"/>
  <point lat="48.8647" lng="-97.1851"/>
  <point lat="48.7806" lng="-97.1576"/>
  <point lat="48.6683" lng="-97.1040"/>
  <point lat="48.5539" lng="-97.1645"/>
  <point lat="48.2832" lng="-97.1411"/>
  <point lat="48.1578" lng="-97.1397"/>
  <point lat="47.9633" lng="-97.0587"/>
  <point lat="47.7098" lng="-96.9434"/>
  <point lat="47.5821" lng="-96.8582"/>
  <point lat="47.2345" lng="-96.8335"/>
  <point lat="46.6702" lng="-96.8005"/>
  <point lat="46.5135" lng="-96.7126"/>
  <point lat="46.2786" lng="-96.6028"/>
  <point lat="46.0189" lng="-96.5767"/>
  <point lat="45.8173" lng="-96.5891"/>
  <point lat="45.6169" lng="-96.8486"/>
  <point lat="45.4601" lng="-96.7456"/>
  <point lat="45.3676" lng="-96.5918"/>
  <point lat="45.2961" lng="-96.4558"/>
  <point lat="43.5008" lng="-96.4531"/>
</state>
<state name ="Missouri" colour="#000088" >
 <point lat="40.6181" lng="-91.7468"/>
  <point lat="40.5597" lng="-91.6809"/>
  <point lat="40.5472" lng="-91.6260"/>
  <point lat="40.4658" lng="-91.5463"/>
  <point lat="40.3675" lng="-91.4337"/>
  <point lat="40.1663" lng="-91.5161"/>
  <point lat="39.9866" lng="-91.4900"/>
  <point lat="39.8634" lng="-91.4447"/>
  <point lat="39.7283" lng="-91.3623"/>
  <point lat="39.6861" lng="-91.3074"/>
  <point lat="39.5464" lng="-91.1096"/>
  <point lat="39.4022" lng="-90.9558"/>
  <point lat="39.2450" lng="-90.7306"/>
  <point lat="38.9893" lng="-90.6812"/>
  <point lat="38.8697" lng="-90.5878"/>
  <point lat="38.9722" lng="-90.4504"/>
  <point lat="38.8868" lng="-90.1813"/>
  <point lat="38.8269" lng="-90.1154"/>
  <point lat="38.7155" lng="-90.1978"/>
  <point lat="38.4149" lng="-90.3186"/>
  <point lat="38.2597" lng="-90.3790"/>
  <point lat="37.9572" lng="-89.9341"/>
  <point lat="37.6925" lng="-89.5331"/>
  <point lat="37.4007" lng="-89.4287"/>
  <point lat="37.2784" lng="-89.5386"/>
  <point lat="37.1734" lng="-89.4452"/>
  <point lat="37.0859" lng="-89.3793"/>
  <point lat="36.9938" lng="-89.2859"/>
  <point lat="37.0311" lng="-89.1829"/>
  <point lat="36.9839" lng="-89.1403"/>
  <point lat="36.9466" lng="-89.1005"/>
  <point lat="36.7884" lng="-89.1788"/>
  <point lat="36.6288" lng="-89.2035"/>
  <point lat="36.5449" lng="-89.2516"/>
  <point lat="36.6188" lng="-89.3532"/>
  <point lat="36.5538" lng="-89.4397"/>
  <point lat="36.4942" lng="-89.5358"/>
  <point lat="36.3594" lng="-89.5331"/>
  <point lat="36.2509" lng="-89.5345"/>
  <point lat="36.0891" lng="-89.6100"/>
  <point lat="36.0002" l

Download .txt

gitextract_o6_achnt/

├── .gitignore
├── INSTALL.md
├── LICENSE
├── README.md
├── comma_delimited_stock_prices.csv
├── first-edition/
│   ├── README.md
│   ├── code/
│   │   ├── __init__.py
│   │   ├── charts.py
│   │   ├── clustering.py
│   │   ├── colon_delimited_stock_prices.txt
│   │   ├── comma_delimited_stock_prices.csv
│   │   ├── comma_delimited_stock_prices.txt
│   │   ├── databases.py
│   │   ├── decision_trees.py
│   │   ├── egrep.py
│   │   ├── getting_data.py
│   │   ├── gradient_descent.py
│   │   ├── hypothesis_and_inference.py
│   │   ├── introduction.py
│   │   ├── line_count.py
│   │   ├── linear_algebra.py
│   │   ├── logistic_regression.py
│   │   ├── machine_learning.py
│   │   ├── mapreduce.py
│   │   ├── most_common_words.py
│   │   ├── multiple_regression.py
│   │   ├── naive_bayes.py
│   │   ├── natural_language_processing.py
│   │   ├── nearest_neighbors.py
│   │   ├── network_analysis.py
│   │   ├── neural_networks.py
│   │   ├── plot_state_borders.py
│   │   ├── probability.py
│   │   ├── recommender_systems.py
│   │   ├── simple_linear_regression.py
│   │   ├── states.txt
│   │   ├── statistics.py
│   │   ├── stocks.txt
│   │   ├── tab_delimited_stock_prices.txt
│   │   ├── visualizing_data.py
│   │   └── working_with_data.py
│   └── code-python3/
│       ├── README.md
│       ├── __init__.py
│       ├── charts.py
│       ├── clustering.py
│       ├── colon_delimited_stock_prices.txt
│       ├── comma_delimited_stock_prices.csv
│       ├── comma_delimited_stock_prices.txt
│       ├── databases.py
│       ├── decision_trees.py
│       ├── egrep.py
│       ├── getting_data.py
│       ├── gradient_descent.py
│       ├── hypothesis_and_inference.py
│       ├── introduction.py
│       ├── line_count.py
│       ├── linear_algebra.py
│       ├── logistic_regression.py
│       ├── machine_learning.py
│       ├── mapreduce.py
│       ├── most_common_words.py
│       ├── multiple_regression.py
│       ├── naive_bayes.py
│       ├── natural_language_processing.py
│       ├── nearest_neighbors.py
│       ├── network_analysis.py
│       ├── neural_networks.py
│       ├── plot_state_borders.py
│       ├── probability.py
│       ├── recommender_systems.py
│       ├── simple_linear_regression.py
│       ├── states.txt
│       ├── stats.py
│       ├── stocks.txt
│       ├── tab_delimited_stock_prices.txt
│       ├── visualizing_data.py
│       └── working_with_data.py
├── im/
│   └── README.md
├── links.md
├── requirements.txt
├── scratch/
│   ├── __init__.py
│   ├── clustering.py
│   ├── crash_course_in_python.py
│   ├── databases.py
│   ├── decision_trees.py
│   ├── deep_learning.py
│   ├── getting_data.py
│   ├── gradient_descent.py
│   ├── inference.py
│   ├── introduction.py
│   ├── k_nearest_neighbors.py
│   ├── linear_algebra.py
│   ├── logistic_regression.py
│   ├── machine_learning.py
│   ├── mapreduce.py
│   ├── multiple_regression.py
│   ├── naive_bayes.py
│   ├── network_analysis.py
│   ├── neural_networks.py
│   ├── nlp.py
│   ├── nlp_advanced.py
│   ├── probability.py
│   ├── recommender_systems.py
│   ├── simple_linear_regression.py
│   ├── statistics.py
│   ├── visualization.py
│   └── working_with_data.py
└── stocks.csv

Download .txt

SYMBOL INDEX (968 symbols across 73 files)

FILE: first-edition/code-python3/clustering.py
  class KMeans (line 6) | class KMeans:
    method __init__ (line 9) | def __init__(self, k):
    method classify (line 13) | def classify(self, input):
    method train (line 18) | def train(self, inputs):
  function squared_clustering_errors (line 40) | def squared_clustering_errors(inputs, k):
  function plot_squared_clustering_errors (line 50) | def plot_squared_clustering_errors():
  function recolor_image (line 65) | def recolor_image(input_file, k=5):
  function is_leaf (line 87) | def is_leaf(cluster):
  function get_children (line 91) | def get_children(cluster):
  function get_values (line 99) | def get_values(cluster):
  function cluster_distance (line 109) | def cluster_distance(cluster1, cluster2, distance_agg=min):
  function get_merge_order (line 116) | def get_merge_order(cluster):
  function bottom_up_cluster (line 122) | def bottom_up_cluster(inputs, distance_agg=min):
  function generate_clusters (line 146) | def generate_clusters(base_cluster, num_clusters):

FILE: first-edition/code-python3/databases.py
  class Table (line 4) | class Table:
    method __init__ (line 5) | def __init__(self, columns):
    method __repr__ (line 9) | def __repr__(self):
    method insert (line 13) | def insert(self, row_values):
    method update (line 19) | def update(self, updates, predicate):
    method delete (line 25) | def delete(self, predicate=lambda row: True):
    method select (line 30) | def select(self, keep_columns=None, additional_columns=None):
    method where (line 49) | def where(self, predicate=lambda row: True):
    method limit (line 55) | def limit(self, num_rows=None):
    method group_by (line 63) | def group_by(self, group_by_columns, aggregates, having=None):
    method order_by (line 83) | def order_by(self, order):
    method join (line 88) | def join(self, other_table, left_join=False):
  function name_len (line 155) | def name_len(row): return len(row["name"])
  function min_user_id (line 164) | def min_user_id(rows): return min(row["user_id"] for row in rows)
  function first_letter_of_name (line 176) | def first_letter_of_name(row):
  function average_num_friends (line 179) | def average_num_friends(rows):
  function enough_friends (line 182) | def enough_friends(rows):
  function sum_user_ids (line 195) | def sum_user_ids(rows): return sum(row["user_id"] for row in rows)
  function count_interests (line 233) | def count_interests(rows):

FILE: first-edition/code-python3/decision_trees.py
  function entropy (line 5) | def entropy(class_probabilities):
  function class_probabilities (line 9) | def class_probabilities(labels):
  function data_entropy (line 14) | def data_entropy(labeled_data):
  function partition_entropy (line 19) | def partition_entropy(subsets):
  function group_by (line 26) | def group_by(items, key_fn):
  function partition_by (line 35) | def partition_by(inputs, attribute):
  function partition_entropy_by (line 40) | def partition_entropy_by(inputs,attribute):
  function classify (line 45) | def classify(tree, input):
  function build_tree_id3 (line 63) | def build_tree_id3(inputs, split_candidates=None):
  function forest_classify (line 100) | def forest_classify(trees, input):

FILE: first-edition/code-python3/getting_data.py
  function is_video (line 13) | def is_video(td):
  function book_info (line 20) | def book_info(td):
  function scrape (line 40) | def scrape(num_pages=31):
  function get_year (line 60) | def get_year(book):
  function plot_years (line 65) | def plot_years(plt, books):
  function call_twitter_search_api (line 108) | def call_twitter_search_api():
  class MyStreamer (line 125) | class MyStreamer(TwythonStreamer):
    method on_success (line 129) | def on_success(self, data):
    method on_error (line 141) | def on_error(self, status_code, data):
  function call_twitter_streaming_api (line 145) | def call_twitter_streaming_api():
  function process (line 155) | def process(date, symbol, price):

FILE: first-edition/code-python3/gradient_descent.py
  function sum_of_squares (line 6) | def sum_of_squares(v):
  function difference_quotient (line 10) | def difference_quotient(f, x, h):
  function plot_estimated_derivative (line 13) | def plot_estimated_derivative():
  function partial_difference_quotient (line 30) | def partial_difference_quotient(f, v, i, h):
  function estimate_gradient (line 38) | def estimate_gradient(f, v, h=0.00001):
  function step (line 42) | def step(v, direction, step_size):
  function sum_of_squares_gradient (line 47) | def sum_of_squares_gradient(v):
  function safe (line 50) | def safe(f):
  function minimize_batch (line 66) | def minimize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001):
  function negate (line 90) | def negate(f):
  function negate_all (line 94) | def negate_all(f):
  function maximize_batch (line 98) | def maximize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001):
  function in_random_order (line 108) | def in_random_order(data):
  function minimize_stochastic (line 115) | def minimize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0...
  function maximize_stochastic (line 145) | def maximize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0...

FILE: first-edition/code-python3/hypothesis_and_inference.py
  function normal_approximation_to_binomial (line 4) | def normal_approximation_to_binomial(n, p):
  function normal_probability_above (line 20) | def normal_probability_above(lo, mu=0, sigma=1):
  function normal_probability_between (line 24) | def normal_probability_between(lo, hi, mu=0, sigma=1):
  function normal_probability_outside (line 28) | def normal_probability_outside(lo, hi, mu=0, sigma=1):
  function normal_upper_bound (line 38) | def normal_upper_bound(probability, mu=0, sigma=1):
  function normal_lower_bound (line 42) | def normal_lower_bound(probability, mu=0, sigma=1):
  function normal_two_sided_bounds (line 46) | def normal_two_sided_bounds(probability, mu=0, sigma=1):
  function two_sided_p_value (line 59) | def two_sided_p_value(x, mu=0, sigma=1):
  function count_extreme_values (line 67) | def count_extreme_values():
  function run_experiment (line 86) | def run_experiment():
  function reject_fairness (line 90) | def reject_fairness(experiment):
  function estimated_parameters (line 102) | def estimated_parameters(N, n):
  function a_b_test_statistic (line 107) | def a_b_test_statistic(N_A, n_A, N_B, n_B):
  function B (line 118) | def B(alpha, beta):
  function beta_pdf (line 122) | def beta_pdf(x, alpha, beta):

FILE: first-edition/code-python3/introduction.py
  function number_of_friends (line 39) | def number_of_friends(user):
  function friends_of_friend_ids_bad (line 55) | def friends_of_friend_ids_bad(user):
  function not_the_same (line 63) | def not_the_same(user, other_user):
  function not_friends (line 67) | def not_friends(user, other_user):
  function friends_of_friend_ids (line 73) | def friends_of_friend_ids(user):
  function data_scientists_who_like (line 99) | def data_scientists_who_like(target_interest):
  function most_common_interests_with (line 118) | def most_common_interests_with(user_id):
  function make_chart_salaries_by_tenure (line 136) | def make_chart_salaries_by_tenure():
  function tenure_bucket (line 156) | def tenure_bucket(tenure):
  function predict_paid_or_unpaid (line 179) | def predict_paid_or_unpaid(years_experience):

FILE: first-edition/code-python3/linear_algebra.py
  function vector_add (line 12) | def vector_add(v, w):
  function vector_subtract (line 16) | def vector_subtract(v, w):
  function vector_sum (line 20) | def vector_sum(vectors):
  function scalar_multiply (line 23) | def scalar_multiply(c, v):
  function vector_mean (line 26) | def vector_mean(vectors):
  function dot (line 32) | def dot(v, w):
  function sum_of_squares (line 36) | def sum_of_squares(v):
  function magnitude (line 40) | def magnitude(v):
  function squared_distance (line 43) | def squared_distance(v, w):
  function distance (line 46) | def distance(v, w):
  function shape (line 53) | def shape(A):
  function get_row (line 58) | def get_row(A, i):
  function get_column (line 61) | def get_column(A, j):
  function make_matrix (line 64) | def make_matrix(num_rows, num_cols, entry_fn):
  function is_diagonal (line 70) | def is_diagonal(i, j):
  function matrix_add (line 94) | def matrix_add(A, B):
  function make_graph_dot_product_as_vector_projection (line 104) | def make_graph_dot_product_as_vector_projection(plt):

FILE: first-edition/code-python3/logistic_regression.py
  function logistic (line 10) | def logistic(x):
  function logistic_prime (line 13) | def logistic_prime(x):
  function logistic_log_likelihood_i (line 16) | def logistic_log_likelihood_i(x_i, y_i, beta):
  function logistic_log_likelihood (line 22) | def logistic_log_likelihood(x, y, beta):
  function logistic_log_partial_ij (line 26) | def logistic_log_partial_ij(x_i, y_i, beta, j):
  function logistic_log_gradient_i (line 32) | def logistic_log_gradient_i(x_i, y_i, beta):
  function logistic_log_gradient (line 39) | def logistic_log_gradient(x, y, beta):

FILE: first-edition/code-python3/machine_learning.py
  function split_data (line 8) | def split_data(data, prob):
  function train_test_split (line 15) | def train_test_split(x, y, test_pct):
  function accuracy (line 26) | def accuracy(tp, fp, fn, tn):
  function precision (line 31) | def precision(tp, fp, fn, tn):
  function recall (line 34) | def recall(tp, fp, fn, tn):
  function f1_score (line 37) | def f1_score(tp, fp, fn, tn):

FILE: first-edition/code-python3/mapreduce.py
  function word_count_old (line 6) | def word_count_old(documents):
  function wc_mapper (line 12) | def wc_mapper(document):
  function wc_reducer (line 17) | def wc_reducer(word, counts):
  function word_count (line 21) | def word_count(documents):
  function map_reduce (line 35) | def map_reduce(inputs, mapper, reducer):
  function reduce_with (line 47) | def reduce_with(aggregation_fn, key, values):
  function values_reducer (line 51) | def values_reducer(aggregation_fn):
  function data_science_day_mapper (line 73) | def data_science_day_mapper(status_update):
  function words_per_user_mapper (line 83) | def words_per_user_mapper(status_update):
  function most_popular_word_reducer (line 88) | def most_popular_word_reducer(user, words_and_counts):
  function liker_mapper (line 104) | def liker_mapper(status_update):
  function matrix_multiply_mapper (line 118) | def matrix_multiply_mapper(m, element):
  function matrix_multiply_reducer (line 132) | def matrix_multiply_reducer(m, key, indexed_values):

FILE: first-edition/code-python3/multiple_regression.py
  function predict (line 10) | def predict(x_i, beta):
  function error (line 13) | def error(x_i, y_i, beta):
  function squared_error (line 16) | def squared_error(x_i, y_i, beta):
  function squared_error_gradient (line 19) | def squared_error_gradient(x_i, y_i, beta):
  function estimate_beta (line 24) | def estimate_beta(x, y):
  function multiple_r_squared (line 32) | def multiple_r_squared(x, y, beta):
  function bootstrap_sample (line 37) | def bootstrap_sample(data):
  function bootstrap_statistic (line 41) | def bootstrap_statistic(data, stats_fn, num_samples):
  function estimate_sample_beta (line 46) | def estimate_sample_beta(sample):
  function p_value (line 50) | def p_value(beta_hat_j, sigma_hat_j):
  function ridge_penalty (line 62) | def ridge_penalty(beta, alpha):
  function squared_error_ridge (line 65) | def squared_error_ridge(x_i, y_i, beta, alpha):
  function ridge_penalty_gradient (line 69) | def ridge_penalty_gradient(beta, alpha):
  function squared_error_ridge_gradient (line 73) | def squared_error_ridge_gradient(x_i, y_i, beta, alpha):
  function estimate_beta_ridge (line 79) | def estimate_beta_ridge(x, y, alpha):
  function lasso_penalty (line 90) | def lasso_penalty(beta, alpha):

FILE: first-edition/code-python3/naive_bayes.py
  function tokenize (line 5) | def tokenize(message):
  function count_words (line 11) | def count_words(training_set):
  function word_probabilities (line 19) | def word_probabilities(counts, total_spams, total_non_spams, k=0.5):
  function spam_probability (line 27) | def spam_probability(word_probs, message):
  class NaiveBayesClassifier (line 50) | class NaiveBayesClassifier:
    method __init__ (line 52) | def __init__(self, k=0.5):
    method train (line 56) | def train(self, training_set):
    method classify (line 71) | def classify(self, message):
  function get_subject_data (line 75) | def get_subject_data(path):
  function p_spam_given_word (line 94) | def p_spam_given_word(word_prob):
  function train_and_test_model (line 98) | def train_and_test_model(path):

FILE: first-edition/code-python3/natural_language_processing.py
  function plot_resumes (line 6) | def plot_resumes(plt):
  function fix_unicode (line 32) | def fix_unicode(text):
  function get_document (line 35) | def get_document():
  function generate_using_bigrams (line 53) | def generate_using_bigrams(transitions):
  function generate_using_trigrams (line 62) | def generate_using_trigrams(starts, trigram_transitions):
  function is_terminal (line 76) | def is_terminal(token):
  function expand (line 79) | def expand(grammar, tokens):
  function generate_sentence (line 97) | def generate_sentence(grammar):
  function roll_a_die (line 104) | def roll_a_die():
  function direct_sample (line 107) | def direct_sample():
  function random_y_given_x (line 112) | def random_y_given_x(x):
  function random_x_given_y (line 116) | def random_x_given_y(y):
  function gibbs_sample (line 126) | def gibbs_sample(num_iters=100):
  function compare_distributions (line 133) | def compare_distributions(num_samples=1000):
  function sample_from (line 144) | def sample_from(weights):
  function p_topic_given_document (line 185) | def p_topic_given_document(topic, d, alpha=0.1):
  function p_word_given_topic (line 192) | def p_word_given_topic(word, topic, beta=0.1):
  function topic_weight (line 199) | def topic_weight(d, word, k):
  function choose_new_topic (line 205) | def choose_new_topic(d, word):

FILE: first-edition/code-python3/nearest_neighbors.py
  function raw_majority_vote (line 7) | def raw_majority_vote(labels):
  function majority_vote (line 12) | def majority_vote(labels):
  function knn_classify (line 26) | def knn_classify(k, labeled_points, new_point):
  function plot_state_borders (line 43) | def plot_state_borders(plt, color='0.8'):
  function plot_cities (line 46) | def plot_cities():
  function classify_and_plot_grid (line 71) | def classify_and_plot_grid(k=1):
  function random_point (line 98) | def random_point(dim):
  function random_distances (line 101) | def random_distances(dim, num_pairs):

FILE: first-edition/code-python3/network_analysis.py
  function shortest_paths_from (line 36) | def shortest_paths_from(from_user):
  function farness (line 102) | def farness(user):
  function matrix_product_entry (line 115) | def matrix_product_entry(A, B, i, j):
  function matrix_multiply (line 118) | def matrix_multiply(A, B):
  function vector_as_matrix (line 126) | def vector_as_matrix(v):
  function vector_from_matrix (line 130) | def vector_from_matrix(v_as_matrix):
  function matrix_operate (line 134) | def matrix_operate(A, v):
  function find_eigenvector (line 139) | def find_eigenvector(A, tolerance=0.00001):
  function entry_fn (line 156) | def entry_fn(i, j):
  function page_rank (line 187) | def page_rank(users, damping = 0.85, num_iters = 100):

FILE: first-edition/code-python3/neural_networks.py
  function step_function (line 8) | def step_function(x):
  function perceptron_output (line 11) | def perceptron_output(weights, bias, x):
  function sigmoid (line 15) | def sigmoid(t):
  function neuron_output (line 18) | def neuron_output(weights, inputs):
  function feed_forward (line 21) | def feed_forward(neural_network, input_vector):
  function backpropagate (line 39) | def backpropagate(network, input_vector, target):
  function patch (line 62) | def patch(x, y, hatch, color):
  function show_weights (line 69) | def show_weights(neuron_idx):
  function make_digit (line 154) | def make_digit(raw_digit):
  function predict (line 185) | def predict(input):

FILE: first-edition/code-python3/plot_state_borders.py
  function plot_state_borders (line 22) | def plot_state_borders(color='0.8'):

FILE: first-edition/code-python3/probability.py
  function random_kid (line 4) | def random_kid():
  function uniform_pdf (line 7) | def uniform_pdf(x):
  function uniform_cdf (line 10) | def uniform_cdf(x):
  function normal_pdf (line 16) | def normal_pdf(x, mu=0, sigma=1):
  function plot_normal_pdfs (line 20) | def plot_normal_pdfs(plt):
  function normal_cdf (line 29) | def normal_cdf(x, mu=0,sigma=1):
  function plot_normal_cdfs (line 32) | def plot_normal_cdfs(plt):
  function inverse_normal_cdf (line 41) | def inverse_normal_cdf(p, mu=0, sigma=1, tolerance=0.00001):
  function bernoulli_trial (line 64) | def bernoulli_trial(p):
  function binomial (line 67) | def binomial(p, n):
  function make_hist (line 70) | def make_hist(p, n, num_points):

FILE: first-edition/code-python3/recommender_systems.py
  function most_popular_new_interests (line 27) | def most_popular_new_interests(user_interests, max_results=5):
  function cosine_similarity (line 37) | def cosine_similarity(v, w):
  function make_user_interest_vector (line 44) | def make_user_interest_vector(user_interests):
  function most_similar_users_to (line 56) | def most_similar_users_to(user_id):
  function user_based_suggestions (line 67) | def user_based_suggestions(user_id, include_current_interests=False):
  function most_similar_interests_to (line 99) | def most_similar_interests_to(interest_id):
  function item_based_suggestions (line 108) | def item_based_suggestions(user_id, include_current_interests=False):

FILE: first-edition/code-python3/simple_linear_regression.py
  function predict (line 7) | def predict(alpha, beta, x_i):
  function error (line 10) | def error(alpha, beta, x_i, y_i):
  function sum_of_squared_errors (line 13) | def sum_of_squared_errors(alpha, beta, x, y):
  function least_squares_fit (line 17) | def least_squares_fit(x,y):
  function total_sum_of_squares (line 24) | def total_sum_of_squares(y):
  function r_squared (line 28) | def r_squared(alpha, beta, x, y):
  function squared_error (line 35) | def squared_error(x_i, y_i, theta):
  function squared_error_gradient (line 39) | def squared_error_gradient(x_i, y_i, theta):

FILE: first-edition/code-python3/stats.py
  function make_friend_counts_histogram (line 7) | def make_friend_counts_histogram(plt):
  function mean (line 29) | def mean(x):
  function median (line 32) | def median(v):
  function quantile (line 47) | def quantile(x, p):
  function mode (line 52) | def mode(x):
  function data_range (line 60) | def data_range(x):
  function de_mean (line 63) | def de_mean(x):
  function variance (line 68) | def variance(x):
  function standard_deviation (line 74) | def standard_deviation(x):
  function interquartile_range (line 77) | def interquartile_range(x):
  function covariance (line 88) | def covariance(x, y):
  function correlation (line 92) | def correlation(x, y):

FILE: first-edition/code-python3/visualizing_data.py
  function make_chart_simple_line_chart (line 4) | def make_chart_simple_line_chart():
  function make_chart_simple_bar_chart (line 20) | def make_chart_simple_bar_chart():
  function make_chart_histogram (line 39) | def make_chart_histogram():
  function make_chart_misleading_y_axis (line 55) | def make_chart_misleading_y_axis(mislead=True):
  function make_chart_several_line_charts (line 77) | def make_chart_several_line_charts():
  function make_chart_scatter_plot (line 99) | def make_chart_scatter_plot():
  function make_chart_scatterplot_axes (line 119) | def make_chart_scatterplot_axes(equal_axes=False):
  function make_chart_pie_chart (line 136) | def make_chart_pie_chart():

FILE: first-edition/code-python3/working_with_data.py
  function bucketize (line 12) | def bucketize(point, bucket_size):
  function make_histogram (line 16) | def make_histogram(points, bucket_size):
  function plot_histogram (line 20) | def plot_histogram(points, bucket_size, title=""):
  function compare_two_distributions (line 26) | def compare_two_distributions():
  function random_normal (line 37) | def random_normal():
  function scatter (line 46) | def scatter():
  function correlation_matrix (line 54) | def correlation_matrix(data):
  function make_scatterplot_matrix (line 65) | def make_scatterplot_matrix():
  function parse_row (line 109) | def parse_row(input_row, parsers):
  function parse_rows_with (line 115) | def parse_rows_with(reader, parsers):
  function try_or_none (line 120) | def try_or_none(f):
  function parse_row (line 128) | def parse_row(input_row, parsers):
  function try_parse_field (line 132) | def try_parse_field(field_name, value, parser_dict):
  function parse_dict (line 140) | def parse_dict(input_dict, parser_dict):
  function picker (line 150) | def picker(field_name):
  function pluck (line 154) | def pluck(field_name, rows):
  function group_by (line 158) | def group_by(grouper, rows, value_transform=None):
  function percent_price_change (line 169) | def percent_price_change(yesterday, today):
  function day_over_day_changes (line 172) | def day_over_day_changes(grouped_rows):
  function scale (line 187) | def scale(data_matrix):
  function rescale (line 195) | def rescale(data_matrix):
  function de_mean_matrix (line 316) | def de_mean_matrix(A):
  function direction (line 323) | def direction(w):
  function directional_variance_i (line 327) | def directional_variance_i(x_i, w):
  function directional_variance (line 331) | def directional_variance(X, w):
  function directional_variance_gradient_i (line 335) | def directional_variance_gradient_i(x_i, w):
  function directional_variance_gradient (line 341) | def directional_variance_gradient(X, w):
  function first_principal_component (line 344) | def first_principal_component(X):
  function first_principal_component_sgd (line 352) | def first_principal_component_sgd(X):
  function project (line 360) | def project(v, w):
  function remove_projection_from_vector (line 365) | def remove_projection_from_vector(v, w):
  function remove_projection (line 369) | def remove_projection(X, w):
  function principal_component_analysis (line 374) | def principal_component_analysis(X, num_components):
  function transform_vector (line 383) | def transform_vector(v, components):
  function transform (line 386) | def transform(X, components):
  function combine_pct_changes (line 445) | def combine_pct_changes(pct_change1, pct_change2):
  function overall_change (line 448) | def overall_change(changes):

FILE: first-edition/code/clustering.py
  class KMeans (line 7) | class KMeans:
    method __init__ (line 10) | def __init__(self, k):
    method classify (line 14) | def classify(self, input):
    method train (line 19) | def train(self, inputs):
  function squared_clustering_errors (line 41) | def squared_clustering_errors(inputs, k):
  function plot_squared_clustering_errors (line 51) | def plot_squared_clustering_errors(plt):
  function recolor_image (line 66) | def recolor_image(input_file, k=5):
  function is_leaf (line 88) | def is_leaf(cluster):
  function get_children (line 92) | def get_children(cluster):
  function get_values (line 100) | def get_values(cluster):
  function cluster_distance (line 110) | def cluster_distance(cluster1, cluster2, distance_agg=min):
  function get_merge_order (line 117) | def get_merge_order(cluster):
  function bottom_up_cluster (line 123) | def bottom_up_cluster(inputs, distance_agg=min):
  function generate_clusters (line 147) | def generate_clusters(base_cluster, num_clusters):

FILE: first-edition/code/databases.py
  class Table (line 5) | class Table:
    method __init__ (line 6) | def __init__(self, columns):
    method __repr__ (line 10) | def __repr__(self):
    method insert (line 14) | def insert(self, row_values):
    method update (line 20) | def update(self, updates, predicate):
    method delete (line 26) | def delete(self, predicate=lambda row: True):
    method select (line 31) | def select(self, keep_columns=None, additional_columns=None):
    method where (line 50) | def where(self, predicate=lambda row: True):
    method limit (line 56) | def limit(self, num_rows=None):
    method group_by (line 64) | def group_by(self, group_by_columns, aggregates, having=None):
    method order_by (line 84) | def order_by(self, order):
    method join (line 89) | def join(self, other_table, left_join=False):
  function name_len (line 156) | def name_len(row): return len(row["name"])
  function min_user_id (line 165) | def min_user_id(rows): return min(row["user_id"] for row in rows)
  function first_letter_of_name (line 177) | def first_letter_of_name(row):
  function average_num_friends (line 180) | def average_num_friends(rows):
  function enough_friends (line 183) | def enough_friends(rows):
  function sum_user_ids (line 196) | def sum_user_ids(rows): return sum(row["user_id"] for row in rows)
  function count_interests (line 234) | def count_interests(rows):

FILE: first-edition/code/decision_trees.py
  function entropy (line 6) | def entropy(class_probabilities):
  function class_probabilities (line 10) | def class_probabilities(labels):
  function data_entropy (line 15) | def data_entropy(labeled_data):
  function partition_entropy (line 20) | def partition_entropy(subsets):
  function group_by (line 27) | def group_by(items, key_fn):
  function partition_by (line 36) | def partition_by(inputs, attribute):
  function partition_entropy_by (line 41) | def partition_entropy_by(inputs,attribute):
  function classify (line 46) | def classify(tree, input):
  function build_tree_id3 (line 64) | def build_tree_id3(inputs, split_candidates=None):
  function forest_classify (line 101) | def forest_classify(trees, input):

FILE: first-edition/code/getting_data.py
  function is_video (line 14) | def is_video(td):
  function book_info (line 21) | def book_info(td):
  function scrape (line 41) | def scrape(num_pages=31):
  function get_year (line 61) | def get_year(book):
  function plot_years (line 66) | def plot_years(plt, books):
  function call_twitter_search_api (line 109) | def call_twitter_search_api():
  class MyStreamer (line 126) | class MyStreamer(TwythonStreamer):
    method on_success (line 130) | def on_success(self, data):
    method on_error (line 142) | def on_error(self, status_code, data):
  function call_twitter_streaming_api (line 146) | def call_twitter_streaming_api():
  function process (line 156) | def process(date, symbol, price):

FILE: first-edition/code/gradient_descent.py
  function sum_of_squares (line 6) | def sum_of_squares(v):
  function difference_quotient (line 10) | def difference_quotient(f, x, h):
  function plot_estimated_derivative (line 13) | def plot_estimated_derivative():
  function partial_difference_quotient (line 30) | def partial_difference_quotient(f, v, i, h):
  function estimate_gradient (line 38) | def estimate_gradient(f, v, h=0.00001):
  function step (line 42) | def step(v, direction, step_size):
  function sum_of_squares_gradient (line 47) | def sum_of_squares_gradient(v):
  function safe (line 50) | def safe(f):
  function minimize_batch (line 66) | def minimize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001):
  function negate (line 90) | def negate(f):
  function negate_all (line 94) | def negate_all(f):
  function maximize_batch (line 98) | def maximize_batch(target_fn, gradient_fn, theta_0, tolerance=0.000001):
  function in_random_order (line 108) | def in_random_order(data):
  function minimize_stochastic (line 115) | def minimize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0...
  function maximize_stochastic (line 145) | def maximize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0...

FILE: first-edition/code/hypothesis_and_inference.py
  function normal_approximation_to_binomial (line 5) | def normal_approximation_to_binomial(n, p):
  function normal_probability_above (line 21) | def normal_probability_above(lo, mu=0, sigma=1):
  function normal_probability_between (line 25) | def normal_probability_between(lo, hi, mu=0, sigma=1):
  function normal_probability_outside (line 29) | def normal_probability_outside(lo, hi, mu=0, sigma=1):
  function normal_upper_bound (line 39) | def normal_upper_bound(probability, mu=0, sigma=1):
  function normal_lower_bound (line 43) | def normal_lower_bound(probability, mu=0, sigma=1):
  function normal_two_sided_bounds (line 47) | def normal_two_sided_bounds(probability, mu=0, sigma=1):
  function two_sided_p_value (line 60) | def two_sided_p_value(x, mu=0, sigma=1):
  function count_extreme_values (line 68) | def count_extreme_values():
  function run_experiment (line 87) | def run_experiment():
  function reject_fairness (line 91) | def reject_fairness(experiment):
  function estimated_parameters (line 103) | def estimated_parameters(N, n):
  function a_b_test_statistic (line 108) | def a_b_test_statistic(N_A, n_A, N_B, n_B):
  function B (line 119) | def B(alpha, beta):
  function beta_pdf (line 123) | def beta_pdf(x, alpha, beta):

FILE: first-edition/code/introduction.py
  function number_of_friends (line 41) | def number_of_friends(user):
  function friends_of_friend_ids_bad (line 57) | def friends_of_friend_ids_bad(user):
  function not_the_same (line 65) | def not_the_same(user, other_user):
  function not_friends (line 69) | def not_friends(user, other_user):
  function friends_of_friend_ids (line 75) | def friends_of_friend_ids(user):
  function data_scientists_who_like (line 101) | def data_scientists_who_like(target_interest):
  function most_common_interests_with (line 120) | def most_common_interests_with(user_id):
  function make_chart_salaries_by_tenure (line 138) | def make_chart_salaries_by_tenure():
  function tenure_bucket (line 158) | def tenure_bucket(tenure):
  function predict_paid_or_unpaid (line 181) | def predict_paid_or_unpaid(years_experience):

FILE: first-edition/code/linear_algebra.py
  function vector_add (line 13) | def vector_add(v, w):
  function vector_subtract (line 17) | def vector_subtract(v, w):
  function vector_sum (line 21) | def vector_sum(vectors):
  function scalar_multiply (line 24) | def scalar_multiply(c, v):
  function vector_mean (line 28) | def vector_mean(vectors):
  function dot (line 34) | def dot(v, w):
  function sum_of_squares (line 38) | def sum_of_squares(v):
  function magnitude (line 42) | def magnitude(v):
  function squared_distance (line 45) | def squared_distance(v, w):
  function distance (line 48) | def distance(v, w):
  function shape (line 55) | def shape(A):
  function get_row (line 60) | def get_row(A, i):
  function get_column (line 63) | def get_column(A, j):
  function make_matrix (line 66) | def make_matrix(num_rows, num_cols, entry_fn):
  function is_diagonal (line 72) | def is_diagonal(i, j):
  function matrix_add (line 96) | def matrix_add(A, B):
  function make_graph_dot_product_as_vector_projection (line 106) | def make_graph_dot_product_as_vector_projection(plt):

FILE: first-edition/code/logistic_regression.py
  function logistic (line 11) | def logistic(x):
  function logistic_prime (line 14) | def logistic_prime(x):
  function logistic_log_likelihood_i (line 17) | def logistic_log_likelihood_i(x_i, y_i, beta):
  function logistic_log_likelihood (line 23) | def logistic_log_likelihood(x, y, beta):
  function logistic_log_partial_ij (line 27) | def logistic_log_partial_ij(x_i, y_i, beta, j):
  function logistic_log_gradient_i (line 33) | def logistic_log_gradient_i(x_i, y_i, beta):
  function logistic_log_gradient (line 40) | def logistic_log_gradient(x, y, beta):

FILE: first-edition/code/machine_learning.py
  function split_data (line 9) | def split_data(data, prob):
  function train_test_split (line 16) | def train_test_split(x, y, test_pct):
  function accuracy (line 27) | def accuracy(tp, fp, fn, tn):
  function precision (line 32) | def precision(tp, fp, fn, tn):
  function recall (line 35) | def recall(tp, fp, fn, tn):
  function f1_score (line 38) | def f1_score(tp, fp, fn, tn):

FILE: first-edition/code/mapreduce.py
  function word_count_old (line 7) | def word_count_old(documents):
  function wc_mapper (line 13) | def wc_mapper(document):
  function wc_reducer (line 18) | def wc_reducer(word, counts):
  function word_count (line 22) | def word_count(documents):
  function map_reduce (line 36) | def map_reduce(inputs, mapper, reducer):
  function reduce_with (line 48) | def reduce_with(aggregation_fn, key, values):
  function values_reducer (line 52) | def values_reducer(aggregation_fn):
  function data_science_day_mapper (line 74) | def data_science_day_mapper(status_update):
  function words_per_user_mapper (line 84) | def words_per_user_mapper(status_update):
  function most_popular_word_reducer (line 89) | def most_popular_word_reducer(user, words_and_counts):
  function liker_mapper (line 105) | def liker_mapper(status_update):
  function matrix_multiply_mapper (line 119) | def matrix_multiply_mapper(m, element):
  function matrix_multiply_reducer (line 133) | def matrix_multiply_reducer(m, key, indexed_values):

FILE: first-edition/code/multiple_regression.py
  function predict (line 12) | def predict(x_i, beta):
  function error (line 15) | def error(x_i, y_i, beta):
  function squared_error (line 18) | def squared_error(x_i, y_i, beta):
  function squared_error_gradient (line 21) | def squared_error_gradient(x_i, y_i, beta):
  function estimate_beta (line 26) | def estimate_beta(x, y):
  function multiple_r_squared (line 34) | def multiple_r_squared(x, y, beta):
  function bootstrap_sample (line 39) | def bootstrap_sample(data):
  function bootstrap_statistic (line 43) | def bootstrap_statistic(data, stats_fn, num_samples):
  function estimate_sample_beta (line 48) | def estimate_sample_beta(sample):
  function p_value (line 52) | def p_value(beta_hat_j, sigma_hat_j):
  function ridge_penalty (line 64) | def ridge_penalty(beta, alpha):
  function squared_error_ridge (line 67) | def squared_error_ridge(x_i, y_i, beta, alpha):
  function ridge_penalty_gradient (line 71) | def ridge_penalty_gradient(beta, alpha):
  function squared_error_ridge_gradient (line 75) | def squared_error_ridge_gradient(x_i, y_i, beta, alpha):
  function estimate_beta_ridge (line 81) | def estimate_beta_ridge(x, y, alpha):
  function lasso_penalty (line 92) | def lasso_penalty(beta, alpha):

FILE: first-edition/code/naive_bayes.py
  function tokenize (line 6) | def tokenize(message):
  function count_words (line 12) | def count_words(training_set):
  function word_probabilities (line 20) | def word_probabilities(counts, total_spams, total_non_spams, k=0.5):
  function spam_probability (line 28) | def spam_probability(word_probs, message):
  class NaiveBayesClassifier (line 51) | class NaiveBayesClassifier:
    method __init__ (line 53) | def __init__(self, k=0.5):
    method train (line 57) | def train(self, training_set):
    method classify (line 72) | def classify(self, message):
  function get_subject_data (line 76) | def get_subject_data(path):
  function p_spam_given_word (line 95) | def p_spam_given_word(word_prob):
  function train_and_test_model (line 99) | def train_and_test_model(path):

FILE: first-edition/code/natural_language_processing.py
  function plot_resumes (line 7) | def plot_resumes(plt):
  function fix_unicode (line 33) | def fix_unicode(text):
  function get_document (line 36) | def get_document():
  function generate_using_bigrams (line 54) | def generate_using_bigrams(transitions):
  function generate_using_trigrams (line 63) | def generate_using_trigrams(starts, trigram_transitions):
  function is_terminal (line 77) | def is_terminal(token):
  function expand (line 80) | def expand(grammar, tokens):
  function generate_sentence (line 98) | def generate_sentence(grammar):
  function roll_a_die (line 105) | def roll_a_die():
  function direct_sample (line 108) | def direct_sample():
  function random_y_given_x (line 113) | def random_y_given_x(x):
  function random_x_given_y (line 117) | def random_x_given_y(y):
  function gibbs_sample (line 127) | def gibbs_sample(num_iters=100):
  function compare_distributions (line 134) | def compare_distributions(num_samples=1000):
  function sample_from (line 145) | def sample_from(weights):
  function p_topic_given_document (line 186) | def p_topic_given_document(topic, d, alpha=0.1):
  function p_word_given_topic (line 193) | def p_word_given_topic(word, topic, beta=0.1):
  function topic_weight (line 200) | def topic_weight(d, word, k):
  function choose_new_topic (line 206) | def choose_new_topic(d, word):

FILE: first-edition/code/nearest_neighbors.py
  function raw_majority_vote (line 8) | def raw_majority_vote(labels):
  function majority_vote (line 13) | def majority_vote(labels):
  function knn_classify (line 27) | def knn_classify(k, labeled_points, new_point):
  function plot_state_borders (line 44) | def plot_state_borders(plt, color='0.8'):
  function plot_cities (line 47) | def plot_cities():
  function classify_and_plot_grid (line 72) | def classify_and_plot_grid(k=1):
  function random_point (line 99) | def random_point(dim):
  function random_distances (line 102) | def random_distances(dim, num_pairs):

FILE: first-edition/code/network_analysis.py
  function shortest_paths_from (line 37) | def shortest_paths_from(from_user):
  function farness (line 103) | def farness(user):
  function matrix_product_entry (line 116) | def matrix_product_entry(A, B, i, j):
  function matrix_multiply (line 119) | def matrix_multiply(A, B):
  function vector_as_matrix (line 127) | def vector_as_matrix(v):
  function vector_from_matrix (line 131) | def vector_from_matrix(v_as_matrix):
  function matrix_operate (line 135) | def matrix_operate(A, v):
  function find_eigenvector (line 140) | def find_eigenvector(A, tolerance=0.00001):
  function entry_fn (line 157) | def entry_fn(i, j):
  function page_rank (line 188) | def page_rank(users, damping = 0.85, num_iters = 100):

FILE: first-edition/code/neural_networks.py
  function step_function (line 9) | def step_function(x):
  function perceptron_output (line 12) | def perceptron_output(weights, bias, x):
  function sigmoid (line 16) | def sigmoid(t):
  function neuron_output (line 19) | def neuron_output(weights, inputs):
  function feed_forward (line 22) | def feed_forward(neural_network, input_vector):
  function backpropagate (line 40) | def backpropagate(network, input_vector, target):
  function patch (line 63) | def patch(x, y, hatch, color):
  function show_weights (line 70) | def show_weights(neuron_idx):
  function make_digit (line 155) | def make_digit(raw_digit):
  function predict (line 186) | def predict(input):

FILE: first-edition/code/plot_state_borders.py
  function plot_state_borders (line 21) | def plot_state_borders(plt, color='0.8'):

FILE: first-edition/code/probability.py
  function random_kid (line 5) | def random_kid():
  function uniform_pdf (line 8) | def uniform_pdf(x):
  function uniform_cdf (line 11) | def uniform_cdf(x):
  function normal_pdf (line 17) | def normal_pdf(x, mu=0, sigma=1):
  function plot_normal_pdfs (line 21) | def plot_normal_pdfs(plt):
  function normal_cdf (line 30) | def normal_cdf(x, mu=0,sigma=1):
  function plot_normal_cdfs (line 33) | def plot_normal_cdfs(plt):
  function inverse_normal_cdf (line 42) | def inverse_normal_cdf(p, mu=0, sigma=1, tolerance=0.00001):
  function bernoulli_trial (line 65) | def bernoulli_trial(p):
  function binomial (line 68) | def binomial(p, n):
  function make_hist (line 71) | def make_hist(p, n, num_points):

FILE: first-edition/code/recommender_systems.py
  function most_popular_new_interests (line 28) | def most_popular_new_interests(user_interests, max_results=5):
  function cosine_similarity (line 38) | def cosine_similarity(v, w):
  function make_user_interest_vector (line 45) | def make_user_interest_vector(user_interests):
  function most_similar_users_to (line 57) | def most_similar_users_to(user_id):
  function user_based_suggestions (line 68) | def user_based_suggestions(user_id, include_current_interests=False):
  function most_similar_interests_to (line 100) | def most_similar_interests_to(interest_id):
  function item_based_suggestions (line 109) | def item_based_suggestions(user_id, include_current_interests=False):

FILE: first-edition/code/simple_linear_regression.py
  function predict (line 8) | def predict(alpha, beta, x_i):
  function error (line 11) | def error(alpha, beta, x_i, y_i):
  function sum_of_squared_errors (line 14) | def sum_of_squared_errors(alpha, beta, x, y):
  function least_squares_fit (line 18) | def least_squares_fit(x,y):
  function total_sum_of_squares (line 25) | def total_sum_of_squares(y):
  function r_squared (line 29) | def r_squared(alpha, beta, x, y):
  function squared_error (line 36) | def squared_error(x_i, y_i, theta):
  function squared_error_gradient (line 40) | def squared_error_gradient(x_i, y_i, theta):

FILE: first-edition/code/statistics.py
  function make_friend_counts_histogram (line 8) | def make_friend_counts_histogram(plt):
  function mean (line 30) | def mean(x):
  function median (line 33) | def median(v):
  function quantile (line 48) | def quantile(x, p):
  function mode (line 53) | def mode(x):
  function data_range (line 61) | def data_range(x):
  function de_mean (line 64) | def de_mean(x):
  function variance (line 69) | def variance(x):
  function standard_deviation (line 75) | def standard_deviation(x):
  function interquartile_range (line 78) | def interquartile_range(x):
  function covariance (line 89) | def covariance(x, y):
  function correlation (line 93) | def correlation(x, y):

FILE: first-edition/code/visualizing_data.py
  function make_chart_simple_line_chart (line 4) | def make_chart_simple_line_chart(plt):
  function make_chart_simple_bar_chart (line 20) | def make_chart_simple_bar_chart(plt):
  function make_chart_histogram (line 39) | def make_chart_histogram(plt):
  function make_chart_misleading_y_axis (line 55) | def make_chart_misleading_y_axis(plt, mislead=True):
  function make_chart_several_line_charts (line 77) | def make_chart_several_line_charts(plt):
  function make_chart_scatter_plot (line 99) | def make_chart_scatter_plot(plt):
  function make_chart_scatterplot_axes (line 119) | def make_chart_scatterplot_axes(plt, equal_axes=False):
  function make_chart_pie_chart (line 136) | def make_chart_pie_chart(plt):

FILE: first-edition/code/working_with_data.py
  function bucketize (line 13) | def bucketize(point, bucket_size):
  function make_histogram (line 17) | def make_histogram(points, bucket_size):
  function plot_histogram (line 21) | def plot_histogram(points, bucket_size, title=""):
  function compare_two_distributions (line 27) | def compare_two_distributions():
  function random_normal (line 38) | def random_normal():
  function scatter (line 47) | def scatter():
  function correlation_matrix (line 55) | def correlation_matrix(data):
  function make_scatterplot_matrix (line 66) | def make_scatterplot_matrix():
  function parse_row (line 110) | def parse_row(input_row, parsers):
  function parse_rows_with (line 116) | def parse_rows_with(reader, parsers):
  function try_or_none (line 121) | def try_or_none(f):
  function parse_row (line 129) | def parse_row(input_row, parsers):
  function try_parse_field (line 133) | def try_parse_field(field_name, value, parser_dict):
  function parse_dict (line 141) | def parse_dict(input_dict, parser_dict):
  function picker (line 151) | def picker(field_name):
  function pluck (line 155) | def pluck(field_name, rows):
  function group_by (line 159) | def group_by(grouper, rows, value_transform=None):
  function percent_price_change (line 170) | def percent_price_change(yesterday, today):
  function day_over_day_changes (line 173) | def day_over_day_changes(grouped_rows):
  function scale (line 188) | def scale(data_matrix):
  function rescale (line 196) | def rescale(data_matrix):
  function de_mean_matrix (line 317) | def de_mean_matrix(A):
  function direction (line 324) | def direction(w):
  function directional_variance_i (line 328) | def directional_variance_i(x_i, w):
  function directional_variance (line 332) | def directional_variance(X, w):
  function directional_variance_gradient_i (line 336) | def directional_variance_gradient_i(x_i, w):
  function directional_variance_gradient (line 342) | def directional_variance_gradient(X, w):
  function first_principal_component (line 345) | def first_principal_component(X):
  function first_principal_component_sgd (line 353) | def first_principal_component_sgd(X):
  function project (line 361) | def project(v, w):
  function remove_projection_from_vector (line 366) | def remove_projection_from_vector(v, w):
  function remove_projection (line 370) | def remove_projection(X, w):
  function principal_component_analysis (line 375) | def principal_component_analysis(X, num_components):
  function transform_vector (line 384) | def transform_vector(v, components):
  function transform (line 387) | def transform(X, components):
  function combine_pct_changes (line 446) | def combine_pct_changes(pct_change1, pct_change2):
  function overall_change (line 449) | def overall_change(changes):

FILE: scratch/clustering.py
  function num_differences (line 3) | def num_differences(v1: Vector, v2: Vector) -> int:
  function cluster_means (line 13) | def cluster_means(k: int,
  class KMeans (line 30) | class KMeans:
    method __init__ (line 31) | def __init__(self, k: int) -> None:
    method classify (line 35) | def classify(self, input: Vector) -> int:
    method train (line 40) | def train(self, inputs: List[Vector]) -> None:
  class Leaf (line 62) | class Leaf(NamedTuple):
  class Merged (line 68) | class Merged(NamedTuple):
  function get_values (line 76) | def get_values(cluster: Cluster) -> List[Vector]:
  function cluster_distance (line 89) | def cluster_distance(cluster1: Cluster,
  function get_merge_order (line 100) | def get_merge_order(cluster: Cluster) -> float:
  function get_children (line 108) | def get_children(cluster: Cluster):
  function bottom_up_cluster (line 114) | def bottom_up_cluster(inputs: List[Vector],
  function generate_clusters (line 142) | def generate_clusters(base_cluster: Cluster,
  function main (line 160) | def main():

FILE: scratch/crash_course_in_python.py
  function double (line 49) | def double(x):
  function apply_to_one (line 56) | def apply_to_one(f):
  function another_double (line 73) | def another_double(x):
  function my_print (line 77) | def my_print(message = "my default message"):
  function full_name (line 83) | def full_name(first = "What's-his-name", last = "Something"):
  function sum_and_product (line 207) | def sum_and_product(x, y):
  function some_function_that_returns_a_string (line 393) | def some_function_that_returns_a_string():
  function smallest_item (line 465) | def smallest_item(xs):
  function smallest_item (line 471) | def smallest_item(xs):
  class CountingClicker (line 475) | class CountingClicker:
    method __init__ (line 478) | def __init__(self, count = 0):
    method __repr__ (line 481) | def __repr__(self):
    method click (line 484) | def click(self, num_times = 1):
    method read (line 488) | def read(self):
    method reset (line 491) | def reset(self):
  class NoResetClicker (line 503) | class NoResetClicker(CountingClicker):
    method reset (line 507) | def reset(self):
  function generate_range (line 517) | def generate_range(n):
  function natural_numbers (line 526) | def natural_numbers():
  function add (line 620) | def add(a, b): return a + b
  function doubler (line 629) | def doubler(f):
  function f1 (line 637) | def f1(x):
  function f2 (line 644) | def f2(x, y):
  function magic (line 653) | def magic(*args, **kwargs):
  function other_way_magic (line 663) | def other_way_magic(x, y, z):
  function doubler_correct (line 670) | def doubler_correct(f):
  function add (line 680) | def add(a, b):
  function add (line 692) | def add(a: int, b: int) -> int:
  function dot_product (line 704) | def dot_product(x, y): ...
  function dot_product (line 707) | def dot_product(x: Vector, y: Vector) -> float: ...
  function secretly_ugly_function (line 711) | def secretly_ugly_function(value, operation): ...
  function ugly_function (line 713) | def ugly_function(value: int, operation: Union[str, int, float, bool]) -...
  function total (line 716) | def total(xs: list) -> float:
  function total (line 721) | def total(xs: List[float]) -> float:
  function twice (line 758) | def twice(repeater: Callable[[str, int], str], s: str) -> str:
  function comma_repeater (line 761) | def comma_repeater(s: str, n: int) -> str:
  function total (line 770) | def total(xs: Numbers) -> Number:

FILE: scratch/databases.py
  class Table (line 14) | class Table:
    method __init__ (line 15) | def __init__(self, columns: List[str], types: List[type]) -> None:
    method col2type (line 22) | def col2type(self, col: str) -> type:
    method insert (line 26) | def insert(self, values: list) -> None:
    method __getitem__ (line 39) | def __getitem__(self, idx: int) -> Row:
    method __iter__ (line 42) | def __iter__(self) -> Iterator[Row]:
    method __len__ (line 45) | def __len__(self) -> int:
    method __repr__ (line 48) | def __repr__(self):
    method update (line 54) | def update(self,
    method delete (line 72) | def delete(self, predicate: WhereClause = lambda row: True) -> None:
    method select (line 76) | def select(self,
    method where (line 106) | def where(self, predicate: WhereClause = lambda row: True) -> 'Table':
    method limit (line 115) | def limit(self, num_rows: int) -> 'Table':
    method group_by (line 125) | def group_by(self,
    method order_by (line 153) | def order_by(self, order: Callable[[Row], Any]) -> 'Table':
    method join (line 158) | def join(self, other_table: 'Table', left_join: bool = False) -> 'Table':
  function main (line 191) | def main():

FILE: scratch/decision_trees.py
  function entropy (line 4) | def entropy(class_probabilities: List[float]) -> float:
  function class_probabilities (line 17) | def class_probabilities(labels: List[Any]) -> List[float]:
  function data_entropy (line 22) | def data_entropy(labels: List[Any]) -> float:
  function partition_entropy (line 29) | def partition_entropy(subsets: List[List[Any]]) -> float:
  class Candidate (line 38) | class Candidate(NamedTuple):
  function partition_by (line 67) | def partition_by(inputs: List[T], attribute: str) -> Dict[Any, List[T]]:
  function partition_entropy_by (line 75) | def partition_entropy_by(inputs: List[Any],
  class Leaf (line 104) | class Leaf(NamedTuple):
  class Split (line 107) | class Split(NamedTuple):
  function classify (line 126) | def classify(tree: DecisionTree, input: Any) -> Any:
  function build_tree_id3 (line 144) | def build_tree_id3(inputs: List[Any],

FILE: scratch/deep_learning.py
  function shape (line 5) | def shape(tensor: Tensor) -> List[int]:
  function is_1d (line 15) | def is_1d(tensor: Tensor) -> bool:
  function tensor_sum (line 25) | def tensor_sum(tensor: Tensor) -> float:
  function tensor_apply (line 38) | def tensor_apply(f: Callable[[float], float], tensor: Tensor) -> Tensor:
  function zeros_like (line 48) | def zeros_like(tensor: Tensor) -> Tensor:
  function tensor_combine (line 54) | def tensor_combine(f: Callable[[float, float], float],
  class Layer (line 70) | class Layer:
    method forward (line 76) | def forward(self, input):
    method backward (line 84) | def backward(self, gradient):
    method params (line 92) | def params(self) -> Iterable[Tensor]:
    method grads (line 100) | def grads(self) -> Iterable[Tensor]:
  class Sigmoid (line 108) | class Sigmoid(Layer):
    method forward (line 109) | def forward(self, input: Tensor) -> Tensor:
    method backward (line 117) | def backward(self, gradient: Tensor) -> Tensor:
  function random_uniform (line 126) | def random_uniform(*dims: int) -> Tensor:
  function random_normal (line 132) | def random_normal(*dims: int,
  function random_tensor (line 145) | def random_tensor(*dims: int, init: str = 'normal') -> Tensor:
  class Linear (line 158) | class Linear(Layer):
    method __init__ (line 159) | def __init__(self, input_dim: int, output_dim: int, init: str = 'xavie...
    method forward (line 173) | def forward(self, input: Tensor) -> Tensor:
    method backward (line 181) | def backward(self, gradient: Tensor) -> Tensor:
    method params (line 198) | def params(self) -> Iterable[Tensor]:
    method grads (line 201) | def grads(self) -> Iterable[Tensor]:
  class Sequential (line 206) | class Sequential(Layer):
    method __init__ (line 212) | def __init__(self, layers: List[Layer]) -> None:
    method forward (line 215) | def forward(self, input):
    method backward (line 221) | def backward(self, gradient):
    method params (line 227) | def params(self) -> Iterable[Tensor]:
    method grads (line 231) | def grads(self) -> Iterable[Tensor]:
  class Loss (line 235) | class Loss:
    method loss (line 236) | def loss(self, predicted: Tensor, actual: Tensor) -> float:
    method gradient (line 240) | def gradient(self, predicted: Tensor, actual: Tensor) -> Tensor:
  class SSE (line 244) | class SSE(Loss):
    method loss (line 246) | def loss(self, predicted: Tensor, actual: Tensor) -> float:
    method gradient (line 256) | def gradient(self, predicted: Tensor, actual: Tensor) -> Tensor:
  class Optimizer (line 267) | class Optimizer:
    method step (line 272) | def step(self, layer: Layer) -> None:
  class GradientDescent (line 275) | class GradientDescent(Optimizer):
    method __init__ (line 276) | def __init__(self, learning_rate: float = 0.1) -> None:
    method step (line 279) | def step(self, layer: Layer) -> None:
  class Momentum (line 297) | class Momentum(Optimizer):
    method __init__ (line 298) | def __init__(self,
    method step (line 305) | def step(self, layer: Layer) -> None:
  function tanh (line 327) | def tanh(x: float) -> float:
  class Tanh (line 336) | class Tanh(Layer):
    method forward (line 337) | def forward(self, input: Tensor) -> Tensor:
    method backward (line 342) | def backward(self, gradient: Tensor) -> Tensor:
  class Relu (line 348) | class Relu(Layer):
    method forward (line 349) | def forward(self, input: Tensor) -> Tensor:
    method backward (line 353) | def backward(self, gradient: Tensor) -> Tensor:
  function softmax (line 358) | def softmax(tensor: Tensor) -> Tensor:
  class SoftmaxCrossEntropy (line 372) | class SoftmaxCrossEntropy(Loss):
    method loss (line 378) | def loss(self, predicted: Tensor, actual: Tensor) -> float:
    method gradient (line 391) | def gradient(self, predicted: Tensor, actual: Tensor) -> Tensor:
  class Dropout (line 399) | class Dropout(Layer):
    method __init__ (line 400) | def __init__(self, p: float) -> None:
    method forward (line 404) | def forward(self, input: Tensor) -> Tensor:
    method backward (line 417) | def backward(self, gradient: Tensor) -> Tensor:
  function one_hot_encode (line 428) | def one_hot_encode(i: int, num_labels: int = 10) -> List[float]:
  function save_weights (line 439) | def save_weights(model: Layer, filename: str) -> None:
  function load_weights (line 444) | def load_weights(model: Layer, filename: str) -> None:
  function main (line 456) | def main():

FILE: scratch/getting_data.py
  function get_domain (line 8) | def get_domain(email_address: str) -> str:
  function process (line 33) | def process(date: str, symbol: str, closing_price: float) -> None:
  function paragraph_mentions (line 136) | def paragraph_mentions(text: str, keyword: str) -> bool:
  function main (line 166) | def main():

FILE: scratch/gradient_descent.py
  function sum_of_squares (line 3) | def sum_of_squares(v: Vector) -> float:
  function difference_quotient (line 9) | def difference_quotient(f: Callable[[float], float],
  function square (line 14) | def square(x: float) -> float:
  function derivative (line 17) | def derivative(x: float) -> float:
  function estimate_gradient (line 20) | def estimate_gradient(f: Callable[[Vector], float],
  function gradient_step (line 29) | def gradient_step(v: Vector, gradient: Vector, step_size: float) -> Vector:
  function sum_of_squares_gradient (line 35) | def sum_of_squares_gradient(v: Vector) -> Vector:
  function linear_gradient (line 41) | def linear_gradient(x: float, y: float, theta: Vector) -> Vector:
  function minibatches (line 53) | def minibatches(dataset: List[T],
  function main (line 66) | def main():

FILE: scratch/inference.py
  function normal_approximation_to_binomial (line 4) | def normal_approximation_to_binomial(n: int, p: float) -> Tuple[float, f...
  function normal_probability_above (line 16) | def normal_probability_above(lo: float,
  function normal_probability_between (line 23) | def normal_probability_between(lo: float,
  function normal_probability_outside (line 31) | def normal_probability_outside(lo: float,
  function normal_upper_bound (line 40) | def normal_upper_bound(probability: float,
  function normal_lower_bound (line 46) | def normal_lower_bound(probability: float,
  function normal_two_sided_bounds (line 52) | def normal_two_sided_bounds(probability: float,
  function two_sided_p_value (line 106) | def two_sided_p_value(x: float, mu: float = 0, sigma: float = 1) -> float:
  function run_experiment (line 158) | def run_experiment() -> List[bool]:
  function reject_fairness (line 162) | def reject_fairness(experiment: List[bool]) -> bool:
  function estimated_parameters (line 175) | def estimated_parameters(N: int, n: int) -> Tuple[float, float]:
  function a_b_test_statistic (line 180) | def a_b_test_statistic(N_A: int, n_A: int, N_B: int, n_B: int) -> float:
  function B (line 198) | def B(alpha: float, beta: float) -> float:
  function beta_pdf (line 202) | def beta_pdf(x: float, alpha: float, beta: float) -> float:

FILE: scratch/introduction.py
  function number_of_friends (line 32) | def number_of_friends(user):
  function foaf_ids_bad (line 67) | def foaf_ids_bad(user):
  function friends_of_friends (line 89) | def friends_of_friends(user):
  function data_scientists_who_like (line 122) | def data_scientists_who_like(target_interest):
  function most_common_interests_with (line 142) | def most_common_interests_with(user):
  function tenure_bucket (line 193) | def tenure_bucket(tenure):
  function predict_paid_or_unpaid (line 225) | def predict_paid_or_unpaid(years_experience):

FILE: scratch/k_nearest_neighbors.py
  function raw_majority_vote (line 4) | def raw_majority_vote(labels: List[str]) -> str:
  function majority_vote (line 11) | def majority_vote(labels: List[str]) -> str:
  class LabeledPoint (line 30) | class LabeledPoint(NamedTuple):
  function knn_classify (line 34) | def knn_classify(k: int,
  function random_point (line 51) | def random_point(dim: int) -> Vector:
  function random_distances (line 54) | def random_distances(dim: int, num_pairs: int) -> List[float]:
  function main (line 58) | def main():

FILE: scratch/linear_algebra.py
  function add (line 14) | def add(v: Vector, w: Vector) -> Vector:
  function subtract (line 22) | def subtract(v: Vector, w: Vector) -> Vector:
  function vector_sum (line 30) | def vector_sum(vectors: List[Vector]) -> Vector:
  function scalar_multiply (line 45) | def scalar_multiply(c: float, v: Vector) -> Vector:
  function vector_mean (line 51) | def vector_mean(vectors: List[Vector]) -> Vector:
  function dot (line 58) | def dot(v: Vector, w: Vector) -> float:
  function sum_of_squares (line 66) | def sum_of_squares(v: Vector) -> float:
  function magnitude (line 74) | def magnitude(v: Vector) -> float:
  function squared_distance (line 80) | def squared_distance(v: Vector, w: Vector) -> float:
  function distance (line 84) | def distance(v: Vector, w: Vector) -> float:
  function distance (line 89) | def distance(v: Vector, w: Vector) -> float:  # type: ignore
  function shape (line 104) | def shape(A: Matrix) -> Tuple[int, int]:
  function get_row (line 112) | def get_row(A: Matrix, i: int) -> Vector:
  function get_column (line 116) | def get_column(A: Matrix, j: int) -> Vector:
  function make_matrix (line 123) | def make_matrix(num_rows: int,
  function identity_matrix (line 134) | def identity_matrix(n: int) -> Matrix:

FILE: scratch/logistic_regression.py
  function logistic (line 11) | def logistic(x: float) -> float:
  function logistic_prime (line 14) | def logistic_prime(x: float) -> float:
  function _negative_log_likelihood (line 21) | def _negative_log_likelihood(x: Vector, y: float, beta: Vector) -> float:
  function negative_log_likelihood (line 30) | def negative_log_likelihood(xs: List[Vector],
  function _negative_log_partial_j (line 38) | def _negative_log_partial_j(x: Vector, y: float, beta: Vector, j: int) -...
  function _negative_log_gradient (line 45) | def _negative_log_gradient(x: Vector, y: float, beta: Vector) -> Vector:
  function negative_log_gradient (line 52) | def negative_log_gradient(xs: List[Vector],
  function main (line 58) | def main():

FILE: scratch/machine_learning.py
  function split_data (line 5) | def split_data(data: List[X], prob: float) -> Tuple[List[X], List[X]]:
  function train_test_split (line 24) | def train_test_split(xs: List[X],
  function accuracy (line 48) | def accuracy(tp: int, fp: int, fn: int, tn: int) -> float:
  function precision (line 55) | def precision(tp: int, fp: int, fn: int, tn: int) -> float:
  function recall (line 60) | def recall(tp: int, fp: int, fn: int, tn: int) -> float:
  function f1_score (line 65) | def f1_score(tp: int, fp: int, fn: int, tn: int) -> float:

FILE: scratch/mapreduce.py
  function tokenize (line 4) | def tokenize(document: str) -> List[str]:
  function word_count_old (line 8) | def word_count_old(documents: List[str]):
  function wc_mapper (line 16) | def wc_mapper(document: str) -> Iterator[Tuple[str, int]]:
  function wc_reducer (line 23) | def wc_reducer(word: str,
  function word_count (line 30) | def word_count(documents: List[str]) -> List[Tuple[str, int]]:
  function map_reduce (line 60) | def map_reduce(inputs: Iterable,
  function values_reducer (line 74) | def values_reducer(values_fn: Callable) -> Reducer:
  class Entry (line 93) | class Entry(NamedTuple):
  function matrix_multiply_mapper (line 99) | def matrix_multiply_mapper(num_rows_a: int, num_cols_b: int) -> Mapper:
  function matrix_multiply_reducer (line 118) | def matrix_multiply_reducer(key: Tuple[int, int],
  function main (line 141) | def main():

FILE: scratch/multiple_regression.py
  function predict (line 8) | def predict(x: Vector, beta: Vector) -> float:
  function error (line 19) | def error(x: Vector, y: float, beta: Vector) -> float:
  function squared_error (line 22) | def squared_error(x: Vector, y: float, beta: Vector) -> float:
  function sqerror_gradient (line 32) | def sqerror_gradient(x: Vector, y: float, beta: Vector) -> Vector:
  function least_squares_fit (line 44) | def least_squares_fit(xs: List[Vector],
  function multiple_r_squared (line 69) | def multiple_r_squared(xs: List[Vector], ys: Vector, beta: Vector) -> fl...
  function bootstrap_sample (line 79) | def bootstrap_sample(data: List[X]) -> List[X]:
  function bootstrap_statistic (line 83) | def bootstrap_statistic(data: List[X],
  function p_value (line 108) | def p_value(beta_hat_j: float, sigma_hat_j: float) -> float:
  function ridge_penalty (line 124) | def ridge_penalty(beta: Vector, alpha: float) -> float:
  function squared_error_ridge (line 127) | def squared_error_ridge(x: Vector,
  function ridge_penalty_gradient (line 136) | def ridge_penalty_gradient(beta: Vector, alpha: float) -> Vector:
  function sqerror_ridge_gradient (line 140) | def sqerror_ridge_gradient(x: Vector,
  function least_squares_fit_ridge (line 157) | def least_squares_fit_ridge(xs: List[Vector],
  function lasso_penalty (line 177) | def lasso_penalty(beta, alpha):
  function main (line 180) | def main():

FILE: scratch/naive_bayes.py
  function tokenize (line 4) | def tokenize(text: str) -> Set[str]:
  class Message (line 13) | class Message(NamedTuple):
  class NaiveBayesClassifier (line 21) | class NaiveBayesClassifier:
    method __init__ (line 22) | def __init__(self, k: float = 0.5) -> None:
    method train (line 30) | def train(self, messages: Iterable[Message]) -> None:
    method _probabilities (line 46) | def _probabilities(self, token: str) -> Tuple[float, float]:
    method predict (line 56) | def predict(self, text: str) -> float:
  function drop_final_s (line 115) | def drop_final_s(word):
  function main (line 118) | def main():

FILE: scratch/network_analysis.py
  class User (line 3) | class User(NamedTuple):
  function shortest_paths_from (line 32) | def shortest_paths_from(from_user_id: int,
  function farness (line 92) | def farness(user_id: int) -> float:
  function matrix_times_matrix (line 101) | def matrix_times_matrix(m1: Matrix, m2: Matrix) -> Matrix:
  function matrix_times_vector (line 114) | def matrix_times_vector(m: Matrix, v: Vector) -> Vector:
  function find_eigenvector (line 125) | def find_eigenvector(m: Matrix,
  function entry_fn (line 146) | def entry_fn(i: int, j: int):
  function page_rank (line 162) | def page_rank(users: List[User],

FILE: scratch/neural_networks.py
  function step_function (line 3) | def step_function(x: float) -> float:
  function perceptron_output (line 6) | def perceptron_output(weights: Vector, bias: float, x: Vector) -> float:
  function sigmoid (line 35) | def sigmoid(t: float) -> float:
  function neuron_output (line 38) | def neuron_output(weights: Vector, inputs: Vector) -> float:
  function feed_forward (line 44) | def feed_forward(neural_network: List[List[Vector]],
  function sqerror_gradients (line 76) | def sqerror_gradients(network: List[List[Vector]],
  function fizz_buzz_encode (line 114) | def fizz_buzz_encode(x: int) -> Vector:
  function binary_encode (line 129) | def binary_encode(x: int) -> Vector:
  function argmax (line 145) | def argmax(xs: list) -> int:
  function main (line 153) | def main():

FILE: scratch/nlp.py
  function fix_unicode (line 16) | def fix_unicode(text: str) -> str:
  function generate_using_bigrams (line 42) | def generate_using_bigrams() -> str:
  function generate_using_trigrams (line 61) | def generate_using_trigrams() -> str:
  function is_terminal (line 92) | def is_terminal(token: str) -> bool:
  function expand (line 95) | def expand(grammar: Grammar, tokens: List[str]) -> List[str]:
  function generate_sentence (line 117) | def generate_sentence(grammar: Grammar) -> List[str]:
  function roll_a_die (line 123) | def roll_a_die() -> int:
  function direct_sample (line 126) | def direct_sample() -> Tuple[int, int]:
  function random_y_given_x (line 131) | def random_y_given_x(x: int) -> int:
  function random_x_given_y (line 135) | def random_x_given_y(y: int) -> int:
  function gibbs_sample (line 145) | def gibbs_sample(num_iters: int = 100) -> Tuple[int, int]:
  function compare_distributions (line 152) | def compare_distributions(num_samples: int = 1000) -> Dict[int, List[int]]:
  function sample_from (line 159) | def sample_from(weights: List[float]) -> int:
  function p_topic_given_document (line 213) | def p_topic_given_document(topic: int, d: int, alpha: float = 0.1) -> fl...
  function p_word_given_topic (line 221) | def p_word_given_topic(word: str, topic: int, beta: float = 0.1) -> float:
  function topic_weight (line 229) | def topic_weight(d: int, word: str, k: int) -> float:
  function choose_new_topic (line 236) | def choose_new_topic(d: int, word: str) -> int:
  function cosine_similarity (line 294) | def cosine_similarity(v1: Vector, v2: Vector) -> float:
  function make_sentence (line 307) | def make_sentence() -> str:
  class Vocabulary (line 325) | class Vocabulary:
    method __init__ (line 326) | def __init__(self, words: List[str] = None) -> None:
    method size (line 334) | def size(self) -> int:
    method add (line 338) | def add(self, word: str) -> None:
    method get_id (line 344) | def get_id(self, word: str) -> int:
    method get_word (line 348) | def get_word(self, word_id: int) -> str:
    method one_hot_encode (line 352) | def one_hot_encode(self, word: str) -> Tensor:
  function save_vocab (line 371) | def save_vocab(vocab: Vocabulary, filename: str) -> None:
  function load_vocab (line 375) | def load_vocab(filename: str) -> Vocabulary:
  class Embedding (line 386) | class Embedding(Layer):
    method __init__ (line 387) | def __init__(self, num_embeddings: int, embedding_dim: int) -> None:
    method forward (line 398) | def forward(self, input_id: int) -> Tensor:
    method backward (line 404) | def backward(self, gradient: Tensor) -> None:
    method params (line 414) | def params(self) -> Iterable[Tensor]:
    method grads (line 417) | def grads(self) -> Iterable[Tensor]:
  class TextEmbedding (line 420) | class TextEmbedding(Embedding):
    method __init__ (line 421) | def __init__(self, vocab: Vocabulary, embedding_dim: int) -> None:
    method __getitem__ (line 428) | def __getitem__(self, word: str) -> Tensor:
    method closest (line 435) | def closest(self, word: str, n: int = 5) -> List[Tuple[float, str]]:
  class SimpleRnn (line 448) | class SimpleRnn(Layer):
    method __init__ (line 450) | def __init__(self, input_dim: int, hidden_dim: int) -> None:
    method reset_hidden_state (line 460) | def reset_hidden_state(self) -> None:
    method forward (line 463) | def forward(self, input: Tensor) -> Tensor:
    method backward (line 475) | def backward(self, gradient: Tensor):
    method params (line 500) | def params(self) -> Iterable[Tensor]:
    method grads (line 503) | def grads(self) -> Iterable[Tensor]:
  function main (line 506) | def main():

FILE: scratch/nlp_advanced.py
  class EmbeddingOptimizer (line 3) | class EmbeddingOptimizer(Optimizer):
    method __init__ (line 8) | def __init__(self, learning_rate: float) -> None:
    method step (line 11) | def step(self, layer: Layer) -> None:

FILE: scratch/probability.py
  function uniform_cdf (line 1) | def uniform_cdf(x: float) -> float:
  function normal_pdf (line 10) | def normal_pdf(x: float, mu: float = 0, sigma: float = 1) -> float:
  function normal_cdf (line 29) | def normal_cdf(x: float, mu: float = 0, sigma: float = 1) -> float:
  function inverse_normal_cdf (line 46) | def inverse_normal_cdf(p: float,
  function bernoulli_trial (line 71) | def bernoulli_trial(p: float) -> int:
  function binomial (line 75) | def binomial(n: int, p: float) -> int:
  function binomial_histogram (line 81) | def binomial_histogram(p: float, n: int, num_points: int) -> None:
  function main (line 103) | def main():

FILE: scratch/recommender_systems.py
  function most_popular_new_interests (line 27) | def most_popular_new_interests(
  function make_user_interest_vector (line 49) | def make_user_interest_vector(user_interests: List[str]) -> List[int]:
  function most_similar_users_to (line 72) | def most_similar_users_to(user_id: int) -> List[Tuple[int, float]]:
  function user_based_suggestions (line 93) | def user_based_suggestions(user_id: int,
  function most_similar_interests_to (line 133) | def most_similar_interests_to(interest_id: int):
  function item_based_suggestions (line 149) | def item_based_suggestions(user_id: int,
  function main (line 194) | def main():

FILE: scratch/simple_linear_regression.py
  function predict (line 1) | def predict(alpha: float, beta: float, x_i: float) -> float:
  function error (line 4) | def error(alpha: float, beta: float, x_i: float, y_i: float) -> float:
  function sum_of_sqerrors (line 13) | def sum_of_sqerrors(alpha: float, beta: float, x: Vector, y: Vector) -> ...
  function least_squares_fit (line 21) | def least_squares_fit(x: Vector, y: Vector) -> Tuple[float, float]:
  function total_sum_of_squares (line 44) | def total_sum_of_squares(y: Vector) -> float:
  function r_squared (line 48) | def r_squared(alpha: float, beta: float, x: Vector, y: Vector) -> float:
  function main (line 59) | def main():

FILE: scratch/statistics.py
  function mean (line 42) | def mean(xs: List[float]) -> float:
  function _median_odd (line 53) | def _median_odd(xs: List[float]) -> float:
  function _median_even (line 57) | def _median_even(xs: List[float]) -> float:
  function median (line 63) | def median(v: List[float]) -> float:
  function quantile (line 73) | def quantile(xs: List[float], p: float) -> float:
  function mode (line 83) | def mode(x: List[float]) -> List[float]:
  function data_range (line 93) | def data_range(xs: List[float]) -> float:
  function de_mean (line 100) | def de_mean(xs: List[float]) -> List[float]:
  function variance (line 105) | def variance(xs: List[float]) -> float:
  function standard_deviation (line 117) | def standard_deviation(xs: List[float]) -> float:
  function interquartile_range (line 123) | def interquartile_range(xs: List[float]) -> float:
  function covariance (line 136) | def covariance(xs: List[float], ys: List[float]) -> float:
  function correlation (line 144) | def correlation(xs: List[float], ys: List[float]) -> float:

FILE: scratch/working_with_data.py
  function bucketize (line 7) | def bucketize(point: float, bucket_size: float) -> float:
  function make_histogram (line 11) | def make_histogram(points: List[float], bucket_size: float) -> Dict[floa...
  function plot_histogram (line 15) | def plot_histogram(points: List[float], bucket_size: float, title: str =...
  function random_normal (line 24) | def random_normal() -> float:
  function correlation_matrix (line 53) | def correlation_matrix(data: List[Vector]) -> Matrix:
  class StockPrice (line 84) | class StockPrice(NamedTuple):
    method is_high_tech (line 89) | def is_high_tech(self) -> bool:
  function parse_row (line 101) | def parse_row(row: List[str]) -> StockPrice:
  function try_parse_row (line 117) | def try_parse_row(row: List[str]) -> Optional[StockPrice]:
  function pct_change (line 189) | def pct_change(yesterday: StockPrice, today: StockPrice) -> float:
  class DailyChange (line 192) | class DailyChange(NamedTuple):
  function day_over_day_changes (line 197) | def day_over_day_changes(prices: List[StockPrice]) -> List[DailyChange]:
  function scale (line 250) | def scale(data: List[Vector]) -> Tuple[Vector, Vector]:
  function rescale (line 265) | def rescale(data: List[Vector]) -> List[Vector]:
  function de_mean (line 396) | def de_mean(data: List[Vector]) -> List[Vector]:
  function direction (line 403) | def direction(w: Vector) -> Vector:
  function directional_variance (line 409) | def directional_variance(data: List[Vector], w: Vector) -> float:
  function directional_variance_gradient (line 416) | def directional_variance_gradient(data: List[Vector], w: Vector) -> Vector:
  function first_principal_component (line 426) | def first_principal_component(data: List[Vector],
  function project (line 443) | def project(v: Vector, w: Vector) -> Vector:
  function remove_projection_from_vector (line 450) | def remove_projection_from_vector(v: Vector, w: Vector) -> Vector:
  function remove_projection (line 454) | def remove_projection(data: List[Vector], w: Vector) -> List[Vector]:
  function pca (line 457) | def pca(data: List[Vector], num_components: int) -> List[Vector]:
  function transform_vector (line 466) | def transform_vector(v: Vector, components: List[Vector]) -> Vector:
  function transform (line 469) | def transform(data: List[Vector], components: List[Vector]) -> List[Vect...
  function main (line 472) | def main():

Download .json

Condensed preview — 108 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (3,485K chars).

[
  {
    "path": ".gitignore",
    "chars": 19,
    "preview": "__pycache__\n*.png\n\n"
  },
  {
    "path": "INSTALL.md",
    "chars": 339,
    "preview": "# How to Install Python\n\nIf you don't already have Python, I strongly recommend you install the Anaconda version,\nwhich "
  },
  {
    "path": "LICENSE",
    "chars": 1066,
    "preview": "MIT License\n\nCopyright (c) 2019 Joel Grus\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\n"
  },
  {
    "path": "README.md",
    "chars": 3877,
    "preview": "Data Science from Scratch\n=========================\n\nHere's all the code and examples from the second edition of my book"
  },
  {
    "path": "comma_delimited_stock_prices.csv",
    "chars": 119,
    "preview": "AAPL,6/20/2014,90.91\nMSFT,6/20/2014,41.68\nFB,6/20/3014,64.5\nAAPL,6/19/2014,91.86\nMSFT,6/19/2014,n/a\nFB,6/19/2014,64.34\n"
  },
  {
    "path": "first-edition/README.md",
    "chars": 3898,
    "preview": "Data Science from Scratch\n=========================\n\nHere's all the code and examples from the first edition of my book "
  },
  {
    "path": "first-edition/code/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "first-edition/code/charts.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "first-edition/code/clustering.py",
    "chars": 6438,
    "preview": "from __future__ import division\nfrom linear_algebra import squared_distance, vector_mean, distance\nimport math, random\ni"
  },
  {
    "path": "first-edition/code/colon_delimited_stock_prices.txt",
    "chars": 85,
    "preview": "date:symbol:closing_price\n6/20/2014:AAPL:90.91\n6/20/2014:MSFT:41.68\n6/20/2014:FB:64.5"
  },
  {
    "path": "first-edition/code/comma_delimited_stock_prices.csv",
    "chars": 118,
    "preview": "6/20/2014,AAPL,90.91\n6/20/2014,MSFT,41.68\n6/20/3014,FB,64.5\n6/19/2014,AAPL,91.86\n6/19/2014,MSFT,n/a\n6/19/2014,FB,64.34"
  },
  {
    "path": "first-edition/code/comma_delimited_stock_prices.txt",
    "chars": 30,
    "preview": "AAPL,90.91\nFB,64.5\nMSFT,41.68\n"
  },
  {
    "path": "first-edition/code/databases.py",
    "chars": 8165,
    "preview": "from __future__ import division\nimport math, random, re\nfrom collections import defaultdict\n\nclass Table:\n    def __init"
  },
  {
    "path": "first-edition/code/decision_trees.py",
    "chars": 5866,
    "preview": "from __future__ import division\nfrom collections import Counter, defaultdict\nfrom functools import partial\nimport math, "
  },
  {
    "path": "first-edition/code/egrep.py",
    "chars": 444,
    "preview": "# egrep.py\nimport sys, re\n\nif __name__ == \"__main__\":\n\n    # sys.argv is the list of command-line arguments\n    # sys.ar"
  },
  {
    "path": "first-edition/code/getting_data.py",
    "chars": 6317,
    "preview": "from __future__ import division\nfrom collections import Counter\nimport math, random, csv, json\n\nfrom bs4 import Beautifu"
  },
  {
    "path": "first-edition/code/gradient_descent.py",
    "chars": 5895,
    "preview": "from __future__ import division\nfrom collections import Counter\nfrom linear_algebra import distance, vector_subtract, sc"
  },
  {
    "path": "first-edition/code/hypothesis_and_inference.py",
    "chars": 6054,
    "preview": "from __future__ import division\nfrom probability import normal_cdf, inverse_normal_cdf\nimport math, random\n\ndef normal_a"
  },
  {
    "path": "first-edition/code/introduction.py",
    "chars": 8189,
    "preview": "from __future__ import division\n\n# at this stage in the book we haven't actually installed matplotlib,\n# comment this ou"
  },
  {
    "path": "first-edition/code/line_count.py",
    "chars": 163,
    "preview": "# line_count.py\nimport sys\n\nif __name__ == \"__main__\":\n\n    count = 0\n    for line in sys.stdin:\n        count += 1\n\n   "
  },
  {
    "path": "first-edition/code/linear_algebra.py",
    "chars": 3699,
    "preview": "# -*- coding: iso-8859-15 -*-\n\nfrom __future__ import division # want 3 / 2 == 1.5\nimport re, math, random # regexes, ma"
  },
  {
    "path": "first-edition/code/logistic_regression.py",
    "chars": 6078,
    "preview": "from __future__ import division\nfrom collections import Counter\nfrom functools import partial\nfrom linear_algebra import"
  },
  {
    "path": "first-edition/code/machine_learning.py",
    "chars": 1382,
    "preview": "from __future__ import division\nfrom collections import Counter\nimport math, random\n\n#\n# data splitting\n#\n\ndef split_dat"
  },
  {
    "path": "first-edition/code/mapreduce.py",
    "chars": 5686,
    "preview": "from __future__ import division\nimport math, random, re, datetime\nfrom collections import defaultdict, Counter\nfrom func"
  },
  {
    "path": "first-edition/code/most_common_words.py",
    "chars": 740,
    "preview": "# most_common_words.py\nimport sys\nfrom collections import Counter\n\nif __name__ == \"__main__\":\n\n    # pass in number of w"
  },
  {
    "path": "first-edition/code/multiple_regression.py",
    "chars": 8590,
    "preview": "from __future__ import division\nfrom collections import Counter\nfrom functools import partial\nfrom linear_algebra import"
  },
  {
    "path": "first-edition/code/naive_bayes.py",
    "chars": 4535,
    "preview": "from __future__ import division\nfrom collections import Counter, defaultdict\nfrom machine_learning import split_data\nimp"
  },
  {
    "path": "first-edition/code/natural_language_processing.py",
    "chars": 10007,
    "preview": "from __future__ import division\nimport math, random, re\nfrom collections import defaultdict, Counter\nfrom bs4 import Bea"
  },
  {
    "path": "first-edition/code/nearest_neighbors.py",
    "chars": 7357,
    "preview": "from __future__ import division\nfrom collections import Counter\nfrom linear_algebra import distance\nfrom statistics impo"
  },
  {
    "path": "first-edition/code/network_analysis.py",
    "chars": 7175,
    "preview": "from __future__ import division\nimport math, random, re\nfrom collections import defaultdict, Counter, deque\nfrom linear_"
  },
  {
    "path": "first-edition/code/neural_networks.py",
    "chars": 6622,
    "preview": "from __future__ import division\nfrom collections import Counter\nfrom functools import partial\nfrom linear_algebra import"
  },
  {
    "path": "first-edition/code/plot_state_borders.py",
    "chars": 597,
    "preview": "import re\n\nsegments = []\npoints = []\n\nlat_long_regex = r\"<point lat=\\\"(.*)\\\" lng=\\\"(.*)\\\"\"\n\nwith open(\"states.txt\", \"r\")"
  },
  {
    "path": "first-edition/code/probability.py",
    "chars": 3863,
    "preview": "from __future__ import division\nfrom collections import Counter\nimport math, random\n\ndef random_kid():\n    return random"
  },
  {
    "path": "first-edition/code/recommender_systems.py",
    "chars": 6291,
    "preview": "from __future__ import division\nimport math, random\nfrom collections import defaultdict, Counter\nfrom linear_algebra imp"
  },
  {
    "path": "first-edition/code/simple_linear_regression.py",
    "chars": 3982,
    "preview": "from __future__ import division\nfrom collections import Counter, defaultdict\nfrom linear_algebra import vector_subtract\n"
  },
  {
    "path": "first-edition/code/states.txt",
    "chars": 133502,
    "preview": "<state name =\"Alabama\" colour=\"#ff0000\" >\n  <point lat=\"35.0041\" lng=\"-88.1955\"/>\n  <point lat=\"34.9918\" lng=\"-85.6068\"/"
  },
  {
    "path": "first-edition/code/statistics.py",
    "chars": 5779,
    "preview": "from __future__ import division\nfrom collections import Counter\nfrom linear_algebra import sum_of_squares, dot\nimport ma"
  },
  {
    "path": "first-edition/code/stocks.txt",
    "chars": 351764,
    "preview": "symbol\tdate\tclosing_price\nAAPL\t2015-01-23\t112.98\nAAPL\t2015-01-22\t112.4\nAAPL\t2015-01-21\t109.55\nAAPL\t2015-01-20\t108.72\nAAP"
  },
  {
    "path": "first-edition/code/tab_delimited_stock_prices.txt",
    "chars": 120,
    "preview": "6/20/2014\tAAPL\t90.91\n6/20/2014\tMSFT\t41.68\n6/20/2014\tFB\t64.5\n6/19/2014\tAAPL\t91.86\n6/19/2014\tMSFT\t41.51\n6/19/2014\tFB\t64.34"
  },
  {
    "path": "first-edition/code/visualizing_data.py",
    "chars": 5116,
    "preview": "import matplotlib.pyplot as plt\nfrom collections import Counter\n\ndef make_chart_simple_line_chart(plt):\n\n    years = [19"
  },
  {
    "path": "first-edition/code/working_with_data.py",
    "chars": 16549,
    "preview": "from __future__ import division\nfrom collections import Counter, defaultdict\nfrom functools import partial\nfrom linear_a"
  },
  {
    "path": "first-edition/code-python3/README.md",
    "chars": 3823,
    "preview": "# Updating the code from Python 2 to Python 3\n\nAfter many requests, here's the code from the book updated from Python 2 "
  },
  {
    "path": "first-edition/code-python3/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "first-edition/code-python3/charts.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "first-edition/code-python3/clustering.py",
    "chars": 6327,
    "preview": "from linear_algebra import squared_distance, vector_mean, distance\nimport math, random\nimport matplotlib.image as mpimg\n"
  },
  {
    "path": "first-edition/code-python3/colon_delimited_stock_prices.txt",
    "chars": 85,
    "preview": "date:symbol:closing_price\n6/20/2014:AAPL:90.91\n6/20/2014:MSFT:41.68\n6/20/2014:FB:64.5"
  },
  {
    "path": "first-edition/code-python3/comma_delimited_stock_prices.csv",
    "chars": 118,
    "preview": "6/20/2014,AAPL,90.91\n6/20/2014,MSFT,41.68\n6/20/3014,FB,64.5\n6/19/2014,AAPL,91.86\n6/19/2014,MSFT,n/a\n6/19/2014,FB,64.34"
  },
  {
    "path": "first-edition/code-python3/comma_delimited_stock_prices.txt",
    "chars": 33,
    "preview": "FB,64.5\r\nMSFT,41.68\r\nAAPL,90.91\r\n"
  },
  {
    "path": "first-edition/code-python3/databases.py",
    "chars": 8190,
    "preview": "import math, random, re\nfrom collections import defaultdict\n\nclass Table:\n    def __init__(self, columns):\n        self."
  },
  {
    "path": "first-edition/code-python3/decision_trees.py",
    "chars": 5734,
    "preview": "from collections import Counter, defaultdict\nfrom functools import partial\nimport math, random\n\ndef entropy(class_probab"
  },
  {
    "path": "first-edition/code-python3/egrep.py",
    "chars": 444,
    "preview": "# egrep.py\nimport sys, re\n\nif __name__ == \"__main__\":\n\n    # sys.argv is the list of command-line arguments\n    # sys.ar"
  },
  {
    "path": "first-edition/code-python3/getting_data.py",
    "chars": 6522,
    "preview": "from collections import Counter\nimport math, random, csv, json, re\n\nfrom bs4 import BeautifulSoup\nimport requests\n\n#####"
  },
  {
    "path": "first-edition/code-python3/gradient_descent.py",
    "chars": 5816,
    "preview": "from collections import Counter\nfrom linear_algebra import distance, vector_subtract, scalar_multiply\nfrom functools imp"
  },
  {
    "path": "first-edition/code-python3/hypothesis_and_inference.py",
    "chars": 6012,
    "preview": "from probability import normal_cdf, inverse_normal_cdf\nimport math, random\n\ndef normal_approximation_to_binomial(n, p):\n"
  },
  {
    "path": "first-edition/code-python3/introduction.py",
    "chars": 8200,
    "preview": "# at this stage in the book we haven't actually installed matplotlib,\n# comment this out if you need to\nfrom matplotlib "
  },
  {
    "path": "first-edition/code-python3/line_count.py",
    "chars": 165,
    "preview": "# line_count.py\nimport sys\n\nif __name__ == \"__main__\":\n\n    count = 0\n    for line in sys.stdin:\n        count += 1\n\n   "
  },
  {
    "path": "first-edition/code-python3/linear_algebra.py",
    "chars": 3564,
    "preview": "# -*- coding: iso-8859-15 -*-\n\nimport re, math, random # regexes, math functions, random numbers\nimport matplotlib.pyplo"
  },
  {
    "path": "first-edition/code-python3/logistic_regression.py",
    "chars": 6055,
    "preview": "from collections import Counter\nfrom functools import partial, reduce\nfrom linear_algebra import dot, vector_add\nfrom gr"
  },
  {
    "path": "first-edition/code-python3/machine_learning.py",
    "chars": 1357,
    "preview": "from collections import Counter\nimport math, random\n\n#\n# data splitting\n#\n\ndef split_data(data, prob):\n    \"\"\"split data"
  },
  {
    "path": "first-edition/code-python3/mapreduce.py",
    "chars": 5556,
    "preview": "import math, random, re, datetime\nfrom collections import defaultdict, Counter\nfrom functools import partial\nfrom naive_"
  },
  {
    "path": "first-edition/code-python3/most_common_words.py",
    "chars": 667,
    "preview": "# most_common_words.py\nimport sys\nfrom collections import Counter\n\nif __name__ == \"__main__\":\n\n    # pass in number of w"
  },
  {
    "path": "first-edition/code-python3/multiple_regression.py",
    "chars": 8556,
    "preview": "from collections import Counter\nfrom functools import partial\nfrom linear_algebra import dot, vector_add\nfrom stats impo"
  },
  {
    "path": "first-edition/code-python3/naive_bayes.py",
    "chars": 4513,
    "preview": "from collections import Counter, defaultdict\nfrom machine_learning import split_data\nimport math, random, re, glob\n\ndef "
  },
  {
    "path": "first-edition/code-python3/natural_language_processing.py",
    "chars": 10000,
    "preview": "import math, random, re\nfrom collections import defaultdict, Counter\nfrom bs4 import BeautifulSoup\nimport requests\n\ndef "
  },
  {
    "path": "first-edition/code-python3/nearest_neighbors.py",
    "chars": 7318,
    "preview": "from collections import Counter\nfrom linear_algebra import distance\nfrom stats import mean\nimport math, random\nimport ma"
  },
  {
    "path": "first-edition/code-python3/network_analysis.py",
    "chars": 6998,
    "preview": "import math, random, re\nfrom collections import defaultdict, Counter, deque\nfrom linear_algebra import dot, get_row, get"
  },
  {
    "path": "first-edition/code-python3/neural_networks.py",
    "chars": 6417,
    "preview": "from collections import Counter\nfrom functools import partial\nfrom linear_algebra import dot\nimport math, random\nimport "
  },
  {
    "path": "first-edition/code-python3/plot_state_borders.py",
    "chars": 625,
    "preview": "import re\nimport matplotlib.pyplot as plt\n\nsegments = []\npoints = []\n\nlat_long_regex = r\"<point lat=\\\"(.*)\\\" lng=\\\"(.*)\\"
  },
  {
    "path": "first-edition/code-python3/probability.py",
    "chars": 3809,
    "preview": "from collections import Counter\nimport math, random\n\ndef random_kid():\n    return random.choice([\"boy\", \"girl\"])\n\ndef un"
  },
  {
    "path": "first-edition/code-python3/recommender_systems.py",
    "chars": 6248,
    "preview": "import math, random\nfrom collections import defaultdict, Counter\nfrom linear_algebra import dot\n\nusers_interests = [\n   "
  },
  {
    "path": "first-edition/code-python3/simple_linear_regression.py",
    "chars": 3948,
    "preview": "from collections import Counter, defaultdict\nfrom linear_algebra import vector_subtract\nfrom stats import mean, correlat"
  },
  {
    "path": "first-edition/code-python3/states.txt",
    "chars": 133502,
    "preview": "<state name =\"Alabama\" colour=\"#ff0000\" >\n  <point lat=\"35.0041\" lng=\"-88.1955\"/>\n  <point lat=\"34.9918\" lng=\"-85.6068\"/"
  },
  {
    "path": "first-edition/code-python3/stats.py",
    "chars": 5737,
    "preview": "from collections import Counter\nfrom linear_algebra import sum_of_squares, dot\nimport math\n\nnum_friends = [100,49,41,40,"
  },
  {
    "path": "first-edition/code-python3/stocks.txt",
    "chars": 351764,
    "preview": "symbol\tdate\tclosing_price\nAAPL\t2015-01-23\t112.98\nAAPL\t2015-01-22\t112.4\nAAPL\t2015-01-21\t109.55\nAAPL\t2015-01-20\t108.72\nAAP"
  },
  {
    "path": "first-edition/code-python3/tab_delimited_stock_prices.txt",
    "chars": 120,
    "preview": "6/20/2014\tAAPL\t90.91\n6/20/2014\tMSFT\t41.68\n6/20/2014\tFB\t64.5\n6/19/2014\tAAPL\t91.86\n6/19/2014\tMSFT\t41.51\n6/19/2014\tFB\t64.34"
  },
  {
    "path": "first-edition/code-python3/visualizing_data.py",
    "chars": 5036,
    "preview": "import matplotlib.pyplot as plt\nfrom collections import Counter\n\ndef make_chart_simple_line_chart():\n\n    years = [1950,"
  },
  {
    "path": "first-edition/code-python3/working_with_data.py",
    "chars": 16545,
    "preview": "from collections import Counter, defaultdict\nfrom functools import partial, reduce\nfrom linear_algebra import shape, get"
  },
  {
    "path": "im/README.md",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "links.md",
    "chars": 9600,
    "preview": "Links\n=====\n\n## Preface\n\n[Data Science Venn Diagram](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram)\n"
  },
  {
    "path": "requirements.txt",
    "chars": 320,
    "preview": "# For a nicer terminal\nipython\n\n# For plotting graphs\nmatplotlib\n\n# For reading in images\npillow\n\n# For making HTTP requ"
  },
  {
    "path": "scratch/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "scratch/clustering.py",
    "chars": 10283,
    "preview": "from scratch.linear_algebra import Vector\n\ndef num_differences(v1: Vector, v2: Vector) -> int:\n    assert len(v1) == len"
  },
  {
    "path": "scratch/crash_course_in_python.py",
    "chars": 20293,
    "preview": "\n\"\"\"\nThis is just code for the introduction to Python.\nIt also won't be used anywhere else in the book.\n\"\"\"\n# type: igno"
  },
  {
    "path": "scratch/databases.py",
    "chars": 12901,
    "preview": "users = [[0, \"Hero\", 0],\n         [1, \"Dunn\", 2],\n         [2, \"Sue\", 3],\n         [3, \"Chi\", 3]]\n\nfrom typing import Tu"
  },
  {
    "path": "scratch/decision_trees.py",
    "chars": 7311,
    "preview": "from typing import List\nimport math\n\ndef entropy(class_probabilities: List[float]) -> float:\n    \"\"\"Given a list of clas"
  },
  {
    "path": "scratch/deep_learning.py",
    "chars": 23857,
    "preview": "Tensor = list\n\nfrom typing import List\n\ndef shape(tensor: Tensor) -> List[int]:\n    sizes: List[int] = []\n    while isin"
  },
  {
    "path": "scratch/getting_data.py",
    "chars": 11107,
    "preview": "\n# Just stick some data there\nwith open('email_addresses.txt', 'w') as f:\n    f.write(\"joelgrus@gmail.com\\n\")\n    f.writ"
  },
  {
    "path": "scratch/gradient_descent.py",
    "chars": 5298,
    "preview": "from scratch.linear_algebra import Vector, dot\n\ndef sum_of_squares(v: Vector) -> float:\n    \"\"\"Computes the sum of squar"
  },
  {
    "path": "scratch/inference.py",
    "chars": 6982,
    "preview": "from typing import Tuple\nimport math\n\ndef normal_approximation_to_binomial(n: int, p: float) -> Tuple[float, float]:\n   "
  },
  {
    "path": "scratch/introduction.py",
    "chars": 8266,
    "preview": "\n\"\"\"\nThis is code for the introduction chapter. As such, it stands alone\nand won't be used anywhere else in the book.\n\"\""
  },
  {
    "path": "scratch/k_nearest_neighbors.py",
    "chars": 4880,
    "preview": "from typing import List\nfrom collections import Counter\n\ndef raw_majority_vote(labels: List[str]) -> str:\n    votes = Co"
  },
  {
    "path": "scratch/linear_algebra.py",
    "chars": 5566,
    "preview": "from typing import List\n\nVector = List[float]\n\nheight_weight_age = [70,  # inches,\n                     170, # pounds,\n "
  },
  {
    "path": "scratch/logistic_regression.py",
    "chars": 7642,
    "preview": "\ntuples = [(0.7,48000,1),(1.9,48000,0),(2.5,60000,1),(4.2,63000,0),(6,76000,0),(6.5,69000,0),(7.5,76000,0),(8.1,88000,0)"
  },
  {
    "path": "scratch/machine_learning.py",
    "chars": 2431,
    "preview": "import random\nfrom typing import TypeVar, List, Tuple\nX = TypeVar('X')  # generic type to represent a data point\n\ndef sp"
  },
  {
    "path": "scratch/mapreduce.py",
    "chars": 6855,
    "preview": "from typing import List\nfrom collections import Counter\n\ndef tokenize(document: str) -> List[str]:\n    \"\"\"Just split on "
  },
  {
    "path": "scratch/multiple_regression.py",
    "chars": 11210,
    "preview": "\nfrom typing import List\n\ninputs: List[List[float]] = [[1.,49,4,0],[1,41,9,0],[1,40,8,0],[1,25,6,0],[1,21,1,0],[1,21,0,0"
  },
  {
    "path": "scratch/naive_bayes.py",
    "chars": 6075,
    "preview": "from typing import Set\nimport re\n\ndef tokenize(text: str) -> Set[str]:\n    text = text.lower()                         #"
  },
  {
    "path": "scratch/network_analysis.py",
    "chars": 6665,
    "preview": "from typing import NamedTuple\n\nclass User(NamedTuple):\n    id: int\n    name: str\n\nusers = [User(0, \"Hero\"), User(1, \"Dun"
  },
  {
    "path": "scratch/neural_networks.py",
    "chars": 8508,
    "preview": "from scratch.linear_algebra import Vector, dot\n\ndef step_function(x: float) -> float:\n    return 1.0 if x >= 0 else 0.0\n"
  },
  {
    "path": "scratch/nlp.py",
    "chars": 25691,
    "preview": "\nimport matplotlib.pyplot as plt\nplt.gca().clear()\n\ndata = [ (\"big data\", 100, 15), (\"Hadoop\", 95, 25), (\"Python\", 75, 5"
  },
  {
    "path": "scratch/nlp_advanced.py",
    "chars": 690,
    "preview": "from scratch.deep_learning import Optimizer, Layer\n\nclass EmbeddingOptimizer(Optimizer):\n    \"\"\"\n    Optimized for the c"
  },
  {
    "path": "scratch/probability.py",
    "chars": 4737,
    "preview": "def uniform_cdf(x: float) -> float:\n    \"\"\"Returns the probability that a uniform random variable is <= x\"\"\"\n    if x < "
  },
  {
    "path": "scratch/recommender_systems.py",
    "chars": 12810,
    "preview": "users_interests = [\n    [\"Hadoop\", \"Big Data\", \"HBase\", \"Java\", \"Spark\", \"Storm\", \"Cassandra\"],\n    [\"NoSQL\", \"MongoDB\","
  },
  {
    "path": "scratch/simple_linear_regression.py",
    "chars": 3316,
    "preview": "def predict(alpha: float, beta: float, x_i: float) -> float:\n    return beta * x_i + alpha\n\ndef error(alpha: float, beta"
  },
  {
    "path": "scratch/statistics.py",
    "chars": 6774,
    "preview": "\nnum_friends = [100.0,49,41,40,25,21,21,19,19,18,18,16,15,15,15,15,14,14,13,13,13,13,12,12,11,10,10,10,10,10,10,10,10,10"
  },
  {
    "path": "scratch/visualization.py",
    "chars": 4696,
    "preview": "from matplotlib import pyplot as plt\n\nyears = [1950, 1960, 1970, 1980, 1990, 2000, 2010]\ngdp = [300.2, 543.3, 1075.9, 28"
  },
  {
    "path": "scratch/working_with_data.py",
    "chars": 18501,
    "preview": "from typing import List, Dict\nfrom collections import Counter\nimport math\n\nimport matplotlib.pyplot as plt\n\ndef bucketiz"
  },
  {
    "path": "stocks.csv",
    "chars": 1734956,
    "preview": "Symbol,Date,Open,High,Low,Close,Adj Close,Volume\r\nAAPL,1980-12-12,0.513393,0.515625,0.513393,0.513393,0.023106,117258400"
  }
]

About this extraction

This page contains the full source code of the joelgrus/data-science-from-scratch GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 108 files (3.1 MB), approximately 823.6k tokens, and a symbol index with 968 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo