Full Code of dvgodoy/handyspark for AI

master 0fb4c8707b34 cached

49 files

467.3 KB

201.5k tokens

344 symbols

1 requests

Download .txt

Showing preview only (487K chars total). Download the full file or copy to clipboard to get everything.

Repository: dvgodoy/handyspark
Branch: master
Commit: 0fb4c8707b34
Files: 49
Total size: 467.3 KB

Directory structure:
gitextract_4phs78pk/

├── .gitignore
├── .travis.yml
├── LICENSE
├── README.md
├── README.rst
├── docs/
│   ├── Makefile
│   └── source/
│       ├── conf.py
│       ├── handyspark.extensions.rst
│       ├── handyspark.ml.rst
│       ├── handyspark.rst
│       ├── handyspark.sql.rst
│       ├── includeme.rst
│       ├── index.rst
│       └── modules.rst
├── handyspark/
│   ├── __init__.py
│   ├── extensions/
│   │   ├── __init__.py
│   │   ├── common.py
│   │   ├── evaluation.py
│   │   └── types.py
│   ├── ml/
│   │   ├── __init__.py
│   │   └── base.py
│   ├── plot.py
│   ├── sql/
│   │   ├── __init__.py
│   │   ├── dataframe.py
│   │   ├── datetime.py
│   │   ├── pandas.py
│   │   ├── schema.py
│   │   ├── string.py
│   │   └── transform.py
│   ├── stats.py
│   └── util.py
├── notebooks/
│   └── Exploring_Titanic.ipynb
├── requirements.txt
├── setup.cfg
├── setup.py
└── tests/
    ├── handyspark/
    │   ├── conftest.py
    │   ├── extensions/
    │   │   ├── test_evaluation.py
    │   │   └── test_types.py
    │   ├── ml/
    │   │   └── test_base.py
    │   ├── sql/
    │   │   ├── test_dataframe.py
    │   │   ├── test_datetime.py
    │   │   ├── test_pandas.py
    │   │   ├── test_schema.py
    │   │   ├── test_string.py
    │   │   └── test_transform.py
    │   ├── test_plot.py
    │   ├── test_stats.py
    │   └── test_util.py
    └── rawdata/
        └── train.csv

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# dotenv
.env

# virtualenv
.venv
venv/
ENV/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/

.idea
examples/spark-warehouse/
tests/spark-warehouse

================================================
FILE: .travis.yml
================================================
language: python
sudo: required
dist: trusty
cache:
  directories:
    - $HOME/.ivy2
    - $HOME/spark
    - $HOME/.cache/pip
    - $HOME/.pip-cache
    - $HOME/.sbt/launchers
jdk:
  - oraclejdk8
python:
  - 3.6
sudo: false
addons:
  apt:
    packages:
      - axel
cache: pip
before_install:
  - export PATH=$HOME/.local/bin:$PATH
  - pip install -U pip
  - export PYTHONPATH=$PYTHONPATH:$(pwd)
install:
  # Download spark 2.3.3
  - "[ -f spark ] || mkdir spark && cd spark && axel http://www-us.apache.org/dist/spark/spark-2.3.3/spark-2.3.3-bin-hadoop2.7.tgz && cd .."
  - "tar -xf ./spark/spark-2.3.3-bin-hadoop2.7.tgz"
  - "export SPARK_HOME=`pwd`/spark-2.3.3-bin-hadoop2.7"
  - "export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python"
  - echo "spark.yarn.jars=$SPARK_HOME/jars/*.jar" > $SPARK_HOME/conf/spark-defaults.conf
  - pip install -r requirements.txt
script:
  - pytest ./tests


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2018 Daniel Voigt Godoy

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
[![Build Status](https://travis-ci.org/dvgodoy/handyspark.svg?branch=master)](https://travis-ci.org/dvgodoy/handyspark)

# HandySpark

## Bringing pandas-like capabilities to Spark dataframes!

***HandySpark*** is a package designed to improve ***PySpark*** user experience, especially when it comes to ***exploratory data analysis***, including ***visualization*** capabilities!

It makes fetching data or computing statistics for columns really easy, returning ***pandas objects*** straight away.

It also leverages on the recently released ***pandas UDFs*** in Spark to allow for an out-of-the-box usage of common ***pandas functions*** in a Spark dataframe.

Moreover, it introduces the ***stratify*** operation, so users can perform more sophisticated analysis, imputation and outlier detection on stratified data without incurring in very computationally expensive ***groupby*** operations.

It brings the long missing capability of ***plotting*** data while retaining the advantage of performing distributed computation (unlike many tutorials on the internet, which just convert the whole dataset to pandas and then plot it - don't ever do that!).

Finally, it also extends ***evaluation metrics*** for ***binary classification***, so you can easily choose which threshold to use!

## Google Colab

Eager to try it out right away? Don't wait any longer!

Open the notebook directly on Google Colab and try it yourself:

- [Exploring Titanic](https://colab.research.google.com/github/dvgodoy/handyspark/blob/master/notebooks/Exploring_Titanic.ipynb)

## Installation

To install ***HandySpark*** from [PyPI](https://pypi.org/project/handyspark/), just type:
```python
pip install handyspark
```

## Documentation

You can find the full documentation [here](http://dvgodoy.github.com/handyspark).

Here is a ***handy*** list of direct links to some classes, objects and methods used:

- [HandyFrame](https://dvgodoy.github.io/handyspark/handyspark.sql.html#handyspark.sql.dataframe.HandyFrame)
  - [cols](https://dvgodoy.github.io/handyspark/handyspark.sql.html#handyspark.sql.dataframe.HandyColumns)
  - [pandas](https://dvgodoy.github.io/handyspark/handyspark.sql.html#handyspark.sql.pandas.HandyPandas)
  - [transformers](https://dvgodoy.github.io/handyspark/handyspark.ml.html#handyspark.ml.base.HandyTransformers)
  - [isnull](https://dvgodoy.github.io/handyspark/handyspark.html#handyspark.HandyFrame.isnull)
  - [fill](https://dvgodoy.github.io/handyspark/handyspark.html#handyspark.HandyFrame.fill)
  - [outliers](https://dvgodoy.github.io/handyspark/handyspark.html#handyspark.HandyFrame.outliers)
  - [fence](https://dvgodoy.github.io/handyspark/handyspark.html#handyspark.HandyFrame.fence)
  - [stratify](https://dvgodoy.github.io/handyspark/handyspark.sql.html#handyspark.sql.dataframe.HandyFrame.stratify)

- [Bucket](https://dvgodoy.github.io/handyspark/handyspark.sql.html#handyspark.sql.dataframe.Bucket)
- [Quantile](https://dvgodoy.github.io/handyspark/handyspark.sql.html#handyspark.sql.dataframe.Quantile)

- [HandyImputer](https://dvgodoy.github.io/handyspark/handyspark.ml.html#handyspark.ml.base.HandyImputer)
- [HandyFencer](https://dvgodoy.github.io/handyspark/handyspark.ml.html#handyspark.ml.base.HandyFencer)

## Quick Start

To use ***HandySpark***, all you need to do is import the package and, after loading your data into a Spark dataframe, call the ***toHandy()*** method to get your own ***HandyFrame***:
```python
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from handyspark import *
sdf = spark.read.csv('./tests/rawdata/train.csv', header=True, inferSchema=True)
hdf = sdf.toHandy()
```

### Fetching and plotting data

Now you can easily fetch data as if you were using pandas, just use the ***cols*** object from your ***HandyFrame***:
```python
hdf.cols['Name'][:5]
```

It should return a pandas Series object:
```
0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object
```

If you include a list of columns, it will return a pandas DataFrame.

Due to the distributed nature of data in Spark, it is only possible to fetch the top rows of any given ***HandyFrame***.

Using ***cols*** you have access to several pandas-like column and DataFrame based methods implemented in Spark:

- min / max / median / q1 / q3 / stddev / mode
- nunique
- value_counts
- corr
- hist
- boxplot
- scatterplot

For instance:
```python
hdf.cols['Embarked'].value_counts(dropna=False)
```

```
S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64
```

You can also make some plots:
```python
from matplotlib import pyplot as plt
fig, axs = plt.subplots(1, 4, figsize=(12, 4))
hdf.cols['Embarked'].hist(ax=axs[0])
hdf.cols['Age'].boxplot(ax=axs[1])
hdf.cols['Fare'].boxplot(ax=axs[2])
hdf.cols[['Fare', 'Age']].scatterplot(ax=axs[3])
```

![cols plots](/images/cols_plot.png)

Handy, right (pun intended!)? But things can get ***even more*** interesting if you use ***stratify***!

### Stratify

Stratifying a HandyFrame means using a ***split-apply-combine*** approach. It will first split your HandyFrame according to the specified (discrete) columns, then it will apply some function to each stratum of data and finally combine the results back together.

This is better illustrated with an example - let's try the stratified version of our previous `value_counts`:
```python
hdf.stratify(['Pclass']).cols['Embarked'].value_counts()
```

```
Pclass  Embarked
1       C            85
        Q             2
        S           127
2       C            17
        Q             3
        S           164
3       C            66
        Q            72
        S           353
Name: value_counts, dtype: int64
```

Cool, isn't it? Besides, under the hood, not a single ***group by*** operation was performed - everything is handled using filter clauses! So, ***no data shuffling***!

What if you want to ***stratify*** on a column containing continuous values? No problem!

```python
hdf.stratify(['Sex', Bucket('Age', 2)]).cols['Embarked'].value_counts()
```

```
Sex     Age                                Embarked
female  Age >= 0.4200 and Age < 40.2100    C            46
                                           Q            12
                                           S           154
        Age >= 40.2100 and Age <= 80.0000  C            15
                                           S            32
male    Age >= 0.4200 and Age < 40.2100    C            53
                                           Q            11
                                           S           287
        Age >= 40.2100 and Age <= 80.0000  C            16
                                           Q             5
                                           S            81
Name: value_counts, dtype: int64
```

You can use either ***Bucket*** or ***Quantile*** to discretize your data in any given number of bins!

What about ***plotting*** it? Yes, ***HandySpark*** can handle that as well!

```python
hdf.stratify(['Sex', Bucket('Age', 2)]).cols['Embarked'].hist(figsize=(8, 6))
```

![stratified hist](/images/stratified_hist.png)

### Handling missing data

***HandySpark*** makes it very easy to spot and fill missing values. To figure if there are any missing values, just use ***isnull***:
```python
hdf.isnull(ratio=True)
```

```
PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
Name: missing(ratio), dtype: float64
```

Ok, now you know there are 3 columns with missing values: `Age`, `Cabin` and `Embarked`. It's time to fill those values up! But, let's skip `Cabin`, which has 77% of its values missing!

So, `Age` is a continuous variable, while `Embarked` is a categorical variable. Let's start with the latter:

```python
hdf_filled = hdf.fill(categorical=['Embarked'])
```

***HandyFrame*** has a ***fill*** method which takes up to 3 arguments:
- categorical: a list of categorical variables
- continuous: a list of continuous variables
- strategy: which strategy to use for each one of the continuous variables (either `mean` or `median`)

Categorical variables use a `mode` strategy by default.

But you do not need to stick with the basics anymore... you can fancy it up using ***stratify*** together with ***fill***:
```python
hdf_filled = hdf_filled.stratify(['Pclass', 'Sex']).fill(continuous=['Age'], strategy=['mean'])
```

How do you know which values are being used? Simple enough:
```python
hdf_filled.statistics_
```

```
{'Age': {'Pclass == "1" and Sex == "female"': 34.61176470588235,
  'Pclass == "1" and Sex == "male"': 41.28138613861386,
  'Pclass == "2" and Sex == "female"': 28.722972972972972,
  'Pclass == "2" and Sex == "male"': 30.74070707070707,
  'Pclass == "3" and Sex == "female"': 21.75,
  'Pclass == "3" and Sex == "male"': 26.507588932806325},
 'Embarked': 'S'}
```

There you go! The filter clauses and the corresponding imputation values!

But there is ***more*** - once you're with your imputation procedure, why not generate a ***custom transformer*** to do that for you, either on your test set or in production?

You only need to call the ***imputer*** method of the ***transformer*** object that every ***HandyFrame*** has:
```python
imputer = hdf_filled.transformers.imputer()
```

In the example above, ***imputer*** is now a full-fledged serializable PySpark transformer! What does that mean? You can use it in your ***pipeline*** and ***save / load*** at will :-)

###  Detecting outliers

Second only to the problem of missing data, outliers can pose a challenge for training machine learning models.

***HandyFrame*** to the rescue, with its ***outliers*** method:

```python
hdf_filled.outliers(method='tukey', k=3.)
```

```
PassengerId      0.0
Survived         0.0
Pclass           0.0
Age              1.0
SibSp           12.0
Parch          213.0
Fare            53.0
dtype: float64
```

Currently, only [***Tukey's***](https://en.wikipedia.org/wiki/Outlier#Tukey's_fences) method is available. This method takes an optional ***k*** argument, which you can set to larger values (like 3) to allow for a more loose detection.

The good thing is, now we can take a peek at the data by plotting it:

```python
from matplotlib import pyplot as plt
fig, axs = plt.subplots(1, 4, figsize=(16, 4))
hdf_filled.cols['Parch'].hist(ax=axs[0])
hdf_filled.cols['SibSp'].hist(ax=axs[1])
hdf_filled.cols['Age'].boxplot(ax=axs[2], k=3)
hdf_filled.cols['Fare'].boxplot(ax=axs[3], k=3)
```

![outliers](/images/outliers.png)

Let's focus on the `Fare` column - what can we do about it? Well, we could use Tukey's fences to, er... ***fence*** the outliers :-)

```python
hdf_fenced = hdf_filled.fence(['Fare'])
```

Which values were used, you ask?
```python
hdf_fenced.fences_
```

```
{'Fare': [-26.0105, 64.4063]}
```

It works quite similarly to the ***fill*** method and, I hope you guessed, it ***also*** gives you the ability to create the corresponding ***custom transformer*** :-)

```python
fencer = hdf_fenced.transformers.fencer()
```

You can also use [***Mahalanobis distance***](https://en.wikipedia.org/wiki/Mahalanobis_distance) to identify outliers in a multi-dimensional space, given a critical value (usually 99.9%, but you are free to have either more restriced or relaxed threshold).

To get the outliers for a subset of columns (only ***numerical*** columns are considered!):

```
outliers = hdf_filled.cols[['Age', 'Fare', 'SibSp']].get_outliers(critical_value=.90)
```

Let's take a look at the first 5 outliers found:

```
outliers.cols[:][:5]
```

![outliers](/images/mahalanobis_outliers.png)

What if you want to discard these sample? You just need to call `remove_outliers`:

```
hdf_without_outliers = hdf_filled.cols[['Age', 'Fare', 'SibSp']].remove_outliers(critical_value=0.90)
```

### Evaluating your model!

You cleaned your data, you trained your classification model, you fine-tuned it and now you want to ***evaluate*** it, right?

```
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator

assem = VectorAssembler(inputCols=['Fare', 'Pclass', 'Age'], outputCol='features')
rf = RandomForestClassifier(featuresCol='features', labelCol='Survived', numTrees=20)
pipeline = Pipeline(stages=[assem, rf])
model = pipeline.fit(hdf_fenced)

predictions = model.transform(hdf_fenced)
evaluator = BinaryClassificationEvaluator(labelCol='Survived')
evaluator.evaluate(predictions)
```

Then you realize evaluators only give you `areaUnderROC` and `areaUnderPR`. How about ***plotting ROC or PR curves***? How about ***finding a threshold*** that suits your needs for False Positive or False negatives?

***HandySpark*** extends the ***BinaryClassificationMetrics*** object to take ***DataFrames*** and output ***all your evaluation needs***!

```
bcm = BinaryClassificationMetrics(predictions, scoreCol='probability', labelCol='Survived')
```

Now you can ***plot*** the curves...

```
from matplotlib import pyplot as plt
fig, axs = plt.subplots(1, 2, figsize=(12, 4))
bcm.plot_roc_curve(ax=axs[0])
bcm.plot_pr_curve(ax=axs[1])
```

![curves](/images/evaluation_curves.png)

...or get metrics for every ***threshold***...

```
bcm.getMetricsByThreshold().toPandas()[100:105]
```

![metrics](/images/metrics_thresholds.png)

...or the ***confusion matrix*** for the threshold you chose:

```
bcm.print_confusion_matrix(.572006)
```

![cm](/images/confusion.png)

### Pandas and more pandas!

With ***HandySpark*** you can feel ***almost*** as if you were using traditional pandas :-)

To gain access to the whole suite of available pandas functions, you need to leverage the ***pandas*** object of your ***HandyFrame***:
```python
some_ports = hdf_fenced.pandas['Embarked'].isin(values=['C', 'Q'])
some_ports
```

```
Column<b'udf(Embarked) AS `<lambda>(Embarked,)`'>
```

In the example above, ***HandySpark*** treats the `Embarked` column as if it were a pandas Series and, therefore, you may call its ***isin*** method!

But, remember Spark has ***lazy evaluation***, so the result is a ***column expression*** which leverages the power of ***pandas UDFs*** (provived that PyArrow is installed, otherwise it will fall back to traditional UDFs).

The only thing left to do is to actually ***assign*** the results to a new column, right?
```python
hdf_fenced = hdf_fenced.assign(is_c_or_q=some_ports)
# What's in there?
hdf_fenced.cols['is_c_or_q'][:5]
```

```
0     True
1    False
2    False
3     True
4     True
Name: is_c_or_q, dtype: bool
```

You got that right! ***HandyFrame*** has a very convenient ***assign*** method, just like in pandas!

It does not get much easier than that :-) There are several column methods available already:
- betweeen / between_time
- isin
- isna / isnull
- notna / notnull
- abs
- clip / clip_lower / clip_upper
- replace
- round / truncate
- tz_convert / tz_localize

And this is not all! Both specialized ***str*** and ***dt*** objects from pandas are available as well!

For instance, if you want to find if a given string contains another substring?

```python
col_mrs = hdf_fenced.pandas['Name'].str.find(sub='Mrs.')
hdf_fenced = hdf_fenced.assign(is_mrs=col_mrs > 0)
```

![is mrs](/images/is_mrs.png)

There are many, many more available methods:

1. ***String methods***:
- contains
- startswith / endswitch
- match
- isalpha / isnumeric / isalnum / isdigit / isdecimal / isspace
- islower / isupper / istitle
- replace
- repeat
- join
- pad
- slice / slice_replace
- strip / lstrip / rstrip
- wrap / center / ljust / rjust
- translate
- get
- normalize
- lower / upper / capitalize / swapcase / title
- zfill
- count
- find / rfind
- len

2. ***Date / Datetime methods***:
- is_leap_year / is_month_end / is_month_start / is_quarter_end / is_quarter_start / is_year_end / is_year_start
- strftime
- tz / time / tz_convert / tz_localize
- day / dayofweek / dayofyear / days_in_month / daysinmonth
- hour / microsecond / minute / nanosecond / second
- week / weekday / weekday_name
- month / quarter / year / weekofyear
- date
- ceil / floor / round
- normalize

### Your own functions

The sky is the limit! You can create regular Python functions and use assign to create new columns :-)

No need to worry about turning them into ***pandas UDFs*** - everything is handled by ***HandySpark*** under the hood!

The arguments of your function (or `lambda`) should have the names of the columns you want to use. For instance, to take the `log` of `Fare`:

```python
import numpy as np
hdf_fenced = hdf_fenced.assign(logFare=lambda Fare: np.log(Fare + 1))
```

![logfare](/images/logfare.png)

You can also use multiple columns:

```python
hdf_fenced = hdf_fenced.assign(fare_times_age=lambda Fare, Age: Fare * Age)
```

Even though the result is kinda pointless, it will work :-)

Keep in mind that the ***return type***, that is, the column type of the new column, will be the same as the first column used (`Fare`, in the example).

What if you want to return something of a ***different*** type?! No worries! You only need to ***wrap*** your function with the desired return type. An example should make this more clear:

```python
from pyspark.sql.types import StringType

hdf_fenced = hdf_fenced.assign(str_fare=StringType.ret(lambda Fare: Fare.map('${:,.2f}'.format)))

hdf_fenced.cols['str_fare'][:5]
```

```
0    $65.66
1    $53.10
2    $26.55
3    $65.66
4    $65.66
Name: str_fare, dtype: object
```

Basically, we imported the desired output type - ***StringType*** - and used its extended method ***ret*** to wrap our `lambda` function that formats our numeric `Fare` column into a string.

It is also possible to create a more complex type, like an array of doubles:

```python
from pyspark.sql.types import ArrayType, DoubleType

def make_list(Fare):
    return Fare.apply(lambda v: [v, v*2])

hdf_fenced = hdf_fenced.assign(fare_list=ArrayType(DoubleType()).ret(make_list))

hdf_fenced.cols['fare_list'][:5]
```

```
0           [7.25, 14.5]
1    [71.2833, 142.5666]
2         [7.925, 15.85]
3          [53.1, 106.2]
4           [8.05, 16.1]
Name: fare_list, dtype: object
```

OK, so, what happened here?

1. First, we imported the necessary types, ***ArrayType*** and ***DoubleType***, since we are building a function that returns a list of doubles.
2. We actually built the function - notice that we call ***apply*** straight from ***Fare***, which is treated as a pandas Series under the hood.
3. We ***wrap*** the function with the return type `ArrayType(DoubleType())` by invoking the extended method `ret`.
4. Finally, we assign it to a new column name, and that's it!

### Nicer exceptions

Now, suppose you make a mistake while creating your function... if you have used Spark for a while, you already realized that, when an exception is raised, it will be ***loooong***, right?

To help you with that, ***HandySpark*** analyzes the error message and parses it nicely for you at the very ***top*** of the error message, in ***bold red***:

![exception](/images/handy_exception.png)

### Safety first

***HandySpark*** wants to protect your cluster and network, so it implements a ***safety*** whenever you perform an operation that are going to retrieve ***ALL*** data from your ***HandyFrame***, like `collect` or `toPandas`.

How does that work? Every time a ***HandyFrame*** has one of these methods called, it will output up to the ***safety limit***, which has a default of ***1,000 elements***.

![safety on](/images/safety_on.png)

Do you want to set a different safety limit for your ***HandyFrame***?

![safety limit](/images/safety_limit.png)

What if you want to retrieve everything nonetheless?! You can invoke the ***safety_off*** method prior to the actual method you want to call and you get a ***one-time*** unlimited result.

![safety off](/images/safety_off.png)

### Don't feel like Handy anymore?

To get back your original Spark dataframe, you only need to call ***notHandy*** to make it not handy again:

```python
hdf_fenced.notHandy()
```

```
DataFrame[PassengerId: int, Survived: int, Pclass: int, Name: string, Sex: string, Age: double, SibSp: int, Parch: int, Ticket: string, Fare: double, Cabin: string, Embarked: string, logFare: double, is_c_or_q: boolean]
```

## Comments, questions, suggestions, bugs

***DISCLAIMER***: this is a project ***under development***, so it is likely you'll run into bugs/problems.

So, if you find any bugs/problems, please open an [issue](https://github.com/dvgodoy/handyspark/issues) or submit a [pull request](https://github.com/dvgodoy/handyspark/pulls).


================================================
FILE: README.rst
================================================


.. image:: https://travis-ci.org/dvgodoy/handyspark.svg?branch=master
   :target: https://travis-ci.org/dvgodoy/handyspark
   :alt: Build Status


HandySpark
==========

Bringing pandas-like capabilities to Spark dataframes!
------------------------------------------------------

*HandySpark* is a package designed to improve *PySpark* user experience, especially when it comes to *exploratory data analysis* , including *visualization* capabilities!

It makes fetching data or computing statistics for columns really easy, returning *pandas objects* straight away.

It also leverages on the recently released *pandas UDFs* in Spark to allow for an out-of-the-box usage of common *pandas functions* in a Spark dataframe.

Moreover, it introduces the *stratify* operation, so users can perform more sophisticated analysis, imputation and outlier detection on stratified data without incurring in very computationally expensive *groupby* operations.

Finally, it brings the long missing capability of *plotting* data while retaining the advantage of performing distributed computation (unlike many tutorials on the internet, which just convert the whole dataset to pandas and then plot it - don't ever do that!).

Google Colab
------------

Eager to try it out right away? Don't wait any longer!

Open the notebook directly on Google Colab and try it yourself:


* `Exploring Titanic <https://colab.research.google.com/github/dvgodoy/handyspark/blob/master/notebooks/Exploring_Titanic.ipynb>`_

Installation
------------

To install *HandySpark* from `PyPI <https://pypi.org/project/handyspark/>`_, just type:

.. code-block:: python

   pip install handyspark

Documentation
-------------

You can find the full documentation `here <http://dvgodoy.github.com/handyspark>`_.

Quick Start
-----------

To use *HandySpark* , all you need to do is import the package and, after loading your data into a Spark dataframe, call the *toHandy()* method to get your own *HandyFrame* :

.. code-block:: python

   from pyspark.sql import SparkSession
   spark = SparkSession.builder.getOrCreate()

   from handyspark import *
   sdf = spark.read.csv('./tests/rawdata/train.csv', header=True, inferSchema=True)
   hdf = sdf.toHandy()

Fetching and plotting data
^^^^^^^^^^^^^^^^^^^^^^^^^^

Now you can easily fetch data as if you were using pandas, just use the *cols* object from your *HandyFrame* :

.. code-block:: python

   hdf.cols['Name'][:5]

It should return a pandas Series object:

.. code-block::

   0                              Braund, Mr. Owen Harris
   1    Cumings, Mrs. John Bradley (Florence Briggs Th...
   2                               Heikkinen, Miss. Laina
   3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
   4                             Allen, Mr. William Henry
   Name: Name, dtype: object

If you include a list of columns, it will return a pandas DataFrame.

Due to the distributed nature of data in Spark, it is only possible to fetch the top rows of any given *HandyFrame*.

Using *cols* you have access to several pandas-like column and DataFrame based methods implemented in Spark:


* min / max / median / q1 / q3 / stddev / mode
* nunique
* value_counts
* corr
* hist
* boxplot
* scatterplot

For instance:

.. code-block:: python

   hdf.cols['Embarked'].value_counts(dropna=False)

.. code-block::

   S      644
   C      168
   Q       77
   NaN      2
   Name: Embarked, dtype: int64

You can also make some plots:

.. code-block:: python

   from matplotlib import pyplot as plt
   fig, axs = plt.subplots(1, 4, figsize=(12, 4))
   hdf.cols['Embarked'].hist(ax=axs[0])
   hdf.cols['Age'].boxplot(ax=axs[1])
   hdf.cols['Fare'].boxplot(ax=axs[2])
   hdf.cols[['Fare', 'Age']].scatterplot(ax=axs[3])


.. image:: /images/cols_plot.png
   :target: /images/cols_plot.png
   :alt: cols plots


Handy, right (pun intended!)? But things can get *even more* interesting if you use *stratify* !

Stratify
^^^^^^^^

Stratifying a HandyFrame means using a *split-apply-combine* approach. It will first split your HandyFrame according to the specified (discrete) columns, then it will apply some function to each stratum of data and finally combine the results back together.

This is better illustrated with an example - let's try the stratified version of our previous ``value_counts``\ :

.. code-block:: python

   hdf.stratify(['Pclass']).cols['Embarked'].value_counts()

.. code-block::

   Pclass  Embarked
   1       C            85
           Q             2
           S           127
   2       C            17
           Q             3
           S           164
   3       C            66
           Q            72
           S           353
   Name: value_counts, dtype: int64

Cool, isn't it? Besides, under the hood, not a single *group by* operation was performed - everything is handled using filter clauses! So, *no data shuffling* !

What if you want to *stratify* on a column containing continuous values? No problem!

.. code-block:: python

   hdf.stratify(['Sex', Bucket('Age', 2)]).cols['Embarked'].value_counts()

.. code-block::

   Sex     Age                                Embarked
   female  Age >= 0.4200 and Age < 40.2100    C            46
                                              Q            12
                                              S           154
           Age >= 40.2100 and Age <= 80.0000  C            15
                                              S            32
   male    Age >= 0.4200 and Age < 40.2100    C            53
                                              Q            11
                                              S           287
           Age >= 40.2100 and Age <= 80.0000  C            16
                                              Q             5
                                              S            81
   Name: value_counts, dtype: int64

You can use either *Bucket* or *Quantile* to discretize your data in any given number of bins!

What about *plotting* it? Yes, *HandySpark* can handle that as well!

.. code-block:: python

   hdf.stratify(['Sex', Bucket('Age', 2)]).cols['Embarked'].hist(figsize=(8, 6))


.. image:: /images/stratified_hist.png
   :target: /images/stratified_hist.png
   :alt: stratified hist


Handling missing data
^^^^^^^^^^^^^^^^^^^^^

*HandySpark* makes it very easy to spot and fill missing values. To figure if there are any missing values, just use *isnull* :

.. code-block:: python

   hdf.isnull(ratio=True)

.. code-block::

   PassengerId    0.000000
   Survived       0.000000
   Pclass         0.000000
   Name           0.000000
   Sex            0.000000
   Age            0.198653
   SibSp          0.000000
   Parch          0.000000
   Ticket         0.000000
   Fare           0.000000
   Cabin          0.771044
   Embarked       0.002245
   Name: missing(ratio), dtype: float64

Ok, now you know there are 3 columns with missing values: ``Age``\ , ``Cabin`` and ``Embarked``. It's time to fill those values up! But, let's skip ``Cabin``\ , which has 77% of its values missing!

So, ``Age`` is a continuous variable, while ``Embarked`` is a categorical variable. Let's start with the latter:

.. code-block:: python

   hdf_filled = hdf.fill(categorical=['Embarked'])

*HandyFrame* has a *fill* method which takes up to 3 arguments:


* categorical: a list of categorical variables
* continuous: a list of continuous variables
* strategy: which strategy to use for each one of the continuous variables (either ``mean`` or ``median``\ )

Categorical variables use a ``mode`` strategy by default.

But you do not need to stick with the basics anymore... you can fancy it up using *stratify* together with *fill* :

.. code-block:: python

   hdf_filled = hdf_filled.stratify(['Pclass', 'Sex']).fill(continuous=['Age'], strategy=['mean'])

How do you know which values are being used? Simple enough:

.. code-block:: python

   hdf_filled.statistics_

.. code-block::

   {'Embarked': 'S',
    'Pclass == "1" and Sex == "female"': {'Age': 34.61176470588235},
    'Pclass == "1" and Sex == "male"': {'Age': 41.28138613861386},
    'Pclass == "2" and Sex == "female"': {'Age': 28.722972972972972},
    'Pclass == "2" and Sex == "male"': {'Age': 30.74070707070707},
    'Pclass == "3" and Sex == "female"': {'Age': 21.75},
    'Pclass == "3" and Sex == "male"': {'Age': 26.507588932806325}}

There you go! The filter clauses and the corresponding imputation values!

But there is *more* - once you're with your imputation procedure, why not generate a *custom transformer* to do that for you, either on your test set or in production?

You only need to call the *imputer* method of the *transformer* object that every *HandyFrame* has:

.. code-block:: python

   imputer = hdf_filled.transformers.imputer()

In the example above, *imputer* is now a full-fledged serializable PySpark transformer! What does that mean? You can use it in your *pipeline* and *save / load* at will :-)

Detecting outliers
^^^^^^^^^^^^^^^^^^

Second only to the problem of missing data, outliers can pose a challenge for training machine learning models.

*HandyFrame* to the rescue, with its *outliers* method:

.. code-block:: python

   hdf_filled.outliers(method='tukey', k=3.)

.. code-block::

   PassengerId      0.0
   Survived         0.0
   Pclass           0.0
   Age              1.0
   SibSp           12.0
   Parch          213.0
   Fare            53.0
   dtype: float64

Currently, only `\ *Tukey's* <https://en.wikipedia.org/wiki/Outlier#Tukey's_fences>`_ method is available (I am working on Mahalanobis distance!). This method takes an optional *k* argument, which you can set to larger values (like 3) to allow for a more loose detection.

The good thing is, now we can take a peek at the data by plotting it:

.. code-block:: python

   from matplotlib import pyplot as plt
   fig, axs = plt.subplots(1, 4, figsize=(16, 4))
   hdf_filled.cols['Parch'].hist(ax=axs[0])
   hdf_filled.cols['SibSp'].hist(ax=axs[1])
   hdf_filled.cols['Age'].boxplot(ax=axs[2], k=3)
   hdf_filled.cols['Fare'].boxplot(ax=axs[3], k=3)


.. image:: /images/outliers.png
   :target: /images/outliers.png
   :alt: outliers


Let's focus on the ``Fare`` column - what can we do about it? Well, we could use Tukey's fences to, er... *fence* the outliers :-)

.. code-block:: python

   hdf_fenced = hdf_filled.fence(['Fare'])

Which values were used, you ask?

.. code-block:: python

   hdf_fenced.fences_

.. code-block::

   {'Fare': [-26.7605, 65.6563]}

It works quite similarly to the *fill* method and, I hope you guessed, it *also* gives you the ability to create the corresponding *custom transformer* :-)

.. code-block:: python

   fencer = hdf_fenced.transformers.fencer()

Pandas and more pandas!
^^^^^^^^^^^^^^^^^^^^^^^

With *HandySpark* you can feel *almost* as if you were using traditional pandas :-)

To gain access to the whole suite of available pandas functions, you need to leverage the *pandas* object of your *HandyFrame* :

.. code-block:: python

   some_ports = hdf_fenced.pandas['Embarked'].isin(values=['C', 'Q'])
   some_ports

.. code-block::

   Column<b'udf(Embarked) AS `<lambda>(Embarked,)`'>

In the example above, *HandySpark* treats the ``Embarked`` column as if it were a pandas Series and, therefore, you may call its *isin* method!

But, remember Spark has *lazy evaluation* , so the result is a *column expression* which leverages the power of *pandas UDFs* (provived that PyArrow is installed, otherwise it will fall back to traditional UDFs).

The only thing left to do is to actually *assign* the results to a new column, right?

.. code-block:: python

   hdf_fenced = hdf_fenced.assign(is_c_or_q=some_ports)
   # What's in there?
   hdf_fenced.cols['is_c_or_q'][:5]

.. code-block::

   0     True
   1    False
   2    False
   3     True
   4     True
   Name: is_c_or_q, dtype: bool

You got that right! *HandyFrame* has a very convenient *assign* method, just like in pandas!

It does not get much easier than that :-) There are several column methods available already:


* betweeen / between_time
* isin
* isna / isnull
* notna / notnull
* abs
* clip / clip_lower / clip_upper
* replace
* round / truncate
* tz_convert / tz_localize

And this is not all! Both specialized *str* and *dt* objects from pandas are available as well!

For instance, if you want to find if a given string contains another substring?

.. code-block:: python

   col_mrs = hdf_fenced.pandas['Name'].str.find(sub='Mrs.')
   hdf_fenced = hdf_fenced.assign(is_mrs=col_mrs > 0)


.. image:: /images/is_mrs.png
   :target: /images/is_mrs.png
   :alt: is mrs


There are many, many more available methods:


*String methods* :

#. contains
#. startswith / endswitch
#. match
#. isalpha / isnumeric / isalnum / isdigit / isdecimal / isspace
#. islower / isupper / istitle
#. replace
#. repeat
#. join
#. pad
#. slice / slice_replace
#. strip / lstrip / rstrip
#. wrap / center / ljust / rjust
#. translate
#. get
#. normalize
#. lower / upper / capitalize / swapcase / title
#. zfill
#. count
#. find / rfind
#. len

*Date / Datetime methods* :

#. is_leap_year / is_month_end / is_month_start / is_quarter_end / is_quarter_start / is_year_end / is_year_start
#. strftime
#. tz / time / tz_convert / tz_localize
#. day / dayofweek / dayofyear / days_in_month / daysinmonth
#. hour / microsecond / minute / nanosecond / second
#. week / weekday / weekday_name
#. month / quarter / year / weekofyear
#. date
#. ceil / floor / round
#. normalize

Your own functions
^^^^^^^^^^^^^^^^^^

The sky is the limit! You can create regular Python functions and use assign to create new columns :-)

No need to worry about turning them into *pandas UDFs* - everything is handled by *HandySpark* under the hood!

The arguments of your function (or ``lambda``\ ) should have the names of the columns you want to use. For instance, to take the ``log`` of ``Fare``\ :

.. code-block:: python

   import numpy as np
   hdf_fenced = hdf_fenced.assign(logFare=lambda Fare: np.log(Fare + 1))


.. image:: /images/logfare.png
   :target: /images/logfare.png
   :alt: logfare


You can also use multiple columns:

.. code-block:: python

   hdf_fenced = hdf_fenced.assign(fare_times_age=lambda Fare, Age: Fare * Age)

Even though the result is kinda pointless, it will work :-)

Keep in mind that the *return type* , that is, the column type of the new column, will be the same as the first column used (\ ``Fare``\ , in the example).

What if you want to return something of a *different* type?! No worries! You only need to *wrap* your function with the desired return type. An example should make this more clear:

.. code-block:: python

   from pyspark.sql.types import StringType

   hdf_fenced = hdf_fenced.assign(str_fare=StringType.ret(lambda Fare: Fare.map('${:,.2f}'.format)))

   hdf_fenced.cols['str_fare'][:5]

.. code-block::

   0    $65.66
   1    $53.10
   2    $26.55
   3    $65.66
   4    $65.66
   Name: str_fare, dtype: object

Basically, we imported the desired output type - *StringType* - and used its extended method *ret* to wrap our ``lambda`` function that formats our numeric ``Fare`` column into a string.

It is also possible to create a more complex type, like an array of doubles:

.. code-block:: python

   from pyspark.sql.types import ArrayType, DoubleType

   def make_list(Fare):
       return Fare.apply(lambda v: [v, v*2])

   hdf_fenced = hdf_fenced.assign(fare_list=ArrayType(DoubleType()).ret(make_list))

   hdf_fenced.cols['fare_list'][:5]

.. code-block::

   0           [7.25, 14.5]
   1    [71.2833, 142.5666]
   2         [7.925, 15.85]
   3          [53.1, 106.2]
   4           [8.05, 16.1]
   Name: fare_list, dtype: object

OK, so, what happened here?


#. First, we imported the necessary types, *ArrayType* and *DoubleType* , since we are building a function that returns a list of doubles.
#. We actually built the function - notice that we call *apply* straight from *Fare* , which is treated as a pandas Series under the hood.
#. We *wrap* the function with the return type ``ArrayType(DoubleType())`` by invoking the extended method ``ret``.
#. Finally, we assign it to a new column name, and that's it!

Nicer exceptions
^^^^^^^^^^^^^^^^

Now, suppose you make a mistake while creating your function... if you have used Spark for a while, you already realized that, when an exception is raised, it will be *loooong* , right?

To help you with that, *HandySpark* analyzes the error message and parses it nicely for you at the very *top* of the error message, in *bold red* :


.. image:: /images/handy_exception.png
   :target: /images/handy_exception.png
   :alt: exception


Safety first
^^^^^^^^^^^^

*HandySpark* wants to protect your cluster and network, so it implements a *safety* whenever you perform an operation that are going to retrieve *ALL* data from your *HandyFrame* , like ``collect`` or ``toPandas``.

How does that work? Every time a *HandyFrame* has one of these methods called, it will output up to the *safety limit* , which has a default of *1,000 elements*.


.. image:: /images/safety_on.png
   :target: /images/safety_on.png
   :alt: safety on


Do you want to set a different safety limit for your *HandyFrame* ?


.. image:: /images/safety_limit.png
   :target: /images/safety_limit.png
   :alt: safety limit


What if you want to retrieve everything nonetheless?! You can invoke the *safety_off* method prior to the actual method you want to call and you get a *one-time* unlimited result.


.. image:: /images/safety_off.png
   :target: /images/safety_off.png
   :alt: safety off


Don't feel like Handy anymore?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To get back your original Spark dataframe, you only need to call *notHandy* to make it not handy again:

.. code-block:: python

   hdf_fenced.notHandy()

.. code-block::

   DataFrame[PassengerId: int, Survived: int, Pclass: int, Name: string, Sex: string, Age: double, SibSp: int, Parch: int, Ticket: string, Fare: double, Cabin: string, Embarked: string, logFare: double, is_c_or_q: boolean]

Comments, questions, suggestions, bugs
--------------------------------------

*DISCLAIMER* : this is a project *under development* , so it is likely you'll run into bugs/problems.

So, if you find any bugs/problems, please open an `issue <https://github.com/dvgodoy/handyspark/issues>`_ or submit a `pull request <https://github.com/dvgodoy/handyspark/pulls>`_.


================================================
FILE: docs/Makefile
================================================
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line.
SPHINXOPTS    =
SPHINXBUILD   = sphinx-build
SPHINXPROJ    = HandySpark
SOURCEDIR     = source
BUILDDIR      = ../../handyspark-docs

# Put it first so that "make" without argument is like "make help".
help:
	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)



================================================
FILE: docs/source/conf.py
================================================
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# HandySpark documentation build configuration file, created by
# sphinx-quickstart on Sun Oct 28 17:42:51 2018.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
sys.path.insert(0, os.path.abspath('../..'))
sys.setrecursionlimit(1500)


# -- General configuration ------------------------------------------------

# If your documentation needs a minimal Sphinx version, state it here.
#
# needs_sphinx = '1.0'

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['sphinx.ext.autodoc',
    'sphinx.ext.intersphinx',
    'sphinx.ext.mathjax',
    'sphinx.ext.ifconfig',
    'sphinx.ext.viewcode',
    'sphinx.ext.githubpages',
    'sphinx.ext.napoleon']

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
# source_suffix = ['.rst', '.md']
source_suffix = '.rst'

# The master toctree document.
master_doc = 'index'

# General information about the project.
project = 'HandySpark'
copyright = '2018, Daniel Voigt Godoy'
author = 'Daniel Voigt Godoy'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '0.0.1'
# The full version, including alpha/beta/rc tags.
release = '0.0.1'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This patterns also effect to html_static_path and html_extra_path
exclude_patterns = []

# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'

# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = False


# -- Options for HTML output ----------------------------------------------

# The theme to use for HTML and HTML Help pages.  See the documentation for
# a list of builtin themes.
#
html_theme = 'alabaster'

# Theme options are theme-specific and customize the look and feel of a theme
# further.  For a list of options available for each theme, see the
# documentation.
#
# html_theme_options = {}

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']

# Custom sidebar templates, must be a dictionary that maps document names
# to template names.
#
# This is required for the alabaster theme
# refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars
html_sidebars = {
    '**': [
        'relations.html',  # needs 'show_related': True theme option to display
        'searchbox.html',
    ]
}


# -- Options for HTMLHelp output ------------------------------------------

# Output file base name for HTML help builder.
htmlhelp_basename = 'HandySparkdoc'


# -- Options for LaTeX output ---------------------------------------------

latex_elements = {
    # The paper size ('letterpaper' or 'a4paper').
    #
    # 'papersize': 'letterpaper',

    # The font size ('10pt', '11pt' or '12pt').
    #
    # 'pointsize': '10pt',

    # Additional stuff for the LaTeX preamble.
    #
    # 'preamble': '',

    # Latex figure (float) alignment
    #
    # 'figure_align': 'htbp',
}

# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
#  author, documentclass [howto, manual, or own class]).
latex_documents = [
    (master_doc, 'HandySpark.tex', 'HandySpark Documentation',
     'Daniel Voigt Godoy', 'manual'),
]


# -- Options for manual page output ---------------------------------------

# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
    (master_doc, 'handyspark', 'HandySpark Documentation',
     [author], 1)
]


# -- Options for Texinfo output -------------------------------------------

# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
#  dir menu entry, description, category)
texinfo_documents = [
    (master_doc, 'HandySpark', 'HandySpark Documentation',
     author, 'HandySpark', 'One line description of project.',
     'Miscellaneous'),
]



# -- Options for Epub output ----------------------------------------------

# Bibliographic Dublin Core info.
epub_title = project
epub_author = author
epub_publisher = author
epub_copyright = copyright

# The unique identifier of the text. This can be a ISBN number
# or the project homepage.
#
# epub_identifier = ''

# A unique identification for the text.
#
# epub_uid = ''

# A list of files that should not be packed into the epub file.
epub_exclude_files = ['search.html']



# Example configuration for intersphinx: refer to the Python standard library.
intersphinx_mapping = {'https://docs.python.org/': None}


================================================
FILE: docs/source/handyspark.extensions.rst
================================================
handyspark\.extensions package
==============================

Submodules
----------

handyspark\.extensions\.common module
-------------------------------------

.. automodule:: handyspark.extensions.common
    :members:
    :undoc-members:
    :show-inheritance:

handyspark\.extensions\.evaluation module
-----------------------------------------

.. automodule:: handyspark.extensions.evaluation
    :members:
    :undoc-members:
    :show-inheritance:

handyspark\.extensions\.types module
------------------------------------

.. automodule:: handyspark.extensions.types
    :members:
    :undoc-members:
    :show-inheritance:


Module contents
---------------

.. automodule:: handyspark.extensions
    :members:
    :undoc-members:
    :show-inheritance:


================================================
FILE: docs/source/handyspark.ml.rst
================================================
handyspark\.ml package
======================

Submodules
----------

handyspark\.ml\.base module
---------------------------

.. automodule:: handyspark.ml.base
    :members:
    :undoc-members:
    :show-inheritance:


Module contents
---------------

.. automodule:: handyspark.ml
    :members:
    :undoc-members:
    :show-inheritance:


================================================
FILE: docs/source/handyspark.rst
================================================
handyspark package
==================

Subpackages
-----------

.. toctree::

    handyspark.extensions
    handyspark.ml
    handyspark.sql

Submodules
----------

handyspark\.plot module
-----------------------

.. automodule:: handyspark.plot
    :members:
    :undoc-members:
    :show-inheritance:

handyspark\.stats module
------------------------

.. automodule:: handyspark.stats
    :members:
    :undoc-members:
    :show-inheritance:

handyspark\.util module
-----------------------

.. automodule:: handyspark.util
    :members:
    :undoc-members:
    :show-inheritance:


Module contents
---------------

.. automodule:: handyspark
    :members:
    :undoc-members:
    :show-inheritance:


================================================
FILE: docs/source/handyspark.sql.rst
================================================
handyspark\.sql package
=======================

Submodules
----------

handyspark\.sql\.dataframe module
---------------------------------

.. automodule:: handyspark.sql.dataframe
    :members:
    :undoc-members:
    :show-inheritance:

handyspark\.sql\.datetime module
--------------------------------

.. automodule:: handyspark.sql.datetime
    :members:
    :undoc-members:
    :show-inheritance:

handyspark\.sql\.pandas module
------------------------------

.. automodule:: handyspark.sql.pandas
    :members:
    :undoc-members:
    :show-inheritance:

handyspark\.sql\.schema module
------------------------------

.. automodule:: handyspark.sql.schema
    :members:
    :undoc-members:
    :show-inheritance:

handyspark\.sql\.string module
------------------------------

.. automodule:: handyspark.sql.string
    :members:
    :undoc-members:
    :show-inheritance:

handyspark\.sql\.transform module
---------------------------------

.. automodule:: handyspark.sql.transform
    :members:
    :undoc-members:
    :show-inheritance:


Module contents
---------------

.. automodule:: handyspark.sql
    :members:
    :undoc-members:
    :show-inheritance:


================================================
FILE: docs/source/includeme.rst
================================================
.. include:: ../../README.rst



================================================
FILE: docs/source/index.rst
================================================
.. HandySpark documentation master file, created by
   sphinx-quickstart on Sun Oct 28 17:42:51 2018.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Welcome to HandySpark's documentation!
======================================

.. toctree::
   :maxdepth: 2
   
   includeme



Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`


================================================
FILE: docs/source/modules.rst
================================================
handyspark
==========

.. toctree::
   :maxdepth: 4

   handyspark


================================================
FILE: handyspark/__init__.py
================================================
from handyspark.extensions.evaluation import BinaryClassificationMetrics
from handyspark.sql import HandyFrame, Bucket, Quantile, DataFrame

__all__ = [
    'HandyFrame', 'Bucket', 'Quantile', 'BinaryClassificationMetrics'
]

================================================
FILE: handyspark/extensions/__init__.py
================================================
from handyspark.extensions.common import JavaModelWrapper
from handyspark.extensions.evaluation import BinaryClassificationMetrics
from handyspark.extensions.types import AtomicType

__all__ = [
    'BinaryClassificationMetrics'
]


================================================
FILE: handyspark/extensions/common.py
================================================
from pyspark.mllib.common import _java2py, _py2java, JavaModelWrapper

def call2(self, name, *a):
    """Another call method for JavaModelWrapper.
    This method should be used whenever the JavaModel returns a Scala Tuple
    that needs to be deserialized before converted to Python.
    """
    serde = self._sc._jvm.org.apache.spark.mllib.api.python.SerDe
    args = [_py2java(self._sc, a) for a in a]
    java_res = getattr(self._java_model, name)(*args)
    java_res = serde.fromTuple2RDD(java_res)
    res = _java2py(self._sc, java_res)
    return res

JavaModelWrapper.call2 = call2


================================================
FILE: handyspark/extensions/evaluation.py
================================================
import pandas as pd
from operator import itemgetter
from handyspark.plot import roc_curve, pr_curve
from pyspark.mllib.evaluation import BinaryClassificationMetrics, MulticlassMetrics
from pyspark.sql import SQLContext, DataFrame, functions as F
from pyspark.sql.types import StructField, StructType, DoubleType

def thresholds(self):
    """
    * Returns thresholds in descending order.
    """
    return self.call('thresholds')

def roc(self):
    """Calls the `roc` method from the Java class

    * Returns the receiver operating characteristic (ROC) curve,
    * which is an RDD of (false positive rate, true positive rate)
    * with (0.0, 0.0) prepended and (1.0, 1.0) appended to it.
    * @see <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic">
    * Receiver operating characteristic (Wikipedia)</a>
    """
    return self.call2('roc')

def pr(self):
    """Calls the `pr` method from the Java class

    * Returns the precision-recall curve, which is an RDD of (recall, precision),
    * NOT (precision, recall), with (0.0, p) prepended to it, where p is the precision
    * associated with the lowest recall on the curve.
    * @see <a href="http://en.wikipedia.org/wiki/Precision_and_recall">
    * Precision and recall (Wikipedia)</a>
    """
    return self.call2('pr')

def fMeasureByThreshold(self, beta=1.0):
    """Calls the `fMeasureByThreshold` method from the Java class

    * Returns the (threshold, F-Measure) curve.
    * @param beta the beta factor in F-Measure computation.
    * @return an RDD of (threshold, F-Measure) pairs.
    * @see <a href="http://en.wikipedia.org/wiki/F1_score">F1 score (Wikipedia)</a>
    """
    return self.call2('fMeasureByThreshold', beta)

def precisionByThreshold(self):
    """Calls the `precisionByThreshold` method from the Java class

    * Returns the (threshold, precision) curve.
    """
    return self.call2('precisionByThreshold')

def recallByThreshold(self):
    """Calls the `recallByThreshold` method from the Java class

    * Returns the (threshold, recall) curve.
    """
    return self.call2('recallByThreshold')

def getMetricsByThreshold(self):
    """Returns DataFrame containing all metrics (FPR, Recall and
    Precision) for every threshold.

    Returns
    -------
    metrics: DataFrame
    """
    thresholds = self.call('thresholds').collect()
    roc = self.call2('roc').collect()[1:-1]
    pr = self.call2('pr').collect()[1:]
    metrics = list(zip(thresholds, map(itemgetter(0), roc), map(itemgetter(1), roc), map(itemgetter(1), pr)))
    metrics += [(0., 1., 1., 0.)]
    sql_ctx = SQLContext.getOrCreate(self._sc)
    df = sql_ctx.createDataFrame(metrics).toDF('threshold', 'fpr', 'recall', 'precision')
    return df

def confusionMatrix(self, threshold=0.5):
    """Returns confusion matrix: predicted classes are in columns,
    they are ordered by class label ascending, as in "labels".

    Predicted classes are computed according to informed threshold.

    Parameters
    ----------
    threshold: double, optional
        Threshold probability for the positive class.
        Default is 0.5.

    Returns
    -------
    confusionMatrix: DenseMatrix
    """
    scoreAndLabels = self.call2('scoreAndLabels').map(lambda t: (float(t[0] > threshold), t[1]))
    mcm = MulticlassMetrics(scoreAndLabels)
    return mcm.confusionMatrix()

def print_confusion_matrix(self, threshold=0.5):
    """Returns confusion matrix: predicted classes are in columns,
    they are ordered by class label ascending, as in "labels".

    Predicted classes are computed according to informed threshold.

    Parameters
    ----------
    threshold: double, optional
        Threshold probability for the positive class.
        Default is 0.5.

    Returns
    -------
    confusionMatrix: pd.DataFrame
    """
    cm = self.confusionMatrix(threshold).toArray()
    df = pd.concat([pd.DataFrame(cm)], keys=['Actual'], names=[])
    df.columns = pd.MultiIndex.from_product([['Predicted'], df.columns])
    return df

def plot_roc_curve(self, ax=None):
    """Makes a plot of Receiver Operating Characteristic (ROC) curve.

    Parameter
    ---------
    ax : matplotlib axes object, default None
    """
    metrics = self.getMetricsByThreshold().toPandas()
    return roc_curve(metrics.fpr, metrics.recall, self.areaUnderROC, ax)

def plot_pr_curve(self, ax=None):
    """Makes a plot of Precision-Recall (PR) curve.

    Parameter
    ---------
    ax : matplotlib axes object, default None
    """
    metrics = self.getMetricsByThreshold().toPandas()
    return pr_curve(metrics.precision, metrics.recall, self.areaUnderPR, ax)

def __init__(self, scoreAndLabels, scoreCol='score', labelCol='label'):
    if isinstance(scoreAndLabels, DataFrame):
        scoreAndLabels = (scoreAndLabels
                          .select(scoreCol, labelCol)
                          .rdd.map(lambda row:(float(row[scoreCol][1]), float(row[labelCol]))))

    sc = scoreAndLabels.ctx
    sql_ctx = SQLContext.getOrCreate(sc)
    df = sql_ctx.createDataFrame(scoreAndLabels, schema=StructType([
        StructField("score", DoubleType(), nullable=False),
        StructField("label", DoubleType(), nullable=False)]))

    java_class = sc._jvm.org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
    java_model = java_class(df._jdf)
    super(BinaryClassificationMetrics, self).__init__(java_model)

BinaryClassificationMetrics.__init__ = __init__
BinaryClassificationMetrics.thresholds = thresholds
BinaryClassificationMetrics.roc = roc
BinaryClassificationMetrics.pr = pr
BinaryClassificationMetrics.fMeasureByThreshold = fMeasureByThreshold
BinaryClassificationMetrics.precisionByThreshold = precisionByThreshold
BinaryClassificationMetrics.recallByThreshold = recallByThreshold
BinaryClassificationMetrics.getMetricsByThreshold = getMetricsByThreshold
BinaryClassificationMetrics.confusionMatrix = confusionMatrix
BinaryClassificationMetrics.plot_roc_curve = plot_roc_curve
BinaryClassificationMetrics.plot_pr_curve = plot_pr_curve
BinaryClassificationMetrics.print_confusion_matrix = print_confusion_matrix

================================================
FILE: handyspark/extensions/types.py
================================================
from pyspark.sql.types import AtomicType, ArrayType, MapType

@classmethod
def ret(cls, expr):
    """Assigns a return type to the expression when used inside an `assign` method.
    """
    return expr, cls.typeName()

AtomicType.ret = ret

def ret(self, expr):
    """Assigns a return type to the expression when used inside an `assign` method.
    """
    return expr, self.simpleString()

ArrayType.ret = ret
MapType.ret = ret


================================================
FILE: handyspark/ml/__init__.py
================================================
from handyspark.ml.base import HandyFencer, HandyImputer

__all__ = [
    'HandyFencer', 'HandyImputer'
]

================================================
FILE: handyspark/ml/base.py
================================================
import json
from pyspark.ml.base import Transformer
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
from pyspark.ml.param import *
from pyspark.sql import functions as F

class HandyTransformers(object):
    """Generates transformers to be used in pipelines.

    Available transformers:
    imputer: Transformer
        Imputation transformer for completing missing values.
    fencer: Transformer
        Fencer transformer for capping outliers according to lower and upper fences.
    """
    def __init__(self, df):
        self._df = df
        self._handy = df._handy

    def imputer(self):
        """
        Generates a transformer to impute missing values, using values
        from the HandyFrame
        """
        return HandyImputer().setDictValues(self._df.statistics_)

    def fencer(self):
        """
        Generates a transformer to fence outliers, using statistics
        from the HandyFrame
        """
        return HandyFencer().setDictValues(self._df.fences_)


class HasDict(Params):
    """Mixin for a Dictionary parameter.
    It dumps the dictionary into a JSON string for storage and
    reloads it whenever needed.
    """
    dictValues = Param(Params._dummy(), "dictValues", "Dictionary values", typeConverter=TypeConverters.toString)

    def __init__(self):
        super(HasDict, self).__init__()
        self._setDefault(dictValues='{}')

    def setDictValues(self, value):
        """
        Sets the value of :py:attr:`dictValues`.
        """
        if isinstance(value, dict):
            value = json.dumps(value).replace('\'', '"')
        return self._set(dictValues=value)

    def getDictValues(self):
        """
        Gets the value of dictValues or its default value.
        """
        values = self.getOrDefault(self.dictValues)
        return json.loads(values)


class HandyImputer(Transformer, HasDict, DefaultParamsReadable, DefaultParamsWritable):
    """Imputation transformer for completing missing values.

    Attributes
    ----------
    statistics : dict
        The imputation fill value for each feature. If stratified, first level keys are
        filter clauses for stratification.
    """
    def _transform(self, dataset):
        # Loads dictionary with values for imputation
        fillingValues = self.getDictValues()

        items = fillingValues.items()
        target = dataset
        # Loops over columns...
        for colname, v in items:
            # If value is another dictionary, it means we're dealing with
            # stratified imputation - the key is the filering clause
            # and its value is going to be used for imputation
            if isinstance(v, dict):
                clauses = v.keys()
                whens = ' '.join(['WHEN (({clause}) AND (isnan({col}) OR isnull({col}))) THEN {quote}{filling}{quote}'
                                 .format(clause=clause, col=colname, filling=v[clause],
                                         quote='"' if isinstance(v[clause], str) else '')
                                   for clause in clauses])
            # Otherwise uses the non-stratified dictionary to fill the values
            else:
                whens = ('WHEN (isnan({col}) OR isnull({col})) THEN {quote}{filling}{quote}'
                         .format(col=colname, filling=v,
                                 quote='"' if isinstance(v, str) else ''))

            expression = F.expr('CASE {expr} ELSE {col} END'.format(expr=whens, col=colname))
            target = target.withColumn(colname, expression)

        # If it is a HandyFrame, make it a regular DataFrame
        try:
            target = target.notHandy()
        except AttributeError:
            pass
        return target

    @property
    def statistics(self):
        return self.getDictValues()


class HandyFencer(Transformer, HasDict, DefaultParamsReadable, DefaultParamsWritable):
    """Fencer transformer for capping outliers according to lower and upper fences.

    Attributes
    ----------
    fences : dict
        The fence values for each feature. If stratified, first level keys are
        filter clauses for stratification.
    """
    def _transform(self, dataset):
        # Loads dictionary with values for fencing
        fences = self.getDictValues()

        items = fences.items()
        target = dataset
        for colname, v in items:
            # If value is another dictionary, it means we're dealing with
            # stratified imputation - the key is the filering clause
            # and its value is going to be used for imputation
            if isinstance(v, dict):
                clauses = v.keys()
                whens1 = ' '.join(['WHEN ({clause}) THEN greatest({col}, {fence})'.format(clause=clause,
                                                                                          col=colname,
                                                                                          fence=v[clause][0])
                                   for clause in clauses])
                whens2 = ' '.join(['WHEN ({clause}) THEN least({col}, {fence})'.format(clause=clause,
                                                                                       col=colname,
                                                                                       fence=v[clause][1])
                                   for clause in clauses])
                expression1 = F.expr('CASE {} END'.format(whens1))
                expression2 = F.expr('CASE {} END'.format(whens2))
            # Otherwise uses the non-stratified dictionary to fill the values
            else:
                expression1 = F.expr('greatest({col}, {fence})'.format(col=colname, fence=v[0]))
                expression2 = F.expr('least({col}, {fence})'.format(col=colname, fence=v[1]))

            target = target.withColumn(colname, expression1).withColumn(colname, expression2)

        # If it is a HandyFrame, make it a regular DataFrame
        try:
            target = target.notHandy()
        except AttributeError:
            pass
        return target

    @property
    def fences(self):
        return self.getDictValues()

================================================
FILE: handyspark/plot.py
================================================
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from inspect import signature
from handyspark.util import get_buckets, none2zero, ensure_list
from operator import add, itemgetter
from pyspark.ml.feature import Bucketizer
from pyspark.ml.pipeline import Pipeline
from pyspark.sql import functions as F
from matplotlib.artist import setp
import matplotlib as mpl
mpl.rc("lines", markeredgewidth=0.5)

def title_fom_clause(clause):
    return clause.replace(' and ', '\n').replace(' == ', '=').replace('"', '')

def consolidate_plots(fig, axs, title, clauses):
    axs[0].set_title(title)
    fig.tight_layout()
    if len(axs) > 1:
        assert len(axs) == len(clauses), 'Mismatched number of plots and clauses!'
        xlim = list(map(lambda ax: ax.get_xlim(), axs))
        xlim = [np.min(list(map(itemgetter(0), xlim))), np.max(list(map(itemgetter(1), xlim)))]
        ylim = list(map(lambda ax: ax.get_ylim(), axs))
        ylim = [np.min(list(map(itemgetter(0), ylim))), np.max(list(map(itemgetter(1), ylim)))]
        for i, ax in enumerate(axs):
            subtitle = title_fom_clause(clauses[i])
            ax.set_title(subtitle, fontdict={'fontsize': 10})
            ax.set_xlim(xlim)
            ax.set_ylim(ylim)
            #if ax.colNum > 0:
            #    ax.get_yaxis().set_visible(False)
            #if ax.rowNum < (ax.numRows - 1):
            #    ax.get_xaxis().set_visible(False)
        if isinstance(title, list):
            title = ', '.join(title)
        fig.suptitle(title)
        fig.tight_layout()
        fig.subplots_adjust(top=0.85)
    return fig, axs

### Correlations
def plot_correlations(pdf, ax=None):
    if ax is None:
        fig, ax = plt.subplots(1, 1)
    return sns.heatmap(round(pdf,2), annot=True, cmap="coolwarm", fmt='.2f', linewidths=.05, ax=ax)

### Scatterplot
def strat_scatterplot(sdf, col1, col2, n=30):
    stages = []
    for col in [col1, col2]:
        splits = np.linspace(*sdf.agg(F.min(col), F.max(col)).rdd.map(tuple).collect()[0], n + 1)
        bucket_name = '__{}_bucket'.format(col)
        stages.append(Bucketizer(splits=splits,
                                 inputCol=col,
                                 outputCol=bucket_name,
                                 handleInvalid="skip"))

    pipeline = Pipeline(stages=stages)
    model = pipeline.fit(sdf)
    return model, sdf.count()

def scatterplot(sdf, col1, col2, n=30, ax=None):
    strat_ax, data = sdf._get_strata()
    if data is None:
        data = strat_scatterplot(sdf, col1, col2, n)
    else:
        ax = strat_ax
    model, total = data

    if ax is None:
        fig, ax = plt.subplots(1, 1)

    axes = ensure_list(ax)
    clauses = sdf._handy._strata_raw_clauses
    if not len(clauses):
        clauses = [None]

    bucket_name1, bucket_name2 = '__{}_bucket'.format(col1), '__{}_bucket'.format(col2)
    strata = sdf._handy.strata_colnames
    colnames = strata + [bucket_name1, bucket_name2]
    result = model.transform(sdf).select(colnames).groupby(colnames).agg(F.count('*').alias('count')).toPandas().sort_values(by=colnames)

    splits = [bucket.getSplits() for bucket in model.stages]
    splits = [list(map(np.mean, zip(split[1:], split[:-1]))) for split in splits]
    splits1 = pd.DataFrame({bucket_name1: np.arange(0, n), col1: splits[0]})
    splits2 = pd.DataFrame({bucket_name2: np.arange(0, n), col2: splits[1]})

    df_counts = result.merge(splits1).merge(splits2)[strata + [col1, col2, 'count']].rename(columns={'count': 'Proportion'})

    df_counts.loc[:, 'Proportion'] = df_counts.Proportion.apply(lambda p: round(p / total, 4))

    for ax, clause in zip(axes, clauses):
        data = df_counts
        if clause is not None:
            data = data.query(clause)
        sns.scatterplot(data=data,
                        x=col1,
                        y=col2,
                        size='Proportion',
                        ax=ax,
                        legend=False)

    if len(axes) == 1:
        axes = axes[0]

    return axes

### Histogram
def strat_histogram(sdf, colname, bins=10, categorical=False):
    if categorical:
        result = sdf.cols[colname]._value_counts(dropna=False, raw=True)

        if hasattr(result.index, 'levels'):
            indexes = pd.MultiIndex.from_product(result.index.levels[:-1] +
                                                 [result.reset_index()[colname].unique().tolist()],
                                                 names=result.index.names)
            result = (pd.DataFrame(index=indexes)
                      .join(result.to_frame(), how='left')
                      .fillna(0)[result.name]
                      .astype(result.dtype))

        start_values = result.index.tolist()
    else:
        bucket_name = '__{}_bucket'.format(colname)
        strata = sdf._handy.strata_colnames
        colnames = strata + ensure_list(bucket_name)

        start_values = np.linspace(*sdf.agg(F.min(colname), F.max(colname)).rdd.map(tuple).collect()[0], bins + 1)
        bucketizer = Bucketizer(splits=start_values, inputCol=colname, outputCol=bucket_name, handleInvalid="skip")
        result = (bucketizer
                  .transform(sdf)
                  .select(colnames)
                  .groupby(colnames)
                  .agg(F.count('*').alias('count'))
                  .toPandas()
                  .sort_values(by=colnames))

        indexes = pd.DataFrame({bucket_name: np.arange(0, bins), 'bucket': start_values[:-1]})
        if len(strata):
            indexes = (indexes
                       .assign(key=1)
                       .merge(result[strata].drop_duplicates().assign(key=1), on='key')
                       .drop(columns=['key']))
        result = indexes.merge(result, how='left', on=strata + [bucket_name]).fillna(0)[strata + [bucket_name, 'count']]

    return start_values, result

def histogram(sdf, colname, bins=10, categorical=False, ax=None):
    strat_ax, data = sdf._get_strata()
    if data is None:
        data = strat_histogram(sdf, colname, bins, categorical)
    else:
        ax = strat_ax
    start_values, counts = data

    if ax is None:
        fig, ax = plt.subplots(1, 1)

    axes = ensure_list(ax)
    clauses = sdf._handy._strata_raw_clauses
    if not len(clauses):
        clauses = [None]

    for ax, clause in zip(axes, clauses):
        if categorical:
            pdf = counts.sort_index().to_frame()
            if clause is not None:
                pdf = pdf.query(clause).reset_index(sdf._handy.strata_colnames).drop(columns=sdf._handy.strata_colnames)
            pdf.iloc[:bins].plot(kind='bar', color='C0', legend=False, rot=0, ax=ax, title=colname)
        else:
            mid_point_bins = start_values[:-1]
            weights = counts
            if clause is not None:
                weights = counts.query(clause)
            ax.hist(mid_point_bins, bins=start_values, weights=weights['count'].values)
            ax.set_title(colname)

    if len(axes) == 1:
        axes = axes[0]

    return axes

### Boxplot
def _gen_dict(rc_name, properties):
    """ Loads properties in the dictionary from rc file if not already
    in the dictionary"""
    rc_str = 'boxplot.{0}.{1}'
    dictionary = dict()
    for prop_dict in properties:
        dictionary.setdefault(prop_dict,
                        plt.rcParams[rc_str.format(rc_name, prop_dict)])
    return dictionary

def draw_boxplot(ax, stats):
    flier_props = ['color', 'marker', 'markerfacecolor', 'markeredgecolor',
                   'markersize', 'linestyle', 'linewidth']
    default_props = ['color', 'linewidth', 'linestyle']
    boxprops = _gen_dict('boxprops', default_props)
    whiskerprops = _gen_dict('whiskerprops', default_props)
    capprops = _gen_dict('capprops', default_props)
    medianprops = _gen_dict('medianprops', default_props)
    meanprops = _gen_dict('meanprops', default_props)
    flierprops = _gen_dict('flierprops', flier_props)

    props = dict(boxprops=boxprops,
                 flierprops=flierprops,
                 medianprops=medianprops,
                 meanprops=meanprops,
                 capprops=capprops,
                 whiskerprops=whiskerprops)

    colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b',
              '#e377c2', '#7f7f7f', '#bcbd22', '#17becf', '#1f77b4']
    bp = ax.bxp(stats, **props)
    ax.grid(True)
    setp(bp['boxes'], color=colors[0], alpha=1)
    setp(bp['whiskers'], color=colors[0], alpha=1)
    setp(bp['medians'], color=colors[2], alpha=1)
    return ax

def boxplot(sdf, colnames, ax=None, showfliers=True, k=1.5, precision=.0001):
    strat_ax, data = sdf._get_strata()
    if data is None:
        if ax is None:
            fig, ax = plt.subplots(1, 1)

    title_clauses = sdf._handy._strata_clauses
    if not len(title_clauses):
        title_clauses = [None]

    pdf = sdf._handy._calc_fences(colnames, k, precision)
    stats = []
    for colname in colnames:
        items, _, _ = sdf._handy._calc_bxp_stats(pdf, colname, showfliers=showfliers)
        for title_clause, item in zip(title_clauses, items):
            name = colname if len(colnames) > 1 else (title_fom_clause(title_clause) if title_clause is not None else colname)
            item.update({'label': name})

        # each list of items corresponds to a different column
        stats.append(items)

    # Stats is a list of columns, containing each a list of clauses
    if ax is not None:
        if title_clauses[0] is None:
            if len(colnames) == 1:
                stats = stats[0]
            else:
                stats = np.squeeze(stats).tolist()
        return draw_boxplot(ax, stats)
    else:
        if len(strat_ax) > 1:
            stats = [[stats[j][i] for j in range(len(stats))] for i in range(len(title_clauses))]
        return stats

def post_boxplot(axs, stats):
    new_res = []
    for ax, stat in zip(axs, stats):
        ax = draw_boxplot(ax, stat)
        new_res.append(ax)
    return new_res

def roc_curve(fpr, tpr, roc_auc, ax=None):
    if ax is None:
        fig, ax = plt.subplots(1, 1)

    ax.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.4f)' % roc_auc)
    ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    ax.set_xlim([0.0, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.set_title('Receiver Operating Characteristic Curve')
    ax.legend(loc="lower right")
    return ax

def pr_curve(precision, recall, pr_auc, ax=None):
    if ax is None:
        fig, ax = plt.subplots(1, 1)

    # In matplotlib < 1.5, plt.fill_between does not have a 'step' argument
    step_kwargs = ({'step': 'post'}
                   if 'step' in signature(plt.fill_between).parameters
                   else {})
    ax.step(recall, precision, color='b', alpha=0.2, where='post', label='PR curve (area = %0.4f)' % pr_auc)
    ax.fill_between(recall, precision, alpha=0.2, color='b', **step_kwargs)
    ax.set_xlabel('Recall')
    ax.set_ylabel('Precision')
    ax.set_ylim([0.0, 1.05])
    ax.set_xlim([0.0, 1.0])
    ax.legend(loc="lower left")
    ax.set_title('Precision-Recall Curve')
    return ax

================================================
FILE: handyspark/sql/__init__.py
================================================
from handyspark.sql.dataframe import HandyFrame, Bucket, Quantile, DataFrame
from handyspark.sql.schema import generate_schema

__all__ = [
    'HandyFrame', 'Bucket', 'Quantile', 'generate_schema'
]

================================================
FILE: handyspark/sql/dataframe.py
================================================
from copy import deepcopy
from handyspark.ml.base import HandyTransformers
from handyspark.plot import histogram, boxplot, scatterplot, strat_scatterplot, strat_histogram,\
    consolidate_plots, post_boxplot
from handyspark.sql.pandas import HandyPandas
from handyspark.sql.transform import _MAPPING, HandyTransform
from handyspark.util import HandyException, dense_to_array, disassemble, ensure_list, check_columns, \
    none2default
import inspect
from matplotlib.axes import Axes
from collections import OrderedDict
import matplotlib.pyplot as plt
import numpy as np
from operator import itemgetter, add
import pandas as pd
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import Bucketizer
from pyspark.mllib.stat import Statistics
from pyspark.sql import DataFrame, GroupedData, Window, functions as F, Column, Row
from pyspark.ml.feature import VectorAssembler, StandardScaler, PCA
from pyspark.ml.pipeline import Pipeline
from scipy.stats import chi2
from scipy.linalg import inv

def toHandy(self):
    """Converts Spark DataFrame into HandyFrame.
    """
    return HandyFrame(self)

def notHandy(self):
    return self

DataFrame.toHandy = toHandy
DataFrame.notHandy = notHandy

def agg(f):
    f.__is_agg = True
    return f

def inccol(f):
    f.__is_inccol = True
    return f

class Handy(object):
    def __init__(self, df):
        self._df = df

        # classification
        self._is_classification = False
        self._nclasses = None
        self._classes = None

        # transformers
        self._imputed_values = {}
        self._fenced_values = {}

        # groups / strata
        self._group_cols = None
        self._strata = None
        self._strata_object = None
        self._strata_plot = None

        self._clear_stratification()
        self._safety_limit = 1000
        self._safety = True

        self._update_types()

    def __deepcopy__(self, memo):
        cls = self.__class__
        result = cls.__new__(cls)
        memo[id(self)] = result
        for k, v in self.__dict__.items():
            if k not in ['_df', '_strata_object', '_strata_plot']:
                setattr(result, k, deepcopy(v, memo))
        return result

    def __getitem__(self, *args):
        if isinstance(args[0], tuple):
            args = args[0]
        item = args[0]
        n = 20
        if len(args) > 1:
            n = args[1]
            if n is None:
                n = -1

        if isinstance(item, int):
            idx = item + (len(self._group_cols) if self._group_cols is not None else 0)
            assert idx < len(self._df.columns), "Invalid column index {}".format(idx)
            item = list(self._df.columns)[idx]

        if isinstance(item, str):
            if self._group_cols is None or len(self._group_cols) == 0:
                res = self._take_array(item, n)
                if res.ndim > 1:
                    res = res.tolist()
                res = pd.Series(res, name=item)
                if self._strata is not None:
                    strata = list(map(lambda v: v[1].to_dict(), self.strata.iterrows()))
                    if len(strata) == len(res):
                        res = pd.concat([pd.DataFrame(strata), res], axis=1).set_index(self._strata).sort_index()
                return res
            else:
                check_columns(self._df, list(self._group_cols) + [item])
                pdf = self._df.notHandy().select(list(self._group_cols) + [item])
                if n != -1:
                    pdf = pdf.limit(n)
                res = pdf.toPandas().set_index(list(self._group_cols)).sort_index()[item]
                return res

    @property
    def stages(self):
        return (len(list(filter(lambda v: '+' == v,
                                map(lambda s: s.strip()[0],
                                    self._df.rdd.toDebugString().decode().split('\n'))))) + 1)

    @property
    def statistics_(self):
        return self._imputed_values

    @property
    def fences_(self):
        return self._fenced_values

    @property
    def is_classification(self):
        return self._is_classification

    @property
    def classes(self):
        return self._classes

    @property
    def nclasses(self):
        return self._nclasses

    @property
    def response(self):
        return self._response

    @property
    def ncols(self):
        return len(self._types)

    @property
    def nrows(self):
        return self._df.count()

    @property
    def shape(self):
        return (self.nrows, self.ncols)

    @property
    def strata(self):
        if self._strata is not None:
            return pd.DataFrame(data=self._strata_combinations, columns=self._strata)

    @property
    def strata_colnames(self):
        if self._strata is not None:
            return list(map(str, ensure_list(self._strata)))
        else:
            return []

    def _stratify(self, strata):
        return HandyStrata(self, strata)

    def _clear_stratification(self):
        self._strata = None
        self._strata_object = None
        self._strata_plot = None
        self._strata_combinations = []
        self._strata_raw_combinations = []
        self._strata_clauses = []
        self._strata_raw_clauses = []
        self._n_cols = 1
        self._n_rows = 1

    def _set_stratification(self, strata, raw_combinations, raw_clauses, combinations, clauses):
        if strata is not None:
            assert len(combinations[0]) == len(strata), "Mismatched number of combinations and strata!"
            self._strata = strata
            self._strata_raw_combinations = raw_combinations
            self._strata_raw_clauses = raw_clauses
            self._strata_combinations = combinations
            self._strata_clauses = clauses
            self._n_cols = len(set(map(itemgetter(0), combinations)))
            try:
                self._n_rows = len(set(map(itemgetter(1), combinations)))
            except IndexError:
                self._n_rows = 1

    def _build_strat_plot(self, n_rows, n_cols, **kwargs):
        fig, axs = plt.subplots(n_rows, n_cols, **kwargs)
        if n_rows == 1:
            axs = [axs]
            if n_cols == 1:
                axs = [axs]
        self._strata_plot = (fig, [ax for col in np.transpose(axs) for ax in col])

    def _update_types(self):
        self._types = list(map(lambda t: (t.name, t.dataType.typeName()), self._df.schema.fields))

        self._numerical = list(map(itemgetter(0), filter(lambda t: t[1] in ['byte', 'short', 'integer', 'long',
                                                                            'float', 'double'], self._types)))
        self._continuous = list(map(itemgetter(0), filter(lambda t: t[1] in ['double', 'float'], self._types)))
        self._categorical = list(map(itemgetter(0), filter(lambda t: t[1] in ['byte', 'short', 'integer', 'long',
                                                                              'boolan', 'string'], self._types)))
        self._array = list(map(itemgetter(0), filter(lambda t: t[1] in ['array', 'map'], self._types)))
        self._string = list(map(itemgetter(0), filter(lambda t: t[1] in ['string'], self._types)))

    def _take_array(self, colname, n):
        check_columns(self._df, colname)
        datatype = self._df.notHandy().select(colname).schema.fields[0].dataType.typeName()
        rdd = self._df.notHandy().select(colname).rdd.map(itemgetter(0))

        if n == -1:
            data = rdd.collect()
        else:
            data = rdd.take(n)

        return np.array(data, dtype=_MAPPING.get(datatype, 'object'))

    def _value_counts(self, colnames, dropna=True, raw=False):
        colnames = ensure_list(colnames)
        strata = self.strata_colnames
        colnames = strata + colnames

        check_columns(self._df, colnames)
        data = self._df.notHandy().select(colnames)
        if dropna:
            data = data.dropna()

        values = (data.groupby(colnames).agg(F.count('*').alias('value_counts'))
                  .toPandas().set_index(colnames).sort_index()['value_counts'])

        if not raw:
            for level, col in enumerate(ensure_list(self._strata)):
                if not isinstance(col, str):
                    values.index.set_levels(pd.Index(col._clauses[1:-1]), level=level, inplace=True)
                    values.index.set_names(col.colname, level=level, inplace=True)

        return values

    def _fillna(self, target, values):
        assert isinstance(target, DataFrame), "Target must be a DataFrame"

        items = values.items()
        for colname, v in items:
            if isinstance(v, dict):
                clauses = v.keys()
                whens = ' '.join(['WHEN (({clause}) AND (isnan({col}) OR isnull({col}))) THEN {quote}{filling}{quote}'
                                 .format(clause=clause, col=colname, filling=v[clause],
                                         quote='"' if isinstance(v[clause], str) else '')
                                   for clause in clauses])
            else:
                whens = ('WHEN (isnan({col}) OR isnull({col})) THEN {quote}{filling}{quote}'
                         .format(col=colname, filling=v,
                                 quote='"' if isinstance(v, str) else ''))

            expression = F.expr('CASE {expr} ELSE {col} END'.format(expr=whens, col=colname))
            target = target.withColumn(colname, expression)

        return target

    def __stat_to_dict(self, colname, stat):
        if len(self._strata_clauses):
            if isinstance(stat, pd.Series):
                stat = stat.to_frame(colname)
            return {clause: stat.query(raw_clause)[colname].iloc[0]
                    for clause, raw_clause in zip(self._strata_clauses, self._strata_raw_clauses)}
        else:
            return stat[colname]

    def _fill_values(self, continuous, categorical, strategy):
        values = {}
        colnames = list(map(itemgetter(0), filter(lambda t: t[1] == 'mean', zip(continuous, strategy))))
        values.update(dict([(col, self.__stat_to_dict(col, self.mean(col))) for col in colnames]))

        colnames = list(map(itemgetter(0), filter(lambda t: t[1] == 'median', zip(continuous, strategy))))
        values.update(dict([(col, self.__stat_to_dict(col, self.median(col))) for col in colnames]))

        values.update(dict([(col, self.__stat_to_dict(col, self.mode(col)))
                            for col in categorical if col in self._categorical]))
        return values

    def __fill_self(self, continuous, categorical, strategy):
        continuous = ensure_list(continuous)
        categorical = ensure_list(categorical)
        check_columns(self._df, continuous + categorical)

        strategy = none2default(strategy, 'mean')

        if continuous == ['all']:
            continuous = self._continuous
        if categorical == ['all']:
            categorical = self._categorical

        if isinstance(strategy, (list, tuple)):
            assert len(continuous) == len(strategy), "There must be a strategy to each column."
        else:
            strategy = [strategy] * len(continuous)

        values = self._fill_values(continuous, categorical, strategy)
        self._imputed_values.update(values)
        res = HandyFrame(self._fillna(self._df, values), self)
        return res

    def _dense_to_array(self, colname, array_colname):
        check_columns(self._df, colname)
        res = dense_to_array(self._df.notHandy(), colname, array_colname)
        return HandyFrame(res, self)

    def _agg(self, name, func, colnames):
        colnames = none2default(colnames, self._df.columns)
        colnames = ensure_list(colnames)
        check_columns(self._df, self.strata_colnames + [col for col in colnames if not isinstance(col, Column)])
        if func is None:
            func = getattr(F, name)

        res = (self._df.notHandy()
               .groupby(self.strata_colnames)
               .agg(*(func(col).alias(str(col)) for col in colnames if str(col) not in self.strata_colnames))
               .toPandas())

        if len(res) == 1:
            res = res.iloc[0]
            res.name = name
        return res

    def _calc_fences(self, colnames, k=1.5, precision=.01):
        colnames = none2default(colnames, self._numerical)
        colnames = ensure_list(colnames)
        check_columns(self._df, colnames)
        colnames = [col for col in colnames if col in self._numerical]
        strata = self.strata_colnames

        pdf = (self._df.notHandy()
               .groupby(strata)
               .agg(F.count(F.lit(1)).alias('nrows'),
                    *[F.expr('approx_percentile({}, {}, {})'.format(c, q, 1./precision)).alias('{}_{}%'.format(c, int(q * 100)))
                      for q in [.25, .50, .75] for c in colnames],
                    *[F.mean(c).alias('{}_mean'.format(c)) for c in colnames]).toPandas())

        for col in colnames:
            pdf.loc[:, '{}_iqr'.format(col)] = pdf.loc[:, '{}_75%'.format(col)] - pdf.loc[:, '{}_25%'.format(col)]
            pdf.loc[:, '{}_lfence'.format(col)] = pdf.loc[:, '{}_25%'.format(col)] - k * pdf.loc[:, '{}_iqr'.format(col)]
            pdf.loc[:, '{}_ufence'.format(col)] = pdf.loc[:, '{}_75%'.format(col)] + k * pdf.loc[:, '{}_iqr'.format(col)]

        return pdf

    def _calc_mahalanobis_distance(self, colnames, output_col='__mahalanobis'):
        """Computes Mahalanobis distance from origin
        """
        sdf = self._df.notHandy()
        check_columns(sdf, colnames)
        # Builds pipeline to assemble feature columns and scale them
        assembler = VectorAssembler(inputCols=colnames, outputCol='__features')
        scaler = StandardScaler(inputCol='__features', outputCol='__scaled', withMean=True)
        pipeline = Pipeline(stages=[assembler, scaler])
        features = pipeline.fit(sdf).transform(sdf)

        # Computes correlation between features and inverts it
        # Since we scaled the features, we can assume they have unit variance
        # and therefore, correlation and covariance matrices are the same!
        mat = Correlation.corr(features, '__scaled').head()[0].toArray()
        inv_mat = inv(mat)

        # Builds Pandas UDF to compute Mahalanobis distance from origin
        # sqrt((V - 0) * inv_M * (V - 0))
        try:
            import pyarrow
            @F.pandas_udf('double')
            def pudf_mult(v):
                return v.apply(lambda v: np.sqrt(np.dot(np.dot(v, inv_mat), v)))
        except:
            @F.udf('double')
            def pudf_mult(v):
                return v.apply(lambda v: np.sqrt(np.dot(np.dot(v, inv_mat), v)))

        # Convert feature vector into array
        features = dense_to_array(features, '__scaled', '__array_scaled')
        # Computes Mahalanobis distance and flags as outliers all elements above critical value
        distance = (features
                    .withColumn('__mahalanobis', pudf_mult('__array_scaled'))
                    .drop('__features', '__scaled', '__array_scaled'))
        return distance

    def _set_mahalanobis_outliers(self, colnames, critical_value=.999,
                                  input_col='__mahalanobis', output_col='__outlier'):
        """Compares Mahalanobis distances to critical values using
         Chi-Squared distribution to identify possible outliers.
        """
        distance = self._calc_mahalanobis_distance(colnames)
        # Computes critical value
        critical_value = chi2.ppf(critical_value, len(colnames))
        # Computes Mahalanobis distance and flags as outliers all elements above critical value
        outlier = (distance.withColumn(output_col, F.col(input_col) > critical_value))
        return outlier

    def _calc_bxp_stats(self, fences_df, colname, showfliers=False):
        strata = self.strata_colnames
        clauses = self._strata_raw_clauses
        if not len(clauses):
            clauses = [None]

        qnames = ['25%', '50%', '75%', 'mean', 'lfence', 'ufence']
        col_summ = fences_df[strata + ['{}_{}'.format(colname, q) for q in qnames] + ['nrows']]
        col_summ.columns = strata + qnames + ['nrows']
        if len(strata):
            col_summ = col_summ.set_index(strata)
        lfence, ufence = col_summ[['lfence']], col_summ[['ufence']]

        expression = None
        for clause in clauses:
            if clause is not None:
                partial = F.col(colname).between(lfence.query(clause).iloc[0, 0], ufence.query(clause).iloc[0, 0])
                partial &= F.expr(clause)
            else:
                partial = F.col(colname).between(lfence.iloc[0, 0], ufence.iloc[0, 0])

            if expression is None:
                expression = partial
            else:
                expression |= partial

        outlier = self._df.notHandy().withColumn('__{}_outlier'.format(colname), ~expression)
        minmax = (outlier
                  .filter('not __{}_outlier'.format(colname))
                  .groupby(strata)
                  .agg(F.min(colname).alias('min'),
                       F.max(colname).alias('max'))
                  .toPandas())

        if len(strata):
            minmax = [minmax.query(clause).iloc[0][['min', 'max']].values for clause in clauses]
        else:
            minmax = [minmax.iloc[0][['min', 'max']].values]

        fliers_df = outlier.filter('__{}_outlier'.format(colname))
        fliers_df = [fliers_df.filter(clause) for clause in clauses] if len(strata) else [fliers_df]
        fliers_count = [df.count() for df in fliers_df]

        if showfliers:
            fliers = [(df
                       .select(F.abs(F.col(colname)).alias(colname))
                       .orderBy(F.desc(colname))
                       .limit(1000)
                       .toPandas()[colname].values) for df in fliers_df]
        else:
            fliers = [[]] * len(clauses)

        stats = []  # each item corresponds to a different clause - all items belong to the same column
        nrows = []
        for clause, whiskers, outliers in zip(clauses, minmax, fliers):
            summary = col_summ
            if clause is not None:
                summary = summary.query(clause)
            item = {'mean': summary['mean'].values[0],
                    'med': summary['50%'].values[0],
                    'q1': summary['25%'].values[0],
                    'q3': summary['75%'].values[0],
                    'whislo': whiskers[0],
                    'whishi': whiskers[1],
                    'fliers': outliers}
            stats.append(item)
            nrows.append(summary['nrows'].values[0])

        if not len(nrows):
            nrows = summary['nrows'].values[0]

        return stats, fliers_count, nrows

    def set_response(self, colname):
        check_columns(self._df, colname)
        self._response = colname
        if colname is not None:
            if colname not in self._continuous:
                self._is_classification = True
                self._classes = self._df.notHandy().select(colname).rdd.map(itemgetter(0)).distinct().collect()
                self._nclasses = len(self._classes)

        return self

    def disassemble(self, colname, new_colnames=None):
        check_columns(self._df, colname)
        res = disassemble(self._df.notHandy(), colname, new_colnames)
        return HandyFrame(res, self)

    def to_metrics_RDD(self, prob_col, label):
        check_columns(self._df, [prob_col, label])
        return self.disassemble(prob_col).select('{}_1'.format(prob_col), F.col(label).cast('double')).rdd.map(tuple)

    def corr(self, colnames=None, method='pearson'):
        colnames = none2default(colnames, self._numerical)
        colnames = ensure_list(colnames)
        check_columns(self._df, colnames)
        colnames = [col for col in colnames if col in self._numerical]
        if self._strata is not None:
            colnames = sorted([col for col in colnames if col not in self.strata_colnames])

        correlations = Statistics.corr(self._df.notHandy().select(colnames).dropna().rdd.map(lambda row: row[0:]), method=method)
        pdf = pd.DataFrame(correlations, columns=colnames, index=colnames)
        return pdf

    def fill(self, *args, continuous=None, categorical=None, strategy=None):
        if len(args) and isinstance(args[0], DataFrame):
            return self._fillna(args[0], self._imputed_values)
        else:
            return self.__fill_self(continuous=continuous, categorical=categorical, strategy=strategy)

    @agg
    def isnull(self, ratio=False):
        def func(colname):
            return F.sum(F.isnull(colname).cast('int')).alias(colname)

        name = 'missing'
        if ratio:
            name += '(ratio)'
        missing = self._agg(name, func, self._df.columns)

        if ratio:
            nrows = self._agg('nrows', F.sum, F.lit(1))
            if isinstance(missing, pd.Series):
                missing = missing / nrows["Column<b'1'>"]
            else:
                missing.iloc[:, 1:] = missing.iloc[:, 1:].values / nrows["Column<b'1'>"].values.reshape(-1, 1)

        if len(self.strata_colnames):
            missing = missing.set_index(self.strata_colnames).T.unstack()
            missing.name = name

        return missing

    @agg
    def nunique(self, colnames=None):
        res = self._agg('nunique', F.approx_count_distinct, colnames)
        if len(self.strata_colnames):
            res = res.set_index(self.strata_colnames).T.unstack()
            res.name = 'nunique'
        return res

    def outliers(self, colnames=None, ratio=False, method='tukey', **kwargs):
        colnames = none2default(colnames, self._numerical)
        colnames = ensure_list(colnames)
        check_columns(self._df, colnames)
        colnames = [col for col in colnames if col in self._numerical]

        res = None
        if method == 'tukey':
            outliers = []
            try:
                k = float(kwargs['k'])
            except KeyError:
                k = 1.5
            fences_df = self._calc_fences(colnames, k=k, precision=.01)

            index = fences_df[self.strata_colnames].set_index(self.strata_colnames).index \
                if len(self.strata_colnames) else None

            for colname in colnames:
                stats, counts, nrows = self._calc_bxp_stats(fences_df, colname, showfliers=False)
                outliers.append(pd.Series(counts, index=index, name=colname))
                if ratio:
                    outliers[-1] /= nrows

            res = pd.DataFrame(outliers).unstack()
            if not len(self.strata_colnames):
                res = res.droplevel(0)
            name = 'outliers'
            if ratio:
                name += '(ratio)'
            res.name = name

        return res

    def get_outliers(self, colnames=None, critical_value=.999):
        colnames = none2default(colnames, self._numerical)
        colnames = ensure_list(colnames)
        check_columns(self._df, colnames)
        colnames = [col for col in colnames if col in self._numerical]

        outliers = self._set_mahalanobis_outliers(colnames, critical_value)
        df = outliers.filter('__outlier').orderBy(F.desc('__mahalanobis')).drop('__outlier', '__mahalanobis')
        return HandyFrame(df, self)

    def remove_outliers(self, colnames=None, critical_value=.999):
        colnames = none2default(colnames, self._numerical)
        colnames = ensure_list(colnames)
        check_columns(self._df, colnames)
        colnames = [col for col in colnames if col in self._numerical]

        outliers = self._set_mahalanobis_outliers(colnames, critical_value)
        df = outliers.filter('not __outlier').drop('__outlier', '__mahalanobis')
        return HandyFrame(df, self)

    def fence(self, colnames, k=1.5):
        colnames = ensure_list(colnames)
        check_columns(self._df, colnames)
        colnames = [col for col in colnames if col in self._numerical]

        pdf = self._calc_fences(colnames, k=k)
        if len(self.strata_colnames):
            pdf = pdf.set_index(self.strata_colnames)

        df = self._df.notHandy()
        for colname in colnames:
            lfence, ufence = pdf.loc[:, ['{}_lfence'.format(colname)]], pdf.loc[:, ['{}_ufence'.format(colname)]]
            if len(self._strata_raw_clauses):
                whens1 = ' '.join(['WHEN ({clause}) THEN greatest({col}, {fence})'.format(clause=clause,
                                                                                          col=colname,
                                                                                          fence=lfence.query(clause).iloc[0, 0])
                                   for clause in self._strata_raw_clauses])
                whens2 = ' '.join(['WHEN ({clause}) THEN least({col}, {fence})'.format(clause=clause,
                                                                                       col=colname,
                                                                                       fence=ufence.query(clause).iloc[0, 0])
                                   for clause in self._strata_raw_clauses])
                expression1 = F.expr('CASE {} END'.format(whens1))
                expression2 = F.expr('CASE {} END'.format(whens2))
                self._fenced_values.update({colname: {clause: [lfence.query(clause).iloc[0, 0],
                                                               ufence.query(clause).iloc[0, 0]]
                                                      for clause in self._strata_clauses}})
            else:
                self._fenced_values.update({colname: [lfence.iloc[0, 0], ufence.iloc[0, 0]]})

                expression1 = F.expr('greatest({col}, {fence})'.format(col=colname, fence=lfence.iloc[0, 0]))
                expression2 = F.expr('least({col}, {fence})'.format(col=colname, fence=ufence.iloc[0, 0]))
            df = df.withColumn(colname, expression1).withColumn(colname, expression2)

        return HandyFrame(df.select(self._df.columns), self)

    @inccol
    def value_counts(self, colnames, dropna=True):
        return self._value_counts(colnames, dropna)

    @inccol
    def mode(self, colname):
        check_columns(self._df, [colname])

        if self._strata is None:
            values = (self._df.notHandy().select(colname).dropna()
                      .groupby(colname).agg(F.count('*').alias('mode'))
                      .orderBy(F.desc('mode')).limit(1)
                      .toPandas()[colname][0])
            return pd.Series(values, index=[colname], name='mode')
        else:
            strata = self.strata_colnames
            colnames = strata + [colname]
            values = (self._df.notHandy().select(colnames).dropna()
                      .groupby(colnames).agg(F.count('*').alias('mode'))
                      .withColumn('order', F.row_number().over(Window.partitionBy(strata).orderBy(F.desc('mode'))))
                      .filter('order == 1').drop('order')
                      .toPandas().set_index(strata).sort_index()[colname])
            values.name = 'mode'
            return values

    @inccol
    def entropy(self, colnames):
        colnames = ensure_list(colnames)
        check_columns(self._df, colnames)
        sdf = self._df.notHandy()
        n = sdf.count()
        entropy = []
        for colname in colnames:
            if colname in self._categorical:
                res = (self._df
                       .groupby(self.strata_colnames + [colname])
                       .agg(F.count('*').alias('value_counts')).withColumn('probability', F.col('value_counts') / n)
                       .groupby(self.strata_colnames)
                       .agg(F.sum(F.expr('-log2(probability) * probability')).alias(colname))
                       .safety_off()
                       .cols[self.strata_colnames + [colname]][:])

                if len(self.strata_colnames):
                    res.set_index(self.strata_colnames, inplace=True)
                    res = res.unstack()
                else:
                    res = res[colname]
                    res.index = [colname]
            else:
                res = pd.Series(None, index=[colname])
            res.name = 'entropy'
            entropy.append(res)
        return pd.concat(entropy).sort_index()

    @inccol
    def mutual_info(self, colnames):
        def distribution(sdf, colnames):
            return sdf.groupby(colnames).agg(F.count('*').alias('__count'))

        check_columns(self._df, colnames)
        n = len(colnames)
        probs = []
        sdf = self._df.notHandy()
        for i in range(n):
            probs.append(distribution(sdf, self.strata_colnames + [colnames[i]]))

        if len(self.strata_colnames):
            nrows = sdf.groupby(self.strata_colnames).agg(F.count('*').alias('__n'))
        else:
            nrows = sdf.count()

        entropies = self.entropy(colnames)
        res = []
        for i in range(n):
            for j in range(i, n):
                if i == j:
                    mi = pd.Series(entropies[colnames[i]], name='mi').to_frame()
                else:
                    tdf = distribution(sdf, self.strata_colnames + [colnames[i], colnames[j]])
                    if len(self.strata_colnames):
                        tdf = tdf.join(nrows, on=self.strata_colnames)
                    else:
                        tdf = tdf.withColumn('__n', F.lit(nrows))
                    tdf = tdf.join(probs[i].toDF(*self.strata_colnames, colnames[i], '__count0'), on=self.strata_colnames + [colnames[i]])
                    tdf = tdf.join(probs[j].toDF(*self.strata_colnames, colnames[j], '__count1'), on=self.strata_colnames + [colnames[j]])
                    mi = (tdf
                          .groupby(self.strata_colnames)
                          .agg(F.sum(F.expr('log2(__count * __n / (__count0 * __count1)) * __count / __n')).alias('mi'))
                          .toPandas())

                    if len(self.strata_colnames):
                        mi.set_index(self.strata_colnames, inplace=True)

                    res.append(mi.assign(ci=colnames[j], cj=colnames[i]))

                res.append(mi.assign(ci=colnames[i], cj=colnames[j]))

        res = pd.concat(res).set_index(['ci', 'cj'], append=len(self.strata_colnames)).sort_index()
        res = pd.pivot_table(res, index=self.strata_colnames + ['ci'], columns=['cj'])
        res.index.names = self.strata_colnames + ['']
        res.columns = res.columns.droplevel(0).rename('')
        return res

    @agg
    def mean(self, colnames):
        return self._agg('mean', F.mean, colnames)

    @agg
    def min(self, colnames):
        return self._agg('min', F.min, colnames)

    @agg
    def max(self, colnames):
        return self._agg('max', F.max, colnames)

    @agg
    def percentile(self, colnames, perc=50, precision=.01):
        def func(c):
            return F.expr('approx_percentile({}, {}, {})'.format(c, perc/100., 1./precision))
        try:
            name = {25: 'q1', 50: 'median', 75: 'q3'}[perc]
        except KeyError:
            name = 'percentile_{}'.format(perc)
        return self._agg(name, func, colnames)

    @agg
    def median(self, colnames, precision=.01):
        return self.percentile(colnames, 50, precision)

    @agg
    def stddev(self, colnames):
        return self._agg('stddev', F.stddev, colnames)

    @agg
    def var(self, colnames):
        return self._agg('var', F.stddev, colnames) ** 2

    @agg
    def q1(self, colnames, precision=.01):
        return self.percentile(colnames, 25, precision)

    @agg
    def q3(self, colnames, precision=.01):
        return self.percentile(colnames, 75, precision)

    ### Boxplot functions
    def _strat_boxplot(self, colnames, **kwargs):
        n_rows = n_cols = 1
        kwds = deepcopy(kwargs)
        for kw in ['showfliers', 'precision']:
            try:
                del kwds[kw]
            except KeyError:
                pass
        if isinstance(colnames, (tuple, list)) and (len(colnames) > 1):
            n_rows = self._n_rows
            n_cols = self._n_cols
        self._build_strat_plot(n_rows, n_cols, **kwds)
        return None

    @inccol
    def boxplot(self, colnames, ax=None, showfliers=True, k=1.5, precision=.01, **kwargs):
        colnames = ensure_list(colnames)
        check_columns(self._df, colnames)
        colnames = [col for col in colnames if col in self._numerical]
        assert len(colnames), "Only numerical columns can be plot!"
        return boxplot(self._df, colnames, ax, showfliers, k, precision)

    def _post_boxplot(self, res):
        return post_boxplot(self._strata_plot[1], res)

    ### Scatterplot functions
    def _strat_scatterplot(self, colnames, **kwargs):
        self._build_strat_plot(self._n_rows, self._n_cols, **kwargs)
        return strat_scatterplot(self._df.notHandy(), colnames[0], colnames[1])

    @inccol
    def scatterplot(self, colnames, ax=None, **kwargs):
        assert len(colnames) == 2, "There must be two columns to plot!"
        check_columns(self._df, colnames)
        colnames = [col for col in colnames if col in self._numerical]
        assert len(colnames) == 2, "Both columns must be numerical!"
        return scatterplot(self._df, colnames[0], colnames[1], ax=ax)

    ### Histogram functions
    def _strat_hist(self, colname, bins=10, **kwargs):
        self._build_strat_plot(self._n_rows, self._n_cols, **kwargs)
        categorical = True
        if colname in self._continuous:
            categorical = False
        #res = strat_histogram(self._df.notHandy(), colname, bins, categorical)
        res = strat_histogram(self._df, colname, bins, categorical)
        self._strata_plot[0].suptitle('')
        plt.tight_layout()
        return res

    @inccol
    def hist(self, colname, bins=10, ax=None, **kwargs):
        # TO DO
        # include split per response/columns
        assert len(ensure_list(colname)) == 1, "Only single columns can be plot!"
        check_columns(self._df, colname)
        if colname in self._continuous:
            return histogram(self._df, colname, bins=bins, categorical=False, ax=ax)
        else:
            return histogram(self._df, colname, bins=bins, categorical=True, ax=ax)


class HandyGrouped(GroupedData):
    def __init__(self, jgd, df, *args):
        self._jgd = jgd
        self._df = df
        self.sql_ctx = df.sql_ctx
        self._cols = args

    def agg(self, *exprs):
        df = super().agg(*exprs)
        handy = deepcopy(self._df._handy)
        handy._group_cols = self._cols
        return HandyFrame(df, handy)

    def __repr__(self):
        return "HandyGrouped[%s]" % (", ".join("%s" % c for c in self._group_cols))


class HandyFrame(DataFrame):
    """HandySpark version of DataFrame.

    Attributes
    ----------
    cols: HandyColumns
        class to access pandas-like column based methods implemented in Spark
    pandas: HandyPandas
        class to access pandas-like column based methods through pandas UDFs
    transformers: HandyTransformers
        class to generate Handy transformers
    stages: integer
        number of stages in the execution plan
    response: string
        name of the response column
    is_classification: boolean
        True if response is a categorical variable
    classes: list
        list of classes for a classification problem
    nclasses: integer
        number of classes for a classification problem
    ncols: integer
        number of columns of the HandyFrame
    nrows: integer
        number of rows of the HandyFrame
    shape: tuple
        tuple representing dimensionality of the HandyFrame
    statistics_: dict
        imputation fill value for each feature
        If stratified, first level keys are filter clauses for stratification
    fences_: dict
        fence values for each feature
        If stratified, first level keys are filter clauses for stratification
    is_stratified: boolean
        True if HandyFrame was stratified
    values: ndarray
        Numpy representation of HandyFrame.

    Available methods:
    - notHandy: makes it a plain Spark dataframe
    - stratify: used to perform stratified operations
    - isnull: checks for missing values
    - fill: fills missing values
    - outliers: returns counts of outliers, columnwise, using Tukey's method
    - get_outliers: returns list of outliers using Mahalanobis distance
    - remove_outliers: filters out outliers using Mahalanobis distance
    - fence: fences outliers
    - set_safety_limit: defines new safety limit for collect operations
    - safety_off: disables safety limit for a single operation
    - assign: appends a new columns based on an expression
    - nunique: returns number of unique values in each column
    - set_response: sets column to be used as response / label
    - disassemble: turns a vector / array column into multiple columns
    - to_metrics_RDD: turns probability and label columns into a tuple RDD
    """

    def __init__(self, df, handy=None):
        super().__init__(df._jdf, df.sql_ctx)
        if handy is None:
            handy = Handy(self)
        else:
            handy = deepcopy(handy)
            handy._df = self
            handy._update_types()
        self._handy = handy
        self._safety = self._handy._safety
        self._safety_limit = self._handy._safety_limit
        self.__overriden = ['collect', 'take']
        self._strat_handy = None
        self._strat_index = None

    def __getattribute__(self, name):
        attr = object.__getattribute__(self, name)
        if hasattr(attr, '__call__') and name not in self.__overriden:
            def wrapper(*args, **kwargs):
                try:
                    res = attr(*args, **kwargs)
                except HandyException as e:
                    raise HandyException(str(e), summary=False)
                except Exception as e:
                    raise HandyException(str(e), summary=True)

                if name != 'notHandy':
                    if not isinstance(res, HandyFrame):
                        if isinstance(res, DataFrame):
                            res = HandyFrame(res, self._handy)
                        if isinstance(res, GroupedData):
                            res = HandyGrouped(res._jgd, res._df, *args)
                return res
            return wrapper
        else:
            return attr

    def __repr__(self):
        return "HandyFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes))

    def _get_strata(self):
        plot = None
        object = None
        if self._strat_handy is not None:
            try:
                object = self._strat_handy._strata_object
            except AttributeError:
                pass
            if object is None:
                object = True
            try:
                plots = self._strat_handy._strata_plot[1]
                #if len(plots) > 1:
                #    plot = plots[self._strat_index]
                plot = plots
            except (AttributeError, IndexError):
                pass
        return plot, object

    def _gen_row_ids(self, *args):
        # EXPERIMENTAL - DO NOT USE!
        return (self
                .sort(*args)
                .withColumn('_miid', F.monotonically_increasing_id())
                .withColumn('_row_id', F.row_number().over(Window().orderBy(F.col('_miid'))))
                .drop('_miid'))

    def _loc(self, lower_bound, upper_bound):
        # EXPERIMENTAL - DO NOT USE!
        assert '_row_id' in self.columns, "Cannot use LOC without generating `row_id`s first!"
        clause = F.col('_row_id').between(lower_bound, upper_bound)
        return self.filter(clause)

    @property
    def cols(self):
        """Returns a class to access pandas-like column based methods implemented in Spark

        Available methods:
        - min
        - max
        - median
        - q1
        - q3
        - stddev
        - value_counts
        - mode
        - corr
        - nunique
        - hist
        - boxplot
        - scatterplot
        """
        return HandyColumns(self, self._handy)

    @property
    def pandas(self):
        """Returns a class to access pandas-like column based methods through pandas UDFs

        Available methods:
        - betweeen / between_time
        - isin
        - isna / isnull
        - notna / notnull
        - abs
        - clip / clip_lower / clip_upper
        - replace
        - round / truncate
        - tz_convert / tz_localize
        """
        return HandyPandas(self)

    @property
    def transformers(self):
        """Returns a class to generate Handy transformers

        Available transformers:
        - HandyImputer
        - HandyFencer
        """
        return HandyTransformers(self)

    @property
    def stages(self):
        """Returns the number of stages in the execution plan.
        """
        return self._handy.stages

    @property
    def response(self):
        """Returns the name of the response column.
        """
        return self._handy.response

    @property
    def is_classification(self):
        """Returns True if response is a categorical variable.
        """
        return self._handy.is_classification

    @property
    def classes(self):
        """Returns list of classes for a classification problem.
        """
        return self._handy.classes

    @property
    def nclasses(self):
        """Returns the number of classes for a classification problem.
        """
        return self._handy.nclasses

    @property
    def ncols(self):
        """Returns the number of columns of the HandyFrame.
        """
        return self._handy.ncols

    @property
    def nrows(self):
        """Returns the number of rows of the HandyFrame.
        """
        return self._handy.nrows

    @property
    def shape(self):
        """Return a tuple representing the dimensionality of the HandyFrame.
        """
        return self._handy.shape

    @property
    def statistics_(self):
        """Returns dictionary with imputation fill value for each feature.
        If stratified, first level keys are filter clauses for stratification.
        """
        return self._handy.statistics_

    @property
    def fences_(self):
        """Returns dictionary with fence values for each feature.
        If stratified, first level keys are filter clauses for stratification.
        """
        return self._handy.fences_

    @property
    def values(self):
        """Numpy representation of HandyFrame.
        """
        # safety limit will kick in, unless explicitly off before
        tdf = self
        if self._safety:
            tdf = tdf.limit(self._safety_limit)
        return np.array(tdf.rdd.map(tuple).collect())

    def notHandy(self):
        """Converts HandyFrame back into Spark's DataFrame
        """
        return DataFrame(self._jdf, self.sql_ctx)

    def set_safety_limit(self, limit):
        """Sets safety limit used for ``collect`` method.
        """
        self._handy._safety_limit = limit
        self._safety_limit = limit

    def safety_off(self):
        """Disables safety limit for a single call of ``collect`` method.
        """
        self._handy._safety = False
        self._safety = False
        return self

    def collect(self):
        """Returns all the records as a list of :class:`Row`.

        By default, its output is limited by the safety limit.
        To get original `collect` behavior, call ``safety_off`` method first.
        """
        try:
            if self._safety:
                print('\nINFO: Safety is ON - returning up to {} instances.'.format(self._safety_limit))
                return super().limit(self._safety_limit).collect()
            else:
                res = super().collect()
                self._safety = True
                return res
        except HandyException as e:
            raise HandyException(str(e), summary=False)
        except Exception as e:
            raise HandyException(str(e), summary=True)

    def take(self, num):
        """Returns the first ``num`` rows as a :class:`list` of :class:`Row`.
        """
        self._handy._safety = False
        res = super().take(num)
        self._handy._safety = True
        return res

    def stratify(self, strata):
        """Stratify the HandyFrame.

        Stratified operations should be more efficient than group by operations, as they
        rely on three iterative steps, namely: filtering the underlying HandyFrame, performing
        the operation and aggregating the results.
        """
        strata = ensure_list(strata)
        check_columns(self, strata)
        return self._handy._stratify(strata)

    def transform(self, f, name=None, args=None, returnType=None):
        """INTERNAL USE
        """
        return HandyTransform.transform(self, f, name=name, args=args, returnType=returnType)

    def apply(self, f, name=None, args=None, returnType=None):
        """INTERNAL USE
        """
        return HandyTransform.apply(self, f, name=name, args=args, returnType=returnType)

    def assign(self, **kwargs):
        """Assign new columns to a HandyFrame, returning a new object (a copy)
        with all the original columns in addition to the new ones.

        Parameters
        ----------
        kwargs : keyword, value pairs
            keywords are the column names.
            If the values are callable, they are computed on the DataFrame and
            assigned to the new columns.
            If the values are not callable, (e.g. a scalar, or string),
            they are simply assigned.

        Returns
        -------
        df : HandyFrame
            A new HandyFrame with the new columns in addition to
            all the existing columns.
        """
        return HandyTransform.assign(self, **kwargs)

    @agg
    def isnull(self, ratio=False):
        """Returns array with counts of missing value for each column in the HandyFrame.

        Parameters
        ----------
        ratio: boolean, default False
            If True, returns ratios instead of absolute counts.

        Returns
        -------
        counts: Series
        """
        return self._handy.isnull(ratio)

    @agg
    def nunique(self):
        """Return Series with number of distinct observations for all columns.

        Parameters
        ----------
        exact: boolean, optional
            If True, computes exact number of unique values, otherwise uses an approximation.

        Returns
        -------
        nunique: Series
        """
        return self._handy.nunique(self.columns) #, exact)

    @inccol
    def outliers(self, ratio=False, method='tukey', **kwargs):
        """Return Series with number of outlier observations according to
         the specified method for all columns.

         Parameters
         ----------
         ratio: boolean, optional
            If True, returns proportion instead of counts.
            Default is True.
         method: string, optional
            Method used to detect outliers. Currently, only Tukey's method is supported.
            Default is tukey.

         Returns
         -------
         outliers: Series
        """
        return self._handy.outliers(self.columns, ratio=ratio, method=method, **kwargs)

    def get_outliers(self, colnames=None, critical_value=.999):
        """Returns HandyFrame containing all rows deemed as outliers using
        Mahalanobis distance and informed critical value.

        Parameters
        ----------
        colnames: list of str, optional
            List of columns to be used for computing Mahalanobis distance.
            Default includes all numerical columns
        critical_value: float, optional
            Critical value for chi-squared distribution to classify outliers
            according to Mahalanobis distance.
            Default is .999 (99.9%).
        """
        return self._handy.get_outliers(colnames, critical_value)

    def remove_outliers(self, colnames=None, critical_value=.999):
        """Returns HandyFrame containing only rows NOT deemed as outliers
        using  Mahalanobis distance and informed critical value.

        Parameters
        ----------
        colnames: list of str, optional
            List of columns to be used for computing Mahalanobis distance.
            Default includes all numerical columns
        critical_value: float, optional
            Critical value for chi-squared distribution to classify outliers
            according to Mahalanobis distance.
            Default is .999 (99.9%).
        """
        return self._handy.remove_outliers(colnames, critical_value)

    def set_response(self, colname):
        """Sets column to be used as response in supervised learning algorithms.

        Parameters
        ----------
        colname: string

        Returns
        -------
        self
        """
        check_columns(self, colname)
        return self._handy.set_response(colname)

    @inccol
    def fill(self, *args, categorical=None, continuous=None, strategy=None):
        """Fill NA/NaN values using the specified methods.

        The values used for imputation are kept in ``statistics_`` property
        and can later be used to generate a corresponding HandyImputer transformer.

        Parameters
        ----------
        categorical: 'all' or list of string, optional
            List of categorical columns.
            These columns are filled with its coresponding modes (most common values).
        continuous: 'all' or list of string, optional
            List of continuous value columns.
            By default, these columns are filled with its  corresponding means.
            If a same-sized list is provided in the ``strategy`` argument, it uses
            the corresponding straegy for each column.
        strategy: list of string, optional
            If informed, it must contain a strategy - either ``mean`` or ``median`` - for
            each one of the continuous columns.

        Returns
        -------
        df : HandyFrame
            A new HandyFrame with filled missing values.
        """
        return self._handy.fill(*args, continuous=continuous, categorical=categorical, strategy=strategy)

    @inccol
    def fence(self, colnames, k=1.5):
        """Caps outliers using lower and upper fences given by Tukey's method,
        using 1.5 times the interquartile range (IQR).

        The fence values used for capping outliers are kept in ``fences_`` property
        and can later be used to generate a corresponding HandyFencer transformer.

        For more information, check: https://en.wikipedia.org/wiki/Outlier#Tukey's_fences

        Parameters
        ----------
        colnames: list of string
            Column names to apply fencing.
        k: float, optional
            Constant multiplier for the IQR.
            Default is 1.5 (corresponding to Tukey's outlier, use 3 for "far out" values)

        Returns
        -------
        df : HandyFrame
            A new HandyFrame with capped outliers.
        """
        return self._handy.fence(colnames, k=k)

    def disassemble(self, colname, new_colnames=None):
        """Disassembles a Vector or Array column into multiple columns.

        Parameters
        ----------
        colname: string
            Column containing Vector or Array elements.
        new_colnames: list of string, optional
            Default is None, column names are generated using a sequentially
            generated suffix (e.g., _0, _1, etc.) for ``colname``.
            If informed, it must have as many column names as elements
            in the shortest vector/array of ``colname``.

        Returns
        -------
        df : HandyFrame
            A new HandyFrame with the new disassembled columns in addition to
            all the existing columns.
        """
        return self._handy.disassemble(colname, new_colnames)

    def to_metrics_RDD(self, prob_col='probability', label_col='label'):
        """Converts a DataFrame containing predicted probabilities and classification labels
        into a RDD suited for use with ``BinaryClassificationMetrics`` object.

        Parameters
        ----------
        prob_col: string, optional
            Column containing Vectors of probabilities.
            Default is 'probability'.
        label_col: string, optional
            Column containing labels.
            Default is 'label'.

        Returns
        -------
        rdd: RDD
            RDD of tuples (probability, label)
        """
        return self._handy.to_metrics_RDD(prob_col, label_col)


class Bucket(object):
    """Bucketizes a column of continuous values into equal sized bins
    to perform stratification.

    Parameters
    ----------
    colname: string
        Column containing continuous values
    bins: integer
        Number of equal sized bins to map original values to.

    Returns
    -------
    bucket: Bucket
        Bucket object to be used as column in stratification.
    """
    def __init__(self, colname, bins=5):
        self._colname = colname
        self._bins = bins
        self._buckets = None
        self._clauses = None

    def __repr__(self):
        return 'Bucket_{}_{}'.format(self._colname, self._bins)

    @property
    def colname(self):
        return self._colname

    def _get_buckets(self, df):
        check_columns(df, self._colname)
        buckets = ([-float('inf')] +
                   np.linspace(*df.agg(F.min(self._colname),
                                       F.max(self._colname)).rdd.map(tuple).collect()[0],
                               self._bins + 1).tolist() +
                   [float('inf')])
        buckets[-2] += 1e-7
        self._buckets = buckets
        return buckets

    def _get_clauses(self, buckets):
        clauses = []
        clauses.append('{} < {:.4f}'.format(self._colname, buckets[1]))
        for b, e in zip(buckets[1:-2], buckets[2:-1]):
            clauses.append('{} >= {:.4f} and {} < {:.4f}'.format(self._colname, b, self._colname, e))
        clauses[-1] = clauses[-1].replace('<', '<=')
        clauses.append('{} > {:.4f}'.format(self._colname, buckets[-2]))
        self._clauses = clauses
        return clauses


class Quantile(Bucket):
    """Bucketizes a column of continuous values into quantiles
    to perform stratification.

    Parameters
    ----------
    colname: string
        Column containing continuous values
    bins: integer
        Number of quantiles to map original values to.

    Returns
    -------
    quantile: Quantile
        Quantile object to be used as column in stratification.
    """
    def __repr__(self):
        return 'Quantile{}_{}'.format(self._colname, self._bins)

    def _get_buckets(self, df):
        buckets = ([-float('inf')] +
                   df.approxQuantile(col=self._colname,
                                     probabilities=np.linspace(0, 1, self._bins + 1).tolist(),
                                     relativeError=0.01) +
                   [float('inf')])
        buckets[-2] += 1e-7
        return buckets


class HandyColumns(object):
    """HandyColumn(s) in a HandyFrame.

    Attributes
    ----------
    numerical: list of string
        List of numerical columns (integer, float, double)
    categorical: list of string
        List of categorical columns (string, integer)
    continuous: list of string
        List of continous columns (float, double)
    string: list of string
        List of string columns (string)
    array: list of string
        List of array columns (array, map)
    """
    def __init__(self, df, handy, strata=None):
        self._df = df
        self._handy = handy
        self._strata = strata
        self._colnames = None
        self.COLTYPES = {'continuous': self.continuous,
                         'categorical': self.categorical,
                         'numerical': self.numerical,
                         'string': self.string,
                         'array': self.array}

    def __getitem__(self, *args):
        if isinstance(args[0], tuple):
            args = args[0]
        item = args[0]
        if self._strata is None:
            if self._colnames is None:
                if item == slice(None, None, None):
                    item = self._df.columns

                if isinstance(item, str):
                    try:
                        # try it as an alias
                        item = self.COLTYPES[item]
                    except KeyError:
                        pass

                check_columns(self._df, item)
                self._colnames = item

                if isinstance(self._colnames, int):
                    idx = self._colnames + (len(self._handy._group_cols) if self._handy._group_cols is not None else 0)
                    assert idx < len(self._df.columns), "Invalid column index {}".format(idx)
                    self._colnames = list(self._df.columns)[idx]

                return self
            else:
                try:
                    n = item.stop
                    if n is None:
                        n = -1
                except:
                    n = 20

                if isinstance(self._colnames, (tuple, list)):
                    res = self._df.notHandy().select(self._colnames)
                    if n == -1:
                        if self._df._safety:
                            print('\nINFO: Safety is ON - returning up to {} instances.'.format(self._df._safety_limit))
                            n = self._df._safety_limit
                    if n != -1:
                        res = res.limit(n)
                    res = res.toPandas()
                    self._handy._safety = True
                    self._df._safety = True
                    return res
                else:
                    return self._handy.__getitem__(self._colnames, n)
        else:
            if self._colnames is None:
                if item == slice(None, None, None):
                    item = self._df.columns

                if isinstance(item, str):
                    try:
                        # try it as an alias
                        item = self.COLTYPES[item]
                    except KeyError:
                        pass

            self._strata._handycolumns = item
            return self._strata

    def __repr__(self):
        colnames = ensure_list(self._colnames)
        return "HandyColumns[%s]" % (", ".join("%s" % str(c) for c in colnames))

    @property
    def numerical(self):
        """Returns list of numerical columns in the HandyFrame.
        """
        return self._handy._numerical

    @property
    def categorical(self):
        """Returns list of categorical columns in the HandyFrame.
        """
        return self._handy._categorical

    @property
    def continuous(self):
        """Returns list of continuous columns in the HandyFrame.
        """
        return self._handy._continuous

    @property
    def string(self):
        """Returns list of string columns in the HandyFrame.
        """
        return self._handy._string

    @property
    def array(self):
        """Returns list of array or map columns in the HandyFrame.
        """
        return self._handy._array

    def mean(self):
        return self._handy.mean(self._colnames)

    def min(self):
        return self._handy.min(self._colnames)

    def max(self):
        return self._handy.max(self._colnames)

    def median(self, precision=.01):
        """Returns approximate median with given precision.

        Parameters
        ----------
        precision: float, optional
            Default is 0.01
        """
        return self._handy.median(self._colnames, precision)

    def stddev(self):
        return self._handy.stddev(self._colnames)

    def var(self):
        return self._handy.var(self._colnames)

    def percentile(self, perc, precision=.01):
        """Returns approximate percentile with given precision.

        Parameters
        ----------
        perc: integer
            Percentile to be computed
        precision: float, optional
            Default is 0.01
        """
        return self._handy.percentile(self._colnames, perc, precision)

    def q1(self, precision=.01):
        """Returns approximate first quartile with given precision.

        Parameters
        ----------
        precision: float, optional
            Default is 0.01
        """
        return self._handy.q1(self._colnames, precision)

    def q3(self, precision=.01):
        """Returns approximate third quartile with given precision.

        Parameters
        ----------
        precision: float, optional
            Default is 0.01
        """
        return self._handy.q3(self._colnames, precision)

    def _value_counts(self, dropna=True, raw=True):
        assert len(ensure_list(self._colnames)) == 1, "A single column must be selected!"
        return self._handy._value_counts(self._colnames, dropna, raw)

    def value_counts(self, dropna=True):
        """Returns object containing counts of unique values.

        The resulting object will be in descending order so that the
        first element is the most frequently-occurring element.
        Excludes NA values by default.


        Parameters
        ----------
        dropna : boolean, default True
            Don't include counts of missing values.

        Returns
        -------
        counts: Series
        """
        assert len(ensure_list(self._colnames)) == 1, "A single column must be selected!"
        return self._handy.value_counts(self._colnames, dropna)

    def entropy(self):
        """Returns object containing entropy (base 2) of each column.

        Returns
        -------
        entropy: Series
        """
        return self._handy.entropy(self._colnames)

    def mutual_info(self):
        """Returns object containing matrix of mutual information
        between every pair of columns.

        Returns
        -------
        mutual_info: pd.DataFrame
        """
        return self._handy.mutual_info(self._colnames)

    def mode(self):
        """Returns same-type modal (most common) value for each column.

        Returns
        -------
        mode: Series
        """
        colnames = ensure_list(self._colnames)
        modes = [self._handy.mode(colname) for colname in colnames]
        if len(colnames) == 1:
            return modes[0]
        else:
            return pd.concat(modes, axis=0)

    def corr(self, method='pearson'):
        """Compute pairwise correlation of columns, excluding NA/null values.

        Parameters
        ----------
        method : {'pearson', 'spearman'}
            * pearson : standard correlation coefficient
            * spearman : Spearman rank correlation

        Returns
        -------
        y : DataFrame
        """
        colnames = [col for col in self._colnames if col in self.numerical]
        return self._handy.corr(colnames, method=method)

    def nunique(self):
        """Return Series with number of distinct observations for specified columns.

        Parameters
        ----------
        exact: boolean, optional
            If True, computes exact number of unique values, otherwise uses an approximation.

        Returns
        -------
        nunique: Series
        """
        return self._handy.nunique(self._colnames) #, exact)

    def outliers(self, ratio=False, method='tukey', **kwargs):
        """Return Series with number of outlier observations according to
         the specified method for all columns.

         Parameters
         ----------
         ratio: boolean, optional
            If True, returns proportion instead of counts.
            Default is True.
         method: string, optional
            Method used to detect outliers. Currently, only Tukey's method is supported.
            Default is tukey.

         Returns
         -------
         outliers: Series
        """
        return self._handy.outliers(self._colnames, ratio=ratio, method=method, **kwargs)

    def get_outliers(self, critical_value=.999):
        """Returns HandyFrame containing all rows deemed as outliers using
        Mahalanobis distance and informed critical value.

        Parameters
        ----------
        critical_value: float, optional
            Critical value for chi-squared distribution to classify outliers
            according to Mahalanobis distance.
            Default is .999 (99.9%).
        """
        return self._handy.get_outliers(self._colnames, critical_value)

    def remove_outliers(self, critical_value=.999):
        """Returns HandyFrame containing only rows NOT deemed as outliers
        using  Mahalanobis distance and informed critical value.

        Parameters
        ----------
        critical_value: float, optional
            Critical value for chi-squared distribution to classify outliers
            according to Mahalanobis distance.
            Default is .999 (99.9%).
        """
        return self._handy.remove_outliers(self._colnames, critical_value)

    def hist(self, bins=10, ax=None):
        """Draws histogram of the HandyFrame's column using matplotlib / pylab.

        Parameters
        ----------
        bins : integer, default 10
            Number of histogram bins to be used
        ax : matplotlib axes object, default None
        """
        return self._handy.hist(self._colnames, bins, ax)

    def boxplot(self, ax=None, showfliers=True, k=1.5, precision=.01):
        """Makes a box plot from HandyFrame column.

        Parameters
        ----------
        ax : matplotlib axes object, default None
        showfliers : bool, optional (True)
            Show the outliers beyond the caps.
        k: float, optional
            Constant multiplier for the IQR.
            Default is 1.5 (corresponding to Tukey's outlier, use 3 for "far out" values)
        """
        return self._handy.boxplot(self._colnames, ax, showfliers, k, precision)

    def scatterplot(self, ax=None):
        """Makes a scatter plot of two HandyFrame columns.

        Parameters
        ----------
        ax : matplotlib axes object, default None
        """
        return self._handy.scatterplot(self._colnames, ax)


class HandyStrata(object):
    __handy_methods = (list(filter(lambda n: n[0] != '_',
                               (map(itemgetter(0),
                                    inspect.getmembers(HandyFrame,
                                                       predicate=inspect.isfunction) +
                                    inspect.getmembers(HandyColumns,
                                                       predicate=inspect.isfunction)))))) + ['handy']

    def __init__(self, handy, strata):
        self._handy = handy
        self._df = handy._df
        self._strata = strata
        self._col_clauses = []
        self._colnames = []
        self._temp_colnames = []

        temp_df = self._df
        temp_df._handy = self._handy
        for col in self._strata:
            clauses = []
            colname = str(col)
            self._colnames.append(colname)
            if isinstance(col, Bucket):
                self._temp_colnames.append(colname)
                buckets = col._get_buckets(self._df)
                clauses = col._get_clauses(buckets)
                bucketizer = Bucketizer(splits=buckets, inputCol=col.colname, outputCol=colname)
                temp_df = HandyFrame(bucketizer.transform(temp_df), self._handy)
            self._col_clauses.append(clauses)

        self._df = temp_df
        self._handy._df = temp_df
        self._df._handy = self._handy

        value_counts = self._df._handy._value_counts(self._colnames, raw=True).reset_index()
        self._raw_combinations = sorted(list(map(tuple, zip(*[value_counts[colname].values
                                                              for colname in self._colnames]))))
        self._raw_clauses = [' and '.join('{} == {}'.format(str(col), value) if isinstance(col, Bucket)
                                      else  '{} == "{}"'.format(str(col),
                                                                value[0] if isinstance(value, tuple) else value)
                                      for col, value in zip(self._strata, comb))
                         for comb in self._raw_combinations]

        self._combinations = [tuple(value if not len(clauses) else clauses[int(float(value))]
                                    for value, clauses in zip(comb, self._col_clauses))
                              for comb in self._raw_combinations]
        self._clauses = [' and '.join(value if isinstance(col, Bucket)
                                      else  '{} == "{}"'.format(str(col),
                                                                value[0] if isinstance(value, tuple) else value)
                                      for col, value in zip(self._strata, comb))
                         for comb in self._combinations]
        self._strat_df = [self._df.filter(clause) for clause in self._clauses]

        self._df._strat_handy = self._handy
        # Shares the same HANDY object among all sub dataframes
        for i, df in enumerate(self._strat_df):
            df._strat_index = i
            df._strat_handy = self._handy
        self._imputed_values = {}
        self._handycolumns = None

    def __repr__(self):
        repr = "HandyStrata[%s]" % (", ".join("%s" % str(c) for c in self._strata))
        if self._handycolumns is not None:
            colnames = ensure_list(self._handycolumns)
            repr = "HandyColumns[%s] by %s" % (", ".join("%s" % str(c) for c in colnames), repr)
        return repr

    def __getattribute__(self, name):
        try:
            if name == 'cols':
                return HandyColumns(self._df, self._handy, self)
            else:
                attr = object.__getattribute__(self, name)
                return attr
        except AttributeError as e:
            if name in self.__handy_methods:
                def wrapper(*args, **kwargs):
                    raised = True
                    try:
                        # Makes stratification
                        for df in self._strat_df:
                            df._handy._strata = self._strata
                        self._handy._set_stratification(self._strata,
                                                        self._raw_combinations, self._raw_clauses,
                                                        self._combinations, self._clauses)

                        if self._handycolumns is not None:
                            args = (self._handycolumns,) + args

                        try:
                            attr_strata = getattr(self._handy, '_strat_{}'.format(name))
                            self._handy._strata_object = attr_strata(*args, **kwargs)
                        except AttributeError:
                            pass

                        try:
                            if self._handycolumns is not None:
                                f = object.__getattribute__(self._handy, name)
                            else:
                                f = object.__getattribute__(self._df, name)
                            is_agg = getattr(f, '__is_agg', False)
                            is_inccol = getattr(f, '__is_inccol', False)
                        except AttributeError:
                            is_agg = False
                            is_inccol = False

                        if is_agg or is_inccol:
                            if self._handycolumns is not None:
                                colnames = ensure_list(args[0])
                            else:
                                colnames = self._df.columns
                            res = getattr(self._handy, name)(*args, **kwargs)
                        else:
                            if self._handycolumns is not None:
                                res = [getattr(df._handy, name)(*args, **kwargs) for df in self._strat_df]
                            else:
                                res = [getattr(df, name)(*args, **kwargs) for df in self._strat_df]

                        if isinstance(res, pd.DataFrame):
                            if len(self._handy.strata_colnames):
                                res = res.set_index(self._handy.strata_colnames).sort_index()
                            if is_agg:
                                if len(colnames) == 1:
                                    res = res[colnames[0]]

                        try:
                            attr_post = getattr(self._handy, '_post_{}'.format(name))
                            res = attr_post(res)
                        except AttributeError:
                            pass

                        strata = list(map(lambda v: v[1].to_dict(OrderedDict), self._handy.strata.iterrows()))
                        strata_cols = [c if isinstance(c, str) else c.colname for c in self._strata]
                        if isinstance(res, list):
                            if isinstance(res[0], DataFrame):
                                joined_df = res[0]
                                self._imputed_values = joined_df.statistics_
                                self._fenced_values = joined_df.fences_
                                if len(res) > 1:
                                    if len(joined_df.statistics_):
                                        self._imputed_values = {self._clauses[0]: joined_df.statistics_}
                                    if len(joined_df.fences_):
                                        self._fenced_values = {self._clauses[0]: joined_df.fences_}
                                    for strat_df, clause in zip(res[1:], self._clauses[1:]):
                                        if len(joined_df.statistics_):
                                            self._imputed_values.update({clause: strat_df.statistics_})
                                        if len(joined_df.fences_):
                                            self._fenced_values.update({clause: strat_df.fences_})
                                        joined_df = joined_df.unionAll(strat_df)
                                    # Clears stratification
                                    self._handy._clear_stratification()
                                    self._df._strat_handy = None
                                    self._df._strat_index = None

                                    if len(self._temp_colnames):
                                        joined_df = joined_df.drop(*self._temp_colnames)

                                    res = HandyFrame(joined_df, self._handy)
                                    res._handy._imputed_values = self._imputed_values
                                    res._handy._fenced_values = self._fenced_values
                            elif isinstance(res[0], pd.DataFrame):
                                strat_res = []
                                indexes = res[0].index.names
                                if indexes[0] is None:
                                    indexes = ['index']
                                for r, s in zip(res, strata):
                                    strata_dict = dict([(k if isinstance(k, str) else k.colname, v) for k, v in s.items()])
                                    strat_res.append(r.assign(**strata_dict)
                                                     .reset_index())
                                res = (pd.concat(strat_res)
                                       .sort_values(by=strata_cols)
                                       .set_index(strata_cols + indexes)
                                       .sort_index())
                            elif isinstance(res[0], pd.Series):
                                # TODO: TEST
                                strat_res = []
                                for r, s in zip(res, strata):
                                    strata_dict = dict([(k if isinstance(k, str) else k.colname, v) for k, v in s.items()])
                                    series_name = none2default(r.name, 0)
                                    if series_name == name:
                                        series_name = 'index'
                                    strat_res.append(r.reset_index()
                                                     .rename(columns={series_name: name, 'index': series_name})
                                                     .assign(**strata_dict)
                                                     .set_index(strata_cols + [series_name])[name])
                                res = pd.concat(strat_res).sort_index()
                                if len(ensure_list(self._handycolumns)) > 1:
                                    try:
                                        res = res.astype(np.float64)
                                        res = res.to_frame().reset_index().pivot_table(values=name,
                                                                                       index=strata_cols,
                                                                                       columns=series_name)
                                        res.columns.name = ''
                                    except ValueError:
                                        pass
                            elif isinstance(res[0], np.ndarray):
                                # TODO: TEST
                                strat_res = []
                                for r, s in zip(res, strata):
                                    strata_dict = dict([(k if isinstance(k, str) else k.colname, v) for k, v in s.items()])
                                    strat_res.append(pd.DataFrame(r, columns=[name])
                                                     .assign(**strata_dict)
                                                     .set_index(strata_cols)[name])
                                res = pd.concat(strat_res).sort_index()
                            elif isinstance(res[0], Axes):
                                res, axs = self._handy._strata_plot
                                res = consolidate_plots(res, axs, args[0], self._clauses)
                            elif isinstance(res[0], list):
                                joined_list = res[0]
                                for l in res[1:]:
                                    joined_list += l
                                return joined_list
                            elif len(res) == len(self._combinations):
                                # TODO: TEST
                                strata_df = pd.DataFrame(strata)
                                strata_df.columns = strata_cols
                                res = (pd.concat([pd.DataFrame(res, columns=[name]), strata_df], axis=1)
                                       .set_index(strata_cols)
                                       .sort_index())
                        raised = False
                        return res
                    except HandyException as e:
                        raise HandyException(str(e), summary=False)
                    except Exception as e:
                        raise HandyException(str(e), summary=True)
                    finally:
                        if not raised:
                            if isinstance(res, HandyFrame):
                                res._handy._clear_stratification()

                        self._handy._clear_stratification()
                        self._df._strat_handy = None
                        self._df._strat_index = None

                        if len(self._temp_colnames):
                            self._df = self._df.drop(*self._temp_colnames)
                            self._handy._df = self._df
                return wrapper
            else:
                raise e


================================================
FILE: handyspark/sql/datetime.py
================================================
from handyspark.sql.transform import HandyTransform
import pandas as pd

class HandyDatetime(object):
    __supported = {'boolean': ['is_leap_year', 'is_month_end', 'is_month_start', 'is_quarter_end', 'is_quarter_start',
                            'is_year_end', 'is_year_start'],
                   'string': ['strftime', 'tz', 'weekday_name'],
                   'integer': ['day', 'dayofweek', 'dayofyear', 'days_in_month', 'daysinmonth', 'hour', 'microsecond',
                           'minute', 'month', 'nanosecond', 'quarter', 'second', 'week', 'weekday', 'weekofyear',
                           'year'],
                   'date': ['date'],
                   'timestamp': ['ceil', 'floor', 'round', 'normalize', 'time', 'tz_convert', 'tz_localize']}
    __unsupported = ['freq', 'to_period', 'to_pydatetime']
    __functions = ['strftime', 'ceil', 'floor', 'round', 'normalize', 'tz_convert', 'tz_localize']
    __available = sorted(__supported['boolean'] + __supported['string'] + __supported['integer'] + __supported['date'] +
                         __supported['timestamp'])
    __types = {n: t for t, v in __supported.items() for n in v}
    _colname = None

    def __init__(self, df, colname):
        self._df = df
        self._colname = colname
        if self._df.notHandy().select(colname).dtypes[0][1] != 'timestamp':
            raise AttributeError('Can only use .dt accessor with datetimelike values')

    def __getattribute__(self, name):
        try:
            attr = object.__getattribute__(self, name)
            return attr
        except AttributeError as e:
            if name in self.__available:
                if name in self.__functions:
                    def wrapper(*args, **kwargs):
                        return HandyTransform.gen_pandas_udf(f=lambda col: col.dt.__getattribute__(name)(**kwargs),
                                                             args=(self._colname,),
                                                             returnType=self.__types.get(name, 'string'))
                    wrapper.__doc__ = getattr(pd.Series.dt, name).__doc__
                    return wrapper
                else:
                    func = HandyTransform.gen_pandas_udf(f=lambda col: col.dt.__getattribute__(name),
                                                         args=(self._colname,),
                                                         returnType=self.__types.get(name, 'string'))
                    func.__doc__ = getattr(pd.Series.dt, name).__doc__
                    return func
            else:
                raise e


================================================
FILE: handyspark/sql/pandas.py
================================================
from handyspark.sql.datetime import HandyDatetime
from handyspark.sql.string import HandyString
from handyspark.sql.transform import HandyTransform
from handyspark.util import check_columns
import pandas as pd

class HandyPandas(object):
    __supported = {'boolean': ['between', 'between_time', 'isin', 'isna', 'isnull', 'notna', 'notnull'],
                   'same': ['abs', 'clip', 'clip_lower', 'clip_upper', 'replace', 'round', 'truncate',
                            'tz_convert', 'tz_localize']}
    __as_series = ['rank', 'interpolate', 'pct_change', 'bfill', 'cummax', 'cummin', 'cumprod', 'cumsum', 'diff',
                   'ffill', 'fillna', 'shift']
    __available = sorted(__supported['boolean'] + __supported['same'])
    __types = {n: t for t, v in __supported.items() for n in v}

    def __init__(self, df):
        self._df = df
        self._colname = None

    def __getitem__(self, *args):
        if isinstance(args[0], tuple):
            args = args[0]
        item = args[0]
        check_columns(self._df, item)
        self._colname = item
        return self

    @property
    def str(self):
        """Returns a class to access pandas-like string column based methods through pandas UDFs

        Available methods:
        - contains
        - startswith / endswitch
        - match
        - isalpha / isnumeric / isalnum / isdigit / isdecimal / isspace
        - islower / isupper / istitle
        - replace
        - repeat
        - join
        - pad
        - slice / slice_replace
        - strip / lstrip / rstrip
        - wrap / center / ljust / rjust
        - translate
        - get
        - normalize
        - lower / upper / capitalize / swapcase / title
        - zfill
        - count
        - find / rfind
        - len
        """
        return HandyString(self._df, self._colname)

    @property
    def dt(self):
        """Returns a class to access pandas-like datetime column based methods through pandas UDFs

        Available methods:
        - is_leap_year / is_month_end / is_month_start / is_quarter_end / is_quarter_start / is_year_end / is_year_start
        - strftime
        - tz / time / tz_convert / tz_localize
        - day / dayofweek / dayofyear / days_in_month / daysinmonth
        - hour / microsecond / minute / nanosecond / second
        - week / weekday / weekday_name
        - month / quarter / year / weekofyear
        - date
        - ceil / floor / round
        - normalize
        """
        return HandyDatetime(self._df, self._colname)

    def __getattribute__(self, name):
        try:
            attr = object.__getattribute__(self, name)
            return attr
        except AttributeError as e:
            if name in self.__available:
                def wrapper(*args, **kwargs):
                    returnType=self.__types.get(name, 'string')
                    if returnType == 'same':
                        returnType = self._df.notHandy().select(self._colname).dtypes[0][1]
                    return HandyTransform.gen_pandas_udf(f=lambda col: col.__getattribute__(name)(**kwargs),
                                                         args=(self._colname,),
                                                         returnType=returnType)
                if name not in ['str', 'dt']:
                    wrapper.__doc__ = getattr(pd.Series, name).__doc__
                return wrapper
            else:
                raise e


================================================
FILE: handyspark/sql/schema.py
================================================
import numpy as np
import datetime
from operator import itemgetter
from pyspark.sql.types import StructType

_mapping = {str: 'string',
            bool: 'boolean',
            int: 'integer',
            float: 'float',
            datetime.date: 'date',
            datetime.datetime: 'timestamp',
            np.bool: 'boolean',
            np.int8: 'byte',
            np.int16: 'short',
            np.int32: 'integer',
            np.int64: 'long',
            np.float32: 'float',
            np.float64: 'double',
            np.ndarray: 'array',
            object: 'string',
            list: 'array',
            tuple: 'array',
            dict: 'map'}

def generate_schema(columns, nullable_columns='all'):
    """
    Parameters
    ----------
    columns: dict of column names (keys) and types (values)
    nullables: list of nullable columns, optional, default is 'all'

    Returns
    -------
    schema: StructType
        Spark DataFrame schema corresponding to Python/numpy types.
    """
    columns = sorted(columns.items())
    colnames = list(map(itemgetter(0), columns))
    coltypes = list(map(itemgetter(1), columns))

    invalid_types = []
    new_types = []
    keys = list(map(itemgetter(0), list(_mapping.items())))
    for coltype in coltypes:
        if coltype not in keys:
            invalid_types.append(coltype)
        else:
            if coltype == np.dtype('O'):
                new_types.append(str)
            else:
                new_types.append(keys[keys.index(coltype)])
    assert len(invalid_types) == 0, "Invalid type(s) specified: {}".format(str(invalid_types))

    if nullable_columns == 'all':
        nullables = [True] * len(colnames)
    else:
        nullables = [col in nullable_columns for col in colnames]

    fields = [{"metadata": {}, "name": name, "nullable": nullable, "type": _mapping[typ]}
              for name, typ, nullable in zip(colnames, new_types, nullables)]
    return StructType.fromJson({"type": "struct", "fields": fields})


================================================
FILE: handyspark/sql/string.py
================================================
from handyspark.sql.transform import HandyTransform
import unicodedata
import pandas as pd

class HandyString(object):
    __supported = {'boolean': ['contains', 'startswith', 'endswith', 'match', 'isalpha', 'isnumeric', 'isalnum', 'isdigit',
                            'isdecimal', 'isspace', 'islower', 'isupper', 'istitle'],
                   'string': ['replace', 'repeat', 'join', 'pad', 'slice', 'slice_replace', 'strip', 'wrap', 'translate',
                           'get', 'center', 'ljust', 'rjust', 'zfill', 'lstrip', 'rstrip',
                           'normalize', 'lower', 'upper', 'title', 'capitalize', 'swapcase'],
                   'integer': ['count', 'find', 'len', 'rfind']}
    __unsupported = ['cat', 'extract', 'extractall', 'get_dummies', 'findall', 'index', 'split', 'rsplit', 'partition',
                     'rpartition', 'rindex', 'decode', 'encode']
    __available = sorted(__supported['boolean'] + __supported['string'] + __supported['integer'])
    __types = {n: t for t, v in __supported.items() for n in v}
    _colname = None

    def __init__(self, df, colname):
        self._df = df
        self._colname = colname

    @staticmethod
    def _remove_accents(input):
        return unicodedata.normalize('NFKD', input).encode('ASCII', 'ignore').decode('unicode_escape')

    def remove_accents(self):
        return HandyTransform.gen_pandas_udf(f=lambda col: col.apply(HandyString._remove_accents),
                                             args=(self._colname,),
                                             returnType='string')

    def __getattribute__(self, name):
        try:
            attr = object.__getattribute__(self, name)
            return attr
        except AttributeError as e:
            if name in self.__available:
                def wrapper(*args, **kwargs):
                    return HandyTransform.gen_pandas_udf(f=lambda col: col.str.__getattribute__(name)(**kwargs),
                                                         args=(self._colname,),
                                                         returnType=self.__types.get(name, 'string'))
                wrapper.__doc__ = getattr(pd.Series.str, name).__doc__
                return wrapper
            else:
                raise e


================================================
FILE: handyspark/sql/transform.py
================================================
import datetime
import inspect
import numpy as np
from pyspark.sql import functions as F

_MAPPING = {'string': str,
            'date': datetime.date,
            'timestamp': datetime.datetime,
            'boolean': np.bool,
            'binary': np.byte,
            'byte': np.int8,
            'short': np.int16,
            'integer': np.int32,
            'long': np.int64,
            'float': np.float32,
            'double': np.float64,
            'array': np.ndarray,
            'map': dict}


class HandyTransform(object):
    _mapping = dict([(v.__name__, k) for k, v in  _MAPPING.items()])
    _mapping.update({'float': 'double', 'int': 'integer', 'list': 'array', 'bool': 'boolean'})

    @staticmethod
    def _get_return(sdf, f, args):
        returnType = None
        if args is None:
            args = f.__code__.co_varnames
        if len(args):
            returnType = sdf.select(args[0]).dtypes[0][1]
        return returnType

    @staticmethod
    def _signatureType(sig):
        returnType = None
        signatureType = str(sig.return_annotation)[7:]
        if '_empty' not in signatureType:
            returnType = signatureType
            types = returnType.replace(']', '').replace('[', ',').split(',')[:3]
            for returnType in types:
                assert returnType.lower().strip() in HandyTransform._mapping.keys(), "invalid returnType"
            types = list(map(lambda t: HandyTransform._mapping[t.lower().strip()], types))
            returnType = types[0]
            if len(types) > 1:
                returnType = '<'.join([returnType, ','.join(types[1:])])
                returnType += '>'
        return returnType

    @staticmethod
    def gen_pandas_udf(f, args=None, returnType=None):
        sig = inspect.signature(f)

        if args is None:
            args = tuple(sig.parameters.keys())
        assert isinstance(args, (list, tuple)), "args must be list or tuple"
        name = '{}{}'.format(f.__name__, str(args).replace("'", ""))

        if returnType is None:
            returnType = HandyTransform._signatureType(sig)

        try:
            import pyarrow
            @F.pandas_udf(returnType=returnType)
            def udf(*args):
                return f(*args)
        except:
            @F.udf(returnType=returnType)
            def udf(*args):
                return f(*args)

        return udf(*args).alias(name)

    @staticmethod
    def gen_grouped_pandas_udf(sdf, f, args=None, returnType=None):
        # TODO: test it properly!
        sig = inspect.signature(f)

        if args is None:
            args = tuple(sig.parameters.keys())
        assert isinstance(args, (list, tuple)), "args must be list or tuple"
        name = '{}{}'.format(f.__name__, str(f.__code__.co_varnames).replace("'", ""))

        if returnType is None:
            returnType = HandyTransform._signatureType(sig)

        schema = sdf.notHandy().select(*args).withColumn(name, F.lit(None).cast(returnType)).schema

        @F.pandas_udf(schema, F.PandasUDFType.GROUPED_MAP)
        def pudf(pdf):
            computed = pdf.apply(lambda row: f(*tuple(row[p] for p in f.__code__.co_varnames)), axis=1)
            return pdf.assign(__computed=computed).rename(columns={'__computed': name})

        return pudf

    @staticmethod
    def transform(sdf, f, name=None, args=None, returnType=None):
        if name is None:
            name = '{}{}'.format(f.__name__, str(f.__code__.co_varnames).replace("'", ""))
        if isinstance(f, tuple):
            f, returnType = f
        if returnType is None:
            returnType = HandyTransform._get_return(sdf, f, args)
        return sdf.withColumn(name, HandyTransform.gen_pandas_udf(f, args, returnType))

    @staticmethod
    def apply(sdf, f, name=None, args=None, returnType=None):
        if name is None:
            name = '{}{}'.format(f.__name__, str(f.__code__.co_varnames).replace("'", ""))
        if isinstance(f, tuple):
            f, returnType = f
        if returnType is None:
            returnType = HandyTransform._get_return(sdf, f, args)
        return sdf.select(HandyTransform.gen_pandas_udf(f, args, returnType).alias(name))

    @staticmethod
    def assign(sdf, **kwargs):
        for c, f in kwargs.items():
            typename = None
            if isinstance(f, tuple):
                f, typename = f
            if callable(f):
                if typename is None:
                    typename = HandyTransform._get_return(sdf, f, None)
                if typename is not None:
                    sdf = sdf.transform(f, name=c, returnType=typename)
                else:
                    sdf = sdf.withColumn(c, F.lit(f()))
            else:
                sdf = sdf.withColumn(c, F.lit(f))
        return sdf


================================================
FILE: handyspark/stats.py
================================================
import numpy as np
from handyspark.util import check_columns, ensure_list
from pyspark.mllib.common import _py2java
from pyspark.mllib.stat.test import KolmogorovSmirnovTestResult

def StatisticalSummaryValues(sdf, colnames):
    """Builds a Java StatisticalSummaryValues object for each column
    """
    colnames = ensure_list(colnames)
    check_columns(sdf, colnames)

    jvm = sdf._sc._jvm
    summ = sdf.notHandy().select(colnames).describe().toPandas().set_index('summary')
    ssvs = {}
    for colname in colnames:
        values = list(map(float, summ[colname].values))
        values = values[1], np.sqrt(values[2]), int(values[0]), values[4], values[3], values[0] * values[1]
        java_class = jvm.org.apache.commons.math3.stat.descriptive.StatisticalSummaryValues
        ssvs.update({colname: java_class(*values)})
    return ssvs

def tTest(jvm, *ssvs):
    """Performs a t-Test for difference of means using StatisticalSummaryValues objects
    """
    n = len(ssvs)
    res = np.identity(n)
    java_class = jvm.org.apache.commons.math3.stat.inference.TTest
    java_obj = java_class()
    for i in range(n):
        for j in range(i + 1, n):
            pvalue = java_obj.tTest(ssvs[i], ssvs[j])
            res[i, j] = pvalue
            res[j, i] = pvalue
    return res

def KolmogorovSmirnovTest(sdf, colname, dist='normal', *params):
    """Performs a KolmogorovSmirnov test for comparing the distribution of values in a column
    to a named canonical distribution.
    """
    check_columns(sdf, colname)
    # Supported distributions
    _distributions = ['Beta', 'Cauchy', 'ChiSquared', 'Exponential', ' F', 'Gamma', 'Gumbel', 'Laplace', 'Levy',
                      'Logistic', 'LogNormal', 'Nakagami', 'Normal', 'Pareto', 'T', 'Triangular', 'Uniform', 'Weibull']
    _distlower = list(map(lambda v: v.lower(), _distributions))
    try:
        dist = _distributions[_distlower.index(dist)]
        # the actual name for the Uniform distribution is UniformReal
        if dist == 'Uniform':
            dist += 'Real'
    except ValueError:
        # If we cannot find a distribution, fall back to Normal
        dist = 'Normal'
        params = (0., 1.)
    jvm = sdf._sc._jvm
    # Maps the DF column into a numeric RDD and turns it into Java RDD
    rdd = sdf.notHandy().select(colname).rdd.map(lambda t: t[0])
    jrdd = _py2java(sdf._sc, rdd)
    # Gets the Java class of the corresponding distribution and creates an obj
    java_class = getattr(jvm, 'org.apache.commons.math3.distribution.{}Distribution'.format(dist))
    java_obj = java_class(*params)
    # Loads the KS test class and performs the test
    ks = jvm.org.apache.spark.mllib.stat.test.KolmogorovSmirnovTest
    res = ks.testOneSample(jrdd.rdd(), java_obj)
    return KolmogorovSmirnovTestResult(res)


================================================
FILE: handyspark/util.py
================================================
from math import isnan, isinf
import pandas as pd
from pyspark.ml.linalg import DenseVector
from pyspark.rdd import RDD
from pyspark.sql import functions as F, DataFrame, Row
from pyspark.sql.types import ArrayType, DoubleType, StructType, StructField
from pyspark.mllib.common import _java2py, _py2java
import traceback

def none2default(value, default):
    return value if value is not None else default

def none2zero(value):
    return none2default(value, 0)

def ensure_list(value):
    if value is None:
        return []
    if isinstance(value, (list, tuple)):
        return value
    else:
        return [value]

def check_columns(df, colnames):
    if colnames is not None:
        available = df.columns
        colnames = ensure_list(colnames)
        colnames = [col if isinstance(col, str) else col.colname for col in colnames]
        diff = set(colnames).difference(set(available))
        assert not len(diff), "DataFrame does not have {} column(s)".format(str(list(diff))[1:-1])

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

class HandyException(Exception):
    def __init__(self, *args, **kwargs):
        try:
            # Summary is a boolean argument
            # If True, it prints the exception summary
            # This way, we can avoid printing the summary all
            # the way along the exception "bubbling up"
            summary = kwargs['summary']
            if summary:
                print(HandyException.exception_summary())
        except KeyError:
            pass

    @staticmethod
    def colortext(text, color_code):
        return color_code + text + (bcolors.ENDC if text[-4:] != bcolors.ENDC else '')

    @staticmethod
    def errortext(text):
        # Makes exception summary both BOLD and RED (FAIL)
        return HandyException.colortext(HandyException.colortext(text, bcolors.FAIL), bcolors.BOLD)

    @staticmethod
    def exception_summary():
        # Gets the error stack
        msg = traceback.format_exc()
        try:
            # Builds the "frame" around the text
            top = HandyException.errortext('-' * 75 + '\nHANDY EXCEPTION SUMMARY\n')
            bottom = HandyException.errortext('-' * 75)
            # Gets the information about the error and makes it BOLD and RED
            info = list(filter(lambda t: len(t) and t[0] != '\t', msg.split('\n')[::-1]))
            error = HandyException.errortext('Error\t: {}'.format(info[0]))
            # Figure out where the error happened - location (file/notebook), line and function
            idx = [t.strip()[:4] for t in info].index('File')
            where = [v.strip() for v in info[idx].strip().split(',')]
            location, line, func = where[0][5:], where[1][5:], where[2][3:]
            # If it is a notebook, figures out the cell
            if 'ipython-input' in location:
                location = 'IPython - In [{}]'.format(location.split('-')[2])
            # If it is a pyspark error, just go with it
            if 'pyspark' in error:
                new_msg = '\n{}\n{}\n{}'.format(top, error, bottom)
            # Otherwise, build the summary
            else:
                new_msg = '\n{}\nLocation: {}\nLine\t: {}\nFunction: {}\n{}\n{}'.format(top, location, line, func, error, bottom)
            return new_msg
        except Exception as e:
            # If we managed to raise an exception while trying to format the original exception...
            # Oh, well...
            return 'This is awkward... \n{}'.format(str(e))

def get_buckets(rdd, buckets):
    """Extracted from pyspark.rdd.RDD.histogram function
    """
    if buckets < 1:
        raise ValueError("number of buckets must be >= 1")

    # filter out non-comparable elements
    def comparable(x):
        if x is None:
            return False
        if type(x) is float and isnan(x):
            return False
        return True

    filtered = rdd.filter(comparable)

    # faster than stats()
    def minmax(a, b):
        return min(a[0], b[0]), max(a[1], b[1])
    try:
        minv, maxv = filtered.map(lambda x: (x, x)).reduce(minmax)
    except TypeError as e:
        if " empty " in str(e):
            raise ValueError("can not generate buckets from empty RDD")
        raise

    if minv == maxv or buckets == 1:
        return [minv, maxv], [filtered.count()]

    try:
        inc = (maxv - minv) / buckets
    except TypeError:
        raise TypeError("Can not generate buckets with non-number in RDD")

    if isinf(inc):
        raise ValueError("Can not generate buckets with infinite value")

    # keep them as integer if possible
    inc = int(inc)
    if inc * buckets != maxv - minv:
        inc = (maxv - minv) * 1.0 / buckets

    buckets = [i * inc + minv for i in range(buckets)]
    buckets.append(maxv)  # fix accumulated error
    return buckets

def dense_to_array(sdf, colname, new_colname):
    """Casts a Vector column into a new Array column.
    """
    # Gets type of original column
    coltype = sdf.notHandy().select(colname).dtypes[0][1]
    # If it is indeed a vector...
    if coltype == 'vector':
        newrow = Row(*sdf.columns, new_colname)
        res = sdf.rdd.map(lambda row: newrow(*row, row[colname].values.tolist())).toDF(sdf.columns + [new_colname])
    # Otherwise just copy the original column into a new one
    else:
        res = sdf.withColumn(new_colname, F.col(colname))

    # Makes it a HandyFrame
    if isinstance(res, DataFrame):
        res = res.toHandy()
    return res

def disassemble(sdf, colname, new_colnames=None):
    """Disassembles a Vector/Array column into multiple columns
    """
    array_col = '_{}'.format(colname)
    # Gets type of original column
    coltype = sdf.notHandy().select(colname).schema.fields[0].dataType.typeName()
    # If it is a vector or array...
    if coltype in ['vectorudt', 'array']:
        # Makes the conversion from vector to array (or not :-))
        tdf = dense_to_array(sdf, colname, array_col)
        # Checks the MIN size of the arrays in the dataset
        # If there are arrays with multiple sizes, it can still safely
        # convert up to that size
        size = tdf.notHandy().select(F.min(F.size(array_col))).take(1)[0][0]
        # If no new names were given, just uses the original name and
        # a sequence number as suffix
        if new_colnames is None:
            new_colnames = ['{}_{}'.format(colname, i) for i in range(size)]
        assert len(new_colnames) == size, \
            "There must be {} column names, only {} found!".format(size, len(new_colnames))
        # Uses `getItem` to disassemble the array into multiple columns
        res = tdf.select(*sdf.columns,
                         *(F.col(array_col).getItem(i).alias(n) for i, n in zip(range(size), new_colnames)))
    # Otherwise just copy the original column into a new one
    else:
        if new_colnames is None:
            new_colnames = [colname]
        res = sdf.withColumn(new_colnames[0], F.col(colname))

    # Makes it a HandyFrame
    if isinstance(res, DataFrame):
        res = res.toHandy()
    return res

def get_jvm_class(cl):
    """Builds JVM class name from Python class
    """
    return 'org.apache.{}.{}'.format(cl.__module__[2:], cl.__name__)

def call_scala_method(py_class, scala_method, df, *args):
    """Given a Python class, calls a method from its Scala equivalent
    """
    sc = df.sql_ctx._sc
    # Gets the Java class from the JVM, given the name built from the Python class
    java_class = getattr(sc._jvm , get_jvm_class(py_class))
    # Converts all columns into doubles and access it as Java DF
    jdf = df.select(*(F.col(col).astype('double') for col in df.columns))._jdf
    # Creates a Java object from both Java class and DataFrame
    java_obj = java_class(jdf)
    # Converts remaining args from Python to Java as well
    args = [_py2java(sc, a) for a in args]
    # Gets method from Java Object and passes arguments to it to get results
    java_res = getattr(java_obj, scala_method)(*args)
    # Converts results from Java back to Python
    res = _java2py(sc, java_res)
    # If result is an RDD, it could be the case its elements are still
    # serialized tuples from Scala...
    if isinstance(res, RDD):
        try:
            # Takes the first element from the result, to check what it is
            first = res.take(1)[0]
            # If it is a dictionary, we need to check its value
            if isinstance(first, dict):
                first = list(first.values())[0]
                # If the value is a scala tuple, we need to deserialize it
                if first.startswith('scala.Tuple'):
                    serde = sc._jvm.org.apache.spark.mllib.api.python.SerDe
                    # We assume it is a Tuple2 and deserialize it
                    java_res = serde.fromTuple2RDD(java_res)
                    # Finally, we convert the deserialized result from Java to Python
                    res = _java2py(sc, java_res)
        except IndexError:
            pass
    return res

def counts_to_df(value_counts, colnames, n_points):
    """DO NOT USE IT!
    """
    pdf = pd.DataFrame(value_counts
                       .to_frame('count')
                       .reset_index()
                       .apply(lambda row: dict({'count': row['count']},
                                               **dict(zip(colnames, row['index'].toArray()))),
                              axis=1)
                       .values
                       .tolist())
    pdf['count'] /= pdf['count'].sum()
    proportions = pdf['count'] / pdf['count'].min()
    factor = int(n_points / proportions.sum())
    pdf = pd.concat([pdf[colnames], (proportions * factor).astype(int)], axis=1)
    combinations = pdf.apply(lambda row: row.to_dict(), axis=1).values.tolist()
    return pd.DataFrame([dict(v) for c in combinations for v in int(c.pop('count')) * [list(c.items())]])


================================================
FILE: notebooks/Exploring_Titanic.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HandySpark\n",
    "\n",
    "### Bringing pandas-like capabilities to Spark dataframes!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# UNCOMMENT THIS IF YOU'RE USING GOOGLE COLAB!\n",
    "\n",
    "#!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
    "#!wget -q http://apache.osuosl.org/spark/spark-2.3.3/spark-2.3.3-bin-hadoop2.7.tgz\n",
    "#!tar xf spark-2.3.3-bin-hadoop2.7.tgz\n",
    "#!pip install numpy==1.15\n",
    "#!pip install -q pandas==0.24.1\n",
    "#!pip install -q seaborn==0.9\n",
    "#!pip install -q pyspark==2.3.3\n",
    "#!pip install -q findspark\n",
    "#!pip install -q handyspark\n",
    "\n",
    "# AFTER RUNNING THIS CELL, YOU MUST RESTART THE RUNTIME TO USE UPDATED VERSIONS OF PACKAGES!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# UNCOMMENT THIS IF YOU'RE USING GOOGLE COLAB!\n",
    "\n",
    "#import os\n",
    "#os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
    "#os.environ[\"SPARK_HOME\"] = \"/content/spark-2.3.3-bin-hadoop2.7\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://raw.githubusercontent.com/dvgodoy/handyspark/master/tests/rawdata/train.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "import findspark\n",
    "import pandas as pd\n",
    "from pyspark.sql import SparkSession\n",
    "from pyspark.sql import functions as F\n",
    "from handyspark import *\n",
    "from matplotlib import pyplot as plt\n",
    "# fixes issue with seaborn hiding fliers on boxplot\n",
    "import matplotlib as mpl\n",
    "mpl.rc(\"lines\", markeredgewidth=0.5)\n",
    "\n",
    "findspark.init()\n",
    "os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] pyspark-shell'\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
    "spark = SparkSession.builder.getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Loading Data into a `HandyFrame`\n",
    "\n",
    "### After loading data as usual, just call method `toHandy()` (an extension to Spark's dataframe)!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "HandyFrame[PassengerId: int, Survived: int, Pclass: int, Name: string, Sex: string, Age: double, SibSp: int, Parch: int, Ticket: string, Fare: double, Cabin: string, Embarked: string]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sdf = spark.read.csv('train.csv', header=True, inferSchema=True)\n",
    "hdf = sdf.toHandy()\n",
    "hdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fetching some data\n",
    "\n",
    "- using an instance of `cols` from your `HandyFrame`, you can retrieve values for given columns in the top N rows"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Single column will be returned as a pandas Series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                              Braund, Mr. Owen Harris\n",
       "1    Cumings, Mrs. John Bradley (Florence Briggs Th...\n",
       "2                               Heikkinen, Miss. Laina\n",
       "3         Futrelle, Mrs. Jacques Heath (Lily May Peel)\n",
       "4                             Allen, Mr. William Henry\n",
       "Name: Name, dtype: object"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hdf.cols['Name'][:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Multiple columns will be returned as a pandas DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Pclass</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                Name  Pclass\n",
       "0                            Braund, Mr. Owen Harris       3\n",
       "1  Cumings, Mrs. John Bradley (Florence Briggs Th...       1\n",
       "2                             Heikkinen, Miss. Laina       3\n",
       "3       Futrelle, Mrs. Jacques Heath (Lily May Peel)       1\n",
       "4                           Allen, Mr. William Henry       3"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hdf.cols[['Name', 'Pclass']][:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### You can also use `:` to get all columns!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>None</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>None</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td

Download .txt

gitextract_4phs78pk/

├── .gitignore
├── .travis.yml
├── LICENSE
├── README.md
├── README.rst
├── docs/
│   ├── Makefile
│   └── source/
│       ├── conf.py
│       ├── handyspark.extensions.rst
│       ├── handyspark.ml.rst
│       ├── handyspark.rst
│       ├── handyspark.sql.rst
│       ├── includeme.rst
│       ├── index.rst
│       └── modules.rst
├── handyspark/
│   ├── __init__.py
│   ├── extensions/
│   │   ├── __init__.py
│   │   ├── common.py
│   │   ├── evaluation.py
│   │   └── types.py
│   ├── ml/
│   │   ├── __init__.py
│   │   └── base.py
│   ├── plot.py
│   ├── sql/
│   │   ├── __init__.py
│   │   ├── dataframe.py
│   │   ├── datetime.py
│   │   ├── pandas.py
│   │   ├── schema.py
│   │   ├── string.py
│   │   └── transform.py
│   ├── stats.py
│   └── util.py
├── notebooks/
│   └── Exploring_Titanic.ipynb
├── requirements.txt
├── setup.cfg
├── setup.py
└── tests/
    ├── handyspark/
    │   ├── conftest.py
    │   ├── extensions/
    │   │   ├── test_evaluation.py
    │   │   └── test_types.py
    │   ├── ml/
    │   │   └── test_base.py
    │   ├── sql/
    │   │   ├── test_dataframe.py
    │   │   ├── test_datetime.py
    │   │   ├── test_pandas.py
    │   │   ├── test_schema.py
    │   │   ├── test_string.py
    │   │   └── test_transform.py
    │   ├── test_plot.py
    │   ├── test_stats.py
    │   └── test_util.py
    └── rawdata/
        └── train.csv

Download .txt

SYMBOL INDEX (344 symbols across 27 files)

FILE: handyspark/extensions/common.py
  function call2 (line 3) | def call2(self, name, *a):

FILE: handyspark/extensions/evaluation.py
  function thresholds (line 8) | def thresholds(self):
  function roc (line 14) | def roc(self):
  function pr (line 25) | def pr(self):
  function fMeasureByThreshold (line 36) | def fMeasureByThreshold(self, beta=1.0):
  function precisionByThreshold (line 46) | def precisionByThreshold(self):
  function recallByThreshold (line 53) | def recallByThreshold(self):
  function getMetricsByThreshold (line 60) | def getMetricsByThreshold(self):
  function confusionMatrix (line 77) | def confusionMatrix(self, threshold=0.5):
  function print_confusion_matrix (line 97) | def print_confusion_matrix(self, threshold=0.5):
  function plot_roc_curve (line 118) | def plot_roc_curve(self, ax=None):
  function plot_pr_curve (line 128) | def plot_pr_curve(self, ax=None):
  function __init__ (line 138) | def __init__(self, scoreAndLabels, scoreCol='score', labelCol='label'):

FILE: handyspark/extensions/types.py
  function ret (line 4) | def ret(cls, expr):
  function ret (line 11) | def ret(self, expr):

FILE: handyspark/ml/base.py
  class HandyTransformers (line 7) | class HandyTransformers(object):
    method __init__ (line 16) | def __init__(self, df):
    method imputer (line 20) | def imputer(self):
    method fencer (line 27) | def fencer(self):
  class HasDict (line 35) | class HasDict(Params):
    method __init__ (line 42) | def __init__(self):
    method setDictValues (line 46) | def setDictValues(self, value):
    method getDictValues (line 54) | def getDictValues(self):
  class HandyImputer (line 62) | class HandyImputer(Transformer, HasDict, DefaultParamsReadable, DefaultP...
    method _transform (line 71) | def _transform(self, dataset):
    method statistics (line 105) | def statistics(self):
  class HandyFencer (line 109) | class HandyFencer(Transformer, HasDict, DefaultParamsReadable, DefaultPa...
    method _transform (line 118) | def _transform(self, dataset):
    method fences (line 155) | def fences(self):

FILE: handyspark/plot.py
  function title_fom_clause (line 15) | def title_fom_clause(clause):
  function consolidate_plots (line 18) | def consolidate_plots(fig, axs, title, clauses):
  function plot_correlations (line 44) | def plot_correlations(pdf, ax=None):
  function strat_scatterplot (line 50) | def strat_scatterplot(sdf, col1, col2, n=30):
  function scatterplot (line 64) | def scatterplot(sdf, col1, col2, n=30, ax=None):
  function strat_histogram (line 111) | def strat_histogram(sdf, colname, bins=10, categorical=False):
  function histogram (line 150) | def histogram(sdf, colname, bins=10, categorical=False, ax=None):
  function _gen_dict (line 186) | def _gen_dict(rc_name, properties):
  function draw_boxplot (line 196) | def draw_boxplot(ax, stats):
  function boxplot (line 223) | def boxplot(sdf, colnames, ax=None, showfliers=True, k=1.5, precision=.0...
  function post_boxplot (line 257) | def post_boxplot(axs, stats):
  function roc_curve (line 264) | def roc_curve(fpr, tpr, roc_auc, ax=None):
  function pr_curve (line 278) | def pr_curve(precision, recall, pr_auc, ax=None):

FILE: handyspark/sql/dataframe.py
  function toHandy (line 25) | def toHandy(self):
  function notHandy (line 30) | def notHandy(self):
  function agg (line 36) | def agg(f):
  function inccol (line 40) | def inccol(f):
  class Handy (line 44) | class Handy(object):
    method __init__ (line 45) | def __init__(self, df):
    method __deepcopy__ (line 69) | def __deepcopy__(self, memo):
    method __getitem__ (line 78) | def __getitem__(self, *args):
    method stages (line 113) | def stages(self):
    method statistics_ (line 119) | def statistics_(self):
    method fences_ (line 123) | def fences_(self):
    method is_classification (line 127) | def is_classification(self):
    method classes (line 131) | def classes(self):
    method nclasses (line 135) | def nclasses(self):
    method response (line 139) | def response(self):
    method ncols (line 143) | def ncols(self):
    method nrows (line 147) | def nrows(self):
    method shape (line 151) | def shape(self):
    method strata (line 155) | def strata(self):
    method strata_colnames (line 160) | def strata_colnames(self):
    method _stratify (line 166) | def _stratify(self, strata):
    method _clear_stratification (line 169) | def _clear_stratification(self):
    method _set_stratification (line 180) | def _set_stratification(self, strata, raw_combinations, raw_clauses, c...
    method _build_strat_plot (line 194) | def _build_strat_plot(self, n_rows, n_cols, **kwargs):
    method _update_types (line 202) | def _update_types(self):
    method _take_array (line 213) | def _take_array(self, colname, n):
    method _value_counts (line 225) | def _value_counts(self, colnames, dropna=True, raw=False):
    method _fillna (line 246) | def _fillna(self, target, values):
    method __stat_to_dict (line 267) | def __stat_to_dict(self, colname, stat):
    method _fill_values (line 276) | def _fill_values(self, continuous, categorical, strategy):
    method __fill_self (line 288) | def __fill_self(self, continuous, categorical, strategy):
    method _dense_to_array (line 310) | def _dense_to_array(self, colname, array_colname):
    method _agg (line 315) | def _agg(self, name, func, colnames):
    method _calc_fences (line 332) | def _calc_fences(self, colnames, k=1.5, precision=.01):
    method _calc_mahalanobis_distance (line 353) | def _calc_mahalanobis_distance(self, colnames, output_col='__mahalanob...
    method _set_mahalanobis_outliers (line 390) | def _set_mahalanobis_outliers(self, colnames, critical_value=.999,
    method _calc_bxp_stats (line 402) | def _calc_bxp_stats(self, fences_df, colname, showfliers=False):
    method set_response (line 475) | def set_response(self, colname):
    method disassemble (line 486) | def disassemble(self, colname, new_colnames=None):
    method to_metrics_RDD (line 491) | def to_metrics_RDD(self, prob_col, label):
    method corr (line 495) | def corr(self, colnames=None, method='pearson'):
    method fill (line 507) | def fill(self, *args, continuous=None, categorical=None, strategy=None):
    method isnull (line 514) | def isnull(self, ratio=False):
    method nunique (line 537) | def nunique(self, colnames=None):
    method outliers (line 544) | def outliers(self, colnames=None, ratio=False, method='tukey', **kwargs):
    method get_outliers (line 578) | def get_outliers(self, colnames=None, critical_value=.999):
    method remove_outliers (line 588) | def remove_outliers(self, colnames=None, critical_value=.999):
    method fence (line 598) | def fence(self, colnames, k=1.5):
    method value_counts (line 634) | def value_counts(self, colnames, dropna=True):
    method mode (line 638) | def mode(self, colname):
    method entropy (line 659) | def entropy(self, colnames):
    method mutual_info (line 688) | def mutual_info(self, colnames):
    method mean (line 737) | def mean(self, colnames):
    method min (line 741) | def min(self, colnames):
    method max (line 745) | def max(self, colnames):
    method percentile (line 749) | def percentile(self, colnames, perc=50, precision=.01):
    method median (line 759) | def median(self, colnames, precision=.01):
    method stddev (line 763) | def stddev(self, colnames):
    method var (line 767) | def var(self, colnames):
    method q1 (line 771) | def q1(self, colnames, precision=.01):
    method q3 (line 775) | def q3(self, colnames, precision=.01):
    method _strat_boxplot (line 779) | def _strat_boxplot(self, colnames, **kwargs):
    method boxplot (line 794) | def boxplot(self, colnames, ax=None, showfliers=True, k=1.5, precision...
    method _post_boxplot (line 801) | def _post_boxplot(self, res):
    method _strat_scatterplot (line 805) | def _strat_scatterplot(self, colnames, **kwargs):
    method scatterplot (line 810) | def scatterplot(self, colnames, ax=None, **kwargs):
    method _strat_hist (line 818) | def _strat_hist(self, colname, bins=10, **kwargs):
    method hist (line 830) | def hist(self, colname, bins=10, ax=None, **kwargs):
  class HandyGrouped (line 841) | class HandyGrouped(GroupedData):
    method __init__ (line 842) | def __init__(self, jgd, df, *args):
    method agg (line 848) | def agg(self, *exprs):
    method __repr__ (line 854) | def __repr__(self):
  class HandyFrame (line 858) | class HandyFrame(DataFrame):
    method __init__ (line 914) | def __init__(self, df, handy=None):
    method __getattribute__ (line 929) | def __getattribute__(self, name):
    method __repr__ (line 951) | def __repr__(self):
    method _get_strata (line 954) | def _get_strata(self):
    method _gen_row_ids (line 973) | def _gen_row_ids(self, *args):
    method _loc (line 981) | def _loc(self, lower_bound, upper_bound):
    method cols (line 988) | def cols(self):
    method pandas (line 1009) | def pandas(self):
    method transformers (line 1026) | def transformers(self):
    method stages (line 1036) | def stages(self):
    method response (line 1042) | def response(self):
    method is_classification (line 1048) | def is_classification(self):
    method classes (line 1054) | def classes(self):
    method nclasses (line 1060) | def nclasses(self):
    method ncols (line 1066) | def ncols(self):
    method nrows (line 1072) | def nrows(self):
    method shape (line 1078) | def shape(self):
    method statistics_ (line 1084) | def statistics_(self):
    method fences_ (line 1091) | def fences_(self):
    method values (line 1098) | def values(self):
    method notHandy (line 1107) | def notHandy(self):
    method set_safety_limit (line 1112) | def set_safety_limit(self, limit):
    method safety_off (line 1118) | def safety_off(self):
    method collect (line 1125) | def collect(self):
    method take (line 1144) | def take(self, num):
    method stratify (line 1152) | def stratify(self, strata):
    method transform (line 1163) | def transform(self, f, name=None, args=None, returnType=None):
    method apply (line 1168) | def apply(self, f, name=None, args=None, returnType=None):
    method assign (line 1173) | def assign(self, **kwargs):
    method isnull (line 1195) | def isnull(self, ratio=False):
    method nunique (line 1210) | def nunique(self):
    method outliers (line 1225) | def outliers(self, ratio=False, method='tukey', **kwargs):
    method get_outliers (line 1244) | def get_outliers(self, colnames=None, critical_value=.999):
    method remove_outliers (line 1260) | def remove_outliers(self, colnames=None, critical_value=.999):
    method set_response (line 1276) | def set_response(self, colname):
    method fill (line 1291) | def fill(self, *args, categorical=None, continuous=None, strategy=None):
    method fence (line 1319) | def fence(self, colnames, k=1.5):
    method disassemble (line 1343) | def disassemble(self, colname, new_colnames=None):
    method to_metrics_RDD (line 1364) | def to_metrics_RDD(self, prob_col='probability', label_col='label'):
  class Bucket (line 1385) | class Bucket(object):
    method __init__ (line 1401) | def __init__(self, colname, bins=5):
    method __repr__ (line 1407) | def __repr__(self):
    method colname (line 1411) | def colname(self):
    method _get_buckets (line 1414) | def _get_buckets(self, df):
    method _get_clauses (line 1425) | def _get_clauses(self, buckets):
  class Quantile (line 1436) | class Quantile(Bucket):
    method __repr__ (line 1452) | def __repr__(self):
    method _get_buckets (line 1455) | def _get_buckets(self, df):
  class HandyColumns (line 1465) | class HandyColumns(object):
    method __init__ (line 1481) | def __init__(self, df, handy, strata=None):
    method __getitem__ (line 1492) | def __getitem__(self, *args):
    method __repr__ (line 1554) | def __repr__(self):
    method numerical (line 1559) | def numerical(self):
    method categorical (line 1565) | def categorical(self):
    method continuous (line 1571) | def continuous(self):
    method string (line 1577) | def string(self):
    method array (line 1583) | def array(self):
    method mean (line 1588) | def mean(self):
    method min (line 1591) | def min(self):
    method max (line 1594) | def max(self):
    method median (line 1597) | def median(self, precision=.01):
    method stddev (line 1607) | def stddev(self):
    method var (line 1610) | def var(self):
    method percentile (line 1613) | def percentile(self, perc, precision=.01):
    method q1 (line 1625) | def q1(self, precision=.01):
    method q3 (line 1635) | def q3(self, precision=.01):
    method _value_counts (line 1645) | def _value_counts(self, dropna=True, raw=True):
    method value_counts (line 1649) | def value_counts(self, dropna=True):
    method entropy (line 1669) | def entropy(self):
    method mutual_info (line 1678) | def mutual_info(self):
    method mode (line 1688) | def mode(self):
    method corr (line 1702) | def corr(self, method='pearson'):
    method nunique (line 1718) | def nunique(self):
    method outliers (line 1732) | def outliers(self, ratio=False, method='tukey', **kwargs):
    method get_outliers (line 1751) | def get_outliers(self, critical_value=.999):
    method remove_outliers (line 1764) | def remove_outliers(self, critical_value=.999):
    method hist (line 1777) | def hist(self, bins=10, ax=None):
    method boxplot (line 1788) | def boxplot(self, ax=None, showfliers=True, k=1.5, precision=.01):
    method scatterplot (line 1802) | def scatterplot(self, ax=None):
  class HandyStrata (line 1812) | class HandyStrata(object):
    method __init__ (line 1820) | def __init__(self, handy, strata):
    method __repr__ (line 1873) | def __repr__(self):
    method __getattribute__ (line 1880) | def __getattribute__(self, name):

FILE: handyspark/sql/datetime.py
  class HandyDatetime (line 4) | class HandyDatetime(object):
    method __init__ (line 20) | def __init__(self, df, colname):
    method __getattribute__ (line 26) | def __getattribute__(self, name):

FILE: handyspark/sql/pandas.py
  class HandyPandas (line 7) | class HandyPandas(object):
    method __init__ (line 16) | def __init__(self, df):
    method __getitem__ (line 20) | def __getitem__(self, *args):
    method str (line 29) | def str(self):
    method dt (line 57) | def dt(self):
    method __getattribute__ (line 74) | def __getattribute__(self, name):

FILE: handyspark/sql/schema.py
  function generate_schema (line 25) | def generate_schema(columns, nullable_columns='all'):

FILE: handyspark/sql/string.py
  class HandyString (line 5) | class HandyString(object):
    method __init__ (line 18) | def __init__(self, df, colname):
    method _remove_accents (line 23) | def _remove_accents(input):
    method remove_accents (line 26) | def remove_accents(self):
    method __getattribute__ (line 31) | def __getattribute__(self, name):

FILE: handyspark/sql/transform.py
  class HandyTransform (line 21) | class HandyTransform(object):
    method _get_return (line 26) | def _get_return(sdf, f, args):
    method _signatureType (line 35) | def _signatureType(sig):
    method gen_pandas_udf (line 51) | def gen_pandas_udf(f, args=None, returnType=None):
    method gen_grouped_pandas_udf (line 75) | def gen_grouped_pandas_udf(sdf, f, args=None, returnType=None):
    method transform (line 97) | def transform(sdf, f, name=None, args=None, returnType=None):
    method apply (line 107) | def apply(sdf, f, name=None, args=None, returnType=None):
    method assign (line 117) | def assign(sdf, **kwargs):

FILE: handyspark/stats.py
  function StatisticalSummaryValues (line 6) | def StatisticalSummaryValues(sdf, colnames):
  function tTest (line 22) | def tTest(jvm, *ssvs):
  function KolmogorovSmirnovTest (line 36) | def KolmogorovSmirnovTest(sdf, colname, dist='normal', *params):

FILE: handyspark/util.py
  function none2default (line 10) | def none2default(value, default):
  function none2zero (line 13) | def none2zero(value):
  function ensure_list (line 16) | def ensure_list(value):
  function check_columns (line 24) | def check_columns(df, colnames):
  class bcolors (line 32) | class bcolors:
  class HandyException (line 42) | class HandyException(Exception):
    method __init__ (line 43) | def __init__(self, *args, **kwargs):
    method colortext (line 56) | def colortext(text, color_code):
    method errortext (line 60) | def errortext(text):
    method exception_summary (line 65) | def exception_summary():
  function get_buckets (line 94) | def get_buckets(rdd, buckets):
  function dense_to_array (line 140) | def dense_to_array(sdf, colname, new_colname):
  function disassemble (line 158) | def disassemble(sdf, colname, new_colnames=None):
  function get_jvm_class (line 192) | def get_jvm_class(cl):
  function call_scala_method (line 197) | def call_scala_method(py_class, scala_method, df, *args):
  function counts_to_df (line 233) | def counts_to_df(value_counts, colnames, n_points):

FILE: setup.py
  function readme (line 3) | def readme():

FILE: tests/handyspark/conftest.py
  function sdf (line 17) | def sdf():
  function sdates (line 21) | def sdates():
  function pdf (line 25) | def pdf():
  function pdates (line 30) | def pdates():
  function predicted (line 34) | def predicted():

FILE: tests/handyspark/extensions/test_evaluation.py
  function test_confusion_matrix (line 11) | def test_confusion_matrix(sdf):
  function test_get_metrics_by_threshold (line 30) | def test_get_metrics_by_threshold(sdf):

FILE: tests/handyspark/extensions/test_types.py
  function test_atomic_types (line 5) | def test_atomic_types():
  function test_composite_types (line 9) | def test_composite_types():

FILE: tests/handyspark/ml/test_base.py
  function test_imputer (line 7) | def test_imputer(sdf, pdf):
  function test_fencer (line 25) | def test_fencer(sdf, pdf):

FILE: tests/handyspark/sql/test_dataframe.py
  function test_to_from_handy (line 14) | def test_to_from_handy(sdf):
  function test_shape (line 20) | def test_shape(sdf):
  function test_response (line 23) | def test_response(sdf):
  function test_safety_limit (line 31) | def test_safety_limit(sdf):
  function test_safety_limit2 (line 49) | def test_safety_limit2(sdf):
  function test_values (line 64) | def test_values(sdf, pdf):
  function test_stages (line 70) | def test_stages(sdf):
  function test_value_counts (line 76) | def test_value_counts(sdf, pdf):
  function test_column_values (line 82) | def test_column_values(sdf, pdf):
  function test_dataframe_values (line 87) | def test_dataframe_values(sdf, pdf):
  function test_isnull (line 92) | def test_isnull(sdf, pdf):
  function test_nunique (line 101) | def test_nunique(sdf, pdf):
  function test_columns_nunique (line 108) | def test_columns_nunique(sdf, pdf):
  function test_outliers (line 114) | def test_outliers(sdf, pdf):
  function test_mean (line 129) | def test_mean(sdf, pdf):
  function test_stratified_mean (line 135) | def test_stratified_mean(sdf, pdf):
  function test_mode (line 141) | def test_mode(sdf, pdf):
  function test_median (line 154) | def test_median(sdf, pdf):
  function test_types (line 169) | def test_types(sdf):
  function test_fill_categorical (line 178) | def test_fill_categorical(sdf):
  function test_fill_continuous (line 184) | def test_fill_continuous(sdf, pdf):
  function test_sequential_fill (line 196) | def test_sequential_fill(sdf):
  function test_corr (line 204) | def test_corr(sdf, pdf):
  function test_stratified_corr (line 210) | def test_stratified_corr(sdf, pdf):
  function test_fence (line 216) | def test_fence(sdf, pdf):
  function test_stratified_fence (line 229) | def test_stratified_fence(sdf):
  function test_grouped_column_values (line 236) | def test_grouped_column_values(sdf, pdf):
  function test_bucket (line 242) | def test_bucket(sdf, pdf):
  function test_quantile (line 252) | def test_quantile(sdf, pdf):
  function test_stratify_length (line 262) | def test_stratify_length(sdf, pdf):
  function test_stratify_list (line 269) | def test_stratify_list(sdf, pdf):
  function test_stratify_pandas_df (line 277) | def test_stratify_pandas_df(sdf, pdf):
  function test_stratify_pandas_series (line 284) | def test_stratify_pandas_series(sdf, pdf):
  function test_stratify_spark_df (line 291) | def test_stratify_spark_df(sdf, pdf):
  function test_stratify_fill (line 298) | def test_stratify_fill(sdf, pdf):
  function test_repr (line 319) | def test_repr(sdf):
  function test_stratify_bucket (line 324) | def test_stratify_bucket(sdf):
  function test_stratified_nunique (line 334) | def test_stratified_nunique(sdf, pdf):
  function test_mahalanobis (line 340) | def test_mahalanobis(sdf, pdf):
  function test_entropy (line 350) | def test_entropy(sdf, pdf):
  function test_mutual_info (line 356) | def test_mutual_info(sdf, pdf):

FILE: tests/handyspark/sql/test_datetime.py
  function test_is_leap_year (line 4) | def test_is_leap_year(sdates, pdates):
  function test_strftime (line 11) | def test_strftime(sdates, pdates):
  function test_weekday_name (line 18) | def test_weekday_name(sdates, pdates):
  function test_round (line 25) | def test_round(sdates, pdates):

FILE: tests/handyspark/sql/test_pandas.py
  function test_between (line 5) | def test_between(sdf, pdf):
  function test_isin (line 12) | def test_isin(sdf, pdf):
  function test_isna (line 19) | def test_isna(sdf, pdf):
  function test_notna (line 26) | def test_notna(sdf, pdf):
  function test_clip (line 34) | def test_clip(sdf, pdf):
  function test_replace (line 41) | def test_replace(sdf, pdf):
  function test_round (line 48) | def test_round(sdf, pdf):

FILE: tests/handyspark/sql/test_schema.py
  function test_generate_schema (line 5) | def test_generate_schema(sdf):

FILE: tests/handyspark/sql/test_string.py
  function test_count (line 5) | def test_count(sdf, pdf):
  function test_find (line 12) | def test_find(sdf, pdf):
  function test_len (line 19) | def test_len(sdf, pdf):
  function test_rfind (line 26) | def test_rfind(sdf, pdf):
  function test_contains (line 34) | def test_contains(sdf, pdf):
  function test_startswith (line 41) | def test_startswith(sdf, pdf):
  function test_match (line 48) | def test_match(sdf, pdf):
  function test_isalpha (line 55) | def test_isalpha(sdf, pdf):
  function test_replace (line 63) | def test_replace(sdf, pdf):
  function test_repeat (line 70) | def test_repeat(sdf, pdf):
  function test_join (line 77) | def test_join(sdf, pdf):
  function test_pad (line 84) | def test_pad(sdf, pdf):
  function test_slice (line 91) | def test_slice(sdf, pdf):
  function test_slice_replace (line 98) | def test_slice_replace(sdf, pdf):
  function test_strip (line 105) | def test_strip(sdf, pdf):
  function test_wrap (line 112) | def test_wrap(sdf, pdf):
  function test_get (line 119) | def test_get(sdf, pdf):
  function test_center (line 126) | def test_center(sdf, pdf):
  function test_zfill (line 133) | def test_zfill(sdf, pdf):
  function test_normalize (line 140) | def test_normalize(sdf, pdf):
  function test_upper (line 147) | def test_upper(sdf, pdf):

FILE: tests/handyspark/sql/test_transform.py
  function test_apply_axis0 (line 5) | def test_apply_axis0(sdf, pdf):
  function test_apply_axis1 (line 15) | def test_apply_axis1(sdf, pdf):
  function test_transform_axis0 (line 28) | def test_transform_axis0(sdf, pdf):
  function test_transform_axis1 (line 38) | def test_transform_axis1(sdf, pdf):
  function test_assign_axis0 (line 51) | def test_assign_axis0(sdf, pdf):
  function test_assign_axis1 (line 58) | def test_assign_axis1(sdf, pdf):

FILE: tests/handyspark/test_plot.py
  function plot_to_base64 (line 10) | def plot_to_base64(fig):
  function plot_to_pixels (line 18) | def plot_to_pixels(fig, shape=None):
  function test_boxplot_single (line 30) | def test_boxplot_single(sdf, pdf):
  function test_boxplot_multiple (line 41) | def test_boxplot_multiple(sdf, pdf):
  function test_hist_categorical (line 57) | def test_hist_categorical(sdf, pdf):
  function test_hist_continuous (line 68) | def test_hist_continuous(sdf, pdf):
  function test_scatterplot (line 81) | def test_scatterplot(sdf, pdf):
  function test_stratified_boxplot (line 106) | def test_stratified_boxplot(sdf, pdf):
  function test_stratified_hist (line 123) | def test_stratified_hist(sdf, pdf):

FILE: tests/handyspark/test_stats.py
  function test_ks (line 5) | def test_ks(sdf):

FILE: tests/handyspark/test_util.py
  function test_dense_to_array (line 5) | def test_dense_to_array(sdf):
  function test_disassemble (line 12) | def test_disassemble(sdf):

Download .json

Condensed preview — 49 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (497K chars).

[
  {
    "path": ".gitignore",
    "chars": 1211,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
  },
  {
    "path": ".travis.yml",
    "chars": 886,
    "preview": "language: python\nsudo: required\ndist: trusty\ncache:\n  directories:\n    - $HOME/.ivy2\n    - $HOME/spark\n    - $HOME/.cach"
  },
  {
    "path": "LICENSE",
    "chars": 1075,
    "preview": "MIT License\n\nCopyright (c) 2018 Daniel Voigt Godoy\n\nPermission is hereby granted, free of charge, to any person obtainin"
  },
  {
    "path": "README.md",
    "chars": 21089,
    "preview": "[![Build Status](https://travis-ci.org/dvgodoy/handyspark.svg?branch=master)](https://travis-ci.org/dvgodoy/handyspark)\n"
  },
  {
    "path": "README.rst",
    "chars": 18434,
    "preview": "\n\n.. image:: https://travis-ci.org/dvgodoy/handyspark.svg?branch=master\n   :target: https://travis-ci.org/dvgodoy/handys"
  },
  {
    "path": "docs/Makefile",
    "chars": 629,
    "preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS    =\nSPHI"
  },
  {
    "path": "docs/source/conf.py",
    "chars": 5929,
    "preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n#\n# HandySpark documentation build configuration file, created by\n# sphin"
  },
  {
    "path": "docs/source/handyspark.extensions.rst",
    "chars": 764,
    "preview": "handyspark\\.extensions package\n==============================\n\nSubmodules\n----------\n\nhandyspark\\.extensions\\.common mod"
  },
  {
    "path": "docs/source/handyspark.ml.rst",
    "chars": 341,
    "preview": "handyspark\\.ml package\n======================\n\nSubmodules\n----------\n\nhandyspark\\.ml\\.base module\n----------------------"
  },
  {
    "path": "docs/source/handyspark.rst",
    "chars": 703,
    "preview": "handyspark package\n==================\n\nSubpackages\n-----------\n\n.. toctree::\n\n    handyspark.extensions\n    handyspark.m"
  },
  {
    "path": "docs/source/handyspark.sql.rst",
    "chars": 1172,
    "preview": "handyspark\\.sql package\n=======================\n\nSubmodules\n----------\n\nhandyspark\\.sql\\.dataframe module\n--------------"
  },
  {
    "path": "docs/source/includeme.rst",
    "chars": 31,
    "preview": ".. include:: ../../README.rst\n\n"
  },
  {
    "path": "docs/source/index.rst",
    "chars": 440,
    "preview": ".. HandySpark documentation master file, created by\n   sphinx-quickstart on Sun Oct 28 17:42:51 2018.\n   You can adapt t"
  },
  {
    "path": "docs/source/modules.rst",
    "chars": 67,
    "preview": "handyspark\n==========\n\n.. toctree::\n   :maxdepth: 4\n\n   handyspark\n"
  },
  {
    "path": "handyspark/__init__.py",
    "chars": 224,
    "preview": "from handyspark.extensions.evaluation import BinaryClassificationMetrics\nfrom handyspark.sql import HandyFrame, Bucket, "
  },
  {
    "path": "handyspark/extensions/__init__.py",
    "chars": 231,
    "preview": "from handyspark.extensions.common import JavaModelWrapper\nfrom handyspark.extensions.evaluation import BinaryClassificat"
  },
  {
    "path": "handyspark/extensions/common.py",
    "chars": 590,
    "preview": "from pyspark.mllib.common import _java2py, _py2java, JavaModelWrapper\n\ndef call2(self, name, *a):\n    \"\"\"Another call me"
  },
  {
    "path": "handyspark/extensions/evaluation.py",
    "chars": 6115,
    "preview": "import pandas as pd\nfrom operator import itemgetter\nfrom handyspark.plot import roc_curve, pr_curve\nfrom pyspark.mllib.e"
  },
  {
    "path": "handyspark/extensions/types.py",
    "chars": 431,
    "preview": "from pyspark.sql.types import AtomicType, ArrayType, MapType\n\n@classmethod\ndef ret(cls, expr):\n    \"\"\"Assigns a return t"
  },
  {
    "path": "handyspark/ml/__init__.py",
    "chars": 105,
    "preview": "from handyspark.ml.base import HandyFencer, HandyImputer\n\n__all__ = [\n    'HandyFencer', 'HandyImputer'\n]"
  },
  {
    "path": "handyspark/ml/base.py",
    "chars": 6187,
    "preview": "import json\nfrom pyspark.ml.base import Transformer\nfrom pyspark.ml.util import DefaultParamsReadable, DefaultParamsWrit"
  },
  {
    "path": "handyspark/plot.py",
    "chars": 11255,
    "preview": "import matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport seaborn as sns\nfrom inspect import signatu"
  },
  {
    "path": "handyspark/sql/__init__.py",
    "chars": 199,
    "preview": "from handyspark.sql.dataframe import HandyFrame, Bucket, Quantile, DataFrame\nfrom handyspark.sql.schema import generate_"
  },
  {
    "path": "handyspark/sql/dataframe.py",
    "chars": 80200,
    "preview": "from copy import deepcopy\nfrom handyspark.ml.base import HandyTransformers\nfrom handyspark.plot import histogram, boxplo"
  },
  {
    "path": "handyspark/sql/datetime.py",
    "chars": 2601,
    "preview": "from handyspark.sql.transform import HandyTransform\nimport pandas as pd\n\nclass HandyDatetime(object):\n    __supported = "
  },
  {
    "path": "handyspark/sql/pandas.py",
    "chars": 3448,
    "preview": "from handyspark.sql.datetime import HandyDatetime\nfrom handyspark.sql.string import HandyString\nfrom handyspark.sql.tran"
  },
  {
    "path": "handyspark/sql/schema.py",
    "chars": 2010,
    "preview": "import numpy as np\nimport datetime\nfrom operator import itemgetter\nfrom pyspark.sql.types import StructType\n\n_mapping = "
  },
  {
    "path": "handyspark/sql/string.py",
    "chars": 2271,
    "preview": "from handyspark.sql.transform import HandyTransform\nimport unicodedata\nimport pandas as pd\n\nclass HandyString(object):\n "
  },
  {
    "path": "handyspark/sql/transform.py",
    "chars": 4784,
    "preview": "import datetime\nimport inspect\nimport numpy as np\nfrom pyspark.sql import functions as F\n\n_MAPPING = {'string': str,\n   "
  },
  {
    "path": "handyspark/stats.py",
    "chars": 2808,
    "preview": "import numpy as np\nfrom handyspark.util import check_columns, ensure_list\nfrom pyspark.mllib.common import _py2java\nfrom"
  },
  {
    "path": "handyspark/util.py",
    "chars": 10056,
    "preview": "from math import isnan, isinf\nimport pandas as pd\nfrom pyspark.ml.linalg import DenseVector\nfrom pyspark.rdd import RDD\n"
  },
  {
    "path": "notebooks/Exploring_Titanic.ipynb",
    "chars": 191317,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# HandySpark\\n\",\n    \"\\n\",\n    \"###"
  },
  {
    "path": "requirements.txt",
    "chars": 125,
    "preview": "numpy>=1.14\nscikit-learn>=0.20.0\npandas>=0.24\nmatplotlib>=2.2.3\nseaborn>=0.9\npyspark>=2.3\nscipy>=1.0\nfindspark\npyarrow>="
  },
  {
    "path": "setup.cfg",
    "chars": 39,
    "preview": "[metadata]\ndescription-file = README.md"
  },
  {
    "path": "setup.py",
    "chars": 1332,
    "preview": "from setuptools import setup, find_packages\n\ndef readme():\n    with open('README.md') as f:\n        return f.read()\n\nset"
  },
  {
    "path": "tests/handyspark/conftest.py",
    "chars": 1237,
    "preview": "import findspark\nimport os\nimport pandas as pd\nimport pytest\nfrom pyspark.sql import SparkSession\nfrom pyspark.ml.featur"
  },
  {
    "path": "tests/handyspark/extensions/test_evaluation.py",
    "chars": 2964,
    "preview": "import numpy as np\nimport numpy.testing as npt\nimport pandas as pd\nfrom handyspark import *\nfrom pyspark.ml.classificati"
  },
  {
    "path": "tests/handyspark/extensions/test_types.py",
    "chars": 452,
    "preview": "from handyspark import *\nimport numpy.testing as npt\nfrom pyspark.sql.types import IntegerType, StringType, ArrayType, M"
  },
  {
    "path": "tests/handyspark/ml/test_base.py",
    "chars": 1620,
    "preview": "import numpy as np\nimport numpy.testing as npt\nimport handyspark\nfrom operator import itemgetter\nfrom sklearn.preprocess"
  },
  {
    "path": "tests/handyspark/sql/test_dataframe.py",
    "chars": 13864,
    "preview": "import numpy as np\nimport numpy.testing as npt\nfrom handyspark import *\nimport pandas as pd\nfrom pyspark.sql import Data"
  },
  {
    "path": "tests/handyspark/sql/test_datetime.py",
    "chars": 1090,
    "preview": "import numpy.testing as npt\nfrom handyspark import *\n\ndef test_is_leap_year(sdates, pdates):\n    hdf = sdates.toHandy()\n"
  },
  {
    "path": "tests/handyspark/sql/test_pandas.py",
    "chars": 1755,
    "preview": "import numpy.testing as npt\nfrom handyspark import *\n\n# boolean returns\ndef test_between(sdf, pdf):\n    hdf = sdf.toHand"
  },
  {
    "path": "tests/handyspark/sql/test_schema.py",
    "chars": 434,
    "preview": "import numpy as np\nimport numpy.testing as npt\nfrom handyspark.sql import generate_schema\n\ndef test_generate_schema(sdf)"
  },
  {
    "path": "tests/handyspark/sql/test_string.py",
    "chars": 5202,
    "preview": "import numpy.testing as npt\nfrom handyspark import *\n\n# integer returns\ndef test_count(sdf, pdf):\n    hdf = sdf.toHandy("
  },
  {
    "path": "tests/handyspark/sql/test_transform.py",
    "chars": 3011,
    "preview": "import numpy.testing as npt\nfrom pyspark.sql.types import DoubleType, StringType\nfrom handyspark import *\n\ndef test_appl"
  },
  {
    "path": "tests/handyspark/test_plot.py",
    "chars": 4902,
    "preview": "import base64\nimport numpy.testing as npt\nimport numpy as np\nimport seaborn as sns\nfrom handyspark import *\nfrom handysp"
  },
  {
    "path": "tests/handyspark/test_stats.py",
    "chars": 864,
    "preview": "import numpy.testing as npt\nfrom handyspark.stats import KolmogorovSmirnovTest\nfrom pyspark.sql import functions as F\n\nd"
  },
  {
    "path": "tests/handyspark/test_util.py",
    "chars": 854,
    "preview": "import numpy.testing as npt\nfrom pyspark.ml.feature import VectorAssembler\nfrom handyspark.util import dense_to_array, d"
  },
  {
    "path": "tests/rawdata/train.csv",
    "chars": 61194,
    "preview": "PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked\r\n1,0,3,\"Braund, Mr. Owen Harris\",male,22"
  }
]

About this extraction

This page contains the full source code of the dvgodoy/handyspark GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 49 files (467.3 KB), approximately 201.5k tokens, and a symbol index with 344 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo