Showing preview only (487K chars total). Download the full file or copy to clipboard to get everything.
Repository: dvgodoy/handyspark
Branch: master
Commit: 0fb4c8707b34
Files: 49
Total size: 467.3 KB
Directory structure:
gitextract_4phs78pk/
├── .gitignore
├── .travis.yml
├── LICENSE
├── README.md
├── README.rst
├── docs/
│ ├── Makefile
│ └── source/
│ ├── conf.py
│ ├── handyspark.extensions.rst
│ ├── handyspark.ml.rst
│ ├── handyspark.rst
│ ├── handyspark.sql.rst
│ ├── includeme.rst
│ ├── index.rst
│ └── modules.rst
├── handyspark/
│ ├── __init__.py
│ ├── extensions/
│ │ ├── __init__.py
│ │ ├── common.py
│ │ ├── evaluation.py
│ │ └── types.py
│ ├── ml/
│ │ ├── __init__.py
│ │ └── base.py
│ ├── plot.py
│ ├── sql/
│ │ ├── __init__.py
│ │ ├── dataframe.py
│ │ ├── datetime.py
│ │ ├── pandas.py
│ │ ├── schema.py
│ │ ├── string.py
│ │ └── transform.py
│ ├── stats.py
│ └── util.py
├── notebooks/
│ └── Exploring_Titanic.ipynb
├── requirements.txt
├── setup.cfg
├── setup.py
└── tests/
├── handyspark/
│ ├── conftest.py
│ ├── extensions/
│ │ ├── test_evaluation.py
│ │ └── test_types.py
│ ├── ml/
│ │ └── test_base.py
│ ├── sql/
│ │ ├── test_dataframe.py
│ │ ├── test_datetime.py
│ │ ├── test_pandas.py
│ │ ├── test_schema.py
│ │ ├── test_string.py
│ │ └── test_transform.py
│ ├── test_plot.py
│ ├── test_stats.py
│ └── test_util.py
└── rawdata/
└── train.csv
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# dotenv
.env
# virtualenv
.venv
venv/
ENV/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.idea
examples/spark-warehouse/
tests/spark-warehouse
================================================
FILE: .travis.yml
================================================
language: python
sudo: required
dist: trusty
cache:
directories:
- $HOME/.ivy2
- $HOME/spark
- $HOME/.cache/pip
- $HOME/.pip-cache
- $HOME/.sbt/launchers
jdk:
- oraclejdk8
python:
- 3.6
sudo: false
addons:
apt:
packages:
- axel
cache: pip
before_install:
- export PATH=$HOME/.local/bin:$PATH
- pip install -U pip
- export PYTHONPATH=$PYTHONPATH:$(pwd)
install:
# Download spark 2.3.3
- "[ -f spark ] || mkdir spark && cd spark && axel http://www-us.apache.org/dist/spark/spark-2.3.3/spark-2.3.3-bin-hadoop2.7.tgz && cd .."
- "tar -xf ./spark/spark-2.3.3-bin-hadoop2.7.tgz"
- "export SPARK_HOME=`pwd`/spark-2.3.3-bin-hadoop2.7"
- "export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python"
- echo "spark.yarn.jars=$SPARK_HOME/jars/*.jar" > $SPARK_HOME/conf/spark-defaults.conf
- pip install -r requirements.txt
script:
- pytest ./tests
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2018 Daniel Voigt Godoy
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
[](https://travis-ci.org/dvgodoy/handyspark)
# HandySpark
## Bringing pandas-like capabilities to Spark dataframes!
***HandySpark*** is a package designed to improve ***PySpark*** user experience, especially when it comes to ***exploratory data analysis***, including ***visualization*** capabilities!
It makes fetching data or computing statistics for columns really easy, returning ***pandas objects*** straight away.
It also leverages on the recently released ***pandas UDFs*** in Spark to allow for an out-of-the-box usage of common ***pandas functions*** in a Spark dataframe.
Moreover, it introduces the ***stratify*** operation, so users can perform more sophisticated analysis, imputation and outlier detection on stratified data without incurring in very computationally expensive ***groupby*** operations.
It brings the long missing capability of ***plotting*** data while retaining the advantage of performing distributed computation (unlike many tutorials on the internet, which just convert the whole dataset to pandas and then plot it - don't ever do that!).
Finally, it also extends ***evaluation metrics*** for ***binary classification***, so you can easily choose which threshold to use!
## Google Colab
Eager to try it out right away? Don't wait any longer!
Open the notebook directly on Google Colab and try it yourself:
- [Exploring Titanic](https://colab.research.google.com/github/dvgodoy/handyspark/blob/master/notebooks/Exploring_Titanic.ipynb)
## Installation
To install ***HandySpark*** from [PyPI](https://pypi.org/project/handyspark/), just type:
```python
pip install handyspark
```
## Documentation
You can find the full documentation [here](http://dvgodoy.github.com/handyspark).
Here is a ***handy*** list of direct links to some classes, objects and methods used:
- [HandyFrame](https://dvgodoy.github.io/handyspark/handyspark.sql.html#handyspark.sql.dataframe.HandyFrame)
- [cols](https://dvgodoy.github.io/handyspark/handyspark.sql.html#handyspark.sql.dataframe.HandyColumns)
- [pandas](https://dvgodoy.github.io/handyspark/handyspark.sql.html#handyspark.sql.pandas.HandyPandas)
- [transformers](https://dvgodoy.github.io/handyspark/handyspark.ml.html#handyspark.ml.base.HandyTransformers)
- [isnull](https://dvgodoy.github.io/handyspark/handyspark.html#handyspark.HandyFrame.isnull)
- [fill](https://dvgodoy.github.io/handyspark/handyspark.html#handyspark.HandyFrame.fill)
- [outliers](https://dvgodoy.github.io/handyspark/handyspark.html#handyspark.HandyFrame.outliers)
- [fence](https://dvgodoy.github.io/handyspark/handyspark.html#handyspark.HandyFrame.fence)
- [stratify](https://dvgodoy.github.io/handyspark/handyspark.sql.html#handyspark.sql.dataframe.HandyFrame.stratify)
- [Bucket](https://dvgodoy.github.io/handyspark/handyspark.sql.html#handyspark.sql.dataframe.Bucket)
- [Quantile](https://dvgodoy.github.io/handyspark/handyspark.sql.html#handyspark.sql.dataframe.Quantile)
- [HandyImputer](https://dvgodoy.github.io/handyspark/handyspark.ml.html#handyspark.ml.base.HandyImputer)
- [HandyFencer](https://dvgodoy.github.io/handyspark/handyspark.ml.html#handyspark.ml.base.HandyFencer)
## Quick Start
To use ***HandySpark***, all you need to do is import the package and, after loading your data into a Spark dataframe, call the ***toHandy()*** method to get your own ***HandyFrame***:
```python
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
from handyspark import *
sdf = spark.read.csv('./tests/rawdata/train.csv', header=True, inferSchema=True)
hdf = sdf.toHandy()
```
### Fetching and plotting data
Now you can easily fetch data as if you were using pandas, just use the ***cols*** object from your ***HandyFrame***:
```python
hdf.cols['Name'][:5]
```
It should return a pandas Series object:
```
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
Name: Name, dtype: object
```
If you include a list of columns, it will return a pandas DataFrame.
Due to the distributed nature of data in Spark, it is only possible to fetch the top rows of any given ***HandyFrame***.
Using ***cols*** you have access to several pandas-like column and DataFrame based methods implemented in Spark:
- min / max / median / q1 / q3 / stddev / mode
- nunique
- value_counts
- corr
- hist
- boxplot
- scatterplot
For instance:
```python
hdf.cols['Embarked'].value_counts(dropna=False)
```
```
S 644
C 168
Q 77
NaN 2
Name: Embarked, dtype: int64
```
You can also make some plots:
```python
from matplotlib import pyplot as plt
fig, axs = plt.subplots(1, 4, figsize=(12, 4))
hdf.cols['Embarked'].hist(ax=axs[0])
hdf.cols['Age'].boxplot(ax=axs[1])
hdf.cols['Fare'].boxplot(ax=axs[2])
hdf.cols[['Fare', 'Age']].scatterplot(ax=axs[3])
```

Handy, right (pun intended!)? But things can get ***even more*** interesting if you use ***stratify***!
### Stratify
Stratifying a HandyFrame means using a ***split-apply-combine*** approach. It will first split your HandyFrame according to the specified (discrete) columns, then it will apply some function to each stratum of data and finally combine the results back together.
This is better illustrated with an example - let's try the stratified version of our previous `value_counts`:
```python
hdf.stratify(['Pclass']).cols['Embarked'].value_counts()
```
```
Pclass Embarked
1 C 85
Q 2
S 127
2 C 17
Q 3
S 164
3 C 66
Q 72
S 353
Name: value_counts, dtype: int64
```
Cool, isn't it? Besides, under the hood, not a single ***group by*** operation was performed - everything is handled using filter clauses! So, ***no data shuffling***!
What if you want to ***stratify*** on a column containing continuous values? No problem!
```python
hdf.stratify(['Sex', Bucket('Age', 2)]).cols['Embarked'].value_counts()
```
```
Sex Age Embarked
female Age >= 0.4200 and Age < 40.2100 C 46
Q 12
S 154
Age >= 40.2100 and Age <= 80.0000 C 15
S 32
male Age >= 0.4200 and Age < 40.2100 C 53
Q 11
S 287
Age >= 40.2100 and Age <= 80.0000 C 16
Q 5
S 81
Name: value_counts, dtype: int64
```
You can use either ***Bucket*** or ***Quantile*** to discretize your data in any given number of bins!
What about ***plotting*** it? Yes, ***HandySpark*** can handle that as well!
```python
hdf.stratify(['Sex', Bucket('Age', 2)]).cols['Embarked'].hist(figsize=(8, 6))
```

### Handling missing data
***HandySpark*** makes it very easy to spot and fill missing values. To figure if there are any missing values, just use ***isnull***:
```python
hdf.isnull(ratio=True)
```
```
PassengerId 0.000000
Survived 0.000000
Pclass 0.000000
Name 0.000000
Sex 0.000000
Age 0.198653
SibSp 0.000000
Parch 0.000000
Ticket 0.000000
Fare 0.000000
Cabin 0.771044
Embarked 0.002245
Name: missing(ratio), dtype: float64
```
Ok, now you know there are 3 columns with missing values: `Age`, `Cabin` and `Embarked`. It's time to fill those values up! But, let's skip `Cabin`, which has 77% of its values missing!
So, `Age` is a continuous variable, while `Embarked` is a categorical variable. Let's start with the latter:
```python
hdf_filled = hdf.fill(categorical=['Embarked'])
```
***HandyFrame*** has a ***fill*** method which takes up to 3 arguments:
- categorical: a list of categorical variables
- continuous: a list of continuous variables
- strategy: which strategy to use for each one of the continuous variables (either `mean` or `median`)
Categorical variables use a `mode` strategy by default.
But you do not need to stick with the basics anymore... you can fancy it up using ***stratify*** together with ***fill***:
```python
hdf_filled = hdf_filled.stratify(['Pclass', 'Sex']).fill(continuous=['Age'], strategy=['mean'])
```
How do you know which values are being used? Simple enough:
```python
hdf_filled.statistics_
```
```
{'Age': {'Pclass == "1" and Sex == "female"': 34.61176470588235,
'Pclass == "1" and Sex == "male"': 41.28138613861386,
'Pclass == "2" and Sex == "female"': 28.722972972972972,
'Pclass == "2" and Sex == "male"': 30.74070707070707,
'Pclass == "3" and Sex == "female"': 21.75,
'Pclass == "3" and Sex == "male"': 26.507588932806325},
'Embarked': 'S'}
```
There you go! The filter clauses and the corresponding imputation values!
But there is ***more*** - once you're with your imputation procedure, why not generate a ***custom transformer*** to do that for you, either on your test set or in production?
You only need to call the ***imputer*** method of the ***transformer*** object that every ***HandyFrame*** has:
```python
imputer = hdf_filled.transformers.imputer()
```
In the example above, ***imputer*** is now a full-fledged serializable PySpark transformer! What does that mean? You can use it in your ***pipeline*** and ***save / load*** at will :-)
### Detecting outliers
Second only to the problem of missing data, outliers can pose a challenge for training machine learning models.
***HandyFrame*** to the rescue, with its ***outliers*** method:
```python
hdf_filled.outliers(method='tukey', k=3.)
```
```
PassengerId 0.0
Survived 0.0
Pclass 0.0
Age 1.0
SibSp 12.0
Parch 213.0
Fare 53.0
dtype: float64
```
Currently, only [***Tukey's***](https://en.wikipedia.org/wiki/Outlier#Tukey's_fences) method is available. This method takes an optional ***k*** argument, which you can set to larger values (like 3) to allow for a more loose detection.
The good thing is, now we can take a peek at the data by plotting it:
```python
from matplotlib import pyplot as plt
fig, axs = plt.subplots(1, 4, figsize=(16, 4))
hdf_filled.cols['Parch'].hist(ax=axs[0])
hdf_filled.cols['SibSp'].hist(ax=axs[1])
hdf_filled.cols['Age'].boxplot(ax=axs[2], k=3)
hdf_filled.cols['Fare'].boxplot(ax=axs[3], k=3)
```

Let's focus on the `Fare` column - what can we do about it? Well, we could use Tukey's fences to, er... ***fence*** the outliers :-)
```python
hdf_fenced = hdf_filled.fence(['Fare'])
```
Which values were used, you ask?
```python
hdf_fenced.fences_
```
```
{'Fare': [-26.0105, 64.4063]}
```
It works quite similarly to the ***fill*** method and, I hope you guessed, it ***also*** gives you the ability to create the corresponding ***custom transformer*** :-)
```python
fencer = hdf_fenced.transformers.fencer()
```
You can also use [***Mahalanobis distance***](https://en.wikipedia.org/wiki/Mahalanobis_distance) to identify outliers in a multi-dimensional space, given a critical value (usually 99.9%, but you are free to have either more restriced or relaxed threshold).
To get the outliers for a subset of columns (only ***numerical*** columns are considered!):
```
outliers = hdf_filled.cols[['Age', 'Fare', 'SibSp']].get_outliers(critical_value=.90)
```
Let's take a look at the first 5 outliers found:
```
outliers.cols[:][:5]
```

What if you want to discard these sample? You just need to call `remove_outliers`:
```
hdf_without_outliers = hdf_filled.cols[['Age', 'Fare', 'SibSp']].remove_outliers(critical_value=0.90)
```
### Evaluating your model!
You cleaned your data, you trained your classification model, you fine-tuned it and now you want to ***evaluate*** it, right?
```
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
assem = VectorAssembler(inputCols=['Fare', 'Pclass', 'Age'], outputCol='features')
rf = RandomForestClassifier(featuresCol='features', labelCol='Survived', numTrees=20)
pipeline = Pipeline(stages=[assem, rf])
model = pipeline.fit(hdf_fenced)
predictions = model.transform(hdf_fenced)
evaluator = BinaryClassificationEvaluator(labelCol='Survived')
evaluator.evaluate(predictions)
```
Then you realize evaluators only give you `areaUnderROC` and `areaUnderPR`. How about ***plotting ROC or PR curves***? How about ***finding a threshold*** that suits your needs for False Positive or False negatives?
***HandySpark*** extends the ***BinaryClassificationMetrics*** object to take ***DataFrames*** and output ***all your evaluation needs***!
```
bcm = BinaryClassificationMetrics(predictions, scoreCol='probability', labelCol='Survived')
```
Now you can ***plot*** the curves...
```
from matplotlib import pyplot as plt
fig, axs = plt.subplots(1, 2, figsize=(12, 4))
bcm.plot_roc_curve(ax=axs[0])
bcm.plot_pr_curve(ax=axs[1])
```

...or get metrics for every ***threshold***...
```
bcm.getMetricsByThreshold().toPandas()[100:105]
```

...or the ***confusion matrix*** for the threshold you chose:
```
bcm.print_confusion_matrix(.572006)
```

### Pandas and more pandas!
With ***HandySpark*** you can feel ***almost*** as if you were using traditional pandas :-)
To gain access to the whole suite of available pandas functions, you need to leverage the ***pandas*** object of your ***HandyFrame***:
```python
some_ports = hdf_fenced.pandas['Embarked'].isin(values=['C', 'Q'])
some_ports
```
```
Column<b'udf(Embarked) AS `<lambda>(Embarked,)`'>
```
In the example above, ***HandySpark*** treats the `Embarked` column as if it were a pandas Series and, therefore, you may call its ***isin*** method!
But, remember Spark has ***lazy evaluation***, so the result is a ***column expression*** which leverages the power of ***pandas UDFs*** (provived that PyArrow is installed, otherwise it will fall back to traditional UDFs).
The only thing left to do is to actually ***assign*** the results to a new column, right?
```python
hdf_fenced = hdf_fenced.assign(is_c_or_q=some_ports)
# What's in there?
hdf_fenced.cols['is_c_or_q'][:5]
```
```
0 True
1 False
2 False
3 True
4 True
Name: is_c_or_q, dtype: bool
```
You got that right! ***HandyFrame*** has a very convenient ***assign*** method, just like in pandas!
It does not get much easier than that :-) There are several column methods available already:
- betweeen / between_time
- isin
- isna / isnull
- notna / notnull
- abs
- clip / clip_lower / clip_upper
- replace
- round / truncate
- tz_convert / tz_localize
And this is not all! Both specialized ***str*** and ***dt*** objects from pandas are available as well!
For instance, if you want to find if a given string contains another substring?
```python
col_mrs = hdf_fenced.pandas['Name'].str.find(sub='Mrs.')
hdf_fenced = hdf_fenced.assign(is_mrs=col_mrs > 0)
```

There are many, many more available methods:
1. ***String methods***:
- contains
- startswith / endswitch
- match
- isalpha / isnumeric / isalnum / isdigit / isdecimal / isspace
- islower / isupper / istitle
- replace
- repeat
- join
- pad
- slice / slice_replace
- strip / lstrip / rstrip
- wrap / center / ljust / rjust
- translate
- get
- normalize
- lower / upper / capitalize / swapcase / title
- zfill
- count
- find / rfind
- len
2. ***Date / Datetime methods***:
- is_leap_year / is_month_end / is_month_start / is_quarter_end / is_quarter_start / is_year_end / is_year_start
- strftime
- tz / time / tz_convert / tz_localize
- day / dayofweek / dayofyear / days_in_month / daysinmonth
- hour / microsecond / minute / nanosecond / second
- week / weekday / weekday_name
- month / quarter / year / weekofyear
- date
- ceil / floor / round
- normalize
### Your own functions
The sky is the limit! You can create regular Python functions and use assign to create new columns :-)
No need to worry about turning them into ***pandas UDFs*** - everything is handled by ***HandySpark*** under the hood!
The arguments of your function (or `lambda`) should have the names of the columns you want to use. For instance, to take the `log` of `Fare`:
```python
import numpy as np
hdf_fenced = hdf_fenced.assign(logFare=lambda Fare: np.log(Fare + 1))
```

You can also use multiple columns:
```python
hdf_fenced = hdf_fenced.assign(fare_times_age=lambda Fare, Age: Fare * Age)
```
Even though the result is kinda pointless, it will work :-)
Keep in mind that the ***return type***, that is, the column type of the new column, will be the same as the first column used (`Fare`, in the example).
What if you want to return something of a ***different*** type?! No worries! You only need to ***wrap*** your function with the desired return type. An example should make this more clear:
```python
from pyspark.sql.types import StringType
hdf_fenced = hdf_fenced.assign(str_fare=StringType.ret(lambda Fare: Fare.map('${:,.2f}'.format)))
hdf_fenced.cols['str_fare'][:5]
```
```
0 $65.66
1 $53.10
2 $26.55
3 $65.66
4 $65.66
Name: str_fare, dtype: object
```
Basically, we imported the desired output type - ***StringType*** - and used its extended method ***ret*** to wrap our `lambda` function that formats our numeric `Fare` column into a string.
It is also possible to create a more complex type, like an array of doubles:
```python
from pyspark.sql.types import ArrayType, DoubleType
def make_list(Fare):
return Fare.apply(lambda v: [v, v*2])
hdf_fenced = hdf_fenced.assign(fare_list=ArrayType(DoubleType()).ret(make_list))
hdf_fenced.cols['fare_list'][:5]
```
```
0 [7.25, 14.5]
1 [71.2833, 142.5666]
2 [7.925, 15.85]
3 [53.1, 106.2]
4 [8.05, 16.1]
Name: fare_list, dtype: object
```
OK, so, what happened here?
1. First, we imported the necessary types, ***ArrayType*** and ***DoubleType***, since we are building a function that returns a list of doubles.
2. We actually built the function - notice that we call ***apply*** straight from ***Fare***, which is treated as a pandas Series under the hood.
3. We ***wrap*** the function with the return type `ArrayType(DoubleType())` by invoking the extended method `ret`.
4. Finally, we assign it to a new column name, and that's it!
### Nicer exceptions
Now, suppose you make a mistake while creating your function... if you have used Spark for a while, you already realized that, when an exception is raised, it will be ***loooong***, right?
To help you with that, ***HandySpark*** analyzes the error message and parses it nicely for you at the very ***top*** of the error message, in ***bold red***:

### Safety first
***HandySpark*** wants to protect your cluster and network, so it implements a ***safety*** whenever you perform an operation that are going to retrieve ***ALL*** data from your ***HandyFrame***, like `collect` or `toPandas`.
How does that work? Every time a ***HandyFrame*** has one of these methods called, it will output up to the ***safety limit***, which has a default of ***1,000 elements***.

Do you want to set a different safety limit for your ***HandyFrame***?

What if you want to retrieve everything nonetheless?! You can invoke the ***safety_off*** method prior to the actual method you want to call and you get a ***one-time*** unlimited result.

### Don't feel like Handy anymore?
To get back your original Spark dataframe, you only need to call ***notHandy*** to make it not handy again:
```python
hdf_fenced.notHandy()
```
```
DataFrame[PassengerId: int, Survived: int, Pclass: int, Name: string, Sex: string, Age: double, SibSp: int, Parch: int, Ticket: string, Fare: double, Cabin: string, Embarked: string, logFare: double, is_c_or_q: boolean]
```
## Comments, questions, suggestions, bugs
***DISCLAIMER***: this is a project ***under development***, so it is likely you'll run into bugs/problems.
So, if you find any bugs/problems, please open an [issue](https://github.com/dvgodoy/handyspark/issues) or submit a [pull request](https://github.com/dvgodoy/handyspark/pulls).
================================================
FILE: README.rst
================================================
.. image:: https://travis-ci.org/dvgodoy/handyspark.svg?branch=master
:target: https://travis-ci.org/dvgodoy/handyspark
:alt: Build Status
HandySpark
==========
Bringing pandas-like capabilities to Spark dataframes!
------------------------------------------------------
*HandySpark* is a package designed to improve *PySpark* user experience, especially when it comes to *exploratory data analysis* , including *visualization* capabilities!
It makes fetching data or computing statistics for columns really easy, returning *pandas objects* straight away.
It also leverages on the recently released *pandas UDFs* in Spark to allow for an out-of-the-box usage of common *pandas functions* in a Spark dataframe.
Moreover, it introduces the *stratify* operation, so users can perform more sophisticated analysis, imputation and outlier detection on stratified data without incurring in very computationally expensive *groupby* operations.
Finally, it brings the long missing capability of *plotting* data while retaining the advantage of performing distributed computation (unlike many tutorials on the internet, which just convert the whole dataset to pandas and then plot it - don't ever do that!).
Google Colab
------------
Eager to try it out right away? Don't wait any longer!
Open the notebook directly on Google Colab and try it yourself:
* `Exploring Titanic <https://colab.research.google.com/github/dvgodoy/handyspark/blob/master/notebooks/Exploring_Titanic.ipynb>`_
Installation
------------
To install *HandySpark* from `PyPI <https://pypi.org/project/handyspark/>`_, just type:
.. code-block:: python
pip install handyspark
Documentation
-------------
You can find the full documentation `here <http://dvgodoy.github.com/handyspark>`_.
Quick Start
-----------
To use *HandySpark* , all you need to do is import the package and, after loading your data into a Spark dataframe, call the *toHandy()* method to get your own *HandyFrame* :
.. code-block:: python
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
from handyspark import *
sdf = spark.read.csv('./tests/rawdata/train.csv', header=True, inferSchema=True)
hdf = sdf.toHandy()
Fetching and plotting data
^^^^^^^^^^^^^^^^^^^^^^^^^^
Now you can easily fetch data as if you were using pandas, just use the *cols* object from your *HandyFrame* :
.. code-block:: python
hdf.cols['Name'][:5]
It should return a pandas Series object:
.. code-block::
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
Name: Name, dtype: object
If you include a list of columns, it will return a pandas DataFrame.
Due to the distributed nature of data in Spark, it is only possible to fetch the top rows of any given *HandyFrame*.
Using *cols* you have access to several pandas-like column and DataFrame based methods implemented in Spark:
* min / max / median / q1 / q3 / stddev / mode
* nunique
* value_counts
* corr
* hist
* boxplot
* scatterplot
For instance:
.. code-block:: python
hdf.cols['Embarked'].value_counts(dropna=False)
.. code-block::
S 644
C 168
Q 77
NaN 2
Name: Embarked, dtype: int64
You can also make some plots:
.. code-block:: python
from matplotlib import pyplot as plt
fig, axs = plt.subplots(1, 4, figsize=(12, 4))
hdf.cols['Embarked'].hist(ax=axs[0])
hdf.cols['Age'].boxplot(ax=axs[1])
hdf.cols['Fare'].boxplot(ax=axs[2])
hdf.cols[['Fare', 'Age']].scatterplot(ax=axs[3])
.. image:: /images/cols_plot.png
:target: /images/cols_plot.png
:alt: cols plots
Handy, right (pun intended!)? But things can get *even more* interesting if you use *stratify* !
Stratify
^^^^^^^^
Stratifying a HandyFrame means using a *split-apply-combine* approach. It will first split your HandyFrame according to the specified (discrete) columns, then it will apply some function to each stratum of data and finally combine the results back together.
This is better illustrated with an example - let's try the stratified version of our previous ``value_counts``\ :
.. code-block:: python
hdf.stratify(['Pclass']).cols['Embarked'].value_counts()
.. code-block::
Pclass Embarked
1 C 85
Q 2
S 127
2 C 17
Q 3
S 164
3 C 66
Q 72
S 353
Name: value_counts, dtype: int64
Cool, isn't it? Besides, under the hood, not a single *group by* operation was performed - everything is handled using filter clauses! So, *no data shuffling* !
What if you want to *stratify* on a column containing continuous values? No problem!
.. code-block:: python
hdf.stratify(['Sex', Bucket('Age', 2)]).cols['Embarked'].value_counts()
.. code-block::
Sex Age Embarked
female Age >= 0.4200 and Age < 40.2100 C 46
Q 12
S 154
Age >= 40.2100 and Age <= 80.0000 C 15
S 32
male Age >= 0.4200 and Age < 40.2100 C 53
Q 11
S 287
Age >= 40.2100 and Age <= 80.0000 C 16
Q 5
S 81
Name: value_counts, dtype: int64
You can use either *Bucket* or *Quantile* to discretize your data in any given number of bins!
What about *plotting* it? Yes, *HandySpark* can handle that as well!
.. code-block:: python
hdf.stratify(['Sex', Bucket('Age', 2)]).cols['Embarked'].hist(figsize=(8, 6))
.. image:: /images/stratified_hist.png
:target: /images/stratified_hist.png
:alt: stratified hist
Handling missing data
^^^^^^^^^^^^^^^^^^^^^
*HandySpark* makes it very easy to spot and fill missing values. To figure if there are any missing values, just use *isnull* :
.. code-block:: python
hdf.isnull(ratio=True)
.. code-block::
PassengerId 0.000000
Survived 0.000000
Pclass 0.000000
Name 0.000000
Sex 0.000000
Age 0.198653
SibSp 0.000000
Parch 0.000000
Ticket 0.000000
Fare 0.000000
Cabin 0.771044
Embarked 0.002245
Name: missing(ratio), dtype: float64
Ok, now you know there are 3 columns with missing values: ``Age``\ , ``Cabin`` and ``Embarked``. It's time to fill those values up! But, let's skip ``Cabin``\ , which has 77% of its values missing!
So, ``Age`` is a continuous variable, while ``Embarked`` is a categorical variable. Let's start with the latter:
.. code-block:: python
hdf_filled = hdf.fill(categorical=['Embarked'])
*HandyFrame* has a *fill* method which takes up to 3 arguments:
* categorical: a list of categorical variables
* continuous: a list of continuous variables
* strategy: which strategy to use for each one of the continuous variables (either ``mean`` or ``median``\ )
Categorical variables use a ``mode`` strategy by default.
But you do not need to stick with the basics anymore... you can fancy it up using *stratify* together with *fill* :
.. code-block:: python
hdf_filled = hdf_filled.stratify(['Pclass', 'Sex']).fill(continuous=['Age'], strategy=['mean'])
How do you know which values are being used? Simple enough:
.. code-block:: python
hdf_filled.statistics_
.. code-block::
{'Embarked': 'S',
'Pclass == "1" and Sex == "female"': {'Age': 34.61176470588235},
'Pclass == "1" and Sex == "male"': {'Age': 41.28138613861386},
'Pclass == "2" and Sex == "female"': {'Age': 28.722972972972972},
'Pclass == "2" and Sex == "male"': {'Age': 30.74070707070707},
'Pclass == "3" and Sex == "female"': {'Age': 21.75},
'Pclass == "3" and Sex == "male"': {'Age': 26.507588932806325}}
There you go! The filter clauses and the corresponding imputation values!
But there is *more* - once you're with your imputation procedure, why not generate a *custom transformer* to do that for you, either on your test set or in production?
You only need to call the *imputer* method of the *transformer* object that every *HandyFrame* has:
.. code-block:: python
imputer = hdf_filled.transformers.imputer()
In the example above, *imputer* is now a full-fledged serializable PySpark transformer! What does that mean? You can use it in your *pipeline* and *save / load* at will :-)
Detecting outliers
^^^^^^^^^^^^^^^^^^
Second only to the problem of missing data, outliers can pose a challenge for training machine learning models.
*HandyFrame* to the rescue, with its *outliers* method:
.. code-block:: python
hdf_filled.outliers(method='tukey', k=3.)
.. code-block::
PassengerId 0.0
Survived 0.0
Pclass 0.0
Age 1.0
SibSp 12.0
Parch 213.0
Fare 53.0
dtype: float64
Currently, only `\ *Tukey's* <https://en.wikipedia.org/wiki/Outlier#Tukey's_fences>`_ method is available (I am working on Mahalanobis distance!). This method takes an optional *k* argument, which you can set to larger values (like 3) to allow for a more loose detection.
The good thing is, now we can take a peek at the data by plotting it:
.. code-block:: python
from matplotlib import pyplot as plt
fig, axs = plt.subplots(1, 4, figsize=(16, 4))
hdf_filled.cols['Parch'].hist(ax=axs[0])
hdf_filled.cols['SibSp'].hist(ax=axs[1])
hdf_filled.cols['Age'].boxplot(ax=axs[2], k=3)
hdf_filled.cols['Fare'].boxplot(ax=axs[3], k=3)
.. image:: /images/outliers.png
:target: /images/outliers.png
:alt: outliers
Let's focus on the ``Fare`` column - what can we do about it? Well, we could use Tukey's fences to, er... *fence* the outliers :-)
.. code-block:: python
hdf_fenced = hdf_filled.fence(['Fare'])
Which values were used, you ask?
.. code-block:: python
hdf_fenced.fences_
.. code-block::
{'Fare': [-26.7605, 65.6563]}
It works quite similarly to the *fill* method and, I hope you guessed, it *also* gives you the ability to create the corresponding *custom transformer* :-)
.. code-block:: python
fencer = hdf_fenced.transformers.fencer()
Pandas and more pandas!
^^^^^^^^^^^^^^^^^^^^^^^
With *HandySpark* you can feel *almost* as if you were using traditional pandas :-)
To gain access to the whole suite of available pandas functions, you need to leverage the *pandas* object of your *HandyFrame* :
.. code-block:: python
some_ports = hdf_fenced.pandas['Embarked'].isin(values=['C', 'Q'])
some_ports
.. code-block::
Column<b'udf(Embarked) AS `<lambda>(Embarked,)`'>
In the example above, *HandySpark* treats the ``Embarked`` column as if it were a pandas Series and, therefore, you may call its *isin* method!
But, remember Spark has *lazy evaluation* , so the result is a *column expression* which leverages the power of *pandas UDFs* (provived that PyArrow is installed, otherwise it will fall back to traditional UDFs).
The only thing left to do is to actually *assign* the results to a new column, right?
.. code-block:: python
hdf_fenced = hdf_fenced.assign(is_c_or_q=some_ports)
# What's in there?
hdf_fenced.cols['is_c_or_q'][:5]
.. code-block::
0 True
1 False
2 False
3 True
4 True
Name: is_c_or_q, dtype: bool
You got that right! *HandyFrame* has a very convenient *assign* method, just like in pandas!
It does not get much easier than that :-) There are several column methods available already:
* betweeen / between_time
* isin
* isna / isnull
* notna / notnull
* abs
* clip / clip_lower / clip_upper
* replace
* round / truncate
* tz_convert / tz_localize
And this is not all! Both specialized *str* and *dt* objects from pandas are available as well!
For instance, if you want to find if a given string contains another substring?
.. code-block:: python
col_mrs = hdf_fenced.pandas['Name'].str.find(sub='Mrs.')
hdf_fenced = hdf_fenced.assign(is_mrs=col_mrs > 0)
.. image:: /images/is_mrs.png
:target: /images/is_mrs.png
:alt: is mrs
There are many, many more available methods:
*String methods* :
#. contains
#. startswith / endswitch
#. match
#. isalpha / isnumeric / isalnum / isdigit / isdecimal / isspace
#. islower / isupper / istitle
#. replace
#. repeat
#. join
#. pad
#. slice / slice_replace
#. strip / lstrip / rstrip
#. wrap / center / ljust / rjust
#. translate
#. get
#. normalize
#. lower / upper / capitalize / swapcase / title
#. zfill
#. count
#. find / rfind
#. len
*Date / Datetime methods* :
#. is_leap_year / is_month_end / is_month_start / is_quarter_end / is_quarter_start / is_year_end / is_year_start
#. strftime
#. tz / time / tz_convert / tz_localize
#. day / dayofweek / dayofyear / days_in_month / daysinmonth
#. hour / microsecond / minute / nanosecond / second
#. week / weekday / weekday_name
#. month / quarter / year / weekofyear
#. date
#. ceil / floor / round
#. normalize
Your own functions
^^^^^^^^^^^^^^^^^^
The sky is the limit! You can create regular Python functions and use assign to create new columns :-)
No need to worry about turning them into *pandas UDFs* - everything is handled by *HandySpark* under the hood!
The arguments of your function (or ``lambda``\ ) should have the names of the columns you want to use. For instance, to take the ``log`` of ``Fare``\ :
.. code-block:: python
import numpy as np
hdf_fenced = hdf_fenced.assign(logFare=lambda Fare: np.log(Fare + 1))
.. image:: /images/logfare.png
:target: /images/logfare.png
:alt: logfare
You can also use multiple columns:
.. code-block:: python
hdf_fenced = hdf_fenced.assign(fare_times_age=lambda Fare, Age: Fare * Age)
Even though the result is kinda pointless, it will work :-)
Keep in mind that the *return type* , that is, the column type of the new column, will be the same as the first column used (\ ``Fare``\ , in the example).
What if you want to return something of a *different* type?! No worries! You only need to *wrap* your function with the desired return type. An example should make this more clear:
.. code-block:: python
from pyspark.sql.types import StringType
hdf_fenced = hdf_fenced.assign(str_fare=StringType.ret(lambda Fare: Fare.map('${:,.2f}'.format)))
hdf_fenced.cols['str_fare'][:5]
.. code-block::
0 $65.66
1 $53.10
2 $26.55
3 $65.66
4 $65.66
Name: str_fare, dtype: object
Basically, we imported the desired output type - *StringType* - and used its extended method *ret* to wrap our ``lambda`` function that formats our numeric ``Fare`` column into a string.
It is also possible to create a more complex type, like an array of doubles:
.. code-block:: python
from pyspark.sql.types import ArrayType, DoubleType
def make_list(Fare):
return Fare.apply(lambda v: [v, v*2])
hdf_fenced = hdf_fenced.assign(fare_list=ArrayType(DoubleType()).ret(make_list))
hdf_fenced.cols['fare_list'][:5]
.. code-block::
0 [7.25, 14.5]
1 [71.2833, 142.5666]
2 [7.925, 15.85]
3 [53.1, 106.2]
4 [8.05, 16.1]
Name: fare_list, dtype: object
OK, so, what happened here?
#. First, we imported the necessary types, *ArrayType* and *DoubleType* , since we are building a function that returns a list of doubles.
#. We actually built the function - notice that we call *apply* straight from *Fare* , which is treated as a pandas Series under the hood.
#. We *wrap* the function with the return type ``ArrayType(DoubleType())`` by invoking the extended method ``ret``.
#. Finally, we assign it to a new column name, and that's it!
Nicer exceptions
^^^^^^^^^^^^^^^^
Now, suppose you make a mistake while creating your function... if you have used Spark for a while, you already realized that, when an exception is raised, it will be *loooong* , right?
To help you with that, *HandySpark* analyzes the error message and parses it nicely for you at the very *top* of the error message, in *bold red* :
.. image:: /images/handy_exception.png
:target: /images/handy_exception.png
:alt: exception
Safety first
^^^^^^^^^^^^
*HandySpark* wants to protect your cluster and network, so it implements a *safety* whenever you perform an operation that are going to retrieve *ALL* data from your *HandyFrame* , like ``collect`` or ``toPandas``.
How does that work? Every time a *HandyFrame* has one of these methods called, it will output up to the *safety limit* , which has a default of *1,000 elements*.
.. image:: /images/safety_on.png
:target: /images/safety_on.png
:alt: safety on
Do you want to set a different safety limit for your *HandyFrame* ?
.. image:: /images/safety_limit.png
:target: /images/safety_limit.png
:alt: safety limit
What if you want to retrieve everything nonetheless?! You can invoke the *safety_off* method prior to the actual method you want to call and you get a *one-time* unlimited result.
.. image:: /images/safety_off.png
:target: /images/safety_off.png
:alt: safety off
Don't feel like Handy anymore?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To get back your original Spark dataframe, you only need to call *notHandy* to make it not handy again:
.. code-block:: python
hdf_fenced.notHandy()
.. code-block::
DataFrame[PassengerId: int, Survived: int, Pclass: int, Name: string, Sex: string, Age: double, SibSp: int, Parch: int, Ticket: string, Fare: double, Cabin: string, Embarked: string, logFare: double, is_c_or_q: boolean]
Comments, questions, suggestions, bugs
--------------------------------------
*DISCLAIMER* : this is a project *under development* , so it is likely you'll run into bugs/problems.
So, if you find any bugs/problems, please open an `issue <https://github.com/dvgodoy/handyspark/issues>`_ or submit a `pull request <https://github.com/dvgodoy/handyspark/pulls>`_.
================================================
FILE: docs/Makefile
================================================
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
SPHINXPROJ = HandySpark
SOURCEDIR = source
BUILDDIR = ../../handyspark-docs
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
================================================
FILE: docs/source/conf.py
================================================
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# HandySpark documentation build configuration file, created by
# sphinx-quickstart on Sun Oct 28 17:42:51 2018.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
sys.path.insert(0, os.path.abspath('../..'))
sys.setrecursionlimit(1500)
# -- General configuration ------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
#
# needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['sphinx.ext.autodoc',
'sphinx.ext.intersphinx',
'sphinx.ext.mathjax',
'sphinx.ext.ifconfig',
'sphinx.ext.viewcode',
'sphinx.ext.githubpages',
'sphinx.ext.napoleon']
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
# source_suffix = ['.rst', '.md']
source_suffix = '.rst'
# The master toctree document.
master_doc = 'index'
# General information about the project.
project = 'HandySpark'
copyright = '2018, Daniel Voigt Godoy'
author = 'Daniel Voigt Godoy'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '0.0.1'
# The full version, including alpha/beta/rc tags.
release = '0.0.1'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This patterns also effect to html_static_path and html_extra_path
exclude_patterns = []
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = False
# -- Options for HTML output ----------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'alabaster'
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
#
# html_theme_options = {}
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
# Custom sidebar templates, must be a dictionary that maps document names
# to template names.
#
# This is required for the alabaster theme
# refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars
html_sidebars = {
'**': [
'relations.html', # needs 'show_related': True theme option to display
'searchbox.html',
]
}
# -- Options for HTMLHelp output ------------------------------------------
# Output file base name for HTML help builder.
htmlhelp_basename = 'HandySparkdoc'
# -- Options for LaTeX output ---------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
#
# 'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
#
# 'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
#
# 'preamble': '',
# Latex figure (float) alignment
#
# 'figure_align': 'htbp',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, 'HandySpark.tex', 'HandySpark Documentation',
'Daniel Voigt Godoy', 'manual'),
]
# -- Options for manual page output ---------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
(master_doc, 'handyspark', 'HandySpark Documentation',
[author], 1)
]
# -- Options for Texinfo output -------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(master_doc, 'HandySpark', 'HandySpark Documentation',
author, 'HandySpark', 'One line description of project.',
'Miscellaneous'),
]
# -- Options for Epub output ----------------------------------------------
# Bibliographic Dublin Core info.
epub_title = project
epub_author = author
epub_publisher = author
epub_copyright = copyright
# The unique identifier of the text. This can be a ISBN number
# or the project homepage.
#
# epub_identifier = ''
# A unique identification for the text.
#
# epub_uid = ''
# A list of files that should not be packed into the epub file.
epub_exclude_files = ['search.html']
# Example configuration for intersphinx: refer to the Python standard library.
intersphinx_mapping = {'https://docs.python.org/': None}
================================================
FILE: docs/source/handyspark.extensions.rst
================================================
handyspark\.extensions package
==============================
Submodules
----------
handyspark\.extensions\.common module
-------------------------------------
.. automodule:: handyspark.extensions.common
:members:
:undoc-members:
:show-inheritance:
handyspark\.extensions\.evaluation module
-----------------------------------------
.. automodule:: handyspark.extensions.evaluation
:members:
:undoc-members:
:show-inheritance:
handyspark\.extensions\.types module
------------------------------------
.. automodule:: handyspark.extensions.types
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: handyspark.extensions
:members:
:undoc-members:
:show-inheritance:
================================================
FILE: docs/source/handyspark.ml.rst
================================================
handyspark\.ml package
======================
Submodules
----------
handyspark\.ml\.base module
---------------------------
.. automodule:: handyspark.ml.base
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: handyspark.ml
:members:
:undoc-members:
:show-inheritance:
================================================
FILE: docs/source/handyspark.rst
================================================
handyspark package
==================
Subpackages
-----------
.. toctree::
handyspark.extensions
handyspark.ml
handyspark.sql
Submodules
----------
handyspark\.plot module
-----------------------
.. automodule:: handyspark.plot
:members:
:undoc-members:
:show-inheritance:
handyspark\.stats module
------------------------
.. automodule:: handyspark.stats
:members:
:undoc-members:
:show-inheritance:
handyspark\.util module
-----------------------
.. automodule:: handyspark.util
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: handyspark
:members:
:undoc-members:
:show-inheritance:
================================================
FILE: docs/source/handyspark.sql.rst
================================================
handyspark\.sql package
=======================
Submodules
----------
handyspark\.sql\.dataframe module
---------------------------------
.. automodule:: handyspark.sql.dataframe
:members:
:undoc-members:
:show-inheritance:
handyspark\.sql\.datetime module
--------------------------------
.. automodule:: handyspark.sql.datetime
:members:
:undoc-members:
:show-inheritance:
handyspark\.sql\.pandas module
------------------------------
.. automodule:: handyspark.sql.pandas
:members:
:undoc-members:
:show-inheritance:
handyspark\.sql\.schema module
------------------------------
.. automodule:: handyspark.sql.schema
:members:
:undoc-members:
:show-inheritance:
handyspark\.sql\.string module
------------------------------
.. automodule:: handyspark.sql.string
:members:
:undoc-members:
:show-inheritance:
handyspark\.sql\.transform module
---------------------------------
.. automodule:: handyspark.sql.transform
:members:
:undoc-members:
:show-inheritance:
Module contents
---------------
.. automodule:: handyspark.sql
:members:
:undoc-members:
:show-inheritance:
================================================
FILE: docs/source/includeme.rst
================================================
.. include:: ../../README.rst
================================================
FILE: docs/source/index.rst
================================================
.. HandySpark documentation master file, created by
sphinx-quickstart on Sun Oct 28 17:42:51 2018.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to HandySpark's documentation!
======================================
.. toctree::
:maxdepth: 2
includeme
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
================================================
FILE: docs/source/modules.rst
================================================
handyspark
==========
.. toctree::
:maxdepth: 4
handyspark
================================================
FILE: handyspark/__init__.py
================================================
from handyspark.extensions.evaluation import BinaryClassificationMetrics
from handyspark.sql import HandyFrame, Bucket, Quantile, DataFrame
__all__ = [
'HandyFrame', 'Bucket', 'Quantile', 'BinaryClassificationMetrics'
]
================================================
FILE: handyspark/extensions/__init__.py
================================================
from handyspark.extensions.common import JavaModelWrapper
from handyspark.extensions.evaluation import BinaryClassificationMetrics
from handyspark.extensions.types import AtomicType
__all__ = [
'BinaryClassificationMetrics'
]
================================================
FILE: handyspark/extensions/common.py
================================================
from pyspark.mllib.common import _java2py, _py2java, JavaModelWrapper
def call2(self, name, *a):
"""Another call method for JavaModelWrapper.
This method should be used whenever the JavaModel returns a Scala Tuple
that needs to be deserialized before converted to Python.
"""
serde = self._sc._jvm.org.apache.spark.mllib.api.python.SerDe
args = [_py2java(self._sc, a) for a in a]
java_res = getattr(self._java_model, name)(*args)
java_res = serde.fromTuple2RDD(java_res)
res = _java2py(self._sc, java_res)
return res
JavaModelWrapper.call2 = call2
================================================
FILE: handyspark/extensions/evaluation.py
================================================
import pandas as pd
from operator import itemgetter
from handyspark.plot import roc_curve, pr_curve
from pyspark.mllib.evaluation import BinaryClassificationMetrics, MulticlassMetrics
from pyspark.sql import SQLContext, DataFrame, functions as F
from pyspark.sql.types import StructField, StructType, DoubleType
def thresholds(self):
"""
* Returns thresholds in descending order.
"""
return self.call('thresholds')
def roc(self):
"""Calls the `roc` method from the Java class
* Returns the receiver operating characteristic (ROC) curve,
* which is an RDD of (false positive rate, true positive rate)
* with (0.0, 0.0) prepended and (1.0, 1.0) appended to it.
* @see <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic">
* Receiver operating characteristic (Wikipedia)</a>
"""
return self.call2('roc')
def pr(self):
"""Calls the `pr` method from the Java class
* Returns the precision-recall curve, which is an RDD of (recall, precision),
* NOT (precision, recall), with (0.0, p) prepended to it, where p is the precision
* associated with the lowest recall on the curve.
* @see <a href="http://en.wikipedia.org/wiki/Precision_and_recall">
* Precision and recall (Wikipedia)</a>
"""
return self.call2('pr')
def fMeasureByThreshold(self, beta=1.0):
"""Calls the `fMeasureByThreshold` method from the Java class
* Returns the (threshold, F-Measure) curve.
* @param beta the beta factor in F-Measure computation.
* @return an RDD of (threshold, F-Measure) pairs.
* @see <a href="http://en.wikipedia.org/wiki/F1_score">F1 score (Wikipedia)</a>
"""
return self.call2('fMeasureByThreshold', beta)
def precisionByThreshold(self):
"""Calls the `precisionByThreshold` method from the Java class
* Returns the (threshold, precision) curve.
"""
return self.call2('precisionByThreshold')
def recallByThreshold(self):
"""Calls the `recallByThreshold` method from the Java class
* Returns the (threshold, recall) curve.
"""
return self.call2('recallByThreshold')
def getMetricsByThreshold(self):
"""Returns DataFrame containing all metrics (FPR, Recall and
Precision) for every threshold.
Returns
-------
metrics: DataFrame
"""
thresholds = self.call('thresholds').collect()
roc = self.call2('roc').collect()[1:-1]
pr = self.call2('pr').collect()[1:]
metrics = list(zip(thresholds, map(itemgetter(0), roc), map(itemgetter(1), roc), map(itemgetter(1), pr)))
metrics += [(0., 1., 1., 0.)]
sql_ctx = SQLContext.getOrCreate(self._sc)
df = sql_ctx.createDataFrame(metrics).toDF('threshold', 'fpr', 'recall', 'precision')
return df
def confusionMatrix(self, threshold=0.5):
"""Returns confusion matrix: predicted classes are in columns,
they are ordered by class label ascending, as in "labels".
Predicted classes are computed according to informed threshold.
Parameters
----------
threshold: double, optional
Threshold probability for the positive class.
Default is 0.5.
Returns
-------
confusionMatrix: DenseMatrix
"""
scoreAndLabels = self.call2('scoreAndLabels').map(lambda t: (float(t[0] > threshold), t[1]))
mcm = MulticlassMetrics(scoreAndLabels)
return mcm.confusionMatrix()
def print_confusion_matrix(self, threshold=0.5):
"""Returns confusion matrix: predicted classes are in columns,
they are ordered by class label ascending, as in "labels".
Predicted classes are computed according to informed threshold.
Parameters
----------
threshold: double, optional
Threshold probability for the positive class.
Default is 0.5.
Returns
-------
confusionMatrix: pd.DataFrame
"""
cm = self.confusionMatrix(threshold).toArray()
df = pd.concat([pd.DataFrame(cm)], keys=['Actual'], names=[])
df.columns = pd.MultiIndex.from_product([['Predicted'], df.columns])
return df
def plot_roc_curve(self, ax=None):
"""Makes a plot of Receiver Operating Characteristic (ROC) curve.
Parameter
---------
ax : matplotlib axes object, default None
"""
metrics = self.getMetricsByThreshold().toPandas()
return roc_curve(metrics.fpr, metrics.recall, self.areaUnderROC, ax)
def plot_pr_curve(self, ax=None):
"""Makes a plot of Precision-Recall (PR) curve.
Parameter
---------
ax : matplotlib axes object, default None
"""
metrics = self.getMetricsByThreshold().toPandas()
return pr_curve(metrics.precision, metrics.recall, self.areaUnderPR, ax)
def __init__(self, scoreAndLabels, scoreCol='score', labelCol='label'):
if isinstance(scoreAndLabels, DataFrame):
scoreAndLabels = (scoreAndLabels
.select(scoreCol, labelCol)
.rdd.map(lambda row:(float(row[scoreCol][1]), float(row[labelCol]))))
sc = scoreAndLabels.ctx
sql_ctx = SQLContext.getOrCreate(sc)
df = sql_ctx.createDataFrame(scoreAndLabels, schema=StructType([
StructField("score", DoubleType(), nullable=False),
StructField("label", DoubleType(), nullable=False)]))
java_class = sc._jvm.org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
java_model = java_class(df._jdf)
super(BinaryClassificationMetrics, self).__init__(java_model)
BinaryClassificationMetrics.__init__ = __init__
BinaryClassificationMetrics.thresholds = thresholds
BinaryClassificationMetrics.roc = roc
BinaryClassificationMetrics.pr = pr
BinaryClassificationMetrics.fMeasureByThreshold = fMeasureByThreshold
BinaryClassificationMetrics.precisionByThreshold = precisionByThreshold
BinaryClassificationMetrics.recallByThreshold = recallByThreshold
BinaryClassificationMetrics.getMetricsByThreshold = getMetricsByThreshold
BinaryClassificationMetrics.confusionMatrix = confusionMatrix
BinaryClassificationMetrics.plot_roc_curve = plot_roc_curve
BinaryClassificationMetrics.plot_pr_curve = plot_pr_curve
BinaryClassificationMetrics.print_confusion_matrix = print_confusion_matrix
================================================
FILE: handyspark/extensions/types.py
================================================
from pyspark.sql.types import AtomicType, ArrayType, MapType
@classmethod
def ret(cls, expr):
"""Assigns a return type to the expression when used inside an `assign` method.
"""
return expr, cls.typeName()
AtomicType.ret = ret
def ret(self, expr):
"""Assigns a return type to the expression when used inside an `assign` method.
"""
return expr, self.simpleString()
ArrayType.ret = ret
MapType.ret = ret
================================================
FILE: handyspark/ml/__init__.py
================================================
from handyspark.ml.base import HandyFencer, HandyImputer
__all__ = [
'HandyFencer', 'HandyImputer'
]
================================================
FILE: handyspark/ml/base.py
================================================
import json
from pyspark.ml.base import Transformer
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
from pyspark.ml.param import *
from pyspark.sql import functions as F
class HandyTransformers(object):
"""Generates transformers to be used in pipelines.
Available transformers:
imputer: Transformer
Imputation transformer for completing missing values.
fencer: Transformer
Fencer transformer for capping outliers according to lower and upper fences.
"""
def __init__(self, df):
self._df = df
self._handy = df._handy
def imputer(self):
"""
Generates a transformer to impute missing values, using values
from the HandyFrame
"""
return HandyImputer().setDictValues(self._df.statistics_)
def fencer(self):
"""
Generates a transformer to fence outliers, using statistics
from the HandyFrame
"""
return HandyFencer().setDictValues(self._df.fences_)
class HasDict(Params):
"""Mixin for a Dictionary parameter.
It dumps the dictionary into a JSON string for storage and
reloads it whenever needed.
"""
dictValues = Param(Params._dummy(), "dictValues", "Dictionary values", typeConverter=TypeConverters.toString)
def __init__(self):
super(HasDict, self).__init__()
self._setDefault(dictValues='{}')
def setDictValues(self, value):
"""
Sets the value of :py:attr:`dictValues`.
"""
if isinstance(value, dict):
value = json.dumps(value).replace('\'', '"')
return self._set(dictValues=value)
def getDictValues(self):
"""
Gets the value of dictValues or its default value.
"""
values = self.getOrDefault(self.dictValues)
return json.loads(values)
class HandyImputer(Transformer, HasDict, DefaultParamsReadable, DefaultParamsWritable):
"""Imputation transformer for completing missing values.
Attributes
----------
statistics : dict
The imputation fill value for each feature. If stratified, first level keys are
filter clauses for stratification.
"""
def _transform(self, dataset):
# Loads dictionary with values for imputation
fillingValues = self.getDictValues()
items = fillingValues.items()
target = dataset
# Loops over columns...
for colname, v in items:
# If value is another dictionary, it means we're dealing with
# stratified imputation - the key is the filering clause
# and its value is going to be used for imputation
if isinstance(v, dict):
clauses = v.keys()
whens = ' '.join(['WHEN (({clause}) AND (isnan({col}) OR isnull({col}))) THEN {quote}{filling}{quote}'
.format(clause=clause, col=colname, filling=v[clause],
quote='"' if isinstance(v[clause], str) else '')
for clause in clauses])
# Otherwise uses the non-stratified dictionary to fill the values
else:
whens = ('WHEN (isnan({col}) OR isnull({col})) THEN {quote}{filling}{quote}'
.format(col=colname, filling=v,
quote='"' if isinstance(v, str) else ''))
expression = F.expr('CASE {expr} ELSE {col} END'.format(expr=whens, col=colname))
target = target.withColumn(colname, expression)
# If it is a HandyFrame, make it a regular DataFrame
try:
target = target.notHandy()
except AttributeError:
pass
return target
@property
def statistics(self):
return self.getDictValues()
class HandyFencer(Transformer, HasDict, DefaultParamsReadable, DefaultParamsWritable):
"""Fencer transformer for capping outliers according to lower and upper fences.
Attributes
----------
fences : dict
The fence values for each feature. If stratified, first level keys are
filter clauses for stratification.
"""
def _transform(self, dataset):
# Loads dictionary with values for fencing
fences = self.getDictValues()
items = fences.items()
target = dataset
for colname, v in items:
# If value is another dictionary, it means we're dealing with
# stratified imputation - the key is the filering clause
# and its value is going to be used for imputation
if isinstance(v, dict):
clauses = v.keys()
whens1 = ' '.join(['WHEN ({clause}) THEN greatest({col}, {fence})'.format(clause=clause,
col=colname,
fence=v[clause][0])
for clause in clauses])
whens2 = ' '.join(['WHEN ({clause}) THEN least({col}, {fence})'.format(clause=clause,
col=colname,
fence=v[clause][1])
for clause in clauses])
expression1 = F.expr('CASE {} END'.format(whens1))
expression2 = F.expr('CASE {} END'.format(whens2))
# Otherwise uses the non-stratified dictionary to fill the values
else:
expression1 = F.expr('greatest({col}, {fence})'.format(col=colname, fence=v[0]))
expression2 = F.expr('least({col}, {fence})'.format(col=colname, fence=v[1]))
target = target.withColumn(colname, expression1).withColumn(colname, expression2)
# If it is a HandyFrame, make it a regular DataFrame
try:
target = target.notHandy()
except AttributeError:
pass
return target
@property
def fences(self):
return self.getDictValues()
================================================
FILE: handyspark/plot.py
================================================
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from inspect import signature
from handyspark.util import get_buckets, none2zero, ensure_list
from operator import add, itemgetter
from pyspark.ml.feature import Bucketizer
from pyspark.ml.pipeline import Pipeline
from pyspark.sql import functions as F
from matplotlib.artist import setp
import matplotlib as mpl
mpl.rc("lines", markeredgewidth=0.5)
def title_fom_clause(clause):
return clause.replace(' and ', '\n').replace(' == ', '=').replace('"', '')
def consolidate_plots(fig, axs, title, clauses):
axs[0].set_title(title)
fig.tight_layout()
if len(axs) > 1:
assert len(axs) == len(clauses), 'Mismatched number of plots and clauses!'
xlim = list(map(lambda ax: ax.get_xlim(), axs))
xlim = [np.min(list(map(itemgetter(0), xlim))), np.max(list(map(itemgetter(1), xlim)))]
ylim = list(map(lambda ax: ax.get_ylim(), axs))
ylim = [np.min(list(map(itemgetter(0), ylim))), np.max(list(map(itemgetter(1), ylim)))]
for i, ax in enumerate(axs):
subtitle = title_fom_clause(clauses[i])
ax.set_title(subtitle, fontdict={'fontsize': 10})
ax.set_xlim(xlim)
ax.set_ylim(ylim)
#if ax.colNum > 0:
# ax.get_yaxis().set_visible(False)
#if ax.rowNum < (ax.numRows - 1):
# ax.get_xaxis().set_visible(False)
if isinstance(title, list):
title = ', '.join(title)
fig.suptitle(title)
fig.tight_layout()
fig.subplots_adjust(top=0.85)
return fig, axs
### Correlations
def plot_correlations(pdf, ax=None):
if ax is None:
fig, ax = plt.subplots(1, 1)
return sns.heatmap(round(pdf,2), annot=True, cmap="coolwarm", fmt='.2f', linewidths=.05, ax=ax)
### Scatterplot
def strat_scatterplot(sdf, col1, col2, n=30):
stages = []
for col in [col1, col2]:
splits = np.linspace(*sdf.agg(F.min(col), F.max(col)).rdd.map(tuple).collect()[0], n + 1)
bucket_name = '__{}_bucket'.format(col)
stages.append(Bucketizer(splits=splits,
inputCol=col,
outputCol=bucket_name,
handleInvalid="skip"))
pipeline = Pipeline(stages=stages)
model = pipeline.fit(sdf)
return model, sdf.count()
def scatterplot(sdf, col1, col2, n=30, ax=None):
strat_ax, data = sdf._get_strata()
if data is None:
data = strat_scatterplot(sdf, col1, col2, n)
else:
ax = strat_ax
model, total = data
if ax is None:
fig, ax = plt.subplots(1, 1)
axes = ensure_list(ax)
clauses = sdf._handy._strata_raw_clauses
if not len(clauses):
clauses = [None]
bucket_name1, bucket_name2 = '__{}_bucket'.format(col1), '__{}_bucket'.format(col2)
strata = sdf._handy.strata_colnames
colnames = strata + [bucket_name1, bucket_name2]
result = model.transform(sdf).select(colnames).groupby(colnames).agg(F.count('*').alias('count')).toPandas().sort_values(by=colnames)
splits = [bucket.getSplits() for bucket in model.stages]
splits = [list(map(np.mean, zip(split[1:], split[:-1]))) for split in splits]
splits1 = pd.DataFrame({bucket_name1: np.arange(0, n), col1: splits[0]})
splits2 = pd.DataFrame({bucket_name2: np.arange(0, n), col2: splits[1]})
df_counts = result.merge(splits1).merge(splits2)[strata + [col1, col2, 'count']].rename(columns={'count': 'Proportion'})
df_counts.loc[:, 'Proportion'] = df_counts.Proportion.apply(lambda p: round(p / total, 4))
for ax, clause in zip(axes, clauses):
data = df_counts
if clause is not None:
data = data.query(clause)
sns.scatterplot(data=data,
x=col1,
y=col2,
size='Proportion',
ax=ax,
legend=False)
if len(axes) == 1:
axes = axes[0]
return axes
### Histogram
def strat_histogram(sdf, colname, bins=10, categorical=False):
if categorical:
result = sdf.cols[colname]._value_counts(dropna=False, raw=True)
if hasattr(result.index, 'levels'):
indexes = pd.MultiIndex.from_product(result.index.levels[:-1] +
[result.reset_index()[colname].unique().tolist()],
names=result.index.names)
result = (pd.DataFrame(index=indexes)
.join(result.to_frame(), how='left')
.fillna(0)[result.name]
.astype(result.dtype))
start_values = result.index.tolist()
else:
bucket_name = '__{}_bucket'.format(colname)
strata = sdf._handy.strata_colnames
colnames = strata + ensure_list(bucket_name)
start_values = np.linspace(*sdf.agg(F.min(colname), F.max(colname)).rdd.map(tuple).collect()[0], bins + 1)
bucketizer = Bucketizer(splits=start_values, inputCol=colname, outputCol=bucket_name, handleInvalid="skip")
result = (bucketizer
.transform(sdf)
.select(colnames)
.groupby(colnames)
.agg(F.count('*').alias('count'))
.toPandas()
.sort_values(by=colnames))
indexes = pd.DataFrame({bucket_name: np.arange(0, bins), 'bucket': start_values[:-1]})
if len(strata):
indexes = (indexes
.assign(key=1)
.merge(result[strata].drop_duplicates().assign(key=1), on='key')
.drop(columns=['key']))
result = indexes.merge(result, how='left', on=strata + [bucket_name]).fillna(0)[strata + [bucket_name, 'count']]
return start_values, result
def histogram(sdf, colname, bins=10, categorical=False, ax=None):
strat_ax, data = sdf._get_strata()
if data is None:
data = strat_histogram(sdf, colname, bins, categorical)
else:
ax = strat_ax
start_values, counts = data
if ax is None:
fig, ax = plt.subplots(1, 1)
axes = ensure_list(ax)
clauses = sdf._handy._strata_raw_clauses
if not len(clauses):
clauses = [None]
for ax, clause in zip(axes, clauses):
if categorical:
pdf = counts.sort_index().to_frame()
if clause is not None:
pdf = pdf.query(clause).reset_index(sdf._handy.strata_colnames).drop(columns=sdf._handy.strata_colnames)
pdf.iloc[:bins].plot(kind='bar', color='C0', legend=False, rot=0, ax=ax, title=colname)
else:
mid_point_bins = start_values[:-1]
weights = counts
if clause is not None:
weights = counts.query(clause)
ax.hist(mid_point_bins, bins=start_values, weights=weights['count'].values)
ax.set_title(colname)
if len(axes) == 1:
axes = axes[0]
return axes
### Boxplot
def _gen_dict(rc_name, properties):
""" Loads properties in the dictionary from rc file if not already
in the dictionary"""
rc_str = 'boxplot.{0}.{1}'
dictionary = dict()
for prop_dict in properties:
dictionary.setdefault(prop_dict,
plt.rcParams[rc_str.format(rc_name, prop_dict)])
return dictionary
def draw_boxplot(ax, stats):
flier_props = ['color', 'marker', 'markerfacecolor', 'markeredgecolor',
'markersize', 'linestyle', 'linewidth']
default_props = ['color', 'linewidth', 'linestyle']
boxprops = _gen_dict('boxprops', default_props)
whiskerprops = _gen_dict('whiskerprops', default_props)
capprops = _gen_dict('capprops', default_props)
medianprops = _gen_dict('medianprops', default_props)
meanprops = _gen_dict('meanprops', default_props)
flierprops = _gen_dict('flierprops', flier_props)
props = dict(boxprops=boxprops,
flierprops=flierprops,
medianprops=medianprops,
meanprops=meanprops,
capprops=capprops,
whiskerprops=whiskerprops)
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b',
'#e377c2', '#7f7f7f', '#bcbd22', '#17becf', '#1f77b4']
bp = ax.bxp(stats, **props)
ax.grid(True)
setp(bp['boxes'], color=colors[0], alpha=1)
setp(bp['whiskers'], color=colors[0], alpha=1)
setp(bp['medians'], color=colors[2], alpha=1)
return ax
def boxplot(sdf, colnames, ax=None, showfliers=True, k=1.5, precision=.0001):
strat_ax, data = sdf._get_strata()
if data is None:
if ax is None:
fig, ax = plt.subplots(1, 1)
title_clauses = sdf._handy._strata_clauses
if not len(title_clauses):
title_clauses = [None]
pdf = sdf._handy._calc_fences(colnames, k, precision)
stats = []
for colname in colnames:
items, _, _ = sdf._handy._calc_bxp_stats(pdf, colname, showfliers=showfliers)
for title_clause, item in zip(title_clauses, items):
name = colname if len(colnames) > 1 else (title_fom_clause(title_clause) if title_clause is not None else colname)
item.update({'label': name})
# each list of items corresponds to a different column
stats.append(items)
# Stats is a list of columns, containing each a list of clauses
if ax is not None:
if title_clauses[0] is None:
if len(colnames) == 1:
stats = stats[0]
else:
stats = np.squeeze(stats).tolist()
return draw_boxplot(ax, stats)
else:
if len(strat_ax) > 1:
stats = [[stats[j][i] for j in range(len(stats))] for i in range(len(title_clauses))]
return stats
def post_boxplot(axs, stats):
new_res = []
for ax, stat in zip(axs, stats):
ax = draw_boxplot(ax, stat)
new_res.append(ax)
return new_res
def roc_curve(fpr, tpr, roc_auc, ax=None):
if ax is None:
fig, ax = plt.subplots(1, 1)
ax.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.4f)' % roc_auc)
ax.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('Receiver Operating Characteristic Curve')
ax.legend(loc="lower right")
return ax
def pr_curve(precision, recall, pr_auc, ax=None):
if ax is None:
fig, ax = plt.subplots(1, 1)
# In matplotlib < 1.5, plt.fill_between does not have a 'step' argument
step_kwargs = ({'step': 'post'}
if 'step' in signature(plt.fill_between).parameters
else {})
ax.step(recall, precision, color='b', alpha=0.2, where='post', label='PR curve (area = %0.4f)' % pr_auc)
ax.fill_between(recall, precision, alpha=0.2, color='b', **step_kwargs)
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.set_ylim([0.0, 1.05])
ax.set_xlim([0.0, 1.0])
ax.legend(loc="lower left")
ax.set_title('Precision-Recall Curve')
return ax
================================================
FILE: handyspark/sql/__init__.py
================================================
from handyspark.sql.dataframe import HandyFrame, Bucket, Quantile, DataFrame
from handyspark.sql.schema import generate_schema
__all__ = [
'HandyFrame', 'Bucket', 'Quantile', 'generate_schema'
]
================================================
FILE: handyspark/sql/dataframe.py
================================================
from copy import deepcopy
from handyspark.ml.base import HandyTransformers
from handyspark.plot import histogram, boxplot, scatterplot, strat_scatterplot, strat_histogram,\
consolidate_plots, post_boxplot
from handyspark.sql.pandas import HandyPandas
from handyspark.sql.transform import _MAPPING, HandyTransform
from handyspark.util import HandyException, dense_to_array, disassemble, ensure_list, check_columns, \
none2default
import inspect
from matplotlib.axes import Axes
from collections import OrderedDict
import matplotlib.pyplot as plt
import numpy as np
from operator import itemgetter, add
import pandas as pd
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import Bucketizer
from pyspark.mllib.stat import Statistics
from pyspark.sql import DataFrame, GroupedData, Window, functions as F, Column, Row
from pyspark.ml.feature import VectorAssembler, StandardScaler, PCA
from pyspark.ml.pipeline import Pipeline
from scipy.stats import chi2
from scipy.linalg import inv
def toHandy(self):
"""Converts Spark DataFrame into HandyFrame.
"""
return HandyFrame(self)
def notHandy(self):
return self
DataFrame.toHandy = toHandy
DataFrame.notHandy = notHandy
def agg(f):
f.__is_agg = True
return f
def inccol(f):
f.__is_inccol = True
return f
class Handy(object):
def __init__(self, df):
self._df = df
# classification
self._is_classification = False
self._nclasses = None
self._classes = None
# transformers
self._imputed_values = {}
self._fenced_values = {}
# groups / strata
self._group_cols = None
self._strata = None
self._strata_object = None
self._strata_plot = None
self._clear_stratification()
self._safety_limit = 1000
self._safety = True
self._update_types()
def __deepcopy__(self, memo):
cls = self.__class__
result = cls.__new__(cls)
memo[id(self)] = result
for k, v in self.__dict__.items():
if k not in ['_df', '_strata_object', '_strata_plot']:
setattr(result, k, deepcopy(v, memo))
return result
def __getitem__(self, *args):
if isinstance(args[0], tuple):
args = args[0]
item = args[0]
n = 20
if len(args) > 1:
n = args[1]
if n is None:
n = -1
if isinstance(item, int):
idx = item + (len(self._group_cols) if self._group_cols is not None else 0)
assert idx < len(self._df.columns), "Invalid column index {}".format(idx)
item = list(self._df.columns)[idx]
if isinstance(item, str):
if self._group_cols is None or len(self._group_cols) == 0:
res = self._take_array(item, n)
if res.ndim > 1:
res = res.tolist()
res = pd.Series(res, name=item)
if self._strata is not None:
strata = list(map(lambda v: v[1].to_dict(), self.strata.iterrows()))
if len(strata) == len(res):
res = pd.concat([pd.DataFrame(strata), res], axis=1).set_index(self._strata).sort_index()
return res
else:
check_columns(self._df, list(self._group_cols) + [item])
pdf = self._df.notHandy().select(list(self._group_cols) + [item])
if n != -1:
pdf = pdf.limit(n)
res = pdf.toPandas().set_index(list(self._group_cols)).sort_index()[item]
return res
@property
def stages(self):
return (len(list(filter(lambda v: '+' == v,
map(lambda s: s.strip()[0],
self._df.rdd.toDebugString().decode().split('\n'))))) + 1)
@property
def statistics_(self):
return self._imputed_values
@property
def fences_(self):
return self._fenced_values
@property
def is_classification(self):
return self._is_classification
@property
def classes(self):
return self._classes
@property
def nclasses(self):
return self._nclasses
@property
def response(self):
return self._response
@property
def ncols(self):
return len(self._types)
@property
def nrows(self):
return self._df.count()
@property
def shape(self):
return (self.nrows, self.ncols)
@property
def strata(self):
if self._strata is not None:
return pd.DataFrame(data=self._strata_combinations, columns=self._strata)
@property
def strata_colnames(self):
if self._strata is not None:
return list(map(str, ensure_list(self._strata)))
else:
return []
def _stratify(self, strata):
return HandyStrata(self, strata)
def _clear_stratification(self):
self._strata = None
self._strata_object = None
self._strata_plot = None
self._strata_combinations = []
self._strata_raw_combinations = []
self._strata_clauses = []
self._strata_raw_clauses = []
self._n_cols = 1
self._n_rows = 1
def _set_stratification(self, strata, raw_combinations, raw_clauses, combinations, clauses):
if strata is not None:
assert len(combinations[0]) == len(strata), "Mismatched number of combinations and strata!"
self._strata = strata
self._strata_raw_combinations = raw_combinations
self._strata_raw_clauses = raw_clauses
self._strata_combinations = combinations
self._strata_clauses = clauses
self._n_cols = len(set(map(itemgetter(0), combinations)))
try:
self._n_rows = len(set(map(itemgetter(1), combinations)))
except IndexError:
self._n_rows = 1
def _build_strat_plot(self, n_rows, n_cols, **kwargs):
fig, axs = plt.subplots(n_rows, n_cols, **kwargs)
if n_rows == 1:
axs = [axs]
if n_cols == 1:
axs = [axs]
self._strata_plot = (fig, [ax for col in np.transpose(axs) for ax in col])
def _update_types(self):
self._types = list(map(lambda t: (t.name, t.dataType.typeName()), self._df.schema.fields))
self._numerical = list(map(itemgetter(0), filter(lambda t: t[1] in ['byte', 'short', 'integer', 'long',
'float', 'double'], self._types)))
self._continuous = list(map(itemgetter(0), filter(lambda t: t[1] in ['double', 'float'], self._types)))
self._categorical = list(map(itemgetter(0), filter(lambda t: t[1] in ['byte', 'short', 'integer', 'long',
'boolan', 'string'], self._types)))
self._array = list(map(itemgetter(0), filter(lambda t: t[1] in ['array', 'map'], self._types)))
self._string = list(map(itemgetter(0), filter(lambda t: t[1] in ['string'], self._types)))
def _take_array(self, colname, n):
check_columns(self._df, colname)
datatype = self._df.notHandy().select(colname).schema.fields[0].dataType.typeName()
rdd = self._df.notHandy().select(colname).rdd.map(itemgetter(0))
if n == -1:
data = rdd.collect()
else:
data = rdd.take(n)
return np.array(data, dtype=_MAPPING.get(datatype, 'object'))
def _value_counts(self, colnames, dropna=True, raw=False):
colnames = ensure_list(colnames)
strata = self.strata_colnames
colnames = strata + colnames
check_columns(self._df, colnames)
data = self._df.notHandy().select(colnames)
if dropna:
data = data.dropna()
values = (data.groupby(colnames).agg(F.count('*').alias('value_counts'))
.toPandas().set_index(colnames).sort_index()['value_counts'])
if not raw:
for level, col in enumerate(ensure_list(self._strata)):
if not isinstance(col, str):
values.index.set_levels(pd.Index(col._clauses[1:-1]), level=level, inplace=True)
values.index.set_names(col.colname, level=level, inplace=True)
return values
def _fillna(self, target, values):
assert isinstance(target, DataFrame), "Target must be a DataFrame"
items = values.items()
for colname, v in items:
if isinstance(v, dict):
clauses = v.keys()
whens = ' '.join(['WHEN (({clause}) AND (isnan({col}) OR isnull({col}))) THEN {quote}{filling}{quote}'
.format(clause=clause, col=colname, filling=v[clause],
quote='"' if isinstance(v[clause], str) else '')
for clause in clauses])
else:
whens = ('WHEN (isnan({col}) OR isnull({col})) THEN {quote}{filling}{quote}'
.format(col=colname, filling=v,
quote='"' if isinstance(v, str) else ''))
expression = F.expr('CASE {expr} ELSE {col} END'.format(expr=whens, col=colname))
target = target.withColumn(colname, expression)
return target
def __stat_to_dict(self, colname, stat):
if len(self._strata_clauses):
if isinstance(stat, pd.Series):
stat = stat.to_frame(colname)
return {clause: stat.query(raw_clause)[colname].iloc[0]
for clause, raw_clause in zip(self._strata_clauses, self._strata_raw_clauses)}
else:
return stat[colname]
def _fill_values(self, continuous, categorical, strategy):
values = {}
colnames = list(map(itemgetter(0), filter(lambda t: t[1] == 'mean', zip(continuous, strategy))))
values.update(dict([(col, self.__stat_to_dict(col, self.mean(col))) for col in colnames]))
colnames = list(map(itemgetter(0), filter(lambda t: t[1] == 'median', zip(continuous, strategy))))
values.update(dict([(col, self.__stat_to_dict(col, self.median(col))) for col in colnames]))
values.update(dict([(col, self.__stat_to_dict(col, self.mode(col)))
for col in categorical if col in self._categorical]))
return values
def __fill_self(self, continuous, categorical, strategy):
continuous = ensure_list(continuous)
categorical = ensure_list(categorical)
check_columns(self._df, continuous + categorical)
strategy = none2default(strategy, 'mean')
if continuous == ['all']:
continuous = self._continuous
if categorical == ['all']:
categorical = self._categorical
if isinstance(strategy, (list, tuple)):
assert len(continuous) == len(strategy), "There must be a strategy to each column."
else:
strategy = [strategy] * len(continuous)
values = self._fill_values(continuous, categorical, strategy)
self._imputed_values.update(values)
res = HandyFrame(self._fillna(self._df, values), self)
return res
def _dense_to_array(self, colname, array_colname):
check_columns(self._df, colname)
res = dense_to_array(self._df.notHandy(), colname, array_colname)
return HandyFrame(res, self)
def _agg(self, name, func, colnames):
colnames = none2default(colnames, self._df.columns)
colnames = ensure_list(colnames)
check_columns(self._df, self.strata_colnames + [col for col in colnames if not isinstance(col, Column)])
if func is None:
func = getattr(F, name)
res = (self._df.notHandy()
.groupby(self.strata_colnames)
.agg(*(func(col).alias(str(col)) for col in colnames if str(col) not in self.strata_colnames))
.toPandas())
if len(res) == 1:
res = res.iloc[0]
res.name = name
return res
def _calc_fences(self, colnames, k=1.5, precision=.01):
colnames = none2default(colnames, self._numerical)
colnames = ensure_list(colnames)
check_columns(self._df, colnames)
colnames = [col for col in colnames if col in self._numerical]
strata = self.strata_colnames
pdf = (self._df.notHandy()
.groupby(strata)
.agg(F.count(F.lit(1)).alias('nrows'),
*[F.expr('approx_percentile({}, {}, {})'.format(c, q, 1./precision)).alias('{}_{}%'.format(c, int(q * 100)))
for q in [.25, .50, .75] for c in colnames],
*[F.mean(c).alias('{}_mean'.format(c)) for c in colnames]).toPandas())
for col in colnames:
pdf.loc[:, '{}_iqr'.format(col)] = pdf.loc[:, '{}_75%'.format(col)] - pdf.loc[:, '{}_25%'.format(col)]
pdf.loc[:, '{}_lfence'.format(col)] = pdf.loc[:, '{}_25%'.format(col)] - k * pdf.loc[:, '{}_iqr'.format(col)]
pdf.loc[:, '{}_ufence'.format(col)] = pdf.loc[:, '{}_75%'.format(col)] + k * pdf.loc[:, '{}_iqr'.format(col)]
return pdf
def _calc_mahalanobis_distance(self, colnames, output_col='__mahalanobis'):
"""Computes Mahalanobis distance from origin
"""
sdf = self._df.notHandy()
check_columns(sdf, colnames)
# Builds pipeline to assemble feature columns and scale them
assembler = VectorAssembler(inputCols=colnames, outputCol='__features')
scaler = StandardScaler(inputCol='__features', outputCol='__scaled', withMean=True)
pipeline = Pipeline(stages=[assembler, scaler])
features = pipeline.fit(sdf).transform(sdf)
# Computes correlation between features and inverts it
# Since we scaled the features, we can assume they have unit variance
# and therefore, correlation and covariance matrices are the same!
mat = Correlation.corr(features, '__scaled').head()[0].toArray()
inv_mat = inv(mat)
# Builds Pandas UDF to compute Mahalanobis distance from origin
# sqrt((V - 0) * inv_M * (V - 0))
try:
import pyarrow
@F.pandas_udf('double')
def pudf_mult(v):
return v.apply(lambda v: np.sqrt(np.dot(np.dot(v, inv_mat), v)))
except:
@F.udf('double')
def pudf_mult(v):
return v.apply(lambda v: np.sqrt(np.dot(np.dot(v, inv_mat), v)))
# Convert feature vector into array
features = dense_to_array(features, '__scaled', '__array_scaled')
# Computes Mahalanobis distance and flags as outliers all elements above critical value
distance = (features
.withColumn('__mahalanobis', pudf_mult('__array_scaled'))
.drop('__features', '__scaled', '__array_scaled'))
return distance
def _set_mahalanobis_outliers(self, colnames, critical_value=.999,
input_col='__mahalanobis', output_col='__outlier'):
"""Compares Mahalanobis distances to critical values using
Chi-Squared distribution to identify possible outliers.
"""
distance = self._calc_mahalanobis_distance(colnames)
# Computes critical value
critical_value = chi2.ppf(critical_value, len(colnames))
# Computes Mahalanobis distance and flags as outliers all elements above critical value
outlier = (distance.withColumn(output_col, F.col(input_col) > critical_value))
return outlier
def _calc_bxp_stats(self, fences_df, colname, showfliers=False):
strata = self.strata_colnames
clauses = self._strata_raw_clauses
if not len(clauses):
clauses = [None]
qnames = ['25%', '50%', '75%', 'mean', 'lfence', 'ufence']
col_summ = fences_df[strata + ['{}_{}'.format(colname, q) for q in qnames] + ['nrows']]
col_summ.columns = strata + qnames + ['nrows']
if len(strata):
col_summ = col_summ.set_index(strata)
lfence, ufence = col_summ[['lfence']], col_summ[['ufence']]
expression = None
for clause in clauses:
if clause is not None:
partial = F.col(colname).between(lfence.query(clause).iloc[0, 0], ufence.query(clause).iloc[0, 0])
partial &= F.expr(clause)
else:
partial = F.col(colname).between(lfence.iloc[0, 0], ufence.iloc[0, 0])
if expression is None:
expression = partial
else:
expression |= partial
outlier = self._df.notHandy().withColumn('__{}_outlier'.format(colname), ~expression)
minmax = (outlier
.filter('not __{}_outlier'.format(colname))
.groupby(strata)
.agg(F.min(colname).alias('min'),
F.max(colname).alias('max'))
.toPandas())
if len(strata):
minmax = [minmax.query(clause).iloc[0][['min', 'max']].values for clause in clauses]
else:
minmax = [minmax.iloc[0][['min', 'max']].values]
fliers_df = outlier.filter('__{}_outlier'.format(colname))
fliers_df = [fliers_df.filter(clause) for clause in clauses] if len(strata) else [fliers_df]
fliers_count = [df.count() for df in fliers_df]
if showfliers:
fliers = [(df
.select(F.abs(F.col(colname)).alias(colname))
.orderBy(F.desc(colname))
.limit(1000)
.toPandas()[colname].values) for df in fliers_df]
else:
fliers = [[]] * len(clauses)
stats = [] # each item corresponds to a different clause - all items belong to the same column
nrows = []
for clause, whiskers, outliers in zip(clauses, minmax, fliers):
summary = col_summ
if clause is not None:
summary = summary.query(clause)
item = {'mean': summary['mean'].values[0],
'med': summary['50%'].values[0],
'q1': summary['25%'].values[0],
'q3': summary['75%'].values[0],
'whislo': whiskers[0],
'whishi': whiskers[1],
'fliers': outliers}
stats.append(item)
nrows.append(summary['nrows'].values[0])
if not len(nrows):
nrows = summary['nrows'].values[0]
return stats, fliers_count, nrows
def set_response(self, colname):
check_columns(self._df, colname)
self._response = colname
if colname is not None:
if colname not in self._continuous:
self._is_classification = True
self._classes = self._df.notHandy().select(colname).rdd.map(itemgetter(0)).distinct().collect()
self._nclasses = len(self._classes)
return self
def disassemble(self, colname, new_colnames=None):
check_columns(self._df, colname)
res = disassemble(self._df.notHandy(), colname, new_colnames)
return HandyFrame(res, self)
def to_metrics_RDD(self, prob_col, label):
check_columns(self._df, [prob_col, label])
return self.disassemble(prob_col).select('{}_1'.format(prob_col), F.col(label).cast('double')).rdd.map(tuple)
def corr(self, colnames=None, method='pearson'):
colnames = none2default(colnames, self._numerical)
colnames = ensure_list(colnames)
check_columns(self._df, colnames)
colnames = [col for col in colnames if col in self._numerical]
if self._strata is not None:
colnames = sorted([col for col in colnames if col not in self.strata_colnames])
correlations = Statistics.corr(self._df.notHandy().select(colnames).dropna().rdd.map(lambda row: row[0:]), method=method)
pdf = pd.DataFrame(correlations, columns=colnames, index=colnames)
return pdf
def fill(self, *args, continuous=None, categorical=None, strategy=None):
if len(args) and isinstance(args[0], DataFrame):
return self._fillna(args[0], self._imputed_values)
else:
return self.__fill_self(continuous=continuous, categorical=categorical, strategy=strategy)
@agg
def isnull(self, ratio=False):
def func(colname):
return F.sum(F.isnull(colname).cast('int')).alias(colname)
name = 'missing'
if ratio:
name += '(ratio)'
missing = self._agg(name, func, self._df.columns)
if ratio:
nrows = self._agg('nrows', F.sum, F.lit(1))
if isinstance(missing, pd.Series):
missing = missing / nrows["Column<b'1'>"]
else:
missing.iloc[:, 1:] = missing.iloc[:, 1:].values / nrows["Column<b'1'>"].values.reshape(-1, 1)
if len(self.strata_colnames):
missing = missing.set_index(self.strata_colnames).T.unstack()
missing.name = name
return missing
@agg
def nunique(self, colnames=None):
res = self._agg('nunique', F.approx_count_distinct, colnames)
if len(self.strata_colnames):
res = res.set_index(self.strata_colnames).T.unstack()
res.name = 'nunique'
return res
def outliers(self, colnames=None, ratio=False, method='tukey', **kwargs):
colnames = none2default(colnames, self._numerical)
colnames = ensure_list(colnames)
check_columns(self._df, colnames)
colnames = [col for col in colnames if col in self._numerical]
res = None
if method == 'tukey':
outliers = []
try:
k = float(kwargs['k'])
except KeyError:
k = 1.5
fences_df = self._calc_fences(colnames, k=k, precision=.01)
index = fences_df[self.strata_colnames].set_index(self.strata_colnames).index \
if len(self.strata_colnames) else None
for colname in colnames:
stats, counts, nrows = self._calc_bxp_stats(fences_df, colname, showfliers=False)
outliers.append(pd.Series(counts, index=index, name=colname))
if ratio:
outliers[-1] /= nrows
res = pd.DataFrame(outliers).unstack()
if not len(self.strata_colnames):
res = res.droplevel(0)
name = 'outliers'
if ratio:
name += '(ratio)'
res.name = name
return res
def get_outliers(self, colnames=None, critical_value=.999):
colnames = none2default(colnames, self._numerical)
colnames = ensure_list(colnames)
check_columns(self._df, colnames)
colnames = [col for col in colnames if col in self._numerical]
outliers = self._set_mahalanobis_outliers(colnames, critical_value)
df = outliers.filter('__outlier').orderBy(F.desc('__mahalanobis')).drop('__outlier', '__mahalanobis')
return HandyFrame(df, self)
def remove_outliers(self, colnames=None, critical_value=.999):
colnames = none2default(colnames, self._numerical)
colnames = ensure_list(colnames)
check_columns(self._df, colnames)
colnames = [col for col in colnames if col in self._numerical]
outliers = self._set_mahalanobis_outliers(colnames, critical_value)
df = outliers.filter('not __outlier').drop('__outlier', '__mahalanobis')
return HandyFrame(df, self)
def fence(self, colnames, k=1.5):
colnames = ensure_list(colnames)
check_columns(self._df, colnames)
colnames = [col for col in colnames if col in self._numerical]
pdf = self._calc_fences(colnames, k=k)
if len(self.strata_colnames):
pdf = pdf.set_index(self.strata_colnames)
df = self._df.notHandy()
for colname in colnames:
lfence, ufence = pdf.loc[:, ['{}_lfence'.format(colname)]], pdf.loc[:, ['{}_ufence'.format(colname)]]
if len(self._strata_raw_clauses):
whens1 = ' '.join(['WHEN ({clause}) THEN greatest({col}, {fence})'.format(clause=clause,
col=colname,
fence=lfence.query(clause).iloc[0, 0])
for clause in self._strata_raw_clauses])
whens2 = ' '.join(['WHEN ({clause}) THEN least({col}, {fence})'.format(clause=clause,
col=colname,
fence=ufence.query(clause).iloc[0, 0])
for clause in self._strata_raw_clauses])
expression1 = F.expr('CASE {} END'.format(whens1))
expression2 = F.expr('CASE {} END'.format(whens2))
self._fenced_values.update({colname: {clause: [lfence.query(clause).iloc[0, 0],
ufence.query(clause).iloc[0, 0]]
for clause in self._strata_clauses}})
else:
self._fenced_values.update({colname: [lfence.iloc[0, 0], ufence.iloc[0, 0]]})
expression1 = F.expr('greatest({col}, {fence})'.format(col=colname, fence=lfence.iloc[0, 0]))
expression2 = F.expr('least({col}, {fence})'.format(col=colname, fence=ufence.iloc[0, 0]))
df = df.withColumn(colname, expression1).withColumn(colname, expression2)
return HandyFrame(df.select(self._df.columns), self)
@inccol
def value_counts(self, colnames, dropna=True):
return self._value_counts(colnames, dropna)
@inccol
def mode(self, colname):
check_columns(self._df, [colname])
if self._strata is None:
values = (self._df.notHandy().select(colname).dropna()
.groupby(colname).agg(F.count('*').alias('mode'))
.orderBy(F.desc('mode')).limit(1)
.toPandas()[colname][0])
return pd.Series(values, index=[colname], name='mode')
else:
strata = self.strata_colnames
colnames = strata + [colname]
values = (self._df.notHandy().select(colnames).dropna()
.groupby(colnames).agg(F.count('*').alias('mode'))
.withColumn('order', F.row_number().over(Window.partitionBy(strata).orderBy(F.desc('mode'))))
.filter('order == 1').drop('order')
.toPandas().set_index(strata).sort_index()[colname])
values.name = 'mode'
return values
@inccol
def entropy(self, colnames):
colnames = ensure_list(colnames)
check_columns(self._df, colnames)
sdf = self._df.notHandy()
n = sdf.count()
entropy = []
for colname in colnames:
if colname in self._categorical:
res = (self._df
.groupby(self.strata_colnames + [colname])
.agg(F.count('*').alias('value_counts')).withColumn('probability', F.col('value_counts') / n)
.groupby(self.strata_colnames)
.agg(F.sum(F.expr('-log2(probability) * probability')).alias(colname))
.safety_off()
.cols[self.strata_colnames + [colname]][:])
if len(self.strata_colnames):
res.set_index(self.strata_colnames, inplace=True)
res = res.unstack()
else:
res = res[colname]
res.index = [colname]
else:
res = pd.Series(None, index=[colname])
res.name = 'entropy'
entropy.append(res)
return pd.concat(entropy).sort_index()
@inccol
def mutual_info(self, colnames):
def distribution(sdf, colnames):
return sdf.groupby(colnames).agg(F.count('*').alias('__count'))
check_columns(self._df, colnames)
n = len(colnames)
probs = []
sdf = self._df.notHandy()
for i in range(n):
probs.append(distribution(sdf, self.strata_colnames + [colnames[i]]))
if len(self.strata_colnames):
nrows = sdf.groupby(self.strata_colnames).agg(F.count('*').alias('__n'))
else:
nrows = sdf.count()
entropies = self.entropy(colnames)
res = []
for i in range(n):
for j in range(i, n):
if i == j:
mi = pd.Series(entropies[colnames[i]], name='mi').to_frame()
else:
tdf = distribution(sdf, self.strata_colnames + [colnames[i], colnames[j]])
if len(self.strata_colnames):
tdf = tdf.join(nrows, on=self.strata_colnames)
else:
tdf = tdf.withColumn('__n', F.lit(nrows))
tdf = tdf.join(probs[i].toDF(*self.strata_colnames, colnames[i], '__count0'), on=self.strata_colnames + [colnames[i]])
tdf = tdf.join(probs[j].toDF(*self.strata_colnames, colnames[j], '__count1'), on=self.strata_colnames + [colnames[j]])
mi = (tdf
.groupby(self.strata_colnames)
.agg(F.sum(F.expr('log2(__count * __n / (__count0 * __count1)) * __count / __n')).alias('mi'))
.toPandas())
if len(self.strata_colnames):
mi.set_index(self.strata_colnames, inplace=True)
res.append(mi.assign(ci=colnames[j], cj=colnames[i]))
res.append(mi.assign(ci=colnames[i], cj=colnames[j]))
res = pd.concat(res).set_index(['ci', 'cj'], append=len(self.strata_colnames)).sort_index()
res = pd.pivot_table(res, index=self.strata_colnames + ['ci'], columns=['cj'])
res.index.names = self.strata_colnames + ['']
res.columns = res.columns.droplevel(0).rename('')
return res
@agg
def mean(self, colnames):
return self._agg('mean', F.mean, colnames)
@agg
def min(self, colnames):
return self._agg('min', F.min, colnames)
@agg
def max(self, colnames):
return self._agg('max', F.max, colnames)
@agg
def percentile(self, colnames, perc=50, precision=.01):
def func(c):
return F.expr('approx_percentile({}, {}, {})'.format(c, perc/100., 1./precision))
try:
name = {25: 'q1', 50: 'median', 75: 'q3'}[perc]
except KeyError:
name = 'percentile_{}'.format(perc)
return self._agg(name, func, colnames)
@agg
def median(self, colnames, precision=.01):
return self.percentile(colnames, 50, precision)
@agg
def stddev(self, colnames):
return self._agg('stddev', F.stddev, colnames)
@agg
def var(self, colnames):
return self._agg('var', F.stddev, colnames) ** 2
@agg
def q1(self, colnames, precision=.01):
return self.percentile(colnames, 25, precision)
@agg
def q3(self, colnames, precision=.01):
return self.percentile(colnames, 75, precision)
### Boxplot functions
def _strat_boxplot(self, colnames, **kwargs):
n_rows = n_cols = 1
kwds = deepcopy(kwargs)
for kw in ['showfliers', 'precision']:
try:
del kwds[kw]
except KeyError:
pass
if isinstance(colnames, (tuple, list)) and (len(colnames) > 1):
n_rows = self._n_rows
n_cols = self._n_cols
self._build_strat_plot(n_rows, n_cols, **kwds)
return None
@inccol
def boxplot(self, colnames, ax=None, showfliers=True, k=1.5, precision=.01, **kwargs):
colnames = ensure_list(colnames)
check_columns(self._df, colnames)
colnames = [col for col in colnames if col in self._numerical]
assert len(colnames), "Only numerical columns can be plot!"
return boxplot(self._df, colnames, ax, showfliers, k, precision)
def _post_boxplot(self, res):
return post_boxplot(self._strata_plot[1], res)
### Scatterplot functions
def _strat_scatterplot(self, colnames, **kwargs):
self._build_strat_plot(self._n_rows, self._n_cols, **kwargs)
return strat_scatterplot(self._df.notHandy(), colnames[0], colnames[1])
@inccol
def scatterplot(self, colnames, ax=None, **kwargs):
assert len(colnames) == 2, "There must be two columns to plot!"
check_columns(self._df, colnames)
colnames = [col for col in colnames if col in self._numerical]
assert len(colnames) == 2, "Both columns must be numerical!"
return scatterplot(self._df, colnames[0], colnames[1], ax=ax)
### Histogram functions
def _strat_hist(self, colname, bins=10, **kwargs):
self._build_strat_plot(self._n_rows, self._n_cols, **kwargs)
categorical = True
if colname in self._continuous:
categorical = False
#res = strat_histogram(self._df.notHandy(), colname, bins, categorical)
res = strat_histogram(self._df, colname, bins, categorical)
self._strata_plot[0].suptitle('')
plt.tight_layout()
return res
@inccol
def hist(self, colname, bins=10, ax=None, **kwargs):
# TO DO
# include split per response/columns
assert len(ensure_list(colname)) == 1, "Only single columns can be plot!"
check_columns(self._df, colname)
if colname in self._continuous:
return histogram(self._df, colname, bins=bins, categorical=False, ax=ax)
else:
return histogram(self._df, colname, bins=bins, categorical=True, ax=ax)
class HandyGrouped(GroupedData):
def __init__(self, jgd, df, *args):
self._jgd = jgd
self._df = df
self.sql_ctx = df.sql_ctx
self._cols = args
def agg(self, *exprs):
df = super().agg(*exprs)
handy = deepcopy(self._df._handy)
handy._group_cols = self._cols
return HandyFrame(df, handy)
def __repr__(self):
return "HandyGrouped[%s]" % (", ".join("%s" % c for c in self._group_cols))
class HandyFrame(DataFrame):
"""HandySpark version of DataFrame.
Attributes
----------
cols: HandyColumns
class to access pandas-like column based methods implemented in Spark
pandas: HandyPandas
class to access pandas-like column based methods through pandas UDFs
transformers: HandyTransformers
class to generate Handy transformers
stages: integer
number of stages in the execution plan
response: string
name of the response column
is_classification: boolean
True if response is a categorical variable
classes: list
list of classes for a classification problem
nclasses: integer
number of classes for a classification problem
ncols: integer
number of columns of the HandyFrame
nrows: integer
number of rows of the HandyFrame
shape: tuple
tuple representing dimensionality of the HandyFrame
statistics_: dict
imputation fill value for each feature
If stratified, first level keys are filter clauses for stratification
fences_: dict
fence values for each feature
If stratified, first level keys are filter clauses for stratification
is_stratified: boolean
True if HandyFrame was stratified
values: ndarray
Numpy representation of HandyFrame.
Available methods:
- notHandy: makes it a plain Spark dataframe
- stratify: used to perform stratified operations
- isnull: checks for missing values
- fill: fills missing values
- outliers: returns counts of outliers, columnwise, using Tukey's method
- get_outliers: returns list of outliers using Mahalanobis distance
- remove_outliers: filters out outliers using Mahalanobis distance
- fence: fences outliers
- set_safety_limit: defines new safety limit for collect operations
- safety_off: disables safety limit for a single operation
- assign: appends a new columns based on an expression
- nunique: returns number of unique values in each column
- set_response: sets column to be used as response / label
- disassemble: turns a vector / array column into multiple columns
- to_metrics_RDD: turns probability and label columns into a tuple RDD
"""
def __init__(self, df, handy=None):
super().__init__(df._jdf, df.sql_ctx)
if handy is None:
handy = Handy(self)
else:
handy = deepcopy(handy)
handy._df = self
handy._update_types()
self._handy = handy
self._safety = self._handy._safety
self._safety_limit = self._handy._safety_limit
self.__overriden = ['collect', 'take']
self._strat_handy = None
self._strat_index = None
def __getattribute__(self, name):
attr = object.__getattribute__(self, name)
if hasattr(attr, '__call__') and name not in self.__overriden:
def wrapper(*args, **kwargs):
try:
res = attr(*args, **kwargs)
except HandyException as e:
raise HandyException(str(e), summary=False)
except Exception as e:
raise HandyException(str(e), summary=True)
if name != 'notHandy':
if not isinstance(res, HandyFrame):
if isinstance(res, DataFrame):
res = HandyFrame(res, self._handy)
if isinstance(res, GroupedData):
res = HandyGrouped(res._jgd, res._df, *args)
return res
return wrapper
else:
return attr
def __repr__(self):
return "HandyFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes))
def _get_strata(self):
plot = None
object = None
if self._strat_handy is not None:
try:
object = self._strat_handy._strata_object
except AttributeError:
pass
if object is None:
object = True
try:
plots = self._strat_handy._strata_plot[1]
#if len(plots) > 1:
# plot = plots[self._strat_index]
plot = plots
except (AttributeError, IndexError):
pass
return plot, object
def _gen_row_ids(self, *args):
# EXPERIMENTAL - DO NOT USE!
return (self
.sort(*args)
.withColumn('_miid', F.monotonically_increasing_id())
.withColumn('_row_id', F.row_number().over(Window().orderBy(F.col('_miid'))))
.drop('_miid'))
def _loc(self, lower_bound, upper_bound):
# EXPERIMENTAL - DO NOT USE!
assert '_row_id' in self.columns, "Cannot use LOC without generating `row_id`s first!"
clause = F.col('_row_id').between(lower_bound, upper_bound)
return self.filter(clause)
@property
def cols(self):
"""Returns a class to access pandas-like column based methods implemented in Spark
Available methods:
- min
- max
- median
- q1
- q3
- stddev
- value_counts
- mode
- corr
- nunique
- hist
- boxplot
- scatterplot
"""
return HandyColumns(self, self._handy)
@property
def pandas(self):
"""Returns a class to access pandas-like column based methods through pandas UDFs
Available methods:
- betweeen / between_time
- isin
- isna / isnull
- notna / notnull
- abs
- clip / clip_lower / clip_upper
- replace
- round / truncate
- tz_convert / tz_localize
"""
return HandyPandas(self)
@property
def transformers(self):
"""Returns a class to generate Handy transformers
Available transformers:
- HandyImputer
- HandyFencer
"""
return HandyTransformers(self)
@property
def stages(self):
"""Returns the number of stages in the execution plan.
"""
return self._handy.stages
@property
def response(self):
"""Returns the name of the response column.
"""
return self._handy.response
@property
def is_classification(self):
"""Returns True if response is a categorical variable.
"""
return self._handy.is_classification
@property
def classes(self):
"""Returns list of classes for a classification problem.
"""
return self._handy.classes
@property
def nclasses(self):
"""Returns the number of classes for a classification problem.
"""
return self._handy.nclasses
@property
def ncols(self):
"""Returns the number of columns of the HandyFrame.
"""
return self._handy.ncols
@property
def nrows(self):
"""Returns the number of rows of the HandyFrame.
"""
return self._handy.nrows
@property
def shape(self):
"""Return a tuple representing the dimensionality of the HandyFrame.
"""
return self._handy.shape
@property
def statistics_(self):
"""Returns dictionary with imputation fill value for each feature.
If stratified, first level keys are filter clauses for stratification.
"""
return self._handy.statistics_
@property
def fences_(self):
"""Returns dictionary with fence values for each feature.
If stratified, first level keys are filter clauses for stratification.
"""
return self._handy.fences_
@property
def values(self):
"""Numpy representation of HandyFrame.
"""
# safety limit will kick in, unless explicitly off before
tdf = self
if self._safety:
tdf = tdf.limit(self._safety_limit)
return np.array(tdf.rdd.map(tuple).collect())
def notHandy(self):
"""Converts HandyFrame back into Spark's DataFrame
"""
return DataFrame(self._jdf, self.sql_ctx)
def set_safety_limit(self, limit):
"""Sets safety limit used for ``collect`` method.
"""
self._handy._safety_limit = limit
self._safety_limit = limit
def safety_off(self):
"""Disables safety limit for a single call of ``collect`` method.
"""
self._handy._safety = False
self._safety = False
return self
def collect(self):
"""Returns all the records as a list of :class:`Row`.
By default, its output is limited by the safety limit.
To get original `collect` behavior, call ``safety_off`` method first.
"""
try:
if self._safety:
print('\nINFO: Safety is ON - returning up to {} instances.'.format(self._safety_limit))
return super().limit(self._safety_limit).collect()
else:
res = super().collect()
self._safety = True
return res
except HandyException as e:
raise HandyException(str(e), summary=False)
except Exception as e:
raise HandyException(str(e), summary=True)
def take(self, num):
"""Returns the first ``num`` rows as a :class:`list` of :class:`Row`.
"""
self._handy._safety = False
res = super().take(num)
self._handy._safety = True
return res
def stratify(self, strata):
"""Stratify the HandyFrame.
Stratified operations should be more efficient than group by operations, as they
rely on three iterative steps, namely: filtering the underlying HandyFrame, performing
the operation and aggregating the results.
"""
strata = ensure_list(strata)
check_columns(self, strata)
return self._handy._stratify(strata)
def transform(self, f, name=None, args=None, returnType=None):
"""INTERNAL USE
"""
return HandyTransform.transform(self, f, name=name, args=args, returnType=returnType)
def apply(self, f, name=None, args=None, returnType=None):
"""INTERNAL USE
"""
return HandyTransform.apply(self, f, name=name, args=args, returnType=returnType)
def assign(self, **kwargs):
"""Assign new columns to a HandyFrame, returning a new object (a copy)
with all the original columns in addition to the new ones.
Parameters
----------
kwargs : keyword, value pairs
keywords are the column names.
If the values are callable, they are computed on the DataFrame and
assigned to the new columns.
If the values are not callable, (e.g. a scalar, or string),
they are simply assigned.
Returns
-------
df : HandyFrame
A new HandyFrame with the new columns in addition to
all the existing columns.
"""
return HandyTransform.assign(self, **kwargs)
@agg
def isnull(self, ratio=False):
"""Returns array with counts of missing value for each column in the HandyFrame.
Parameters
----------
ratio: boolean, default False
If True, returns ratios instead of absolute counts.
Returns
-------
counts: Series
"""
return self._handy.isnull(ratio)
@agg
def nunique(self):
"""Return Series with number of distinct observations for all columns.
Parameters
----------
exact: boolean, optional
If True, computes exact number of unique values, otherwise uses an approximation.
Returns
-------
nunique: Series
"""
return self._handy.nunique(self.columns) #, exact)
@inccol
def outliers(self, ratio=False, method='tukey', **kwargs):
"""Return Series with number of outlier observations according to
the specified method for all columns.
Parameters
----------
ratio: boolean, optional
If True, returns proportion instead of counts.
Default is True.
method: string, optional
Method used to detect outliers. Currently, only Tukey's method is supported.
Default is tukey.
Returns
-------
outliers: Series
"""
return self._handy.outliers(self.columns, ratio=ratio, method=method, **kwargs)
def get_outliers(self, colnames=None, critical_value=.999):
"""Returns HandyFrame containing all rows deemed as outliers using
Mahalanobis distance and informed critical value.
Parameters
----------
colnames: list of str, optional
List of columns to be used for computing Mahalanobis distance.
Default includes all numerical columns
critical_value: float, optional
Critical value for chi-squared distribution to classify outliers
according to Mahalanobis distance.
Default is .999 (99.9%).
"""
return self._handy.get_outliers(colnames, critical_value)
def remove_outliers(self, colnames=None, critical_value=.999):
"""Returns HandyFrame containing only rows NOT deemed as outliers
using Mahalanobis distance and informed critical value.
Parameters
----------
colnames: list of str, optional
List of columns to be used for computing Mahalanobis distance.
Default includes all numerical columns
critical_value: float, optional
Critical value for chi-squared distribution to classify outliers
according to Mahalanobis distance.
Default is .999 (99.9%).
"""
return self._handy.remove_outliers(colnames, critical_value)
def set_response(self, colname):
"""Sets column to be used as response in supervised learning algorithms.
Parameters
----------
colname: string
Returns
-------
self
"""
check_columns(self, colname)
return self._handy.set_response(colname)
@inccol
def fill(self, *args, categorical=None, continuous=None, strategy=None):
"""Fill NA/NaN values using the specified methods.
The values used for imputation are kept in ``statistics_`` property
and can later be used to generate a corresponding HandyImputer transformer.
Parameters
----------
categorical: 'all' or list of string, optional
List of categorical columns.
These columns are filled with its coresponding modes (most common values).
continuous: 'all' or list of string, optional
List of continuous value columns.
By default, these columns are filled with its corresponding means.
If a same-sized list is provided in the ``strategy`` argument, it uses
the corresponding straegy for each column.
strategy: list of string, optional
If informed, it must contain a strategy - either ``mean`` or ``median`` - for
each one of the continuous columns.
Returns
-------
df : HandyFrame
A new HandyFrame with filled missing values.
"""
return self._handy.fill(*args, continuous=continuous, categorical=categorical, strategy=strategy)
@inccol
def fence(self, colnames, k=1.5):
"""Caps outliers using lower and upper fences given by Tukey's method,
using 1.5 times the interquartile range (IQR).
The fence values used for capping outliers are kept in ``fences_`` property
and can later be used to generate a corresponding HandyFencer transformer.
For more information, check: https://en.wikipedia.org/wiki/Outlier#Tukey's_fences
Parameters
----------
colnames: list of string
Column names to apply fencing.
k: float, optional
Constant multiplier for the IQR.
Default is 1.5 (corresponding to Tukey's outlier, use 3 for "far out" values)
Returns
-------
df : HandyFrame
A new HandyFrame with capped outliers.
"""
return self._handy.fence(colnames, k=k)
def disassemble(self, colname, new_colnames=None):
"""Disassembles a Vector or Array column into multiple columns.
Parameters
----------
colname: string
Column containing Vector or Array elements.
new_colnames: list of string, optional
Default is None, column names are generated using a sequentially
generated suffix (e.g., _0, _1, etc.) for ``colname``.
If informed, it must have as many column names as elements
in the shortest vector/array of ``colname``.
Returns
-------
df : HandyFrame
A new HandyFrame with the new disassembled columns in addition to
all the existing columns.
"""
return self._handy.disassemble(colname, new_colnames)
def to_metrics_RDD(self, prob_col='probability', label_col='label'):
"""Converts a DataFrame containing predicted probabilities and classification labels
into a RDD suited for use with ``BinaryClassificationMetrics`` object.
Parameters
----------
prob_col: string, optional
Column containing Vectors of probabilities.
Default is 'probability'.
label_col: string, optional
Column containing labels.
Default is 'label'.
Returns
-------
rdd: RDD
RDD of tuples (probability, label)
"""
return self._handy.to_metrics_RDD(prob_col, label_col)
class Bucket(object):
"""Bucketizes a column of continuous values into equal sized bins
to perform stratification.
Parameters
----------
colname: string
Column containing continuous values
bins: integer
Number of equal sized bins to map original values to.
Returns
-------
bucket: Bucket
Bucket object to be used as column in stratification.
"""
def __init__(self, colname, bins=5):
self._colname = colname
self._bins = bins
self._buckets = None
self._clauses = None
def __repr__(self):
return 'Bucket_{}_{}'.format(self._colname, self._bins)
@property
def colname(self):
return self._colname
def _get_buckets(self, df):
check_columns(df, self._colname)
buckets = ([-float('inf')] +
np.linspace(*df.agg(F.min(self._colname),
F.max(self._colname)).rdd.map(tuple).collect()[0],
self._bins + 1).tolist() +
[float('inf')])
buckets[-2] += 1e-7
self._buckets = buckets
return buckets
def _get_clauses(self, buckets):
clauses = []
clauses.append('{} < {:.4f}'.format(self._colname, buckets[1]))
for b, e in zip(buckets[1:-2], buckets[2:-1]):
clauses.append('{} >= {:.4f} and {} < {:.4f}'.format(self._colname, b, self._colname, e))
clauses[-1] = clauses[-1].replace('<', '<=')
clauses.append('{} > {:.4f}'.format(self._colname, buckets[-2]))
self._clauses = clauses
return clauses
class Quantile(Bucket):
"""Bucketizes a column of continuous values into quantiles
to perform stratification.
Parameters
----------
colname: string
Column containing continuous values
bins: integer
Number of quantiles to map original values to.
Returns
-------
quantile: Quantile
Quantile object to be used as column in stratification.
"""
def __repr__(self):
return 'Quantile{}_{}'.format(self._colname, self._bins)
def _get_buckets(self, df):
buckets = ([-float('inf')] +
df.approxQuantile(col=self._colname,
probabilities=np.linspace(0, 1, self._bins + 1).tolist(),
relativeError=0.01) +
[float('inf')])
buckets[-2] += 1e-7
return buckets
class HandyColumns(object):
"""HandyColumn(s) in a HandyFrame.
Attributes
----------
numerical: list of string
List of numerical columns (integer, float, double)
categorical: list of string
List of categorical columns (string, integer)
continuous: list of string
List of continous columns (float, double)
string: list of string
List of string columns (string)
array: list of string
List of array columns (array, map)
"""
def __init__(self, df, handy, strata=None):
self._df = df
self._handy = handy
self._strata = strata
self._colnames = None
self.COLTYPES = {'continuous': self.continuous,
'categorical': self.categorical,
'numerical': self.numerical,
'string': self.string,
'array': self.array}
def __getitem__(self, *args):
if isinstance(args[0], tuple):
args = args[0]
item = args[0]
if self._strata is None:
if self._colnames is None:
if item == slice(None, None, None):
item = self._df.columns
if isinstance(item, str):
try:
# try it as an alias
item = self.COLTYPES[item]
except KeyError:
pass
check_columns(self._df, item)
self._colnames = item
if isinstance(self._colnames, int):
idx = self._colnames + (len(self._handy._group_cols) if self._handy._group_cols is not None else 0)
assert idx < len(self._df.columns), "Invalid column index {}".format(idx)
self._colnames = list(self._df.columns)[idx]
return self
else:
try:
n = item.stop
if n is None:
n = -1
except:
n = 20
if isinstance(self._colnames, (tuple, list)):
res = self._df.notHandy().select(self._colnames)
if n == -1:
if self._df._safety:
print('\nINFO: Safety is ON - returning up to {} instances.'.format(self._df._safety_limit))
n = self._df._safety_limit
if n != -1:
res = res.limit(n)
res = res.toPandas()
self._handy._safety = True
self._df._safety = True
return res
else:
return self._handy.__getitem__(self._colnames, n)
else:
if self._colnames is None:
if item == slice(None, None, None):
item = self._df.columns
if isinstance(item, str):
try:
# try it as an alias
item = self.COLTYPES[item]
except KeyError:
pass
self._strata._handycolumns = item
return self._strata
def __repr__(self):
colnames = ensure_list(self._colnames)
return "HandyColumns[%s]" % (", ".join("%s" % str(c) for c in colnames))
@property
def numerical(self):
"""Returns list of numerical columns in the HandyFrame.
"""
return self._handy._numerical
@property
def categorical(self):
"""Returns list of categorical columns in the HandyFrame.
"""
return self._handy._categorical
@property
def continuous(self):
"""Returns list of continuous columns in the HandyFrame.
"""
return self._handy._continuous
@property
def string(self):
"""Returns list of string columns in the HandyFrame.
"""
return self._handy._string
@property
def array(self):
"""Returns list of array or map columns in the HandyFrame.
"""
return self._handy._array
def mean(self):
return self._handy.mean(self._colnames)
def min(self):
return self._handy.min(self._colnames)
def max(self):
return self._handy.max(self._colnames)
def median(self, precision=.01):
"""Returns approximate median with given precision.
Parameters
----------
precision: float, optional
Default is 0.01
"""
return self._handy.median(self._colnames, precision)
def stddev(self):
return self._handy.stddev(self._colnames)
def var(self):
return self._handy.var(self._colnames)
def percentile(self, perc, precision=.01):
"""Returns approximate percentile with given precision.
Parameters
----------
perc: integer
Percentile to be computed
precision: float, optional
Default is 0.01
"""
return self._handy.percentile(self._colnames, perc, precision)
def q1(self, precision=.01):
"""Returns approximate first quartile with given precision.
Parameters
----------
precision: float, optional
Default is 0.01
"""
return self._handy.q1(self._colnames, precision)
def q3(self, precision=.01):
"""Returns approximate third quartile with given precision.
Parameters
----------
precision: float, optional
Default is 0.01
"""
return self._handy.q3(self._colnames, precision)
def _value_counts(self, dropna=True, raw=True):
assert len(ensure_list(self._colnames)) == 1, "A single column must be selected!"
return self._handy._value_counts(self._colnames, dropna, raw)
def value_counts(self, dropna=True):
"""Returns object containing counts of unique values.
The resulting object will be in descending order so that the
first element is the most frequently-occurring element.
Excludes NA values by default.
Parameters
----------
dropna : boolean, default True
Don't include counts of missing values.
Returns
-------
counts: Series
"""
assert len(ensure_list(self._colnames)) == 1, "A single column must be selected!"
return self._handy.value_counts(self._colnames, dropna)
def entropy(self):
"""Returns object containing entropy (base 2) of each column.
Returns
-------
entropy: Series
"""
return self._handy.entropy(self._colnames)
def mutual_info(self):
"""Returns object containing matrix of mutual information
between every pair of columns.
Returns
-------
mutual_info: pd.DataFrame
"""
return self._handy.mutual_info(self._colnames)
def mode(self):
"""Returns same-type modal (most common) value for each column.
Returns
-------
mode: Series
"""
colnames = ensure_list(self._colnames)
modes = [self._handy.mode(colname) for colname in colnames]
if len(colnames) == 1:
return modes[0]
else:
return pd.concat(modes, axis=0)
def corr(self, method='pearson'):
"""Compute pairwise correlation of columns, excluding NA/null values.
Parameters
----------
method : {'pearson', 'spearman'}
* pearson : standard correlation coefficient
* spearman : Spearman rank correlation
Returns
-------
y : DataFrame
"""
colnames = [col for col in self._colnames if col in self.numerical]
return self._handy.corr(colnames, method=method)
def nunique(self):
"""Return Series with number of distinct observations for specified columns.
Parameters
----------
exact: boolean, optional
If True, computes exact number of unique values, otherwise uses an approximation.
Returns
-------
nunique: Series
"""
return self._handy.nunique(self._colnames) #, exact)
def outliers(self, ratio=False, method='tukey', **kwargs):
"""Return Series with number of outlier observations according to
the specified method for all columns.
Parameters
----------
ratio: boolean, optional
If True, returns proportion instead of counts.
Default is True.
method: string, optional
Method used to detect outliers. Currently, only Tukey's method is supported.
Default is tukey.
Returns
-------
outliers: Series
"""
return self._handy.outliers(self._colnames, ratio=ratio, method=method, **kwargs)
def get_outliers(self, critical_value=.999):
"""Returns HandyFrame containing all rows deemed as outliers using
Mahalanobis distance and informed critical value.
Parameters
----------
critical_value: float, optional
Critical value for chi-squared distribution to classify outliers
according to Mahalanobis distance.
Default is .999 (99.9%).
"""
return self._handy.get_outliers(self._colnames, critical_value)
def remove_outliers(self, critical_value=.999):
"""Returns HandyFrame containing only rows NOT deemed as outliers
using Mahalanobis distance and informed critical value.
Parameters
----------
critical_value: float, optional
Critical value for chi-squared distribution to classify outliers
according to Mahalanobis distance.
Default is .999 (99.9%).
"""
return self._handy.remove_outliers(self._colnames, critical_value)
def hist(self, bins=10, ax=None):
"""Draws histogram of the HandyFrame's column using matplotlib / pylab.
Parameters
----------
bins : integer, default 10
Number of histogram bins to be used
ax : matplotlib axes object, default None
"""
return self._handy.hist(self._colnames, bins, ax)
def boxplot(self, ax=None, showfliers=True, k=1.5, precision=.01):
"""Makes a box plot from HandyFrame column.
Parameters
----------
ax : matplotlib axes object, default None
showfliers : bool, optional (True)
Show the outliers beyond the caps.
k: float, optional
Constant multiplier for the IQR.
Default is 1.5 (corresponding to Tukey's outlier, use 3 for "far out" values)
"""
return self._handy.boxplot(self._colnames, ax, showfliers, k, precision)
def scatterplot(self, ax=None):
"""Makes a scatter plot of two HandyFrame columns.
Parameters
----------
ax : matplotlib axes object, default None
"""
return self._handy.scatterplot(self._colnames, ax)
class HandyStrata(object):
__handy_methods = (list(filter(lambda n: n[0] != '_',
(map(itemgetter(0),
inspect.getmembers(HandyFrame,
predicate=inspect.isfunction) +
inspect.getmembers(HandyColumns,
predicate=inspect.isfunction)))))) + ['handy']
def __init__(self, handy, strata):
self._handy = handy
self._df = handy._df
self._strata = strata
self._col_clauses = []
self._colnames = []
self._temp_colnames = []
temp_df = self._df
temp_df._handy = self._handy
for col in self._strata:
clauses = []
colname = str(col)
self._colnames.append(colname)
if isinstance(col, Bucket):
self._temp_colnames.append(colname)
buckets = col._get_buckets(self._df)
clauses = col._get_clauses(buckets)
bucketizer = Bucketizer(splits=buckets, inputCol=col.colname, outputCol=colname)
temp_df = HandyFrame(bucketizer.transform(temp_df), self._handy)
self._col_clauses.append(clauses)
self._df = temp_df
self._handy._df = temp_df
self._df._handy = self._handy
value_counts = self._df._handy._value_counts(self._colnames, raw=True).reset_index()
self._raw_combinations = sorted(list(map(tuple, zip(*[value_counts[colname].values
for colname in self._colnames]))))
self._raw_clauses = [' and '.join('{} == {}'.format(str(col), value) if isinstance(col, Bucket)
else '{} == "{}"'.format(str(col),
value[0] if isinstance(value, tuple) else value)
for col, value in zip(self._strata, comb))
for comb in self._raw_combinations]
self._combinations = [tuple(value if not len(clauses) else clauses[int(float(value))]
for value, clauses in zip(comb, self._col_clauses))
for comb in self._raw_combinations]
self._clauses = [' and '.join(value if isinstance(col, Bucket)
else '{} == "{}"'.format(str(col),
value[0] if isinstance(value, tuple) else value)
for col, value in zip(self._strata, comb))
for comb in self._combinations]
self._strat_df = [self._df.filter(clause) for clause in self._clauses]
self._df._strat_handy = self._handy
# Shares the same HANDY object among all sub dataframes
for i, df in enumerate(self._strat_df):
df._strat_index = i
df._strat_handy = self._handy
self._imputed_values = {}
self._handycolumns = None
def __repr__(self):
repr = "HandyStrata[%s]" % (", ".join("%s" % str(c) for c in self._strata))
if self._handycolumns is not None:
colnames = ensure_list(self._handycolumns)
repr = "HandyColumns[%s] by %s" % (", ".join("%s" % str(c) for c in colnames), repr)
return repr
def __getattribute__(self, name):
try:
if name == 'cols':
return HandyColumns(self._df, self._handy, self)
else:
attr = object.__getattribute__(self, name)
return attr
except AttributeError as e:
if name in self.__handy_methods:
def wrapper(*args, **kwargs):
raised = True
try:
# Makes stratification
for df in self._strat_df:
df._handy._strata = self._strata
self._handy._set_stratification(self._strata,
self._raw_combinations, self._raw_clauses,
self._combinations, self._clauses)
if self._handycolumns is not None:
args = (self._handycolumns,) + args
try:
attr_strata = getattr(self._handy, '_strat_{}'.format(name))
self._handy._strata_object = attr_strata(*args, **kwargs)
except AttributeError:
pass
try:
if self._handycolumns is not None:
f = object.__getattribute__(self._handy, name)
else:
f = object.__getattribute__(self._df, name)
is_agg = getattr(f, '__is_agg', False)
is_inccol = getattr(f, '__is_inccol', False)
except AttributeError:
is_agg = False
is_inccol = False
if is_agg or is_inccol:
if self._handycolumns is not None:
colnames = ensure_list(args[0])
else:
colnames = self._df.columns
res = getattr(self._handy, name)(*args, **kwargs)
else:
if self._handycolumns is not None:
res = [getattr(df._handy, name)(*args, **kwargs) for df in self._strat_df]
else:
res = [getattr(df, name)(*args, **kwargs) for df in self._strat_df]
if isinstance(res, pd.DataFrame):
if len(self._handy.strata_colnames):
res = res.set_index(self._handy.strata_colnames).sort_index()
if is_agg:
if len(colnames) == 1:
res = res[colnames[0]]
try:
attr_post = getattr(self._handy, '_post_{}'.format(name))
res = attr_post(res)
except AttributeError:
pass
strata = list(map(lambda v: v[1].to_dict(OrderedDict), self._handy.strata.iterrows()))
strata_cols = [c if isinstance(c, str) else c.colname for c in self._strata]
if isinstance(res, list):
if isinstance(res[0], DataFrame):
joined_df = res[0]
self._imputed_values = joined_df.statistics_
self._fenced_values = joined_df.fences_
if len(res) > 1:
if len(joined_df.statistics_):
self._imputed_values = {self._clauses[0]: joined_df.statistics_}
if len(joined_df.fences_):
self._fenced_values = {self._clauses[0]: joined_df.fences_}
for strat_df, clause in zip(res[1:], self._clauses[1:]):
if len(joined_df.statistics_):
self._imputed_values.update({clause: strat_df.statistics_})
if len(joined_df.fences_):
self._fenced_values.update({clause: strat_df.fences_})
joined_df = joined_df.unionAll(strat_df)
# Clears stratification
self._handy._clear_stratification()
self._df._strat_handy = None
self._df._strat_index = None
if len(self._temp_colnames):
joined_df = joined_df.drop(*self._temp_colnames)
res = HandyFrame(joined_df, self._handy)
res._handy._imputed_values = self._imputed_values
res._handy._fenced_values = self._fenced_values
elif isinstance(res[0], pd.DataFrame):
strat_res = []
indexes = res[0].index.names
if indexes[0] is None:
indexes = ['index']
for r, s in zip(res, strata):
strata_dict = dict([(k if isinstance(k, str) else k.colname, v) for k, v in s.items()])
strat_res.append(r.assign(**strata_dict)
.reset_index())
res = (pd.concat(strat_res)
.sort_values(by=strata_cols)
.set_index(strata_cols + indexes)
.sort_index())
elif isinstance(res[0], pd.Series):
# TODO: TEST
strat_res = []
for r, s in zip(res, strata):
strata_dict = dict([(k if isinstance(k, str) else k.colname, v) for k, v in s.items()])
series_name = none2default(r.name, 0)
if series_name == name:
series_name = 'index'
strat_res.append(r.reset_index()
.rename(columns={series_name: name, 'index': series_name})
.assign(**strata_dict)
.set_index(strata_cols + [series_name])[name])
res = pd.concat(strat_res).sort_index()
if len(ensure_list(self._handycolumns)) > 1:
try:
res = res.astype(np.float64)
res = res.to_frame().reset_index().pivot_table(values=name,
index=strata_cols,
columns=series_name)
res.columns.name = ''
except ValueError:
pass
elif isinstance(res[0], np.ndarray):
# TODO: TEST
strat_res = []
for r, s in zip(res, strata):
strata_dict = dict([(k if isinstance(k, str) else k.colname, v) for k, v in s.items()])
strat_res.append(pd.DataFrame(r, columns=[name])
.assign(**strata_dict)
.set_index(strata_cols)[name])
res = pd.concat(strat_res).sort_index()
elif isinstance(res[0], Axes):
res, axs = self._handy._strata_plot
res = consolidate_plots(res, axs, args[0], self._clauses)
elif isinstance(res[0], list):
joined_list = res[0]
for l in res[1:]:
joined_list += l
return joined_list
elif len(res) == len(self._combinations):
# TODO: TEST
strata_df = pd.DataFrame(strata)
strata_df.columns = strata_cols
res = (pd.concat([pd.DataFrame(res, columns=[name]), strata_df], axis=1)
.set_index(strata_cols)
.sort_index())
raised = False
return res
except HandyException as e:
raise HandyException(str(e), summary=False)
except Exception as e:
raise HandyException(str(e), summary=True)
finally:
if not raised:
if isinstance(res, HandyFrame):
res._handy._clear_stratification()
self._handy._clear_stratification()
self._df._strat_handy = None
self._df._strat_index = None
if len(self._temp_colnames):
self._df = self._df.drop(*self._temp_colnames)
self._handy._df = self._df
return wrapper
else:
raise e
================================================
FILE: handyspark/sql/datetime.py
================================================
from handyspark.sql.transform import HandyTransform
import pandas as pd
class HandyDatetime(object):
__supported = {'boolean': ['is_leap_year', 'is_month_end', 'is_month_start', 'is_quarter_end', 'is_quarter_start',
'is_year_end', 'is_year_start'],
'string': ['strftime', 'tz', 'weekday_name'],
'integer': ['day', 'dayofweek', 'dayofyear', 'days_in_month', 'daysinmonth', 'hour', 'microsecond',
'minute', 'month', 'nanosecond', 'quarter', 'second', 'week', 'weekday', 'weekofyear',
'year'],
'date': ['date'],
'timestamp': ['ceil', 'floor', 'round', 'normalize', 'time', 'tz_convert', 'tz_localize']}
__unsupported = ['freq', 'to_period', 'to_pydatetime']
__functions = ['strftime', 'ceil', 'floor', 'round', 'normalize', 'tz_convert', 'tz_localize']
__available = sorted(__supported['boolean'] + __supported['string'] + __supported['integer'] + __supported['date'] +
__supported['timestamp'])
__types = {n: t for t, v in __supported.items() for n in v}
_colname = None
def __init__(self, df, colname):
self._df = df
self._colname = colname
if self._df.notHandy().select(colname).dtypes[0][1] != 'timestamp':
raise AttributeError('Can only use .dt accessor with datetimelike values')
def __getattribute__(self, name):
try:
attr = object.__getattribute__(self, name)
return attr
except AttributeError as e:
if name in self.__available:
if name in self.__functions:
def wrapper(*args, **kwargs):
return HandyTransform.gen_pandas_udf(f=lambda col: col.dt.__getattribute__(name)(**kwargs),
args=(self._colname,),
returnType=self.__types.get(name, 'string'))
wrapper.__doc__ = getattr(pd.Series.dt, name).__doc__
return wrapper
else:
func = HandyTransform.gen_pandas_udf(f=lambda col: col.dt.__getattribute__(name),
args=(self._colname,),
returnType=self.__types.get(name, 'string'))
func.__doc__ = getattr(pd.Series.dt, name).__doc__
return func
else:
raise e
================================================
FILE: handyspark/sql/pandas.py
================================================
from handyspark.sql.datetime import HandyDatetime
from handyspark.sql.string import HandyString
from handyspark.sql.transform import HandyTransform
from handyspark.util import check_columns
import pandas as pd
class HandyPandas(object):
__supported = {'boolean': ['between', 'between_time', 'isin', 'isna', 'isnull', 'notna', 'notnull'],
'same': ['abs', 'clip', 'clip_lower', 'clip_upper', 'replace', 'round', 'truncate',
'tz_convert', 'tz_localize']}
__as_series = ['rank', 'interpolate', 'pct_change', 'bfill', 'cummax', 'cummin', 'cumprod', 'cumsum', 'diff',
'ffill', 'fillna', 'shift']
__available = sorted(__supported['boolean'] + __supported['same'])
__types = {n: t for t, v in __supported.items() for n in v}
def __init__(self, df):
self._df = df
self._colname = None
def __getitem__(self, *args):
if isinstance(args[0], tuple):
args = args[0]
item = args[0]
check_columns(self._df, item)
self._colname = item
return self
@property
def str(self):
"""Returns a class to access pandas-like string column based methods through pandas UDFs
Available methods:
- contains
- startswith / endswitch
- match
- isalpha / isnumeric / isalnum / isdigit / isdecimal / isspace
- islower / isupper / istitle
- replace
- repeat
- join
- pad
- slice / slice_replace
- strip / lstrip / rstrip
- wrap / center / ljust / rjust
- translate
- get
- normalize
- lower / upper / capitalize / swapcase / title
- zfill
- count
- find / rfind
- len
"""
return HandyString(self._df, self._colname)
@property
def dt(self):
"""Returns a class to access pandas-like datetime column based methods through pandas UDFs
Available methods:
- is_leap_year / is_month_end / is_month_start / is_quarter_end / is_quarter_start / is_year_end / is_year_start
- strftime
- tz / time / tz_convert / tz_localize
- day / dayofweek / dayofyear / days_in_month / daysinmonth
- hour / microsecond / minute / nanosecond / second
- week / weekday / weekday_name
- month / quarter / year / weekofyear
- date
- ceil / floor / round
- normalize
"""
return HandyDatetime(self._df, self._colname)
def __getattribute__(self, name):
try:
attr = object.__getattribute__(self, name)
return attr
except AttributeError as e:
if name in self.__available:
def wrapper(*args, **kwargs):
returnType=self.__types.get(name, 'string')
if returnType == 'same':
returnType = self._df.notHandy().select(self._colname).dtypes[0][1]
return HandyTransform.gen_pandas_udf(f=lambda col: col.__getattribute__(name)(**kwargs),
args=(self._colname,),
returnType=returnType)
if name not in ['str', 'dt']:
wrapper.__doc__ = getattr(pd.Series, name).__doc__
return wrapper
else:
raise e
================================================
FILE: handyspark/sql/schema.py
================================================
import numpy as np
import datetime
from operator import itemgetter
from pyspark.sql.types import StructType
_mapping = {str: 'string',
bool: 'boolean',
int: 'integer',
float: 'float',
datetime.date: 'date',
datetime.datetime: 'timestamp',
np.bool: 'boolean',
np.int8: 'byte',
np.int16: 'short',
np.int32: 'integer',
np.int64: 'long',
np.float32: 'float',
np.float64: 'double',
np.ndarray: 'array',
object: 'string',
list: 'array',
tuple: 'array',
dict: 'map'}
def generate_schema(columns, nullable_columns='all'):
"""
Parameters
----------
columns: dict of column names (keys) and types (values)
nullables: list of nullable columns, optional, default is 'all'
Returns
-------
schema: StructType
Spark DataFrame schema corresponding to Python/numpy types.
"""
columns = sorted(columns.items())
colnames = list(map(itemgetter(0), columns))
coltypes = list(map(itemgetter(1), columns))
invalid_types = []
new_types = []
keys = list(map(itemgetter(0), list(_mapping.items())))
for coltype in coltypes:
if coltype not in keys:
invalid_types.append(coltype)
else:
if coltype == np.dtype('O'):
new_types.append(str)
else:
new_types.append(keys[keys.index(coltype)])
assert len(invalid_types) == 0, "Invalid type(s) specified: {}".format(str(invalid_types))
if nullable_columns == 'all':
nullables = [True] * len(colnames)
else:
nullables = [col in nullable_columns for col in colnames]
fields = [{"metadata": {}, "name": name, "nullable": nullable, "type": _mapping[typ]}
for name, typ, nullable in zip(colnames, new_types, nullables)]
return StructType.fromJson({"type": "struct", "fields": fields})
================================================
FILE: handyspark/sql/string.py
================================================
from handyspark.sql.transform import HandyTransform
import unicodedata
import pandas as pd
class HandyString(object):
__supported = {'boolean': ['contains', 'startswith', 'endswith', 'match', 'isalpha', 'isnumeric', 'isalnum', 'isdigit',
'isdecimal', 'isspace', 'islower', 'isupper', 'istitle'],
'string': ['replace', 'repeat', 'join', 'pad', 'slice', 'slice_replace', 'strip', 'wrap', 'translate',
'get', 'center', 'ljust', 'rjust', 'zfill', 'lstrip', 'rstrip',
'normalize', 'lower', 'upper', 'title', 'capitalize', 'swapcase'],
'integer': ['count', 'find', 'len', 'rfind']}
__unsupported = ['cat', 'extract', 'extractall', 'get_dummies', 'findall', 'index', 'split', 'rsplit', 'partition',
'rpartition', 'rindex', 'decode', 'encode']
__available = sorted(__supported['boolean'] + __supported['string'] + __supported['integer'])
__types = {n: t for t, v in __supported.items() for n in v}
_colname = None
def __init__(self, df, colname):
self._df = df
self._colname = colname
@staticmethod
def _remove_accents(input):
return unicodedata.normalize('NFKD', input).encode('ASCII', 'ignore').decode('unicode_escape')
def remove_accents(self):
return HandyTransform.gen_pandas_udf(f=lambda col: col.apply(HandyString._remove_accents),
args=(self._colname,),
returnType='string')
def __getattribute__(self, name):
try:
attr = object.__getattribute__(self, name)
return attr
except AttributeError as e:
if name in self.__available:
def wrapper(*args, **kwargs):
return HandyTransform.gen_pandas_udf(f=lambda col: col.str.__getattribute__(name)(**kwargs),
args=(self._colname,),
returnType=self.__types.get(name, 'string'))
wrapper.__doc__ = getattr(pd.Series.str, name).__doc__
return wrapper
else:
raise e
================================================
FILE: handyspark/sql/transform.py
================================================
import datetime
import inspect
import numpy as np
from pyspark.sql import functions as F
_MAPPING = {'string': str,
'date': datetime.date,
'timestamp': datetime.datetime,
'boolean': np.bool,
'binary': np.byte,
'byte': np.int8,
'short': np.int16,
'integer': np.int32,
'long': np.int64,
'float': np.float32,
'double': np.float64,
'array': np.ndarray,
'map': dict}
class HandyTransform(object):
_mapping = dict([(v.__name__, k) for k, v in _MAPPING.items()])
_mapping.update({'float': 'double', 'int': 'integer', 'list': 'array', 'bool': 'boolean'})
@staticmethod
def _get_return(sdf, f, args):
returnType = None
if args is None:
args = f.__code__.co_varnames
if len(args):
returnType = sdf.select(args[0]).dtypes[0][1]
return returnType
@staticmethod
def _signatureType(sig):
returnType = None
signatureType = str(sig.return_annotation)[7:]
if '_empty' not in signatureType:
returnType = signatureType
types = returnType.replace(']', '').replace('[', ',').split(',')[:3]
for returnType in types:
assert returnType.lower().strip() in HandyTransform._mapping.keys(), "invalid returnType"
types = list(map(lambda t: HandyTransform._mapping[t.lower().strip()], types))
returnType = types[0]
if len(types) > 1:
returnType = '<'.join([returnType, ','.join(types[1:])])
returnType += '>'
return returnType
@staticmethod
def gen_pandas_udf(f, args=None, returnType=None):
sig = inspect.signature(f)
if args is None:
args = tuple(sig.parameters.keys())
assert isinstance(args, (list, tuple)), "args must be list or tuple"
name = '{}{}'.format(f.__name__, str(args).replace("'", ""))
if returnType is None:
returnType = HandyTransform._signatureType(sig)
try:
import pyarrow
@F.pandas_udf(returnType=returnType)
def udf(*args):
return f(*args)
except:
@F.udf(returnType=returnType)
def udf(*args):
return f(*args)
return udf(*args).alias(name)
@staticmethod
def gen_grouped_pandas_udf(sdf, f, args=None, returnType=None):
# TODO: test it properly!
sig = inspect.signature(f)
if args is None:
args = tuple(sig.parameters.keys())
assert isinstance(args, (list, tuple)), "args must be list or tuple"
name = '{}{}'.format(f.__name__, str(f.__code__.co_varnames).replace("'", ""))
if returnType is None:
returnType = HandyTransform._signatureType(sig)
schema = sdf.notHandy().select(*args).withColumn(name, F.lit(None).cast(returnType)).schema
@F.pandas_udf(schema, F.PandasUDFType.GROUPED_MAP)
def pudf(pdf):
computed = pdf.apply(lambda row: f(*tuple(row[p] for p in f.__code__.co_varnames)), axis=1)
return pdf.assign(__computed=computed).rename(columns={'__computed': name})
return pudf
@staticmethod
def transform(sdf, f, name=None, args=None, returnType=None):
if name is None:
name = '{}{}'.format(f.__name__, str(f.__code__.co_varnames).replace("'", ""))
if isinstance(f, tuple):
f, returnType = f
if returnType is None:
returnType = HandyTransform._get_return(sdf, f, args)
return sdf.withColumn(name, HandyTransform.gen_pandas_udf(f, args, returnType))
@staticmethod
def apply(sdf, f, name=None, args=None, returnType=None):
if name is None:
name = '{}{}'.format(f.__name__, str(f.__code__.co_varnames).replace("'", ""))
if isinstance(f, tuple):
f, returnType = f
if returnType is None:
returnType = HandyTransform._get_return(sdf, f, args)
return sdf.select(HandyTransform.gen_pandas_udf(f, args, returnType).alias(name))
@staticmethod
def assign(sdf, **kwargs):
for c, f in kwargs.items():
typename = None
if isinstance(f, tuple):
f, typename = f
if callable(f):
if typename is None:
typename = HandyTransform._get_return(sdf, f, None)
if typename is not None:
sdf = sdf.transform(f, name=c, returnType=typename)
else:
sdf = sdf.withColumn(c, F.lit(f()))
else:
sdf = sdf.withColumn(c, F.lit(f))
return sdf
================================================
FILE: handyspark/stats.py
================================================
import numpy as np
from handyspark.util import check_columns, ensure_list
from pyspark.mllib.common import _py2java
from pyspark.mllib.stat.test import KolmogorovSmirnovTestResult
def StatisticalSummaryValues(sdf, colnames):
"""Builds a Java StatisticalSummaryValues object for each column
"""
colnames = ensure_list(colnames)
check_columns(sdf, colnames)
jvm = sdf._sc._jvm
summ = sdf.notHandy().select(colnames).describe().toPandas().set_index('summary')
ssvs = {}
for colname in colnames:
values = list(map(float, summ[colname].values))
values = values[1], np.sqrt(values[2]), int(values[0]), values[4], values[3], values[0] * values[1]
java_class = jvm.org.apache.commons.math3.stat.descriptive.StatisticalSummaryValues
ssvs.update({colname: java_class(*values)})
return ssvs
def tTest(jvm, *ssvs):
"""Performs a t-Test for difference of means using StatisticalSummaryValues objects
"""
n = len(ssvs)
res = np.identity(n)
java_class = jvm.org.apache.commons.math3.stat.inference.TTest
java_obj = java_class()
for i in range(n):
for j in range(i + 1, n):
pvalue = java_obj.tTest(ssvs[i], ssvs[j])
res[i, j] = pvalue
res[j, i] = pvalue
return res
def KolmogorovSmirnovTest(sdf, colname, dist='normal', *params):
"""Performs a KolmogorovSmirnov test for comparing the distribution of values in a column
to a named canonical distribution.
"""
check_columns(sdf, colname)
# Supported distributions
_distributions = ['Beta', 'Cauchy', 'ChiSquared', 'Exponential', ' F', 'Gamma', 'Gumbel', 'Laplace', 'Levy',
'Logistic', 'LogNormal', 'Nakagami', 'Normal', 'Pareto', 'T', 'Triangular', 'Uniform', 'Weibull']
_distlower = list(map(lambda v: v.lower(), _distributions))
try:
dist = _distributions[_distlower.index(dist)]
# the actual name for the Uniform distribution is UniformReal
if dist == 'Uniform':
dist += 'Real'
except ValueError:
# If we cannot find a distribution, fall back to Normal
dist = 'Normal'
params = (0., 1.)
jvm = sdf._sc._jvm
# Maps the DF column into a numeric RDD and turns it into Java RDD
rdd = sdf.notHandy().select(colname).rdd.map(lambda t: t[0])
jrdd = _py2java(sdf._sc, rdd)
# Gets the Java class of the corresponding distribution and creates an obj
java_class = getattr(jvm, 'org.apache.commons.math3.distribution.{}Distribution'.format(dist))
java_obj = java_class(*params)
# Loads the KS test class and performs the test
ks = jvm.org.apache.spark.mllib.stat.test.KolmogorovSmirnovTest
res = ks.testOneSample(jrdd.rdd(), java_obj)
return KolmogorovSmirnovTestResult(res)
================================================
FILE: handyspark/util.py
================================================
from math import isnan, isinf
import pandas as pd
from pyspark.ml.linalg import DenseVector
from pyspark.rdd import RDD
from pyspark.sql import functions as F, DataFrame, Row
from pyspark.sql.types import ArrayType, DoubleType, StructType, StructField
from pyspark.mllib.common import _java2py, _py2java
import traceback
def none2default(value, default):
return value if value is not None else default
def none2zero(value):
return none2default(value, 0)
def ensure_list(value):
if value is None:
return []
if isinstance(value, (list, tuple)):
return value
else:
return [value]
def check_columns(df, colnames):
if colnames is not None:
available = df.columns
colnames = ensure_list(colnames)
colnames = [col if isinstance(col, str) else col.colname for col in colnames]
diff = set(colnames).difference(set(available))
assert not len(diff), "DataFrame does not have {} column(s)".format(str(list(diff))[1:-1])
class bcolors:
HEADER = '\033[95m'
OKBLUE = '\033[94m'
OKGREEN = '\033[92m'
WARNING = '\033[93m'
FAIL = '\033[91m'
ENDC = '\033[0m'
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
class HandyException(Exception):
def __init__(self, *args, **kwargs):
try:
# Summary is a boolean argument
# If True, it prints the exception summary
# This way, we can avoid printing the summary all
# the way along the exception "bubbling up"
summary = kwargs['summary']
if summary:
print(HandyException.exception_summary())
except KeyError:
pass
@staticmethod
def colortext(text, color_code):
return color_code + text + (bcolors.ENDC if text[-4:] != bcolors.ENDC else '')
@staticmethod
def errortext(text):
# Makes exception summary both BOLD and RED (FAIL)
return HandyException.colortext(HandyException.colortext(text, bcolors.FAIL), bcolors.BOLD)
@staticmethod
def exception_summary():
# Gets the error stack
msg = traceback.format_exc()
try:
# Builds the "frame" around the text
top = HandyException.errortext('-' * 75 + '\nHANDY EXCEPTION SUMMARY\n')
bottom = HandyException.errortext('-' * 75)
# Gets the information about the error and makes it BOLD and RED
info = list(filter(lambda t: len(t) and t[0] != '\t', msg.split('\n')[::-1]))
error = HandyException.errortext('Error\t: {}'.format(info[0]))
# Figure out where the error happened - location (file/notebook), line and function
idx = [t.strip()[:4] for t in info].index('File')
where = [v.strip() for v in info[idx].strip().split(',')]
location, line, func = where[0][5:], where[1][5:], where[2][3:]
# If it is a notebook, figures out the cell
if 'ipython-input' in location:
location = 'IPython - In [{}]'.format(location.split('-')[2])
# If it is a pyspark error, just go with it
if 'pyspark' in error:
new_msg = '\n{}\n{}\n{}'.format(top, error, bottom)
# Otherwise, build the summary
else:
new_msg = '\n{}\nLocation: {}\nLine\t: {}\nFunction: {}\n{}\n{}'.format(top, location, line, func, error, bottom)
return new_msg
except Exception as e:
# If we managed to raise an exception while trying to format the original exception...
# Oh, well...
return 'This is awkward... \n{}'.format(str(e))
def get_buckets(rdd, buckets):
"""Extracted from pyspark.rdd.RDD.histogram function
"""
if buckets < 1:
raise ValueError("number of buckets must be >= 1")
# filter out non-comparable elements
def comparable(x):
if x is None:
return False
if type(x) is float and isnan(x):
return False
return True
filtered = rdd.filter(comparable)
# faster than stats()
def minmax(a, b):
return min(a[0], b[0]), max(a[1], b[1])
try:
minv, maxv = filtered.map(lambda x: (x, x)).reduce(minmax)
except TypeError as e:
if " empty " in str(e):
raise ValueError("can not generate buckets from empty RDD")
raise
if minv == maxv or buckets == 1:
return [minv, maxv], [filtered.count()]
try:
inc = (maxv - minv) / buckets
except TypeError:
raise TypeError("Can not generate buckets with non-number in RDD")
if isinf(inc):
raise ValueError("Can not generate buckets with infinite value")
# keep them as integer if possible
inc = int(inc)
if inc * buckets != maxv - minv:
inc = (maxv - minv) * 1.0 / buckets
buckets = [i * inc + minv for i in range(buckets)]
buckets.append(maxv) # fix accumulated error
return buckets
def dense_to_array(sdf, colname, new_colname):
"""Casts a Vector column into a new Array column.
"""
# Gets type of original column
coltype = sdf.notHandy().select(colname).dtypes[0][1]
# If it is indeed a vector...
if coltype == 'vector':
newrow = Row(*sdf.columns, new_colname)
res = sdf.rdd.map(lambda row: newrow(*row, row[colname].values.tolist())).toDF(sdf.columns + [new_colname])
# Otherwise just copy the original column into a new one
else:
res = sdf.withColumn(new_colname, F.col(colname))
# Makes it a HandyFrame
if isinstance(res, DataFrame):
res = res.toHandy()
return res
def disassemble(sdf, colname, new_colnames=None):
"""Disassembles a Vector/Array column into multiple columns
"""
array_col = '_{}'.format(colname)
# Gets type of original column
coltype = sdf.notHandy().select(colname).schema.fields[0].dataType.typeName()
# If it is a vector or array...
if coltype in ['vectorudt', 'array']:
# Makes the conversion from vector to array (or not :-))
tdf = dense_to_array(sdf, colname, array_col)
# Checks the MIN size of the arrays in the dataset
# If there are arrays with multiple sizes, it can still safely
# convert up to that size
size = tdf.notHandy().select(F.min(F.size(array_col))).take(1)[0][0]
# If no new names were given, just uses the original name and
# a sequence number as suffix
if new_colnames is None:
new_colnames = ['{}_{}'.format(colname, i) for i in range(size)]
assert len(new_colnames) == size, \
"There must be {} column names, only {} found!".format(size, len(new_colnames))
# Uses `getItem` to disassemble the array into multiple columns
res = tdf.select(*sdf.columns,
*(F.col(array_col).getItem(i).alias(n) for i, n in zip(range(size), new_colnames)))
# Otherwise just copy the original column into a new one
else:
if new_colnames is None:
new_colnames = [colname]
res = sdf.withColumn(new_colnames[0], F.col(colname))
# Makes it a HandyFrame
if isinstance(res, DataFrame):
res = res.toHandy()
return res
def get_jvm_class(cl):
"""Builds JVM class name from Python class
"""
return 'org.apache.{}.{}'.format(cl.__module__[2:], cl.__name__)
def call_scala_method(py_class, scala_method, df, *args):
"""Given a Python class, calls a method from its Scala equivalent
"""
sc = df.sql_ctx._sc
# Gets the Java class from the JVM, given the name built from the Python class
java_class = getattr(sc._jvm , get_jvm_class(py_class))
# Converts all columns into doubles and access it as Java DF
jdf = df.select(*(F.col(col).astype('double') for col in df.columns))._jdf
# Creates a Java object from both Java class and DataFrame
java_obj = java_class(jdf)
# Converts remaining args from Python to Java as well
args = [_py2java(sc, a) for a in args]
# Gets method from Java Object and passes arguments to it to get results
java_res = getattr(java_obj, scala_method)(*args)
# Converts results from Java back to Python
res = _java2py(sc, java_res)
# If result is an RDD, it could be the case its elements are still
# serialized tuples from Scala...
if isinstance(res, RDD):
try:
# Takes the first element from the result, to check what it is
first = res.take(1)[0]
# If it is a dictionary, we need to check its value
if isinstance(first, dict):
first = list(first.values())[0]
# If the value is a scala tuple, we need to deserialize it
if first.startswith('scala.Tuple'):
serde = sc._jvm.org.apache.spark.mllib.api.python.SerDe
# We assume it is a Tuple2 and deserialize it
java_res = serde.fromTuple2RDD(java_res)
# Finally, we convert the deserialized result from Java to Python
res = _java2py(sc, java_res)
except IndexError:
pass
return res
def counts_to_df(value_counts, colnames, n_points):
"""DO NOT USE IT!
"""
pdf = pd.DataFrame(value_counts
.to_frame('count')
.reset_index()
.apply(lambda row: dict({'count': row['count']},
**dict(zip(colnames, row['index'].toArray()))),
axis=1)
.values
.tolist())
pdf['count'] /= pdf['count'].sum()
proportions = pdf['count'] / pdf['count'].min()
factor = int(n_points / proportions.sum())
pdf = pd.concat([pdf[colnames], (proportions * factor).astype(int)], axis=1)
combinations = pdf.apply(lambda row: row.to_dict(), axis=1).values.tolist()
return pd.DataFrame([dict(v) for c in combinations for v in int(c.pop('count')) * [list(c.items())]])
================================================
FILE: notebooks/Exploring_Titanic.ipynb
================================================
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# HandySpark\n",
"\n",
"### Bringing pandas-like capabilities to Spark dataframes!"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# UNCOMMENT THIS IF YOU'RE USING GOOGLE COLAB!\n",
"\n",
"#!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
"#!wget -q http://apache.osuosl.org/spark/spark-2.3.3/spark-2.3.3-bin-hadoop2.7.tgz\n",
"#!tar xf spark-2.3.3-bin-hadoop2.7.tgz\n",
"#!pip install numpy==1.15\n",
"#!pip install -q pandas==0.24.1\n",
"#!pip install -q seaborn==0.9\n",
"#!pip install -q pyspark==2.3.3\n",
"#!pip install -q findspark\n",
"#!pip install -q handyspark\n",
"\n",
"# AFTER RUNNING THIS CELL, YOU MUST RESTART THE RUNTIME TO USE UPDATED VERSIONS OF PACKAGES!"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# UNCOMMENT THIS IF YOU'RE USING GOOGLE COLAB!\n",
"\n",
"#import os\n",
"#os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n",
"#os.environ[\"SPARK_HOME\"] = \"/content/spark-2.3.3-bin-hadoop2.7\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!wget https://raw.githubusercontent.com/dvgodoy/handyspark/master/tests/rawdata/train.csv"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import numpy as np\n",
"import findspark\n",
"import pandas as pd\n",
"from pyspark.sql import SparkSession\n",
"from pyspark.sql import functions as F\n",
"from handyspark import *\n",
"from matplotlib import pyplot as plt\n",
"# fixes issue with seaborn hiding fliers on boxplot\n",
"import matplotlib as mpl\n",
"mpl.rc(\"lines\", markeredgewidth=0.5)\n",
"\n",
"findspark.init()\n",
"os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] pyspark-shell'\n",
"\n",
"%matplotlib inline\n",
"\n",
"spark = SparkSession.builder.getOrCreate()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Loading Data into a `HandyFrame`\n",
"\n",
"### After loading data as usual, just call method `toHandy()` (an extension to Spark's dataframe)!"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"HandyFrame[PassengerId: int, Survived: int, Pclass: int, Name: string, Sex: string, Age: double, SibSp: int, Parch: int, Ticket: string, Fare: double, Cabin: string, Embarked: string]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sdf = spark.read.csv('train.csv', header=True, inferSchema=True)\n",
"hdf = sdf.toHandy()\n",
"hdf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fetching some data\n",
"\n",
"- using an instance of `cols` from your `HandyFrame`, you can retrieve values for given columns in the top N rows"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Single column will be returned as a pandas Series"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Braund, Mr. Owen Harris\n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th...\n",
"2 Heikkinen, Miss. Laina\n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel)\n",
"4 Allen, Mr. William Henry\n",
"Name: Name, dtype: object"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hdf.cols['Name'][:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Multiple columns will be returned as a pandas DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Pclass</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Pclass\n",
"0 Braund, Mr. Owen Harris 3\n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1\n",
"2 Heikkinen, Miss. Laina 3\n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1\n",
"4 Allen, Mr. William Henry 3"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hdf.cols[['Name', 'Pclass']][:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### You can also use `:` to get all columns!"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>None</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>None</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td
gitextract_4phs78pk/
├── .gitignore
├── .travis.yml
├── LICENSE
├── README.md
├── README.rst
├── docs/
│ ├── Makefile
│ └── source/
│ ├── conf.py
│ ├── handyspark.extensions.rst
│ ├── handyspark.ml.rst
│ ├── handyspark.rst
│ ├── handyspark.sql.rst
│ ├── includeme.rst
│ ├── index.rst
│ └── modules.rst
├── handyspark/
│ ├── __init__.py
│ ├── extensions/
│ │ ├── __init__.py
│ │ ├── common.py
│ │ ├── evaluation.py
│ │ └── types.py
│ ├── ml/
│ │ ├── __init__.py
│ │ └── base.py
│ ├── plot.py
│ ├── sql/
│ │ ├── __init__.py
│ │ ├── dataframe.py
│ │ ├── datetime.py
│ │ ├── pandas.py
│ │ ├── schema.py
│ │ ├── string.py
│ │ └── transform.py
│ ├── stats.py
│ └── util.py
├── notebooks/
│ └── Exploring_Titanic.ipynb
├── requirements.txt
├── setup.cfg
├── setup.py
└── tests/
├── handyspark/
│ ├── conftest.py
│ ├── extensions/
│ │ ├── test_evaluation.py
│ │ └── test_types.py
│ ├── ml/
│ │ └── test_base.py
│ ├── sql/
│ │ ├── test_dataframe.py
│ │ ├── test_datetime.py
│ │ ├── test_pandas.py
│ │ ├── test_schema.py
│ │ ├── test_string.py
│ │ └── test_transform.py
│ ├── test_plot.py
│ ├── test_stats.py
│ └── test_util.py
└── rawdata/
└── train.csv
SYMBOL INDEX (344 symbols across 27 files)
FILE: handyspark/extensions/common.py
function call2 (line 3) | def call2(self, name, *a):
FILE: handyspark/extensions/evaluation.py
function thresholds (line 8) | def thresholds(self):
function roc (line 14) | def roc(self):
function pr (line 25) | def pr(self):
function fMeasureByThreshold (line 36) | def fMeasureByThreshold(self, beta=1.0):
function precisionByThreshold (line 46) | def precisionByThreshold(self):
function recallByThreshold (line 53) | def recallByThreshold(self):
function getMetricsByThreshold (line 60) | def getMetricsByThreshold(self):
function confusionMatrix (line 77) | def confusionMatrix(self, threshold=0.5):
function print_confusion_matrix (line 97) | def print_confusion_matrix(self, threshold=0.5):
function plot_roc_curve (line 118) | def plot_roc_curve(self, ax=None):
function plot_pr_curve (line 128) | def plot_pr_curve(self, ax=None):
function __init__ (line 138) | def __init__(self, scoreAndLabels, scoreCol='score', labelCol='label'):
FILE: handyspark/extensions/types.py
function ret (line 4) | def ret(cls, expr):
function ret (line 11) | def ret(self, expr):
FILE: handyspark/ml/base.py
class HandyTransformers (line 7) | class HandyTransformers(object):
method __init__ (line 16) | def __init__(self, df):
method imputer (line 20) | def imputer(self):
method fencer (line 27) | def fencer(self):
class HasDict (line 35) | class HasDict(Params):
method __init__ (line 42) | def __init__(self):
method setDictValues (line 46) | def setDictValues(self, value):
method getDictValues (line 54) | def getDictValues(self):
class HandyImputer (line 62) | class HandyImputer(Transformer, HasDict, DefaultParamsReadable, DefaultP...
method _transform (line 71) | def _transform(self, dataset):
method statistics (line 105) | def statistics(self):
class HandyFencer (line 109) | class HandyFencer(Transformer, HasDict, DefaultParamsReadable, DefaultPa...
method _transform (line 118) | def _transform(self, dataset):
method fences (line 155) | def fences(self):
FILE: handyspark/plot.py
function title_fom_clause (line 15) | def title_fom_clause(clause):
function consolidate_plots (line 18) | def consolidate_plots(fig, axs, title, clauses):
function plot_correlations (line 44) | def plot_correlations(pdf, ax=None):
function strat_scatterplot (line 50) | def strat_scatterplot(sdf, col1, col2, n=30):
function scatterplot (line 64) | def scatterplot(sdf, col1, col2, n=30, ax=None):
function strat_histogram (line 111) | def strat_histogram(sdf, colname, bins=10, categorical=False):
function histogram (line 150) | def histogram(sdf, colname, bins=10, categorical=False, ax=None):
function _gen_dict (line 186) | def _gen_dict(rc_name, properties):
function draw_boxplot (line 196) | def draw_boxplot(ax, stats):
function boxplot (line 223) | def boxplot(sdf, colnames, ax=None, showfliers=True, k=1.5, precision=.0...
function post_boxplot (line 257) | def post_boxplot(axs, stats):
function roc_curve (line 264) | def roc_curve(fpr, tpr, roc_auc, ax=None):
function pr_curve (line 278) | def pr_curve(precision, recall, pr_auc, ax=None):
FILE: handyspark/sql/dataframe.py
function toHandy (line 25) | def toHandy(self):
function notHandy (line 30) | def notHandy(self):
function agg (line 36) | def agg(f):
function inccol (line 40) | def inccol(f):
class Handy (line 44) | class Handy(object):
method __init__ (line 45) | def __init__(self, df):
method __deepcopy__ (line 69) | def __deepcopy__(self, memo):
method __getitem__ (line 78) | def __getitem__(self, *args):
method stages (line 113) | def stages(self):
method statistics_ (line 119) | def statistics_(self):
method fences_ (line 123) | def fences_(self):
method is_classification (line 127) | def is_classification(self):
method classes (line 131) | def classes(self):
method nclasses (line 135) | def nclasses(self):
method response (line 139) | def response(self):
method ncols (line 143) | def ncols(self):
method nrows (line 147) | def nrows(self):
method shape (line 151) | def shape(self):
method strata (line 155) | def strata(self):
method strata_colnames (line 160) | def strata_colnames(self):
method _stratify (line 166) | def _stratify(self, strata):
method _clear_stratification (line 169) | def _clear_stratification(self):
method _set_stratification (line 180) | def _set_stratification(self, strata, raw_combinations, raw_clauses, c...
method _build_strat_plot (line 194) | def _build_strat_plot(self, n_rows, n_cols, **kwargs):
method _update_types (line 202) | def _update_types(self):
method _take_array (line 213) | def _take_array(self, colname, n):
method _value_counts (line 225) | def _value_counts(self, colnames, dropna=True, raw=False):
method _fillna (line 246) | def _fillna(self, target, values):
method __stat_to_dict (line 267) | def __stat_to_dict(self, colname, stat):
method _fill_values (line 276) | def _fill_values(self, continuous, categorical, strategy):
method __fill_self (line 288) | def __fill_self(self, continuous, categorical, strategy):
method _dense_to_array (line 310) | def _dense_to_array(self, colname, array_colname):
method _agg (line 315) | def _agg(self, name, func, colnames):
method _calc_fences (line 332) | def _calc_fences(self, colnames, k=1.5, precision=.01):
method _calc_mahalanobis_distance (line 353) | def _calc_mahalanobis_distance(self, colnames, output_col='__mahalanob...
method _set_mahalanobis_outliers (line 390) | def _set_mahalanobis_outliers(self, colnames, critical_value=.999,
method _calc_bxp_stats (line 402) | def _calc_bxp_stats(self, fences_df, colname, showfliers=False):
method set_response (line 475) | def set_response(self, colname):
method disassemble (line 486) | def disassemble(self, colname, new_colnames=None):
method to_metrics_RDD (line 491) | def to_metrics_RDD(self, prob_col, label):
method corr (line 495) | def corr(self, colnames=None, method='pearson'):
method fill (line 507) | def fill(self, *args, continuous=None, categorical=None, strategy=None):
method isnull (line 514) | def isnull(self, ratio=False):
method nunique (line 537) | def nunique(self, colnames=None):
method outliers (line 544) | def outliers(self, colnames=None, ratio=False, method='tukey', **kwargs):
method get_outliers (line 578) | def get_outliers(self, colnames=None, critical_value=.999):
method remove_outliers (line 588) | def remove_outliers(self, colnames=None, critical_value=.999):
method fence (line 598) | def fence(self, colnames, k=1.5):
method value_counts (line 634) | def value_counts(self, colnames, dropna=True):
method mode (line 638) | def mode(self, colname):
method entropy (line 659) | def entropy(self, colnames):
method mutual_info (line 688) | def mutual_info(self, colnames):
method mean (line 737) | def mean(self, colnames):
method min (line 741) | def min(self, colnames):
method max (line 745) | def max(self, colnames):
method percentile (line 749) | def percentile(self, colnames, perc=50, precision=.01):
method median (line 759) | def median(self, colnames, precision=.01):
method stddev (line 763) | def stddev(self, colnames):
method var (line 767) | def var(self, colnames):
method q1 (line 771) | def q1(self, colnames, precision=.01):
method q3 (line 775) | def q3(self, colnames, precision=.01):
method _strat_boxplot (line 779) | def _strat_boxplot(self, colnames, **kwargs):
method boxplot (line 794) | def boxplot(self, colnames, ax=None, showfliers=True, k=1.5, precision...
method _post_boxplot (line 801) | def _post_boxplot(self, res):
method _strat_scatterplot (line 805) | def _strat_scatterplot(self, colnames, **kwargs):
method scatterplot (line 810) | def scatterplot(self, colnames, ax=None, **kwargs):
method _strat_hist (line 818) | def _strat_hist(self, colname, bins=10, **kwargs):
method hist (line 830) | def hist(self, colname, bins=10, ax=None, **kwargs):
class HandyGrouped (line 841) | class HandyGrouped(GroupedData):
method __init__ (line 842) | def __init__(self, jgd, df, *args):
method agg (line 848) | def agg(self, *exprs):
method __repr__ (line 854) | def __repr__(self):
class HandyFrame (line 858) | class HandyFrame(DataFrame):
method __init__ (line 914) | def __init__(self, df, handy=None):
method __getattribute__ (line 929) | def __getattribute__(self, name):
method __repr__ (line 951) | def __repr__(self):
method _get_strata (line 954) | def _get_strata(self):
method _gen_row_ids (line 973) | def _gen_row_ids(self, *args):
method _loc (line 981) | def _loc(self, lower_bound, upper_bound):
method cols (line 988) | def cols(self):
method pandas (line 1009) | def pandas(self):
method transformers (line 1026) | def transformers(self):
method stages (line 1036) | def stages(self):
method response (line 1042) | def response(self):
method is_classification (line 1048) | def is_classification(self):
method classes (line 1054) | def classes(self):
method nclasses (line 1060) | def nclasses(self):
method ncols (line 1066) | def ncols(self):
method nrows (line 1072) | def nrows(self):
method shape (line 1078) | def shape(self):
method statistics_ (line 1084) | def statistics_(self):
method fences_ (line 1091) | def fences_(self):
method values (line 1098) | def values(self):
method notHandy (line 1107) | def notHandy(self):
method set_safety_limit (line 1112) | def set_safety_limit(self, limit):
method safety_off (line 1118) | def safety_off(self):
method collect (line 1125) | def collect(self):
method take (line 1144) | def take(self, num):
method stratify (line 1152) | def stratify(self, strata):
method transform (line 1163) | def transform(self, f, name=None, args=None, returnType=None):
method apply (line 1168) | def apply(self, f, name=None, args=None, returnType=None):
method assign (line 1173) | def assign(self, **kwargs):
method isnull (line 1195) | def isnull(self, ratio=False):
method nunique (line 1210) | def nunique(self):
method outliers (line 1225) | def outliers(self, ratio=False, method='tukey', **kwargs):
method get_outliers (line 1244) | def get_outliers(self, colnames=None, critical_value=.999):
method remove_outliers (line 1260) | def remove_outliers(self, colnames=None, critical_value=.999):
method set_response (line 1276) | def set_response(self, colname):
method fill (line 1291) | def fill(self, *args, categorical=None, continuous=None, strategy=None):
method fence (line 1319) | def fence(self, colnames, k=1.5):
method disassemble (line 1343) | def disassemble(self, colname, new_colnames=None):
method to_metrics_RDD (line 1364) | def to_metrics_RDD(self, prob_col='probability', label_col='label'):
class Bucket (line 1385) | class Bucket(object):
method __init__ (line 1401) | def __init__(self, colname, bins=5):
method __repr__ (line 1407) | def __repr__(self):
method colname (line 1411) | def colname(self):
method _get_buckets (line 1414) | def _get_buckets(self, df):
method _get_clauses (line 1425) | def _get_clauses(self, buckets):
class Quantile (line 1436) | class Quantile(Bucket):
method __repr__ (line 1452) | def __repr__(self):
method _get_buckets (line 1455) | def _get_buckets(self, df):
class HandyColumns (line 1465) | class HandyColumns(object):
method __init__ (line 1481) | def __init__(self, df, handy, strata=None):
method __getitem__ (line 1492) | def __getitem__(self, *args):
method __repr__ (line 1554) | def __repr__(self):
method numerical (line 1559) | def numerical(self):
method categorical (line 1565) | def categorical(self):
method continuous (line 1571) | def continuous(self):
method string (line 1577) | def string(self):
method array (line 1583) | def array(self):
method mean (line 1588) | def mean(self):
method min (line 1591) | def min(self):
method max (line 1594) | def max(self):
method median (line 1597) | def median(self, precision=.01):
method stddev (line 1607) | def stddev(self):
method var (line 1610) | def var(self):
method percentile (line 1613) | def percentile(self, perc, precision=.01):
method q1 (line 1625) | def q1(self, precision=.01):
method q3 (line 1635) | def q3(self, precision=.01):
method _value_counts (line 1645) | def _value_counts(self, dropna=True, raw=True):
method value_counts (line 1649) | def value_counts(self, dropna=True):
method entropy (line 1669) | def entropy(self):
method mutual_info (line 1678) | def mutual_info(self):
method mode (line 1688) | def mode(self):
method corr (line 1702) | def corr(self, method='pearson'):
method nunique (line 1718) | def nunique(self):
method outliers (line 1732) | def outliers(self, ratio=False, method='tukey', **kwargs):
method get_outliers (line 1751) | def get_outliers(self, critical_value=.999):
method remove_outliers (line 1764) | def remove_outliers(self, critical_value=.999):
method hist (line 1777) | def hist(self, bins=10, ax=None):
method boxplot (line 1788) | def boxplot(self, ax=None, showfliers=True, k=1.5, precision=.01):
method scatterplot (line 1802) | def scatterplot(self, ax=None):
class HandyStrata (line 1812) | class HandyStrata(object):
method __init__ (line 1820) | def __init__(self, handy, strata):
method __repr__ (line 1873) | def __repr__(self):
method __getattribute__ (line 1880) | def __getattribute__(self, name):
FILE: handyspark/sql/datetime.py
class HandyDatetime (line 4) | class HandyDatetime(object):
method __init__ (line 20) | def __init__(self, df, colname):
method __getattribute__ (line 26) | def __getattribute__(self, name):
FILE: handyspark/sql/pandas.py
class HandyPandas (line 7) | class HandyPandas(object):
method __init__ (line 16) | def __init__(self, df):
method __getitem__ (line 20) | def __getitem__(self, *args):
method str (line 29) | def str(self):
method dt (line 57) | def dt(self):
method __getattribute__ (line 74) | def __getattribute__(self, name):
FILE: handyspark/sql/schema.py
function generate_schema (line 25) | def generate_schema(columns, nullable_columns='all'):
FILE: handyspark/sql/string.py
class HandyString (line 5) | class HandyString(object):
method __init__ (line 18) | def __init__(self, df, colname):
method _remove_accents (line 23) | def _remove_accents(input):
method remove_accents (line 26) | def remove_accents(self):
method __getattribute__ (line 31) | def __getattribute__(self, name):
FILE: handyspark/sql/transform.py
class HandyTransform (line 21) | class HandyTransform(object):
method _get_return (line 26) | def _get_return(sdf, f, args):
method _signatureType (line 35) | def _signatureType(sig):
method gen_pandas_udf (line 51) | def gen_pandas_udf(f, args=None, returnType=None):
method gen_grouped_pandas_udf (line 75) | def gen_grouped_pandas_udf(sdf, f, args=None, returnType=None):
method transform (line 97) | def transform(sdf, f, name=None, args=None, returnType=None):
method apply (line 107) | def apply(sdf, f, name=None, args=None, returnType=None):
method assign (line 117) | def assign(sdf, **kwargs):
FILE: handyspark/stats.py
function StatisticalSummaryValues (line 6) | def StatisticalSummaryValues(sdf, colnames):
function tTest (line 22) | def tTest(jvm, *ssvs):
function KolmogorovSmirnovTest (line 36) | def KolmogorovSmirnovTest(sdf, colname, dist='normal', *params):
FILE: handyspark/util.py
function none2default (line 10) | def none2default(value, default):
function none2zero (line 13) | def none2zero(value):
function ensure_list (line 16) | def ensure_list(value):
function check_columns (line 24) | def check_columns(df, colnames):
class bcolors (line 32) | class bcolors:
class HandyException (line 42) | class HandyException(Exception):
method __init__ (line 43) | def __init__(self, *args, **kwargs):
method colortext (line 56) | def colortext(text, color_code):
method errortext (line 60) | def errortext(text):
method exception_summary (line 65) | def exception_summary():
function get_buckets (line 94) | def get_buckets(rdd, buckets):
function dense_to_array (line 140) | def dense_to_array(sdf, colname, new_colname):
function disassemble (line 158) | def disassemble(sdf, colname, new_colnames=None):
function get_jvm_class (line 192) | def get_jvm_class(cl):
function call_scala_method (line 197) | def call_scala_method(py_class, scala_method, df, *args):
function counts_to_df (line 233) | def counts_to_df(value_counts, colnames, n_points):
FILE: setup.py
function readme (line 3) | def readme():
FILE: tests/handyspark/conftest.py
function sdf (line 17) | def sdf():
function sdates (line 21) | def sdates():
function pdf (line 25) | def pdf():
function pdates (line 30) | def pdates():
function predicted (line 34) | def predicted():
FILE: tests/handyspark/extensions/test_evaluation.py
function test_confusion_matrix (line 11) | def test_confusion_matrix(sdf):
function test_get_metrics_by_threshold (line 30) | def test_get_metrics_by_threshold(sdf):
FILE: tests/handyspark/extensions/test_types.py
function test_atomic_types (line 5) | def test_atomic_types():
function test_composite_types (line 9) | def test_composite_types():
FILE: tests/handyspark/ml/test_base.py
function test_imputer (line 7) | def test_imputer(sdf, pdf):
function test_fencer (line 25) | def test_fencer(sdf, pdf):
FILE: tests/handyspark/sql/test_dataframe.py
function test_to_from_handy (line 14) | def test_to_from_handy(sdf):
function test_shape (line 20) | def test_shape(sdf):
function test_response (line 23) | def test_response(sdf):
function test_safety_limit (line 31) | def test_safety_limit(sdf):
function test_safety_limit2 (line 49) | def test_safety_limit2(sdf):
function test_values (line 64) | def test_values(sdf, pdf):
function test_stages (line 70) | def test_stages(sdf):
function test_value_counts (line 76) | def test_value_counts(sdf, pdf):
function test_column_values (line 82) | def test_column_values(sdf, pdf):
function test_dataframe_values (line 87) | def test_dataframe_values(sdf, pdf):
function test_isnull (line 92) | def test_isnull(sdf, pdf):
function test_nunique (line 101) | def test_nunique(sdf, pdf):
function test_columns_nunique (line 108) | def test_columns_nunique(sdf, pdf):
function test_outliers (line 114) | def test_outliers(sdf, pdf):
function test_mean (line 129) | def test_mean(sdf, pdf):
function test_stratified_mean (line 135) | def test_stratified_mean(sdf, pdf):
function test_mode (line 141) | def test_mode(sdf, pdf):
function test_median (line 154) | def test_median(sdf, pdf):
function test_types (line 169) | def test_types(sdf):
function test_fill_categorical (line 178) | def test_fill_categorical(sdf):
function test_fill_continuous (line 184) | def test_fill_continuous(sdf, pdf):
function test_sequential_fill (line 196) | def test_sequential_fill(sdf):
function test_corr (line 204) | def test_corr(sdf, pdf):
function test_stratified_corr (line 210) | def test_stratified_corr(sdf, pdf):
function test_fence (line 216) | def test_fence(sdf, pdf):
function test_stratified_fence (line 229) | def test_stratified_fence(sdf):
function test_grouped_column_values (line 236) | def test_grouped_column_values(sdf, pdf):
function test_bucket (line 242) | def test_bucket(sdf, pdf):
function test_quantile (line 252) | def test_quantile(sdf, pdf):
function test_stratify_length (line 262) | def test_stratify_length(sdf, pdf):
function test_stratify_list (line 269) | def test_stratify_list(sdf, pdf):
function test_stratify_pandas_df (line 277) | def test_stratify_pandas_df(sdf, pdf):
function test_stratify_pandas_series (line 284) | def test_stratify_pandas_series(sdf, pdf):
function test_stratify_spark_df (line 291) | def test_stratify_spark_df(sdf, pdf):
function test_stratify_fill (line 298) | def test_stratify_fill(sdf, pdf):
function test_repr (line 319) | def test_repr(sdf):
function test_stratify_bucket (line 324) | def test_stratify_bucket(sdf):
function test_stratified_nunique (line 334) | def test_stratified_nunique(sdf, pdf):
function test_mahalanobis (line 340) | def test_mahalanobis(sdf, pdf):
function test_entropy (line 350) | def test_entropy(sdf, pdf):
function test_mutual_info (line 356) | def test_mutual_info(sdf, pdf):
FILE: tests/handyspark/sql/test_datetime.py
function test_is_leap_year (line 4) | def test_is_leap_year(sdates, pdates):
function test_strftime (line 11) | def test_strftime(sdates, pdates):
function test_weekday_name (line 18) | def test_weekday_name(sdates, pdates):
function test_round (line 25) | def test_round(sdates, pdates):
FILE: tests/handyspark/sql/test_pandas.py
function test_between (line 5) | def test_between(sdf, pdf):
function test_isin (line 12) | def test_isin(sdf, pdf):
function test_isna (line 19) | def test_isna(sdf, pdf):
function test_notna (line 26) | def test_notna(sdf, pdf):
function test_clip (line 34) | def test_clip(sdf, pdf):
function test_replace (line 41) | def test_replace(sdf, pdf):
function test_round (line 48) | def test_round(sdf, pdf):
FILE: tests/handyspark/sql/test_schema.py
function test_generate_schema (line 5) | def test_generate_schema(sdf):
FILE: tests/handyspark/sql/test_string.py
function test_count (line 5) | def test_count(sdf, pdf):
function test_find (line 12) | def test_find(sdf, pdf):
function test_len (line 19) | def test_len(sdf, pdf):
function test_rfind (line 26) | def test_rfind(sdf, pdf):
function test_contains (line 34) | def test_contains(sdf, pdf):
function test_startswith (line 41) | def test_startswith(sdf, pdf):
function test_match (line 48) | def test_match(sdf, pdf):
function test_isalpha (line 55) | def test_isalpha(sdf, pdf):
function test_replace (line 63) | def test_replace(sdf, pdf):
function test_repeat (line 70) | def test_repeat(sdf, pdf):
function test_join (line 77) | def test_join(sdf, pdf):
function test_pad (line 84) | def test_pad(sdf, pdf):
function test_slice (line 91) | def test_slice(sdf, pdf):
function test_slice_replace (line 98) | def test_slice_replace(sdf, pdf):
function test_strip (line 105) | def test_strip(sdf, pdf):
function test_wrap (line 112) | def test_wrap(sdf, pdf):
function test_get (line 119) | def test_get(sdf, pdf):
function test_center (line 126) | def test_center(sdf, pdf):
function test_zfill (line 133) | def test_zfill(sdf, pdf):
function test_normalize (line 140) | def test_normalize(sdf, pdf):
function test_upper (line 147) | def test_upper(sdf, pdf):
FILE: tests/handyspark/sql/test_transform.py
function test_apply_axis0 (line 5) | def test_apply_axis0(sdf, pdf):
function test_apply_axis1 (line 15) | def test_apply_axis1(sdf, pdf):
function test_transform_axis0 (line 28) | def test_transform_axis0(sdf, pdf):
function test_transform_axis1 (line 38) | def test_transform_axis1(sdf, pdf):
function test_assign_axis0 (line 51) | def test_assign_axis0(sdf, pdf):
function test_assign_axis1 (line 58) | def test_assign_axis1(sdf, pdf):
FILE: tests/handyspark/test_plot.py
function plot_to_base64 (line 10) | def plot_to_base64(fig):
function plot_to_pixels (line 18) | def plot_to_pixels(fig, shape=None):
function test_boxplot_single (line 30) | def test_boxplot_single(sdf, pdf):
function test_boxplot_multiple (line 41) | def test_boxplot_multiple(sdf, pdf):
function test_hist_categorical (line 57) | def test_hist_categorical(sdf, pdf):
function test_hist_continuous (line 68) | def test_hist_continuous(sdf, pdf):
function test_scatterplot (line 81) | def test_scatterplot(sdf, pdf):
function test_stratified_boxplot (line 106) | def test_stratified_boxplot(sdf, pdf):
function test_stratified_hist (line 123) | def test_stratified_hist(sdf, pdf):
FILE: tests/handyspark/test_stats.py
function test_ks (line 5) | def test_ks(sdf):
FILE: tests/handyspark/test_util.py
function test_dense_to_array (line 5) | def test_dense_to_array(sdf):
function test_disassemble (line 12) | def test_disassemble(sdf):
Condensed preview — 49 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (497K chars).
[
{
"path": ".gitignore",
"chars": 1211,
"preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
},
{
"path": ".travis.yml",
"chars": 886,
"preview": "language: python\nsudo: required\ndist: trusty\ncache:\n directories:\n - $HOME/.ivy2\n - $HOME/spark\n - $HOME/.cach"
},
{
"path": "LICENSE",
"chars": 1075,
"preview": "MIT License\n\nCopyright (c) 2018 Daniel Voigt Godoy\n\nPermission is hereby granted, free of charge, to any person obtainin"
},
{
"path": "README.md",
"chars": 21089,
"preview": "[](https://travis-ci.org/dvgodoy/handyspark)\n"
},
{
"path": "README.rst",
"chars": 18434,
"preview": "\n\n.. image:: https://travis-ci.org/dvgodoy/handyspark.svg?branch=master\n :target: https://travis-ci.org/dvgodoy/handys"
},
{
"path": "docs/Makefile",
"chars": 629,
"preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS =\nSPHI"
},
{
"path": "docs/source/conf.py",
"chars": 5929,
"preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n#\n# HandySpark documentation build configuration file, created by\n# sphin"
},
{
"path": "docs/source/handyspark.extensions.rst",
"chars": 764,
"preview": "handyspark\\.extensions package\n==============================\n\nSubmodules\n----------\n\nhandyspark\\.extensions\\.common mod"
},
{
"path": "docs/source/handyspark.ml.rst",
"chars": 341,
"preview": "handyspark\\.ml package\n======================\n\nSubmodules\n----------\n\nhandyspark\\.ml\\.base module\n----------------------"
},
{
"path": "docs/source/handyspark.rst",
"chars": 703,
"preview": "handyspark package\n==================\n\nSubpackages\n-----------\n\n.. toctree::\n\n handyspark.extensions\n handyspark.m"
},
{
"path": "docs/source/handyspark.sql.rst",
"chars": 1172,
"preview": "handyspark\\.sql package\n=======================\n\nSubmodules\n----------\n\nhandyspark\\.sql\\.dataframe module\n--------------"
},
{
"path": "docs/source/includeme.rst",
"chars": 31,
"preview": ".. include:: ../../README.rst\n\n"
},
{
"path": "docs/source/index.rst",
"chars": 440,
"preview": ".. HandySpark documentation master file, created by\n sphinx-quickstart on Sun Oct 28 17:42:51 2018.\n You can adapt t"
},
{
"path": "docs/source/modules.rst",
"chars": 67,
"preview": "handyspark\n==========\n\n.. toctree::\n :maxdepth: 4\n\n handyspark\n"
},
{
"path": "handyspark/__init__.py",
"chars": 224,
"preview": "from handyspark.extensions.evaluation import BinaryClassificationMetrics\nfrom handyspark.sql import HandyFrame, Bucket, "
},
{
"path": "handyspark/extensions/__init__.py",
"chars": 231,
"preview": "from handyspark.extensions.common import JavaModelWrapper\nfrom handyspark.extensions.evaluation import BinaryClassificat"
},
{
"path": "handyspark/extensions/common.py",
"chars": 590,
"preview": "from pyspark.mllib.common import _java2py, _py2java, JavaModelWrapper\n\ndef call2(self, name, *a):\n \"\"\"Another call me"
},
{
"path": "handyspark/extensions/evaluation.py",
"chars": 6115,
"preview": "import pandas as pd\nfrom operator import itemgetter\nfrom handyspark.plot import roc_curve, pr_curve\nfrom pyspark.mllib.e"
},
{
"path": "handyspark/extensions/types.py",
"chars": 431,
"preview": "from pyspark.sql.types import AtomicType, ArrayType, MapType\n\n@classmethod\ndef ret(cls, expr):\n \"\"\"Assigns a return t"
},
{
"path": "handyspark/ml/__init__.py",
"chars": 105,
"preview": "from handyspark.ml.base import HandyFencer, HandyImputer\n\n__all__ = [\n 'HandyFencer', 'HandyImputer'\n]"
},
{
"path": "handyspark/ml/base.py",
"chars": 6187,
"preview": "import json\nfrom pyspark.ml.base import Transformer\nfrom pyspark.ml.util import DefaultParamsReadable, DefaultParamsWrit"
},
{
"path": "handyspark/plot.py",
"chars": 11255,
"preview": "import matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport seaborn as sns\nfrom inspect import signatu"
},
{
"path": "handyspark/sql/__init__.py",
"chars": 199,
"preview": "from handyspark.sql.dataframe import HandyFrame, Bucket, Quantile, DataFrame\nfrom handyspark.sql.schema import generate_"
},
{
"path": "handyspark/sql/dataframe.py",
"chars": 80200,
"preview": "from copy import deepcopy\nfrom handyspark.ml.base import HandyTransformers\nfrom handyspark.plot import histogram, boxplo"
},
{
"path": "handyspark/sql/datetime.py",
"chars": 2601,
"preview": "from handyspark.sql.transform import HandyTransform\nimport pandas as pd\n\nclass HandyDatetime(object):\n __supported = "
},
{
"path": "handyspark/sql/pandas.py",
"chars": 3448,
"preview": "from handyspark.sql.datetime import HandyDatetime\nfrom handyspark.sql.string import HandyString\nfrom handyspark.sql.tran"
},
{
"path": "handyspark/sql/schema.py",
"chars": 2010,
"preview": "import numpy as np\nimport datetime\nfrom operator import itemgetter\nfrom pyspark.sql.types import StructType\n\n_mapping = "
},
{
"path": "handyspark/sql/string.py",
"chars": 2271,
"preview": "from handyspark.sql.transform import HandyTransform\nimport unicodedata\nimport pandas as pd\n\nclass HandyString(object):\n "
},
{
"path": "handyspark/sql/transform.py",
"chars": 4784,
"preview": "import datetime\nimport inspect\nimport numpy as np\nfrom pyspark.sql import functions as F\n\n_MAPPING = {'string': str,\n "
},
{
"path": "handyspark/stats.py",
"chars": 2808,
"preview": "import numpy as np\nfrom handyspark.util import check_columns, ensure_list\nfrom pyspark.mllib.common import _py2java\nfrom"
},
{
"path": "handyspark/util.py",
"chars": 10056,
"preview": "from math import isnan, isinf\nimport pandas as pd\nfrom pyspark.ml.linalg import DenseVector\nfrom pyspark.rdd import RDD\n"
},
{
"path": "notebooks/Exploring_Titanic.ipynb",
"chars": 191317,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"metadata\": {},\n \"source\": [\n \"# HandySpark\\n\",\n \"\\n\",\n \"###"
},
{
"path": "requirements.txt",
"chars": 125,
"preview": "numpy>=1.14\nscikit-learn>=0.20.0\npandas>=0.24\nmatplotlib>=2.2.3\nseaborn>=0.9\npyspark>=2.3\nscipy>=1.0\nfindspark\npyarrow>="
},
{
"path": "setup.cfg",
"chars": 39,
"preview": "[metadata]\ndescription-file = README.md"
},
{
"path": "setup.py",
"chars": 1332,
"preview": "from setuptools import setup, find_packages\n\ndef readme():\n with open('README.md') as f:\n return f.read()\n\nset"
},
{
"path": "tests/handyspark/conftest.py",
"chars": 1237,
"preview": "import findspark\nimport os\nimport pandas as pd\nimport pytest\nfrom pyspark.sql import SparkSession\nfrom pyspark.ml.featur"
},
{
"path": "tests/handyspark/extensions/test_evaluation.py",
"chars": 2964,
"preview": "import numpy as np\nimport numpy.testing as npt\nimport pandas as pd\nfrom handyspark import *\nfrom pyspark.ml.classificati"
},
{
"path": "tests/handyspark/extensions/test_types.py",
"chars": 452,
"preview": "from handyspark import *\nimport numpy.testing as npt\nfrom pyspark.sql.types import IntegerType, StringType, ArrayType, M"
},
{
"path": "tests/handyspark/ml/test_base.py",
"chars": 1620,
"preview": "import numpy as np\nimport numpy.testing as npt\nimport handyspark\nfrom operator import itemgetter\nfrom sklearn.preprocess"
},
{
"path": "tests/handyspark/sql/test_dataframe.py",
"chars": 13864,
"preview": "import numpy as np\nimport numpy.testing as npt\nfrom handyspark import *\nimport pandas as pd\nfrom pyspark.sql import Data"
},
{
"path": "tests/handyspark/sql/test_datetime.py",
"chars": 1090,
"preview": "import numpy.testing as npt\nfrom handyspark import *\n\ndef test_is_leap_year(sdates, pdates):\n hdf = sdates.toHandy()\n"
},
{
"path": "tests/handyspark/sql/test_pandas.py",
"chars": 1755,
"preview": "import numpy.testing as npt\nfrom handyspark import *\n\n# boolean returns\ndef test_between(sdf, pdf):\n hdf = sdf.toHand"
},
{
"path": "tests/handyspark/sql/test_schema.py",
"chars": 434,
"preview": "import numpy as np\nimport numpy.testing as npt\nfrom handyspark.sql import generate_schema\n\ndef test_generate_schema(sdf)"
},
{
"path": "tests/handyspark/sql/test_string.py",
"chars": 5202,
"preview": "import numpy.testing as npt\nfrom handyspark import *\n\n# integer returns\ndef test_count(sdf, pdf):\n hdf = sdf.toHandy("
},
{
"path": "tests/handyspark/sql/test_transform.py",
"chars": 3011,
"preview": "import numpy.testing as npt\nfrom pyspark.sql.types import DoubleType, StringType\nfrom handyspark import *\n\ndef test_appl"
},
{
"path": "tests/handyspark/test_plot.py",
"chars": 4902,
"preview": "import base64\nimport numpy.testing as npt\nimport numpy as np\nimport seaborn as sns\nfrom handyspark import *\nfrom handysp"
},
{
"path": "tests/handyspark/test_stats.py",
"chars": 864,
"preview": "import numpy.testing as npt\nfrom handyspark.stats import KolmogorovSmirnovTest\nfrom pyspark.sql import functions as F\n\nd"
},
{
"path": "tests/handyspark/test_util.py",
"chars": 854,
"preview": "import numpy.testing as npt\nfrom pyspark.ml.feature import VectorAssembler\nfrom handyspark.util import dense_to_array, d"
},
{
"path": "tests/rawdata/train.csv",
"chars": 61194,
"preview": "PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked\r\n1,0,3,\"Braund, Mr. Owen Harris\",male,22"
}
]
About this extraction
This page contains the full source code of the dvgodoy/handyspark GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 49 files (467.3 KB), approximately 201.5k tokens, and a symbol index with 344 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.