Full Code of d6t/d6tstack for AI

master a0924bd7d63b cached

44 files

444.9 KB

155.3k tokens

196 symbols

1 requests

Download .txt

Showing preview only (463K chars total). Download the full file or copy to clipboard to get everything.

Repository: d6t/d6tstack
Branch: master
Commit: a0924bd7d63b
Files: 44
Total size: 444.9 KB

Directory structure:
gitextract_bcekc6b6/

├── .gitignore
├── LICENSE
├── MANIFEST.in
├── README.md
├── d6tstack/
│   ├── __init__.py
│   ├── combine_csv.py
│   ├── convert_xls.py
│   ├── helpers.py
│   ├── pyftp_final.py
│   ├── sniffer.py
│   ├── sync.py
│   └── utils.py
├── docs/
│   ├── Makefile
│   ├── make.bat
│   ├── make_zip_sample_csv.py
│   ├── make_zip_sample_xls.py
│   ├── shell-napoleon-html.sh
│   ├── shell-napoleon-recreate.sh
│   └── source/
│       ├── conf.py
│       ├── d6tstack.rst
│       ├── index.rst
│       ├── modules.rst
│       ├── setup.rst
│       └── tests.rst
├── examples-csv.ipynb
├── examples-dask.ipynb
├── examples-excel.ipynb
├── examples-pyspark.ipynb
├── examples-read-write.ipynb
├── examples-sql.ipynb
├── requirements-dev.txt
├── requirements.txt
├── setup.cfg
├── setup.py
└── tests/
    ├── __init__.py
    ├── pypi.sh
    ├── test-parquet.py
    ├── test_combine_csv.py
    ├── test_combine_old.py
    ├── test_sync.py
    ├── test_xls.py
    ├── tmp-reindex-withorder.py
    ├── tmp-runtest.py
    └── tmp.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
tests/.test-cred.yaml

.idea/
.env
temp/
fiddle*
.pytest_cache/
test-data/output/

# add this manually
test-data/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
.static_storage/
.media/
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/

# pypi config file
.pypirc


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2018 Databolt

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: MANIFEST.in
================================================
include README.md
include LICENSE

================================================
FILE: README.md
================================================
# Databolt File Ingest

Quickly ingest raw files. Works for XLS, CSV, TXT which can be exported to CSV, Parquet, SQL and Pandas. `d6tstack` solves many performance and schema problems typically encountered when ingesting raw files. 

![](https://www.databolt.tech/images/combiner-landing-git.png)

### Features include

* Fast pd.to_sql() for postgres and mysql
* Quickly check columns for consistency across files
* Fix added/missing columns
* Fix renamed columns
* Check Excel tabs for consistency across files
* Excel to CSV converter (incl multi-sheet support)
* Out of core functionality to process large files
* Export to CSV, parquet, SQL, pandas dataframe

## Installation

Latest published version `pip install d6tstack`. Additional requirements:
* `d6tstack[psql]`: for pandas to postgres
* `d6tstack[mysql]`: for pandas to mysql
* `d6tstack[xls]`: for excel support
* `d6tstack[parquet]`: for ingest csv to parquet

Latest dev version from github `pip install git+https://github.com/d6t/d6tstack.git`  

### Sample Use

```

import d6tstack

# fast CSV to SQL import - see SQL examples notebook
d6tstack.utils.pd_to_psql(df, 'postgresql+psycopg2://usr:pwd@localhost/db', 'tablename')
d6tstack.utils.pd_to_mysql(df, 'mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename')
d6tstack.utils.pd_to_mssql(df, 'mssql+pymssql://usr:pwd@localhost/db', 'tablename') # experimental

# ingest mutiple CSVs which may have data schema changes - see CSV examples notebook

import glob
>>> c = d6tstack.combine_csv.CombinerCSV(glob.glob('data/*.csv'))

# show columns of each file
>>> c.columns()

# quick check if all files have consistent columns
>>> c.is_all_equal()
False

# show which files have missing columns
>>> c.is_column_present()
   filename  cost  date profit profit2 sales
0  feb.csv  True  True   True   False  True
2  mar.csv  True  True   True    True  True

>>> c.combine_preview() # keep all columns
   filename  cost        date profit profit2 sales
0   jan.csv  -80  2011-01-01     20     NaN   100
0   mar.csv  -100  2011-03-01    200     400   300

>>> d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), columns_select_common=True).combine_preview() # keep common columns
   filename  cost        date profit sales
0   jan.csv  -80  2011-01-01     20   100
0   mar.csv  -100  2011-03-01    200   300

>>> d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), columns_rename={'sales':'revenue'}).combine_preview()
   filename  cost        date profit profit2 revenue
0   jan.csv  -80  2011-01-01     20     NaN   100
0   mar.csv  -100  2011-03-01    200     400   300

# to come: check if columns match database
>>> c.is_columns_match_db('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename')

# create csv with first nrows_preview rows of each file
>>> c.to_csv_head()

# export to csv, parquet, sql. Out of core with optimized fast imports for postgres and mysql
>>> c.to_pandas()
>>> c.to_csv_align(output_dir='process/')
>>> c.to_parquet_align(output_dir='process/')
>>> c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename')
>>> c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast, using COPY FROM
>>> c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast, using LOAD DATA LOCAL INFILE

# read Excel files - see Excel examples notebook for more details
import d6tstack.convert_xls

d6tstack.convert_xls.read_excel_advanced('test.xls',
    sheet_name='Sheet1', header_xls_range="B2:E2")

d6tstack.convert_xls.XLStoCSVMultiSheet('test.xls').convert_all(header_xls_range="B2:E2")

d6tstack.convert_xls.XLStoCSVMultiFile(glob.glob('*.xls'), 
    cfg_xls_sheets_sel_mode='name_global',cfg_xls_sheets_sel='Sheet1')
    .convert_all(header_xls_range="B2:E2")

```


## Documentation

*  [SQL examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-sql.ipynb) - Fast loading of CSV to SQL with pandas preprocessing
*  [CSV examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-csv.ipynb) - Quickly load any type of CSV files
*  [Excel examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-excel.ipynb) - Quickly extract from Excel to CSV 
*  [Dask Examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-dask.ipynb) - How to use d6tstack to solve Dask input file problems
*  [Pyspark Examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-pyspark.ipynb) - How to use d6tstack to solve pyspark input file problems
*  [Function reference docs](http://d6tstack.readthedocs.io/en/latest/py-modindex.html) - Detailed documentation for modules, classes, functions

## Faster Data Engineering

Check out other d6t libraries to solve common data engineering problems, including  
* data worfklows,build highly effective data science workflows
* fuzzy joins, quickly join data
* data pipes, quickly share and distribute data

https://github.com/d6t/d6t-python

And we encourage you to join the Databolt blog to get updates and tips+tricks http://blog.databolt.tech

## Collecting Errors Messages and Usage statistics

We have put a lot of effort into making this library useful to you. To help us make this library even better, it collects ANONYMOUS error messages and usage statistics. See [d6tcollect](https://github.com/d6t/d6tcollect) for details including how to disable collection. Collection is asynchronous and doesn't impact your code in any way.

It may not catch all errors so if you run into any problems or have any questions, please raise an issue on github.


================================================
FILE: d6tstack/__init__.py
================================================
import d6tstack.combine_csv
#import d6tstack.convert_xls
import d6tstack.sniffer
#import d6tstack.sync
import d6tstack.utils


================================================
FILE: d6tstack/combine_csv.py
================================================
import numpy as np
import pandas as pd
pd.set_option('display.expand_frame_repr', False)
from scipy.stats import mode
import warnings
import ntpath, pathlib
import copy
import itertools
import os

import d6tcollect
# d6tcollect.init(__name__)

from .helpers import *
from .utils import PrintLogger


# ******************************************************************
# helpers
# ******************************************************************
def _dfconact(df):
    return pd.concat(itertools.chain.from_iterable(df), sort=False, copy=False, join='inner', ignore_index=True)

def _direxists(fname, logger):
    fdir = os.path.dirname(fname)
    if fdir and not os.path.exists(fdir):
        if logger:
            logger.send_log('creating ' + fdir, 'ok')
        os.makedirs(fdir)
    return True

# ******************************************************************
# combiner
# ******************************************************************

class CombinerCSV(object, metaclass=d6tcollect.Collect):
    """    
    Core combiner class. Sniffs columns, generates preview, combines aka stacks to various output formats.

    Args:
        fname_list (list): file names, eg ['a.csv','b.csv']
        sep (string): CSV delimiter, see pandas.read_csv()
        has_header (boolean): data has header row 
        nrows_preview (int): number of rows in preview
        chunksize (int): number of rows to read into memory while processing, see pandas.read_csv()
        read_csv_params (dict): additional parameters to pass to pandas.read_csv()
        columns_select (list): list of column names to keep
        columns_select_common (bool): keep only common columns. Use this instead of `columns_select`
        columns_rename (dict): dict of columns to rename `{'name_old':'name_new'}
        add_filename (bool): add filename column to output data frame. If `False`, will not add column.
        apply_after_read (function): function to apply after reading each file. needs to return a dataframe
        log (bool): send logs to logger
        logger (object): logger object with `send_log()`

    """

    def __init__(self, fname_list, sep=',', nrows_preview=3, chunksize=1e6, read_csv_params=None,
                 columns_select=None, columns_select_common=False, columns_rename=None, add_filename=True,
                 apply_after_read=None, log=True, logger=None):
        if not fname_list:
            raise ValueError("Filename list should not be empty")
        self.fname_list = np.sort(fname_list)
        self.nrows_preview = nrows_preview
        self.read_csv_params = read_csv_params
        if not self.read_csv_params:
            self.read_csv_params = {}
        if not 'sep' in self.read_csv_params:
            self.read_csv_params['sep'] = sep
        if not 'chunksize' in self.read_csv_params:
            self.read_csv_params['chunksize'] = chunksize
        self.logger = logger
        if not logger and log:
            self.logger = PrintLogger()
        if not log:
            self.logger = None
        self.sniff_results = None
        self.add_filename = add_filename
        self.columns_select = columns_select
        self.columns_select_common = columns_select_common
        if columns_select and columns_select_common:
            warnings.warn('columns_select will override columns_select_common, pick either one')
        self.columns_rename = columns_rename
        self._columns_reindex = None
        self._columns_rename_dict = None
        self.apply_after_read = apply_after_read

        self.df_combine_preview = None

        if self.columns_select:
            if max(collections.Counter(columns_select).values())>1:
                raise ValueError('Duplicate entries in columns_select')

    def _read_csv_yield(self, fname, read_csv_params):
        self._columns_reindex_available()
        dfs = pd.read_csv(fname, **read_csv_params)
        for dfc in dfs:
            if self.columns_rename and self._columns_rename_dict[fname]:
                dfc = dfc.rename(columns=self._columns_rename_dict[fname])

            dfc = dfc.reindex(columns=self._columns_reindex)
            if self.apply_after_read:
                dfc = self.apply_after_read(dfc)
            if self.add_filename:
                dfc['filepath'] = fname
                dfc['filename'] = ntpath.basename(fname)
            yield dfc

    def sniff_columns(self):

        """
        
        Checks column consistency by reading top nrows in all files. It checks both presence and order of columns in all files

        Returns:
            dict: results dictionary with
                files_columns (dict): dictionary with information, keys = filename, value = list of columns in file
                columns_all (list): all columns in files
                columns_common (list): only columns present in every file
                is_all_equal (boolean): all files equal in all files?
                df_columns_present (dataframe): which columns are present in which file?
                df_columns_order (dataframe): where in the file is the column?

        """

        if self.logger:
            self.logger.send_log('sniffing columns', 'ok')

        read_csv_params = copy.deepcopy(self.read_csv_params)
        read_csv_params['dtype'] = str
        read_csv_params['nrows'] = self.nrows_preview
        read_csv_params['chunksize'] = None

        # read nrows of every file
        self.dfl_all = []
        for fname in self.fname_list:
            # todo: make sure no nrows param in self.read_csv_params
            df = pd.read_csv(fname, **read_csv_params)
            self.dfl_all.append(df)

        # process columns
        dfl_all_col = [df.columns.tolist() for df in self.dfl_all]
        col_files = dict(zip(self.fname_list, dfl_all_col))
        col_common = list_common(list(col_files.values()))
        col_all = list_unique(list(col_files.values()))

        # find index in column list so can check order is correct
        df_col_present = {}
        for iFileName, iFileCol in col_files.items():
            df_col_present[iFileName] = [iCol in iFileCol for iCol in col_all]

        df_col_present = pd.DataFrame(df_col_present, index=col_all).T
        df_col_present.index.names = ['file_path']

        # find index in column list so can check order is correct
        df_col_idx = {}
        for iFileName, iFileCol in col_files.items():
            df_col_idx[iFileName] = [iFileCol.index(iCol) if iCol in iFileCol else np.nan for iCol in col_all]
        df_col_idx = pd.DataFrame(df_col_idx, index=col_all).T

        # order columns by where they appear in file
        m=mode(df_col_idx,axis=0)
        df_col_pos = pd.DataFrame({'o':m[0][0],'c':m[1][0]},index=df_col_idx.columns)
        df_col_pos = df_col_pos.sort_values(['o','c'])
        df_col_pos['iscommon']=df_col_pos.index.isin(col_common)


        # reorder by position
        col_all = df_col_pos.index.values.tolist()
        col_common = df_col_pos[df_col_pos['iscommon']].index.values.tolist()
        col_unique = df_col_pos[~df_col_pos['iscommon']].index.values.tolist()
        df_col_present = df_col_present[col_all]
        df_col_idx = df_col_idx[col_all]

        sniff_results = {'files_columns': col_files, 'columns_all': col_all, 'columns_common': col_common,
                       'columns_unique': col_unique, 'is_all_equal': columns_all_equal(dfl_all_col),
                       'df_columns_present': df_col_present, 'df_columns_order': df_col_idx}
        self.sniff_results = sniff_results

        return sniff_results

    def get_sniff_results(self):
        if not self.sniff_results:
            self.sniff_columns()
        return self.sniff_results

    def _sniff_available(self):
        if not self.sniff_results:
            self.sniff_columns()

    def is_all_equal(self):
        """
        Checks if all columns are equal in all files

        Returns:
             bool: all columns are equal in all files?
        """
        self._sniff_available()
        return self.sniff_results['is_all_equal']

    def is_column_present(self):
        """
        Shows which columns are present in which files

        Returns:
             dataframe: boolean values for column presence in each file
        """
        self._sniff_available()
        return self.sniff_results['df_columns_present']

    def is_column_present_unique(self):
        """
        Shows unique columns by file

        Returns:
             dataframe: boolean values for column presence in each file
        """
        self._sniff_available()
        return self.is_column_present()[self.sniff_results['columns_unique']]

    def columns_unique(self):
        """
        Shows unique columns by file

        Returns:
             dataframe: boolean values for column presence in each file
        """
        self.columns_unique()

    def is_column_present_common(self):
        """
        Shows common columns by file        

        Returns:
             dataframe: boolean values for column presence in each file
        """
        self._sniff_available()
        return self.is_column_present()[self.sniff_results['columns_common']]

    def columns_common(self):
        """
        Shows common columns by file        

        Returns:
             dataframe: boolean values for column presence in each file
        """
        return self.is_column_present_common()

    def columns(self):
        """
        Shows columns by file        

        Returns:
             dict: filename, columns
        """
        self._sniff_available()
        return self.sniff_results['files_columns']

    def head(self):
        """
        Shows preview rows for each file

        Returns:
             dict: filename, dataframe
        """
        self._sniff_available()
        return dict(zip(self.fname_list,self.dfl_all))

    def _columns_reindex_prep(self):

        self._sniff_available()
        self._columns_select_dict = {} # select columns by filename
        self._columns_rename_dict = {} # rename columns by filename

        for fname in self.fname_list:
            if self.columns_rename:
                columns_rename = self.columns_rename.copy()
                # check no naming conflicts
                columns_select2 = [columns_rename[k] if k in columns_rename.keys() else k for k in self.sniff_results['files_columns'][fname]]
                df_rename_count = collections.Counter(columns_select2)
                if df_rename_count and max(df_rename_count.values()) > 1:  # would the rename create naming conflict?
                    warnings.warn('Renaming conflict: {}'.format([(k, v) for k, v in df_rename_count.items() if v > 1]),
                                  UserWarning)
                    while df_rename_count and max(df_rename_count.values()) > 1:
                        # remove key value pair causing conflict
                        conflicting_keys = [i for i, j in df_rename_count.items() if j > 1]
                        columns_rename = {k: v for k, v in columns_rename.items() if k in conflicting_keys}
                        columns_select2 = [columns_rename[k] if k in columns_rename.keys() else k for k in
                                           self.sniff_results['files_columns'][fname]]
                        df_rename_count = collections.Counter(columns_select2)

                # store rename by file. keep only renames for columns actually present in file
                self._columns_rename_dict[fname] = dict((k,v) for k,v in columns_rename.items() if k in k in self.sniff_results['files_columns'][fname])

        if self.columns_select:
            columns_select2 = self.columns_select.copy()
        else:
            if self.columns_select_common:
                columns_select2 = self.sniff_results['columns_common'].copy()
            else:
                columns_select2 = self.sniff_results['columns_all'].copy()

        if self.columns_rename:
            columns_select2 = list(dict.fromkeys([columns_rename[k] if k in columns_rename.keys() else k for k in columns_select2]))  # set of columns after rename
        # store select by file
        self._columns_reindex = columns_select2

    def _columns_reindex_available(self):
        if not self._columns_rename_dict or not self._columns_reindex:
            self._columns_reindex_prep()

    def preview_rename(self):
        """
        Shows which columns will be renamed in processing

        Returns:
            dataframe: columns to be renamed from which file
        """
        self._columns_reindex_available()
        df = pd.DataFrame(self._columns_rename_dict).T
        return df

    def preview_select(self):
        """
        Shows which columns will be selected in processing

        Returns:
            list: columns to be selected from all files
        """
        self._columns_reindex_available()
        return self._columns_reindex

    def combine_preview(self):
        """
        Preview of what the combined data will look like

        Returns:
            dataframe: combined dataframe
        """
        read_csv_params = copy.deepcopy(self.read_csv_params)
        read_csv_params['nrows'] = self.nrows_preview

        df = [[dfc for dfc in self._read_csv_yield(fname, read_csv_params)] for fname in self.fname_list]
        df = _dfconact(df)
        self.df_combine_preview = df.copy()
        return df

    def _combine_preview_available(self):
        if self.df_combine_preview is None:
            self.combine_preview()

    def to_pandas(self):
        """
        Combine all files to a pandas dataframe

        Returns:
            dataframe: combined data
        """
        df = [[dfc for dfc in self._read_csv_yield(fname, self.read_csv_params)] for fname in self.fname_list]
        df = _dfconact(df)
        return df

    def _get_filepath_out(self, fname, output_dir, output_prefix, ext):
        # filename
        fname_out = ntpath.basename(fname)
        fname_out = os.path.splitext(fname_out)[0]
        fname_out = output_prefix + fname_out + ext

        # path
        output_dir = output_dir if output_dir else os.path.dirname(fname)
        fpath_out = os.path.join(output_dir, fname_out)
        assert _direxists(fpath_out, self.logger)
        return fpath_out

    def _to_csv_prep(self, write_params):
        if 'index' not in write_params:
            write_params['index'] = False
        write_params.pop('header', None) # library handles

        self._combine_preview_available()

        return write_params

    def to_csv_head(self, output_dir=None, write_params={}):
        """
        Save `nrows_preview` header rows as individual files

        Args:
            output_dir (str): directory to save files in. If not given save in the same directory as the original file
            write_params (dict): additional params to pass to `pandas.to_csv()`

        Returns:
            list: list of filenames of processed files
        """

        write_params = self._to_csv_prep(write_params)

        fnamesout = []
        for fname, dfg in dict(zip(self.fname_list,self.dfl_all)).items():
            filename = f'{fname}-head.csv'
            filename = filename if output_dir is None else str(pathlib.Path(output_dir)/filename)
            dfg.to_csv(filename, **write_params)
            fnamesout.append(filename)

        return fnamesout

    def to_csv_align(self, output_dir=None, output_prefix='d6tstack-', write_params={}):
        """
        Create cleaned versions of original files. Automatically runs out of core, using `self.chunksize`.

        Args:
            output_dir (str): directory to save files in. If not given save in the same directory as the original file
            output_prefix (str): prepend with prefix to distinguish from original files
            write_params (dict): additional params to pass to `pandas.to_csv()`

        Returns:
            list: list of filenames of processed files
        """
        # stream all chunks to multiple files

        write_params = self._to_csv_prep(write_params)

        fnamesout = []
        for fname in self.fname_list:
            filename = self._get_filepath_out(fname, output_dir, output_prefix, '.csv')
            if self.logger:
                self.logger.send_log('writing '+filename , 'ok')
            fhandle = open(filename, 'w')
            self.df_combine_preview[:0].to_csv(fhandle, **write_params)
            for dfc in self._read_csv_yield(fname, self.read_csv_params):
                dfc.to_csv(fhandle, header=False, **write_params)
            fhandle.close()
            fnamesout.append(filename)

        return fnamesout

    def to_csv_combine(self, filename, write_params={}):
        """
        Combines all files to a single csv file. Automatically runs out of core, using `self.chunksize`.

        Args:
            filename (str): file names
            write_params (dict): additional params to pass to `pandas.to_csv()`

        Returns:
            str: filename for combined data
        """
        # stream all chunks from all files to a single file
        write_params = self._to_csv_prep(write_params)

        assert _direxists(filename, self.logger)
        fhandle = open(filename, 'w')
        self.df_combine_preview[:0].to_csv(fhandle, **write_params)
        for fname in self.fname_list:
            for dfc in self._read_csv_yield(fname, self.read_csv_params):
                dfc.to_csv(fhandle, header=False, **write_params)
        fhandle.close()
        return filename

    def to_parquet_align(self, output_dir=None, output_prefix='d6tstack-', write_params={}):
        """
        Same as `to_csv_align` but outputs parquet files

        """
        # write_params for pyarrow.parquet.write_table

        # stream all chunks to multiple files
        self._combine_preview_available()

        import pyarrow as pa
        import pyarrow.parquet as pq

        fnamesout = []
        pqschema = pa.Table.from_pandas(self.df_combine_preview).schema
        for fname in self.fname_list:
            filename = self._get_filepath_out(fname, output_dir, output_prefix, '.pq')
            if self.logger:
                self.logger.send_log('writing '+filename , 'ok')
            pqwriter = pq.ParquetWriter(filename, pqschema)
            for dfc in self._read_csv_yield(fname, self.read_csv_params):
                pqwriter.write_table(pa.Table.from_pandas(dfc.astype(self.df_combine_preview.dtypes), schema=pqschema),**write_params)
            pqwriter.close()
            fnamesout.append(filename)

        return fnamesout

    def to_parquet_combine(self, filename, write_params={}):
        """
        Same as `to_csv_combine` but outputs parquet files

        """
        # stream all chunks from all files to a single file
        self._combine_preview_available()

        assert _direxists(filename, self.logger)
        import pyarrow as pa
        import pyarrow.parquet as pq

        # todo: fix mixed data type writing. at least give a warning
        pqwriter = pq.ParquetWriter(filename, pa.Table.from_pandas(self.df_combine_preview).schema)
        for fname in self.fname_list:
            for dfc in self._read_csv_yield(fname, self.read_csv_params):
                pqwriter.write_table(pa.Table.from_pandas(dfc.astype(self.df_combine_preview.dtypes)),**write_params)
        pqwriter.close()
        return filename

    def to_sql_combine(self, uri, tablename, if_exists='fail', write_params=None, return_create_sql=False):
        """
        Load all files into a sql table using sqlalchemy. Generic but slower than the optmized functions

        Args:
            uri (str): sqlalchemy database uri
            tablename (str): table to store data in
            if_exists (str): {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See `pandas.to_sql()` for details
            write_params (dict): additional params to pass to `pandas.to_sql()`
            return_create_sql (dict): show create sql statement for combined file schema. Doesn't run data load

        Returns:
            bool: True if loader finished

        """
        if not write_params:
            write_params = {}
        if 'if_exists' not in write_params:
            write_params['if_exists'] = if_exists
        if 'index' not in write_params:
            write_params['index'] = False
        self._combine_preview_available()

        if 'mysql' in uri and not 'mysql+pymysql' in uri:
            raise ValueError('need to use pymysql for mysql (pip install pymysql)')

        import sqlalchemy

        sql_engine = sqlalchemy.create_engine(uri)

        # create table
        dfhead = self.df_combine_preview.astype(self.df_combine_preview.dtypes)[:0]

        if return_create_sql:
            return pd.io.sql.get_schema(dfhead, tablename).replace('"',"`")

        dfhead.to_sql(tablename, sql_engine, **write_params)

        # append data
        write_params['if_exists'] = 'append'
        for fname in self.fname_list:
            for dfc in self._read_csv_yield(fname, self.read_csv_params):
                dfc.astype(self.df_combine_preview.dtypes).to_sql(tablename, sql_engine, **write_params)

        return True

    def to_psql_combine(self, uri, table_name, if_exists='fail', sep=','):
        """
        Load all files into a sql table using native postgres COPY FROM. Chunks data load to reduce memory consumption

        Args:
            uri (str): postgres psycopg2 sqlalchemy database uri
            table_name (str): table to store data in
            if_exists (str): {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See `pandas.to_sql()` for details
            sep (str): separator for temp file, eg ',' or '\t'

        Returns:
            bool: True if loader finished

        """

        if not 'psycopg2' in uri:
            raise ValueError('need to use psycopg2 uri')

        self._combine_preview_available()

        import sqlalchemy
        import io

        sql_engine = sqlalchemy.create_engine(uri)
        sql_cnxn = sql_engine.raw_connection()
        cursor = sql_cnxn.cursor()

        self.df_combine_preview[:0].to_sql(table_name, sql_engine, if_exists=if_exists, index=False)

        for fname in self.fname_list:
            for dfc in self._read_csv_yield(fname, self.read_csv_params):
                fbuf = io.StringIO()
                dfc.astype(self.df_combine_preview.dtypes).to_csv(fbuf, index=False, header=False, sep=sep)
                fbuf.seek(0)
                cursor.copy_from(fbuf, table_name, sep=sep, null='')
        sql_cnxn.commit()
        cursor.close()

        return True

    def to_mysql_combine(self, uri, table_name, if_exists='fail', tmpfile='mysql.csv', sep=','):
        """
        Load all files into a sql table using native postgres LOAD DATA LOCAL INFILE. Chunks data load to reduce memory consumption

        Args:
            uri (str): mysql mysqlconnector sqlalchemy database uri
            table_name (str): table to store data in
            if_exists (str): {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See `pandas.to_sql()` for details
            tmpfile (str): filename for temporary file to load from
            sep (str): separator for temp file, eg ',' or '\t'

        Returns:
            bool: True if loader finished

        """
        if not 'mysql+mysqlconnector' in uri:
            raise ValueError('need to use mysql+mysqlconnector uri (pip install mysql-connector)')

        self._combine_preview_available()

        import sqlalchemy

        sql_engine = sqlalchemy.create_engine(uri)

        self.df_combine_preview[:0].to_sql(table_name, sql_engine, if_exists=if_exists, index=False)

        if self.logger:
            self.logger.send_log('creating ' + tmpfile, 'ok')
        self.to_csv_combine(tmpfile, write_params={'na_rep':'\\N','sep':sep})
        if self.logger:
            self.logger.send_log('loading ' + tmpfile, 'ok')
        sql_load = "LOAD DATA LOCAL INFILE '{}' INTO TABLE {} FIELDS TERMINATED BY '{}' IGNORE 1 LINES;".format(tmpfile, table_name, sep)
        sql_engine.execute(sql_load)

        os.remove(tmpfile)

        return True

    def to_mssql_combine(self, uri, table_name, schema_name=None, if_exists='fail', tmpfile='mysql.csv'):
        """
        Load all files into a sql table using native postgres LOAD DATA LOCAL INFILE. Chunks data load to reduce memory consumption

        Args:
            uri (str): mysql mysqlconnector sqlalchemy database uri
            table_name (str): table to store data in
            schema_name (str): name of schema to write to
            if_exists (str): {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See `pandas.to_sql()` for details
            tmpfile (str): filename for temporary file to load from

        Returns:
            bool: True if loader finished

        """
        if not 'mssql+pymssql' in uri:
            raise ValueError('need to use mssql+pymssql uri (conda install -c prometeia pymssql)')

        self._combine_preview_available()

        import sqlalchemy

        sql_engine = sqlalchemy.create_engine(uri)

        self.df_combine_preview[:0].to_sql(table_name, sql_engine, schema=schema_name, if_exists=if_exists, index=False)

        if self.logger:
            self.logger.send_log('creating ' + tmpfile, 'ok')
        self.to_csv_combine(tmpfile, write_params={'na_rep':'\\N'})
        if self.logger:
            self.logger.send_log('loading ' + tmpfile, 'ok')
        if schema_name is not None:
            table_name = '{}.{}'.format(schema_name,table_name)
        sql_load = "BULK INSERT {} FROM '{}';".format()(table_name, tmpfile)
        sql_engine.execute(sql_load)

        os.remove(tmpfile)

        return True

# todo: ever need to rerun _available fct instead of using cache?


================================================
FILE: d6tstack/convert_xls.py
================================================
import warnings
import os.path

import numpy as np
import pandas as pd

import ntpath

import openpyxl
import xlrd

try:
    from openpyxl.utils.cell import coordinate_from_string
except:
    from openpyxl.utils import coordinate_from_string
from d6tstack.helpers import compare_pandas_versions, check_valid_xls

import d6tcollect
# d6tcollect.init(__name__)

#******************************************************************
# read_excel_advanced
#******************************************************************
def read_excel_advanced(fname, remove_blank_cols=True, remove_blank_rows=True, collapse_header=True,
                        header_xls_range=None, header_xls_start=None, header_xls_end=None,
                        is_preview=False, nrows_preview=3, **kwds):
    """
    Read Excel files to pandas dataframe with advanced options like set header ranges and remove blank columns and rows

    Args:
        fname (str): Excel file path
        remove_blank_cols (bool): remove blank columns
        remove_blank_rows (bool): remove blank rows
        collapse_header (bool): to convert multiline header to a single line string
        header_xls_range (string): range of headers in excel, eg: A4:B16
        header_xls_start (string): Starting cell of excel for header range, eg: A4
        header_xls_end (string): End cell of excel for header range, eg: B16
        is_preview (bool): Read only first `nrows_preview` lines
        nrows_preview (integer): Initial number of rows to be used for preview columns (default: 3)
        kwds (mixed): parameters for `pandas.read_excel()` to pass through

    Returns:
         df (dataframe): pandas dataframe

    Note:
        You can pass in any `pandas.read_excel()` parameters in particular `sheet_name`

    """
    header = []

    if header_xls_range:
        if not (header_xls_start and header_xls_end):
            header_xls_range = header_xls_range.split(':')
            header_xls_start, header_xls_end = header_xls_range
        else:
            raise ValueError('Parameter conflict. Can only pass header_xls_range or header_xls_start with header_xls_end')

    if header_xls_start and header_xls_end:
        if 'skiprows' in kwds or 'usecols' in kwds:
            raise ValueError('Parameter conflict. Cannot pass skiprows or usecols with header_xls')

        scol, srow = coordinate_from_string(header_xls_start)
        ecol, erow = coordinate_from_string(header_xls_end)

        # header, skiprows, usecols
        header = list(range(erow - srow + 1))
        usecols = scol + ":" + ecol
        skiprows = srow - 1

        if compare_pandas_versions(pd.__version__, "0.20.3") > 0:
            df = pd.read_excel(fname, header=header, skiprows=skiprows, usecols=usecols, **kwds)
        else:
            df = pd.read_excel(fname, header=header, skiprows=skiprows, parse_cols=usecols, **kwds)
    else:
        df = pd.read_excel(fname, **kwds)

    # remove blank cols and rows
    if remove_blank_cols:
        df = df.dropna(axis='columns', how='all')
    if remove_blank_rows:
        df = df.dropna(axis='rows', how='all')

    # todo: add df.reset_index() once no actual data in index

    # clean up header
    if collapse_header:
        if len(header) > 1:
            df.columns = [' '.join([s for s in col if not 'Unnamed' in s]).strip().replace("\n", ' ')
                          for col in df.columns.values]
            df = df.reset_index()
        else:
            df.rename(columns=lambda x: x.strip().replace("\n", ' '), inplace=True)

    # preview
    if is_preview:
        df = df.head(nrows_preview)

    return df


#******************************************************************
# XLSSniffer
#******************************************************************

class XLSSniffer(object, metaclass=d6tcollect.Collect):
    """

    Extracts available sheets from MULTIPLE Excel files and runs diagnostics

    Args:
        fname_list (list): file paths, eg ['dir/a.csv','dir/b.csv']
        logger (object): logger object with send_log(), optional

    """

    def __init__(self, fname_list, logger=None):
        if not fname_list:
            raise ValueError("Filename list should not be empty")
        self.fname_list = fname_list
        self.logger = logger
        check_valid_xls(self.fname_list)
        self.sniff()

    def sniff(self):
        """

        Executes sniffer

        Returns:
            boolean: True if everything ok. Results are accessible in ``.df_xls_sheets``

        """

        xls_sheets = {}
        for fname in self.fname_list:
            if self.logger:
                self.logger.send_log('sniffing sheets in '+ntpath.basename(fname),'ok')

            xls_fname = {}
            xls_fname['file_name'] = ntpath.basename(fname)
            if fname[-5:]=='.xlsx':
                fh = openpyxl.load_workbook(fname,read_only=True)
                xls_fname['sheets_names'] = fh.sheetnames
                fh.close()
                # todo: need to close file?
            elif fname[-4:]=='.xls':
                fh = xlrd.open_workbook(fname, on_demand=True)
                xls_fname['sheets_names'] = fh.sheet_names()
                fh.release_resources()
            else:
                raise IOError('Only .xls or .xlsx files can be combined')

            xls_fname['sheets_count'] = len(xls_fname['sheets_names'])
            xls_fname['sheets_idx'] = np.arange(xls_fname['sheets_count']).tolist()
            xls_sheets[fname] = xls_fname

            self.xls_sheets = xls_sheets

        df_xls_sheets = pd.DataFrame(xls_sheets).T
        df_xls_sheets.index.names = ['file_path']

        self.dict_xls_sheets = xls_sheets
        self.df_xls_sheets = df_xls_sheets

        return True

    def all_contain_sheetname(self,sheet_name):
        """
        Check if all files contain a certain sheet

        Args:
            sheet_name (string): sheetname to check

        Returns:
            boolean: If true

        """
        return np.all([sheet_name in self.dict_xls_sheets[fname]['sheets_names'] for fname in self.fname_list])

    def all_have_idx(self,sheet_idx):
        """
        Check if all files contain a certain index

        Args:
            sheet_idx (string): index to check

        Returns:
            boolean: If true

        """
        return np.all([sheet_idx<=(d['sheets_count']-1) for k,d in self.dict_xls_sheets.items()])

    def all_same_count(self):
        """
        Check if all files contain the same number of sheets

        Args:
            sheet_idx (string): index to check

        Returns:
            boolean: If true

        """
        first_elem = next(iter(self.dict_xls_sheets.values()))
        return np.all([first_elem['sheets_count']==d['sheets_count'] for k,d in self.dict_xls_sheets.items()])

    def all_same_names(self):
        first_elem = next(iter(self.dict_xls_sheets.values()))
        return np.all([first_elem['sheets_names']==d['sheets_names'] for k,d in self.dict_xls_sheets.items()])



#******************************************************************
# convertor
#******************************************************************
class XLStoBase(object, metaclass=d6tcollect.Collect):
    def __init__(self, if_exists='skip', output_dir=None, logger=None):
        """

        Base class for converting Excel files

        Args:
            if_exists (str): Possible values: skip and replace, default: skip, optional
            output_dir (str): If present, file is saved in given directory, optional
            logger (object): logger object with send_log('msg','status'), optional

        """

        if if_exists not in ['skip', 'replace']:
            raise ValueError("Possible value of 'if_exists' are 'skip' and 'replace'")
        self.logger = logger
        self.if_exists = if_exists
        self.output_dir = output_dir
        if self.output_dir:
            if not os.path.exists(self.output_dir):
                os.makedirs(self.output_dir)

    def _get_output_filename(self, fname):
        if self.output_dir:
            basename = os.path.basename(fname)
            fname_out = os.path.join(self.output_dir, basename)
        else:
            fname_out = fname
        is_skip = (self.if_exists == 'skip' and os.path.isfile(fname_out))
        return fname_out, is_skip

    def convert_single(self, fname, sheet_name, **kwds):
        """

        Converts single file

        Args:
            fname: path to file
            sheet_name (str): optional sheet_name to override global `cfg_xls_sheets_sel`
            Same as `d6tstack.utils.read_excel_advanced()`

        Returns:
            list: output file names

        """

        if self.logger:
            msg = 'converting file: '+ntpath.basename(fname)+' | sheet: '
            if hasattr(self, 'cfg_xls_sheets_sel'):
                msg += str(self.cfg_xls_sheets_sel[fname])
            self.logger.send_log(msg,'ok')

        fname_out = fname + '-' + str(sheet_name) + '.csv'
        fname_out, is_skip = self._get_output_filename(fname_out)
        if not is_skip:
            df = read_excel_advanced(fname, sheet_name=sheet_name, **kwds)
            df.to_csv(fname_out, index=False)
        else:
            warnings.warn('File %s exists, skipping' %fname)

        return fname_out


class XLStoCSVMultiFile(XLStoBase, metaclass=d6tcollect.Collect):
    """
    
    Converts xls|xlsx files to csv files. Selects a SINGLE SHEET from each file. To extract MULTIPLE SHEETS from a file use XLStoCSVMultiSheet

    Args:
        fname_list (list): file paths, eg ['dir/a.csv','dir/b.csv']
        cfg_xls_sheets_sel_mode (string): mode to select tabs

            * ``name``: select by name, provide name for each file, can customize by file
            * ``name_global``: select by name, one name for all files
            * ``idx``: select by index, provide index for each file, can customize by file
            * ``idx_global``: select by index, one index for all files

        cfg_xls_sheets_sel (dict): values to select tabs `{'filename':'value'}`
        output_dir (str): If present, file is saved in given directory, optional
        if_exists (str): Possible values: skip and replace, default: skip, optional
        logger (object): logger object with send_log('msg','status'), optional

    """

    def __init__(self, fname_list, cfg_xls_sheets_sel_mode='idx_global', cfg_xls_sheets_sel=0,
                 output_dir=None, if_exists='skip', logger=None):
        super().__init__(if_exists, output_dir, logger)
        if not fname_list:
            raise ValueError("Filename list should not be empty")
        self.set_files(fname_list)
        self.set_select_mode(cfg_xls_sheets_sel_mode, cfg_xls_sheets_sel)

    def set_files(self, fname_list):
        """

        Update input files. You will also need to update sheet selection with ``.set_select_mode()``.

        Args:
            fname_list (list): see class description for details

        """
        self.fname_list = fname_list
        self.xlsSniffer = XLSSniffer(fname_list)

    def set_select_mode(self, cfg_xls_sheets_sel_mode, cfg_xls_sheets_sel):
        """
        
        Update sheet selection values

        Args:
            cfg_xls_sheets_sel_mode (string): see class description for details
            cfg_xls_sheets_sel (list): see class description for details

        """

        assert cfg_xls_sheets_sel_mode in ['name','idx','name_global','idx_global']
        sheets = self.xlsSniffer.dict_xls_sheets

        if cfg_xls_sheets_sel_mode=='name_global':
            cfg_xls_sheets_sel_mode = 'name'
            cfg_xls_sheets_sel = dict(zip(self.fname_list,[cfg_xls_sheets_sel]*len(self.fname_list)))
        elif cfg_xls_sheets_sel_mode=='idx_global':
            cfg_xls_sheets_sel_mode = 'idx'
            cfg_xls_sheets_sel = dict(zip(self.fname_list,[cfg_xls_sheets_sel]*len(self.fname_list)))

        if not set(cfg_xls_sheets_sel.keys())==set(sheets.keys()):
            raise ValueError('Need to select a sheet from every file')

        # check given selection actually present in files
        if cfg_xls_sheets_sel_mode=='name':
            if not np.all([cfg_xls_sheets_sel[fname] in sheets[fname]['sheets_names'] for fname in self.fname_list]):
                raise ValueError('Invalid sheet name selected in one of the files')
                # todo show which file is mismatched
        elif cfg_xls_sheets_sel_mode=='idx':
            if not np.all([cfg_xls_sheets_sel[fname] <= sheets[fname]['sheets_count'] for fname in self.fname_list]):
                raise ValueError('Invalid index selected in one of the files')
                # todo show which file is mismatched
        else:
            raise ValueError('Invalid xls_sheets_mode')

        self.cfg_xls_sheets_sel_mode = cfg_xls_sheets_sel_mode
        self.cfg_xls_sheets_sel = cfg_xls_sheets_sel

    def convert_all(self, **kwds):
        """
        
        Converts all files

        Args:
            Any parameters for `d6tstack.utils.read_excel_advanced()`

        Returns: 
            list: output file names
        """

        fnames_converted = []
        for fname in self.fname_list:
            fname_out = self.convert_single(fname, self.cfg_xls_sheets_sel[fname], **kwds)
            fnames_converted.append(fname_out)

        return fnames_converted


class XLStoCSVMultiSheet(XLStoBase, metaclass=d6tcollect.Collect):
    """
    
    Converts ALL SHEETS from a SINGLE xls|xlsx files to separate csv files

    Args:
        fname (string): file path
        sheet_names (list): list of int or str. If not given, will convert all sheets in the file
        output_dir (str): If present, file is saved in given directory, optional
        if_exists (str): Possible values: skip and replace, default: skip, optional
        logger (object): logger object with send_log('msg','status'), optional

    """

    def __init__(self, fname, sheet_names=None, output_dir=None, if_exists='skip', logger=None):
        super().__init__(if_exists, output_dir, logger)
        self.fname = fname


        if sheet_names:
            if not isinstance(sheet_names, (list,str)):
                raise ValueError('sheet_names needs to be a list')
            self.sheet_names = sheet_names
        else:
            self.xlsSniffer = XLSSniffer([fname, ])
            self.sheet_names = self.xlsSniffer.xls_sheets[self.fname]['sheets_names']

    def convert_single(self, sheet_name, **kwds):
        """

        Converts all files

        Args:
            sheet_name (str): Excel sheet
            Any parameters for `d6tstack.utils.read_excel_advanced()`

        Returns:
            str: output file name
        """
        return super().convert_single(self.fname, sheet_name, **kwds)

    def convert_all(self, **kwds):
        """

        Converts all files

        Args:
            Any parameters for `d6tstack.utils.read_excel_advanced()`

        Returns:
            list: output file names
        """

        fnames_converted = []
        for iSheet in self.sheet_names:
            fname_out = self.convert_single(iSheet, **kwds)
            fnames_converted.append(fname_out)

        return fnames_converted


================================================
FILE: d6tstack/helpers.py
================================================
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""

Module with several helper functions

"""

import os
import collections
import re

def file_extensions_get(fname_list):
    """Returns file extensions in list

    Args:
        fname_list (list): file names, eg ['a.csv','b.csv']

    Returns:
        list: file extensions for each file name in input list, eg ['.csv','.csv']
    """
    return [os.path.splitext(fname)[-1] for fname in fname_list]


def file_extensions_all_equal(ext_list):
    """Checks that all file extensions are equal. 

    Args:
        ext_list (list): file extensions, eg ['.csv','.csv']

    Returns:
        bool: all extensions are equal to first extension in list?
    """
    return len(set(ext_list))==1


def file_extensions_contains_xls(ext_list):
    # Assumes all file extensions are equal! Only checks first file
    return ext_list[0] == '.xls'

def file_extensions_contains_xlsx(ext_list):
    # Assumes all file extensions are equal! Only checks first file
    return ext_list[0] == '.xlsx'

def file_extensions_contains_csv(ext_list):
    # Assumes all file extensions are equal! Only checks first file
    return (ext_list[0] == '.csv' or ext_list[0] == '.txt')

def file_extensions_valid(ext_list):
    """Checks if file list contains only valid files

    Notes:
        Assumes all file extensions are equal! Only checks first file

    Args:
        ext_list (list): file extensions, eg ['.csv','.csv']

    Returns:
        bool: first element in list is one of ['.csv','.txt','.xls','.xlsx']?
    """
    ext_list_valid = ['.csv','.txt','.xls','.xlsx']
    return ext_list[0] in ext_list_valid


def columns_all_equal(col_list):
    """Checks that all lists in col_list are equal. 

    Args:
        col_list (list): columns, eg [['a','b'],['a','b','c']]

    Returns:
        bool: all lists in list are equal?
    """
    return all([l==col_list[0] for l in col_list])


def list_common(_list, sort=True):
    l = list(set.intersection(*[set(l) for l in _list]))
    if sort:
        return sorted(l)
    else:
        return l


def list_unique(_list, sort=True):
    l = list(set.union(*[set(l) for l in _list]))
    if sort:
        return sorted(l)
    else:
        return l


def list_tofront(_list,val):
    return _list.insert(0, _list.pop(_list.index(val)))


def cols_filename_tofront(_list):
    return list_tofront(_list,'filename')


def df_filename_tofront(dfg):
    cfg_col = dfg.columns.tolist()
    return dfg[cols_filename_tofront(cfg_col)]


def check_valid_xls(fname_list):
    ext_list = file_extensions_get(fname_list)

    if not file_extensions_all_equal(ext_list):
        raise IOError('All file types and extensions have to be equal')

    if not(file_extensions_contains_xls(ext_list) or file_extensions_contains_xlsx(ext_list)):
        raise IOError('Only .xls, .xlsx files can be processed')

    return True


def compare_pandas_versions(version1, version2):
    def cmp(a, b):
        return (a > b) - (a < b)

    def normalize(v):
        return [int(x) for x in re.sub(r'(\.0+)*$','', v).split(".")]

    return cmp(normalize(version1), normalize(version2))


================================================
FILE: d6tstack/pyftp_final.py
================================================
from boto.s3.connection import S3Connection
from boto.s3.key import Key
import os
import ftputil


def get_ftp_files():
    fileSetftp = set()
    with ftputil.FTPHost(cfg_ftp_host, cfg_ftp_usr, cfg_ftp_pwd) as ftp_host:
        ftp_host.use_list_a_option = False
        for dir_, _, files in ftp_host.walk(cfg_dir_ftp):
            for fileName in files:
                relDir = os.path.relpath(dir_, cfg_dir_ftp)
                relFile = os.path.join(relDir, fileName)
                fileSetftp.add(relFile)
    return fileSetftp


def upload_ftp_files_s3(ftp_files, s3_files, bucket):
    files_ftp_sync = set(ftp_files).difference(s3_files)
    with ftputil.FTPHost(cfg_ftp_host, cfg_ftp_usr, cfg_ftp_pwd) as ftp_host:
        for ftp_file in files_ftp_sync:
            full_name = cfg_dir_ftp + ftp_file
            basename = os.path.basename(full_name)
            temp_path = '/tmp/'+basename
            ftp_host.download(full_name, temp_path)
            with open(temp_path, 'rb') as f:
                key = Key(bucket, ftp_file)
                key.set_contents_from_file(f)


def list_s3_files(bucket):
    s3_files = set()
    for key in bucket.list():
        s3_files.add(key.name.encode('utf-8'))
    return s3_files


def upload_to_s3(bucket):
    fname = '/home/anuj/Pictures/test/hp.jpg'
    basename = os.path.basename(fname)
    key = Key(bucket, basename)
    with open(fname, 'rb') as f:
        key.set_contents_from_file(f)


if __name__ == "__main__":
    print("S3 File sync")
    s3_id = ''
    s3_key = ''
    bucket_name = 'test-anuj-ftp-sync'

    cfg_ftp_host = 'ftp.fic.com.tw'
    cfg_ftp_usr = 'anonymous'
    cfg_ftp_pwd = 'random'
    cfg_dir_ftp = '/photo/ia/'

    s3_conn = S3Connection(s3_id, s3_key, host='s3.ap-south-1.amazonaws.com')
    bucket = s3_conn.get_bucket(bucket_name)
    s3_files = list_s3_files(bucket)
    upload_to_s3(bucket)

    ftp_files = get_ftp_files()
    print(ftp_files)

    upload_ftp_files_s3(ftp_files, s3_files, bucket)


================================================
FILE: d6tstack/sniffer.py
================================================
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""

Finds CSV settings and Excel sheets in multiple files. Often needed as input for stacking

"""
import collections
import csv

import d6tcollect
# d6tcollect.init(__name__)

#******************************************************************
# csv
#******************************************************************

def csv_count_rows(fname):
    def blocks(files, size=65536):
        while True:
            b = files.read(size)
            if not b: break
            yield b

    with open(fname) as f:
        nrows = sum(bl.count("\n") for bl in blocks(f))

    return nrows

class CSVSniffer(object, metaclass=d6tcollect.Collect):
    """
    
    Automatically detects settings needed to read csv files. SINGLE file only, for MULTI file use CSVSnifferList

    Args:
        fname (string): file path
        nlines (int): number of lines to sample from each file
        delims (string): possible delimiters, default ",;\t|"

    """

    def __init__(self, fname, nlines = 10, delims=',;\t|'):
        self.cfg_fname = fname
        self.nrows = csv_count_rows(fname) # todo: check for file size, if large don't run this
        self.cfg_nlines = min(nlines,self.nrows) # read_lines() doesn't check EOF # todo: check 1% of file up to a max
        self.cfg_delims_pool = delims
        self.delim = None # delim used for the file
        self.csv_lines = None # top n lines read from file
        self.csv_lines_delim = None # detected delim for each line in file
        self.csv_rows = None # top n lines split usingn delim

    def read_nlines(self):
        # read top lines
        fhandle = open(self.cfg_fname)
        self.csv_lines = [fhandle.readline().rstrip() for _ in range(self.cfg_nlines)]
        fhandle.close()

    def scan_delim(self):
        if not self.csv_lines:
            self.read_nlines()

        # get delimiter for each line in file
        delims = []
        for line in self.csv_lines:
            try:
                csv_sniff = csv.Sniffer().sniff(line, self.cfg_delims_pool)
                delims.append(csv_sniff.delimiter)
            except:
                delims.append(None) # todo: able to catch exception more specifically?

        self.csv_lines_delim = delims

    def get_delim(self):
        if not self.csv_lines_delim:
            self.scan_delim()

        # all delimiters the same?
        if len(set(self.csv_lines_delim))>1:
            self.delim_is_consistent = False
            csv_delim_count = collections.Counter(self.csv_lines_delim)
            csv_delim = csv_delim_count.most_common(1)[0][0] # use the most common used delimeter
            # todo: rerun on cfg_csv_scan_topline**2 files in case there is a large # of header rows
        else:
            self.delim_is_consistent = True
            csv_delim = self.csv_lines_delim[0]

        if csv_delim==None:
            raise IOError('Could not determine a valid delimiter, pleaes check your files are .csv or .txt using one delimiter of %s' %(self.cfg_delims_pool))
        else:
            self.delim = csv_delim

        self.csv_rows = [s.split(self.delim) for s in self.csv_lines][self.count_skiprows():]
        if self.check_column_length_consistent():
            self.certainty = 'high'
        else:
            self.certainty = 'probable'

        return self.delim

    def check_column_length_consistent(self):
        # check if all rows have the same length. NB: this is just on the sample!
        if not self.csv_rows:
            self.get_delim()

        return len(set([len(row) for row in self.csv_rows]))==1

    def count_skiprows(self):
        # finds the number of rows to skip by finding the last line which doesn't use the selected delimiter
        if not self.delim:
            self.get_delim()

        if self.delim_is_consistent: # all delims the same so nothing to skip
            return 0

        l = [d != self.delim for d in self.csv_lines_delim]
        l = list(reversed(l))
        return len(l) - l.index(True)

    def has_header_inverse(self):
        # checks if head present if all columns in first row contain a letter
        if not self.csv_rows:
            self.get_delim()

        def is_number(s):
            try:
                float(s)
                return True
            except ValueError:
                return False

        self.is_all_rows_number_col = all([any([is_number(s) for s in row]) for row in self.csv_rows])

        '''
        self.row_distance = [distance.jaccard(self.csv_rows[0], self.csv_rows[i]) for i in range(1,len(self.csv_rows))]

        iqr_low, iqr_high = np.percentile(self.row_distance[1:], [5, 95])
        is_first_row_different = not(iqr_low <= self.row_distance[0] <= iqr_high)
        '''

    def has_header(self):
        # more likely than not to contain headers so have to prove no header present
        self.has_header_inverse()
        return not self.is_all_rows_number_col

class CSVSnifferList(object, metaclass=d6tcollect.Collect):
    """
    
    Automatically detects settings needed to read csv files. MULTI file use

    Args:
        fname_list (list): file names, eg ['a.csv','b.csv']
        nlines (int): number of lines to sample from each file
        delims (string): possible delimiters, default ',;\t|'

    """


    def __init__(self, fname_list, nlines = 10, delims=',;\t|'):
        self.cfg_fname_list = fname_list
        self.sniffers = [CSVSniffer(fname, nlines, delims) for fname in fname_list]

    def get_all(self, fun_name, msg_error):
        val = []
        for sniffer in self.sniffers:
            func = getattr(sniffer, fun_name)
            val.append(func())

        if len(set(val))>1:
            raise NotImplementedError(msg_error+' Make sure all files have the same format')
            # todo: want to raise an exception here...? or just use whatever got detected for each file?
        else:
            return val[0]

    def get_delim(self):
        return self.get_all('get_delim','Inconsistent delimiters detected!')

    def count_skiprows(self):
        return self.get_all('count_skiprows','Inconsistent skiprows detected!')

    def has_header(self):
        return self.get_all('has_header','Inconsistent header setting detected!')

        # todo: propagate status of individual sniffers. instead of raising exception pass back status to get user input


def sniff_settings_csv(fname_list):
    sniff = CSVSnifferList(fname_list)
    csv_sniff = {}
    csv_sniff['delim'] = sniff.get_delim()
    csv_sniff['skiprows'] = sniff.count_skiprows()
    csv_sniff['has_header'] = sniff.has_header()
    csv_sniff['header'] = 0 if sniff.has_header() else None
    return csv_sniff




================================================
FILE: d6tstack/sync.py
================================================
import boto3
import botocore
import os
import ftputil
import numpy as np


class FTPSync:
    """

        FTP Sync class. It allows users to sync their files to s3 or local.

        Args:
            cfg_ftp_host (string): FTP host name
            cfg_ftp_usr (string): FTP login username
            cfg_ftp_pwd (string): FTP login password
            cfg_ftp_dir (string): FTP starting directory to be used for sync.
            cfg_s3_key (string): AWS S3 key for connection
            cfg_s3_secret (string): AWS S3 secret for connection
            bucket_name (string): Bucket name in s3 for syncing the files
            local_dir (string): local dir path to be used for sync. dir will be created if not exist.
            logger (object): logger object with send_log()

        """
    def __init__(self, cfg_ftp_host, cfg_ftp_usr, cfg_ftp_pwd, cfg_ftp_dir,
                 cfg_s3_key=None, cfg_s3_secret=None, bucket_name=None,
                 local_dir='./data/', logger=None):
        self.cfg_ftp_host = cfg_ftp_host
        self.cfg_ftp_usr = cfg_ftp_usr
        self.cfg_ftp_pwd = cfg_ftp_pwd
        self.cfg_ftp_dir = cfg_ftp_dir
        self.ftp_host = ftputil.FTPHost(cfg_ftp_host, cfg_ftp_usr, cfg_ftp_pwd)
        self.ftp_host.use_list_a_option = False
        self.s3_client = None
        self.bucket_name = None
        if cfg_s3_key and cfg_s3_secret and bucket_name:
            self.s3_client = boto3.client(
                's3',
                aws_access_key_id=cfg_s3_key,
                aws_secret_access_key=cfg_s3_secret
            )
            exists = True
            try:
                self.s3_client.head_bucket(Bucket=bucket_name)
            except botocore.exceptions.ClientError as e:
                # If a client error is thrown, then check that it was a 404 error.
                # If it was a 404 error, then the bucket does not exist.
                error_code = int(e.response['Error']['Code'])
                if error_code == 404:
                    exists = False
            if not exists:
                if logger:
                    logger.send_log('Bucket does not exist. Creating bucket', 'ok')
                self.s3_client.create_bucket(Bucket=bucket_name)
            self.bucket_name = bucket_name
        self.local_dir = local_dir
        if not os.path.exists(local_dir):
            os.makedirs(local_dir)
        self.logger = logger

    def get_all_files(self, subdirs=True, ftp=False):
        """

        Get all file list from local or ftp

        Args:
            subdirs (bool): return all the files in directory recursively? If `false` it will not go to sub directories
            ftp (bool): local files if `true` otherwise local files

        Returns:
            Alphabetically Sorted file list

        """
        fileSet = set()
        host = os
        from_dir = self.local_dir
        if ftp:
            host = self.ftp_host
            from_dir = self.cfg_ftp_dir
        if subdirs:
            for dir_, _, files in host.walk(from_dir):
                for fileName in files:
                    relDir = os.path.relpath(dir_, from_dir)
                    relFile = os.path.join(relDir, fileName)
                    fileSet.add(relFile)
        else:
            for fileName in host.listdir(from_dir):
                relFile = os.path.join(from_dir, fileName)
                if host.path.isfile(relFile):
                    fileSet.add(relFile)
        return np.sort(list(fileSet))

    def get_s3_files(self):
        """

            Get all file list from s3 in the given bucket

           Returns:
                File list from s3 in bucket

        """
        if not self.s3_client or not self.bucket_name:
            raise ValueError("S3 credentials are mandatory to use this functionality")
        s3_files = set()
        all_files = self.s3_client.list_objects(Bucket=self.bucket_name)
        for content in all_files.get('Contents', []):
            s3_files.add(content.get('Key'))
        return s3_files

    def upload_to_s3(self, fname, local_path):
        """

            Upload a single file from local to s3

            Args:
                fname (string): Filename in s3
                local_path (string): Local path of file to be uploaded

        """

        with open(local_path, 'rb') as f:
            self.s3_client.upload_fileobj(f, self.bucket_name, fname)

    def get_files_for_sync(self, subdirs=True, to_s3=False):
        """

            Get File list for sync along with total file size

            Args:
                subdirs (bool): return all the files in directory recursively? If `false` it will not go to sub directories, Optional
                to_s3 (bool): get files to be sync from ftp to local. If `true` all files will be synced from ftp to s3

        """
        ftp_files = self.get_all_files(subdirs=subdirs, ftp=True)
        if to_s3:
            server_files = self.get_s3_files()
        else:
            server_files = self.get_all_files(subdirs=subdirs)
        files_ftp_sync = set(ftp_files).difference(set(server_files))
        total_file_size = sum([self.ftp_host.path.getsize(os.path.join(self.cfg_ftp_dir, f))
                               for f in files_ftp_sync])
        return files_ftp_sync, total_file_size

    def upload_ftp_files(self, subdirs=True, to_s3=False):
        """

            Get File list for sync along with total file size

            Args:
                subdirs (bool): Upload files from ftp recursively? If `false` it will not go to sub directories, Optional
                to_s3 (bool): upload files from ftp to local. If `true` files will be uploaded from ftp to s3

        """

        files_ftp_sync, total_file_size = self.get_files_for_sync(subdirs=subdirs, to_s3=to_s3)
        for ftp_file in files_ftp_sync:
            full_name = os.path.join(self.cfg_ftp_dir, ftp_file)
            local_path = os.path.join(self.local_dir, ftp_file)
            file_dir_local = os.path.dirname(local_path)
            if not os.path.exists(file_dir_local):
                os.makedirs(file_dir_local)
            self.ftp_host.download(full_name, local_path)
            if to_s3:
                self.upload_to_s3(ftp_file, local_path)


================================================
FILE: d6tstack/utils.py
================================================
import pandas as pd
import warnings

import d6tcollect
d6tcollect.init(__name__)

class PrintLogger(object):
    def send_log(self, msg, status):
        print(msg,status)

    def send(self, data):
        print(data)

import os

@d6tcollect.collect
def pd_readsql_query_from_sqlengine(uri, sql, schema_name=None, connect_args=None):
    """
    Load SQL statement into pandas dataframe using `sql_engine.execute` making execution faster.

    Args:
        uri (str): postgres psycopg2 sqlalchemy database uri
        sql (str): sql query
        schema_name (str): name of schema
        connect_args (dict): dictionary of connection arguments to pass to `sqlalchemy.create_engine`

    Returns:
        df: pandas dataframe

    """

    import sqlalchemy
    if connect_args is not None:
        sql_engine = sqlalchemy.create_engine(uri, connect_args=connect_args)
    elif schema_name is not None:
        if 'psycopg2' in uri:
            sql_engine = sqlalchemy.create_engine(uri, connect_args={'options': '-csearch_path={}'.format(schema_name)})
        else:
            raise NotImplementedError('only `psycopg2` supported with schema_name, pass connect_args for your db engine')
    else:
        sql_engine = sqlalchemy.create_engine(uri)

    sql = sql_engine.execute(sql)
    df = pd.DataFrame(sql.fetchall())

    return df


@d6tcollect.collect
def pd_readsql_table_from_sqlengine(uri, table_name, schema_name=None, connect_args=None):
    """
    Load SQL table into pandas dataframe using `sql_engine.execute` making execution faster. Convenience function that returns full table.

    Args:
        uri (str): postgres psycopg2 sqlalchemy database uri
        table_name (str): table
        schema_name (str): name of schema
        connect_args (dict): dictionary of connection arguments to pass to `sqlalchemy.create_engine`

    Returns:
        df: pandas dataframe

    """

    return pd_readsql_query_from_sqlengine(uri, "SELECT * FROM {};".fromat(table_name), schema_name=schema_name, connect_args=connect_args)


@d6tcollect.collect
def pd_to_psql(df, uri, table_name, schema_name=None, if_exists='fail', sep=','):
    """
    Load pandas dataframe into a sql table using native postgres COPY FROM.

    Args:
        df (dataframe): pandas dataframe
        uri (str): postgres psycopg2 sqlalchemy database uri
        table_name (str): table to store data in
        schema_name (str): name of schema in db to write to
        if_exists (str): {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See `pandas.to_sql()` for details
        sep (str): separator for temp file, eg ',' or '\t'

    Returns:
        bool: True if loader finished

    """

    if not 'psycopg2' in uri:
        raise ValueError('need to use psycopg2 uri eg postgresql+psycopg2://psqlusr:psqlpwdpsqlpwd@localhost/psqltest. install with `pip install psycopg2-binary`')
    table_name = table_name.lower()
    if schema_name:
        schema_name = schema_name.lower()
   
    import sqlalchemy
    import io

    if schema_name is not None:
        sql_engine = sqlalchemy.create_engine(uri, connect_args={'options': '-csearch_path={}'.format(schema_name)})
    else:
        sql_engine = sqlalchemy.create_engine(uri)
    sql_cnxn = sql_engine.raw_connection()
    cursor = sql_cnxn.cursor()

    df[:0].to_sql(table_name, sql_engine, schema=schema_name, if_exists=if_exists, index=False)

    fbuf = io.StringIO()
    df.to_csv(fbuf, index=False, header=False, sep=sep)
    fbuf.seek(0)
    cursor.copy_from(fbuf, table_name, sep=sep, null='')
    sql_cnxn.commit()
    cursor.close()

    return True


@d6tcollect.collect
def pd_to_mysql(df, uri, table_name, if_exists='fail', tmpfile='mysql.csv', sep=',', newline='\n'):
    """
    Load dataframe into a sql table using native postgres LOAD DATA LOCAL INFILE.

    Args:
        df (dataframe): pandas dataframe
        uri (str): mysql mysqlconnector sqlalchemy database uri
        table_name (str): table to store data in
        if_exists (str): {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See `pandas.to_sql()` for details
        tmpfile (str): filename for temporary file to load from
        sep (str): separator for temp file, eg ',' or '\t'

    Returns:
        bool: True if loader finished

    """
    if not 'mysql+mysqlconnector' in uri:
        raise ValueError('need to use mysql+mysqlconnector uri eg mysql+mysqlconnector://testusr:testpwd@localhost/testdb. install with `pip install mysql-connector`')
    table_name = table_name.lower()

    import sqlalchemy

    sql_engine = sqlalchemy.create_engine(uri)

    df[:0].to_sql(table_name, sql_engine, if_exists=if_exists, index=False)

    logger = PrintLogger()
    logger.send_log('creating ' + tmpfile, 'ok')
    with open(tmpfile, mode='w', newline=newline) as fhandle:
        df.to_csv(fhandle, na_rep='\\N', index=False, sep=sep)
    logger.send_log('loading ' + tmpfile, 'ok')
    sql_load = "LOAD DATA LOCAL INFILE '{}' INTO TABLE {} FIELDS TERMINATED BY '{}' LINES TERMINATED BY '{}' IGNORE 1 LINES;".format(tmpfile, table_name, sep, newline)
    sql_engine.execute(sql_load)

    os.remove(tmpfile)

    return True

@d6tcollect.collect
def pd_to_mssql(df, uri, table_name, schema_name=None, if_exists='fail', tmpfile='mysql.csv'):
    """
    Load dataframe into a sql table using native postgres LOAD DATA LOCAL INFILE.

    Args:
        df (dataframe): pandas dataframe
        uri (str): mysql mysqlconnector sqlalchemy database uri
        table_name (str): table to store data in
        schema_name (str): name of schema in db to write to
        if_exists (str): {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See `pandas.to_sql()` for details
        tmpfile (str): filename for temporary file to load from

    Returns:
        bool: True if loader finished

    """
    if not 'mssql+pymssql' in uri:
        raise ValueError('need to use mssql+pymssql uri (conda install -c prometeia pymssql)')
    table_name = table_name.lower()
    if schema_name:
        schema_name = schema_name.lower()

    warnings.warn('`.pd_to_mssql()` is experimental, if any problems please raise an issue on https://github.com/d6t/d6tstack/issues or make a pull request')
    import sqlalchemy

    sql_engine = sqlalchemy.create_engine(uri)

    df[:0].to_sql(table_name, sql_engine, if_exists=if_exists, index=False)

    logger = PrintLogger()
    logger.send_log('creating ' + tmpfile, 'ok')
    df.to_csv(tmpfile, na_rep='\\N', index=False)
    logger.send_log('loading ' + tmpfile, 'ok')
    if schema_name is not None:
        table_name = '{}.{}'.format(schema_name,table_name)
    sql_load = "BULK INSERT {} FROM '{}';".format(table_name, tmpfile)
    sql_engine.execute(sql_load)

    os.remove(tmpfile)

    return True



================================================
FILE: docs/Makefile
================================================
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line.
SPHINXOPTS    =
SPHINXBUILD   = python -msphinx
SPHINXPROJ    = d6tstack
SOURCEDIR     = source
BUILDDIR      = build

# Put it first so that "make" without argument is like "make help".
help:
	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

================================================
FILE: docs/make.bat
================================================
@ECHO OFF

pushd %~dp0

REM Command file for Sphinx documentation

if "%SPHINXBUILD%" == "" (
	set SPHINXBUILD=python -msphinx
)
set SOURCEDIR=source
set BUILDDIR=build
set SPHINXPROJ=d6t-lib

if "%1" == "" goto help

%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
	echo.
	echo.The Sphinx module was not found. Make sure you have Sphinx installed,
	echo.then set the SPHINXBUILD environment variable to point to the full
	echo.path of the 'sphinx-build' executable. Alternatively you may add the
	echo.Sphinx directory to PATH.
	echo.
	echo.If you don't have Sphinx installed, grab it from
	echo.http://sphinx-doc.org/
	exit /b 1
)

%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
goto end

:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%

:end
popd


================================================
FILE: docs/make_zip_sample_csv.py
================================================
import zipfile
import glob
import os

if not os.path.exists('test-data/output/__init__.py'):
	fhandle = open('test-data/output/__init__.py', 'w')
	fhandle.close()


ziphandle = zipfile.ZipFile('test-data.zip', 'w')
cfg_path_base = 'test-data/input/test-data-input'
for fname in  glob.glob(cfg_path_base+'*.csv')+glob.glob(cfg_path_base+'*.xls')+glob.glob(cfg_path_base+'*.xlsx'):
	ziphandle.write(fname)

ziphandle.write('test-data/output/__init__.py')
ziphandle.close()


================================================
FILE: docs/make_zip_sample_xls.py
================================================
import zipfile
import glob
import os



import pandas as pd
import numpy as np
# generate fake data
cfg_tickers = ['AAP','M','SPLS']
cfg_ntickers = len(cfg_tickers)
cfg_ndates = 10
cfg_dates = pd.bdate_range('2018-01-01',periods=cfg_ndates).tolist()+pd.bdate_range('2018-02-01',periods=cfg_ndates).tolist()
cfg_nobs = cfg_ndates*2
dft = pd.DataFrame({'date':np.tile(cfg_dates,cfg_ntickers), 'ticker':np.repeat(cfg_tickers,cfg_nobs)})


#****************************************
# xls
#****************************************
def write_file_xls(dfg, fname, sheets, startrow=0,startcol=0):
    writer = pd.ExcelWriter(fname)
    for isheet in sheets:
        dft['data'] = np.random.normal(size=dfg.shape[0])
        dfg['xls_sheet'] = isheet
        dfg.to_excel(writer, isheet, index=False,startrow=startrow,startcol=startcol)
    writer.save()

# excel - bad case => d6tstack. Fake data
cfg_path_base = 'test-data/excel_adv_data/sample-xls-'
df = dft
np.random.seed(0)
write_file_xls(df, cfg_path_base+'case-simple.xlsx',['Sheet1'])
write_file_xls(df, cfg_path_base+'case-multisheet.xlsx',['Sheet1','Sheet2'])
write_file_xls(df, cfg_path_base+'case-multifile1.xlsx',['Sheet1'])
write_file_xls(df, cfg_path_base+'case-multifile2.xlsx',['Sheet1'])
write_file_xls(df, cfg_path_base+'case-badlayout1.xlsx',['Sheet1','Sheet2'],startrow=1,startcol=1)

ziphandle = zipfile.ZipFile('test-data-xls.zip', 'w')
for fname in  glob.glob(cfg_path_base+'*.xlsx'):
    ziphandle.write(fname)

ziphandle.write('test-data/output/__init__.py')
ziphandle.close()



================================================
FILE: docs/shell-napoleon-html.sh
================================================
make html


================================================
FILE: docs/shell-napoleon-recreate.sh
================================================
#rm ./source/*
#cp ./source-bak/* ./source/
sphinx-apidoc -f -o ./source ..
make clean
make html


================================================
FILE: docs/source/conf.py
================================================
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# d6t-lib documentation build configuration file, created by
# sphinx-quickstart on Tue Nov 28 11:32:56 2017.
#
# This file is execfile()d with the current directory set to its
# containing dir.
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys

sys.path.insert(0, os.path.abspath('.'))
sys.path.insert(0, os.path.dirname(os.path.abspath('.')))  # todo: why is this not working?
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath('.'))))
sys.path.insert(0, os.path.join(os.path.dirname((os.path.abspath('.'))), "d6tstack"))

# -- General configuration ------------------------------------------------

# If your documentation needs a minimal Sphinx version, state it here.
#
# needs_sphinx = '1.0'

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['sphinx.ext.autodoc',
              'sphinx.ext.todo',
              'sphinx.ext.viewcode',
              'sphinx.ext.githubpages',
              'sphinx.ext.napoleon']

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
#
# source_suffix = ['.rst', '.md']
source_suffix = '.rst'

# The master toctree document.
master_doc = 'index'

# General information about the project.
project = 'd6tstack'
copyright = '2017, databolt'
author = 'databolt'

# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '0.1'
# The full version, including alpha/beta/rc tags.
release = '0.1'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This patterns also effect to html_static_path and html_extra_path
exclude_patterns = []

# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'

# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = True

# -- Options for HTML output ----------------------------------------------

# The theme to use for HTML and HTML Help pages.  See the documentation for
# a list of builtin themes.
#
html_theme = 'sphinx_rtd_theme'  # 'alabaster'

# Theme options are theme-specific and customize the look and feel of a theme
# further.  For a list of options available for each theme, see the
# documentation.
#
# html_theme_options = {}

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']

# Custom sidebar templates, must be a dictionary that maps document names
# to template names.
#
# This is required for the alabaster theme
# refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars
# html_sidebars = {
#     '**': [
#         'about.html',
#         'navigation.html',
#         'relations.html',  # needs 'show_related': True theme option to display
#         'searchbox.html',
#         'donate.html',
#     ]
# }


# -- Options for HTMLHelp output ------------------------------------------

# Output file base name for HTML help builder.
htmlhelp_basename = 'd6tstack-doc'

# -- Options for LaTeX output ---------------------------------------------

latex_elements = {
    # The paper size ('letterpaper' or 'a4paper').
    #
    # 'papersize': 'letterpaper',

    # The font size ('10pt', '11pt' or '12pt').
    #
    # 'pointsize': '10pt',

    # Additional stuff for the LaTeX preamble.
    #
    # 'preamble': '',

    # Latex figure (float) alignment
    #
    # 'figure_align': 'htbp',
}

# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
#  author, documentclass [howto, manual, or own class]).
latex_documents = [
    (master_doc, 'd6tstack.tex', 'd6tstack Documentation',
     'nn', 'manual'),
]

# -- Options for manual page output ---------------------------------------

# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
    (master_doc, 'd6tstack', 'd6tstack Documentation',
     [author], 1)
]

# -- Options for Texinfo output -------------------------------------------

# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
#  dir menu entry, description, category)
texinfo_documents = [
    (master_doc, 'd6tstack', 'd6tstack Documentation',
     author, 'd6tstack', 'Databolt python library - Accelerate data engineering',
     'Miscellaneous'),
]


================================================
FILE: docs/source/d6tstack.rst
================================================
d6tstack package
================

Submodules
----------

d6tstack.combine\_csv module
----------------------------

.. automodule:: d6tstack.combine_csv
    :members:
    :undoc-members:
    :show-inheritance:

d6tstack.convert\_xls module
----------------------------

.. automodule:: d6tstack.convert_xls
    :members:
    :undoc-members:
    :show-inheritance:

d6tstack.helpers module
-----------------------

.. automodule:: d6tstack.helpers
    :members:
    :undoc-members:
    :show-inheritance:

d6tstack.helpers\_ui module
---------------------------

.. automodule:: d6tstack.helpers_ui
    :members:
    :undoc-members:
    :show-inheritance:

d6tstack.sniffer module
-----------------------

.. automodule:: d6tstack.sniffer
    :members:
    :undoc-members:
    :show-inheritance:

d6tstack.sync module
--------------------

.. automodule:: d6tstack.sync
    :members:
    :undoc-members:
    :show-inheritance:

d6tstack.utils module
---------------------

.. automodule:: d6tstack.utils
    :members:
    :undoc-members:
    :show-inheritance:


Module contents
---------------

.. automodule:: d6tstack
    :members:
    :undoc-members:
    :show-inheritance:


================================================
FILE: docs/source/index.rst
================================================
.. d6t-celery-combine documentation master file, created by
   sphinx-quickstart on Tue Nov 28 11:32:56 2017.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Welcome to d6tstack documentation!
==============================================

Documentation for using the databolt python File Stack Combine library.

Library Docs
==================

* :ref:`modindex`

Search
==================

* :ref:`search`


================================================
FILE: docs/source/modules.rst
================================================
d6tstack
========

.. toctree::
   :maxdepth: 4

   d6tstack
   setup


================================================
FILE: docs/source/setup.rst
================================================
setup module
============

.. automodule:: setup
    :members:
    :undoc-members:
    :show-inheritance:


================================================
FILE: docs/source/tests.rst
================================================
tests package
=============

Submodules
----------

tests.test\_combine module
--------------------------

.. automodule:: tests.test_combine
    :members:
    :undoc-members:
    :show-inheritance:

tests.test\_sync module
-----------------------

.. automodule:: tests.test_sync
    :members:
    :undoc-members:
    :show-inheritance:

tests.test\_xls module
----------------------

.. automodule:: tests.test_xls
    :members:
    :undoc-members:
    :show-inheritance:

tests.tmp module
----------------

.. automodule:: tests.tmp
    :members:
    :undoc-members:
    :show-inheritance:


Module contents
---------------

.. automodule:: tests
    :members:
    :undoc-members:
    :show-inheritance:


================================================
FILE: examples-csv.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Engineering in Python with databolt  - Quickly Load Any Type of CSV (d6tlib/d6tstack)\n",
    "\n",
    "Vendors often send large datasets in multiple files. Often there are missing and misaligned columns between files that have to be manually cleaned. With DataBolt File Stack you can easily stack them together into one consistent dataset.\n",
    "\n",
    "Features include:\n",
    "* Quickly check column consistency across multiple files\n",
    "* Fix added/missing columns\n",
    "* Fix renamed columns\n",
    "* Out of core functionality to process large files\n",
    "* Export to pandas, CSV, SQL, parquet\n",
    "    * Fast export to postgres and mysql with out of core support\n",
    "    \n",
    "In this workbook we will demonstrate the usage of the d6tstack library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import importlib\n",
    "import pandas as pd\n",
    "import glob\n",
    "\n",
    "import d6tstack.combine_csv as d6tc\n",
    "import d6tstack"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get sample data\n",
    "\n",
    "We've created some dummy sample data which you can download. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import urllib.request\n",
    "cfg_fname_sample = 'test-data.zip'\n",
    "urllib.request.urlretrieve(\"https://github.com/d6t/d6tstack/raw/master/\"+cfg_fname_sample, cfg_fname_sample)\n",
    "import zipfile\n",
    "zip_ref = zipfile.ZipFile(cfg_fname_sample, 'r')\n",
    "zip_ref.extractall('.')\n",
    "zip_ref.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use Case: Checking Column Consistency\n",
    "\n",
    "Let's say you receive a bunch of csv files you want to ingest them, say for example into pandas, dask, pyspark, database."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['test-data/input/test-data-input-csv-clean-mar.csv', 'test-data/input/test-data-input-csv-clean-feb.csv', 'test-data/input/test-data-input-csv-clean-jan.csv']\n"
     ]
    }
   ],
   "source": [
    "cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-clean-*.csv'))\n",
    "print(cfg_fnames)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check column consistency across all files\n",
    "\n",
    "Even if you think the files have a consistent column layout, it worthwhile using `d6tstack` to assert that that is actually the case. It's very quick to do even with very many large files!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sniffing columns ok\n"
     ]
    }
   ],
   "source": [
    "# get previews\n",
    "c = d6tc.CombinerCSV(cfg_fnames) # all_strings=True makes reading faster\n",
    "col_sniff = c.sniff_columns()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "all columns equal? True\n",
      "\n",
      "which columns are present in which files?\n",
      "\n",
      "                                                   date  sales  cost  profit\n",
      "file_path                                                                   \n",
      "test-data/input/test-data-input-csv-clean-feb.csv  True   True  True    True\n",
      "test-data/input/test-data-input-csv-clean-jan.csv  True   True  True    True\n",
      "test-data/input/test-data-input-csv-clean-mar.csv  True   True  True    True\n",
      "\n",
      "in what order do columns appear in the files?\n",
      "\n",
      "   date  sales  cost  profit\n",
      "0     0      1     2       3\n",
      "1     0      1     2       3\n",
      "2     0      1     2       3\n"
     ]
    }
   ],
   "source": [
    "print('all columns equal?', c.is_all_equal())\n",
    "print('')\n",
    "print('which columns are present in which files?')\n",
    "print('')\n",
    "print(c.is_column_present())\n",
    "print('')\n",
    "print('in what order do columns appear in the files?')\n",
    "print('')\n",
    "print(col_sniff['df_columns_order'].reset_index(drop=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preview Combined Data\n",
    "\n",
    "You can see a preview of what the combined data from all files will look like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>sales</th>\n",
       "      <th>cost</th>\n",
       "      <th>profit</th>\n",
       "      <th>filepath</th>\n",
       "      <th>filename</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2011-02-01</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-clean-feb.csv</td>\n",
       "      <td>test-data-input-csv-clean-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2011-02-02</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-clean-feb.csv</td>\n",
       "      <td>test-data-input-csv-clean-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2011-02-03</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-clean-feb.csv</td>\n",
       "      <td>test-data-input-csv-clean-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2011-01-01</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>test-data/input/test-data-input-csv-clean-jan.csv</td>\n",
       "      <td>test-data-input-csv-clean-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011-01-02</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>test-data/input/test-data-input-csv-clean-jan.csv</td>\n",
       "      <td>test-data-input-csv-clean-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2011-01-03</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>test-data/input/test-data-input-csv-clean-jan.csv</td>\n",
       "      <td>test-data-input-csv-clean-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2011-03-01</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>test-data/input/test-data-input-csv-clean-mar.csv</td>\n",
       "      <td>test-data-input-csv-clean-mar.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2011-03-02</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>test-data/input/test-data-input-csv-clean-mar.csv</td>\n",
       "      <td>test-data-input-csv-clean-mar.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2011-03-03</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>test-data/input/test-data-input-csv-clean-mar.csv</td>\n",
       "      <td>test-data-input-csv-clean-mar.csv</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         date  sales  cost  profit                                           filepath                           filename\n",
       "0  2011-02-01    200   -90     110  test-data/input/test-data-input-csv-clean-feb.csv  test-data-input-csv-clean-feb.csv\n",
       "1  2011-02-02    200   -90     110  test-data/input/test-data-input-csv-clean-feb.csv  test-data-input-csv-clean-feb.csv\n",
       "2  2011-02-03    200   -90     110  test-data/input/test-data-input-csv-clean-feb.csv  test-data-input-csv-clean-feb.csv\n",
       "3  2011-01-01    100   -80      20  test-data/input/test-data-input-csv-clean-jan.csv  test-data-input-csv-clean-jan.csv\n",
       "4  2011-01-02    100   -80      20  test-data/input/test-data-input-csv-clean-jan.csv  test-data-input-csv-clean-jan.csv\n",
       "5  2011-01-03    100   -80      20  test-data/input/test-data-input-csv-clean-jan.csv  test-data-input-csv-clean-jan.csv\n",
       "6  2011-03-01    300  -100     200  test-data/input/test-data-input-csv-clean-mar.csv  test-data-input-csv-clean-mar.csv\n",
       "7  2011-03-02    300  -100     200  test-data/input/test-data-input-csv-clean-mar.csv  test-data-input-csv-clean-mar.csv\n",
       "8  2011-03-03    300  -100     200  test-data/input/test-data-input-csv-clean-mar.csv  test-data-input-csv-clean-mar.csv"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c.combine_preview()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read All Files to Pandas\n",
    "\n",
    "You can quickly load the combined data into a pandas dataframe with a single command. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>sales</th>\n",
       "      <th>cost</th>\n",
       "      <th>profit</th>\n",
       "      <th>filepath</th>\n",
       "      <th>filename</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2011-02-01</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-clean-feb.csv</td>\n",
       "      <td>test-data-input-csv-clean-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2011-02-02</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-clean-feb.csv</td>\n",
       "      <td>test-data-input-csv-clean-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2011-02-03</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-clean-feb.csv</td>\n",
       "      <td>test-data-input-csv-clean-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2011-02-04</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-clean-feb.csv</td>\n",
       "      <td>test-data-input-csv-clean-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011-02-05</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-clean-feb.csv</td>\n",
       "      <td>test-data-input-csv-clean-feb.csv</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         date  sales  cost  profit                                           filepath                           filename\n",
       "0  2011-02-01    200   -90     110  test-data/input/test-data-input-csv-clean-feb.csv  test-data-input-csv-clean-feb.csv\n",
       "1  2011-02-02    200   -90     110  test-data/input/test-data-input-csv-clean-feb.csv  test-data-input-csv-clean-feb.csv\n",
       "2  2011-02-03    200   -90     110  test-data/input/test-data-input-csv-clean-feb.csv  test-data-input-csv-clean-feb.csv\n",
       "3  2011-02-04    200   -90     110  test-data/input/test-data-input-csv-clean-feb.csv  test-data-input-csv-clean-feb.csv\n",
       "4  2011-02-05    200   -90     110  test-data/input/test-data-input-csv-clean-feb.csv  test-data-input-csv-clean-feb.csv"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c.to_pandas().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use Case: Identifying and fixing inconsistent columns\n",
    "\n",
    "The first case was clean: all files had the same columns. It happens very frequently that the data schema changes over time with columns being added or deleted over time. Let's look at a case where an extra columns got added."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['test-data/input/test-data-input-csv-colmismatch-mar.csv', 'test-data/input/test-data-input-csv-colmismatch-feb.csv', 'test-data/input/test-data-input-csv-colmismatch-jan.csv']\n"
     ]
    }
   ],
   "source": [
    "cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-colmismatch-*.csv'))\n",
    "print(cfg_fnames)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sniffing columns ok\n"
     ]
    }
   ],
   "source": [
    "# get previews\n",
    "c = d6tc.CombinerCSV(cfg_fnames) # all_strings=True makes reading faster\n",
    "col_sniff = c.sniff_columns()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "all columns equal? False\n",
      "\n",
      "which columns are unique? ['profit2']\n",
      "\n",
      "which files have unique columns?\n",
      "\n",
      "                                                    profit2\n",
      "file_path                                                  \n",
      "test-data/input/test-data-input-csv-colmismatch...    False\n",
      "test-data/input/test-data-input-csv-colmismatch...    False\n",
      "test-data/input/test-data-input-csv-colmismatch...     True\n"
     ]
    }
   ],
   "source": [
    "print('all columns equal?', c.is_all_equal())\n",
    "print('')\n",
    "print('which columns are unique?', col_sniff['columns_unique'])\n",
    "print('')\n",
    "print('which files have unique columns?')\n",
    "print('')\n",
    "print(c.is_column_present_unique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>sales</th>\n",
       "      <th>cost</th>\n",
       "      <th>profit</th>\n",
       "      <th>profit2</th>\n",
       "      <th>filepath</th>\n",
       "      <th>filename</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2011-02-01</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2011-02-02</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2011-02-03</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2011-02-04</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011-02-05</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         date  sales  cost  profit  profit2                                           filepath                                 filename\n",
       "0  2011-02-01    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "1  2011-02-02    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "2  2011-02-03    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "3  2011-02-04    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "4  2011-02-05    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c.to_pandas().head() # keep all columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sniffing columns ok\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>sales</th>\n",
       "      <th>cost</th>\n",
       "      <th>profit</th>\n",
       "      <th>filepath</th>\n",
       "      <th>filename</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2011-02-01</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2011-02-02</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2011-02-03</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2011-02-04</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011-02-05</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         date  sales  cost  profit                                           filepath                                 filename\n",
       "0  2011-02-01    200   -90     110  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "1  2011-02-02    200   -90     110  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "2  2011-02-03    200   -90     110  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "3  2011-02-04    200   -90     110  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "4  2011-02-05    200   -90     110  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d6tc.CombinerCSV(cfg_fnames, columns_select_common=True).to_pandas().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Use Case: align renamed columns. Select subset of columns\n",
    "\n",
    "Say a column has been renamed and now the data doesn't line up with the data from the old column name. You can easily fix such a situation by using `CombinerCSVAdvanced` which allows you to rename columns and automatically lines up the data. It also allows you to just load data from a subset of columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sniffing columns ok\n",
      "                                                    revenue  sales\n",
      "file_path                                                         \n",
      "test-data/input/test-data-input-csv-renamed-feb...    False   True\n",
      "test-data/input/test-data-input-csv-renamed-jan...    False   True\n",
      "test-data/input/test-data-input-csv-renamed-mar...     True  False\n"
     ]
    }
   ],
   "source": [
    "cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-renamed-*.csv'))\n",
    "c = d6tc.CombinerCSV(cfg_fnames)\n",
    "print(c.is_column_present_unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The column `sales` got renamed to `revenue` in the March file, this would causes problems when reading the files. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sniffing columns ok\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>filename</th>\n",
       "      <th>revenue</th>\n",
       "      <th>sales</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>test-data-input-csv-renamed-feb.csv</td>\n",
       "      <td>NaN</td>\n",
       "      <td>200.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>test-data-input-csv-renamed-feb.csv</td>\n",
       "      <td>NaN</td>\n",
       "      <td>200.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>test-data-input-csv-renamed-feb.csv</td>\n",
       "      <td>NaN</td>\n",
       "      <td>200.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>test-data-input-csv-renamed-jan.csv</td>\n",
       "      <td>NaN</td>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>test-data-input-csv-renamed-jan.csv</td>\n",
       "      <td>NaN</td>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>test-data-input-csv-renamed-jan.csv</td>\n",
       "      <td>NaN</td>\n",
       "      <td>100.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>test-data-input-csv-renamed-mar.csv</td>\n",
       "      <td>300.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>test-data-input-csv-renamed-mar.csv</td>\n",
       "      <td>300.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>test-data-input-csv-renamed-mar.csv</td>\n",
       "      <td>300.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              filename  revenue  sales\n",
       "0  test-data-input-csv-renamed-feb.csv      NaN  200.0\n",
       "1  test-data-input-csv-renamed-feb.csv      NaN  200.0\n",
       "2  test-data-input-csv-renamed-feb.csv      NaN  200.0\n",
       "3  test-data-input-csv-renamed-jan.csv      NaN  100.0\n",
       "4  test-data-input-csv-renamed-jan.csv      NaN  100.0\n",
       "5  test-data-input-csv-renamed-jan.csv      NaN  100.0\n",
       "6  test-data-input-csv-renamed-mar.csv    300.0    NaN\n",
       "7  test-data-input-csv-renamed-mar.csv    300.0    NaN\n",
       "8  test-data-input-csv-renamed-mar.csv    300.0    NaN"
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "col_sniff = c.sniff_columns()\n",
    "c.combine_preview()[['filename']+col_sniff['columns_unique']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can pass the columns you want to rename to `columns_rename` and it will rename and align those columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# only select particular columns\n",
    "cfg_col_sel = ['date','sales','cost','profit'] # don't select profit2\n",
    "# rename colums\n",
    "cfg_col_rename = {'sales':'revenue'} # rename all instances of sales to revenue"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sniffing columns ok\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>revenue</th>\n",
       "      <th>cost</th>\n",
       "      <th>profit</th>\n",
       "      <th>filepath</th>\n",
       "      <th>filename</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2011-02-01</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-renamed-fe...</td>\n",
       "      <td>test-data-input-csv-renamed-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2011-02-02</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-renamed-fe...</td>\n",
       "      <td>test-data-input-csv-renamed-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2011-02-03</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-renamed-fe...</td>\n",
       "      <td>test-data-input-csv-renamed-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2011-01-01</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>test-data/input/test-data-input-csv-renamed-ja...</td>\n",
       "      <td>test-data-input-csv-renamed-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011-01-02</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>test-data/input/test-data-input-csv-renamed-ja...</td>\n",
       "      <td>test-data-input-csv-renamed-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2011-01-03</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>test-data/input/test-data-input-csv-renamed-ja...</td>\n",
       "      <td>test-data-input-csv-renamed-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2011-03-01</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>test-data/input/test-data-input-csv-renamed-ma...</td>\n",
       "      <td>test-data-input-csv-renamed-mar.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2011-03-02</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>test-data/input/test-data-input-csv-renamed-ma...</td>\n",
       "      <td>test-data-input-csv-renamed-mar.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2011-03-03</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>test-data/input/test-data-input-csv-renamed-ma...</td>\n",
       "      <td>test-data-input-csv-renamed-mar.csv</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         date  revenue  cost  profit                                           filepath                             filename\n",
       "0  2011-02-01      200   -90     110  test-data/input/test-data-input-csv-renamed-fe...  test-data-input-csv-renamed-feb.csv\n",
       "1  2011-02-02      200   -90     110  test-data/input/test-data-input-csv-renamed-fe...  test-data-input-csv-renamed-feb.csv\n",
       "2  2011-02-03      200   -90     110  test-data/input/test-data-input-csv-renamed-fe...  test-data-input-csv-renamed-feb.csv\n",
       "3  2011-01-01      100   -80      20  test-data/input/test-data-input-csv-renamed-ja...  test-data-input-csv-renamed-jan.csv\n",
       "4  2011-01-02      100   -80      20  test-data/input/test-data-input-csv-renamed-ja...  test-data-input-csv-renamed-jan.csv\n",
       "5  2011-01-03      100   -80      20  test-data/input/test-data-input-csv-renamed-ja...  test-data-input-csv-renamed-jan.csv\n",
       "6  2011-03-01      300  -100     200  test-data/input/test-data-input-csv-renamed-ma...  test-data-input-csv-renamed-mar.csv\n",
       "7  2011-03-02      300  -100     200  test-data/input/test-data-input-csv-renamed-ma...  test-data-input-csv-renamed-mar.csv\n",
       "8  2011-03-03      300  -100     200  test-data/input/test-data-input-csv-renamed-ma...  test-data-input-csv-renamed-mar.csv"
      ]
     },
     "execution_count": 92,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c = d6tc.CombinerCSV(cfg_fnames, columns_rename = cfg_col_rename, columns_select = cfg_col_sel) \n",
    "c.combine_preview() \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Case: Identify change in column order\n",
    "\n",
    "If you read your files into a database this will be a real problem because it look like the files are all the same whereas in fact they have changes. This is because programs like dask or sql loaders assume the column order is the same. With `d6tstack` you can easily identify and fix such a case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['test-data/input/test-data-input-csv-reorder-jan.csv', 'test-data/input/test-data-input-csv-reorder-mar.csv', 'test-data/input/test-data-input-csv-reorder-feb.csv']\n"
     ]
    }
   ],
   "source": [
    "cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-reorder-*.csv'))\n",
    "print(cfg_fnames)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sniffing columns ok\n"
     ]
    }
   ],
   "source": [
    "# get previews\n",
    "c = d6tc.CombinerCSV(cfg_fnames) # all_strings=True makes reading faster\n",
    "col_sniff = c.sniff_columns()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we can see that all columns are not equal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "all columns equal? False\n",
      "\n",
      "in what order do columns appear in the files?\n",
      "\n",
      "   date  sales  cost  profit\n",
      "0     0      1     2       3\n",
      "1     0      1     2       3\n",
      "2     0      1     3       2\n"
     ]
    }
   ],
   "source": [
    "print('all columns equal?', col_sniff['is_all_equal'])\n",
    "print('')\n",
    "print('in what order do columns appear in the files?')\n",
    "print('')\n",
    "print(col_sniff['df_columns_order'].reset_index(drop=True))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>sales</th>\n",
       "      <th>cost</th>\n",
       "      <th>profit</th>\n",
       "      <th>filepath</th>\n",
       "      <th>filename</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2011-02-01</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-reorder-fe...</td>\n",
       "      <td>test-data-input-csv-reorder-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2011-02-02</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-reorder-fe...</td>\n",
       "      <td>test-data-input-csv-reorder-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2011-02-03</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test-data/input/test-data-input-csv-reorder-fe...</td>\n",
       "      <td>test-data-input-csv-reorder-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2011-01-01</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>test-data/input/test-data-input-csv-reorder-ja...</td>\n",
       "      <td>test-data-input-csv-reorder-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011-01-02</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>test-data/input/test-data-input-csv-reorder-ja...</td>\n",
       "      <td>test-data-input-csv-reorder-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2011-01-03</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>test-data/input/test-data-input-csv-reorder-ja...</td>\n",
       "      <td>test-data-input-csv-reorder-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2011-03-01</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>test-data/input/test-data-input-csv-reorder-ma...</td>\n",
       "      <td>test-data-input-csv-reorder-mar.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2011-03-02</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>test-data/input/test-data-input-csv-reorder-ma...</td>\n",
       "      <td>test-data-input-csv-reorder-mar.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2011-03-03</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>test-data/input/test-data-input-csv-reorder-ma...</td>\n",
       "      <td>test-data-input-csv-reorder-mar.csv</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         date  sales  cost  profit                                           filepath                             filename\n",
       "0  2011-02-01    200   -90     110  test-data/input/test-data-input-csv-reorder-fe...  test-data-input-csv-reorder-feb.csv\n",
       "1  2011-02-02    200   -90     110  test-data/input/test-data-input-csv-reorder-fe...  test-data-input-csv-reorder-feb.csv\n",
       "2  2011-02-03    200   -90     110  test-data/input/test-data-input-csv-reorder-fe...  test-data-input-csv-reorder-feb.csv\n",
       "3  2011-01-01    100   -80      20  test-data/input/test-data-input-csv-reorder-ja...  test-data-input-csv-reorder-jan.csv\n",
       "4  2011-01-02    100   -80      20  test-data/input/test-data-input-csv-reorder-ja...  test-data-input-csv-reorder-jan.csv\n",
       "5  2011-01-03    100   -80      20  test-data/input/test-data-input-csv-reorder-ja...  test-data-input-csv-reorder-jan.csv\n",
       "6  2011-03-01    300  -100     200  test-data/input/test-data-input-csv-reorder-ma...  test-data-input-csv-reorder-mar.csv\n",
       "7  2011-03-02    300  -100     200  test-data/input/test-data-input-csv-reorder-ma...  test-data-input-csv-reorder-mar.csv\n",
       "8  2011-03-03    300  -100     200  test-data/input/test-data-input-csv-reorder-ma...  test-data-input-csv-reorder-mar.csv"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c.combine_preview() # automatically puts it in the right order"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Customize separator and pass pd.read_csv() params\n",
    "\n",
    "You can pass additional parameters such as separators and any params for `pd.read_csv()` to the combiner."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sniffing columns ok\n",
      "{'files_columns': {'test-data/input/test-data-input-csv-reorder-feb.csv': ['date', 'sales', 'cost', 'profit'], 'test-data/input/test-data-input-csv-reorder-jan.csv': ['date', 'sales', 'cost', 'profit'], 'test-data/input/test-data-input-csv-reorder-mar.csv': ['date', 'sales', 'profit', 'cost']}, 'columns_all': ['date', 'sales', 'cost', 'profit'], 'columns_common': ['date', 'sales', 'cost', 'profit'], 'columns_unique': [], 'is_all_equal': False, 'df_columns_present':                                                     date  sales  cost  profit\n",
      "file_path                                                                    \n",
      "test-data/input/test-data-input-csv-reorder-feb...  True   True  True    True\n",
      "test-data/input/test-data-input-csv-reorder-jan...  True   True  True    True\n",
      "test-data/input/test-data-input-csv-reorder-mar...  True   True  True    True, 'df_columns_order':                                                     date  sales  cost  profit\n",
      "test-data/input/test-data-input-csv-reorder-feb...     0      1     2       3\n",
      "test-data/input/test-data-input-csv-reorder-jan...     0      1     2       3\n",
      "test-data/input/test-data-input-csv-reorder-mar...     0      1     3       2}\n"
     ]
    }
   ],
   "source": [
    "c = d6tc.CombinerCSV(cfg_fnames, sep=',', read_csv_params={'header': None})\n",
    "col_sniff = c.sniff_columns()\n",
    "print(col_sniff)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CSV out of core functionality\n",
    "\n",
    "If your files are large you don't want to read them all in memory and then save. Instead you can write directly to the output file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'test-data/output/test.csv'"
      ]
     },
     "execution_count": 98,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c.to_csv_combine('test-data/output/test.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Auto Detect pd.read_csv() settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "### Detect CSV settings across a single file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'delim': ',', 'skiprows': 0, 'has_header': True, 'header': 0}\n"
     ]
    }
   ],
   "source": [
    "cfg_sniff = d6tstack.sniffer.sniff_settings_csv([cfg_fnames[0]])\n",
    "print(cfg_sniff)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Detect CSV settings across multiple files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'delim': ',', 'skiprows': 0, 'has_header': True, 'header': 0}\n"
     ]
    }
   ],
   "source": [
    "# finds common csv across all files\n",
    "cfg_sniff = d6tstack.sniffer.sniff_settings_csv(cfg_fnames)\n",
    "print(cfg_sniff)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: examples-dask.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# d6tstack with Dask\n",
    "\n",
    "Dask is a great library for out-of-core computing. But if input files are not properly organized it quickly breaks. For example:\n",
    "\n",
    "1) if columns are different between files, dask won't even read the data! It doesn't tell you what you need to do to fix it.\n",
    "\n",
    "2) if column order is rearranged between files it will read data, but into the wrong columns and you won't notice it\n",
    "\n",
    "Dask can't handle those scenarios. With d6tstack you can easily fix the situation with just a few lines of code!\n",
    "\n",
    "For more instructions, examples and documentation see https://github.com/d6t/d6tstack"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Base Case: Columns are same between all files\n",
    "As a base case, we have input files which have consistent input columns and thus can be easily read in dask."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88\n",
      "  return f(*args, **kwds)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>sales</th>\n",
       "      <th>cost</th>\n",
       "      <th>profit</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2011-02-01</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2011-02-02</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2011-02-03</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2011-02-04</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011-02-05</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2011-02-06</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2011-02-07</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2011-02-08</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2011-02-09</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2011-02-10</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2011-01-01</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2011-01-02</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2011-01-03</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2011-01-04</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011-01-05</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2011-01-06</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2011-01-07</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2011-01-08</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2011-01-09</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2011-01-10</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2011-03-01</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2011-03-02</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2011-03-03</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2011-03-04</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011-03-05</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2011-03-06</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2011-03-07</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2011-03-08</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2011-03-09</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2011-03-10</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         date  sales  cost  profit\n",
       "0  2011-02-01    200   -90     110\n",
       "1  2011-02-02    200   -90     110\n",
       "2  2011-02-03    200   -90     110\n",
       "3  2011-02-04    200   -90     110\n",
       "4  2011-02-05    200   -90     110\n",
       "5  2011-02-06    200   -90     110\n",
       "6  2011-02-07    200   -90     110\n",
       "7  2011-02-08    200   -90     110\n",
       "8  2011-02-09    200   -90     110\n",
       "9  2011-02-10    200   -90     110\n",
       "0  2011-01-01    100   -80      20\n",
       "1  2011-01-02    100   -80      20\n",
       "2  2011-01-03    100   -80      20\n",
       "3  2011-01-04    100   -80      20\n",
       "4  2011-01-05    100   -80      20\n",
       "5  2011-01-06    100   -80      20\n",
       "6  2011-01-07    100   -80      20\n",
       "7  2011-01-08    100   -80      20\n",
       "8  2011-01-09    100   -80      20\n",
       "9  2011-01-10    100   -80      20\n",
       "0  2011-03-01    300  -100     200\n",
       "1  2011-03-02    300  -100     200\n",
       "2  2011-03-03    300  -100     200\n",
       "3  2011-03-04    300  -100     200\n",
       "4  2011-03-05    300  -100     200\n",
       "5  2011-03-06    300  -100     200\n",
       "6  2011-03-07    300  -100     200\n",
       "7  2011-03-08    300  -100     200\n",
       "8  2011-03-09    300  -100     200\n",
       "9  2011-03-10    300  -100     200"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import dask.dataframe as dd\n",
    "\n",
    "# consistent format\n",
    "ddf = dd.read_csv('test-data/input/test-data-input-csv-clean-*.csv')\n",
    "ddf.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Problem Case 1: Columns are different between files\n",
    "That worked well. But what happens if your input files have inconsistent columns across files? Say for example one file has a new column that the other files don't have."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "Length mismatch: Expected axis has 5 elements, new values have 4 elements",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-23-1bbb9709e59f>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[1;31m# consistent format\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m      2\u001b[0m \u001b[0mddf\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mdd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'test-data/input/test-data-input-csv-colmismatch-*.csv'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mddf\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcompute\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;32mC:\\Anaconda3\\lib\\site-packages\\dask\\base.py\u001b[0m in \u001b[0;36mcompute\u001b[1;34m(self, **kwargs)\u001b[0m\n\u001b[0;32m    153\u001b[0m         \u001b[0mdask\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mbase\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcompute\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    154\u001b[0m         \"\"\"\n\u001b[1;32m--> 155\u001b[1;33m         \u001b[1;33m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcompute\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtraverse\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mFalse\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    156\u001b[0m         \u001b[1;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    157\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mC:\\Anaconda3\\lib\\site-packages\\dask\\base.py\u001b[0m in \u001b[0;36mcompute\u001b[1;34m(*args, **kwargs)\u001b[0m\n\u001b[0;32m    402\u001b[0m     postcomputes = [a.__dask_postcompute__() if is_dask_collection(a)\n\u001b[0;32m    403\u001b[0m                     else (None, a) for a in args]\n\u001b[1;32m--> 404\u001b[1;33m     \u001b[0mresults\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mget\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mdsk\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkeys\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    405\u001b[0m     \u001b[0mresults_iter\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0miter\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mresults\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    406\u001b[0m     return tuple(a if f is None else f(next(results_iter), *a)\n",
      "\u001b[1;32mC:\\Anaconda3\\lib\\site-packages\\dask\\threaded.py\u001b[0m in \u001b[0;36mget\u001b[1;34m(dsk, result, cache, num_workers, **kwargs)\u001b[0m\n\u001b[0;32m     73\u001b[0m     results = get_async(pool.apply_async, len(pool._pool), dsk, result,\n\u001b[0;32m     74\u001b[0m                         \u001b[0mcache\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0mcache\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mget_id\u001b[0m\u001b[1;33m=\u001b[0m\u001b[0m_thread_get_id\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 75\u001b[1;33m                         pack_exception=pack_exception, **kwargs)\n\u001b[0m\u001b[0;32m     76\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     77\u001b[0m     \u001b[1;31m# Cleanup pools associated to dead threads\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mC:\\Anaconda3\\lib\\site-packages\\dask\\local.py\u001b[0m in \u001b[0;36mget_async\u001b[1;34m(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)\u001b[0m\n\u001b[0;32m    519\u001b[0m                         \u001b[0m_execute_task\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtask\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdata\u001b[0m\u001b[1;33m)\u001b[0m  \u001b[1;31m# Re-execute locally\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    520\u001b[0m                     \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 521\u001b[1;33m                         \u001b[0mraise_exception\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mexc\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtb\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    522\u001b[0m                 \u001b[0mres\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mworker_id\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mloads\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mres_info\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    523\u001b[0m                 \u001b[0mstate\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;34m'cache'\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0mkey\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mres\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mC:\\Anaconda3\\lib\\site-packages\\dask\\compatibility.py\u001b[0m in \u001b[0;36mreraise\u001b[1;34m(exc, tb)\u001b[0m\n\u001b[0;32m     65\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0mexc\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m__traceback__\u001b[0m \u001b[1;32mis\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mtb\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     66\u001b[0m             \u001b[1;32mraise\u001b[0m \u001b[0mexc\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mwith_traceback\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtb\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 67\u001b[1;33m         \u001b[1;32mraise\u001b[0m \u001b[0mexc\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     68\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     69\u001b[0m \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mC:\\Anaconda3\\lib\\site-packages\\dask\\local.py\u001b[0m in \u001b[0;36mexecute_task\u001b[1;34m(key, task_info, dumps, loads, get_id, pack_exception)\u001b[0m\n\u001b[0;32m    288\u001b[0m     \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    289\u001b[0m         \u001b[0mtask\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdata\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mloads\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtask_info\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 290\u001b[1;33m         \u001b[0mresult\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0m_execute_task\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mtask\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdata\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    291\u001b[0m         \u001b[0mid\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mget_id\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    292\u001b[0m         \u001b[0mresult\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mdumps\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mresult\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mid\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mC:\\Anaconda3\\lib\\site-packages\\dask\\local.py\u001b[0m in \u001b[0;36m_execute_task\u001b[1;34m(arg, cache, dsk)\u001b[0m\n\u001b[0;32m    269\u001b[0m         \u001b[0mfunc\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0margs\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0marg\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0marg\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    270\u001b[0m         \u001b[0margs2\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m[\u001b[0m\u001b[0m_execute_task\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0ma\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcache\u001b[0m\u001b[1;33m)\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0ma\u001b[0m \u001b[1;32min\u001b[0m \u001b[0margs\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 271\u001b[1;33m         \u001b[1;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0margs2\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    272\u001b[0m     \u001b[1;32melif\u001b[0m \u001b[1;32mnot\u001b[0m \u001b[0mishashable\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0marg\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    273\u001b[0m         \u001b[1;32mreturn\u001b[0m \u001b[0marg\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mC:\\Anaconda3\\lib\\site-packages\\dask\\compatibility.py\u001b[0m in \u001b[0;36mapply\u001b[1;34m(func, args, kwargs)\u001b[0m\n\u001b[0;32m     46\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0mapply\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfunc\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mNone\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     47\u001b[0m         \u001b[1;32mif\u001b[0m \u001b[0mkwargs\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 48\u001b[1;33m             \u001b[1;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;33m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     49\u001b[0m         \u001b[1;32melse\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     50\u001b[0m             \u001b[1;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m*\u001b[0m\u001b[0margs\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mC:\\Anaconda3\\lib\\site-packages\\dask\\dataframe\\io\\csv.py\u001b[0m in \u001b[0;36mpandas_read_text\u001b[1;34m(reader, b, header, kwargs, dtypes, columns, write_header, enforce)\u001b[0m\n\u001b[0;32m     69\u001b[0m         \u001b[1;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"Columns do not match\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mdf\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mcolumns\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     70\u001b[0m     \u001b[1;32melif\u001b[0m \u001b[0mcolumns\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m---> 71\u001b[1;33m         \u001b[0mdf\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mcolumns\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mcolumns\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m     72\u001b[0m     \u001b[1;32mreturn\u001b[0m \u001b[0mdf\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m     73\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mC:\\Anaconda3\\lib\\site-packages\\pandas\\core\\generic.py\u001b[0m in \u001b[0;36m__setattr__\u001b[1;34m(self, name, value)\u001b[0m\n\u001b[0;32m   3625\u001b[0m         \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   3626\u001b[0m             \u001b[0mobject\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mname\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 3627\u001b[1;33m             \u001b[1;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m__setattr__\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mname\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mvalue\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m   3628\u001b[0m         \u001b[1;32mexcept\u001b[0m \u001b[0mAttributeError\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   3629\u001b[0m             \u001b[1;32mpass\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mpandas/_libs/properties.pyx\u001b[0m in \u001b[0;36mpandas._libs.properties.AxisProperty.__set__\u001b[1;34m()\u001b[0m\n",
      "\u001b[1;32mC:\\Anaconda3\\lib\\site-packages\\pandas\\core\\generic.py\u001b[0m in \u001b[0;36m_set_axis\u001b[1;34m(self, axis, labels)\u001b[0m\n\u001b[0;32m    557\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    558\u001b[0m     \u001b[1;32mdef\u001b[0m \u001b[0m_set_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlabels\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 559\u001b[1;33m         \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_data\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mset_axis\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlabels\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m    560\u001b[0m         \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_clear_item_cache\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m    561\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;32mC:\\Anaconda3\\lib\\site-packages\\pandas\\core\\internals.py\u001b[0m in \u001b[0;36mset_axis\u001b[1;34m(self, axis, new_labels)\u001b[0m\n\u001b[0;32m   3072\u001b[0m             raise ValueError('Length mismatch: Expected axis has %d elements, '\n\u001b[0;32m   3073\u001b[0m                              \u001b[1;34m'new values have %d elements'\u001b[0m \u001b[1;33m%\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 3074\u001b[1;33m                              (old_len, new_len))\n\u001b[0m\u001b[0;32m   3075\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m   3076\u001b[0m         \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0maxes\u001b[0m\u001b[1;33m[\u001b[0m\u001b[0maxis\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnew_labels\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
      "\u001b[1;31mValueError\u001b[0m: Length mismatch: Expected axis has 5 elements, new values have 4 elements"
     ]
    }
   ],
   "source": [
    "# consistent format\n",
    "ddf = dd.read_csv('test-data/input/test-data-input-csv-colmismatch-*.csv')\n",
    "ddf.compute()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fixing the problem with d6stack\n",
    "Urgh! There's no way to use these files in dask. You don't even know what's going on. What file caused the problem? Why did it cause a problem? All you know is one file has more columns than the first file.\n",
    "\n",
    "You can either manually process those files or use d6tstack to easily check for such a situation and fix it with a few lines of code - no manual processing required. Let's take a look!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sniffing columns ok\n",
      "all equal False\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>sales</th>\n",
       "      <th>cost</th>\n",
       "      <th>profit</th>\n",
       "      <th>profit2</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>file_path</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>test-data/input/test-data-input-csv-colmismatch-feb.csv</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>test-data/input/test-data-input-csv-colmismatch-jan.csv</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>test-data/input/test-data-input-csv-colmismatch-mar.csv</th>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                    date  sales  cost  profit  profit2\n",
       "file_path                                                                             \n",
       "test-data/input/test-data-input-csv-colmismatch...  True   True  True    True    False\n",
       "test-data/input/test-data-input-csv-colmismatch...  True   True  True    True    False\n",
       "test-data/input/test-data-input-csv-colmismatch...  True   True  True    True     True"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import glob\n",
    "import d6tstack.combine_csv\n",
    "\n",
    "cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-colmismatch-*.csv'))\n",
    "c = d6tstack.combine_csv.CombinerCSV(cfg_fnames)\n",
    "\n",
    "# check columns\n",
    "print('all equal',c.is_all_equal())\n",
    "print('')\n",
    "c.is_column_present()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before using dask you can quickly use d6stack to check if all colums are consistent with `d6tstack.combine_csv.CombinerCSV.is_all_equal()`. If they are not consistent you can easily see which files are causing problems with `d6tstack.combine_csv.CombinerCSV.is_col_present()`, in this case there is a new column \"profit2\" in \"test-data-input-csv-colmismatch-mar.csv\".\n",
    "\n",
    "**Let's use d6stack to fix the situation.** We will use out-of-core processing with `d6tstack.combine_csv.CombinerCSVAdvanced.combine_save()` to save data from all files into one combined file with constistent columns. Any missing data is filled with NaN (to keep only common columns use `cfg_col_sel=c.col_preview['columns_common']`) Just 2 lines of code! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sniffing columns ok\n",
      "writing test-data/output/d6tstack-test-data-input-csv-colmismatch-feb.csv ok\n",
      "writing test-data/output/d6tstack-test-data-input-csv-colmismatch-jan.csv ok\n",
      "writing test-data/output/d6tstack-test-data-input-csv-colmismatch-mar.csv ok\n"
     ]
    }
   ],
   "source": [
    "# out-of-core combining\n",
    "fnames = d6tstack.combine_csv.CombinerCSV(cfg_fnames).to_csv_align(output_dir='test-data/output/')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "NB: Instead of `to_csv_align()` you can also run `to_csv_combine()` which creates a single combined file.\n",
    "\n",
    "Now you can read this in dask and do whatever you wanted to do in the first place."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>sales</th>\n",
       "      <th>cost</th>\n",
       "      <th>profit</th>\n",
       "      <th>profit2</th>\n",
       "      <th>filepath</th>\n",
       "      <th>filename</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2011-02-01</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2011-02-02</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2011-02-03</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2011-02-04</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011-02-05</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2011-02-06</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2011-02-07</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2011-02-08</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2011-02-09</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2011-02-10</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-feb.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2011-01-01</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2011-01-02</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2011-01-03</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2011-01-04</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011-01-05</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2011-01-06</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2011-01-07</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2011-01-08</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2011-01-09</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2011-01-10</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "      <td>NaN</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-jan.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2011-03-01</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>400.0</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-mar.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2011-03-02</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>400.0</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-mar.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2011-03-03</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>400.0</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-mar.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2011-03-04</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>400.0</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-mar.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011-03-05</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>400.0</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-mar.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2011-03-06</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>400.0</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-mar.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2011-03-07</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>400.0</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-mar.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2011-03-08</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>400.0</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-mar.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2011-03-09</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>400.0</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-mar.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2011-03-10</td>\n",
       "      <td>300</td>\n",
       "      <td>-100</td>\n",
       "      <td>200</td>\n",
       "      <td>400.0</td>\n",
       "      <td>test-data/input/test-data-input-csv-colmismatc...</td>\n",
       "      <td>test-data-input-csv-colmismatch-mar.csv</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         date  sales  cost  profit  profit2                                           filepath                                 filename\n",
       "0  2011-02-01    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "1  2011-02-02    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "2  2011-02-03    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "3  2011-02-04    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "4  2011-02-05    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "5  2011-02-06    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "6  2011-02-07    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "7  2011-02-08    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "8  2011-02-09    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "9  2011-02-10    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\n",
       "0  2011-01-01    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\n",
       "1  2011-01-02    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\n",
       "2  2011-01-03    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\n",
       "3  2011-01-04    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\n",
       "4  2011-01-05    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\n",
       "5  2011-01-06    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\n",
       "6  2011-01-07    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\n",
       "7  2011-01-08    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\n",
       "8  2011-01-09    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\n",
       "9  2011-01-10    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\n",
       "0  2011-03-01    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\n",
       "1  2011-03-02    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\n",
       "2  2011-03-03    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\n",
       "3  2011-03-04    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\n",
       "4  2011-03-05    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\n",
       "5  2011-03-06    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\n",
       "6  2011-03-07    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\n",
       "7  2011-03-08    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\n",
       "8  2011-03-09    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\n",
       "9  2011-03-10    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# consistent format\n",
    "ddf = dd.read_csv('test-data/output/d6tstack-test-data-input-csv-colmismatch-*.csv')\n",
    "ddf.compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Problem Case 2: Columns are reordered between files\n",
    "This is a sneaky case. The columns are the same but the order is different! Dask will read everything just fine without a warning but your data is totally messed up!\n",
    "\n",
    "In the example below, the \"profit\" column contains data from the \"cost\" column!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>sales</th>\n",
       "      <th>cost</th>\n",
       "      <th>profit</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2011-02-01</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2011-02-02</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2011-02-03</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2011-02-04</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011-02-05</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2011-02-06</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2011-02-07</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2011-02-08</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2011-02-09</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2011-02-10</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2011-01-01</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2011-01-02</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2011-01-03</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2011-01-04</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011-01-05</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2011-01-06</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2011-01-07</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2011-01-08</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2011-01-09</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2011-01-10</td>\n",
       "      <td>100</td>\n",
       "      <td>-80</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2011-03-01</td>\n",
       "      <td>300</td>\n",
       "      <td>200</td>\n",
       "      <td>-100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2011-03-02</td>\n",
       "      <td>300</td>\n",
       "      <td>200</td>\n",
       "      <td>-100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2011-03-03</td>\n",
       "      <td>300</td>\n",
       "      <td>200</td>\n",
       "      <td>-100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2011-03-04</td>\n",
       "      <td>300</td>\n",
       "      <td>200</td>\n",
       "      <td>-100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2011-03-05</td>\n",
       "      <td>300</td>\n",
       "      <td>200</td>\n",
       "      <td>-100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2011-03-06</td>\n",
       "      <td>300</td>\n",
       "      <td>200</td>\n",
       "      <td>-100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2011-03-07</td>\n",
       "      <td>300</td>\n",
       "      <td>200</td>\n",
       "      <td>-100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2011-03-08</td>\n",
       "      <td>300</td>\n",
       "      <td>200</td>\n",
       "      <td>-100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2011-03-09</td>\n",
       "      <td>300</td>\n",
       "      <td>200</td>\n",
       "      <td>-100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2011-03-10</td>\n",
       "      <td>300</td>\n",
       "      <td>200</td>\n",
       "      <td>-100</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         date  sales  cost  profit\n",
       "0  2011-02-01    200   -90     110\n",
       "1  2011-02-02    200   -90     110\n",
       "2  2011-02-03    200   -90     110\n",
       "3  2011-02-04    200   -90     110\n",
       "4  2011-02-05    200   -90     110\n",
       "5  2011-02-06    200   -90     110\n",
       "6  2011-02-07    200   -90     110\n",
       "7  2011-02-08    200   -90     110\n",
       "8  2011-02-09    200   -90     110\n",
       "9  2011-02-10    200   -90     110\n",
       "0  2011-01-01    100   -80      20\n",
       "1  2011-01-02    100   -80      20\n",
       "2  2011-01-03    100   -80      20\n",
       "3  2011-01-04    100   -80      20\n",
       "4  2011-01-05    100   -80      20\n",
       "5  2011-01-06    100   -80      20\n",
       "6  2011-01-07    100   -80      20\n",
       "7  2011-01-08    100   -80      20\n",
       "8  2011-01-09    100   -80      20\n",
       "9  2011-01-10    100   -80      20\n",
       "0  2011-03-01    300   200    -100\n",
       "1  2011-03-02    300   200    -100\n",
       "2  2011-03-03    300   200    -100\n",
       "3  2011-03-04    300   200    -100\n",
       "4  2011-03-05    300   200    -100\n",
       "5  2011-03-06    300   200    -100\n",
       "6  2011-03-07    300   200    -100\n",
       "7  2011-03-08    300   200    -100\n",
       "8  2011-03-09    300   200    -100\n",
       "9  2011-03-10    300   200    -100"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# consistent format\n",
    "ddf = dd.read_csv('test-data/input/test-data-input-csv-reorder-*.csv')\n",
    "ddf.compute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sniffing columns ok\n",
      "all columns equal? False\n",
      "\n",
      "in what order do columns appear in the files?\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>sales</th>\n",
       "      <th>cost</th>\n",
       "      <th>profit</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   date  sales  cost  profit\n",
       "0     0      1     2       3\n",
       "1     0      1     2       3\n",
       "2     0      1     3       2"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-reorder-*.csv'))\n",
    "c = d6tstack.combine_csv.CombinerCSV(cfg_fnames)\n",
    "\n",
    "# check columns\n",
    "col_sniff = c.sniff_columns()\n",
    "print('all columns equal?' , c.is_all_equal())\n",
    "print('')\n",
    "print('in what order do columns appear in the files?')\n",
    "print('')\n",
    "col_sniff['df_columns_order'].reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, just a useful check before loading data into dask you can see that the columns don't line up. It's very fast to run because it only reads the headers, there's NO reason for you NOT to do it from a QA perspective.\n",
    "\n",
    "Same as above, the fix is the same few lines of code with d6stack."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sniffing columns ok\n",
      "writing test-data/output/d6tstack-test-data-input-csv-reorder-feb.csv ok\n",
      "writing test-data/output/d6tstack-test-data-input-csv-reorder-jan.csv ok\n",
      "writing test-data/output/d6tstack-test-data-input-csv-reorder-mar.csv ok\n"
     ]
    }
   ],
   "source": [
    "# out-of-core combining\n",
    "fnames = d6tstack.combine_csv.CombinerCSV(cfg_fnames).to_csv_align(output_dir='test-data/output/')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>sales</th>\n",
       "      <th>cost</th>\n",
       "      <th>profit</th>\n",
       "      <th>filepath</th>\n",
       "      <th>filename</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2011-02-01</td>\n",
       "      <td>200</td>\n",
       "      <td>-90</td>\n",
       "      <td>110</td>\n",
       "      <td>test

Download .txt

gitextract_bcekc6b6/

├── .gitignore
├── LICENSE
├── MANIFEST.in
├── README.md
├── d6tstack/
│   ├── __init__.py
│   ├── combine_csv.py
│   ├── convert_xls.py
│   ├── helpers.py
│   ├── pyftp_final.py
│   ├── sniffer.py
│   ├── sync.py
│   └── utils.py
├── docs/
│   ├── Makefile
│   ├── make.bat
│   ├── make_zip_sample_csv.py
│   ├── make_zip_sample_xls.py
│   ├── shell-napoleon-html.sh
│   ├── shell-napoleon-recreate.sh
│   └── source/
│       ├── conf.py
│       ├── d6tstack.rst
│       ├── index.rst
│       ├── modules.rst
│       ├── setup.rst
│       └── tests.rst
├── examples-csv.ipynb
├── examples-dask.ipynb
├── examples-excel.ipynb
├── examples-pyspark.ipynb
├── examples-read-write.ipynb
├── examples-sql.ipynb
├── requirements-dev.txt
├── requirements.txt
├── setup.cfg
├── setup.py
└── tests/
    ├── __init__.py
    ├── pypi.sh
    ├── test-parquet.py
    ├── test_combine_csv.py
    ├── test_combine_old.py
    ├── test_sync.py
    ├── test_xls.py
    ├── tmp-reindex-withorder.py
    ├── tmp-runtest.py
    └── tmp.py

Download .txt

SYMBOL INDEX (196 symbols across 13 files)

FILE: d6tstack/combine_csv.py
  function _dfconact (line 21) | def _dfconact(df):
  function _direxists (line 24) | def _direxists(fname, logger):
  class CombinerCSV (line 36) | class CombinerCSV(object, metaclass=d6tcollect.Collect):
    method __init__ (line 57) | def __init__(self, fname_list, sep=',', nrows_preview=3, chunksize=1e6...
    method _read_csv_yield (line 93) | def _read_csv_yield(self, fname, read_csv_params):
    method sniff_columns (line 108) | def sniff_columns(self):
    method get_sniff_results (line 181) | def get_sniff_results(self):
    method _sniff_available (line 186) | def _sniff_available(self):
    method is_all_equal (line 190) | def is_all_equal(self):
    method is_column_present (line 200) | def is_column_present(self):
    method is_column_present_unique (line 210) | def is_column_present_unique(self):
    method columns_unique (line 220) | def columns_unique(self):
    method is_column_present_common (line 229) | def is_column_present_common(self):
    method columns_common (line 239) | def columns_common(self):
    method columns (line 248) | def columns(self):
    method head (line 258) | def head(self):
    method _columns_reindex_prep (line 268) | def _columns_reindex_prep(self):
    method _columns_reindex_available (line 307) | def _columns_reindex_available(self):
    method preview_rename (line 311) | def preview_rename(self):
    method preview_select (line 322) | def preview_select(self):
    method combine_preview (line 332) | def combine_preview(self):
    method _combine_preview_available (line 347) | def _combine_preview_available(self):
    method to_pandas (line 351) | def to_pandas(self):
    method _get_filepath_out (line 362) | def _get_filepath_out(self, fname, output_dir, output_prefix, ext):
    method _to_csv_prep (line 374) | def _to_csv_prep(self, write_params):
    method to_csv_head (line 383) | def to_csv_head(self, output_dir=None, write_params={}):
    method to_csv_align (line 406) | def to_csv_align(self, output_dir=None, output_prefix='d6tstack-', wri...
    method to_csv_combine (line 436) | def to_csv_combine(self, filename, write_params={}):
    method to_parquet_align (line 459) | def to_parquet_align(self, output_dir=None, output_prefix='d6tstack-',...
    method to_parquet_combine (line 486) | def to_parquet_combine(self, filename, write_params={}):
    method to_sql_combine (line 506) | def to_sql_combine(self, uri, tablename, if_exists='fail', write_param...
    method to_psql_combine (line 552) | def to_psql_combine(self, uri, table_name, if_exists='fail', sep=','):
    method to_mysql_combine (line 592) | def to_mysql_combine(self, uri, table_name, if_exists='fail', tmpfile=...
    method to_mssql_combine (line 630) | def to_mssql_combine(self, uri, table_name, schema_name=None, if_exist...

FILE: d6tstack/convert_xls.py
  function read_excel_advanced (line 24) | def read_excel_advanced(fname, remove_blank_cols=True, remove_blank_rows...
  class XLSSniffer (line 105) | class XLSSniffer(object, metaclass=d6tcollect.Collect):
    method __init__ (line 116) | def __init__(self, fname_list, logger=None):
    method sniff (line 124) | def sniff(self):
    method all_contain_sheetname (line 167) | def all_contain_sheetname(self,sheet_name):
    method all_have_idx (line 180) | def all_have_idx(self,sheet_idx):
    method all_same_count (line 193) | def all_same_count(self):
    method all_same_names (line 207) | def all_same_names(self):
  class XLStoBase (line 216) | class XLStoBase(object, metaclass=d6tcollect.Collect):
    method __init__ (line 217) | def __init__(self, if_exists='skip', output_dir=None, logger=None):
    method _get_output_filename (line 238) | def _get_output_filename(self, fname):
    method convert_single (line 247) | def convert_single(self, fname, sheet_name, **kwds):
  class XLStoCSVMultiFile (line 279) | class XLStoCSVMultiFile(XLStoBase, metaclass=d6tcollect.Collect):
    method __init__ (line 300) | def __init__(self, fname_list, cfg_xls_sheets_sel_mode='idx_global', c...
    method set_files (line 308) | def set_files(self, fname_list):
    method set_select_mode (line 320) | def set_select_mode(self, cfg_xls_sheets_sel_mode, cfg_xls_sheets_sel):
    method convert_all (line 359) | def convert_all(self, **kwds):
  class XLStoCSVMultiSheet (line 379) | class XLStoCSVMultiSheet(XLStoBase, metaclass=d6tcollect.Collect):
    method __init__ (line 393) | def __init__(self, fname, sheet_names=None, output_dir=None, if_exists...
    method convert_single (line 406) | def convert_single(self, sheet_name, **kwds):
    method convert_all (line 420) | def convert_all(self, **kwds):

FILE: d6tstack/helpers.py
  function file_extensions_get (line 13) | def file_extensions_get(fname_list):
  function file_extensions_all_equal (line 25) | def file_extensions_all_equal(ext_list):
  function file_extensions_contains_xls (line 37) | def file_extensions_contains_xls(ext_list):
  function file_extensions_contains_xlsx (line 41) | def file_extensions_contains_xlsx(ext_list):
  function file_extensions_contains_csv (line 45) | def file_extensions_contains_csv(ext_list):
  function file_extensions_valid (line 49) | def file_extensions_valid(ext_list):
  function columns_all_equal (line 65) | def columns_all_equal(col_list):
  function list_common (line 77) | def list_common(_list, sort=True):
  function list_unique (line 85) | def list_unique(_list, sort=True):
  function list_tofront (line 93) | def list_tofront(_list,val):
  function cols_filename_tofront (line 97) | def cols_filename_tofront(_list):
  function df_filename_tofront (line 101) | def df_filename_tofront(dfg):
  function check_valid_xls (line 106) | def check_valid_xls(fname_list):
  function compare_pandas_versions (line 118) | def compare_pandas_versions(version1, version2):

FILE: d6tstack/pyftp_final.py
  function get_ftp_files (line 7) | def get_ftp_files():
  function upload_ftp_files_s3 (line 19) | def upload_ftp_files_s3(ftp_files, s3_files, bucket):
  function list_s3_files (line 32) | def list_s3_files(bucket):
  function upload_to_s3 (line 39) | def upload_to_s3(bucket):

FILE: d6tstack/sniffer.py
  function csv_count_rows (line 18) | def csv_count_rows(fname):
  class CSVSniffer (line 30) | class CSVSniffer(object, metaclass=d6tcollect.Collect):
    method __init__ (line 42) | def __init__(self, fname, nlines = 10, delims=',;\t|'):
    method read_nlines (line 52) | def read_nlines(self):
    method scan_delim (line 58) | def scan_delim(self):
    method get_delim (line 73) | def get_delim(self):
    method check_column_length_consistent (line 100) | def check_column_length_consistent(self):
    method count_skiprows (line 107) | def count_skiprows(self):
    method has_header_inverse (line 119) | def has_header_inverse(self):
    method has_header (line 140) | def has_header(self):
  class CSVSnifferList (line 145) | class CSVSnifferList(object, metaclass=d6tcollect.Collect):
    method __init__ (line 158) | def __init__(self, fname_list, nlines = 10, delims=',;\t|'):
    method get_all (line 162) | def get_all(self, fun_name, msg_error):
    method get_delim (line 174) | def get_delim(self):
    method count_skiprows (line 177) | def count_skiprows(self):
    method has_header (line 180) | def has_header(self):
  function sniff_settings_csv (line 186) | def sniff_settings_csv(fname_list):

FILE: d6tstack/sync.py
  class FTPSync (line 8) | class FTPSync:
    method __init__ (line 25) | def __init__(self, cfg_ftp_host, cfg_ftp_usr, cfg_ftp_pwd, cfg_ftp_dir,
    method get_all_files (line 61) | def get_all_files(self, subdirs=True, ftp=False):
    method get_s3_files (line 93) | def get_s3_files(self):
    method upload_to_s3 (line 110) | def upload_to_s3(self, fname, local_path):
    method get_files_for_sync (line 124) | def get_files_for_sync(self, subdirs=True, to_s3=False):
    method upload_ftp_files (line 144) | def upload_ftp_files(self, subdirs=True, to_s3=False):

FILE: d6tstack/utils.py
  class PrintLogger (line 7) | class PrintLogger(object):
    method send_log (line 8) | def send_log(self, msg, status):
    method send (line 11) | def send(self, data):
  function pd_readsql_query_from_sqlengine (line 17) | def pd_readsql_query_from_sqlengine(uri, sql, schema_name=None, connect_...
  function pd_readsql_table_from_sqlengine (line 50) | def pd_readsql_table_from_sqlengine(uri, table_name, schema_name=None, c...
  function pd_to_psql (line 69) | def pd_to_psql(df, uri, table_name, schema_name=None, if_exists='fail', ...
  function pd_to_mysql (line 115) | def pd_to_mysql(df, uri, table_name, if_exists='fail', tmpfile='mysql.cs...
  function pd_to_mssql (line 154) | def pd_to_mssql(df, uri, table_name, schema_name=None, if_exists='fail',...

FILE: docs/make_zip_sample_xls.py
  function write_file_xls (line 21) | def write_file_xls(dfg, fname, sheets, startrow=0,startcol=0):

FILE: tests/test_combine_csv.py
  class DebugLogger (line 48) | class DebugLogger(object):
    method __init__ (line 49) | def __init__(self, event):
    method send_log (line 52) | def send_log(self, msg, status):
    method send (line 55) | def send(self, data):
  function create_files_df_clean (line 61) | def create_files_df_clean():
  function create_files_df_clean_combine (line 71) | def create_files_df_clean_combine():
  function create_files_df_clean_combine_with_filename (line 79) | def create_files_df_clean_combine_with_filename(fname_list):
  function create_files_df_colmismatch_combine (line 90) | def create_files_df_colmismatch_combine(cfg_col_common,allstr=True):
  function check_df_colmismatch_combine (line 103) | def check_df_colmismatch_combine(dfg,is_common=False, convert_date=True):
  function create_files_df_colmismatch_combine2 (line 112) | def create_files_df_colmismatch_combine2(cfg_col_common):
  function create_files_csv (line 127) | def create_files_csv():
  function create_files_csv_colmismatch (line 139) | def create_files_csv_colmismatch():
  function create_files_csv_colmismatch2 (line 152) | def create_files_csv_colmismatch2():
  function create_files_csv_colreorder (line 166) | def create_files_csv_colreorder():
  function create_files_csv_noheader (line 182) | def create_files_csv_noheader():
  function create_files_csv_col_renamed (line 194) | def create_files_csv_col_renamed():
  function create_files_csv_dirty (line 209) | def create_files_csv_dirty(cfg_sep=",", cfg_header=True):
  function create_files_xls_single_helper (line 217) | def create_files_xls_single_helper(cfg_fname):
  function create_files_xls_single (line 226) | def create_files_xls_single():
  function create_files_xlsx_single (line 230) | def create_files_xlsx_single():
  function write_file_xls (line 234) | def write_file_xls(dfg, fname, startrow=0,startcol=0):
  function create_files_xls_multiple_helper (line 241) | def create_files_xls_multiple_helper(cfg_fname):
  function create_files_xls_multiple (line 251) | def create_files_xls_multiple():
  function create_files_xlsx_multiple (line 255) | def create_files_xlsx_multiple():
  function test_file_extensions_get (line 262) | def test_file_extensions_get():
  function test_file_extensions_all_equal (line 271) | def test_file_extensions_all_equal():
  function test_file_extensions_valid (line 279) | def test_file_extensions_valid():
  function test_csv_sniff (line 292) | def test_csv_sniff(create_files_csv, create_files_csv_colmismatch, creat...
  function test_csv_selectrename (line 325) | def test_csv_selectrename(create_files_csv, create_files_csv_colmismatch):
  function test_to_pandas (line 360) | def test_to_pandas(create_files_csv, create_files_csv_colmismatch, creat...
  function test_combinepreview (line 379) | def test_combinepreview(create_files_csv_colmismatch):
  function test_tocsv (line 393) | def test_tocsv(create_files_csv_colmismatch):
  function test_topq (line 434) | def test_topq(create_files_csv_colmismatch):
  function test_tosql (line 470) | def test_tosql(create_files_csv_colmismatch):
  function test_tosql_util (line 517) | def test_tosql_util(create_files_csv_colmismatch):

FILE: tests/test_combine_old.py
  class TestLogPusher (line 41) | class TestLogPusher(object):
    method __init__ (line 42) | def __init__(self, event):
    method send_log (line 45) | def send_log(self, msg, status):
    method send (line 48) | def send(self, data):
  function create_files_df_clean (line 54) | def create_files_df_clean():
  function create_files_df_clean_combine (line 64) | def create_files_df_clean_combine():
  function create_files_df_clean_combine_with_filename (line 72) | def create_files_df_clean_combine_with_filename(fname_list):
  function create_files_df_colmismatch_combine (line 83) | def create_files_df_colmismatch_combine(cfg_col_common):
  function create_files_df_colmismatch_combine2 (line 95) | def create_files_df_colmismatch_combine2(cfg_col_common):
  function create_files_csv (line 110) | def create_files_csv():
  function create_files_csv_colmismatch (line 122) | def create_files_csv_colmismatch():
  function create_files_csv_colmismatch2 (line 135) | def create_files_csv_colmismatch2():
  function create_files_csv_colreorder (line 149) | def create_files_csv_colreorder():
  function create_files_csv_noheader (line 165) | def create_files_csv_noheader():
  function create_files_csv_col_renamed (line 177) | def create_files_csv_col_renamed():
  function test_create_files_csv_col_renamed (line 191) | def test_create_files_csv_col_renamed(create_files_csv_col_renamed):
  function create_files_csv_dirty (line 194) | def create_files_csv_dirty(cfg_sep=",", cfg_header=True):
  function create_files_xls_single_helper (line 202) | def create_files_xls_single_helper(cfg_fname):
  function create_files_xls_single (line 211) | def create_files_xls_single():
  function create_files_xlsx_single (line 215) | def create_files_xlsx_single():
  function write_file_xls (line 219) | def write_file_xls(dfg, fname, startrow=0,startcol=0):
  function create_files_xls_multiple_helper (line 226) | def create_files_xls_multiple_helper(cfg_fname):
  function create_files_xls_multiple (line 236) | def create_files_xls_multiple():
  function create_files_xlsx_multiple (line 240) | def create_files_xlsx_multiple():
  function test_file_extensions_get (line 247) | def test_file_extensions_get():
  function test_file_extensions_all_equal (line 256) | def test_file_extensions_all_equal():
  function test_file_extensions_valid (line 264) | def test_file_extensions_valid():
  function test_csv_sniff_single (line 277) | def test_csv_sniff_single(create_files_csv, create_files_csv_noheader):
  function test_csv_sniff_multi (line 300) | def test_csv_sniff_multi(create_files_csv, create_files_csv_noheader):
  function test_CombinerCSV_columns (line 314) | def test_CombinerCSV_columns(create_files_csv, create_files_csv_colmisma...
  function test_CombinerCSV_combine (line 347) | def test_CombinerCSV_combine(create_files_csv, create_files_csv_colmisma...
  function test_CombinerCSV_combine_advanced (line 394) | def test_CombinerCSV_combine_advanced(create_files_csv):
  function test_preview_dict (line 422) | def test_preview_dict():
  function create_df_rename (line 430) | def create_df_rename():
  function create_files_csv_rename (line 440) | def create_files_csv_rename():
  function test_create_files_csv_rename (line 452) | def test_create_files_csv_rename(create_files_csv_rename):
  function create_out_files_csv_align_save (line 456) | def create_out_files_csv_align_save():
  function create_out_files_parquet_align_save (line 462) | def create_out_files_parquet_align_save():
  function test_apply_select_rename (line 467) | def test_apply_select_rename():
  function test_CombinerCSV_rename (line 485) | def test_CombinerCSV_rename(create_files_csv_rename):
  function test_CombinerCSV_align_save_advanced (line 548) | def test_CombinerCSV_align_save_advanced(create_files_csv_rename, create...
  function test_CombinerCSV_sql_advanced (line 592) | def test_CombinerCSV_sql_advanced(create_files_csv_rename):
  function test_CombinerCSV_sql (line 644) | def test_CombinerCSV_sql(create_files_csv):
  function test_combinercsv_to_csv (line 673) | def test_combinercsv_to_csv(create_files_csv_rename, create_out_files_cs...
  function test_combinercsv_to_parquet (line 712) | def test_combinercsv_to_parquet(create_files_csv_rename, create_out_file...

FILE: tests/test_sync.py
  function _remove_local_dir (line 15) | def _remove_local_dir(folder):
  function test_sync_local (line 21) | def test_sync_local():
  function test_sync_s3 (line 49) | def test_sync_s3():

FILE: tests/test_xls.py
  function test_xls_scan_sheets_single (line 21) | def test_xls_scan_sheets_single(create_files_xls_single, create_files_xl...
  function test_xls_scan_sheets_multipe (line 40) | def test_xls_scan_sheets_multipe(create_files_xls_multiple, create_files...
  function test_read_excel_adv (line 56) | def test_read_excel_adv():
  function test_XLStoBase (line 101) | def test_XLStoBase():
  function test_XLStoCSVMultiFile (line 156) | def test_XLStoCSVMultiFile(create_files_xls_single,create_files_xlsx_sin...
  function test_XLStoCSVMultiSheet (line 218) | def test_XLStoCSVMultiSheet(create_files_xlsx_multiple):

FILE: tests/tmp-runtest.py
  function apply (line 22) | def apply(dfg):

Download .json

Condensed preview — 44 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (493K chars).

[
  {
    "path": ".gitignore",
    "chars": 1345,
    "preview": "tests/.test-cred.yaml\n\n.idea/\n.env\ntemp/\nfiddle*\n.pytest_cache/\ntest-data/output/\n\n# add this manually\ntest-data/\n\n# Byt"
  },
  {
    "path": "LICENSE",
    "chars": 1065,
    "preview": "MIT License\n\nCopyright (c) 2018 Databolt\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\no"
  },
  {
    "path": "MANIFEST.in",
    "chars": 33,
    "preview": "include README.md\ninclude LICENSE"
  },
  {
    "path": "README.md",
    "chars": 5557,
    "preview": "# Databolt File Ingest\n\nQuickly ingest raw files. Works for XLS, CSV, TXT which can be exported to CSV, Parquet, SQL and"
  },
  {
    "path": "d6tstack/__init__.py",
    "chars": 125,
    "preview": "import d6tstack.combine_csv\n#import d6tstack.convert_xls\nimport d6tstack.sniffer\n#import d6tstack.sync\nimport d6tstack.u"
  },
  {
    "path": "d6tstack/combine_csv.py",
    "chars": 25903,
    "preview": "import numpy as np\nimport pandas as pd\npd.set_option('display.expand_frame_repr', False)\nfrom scipy.stats import mode\nim"
  },
  {
    "path": "d6tstack/convert_xls.py",
    "chars": 15299,
    "preview": "import warnings\nimport os.path\n\nimport numpy as np\nimport pandas as pd\n\nimport ntpath\n\nimport openpyxl\nimport xlrd\n\ntry:"
  },
  {
    "path": "d6tstack/helpers.py",
    "chars": 3149,
    "preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\"\"\"\n\nModule with several helper functions\n\n\"\"\"\n\nimport os\nimport collecti"
  },
  {
    "path": "d6tstack/pyftp_final.py",
    "chars": 2000,
    "preview": "from boto.s3.connection import S3Connection\nfrom boto.s3.key import Key\nimport os\nimport ftputil\n\n\ndef get_ftp_files():\n"
  },
  {
    "path": "d6tstack/sniffer.py",
    "chars": 6732,
    "preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\"\"\"\n\nFinds CSV settings and Excel sheets in multiple files. Often needed "
  },
  {
    "path": "d6tstack/sync.py",
    "chars": 6266,
    "preview": "import boto3\nimport botocore\nimport os\nimport ftputil\nimport numpy as np\n\n\nclass FTPSync:\n    \"\"\"\n\n        FTP Sync clas"
  },
  {
    "path": "d6tstack/utils.py",
    "chars": 6769,
    "preview": "import pandas as pd\nimport warnings\n\nimport d6tcollect\nd6tcollect.init(__name__)\n\nclass PrintLogger(object):\n    def sen"
  },
  {
    "path": "docs/Makefile",
    "chars": 612,
    "preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS    =\nSPHI"
  },
  {
    "path": "docs/make.bat",
    "chars": 773,
    "preview": "@ECHO OFF\n\npushd %~dp0\n\nREM Command file for Sphinx documentation\n\nif \"%SPHINXBUILD%\" == \"\" (\n\tset SPHINXBUILD=python -m"
  },
  {
    "path": "docs/make_zip_sample_csv.py",
    "chars": 471,
    "preview": "import zipfile\nimport glob\nimport os\n\nif not os.path.exists('test-data/output/__init__.py'):\n\tfhandle = open('test-data/"
  },
  {
    "path": "docs/make_zip_sample_xls.py",
    "chars": 1546,
    "preview": "import zipfile\nimport glob\nimport os\n\n\n\nimport pandas as pd\nimport numpy as np\n# generate fake data\ncfg_tickers = ['AAP'"
  },
  {
    "path": "docs/shell-napoleon-html.sh",
    "chars": 10,
    "preview": "make html\n"
  },
  {
    "path": "docs/shell-napoleon-recreate.sh",
    "chars": 97,
    "preview": "#rm ./source/*\n#cp ./source-bak/* ./source/\nsphinx-apidoc -f -o ./source ..\nmake clean\nmake html\n"
  },
  {
    "path": "docs/source/conf.py",
    "chars": 5575,
    "preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n#\n# d6t-lib documentation build configuration file, created by\n# sphinx-q"
  },
  {
    "path": "docs/source/d6tstack.rst",
    "chars": 1178,
    "preview": "d6tstack package\n================\n\nSubmodules\n----------\n\nd6tstack.combine\\_csv module\n----------------------------\n\n.. "
  },
  {
    "path": "docs/source/index.rst",
    "chars": 480,
    "preview": ".. d6t-celery-combine documentation master file, created by\n   sphinx-quickstart on Tue Nov 28 11:32:56 2017.\n   You can"
  },
  {
    "path": "docs/source/modules.rst",
    "chars": 70,
    "preview": "d6tstack\n========\n\n.. toctree::\n   :maxdepth: 4\n\n   d6tstack\n   setup\n"
  },
  {
    "path": "docs/source/setup.rst",
    "chars": 106,
    "preview": "setup module\n============\n\n.. automodule:: setup\n    :members:\n    :undoc-members:\n    :show-inheritance:\n"
  },
  {
    "path": "docs/source/tests.rst",
    "chars": 707,
    "preview": "tests package\n=============\n\nSubmodules\n----------\n\ntests.test\\_combine module\n--------------------------\n\n.. automodule"
  },
  {
    "path": "examples-csv.ipynb",
    "chars": 48659,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Data Engineering in Python with d"
  },
  {
    "path": "examples-dask.ipynb",
    "chars": 77397,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# d6tstack with Dask\\n\",\n    \"\\n\",\n"
  },
  {
    "path": "examples-excel.ipynb",
    "chars": 38653,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Data Engineering in Python with d"
  },
  {
    "path": "examples-pyspark.ipynb",
    "chars": 47300,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# d6tstack with pyspark\\n\",\n    \"\\n"
  },
  {
    "path": "examples-read-write.ipynb",
    "chars": 76637,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"o"
  },
  {
    "path": "examples-sql.ipynb",
    "chars": 12041,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Data Engineering in Python with d"
  },
  {
    "path": "requirements-dev.txt",
    "chars": 101,
    "preview": "pytest\nsphinx\nsphinxcontrib-napoleon\nsphinx_rtd_theme\ndask[dataframe]\nfastparquet\npython-snappy\nxlwt\n"
  },
  {
    "path": "requirements.txt",
    "chars": 95,
    "preview": "numpy\nopenpyxl\nxlrd\npandas>=0.22.0\nsqlalchemy\nscipy\npyarrow\npsycopg2\nmysql-connector\nd6tcollect"
  },
  {
    "path": "setup.cfg",
    "chars": 39,
    "preview": "[metadata]\ndescription-file = README.md"
  },
  {
    "path": "setup.py",
    "chars": 949,
    "preview": "from setuptools import setup\n\nextras = {\n    'xls': ['openpyxl','xlrd'],\n    'parquet': ['pyarrow'],\n    'psql': ['psyco"
  },
  {
    "path": "tests/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "tests/pypi.sh",
    "chars": 90,
    "preview": "# pip install setuptools wheel twine\npython setup.py sdist bdist_wheel\ntwine upload dist/*"
  },
  {
    "path": "tests/test-parquet.py",
    "chars": 774,
    "preview": "# -*- coding: utf-8 -*-\n\"\"\"\nCreated on Sun Jun 24 10:45:14 2018\n\n@author: deepmind\n\"\"\"\n\nimport pandas as pd\nimport glob\n"
  },
  {
    "path": "tests/test_combine_csv.py",
    "chars": 20933,
    "preview": "\"\"\"Run unit tests.\n\nUse this to run tests and understand how tasks.py works.\n\nSetup::\n\n    mkdir -p test-data/input\n    "
  },
  {
    "path": "tests/test_combine_old.py",
    "chars": 26679,
    "preview": "\"\"\"Run unit tests.\n\nUse this to run tests and understand how tasks.py works.\n\nExample:\n\n    Create directories::\n\n      "
  },
  {
    "path": "tests/test_sync.py",
    "chars": 2242,
    "preview": "from moto import mock_s3\nimport shutil\nimport os\nimport pytest\nfrom d6tstack.sync import FTPSync\n\ncfg_ftp_host = 'ftp.fi"
  },
  {
    "path": "tests/test_xls.py",
    "chars": 11034,
    "preview": "import pytest\nimport shutil\nfrom pathlib import Path\n\nimport numpy as np\nimport pandas as pd\nimport xlrd\n\nfrom d6tstack."
  },
  {
    "path": "tests/tmp-reindex-withorder.py",
    "chars": 2060,
    "preview": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\"\"\"\nCreated on Wed Oct  3 11:36:43 2018\n\n@author: deepmind\n\"\"\"\n\nimport pa"
  },
  {
    "path": "tests/tmp-runtest.py",
    "chars": 2052,
    "preview": "import pandas as pd\nimport numpy as np\nimport glob\nimport d6tstack.combine_csv\nimport importlib\nimportlib.reload(d6tstac"
  },
  {
    "path": "tests/tmp.py",
    "chars": 1946,
    "preview": "import importlib\nimport d6tstack.utils\nimportlib.reload(d6tstack.utils)\n\nimport time\nimport yaml\n\nconfig = yaml.load(ope"
  }
]

About this extraction

This page contains the full source code of the d6t/d6tstack GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 44 files (444.9 KB), approximately 155.3k tokens, and a symbol index with 196 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo