[
  {
    "path": ".gitignore",
    "content": "tests/.test-cred.yaml\n\n.idea/\n.env\ntemp/\nfiddle*\n.pytest_cache/\ntest-data/output/\n\n# add this manually\ntest-data/\n\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n.hypothesis/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\n.static_storage/\n.media/\nlocal_settings.py\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# pyenv\n.python-version\n\n# celery beat schedule file\ncelerybeat-schedule\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n\n# pypi config file\n.pypirc\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2018 Databolt\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "MANIFEST.in",
    "content": "include README.md\ninclude LICENSE"
  },
  {
    "path": "README.md",
    "content": "# Databolt File Ingest\n\nQuickly ingest raw files. Works for XLS, CSV, TXT which can be exported to CSV, Parquet, SQL and Pandas. `d6tstack` solves many performance and schema problems typically encountered when ingesting raw files. \n\n![](https://www.databolt.tech/images/combiner-landing-git.png)\n\n### Features include\n\n* Fast pd.to_sql() for postgres and mysql\n* Quickly check columns for consistency across files\n* Fix added/missing columns\n* Fix renamed columns\n* Check Excel tabs for consistency across files\n* Excel to CSV converter (incl multi-sheet support)\n* Out of core functionality to process large files\n* Export to CSV, parquet, SQL, pandas dataframe\n\n## Installation\n\nLatest published version `pip install d6tstack`. Additional requirements:\n* `d6tstack[psql]`: for pandas to postgres\n* `d6tstack[mysql]`: for pandas to mysql\n* `d6tstack[xls]`: for excel support\n* `d6tstack[parquet]`: for ingest csv to parquet\n\nLatest dev version from github `pip install git+https://github.com/d6t/d6tstack.git`  \n\n### Sample Use\n\n```\n\nimport d6tstack\n\n# fast CSV to SQL import - see SQL examples notebook\nd6tstack.utils.pd_to_psql(df, 'postgresql+psycopg2://usr:pwd@localhost/db', 'tablename')\nd6tstack.utils.pd_to_mysql(df, 'mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename')\nd6tstack.utils.pd_to_mssql(df, 'mssql+pymssql://usr:pwd@localhost/db', 'tablename') # experimental\n\n# ingest mutiple CSVs which may have data schema changes - see CSV examples notebook\n\nimport glob\n>>> c = d6tstack.combine_csv.CombinerCSV(glob.glob('data/*.csv'))\n\n# show columns of each file\n>>> c.columns()\n\n# quick check if all files have consistent columns\n>>> c.is_all_equal()\nFalse\n\n# show which files have missing columns\n>>> c.is_column_present()\n   filename  cost  date profit profit2 sales\n0  feb.csv  True  True   True   False  True\n2  mar.csv  True  True   True    True  True\n\n>>> c.combine_preview() # keep all columns\n   filename  cost        date profit profit2 sales\n0   jan.csv  -80  2011-01-01     20     NaN   100\n0   mar.csv  -100  2011-03-01    200     400   300\n\n>>> d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), columns_select_common=True).combine_preview() # keep common columns\n   filename  cost        date profit sales\n0   jan.csv  -80  2011-01-01     20   100\n0   mar.csv  -100  2011-03-01    200   300\n\n>>> d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), columns_rename={'sales':'revenue'}).combine_preview()\n   filename  cost        date profit profit2 revenue\n0   jan.csv  -80  2011-01-01     20     NaN   100\n0   mar.csv  -100  2011-03-01    200     400   300\n\n# to come: check if columns match database\n>>> c.is_columns_match_db('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename')\n\n# create csv with first nrows_preview rows of each file\n>>> c.to_csv_head()\n\n# export to csv, parquet, sql. Out of core with optimized fast imports for postgres and mysql\n>>> c.to_pandas()\n>>> c.to_csv_align(output_dir='process/')\n>>> c.to_parquet_align(output_dir='process/')\n>>> c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename')\n>>> c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast, using COPY FROM\n>>> c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast, using LOAD DATA LOCAL INFILE\n\n# read Excel files - see Excel examples notebook for more details\nimport d6tstack.convert_xls\n\nd6tstack.convert_xls.read_excel_advanced('test.xls',\n    sheet_name='Sheet1', header_xls_range=\"B2:E2\")\n\nd6tstack.convert_xls.XLStoCSVMultiSheet('test.xls').convert_all(header_xls_range=\"B2:E2\")\n\nd6tstack.convert_xls.XLStoCSVMultiFile(glob.glob('*.xls'), \n    cfg_xls_sheets_sel_mode='name_global',cfg_xls_sheets_sel='Sheet1')\n    .convert_all(header_xls_range=\"B2:E2\")\n\n```\n\n\n## Documentation\n\n*  [SQL examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-sql.ipynb) - Fast loading of CSV to SQL with pandas preprocessing\n*  [CSV examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-csv.ipynb) - Quickly load any type of CSV files\n*  [Excel examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-excel.ipynb) - Quickly extract from Excel to CSV \n*  [Dask Examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-dask.ipynb) - How to use d6tstack to solve Dask input file problems\n*  [Pyspark Examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-pyspark.ipynb) - How to use d6tstack to solve pyspark input file problems\n*  [Function reference docs](http://d6tstack.readthedocs.io/en/latest/py-modindex.html) - Detailed documentation for modules, classes, functions\n\n## Faster Data Engineering\n\nCheck out other d6t libraries to solve common data engineering problems, including  \n* data worfklows,build highly effective data science workflows\n* fuzzy joins, quickly join data\n* data pipes, quickly share and distribute data\n\nhttps://github.com/d6t/d6t-python\n\nAnd we encourage you to join the Databolt blog to get updates and tips+tricks http://blog.databolt.tech\n\n## Collecting Errors Messages and Usage statistics\n\nWe have put a lot of effort into making this library useful to you. To help us make this library even better, it collects ANONYMOUS error messages and usage statistics. See [d6tcollect](https://github.com/d6t/d6tcollect) for details including how to disable collection. Collection is asynchronous and doesn't impact your code in any way.\n\nIt may not catch all errors so if you run into any problems or have any questions, please raise an issue on github.\n"
  },
  {
    "path": "d6tstack/__init__.py",
    "content": "import d6tstack.combine_csv\n#import d6tstack.convert_xls\nimport d6tstack.sniffer\n#import d6tstack.sync\nimport d6tstack.utils\n"
  },
  {
    "path": "d6tstack/combine_csv.py",
    "content": "import numpy as np\nimport pandas as pd\npd.set_option('display.expand_frame_repr', False)\nfrom scipy.stats import mode\nimport warnings\nimport ntpath, pathlib\nimport copy\nimport itertools\nimport os\n\nimport d6tcollect\n# d6tcollect.init(__name__)\n\nfrom .helpers import *\nfrom .utils import PrintLogger\n\n\n# ******************************************************************\n# helpers\n# ******************************************************************\ndef _dfconact(df):\n    return pd.concat(itertools.chain.from_iterable(df), sort=False, copy=False, join='inner', ignore_index=True)\n\ndef _direxists(fname, logger):\n    fdir = os.path.dirname(fname)\n    if fdir and not os.path.exists(fdir):\n        if logger:\n            logger.send_log('creating ' + fdir, 'ok')\n        os.makedirs(fdir)\n    return True\n\n# ******************************************************************\n# combiner\n# ******************************************************************\n\nclass CombinerCSV(object, metaclass=d6tcollect.Collect):\n    \"\"\"    \n    Core combiner class. Sniffs columns, generates preview, combines aka stacks to various output formats.\n\n    Args:\n        fname_list (list): file names, eg ['a.csv','b.csv']\n        sep (string): CSV delimiter, see pandas.read_csv()\n        has_header (boolean): data has header row \n        nrows_preview (int): number of rows in preview\n        chunksize (int): number of rows to read into memory while processing, see pandas.read_csv()\n        read_csv_params (dict): additional parameters to pass to pandas.read_csv()\n        columns_select (list): list of column names to keep\n        columns_select_common (bool): keep only common columns. Use this instead of `columns_select`\n        columns_rename (dict): dict of columns to rename `{'name_old':'name_new'}\n        add_filename (bool): add filename column to output data frame. If `False`, will not add column.\n        apply_after_read (function): function to apply after reading each file. needs to return a dataframe\n        log (bool): send logs to logger\n        logger (object): logger object with `send_log()`\n\n    \"\"\"\n\n    def __init__(self, fname_list, sep=',', nrows_preview=3, chunksize=1e6, read_csv_params=None,\n                 columns_select=None, columns_select_common=False, columns_rename=None, add_filename=True,\n                 apply_after_read=None, log=True, logger=None):\n        if not fname_list:\n            raise ValueError(\"Filename list should not be empty\")\n        self.fname_list = np.sort(fname_list)\n        self.nrows_preview = nrows_preview\n        self.read_csv_params = read_csv_params\n        if not self.read_csv_params:\n            self.read_csv_params = {}\n        if not 'sep' in self.read_csv_params:\n            self.read_csv_params['sep'] = sep\n        if not 'chunksize' in self.read_csv_params:\n            self.read_csv_params['chunksize'] = chunksize\n        self.logger = logger\n        if not logger and log:\n            self.logger = PrintLogger()\n        if not log:\n            self.logger = None\n        self.sniff_results = None\n        self.add_filename = add_filename\n        self.columns_select = columns_select\n        self.columns_select_common = columns_select_common\n        if columns_select and columns_select_common:\n            warnings.warn('columns_select will override columns_select_common, pick either one')\n        self.columns_rename = columns_rename\n        self._columns_reindex = None\n        self._columns_rename_dict = None\n        self.apply_after_read = apply_after_read\n\n        self.df_combine_preview = None\n\n        if self.columns_select:\n            if max(collections.Counter(columns_select).values())>1:\n                raise ValueError('Duplicate entries in columns_select')\n\n    def _read_csv_yield(self, fname, read_csv_params):\n        self._columns_reindex_available()\n        dfs = pd.read_csv(fname, **read_csv_params)\n        for dfc in dfs:\n            if self.columns_rename and self._columns_rename_dict[fname]:\n                dfc = dfc.rename(columns=self._columns_rename_dict[fname])\n\n            dfc = dfc.reindex(columns=self._columns_reindex)\n            if self.apply_after_read:\n                dfc = self.apply_after_read(dfc)\n            if self.add_filename:\n                dfc['filepath'] = fname\n                dfc['filename'] = ntpath.basename(fname)\n            yield dfc\n\n    def sniff_columns(self):\n\n        \"\"\"\n        \n        Checks column consistency by reading top nrows in all files. It checks both presence and order of columns in all files\n\n        Returns:\n            dict: results dictionary with\n                files_columns (dict): dictionary with information, keys = filename, value = list of columns in file\n                columns_all (list): all columns in files\n                columns_common (list): only columns present in every file\n                is_all_equal (boolean): all files equal in all files?\n                df_columns_present (dataframe): which columns are present in which file?\n                df_columns_order (dataframe): where in the file is the column?\n\n        \"\"\"\n\n        if self.logger:\n            self.logger.send_log('sniffing columns', 'ok')\n\n        read_csv_params = copy.deepcopy(self.read_csv_params)\n        read_csv_params['dtype'] = str\n        read_csv_params['nrows'] = self.nrows_preview\n        read_csv_params['chunksize'] = None\n\n        # read nrows of every file\n        self.dfl_all = []\n        for fname in self.fname_list:\n            # todo: make sure no nrows param in self.read_csv_params\n            df = pd.read_csv(fname, **read_csv_params)\n            self.dfl_all.append(df)\n\n        # process columns\n        dfl_all_col = [df.columns.tolist() for df in self.dfl_all]\n        col_files = dict(zip(self.fname_list, dfl_all_col))\n        col_common = list_common(list(col_files.values()))\n        col_all = list_unique(list(col_files.values()))\n\n        # find index in column list so can check order is correct\n        df_col_present = {}\n        for iFileName, iFileCol in col_files.items():\n            df_col_present[iFileName] = [iCol in iFileCol for iCol in col_all]\n\n        df_col_present = pd.DataFrame(df_col_present, index=col_all).T\n        df_col_present.index.names = ['file_path']\n\n        # find index in column list so can check order is correct\n        df_col_idx = {}\n        for iFileName, iFileCol in col_files.items():\n            df_col_idx[iFileName] = [iFileCol.index(iCol) if iCol in iFileCol else np.nan for iCol in col_all]\n        df_col_idx = pd.DataFrame(df_col_idx, index=col_all).T\n\n        # order columns by where they appear in file\n        m=mode(df_col_idx,axis=0)\n        df_col_pos = pd.DataFrame({'o':m[0][0],'c':m[1][0]},index=df_col_idx.columns)\n        df_col_pos = df_col_pos.sort_values(['o','c'])\n        df_col_pos['iscommon']=df_col_pos.index.isin(col_common)\n\n\n        # reorder by position\n        col_all = df_col_pos.index.values.tolist()\n        col_common = df_col_pos[df_col_pos['iscommon']].index.values.tolist()\n        col_unique = df_col_pos[~df_col_pos['iscommon']].index.values.tolist()\n        df_col_present = df_col_present[col_all]\n        df_col_idx = df_col_idx[col_all]\n\n        sniff_results = {'files_columns': col_files, 'columns_all': col_all, 'columns_common': col_common,\n                       'columns_unique': col_unique, 'is_all_equal': columns_all_equal(dfl_all_col),\n                       'df_columns_present': df_col_present, 'df_columns_order': df_col_idx}\n        self.sniff_results = sniff_results\n\n        return sniff_results\n\n    def get_sniff_results(self):\n        if not self.sniff_results:\n            self.sniff_columns()\n        return self.sniff_results\n\n    def _sniff_available(self):\n        if not self.sniff_results:\n            self.sniff_columns()\n\n    def is_all_equal(self):\n        \"\"\"\n        Checks if all columns are equal in all files\n\n        Returns:\n             bool: all columns are equal in all files?\n        \"\"\"\n        self._sniff_available()\n        return self.sniff_results['is_all_equal']\n\n    def is_column_present(self):\n        \"\"\"\n        Shows which columns are present in which files\n\n        Returns:\n             dataframe: boolean values for column presence in each file\n        \"\"\"\n        self._sniff_available()\n        return self.sniff_results['df_columns_present']\n\n    def is_column_present_unique(self):\n        \"\"\"\n        Shows unique columns by file\n\n        Returns:\n             dataframe: boolean values for column presence in each file\n        \"\"\"\n        self._sniff_available()\n        return self.is_column_present()[self.sniff_results['columns_unique']]\n\n    def columns_unique(self):\n        \"\"\"\n        Shows unique columns by file\n\n        Returns:\n             dataframe: boolean values for column presence in each file\n        \"\"\"\n        self.columns_unique()\n\n    def is_column_present_common(self):\n        \"\"\"\n        Shows common columns by file        \n\n        Returns:\n             dataframe: boolean values for column presence in each file\n        \"\"\"\n        self._sniff_available()\n        return self.is_column_present()[self.sniff_results['columns_common']]\n\n    def columns_common(self):\n        \"\"\"\n        Shows common columns by file        \n\n        Returns:\n             dataframe: boolean values for column presence in each file\n        \"\"\"\n        return self.is_column_present_common()\n\n    def columns(self):\n        \"\"\"\n        Shows columns by file        \n\n        Returns:\n             dict: filename, columns\n        \"\"\"\n        self._sniff_available()\n        return self.sniff_results['files_columns']\n\n    def head(self):\n        \"\"\"\n        Shows preview rows for each file\n\n        Returns:\n             dict: filename, dataframe\n        \"\"\"\n        self._sniff_available()\n        return dict(zip(self.fname_list,self.dfl_all))\n\n    def _columns_reindex_prep(self):\n\n        self._sniff_available()\n        self._columns_select_dict = {} # select columns by filename\n        self._columns_rename_dict = {} # rename columns by filename\n\n        for fname in self.fname_list:\n            if self.columns_rename:\n                columns_rename = self.columns_rename.copy()\n                # check no naming conflicts\n                columns_select2 = [columns_rename[k] if k in columns_rename.keys() else k for k in self.sniff_results['files_columns'][fname]]\n                df_rename_count = collections.Counter(columns_select2)\n                if df_rename_count and max(df_rename_count.values()) > 1:  # would the rename create naming conflict?\n                    warnings.warn('Renaming conflict: {}'.format([(k, v) for k, v in df_rename_count.items() if v > 1]),\n                                  UserWarning)\n                    while df_rename_count and max(df_rename_count.values()) > 1:\n                        # remove key value pair causing conflict\n                        conflicting_keys = [i for i, j in df_rename_count.items() if j > 1]\n                        columns_rename = {k: v for k, v in columns_rename.items() if k in conflicting_keys}\n                        columns_select2 = [columns_rename[k] if k in columns_rename.keys() else k for k in\n                                           self.sniff_results['files_columns'][fname]]\n                        df_rename_count = collections.Counter(columns_select2)\n\n                # store rename by file. keep only renames for columns actually present in file\n                self._columns_rename_dict[fname] = dict((k,v) for k,v in columns_rename.items() if k in k in self.sniff_results['files_columns'][fname])\n\n        if self.columns_select:\n            columns_select2 = self.columns_select.copy()\n        else:\n            if self.columns_select_common:\n                columns_select2 = self.sniff_results['columns_common'].copy()\n            else:\n                columns_select2 = self.sniff_results['columns_all'].copy()\n\n        if self.columns_rename:\n            columns_select2 = list(dict.fromkeys([columns_rename[k] if k in columns_rename.keys() else k for k in columns_select2]))  # set of columns after rename\n        # store select by file\n        self._columns_reindex = columns_select2\n\n    def _columns_reindex_available(self):\n        if not self._columns_rename_dict or not self._columns_reindex:\n            self._columns_reindex_prep()\n\n    def preview_rename(self):\n        \"\"\"\n        Shows which columns will be renamed in processing\n\n        Returns:\n            dataframe: columns to be renamed from which file\n        \"\"\"\n        self._columns_reindex_available()\n        df = pd.DataFrame(self._columns_rename_dict).T\n        return df\n\n    def preview_select(self):\n        \"\"\"\n        Shows which columns will be selected in processing\n\n        Returns:\n            list: columns to be selected from all files\n        \"\"\"\n        self._columns_reindex_available()\n        return self._columns_reindex\n\n    def combine_preview(self):\n        \"\"\"\n        Preview of what the combined data will look like\n\n        Returns:\n            dataframe: combined dataframe\n        \"\"\"\n        read_csv_params = copy.deepcopy(self.read_csv_params)\n        read_csv_params['nrows'] = self.nrows_preview\n\n        df = [[dfc for dfc in self._read_csv_yield(fname, read_csv_params)] for fname in self.fname_list]\n        df = _dfconact(df)\n        self.df_combine_preview = df.copy()\n        return df\n\n    def _combine_preview_available(self):\n        if self.df_combine_preview is None:\n            self.combine_preview()\n\n    def to_pandas(self):\n        \"\"\"\n        Combine all files to a pandas dataframe\n\n        Returns:\n            dataframe: combined data\n        \"\"\"\n        df = [[dfc for dfc in self._read_csv_yield(fname, self.read_csv_params)] for fname in self.fname_list]\n        df = _dfconact(df)\n        return df\n\n    def _get_filepath_out(self, fname, output_dir, output_prefix, ext):\n        # filename\n        fname_out = ntpath.basename(fname)\n        fname_out = os.path.splitext(fname_out)[0]\n        fname_out = output_prefix + fname_out + ext\n\n        # path\n        output_dir = output_dir if output_dir else os.path.dirname(fname)\n        fpath_out = os.path.join(output_dir, fname_out)\n        assert _direxists(fpath_out, self.logger)\n        return fpath_out\n\n    def _to_csv_prep(self, write_params):\n        if 'index' not in write_params:\n            write_params['index'] = False\n        write_params.pop('header', None) # library handles\n\n        self._combine_preview_available()\n\n        return write_params\n\n    def to_csv_head(self, output_dir=None, write_params={}):\n        \"\"\"\n        Save `nrows_preview` header rows as individual files\n\n        Args:\n            output_dir (str): directory to save files in. If not given save in the same directory as the original file\n            write_params (dict): additional params to pass to `pandas.to_csv()`\n\n        Returns:\n            list: list of filenames of processed files\n        \"\"\"\n\n        write_params = self._to_csv_prep(write_params)\n\n        fnamesout = []\n        for fname, dfg in dict(zip(self.fname_list,self.dfl_all)).items():\n            filename = f'{fname}-head.csv'\n            filename = filename if output_dir is None else str(pathlib.Path(output_dir)/filename)\n            dfg.to_csv(filename, **write_params)\n            fnamesout.append(filename)\n\n        return fnamesout\n\n    def to_csv_align(self, output_dir=None, output_prefix='d6tstack-', write_params={}):\n        \"\"\"\n        Create cleaned versions of original files. Automatically runs out of core, using `self.chunksize`.\n\n        Args:\n            output_dir (str): directory to save files in. If not given save in the same directory as the original file\n            output_prefix (str): prepend with prefix to distinguish from original files\n            write_params (dict): additional params to pass to `pandas.to_csv()`\n\n        Returns:\n            list: list of filenames of processed files\n        \"\"\"\n        # stream all chunks to multiple files\n\n        write_params = self._to_csv_prep(write_params)\n\n        fnamesout = []\n        for fname in self.fname_list:\n            filename = self._get_filepath_out(fname, output_dir, output_prefix, '.csv')\n            if self.logger:\n                self.logger.send_log('writing '+filename , 'ok')\n            fhandle = open(filename, 'w')\n            self.df_combine_preview[:0].to_csv(fhandle, **write_params)\n            for dfc in self._read_csv_yield(fname, self.read_csv_params):\n                dfc.to_csv(fhandle, header=False, **write_params)\n            fhandle.close()\n            fnamesout.append(filename)\n\n        return fnamesout\n\n    def to_csv_combine(self, filename, write_params={}):\n        \"\"\"\n        Combines all files to a single csv file. Automatically runs out of core, using `self.chunksize`.\n\n        Args:\n            filename (str): file names\n            write_params (dict): additional params to pass to `pandas.to_csv()`\n\n        Returns:\n            str: filename for combined data\n        \"\"\"\n        # stream all chunks from all files to a single file\n        write_params = self._to_csv_prep(write_params)\n\n        assert _direxists(filename, self.logger)\n        fhandle = open(filename, 'w')\n        self.df_combine_preview[:0].to_csv(fhandle, **write_params)\n        for fname in self.fname_list:\n            for dfc in self._read_csv_yield(fname, self.read_csv_params):\n                dfc.to_csv(fhandle, header=False, **write_params)\n        fhandle.close()\n        return filename\n\n    def to_parquet_align(self, output_dir=None, output_prefix='d6tstack-', write_params={}):\n        \"\"\"\n        Same as `to_csv_align` but outputs parquet files\n\n        \"\"\"\n        # write_params for pyarrow.parquet.write_table\n\n        # stream all chunks to multiple files\n        self._combine_preview_available()\n\n        import pyarrow as pa\n        import pyarrow.parquet as pq\n\n        fnamesout = []\n        pqschema = pa.Table.from_pandas(self.df_combine_preview).schema\n        for fname in self.fname_list:\n            filename = self._get_filepath_out(fname, output_dir, output_prefix, '.pq')\n            if self.logger:\n                self.logger.send_log('writing '+filename , 'ok')\n            pqwriter = pq.ParquetWriter(filename, pqschema)\n            for dfc in self._read_csv_yield(fname, self.read_csv_params):\n                pqwriter.write_table(pa.Table.from_pandas(dfc.astype(self.df_combine_preview.dtypes), schema=pqschema),**write_params)\n            pqwriter.close()\n            fnamesout.append(filename)\n\n        return fnamesout\n\n    def to_parquet_combine(self, filename, write_params={}):\n        \"\"\"\n        Same as `to_csv_combine` but outputs parquet files\n\n        \"\"\"\n        # stream all chunks from all files to a single file\n        self._combine_preview_available()\n\n        assert _direxists(filename, self.logger)\n        import pyarrow as pa\n        import pyarrow.parquet as pq\n\n        # todo: fix mixed data type writing. at least give a warning\n        pqwriter = pq.ParquetWriter(filename, pa.Table.from_pandas(self.df_combine_preview).schema)\n        for fname in self.fname_list:\n            for dfc in self._read_csv_yield(fname, self.read_csv_params):\n                pqwriter.write_table(pa.Table.from_pandas(dfc.astype(self.df_combine_preview.dtypes)),**write_params)\n        pqwriter.close()\n        return filename\n\n    def to_sql_combine(self, uri, tablename, if_exists='fail', write_params=None, return_create_sql=False):\n        \"\"\"\n        Load all files into a sql table using sqlalchemy. Generic but slower than the optmized functions\n\n        Args:\n            uri (str): sqlalchemy database uri\n            tablename (str): table to store data in\n            if_exists (str): {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See `pandas.to_sql()` for details\n            write_params (dict): additional params to pass to `pandas.to_sql()`\n            return_create_sql (dict): show create sql statement for combined file schema. Doesn't run data load\n\n        Returns:\n            bool: True if loader finished\n\n        \"\"\"\n        if not write_params:\n            write_params = {}\n        if 'if_exists' not in write_params:\n            write_params['if_exists'] = if_exists\n        if 'index' not in write_params:\n            write_params['index'] = False\n        self._combine_preview_available()\n\n        if 'mysql' in uri and not 'mysql+pymysql' in uri:\n            raise ValueError('need to use pymysql for mysql (pip install pymysql)')\n\n        import sqlalchemy\n\n        sql_engine = sqlalchemy.create_engine(uri)\n\n        # create table\n        dfhead = self.df_combine_preview.astype(self.df_combine_preview.dtypes)[:0]\n\n        if return_create_sql:\n            return pd.io.sql.get_schema(dfhead, tablename).replace('\"',\"`\")\n\n        dfhead.to_sql(tablename, sql_engine, **write_params)\n\n        # append data\n        write_params['if_exists'] = 'append'\n        for fname in self.fname_list:\n            for dfc in self._read_csv_yield(fname, self.read_csv_params):\n                dfc.astype(self.df_combine_preview.dtypes).to_sql(tablename, sql_engine, **write_params)\n\n        return True\n\n    def to_psql_combine(self, uri, table_name, if_exists='fail', sep=','):\n        \"\"\"\n        Load all files into a sql table using native postgres COPY FROM. Chunks data load to reduce memory consumption\n\n        Args:\n            uri (str): postgres psycopg2 sqlalchemy database uri\n            table_name (str): table to store data in\n            if_exists (str): {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See `pandas.to_sql()` for details\n            sep (str): separator for temp file, eg ',' or '\\t'\n\n        Returns:\n            bool: True if loader finished\n\n        \"\"\"\n\n        if not 'psycopg2' in uri:\n            raise ValueError('need to use psycopg2 uri')\n\n        self._combine_preview_available()\n\n        import sqlalchemy\n        import io\n\n        sql_engine = sqlalchemy.create_engine(uri)\n        sql_cnxn = sql_engine.raw_connection()\n        cursor = sql_cnxn.cursor()\n\n        self.df_combine_preview[:0].to_sql(table_name, sql_engine, if_exists=if_exists, index=False)\n\n        for fname in self.fname_list:\n            for dfc in self._read_csv_yield(fname, self.read_csv_params):\n                fbuf = io.StringIO()\n                dfc.astype(self.df_combine_preview.dtypes).to_csv(fbuf, index=False, header=False, sep=sep)\n                fbuf.seek(0)\n                cursor.copy_from(fbuf, table_name, sep=sep, null='')\n        sql_cnxn.commit()\n        cursor.close()\n\n        return True\n\n    def to_mysql_combine(self, uri, table_name, if_exists='fail', tmpfile='mysql.csv', sep=','):\n        \"\"\"\n        Load all files into a sql table using native postgres LOAD DATA LOCAL INFILE. Chunks data load to reduce memory consumption\n\n        Args:\n            uri (str): mysql mysqlconnector sqlalchemy database uri\n            table_name (str): table to store data in\n            if_exists (str): {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See `pandas.to_sql()` for details\n            tmpfile (str): filename for temporary file to load from\n            sep (str): separator for temp file, eg ',' or '\\t'\n\n        Returns:\n            bool: True if loader finished\n\n        \"\"\"\n        if not 'mysql+mysqlconnector' in uri:\n            raise ValueError('need to use mysql+mysqlconnector uri (pip install mysql-connector)')\n\n        self._combine_preview_available()\n\n        import sqlalchemy\n\n        sql_engine = sqlalchemy.create_engine(uri)\n\n        self.df_combine_preview[:0].to_sql(table_name, sql_engine, if_exists=if_exists, index=False)\n\n        if self.logger:\n            self.logger.send_log('creating ' + tmpfile, 'ok')\n        self.to_csv_combine(tmpfile, write_params={'na_rep':'\\\\N','sep':sep})\n        if self.logger:\n            self.logger.send_log('loading ' + tmpfile, 'ok')\n        sql_load = \"LOAD DATA LOCAL INFILE '{}' INTO TABLE {} FIELDS TERMINATED BY '{}' IGNORE 1 LINES;\".format(tmpfile, table_name, sep)\n        sql_engine.execute(sql_load)\n\n        os.remove(tmpfile)\n\n        return True\n\n    def to_mssql_combine(self, uri, table_name, schema_name=None, if_exists='fail', tmpfile='mysql.csv'):\n        \"\"\"\n        Load all files into a sql table using native postgres LOAD DATA LOCAL INFILE. Chunks data load to reduce memory consumption\n\n        Args:\n            uri (str): mysql mysqlconnector sqlalchemy database uri\n            table_name (str): table to store data in\n            schema_name (str): name of schema to write to\n            if_exists (str): {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See `pandas.to_sql()` for details\n            tmpfile (str): filename for temporary file to load from\n\n        Returns:\n            bool: True if loader finished\n\n        \"\"\"\n        if not 'mssql+pymssql' in uri:\n            raise ValueError('need to use mssql+pymssql uri (conda install -c prometeia pymssql)')\n\n        self._combine_preview_available()\n\n        import sqlalchemy\n\n        sql_engine = sqlalchemy.create_engine(uri)\n\n        self.df_combine_preview[:0].to_sql(table_name, sql_engine, schema=schema_name, if_exists=if_exists, index=False)\n\n        if self.logger:\n            self.logger.send_log('creating ' + tmpfile, 'ok')\n        self.to_csv_combine(tmpfile, write_params={'na_rep':'\\\\N'})\n        if self.logger:\n            self.logger.send_log('loading ' + tmpfile, 'ok')\n        if schema_name is not None:\n            table_name = '{}.{}'.format(schema_name,table_name)\n        sql_load = \"BULK INSERT {} FROM '{}';\".format()(table_name, tmpfile)\n        sql_engine.execute(sql_load)\n\n        os.remove(tmpfile)\n\n        return True\n\n# todo: ever need to rerun _available fct instead of using cache?\n"
  },
  {
    "path": "d6tstack/convert_xls.py",
    "content": "import warnings\nimport os.path\n\nimport numpy as np\nimport pandas as pd\n\nimport ntpath\n\nimport openpyxl\nimport xlrd\n\ntry:\n    from openpyxl.utils.cell import coordinate_from_string\nexcept:\n    from openpyxl.utils import coordinate_from_string\nfrom d6tstack.helpers import compare_pandas_versions, check_valid_xls\n\nimport d6tcollect\n# d6tcollect.init(__name__)\n\n#******************************************************************\n# read_excel_advanced\n#******************************************************************\ndef read_excel_advanced(fname, remove_blank_cols=True, remove_blank_rows=True, collapse_header=True,\n                        header_xls_range=None, header_xls_start=None, header_xls_end=None,\n                        is_preview=False, nrows_preview=3, **kwds):\n    \"\"\"\n    Read Excel files to pandas dataframe with advanced options like set header ranges and remove blank columns and rows\n\n    Args:\n        fname (str): Excel file path\n        remove_blank_cols (bool): remove blank columns\n        remove_blank_rows (bool): remove blank rows\n        collapse_header (bool): to convert multiline header to a single line string\n        header_xls_range (string): range of headers in excel, eg: A4:B16\n        header_xls_start (string): Starting cell of excel for header range, eg: A4\n        header_xls_end (string): End cell of excel for header range, eg: B16\n        is_preview (bool): Read only first `nrows_preview` lines\n        nrows_preview (integer): Initial number of rows to be used for preview columns (default: 3)\n        kwds (mixed): parameters for `pandas.read_excel()` to pass through\n\n    Returns:\n         df (dataframe): pandas dataframe\n\n    Note:\n        You can pass in any `pandas.read_excel()` parameters in particular `sheet_name`\n\n    \"\"\"\n    header = []\n\n    if header_xls_range:\n        if not (header_xls_start and header_xls_end):\n            header_xls_range = header_xls_range.split(':')\n            header_xls_start, header_xls_end = header_xls_range\n        else:\n            raise ValueError('Parameter conflict. Can only pass header_xls_range or header_xls_start with header_xls_end')\n\n    if header_xls_start and header_xls_end:\n        if 'skiprows' in kwds or 'usecols' in kwds:\n            raise ValueError('Parameter conflict. Cannot pass skiprows or usecols with header_xls')\n\n        scol, srow = coordinate_from_string(header_xls_start)\n        ecol, erow = coordinate_from_string(header_xls_end)\n\n        # header, skiprows, usecols\n        header = list(range(erow - srow + 1))\n        usecols = scol + \":\" + ecol\n        skiprows = srow - 1\n\n        if compare_pandas_versions(pd.__version__, \"0.20.3\") > 0:\n            df = pd.read_excel(fname, header=header, skiprows=skiprows, usecols=usecols, **kwds)\n        else:\n            df = pd.read_excel(fname, header=header, skiprows=skiprows, parse_cols=usecols, **kwds)\n    else:\n        df = pd.read_excel(fname, **kwds)\n\n    # remove blank cols and rows\n    if remove_blank_cols:\n        df = df.dropna(axis='columns', how='all')\n    if remove_blank_rows:\n        df = df.dropna(axis='rows', how='all')\n\n    # todo: add df.reset_index() once no actual data in index\n\n    # clean up header\n    if collapse_header:\n        if len(header) > 1:\n            df.columns = [' '.join([s for s in col if not 'Unnamed' in s]).strip().replace(\"\\n\", ' ')\n                          for col in df.columns.values]\n            df = df.reset_index()\n        else:\n            df.rename(columns=lambda x: x.strip().replace(\"\\n\", ' '), inplace=True)\n\n    # preview\n    if is_preview:\n        df = df.head(nrows_preview)\n\n    return df\n\n\n#******************************************************************\n# XLSSniffer\n#******************************************************************\n\nclass XLSSniffer(object, metaclass=d6tcollect.Collect):\n    \"\"\"\n\n    Extracts available sheets from MULTIPLE Excel files and runs diagnostics\n\n    Args:\n        fname_list (list): file paths, eg ['dir/a.csv','dir/b.csv']\n        logger (object): logger object with send_log(), optional\n\n    \"\"\"\n\n    def __init__(self, fname_list, logger=None):\n        if not fname_list:\n            raise ValueError(\"Filename list should not be empty\")\n        self.fname_list = fname_list\n        self.logger = logger\n        check_valid_xls(self.fname_list)\n        self.sniff()\n\n    def sniff(self):\n        \"\"\"\n\n        Executes sniffer\n\n        Returns:\n            boolean: True if everything ok. Results are accessible in ``.df_xls_sheets``\n\n        \"\"\"\n\n        xls_sheets = {}\n        for fname in self.fname_list:\n            if self.logger:\n                self.logger.send_log('sniffing sheets in '+ntpath.basename(fname),'ok')\n\n            xls_fname = {}\n            xls_fname['file_name'] = ntpath.basename(fname)\n            if fname[-5:]=='.xlsx':\n                fh = openpyxl.load_workbook(fname,read_only=True)\n                xls_fname['sheets_names'] = fh.sheetnames\n                fh.close()\n                # todo: need to close file?\n            elif fname[-4:]=='.xls':\n                fh = xlrd.open_workbook(fname, on_demand=True)\n                xls_fname['sheets_names'] = fh.sheet_names()\n                fh.release_resources()\n            else:\n                raise IOError('Only .xls or .xlsx files can be combined')\n\n            xls_fname['sheets_count'] = len(xls_fname['sheets_names'])\n            xls_fname['sheets_idx'] = np.arange(xls_fname['sheets_count']).tolist()\n            xls_sheets[fname] = xls_fname\n\n            self.xls_sheets = xls_sheets\n\n        df_xls_sheets = pd.DataFrame(xls_sheets).T\n        df_xls_sheets.index.names = ['file_path']\n\n        self.dict_xls_sheets = xls_sheets\n        self.df_xls_sheets = df_xls_sheets\n\n        return True\n\n    def all_contain_sheetname(self,sheet_name):\n        \"\"\"\n        Check if all files contain a certain sheet\n\n        Args:\n            sheet_name (string): sheetname to check\n\n        Returns:\n            boolean: If true\n\n        \"\"\"\n        return np.all([sheet_name in self.dict_xls_sheets[fname]['sheets_names'] for fname in self.fname_list])\n\n    def all_have_idx(self,sheet_idx):\n        \"\"\"\n        Check if all files contain a certain index\n\n        Args:\n            sheet_idx (string): index to check\n\n        Returns:\n            boolean: If true\n\n        \"\"\"\n        return np.all([sheet_idx<=(d['sheets_count']-1) for k,d in self.dict_xls_sheets.items()])\n\n    def all_same_count(self):\n        \"\"\"\n        Check if all files contain the same number of sheets\n\n        Args:\n            sheet_idx (string): index to check\n\n        Returns:\n            boolean: If true\n\n        \"\"\"\n        first_elem = next(iter(self.dict_xls_sheets.values()))\n        return np.all([first_elem['sheets_count']==d['sheets_count'] for k,d in self.dict_xls_sheets.items()])\n\n    def all_same_names(self):\n        first_elem = next(iter(self.dict_xls_sheets.values()))\n        return np.all([first_elem['sheets_names']==d['sheets_names'] for k,d in self.dict_xls_sheets.items()])\n\n\n\n#******************************************************************\n# convertor\n#******************************************************************\nclass XLStoBase(object, metaclass=d6tcollect.Collect):\n    def __init__(self, if_exists='skip', output_dir=None, logger=None):\n        \"\"\"\n\n        Base class for converting Excel files\n\n        Args:\n            if_exists (str): Possible values: skip and replace, default: skip, optional\n            output_dir (str): If present, file is saved in given directory, optional\n            logger (object): logger object with send_log('msg','status'), optional\n\n        \"\"\"\n\n        if if_exists not in ['skip', 'replace']:\n            raise ValueError(\"Possible value of 'if_exists' are 'skip' and 'replace'\")\n        self.logger = logger\n        self.if_exists = if_exists\n        self.output_dir = output_dir\n        if self.output_dir:\n            if not os.path.exists(self.output_dir):\n                os.makedirs(self.output_dir)\n\n    def _get_output_filename(self, fname):\n        if self.output_dir:\n            basename = os.path.basename(fname)\n            fname_out = os.path.join(self.output_dir, basename)\n        else:\n            fname_out = fname\n        is_skip = (self.if_exists == 'skip' and os.path.isfile(fname_out))\n        return fname_out, is_skip\n\n    def convert_single(self, fname, sheet_name, **kwds):\n        \"\"\"\n\n        Converts single file\n\n        Args:\n            fname: path to file\n            sheet_name (str): optional sheet_name to override global `cfg_xls_sheets_sel`\n            Same as `d6tstack.utils.read_excel_advanced()`\n\n        Returns:\n            list: output file names\n\n        \"\"\"\n\n        if self.logger:\n            msg = 'converting file: '+ntpath.basename(fname)+' | sheet: '\n            if hasattr(self, 'cfg_xls_sheets_sel'):\n                msg += str(self.cfg_xls_sheets_sel[fname])\n            self.logger.send_log(msg,'ok')\n\n        fname_out = fname + '-' + str(sheet_name) + '.csv'\n        fname_out, is_skip = self._get_output_filename(fname_out)\n        if not is_skip:\n            df = read_excel_advanced(fname, sheet_name=sheet_name, **kwds)\n            df.to_csv(fname_out, index=False)\n        else:\n            warnings.warn('File %s exists, skipping' %fname)\n\n        return fname_out\n\n\nclass XLStoCSVMultiFile(XLStoBase, metaclass=d6tcollect.Collect):\n    \"\"\"\n    \n    Converts xls|xlsx files to csv files. Selects a SINGLE SHEET from each file. To extract MULTIPLE SHEETS from a file use XLStoCSVMultiSheet\n\n    Args:\n        fname_list (list): file paths, eg ['dir/a.csv','dir/b.csv']\n        cfg_xls_sheets_sel_mode (string): mode to select tabs\n\n            * ``name``: select by name, provide name for each file, can customize by file\n            * ``name_global``: select by name, one name for all files\n            * ``idx``: select by index, provide index for each file, can customize by file\n            * ``idx_global``: select by index, one index for all files\n\n        cfg_xls_sheets_sel (dict): values to select tabs `{'filename':'value'}`\n        output_dir (str): If present, file is saved in given directory, optional\n        if_exists (str): Possible values: skip and replace, default: skip, optional\n        logger (object): logger object with send_log('msg','status'), optional\n\n    \"\"\"\n\n    def __init__(self, fname_list, cfg_xls_sheets_sel_mode='idx_global', cfg_xls_sheets_sel=0,\n                 output_dir=None, if_exists='skip', logger=None):\n        super().__init__(if_exists, output_dir, logger)\n        if not fname_list:\n            raise ValueError(\"Filename list should not be empty\")\n        self.set_files(fname_list)\n        self.set_select_mode(cfg_xls_sheets_sel_mode, cfg_xls_sheets_sel)\n\n    def set_files(self, fname_list):\n        \"\"\"\n\n        Update input files. You will also need to update sheet selection with ``.set_select_mode()``.\n\n        Args:\n            fname_list (list): see class description for details\n\n        \"\"\"\n        self.fname_list = fname_list\n        self.xlsSniffer = XLSSniffer(fname_list)\n\n    def set_select_mode(self, cfg_xls_sheets_sel_mode, cfg_xls_sheets_sel):\n        \"\"\"\n        \n        Update sheet selection values\n\n        Args:\n            cfg_xls_sheets_sel_mode (string): see class description for details\n            cfg_xls_sheets_sel (list): see class description for details\n\n        \"\"\"\n\n        assert cfg_xls_sheets_sel_mode in ['name','idx','name_global','idx_global']\n        sheets = self.xlsSniffer.dict_xls_sheets\n\n        if cfg_xls_sheets_sel_mode=='name_global':\n            cfg_xls_sheets_sel_mode = 'name'\n            cfg_xls_sheets_sel = dict(zip(self.fname_list,[cfg_xls_sheets_sel]*len(self.fname_list)))\n        elif cfg_xls_sheets_sel_mode=='idx_global':\n            cfg_xls_sheets_sel_mode = 'idx'\n            cfg_xls_sheets_sel = dict(zip(self.fname_list,[cfg_xls_sheets_sel]*len(self.fname_list)))\n\n        if not set(cfg_xls_sheets_sel.keys())==set(sheets.keys()):\n            raise ValueError('Need to select a sheet from every file')\n\n        # check given selection actually present in files\n        if cfg_xls_sheets_sel_mode=='name':\n            if not np.all([cfg_xls_sheets_sel[fname] in sheets[fname]['sheets_names'] for fname in self.fname_list]):\n                raise ValueError('Invalid sheet name selected in one of the files')\n                # todo show which file is mismatched\n        elif cfg_xls_sheets_sel_mode=='idx':\n            if not np.all([cfg_xls_sheets_sel[fname] <= sheets[fname]['sheets_count'] for fname in self.fname_list]):\n                raise ValueError('Invalid index selected in one of the files')\n                # todo show which file is mismatched\n        else:\n            raise ValueError('Invalid xls_sheets_mode')\n\n        self.cfg_xls_sheets_sel_mode = cfg_xls_sheets_sel_mode\n        self.cfg_xls_sheets_sel = cfg_xls_sheets_sel\n\n    def convert_all(self, **kwds):\n        \"\"\"\n        \n        Converts all files\n\n        Args:\n            Any parameters for `d6tstack.utils.read_excel_advanced()`\n\n        Returns: \n            list: output file names\n        \"\"\"\n\n        fnames_converted = []\n        for fname in self.fname_list:\n            fname_out = self.convert_single(fname, self.cfg_xls_sheets_sel[fname], **kwds)\n            fnames_converted.append(fname_out)\n\n        return fnames_converted\n\n\nclass XLStoCSVMultiSheet(XLStoBase, metaclass=d6tcollect.Collect):\n    \"\"\"\n    \n    Converts ALL SHEETS from a SINGLE xls|xlsx files to separate csv files\n\n    Args:\n        fname (string): file path\n        sheet_names (list): list of int or str. If not given, will convert all sheets in the file\n        output_dir (str): If present, file is saved in given directory, optional\n        if_exists (str): Possible values: skip and replace, default: skip, optional\n        logger (object): logger object with send_log('msg','status'), optional\n\n    \"\"\"\n\n    def __init__(self, fname, sheet_names=None, output_dir=None, if_exists='skip', logger=None):\n        super().__init__(if_exists, output_dir, logger)\n        self.fname = fname\n\n\n        if sheet_names:\n            if not isinstance(sheet_names, (list,str)):\n                raise ValueError('sheet_names needs to be a list')\n            self.sheet_names = sheet_names\n        else:\n            self.xlsSniffer = XLSSniffer([fname, ])\n            self.sheet_names = self.xlsSniffer.xls_sheets[self.fname]['sheets_names']\n\n    def convert_single(self, sheet_name, **kwds):\n        \"\"\"\n\n        Converts all files\n\n        Args:\n            sheet_name (str): Excel sheet\n            Any parameters for `d6tstack.utils.read_excel_advanced()`\n\n        Returns:\n            str: output file name\n        \"\"\"\n        return super().convert_single(self.fname, sheet_name, **kwds)\n\n    def convert_all(self, **kwds):\n        \"\"\"\n\n        Converts all files\n\n        Args:\n            Any parameters for `d6tstack.utils.read_excel_advanced()`\n\n        Returns:\n            list: output file names\n        \"\"\"\n\n        fnames_converted = []\n        for iSheet in self.sheet_names:\n            fname_out = self.convert_single(iSheet, **kwds)\n            fnames_converted.append(fname_out)\n\n        return fnames_converted\n"
  },
  {
    "path": "d6tstack/helpers.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\"\"\"\n\nModule with several helper functions\n\n\"\"\"\n\nimport os\nimport collections\nimport re\n\ndef file_extensions_get(fname_list):\n    \"\"\"Returns file extensions in list\n\n    Args:\n        fname_list (list): file names, eg ['a.csv','b.csv']\n\n    Returns:\n        list: file extensions for each file name in input list, eg ['.csv','.csv']\n    \"\"\"\n    return [os.path.splitext(fname)[-1] for fname in fname_list]\n\n\ndef file_extensions_all_equal(ext_list):\n    \"\"\"Checks that all file extensions are equal. \n\n    Args:\n        ext_list (list): file extensions, eg ['.csv','.csv']\n\n    Returns:\n        bool: all extensions are equal to first extension in list?\n    \"\"\"\n    return len(set(ext_list))==1\n\n\ndef file_extensions_contains_xls(ext_list):\n    # Assumes all file extensions are equal! Only checks first file\n    return ext_list[0] == '.xls'\n\ndef file_extensions_contains_xlsx(ext_list):\n    # Assumes all file extensions are equal! Only checks first file\n    return ext_list[0] == '.xlsx'\n\ndef file_extensions_contains_csv(ext_list):\n    # Assumes all file extensions are equal! Only checks first file\n    return (ext_list[0] == '.csv' or ext_list[0] == '.txt')\n\ndef file_extensions_valid(ext_list):\n    \"\"\"Checks if file list contains only valid files\n\n    Notes:\n        Assumes all file extensions are equal! Only checks first file\n\n    Args:\n        ext_list (list): file extensions, eg ['.csv','.csv']\n\n    Returns:\n        bool: first element in list is one of ['.csv','.txt','.xls','.xlsx']?\n    \"\"\"\n    ext_list_valid = ['.csv','.txt','.xls','.xlsx']\n    return ext_list[0] in ext_list_valid\n\n\ndef columns_all_equal(col_list):\n    \"\"\"Checks that all lists in col_list are equal. \n\n    Args:\n        col_list (list): columns, eg [['a','b'],['a','b','c']]\n\n    Returns:\n        bool: all lists in list are equal?\n    \"\"\"\n    return all([l==col_list[0] for l in col_list])\n\n\ndef list_common(_list, sort=True):\n    l = list(set.intersection(*[set(l) for l in _list]))\n    if sort:\n        return sorted(l)\n    else:\n        return l\n\n\ndef list_unique(_list, sort=True):\n    l = list(set.union(*[set(l) for l in _list]))\n    if sort:\n        return sorted(l)\n    else:\n        return l\n\n\ndef list_tofront(_list,val):\n    return _list.insert(0, _list.pop(_list.index(val)))\n\n\ndef cols_filename_tofront(_list):\n    return list_tofront(_list,'filename')\n\n\ndef df_filename_tofront(dfg):\n    cfg_col = dfg.columns.tolist()\n    return dfg[cols_filename_tofront(cfg_col)]\n\n\ndef check_valid_xls(fname_list):\n    ext_list = file_extensions_get(fname_list)\n\n    if not file_extensions_all_equal(ext_list):\n        raise IOError('All file types and extensions have to be equal')\n\n    if not(file_extensions_contains_xls(ext_list) or file_extensions_contains_xlsx(ext_list)):\n        raise IOError('Only .xls, .xlsx files can be processed')\n\n    return True\n\n\ndef compare_pandas_versions(version1, version2):\n    def cmp(a, b):\n        return (a > b) - (a < b)\n\n    def normalize(v):\n        return [int(x) for x in re.sub(r'(\\.0+)*$','', v).split(\".\")]\n\n    return cmp(normalize(version1), normalize(version2))\n"
  },
  {
    "path": "d6tstack/pyftp_final.py",
    "content": "from boto.s3.connection import S3Connection\nfrom boto.s3.key import Key\nimport os\nimport ftputil\n\n\ndef get_ftp_files():\n    fileSetftp = set()\n    with ftputil.FTPHost(cfg_ftp_host, cfg_ftp_usr, cfg_ftp_pwd) as ftp_host:\n        ftp_host.use_list_a_option = False\n        for dir_, _, files in ftp_host.walk(cfg_dir_ftp):\n            for fileName in files:\n                relDir = os.path.relpath(dir_, cfg_dir_ftp)\n                relFile = os.path.join(relDir, fileName)\n                fileSetftp.add(relFile)\n    return fileSetftp\n\n\ndef upload_ftp_files_s3(ftp_files, s3_files, bucket):\n    files_ftp_sync = set(ftp_files).difference(s3_files)\n    with ftputil.FTPHost(cfg_ftp_host, cfg_ftp_usr, cfg_ftp_pwd) as ftp_host:\n        for ftp_file in files_ftp_sync:\n            full_name = cfg_dir_ftp + ftp_file\n            basename = os.path.basename(full_name)\n            temp_path = '/tmp/'+basename\n            ftp_host.download(full_name, temp_path)\n            with open(temp_path, 'rb') as f:\n                key = Key(bucket, ftp_file)\n                key.set_contents_from_file(f)\n\n\ndef list_s3_files(bucket):\n    s3_files = set()\n    for key in bucket.list():\n        s3_files.add(key.name.encode('utf-8'))\n    return s3_files\n\n\ndef upload_to_s3(bucket):\n    fname = '/home/anuj/Pictures/test/hp.jpg'\n    basename = os.path.basename(fname)\n    key = Key(bucket, basename)\n    with open(fname, 'rb') as f:\n        key.set_contents_from_file(f)\n\n\nif __name__ == \"__main__\":\n    print(\"S3 File sync\")\n    s3_id = ''\n    s3_key = ''\n    bucket_name = 'test-anuj-ftp-sync'\n\n    cfg_ftp_host = 'ftp.fic.com.tw'\n    cfg_ftp_usr = 'anonymous'\n    cfg_ftp_pwd = 'random'\n    cfg_dir_ftp = '/photo/ia/'\n\n    s3_conn = S3Connection(s3_id, s3_key, host='s3.ap-south-1.amazonaws.com')\n    bucket = s3_conn.get_bucket(bucket_name)\n    s3_files = list_s3_files(bucket)\n    upload_to_s3(bucket)\n\n    ftp_files = get_ftp_files()\n    print(ftp_files)\n\n    upload_ftp_files_s3(ftp_files, s3_files, bucket)\n"
  },
  {
    "path": "d6tstack/sniffer.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\"\"\"\n\nFinds CSV settings and Excel sheets in multiple files. Often needed as input for stacking\n\n\"\"\"\nimport collections\nimport csv\n\nimport d6tcollect\n# d6tcollect.init(__name__)\n\n#******************************************************************\n# csv\n#******************************************************************\n\ndef csv_count_rows(fname):\n    def blocks(files, size=65536):\n        while True:\n            b = files.read(size)\n            if not b: break\n            yield b\n\n    with open(fname) as f:\n        nrows = sum(bl.count(\"\\n\") for bl in blocks(f))\n\n    return nrows\n\nclass CSVSniffer(object, metaclass=d6tcollect.Collect):\n    \"\"\"\n    \n    Automatically detects settings needed to read csv files. SINGLE file only, for MULTI file use CSVSnifferList\n\n    Args:\n        fname (string): file path\n        nlines (int): number of lines to sample from each file\n        delims (string): possible delimiters, default \",;\\t|\"\n\n    \"\"\"\n\n    def __init__(self, fname, nlines = 10, delims=',;\\t|'):\n        self.cfg_fname = fname\n        self.nrows = csv_count_rows(fname) # todo: check for file size, if large don't run this\n        self.cfg_nlines = min(nlines,self.nrows) # read_lines() doesn't check EOF # todo: check 1% of file up to a max\n        self.cfg_delims_pool = delims\n        self.delim = None # delim used for the file\n        self.csv_lines = None # top n lines read from file\n        self.csv_lines_delim = None # detected delim for each line in file\n        self.csv_rows = None # top n lines split usingn delim\n\n    def read_nlines(self):\n        # read top lines\n        fhandle = open(self.cfg_fname)\n        self.csv_lines = [fhandle.readline().rstrip() for _ in range(self.cfg_nlines)]\n        fhandle.close()\n\n    def scan_delim(self):\n        if not self.csv_lines:\n            self.read_nlines()\n\n        # get delimiter for each line in file\n        delims = []\n        for line in self.csv_lines:\n            try:\n                csv_sniff = csv.Sniffer().sniff(line, self.cfg_delims_pool)\n                delims.append(csv_sniff.delimiter)\n            except:\n                delims.append(None) # todo: able to catch exception more specifically?\n\n        self.csv_lines_delim = delims\n\n    def get_delim(self):\n        if not self.csv_lines_delim:\n            self.scan_delim()\n\n        # all delimiters the same?\n        if len(set(self.csv_lines_delim))>1:\n            self.delim_is_consistent = False\n            csv_delim_count = collections.Counter(self.csv_lines_delim)\n            csv_delim = csv_delim_count.most_common(1)[0][0] # use the most common used delimeter\n            # todo: rerun on cfg_csv_scan_topline**2 files in case there is a large # of header rows\n        else:\n            self.delim_is_consistent = True\n            csv_delim = self.csv_lines_delim[0]\n\n        if csv_delim==None:\n            raise IOError('Could not determine a valid delimiter, pleaes check your files are .csv or .txt using one delimiter of %s' %(self.cfg_delims_pool))\n        else:\n            self.delim = csv_delim\n\n        self.csv_rows = [s.split(self.delim) for s in self.csv_lines][self.count_skiprows():]\n        if self.check_column_length_consistent():\n            self.certainty = 'high'\n        else:\n            self.certainty = 'probable'\n\n        return self.delim\n\n    def check_column_length_consistent(self):\n        # check if all rows have the same length. NB: this is just on the sample!\n        if not self.csv_rows:\n            self.get_delim()\n\n        return len(set([len(row) for row in self.csv_rows]))==1\n\n    def count_skiprows(self):\n        # finds the number of rows to skip by finding the last line which doesn't use the selected delimiter\n        if not self.delim:\n            self.get_delim()\n\n        if self.delim_is_consistent: # all delims the same so nothing to skip\n            return 0\n\n        l = [d != self.delim for d in self.csv_lines_delim]\n        l = list(reversed(l))\n        return len(l) - l.index(True)\n\n    def has_header_inverse(self):\n        # checks if head present if all columns in first row contain a letter\n        if not self.csv_rows:\n            self.get_delim()\n\n        def is_number(s):\n            try:\n                float(s)\n                return True\n            except ValueError:\n                return False\n\n        self.is_all_rows_number_col = all([any([is_number(s) for s in row]) for row in self.csv_rows])\n\n        '''\n        self.row_distance = [distance.jaccard(self.csv_rows[0], self.csv_rows[i]) for i in range(1,len(self.csv_rows))]\n\n        iqr_low, iqr_high = np.percentile(self.row_distance[1:], [5, 95])\n        is_first_row_different = not(iqr_low <= self.row_distance[0] <= iqr_high)\n        '''\n\n    def has_header(self):\n        # more likely than not to contain headers so have to prove no header present\n        self.has_header_inverse()\n        return not self.is_all_rows_number_col\n\nclass CSVSnifferList(object, metaclass=d6tcollect.Collect):\n    \"\"\"\n    \n    Automatically detects settings needed to read csv files. MULTI file use\n\n    Args:\n        fname_list (list): file names, eg ['a.csv','b.csv']\n        nlines (int): number of lines to sample from each file\n        delims (string): possible delimiters, default ',;\\t|'\n\n    \"\"\"\n\n\n    def __init__(self, fname_list, nlines = 10, delims=',;\\t|'):\n        self.cfg_fname_list = fname_list\n        self.sniffers = [CSVSniffer(fname, nlines, delims) for fname in fname_list]\n\n    def get_all(self, fun_name, msg_error):\n        val = []\n        for sniffer in self.sniffers:\n            func = getattr(sniffer, fun_name)\n            val.append(func())\n\n        if len(set(val))>1:\n            raise NotImplementedError(msg_error+' Make sure all files have the same format')\n            # todo: want to raise an exception here...? or just use whatever got detected for each file?\n        else:\n            return val[0]\n\n    def get_delim(self):\n        return self.get_all('get_delim','Inconsistent delimiters detected!')\n\n    def count_skiprows(self):\n        return self.get_all('count_skiprows','Inconsistent skiprows detected!')\n\n    def has_header(self):\n        return self.get_all('has_header','Inconsistent header setting detected!')\n\n        # todo: propagate status of individual sniffers. instead of raising exception pass back status to get user input\n\n\ndef sniff_settings_csv(fname_list):\n    sniff = CSVSnifferList(fname_list)\n    csv_sniff = {}\n    csv_sniff['delim'] = sniff.get_delim()\n    csv_sniff['skiprows'] = sniff.count_skiprows()\n    csv_sniff['has_header'] = sniff.has_header()\n    csv_sniff['header'] = 0 if sniff.has_header() else None\n    return csv_sniff\n\n\n"
  },
  {
    "path": "d6tstack/sync.py",
    "content": "import boto3\nimport botocore\nimport os\nimport ftputil\nimport numpy as np\n\n\nclass FTPSync:\n    \"\"\"\n\n        FTP Sync class. It allows users to sync their files to s3 or local.\n\n        Args:\n            cfg_ftp_host (string): FTP host name\n            cfg_ftp_usr (string): FTP login username\n            cfg_ftp_pwd (string): FTP login password\n            cfg_ftp_dir (string): FTP starting directory to be used for sync.\n            cfg_s3_key (string): AWS S3 key for connection\n            cfg_s3_secret (string): AWS S3 secret for connection\n            bucket_name (string): Bucket name in s3 for syncing the files\n            local_dir (string): local dir path to be used for sync. dir will be created if not exist.\n            logger (object): logger object with send_log()\n\n        \"\"\"\n    def __init__(self, cfg_ftp_host, cfg_ftp_usr, cfg_ftp_pwd, cfg_ftp_dir,\n                 cfg_s3_key=None, cfg_s3_secret=None, bucket_name=None,\n                 local_dir='./data/', logger=None):\n        self.cfg_ftp_host = cfg_ftp_host\n        self.cfg_ftp_usr = cfg_ftp_usr\n        self.cfg_ftp_pwd = cfg_ftp_pwd\n        self.cfg_ftp_dir = cfg_ftp_dir\n        self.ftp_host = ftputil.FTPHost(cfg_ftp_host, cfg_ftp_usr, cfg_ftp_pwd)\n        self.ftp_host.use_list_a_option = False\n        self.s3_client = None\n        self.bucket_name = None\n        if cfg_s3_key and cfg_s3_secret and bucket_name:\n            self.s3_client = boto3.client(\n                's3',\n                aws_access_key_id=cfg_s3_key,\n                aws_secret_access_key=cfg_s3_secret\n            )\n            exists = True\n            try:\n                self.s3_client.head_bucket(Bucket=bucket_name)\n            except botocore.exceptions.ClientError as e:\n                # If a client error is thrown, then check that it was a 404 error.\n                # If it was a 404 error, then the bucket does not exist.\n                error_code = int(e.response['Error']['Code'])\n                if error_code == 404:\n                    exists = False\n            if not exists:\n                if logger:\n                    logger.send_log('Bucket does not exist. Creating bucket', 'ok')\n                self.s3_client.create_bucket(Bucket=bucket_name)\n            self.bucket_name = bucket_name\n        self.local_dir = local_dir\n        if not os.path.exists(local_dir):\n            os.makedirs(local_dir)\n        self.logger = logger\n\n    def get_all_files(self, subdirs=True, ftp=False):\n        \"\"\"\n\n        Get all file list from local or ftp\n\n        Args:\n            subdirs (bool): return all the files in directory recursively? If `false` it will not go to sub directories\n            ftp (bool): local files if `true` otherwise local files\n\n        Returns:\n            Alphabetically Sorted file list\n\n        \"\"\"\n        fileSet = set()\n        host = os\n        from_dir = self.local_dir\n        if ftp:\n            host = self.ftp_host\n            from_dir = self.cfg_ftp_dir\n        if subdirs:\n            for dir_, _, files in host.walk(from_dir):\n                for fileName in files:\n                    relDir = os.path.relpath(dir_, from_dir)\n                    relFile = os.path.join(relDir, fileName)\n                    fileSet.add(relFile)\n        else:\n            for fileName in host.listdir(from_dir):\n                relFile = os.path.join(from_dir, fileName)\n                if host.path.isfile(relFile):\n                    fileSet.add(relFile)\n        return np.sort(list(fileSet))\n\n    def get_s3_files(self):\n        \"\"\"\n\n            Get all file list from s3 in the given bucket\n\n           Returns:\n                File list from s3 in bucket\n\n        \"\"\"\n        if not self.s3_client or not self.bucket_name:\n            raise ValueError(\"S3 credentials are mandatory to use this functionality\")\n        s3_files = set()\n        all_files = self.s3_client.list_objects(Bucket=self.bucket_name)\n        for content in all_files.get('Contents', []):\n            s3_files.add(content.get('Key'))\n        return s3_files\n\n    def upload_to_s3(self, fname, local_path):\n        \"\"\"\n\n            Upload a single file from local to s3\n\n            Args:\n                fname (string): Filename in s3\n                local_path (string): Local path of file to be uploaded\n\n        \"\"\"\n\n        with open(local_path, 'rb') as f:\n            self.s3_client.upload_fileobj(f, self.bucket_name, fname)\n\n    def get_files_for_sync(self, subdirs=True, to_s3=False):\n        \"\"\"\n\n            Get File list for sync along with total file size\n\n            Args:\n                subdirs (bool): return all the files in directory recursively? If `false` it will not go to sub directories, Optional\n                to_s3 (bool): get files to be sync from ftp to local. If `true` all files will be synced from ftp to s3\n\n        \"\"\"\n        ftp_files = self.get_all_files(subdirs=subdirs, ftp=True)\n        if to_s3:\n            server_files = self.get_s3_files()\n        else:\n            server_files = self.get_all_files(subdirs=subdirs)\n        files_ftp_sync = set(ftp_files).difference(set(server_files))\n        total_file_size = sum([self.ftp_host.path.getsize(os.path.join(self.cfg_ftp_dir, f))\n                               for f in files_ftp_sync])\n        return files_ftp_sync, total_file_size\n\n    def upload_ftp_files(self, subdirs=True, to_s3=False):\n        \"\"\"\n\n            Get File list for sync along with total file size\n\n            Args:\n                subdirs (bool): Upload files from ftp recursively? If `false` it will not go to sub directories, Optional\n                to_s3 (bool): upload files from ftp to local. If `true` files will be uploaded from ftp to s3\n\n        \"\"\"\n\n        files_ftp_sync, total_file_size = self.get_files_for_sync(subdirs=subdirs, to_s3=to_s3)\n        for ftp_file in files_ftp_sync:\n            full_name = os.path.join(self.cfg_ftp_dir, ftp_file)\n            local_path = os.path.join(self.local_dir, ftp_file)\n            file_dir_local = os.path.dirname(local_path)\n            if not os.path.exists(file_dir_local):\n                os.makedirs(file_dir_local)\n            self.ftp_host.download(full_name, local_path)\n            if to_s3:\n                self.upload_to_s3(ftp_file, local_path)\n"
  },
  {
    "path": "d6tstack/utils.py",
    "content": "import pandas as pd\nimport warnings\n\nimport d6tcollect\nd6tcollect.init(__name__)\n\nclass PrintLogger(object):\n    def send_log(self, msg, status):\n        print(msg,status)\n\n    def send(self, data):\n        print(data)\n\nimport os\n\n@d6tcollect.collect\ndef pd_readsql_query_from_sqlengine(uri, sql, schema_name=None, connect_args=None):\n    \"\"\"\n    Load SQL statement into pandas dataframe using `sql_engine.execute` making execution faster.\n\n    Args:\n        uri (str): postgres psycopg2 sqlalchemy database uri\n        sql (str): sql query\n        schema_name (str): name of schema\n        connect_args (dict): dictionary of connection arguments to pass to `sqlalchemy.create_engine`\n\n    Returns:\n        df: pandas dataframe\n\n    \"\"\"\n\n    import sqlalchemy\n    if connect_args is not None:\n        sql_engine = sqlalchemy.create_engine(uri, connect_args=connect_args)\n    elif schema_name is not None:\n        if 'psycopg2' in uri:\n            sql_engine = sqlalchemy.create_engine(uri, connect_args={'options': '-csearch_path={}'.format(schema_name)})\n        else:\n            raise NotImplementedError('only `psycopg2` supported with schema_name, pass connect_args for your db engine')\n    else:\n        sql_engine = sqlalchemy.create_engine(uri)\n\n    sql = sql_engine.execute(sql)\n    df = pd.DataFrame(sql.fetchall())\n\n    return df\n\n\n@d6tcollect.collect\ndef pd_readsql_table_from_sqlengine(uri, table_name, schema_name=None, connect_args=None):\n    \"\"\"\n    Load SQL table into pandas dataframe using `sql_engine.execute` making execution faster. Convenience function that returns full table.\n\n    Args:\n        uri (str): postgres psycopg2 sqlalchemy database uri\n        table_name (str): table\n        schema_name (str): name of schema\n        connect_args (dict): dictionary of connection arguments to pass to `sqlalchemy.create_engine`\n\n    Returns:\n        df: pandas dataframe\n\n    \"\"\"\n\n    return pd_readsql_query_from_sqlengine(uri, \"SELECT * FROM {};\".fromat(table_name), schema_name=schema_name, connect_args=connect_args)\n\n\n@d6tcollect.collect\ndef pd_to_psql(df, uri, table_name, schema_name=None, if_exists='fail', sep=','):\n    \"\"\"\n    Load pandas dataframe into a sql table using native postgres COPY FROM.\n\n    Args:\n        df (dataframe): pandas dataframe\n        uri (str): postgres psycopg2 sqlalchemy database uri\n        table_name (str): table to store data in\n        schema_name (str): name of schema in db to write to\n        if_exists (str): {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See `pandas.to_sql()` for details\n        sep (str): separator for temp file, eg ',' or '\\t'\n\n    Returns:\n        bool: True if loader finished\n\n    \"\"\"\n\n    if not 'psycopg2' in uri:\n        raise ValueError('need to use psycopg2 uri eg postgresql+psycopg2://psqlusr:psqlpwdpsqlpwd@localhost/psqltest. install with `pip install psycopg2-binary`')\n    table_name = table_name.lower()\n    if schema_name:\n        schema_name = schema_name.lower()\n   \n    import sqlalchemy\n    import io\n\n    if schema_name is not None:\n        sql_engine = sqlalchemy.create_engine(uri, connect_args={'options': '-csearch_path={}'.format(schema_name)})\n    else:\n        sql_engine = sqlalchemy.create_engine(uri)\n    sql_cnxn = sql_engine.raw_connection()\n    cursor = sql_cnxn.cursor()\n\n    df[:0].to_sql(table_name, sql_engine, schema=schema_name, if_exists=if_exists, index=False)\n\n    fbuf = io.StringIO()\n    df.to_csv(fbuf, index=False, header=False, sep=sep)\n    fbuf.seek(0)\n    cursor.copy_from(fbuf, table_name, sep=sep, null='')\n    sql_cnxn.commit()\n    cursor.close()\n\n    return True\n\n\n@d6tcollect.collect\ndef pd_to_mysql(df, uri, table_name, if_exists='fail', tmpfile='mysql.csv', sep=',', newline='\\n'):\n    \"\"\"\n    Load dataframe into a sql table using native postgres LOAD DATA LOCAL INFILE.\n\n    Args:\n        df (dataframe): pandas dataframe\n        uri (str): mysql mysqlconnector sqlalchemy database uri\n        table_name (str): table to store data in\n        if_exists (str): {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See `pandas.to_sql()` for details\n        tmpfile (str): filename for temporary file to load from\n        sep (str): separator for temp file, eg ',' or '\\t'\n\n    Returns:\n        bool: True if loader finished\n\n    \"\"\"\n    if not 'mysql+mysqlconnector' in uri:\n        raise ValueError('need to use mysql+mysqlconnector uri eg mysql+mysqlconnector://testusr:testpwd@localhost/testdb. install with `pip install mysql-connector`')\n    table_name = table_name.lower()\n\n    import sqlalchemy\n\n    sql_engine = sqlalchemy.create_engine(uri)\n\n    df[:0].to_sql(table_name, sql_engine, if_exists=if_exists, index=False)\n\n    logger = PrintLogger()\n    logger.send_log('creating ' + tmpfile, 'ok')\n    with open(tmpfile, mode='w', newline=newline) as fhandle:\n        df.to_csv(fhandle, na_rep='\\\\N', index=False, sep=sep)\n    logger.send_log('loading ' + tmpfile, 'ok')\n    sql_load = \"LOAD DATA LOCAL INFILE '{}' INTO TABLE {} FIELDS TERMINATED BY '{}' LINES TERMINATED BY '{}' IGNORE 1 LINES;\".format(tmpfile, table_name, sep, newline)\n    sql_engine.execute(sql_load)\n\n    os.remove(tmpfile)\n\n    return True\n\n@d6tcollect.collect\ndef pd_to_mssql(df, uri, table_name, schema_name=None, if_exists='fail', tmpfile='mysql.csv'):\n    \"\"\"\n    Load dataframe into a sql table using native postgres LOAD DATA LOCAL INFILE.\n\n    Args:\n        df (dataframe): pandas dataframe\n        uri (str): mysql mysqlconnector sqlalchemy database uri\n        table_name (str): table to store data in\n        schema_name (str): name of schema in db to write to\n        if_exists (str): {‘fail’, ‘replace’, ‘append’}, default ‘fail’. See `pandas.to_sql()` for details\n        tmpfile (str): filename for temporary file to load from\n\n    Returns:\n        bool: True if loader finished\n\n    \"\"\"\n    if not 'mssql+pymssql' in uri:\n        raise ValueError('need to use mssql+pymssql uri (conda install -c prometeia pymssql)')\n    table_name = table_name.lower()\n    if schema_name:\n        schema_name = schema_name.lower()\n\n    warnings.warn('`.pd_to_mssql()` is experimental, if any problems please raise an issue on https://github.com/d6t/d6tstack/issues or make a pull request')\n    import sqlalchemy\n\n    sql_engine = sqlalchemy.create_engine(uri)\n\n    df[:0].to_sql(table_name, sql_engine, if_exists=if_exists, index=False)\n\n    logger = PrintLogger()\n    logger.send_log('creating ' + tmpfile, 'ok')\n    df.to_csv(tmpfile, na_rep='\\\\N', index=False)\n    logger.send_log('loading ' + tmpfile, 'ok')\n    if schema_name is not None:\n        table_name = '{}.{}'.format(schema_name,table_name)\n    sql_load = \"BULK INSERT {} FROM '{}';\".format(table_name, tmpfile)\n    sql_engine.execute(sql_load)\n\n    os.remove(tmpfile)\n\n    return True\n\n"
  },
  {
    "path": "docs/Makefile",
    "content": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line.\nSPHINXOPTS    =\nSPHINXBUILD   = python -msphinx\nSPHINXPROJ    = d6tstack\nSOURCEDIR     = source\nBUILDDIR      = build\n\n# Put it first so that \"make\" without argument is like \"make help\".\nhelp:\n\t@$(SPHINXBUILD) -M help \"$(SOURCEDIR)\" \"$(BUILDDIR)\" $(SPHINXOPTS) $(O)\n\n.PHONY: help Makefile\n\n# Catch-all target: route all unknown targets to Sphinx using the new\n# \"make mode\" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).\n%: Makefile\n\t@$(SPHINXBUILD) -M $@ \"$(SOURCEDIR)\" \"$(BUILDDIR)\" $(SPHINXOPTS) $(O)"
  },
  {
    "path": "docs/make.bat",
    "content": "@ECHO OFF\n\npushd %~dp0\n\nREM Command file for Sphinx documentation\n\nif \"%SPHINXBUILD%\" == \"\" (\n\tset SPHINXBUILD=python -msphinx\n)\nset SOURCEDIR=source\nset BUILDDIR=build\nset SPHINXPROJ=d6t-lib\n\nif \"%1\" == \"\" goto help\n\n%SPHINXBUILD% >NUL 2>NUL\nif errorlevel 9009 (\n\techo.\n\techo.The Sphinx module was not found. Make sure you have Sphinx installed,\n\techo.then set the SPHINXBUILD environment variable to point to the full\n\techo.path of the 'sphinx-build' executable. Alternatively you may add the\n\techo.Sphinx directory to PATH.\n\techo.\n\techo.If you don't have Sphinx installed, grab it from\n\techo.http://sphinx-doc.org/\n\texit /b 1\n)\n\n%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%\ngoto end\n\n:help\n%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%\n\n:end\npopd\n"
  },
  {
    "path": "docs/make_zip_sample_csv.py",
    "content": "import zipfile\nimport glob\nimport os\n\nif not os.path.exists('test-data/output/__init__.py'):\n\tfhandle = open('test-data/output/__init__.py', 'w')\n\tfhandle.close()\n\n\nziphandle = zipfile.ZipFile('test-data.zip', 'w')\ncfg_path_base = 'test-data/input/test-data-input'\nfor fname in  glob.glob(cfg_path_base+'*.csv')+glob.glob(cfg_path_base+'*.xls')+glob.glob(cfg_path_base+'*.xlsx'):\n\tziphandle.write(fname)\n\nziphandle.write('test-data/output/__init__.py')\nziphandle.close()\n"
  },
  {
    "path": "docs/make_zip_sample_xls.py",
    "content": "import zipfile\nimport glob\nimport os\n\n\n\nimport pandas as pd\nimport numpy as np\n# generate fake data\ncfg_tickers = ['AAP','M','SPLS']\ncfg_ntickers = len(cfg_tickers)\ncfg_ndates = 10\ncfg_dates = pd.bdate_range('2018-01-01',periods=cfg_ndates).tolist()+pd.bdate_range('2018-02-01',periods=cfg_ndates).tolist()\ncfg_nobs = cfg_ndates*2\ndft = pd.DataFrame({'date':np.tile(cfg_dates,cfg_ntickers), 'ticker':np.repeat(cfg_tickers,cfg_nobs)})\n\n\n#****************************************\n# xls\n#****************************************\ndef write_file_xls(dfg, fname, sheets, startrow=0,startcol=0):\n    writer = pd.ExcelWriter(fname)\n    for isheet in sheets:\n        dft['data'] = np.random.normal(size=dfg.shape[0])\n        dfg['xls_sheet'] = isheet\n        dfg.to_excel(writer, isheet, index=False,startrow=startrow,startcol=startcol)\n    writer.save()\n\n# excel - bad case => d6tstack. Fake data\ncfg_path_base = 'test-data/excel_adv_data/sample-xls-'\ndf = dft\nnp.random.seed(0)\nwrite_file_xls(df, cfg_path_base+'case-simple.xlsx',['Sheet1'])\nwrite_file_xls(df, cfg_path_base+'case-multisheet.xlsx',['Sheet1','Sheet2'])\nwrite_file_xls(df, cfg_path_base+'case-multifile1.xlsx',['Sheet1'])\nwrite_file_xls(df, cfg_path_base+'case-multifile2.xlsx',['Sheet1'])\nwrite_file_xls(df, cfg_path_base+'case-badlayout1.xlsx',['Sheet1','Sheet2'],startrow=1,startcol=1)\n\nziphandle = zipfile.ZipFile('test-data-xls.zip', 'w')\nfor fname in  glob.glob(cfg_path_base+'*.xlsx'):\n    ziphandle.write(fname)\n\nziphandle.write('test-data/output/__init__.py')\nziphandle.close()\n\n"
  },
  {
    "path": "docs/shell-napoleon-html.sh",
    "content": "make html\n"
  },
  {
    "path": "docs/shell-napoleon-recreate.sh",
    "content": "#rm ./source/*\n#cp ./source-bak/* ./source/\nsphinx-apidoc -f -o ./source ..\nmake clean\nmake html\n"
  },
  {
    "path": "docs/source/conf.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n#\n# d6t-lib documentation build configuration file, created by\n# sphinx-quickstart on Tue Nov 28 11:32:56 2017.\n#\n# This file is execfile()d with the current directory set to its\n# containing dir.\n#\n# Note that not all possible configuration values are present in this\n# autogenerated file.\n#\n# All configuration values have a default; values that are commented out\n# serve to show the default.\n\n# If extensions (or modules to document with autodoc) are in another directory,\n# add these directories to sys.path here. If the directory is relative to the\n# documentation root, use os.path.abspath to make it absolute, like shown here.\n#\nimport os\nimport sys\n\nsys.path.insert(0, os.path.abspath('.'))\nsys.path.insert(0, os.path.dirname(os.path.abspath('.')))  # todo: why is this not working?\nsys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath('.'))))\nsys.path.insert(0, os.path.join(os.path.dirname((os.path.abspath('.'))), \"d6tstack\"))\n\n# -- General configuration ------------------------------------------------\n\n# If your documentation needs a minimal Sphinx version, state it here.\n#\n# needs_sphinx = '1.0'\n\n# Add any Sphinx extension module names here, as strings. They can be\n# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom\n# ones.\nextensions = ['sphinx.ext.autodoc',\n              'sphinx.ext.todo',\n              'sphinx.ext.viewcode',\n              'sphinx.ext.githubpages',\n              'sphinx.ext.napoleon']\n\n# Add any paths that contain templates here, relative to this directory.\ntemplates_path = ['_templates']\n\n# The suffix(es) of source filenames.\n# You can specify multiple suffix as a list of string:\n#\n# source_suffix = ['.rst', '.md']\nsource_suffix = '.rst'\n\n# The master toctree document.\nmaster_doc = 'index'\n\n# General information about the project.\nproject = 'd6tstack'\ncopyright = '2017, databolt'\nauthor = 'databolt'\n\n# The version info for the project you're documenting, acts as replacement for\n# |version| and |release|, also used in various other places throughout the\n# built documents.\n#\n# The short X.Y version.\nversion = '0.1'\n# The full version, including alpha/beta/rc tags.\nrelease = '0.1'\n\n# The language for content autogenerated by Sphinx. Refer to documentation\n# for a list of supported languages.\n#\n# This is also used if you do content translation via gettext catalogs.\n# Usually you set \"language\" from the command line for these cases.\nlanguage = None\n\n# List of patterns, relative to source directory, that match files and\n# directories to ignore when looking for source files.\n# This patterns also effect to html_static_path and html_extra_path\nexclude_patterns = []\n\n# The name of the Pygments (syntax highlighting) style to use.\npygments_style = 'sphinx'\n\n# If true, `todo` and `todoList` produce output, else they produce nothing.\ntodo_include_todos = True\n\n# -- Options for HTML output ----------------------------------------------\n\n# The theme to use for HTML and HTML Help pages.  See the documentation for\n# a list of builtin themes.\n#\nhtml_theme = 'sphinx_rtd_theme'  # 'alabaster'\n\n# Theme options are theme-specific and customize the look and feel of a theme\n# further.  For a list of options available for each theme, see the\n# documentation.\n#\n# html_theme_options = {}\n\n# Add any paths that contain custom static files (such as style sheets) here,\n# relative to this directory. They are copied after the builtin static files,\n# so a file named \"default.css\" will overwrite the builtin \"default.css\".\nhtml_static_path = ['_static']\n\n# Custom sidebar templates, must be a dictionary that maps document names\n# to template names.\n#\n# This is required for the alabaster theme\n# refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars\n# html_sidebars = {\n#     '**': [\n#         'about.html',\n#         'navigation.html',\n#         'relations.html',  # needs 'show_related': True theme option to display\n#         'searchbox.html',\n#         'donate.html',\n#     ]\n# }\n\n\n# -- Options for HTMLHelp output ------------------------------------------\n\n# Output file base name for HTML help builder.\nhtmlhelp_basename = 'd6tstack-doc'\n\n# -- Options for LaTeX output ---------------------------------------------\n\nlatex_elements = {\n    # The paper size ('letterpaper' or 'a4paper').\n    #\n    # 'papersize': 'letterpaper',\n\n    # The font size ('10pt', '11pt' or '12pt').\n    #\n    # 'pointsize': '10pt',\n\n    # Additional stuff for the LaTeX preamble.\n    #\n    # 'preamble': '',\n\n    # Latex figure (float) alignment\n    #\n    # 'figure_align': 'htbp',\n}\n\n# Grouping the document tree into LaTeX files. List of tuples\n# (source start file, target name, title,\n#  author, documentclass [howto, manual, or own class]).\nlatex_documents = [\n    (master_doc, 'd6tstack.tex', 'd6tstack Documentation',\n     'nn', 'manual'),\n]\n\n# -- Options for manual page output ---------------------------------------\n\n# One entry per manual page. List of tuples\n# (source start file, name, description, authors, manual section).\nman_pages = [\n    (master_doc, 'd6tstack', 'd6tstack Documentation',\n     [author], 1)\n]\n\n# -- Options for Texinfo output -------------------------------------------\n\n# Grouping the document tree into Texinfo files. List of tuples\n# (source start file, target name, title, author,\n#  dir menu entry, description, category)\ntexinfo_documents = [\n    (master_doc, 'd6tstack', 'd6tstack Documentation',\n     author, 'd6tstack', 'Databolt python library - Accelerate data engineering',\n     'Miscellaneous'),\n]\n"
  },
  {
    "path": "docs/source/d6tstack.rst",
    "content": "d6tstack package\n================\n\nSubmodules\n----------\n\nd6tstack.combine\\_csv module\n----------------------------\n\n.. automodule:: d6tstack.combine_csv\n    :members:\n    :undoc-members:\n    :show-inheritance:\n\nd6tstack.convert\\_xls module\n----------------------------\n\n.. automodule:: d6tstack.convert_xls\n    :members:\n    :undoc-members:\n    :show-inheritance:\n\nd6tstack.helpers module\n-----------------------\n\n.. automodule:: d6tstack.helpers\n    :members:\n    :undoc-members:\n    :show-inheritance:\n\nd6tstack.helpers\\_ui module\n---------------------------\n\n.. automodule:: d6tstack.helpers_ui\n    :members:\n    :undoc-members:\n    :show-inheritance:\n\nd6tstack.sniffer module\n-----------------------\n\n.. automodule:: d6tstack.sniffer\n    :members:\n    :undoc-members:\n    :show-inheritance:\n\nd6tstack.sync module\n--------------------\n\n.. automodule:: d6tstack.sync\n    :members:\n    :undoc-members:\n    :show-inheritance:\n\nd6tstack.utils module\n---------------------\n\n.. automodule:: d6tstack.utils\n    :members:\n    :undoc-members:\n    :show-inheritance:\n\n\nModule contents\n---------------\n\n.. automodule:: d6tstack\n    :members:\n    :undoc-members:\n    :show-inheritance:\n"
  },
  {
    "path": "docs/source/index.rst",
    "content": ".. d6t-celery-combine documentation master file, created by\n   sphinx-quickstart on Tue Nov 28 11:32:56 2017.\n   You can adapt this file completely to your liking, but it should at least\n   contain the root `toctree` directive.\n\nWelcome to d6tstack documentation!\n==============================================\n\nDocumentation for using the databolt python File Stack Combine library.\n\nLibrary Docs\n==================\n\n* :ref:`modindex`\n\nSearch\n==================\n\n* :ref:`search`\n"
  },
  {
    "path": "docs/source/modules.rst",
    "content": "d6tstack\n========\n\n.. toctree::\n   :maxdepth: 4\n\n   d6tstack\n   setup\n"
  },
  {
    "path": "docs/source/setup.rst",
    "content": "setup module\n============\n\n.. automodule:: setup\n    :members:\n    :undoc-members:\n    :show-inheritance:\n"
  },
  {
    "path": "docs/source/tests.rst",
    "content": "tests package\n=============\n\nSubmodules\n----------\n\ntests.test\\_combine module\n--------------------------\n\n.. automodule:: tests.test_combine\n    :members:\n    :undoc-members:\n    :show-inheritance:\n\ntests.test\\_sync module\n-----------------------\n\n.. automodule:: tests.test_sync\n    :members:\n    :undoc-members:\n    :show-inheritance:\n\ntests.test\\_xls module\n----------------------\n\n.. automodule:: tests.test_xls\n    :members:\n    :undoc-members:\n    :show-inheritance:\n\ntests.tmp module\n----------------\n\n.. automodule:: tests.tmp\n    :members:\n    :undoc-members:\n    :show-inheritance:\n\n\nModule contents\n---------------\n\n.. automodule:: tests\n    :members:\n    :undoc-members:\n    :show-inheritance:\n"
  },
  {
    "path": "examples-csv.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Data Engineering in Python with databolt  - Quickly Load Any Type of CSV (d6tlib/d6tstack)\\n\",\n    \"\\n\",\n    \"Vendors often send large datasets in multiple files. Often there are missing and misaligned columns between files that have to be manually cleaned. With DataBolt File Stack you can easily stack them together into one consistent dataset.\\n\",\n    \"\\n\",\n    \"Features include:\\n\",\n    \"* Quickly check column consistency across multiple files\\n\",\n    \"* Fix added/missing columns\\n\",\n    \"* Fix renamed columns\\n\",\n    \"* Out of core functionality to process large files\\n\",\n    \"* Export to pandas, CSV, SQL, parquet\\n\",\n    \"    * Fast export to postgres and mysql with out of core support\\n\",\n    \"    \\n\",\n    \"In this workbook we will demonstrate the usage of the d6tstack library.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 100,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import importlib\\n\",\n    \"import pandas as pd\\n\",\n    \"import glob\\n\",\n    \"\\n\",\n    \"import d6tstack.combine_csv as d6tc\\n\",\n    \"import d6tstack\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Get sample data\\n\",\n    \"\\n\",\n    \"We've created some dummy sample data which you can download. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 78,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import urllib.request\\n\",\n    \"cfg_fname_sample = 'test-data.zip'\\n\",\n    \"urllib.request.urlretrieve(\\\"https://github.com/d6t/d6tstack/raw/master/\\\"+cfg_fname_sample, cfg_fname_sample)\\n\",\n    \"import zipfile\\n\",\n    \"zip_ref = zipfile.ZipFile(cfg_fname_sample, 'r')\\n\",\n    \"zip_ref.extractall('.')\\n\",\n    \"zip_ref.close()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Use Case: Checking Column Consistency\\n\",\n    \"\\n\",\n    \"Let's say you receive a bunch of csv files you want to ingest them, say for example into pandas, dask, pyspark, database.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 79,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"['test-data/input/test-data-input-csv-clean-mar.csv', 'test-data/input/test-data-input-csv-clean-feb.csv', 'test-data/input/test-data-input-csv-clean-jan.csv']\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-clean-*.csv'))\\n\",\n    \"print(cfg_fnames)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Check column consistency across all files\\n\",\n    \"\\n\",\n    \"Even if you think the files have a consistent column layout, it worthwhile using `d6tstack` to assert that that is actually the case. It's very quick to do even with very many large files!\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 80,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"sniffing columns ok\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# get previews\\n\",\n    \"c = d6tc.CombinerCSV(cfg_fnames) # all_strings=True makes reading faster\\n\",\n    \"col_sniff = c.sniff_columns()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 81,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"all columns equal? True\\n\",\n      \"\\n\",\n      \"which columns are present in which files?\\n\",\n      \"\\n\",\n      \"                                                   date  sales  cost  profit\\n\",\n      \"file_path                                                                   \\n\",\n      \"test-data/input/test-data-input-csv-clean-feb.csv  True   True  True    True\\n\",\n      \"test-data/input/test-data-input-csv-clean-jan.csv  True   True  True    True\\n\",\n      \"test-data/input/test-data-input-csv-clean-mar.csv  True   True  True    True\\n\",\n      \"\\n\",\n      \"in what order do columns appear in the files?\\n\",\n      \"\\n\",\n      \"   date  sales  cost  profit\\n\",\n      \"0     0      1     2       3\\n\",\n      \"1     0      1     2       3\\n\",\n      \"2     0      1     2       3\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print('all columns equal?', c.is_all_equal())\\n\",\n    \"print('')\\n\",\n    \"print('which columns are present in which files?')\\n\",\n    \"print('')\\n\",\n    \"print(c.is_column_present())\\n\",\n    \"print('')\\n\",\n    \"print('in what order do columns appear in the files?')\\n\",\n    \"print('')\\n\",\n    \"print(col_sniff['df_columns_order'].reset_index(drop=True))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Preview Combined Data\\n\",\n    \"\\n\",\n    \"You can see a preview of what the combined data from all files will look like.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 82,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>filepath</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-02-01</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-clean-feb.csv</td>\\n\",\n       \"      <td>test-data-input-csv-clean-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-02-02</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-clean-feb.csv</td>\\n\",\n       \"      <td>test-data-input-csv-clean-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-02-03</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-clean-feb.csv</td>\\n\",\n       \"      <td>test-data-input-csv-clean-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-01-01</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-clean-jan.csv</td>\\n\",\n       \"      <td>test-data-input-csv-clean-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-01-02</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-clean-jan.csv</td>\\n\",\n       \"      <td>test-data-input-csv-clean-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>2011-01-03</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-clean-jan.csv</td>\\n\",\n       \"      <td>test-data-input-csv-clean-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>2011-03-01</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-clean-mar.csv</td>\\n\",\n       \"      <td>test-data-input-csv-clean-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>2011-03-02</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-clean-mar.csv</td>\\n\",\n       \"      <td>test-data-input-csv-clean-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>2011-03-03</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-clean-mar.csv</td>\\n\",\n       \"      <td>test-data-input-csv-clean-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"         date  sales  cost  profit                                           filepath                           filename\\n\",\n       \"0  2011-02-01    200   -90     110  test-data/input/test-data-input-csv-clean-feb.csv  test-data-input-csv-clean-feb.csv\\n\",\n       \"1  2011-02-02    200   -90     110  test-data/input/test-data-input-csv-clean-feb.csv  test-data-input-csv-clean-feb.csv\\n\",\n       \"2  2011-02-03    200   -90     110  test-data/input/test-data-input-csv-clean-feb.csv  test-data-input-csv-clean-feb.csv\\n\",\n       \"3  2011-01-01    100   -80      20  test-data/input/test-data-input-csv-clean-jan.csv  test-data-input-csv-clean-jan.csv\\n\",\n       \"4  2011-01-02    100   -80      20  test-data/input/test-data-input-csv-clean-jan.csv  test-data-input-csv-clean-jan.csv\\n\",\n       \"5  2011-01-03    100   -80      20  test-data/input/test-data-input-csv-clean-jan.csv  test-data-input-csv-clean-jan.csv\\n\",\n       \"6  2011-03-01    300  -100     200  test-data/input/test-data-input-csv-clean-mar.csv  test-data-input-csv-clean-mar.csv\\n\",\n       \"7  2011-03-02    300  -100     200  test-data/input/test-data-input-csv-clean-mar.csv  test-data-input-csv-clean-mar.csv\\n\",\n       \"8  2011-03-03    300  -100     200  test-data/input/test-data-input-csv-clean-mar.csv  test-data-input-csv-clean-mar.csv\"\n      ]\n     },\n     \"execution_count\": 82,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"c.combine_preview()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Read All Files to Pandas\\n\",\n    \"\\n\",\n    \"You can quickly load the combined data into a pandas dataframe with a single command. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 83,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>filepath</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-02-01</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-clean-feb.csv</td>\\n\",\n       \"      <td>test-data-input-csv-clean-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-02-02</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-clean-feb.csv</td>\\n\",\n       \"      <td>test-data-input-csv-clean-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-02-03</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-clean-feb.csv</td>\\n\",\n       \"      <td>test-data-input-csv-clean-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-02-04</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-clean-feb.csv</td>\\n\",\n       \"      <td>test-data-input-csv-clean-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-02-05</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-clean-feb.csv</td>\\n\",\n       \"      <td>test-data-input-csv-clean-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"         date  sales  cost  profit                                           filepath                           filename\\n\",\n       \"0  2011-02-01    200   -90     110  test-data/input/test-data-input-csv-clean-feb.csv  test-data-input-csv-clean-feb.csv\\n\",\n       \"1  2011-02-02    200   -90     110  test-data/input/test-data-input-csv-clean-feb.csv  test-data-input-csv-clean-feb.csv\\n\",\n       \"2  2011-02-03    200   -90     110  test-data/input/test-data-input-csv-clean-feb.csv  test-data-input-csv-clean-feb.csv\\n\",\n       \"3  2011-02-04    200   -90     110  test-data/input/test-data-input-csv-clean-feb.csv  test-data-input-csv-clean-feb.csv\\n\",\n       \"4  2011-02-05    200   -90     110  test-data/input/test-data-input-csv-clean-feb.csv  test-data-input-csv-clean-feb.csv\"\n      ]\n     },\n     \"execution_count\": 83,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"c.to_pandas().head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Use Case: Identifying and fixing inconsistent columns\\n\",\n    \"\\n\",\n    \"The first case was clean: all files had the same columns. It happens very frequently that the data schema changes over time with columns being added or deleted over time. Let's look at a case where an extra columns got added.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 84,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"['test-data/input/test-data-input-csv-colmismatch-mar.csv', 'test-data/input/test-data-input-csv-colmismatch-feb.csv', 'test-data/input/test-data-input-csv-colmismatch-jan.csv']\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-colmismatch-*.csv'))\\n\",\n    \"print(cfg_fnames)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 85,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"sniffing columns ok\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# get previews\\n\",\n    \"c = d6tc.CombinerCSV(cfg_fnames) # all_strings=True makes reading faster\\n\",\n    \"col_sniff = c.sniff_columns()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 86,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"all columns equal? False\\n\",\n      \"\\n\",\n      \"which columns are unique? ['profit2']\\n\",\n      \"\\n\",\n      \"which files have unique columns?\\n\",\n      \"\\n\",\n      \"                                                    profit2\\n\",\n      \"file_path                                                  \\n\",\n      \"test-data/input/test-data-input-csv-colmismatch...    False\\n\",\n      \"test-data/input/test-data-input-csv-colmismatch...    False\\n\",\n      \"test-data/input/test-data-input-csv-colmismatch...     True\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print('all columns equal?', c.is_all_equal())\\n\",\n    \"print('')\\n\",\n    \"print('which columns are unique?', col_sniff['columns_unique'])\\n\",\n    \"print('')\\n\",\n    \"print('which files have unique columns?')\\n\",\n    \"print('')\\n\",\n    \"print(c.is_column_present_unique())\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 87,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>profit2</th>\\n\",\n       \"      <th>filepath</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-02-01</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-02-02</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-02-03</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-02-04</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-02-05</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"         date  sales  cost  profit  profit2                                           filepath                                 filename\\n\",\n       \"0  2011-02-01    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"1  2011-02-02    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"2  2011-02-03    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"3  2011-02-04    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"4  2011-02-05    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\"\n      ]\n     },\n     \"execution_count\": 87,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"c.to_pandas().head() # keep all columns\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 88,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"sniffing columns ok\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>filepath</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-02-01</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-02-02</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-02-03</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-02-04</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-02-05</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"         date  sales  cost  profit                                           filepath                                 filename\\n\",\n       \"0  2011-02-01    200   -90     110  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"1  2011-02-02    200   -90     110  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"2  2011-02-03    200   -90     110  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"3  2011-02-04    200   -90     110  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"4  2011-02-05    200   -90     110  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\"\n      ]\n     },\n     \"execution_count\": 88,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"d6tc.CombinerCSV(cfg_fnames, columns_select_common=True).to_pandas().head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Use Case: align renamed columns. Select subset of columns\\n\",\n    \"\\n\",\n    \"Say a column has been renamed and now the data doesn't line up with the data from the old column name. You can easily fix such a situation by using `CombinerCSVAdvanced` which allows you to rename columns and automatically lines up the data. It also allows you to just load data from a subset of columns.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 89,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"sniffing columns ok\\n\",\n      \"                                                    revenue  sales\\n\",\n      \"file_path                                                         \\n\",\n      \"test-data/input/test-data-input-csv-renamed-feb...    False   True\\n\",\n      \"test-data/input/test-data-input-csv-renamed-jan...    False   True\\n\",\n      \"test-data/input/test-data-input-csv-renamed-mar...     True  False\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-renamed-*.csv'))\\n\",\n    \"c = d6tc.CombinerCSV(cfg_fnames)\\n\",\n    \"print(c.is_column_present_unique())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The column `sales` got renamed to `revenue` in the March file, this would causes problems when reading the files. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 90,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"sniffing columns ok\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"      <th>revenue</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>test-data-input-csv-renamed-feb.csv</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>200.0</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>test-data-input-csv-renamed-feb.csv</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>200.0</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>test-data-input-csv-renamed-feb.csv</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>200.0</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>test-data-input-csv-renamed-jan.csv</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100.0</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>test-data-input-csv-renamed-jan.csv</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100.0</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>test-data-input-csv-renamed-jan.csv</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100.0</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>test-data-input-csv-renamed-mar.csv</td>\\n\",\n       \"      <td>300.0</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>test-data-input-csv-renamed-mar.csv</td>\\n\",\n       \"      <td>300.0</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>test-data-input-csv-renamed-mar.csv</td>\\n\",\n       \"      <td>300.0</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"                              filename  revenue  sales\\n\",\n       \"0  test-data-input-csv-renamed-feb.csv      NaN  200.0\\n\",\n       \"1  test-data-input-csv-renamed-feb.csv      NaN  200.0\\n\",\n       \"2  test-data-input-csv-renamed-feb.csv      NaN  200.0\\n\",\n       \"3  test-data-input-csv-renamed-jan.csv      NaN  100.0\\n\",\n       \"4  test-data-input-csv-renamed-jan.csv      NaN  100.0\\n\",\n       \"5  test-data-input-csv-renamed-jan.csv      NaN  100.0\\n\",\n       \"6  test-data-input-csv-renamed-mar.csv    300.0    NaN\\n\",\n       \"7  test-data-input-csv-renamed-mar.csv    300.0    NaN\\n\",\n       \"8  test-data-input-csv-renamed-mar.csv    300.0    NaN\"\n      ]\n     },\n     \"execution_count\": 90,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"col_sniff = c.sniff_columns()\\n\",\n    \"c.combine_preview()[['filename']+col_sniff['columns_unique']]\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"You can pass the columns you want to rename to `columns_rename` and it will rename and align those columns.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 91,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# only select particular columns\\n\",\n    \"cfg_col_sel = ['date','sales','cost','profit'] # don't select profit2\\n\",\n    \"# rename colums\\n\",\n    \"cfg_col_rename = {'sales':'revenue'} # rename all instances of sales to revenue\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 92,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"sniffing columns ok\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>revenue</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>filepath</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-02-01</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-renamed-fe...</td>\\n\",\n       \"      <td>test-data-input-csv-renamed-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-02-02</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-renamed-fe...</td>\\n\",\n       \"      <td>test-data-input-csv-renamed-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-02-03</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-renamed-fe...</td>\\n\",\n       \"      <td>test-data-input-csv-renamed-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-01-01</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-renamed-ja...</td>\\n\",\n       \"      <td>test-data-input-csv-renamed-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-01-02</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-renamed-ja...</td>\\n\",\n       \"      <td>test-data-input-csv-renamed-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>2011-01-03</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-renamed-ja...</td>\\n\",\n       \"      <td>test-data-input-csv-renamed-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>2011-03-01</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-renamed-ma...</td>\\n\",\n       \"      <td>test-data-input-csv-renamed-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>2011-03-02</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-renamed-ma...</td>\\n\",\n       \"      <td>test-data-input-csv-renamed-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>2011-03-03</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-renamed-ma...</td>\\n\",\n       \"      <td>test-data-input-csv-renamed-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"         date  revenue  cost  profit                                           filepath                             filename\\n\",\n       \"0  2011-02-01      200   -90     110  test-data/input/test-data-input-csv-renamed-fe...  test-data-input-csv-renamed-feb.csv\\n\",\n       \"1  2011-02-02      200   -90     110  test-data/input/test-data-input-csv-renamed-fe...  test-data-input-csv-renamed-feb.csv\\n\",\n       \"2  2011-02-03      200   -90     110  test-data/input/test-data-input-csv-renamed-fe...  test-data-input-csv-renamed-feb.csv\\n\",\n       \"3  2011-01-01      100   -80      20  test-data/input/test-data-input-csv-renamed-ja...  test-data-input-csv-renamed-jan.csv\\n\",\n       \"4  2011-01-02      100   -80      20  test-data/input/test-data-input-csv-renamed-ja...  test-data-input-csv-renamed-jan.csv\\n\",\n       \"5  2011-01-03      100   -80      20  test-data/input/test-data-input-csv-renamed-ja...  test-data-input-csv-renamed-jan.csv\\n\",\n       \"6  2011-03-01      300  -100     200  test-data/input/test-data-input-csv-renamed-ma...  test-data-input-csv-renamed-mar.csv\\n\",\n       \"7  2011-03-02      300  -100     200  test-data/input/test-data-input-csv-renamed-ma...  test-data-input-csv-renamed-mar.csv\\n\",\n       \"8  2011-03-03      300  -100     200  test-data/input/test-data-input-csv-renamed-ma...  test-data-input-csv-renamed-mar.csv\"\n      ]\n     },\n     \"execution_count\": 92,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"c = d6tc.CombinerCSV(cfg_fnames, columns_rename = cfg_col_rename, columns_select = cfg_col_sel) \\n\",\n    \"c.combine_preview() \\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Case: Identify change in column order\\n\",\n    \"\\n\",\n    \"If you read your files into a database this will be a real problem because it look like the files are all the same whereas in fact they have changes. This is because programs like dask or sql loaders assume the column order is the same. With `d6tstack` you can easily identify and fix such a case.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 93,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"['test-data/input/test-data-input-csv-reorder-jan.csv', 'test-data/input/test-data-input-csv-reorder-mar.csv', 'test-data/input/test-data-input-csv-reorder-feb.csv']\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-reorder-*.csv'))\\n\",\n    \"print(cfg_fnames)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 94,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"sniffing columns ok\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# get previews\\n\",\n    \"c = d6tc.CombinerCSV(cfg_fnames) # all_strings=True makes reading faster\\n\",\n    \"col_sniff = c.sniff_columns()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Here we can see that all columns are not equal\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 95,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"all columns equal? False\\n\",\n      \"\\n\",\n      \"in what order do columns appear in the files?\\n\",\n      \"\\n\",\n      \"   date  sales  cost  profit\\n\",\n      \"0     0      1     2       3\\n\",\n      \"1     0      1     2       3\\n\",\n      \"2     0      1     3       2\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"print('all columns equal?', col_sniff['is_all_equal'])\\n\",\n    \"print('')\\n\",\n    \"print('in what order do columns appear in the files?')\\n\",\n    \"print('')\\n\",\n    \"print(col_sniff['df_columns_order'].reset_index(drop=True))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 96,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>filepath</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-02-01</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-fe...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-02-02</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-fe...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-02-03</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-fe...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-01-01</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ja...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-01-02</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ja...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>2011-01-03</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ja...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>2011-03-01</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ma...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>2011-03-02</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ma...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>2011-03-03</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ma...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"         date  sales  cost  profit                                           filepath                             filename\\n\",\n       \"0  2011-02-01    200   -90     110  test-data/input/test-data-input-csv-reorder-fe...  test-data-input-csv-reorder-feb.csv\\n\",\n       \"1  2011-02-02    200   -90     110  test-data/input/test-data-input-csv-reorder-fe...  test-data-input-csv-reorder-feb.csv\\n\",\n       \"2  2011-02-03    200   -90     110  test-data/input/test-data-input-csv-reorder-fe...  test-data-input-csv-reorder-feb.csv\\n\",\n       \"3  2011-01-01    100   -80      20  test-data/input/test-data-input-csv-reorder-ja...  test-data-input-csv-reorder-jan.csv\\n\",\n       \"4  2011-01-02    100   -80      20  test-data/input/test-data-input-csv-reorder-ja...  test-data-input-csv-reorder-jan.csv\\n\",\n       \"5  2011-01-03    100   -80      20  test-data/input/test-data-input-csv-reorder-ja...  test-data-input-csv-reorder-jan.csv\\n\",\n       \"6  2011-03-01    300  -100     200  test-data/input/test-data-input-csv-reorder-ma...  test-data-input-csv-reorder-mar.csv\\n\",\n       \"7  2011-03-02    300  -100     200  test-data/input/test-data-input-csv-reorder-ma...  test-data-input-csv-reorder-mar.csv\\n\",\n       \"8  2011-03-03    300  -100     200  test-data/input/test-data-input-csv-reorder-ma...  test-data-input-csv-reorder-mar.csv\"\n      ]\n     },\n     \"execution_count\": 96,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"c.combine_preview() # automatically puts it in the right order\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Customize separator and pass pd.read_csv() params\\n\",\n    \"\\n\",\n    \"You can pass additional parameters such as separators and any params for `pd.read_csv()` to the combiner.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 97,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"sniffing columns ok\\n\",\n      \"{'files_columns': {'test-data/input/test-data-input-csv-reorder-feb.csv': ['date', 'sales', 'cost', 'profit'], 'test-data/input/test-data-input-csv-reorder-jan.csv': ['date', 'sales', 'cost', 'profit'], 'test-data/input/test-data-input-csv-reorder-mar.csv': ['date', 'sales', 'profit', 'cost']}, 'columns_all': ['date', 'sales', 'cost', 'profit'], 'columns_common': ['date', 'sales', 'cost', 'profit'], 'columns_unique': [], 'is_all_equal': False, 'df_columns_present':                                                     date  sales  cost  profit\\n\",\n      \"file_path                                                                    \\n\",\n      \"test-data/input/test-data-input-csv-reorder-feb...  True   True  True    True\\n\",\n      \"test-data/input/test-data-input-csv-reorder-jan...  True   True  True    True\\n\",\n      \"test-data/input/test-data-input-csv-reorder-mar...  True   True  True    True, 'df_columns_order':                                                     date  sales  cost  profit\\n\",\n      \"test-data/input/test-data-input-csv-reorder-feb...     0      1     2       3\\n\",\n      \"test-data/input/test-data-input-csv-reorder-jan...     0      1     2       3\\n\",\n      \"test-data/input/test-data-input-csv-reorder-mar...     0      1     3       2}\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"c = d6tc.CombinerCSV(cfg_fnames, sep=',', read_csv_params={'header': None})\\n\",\n    \"col_sniff = c.sniff_columns()\\n\",\n    \"print(col_sniff)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# CSV out of core functionality\\n\",\n    \"\\n\",\n    \"If your files are large you don't want to read them all in memory and then save. Instead you can write directly to the output file.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 98,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"'test-data/output/test.csv'\"\n      ]\n     },\n     \"execution_count\": 98,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"c.to_csv_combine('test-data/output/test.csv')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Auto Detect pd.read_csv() settings\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 103,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"### Detect CSV settings across a single file\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 104,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{'delim': ',', 'skiprows': 0, 'has_header': True, 'header': 0}\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"cfg_sniff = d6tstack.sniffer.sniff_settings_csv([cfg_fnames[0]])\\n\",\n    \"print(cfg_sniff)\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Detect CSV settings across multiple files\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 105,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"{'delim': ',', 'skiprows': 0, 'has_header': True, 'header': 0}\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# finds common csv across all files\\n\",\n    \"cfg_sniff = d6tstack.sniffer.sniff_settings_csv(cfg_fnames)\\n\",\n    \"print(cfg_sniff)\\n\"\n   ]\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.3\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples-dask.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# d6tstack with Dask\\n\",\n    \"\\n\",\n    \"Dask is a great library for out-of-core computing. But if input files are not properly organized it quickly breaks. For example:\\n\",\n    \"\\n\",\n    \"1) if columns are different between files, dask won't even read the data! It doesn't tell you what you need to do to fix it.\\n\",\n    \"\\n\",\n    \"2) if column order is rearranged between files it will read data, but into the wrong columns and you won't notice it\\n\",\n    \"\\n\",\n    \"Dask can't handle those scenarios. With d6tstack you can easily fix the situation with just a few lines of code!\\n\",\n    \"\\n\",\n    \"For more instructions, examples and documentation see https://github.com/d6t/d6tstack\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Base Case: Columns are same between all files\\n\",\n    \"As a base case, we have input files which have consistent input columns and thus can be easily read in dask.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/opt/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88\\n\",\n      \"  return f(*args, **kwds)\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-02-01</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-02-02</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-02-03</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-02-04</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-02-05</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>2011-02-06</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>2011-02-07</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>2011-02-08</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>2011-02-09</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>9</th>\\n\",\n       \"      <td>2011-02-10</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-01-01</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-01-02</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-01-03</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-01-04</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-01-05</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>2011-01-06</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>2011-01-07</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>2011-01-08</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>2011-01-09</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>9</th>\\n\",\n       \"      <td>2011-01-10</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-03-01</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-03-02</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-03-03</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-03-04</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-03-05</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>2011-03-06</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>2011-03-07</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>2011-03-08</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>2011-03-09</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>9</th>\\n\",\n       \"      <td>2011-03-10</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"         date  sales  cost  profit\\n\",\n       \"0  2011-02-01    200   -90     110\\n\",\n       \"1  2011-02-02    200   -90     110\\n\",\n       \"2  2011-02-03    200   -90     110\\n\",\n       \"3  2011-02-04    200   -90     110\\n\",\n       \"4  2011-02-05    200   -90     110\\n\",\n       \"5  2011-02-06    200   -90     110\\n\",\n       \"6  2011-02-07    200   -90     110\\n\",\n       \"7  2011-02-08    200   -90     110\\n\",\n       \"8  2011-02-09    200   -90     110\\n\",\n       \"9  2011-02-10    200   -90     110\\n\",\n       \"0  2011-01-01    100   -80      20\\n\",\n       \"1  2011-01-02    100   -80      20\\n\",\n       \"2  2011-01-03    100   -80      20\\n\",\n       \"3  2011-01-04    100   -80      20\\n\",\n       \"4  2011-01-05    100   -80      20\\n\",\n       \"5  2011-01-06    100   -80      20\\n\",\n       \"6  2011-01-07    100   -80      20\\n\",\n       \"7  2011-01-08    100   -80      20\\n\",\n       \"8  2011-01-09    100   -80      20\\n\",\n       \"9  2011-01-10    100   -80      20\\n\",\n       \"0  2011-03-01    300  -100     200\\n\",\n       \"1  2011-03-02    300  -100     200\\n\",\n       \"2  2011-03-03    300  -100     200\\n\",\n       \"3  2011-03-04    300  -100     200\\n\",\n       \"4  2011-03-05    300  -100     200\\n\",\n       \"5  2011-03-06    300  -100     200\\n\",\n       \"6  2011-03-07    300  -100     200\\n\",\n       \"7  2011-03-08    300  -100     200\\n\",\n       \"8  2011-03-09    300  -100     200\\n\",\n       \"9  2011-03-10    300  -100     200\"\n      ]\n     },\n     \"execution_count\": 1,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import dask.dataframe as dd\\n\",\n    \"\\n\",\n    \"# consistent format\\n\",\n    \"ddf = dd.read_csv('test-data/input/test-data-input-csv-clean-*.csv')\\n\",\n    \"ddf.compute()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Problem Case 1: Columns are different between files\\n\",\n    \"That worked well. But what happens if your input files have inconsistent columns across files? Say for example one file has a new column that the other files don't have.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 23,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"ename\": \"ValueError\",\n     \"evalue\": \"Length mismatch: Expected axis has 5 elements, new values have 4 elements\",\n     \"output_type\": \"error\",\n     \"traceback\": [\n      \"\\u001b[1;31m---------------------------------------------------------------------------\\u001b[0m\",\n      \"\\u001b[1;31mValueError\\u001b[0m                                Traceback (most recent call last)\",\n      \"\\u001b[1;32m<ipython-input-23-1bbb9709e59f>\\u001b[0m in \\u001b[0;36m<module>\\u001b[1;34m()\\u001b[0m\\n\\u001b[0;32m      1\\u001b[0m \\u001b[1;31m# consistent format\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m      2\\u001b[0m \\u001b[0mddf\\u001b[0m \\u001b[1;33m=\\u001b[0m \\u001b[0mdd\\u001b[0m\\u001b[1;33m.\\u001b[0m\\u001b[0mread_csv\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[1;34m'test-data/input/test-data-input-csv-colmismatch-*.csv'\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m----> 3\\u001b[1;33m \\u001b[0mddf\\u001b[0m\\u001b[1;33m.\\u001b[0m\\u001b[0mcompute\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0m\",\n      \"\\u001b[1;32mC:\\\\Anaconda3\\\\lib\\\\site-packages\\\\dask\\\\base.py\\u001b[0m in \\u001b[0;36mcompute\\u001b[1;34m(self, **kwargs)\\u001b[0m\\n\\u001b[0;32m    153\\u001b[0m         \\u001b[0mdask\\u001b[0m\\u001b[1;33m.\\u001b[0m\\u001b[0mbase\\u001b[0m\\u001b[1;33m.\\u001b[0m\\u001b[0mcompute\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m    154\\u001b[0m         \\\"\\\"\\\"\\n\\u001b[1;32m--> 155\\u001b[1;33m         \\u001b[1;33m(\\u001b[0m\\u001b[0mresult\\u001b[0m\\u001b[1;33m,\\u001b[0m\\u001b[1;33m)\\u001b[0m \\u001b[1;33m=\\u001b[0m \\u001b[0mcompute\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[0mself\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mtraverse\\u001b[0m\\u001b[1;33m=\\u001b[0m\\u001b[1;32mFalse\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[1;33m**\\u001b[0m\\u001b[0mkwargs\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[0;32m    156\\u001b[0m         \\u001b[1;32mreturn\\u001b[0m \\u001b[0mresult\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m    157\\u001b[0m \\u001b[1;33m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[1;32mC:\\\\Anaconda3\\\\lib\\\\site-packages\\\\dask\\\\base.py\\u001b[0m in \\u001b[0;36mcompute\\u001b[1;34m(*args, **kwargs)\\u001b[0m\\n\\u001b[0;32m    402\\u001b[0m     postcomputes = [a.__dask_postcompute__() if is_dask_collection(a)\\n\\u001b[0;32m    403\\u001b[0m                     else (None, a) for a in args]\\n\\u001b[1;32m--> 404\\u001b[1;33m     \\u001b[0mresults\\u001b[0m \\u001b[1;33m=\\u001b[0m \\u001b[0mget\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[0mdsk\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mkeys\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[1;33m**\\u001b[0m\\u001b[0mkwargs\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[0;32m    405\\u001b[0m     \\u001b[0mresults_iter\\u001b[0m \\u001b[1;33m=\\u001b[0m \\u001b[0miter\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[0mresults\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m    406\\u001b[0m     return tuple(a if f is None else f(next(results_iter), *a)\\n\",\n      \"\\u001b[1;32mC:\\\\Anaconda3\\\\lib\\\\site-packages\\\\dask\\\\threaded.py\\u001b[0m in \\u001b[0;36mget\\u001b[1;34m(dsk, result, cache, num_workers, **kwargs)\\u001b[0m\\n\\u001b[0;32m     73\\u001b[0m     results = get_async(pool.apply_async, len(pool._pool), dsk, result,\\n\\u001b[0;32m     74\\u001b[0m                         \\u001b[0mcache\\u001b[0m\\u001b[1;33m=\\u001b[0m\\u001b[0mcache\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mget_id\\u001b[0m\\u001b[1;33m=\\u001b[0m\\u001b[0m_thread_get_id\\u001b[0m\\u001b[1;33m,\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m---> 75\\u001b[1;33m                         pack_exception=pack_exception, **kwargs)\\n\\u001b[0m\\u001b[0;32m     76\\u001b[0m \\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m     77\\u001b[0m     \\u001b[1;31m# Cleanup pools associated to dead threads\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[1;32mC:\\\\Anaconda3\\\\lib\\\\site-packages\\\\dask\\\\local.py\\u001b[0m in \\u001b[0;36mget_async\\u001b[1;34m(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)\\u001b[0m\\n\\u001b[0;32m    519\\u001b[0m                         \\u001b[0m_execute_task\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[0mtask\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mdata\\u001b[0m\\u001b[1;33m)\\u001b[0m  \\u001b[1;31m# Re-execute locally\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m    520\\u001b[0m                     \\u001b[1;32melse\\u001b[0m\\u001b[1;33m:\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m--> 521\\u001b[1;33m                         \\u001b[0mraise_exception\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[0mexc\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mtb\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[0;32m    522\\u001b[0m                 \\u001b[0mres\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mworker_id\\u001b[0m \\u001b[1;33m=\\u001b[0m \\u001b[0mloads\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[0mres_info\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m    523\\u001b[0m                 \\u001b[0mstate\\u001b[0m\\u001b[1;33m[\\u001b[0m\\u001b[1;34m'cache'\\u001b[0m\\u001b[1;33m]\\u001b[0m\\u001b[1;33m[\\u001b[0m\\u001b[0mkey\\u001b[0m\\u001b[1;33m]\\u001b[0m \\u001b[1;33m=\\u001b[0m \\u001b[0mres\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[1;32mC:\\\\Anaconda3\\\\lib\\\\site-packages\\\\dask\\\\compatibility.py\\u001b[0m in \\u001b[0;36mreraise\\u001b[1;34m(exc, tb)\\u001b[0m\\n\\u001b[0;32m     65\\u001b[0m         \\u001b[1;32mif\\u001b[0m \\u001b[0mexc\\u001b[0m\\u001b[1;33m.\\u001b[0m\\u001b[0m__traceback__\\u001b[0m \\u001b[1;32mis\\u001b[0m \\u001b[1;32mnot\\u001b[0m \\u001b[0mtb\\u001b[0m\\u001b[1;33m:\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m     66\\u001b[0m             \\u001b[1;32mraise\\u001b[0m \\u001b[0mexc\\u001b[0m\\u001b[1;33m.\\u001b[0m\\u001b[0mwith_traceback\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[0mtb\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m---> 67\\u001b[1;33m         \\u001b[1;32mraise\\u001b[0m \\u001b[0mexc\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[0;32m     68\\u001b[0m \\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m     69\\u001b[0m \\u001b[1;32melse\\u001b[0m\\u001b[1;33m:\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[1;32mC:\\\\Anaconda3\\\\lib\\\\site-packages\\\\dask\\\\local.py\\u001b[0m in \\u001b[0;36mexecute_task\\u001b[1;34m(key, task_info, dumps, loads, get_id, pack_exception)\\u001b[0m\\n\\u001b[0;32m    288\\u001b[0m     \\u001b[1;32mtry\\u001b[0m\\u001b[1;33m:\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m    289\\u001b[0m         \\u001b[0mtask\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mdata\\u001b[0m \\u001b[1;33m=\\u001b[0m \\u001b[0mloads\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[0mtask_info\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m--> 290\\u001b[1;33m         \\u001b[0mresult\\u001b[0m \\u001b[1;33m=\\u001b[0m \\u001b[0m_execute_task\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[0mtask\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mdata\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[0;32m    291\\u001b[0m         \\u001b[0mid\\u001b[0m \\u001b[1;33m=\\u001b[0m \\u001b[0mget_id\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m    292\\u001b[0m         \\u001b[0mresult\\u001b[0m \\u001b[1;33m=\\u001b[0m \\u001b[0mdumps\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[0mresult\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mid\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[1;32mC:\\\\Anaconda3\\\\lib\\\\site-packages\\\\dask\\\\local.py\\u001b[0m in \\u001b[0;36m_execute_task\\u001b[1;34m(arg, cache, dsk)\\u001b[0m\\n\\u001b[0;32m    269\\u001b[0m         \\u001b[0mfunc\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0margs\\u001b[0m \\u001b[1;33m=\\u001b[0m \\u001b[0marg\\u001b[0m\\u001b[1;33m[\\u001b[0m\\u001b[1;36m0\\u001b[0m\\u001b[1;33m]\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0marg\\u001b[0m\\u001b[1;33m[\\u001b[0m\\u001b[1;36m1\\u001b[0m\\u001b[1;33m:\\u001b[0m\\u001b[1;33m]\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m    270\\u001b[0m         \\u001b[0margs2\\u001b[0m \\u001b[1;33m=\\u001b[0m \\u001b[1;33m[\\u001b[0m\\u001b[0m_execute_task\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[0ma\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mcache\\u001b[0m\\u001b[1;33m)\\u001b[0m \\u001b[1;32mfor\\u001b[0m \\u001b[0ma\\u001b[0m \\u001b[1;32min\\u001b[0m \\u001b[0margs\\u001b[0m\\u001b[1;33m]\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m--> 271\\u001b[1;33m         \\u001b[1;32mreturn\\u001b[0m \\u001b[0mfunc\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[1;33m*\\u001b[0m\\u001b[0margs2\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[0;32m    272\\u001b[0m     \\u001b[1;32melif\\u001b[0m \\u001b[1;32mnot\\u001b[0m \\u001b[0mishashable\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[0marg\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m:\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m    273\\u001b[0m         \\u001b[1;32mreturn\\u001b[0m \\u001b[0marg\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[1;32mC:\\\\Anaconda3\\\\lib\\\\site-packages\\\\dask\\\\compatibility.py\\u001b[0m in \\u001b[0;36mapply\\u001b[1;34m(func, args, kwargs)\\u001b[0m\\n\\u001b[0;32m     46\\u001b[0m     \\u001b[1;32mdef\\u001b[0m \\u001b[0mapply\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[0mfunc\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0margs\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mkwargs\\u001b[0m\\u001b[1;33m=\\u001b[0m\\u001b[1;32mNone\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m:\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m     47\\u001b[0m         \\u001b[1;32mif\\u001b[0m \\u001b[0mkwargs\\u001b[0m\\u001b[1;33m:\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m---> 48\\u001b[1;33m             \\u001b[1;32mreturn\\u001b[0m \\u001b[0mfunc\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[1;33m*\\u001b[0m\\u001b[0margs\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[1;33m**\\u001b[0m\\u001b[0mkwargs\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[0;32m     49\\u001b[0m         \\u001b[1;32melse\\u001b[0m\\u001b[1;33m:\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m     50\\u001b[0m             \\u001b[1;32mreturn\\u001b[0m \\u001b[0mfunc\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[1;33m*\\u001b[0m\\u001b[0margs\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[1;32mC:\\\\Anaconda3\\\\lib\\\\site-packages\\\\dask\\\\dataframe\\\\io\\\\csv.py\\u001b[0m in \\u001b[0;36mpandas_read_text\\u001b[1;34m(reader, b, header, kwargs, dtypes, columns, write_header, enforce)\\u001b[0m\\n\\u001b[0;32m     69\\u001b[0m         \\u001b[1;32mraise\\u001b[0m \\u001b[0mValueError\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[1;34m\\\"Columns do not match\\\"\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mdf\\u001b[0m\\u001b[1;33m.\\u001b[0m\\u001b[0mcolumns\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mcolumns\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m     70\\u001b[0m     \\u001b[1;32melif\\u001b[0m \\u001b[0mcolumns\\u001b[0m\\u001b[1;33m:\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m---> 71\\u001b[1;33m         \\u001b[0mdf\\u001b[0m\\u001b[1;33m.\\u001b[0m\\u001b[0mcolumns\\u001b[0m \\u001b[1;33m=\\u001b[0m \\u001b[0mcolumns\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[0;32m     72\\u001b[0m     \\u001b[1;32mreturn\\u001b[0m \\u001b[0mdf\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m     73\\u001b[0m \\u001b[1;33m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[1;32mC:\\\\Anaconda3\\\\lib\\\\site-packages\\\\pandas\\\\core\\\\generic.py\\u001b[0m in \\u001b[0;36m__setattr__\\u001b[1;34m(self, name, value)\\u001b[0m\\n\\u001b[0;32m   3625\\u001b[0m         \\u001b[1;32mtry\\u001b[0m\\u001b[1;33m:\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m   3626\\u001b[0m             \\u001b[0mobject\\u001b[0m\\u001b[1;33m.\\u001b[0m\\u001b[0m__getattribute__\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[0mself\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mname\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m-> 3627\\u001b[1;33m             \\u001b[1;32mreturn\\u001b[0m \\u001b[0mobject\\u001b[0m\\u001b[1;33m.\\u001b[0m\\u001b[0m__setattr__\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[0mself\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mname\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mvalue\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[0;32m   3628\\u001b[0m         \\u001b[1;32mexcept\\u001b[0m \\u001b[0mAttributeError\\u001b[0m\\u001b[1;33m:\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m   3629\\u001b[0m             \\u001b[1;32mpass\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[1;32mpandas/_libs/properties.pyx\\u001b[0m in \\u001b[0;36mpandas._libs.properties.AxisProperty.__set__\\u001b[1;34m()\\u001b[0m\\n\",\n      \"\\u001b[1;32mC:\\\\Anaconda3\\\\lib\\\\site-packages\\\\pandas\\\\core\\\\generic.py\\u001b[0m in \\u001b[0;36m_set_axis\\u001b[1;34m(self, axis, labels)\\u001b[0m\\n\\u001b[0;32m    557\\u001b[0m \\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m    558\\u001b[0m     \\u001b[1;32mdef\\u001b[0m \\u001b[0m_set_axis\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[0mself\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0maxis\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mlabels\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m:\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m--> 559\\u001b[1;33m         \\u001b[0mself\\u001b[0m\\u001b[1;33m.\\u001b[0m\\u001b[0m_data\\u001b[0m\\u001b[1;33m.\\u001b[0m\\u001b[0mset_axis\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[0maxis\\u001b[0m\\u001b[1;33m,\\u001b[0m \\u001b[0mlabels\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[0;32m    560\\u001b[0m         \\u001b[0mself\\u001b[0m\\u001b[1;33m.\\u001b[0m\\u001b[0m_clear_item_cache\\u001b[0m\\u001b[1;33m(\\u001b[0m\\u001b[1;33m)\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m    561\\u001b[0m \\u001b[1;33m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[1;32mC:\\\\Anaconda3\\\\lib\\\\site-packages\\\\pandas\\\\core\\\\internals.py\\u001b[0m in \\u001b[0;36mset_axis\\u001b[1;34m(self, axis, new_labels)\\u001b[0m\\n\\u001b[0;32m   3072\\u001b[0m             raise ValueError('Length mismatch: Expected axis has %d elements, '\\n\\u001b[0;32m   3073\\u001b[0m                              \\u001b[1;34m'new values have %d elements'\\u001b[0m \\u001b[1;33m%\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m-> 3074\\u001b[1;33m                              (old_len, new_len))\\n\\u001b[0m\\u001b[0;32m   3075\\u001b[0m \\u001b[1;33m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m   3076\\u001b[0m         \\u001b[0mself\\u001b[0m\\u001b[1;33m.\\u001b[0m\\u001b[0maxes\\u001b[0m\\u001b[1;33m[\\u001b[0m\\u001b[0maxis\\u001b[0m\\u001b[1;33m]\\u001b[0m \\u001b[1;33m=\\u001b[0m \\u001b[0mnew_labels\\u001b[0m\\u001b[1;33m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[1;31mValueError\\u001b[0m: Length mismatch: Expected axis has 5 elements, new values have 4 elements\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# consistent format\\n\",\n    \"ddf = dd.read_csv('test-data/input/test-data-input-csv-colmismatch-*.csv')\\n\",\n    \"ddf.compute()\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Fixing the problem with d6stack\\n\",\n    \"Urgh! There's no way to use these files in dask. You don't even know what's going on. What file caused the problem? Why did it cause a problem? All you know is one file has more columns than the first file.\\n\",\n    \"\\n\",\n    \"You can either manually process those files or use d6tstack to easily check for such a situation and fix it with a few lines of code - no manual processing required. Let's take a look!\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"sniffing columns ok\\n\",\n      \"all equal False\\n\",\n      \"\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>profit2</th>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>file_path</th>\\n\",\n       \"      <th></th>\\n\",\n       \"      <th></th>\\n\",\n       \"      <th></th>\\n\",\n       \"      <th></th>\\n\",\n       \"      <th></th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>test-data/input/test-data-input-csv-colmismatch-feb.csv</th>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>False</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>test-data/input/test-data-input-csv-colmismatch-jan.csv</th>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>False</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>test-data/input/test-data-input-csv-colmismatch-mar.csv</th>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"                                                    date  sales  cost  profit  profit2\\n\",\n       \"file_path                                                                             \\n\",\n       \"test-data/input/test-data-input-csv-colmismatch...  True   True  True    True    False\\n\",\n       \"test-data/input/test-data-input-csv-colmismatch...  True   True  True    True    False\\n\",\n       \"test-data/input/test-data-input-csv-colmismatch...  True   True  True    True     True\"\n      ]\n     },\n     \"execution_count\": 4,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import glob\\n\",\n    \"import d6tstack.combine_csv\\n\",\n    \"\\n\",\n    \"cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-colmismatch-*.csv'))\\n\",\n    \"c = d6tstack.combine_csv.CombinerCSV(cfg_fnames)\\n\",\n    \"\\n\",\n    \"# check columns\\n\",\n    \"print('all equal',c.is_all_equal())\\n\",\n    \"print('')\\n\",\n    \"c.is_column_present()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Before using dask you can quickly use d6stack to check if all colums are consistent with `d6tstack.combine_csv.CombinerCSV.is_all_equal()`. If they are not consistent you can easily see which files are causing problems with `d6tstack.combine_csv.CombinerCSV.is_col_present()`, in this case there is a new column \\\"profit2\\\" in \\\"test-data-input-csv-colmismatch-mar.csv\\\".\\n\",\n    \"\\n\",\n    \"**Let's use d6stack to fix the situation.** We will use out-of-core processing with `d6tstack.combine_csv.CombinerCSVAdvanced.combine_save()` to save data from all files into one combined file with constistent columns. Any missing data is filled with NaN (to keep only common columns use `cfg_col_sel=c.col_preview['columns_common']`) Just 2 lines of code! \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"sniffing columns ok\\n\",\n      \"writing test-data/output/d6tstack-test-data-input-csv-colmismatch-feb.csv ok\\n\",\n      \"writing test-data/output/d6tstack-test-data-input-csv-colmismatch-jan.csv ok\\n\",\n      \"writing test-data/output/d6tstack-test-data-input-csv-colmismatch-mar.csv ok\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# out-of-core combining\\n\",\n    \"fnames = d6tstack.combine_csv.CombinerCSV(cfg_fnames).to_csv_align(output_dir='test-data/output/')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"NB: Instead of `to_csv_align()` you can also run `to_csv_combine()` which creates a single combined file.\\n\",\n    \"\\n\",\n    \"Now you can read this in dask and do whatever you wanted to do in the first place.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>profit2</th>\\n\",\n       \"      <th>filepath</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-02-01</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-02-02</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-02-03</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-02-04</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-02-05</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>2011-02-06</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>2011-02-07</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>2011-02-08</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>2011-02-09</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>9</th>\\n\",\n       \"      <td>2011-02-10</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-01-01</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-01-02</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-01-03</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-01-04</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-01-05</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>2011-01-06</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>2011-01-07</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>2011-01-08</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>2011-01-09</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>9</th>\\n\",\n       \"      <td>2011-01-10</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-03-01</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400.0</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-03-02</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400.0</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-03-03</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400.0</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-03-04</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400.0</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-03-05</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400.0</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>2011-03-06</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400.0</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>2011-03-07</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400.0</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>2011-03-08</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400.0</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>2011-03-09</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400.0</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>9</th>\\n\",\n       \"      <td>2011-03-10</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400.0</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-colmismatc...</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"         date  sales  cost  profit  profit2                                           filepath                                 filename\\n\",\n       \"0  2011-02-01    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"1  2011-02-02    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"2  2011-02-03    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"3  2011-02-04    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"4  2011-02-05    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"5  2011-02-06    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"6  2011-02-07    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"7  2011-02-08    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"8  2011-02-09    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"9  2011-02-10    200   -90     110      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-feb.csv\\n\",\n       \"0  2011-01-01    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\\n\",\n       \"1  2011-01-02    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\\n\",\n       \"2  2011-01-03    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\\n\",\n       \"3  2011-01-04    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\\n\",\n       \"4  2011-01-05    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\\n\",\n       \"5  2011-01-06    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\\n\",\n       \"6  2011-01-07    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\\n\",\n       \"7  2011-01-08    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\\n\",\n       \"8  2011-01-09    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\\n\",\n       \"9  2011-01-10    100   -80      20      NaN  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-jan.csv\\n\",\n       \"0  2011-03-01    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\\n\",\n       \"1  2011-03-02    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\\n\",\n       \"2  2011-03-03    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\\n\",\n       \"3  2011-03-04    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\\n\",\n       \"4  2011-03-05    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\\n\",\n       \"5  2011-03-06    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\\n\",\n       \"6  2011-03-07    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\\n\",\n       \"7  2011-03-08    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\\n\",\n       \"8  2011-03-09    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\\n\",\n       \"9  2011-03-10    300  -100     200    400.0  test-data/input/test-data-input-csv-colmismatc...  test-data-input-csv-colmismatch-mar.csv\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# consistent format\\n\",\n    \"ddf = dd.read_csv('test-data/output/d6tstack-test-data-input-csv-colmismatch-*.csv')\\n\",\n    \"ddf.compute()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Problem Case 2: Columns are reordered between files\\n\",\n    \"This is a sneaky case. The columns are the same but the order is different! Dask will read everything just fine without a warning but your data is totally messed up!\\n\",\n    \"\\n\",\n    \"In the example below, the \\\"profit\\\" column contains data from the \\\"cost\\\" column!\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-02-01</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-02-02</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-02-03</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-02-04</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-02-05</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>2011-02-06</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>2011-02-07</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>2011-02-08</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>2011-02-09</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>9</th>\\n\",\n       \"      <td>2011-02-10</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-01-01</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-01-02</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-01-03</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-01-04</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-01-05</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>2011-01-06</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>2011-01-07</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>2011-01-08</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>2011-01-09</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>9</th>\\n\",\n       \"      <td>2011-01-10</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-03-01</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-03-02</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-03-03</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-03-04</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-03-05</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>2011-03-06</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>2011-03-07</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>2011-03-08</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>2011-03-09</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>9</th>\\n\",\n       \"      <td>2011-03-10</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"         date  sales  cost  profit\\n\",\n       \"0  2011-02-01    200   -90     110\\n\",\n       \"1  2011-02-02    200   -90     110\\n\",\n       \"2  2011-02-03    200   -90     110\\n\",\n       \"3  2011-02-04    200   -90     110\\n\",\n       \"4  2011-02-05    200   -90     110\\n\",\n       \"5  2011-02-06    200   -90     110\\n\",\n       \"6  2011-02-07    200   -90     110\\n\",\n       \"7  2011-02-08    200   -90     110\\n\",\n       \"8  2011-02-09    200   -90     110\\n\",\n       \"9  2011-02-10    200   -90     110\\n\",\n       \"0  2011-01-01    100   -80      20\\n\",\n       \"1  2011-01-02    100   -80      20\\n\",\n       \"2  2011-01-03    100   -80      20\\n\",\n       \"3  2011-01-04    100   -80      20\\n\",\n       \"4  2011-01-05    100   -80      20\\n\",\n       \"5  2011-01-06    100   -80      20\\n\",\n       \"6  2011-01-07    100   -80      20\\n\",\n       \"7  2011-01-08    100   -80      20\\n\",\n       \"8  2011-01-09    100   -80      20\\n\",\n       \"9  2011-01-10    100   -80      20\\n\",\n       \"0  2011-03-01    300   200    -100\\n\",\n       \"1  2011-03-02    300   200    -100\\n\",\n       \"2  2011-03-03    300   200    -100\\n\",\n       \"3  2011-03-04    300   200    -100\\n\",\n       \"4  2011-03-05    300   200    -100\\n\",\n       \"5  2011-03-06    300   200    -100\\n\",\n       \"6  2011-03-07    300   200    -100\\n\",\n       \"7  2011-03-08    300   200    -100\\n\",\n       \"8  2011-03-09    300   200    -100\\n\",\n       \"9  2011-03-10    300   200    -100\"\n      ]\n     },\n     \"execution_count\": 8,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# consistent format\\n\",\n    \"ddf = dd.read_csv('test-data/input/test-data-input-csv-reorder-*.csv')\\n\",\n    \"ddf.compute()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 9,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"sniffing columns ok\\n\",\n      \"all columns equal? False\\n\",\n      \"\\n\",\n      \"in what order do columns appear in the files?\\n\",\n      \"\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>2</td>\\n\",\n       \"      <td>3</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>2</td>\\n\",\n       \"      <td>3</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>3</td>\\n\",\n       \"      <td>2</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   date  sales  cost  profit\\n\",\n       \"0     0      1     2       3\\n\",\n       \"1     0      1     2       3\\n\",\n       \"2     0      1     3       2\"\n      ]\n     },\n     \"execution_count\": 9,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-reorder-*.csv'))\\n\",\n    \"c = d6tstack.combine_csv.CombinerCSV(cfg_fnames)\\n\",\n    \"\\n\",\n    \"# check columns\\n\",\n    \"col_sniff = c.sniff_columns()\\n\",\n    \"print('all columns equal?' , c.is_all_equal())\\n\",\n    \"print('')\\n\",\n    \"print('in what order do columns appear in the files?')\\n\",\n    \"print('')\\n\",\n    \"col_sniff['df_columns_order'].reset_index(drop=True)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Again, just a useful check before loading data into dask you can see that the columns don't line up. It's very fast to run because it only reads the headers, there's NO reason for you NOT to do it from a QA perspective.\\n\",\n    \"\\n\",\n    \"Same as above, the fix is the same few lines of code with d6stack.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"sniffing columns ok\\n\",\n      \"writing test-data/output/d6tstack-test-data-input-csv-reorder-feb.csv ok\\n\",\n      \"writing test-data/output/d6tstack-test-data-input-csv-reorder-jan.csv ok\\n\",\n      \"writing test-data/output/d6tstack-test-data-input-csv-reorder-mar.csv ok\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# out-of-core combining\\n\",\n    \"fnames = d6tstack.combine_csv.CombinerCSV(cfg_fnames).to_csv_align(output_dir='test-data/output/')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>filepath</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-02-01</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-fe...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-02-02</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-fe...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-02-03</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-fe...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-02-04</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-fe...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-02-05</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-fe...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>2011-02-06</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-fe...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>2011-02-07</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-fe...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>2011-02-08</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-fe...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>2011-02-09</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-fe...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>9</th>\\n\",\n       \"      <td>2011-02-10</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-fe...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-01-01</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ja...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-01-02</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ja...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-01-03</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ja...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-01-04</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ja...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-01-05</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ja...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>2011-01-06</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ja...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>2011-01-07</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ja...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>2011-01-08</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ja...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>2011-01-09</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ja...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>9</th>\\n\",\n       \"      <td>2011-01-10</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ja...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-03-01</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ma...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-03-02</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ma...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-03-03</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ma...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-03-04</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ma...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-03-05</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ma...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>2011-03-06</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ma...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>2011-03-07</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ma...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>2011-03-08</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ma...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>2011-03-09</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ma...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>9</th>\\n\",\n       \"      <td>2011-03-10</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data/input/test-data-input-csv-reorder-ma...</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"         date  sales  cost  profit                                           filepath                             filename\\n\",\n       \"0  2011-02-01    200   -90     110  test-data/input/test-data-input-csv-reorder-fe...  test-data-input-csv-reorder-feb.csv\\n\",\n       \"1  2011-02-02    200   -90     110  test-data/input/test-data-input-csv-reorder-fe...  test-data-input-csv-reorder-feb.csv\\n\",\n       \"2  2011-02-03    200   -90     110  test-data/input/test-data-input-csv-reorder-fe...  test-data-input-csv-reorder-feb.csv\\n\",\n       \"3  2011-02-04    200   -90     110  test-data/input/test-data-input-csv-reorder-fe...  test-data-input-csv-reorder-feb.csv\\n\",\n       \"4  2011-02-05    200   -90     110  test-data/input/test-data-input-csv-reorder-fe...  test-data-input-csv-reorder-feb.csv\\n\",\n       \"5  2011-02-06    200   -90     110  test-data/input/test-data-input-csv-reorder-fe...  test-data-input-csv-reorder-feb.csv\\n\",\n       \"6  2011-02-07    200   -90     110  test-data/input/test-data-input-csv-reorder-fe...  test-data-input-csv-reorder-feb.csv\\n\",\n       \"7  2011-02-08    200   -90     110  test-data/input/test-data-input-csv-reorder-fe...  test-data-input-csv-reorder-feb.csv\\n\",\n       \"8  2011-02-09    200   -90     110  test-data/input/test-data-input-csv-reorder-fe...  test-data-input-csv-reorder-feb.csv\\n\",\n       \"9  2011-02-10    200   -90     110  test-data/input/test-data-input-csv-reorder-fe...  test-data-input-csv-reorder-feb.csv\\n\",\n       \"0  2011-01-01    100   -80      20  test-data/input/test-data-input-csv-reorder-ja...  test-data-input-csv-reorder-jan.csv\\n\",\n       \"1  2011-01-02    100   -80      20  test-data/input/test-data-input-csv-reorder-ja...  test-data-input-csv-reorder-jan.csv\\n\",\n       \"2  2011-01-03    100   -80      20  test-data/input/test-data-input-csv-reorder-ja...  test-data-input-csv-reorder-jan.csv\\n\",\n       \"3  2011-01-04    100   -80      20  test-data/input/test-data-input-csv-reorder-ja...  test-data-input-csv-reorder-jan.csv\\n\",\n       \"4  2011-01-05    100   -80      20  test-data/input/test-data-input-csv-reorder-ja...  test-data-input-csv-reorder-jan.csv\\n\",\n       \"5  2011-01-06    100   -80      20  test-data/input/test-data-input-csv-reorder-ja...  test-data-input-csv-reorder-jan.csv\\n\",\n       \"6  2011-01-07    100   -80      20  test-data/input/test-data-input-csv-reorder-ja...  test-data-input-csv-reorder-jan.csv\\n\",\n       \"7  2011-01-08    100   -80      20  test-data/input/test-data-input-csv-reorder-ja...  test-data-input-csv-reorder-jan.csv\\n\",\n       \"8  2011-01-09    100   -80      20  test-data/input/test-data-input-csv-reorder-ja...  test-data-input-csv-reorder-jan.csv\\n\",\n       \"9  2011-01-10    100   -80      20  test-data/input/test-data-input-csv-reorder-ja...  test-data-input-csv-reorder-jan.csv\\n\",\n       \"0  2011-03-01    300  -100     200  test-data/input/test-data-input-csv-reorder-ma...  test-data-input-csv-reorder-mar.csv\\n\",\n       \"1  2011-03-02    300  -100     200  test-data/input/test-data-input-csv-reorder-ma...  test-data-input-csv-reorder-mar.csv\\n\",\n       \"2  2011-03-03    300  -100     200  test-data/input/test-data-input-csv-reorder-ma...  test-data-input-csv-reorder-mar.csv\\n\",\n       \"3  2011-03-04    300  -100     200  test-data/input/test-data-input-csv-reorder-ma...  test-data-input-csv-reorder-mar.csv\\n\",\n       \"4  2011-03-05    300  -100     200  test-data/input/test-data-input-csv-reorder-ma...  test-data-input-csv-reorder-mar.csv\\n\",\n       \"5  2011-03-06    300  -100     200  test-data/input/test-data-input-csv-reorder-ma...  test-data-input-csv-reorder-mar.csv\\n\",\n       \"6  2011-03-07    300  -100     200  test-data/input/test-data-input-csv-reorder-ma...  test-data-input-csv-reorder-mar.csv\\n\",\n       \"7  2011-03-08    300  -100     200  test-data/input/test-data-input-csv-reorder-ma...  test-data-input-csv-reorder-mar.csv\\n\",\n       \"8  2011-03-09    300  -100     200  test-data/input/test-data-input-csv-reorder-ma...  test-data-input-csv-reorder-mar.csv\\n\",\n       \"9  2011-03-10    300  -100     200  test-data/input/test-data-input-csv-reorder-ma...  test-data-input-csv-reorder-mar.csv\"\n      ]\n     },\n     \"execution_count\": 11,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# consistent format\\n\",\n    \"ddf = dd.read_csv('test-data/output/d6tstack-test-data-input-csv-reorder-*.csv')\\n\",\n    \"ddf.compute()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.3\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples-excel.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Data Engineering in Python with databolt  - Quickly Extract data from Excel Files (d6tlib/d6tstack)\\n\",\n    \"\\n\",\n    \"Excel are very common because non-technical user like accessing and manipulating data in Excel. For data engineering and data science those Excel files are not easily read however, for example `dask` and `pyspark` don't read Excel files. \\n\",\n    \"\\n\",\n    \"** In this workbook we will demonstrate how to use d6tstack to quickly extract data from messy Excel files into clean CSV data.**\\n\",\n    \"\\n\",\n    \"We will be covering the following use cases:\\n\",\n    \"* Check tab consistency for across multiple files\\n\",\n    \"* Exract tabs from multipe Excel files\\n\",\n    \"* Exract all tabs from an Excel file\\n\",\n    \"* Extract data given unstrcutured files\\n\",\n    \"* Clean empty columns and rows\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 35,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import d6tstack.convert_xls\\n\",\n    \"from d6tstack.convert_xls import XLSSniffer\\n\",\n    \"from d6tstack.utils import PrintLogger\\n\",\n    \"\\n\",\n    \"import pandas as pd\\n\",\n    \"import dask.dataframe as dd\\n\",\n    \"import glob\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Get sample data\\n\",\n    \"\\n\",\n    \"We've created some dummy sample data which you can download. \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import urllib.request\\n\",\n    \"cfg_fname_sample = 'test-data-xls.zip'\\n\",\n    \"urllib.request.urlretrieve(\\\"https://github.com/d6t/d6tstack/raw/master/\\\"+cfg_fname_sample, cfg_fname_sample)\\n\",\n    \"import zipfile\\n\",\n    \"zip_ref = zipfile.ZipFile(cfg_fname_sample, 'r')\\n\",\n    \"zip_ref.extractall('.')\\n\",\n    \"zip_ref.close()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Use Case: Extract all sheets from a single file\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 10,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"converting file: sample-xls-case-multisheet.xlsx | sheet:  ok\\n\",\n      \"converting file: sample-xls-case-multisheet.xlsx | sheet:  ok\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/mnt/data/dev/d6t-lib/d6tstack/d6tstack/convert_xls.py:72: UserWarning: File test-data/excel_adv_data/sample-xls-case-multisheet.xlsx exists, skipping\\n\",\n      \"  warnings.warn('File %s exists, skipping' %fname)\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"['test-data/output/sample-xls-case-multisheet.xlsx-Sheet1.csv',\\n\",\n       \" 'test-data/output/sample-xls-case-multisheet.xlsx-Sheet2.csv']\"\n      ]\n     },\n     \"execution_count\": 10,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"c = d6tstack.convert_xls.XLStoCSVMultiSheet('test-data/excel_adv_data/sample-xls-case-multisheet.xlsx', \\n\",\n    \"                                            output_dir = 'test-data/output', logger=PrintLogger())\\n\",\n    \"c.convert_all()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 11,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>ticker</th>\\n\",\n       \"      <th>data</th>\\n\",\n       \"      <th>xls_sheet</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2018-01-01</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>-0.672460</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2018-01-02</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>-0.359553</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2018-01-03</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>-0.813146</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2018-01-04</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>-1.726283</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2018-01-05</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>0.177426</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"         date ticker      data xls_sheet\\n\",\n       \"0  2018-01-01    AAP -0.672460    Sheet1\\n\",\n       \"1  2018-01-02    AAP -0.359553    Sheet1\\n\",\n       \"2  2018-01-03    AAP -0.813146    Sheet1\\n\",\n       \"3  2018-01-04    AAP -1.726283    Sheet1\\n\",\n       \"4  2018-01-05    AAP  0.177426    Sheet1\"\n      ]\n     },\n     \"execution_count\": 11,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"ddf = dd.read_csv('test-data/output/sample-xls-case-multisheet.xlsx-*.csv')\\n\",\n    \"ddf.compute().head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Use Case: Extract a sheets from multiple files\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Checking if the sheet exists across all files\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 18,\n   \"metadata\": {},\n   \"outputs\": [],\n   \"source\": [\n    \"cfg_fnames = list(glob.glob('test-data/excel_adv_data/sample-xls-case-multifile*.xlsx'))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 20,\n   \"metadata\": {\n    \"scrolled\": true\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"all files have same sheet count? True\\n\",\n      \"\\n\",\n      \"all files have same sheet names? True\\n\",\n      \"\\n\",\n      \"all files contain sheet? True\\n\",\n      \"\\n\",\n      \"detailed dataframe\\n\",\n      \"\\n\",\n      \"                         file_name sheets_count sheets_idx sheets_names\\n\",\n      \"0  sample-xls-case-multifile1.xlsx            1        [0]     [Sheet1]\\n\",\n      \"1  sample-xls-case-multifile2.xlsx            1        [0]     [Sheet1]\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"# finds sheets across all files\\n\",\n    \"sniffer = XLSSniffer(cfg_fnames)\\n\",\n    \"\\n\",\n    \"print('all files have same sheet count?', sniffer.all_same_count())\\n\",\n    \"print('')\\n\",\n    \"print('all files have same sheet names?', sniffer.all_same_names())\\n\",\n    \"print('')\\n\",\n    \"print('all files contain sheet?', sniffer.all_contain_sheetname('Sheet1'))\\n\",\n    \"print('')\\n\",\n    \"print('detailed dataframe')\\n\",\n    \"print('')\\n\",\n    \"print(sniffer.df_xls_sheets.reset_index(drop=True).head())\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Extracting to csv\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 23,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"converting file: sample-xls-case-multifile1.xlsx | sheet: Sheet1 ok\\n\",\n      \"converting file: sample-xls-case-multifile2.xlsx | sheet: Sheet1 ok\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"['test-data/output/sample-xls-case-multifile1.xlsx-Sheet1.csv',\\n\",\n       \" 'test-data/output/sample-xls-case-multifile2.xlsx-Sheet1.csv']\"\n      ]\n     },\n     \"execution_count\": 23,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"d6tstack.convert_xls.XLStoCSVMultiFile(cfg_fnames,output_dir = 'test-data/output',\\n\",\n    \"                                       cfg_xls_sheets_sel_mode='name_global',cfg_xls_sheets_sel='Sheet1',logger=PrintLogger()).convert_all()\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 25,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>ticker</th>\\n\",\n       \"      <th>data</th>\\n\",\n       \"      <th>xls_sheet</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2018-01-01</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>-0.353994</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2018-01-02</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>-1.374951</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2018-01-03</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>-0.643618</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2018-01-04</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>-2.223403</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2018-01-05</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>0.625231</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"         date ticker      data xls_sheet\\n\",\n       \"0  2018-01-01    AAP -0.353994    Sheet1\\n\",\n       \"1  2018-01-02    AAP -1.374951    Sheet1\\n\",\n       \"2  2018-01-03    AAP -0.643618    Sheet1\\n\",\n       \"3  2018-01-04    AAP -2.223403    Sheet1\\n\",\n       \"4  2018-01-05    AAP  0.625231    Sheet1\"\n      ]\n     },\n     \"execution_count\": 25,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"ddf = dd.read_csv('test-data/output/sample-xls-case-multifile1.xlsx-*.csv')\\n\",\n    \"ddf.compute().head()\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Use Case: Extract a sheets from multiple files, with complex layout\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"### Checking if the sheet exists across all files\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 26,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"1\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"cfg_fnames = list(glob.glob('test-data/excel_adv_data/sample-xls-case-badlayout1*.xlsx'))\\n\",\n    \"print(len(cfg_fnames))\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 36,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>Unnamed: 0</th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>ticker</th>\\n\",\n       \"      <th>data</th>\\n\",\n       \"      <th>xls_sheet</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>2018-01-01</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>-1.306527</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>2018-01-02</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>1.658131</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>2018-01-03</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>-0.118164</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>2018-01-04</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>-0.680178</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>2018-01-05</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>0.666383</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   Unnamed: 0       date ticker      data xls_sheet\\n\",\n       \"0         NaN 2018-01-01    AAP -1.306527    Sheet1\\n\",\n       \"1         NaN 2018-01-02    AAP  1.658131    Sheet1\\n\",\n       \"2         NaN 2018-01-03    AAP -0.118164    Sheet1\\n\",\n       \"3         NaN 2018-01-04    AAP -0.680178    Sheet1\\n\",\n       \"4         NaN 2018-01-05    AAP  0.666383    Sheet1\"\n      ]\n     },\n     \"execution_count\": 36,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"pd.read_excel(cfg_fnames[0]).head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 38,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>ticker</th>\\n\",\n       \"      <th>data</th>\\n\",\n       \"      <th>xls_sheet</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2018-01-01</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>-1.306527</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2018-01-02</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>1.658131</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2018-01-03</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>-0.118164</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2018-01-04</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>-0.680178</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2018-01-05</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>0.666383</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"        date ticker      data xls_sheet\\n\",\n       \"0 2018-01-01    AAP -1.306527    Sheet1\\n\",\n       \"1 2018-01-02    AAP  1.658131    Sheet1\\n\",\n       \"2 2018-01-03    AAP -0.118164    Sheet1\\n\",\n       \"3 2018-01-04    AAP -0.680178    Sheet1\\n\",\n       \"4 2018-01-05    AAP  0.666383    Sheet1\"\n      ]\n     },\n     \"execution_count\": 38,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"d6tstack.convert_xls.read_excel_advanced(cfg_fnames[0],\\n\",\n    \"                                   sheet_name='Sheet1', header_xls_range=\\\"B2:E2\\\").head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 39,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"converting file: sample-xls-case-badlayout1.xlsx | sheet:  ok\\n\",\n      \"converting file: sample-xls-case-badlayout1.xlsx | sheet:  ok\\n\"\n     ]\n    },\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/mnt/data/dev/d6t-lib/d6tstack/d6tstack/convert_xls.py:72: UserWarning: File test-data/excel_adv_data/sample-xls-case-badlayout1.xlsx exists, skipping\\n\",\n      \"  warnings.warn('File %s exists, skipping' %fname)\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"['test-data/output/sample-xls-case-badlayout1.xlsx-Sheet1.csv',\\n\",\n       \" 'test-data/output/sample-xls-case-badlayout1.xlsx-Sheet2.csv']\"\n      ]\n     },\n     \"execution_count\": 39,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"c = d6tstack.convert_xls.XLStoCSVMultiSheet(cfg_fnames[0],output_dir = 'test-data/output',logger=PrintLogger())\\n\",\n    \"c.convert_all(header_xls_range=\\\"B2:B2\\\")\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 40,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"ename\": \"ValueError\",\n     \"evalue\": \"Length mismatch: Expected axis has 1 elements, new values have 4 elements\",\n     \"output_type\": \"error\",\n     \"traceback\": [\n      \"\\u001b[0;31m---------------------------------------------------------------------------\\u001b[0m\",\n      \"\\u001b[0;31mValueError\\u001b[0m                                Traceback (most recent call last)\",\n      \"\\u001b[0;32m<ipython-input-40-bbc76ddb79a3>\\u001b[0m in \\u001b[0;36m<module>\\u001b[0;34m()\\u001b[0m\\n\\u001b[1;32m      1\\u001b[0m \\u001b[0mddf\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mdd\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mread_csv\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m'test-data/output/sample-xls-case-badlayout1.xlsx-*.csv'\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m----> 2\\u001b[0;31m \\u001b[0mddf\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mcompute\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mhead\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/dask/base.py\\u001b[0m in \\u001b[0;36mcompute\\u001b[0;34m(self, **kwargs)\\u001b[0m\\n\\u001b[1;32m    154\\u001b[0m         \\u001b[0mdask\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mbase\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mcompute\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    155\\u001b[0m         \\\"\\\"\\\"\\n\\u001b[0;32m--> 156\\u001b[0;31m         \\u001b[0;34m(\\u001b[0m\\u001b[0mresult\\u001b[0m\\u001b[0;34m,\\u001b[0m\\u001b[0;34m)\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mcompute\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mself\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mtraverse\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;32mFalse\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0;34m**\\u001b[0m\\u001b[0mkwargs\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m    157\\u001b[0m         \\u001b[0;32mreturn\\u001b[0m \\u001b[0mresult\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    158\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/dask/base.py\\u001b[0m in \\u001b[0;36mcompute\\u001b[0;34m(*args, **kwargs)\\u001b[0m\\n\\u001b[1;32m    400\\u001b[0m     \\u001b[0mkeys\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0;34m[\\u001b[0m\\u001b[0mx\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m__dask_keys__\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m)\\u001b[0m \\u001b[0;32mfor\\u001b[0m \\u001b[0mx\\u001b[0m \\u001b[0;32min\\u001b[0m \\u001b[0mcollections\\u001b[0m\\u001b[0;34m]\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    401\\u001b[0m     \\u001b[0mpostcomputes\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0;34m[\\u001b[0m\\u001b[0mx\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m__dask_postcompute__\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m)\\u001b[0m \\u001b[0;32mfor\\u001b[0m \\u001b[0mx\\u001b[0m \\u001b[0;32min\\u001b[0m \\u001b[0mcollections\\u001b[0m\\u001b[0;34m]\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m--> 402\\u001b[0;31m     \\u001b[0mresults\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mschedule\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mdsk\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mkeys\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0;34m**\\u001b[0m\\u001b[0mkwargs\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m    403\\u001b[0m     \\u001b[0;32mreturn\\u001b[0m \\u001b[0mrepack\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m[\\u001b[0m\\u001b[0mf\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mr\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0;34m*\\u001b[0m\\u001b[0ma\\u001b[0m\\u001b[0;34m)\\u001b[0m \\u001b[0;32mfor\\u001b[0m \\u001b[0mr\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0;34m(\\u001b[0m\\u001b[0mf\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0ma\\u001b[0m\\u001b[0;34m)\\u001b[0m \\u001b[0;32min\\u001b[0m \\u001b[0mzip\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mresults\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mpostcomputes\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m]\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    404\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/dask/threaded.py\\u001b[0m in \\u001b[0;36mget\\u001b[0;34m(dsk, result, cache, num_workers, **kwargs)\\u001b[0m\\n\\u001b[1;32m     73\\u001b[0m     results = get_async(pool.apply_async, len(pool._pool), dsk, result,\\n\\u001b[1;32m     74\\u001b[0m                         \\u001b[0mcache\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0mcache\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mget_id\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0m_thread_get_id\\u001b[0m\\u001b[0;34m,\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m---> 75\\u001b[0;31m                         pack_exception=pack_exception, **kwargs)\\n\\u001b[0m\\u001b[1;32m     76\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m     77\\u001b[0m     \\u001b[0;31m# Cleanup pools associated to dead threads\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/dask/local.py\\u001b[0m in \\u001b[0;36mget_async\\u001b[0;34m(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)\\u001b[0m\\n\\u001b[1;32m    519\\u001b[0m                         \\u001b[0m_execute_task\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mtask\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mdata\\u001b[0m\\u001b[0;34m)\\u001b[0m  \\u001b[0;31m# Re-execute locally\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    520\\u001b[0m                     \\u001b[0;32melse\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m--> 521\\u001b[0;31m                         \\u001b[0mraise_exception\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mexc\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mtb\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m    522\\u001b[0m                 \\u001b[0mres\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mworker_id\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mloads\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mres_info\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    523\\u001b[0m                 \\u001b[0mstate\\u001b[0m\\u001b[0;34m[\\u001b[0m\\u001b[0;34m'cache'\\u001b[0m\\u001b[0;34m]\\u001b[0m\\u001b[0;34m[\\u001b[0m\\u001b[0mkey\\u001b[0m\\u001b[0;34m]\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mres\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/dask/compatibility.py\\u001b[0m in \\u001b[0;36mreraise\\u001b[0;34m(exc, tb)\\u001b[0m\\n\\u001b[1;32m     67\\u001b[0m         \\u001b[0;32mif\\u001b[0m \\u001b[0mexc\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m__traceback__\\u001b[0m \\u001b[0;32mis\\u001b[0m \\u001b[0;32mnot\\u001b[0m \\u001b[0mtb\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m     68\\u001b[0m             \\u001b[0;32mraise\\u001b[0m \\u001b[0mexc\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mwith_traceback\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mtb\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m---> 69\\u001b[0;31m         \\u001b[0;32mraise\\u001b[0m \\u001b[0mexc\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m     70\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m     71\\u001b[0m \\u001b[0;32melse\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/dask/local.py\\u001b[0m in \\u001b[0;36mexecute_task\\u001b[0;34m(key, task_info, dumps, loads, get_id, pack_exception)\\u001b[0m\\n\\u001b[1;32m    288\\u001b[0m     \\u001b[0;32mtry\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    289\\u001b[0m         \\u001b[0mtask\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mdata\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mloads\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mtask_info\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m--> 290\\u001b[0;31m         \\u001b[0mresult\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0m_execute_task\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mtask\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mdata\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m    291\\u001b[0m         \\u001b[0mid\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mget_id\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    292\\u001b[0m         \\u001b[0mresult\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mdumps\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mresult\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mid\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/dask/local.py\\u001b[0m in \\u001b[0;36m_execute_task\\u001b[0;34m(arg, cache, dsk)\\u001b[0m\\n\\u001b[1;32m    269\\u001b[0m         \\u001b[0mfunc\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0margs\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0marg\\u001b[0m\\u001b[0;34m[\\u001b[0m\\u001b[0;36m0\\u001b[0m\\u001b[0;34m]\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0marg\\u001b[0m\\u001b[0;34m[\\u001b[0m\\u001b[0;36m1\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m]\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    270\\u001b[0m         \\u001b[0margs2\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0;34m[\\u001b[0m\\u001b[0m_execute_task\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0ma\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mcache\\u001b[0m\\u001b[0;34m)\\u001b[0m \\u001b[0;32mfor\\u001b[0m \\u001b[0ma\\u001b[0m \\u001b[0;32min\\u001b[0m \\u001b[0margs\\u001b[0m\\u001b[0;34m]\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m--> 271\\u001b[0;31m         \\u001b[0;32mreturn\\u001b[0m \\u001b[0mfunc\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m*\\u001b[0m\\u001b[0margs2\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m    272\\u001b[0m     \\u001b[0;32melif\\u001b[0m \\u001b[0;32mnot\\u001b[0m \\u001b[0mishashable\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0marg\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    273\\u001b[0m         \\u001b[0;32mreturn\\u001b[0m \\u001b[0marg\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/dask/compatibility.py\\u001b[0m in \\u001b[0;36mapply\\u001b[0;34m(func, args, kwargs)\\u001b[0m\\n\\u001b[1;32m     48\\u001b[0m     \\u001b[0;32mdef\\u001b[0m \\u001b[0mapply\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mfunc\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0margs\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mkwargs\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;32mNone\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m     49\\u001b[0m         \\u001b[0;32mif\\u001b[0m \\u001b[0mkwargs\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m---> 50\\u001b[0;31m             \\u001b[0;32mreturn\\u001b[0m \\u001b[0mfunc\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m*\\u001b[0m\\u001b[0margs\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0;34m**\\u001b[0m\\u001b[0mkwargs\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m     51\\u001b[0m         \\u001b[0;32melse\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m     52\\u001b[0m             \\u001b[0;32mreturn\\u001b[0m \\u001b[0mfunc\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m*\\u001b[0m\\u001b[0margs\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/dask/dataframe/io/csv.py\\u001b[0m in \\u001b[0;36mpandas_read_text\\u001b[0;34m(reader, b, header, kwargs, dtypes, columns, write_header, enforce)\\u001b[0m\\n\\u001b[1;32m     69\\u001b[0m         \\u001b[0;32mraise\\u001b[0m \\u001b[0mValueError\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m\\\"Columns do not match\\\"\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mdf\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mcolumns\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mcolumns\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m     70\\u001b[0m     \\u001b[0;32melif\\u001b[0m \\u001b[0mcolumns\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m---> 71\\u001b[0;31m         \\u001b[0mdf\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mcolumns\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mcolumns\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m     72\\u001b[0m     \\u001b[0;32mreturn\\u001b[0m \\u001b[0mdf\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m     73\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py\\u001b[0m in \\u001b[0;36m__setattr__\\u001b[0;34m(self, name, value)\\u001b[0m\\n\\u001b[1;32m   4387\\u001b[0m         \\u001b[0;32mtry\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m   4388\\u001b[0m             \\u001b[0mobject\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m__getattribute__\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mself\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mname\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m-> 4389\\u001b[0;31m             \\u001b[0;32mreturn\\u001b[0m \\u001b[0mobject\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m__setattr__\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mself\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mname\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mvalue\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m   4390\\u001b[0m         \\u001b[0;32mexcept\\u001b[0m \\u001b[0mAttributeError\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m   4391\\u001b[0m             \\u001b[0;32mpass\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32mpandas/_libs/properties.pyx\\u001b[0m in \\u001b[0;36mpandas._libs.properties.AxisProperty.__set__\\u001b[0;34m()\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py\\u001b[0m in \\u001b[0;36m_set_axis\\u001b[0;34m(self, axis, labels)\\u001b[0m\\n\\u001b[1;32m    644\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    645\\u001b[0m     \\u001b[0;32mdef\\u001b[0m \\u001b[0m_set_axis\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mself\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0maxis\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mlabels\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m--> 646\\u001b[0;31m         \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_data\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mset_axis\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0maxis\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mlabels\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m    647\\u001b[0m         \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_clear_item_cache\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    648\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py\\u001b[0m in \\u001b[0;36mset_axis\\u001b[0;34m(self, axis, new_labels)\\u001b[0m\\n\\u001b[1;32m   3321\\u001b[0m             raise ValueError(\\n\\u001b[1;32m   3322\\u001b[0m                 \\u001b[0;34m'Length mismatch: Expected axis has {old} elements, new '\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m-> 3323\\u001b[0;31m                 'values have {new} elements'.format(old=old_len, new=new_len))\\n\\u001b[0m\\u001b[1;32m   3324\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m   3325\\u001b[0m         \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0maxes\\u001b[0m\\u001b[0;34m[\\u001b[0m\\u001b[0maxis\\u001b[0m\\u001b[0;34m]\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mnew_labels\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;31mValueError\\u001b[0m: Length mismatch: Expected axis has 1 elements, new values have 4 elements\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"ddf = dd.read_csv('test-data/output/sample-xls-case-badlayout1.xlsx-*.csv')\\n\",\n    \"ddf.compute().head() # dask breaks! use d6tstack.combine_csv\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 42,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>ticker</th>\\n\",\n       \"      <th>data</th>\\n\",\n       \"      <th>xls_sheet</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2018-01-01</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>-1.306526851735317</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"      <td>sample-xls-case-badlayout1.xlsx-Sheet1.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2018-01-02</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>1.658130679618188</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"      <td>sample-xls-case-badlayout1.xlsx-Sheet1.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2018-01-03</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>-0.1181640451285698</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"      <td>sample-xls-case-badlayout1.xlsx-Sheet1.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2018-01-04</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>-0.6801782039968504</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"      <td>sample-xls-case-badlayout1.xlsx-Sheet1.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2018-01-05</td>\\n\",\n       \"      <td>AAP</td>\\n\",\n       \"      <td>0.6663830820319143</td>\\n\",\n       \"      <td>Sheet1</td>\\n\",\n       \"      <td>sample-xls-case-badlayout1.xlsx-Sheet1.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"         date ticker                 data xls_sheet  \\\\\\n\",\n       \"0  2018-01-01    AAP   -1.306526851735317    Sheet1   \\n\",\n       \"1  2018-01-02    AAP    1.658130679618188    Sheet1   \\n\",\n       \"2  2018-01-03    AAP  -0.1181640451285698    Sheet1   \\n\",\n       \"3  2018-01-04    AAP  -0.6801782039968504    Sheet1   \\n\",\n       \"4  2018-01-05    AAP   0.6663830820319143    Sheet1   \\n\",\n       \"\\n\",\n       \"                                     filename  \\n\",\n       \"0  sample-xls-case-badlayout1.xlsx-Sheet1.csv  \\n\",\n       \"1  sample-xls-case-badlayout1.xlsx-Sheet1.csv  \\n\",\n       \"2  sample-xls-case-badlayout1.xlsx-Sheet1.csv  \\n\",\n       \"3  sample-xls-case-badlayout1.xlsx-Sheet1.csv  \\n\",\n       \"4  sample-xls-case-badlayout1.xlsx-Sheet1.csv  \"\n      ]\n     },\n     \"execution_count\": 42,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"cfg_fnames = list(glob.glob('test-data/output/sample-xls-case-badlayout1.xlsx-*.csv'))\\n\",\n    \"len(cfg_fnames)\\n\",\n    \"c = d6tstack.combine_csv.CombinerCSV(cfg_fnames, all_strings=True)\\n\",\n    \"c.combine().head()\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.3\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples-pyspark.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# d6tstack with pyspark\\n\",\n    \"\\n\",\n    \"Pyspark is a great library for out-of-core computing. But if input files are not properly organized it quickly breaks. For example:\\n\",\n    \"\\n\",\n    \"1) if columns are different between files: [unlike dask](https://github.com/d6t/d6tstack/blob/master/examples-dask.ipynb) pyspark actually handles that\\n\",\n    \"\\n\",\n    \"2) if column order is rearranged between files it will read data, but into the wrong columns and you won't notice it\\n\",\n    \"\\n\",\n    \"3) if columns are named between files, you'll have to manually fix the situation\\n\",\n    \"\\n\",\n    \"Pyspark can't easily handle those scenarios. With d6tstack you can easily fix the situation with just a few lines of code!\\n\",\n    \"\\n\",\n    \"For more instructions, examples and documentation see https://github.com/d6t/d6tstack\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import findspark\\n\",\n    \"findspark.init(r'E:\\\\progs.install\\\\spark-2.2.0-bin-hadoop2.7')\\n\",\n    \"\\n\",\n    \"import pyspark\\n\",\n    \"sc = pyspark.SparkContext(appName=\\\"myAppName\\\")\\n\",\n    \"from pyspark.sql import SQLContext\\n\",\n    \"sqlc = SQLContext(sc)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Base Case: Columns are same between all files\\n\",\n    \"As a base case, we have input files which have consistent input columns and thus can be easily read in dask.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 2,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-01</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-02</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-03</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-04</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-05</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-06</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-07</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-08</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-09</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>9</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-10</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>10</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-01</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>11</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-02</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>12</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-03</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>13</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-04</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>14</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-05</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>15</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-06</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>16</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-07</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>17</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-08</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>18</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-09</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>19</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-10</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>20</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-01</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>21</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-02</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>22</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-03</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>23</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-04</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>24</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-05</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>25</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-06</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>26</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-07</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>27</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-08</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>28</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-09</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>29</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-10</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"    cost        date profit sales\\n\",\n       \"0   -100  2011-03-01    200   300\\n\",\n       \"1   -100  2011-03-02    200   300\\n\",\n       \"2   -100  2011-03-03    200   300\\n\",\n       \"3   -100  2011-03-04    200   300\\n\",\n       \"4   -100  2011-03-05    200   300\\n\",\n       \"5   -100  2011-03-06    200   300\\n\",\n       \"6   -100  2011-03-07    200   300\\n\",\n       \"7   -100  2011-03-08    200   300\\n\",\n       \"8   -100  2011-03-09    200   300\\n\",\n       \"9   -100  2011-03-10    200   300\\n\",\n       \"10   -90  2011-02-01    110   200\\n\",\n       \"11   -90  2011-02-02    110   200\\n\",\n       \"12   -90  2011-02-03    110   200\\n\",\n       \"13   -90  2011-02-04    110   200\\n\",\n       \"14   -90  2011-02-05    110   200\\n\",\n       \"15   -90  2011-02-06    110   200\\n\",\n       \"16   -90  2011-02-07    110   200\\n\",\n       \"17   -90  2011-02-08    110   200\\n\",\n       \"18   -90  2011-02-09    110   200\\n\",\n       \"19   -90  2011-02-10    110   200\\n\",\n       \"20   -80  2011-01-01     20   100\\n\",\n       \"21   -80  2011-01-02     20   100\\n\",\n       \"22   -80  2011-01-03     20   100\\n\",\n       \"23   -80  2011-01-04     20   100\\n\",\n       \"24   -80  2011-01-05     20   100\\n\",\n       \"25   -80  2011-01-06     20   100\\n\",\n       \"26   -80  2011-01-07     20   100\\n\",\n       \"27   -80  2011-01-08     20   100\\n\",\n       \"28   -80  2011-01-09     20   100\\n\",\n       \"29   -80  2011-01-10     20   100\"\n      ]\n     },\n     \"execution_count\": 2,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"sdf = sqlc.read.csv('test-data/input/test-data-input-csv-clean-*.csv', inferSchema=False, header=True)\\n\",\n    \"sdf.toPandas()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Problem Case 1: Columns are different between files\\n\",\n    \"That worked well. But what happens if your input files have inconsistent columns across files? Say for example one file has a new column that the other files don't have.\\n\",\n    \"\\n\",\n    \"[unlike dask](https://github.com/d6t/d6tstack/blob/master/examples-dask.ipynb) pyspark actually handles that. The new column got correctly added.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 3,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>profit2</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-01</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-02</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-03</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-04</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-05</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-06</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-07</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-08</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-09</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>9</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-10</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>10</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-01</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>11</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-02</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>12</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-03</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>13</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-04</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>14</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-05</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>15</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-06</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>16</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-07</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>17</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-08</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>18</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-09</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>19</th>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>2011-02-10</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>20</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-01</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>21</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-02</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>22</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-03</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>23</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-04</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>24</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-05</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>25</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-06</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>26</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-07</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>27</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-08</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>28</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-09</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>29</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-10</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"    cost        date profit sales profit2\\n\",\n       \"0   -100  2011-03-01    200   300     400\\n\",\n       \"1   -100  2011-03-02    200   300     400\\n\",\n       \"2   -100  2011-03-03    200   300     400\\n\",\n       \"3   -100  2011-03-04    200   300     400\\n\",\n       \"4   -100  2011-03-05    200   300     400\\n\",\n       \"5   -100  2011-03-06    200   300     400\\n\",\n       \"6   -100  2011-03-07    200   300     400\\n\",\n       \"7   -100  2011-03-08    200   300     400\\n\",\n       \"8   -100  2011-03-09    200   300     400\\n\",\n       \"9   -100  2011-03-10    200   300     400\\n\",\n       \"10   -90  2011-02-01    110   200    None\\n\",\n       \"11   -90  2011-02-02    110   200    None\\n\",\n       \"12   -90  2011-02-03    110   200    None\\n\",\n       \"13   -90  2011-02-04    110   200    None\\n\",\n       \"14   -90  2011-02-05    110   200    None\\n\",\n       \"15   -90  2011-02-06    110   200    None\\n\",\n       \"16   -90  2011-02-07    110   200    None\\n\",\n       \"17   -90  2011-02-08    110   200    None\\n\",\n       \"18   -90  2011-02-09    110   200    None\\n\",\n       \"19   -90  2011-02-10    110   200    None\\n\",\n       \"20   -80  2011-01-01     20   100    None\\n\",\n       \"21   -80  2011-01-02     20   100    None\\n\",\n       \"22   -80  2011-01-03     20   100    None\\n\",\n       \"23   -80  2011-01-04     20   100    None\\n\",\n       \"24   -80  2011-01-05     20   100    None\\n\",\n       \"25   -80  2011-01-06     20   100    None\\n\",\n       \"26   -80  2011-01-07     20   100    None\\n\",\n       \"27   -80  2011-01-08     20   100    None\\n\",\n       \"28   -80  2011-01-09     20   100    None\\n\",\n       \"29   -80  2011-01-10     20   100    None\"\n      ]\n     },\n     \"execution_count\": 3,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"sdf = sqlc.read.csv('test-data/input/test-data-input-csv-colmismatch-*.csv', inferSchema=False, header=True)\\n\",\n    \"sdf.toPandas()\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Problem Case 2: Columns are reordered between files\\n\",\n    \"This is a sneaky case. The columns are the same but the order is different! Pyspark will read everything just fine without a warning but your data is totally messed up! You don't even notice it! You'll start using the data and at some point notice something weird is going on!\\n\",\n    \"\\n\",\n    \"In the example below, the \\\"profit\\\" column contains data from the \\\"cost\\\" column!\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 4,\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-03-01</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-03-02</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-03-03</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>2011-03-04</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>2011-03-05</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>2011-03-06</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>2011-03-07</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>2011-03-08</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>2011-03-09</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>9</th>\\n\",\n       \"      <td>2011-03-10</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>10</th>\\n\",\n       \"      <td>2011-02-01</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>11</th>\\n\",\n       \"      <td>2011-02-02</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>12</th>\\n\",\n       \"      <td>2011-02-03</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>13</th>\\n\",\n       \"      <td>2011-02-04</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>14</th>\\n\",\n       \"      <td>2011-02-05</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>15</th>\\n\",\n       \"      <td>2011-02-06</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>16</th>\\n\",\n       \"      <td>2011-02-07</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>17</th>\\n\",\n       \"      <td>2011-02-08</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>18</th>\\n\",\n       \"      <td>2011-02-09</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>19</th>\\n\",\n       \"      <td>2011-02-10</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>20</th>\\n\",\n       \"      <td>2011-01-01</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>21</th>\\n\",\n       \"      <td>2011-01-02</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>22</th>\\n\",\n       \"      <td>2011-01-03</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>23</th>\\n\",\n       \"      <td>2011-01-04</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>24</th>\\n\",\n       \"      <td>2011-01-05</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>25</th>\\n\",\n       \"      <td>2011-01-06</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>26</th>\\n\",\n       \"      <td>2011-01-07</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>27</th>\\n\",\n       \"      <td>2011-01-08</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>28</th>\\n\",\n       \"      <td>2011-01-09</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>29</th>\\n\",\n       \"      <td>2011-01-10</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"          date sales profit  cost\\n\",\n       \"0   2011-03-01   300    200  -100\\n\",\n       \"1   2011-03-02   300    200  -100\\n\",\n       \"2   2011-03-03   300    200  -100\\n\",\n       \"3   2011-03-04   300    200  -100\\n\",\n       \"4   2011-03-05   300    200  -100\\n\",\n       \"5   2011-03-06   300    200  -100\\n\",\n       \"6   2011-03-07   300    200  -100\\n\",\n       \"7   2011-03-08   300    200  -100\\n\",\n       \"8   2011-03-09   300    200  -100\\n\",\n       \"9   2011-03-10   300    200  -100\\n\",\n       \"10  2011-02-01   200    -90   110\\n\",\n       \"11  2011-02-02   200    -90   110\\n\",\n       \"12  2011-02-03   200    -90   110\\n\",\n       \"13  2011-02-04   200    -90   110\\n\",\n       \"14  2011-02-05   200    -90   110\\n\",\n       \"15  2011-02-06   200    -90   110\\n\",\n       \"16  2011-02-07   200    -90   110\\n\",\n       \"17  2011-02-08   200    -90   110\\n\",\n       \"18  2011-02-09   200    -90   110\\n\",\n       \"19  2011-02-10   200    -90   110\\n\",\n       \"20  2011-01-01   100    -80    20\\n\",\n       \"21  2011-01-02   100    -80    20\\n\",\n       \"22  2011-01-03   100    -80    20\\n\",\n       \"23  2011-01-04   100    -80    20\\n\",\n       \"24  2011-01-05   100    -80    20\\n\",\n       \"25  2011-01-06   100    -80    20\\n\",\n       \"26  2011-01-07   100    -80    20\\n\",\n       \"27  2011-01-08   100    -80    20\\n\",\n       \"28  2011-01-09   100    -80    20\\n\",\n       \"29  2011-01-10   100    -80    20\"\n      ]\n     },\n     \"execution_count\": 4,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"sdf = sqlc.read.csv('test-data/input/test-data-input-csv-reorder-*.csv', inferSchema=False, header=True)\\n\",\n    \"sdf.toPandas()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Fixing the problem with d6stack\\n\",\n    \"After a while you'll get to the root of the problem, and then you can either manually process those files or use d6tstack to easily check for such a situation and fix it with a few lines of code - no manual processing required. Let's take a look!\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 1,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/opt/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88\\n\",\n      \"  return f(*args, **kwds)\\n\"\n     ]\n    },\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"sniffing columns ok\\n\",\n      \"all columns equal? False\\n\",\n      \"\\n\",\n      \"in what order do columns appear in the files?\\n\",\n      \"\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>2</td>\\n\",\n       \"      <td>3</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>2</td>\\n\",\n       \"      <td>3</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>3</td>\\n\",\n       \"      <td>2</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   date  sales  cost  profit\\n\",\n       \"0     0      1     2       3\\n\",\n       \"1     0      1     2       3\\n\",\n       \"2     0      1     3       2\"\n      ]\n     },\n     \"execution_count\": 1,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import glob\\n\",\n    \"import d6tstack.combine_csv\\n\",\n    \"\\n\",\n    \"cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-reorder-*.csv'))\\n\",\n    \"c = d6tstack.combine_csv.CombinerCSV(cfg_fnames, all_strings=True)\\n\",\n    \"\\n\",\n    \"# check columns\\n\",\n    \"col_sniff = c.sniff_columns()\\n\",\n    \"print('all columns equal?' , col_sniff['is_all_equal'])\\n\",\n    \"print('')\\n\",\n    \"print('in what order do columns appear in the files?')\\n\",\n    \"print('')\\n\",\n    \"col_sniff['df_columns_order'].reset_index(drop=True)\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"Again, just a useful check before loading data into dask you can see that the columns don't line up. It's very fast to run because it only reads the headers, there's NO reason for you NOT to do it from a QA perspective.\\n\",\n    \"\\n\",\n    \"Same as above, the fix is the same few lines of code with d6stack.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 6,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"True\"\n      ]\n     },\n     \"execution_count\": 6,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# out-of-core combining\\n\",\n    \"c.to_csv_align(output_dir='test-data/output/')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 7,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>2011-02-01</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>2011-02-02</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>2011-02-03</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>2011-02-04</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>2011-02-05</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>5</th>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>2011-02-06</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>6</th>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>2011-02-07</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>7</th>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>2011-02-08</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>8</th>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>2011-02-09</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>9</th>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>2011-02-10</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>10</th>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>2011-01-01</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>11</th>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>2011-01-02</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>12</th>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>2011-01-03</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>13</th>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>2011-01-04</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>14</th>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>2011-01-05</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>15</th>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>2011-01-06</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>16</th>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>2011-01-07</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>17</th>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>2011-01-08</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>18</th>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>2011-01-09</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>19</th>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>2011-01-10</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>20</th>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>2011-03-01</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>21</th>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>2011-03-02</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>22</th>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>2011-03-03</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>23</th>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>2011-03-04</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>24</th>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>2011-03-05</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>25</th>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>2011-03-06</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>26</th>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>2011-03-07</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>27</th>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>2011-03-08</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>28</th>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>2011-03-09</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>29</th>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>2011-03-10</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>test-data-input-csv-reorder-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   profit        date  cost sales                             filename\\n\",\n       \"0     110  2011-02-01   -90   200  test-data-input-csv-reorder-feb.csv\\n\",\n       \"1     110  2011-02-02   -90   200  test-data-input-csv-reorder-feb.csv\\n\",\n       \"2     110  2011-02-03   -90   200  test-data-input-csv-reorder-feb.csv\\n\",\n       \"3     110  2011-02-04   -90   200  test-data-input-csv-reorder-feb.csv\\n\",\n       \"4     110  2011-02-05   -90   200  test-data-input-csv-reorder-feb.csv\\n\",\n       \"5     110  2011-02-06   -90   200  test-data-input-csv-reorder-feb.csv\\n\",\n       \"6     110  2011-02-07   -90   200  test-data-input-csv-reorder-feb.csv\\n\",\n       \"7     110  2011-02-08   -90   200  test-data-input-csv-reorder-feb.csv\\n\",\n       \"8     110  2011-02-09   -90   200  test-data-input-csv-reorder-feb.csv\\n\",\n       \"9     110  2011-02-10   -90   200  test-data-input-csv-reorder-feb.csv\\n\",\n       \"10     20  2011-01-01   -80   100  test-data-input-csv-reorder-jan.csv\\n\",\n       \"11     20  2011-01-02   -80   100  test-data-input-csv-reorder-jan.csv\\n\",\n       \"12     20  2011-01-03   -80   100  test-data-input-csv-reorder-jan.csv\\n\",\n       \"13     20  2011-01-04   -80   100  test-data-input-csv-reorder-jan.csv\\n\",\n       \"14     20  2011-01-05   -80   100  test-data-input-csv-reorder-jan.csv\\n\",\n       \"15     20  2011-01-06   -80   100  test-data-input-csv-reorder-jan.csv\\n\",\n       \"16     20  2011-01-07   -80   100  test-data-input-csv-reorder-jan.csv\\n\",\n       \"17     20  2011-01-08   -80   100  test-data-input-csv-reorder-jan.csv\\n\",\n       \"18     20  2011-01-09   -80   100  test-data-input-csv-reorder-jan.csv\\n\",\n       \"19     20  2011-01-10   -80   100  test-data-input-csv-reorder-jan.csv\\n\",\n       \"20    200  2011-03-01  -100   300  test-data-input-csv-reorder-mar.csv\\n\",\n       \"21    200  2011-03-02  -100   300  test-data-input-csv-reorder-mar.csv\\n\",\n       \"22    200  2011-03-03  -100   300  test-data-input-csv-reorder-mar.csv\\n\",\n       \"23    200  2011-03-04  -100   300  test-data-input-csv-reorder-mar.csv\\n\",\n       \"24    200  2011-03-05  -100   300  test-data-input-csv-reorder-mar.csv\\n\",\n       \"25    200  2011-03-06  -100   300  test-data-input-csv-reorder-mar.csv\\n\",\n       \"26    200  2011-03-07  -100   300  test-data-input-csv-reorder-mar.csv\\n\",\n       \"27    200  2011-03-08  -100   300  test-data-input-csv-reorder-mar.csv\\n\",\n       \"28    200  2011-03-09  -100   300  test-data-input-csv-reorder-mar.csv\\n\",\n       \"29    200  2011-03-10  -100   300  test-data-input-csv-reorder-mar.csv\"\n      ]\n     },\n     \"execution_count\": 7,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"sdf = sqlc.read.csv('test-data/output/d6tstack-test-data-input-csv-colmismatch-*.csv', inferSchema=False, header=True)\\n\",\n    \"sdf.toPandas()\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"## Problem Case 3: Columns are renamed between files\\n\",\n    \"In this case a column gets renamed between files so you have two columns with partial NaNs that should really be the same column. You would have to manually inspect which columns this applies to and then manually edit them looking for NaNs.\\n\",\n    \"\\n\",\n    \"Instead you can use d6tstack to make your input files consistent.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 8,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# see examples-csv.ipynb\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.3\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples-read-write.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"import importlib\\n\",\n    \"import pandas as pd\\n\",\n    \"import numpy as np\\n\",\n    \"import glob\\n\",\n    \"\\n\",\n    \"import d6tstack.combine_csv\\n\",\n    \"from d6tstack.utils import PrintLogger\\n\",\n    \"logger = PrintLogger()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# CombinerCSV\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 77,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"['test-data/input/test-data-input-csv-colmismatch-mar.csv', 'test-data/input/test-data-input-csv-colmismatch-feb.csv', 'test-data/input/test-data-input-csv-colmismatch-jan.csv']\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-colmismatch-*.csv'))\\n\",\n    \"print(cfg_fnames)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 88,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"c = d6tstack.combine_csv.CombinerCSV(cfg_fnames, all_strings=True)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 89,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>profit2</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-01</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-02</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-03</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-04</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-05</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   cost        date  profit  profit2  sales  \\\\\\n\",\n       \"0   -80  2011-01-01      20      NaN    100   \\n\",\n       \"1   -80  2011-01-02      20      NaN    100   \\n\",\n       \"2   -80  2011-01-03      20      NaN    100   \\n\",\n       \"3   -80  2011-01-04      20      NaN    100   \\n\",\n       \"4   -80  2011-01-05      20      NaN    100   \\n\",\n       \"\\n\",\n       \"                                          filename  \\n\",\n       \"0  test-data-input-csv-colmismatch-jan-matched.csv  \\n\",\n       \"1  test-data-input-csv-colmismatch-jan-matched.csv  \\n\",\n       \"2  test-data-input-csv-colmismatch-jan-matched.csv  \\n\",\n       \"3  test-data-input-csv-colmismatch-jan-matched.csv  \\n\",\n       \"4  test-data-input-csv-colmismatch-jan-matched.csv  \"\n      ]\n     },\n     \"execution_count\": 89,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"c.to_csv(output_dir='test-data/output/',overwrite=True)\\n\",\n    \"pd.read_csv('test-data/output/test-data-input-csv-colmismatch-jan-matched.csv').head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 90,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>profit2</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-01</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-02</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-03</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-04</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-05</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   cost        date  profit  profit2  sales  \\\\\\n\",\n       \"0   -80  2011-01-01      20      NaN    100   \\n\",\n       \"1   -80  2011-01-02      20      NaN    100   \\n\",\n       \"2   -80  2011-01-03      20      NaN    100   \\n\",\n       \"3   -80  2011-01-04      20      NaN    100   \\n\",\n       \"4   -80  2011-01-05      20      NaN    100   \\n\",\n       \"\\n\",\n       \"                                          filename  \\n\",\n       \"0  test-data-input-csv-colmismatch-jan-matched.csv  \\n\",\n       \"1  test-data-input-csv-colmismatch-jan-matched.csv  \\n\",\n       \"2  test-data-input-csv-colmismatch-jan-matched.csv  \\n\",\n       \"3  test-data-input-csv-colmismatch-jan-matched.csv  \\n\",\n       \"4  test-data-input-csv-colmismatch-jan-matched.csv  \"\n      ]\n     },\n     \"execution_count\": 90,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# doesn't raise any warnings... thought we had overwrite warnings implemented?\\n\",\n    \"c.to_csv(output_dir='test-data/output/',overwrite=False)\\n\",\n    \"pd.read_csv('test-data/output/test-data-input-csv-colmismatch-jan-matched.csv').head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 92,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>profit2</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"      <th>filename.1</th>\\n\",\n       \"      <th>filename.2</th>\\n\",\n       \"      <th>filename.3</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-01</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-02</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-03</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-04</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-05</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan-matched.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   cost        date  profit  profit2  sales  \\\\\\n\",\n       \"0   -80  2011-01-01      20      NaN    100   \\n\",\n       \"1   -80  2011-01-02      20      NaN    100   \\n\",\n       \"2   -80  2011-01-03      20      NaN    100   \\n\",\n       \"3   -80  2011-01-04      20      NaN    100   \\n\",\n       \"4   -80  2011-01-05      20      NaN    100   \\n\",\n       \"\\n\",\n       \"                                          filename  \\\\\\n\",\n       \"0  test-data-input-csv-colmismatch-jan-matched.csv   \\n\",\n       \"1  test-data-input-csv-colmismatch-jan-matched.csv   \\n\",\n       \"2  test-data-input-csv-colmismatch-jan-matched.csv   \\n\",\n       \"3  test-data-input-csv-colmismatch-jan-matched.csv   \\n\",\n       \"4  test-data-input-csv-colmismatch-jan-matched.csv   \\n\",\n       \"\\n\",\n       \"                                        filename.1  \\\\\\n\",\n       \"0  test-data-input-csv-colmismatch-jan-matched.csv   \\n\",\n       \"1  test-data-input-csv-colmismatch-jan-matched.csv   \\n\",\n       \"2  test-data-input-csv-colmismatch-jan-matched.csv   \\n\",\n       \"3  test-data-input-csv-colmismatch-jan-matched.csv   \\n\",\n       \"4  test-data-input-csv-colmismatch-jan-matched.csv   \\n\",\n       \"\\n\",\n       \"                                        filename.2  \\\\\\n\",\n       \"0  test-data-input-csv-colmismatch-jan-matched.csv   \\n\",\n       \"1  test-data-input-csv-colmismatch-jan-matched.csv   \\n\",\n       \"2  test-data-input-csv-colmismatch-jan-matched.csv   \\n\",\n       \"3  test-data-input-csv-colmismatch-jan-matched.csv   \\n\",\n       \"4  test-data-input-csv-colmismatch-jan-matched.csv   \\n\",\n       \"\\n\",\n       \"                                        filename.3  \\n\",\n       \"0  test-data-input-csv-colmismatch-jan-matched.csv  \\n\",\n       \"1  test-data-input-csv-colmismatch-jan-matched.csv  \\n\",\n       \"2  test-data-input-csv-colmismatch-jan-matched.csv  \\n\",\n       \"3  test-data-input-csv-colmismatch-jan-matched.csv  \\n\",\n       \"4  test-data-input-csv-colmismatch-jan-matched.csv  \"\n      ]\n     },\n     \"execution_count\": 92,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# adds 3 columns for filename\\n\",\n    \"c.to_csv(output_dir='test-data/output/',overwrite=True)\\n\",\n    \"pd.read_csv('test-data/output/test-data-input-csv-colmismatch-jan-matched.csv').head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 40,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# not writing a file / raising error\\n\",\n    \"c.to_csv(output_dir='test-data/output/',separate_files=False)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 41,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"# not writing a file\\n\",\n    \"c.to_csv(output_dir='test-data/output/',separate_files=False)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 93,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/mnt/data/dev/d6t-lib/d6tstack/d6tstack/combine_csv.py:271: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version\\n\",\n      \"of pandas will change to not sort by default.\\n\",\n      \"\\n\",\n      \"To accept the future behavior, pass 'sort=True'.\\n\",\n      \"\\n\",\n      \"To retain the current behavior and silence the warning, pass sort=False\\n\",\n      \"\\n\",\n      \"  df_all = pd.concat(dfl_all)\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>profit2</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-01</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400.0</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-02</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400.0</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-03</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400.0</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-04</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400.0</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-05</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400.0</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   cost        date                                 filename  profit  profit2  \\\\\\n\",\n       \"0  -100  2011-03-01  test-data-input-csv-colmismatch-mar.csv     200    400.0   \\n\",\n       \"1  -100  2011-03-02  test-data-input-csv-colmismatch-mar.csv     200    400.0   \\n\",\n       \"2  -100  2011-03-03  test-data-input-csv-colmismatch-mar.csv     200    400.0   \\n\",\n       \"3  -100  2011-03-04  test-data-input-csv-colmismatch-mar.csv     200    400.0   \\n\",\n       \"4  -100  2011-03-05  test-data-input-csv-colmismatch-mar.csv     200    400.0   \\n\",\n       \"\\n\",\n       \"   sales  \\n\",\n       \"0    300  \\n\",\n       \"1    300  \\n\",\n       \"2    300  \\n\",\n       \"3    300  \\n\",\n       \"4    300  \"\n      ]\n     },\n     \"execution_count\": 93,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"# ok\\n\",\n    \"c.to_csv(out_filename='test-data/output/test-combined.csv',separate_files=False)\\n\",\n    \"pd.read_csv('test-data/output/test-combined.csv').head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 94,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>profit2</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>25</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-06</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>26</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-07</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>27</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-08</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>28</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-09</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>29</th>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-10</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"    cost        date                                 filename  profit  \\\\\\n\",\n       \"25   -80  2011-01-06  test-data-input-csv-colmismatch-jan.csv      20   \\n\",\n       \"26   -80  2011-01-07  test-data-input-csv-colmismatch-jan.csv      20   \\n\",\n       \"27   -80  2011-01-08  test-data-input-csv-colmismatch-jan.csv      20   \\n\",\n       \"28   -80  2011-01-09  test-data-input-csv-colmismatch-jan.csv      20   \\n\",\n       \"29   -80  2011-01-10  test-data-input-csv-colmismatch-jan.csv      20   \\n\",\n       \"\\n\",\n       \"    profit2  sales  \\n\",\n       \"25      NaN    100  \\n\",\n       \"26      NaN    100  \\n\",\n       \"27      NaN    100  \\n\",\n       \"28      NaN    100  \\n\",\n       \"29      NaN    100  \"\n      ]\n     },\n     \"execution_count\": 94,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"pd.read_csv('test-data/output/test-combined.csv').tail()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 95,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"ename\": \"TypeError\",\n     \"evalue\": \"to_csv() got an unexpected keyword argument 'is_col_common'\",\n     \"traceback\": [\n      \"\\u001b[0;31m---------------------------------------------------------------------------\\u001b[0m\",\n      \"\\u001b[0;31mTypeError\\u001b[0m                                 Traceback (most recent call last)\",\n      \"\\u001b[0;32m<ipython-input-95-436aa482cf5f>\\u001b[0m in \\u001b[0;36m<module>\\u001b[0;34m()\\u001b[0m\\n\\u001b[1;32m      1\\u001b[0m \\u001b[0;31m# add is_col_common to pass through\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m----> 2\\u001b[0;31m \\u001b[0mc\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mto_csv\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mout_filename\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;34m'test-data/output/test-combined.csv'\\u001b[0m\\u001b[0;34m,\\u001b[0m\\u001b[0mseparate_files\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;32mFalse\\u001b[0m\\u001b[0;34m,\\u001b[0m\\u001b[0mis_col_common\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;32mFalse\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\",\n      \"\\u001b[0;31mTypeError\\u001b[0m: to_csv() got an unexpected keyword argument 'is_col_common'\"\n     ],\n     \"output_type\": \"error\"\n    }\n   ],\n   \"source\": [\n    \"# add is_col_common to pass through\\n\",\n    \"c.to_csv(out_filename='test-data/output/test-combined.csv',separate_files=False,is_col_common=False)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 96,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"ename\": \"TypeError\",\n     \"evalue\": \"to_csv() got an unexpected keyword argument 'streaming'\",\n     \"traceback\": [\n      \"\\u001b[0;31m---------------------------------------------------------------------------\\u001b[0m\",\n      \"\\u001b[0;31mTypeError\\u001b[0m                                 Traceback (most recent call last)\",\n      \"\\u001b[0;32m<ipython-input-96-970c7618acf7>\\u001b[0m in \\u001b[0;36m<module>\\u001b[0;34m()\\u001b[0m\\n\\u001b[1;32m      1\\u001b[0m \\u001b[0;31m# how do I do streaming?\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m----> 2\\u001b[0;31m \\u001b[0mc\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mto_csv\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mout_filename\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;34m'test-data/output/test-combined.csv'\\u001b[0m\\u001b[0;34m,\\u001b[0m\\u001b[0mseparate_files\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;32mFalse\\u001b[0m\\u001b[0;34m,\\u001b[0m\\u001b[0mstreaming\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;32mTrue\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\",\n      \"\\u001b[0;31mTypeError\\u001b[0m: to_csv() got an unexpected keyword argument 'streaming'\"\n     ],\n     \"output_type\": \"error\"\n    }\n   ],\n   \"source\": [\n    \"# how do I do streaming?\\n\",\n    \"c.to_csv(out_filename='test-data/output/test-combined.csv',separate_files=False,streaming=True)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 97,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"ename\": \"AttributeError\",\n     \"evalue\": \"'CombinerCSV' object has no attribute 'to_parquet'\",\n     \"traceback\": [\n      \"\\u001b[0;31m---------------------------------------------------------------------------\\u001b[0m\",\n      \"\\u001b[0;31mAttributeError\\u001b[0m                            Traceback (most recent call last)\",\n      \"\\u001b[0;32m<ipython-input-97-bb0939e7210f>\\u001b[0m in \\u001b[0;36m<module>\\u001b[0;34m()\\u001b[0m\\n\\u001b[0;32m----> 1\\u001b[0;31m \\u001b[0mc\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mto_parquet\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0moutput_dir\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;34m'test-data/output/'\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\",\n      \"\\u001b[0;31mAttributeError\\u001b[0m: 'CombinerCSV' object has no attribute 'to_parquet'\"\n     ],\n     \"output_type\": \"error\"\n    }\n   ],\n   \"source\": [\n    \"c.to_parquet(output_dir='test-data/output/')\\n\",\n    \"import pyarrow.parquet as pq\\n\",\n    \"table = pq.read_table('test-data/output/test-data-input-csv-colmismatch-jan-matched')\\n\",\n    \"table.to_pandas()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 61,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stderr\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"/mnt/data/dev/d6t-lib/d6tstack/d6tstack/combine_csv.py:271: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version\\n\",\n      \"of pandas will change to not sort by default.\\n\",\n      \"\\n\",\n      \"To accept the future behavior, pass 'sort=True'.\\n\",\n      \"\\n\",\n      \"To retain the current behavior and silence the warning, pass sort=False\\n\",\n      \"\\n\",\n      \"  df_all = pd.concat(dfl_all)\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"True\"\n      ]\n     },\n     \"execution_count\": 61,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"c.to_sql('mysql+mysqlconnector://testusr:testusr@localhost/test','testd6tstack')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 62,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"from sqlalchemy.engine import create_engine\\n\",\n    \"sqlcnxn = create_engine('mysql+mysqlconnector://testusr:testusr@localhost/test').connect()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 66,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>index</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>profit2</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>0</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-01</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>1</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-02</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-03</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>3</th>\\n\",\n       \"      <td>3</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-04</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>4</th>\\n\",\n       \"      <td>4</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>2011-03-05</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"   index  cost        date                                 filename profit  \\\\\\n\",\n       \"0      0  -100  2011-03-01  test-data-input-csv-colmismatch-mar.csv    200   \\n\",\n       \"1      1  -100  2011-03-02  test-data-input-csv-colmismatch-mar.csv    200   \\n\",\n       \"2      2  -100  2011-03-03  test-data-input-csv-colmismatch-mar.csv    200   \\n\",\n       \"3      3  -100  2011-03-04  test-data-input-csv-colmismatch-mar.csv    200   \\n\",\n       \"4      4  -100  2011-03-05  test-data-input-csv-colmismatch-mar.csv    200   \\n\",\n       \"\\n\",\n       \"  profit2 sales  \\n\",\n       \"0     400   300  \\n\",\n       \"1     400   300  \\n\",\n       \"2     400   300  \\n\",\n       \"3     400   300  \\n\",\n       \"4     400   300  \"\n      ]\n     },\n     \"execution_count\": 66,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"pd.read_sql_table('testd6tstack',sqlcnxn).head()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 67,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>index</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>profit2</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>25</th>\\n\",\n       \"      <td>5</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-06</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>26</th>\\n\",\n       \"      <td>6</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-07</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>27</th>\\n\",\n       \"      <td>7</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-08</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>28</th>\\n\",\n       \"      <td>8</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-09</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>29</th>\\n\",\n       \"      <td>9</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>2011-01-10</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>None</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"    index cost        date                                 filename profit  \\\\\\n\",\n       \"25      5  -80  2011-01-06  test-data-input-csv-colmismatch-jan.csv     20   \\n\",\n       \"26      6  -80  2011-01-07  test-data-input-csv-colmismatch-jan.csv     20   \\n\",\n       \"27      7  -80  2011-01-08  test-data-input-csv-colmismatch-jan.csv     20   \\n\",\n       \"28      8  -80  2011-01-09  test-data-input-csv-colmismatch-jan.csv     20   \\n\",\n       \"29      9  -80  2011-01-10  test-data-input-csv-colmismatch-jan.csv     20   \\n\",\n       \"\\n\",\n       \"   profit2 sales  \\n\",\n       \"25    None   100  \\n\",\n       \"26    None   100  \\n\",\n       \"27    None   100  \\n\",\n       \"28    None   100  \\n\",\n       \"29    None   100  \"\n      ]\n     },\n     \"execution_count\": 67,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"pd.read_sql_table('testd6tstack',sqlcnxn).tail()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# CombinerCSVAdvanced.to_csv()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 51,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>profit3</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>filename</th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-03-01</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-03-02</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-03-03</td>\\n\",\n       \"      <td>400</td>\\n\",\n       \"      <td>300</td>\\n\",\n       \"      <td>-100</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-mar.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-02-01</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-02-02</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-02-03</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>200</td>\\n\",\n       \"      <td>-90</td>\\n\",\n       \"      <td>110</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-feb.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>0</th>\\n\",\n       \"      <td>2011-01-01</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>1</th>\\n\",\n       \"      <td>2011-01-02</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>2</th>\\n\",\n       \"      <td>2011-01-03</td>\\n\",\n       \"      <td>NaN</td>\\n\",\n       \"      <td>100</td>\\n\",\n       \"      <td>-80</td>\\n\",\n       \"      <td>20</td>\\n\",\n       \"      <td>test-data-input-csv-colmismatch-jan.csv</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"         date profit3 sales  cost profit  \\\\\\n\",\n       \"0  2011-03-01     400   300  -100    200   \\n\",\n       \"1  2011-03-02     400   300  -100    200   \\n\",\n       \"2  2011-03-03     400   300  -100    200   \\n\",\n       \"0  2011-02-01     NaN   200   -90    110   \\n\",\n       \"1  2011-02-02     NaN   200   -90    110   \\n\",\n       \"2  2011-02-03     NaN   200   -90    110   \\n\",\n       \"0  2011-01-01     NaN   100   -80     20   \\n\",\n       \"1  2011-01-02     NaN   100   -80     20   \\n\",\n       \"2  2011-01-03     NaN   100   -80     20   \\n\",\n       \"\\n\",\n       \"                                  filename  \\n\",\n       \"0  test-data-input-csv-colmismatch-mar.csv  \\n\",\n       \"1  test-data-input-csv-colmismatch-mar.csv  \\n\",\n       \"2  test-data-input-csv-colmismatch-mar.csv  \\n\",\n       \"0  test-data-input-csv-colmismatch-feb.csv  \\n\",\n       \"1  test-data-input-csv-colmismatch-feb.csv  \\n\",\n       \"2  test-data-input-csv-colmismatch-feb.csv  \\n\",\n       \"0  test-data-input-csv-colmismatch-jan.csv  \\n\",\n       \"1  test-data-input-csv-colmismatch-jan.csv  \\n\",\n       \"2  test-data-input-csv-colmismatch-jan.csv  \"\n      ]\n     },\n     \"execution_count\": 51,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"combiner2 = d6tstack.combine_csv.CombinerCSVAdvanced(c, c.preview_columns()['columns_all'], {'profit2':'profit3'})\\n\",\n    \"combiner2.preview_combine() \"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 53,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"ename\": \"TypeError\",\n     \"evalue\": \"combine_save() got an unexpected keyword argument 'parquet_output'\",\n     \"traceback\": [\n      \"\\u001b[0;31m---------------------------------------------------------------------------\\u001b[0m\",\n      \"\\u001b[0;31mTypeError\\u001b[0m                                 Traceback (most recent call last)\",\n      \"\\u001b[0;32m<ipython-input-53-998d023decfd>\\u001b[0m in \\u001b[0;36m<module>\\u001b[0;34m()\\u001b[0m\\n\\u001b[0;32m----> 1\\u001b[0;31m \\u001b[0mcombiner2\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mto_csv\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mout_filename\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;34m'test-data/output/test-combined.csv'\\u001b[0m\\u001b[0;34m,\\u001b[0m\\u001b[0mseparate_files\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;32mFalse\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\",\n      \"\\u001b[0;32m/mnt/data/dev/d6t-lib/d6tstack/d6tstack/combine_csv.py\\u001b[0m in \\u001b[0;36mto_csv\\u001b[0;34m(self, out_filename, separate_files, output_dir, suffix, overwrite, streaming, chunksize)\\u001b[0m\\n\\u001b[1;32m    617\\u001b[0m         \\\"\\\"\\\"\\n\\u001b[1;32m    618\\u001b[0m         convert_to_csv_parquet(self, out_filename=out_filename, separate_files=separate_files, output_dir=output_dir,\\n\\u001b[0;32m--> 619\\u001b[0;31m                                suffix=suffix, overwrite=overwrite, streaming=streaming, chunksize=chunksize)\\n\\u001b[0m\\u001b[1;32m    620\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    621\\u001b[0m     def to_parquet(self, out_filename=None, separate_files=True, output_dir=None, suffix='-matched',\\n\",\n      \"\\u001b[0;32m/mnt/data/dev/d6t-lib/d6tstack/d6tstack/combine_csv.py\\u001b[0m in \\u001b[0;36mconvert_to_csv_parquet\\u001b[0;34m(combiner, out_filename, separate_files, output_dir, suffix, overwrite, streaming, chunksize, parquet_output)\\u001b[0m\\n\\u001b[1;32m     58\\u001b[0m                             chunksize=chunksize, parquet_output=parquet_output)\\n\\u001b[1;32m     59\\u001b[0m     \\u001b[0;32melif\\u001b[0m \\u001b[0mstreaming\\u001b[0m \\u001b[0;32mand\\u001b[0m \\u001b[0mout_filename\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m---> 60\\u001b[0;31m         \\u001b[0mcombiner\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mcombine_save\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mout_filename\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mchunksize\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0mchunksize\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mparquet_output\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0mparquet_output\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m     61\\u001b[0m     \\u001b[0;32melif\\u001b[0m \\u001b[0mout_filename\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m     62\\u001b[0m         \\u001b[0mdf\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mcombiner\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mcombine\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;31mTypeError\\u001b[0m: combine_save() got an unexpected keyword argument 'parquet_output'\"\n     ],\n     \"output_type\": \"error\"\n    }\n   ],\n   \"source\": [\n    \"# bug??\\n\",\n    \"combiner2.to_csv(out_filename='test-data/output/test-combined.csv',separate_files=False)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 54,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"ename\": \"TypeError\",\n     \"evalue\": \"align_save() got an unexpected keyword argument 'parquet_output'\",\n     \"traceback\": [\n      \"\\u001b[0;31m---------------------------------------------------------------------------\\u001b[0m\",\n      \"\\u001b[0;31mTypeError\\u001b[0m                                 Traceback (most recent call last)\",\n      \"\\u001b[0;32m<ipython-input-54-ebe0679ac8cd>\\u001b[0m in \\u001b[0;36m<module>\\u001b[0;34m()\\u001b[0m\\n\\u001b[1;32m      1\\u001b[0m \\u001b[0;31m# bug??\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m----> 2\\u001b[0;31m \\u001b[0mcombiner2\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mto_csv\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mout_filename\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;34m'test-data/output/test-combined.csv'\\u001b[0m\\u001b[0;34m,\\u001b[0m\\u001b[0mseparate_files\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;32mTrue\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\",\n      \"\\u001b[0;32m/mnt/data/dev/d6t-lib/d6tstack/d6tstack/combine_csv.py\\u001b[0m in \\u001b[0;36mto_csv\\u001b[0;34m(self, out_filename, separate_files, output_dir, suffix, overwrite, streaming, chunksize)\\u001b[0m\\n\\u001b[1;32m    617\\u001b[0m         \\\"\\\"\\\"\\n\\u001b[1;32m    618\\u001b[0m         convert_to_csv_parquet(self, out_filename=out_filename, separate_files=separate_files, output_dir=output_dir,\\n\\u001b[0;32m--> 619\\u001b[0;31m                                suffix=suffix, overwrite=overwrite, streaming=streaming, chunksize=chunksize)\\n\\u001b[0m\\u001b[1;32m    620\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    621\\u001b[0m     def to_parquet(self, out_filename=None, separate_files=True, output_dir=None, suffix='-matched',\\n\",\n      \"\\u001b[0;32m/mnt/data/dev/d6t-lib/d6tstack/d6tstack/combine_csv.py\\u001b[0m in \\u001b[0;36mconvert_to_csv_parquet\\u001b[0;34m(combiner, out_filename, separate_files, output_dir, suffix, overwrite, streaming, chunksize, parquet_output)\\u001b[0m\\n\\u001b[1;32m     56\\u001b[0m     \\u001b[0;32mif\\u001b[0m \\u001b[0mseparate_files\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m     57\\u001b[0m         combiner.align_save(output_dir=output_dir, suffix=suffix, overwrite=overwrite,\\n\\u001b[0;32m---> 58\\u001b[0;31m                             chunksize=chunksize, parquet_output=parquet_output)\\n\\u001b[0m\\u001b[1;32m     59\\u001b[0m     \\u001b[0;32melif\\u001b[0m \\u001b[0mstreaming\\u001b[0m \\u001b[0;32mand\\u001b[0m \\u001b[0mout_filename\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m     60\\u001b[0m         \\u001b[0mcombiner\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mcombine_save\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mout_filename\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mchunksize\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0mchunksize\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mparquet_output\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0mparquet_output\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;31mTypeError\\u001b[0m: align_save() got an unexpected keyword argument 'parquet_output'\"\n     ],\n     \"output_type\": \"error\"\n    }\n   ],\n   \"source\": [\n    \"# bug??\\n\",\n    \"combiner2.to_csv(out_filename='test-data/output/test-combined.csv',separate_files=True)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 58,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"ename\": \"TypeError\",\n     \"evalue\": \"align_save() got an unexpected keyword argument 'parquet_output'\",\n     \"traceback\": [\n      \"\\u001b[0;31m---------------------------------------------------------------------------\\u001b[0m\",\n      \"\\u001b[0;31mTypeError\\u001b[0m                                 Traceback (most recent call last)\",\n      \"\\u001b[0;32m<ipython-input-58-79723936bb0c>\\u001b[0m in \\u001b[0;36m<module>\\u001b[0;34m()\\u001b[0m\\n\\u001b[0;32m----> 1\\u001b[0;31m \\u001b[0mcombiner2\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mto_parquet\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0moutput_dir\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;34m'test-data/output/'\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\",\n      \"\\u001b[0;32m/mnt/data/dev/d6t-lib/d6tstack/d6tstack/combine_csv.py\\u001b[0m in \\u001b[0;36mto_parquet\\u001b[0;34m(self, out_filename, separate_files, output_dir, suffix, overwrite, streaming, chunksize)\\u001b[0m\\n\\u001b[1;32m    637\\u001b[0m         convert_to_csv_parquet(self, out_filename=out_filename, separate_files=separate_files, output_dir=output_dir,\\n\\u001b[1;32m    638\\u001b[0m                                \\u001b[0msuffix\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0msuffix\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0moverwrite\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0moverwrite\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mstreaming\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0mstreaming\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mchunksize\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0mchunksize\\u001b[0m\\u001b[0;34m,\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m--> 639\\u001b[0;31m                                parquet_output=True)\\n\\u001b[0m\",\n      \"\\u001b[0;32m/mnt/data/dev/d6t-lib/d6tstack/d6tstack/combine_csv.py\\u001b[0m in \\u001b[0;36mconvert_to_csv_parquet\\u001b[0;34m(combiner, out_filename, separate_files, output_dir, suffix, overwrite, streaming, chunksize, parquet_output)\\u001b[0m\\n\\u001b[1;32m     56\\u001b[0m     \\u001b[0;32mif\\u001b[0m \\u001b[0mseparate_files\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m     57\\u001b[0m         combiner.align_save(output_dir=output_dir, suffix=suffix, overwrite=overwrite,\\n\\u001b[0;32m---> 58\\u001b[0;31m                             chunksize=chunksize, parquet_output=parquet_output)\\n\\u001b[0m\\u001b[1;32m     59\\u001b[0m     \\u001b[0;32melif\\u001b[0m \\u001b[0mstreaming\\u001b[0m \\u001b[0;32mand\\u001b[0m \\u001b[0mout_filename\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m     60\\u001b[0m         \\u001b[0mcombiner\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mcombine_save\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mout_filename\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mchunksize\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0mchunksize\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mparquet_output\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0mparquet_output\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;31mTypeError\\u001b[0m: align_save() got an unexpected keyword argument 'parquet_output'\"\n     ],\n     \"output_type\": \"error\"\n    }\n   ],\n   \"source\": [\n    \"combiner2.to_parquet(output_dir='test-data/output/')\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# DEBUG: large files\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 70,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"9\\n\",\n      \"['/mnt/data/dev/ubs-alphahack2017-shared/data-raw/ihs/US_Factors_Zscores/US_Factors_TotalCap_Cusip_Zscore_Historical_20140401_20140430_D.txt', '/mnt/data/dev/ubs-alphahack2017-shared/data-raw/ihs/US_Factors_Zscores/US_Factors_TotalCap_Cusip_Zscore_Historical_20140501_20140829_D.txt', '/mnt/data/dev/ubs-alphahack2017-shared/data-raw/ihs/US_Factors_Zscores/US_Factors_TotalCap_Cusip_Zscore_Historical_20140901_20141231_D.txt', '/mnt/data/dev/ubs-alphahack2017-shared/data-raw/ihs/US_Factors_Zscores/US_Factors_TotalCap_Cusip_Zscore_Historical_20150101_20150630_D.txt', '/mnt/data/dev/ubs-alphahack2017-shared/data-raw/ihs/US_Factors_Zscores/US_Factors_TotalCap_Cusip_Zscore_Historical_20150701_20151231_D.txt', '/mnt/data/dev/ubs-alphahack2017-shared/data-raw/ihs/US_Factors_Zscores/US_Factors_TotalCap_Cusip_Zscore_Historical_20160101_20160630_D.txt', '/mnt/data/dev/ubs-alphahack2017-shared/data-raw/ihs/US_Factors_Zscores/US_Factors_TotalCap_Cusip_Zscore_Historical_20160701_20161230_D.txt', '/mnt/data/dev/ubs-alphahack2017-shared/data-raw/ihs/US_Factors_Zscores/US_Factors_TotalCap_Cusip_Zscore_Historical_20170102_20170630_D.txt', '/mnt/data/dev/ubs-alphahack2017-shared/data-raw/ihs/US_Factors_Zscores/US_Factors_TotalCap_Cusip_Zscore_Historical_20170703_20170731_D.txt']\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"cfg_fnames = list(np.sort(glob.glob('/mnt/data/dev/ubs-alphahack2017-shared/data-raw/ihs/US_Factors_Zscores/US_Factors_TotalCap_Cusip_Zscore_Historical_*.txt')))\\n\",\n    \"print(len(cfg_fnames))\\n\",\n    \"print(cfg_fnames)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 71,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"1\\n\",\n      \"['/mnt/data/dev/ubs-alphahack2017-shared/data-raw/ihs/US_Factors_Zscores/US_Factors_TotalCap_Cusip_Zscore_Historical_20150101_20150630_D.txt']\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"cfg_fnames = list(np.sort(glob.glob('/mnt/data/dev/ubs-alphahack2017-shared/data-raw/ihs/US_Factors_Zscores/US_Factors_TotalCap_Cusip_Zscore_Historical_20150101_20150630_D.txt')))\\n\",\n    \"print(len(cfg_fnames))\\n\",\n    \"print(cfg_fnames)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 72,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": [\n    \"c = d6tstack.combine_csv.CombinerCSV(cfg_fnames, all_strings=True)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 73,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"data\": {\n      \"text/plain\": [\n       \"True\"\n      ]\n     },\n     \"execution_count\": 73,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"c.is_all_equal()\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 31,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"ename\": \"KeyboardInterrupt\",\n     \"evalue\": \"\",\n     \"traceback\": [\n      \"\\u001b[0;31m---------------------------------------------------------------------------\\u001b[0m\",\n      \"\\u001b[0;31mKeyboardInterrupt\\u001b[0m                         Traceback (most recent call last)\",\n      \"\\u001b[0;32m<ipython-input-31-3dcf1763c8be>\\u001b[0m in \\u001b[0;36m<module>\\u001b[0;34m()\\u001b[0m\\n\\u001b[1;32m      1\\u001b[0m \\u001b[0;31m# how do I do streaming?\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m----> 2\\u001b[0;31m \\u001b[0mc\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mto_csv\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mout_filename\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;34m'test-data/output/test-combined.csv'\\u001b[0m\\u001b[0;34m,\\u001b[0m\\u001b[0mseparate_files\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;32mFalse\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\",\n      \"\\u001b[0;32m/mnt/data/dev/d6t-lib/d6tstack/d6tstack/combine_csv.py\\u001b[0m in \\u001b[0;36mto_csv\\u001b[0;34m(self, out_filename, separate_files, output_dir, suffix, overwrite, chunksize)\\u001b[0m\\n\\u001b[1;32m    434\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    435\\u001b[0m         convert_to_csv_parquet(self, out_filename=out_filename, separate_files=separate_files, output_dir=output_dir,\\n\\u001b[0;32m--> 436\\u001b[0;31m                                suffix=suffix, overwrite=overwrite, streaming=False, chunksize=chunksize)\\n\\u001b[0m\\u001b[1;32m    437\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    438\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/mnt/data/dev/d6t-lib/d6tstack/d6tstack/combine_csv.py\\u001b[0m in \\u001b[0;36mconvert_to_csv_parquet\\u001b[0;34m(combiner, out_filename, separate_files, output_dir, suffix, overwrite, streaming, chunksize, parquet_output)\\u001b[0m\\n\\u001b[1;32m     60\\u001b[0m         \\u001b[0mcombiner\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mcombine_save\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mout_filename\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mchunksize\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0mchunksize\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mparquet_output\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0mparquet_output\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m     61\\u001b[0m     \\u001b[0;32melif\\u001b[0m \\u001b[0mout_filename\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m---> 62\\u001b[0;31m         \\u001b[0mdf\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mcombiner\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mcombine\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m     63\\u001b[0m         \\u001b[0;32mif\\u001b[0m \\u001b[0mparquet_output\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m     64\\u001b[0m             \\u001b[0;32mimport\\u001b[0m \\u001b[0mpyarrow\\u001b[0m \\u001b[0;32mas\\u001b[0m \\u001b[0mpa\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/mnt/data/dev/d6t-lib/d6tstack/d6tstack/combine_csv.py\\u001b[0m in \\u001b[0;36mcombine\\u001b[0;34m(self, is_col_common, is_preview)\\u001b[0m\\n\\u001b[1;32m    261\\u001b[0m         \\\"\\\"\\\"\\n\\u001b[1;32m    262\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m--> 263\\u001b[0;31m         \\u001b[0mdfl_all\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mread_csv_all\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m'reading full file'\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mis_preview\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0mis_preview\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m    264\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    265\\u001b[0m         \\u001b[0;32mif\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mlogger\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/mnt/data/dev/d6t-lib/d6tstack/d6tstack/combine_csv.py\\u001b[0m in \\u001b[0;36mread_csv_all\\u001b[0;34m(self, msg, is_preview, chunksize, cfg_col_sel, cfg_col_rename)\\u001b[0m\\n\\u001b[1;32m    125\\u001b[0m             \\u001b[0;32mif\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mlogger\\u001b[0m \\u001b[0;32mand\\u001b[0m \\u001b[0mmsg\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    126\\u001b[0m                 \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mlogger\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0msend_log\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mmsg\\u001b[0m \\u001b[0;34m+\\u001b[0m \\u001b[0;34m' '\\u001b[0m \\u001b[0;34m+\\u001b[0m \\u001b[0mntpath\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mbasename\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mfname\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0;34m'ok'\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m--> 127\\u001b[0;31m             \\u001b[0mdf\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mread_csv\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mfname\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mis_preview\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0mis_preview\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mchunksize\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0mchunksize\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m    128\\u001b[0m             \\u001b[0;32mif\\u001b[0m \\u001b[0mcfg_col_sel\\u001b[0m \\u001b[0;32mor\\u001b[0m \\u001b[0mcfg_col_rename\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    129\\u001b[0m                 \\u001b[0mdf\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mapply_select_rename\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mdf\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mcfg_col_sel\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mcfg_col_rename\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/mnt/data/dev/d6t-lib/d6tstack/d6tstack/combine_csv.py\\u001b[0m in \\u001b[0;36mread_csv\\u001b[0;34m(self, fname, is_preview, chunksize)\\u001b[0m\\n\\u001b[1;32m    113\\u001b[0m         \\u001b[0mcfg_nrows\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mnrows_preview\\u001b[0m \\u001b[0;32mif\\u001b[0m \\u001b[0mis_preview\\u001b[0m \\u001b[0;32melse\\u001b[0m \\u001b[0;32mNone\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    114\\u001b[0m         return pd.read_csv(fname, dtype=cfg_dype, nrows=cfg_nrows, chunksize=chunksize,\\n\\u001b[0;32m--> 115\\u001b[0;31m                            **self.read_csv_params)\\n\\u001b[0m\\u001b[1;32m    116\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    117\\u001b[0m     def read_csv_all(self, msg=None, is_preview=False, chunksize=None, cfg_col_sel=None,\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py\\u001b[0m in \\u001b[0;36mparser_f\\u001b[0;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)\\u001b[0m\\n\\u001b[1;32m    676\\u001b[0m                     skip_blank_lines=skip_blank_lines)\\n\\u001b[1;32m    677\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m--> 678\\u001b[0;31m         \\u001b[0;32mreturn\\u001b[0m \\u001b[0m_read\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mfilepath_or_buffer\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mkwds\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m    679\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    680\\u001b[0m     \\u001b[0mparser_f\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m__name__\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mname\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py\\u001b[0m in \\u001b[0;36m_read\\u001b[0;34m(filepath_or_buffer, kwds)\\u001b[0m\\n\\u001b[1;32m    444\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    445\\u001b[0m     \\u001b[0;32mtry\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m--> 446\\u001b[0;31m         \\u001b[0mdata\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mparser\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mread\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mnrows\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m    447\\u001b[0m     \\u001b[0;32mfinally\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    448\\u001b[0m         \\u001b[0mparser\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mclose\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py\\u001b[0m in \\u001b[0;36mread\\u001b[0;34m(self, nrows)\\u001b[0m\\n\\u001b[1;32m   1034\\u001b[0m                 \\u001b[0;32mraise\\u001b[0m \\u001b[0mValueError\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m'skipfooter not supported for iteration'\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m   1035\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m-> 1036\\u001b[0;31m         \\u001b[0mret\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_engine\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mread\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mnrows\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m   1037\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m   1038\\u001b[0m         \\u001b[0;31m# May alter columns / col_dict\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py\\u001b[0m in \\u001b[0;36mread\\u001b[0;34m(self, nrows)\\u001b[0m\\n\\u001b[1;32m   1846\\u001b[0m     \\u001b[0;32mdef\\u001b[0m \\u001b[0mread\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mself\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mnrows\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;32mNone\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m   1847\\u001b[0m         \\u001b[0;32mtry\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m-> 1848\\u001b[0;31m             \\u001b[0mdata\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_reader\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mread\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mnrows\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m   1849\\u001b[0m         \\u001b[0;32mexcept\\u001b[0m \\u001b[0mStopIteration\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m   1850\\u001b[0m             \\u001b[0;32mif\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_first_chunk\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32mpandas/_libs/parsers.pyx\\u001b[0m in \\u001b[0;36mpandas._libs.parsers.TextReader.read\\u001b[0;34m()\\u001b[0m\\n\",\n      \"\\u001b[0;32mpandas/_libs/parsers.pyx\\u001b[0m in \\u001b[0;36mpandas._libs.parsers.TextReader._read_low_memory\\u001b[0;34m()\\u001b[0m\\n\",\n      \"\\u001b[0;32mpandas/_libs/parsers.pyx\\u001b[0m in \\u001b[0;36mpandas._libs.parsers.TextReader._read_rows\\u001b[0;34m()\\u001b[0m\\n\",\n      \"\\u001b[0;32mpandas/_libs/parsers.pyx\\u001b[0m in \\u001b[0;36mpandas._libs.parsers.TextReader._convert_column_data\\u001b[0;34m()\\u001b[0m\\n\",\n      \"\\u001b[0;32mpandas/_libs/parsers.pyx\\u001b[0m in \\u001b[0;36mpandas._libs.parsers.TextReader._convert_tokens\\u001b[0;34m()\\u001b[0m\\n\",\n      \"\\u001b[0;32mpandas/_libs/parsers.pyx\\u001b[0m in \\u001b[0;36mpandas._libs.parsers.TextReader._convert_with_dtype\\u001b[0;34m()\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/common.py\\u001b[0m in \\u001b[0;36mis_integer_dtype\\u001b[0;34m(arr_or_dtype)\\u001b[0m\\n\\u001b[1;32m    809\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    810\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m--> 811\\u001b[0;31m \\u001b[0;32mdef\\u001b[0m \\u001b[0mis_integer_dtype\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0marr_or_dtype\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m    812\\u001b[0m     \\\"\\\"\\\"\\n\\u001b[1;32m    813\\u001b[0m     \\u001b[0mCheck\\u001b[0m \\u001b[0mwhether\\u001b[0m \\u001b[0mthe\\u001b[0m \\u001b[0mprovided\\u001b[0m \\u001b[0marray\\u001b[0m \\u001b[0;32mor\\u001b[0m \\u001b[0mdtype\\u001b[0m \\u001b[0;32mis\\u001b[0m \\u001b[0mof\\u001b[0m \\u001b[0man\\u001b[0m \\u001b[0minteger\\u001b[0m \\u001b[0mdtype\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;31mKeyboardInterrupt\\u001b[0m: \"\n     ],\n     \"output_type\": \"error\"\n    }\n   ],\n   \"source\": [\n    \"# how do I do streaming? this loads everything into memeory...\\n\",\n    \"c.to_csv(out_filename='test-data/output/test-combined.csv',separate_files=False)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 75,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"ename\": \"KeyboardInterrupt\",\n     \"evalue\": \"\",\n     \"traceback\": [\n      \"\\u001b[0;31m---------------------------------------------------------------------------\\u001b[0m\",\n      \"\\u001b[0;31mKeyboardInterrupt\\u001b[0m                         Traceback (most recent call last)\",\n      \"\\u001b[0;32m<ipython-input-75-6289d35d16bb>\\u001b[0m in \\u001b[0;36m<module>\\u001b[0;34m()\\u001b[0m\\n\\u001b[0;32m----> 1\\u001b[0;31m \\u001b[0mc\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mto_sql_stream\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m'mysql+mysqlconnector://testusr:testusr@localhost/test'\\u001b[0m\\u001b[0;34m,\\u001b[0m\\u001b[0;34m'testd6tstack'\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\",\n      \"\\u001b[0;32m/mnt/data/dev/d6t-lib/d6tstack/d6tstack/combine_csv.py\\u001b[0m in \\u001b[0;36mto_sql_stream\\u001b[0;34m(self, cnxn_string, table_name, if_exists, chunksize, sql_chunksize, cfg_col_sel, is_col_common, cfg_col_rename)\\u001b[0m\\n\\u001b[1;32m    403\\u001b[0m             \\u001b[0;32mif\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mlogger\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    404\\u001b[0m                 \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mlogger\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0msend_log\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m'processing '\\u001b[0m \\u001b[0;34m+\\u001b[0m \\u001b[0mntpath\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mbasename\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mfname\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0;34m'ok'\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m--> 405\\u001b[0;31m             \\u001b[0;32mfor\\u001b[0m \\u001b[0mdf_chunk\\u001b[0m \\u001b[0;32min\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mread_csv\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mfname\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mchunksize\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0mchunksize\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m    406\\u001b[0m                 \\u001b[0;32mif\\u001b[0m \\u001b[0mcfg_col_sel\\u001b[0m \\u001b[0;32mor\\u001b[0m \\u001b[0mcfg_col_rename\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    407\\u001b[0m                     \\u001b[0mdf_chunk\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mapply_select_rename\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mdf_chunk\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mcfg_col_sel\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mcfg_col_rename\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py\\u001b[0m in \\u001b[0;36m__next__\\u001b[0;34m(self)\\u001b[0m\\n\\u001b[1;32m   1005\\u001b[0m     \\u001b[0;32mdef\\u001b[0m \\u001b[0m__next__\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mself\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m   1006\\u001b[0m         \\u001b[0;32mtry\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m-> 1007\\u001b[0;31m             \\u001b[0;32mreturn\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mget_chunk\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m   1008\\u001b[0m         \\u001b[0;32mexcept\\u001b[0m \\u001b[0mStopIteration\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m   1009\\u001b[0m             \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mclose\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py\\u001b[0m in \\u001b[0;36mget_chunk\\u001b[0;34m(self, size)\\u001b[0m\\n\\u001b[1;32m   1068\\u001b[0m                 \\u001b[0;32mraise\\u001b[0m \\u001b[0mStopIteration\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m   1069\\u001b[0m             \\u001b[0msize\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mmin\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0msize\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mnrows\\u001b[0m \\u001b[0;34m-\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_currow\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m-> 1070\\u001b[0;31m         \\u001b[0;32mreturn\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mread\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mnrows\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0msize\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m   1071\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m   1072\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py\\u001b[0m in \\u001b[0;36mread\\u001b[0;34m(self, nrows)\\u001b[0m\\n\\u001b[1;32m   1034\\u001b[0m                 \\u001b[0;32mraise\\u001b[0m \\u001b[0mValueError\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0;34m'skipfooter not supported for iteration'\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m   1035\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m-> 1036\\u001b[0;31m         \\u001b[0mret\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_engine\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mread\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mnrows\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m   1037\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m   1038\\u001b[0m         \\u001b[0;31m# May alter columns / col_dict\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py\\u001b[0m in \\u001b[0;36mread\\u001b[0;34m(self, nrows)\\u001b[0m\\n\\u001b[1;32m   1846\\u001b[0m     \\u001b[0;32mdef\\u001b[0m \\u001b[0mread\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mself\\u001b[0m\\u001b[0;34m,\\u001b[0m \\u001b[0mnrows\\u001b[0m\\u001b[0;34m=\\u001b[0m\\u001b[0;32mNone\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m   1847\\u001b[0m         \\u001b[0;32mtry\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m-> 1848\\u001b[0;31m             \\u001b[0mdata\\u001b[0m \\u001b[0;34m=\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_reader\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0mread\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0mnrows\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m   1849\\u001b[0m         \\u001b[0;32mexcept\\u001b[0m \\u001b[0mStopIteration\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m   1850\\u001b[0m             \\u001b[0;32mif\\u001b[0m \\u001b[0mself\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0m_first_chunk\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;32mpandas/_libs/parsers.pyx\\u001b[0m in \\u001b[0;36mpandas._libs.parsers.TextReader.read\\u001b[0;34m()\\u001b[0m\\n\",\n      \"\\u001b[0;32mpandas/_libs/parsers.pyx\\u001b[0m in \\u001b[0;36mpandas._libs.parsers.TextReader._read_low_memory\\u001b[0;34m()\\u001b[0m\\n\",\n      \"\\u001b[0;32mpandas/_libs/parsers.pyx\\u001b[0m in \\u001b[0;36mpandas._libs.parsers.TextReader._read_rows\\u001b[0;34m()\\u001b[0m\\n\",\n      \"\\u001b[0;32mpandas/_libs/parsers.pyx\\u001b[0m in \\u001b[0;36mpandas._libs.parsers.TextReader._convert_column_data\\u001b[0;34m()\\u001b[0m\\n\",\n      \"\\u001b[0;32mpandas/_libs/parsers.pyx\\u001b[0m in \\u001b[0;36mpandas._libs.parsers.TextReader._convert_tokens\\u001b[0;34m()\\u001b[0m\\n\",\n      \"\\u001b[0;32mpandas/_libs/parsers.pyx\\u001b[0m in \\u001b[0;36mpandas._libs.parsers.TextReader._convert_with_dtype\\u001b[0;34m()\\u001b[0m\\n\",\n      \"\\u001b[0;32m/opt/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/common.py\\u001b[0m in \\u001b[0;36mis_integer_dtype\\u001b[0;34m(arr_or_dtype)\\u001b[0m\\n\\u001b[1;32m    809\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[1;32m    810\\u001b[0m \\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0;32m--> 811\\u001b[0;31m \\u001b[0;32mdef\\u001b[0m \\u001b[0mis_integer_dtype\\u001b[0m\\u001b[0;34m(\\u001b[0m\\u001b[0marr_or_dtype\\u001b[0m\\u001b[0;34m)\\u001b[0m\\u001b[0;34m:\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\\u001b[0m\\u001b[1;32m    812\\u001b[0m     \\\"\\\"\\\"\\n\\u001b[1;32m    813\\u001b[0m     \\u001b[0mCheck\\u001b[0m \\u001b[0mwhether\\u001b[0m \\u001b[0mthe\\u001b[0m \\u001b[0mprovided\\u001b[0m \\u001b[0marray\\u001b[0m \\u001b[0;32mor\\u001b[0m \\u001b[0mdtype\\u001b[0m \\u001b[0;32mis\\u001b[0m \\u001b[0mof\\u001b[0m \\u001b[0man\\u001b[0m \\u001b[0minteger\\u001b[0m \\u001b[0mdtype\\u001b[0m\\u001b[0;34m.\\u001b[0m\\u001b[0;34m\\u001b[0m\\u001b[0m\\n\",\n      \"\\u001b[0;31mKeyboardInterrupt\\u001b[0m: \"\n     ],\n     \"output_type\": \"error\"\n    }\n   ],\n   \"source\": [\n    \"c.to_sql_stream('mysql+mysqlconnector://testusr:testusr@localhost/test','testd6tstack')\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.3\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "examples-sql.ipynb",
    "content": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Data Engineering in Python with databolt  - Fast Loading to SQL with pandas (d6tlib/d6tstack)\\n\",\n    \"\\n\",\n    \"Pandas and SQL are great but they have some problems:\\n\",\n    \"* loading data from pandas to SQL is very slow. So you can't preprocess data with python and then quickly store it in a db\\n\",\n    \"* Loading CSV files into SQL is cumbersome and quickly breaks when input files are not consistent\\n\",\n    \"\\n\",\n    \"With `d6tstack` you can:\\n\",\n    \"* load pandas dataframes to postgres or mysql much faster than with `pd.to_sql()` and with minimal memory consumption\\n\",\n    \"* preprocess CSV files with pandas before writing to db\\n\",\n    \"* solve data schema problems (eg new or renamed columns) before writing to db \\n\",\n    \"* out of core functionality where large files are processed in chunks\\n\",\n    \"\\n\",\n    \"In this workbook we will demonstrate the usage of the d6tstack library for quickly loading data into SQL from CSV files and pandas.\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# pd.to_sql() is slow\\n\",\n    \"Let's see how slow `pd.to_sql()` is storing 100k rows of random data.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 34,\n   \"metadata\": {\n    \"scrolled\": false\n   },\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"(100000, 23)\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import pandas as pd\\n\",\n    \"import numpy as np\\n\",\n    \"import uuid\\n\",\n    \"import sqlalchemy\\n\",\n    \"import glob\\n\",\n    \"import time\\n\",\n    \"\\n\",\n    \"cfg_uri_psql = 'postgresql+psycopg2://psqlusr:psqlpwdpsqlpwd@localhost/psqltest'\\n\",\n    \"cfg_uri_mysql = 'mysql+mysqlconnector://testusr:testpwd@localhost/testdb'\\n\",\n    \"\\n\",\n    \"cfg_nobs = int(1e5)\\n\",\n    \"np.random.seed(0)\\n\",\n    \"df = pd.DataFrame({'id':range(cfg_nobs)})\\n\",\n    \"df['uuid']=[uuid.uuid4().hex.upper()[0:10] for _ in range(cfg_nobs)]\\n\",\n    \"df['date']=pd.date_range('1/1/2010',periods=cfg_nobs, freq='1T')\\n\",\n    \"for i in range(20):\\n\",\n    \"    df['d'+str(i)]=np.random.normal(size=int(cfg_nobs))\\n\",\n    \"\\n\",\n    \"print(df.shape)\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 35,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"--- 28.010449647903442 seconds ---\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"sqlengine = sqlalchemy.create_engine(cfg_uri_psql)\\n\",\n    \"\\n\",\n    \"start_time = time.time()\\n\",\n    \"df.to_sql('benchmark',sqlengine,if_exists='replace')\\n\",\n    \"print(\\\"--- %s seconds ---\\\" % (time.time() - start_time))\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Speeding up pd.to_sql() in postgres and mysql with d6tstack\\n\",\n    \"Let's see how we can make this faster. In this simple example we have a ~5x speedup with the speedup growing exponentially with larger datasets.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 36,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"--- 4.6529316902160645 seconds ---\\n\",\n      \"creating mysql.csv ok\\n\",\n      \"loading mysql.csv ok\\n\",\n      \"--- 7.102342367172241 seconds ---\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"import d6tstack.utils\\n\",\n    \"\\n\",\n    \"# psql\\n\",\n    \"start_time = time.time()\\n\",\n    \"d6tstack.utils.pd_to_psql(df, cfg_uri_psql, 'benchmark', if_exists='replace')\\n\",\n    \"print(\\\"--- %s seconds ---\\\" % (time.time() - start_time))\\n\",\n    \"\\n\",\n    \"# mysql\\n\",\n    \"start_time = time.time()\\n\",\n    \"d6tstack.utils.pd_to_mysql(df, cfg_uri_mysql, 'benchmark', if_exists='replace')\\n\",\n    \"print(\\\"--- %s seconds ---\\\" % (time.time() - start_time))\\n\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Using Pandas for preprocessing CSVs before storing to database\\n\",\n    \"Pandas is great for preprocessing data. For example lets say we want to process dates before importing them to a database. `d6tstack` makes this easy for you, you simply pass the filename or list of files along with the preprocessing function and it will be quickly loaded in SQL - without loading everything into memory.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 46,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"         date  sales  cost  profit\\n\",\n      \"0  2011-02-01    200   -90     110\\n\",\n      \"1  2011-02-02    200   -90     110\\n\",\n      \"2  2011-02-03    200   -90     110\\n\",\n      \"3  2011-02-04    200   -90     110\\n\",\n      \"4  2011-02-05    200   -90     110\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"cfg_fname = 'test-data/input/test-data-input-csv-colmismatch-feb.csv'\\n\",\n    \"print(pd.read_csv(cfg_fname).head())\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 47,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"sniffing columns ok\\n\",\n      \"        date  sales  cost  profit date_year_quarter date_monthend\\n\",\n      \"0 2011-02-01    200   -90     110              11Q1    2011-02-28\\n\",\n      \"1 2011-02-02    200   -90     110              11Q1    2011-02-28\\n\",\n      \"2 2011-02-03    200   -90     110              11Q1    2011-02-28\\n\",\n      \"3 2011-02-04    200   -90     110              11Q1    2011-02-28\\n\",\n      \"4 2011-02-05    200   -90     110              11Q1    2011-02-28\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"def apply(dfg):\\n\",\n    \"    dfg['date'] = pd.to_datetime(dfg['date'], format='%Y-%m-%d')\\n\",\n    \"    dfg['date_year_quarter'] = (dfg['date'].dt.year).astype(str).str[-2:]+'Q'+(dfg['date'].dt.quarter).astype(str)\\n\",\n    \"    dfg['date_monthend'] = dfg['date'] + pd.tseries.offsets.MonthEnd()\\n\",\n    \"    return dfg\\n\",\n    \"\\n\",\n    \"d6tstack.combine_csv.CombinerCSV([cfg_fname], apply_after_read=apply,add_filename=False).to_psql_combine(cfg_uri_psql, 'benchmark', if_exists='replace')\\n\",\n    \"print(pd.read_sql_table('benchmark',sqlengine).head())\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# Loading multiple CSV to SQL with data schema changes\\n\",\n    \"Native database import commands only support one file. You can write a script to process multipe files which first of all is annoying and even worse it often breaks eg if there are schema changes. With `d6tstack` you quickly import multiple files and deal with data schema changes with just a couple of lines of python. The below is a quick example, to explore full functionality see  https://github.com/d6t/d6tstack/blob/master/examples-csv.ipynb\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 48,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"sniffing columns ok\\n\",\n      \"all equal False\\n\",\n      \"\\n\"\n     ]\n    },\n    {\n     \"data\": {\n      \"text/html\": [\n       \"<div>\\n\",\n       \"<style scoped>\\n\",\n       \"    .dataframe tbody tr th:only-of-type {\\n\",\n       \"        vertical-align: middle;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe tbody tr th {\\n\",\n       \"        vertical-align: top;\\n\",\n       \"    }\\n\",\n       \"\\n\",\n       \"    .dataframe thead th {\\n\",\n       \"        text-align: right;\\n\",\n       \"    }\\n\",\n       \"</style>\\n\",\n       \"<table border=\\\"1\\\" class=\\\"dataframe\\\">\\n\",\n       \"  <thead>\\n\",\n       \"    <tr style=\\\"text-align: right;\\\">\\n\",\n       \"      <th></th>\\n\",\n       \"      <th>date</th>\\n\",\n       \"      <th>sales</th>\\n\",\n       \"      <th>cost</th>\\n\",\n       \"      <th>profit</th>\\n\",\n       \"      <th>profit2</th>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>file_path</th>\\n\",\n       \"      <th></th>\\n\",\n       \"      <th></th>\\n\",\n       \"      <th></th>\\n\",\n       \"      <th></th>\\n\",\n       \"      <th></th>\\n\",\n       \"    </tr>\\n\",\n       \"  </thead>\\n\",\n       \"  <tbody>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>test-data/input/test-data-input-csv-colmismatch-feb.csv</th>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>False</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>test-data/input/test-data-input-csv-colmismatch-jan.csv</th>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>False</td>\\n\",\n       \"    </tr>\\n\",\n       \"    <tr>\\n\",\n       \"      <th>test-data/input/test-data-input-csv-colmismatch-mar.csv</th>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"      <td>True</td>\\n\",\n       \"    </tr>\\n\",\n       \"  </tbody>\\n\",\n       \"</table>\\n\",\n       \"</div>\"\n      ],\n      \"text/plain\": [\n       \"                                                    date  sales  cost  profit  profit2\\n\",\n       \"file_path                                                                             \\n\",\n       \"test-data/input/test-data-input-csv-colmismatch...  True   True  True    True    False\\n\",\n       \"test-data/input/test-data-input-csv-colmismatch...  True   True  True    True    False\\n\",\n       \"test-data/input/test-data-input-csv-colmismatch...  True   True  True    True     True\"\n      ]\n     },\n     \"execution_count\": 48,\n     \"metadata\": {},\n     \"output_type\": \"execute_result\"\n    }\n   ],\n   \"source\": [\n    \"import glob\\n\",\n    \"import d6tstack.combine_csv\\n\",\n    \"\\n\",\n    \"cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-colmismatch-*.csv'))\\n\",\n    \"c = d6tstack.combine_csv.CombinerCSV(cfg_fnames)\\n\",\n    \"\\n\",\n    \"# check columns\\n\",\n    \"print('all equal',c.is_all_equal())\\n\",\n    \"print('')\\n\",\n    \"c.is_column_present()\"\n   ]\n  },\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"The presence of the additional `profit2` column in the 3rd file would break the data load. `d6tstack` will fix the situation and load everything correctly. And you can run any additional preprocessing logic like in the above example. All this is done out of core so you can process even large files without any memory issues.\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": 49,\n   \"metadata\": {},\n   \"outputs\": [\n    {\n     \"name\": \"stdout\",\n     \"output_type\": \"stream\",\n     \"text\": [\n      \"sniffing columns ok\\n\",\n      \"         date  sales  cost  profit  profit2 date_year_quarter date_monthend\\n\",\n      \"25 2011-03-06    300  -100     200    400.0              11Q1    2011-03-31\\n\",\n      \"26 2011-03-07    300  -100     200    400.0              11Q1    2011-03-31\\n\",\n      \"27 2011-03-08    300  -100     200    400.0              11Q1    2011-03-31\\n\",\n      \"28 2011-03-09    300  -100     200    400.0              11Q1    2011-03-31\\n\",\n      \"29 2011-03-10    300  -100     200    400.0              11Q1    2011-03-31\\n\"\n     ]\n    }\n   ],\n   \"source\": [\n    \"cfg_fnames = list(glob.glob('test-data/input/test-data-input-csv-colmismatch-*.csv'))\\n\",\n    \"d6tstack.combine_csv.CombinerCSV(cfg_fnames, apply_after_read=apply,add_filename=False).to_psql_combine(cfg_uri_psql, 'benchmark', if_exists='replace')\\n\",\n    \"print(pd.read_sql_table('benchmark',sqlengine).tail())\"\n   ]\n  },\n  {\n   \"cell_type\": \"code\",\n   \"execution_count\": null,\n   \"metadata\": {\n    \"collapsed\": true\n   },\n   \"outputs\": [],\n   \"source\": []\n  }\n ],\n \"metadata\": {\n  \"kernelspec\": {\n   \"display_name\": \"Python 3\",\n   \"language\": \"python\",\n   \"name\": \"python3\"\n  },\n  \"language_info\": {\n   \"codemirror_mode\": {\n    \"name\": \"ipython\",\n    \"version\": 3\n   },\n   \"file_extension\": \".py\",\n   \"mimetype\": \"text/x-python\",\n   \"name\": \"python\",\n   \"nbconvert_exporter\": \"python\",\n   \"pygments_lexer\": \"ipython3\",\n   \"version\": \"3.6.3\"\n  }\n },\n \"nbformat\": 4,\n \"nbformat_minor\": 2\n}\n"
  },
  {
    "path": "requirements-dev.txt",
    "content": "pytest\nsphinx\nsphinxcontrib-napoleon\nsphinx_rtd_theme\ndask[dataframe]\nfastparquet\npython-snappy\nxlwt\n"
  },
  {
    "path": "requirements.txt",
    "content": "numpy\nopenpyxl\nxlrd\npandas>=0.22.0\nsqlalchemy\nscipy\npyarrow\npsycopg2\nmysql-connector\nd6tcollect"
  },
  {
    "path": "setup.cfg",
    "content": "[metadata]\ndescription-file = README.md"
  },
  {
    "path": "setup.py",
    "content": "from setuptools import setup\n\nextras = {\n    'xls': ['openpyxl','xlrd'],\n    'parquet': ['pyarrow'],\n    'psql': ['psycopg2-binary'],\n    'mysql': ['mysql-connector'],\n}\n\nsetup(\n    name='d6tstack',\n    version='0.2.0',\n    packages=['d6tstack'],\n    url='https://github.com/d6t/d6tstack',\n    license='MIT',\n    author='DataBolt Team',\n    author_email='support@databolt.tech',\n    description='d6tstack: Quickly ingest CSV and XLS files. Export to pandas, SQL, parquet',\n    long_description='Quickly ingest raw files. Works for XLS, CSV, TXT which can be exported to CSV, Parquet, SQL and Pandas. d6tstack solves many performance and schema problems typically encountered when ingesting raw files.',\n    install_requires=[\n        'numpy','pandas>=0.22.0','sqlalchemy','scipy','d6tcollect'\n    ],\n    extras_require=extras,\n    include_package_data=True,\n    python_requires='>=3.5',\n    keywords=['d6tstack', 'ingest csv'],\n    classifiers=[]\n)\n"
  },
  {
    "path": "tests/__init__.py",
    "content": ""
  },
  {
    "path": "tests/pypi.sh",
    "content": "# pip install setuptools wheel twine\npython setup.py sdist bdist_wheel\ntwine upload dist/*"
  },
  {
    "path": "tests/test-parquet.py",
    "content": "# -*- coding: utf-8 -*-\n\"\"\"\nCreated on Sun Jun 24 10:45:14 2018\n\n@author: deepmind\n\"\"\"\n\nimport pandas as pd\nimport glob\nfrom fastparquet import write\nfrom fastparquet import ParquetFile\n\nimport pyarrow.parquet as pq\n\nfor fname in glob.glob('test-data-input-csv-*.csv'):\n    df=pd.read_csv(fname)\n    df['date']=pd.to_datetime(df['date'],format='%Y-%m-%d')\n#    write(fname[:-4]+'.parq', df)\n    pq.write_table(table, 'example.parquet')pa.Table.from_pandas(df)\n\nimport dask.dataframe as dd\nddf = dd.read_parquet('test-data-input-csv-*.csv')\nddf.head()\n\nddf = dd.read_parquet('test-data-input-csv-*.parq')\nddf.head()\nddf.tail()\nddf.compute()\n\ndft = ParquetFile('test-data-input-csv-mar.parq').to_pandas()\nassert df.equals(dft)\n\nddf = dd.read_csv('test-data-input-csv-*.parq')\n"
  },
  {
    "path": "tests/test_combine_csv.py",
    "content": "\"\"\"Run unit tests.\n\nUse this to run tests and understand how tasks.py works.\n\nSetup::\n\n    mkdir -p test-data/input\n    mkdir -p test-data/output\n    mysql -u root -p\n        CREATE DATABASE testdb;\n        CREATE USER 'testusr'@'localhost' IDENTIFIED BY 'testpwd';\n        GRANT ALL PRIVILEGES ON testdb.* TO 'testusr'@'%';\n\nRun tests::\n\n    pytest test_combine.py -s\n\nNotes:\n\n    * this will create sample csv, xls and xlsx files    \n    * test_combine_() test the main combine function\n\n\"\"\"\n\nfrom d6tstack.combine_csv import *\nfrom d6tstack.sniffer import CSVSniffer\nimport d6tstack.utils\n\nimport math\nimport pandas as pd\n# import pyarrow as pa\n# import pyarrow.parquet as pq\nimport ntpath\nimport shutil\nimport dask.dataframe as dd\nimport sqlalchemy\n\nimport pytest\n\ncfg_fname_base_in = 'test-data/input/test-data-'\ncfg_fname_base_out_dir = 'test-data/output'\ncfg_fname_base_out = cfg_fname_base_out_dir+'/test-data-'\ncnxn_string = 'sqlite:///test-data/db/{}.db'\n\n#************************************************************\n# fixtures\n#************************************************************\nclass DebugLogger(object):\n    def __init__(self, event):\n        pass\n        \n    def send_log(self, msg, status):\n        pass\n\n    def send(self, data):\n        pass\n\nlogger = DebugLogger('combiner')\n\n# sample data\ndef create_files_df_clean():\n    # create sample data\n    df1=pd.DataFrame({'date':pd.date_range('1/1/2011', periods=10), 'sales': 100, 'cost':-80, 'profit':20})\n    df2=pd.DataFrame({'date':pd.date_range('2/1/2011', periods=10), 'sales': 200, 'cost':-90, 'profit':200-90})\n    df3=pd.DataFrame({'date':pd.date_range('3/1/2011', periods=10), 'sales': 300, 'cost':-100, 'profit':300-100})\n#    cfg_col = [ 'date', 'sales','cost','profit']\n    \n    # return df1[cfg_col], df2[cfg_col], df3[cfg_col]\n    return df1, df2, df3\n\ndef create_files_df_clean_combine():\n    df1,df2,df3 = create_files_df_clean()\n    df_all = pd.concat([df1,df2,df3])\n    df_all = df_all[df_all.columns].astype(str)\n    \n    return df_all\n\n\ndef create_files_df_clean_combine_with_filename(fname_list):\n    df1, df2, df3 = create_files_df_clean()\n    df1['filename'] = os.path.basename(fname_list[0])\n    df2['filename'] = os.path.basename(fname_list[1])\n    df3['filename'] = os.path.basename(fname_list[2])\n    df_all = pd.concat([df1, df2, df3])\n    df_all = df_all[df_all.columns].astype(str)\n\n    return df_all\n\n\ndef create_files_df_colmismatch_combine(cfg_col_common,allstr=True):\n    df1, df2, df3 = create_files_df_clean()\n    df3['profit2']=df3['profit']*2\n    if cfg_col_common:\n        df_all = pd.concat([df1, df2, df3], join='inner')\n    else:\n        df_all = pd.concat([df1, df2, df3])\n    if allstr:\n        df_all = df_all[df_all.columns].astype(str)\n\n    return df_all\n\n\ndef check_df_colmismatch_combine(dfg,is_common=False, convert_date=True):\n    dfg = dfg.drop(['filepath','filename'],1).sort_values('date').reset_index(drop=True)\n    if convert_date:\n        dfg['date'] = pd.to_datetime(dfg['date'], format='%Y-%m-%d')\n    dfchk = create_files_df_colmismatch_combine(is_common,False).reset_index(drop=True)[dfg.columns]\n    assert dfg.equals(dfchk)\n    return True\n\n\ndef create_files_df_colmismatch_combine2(cfg_col_common):\n    df1, df2, df3 = create_files_df_clean()\n    for i in range(15):\n        df3['profit'+str(i)]=df3['profit']*2\n    if cfg_col_common:\n        df_all = pd.concat([df1, df2, df3], join='inner')\n    else:\n        df_all = pd.concat([df1, df2, df3])\n    df_all = df_all[df_all.columns].astype(str)\n\n    return df_all\n\n\n# csv standard\n@pytest.fixture(scope=\"module\")\ndef create_files_csv():\n\n    df1,df2,df3 = create_files_df_clean()\n    # save files\n    cfg_fname = cfg_fname_base_in+'input-csv-clean-%s.csv'\n    df1.to_csv(cfg_fname % 'jan',index=False)\n    df2.to_csv(cfg_fname % 'feb',index=False)\n    df3.to_csv(cfg_fname % 'mar',index=False)\n\n    return [cfg_fname % 'jan',cfg_fname % 'feb',cfg_fname % 'mar']\n\n@pytest.fixture(scope=\"module\")\ndef create_files_csv_colmismatch():\n\n    df1,df2,df3 = create_files_df_clean()\n    df3['profit2']=df3['profit']*2\n    # save files\n    cfg_fname = cfg_fname_base_in+'input-csv-colmismatch-%s.csv'\n    df1.to_csv(cfg_fname % 'jan',index=False)\n    df2.to_csv(cfg_fname % 'feb',index=False)\n    df3.to_csv(cfg_fname % 'mar',index=False)\n\n    return [cfg_fname % 'jan',cfg_fname % 'feb',cfg_fname % 'mar']\n\n@pytest.fixture(scope=\"module\")\ndef create_files_csv_colmismatch2():\n\n    df1,df2,df3 = create_files_df_clean()\n    for i in range(15):\n        df3['profit'+str(i)]=df3['profit']*2\n    # save files\n    cfg_fname = cfg_fname_base_in+'input-csv-colmismatch2-%s.csv'\n    df1.to_csv(cfg_fname % 'jan',index=False)\n    df2.to_csv(cfg_fname % 'feb',index=False)\n    df3.to_csv(cfg_fname % 'mar',index=False)\n\n    return [cfg_fname % 'jan',cfg_fname % 'feb',cfg_fname % 'mar']\n\n@pytest.fixture(scope=\"module\")\ndef create_files_csv_colreorder():\n\n    df1,df2,df3 = create_files_df_clean()\n    cfg_col = [ 'date', 'sales','cost','profit']\n    cfg_col2 = [ 'date', 'sales','profit','cost']\n    \n    # return df1[cfg_col], df2[cfg_col], df3[cfg_col]\n    # save files\n    cfg_fname = cfg_fname_base_in+'input-csv-reorder-%s.csv'\n    df1[cfg_col].to_csv(cfg_fname % 'jan',index=False)\n    df2[cfg_col].to_csv(cfg_fname % 'feb',index=False)\n    df3[cfg_col2].to_csv(cfg_fname % 'mar',index=False)\n\n    return [cfg_fname % 'jan',cfg_fname % 'feb',cfg_fname % 'mar']\n\n@pytest.fixture(scope=\"module\")\ndef create_files_csv_noheader():\n\n    df1,df2,df3 = create_files_df_clean()\n    # save files\n    cfg_fname = cfg_fname_base_in+'input-noheader-csv-%s.csv'\n    df1.to_csv(cfg_fname % 'jan',index=False, header=False)\n    df2.to_csv(cfg_fname % 'feb',index=False, header=False)\n    df3.to_csv(cfg_fname % 'mar',index=False, header=False)\n\n    return [cfg_fname % 'jan',cfg_fname % 'feb',cfg_fname % 'mar']\n\n@pytest.fixture(scope=\"module\")\ndef create_files_csv_col_renamed():\n\n    df1, df2, df3 = create_files_df_clean()\n    df3 = df3.rename(columns={'sales':'revenue'})\n    cfg_col = ['date', 'sales', 'profit', 'cost']\n    cfg_col2 = ['date', 'revenue', 'profit', 'cost']\n\n    cfg_fname = cfg_fname_base_in + 'input-csv-renamed-%s.csv'\n    df1[cfg_col].to_csv(cfg_fname % 'jan', index=False)\n    df2[cfg_col].to_csv(cfg_fname % 'feb', index=False)\n    df3[cfg_col2].to_csv(cfg_fname % 'mar', index=False)\n\n    return [cfg_fname % 'jan', cfg_fname % 'feb', cfg_fname % 'mar']\n\n\ndef create_files_csv_dirty(cfg_sep=\",\", cfg_header=True):\n\n    df1,df2,df3 = create_files_df_clean()\n    df1.to_csv(cfg_fname_base_in+'debug.csv',index=False, sep=cfg_sep, header=cfg_header)\n\n    return cfg_fname_base_in+'debug.csv'\n\n# excel single-tab\ndef create_files_xls_single_helper(cfg_fname):\n    df1,df2,df3 = create_files_df_clean()\n    df1.to_excel(cfg_fname % 'jan',index=False)\n    df2.to_excel(cfg_fname % 'feb',index=False)\n    df3.to_excel(cfg_fname % 'mar',index=False)\n\n    return [cfg_fname % 'jan',cfg_fname % 'feb',cfg_fname % 'mar']\n\n@pytest.fixture(scope=\"module\")\ndef create_files_xls_single():\n    return create_files_xls_single_helper(cfg_fname_base_in+'input-xls-sing-%s.xls')\n\n@pytest.fixture(scope=\"module\")\ndef create_files_xlsx_single():\n    return create_files_xls_single_helper(cfg_fname_base_in+'input-xls-sing-%s.xlsx')\n\n\ndef write_file_xls(dfg, fname, startrow=0,startcol=0):\n    writer = pd.ExcelWriter(fname)\n    dfg.to_excel(writer, 'Sheet1', index=False,startrow=startrow,startcol=startcol)\n    dfg.to_excel(writer, 'Sheet2', index=False,startrow=startrow,startcol=startcol)\n    writer.save()\n\n# excel multi-tab\ndef create_files_xls_multiple_helper(cfg_fname):\n\n    df1,df2,df3 = create_files_df_clean()\n    write_file_xls(df1,cfg_fname % 'jan')\n    write_file_xls(df2,cfg_fname % 'feb')\n    write_file_xls(df3,cfg_fname % 'mar')\n\n    return [cfg_fname % 'jan',cfg_fname % 'feb',cfg_fname % 'mar']\n    \n@pytest.fixture(scope=\"module\")\ndef create_files_xls_multiple():\n    return create_files_xls_multiple_helper(cfg_fname_base_in+'input-xls-mult-%s.xls')\n\n@pytest.fixture(scope=\"module\")\ndef create_files_xlsx_multiple():\n    return create_files_xls_multiple_helper(cfg_fname_base_in+'input-xls-mult-%s.xlsx')\n    \n#************************************************************\n# tests - helpers\n#************************************************************\n\ndef test_file_extensions_get():\n    fname_list = ['a.csv','b.csv']\n    ext_list = file_extensions_get(fname_list)\n    assert ext_list==['.csv','.csv']\n    \n    fname_list = ['a.xls','b.xls']\n    ext_list = file_extensions_get(fname_list)\n    assert ext_list==['.xls','.xls']\n\ndef test_file_extensions_all_equal():\n    ext_list = ['.csv']*2\n    assert file_extensions_all_equal(ext_list)\n    ext_list = ['.xls']*2\n    assert file_extensions_all_equal(ext_list)\n    ext_list = ['.csv','.xls']\n    assert not file_extensions_all_equal(ext_list)\n    \ndef test_file_extensions_valid():\n    ext_list = ['.csv']*2\n    assert file_extensions_valid(ext_list)\n    ext_list = ['.xls']*2\n    assert file_extensions_valid(ext_list)\n    ext_list = ['.exe','.xls']\n    assert not file_extensions_valid(ext_list)\n\n#************************************************************\n#************************************************************\n# scan header\n#************************************************************\n#************************************************************\ndef test_csv_sniff(create_files_csv, create_files_csv_colmismatch, create_files_csv_colreorder):\n\n    with pytest.raises(ValueError) as e:\n        c = CombinerCSV([])\n\n    # clean\n    combiner = CombinerCSV(fname_list=create_files_csv)\n    combiner.sniff_columns()\n    assert combiner.is_all_equal()\n    assert combiner.is_column_present().all().all()\n    assert combiner.sniff_results['columns_all'] == ['date', 'sales', 'cost', 'profit']\n    assert combiner.sniff_results['columns_common'] == combiner.sniff_results['columns_all']\n    assert combiner.sniff_results['columns_unique'] == []\n\n    # extra column\n    combiner = CombinerCSV(fname_list=create_files_csv_colmismatch)\n    combiner.sniff_columns()\n    assert not combiner.is_all_equal()\n    assert not combiner.is_column_present().all().all()\n    assert combiner.is_column_present().all().values.tolist()==[True, True, True, True, False]\n    assert combiner.sniff_results['columns_all'] == ['date', 'sales', 'cost', 'profit', 'profit2']\n    assert combiner.sniff_results['columns_common'] == ['date', 'sales', 'cost', 'profit']\n    assert combiner.is_column_present_common().columns.tolist() == ['date', 'sales', 'cost', 'profit']\n    assert combiner.sniff_results['columns_unique'] == ['profit2']\n    assert combiner.is_column_present_unique().columns.tolist() == ['profit2']\n\n    # mixed order\n    combiner = CombinerCSV(fname_list=create_files_csv_colreorder)\n    combiner.sniff_columns()\n    assert not combiner.is_all_equal()\n    assert combiner.sniff_results['df_columns_order']['profit'].values.tolist() == [3, 3, 2]\n\n\ndef test_csv_selectrename(create_files_csv, create_files_csv_colmismatch):\n\n    # rename\n    df = CombinerCSV(fname_list=create_files_csv).preview_rename()\n    assert df.empty\n    df = CombinerCSV(fname_list=create_files_csv, columns_rename={'notthere':'nan'}).preview_rename()\n    assert df.empty\n\n    df = CombinerCSV(fname_list=create_files_csv, columns_rename={'cost':'cost2'}).preview_rename()\n    assert df.columns.tolist()==['cost']\n    assert df['cost'].unique().tolist()==['cost2']\n\n    df = CombinerCSV(fname_list=create_files_csv_colmismatch, columns_rename={'profit2':'profit3'}).preview_rename()\n    assert df.columns.tolist()==['profit2']\n    assert df['profit2'].unique().tolist()==[np.nan, 'profit3']\n\n    # select\n    l = CombinerCSV(fname_list=create_files_csv).preview_select()\n    assert l == ['date', 'sales', 'cost', 'profit']\n    l2 = CombinerCSV(fname_list=create_files_csv, columns_select_common=True).preview_select()\n    assert l2==l\n    l = CombinerCSV(fname_list=create_files_csv, columns_select=['date', 'sales', 'cost']).preview_select()\n    assert l == ['date', 'sales', 'cost']\n\n    l = CombinerCSV(fname_list=create_files_csv_colmismatch).preview_select()\n    assert l == ['date', 'sales', 'cost', 'profit', 'profit2']\n    l = CombinerCSV(fname_list=create_files_csv_colmismatch, columns_select_common=True).preview_select()\n    assert l == ['date', 'sales', 'cost', 'profit']\n\n    # rename+select\n    l = CombinerCSV(fname_list=create_files_csv_colmismatch, columns_select=['date','profit2'], columns_rename={'profit2':'profit3'}).preview_select()\n    assert l==['date', 'profit3']\n    l = CombinerCSV(fname_list=create_files_csv_colmismatch, columns_select=['date','profit3'], columns_rename={'profit2':'profit3'}).preview_select()\n    assert l==['date', 'profit3']\n\ndef test_to_pandas(create_files_csv, create_files_csv_colmismatch, create_files_csv_colreorder):\n    df = CombinerCSV(fname_list=create_files_csv).to_pandas()\n    assert df.shape == (30, 6)\n    \n    df = CombinerCSV(fname_list=create_files_csv_colmismatch).to_pandas()\n    assert df.shape == (30, 6+1)\n    assert df['profit2'].isnull().unique().tolist() == [True, False]\n    df = CombinerCSV(fname_list=create_files_csv_colmismatch, columns_select_common=True).to_pandas()\n    assert df.shape == (30, 6)\n    assert 'profit2' not in df.columns\n\n    # rename+select\n    df = CombinerCSV(fname_list=create_files_csv_colmismatch, columns_select=['date','profit2'], columns_rename={'profit2':'profit3'}, add_filename=False).to_pandas()\n    assert df.shape == (30, 2)\n    assert 'profit3' in df.columns and not 'profit2' in df.columns\n    df = CombinerCSV(fname_list=create_files_csv_colmismatch, columns_select=['date','profit3'], columns_rename={'profit2':'profit3'}, add_filename=False).to_pandas()\n    assert df.shape == (30, 2)\n    assert 'profit3' in df.columns and not 'profit2' in df.columns\n\ndef test_combinepreview(create_files_csv_colmismatch):\n    df = CombinerCSV(fname_list=create_files_csv_colmismatch).combine_preview()\n    assert df.shape == (9, 6+1)\n    assert df.dtypes.tolist() == [np.dtype('O'), np.dtype('int64'), np.dtype('int64'), np.dtype('int64'), np.dtype('float64'), np.dtype('O'), np.dtype('O')]\n\n    def apply(dfg):\n        dfg['date'] = pd.to_datetime(dfg['date'], format='%Y-%m-%d')\n        return dfg\n\n    df = CombinerCSV(fname_list=create_files_csv_colmismatch, apply_after_read=apply).combine_preview()\n    assert df.shape == (9, 6+1)\n    assert df.dtypes.tolist() == [np.dtype('<M8[ns]'), np.dtype('int64'), np.dtype('int64'), np.dtype('int64'), np.dtype('float64'), np.dtype('O'), np.dtype('O')]\n\n\ndef test_tocsv(create_files_csv_colmismatch):\n    fname = 'test-data/output/combined.csv'\n    fnameout = CombinerCSV(fname_list=create_files_csv_colmismatch).to_csv_combine(filename=fname)\n    assert fname == fnameout\n    df = pd.read_csv(fname)\n    dfchk = df.copy()\n    assert df.shape == (30, 4+1+2)\n    assert df.columns.tolist() == ['date', 'sales', 'cost', 'profit', 'profit2', 'filepath', 'filename']\n    assert check_df_colmismatch_combine(df)\n    fnameout = CombinerCSV(fname_list=create_files_csv_colmismatch, columns_select_common=True).to_csv_combine(filename=fname)\n    df = pd.read_csv(fname)\n    assert df.columns.tolist() == ['date', 'sales', 'cost', 'profit', 'filepath', 'filename']\n    assert check_df_colmismatch_combine(df,is_common=True)\n\n    def helper(fdir):\n        fnamesout = CombinerCSV(fname_list=create_files_csv_colmismatch).to_csv_align(output_dir=fdir)\n        for fname in fnamesout:\n            df = pd.read_csv(fname)\n            assert df.shape == (10, 4+1+2)\n            assert df.columns.tolist() == ['date', 'sales', 'cost', 'profit', 'profit2', 'filepath', 'filename']\n    helper('test-data/output')\n    helper('test-data/output/')\n\n    df = dd.read_csv('test-data/output/d6tstack-test-data-input-csv-colmismatch-*.csv')\n    df = df.compute()\n    assert df.columns.tolist() == ['date', 'sales', 'cost', 'profit', 'profit2', 'filepath', 'filename']\n    assert df.reset_index(drop=True).equals(dfchk)\n    assert check_df_colmismatch_combine(df)\n\n    # check creates directory\n    try:\n        shutil.rmtree('test-data/output-tmp')\n    except:\n        pass\n    _ = CombinerCSV(fname_list=create_files_csv_colmismatch).to_csv_align(output_dir='test-data/output-tmp')\n    try:\n        shutil.rmtree('test-data/output-tmp')\n    except:\n        pass\n\n\ndef test_topq(create_files_csv_colmismatch):\n    fname = 'test-data/output/combined.pq'\n    fnameout = CombinerCSV(fname_list=create_files_csv_colmismatch).to_parquet_combine(filename=fname)\n    assert fname == fnameout\n    df = pd.read_parquet(fname, engine='fastparquet')\n    assert df.shape == (30, 4+1+2)\n    assert df.columns.tolist() == ['date', 'sales', 'cost', 'profit', 'profit2', 'filepath', 'filename']\n    df2 = pd.read_parquet(fname, engine='pyarrow')\n    assert df2.equals(df)\n    assert check_df_colmismatch_combine(df)\n\n    df = dd.read_parquet(fname)\n    df = df.compute()\n    assert df.columns.tolist() == ['date', 'sales', 'cost', 'profit', 'profit2', 'filepath', 'filename']\n    df2 = pd.read_parquet(fname, engine='fastparquet')\n    assert df2.equals(df)\n    df3 = pd.read_parquet(fname, engine='pyarrow')\n    assert df3.equals(df)\n    assert check_df_colmismatch_combine(df)\n\n\n    def helper(fdir):\n        fnamesout = CombinerCSV(fname_list=create_files_csv_colmismatch).to_parquet_align(output_dir=fdir)\n        for fname in fnamesout:\n            df = pd.read_parquet(fname, engine='fastparquet')\n            assert df.shape == (10, 4+1+2)\n            assert df.columns.tolist() == ['date', 'sales', 'cost', 'profit', 'profit2', 'filepath', 'filename']\n    helper('test-data/output')\n\n    df = dd.read_parquet('test-data/output/d6tstack-test-data-input-csv-colmismatch-*.pq')\n    df = df.compute()\n    assert df.columns.tolist() == ['date', 'sales', 'cost', 'profit', 'profit2', 'filepath', 'filename']\n    assert check_df_colmismatch_combine(df)\n\n    # todo: write tests such that compare to concat df not always repeat same code to test shape and columns\n\ndef test_tosql(create_files_csv_colmismatch):\n    tblname = 'testd6tstack'\n\n    def apply(dfg):\n        dfg['date'] = pd.to_datetime(dfg['date'], format='%Y-%m-%d')\n        return dfg\n\n    def helper(uri):\n        sql_engine = sqlalchemy.create_engine(uri)\n        CombinerCSV(fname_list=create_files_csv_colmismatch).to_sql_combine(uri, tblname, 'replace')\n        df = pd.read_sql_table(tblname, sql_engine)\n        assert check_df_colmismatch_combine(df)\n\n        # with date convert\n        CombinerCSV(fname_list=create_files_csv_colmismatch, apply_after_read=apply).to_sql_combine(uri, tblname, 'replace')\n        df = pd.read_sql_table(tblname, sql_engine)\n        assert check_df_colmismatch_combine(df, convert_date=False)\n\n    uri = 'postgresql+psycopg2://psqlusr:psqlpwdpsqlpwd@localhost/psqltest'\n    helper(uri)\n    uri = 'mysql+pymysql://testusr:testpwd@localhost/testdb'\n    helper(uri)\n\n    uri = 'postgresql+psycopg2://psqlusr:psqlpwdpsqlpwd@localhost/psqltest'\n    sql_engine = sqlalchemy.create_engine(uri)\n    CombinerCSV(fname_list=create_files_csv_colmismatch).to_psql_combine(uri, tblname, if_exists='replace')\n    df = pd.read_sql_table(tblname, sql_engine)\n    assert df.shape == (30, 4+1+2)\n    assert check_df_colmismatch_combine(df)\n\n    CombinerCSV(fname_list=create_files_csv_colmismatch, apply_after_read=apply).to_psql_combine(uri, tblname, if_exists='replace')\n    df = pd.read_sql_table(tblname, sql_engine)\n    assert check_df_colmismatch_combine(df, convert_date=False)\n\n    uri = 'mysql+mysqlconnector://testusr:testpwd@localhost/testdb'\n    sql_engine = sqlalchemy.create_engine(uri)\n    CombinerCSV(fname_list=create_files_csv_colmismatch).to_mysql_combine(uri, tblname, if_exists='replace')\n    df = pd.read_sql_table(tblname, sql_engine)\n    assert df.shape == (30, 4+1+2)\n    assert check_df_colmismatch_combine(df)\n\n    # todo: mysql import makes NaNs 0s\n    CombinerCSV(fname_list=create_files_csv_colmismatch, apply_after_read=apply).to_mysql_combine(uri, tblname, if_exists='replace')\n    df = pd.read_sql_table(tblname, sql_engine)\n    assert check_df_colmismatch_combine(df, convert_date=False)\n\n\ndef test_tosql_util(create_files_csv_colmismatch):\n    tblname = 'testd6tstack'\n\n    uri = 'postgresql+psycopg2://psqlusr:psqlpwdpsqlpwd@localhost/psqltest'\n    sql_engine = sqlalchemy.create_engine(uri)\n    dfc = CombinerCSV(fname_list=create_files_csv_colmismatch).to_pandas()\n\n    # psql\n    d6tstack.utils.pd_to_psql(dfc, uri, tblname, if_exists='replace')\n    df = pd.read_sql_table(tblname, sql_engine)\n    assert df.equals(dfc)\n\n    uri = 'mysql+mysqlconnector://testusr:testpwd@localhost/testdb'\n    sql_engine = sqlalchemy.create_engine(uri)\n    d6tstack.utils.pd_to_mysql(dfc, uri, tblname, if_exists='replace')\n    df = pd.read_sql_table(tblname, sql_engine)\n    assert df.equals(dfc)\n"
  },
  {
    "path": "tests/test_combine_old.py",
    "content": "\"\"\"Run unit tests.\n\nUse this to run tests and understand how tasks.py works.\n\nExample:\n\n    Create directories::\n\n        mkdir -p test-data/input\n        mkdir -p test-data/output\n\n    Run tests::\n\n        pytest test_combine.py -s\n\nNotes:\n\n    * this will create sample csv, xls and xlsx files    \n    * test_combine_() test the main combine function\n\n\"\"\"\n\nfrom d6tstack.combine_csv import *\nfrom d6tstack.sniffer import CSVSniffer\n\nimport pandas as pd\nimport pyarrow as pa\nimport pyarrow.parquet as pq\nimport ntpath\n\nimport pytest\n\ncfg_fname_base_in = 'test-data/input/test-data-'\ncfg_fname_base_out_dir = 'test-data/output'\ncfg_fname_base_out = cfg_fname_base_out_dir+'/test-data-'\ncnxn_string = 'sqlite:///test-data/db/{}.db'\n\n#************************************************************\n# fixtures\n#************************************************************\nclass TestLogPusher(object):\n    def __init__(self, event):\n        pass\n        \n    def send_log(self, msg, status):\n        pass\n\n    def send(self, data):\n        pass\n\nlogger = TestLogPusher('combiner')\n\n# sample data\ndef create_files_df_clean():\n    # create sample data\n    df1=pd.DataFrame({'date':pd.date_range('1/1/2011', periods=10), 'sales': 100, 'cost':-80, 'profit':20})\n    df2=pd.DataFrame({'date':pd.date_range('2/1/2011', periods=10), 'sales': 200, 'cost':-90, 'profit':200-90})\n    df3=pd.DataFrame({'date':pd.date_range('3/1/2011', periods=10), 'sales': 300, 'cost':-100, 'profit':300-100})\n#    cfg_col = [ 'date', 'sales','cost','profit']\n    \n    # return df1[cfg_col], df2[cfg_col], df3[cfg_col]\n    return df1, df2, df3\n\ndef create_files_df_clean_combine():\n    df1,df2,df3 = create_files_df_clean()\n    df_all = pd.concat([df1,df2,df3])\n    df_all = df_all[df_all.columns].astype(str)\n    \n    return df_all\n\n\ndef create_files_df_clean_combine_with_filename(fname_list):\n    df1, df2, df3 = create_files_df_clean()\n    df1['filename'] = os.path.basename(fname_list[0])\n    df2['filename'] = os.path.basename(fname_list[1])\n    df3['filename'] = os.path.basename(fname_list[2])\n    df_all = pd.concat([df1, df2, df3])\n    df_all = df_all[df_all.columns].astype(str)\n\n    return df_all\n\n\ndef create_files_df_colmismatch_combine(cfg_col_common):\n    df1, df2, df3 = create_files_df_clean()\n    df3['profit2']=df3['profit']*2\n    if cfg_col_common:\n        df_all = pd.concat([df1, df2, df3], join='inner')\n    else:\n        df_all = pd.concat([df1, df2, df3])\n    df_all = df_all[df_all.columns].astype(str)\n\n    return df_all\n\n\ndef create_files_df_colmismatch_combine2(cfg_col_common):\n    df1, df2, df3 = create_files_df_clean()\n    for i in range(15):\n        df3['profit'+str(i)]=df3['profit']*2\n    if cfg_col_common:\n        df_all = pd.concat([df1, df2, df3], join='inner')\n    else:\n        df_all = pd.concat([df1, df2, df3])\n    df_all = df_all[df_all.columns].astype(str)\n\n    return df_all\n\n\n# csv standard\n@pytest.fixture(scope=\"module\")\ndef create_files_csv():\n\n    df1,df2,df3 = create_files_df_clean()\n    # save files\n    cfg_fname = cfg_fname_base_in+'input-csv-clean-%s.csv'\n    df1.to_csv(cfg_fname % 'jan',index=False)\n    df2.to_csv(cfg_fname % 'feb',index=False)\n    df3.to_csv(cfg_fname % 'mar',index=False)\n\n    return [cfg_fname % 'jan',cfg_fname % 'feb',cfg_fname % 'mar']\n\n@pytest.fixture(scope=\"module\")\ndef create_files_csv_colmismatch():\n\n    df1,df2,df3 = create_files_df_clean()\n    df3['profit2']=df3['profit']*2\n    # save files\n    cfg_fname = cfg_fname_base_in+'input-csv-colmismatch-%s.csv'\n    df1.to_csv(cfg_fname % 'jan',index=False)\n    df2.to_csv(cfg_fname % 'feb',index=False)\n    df3.to_csv(cfg_fname % 'mar',index=False)\n\n    return [cfg_fname % 'jan',cfg_fname % 'feb',cfg_fname % 'mar']\n\n@pytest.fixture(scope=\"module\")\ndef create_files_csv_colmismatch2():\n\n    df1,df2,df3 = create_files_df_clean()\n    for i in range(15):\n        df3['profit'+str(i)]=df3['profit']*2\n    # save files\n    cfg_fname = cfg_fname_base_in+'input-csv-colmismatch2-%s.csv'\n    df1.to_csv(cfg_fname % 'jan',index=False)\n    df2.to_csv(cfg_fname % 'feb',index=False)\n    df3.to_csv(cfg_fname % 'mar',index=False)\n\n    return [cfg_fname % 'jan',cfg_fname % 'feb',cfg_fname % 'mar']\n\n@pytest.fixture(scope=\"module\")\ndef create_files_csv_colreorder():\n\n    df1,df2,df3 = create_files_df_clean()\n    cfg_col = [ 'date', 'sales','cost','profit']\n    cfg_col2 = [ 'date', 'sales','profit','cost']\n    \n    # return df1[cfg_col], df2[cfg_col], df3[cfg_col]\n    # save files\n    cfg_fname = cfg_fname_base_in+'input-csv-reorder-%s.csv'\n    df1[cfg_col].to_csv(cfg_fname % 'jan',index=False)\n    df2[cfg_col].to_csv(cfg_fname % 'feb',index=False)\n    df3[cfg_col2].to_csv(cfg_fname % 'mar',index=False)\n\n    return [cfg_fname % 'jan',cfg_fname % 'feb',cfg_fname % 'mar']\n\n@pytest.fixture(scope=\"module\")\ndef create_files_csv_noheader():\n\n    df1,df2,df3 = create_files_df_clean()\n    # save files\n    cfg_fname = cfg_fname_base_in+'input-noheader-csv-%s.csv'\n    df1.to_csv(cfg_fname % 'jan',index=False, header=False)\n    df2.to_csv(cfg_fname % 'feb',index=False, header=False)\n    df3.to_csv(cfg_fname % 'mar',index=False, header=False)\n\n    return [cfg_fname % 'jan',cfg_fname % 'feb',cfg_fname % 'mar']\n\n@pytest.fixture(scope=\"module\")\ndef create_files_csv_col_renamed():\n\n    df1, df2, df3 = create_files_df_clean()\n    df3 = df3.rename(columns={'sales':'revenue'})\n    cfg_col = ['date', 'sales', 'profit', 'cost']\n    cfg_col2 = ['date', 'revenue', 'profit', 'cost']\n\n    cfg_fname = cfg_fname_base_in + 'input-csv-renamed-%s.csv'\n    df1[cfg_col].to_csv(cfg_fname % 'jan', index=False)\n    df2[cfg_col].to_csv(cfg_fname % 'feb', index=False)\n    df3[cfg_col2].to_csv(cfg_fname % 'mar', index=False)\n\n    return [cfg_fname % 'jan', cfg_fname % 'feb', cfg_fname % 'mar']\n\ndef test_create_files_csv_col_renamed(create_files_csv_col_renamed):\n    pass\n\ndef create_files_csv_dirty(cfg_sep=\",\", cfg_header=True):\n\n    df1,df2,df3 = create_files_df_clean()\n    df1.to_csv(cfg_fname_base_in+'debug.csv',index=False, sep=cfg_sep, header=cfg_header)\n\n    return cfg_fname_base_in+'debug.csv'\n\n# excel single-tab\ndef create_files_xls_single_helper(cfg_fname):\n    df1,df2,df3 = create_files_df_clean()\n    df1.to_excel(cfg_fname % 'jan',index=False)\n    df2.to_excel(cfg_fname % 'feb',index=False)\n    df3.to_excel(cfg_fname % 'mar',index=False)\n\n    return [cfg_fname % 'jan',cfg_fname % 'feb',cfg_fname % 'mar']\n\n@pytest.fixture(scope=\"module\")\ndef create_files_xls_single():\n    return create_files_xls_single_helper(cfg_fname_base_in+'input-xls-sing-%s.xls')\n\n@pytest.fixture(scope=\"module\")\ndef create_files_xlsx_single():\n    return create_files_xls_single_helper(cfg_fname_base_in+'input-xls-sing-%s.xlsx')\n\n\ndef write_file_xls(dfg, fname, startrow=0,startcol=0):\n    writer = pd.ExcelWriter(fname)\n    dfg.to_excel(writer, 'Sheet1', index=False,startrow=startrow,startcol=startcol)\n    dfg.to_excel(writer, 'Sheet2', index=False,startrow=startrow,startcol=startcol)\n    writer.save()\n\n# excel multi-tab\ndef create_files_xls_multiple_helper(cfg_fname):\n\n    df1,df2,df3 = create_files_df_clean()\n    write_file_xls(df1,cfg_fname % 'jan')\n    write_file_xls(df2,cfg_fname % 'feb')\n    write_file_xls(df3,cfg_fname % 'mar')\n\n    return [cfg_fname % 'jan',cfg_fname % 'feb',cfg_fname % 'mar']\n    \n@pytest.fixture(scope=\"module\")\ndef create_files_xls_multiple():\n    return create_files_xls_multiple_helper(cfg_fname_base_in+'input-xls-mult-%s.xls')\n\n@pytest.fixture(scope=\"module\")\ndef create_files_xlsx_multiple():\n    return create_files_xls_multiple_helper(cfg_fname_base_in+'input-xls-mult-%s.xlsx')\n    \n#************************************************************\n# tests - helpers\n#************************************************************\n\ndef test_file_extensions_get():\n    fname_list = ['a.csv','b.csv']\n    ext_list = file_extensions_get(fname_list)\n    assert ext_list==['.csv','.csv']\n    \n    fname_list = ['a.xls','b.xls']\n    ext_list = file_extensions_get(fname_list)\n    assert ext_list==['.xls','.xls']\n\ndef test_file_extensions_all_equal():\n    ext_list = ['.csv']*2\n    assert file_extensions_all_equal(ext_list)\n    ext_list = ['.xls']*2\n    assert file_extensions_all_equal(ext_list)\n    ext_list = ['.csv','.xls']\n    assert not file_extensions_all_equal(ext_list)\n    \ndef test_file_extensions_valid():\n    ext_list = ['.csv']*2\n    assert file_extensions_valid(ext_list)\n    ext_list = ['.xls']*2\n    assert file_extensions_valid(ext_list)\n    ext_list = ['.exe','.xls']\n    assert not file_extensions_valid(ext_list)\n\n#************************************************************\n#************************************************************\n# combine_csv\n#************************************************************\n#************************************************************\ndef test_csv_sniff_single(create_files_csv, create_files_csv_noheader):\n    sniff = CSVSniffer(create_files_csv[0])\n    sniff.get_delim()\n    assert sniff.delim == ','\n    assert sniff.count_skiprows() == 0\n    assert sniff.has_header()\n\n    fname = create_files_csv_dirty(\"|\")\n    sniff = CSVSniffer(fname)\n    sniff.get_delim()\n    assert sniff.delim == \"|\"\n    assert sniff.has_header()\n\n    df1,df2,df3 = create_files_df_clean()\n    assert sniff.nrows == df1.shape[0]+1\n\n    # no header test\n    sniff = CSVSniffer(create_files_csv_noheader[0])\n    sniff.get_delim()\n    assert sniff.delim == ','\n    assert sniff.count_skiprows() == 0\n    assert not sniff.has_header()\n\ndef test_csv_sniff_multi(create_files_csv, create_files_csv_noheader):\n    sniff = CSVSnifferList(create_files_csv)\n    assert sniff.get_delim() == ','\n    assert sniff.count_skiprows() == 0\n    assert sniff.has_header()\n\n    # no header test\n    sniff = CSVSnifferList(create_files_csv_noheader)\n    sniff.get_delim()\n    assert sniff.get_delim() == ','\n    assert sniff.count_skiprows() == 0\n    assert not sniff.has_header()\n\n\ndef test_CombinerCSV_columns(create_files_csv, create_files_csv_colmismatch, create_files_csv_colreorder):\n\n    with pytest.raises(ValueError) as e:\n        c = CombinerCSV([])\n\n    fname_list = create_files_csv\n    combiner = CombinerCSV(fname_list=fname_list, all_strings=True)\n    col_preview = combiner.preview_columns()\n    # todo: cache the preview dfs somehow? reading the same in next step\n    \n    assert col_preview['is_all_equal']\n    assert col_preview['columns_all']==col_preview['columns_common']\n    assert col_preview['columns_all']==['cost', 'date', 'profit', 'sales']\n\n    fname_list = create_files_csv_colmismatch\n    combiner = CombinerCSV(fname_list=fname_list, all_strings=True)\n    col_preview = combiner.preview_columns()\n    # todo: cache the preview dfs somehow? reading the same in next step\n    \n    assert not col_preview['is_all_equal']\n    assert not col_preview['columns_all']==col_preview['columns_common']\n    assert col_preview['columns_all']==['cost', 'date', 'profit', 'profit2', 'sales']\n    assert col_preview['columns_common']==['cost', 'date', 'profit', 'sales']\n    assert col_preview['columns_unique']==['profit2']\n\n    fname_list = create_files_csv_colreorder\n    combiner = CombinerCSV(fname_list=fname_list, all_strings=True)\n    col_preview = combiner.preview_columns()\n\n    assert not col_preview['is_all_equal']\n    assert col_preview['columns_all']==col_preview['columns_common']\n\n\ndef test_CombinerCSV_combine(create_files_csv, create_files_csv_colmismatch, create_files_csv_colreorder):\n\n    # all columns present\n    fname_list = create_files_csv\n    combiner = CombinerCSV(fname_list=fname_list, all_strings=True)\n    df = combiner.combine()\n\n    df = df.sort_values('date').drop(['filename'],axis=1)\n    df_chk = create_files_df_clean_combine()\n    assert df.equals(df_chk)\n\n    df = combiner.combine()\n    df = df.groupby('filename').head(combiner.nrows_preview)\n    df_chk = combiner.preview_combine()\n    assert df.equals(df_chk)\n\n    # columns mismatch, all columns\n    fname_list = create_files_csv_colmismatch\n    combiner = CombinerCSV(fname_list=fname_list, all_strings=True, add_filename=True)\n    df = combiner.combine()\n    df = df.sort_values('date').drop(['filename'],axis=1)\n    df_chk = create_files_df_colmismatch_combine(cfg_col_common=False)\n    assert df.shape[1] == df_chk.shape[1]\n\n    # columns mismatch, common columns\n    df = combiner.combine(is_col_common=True)\n    df = df.sort_values('date').drop(['filename'], axis=1)\n    df_chk = create_files_df_colmismatch_combine(cfg_col_common=True)\n    assert df.shape[1] == df_chk.shape[1]\n\n    # Filename column True\n    fname_list = create_files_csv\n    combiner = CombinerCSV(fname_list=fname_list, all_strings=True)\n    df = combiner.combine()\n\n    df = df.sort_values('date')\n    df_chk = create_files_df_clean_combine_with_filename(fname_list)\n    assert df.equals(df_chk)\n\n    # Filename column False\n    combiner = CombinerCSV(fname_list=fname_list, all_strings=True, add_filename=False)\n    df = combiner.combine()\n    df = df.sort_values('date')\n    df_chk = create_files_df_clean_combine()\n    assert df.equals(df_chk)\n\n\ndef test_CombinerCSV_combine_advanced(create_files_csv):\n\n    # Check if rename worked correctly.\n    fname_list = create_files_csv\n    combiner = CombinerCSV(fname_list=fname_list, all_strings=True)\n    adv_combiner = CombinerCSV(fname_list=fname_list, all_strings=True,\n                               columns_select=None, columns_rename={'date':'date1'})\n\n    df = adv_combiner.combine()\n    assert 'date1' in df.columns.values\n    assert 'date' not in df.columns.values\n\n    df = adv_combiner.preview_combine()\n    assert 'date1' in df.columns.values\n    assert 'date' not in df.columns.values\n\n    adv_combiner = CombinerCSV(fname_list=fname_list, all_strings=True,\n                               columns_select=['cost', 'date', 'profit', 'profit2', 'sales'])\n\n    df = adv_combiner.combine()\n    assert 'profit2' in df.columns.values\n    assert df['profit2'].isnull().all()\n\n    df = adv_combiner.preview_combine()\n    assert 'profit2' in df.columns.values\n    assert df['profit2'].isnull().all()\n\n\ndef test_preview_dict():\n    df = pd.DataFrame({'col1':[0,1],'col2':[0,1]})\n    assert preview_dict(df) == {'columns': ['col1', 'col2'], 'rows': {0: [[0]], 1: [[1]]}}\n\n\n#************************************************************\n# tests - CombinerCSV rename and select columns\n#************************************************************\ndef create_df_rename():\n    df11 = pd.DataFrame({'a':range(10)})\n    df12 = pd.DataFrame({'b': range(10)})\n    df21 = pd.DataFrame({'a':range(10),'c': range(10)})\n    df22 = pd.DataFrame({'b': range(10),'c': range(10)})\n\n    return df11, df12, df21, df22\n\n# csv standard\n@pytest.fixture(scope=\"module\")\ndef create_files_csv_rename():\n\n    df11, df12, df21, df22 = create_df_rename()\n    # save files\n    cfg_fname = cfg_fname_base_in+'input-csv-rename-%s.csv'\n    df11.to_csv(cfg_fname % '11',index=False)\n    df12.to_csv(cfg_fname % '12',index=False)\n    df21.to_csv(cfg_fname % '21',index=False)\n    df22.to_csv(cfg_fname % '22',index=False)\n\n    return [cfg_fname % '11',cfg_fname % '12',cfg_fname % '21',cfg_fname % '22']\n\ndef test_create_files_csv_rename(create_files_csv_rename):\n    pass\n\n@pytest.fixture(scope=\"module\")\ndef create_out_files_csv_align_save():\n    cfg_outname = cfg_fname_base_out + 'input-csv-rename-%s-align-save.csv'\n    return [cfg_outname % '11', cfg_outname % '12',cfg_outname % '21',cfg_outname % '22']\n\n\n@pytest.fixture(scope=\"module\")\ndef create_out_files_parquet_align_save():\n    cfg_outname = cfg_fname_base_out + 'input-csv-rename-%s-align-save.parquet'\n    return [cfg_outname % '11', cfg_outname % '12',cfg_outname % '21',cfg_outname % '22']\n\n\ndef test_apply_select_rename():\n    df11, df12, df21, df22 = create_df_rename()\n\n    # rename 1, select all\n    assert df11.equals(apply_select_rename(df12.copy(),[],{'b':'a'}))\n\n    # rename and select 1\n    assert df11.equals(apply_select_rename(df22.copy(),['b'],{'b':'a'}))\n    assert df11.equals(apply_select_rename(df22.copy(),['a'],{'b':'a'}))\n\n    # rename and select 2\n    assert df21[list(dict.fromkeys(df21.columns))].equals(apply_select_rename(df22.copy(),['b','c'],{'b':'a'}))\n    assert df21[list(dict.fromkeys(df21.columns))].equals(apply_select_rename(df22.copy(),['a','c'],{'b':'a'}))\n\n    with pytest.warns(UserWarning, match=\"Renaming conflict\"):\n        assert df22.equals(apply_select_rename(df22.copy(), ['b', 'c'], {'c': 'b'}))\n\n\ndef test_CombinerCSV_rename(create_files_csv_rename):\n    df11, df12, df21, df22 = create_df_rename()\n    df_chk1 = pd.concat([df11,df11])\n    df_chk2 = pd.concat([df11,df21])\n\n    def helper(fnames, cfg_col_sel,cfg_col_rename, df_chk, chk_filename=False, is_filename_col=True):\n        if cfg_col_sel and cfg_col_rename:\n            c2 = CombinerCSV(fnames, add_filename=is_filename_col,\n                             columns_select=cfg_col_sel, columns_rename=cfg_col_rename)\n        elif cfg_col_rename:\n            c2 = CombinerCSV(fnames, add_filename=is_filename_col, columns_rename=cfg_col_rename)\n        else:\n            c2 = CombinerCSV(fnames, add_filename=is_filename_col)\n\n        dfc = c2.combine()\n        if (not chk_filename) and is_filename_col:\n            dfc = dfc.drop(['filename'], 1)\n        assert dfc.equals(df_chk)\n\n        if cfg_col_sel:\n            fname_out = cfg_fname_base_out_dir + '/test_save.csv'\n            c2.combine_save(fname_out)\n            dfc = pd.read_csv(fname_out)\n            if (not chk_filename) or is_filename_col:\n                dfc = dfc.drop(['filename'], 1)\n            assert dfc.equals(df_chk.reset_index(drop=True))\n\n    # rename 1, select all\n    l = create_files_csv_rename[:2]\n    helper(l,None,{'b':'a'},df_chk1)\n\n    with pytest.raises(ValueError) as e:\n        c2 = CombinerCSV(l, columns_select=['a','a'])\n\n    # rename 1, select some\n    l = [create_files_csv_rename[0],create_files_csv_rename[-1]]\n    helper(l,['a'],{'b':'a'},df_chk1)\n    helper(l,['b'],{'b':'a'},df_chk1)\n    helper(l,None,{'b':'a'},df_chk2)\n\n    l = [create_files_csv_rename[1],create_files_csv_rename[-1]]\n    helper(l,['a'],{'b':'a'},df_chk1)\n    helper(l,['b'],{'b':'a'},df_chk1)\n    helper(l,None,{'b':'a'},df_chk2)\n\n    with pytest.warns(UserWarning, match=\"Renaming conflict\"):\n        c2 = CombinerCSV(l, columns_rename={'b': 'a', 'c': 'a'})\n        c2.combine()\n\n    # rename none, select all\n    l = [create_files_csv_rename[0],create_files_csv_rename[2]]\n    helper(l,None,None,df_chk2)\n\n    # filename col True\n    df31 = df11\n    df32 = df21\n    df31['filename'] = os.path.basename(l[0])\n    df32['filename'] = os.path.basename(l[1])\n    df_chk3 = pd.concat([df31, df32])\n    helper(l, None, None, df_chk3, is_filename_col=True, chk_filename=True)\n    helper(l, None, None, df_chk2, is_filename_col=False, chk_filename=True)\n\n\ndef test_CombinerCSV_align_save_advanced(create_files_csv_rename, create_out_files_csv_align_save):\n    df11, df12, df21, df22 = create_df_rename()\n\n    def helper(fnames, cfg_col_sel, cfg_col_rename, new_fnames, df_chks, is_filename_col=False):\n        if cfg_col_sel and cfg_col_rename:\n            c2 = CombinerCSV(fnames, add_filename=is_filename_col,\n                             columns_select=cfg_col_sel, columns_rename=cfg_col_rename)\n        elif cfg_col_sel:\n            c2 = CombinerCSV(fnames, add_filename=is_filename_col, columns_select=cfg_col_sel)\n        elif cfg_col_rename:\n            c2 = CombinerCSV(fnames, add_filename=is_filename_col, columns_rename=cfg_col_rename)\n        else:\n            c2 = CombinerCSV(fnames, add_filename=is_filename_col)\n            \n        c2.align_save(output_dir=cfg_fname_base_out_dir, prefix=\"-align-save\")\n        for fname_out, df_chk in zip(new_fnames, df_chks):\n            dfc = pd.read_csv(fname_out)\n            assert dfc.equals(df_chk)\n        \n    # rename 1, select all\n    l = create_files_csv_rename[:2]\n    outl = create_out_files_csv_align_save[:2]\n    helper(l, ['a'], {'b':'a'}, outl, [df11, df11])\n\n    # rename 1, select some\n    l = [create_files_csv_rename[2]]\n    outl = [create_out_files_csv_align_save[2]]\n    helper(l, ['a'], {'b':'a'}, outl, [df11])\n\n    # rename none, select 1\n    l = [create_files_csv_rename[2]]\n    outl = [create_out_files_csv_align_save[2]]\n    helper(l, ['a'], None, outl, [df11])\n\n    # rename none, select all\n    l = [create_files_csv_rename[2]]\n    outl = [create_out_files_csv_align_save[2]]\n    helper(l, ['a', 'c'], None, outl, [df21])\n\n    # rename none, select all, filename col true\n    df21['filename'] = os.path.basename(outl[0])\n    helper(l, ['a', 'c'], None, outl, [df21], is_filename_col=True)\n\n\ndef test_CombinerCSV_sql_advanced(create_files_csv_rename):\n    df11, df12, df21, df22 = create_df_rename()\n\n    def helper(fnames, cfg_col_sel, cfg_col_rename, df_chks, is_filename_col=False, stream=False):\n        if cfg_col_sel and cfg_col_rename:\n            c2 = CombinerCSV(fnames, add_filename=is_filename_col,\n                             columns_select=cfg_col_sel, columns_rename=cfg_col_rename)\n        elif cfg_col_sel:\n            c2 = CombinerCSV(fnames, add_filename=is_filename_col, columns_select=cfg_col_sel)\n        elif cfg_col_rename:\n            c2 = CombinerCSV(fnames, add_filename=is_filename_col, columns_rename=cfg_col_rename)\n        else:\n            c2 = CombinerCSV(fnames, add_filename=is_filename_col)\n        df_chk = pd.DataFrame()\n        for df in df_chks:\n            df_chk = df_chk.append(df)\n        table_name = 'test'\n        db_cnxn_string = cnxn_string.format('test-combined-adv')\n        if stream:\n            c2.to_sql_stream(db_cnxn_string, table_name)\n        else:\n            c2.to_sql(db_cnxn_string, table_name)\n        dfc = pd.read_sql(\"select * from test\", db_cnxn_string)\n        dfc = dfc.set_index('index')\n        dfc.index.name = None\n        pd.testing.assert_frame_equal(dfc, df_chk)\n        assert dfc.equals(df_chk)\n\n    # rename 1, select all\n    l = create_files_csv_rename[:2]\n    helper(l, ['a'], {'b': 'a'}, [df11, df11], stream=True)\n\n    # test sql stream\n    helper(l, ['a'], {'b': 'a'}, [df11, df11])\n\n    # rename 1, select some\n    l = [create_files_csv_rename[2]]\n    helper(l, ['a'], {'b': 'a'}, [df11])\n\n    # rename none, select 1\n    l = [create_files_csv_rename[2]]\n    helper(l, ['a'], None, [df11])\n\n    # rename none, select all\n    l = [create_files_csv_rename[2]]\n    helper(l, ['a', 'c'], None, [df21])\n\n    # rename none, select all, filename col true\n    df21['filename'] = os.path.basename(l[0])\n    helper(l, ['a', 'c'], None, [df21], is_filename_col=True)\n\n\ndef test_CombinerCSV_sql(create_files_csv):\n\n    def helper(fnames, is_col_common=False, is_filename_col=False, stream=False):\n        combiner = CombinerCSV(fname_list=fnames, all_strings=True, add_filename=is_filename_col)\n        table_name = 'test'\n        db_cnxn_string = cnxn_string.format('test-combined-adv')\n        if stream:\n            combiner.to_sql_stream(db_cnxn_string, table_name, is_col_common=is_col_common)\n        else:\n            combiner.to_sql(db_cnxn_string, table_name, is_col_common=is_col_common)\n        df = pd.read_sql(\"select * from test\", db_cnxn_string)\n        df = df.set_index('index')\n        df.index.name = None\n        return df\n\n    # all columns present, to_sql\n    fname_list = create_files_csv\n    df_chk = create_files_df_clean_combine()\n    assert df_chk.equals(helper(fname_list))\n\n    # to sql stream\n    assert df_chk.equals(helper(fname_list, stream=True))\n\n    # columns mismatch, common columns, to_sql\n    fname_list = create_files_csv_colmismatch()\n    df_chk = create_files_df_colmismatch_combine(cfg_col_common=True)\n    assert helper(fname_list, is_col_common=True).shape[1] == df_chk.shape[1]\n\n\ndef test_combinercsv_to_csv(create_files_csv_rename, create_out_files_csv_align_save):\n    fnames = create_files_csv_rename\n    df11, df12, df21, df22 = create_df_rename()\n\n    # error when separate files is False and no out_filename\n    with pytest.raises(ValueError):\n        c = CombinerCSV(fnames)\n        c.to_csv(separate_files=False)\n\n    # to_csv will call align_save\n    fnames = create_files_csv_rename[:2]\n    new_names = create_out_files_csv_align_save[:2]\n    c2 = CombinerCSV(fnames, columns_select=['a'],\n                     columns_rename={'b': 'a'}, add_filename=False)\n    c2.to_csv(output_dir=cfg_fname_base_out_dir, prefix=\"-align-save\")\n    df_chks = [df11, df11]\n    for fname_out, df_chk in zip(new_names, df_chks):\n        dfc = pd.read_csv(fname_out)\n        assert dfc.equals(df_chk)\n\n    # to_csv will call combine_save\n    df_chk = pd.concat([df11, df11])\n    fnames = [create_files_csv_rename[0], create_files_csv_rename[-1]]\n\n    c3 = CombinerCSV(fnames, columns_select=['a'], columns_rename={'b': 'a'})\n\n    dfc = c3.combine()\n    dfc = dfc.drop(['filename'], 1)\n    assert dfc.equals(df_chk)\n\n    fname_out = cfg_fname_base_out_dir + '/test_save.csv'\n    with pytest.warns(UserWarning, match=\"File already exists\"):\n        c3.to_csv(out_filename=fname_out, separate_files=False, streaming=True, overwrite=False)\n    c3.to_csv(out_filename=fname_out, separate_files=False, streaming=True)\n    dfc = pd.read_csv(fname_out)\n    dfc = dfc.drop(['filename'], 1)\n    assert dfc.equals(df_chk.reset_index(drop=True))\n\n\ndef test_combinercsv_to_parquet(create_files_csv_rename, create_out_files_parquet_align_save):\n    fnames = create_files_csv_rename\n    df11, df12, df21, df22 = create_df_rename()\n\n    # error when separate files is False and no out_filename\n    with pytest.raises(ValueError):\n        c = CombinerCSV(fnames)\n        c.to_parquet(separate_files=False)\n\n    # to_csv will call align_save\n    fnames = create_files_csv_rename[:2]\n    new_names = create_out_files_parquet_align_save[:2]\n    c2 = CombinerCSV(fnames, columns_select=['a'],\n                     columns_rename={'b': 'a'}, add_filename=False)\n    c2.to_parquet(output_dir=cfg_fname_base_out_dir, prefix=\"-align-save\")\n    df_chks = [df11, df11]\n    for fname_out, df_chk in zip(new_names, df_chks):\n        table = pq.read_table(fname_out)\n        dfc = table.to_pandas()\n        assert dfc.equals(df_chk)\n\n    # to_csv will call combine_save\n    df_chk = pd.concat([df11, df11])\n    fnames = [create_files_csv_rename[0], create_files_csv_rename[-1]]\n\n    c3 = CombinerCSV(fnames, columns_select=['a'], columns_rename={'b': 'a'})\n\n    fname_out = cfg_fname_base_out_dir + '/test_save.parquet'\n    c3.to_parquet(out_filename=fname_out, separate_files=False, streaming=True)\n    table = pq.read_table(fname_out)\n    dfc = table.to_pandas()\n    dfc = dfc.drop(['filename'], 1)\n    assert dfc.equals(df_chk)\n"
  },
  {
    "path": "tests/test_sync.py",
    "content": "from moto import mock_s3\nimport shutil\nimport os\nimport pytest\nfrom d6tstack.sync import FTPSync\n\ncfg_ftp_host = 'ftp.fic.com.tw'\ncfg_ftp_usr = 'anonymous'\ncfg_ftp_pwd = 'random'\ncfg_ftp_dir_base = '/photo/ia/'\nlocal_dir = '/tmp/new_data/'\nftp_file_path = 'aquapad/AquaPad.jpg'\n\n\ndef _remove_local_dir(folder):\n    if os.path.exists(folder):\n        shutil.rmtree(folder)\n\n\n@pytest.mark.slow\ndef test_sync_local():\n    _remove_local_dir(local_dir)\n    ftpsync = FTPSync(cfg_ftp_host, cfg_ftp_usr, cfg_ftp_pwd, cfg_ftp_dir_base,\n                      local_dir=local_dir, logger=None)\n\n    # check local dir is created\n    assert os.path.exists(local_dir)\n\n    # Local files should be empty\n    assert ftpsync.get_all_files().tolist() == []\n\n    # FTP files should be there\n    assert ftpsync.get_all_files(ftp=True).tolist() == [ftp_file_path]\n\n    # Get files for sync local\n    assert ftpsync.get_files_for_sync() == ({ftp_file_path}, 278683)\n\n    # Upload ftp files to local (subdirs false)\n    ftpsync.upload_ftp_files(subdirs=False)\n    assert ftpsync.get_all_files().tolist() == []\n\n    # Upload ftp files to local (subdirs true)\n    ftpsync.upload_ftp_files()\n    assert ftpsync.get_all_files().tolist() == [ftp_file_path]\n\n\n@pytest.mark.slow\n@mock_s3\ndef test_sync_s3():\n    ftpsync = FTPSync(cfg_ftp_host, cfg_ftp_usr, cfg_ftp_pwd, cfg_ftp_dir_base,\n                      local_dir=local_dir, logger=None)\n\n    # s3 files error for no connection details\n    with pytest.raises(ValueError) as e:\n        ftpsync.get_s3_files()\n\n    ftpsync = FTPSync(cfg_ftp_host, cfg_ftp_usr, cfg_ftp_pwd, cfg_ftp_dir_base,\n                      cfg_s3_key=\"test\", cfg_s3_secret=\"test\", bucket_name=\"test\",\n                      local_dir=local_dir, logger=None)\n\n    assert ftpsync.get_s3_files() == set()\n\n    file_key = \"test/hp.jpg\"\n    with pytest.raises(FileNotFoundError) as e:\n        ftpsync.upload_to_s3(file_key, \"test-data/syncasdsadf/hp.jpg\")\n\n    ftpsync.upload_to_s3(file_key, \"test-data/sync/hp.jpg\")\n    assert ftpsync.get_s3_files() == {file_key}\n\n    assert ftpsync.get_files_for_sync(to_s3=True) == ({ftp_file_path}, 278683)\n\n    ftpsync.upload_ftp_files(to_s3=True)\n    assert ftpsync.get_s3_files() == {ftp_file_path, file_key}\n"
  },
  {
    "path": "tests/test_xls.py",
    "content": "import pytest\nimport shutil\nfrom pathlib import Path\n\nimport numpy as np\nimport pandas as pd\nimport xlrd\n\nfrom d6tstack.convert_xls import *\n\nfrom tests.test_combine import cfg_fname_base_out_dir\nfrom tests.test_combine import create_files_xls_single, create_files_xlsx_single, create_files_xls_multiple, create_files_xlsx_multiple\nfrom tests.test_combine import write_file_xls\n\ncfg_fname_test_base = 'test.xlsx'\n\n\n# ************************************************************\n# XLSSniffer\n# ************************************************************\ndef test_xls_scan_sheets_single(create_files_xls_single, create_files_xlsx_single):\n    def helper(fnames):\n        xlsSniffer = XLSSniffer(fnames)\n        sheets = xlsSniffer.dict_xls_sheets\n        assert np.all([file['sheets_names'] == ['Sheet1'] for file in sheets.values()])\n        assert np.all([file['sheets_count'] == 1 for file in sheets.values()])\n        assert xlsSniffer.all_same_count()\n        assert xlsSniffer.all_same_names()\n        assert xlsSniffer.all_contain_sheetname('Sheet1')\n        assert xlsSniffer.all_have_idx(0)\n        assert not xlsSniffer.all_have_idx(1)\n\n    with pytest.raises(ValueError) as e:\n        x = XLSSniffer([])\n\n    helper(create_files_xls_single)\n    helper(create_files_xlsx_single)\n\n\ndef test_xls_scan_sheets_multipe(create_files_xls_multiple, create_files_xlsx_multiple):\n    def helper(fnames):\n        xlsSniffer = XLSSniffer(fnames)\n        sheets = xlsSniffer.dict_xls_sheets\n        assert np.all([file['sheets_names'] == ['Sheet1', 'Sheet2'] for file in sheets.values()])\n        assert np.all([file['sheets_count'] == 2 for file in sheets.values()])\n\n    helper(create_files_xls_multiple)\n    helper(create_files_xlsx_multiple)\n\n\n#************************************************************\n# read_excel_advanced\n#************************************************************\ncfg_fname_dir_xls = 'test-data/excel_adv_data/'\n\ndef test_read_excel_adv():\n    # actual file\n    fname = cfg_fname_dir_xls + 'read_excel_adv - sample1.xlsx'\n    df = read_excel_advanced(fname, header_xls_start=\"A8\", header_xls_end=\"O9\")\n    assert 'Balance' in df.columns\n    assert 'Billing Type' in df.columns\n\n    df = read_excel_advanced(fname, header_xls_start=\"A8\", header_xls_end=\"O9\")\n    assert 'Balance' in df.columns\n    assert 'Billing Type' in df.columns\n\n    fname = cfg_fname_dir_xls + 'read_excel_adv - sample3.xlsx'\n    df = read_excel_advanced(fname, header_xls_start=\"A10\", header_xls_end=\"G10\")\n    assert 'Product Code' in df.columns\n\n    # synthetic data\n    dfc = pd.DataFrame({'a':range(10),'b':range(10)})\n    fname = cfg_fname_dir_xls + cfg_fname_test_base\n    dfc.to_excel(fname,startrow=1,startcol=1,index=False)\n\n    # basic\n    dfr = read_excel_advanced(fname, header_xls_start=\"B2\", header_xls_end=\"C2\")\n    assert dfr.equals(dfc)\n    dfr = read_excel_advanced(fname, header_xls_range=\"B2:C2\")\n    assert dfr.equals(dfc)\n\n    # empty rows/columns\n    def helper(dfc,dfc2):\n        dfc2.to_excel(fname,startrow=1,startcol=1,index=False)\n        dfr = read_excel_advanced(fname, header_xls_range=\"B2:D2\")\n        assert dfr.astype(int).reset_index(drop=True).equals(dfc)\n        dfr = read_excel_advanced(fname, header_xls_range=\"B2:D2\", remove_blank_cols=False, remove_blank_rows=False)\n        assert dfr.equals(dfc2)\n\n    helper(dfc, dfc.reindex(['a', 'c', 'b'], axis=1))\n    helper(dfc, dfc.reindex(range(-1,10)).reset_index(drop=True))\n\n    # collapse header\n    # todo: complete\n\n\n#************************************************************\n# XLStoBase\n#************************************************************\n\ndef test_XLStoBase():\n    cfg_output_dir = 'testout'\n    cfg_fname_base_out_dir2 = os.path.join(cfg_fname_base_out_dir,cfg_output_dir)\n    cfg_fname_test1 = os.path.join(cfg_fname_base_out_dir,cfg_fname_test_base)\n    cfg_fname_test2 = os.path.join(cfg_fname_base_out_dir2,cfg_fname_test_base)\n\n    with pytest.raises(ValueError) as e:\n        x = XLStoBase(if_exists='invalid')\n\n    # output_dir\n    if os.path.exists(cfg_fname_base_out_dir2):\n        shutil.rmtree(cfg_fname_base_out_dir2)\n    assert not os.path.exists(cfg_fname_base_out_dir2)\n\n    x = XLStoBase(output_dir=cfg_fname_base_out_dir2)\n    assert os.path.exists(cfg_fname_base_out_dir2)\n\n    fname_out, is_skip = x._get_output_filename(cfg_fname_test1)\n    assert Path(fname_out).parts[-2] == cfg_output_dir\n    assert not is_skip\n\n    # if_exists\n    dfc = pd.DataFrame({'a':range(10),'b':range(10)})\n    dfc.to_excel(cfg_fname_test2,index=False)\n    fname_out, is_skip = x._get_output_filename(cfg_fname_test1)\n    assert is_skip\n\n    x = XLStoBase(output_dir=cfg_fname_base_out_dir2, if_exists='skip')\n    fname_out, is_skip = x._get_output_filename(cfg_fname_test1)\n    assert is_skip\n\n    x = XLStoBase(output_dir=cfg_fname_base_out_dir2,if_exists='replace')\n    fname_out, is_skip = x._get_output_filename(cfg_fname_test1)\n    assert not is_skip\n\n    # convert_single\n    def helper(sheet_name):\n        fname_out = x.convert_single(cfg_fname_test2,sheet_name)\n        assert sheet_name in fname_out and fname_out[-4:]=='.csv'\n        dfr = pd.read_csv(fname_out)\n        assert dfr.equals(dfc)\n\n    helper('Sheet1')\n    write_file_xls(dfc, cfg_fname_test2)\n    helper('Sheet2')\n\n    # convert advanced\n    dfc.to_excel(cfg_fname_test2,startrow=1,startcol=1,index=False)\n    fname_out = x.convert_single(cfg_fname_test2, 'Sheet1', header_xls_range=\"B2:C2\")\n    dfr = pd.read_csv(fname_out)\n    assert dfr.equals(dfc)\n\n#************************************************************\n# XLStoCSVMultiFile\n#************************************************************\ndef test_XLStoCSVMultiFile(create_files_xls_single,create_files_xlsx_single):\n\n    # global mode\n    def helper1(flist,select_mode,select_val, if_exists='replace'):\n        x = XLStoCSVMultiFile(flist,output_dir=cfg_fname_base_out_dir,cfg_xls_sheets_sel_mode=select_mode,\n                              cfg_xls_sheets_sel=select_val,if_exists=if_exists)\n        fnames_out = x.convert_all()\n        fnames_out_chk = [x._get_output_filename(fname+'-'+str(select_val)+'.csv')[0] for fname in flist]\n        assert fnames_out==fnames_out_chk\n        assert all([os.path.exists(fname) for fname in fnames_out_chk])\n\n    helper1(create_files_xlsx_single,'name_global','Sheet1')\n    helper1(create_files_xls_single,'name_global','Sheet1')\n    helper1(create_files_xlsx_single,'idx_global',0)\n    # if_exists - skip\n    with pytest.warns(UserWarning):\n        helper1(create_files_xlsx_single, 'idx_global', 0, if_exists='skip')\n\n    # Without output dir\n    flist = create_files_xlsx_single\n    x = XLStoCSVMultiFile(flist, cfg_xls_sheets_sel_mode='idx_global',\n                          cfg_xls_sheets_sel=0, if_exists='replace')\n    fnames_out = x.convert_all()\n    # same directory\n    fnames_out_chk = [fname + '-' + str(0) + '.csv' for fname in flist]\n    assert fnames_out == fnames_out_chk\n    assert all([os.path.exists(fname) for fname in fnames_out_chk])\n\n    # by file mode\n    def helper2(flist,select_mode,select_val_list, if_exists='replace'):\n        x = XLStoCSVMultiFile(flist,output_dir=cfg_fname_base_out_dir,cfg_xls_sheets_sel_mode=select_mode,\n                              cfg_xls_sheets_sel=select_val_list,if_exists=if_exists)\n        fnames_out = x.convert_all()\n        fnames_out_chk = [x._get_output_filename(fname+'-'+str(select_val_list[fname])+'.csv')[0] for fname in flist]\n        assert fnames_out==fnames_out_chk\n        assert all([os.path.exists(fname) for fname in fnames_out_chk])\n\n    helper2(create_files_xlsx_single,'idx',dict(zip(create_files_xlsx_single,[0]*len(create_files_xlsx_single))))\n    helper2(create_files_xlsx_single,'name',dict(zip(create_files_xlsx_single,['Sheet1']*len(create_files_xlsx_single))))\n    # if exists - skip\n    with pytest.warns(UserWarning):\n        helper2(create_files_xlsx_single,'name',\n                dict(zip(create_files_xlsx_single,['Sheet1']*len(create_files_xlsx_single))),\n                if_exists='skip')\n\n    # global advanced\n    dfc = pd.DataFrame({'a':range(10),'b':range(10)})\n    fname = cfg_fname_dir_xls + cfg_fname_test_base\n    dfc.to_excel(fname,startrow=1,startcol=1,index=False)\n    x = XLStoCSVMultiFile([fname],output_dir=cfg_fname_base_out_dir,cfg_xls_sheets_sel_mode='name_global',\n                          cfg_xls_sheets_sel='Sheet1',if_exists='replace')\n    fnames_out = x.convert_all(header_xls_range=\"B2:C2\")\n    dfr = pd.read_csv(fnames_out[0])\n    assert dfr.equals(dfc)\n\n    with pytest.raises(ValueError) as e:\n        x = XLStoCSVMultiFile([], 'name', {})\n\n\n#************************************************************\n# XLStoCSVMultiSheet\n#************************************************************\ndef test_XLStoCSVMultiSheet(create_files_xlsx_multiple):\n    x = XLStoCSVMultiSheet(create_files_xlsx_multiple[0],output_dir=cfg_fname_base_out_dir,if_exists='replace')\n\n    fname_out = x.convert_single('Sheet1')\n    assert 'Sheet1' in fname_out\n    fname_out = x.convert_single('Sheet2')\n    assert 'Sheet2' in fname_out\n    # path should be output dir given\n    assert os.path.dirname(fname_out) == cfg_fname_base_out_dir\n\n    with pytest.raises(xlrd.XLRDError) as e:\n        x.convert_single('Sheet3')\n\n    dfc = pd.DataFrame({'a':range(10),'b':range(10)})\n    fname = cfg_fname_dir_xls + cfg_fname_test_base\n    write_file_xls(dfc, fname, startrow=1, startcol=1)\n\n    x = XLStoCSVMultiSheet(fname, output_dir=cfg_fname_base_out_dir, if_exists='skip')\n    with pytest.warns(UserWarning):\n        fname_out = x.convert_single('Sheet1', header_xls_range=\"B2:C2\")\n\n    x = XLStoCSVMultiSheet(fname,output_dir=cfg_fname_base_out_dir,if_exists='replace')\n    fname_out = x.convert_single('Sheet1',header_xls_range=\"B2:C2\")\n    assert 'Sheet1' in fname_out\n    dfr = pd.read_csv(fname_out)\n    assert dfr.equals(dfc)\n\n    fnames_out = x.convert_all(header_xls_range=\"B2:C2\")\n    assert len(fnames_out)==2\n    assert 'Sheet1' in fnames_out[0]\n    assert 'Sheet2' in fnames_out[1]\n    dfr = pd.read_csv(fnames_out[0])\n    assert dfr.equals(dfc)\n    dfr = pd.read_csv(fnames_out[1])\n    assert dfr.equals(dfc)\n\n    x = XLStoCSVMultiSheet(fname,sheet_names=['Sheet1'],output_dir=cfg_fname_base_out_dir,\n                           if_exists='replace')\n    fnames_out = x.convert_all(header_xls_range=\"B2:C2\")\n    assert len(fnames_out)==1\n    assert 'Sheet1' in fnames_out[0]\n\n    # Single sheet Without output dir\n    x = XLStoCSVMultiSheet(create_files_xlsx_multiple[0], if_exists='replace')\n    fname_out = x.convert_single('Sheet1')\n    # same directory\n    fname_out_chk = create_files_xlsx_multiple[0]\n    assert os.path.dirname(fname_out) == os.path.dirname(fname_out_chk)\n\n    # Single sheet Without output dir\n    x = XLStoCSVMultiSheet(fname, if_exists='replace')\n    fnames_out = x.convert_all(header_xls_range=\"B2:C2\")\n    # same directory\n    for fname_out in fnames_out:\n        assert os.path.dirname(fname_out) == os.path.dirname(fname)\n"
  },
  {
    "path": "tests/tmp-reindex-withorder.py",
    "content": "#!/usr/bin/env python3\n# -*- coding: utf-8 -*-\n\"\"\"\nCreated on Wed Oct  3 11:36:43 2018\n\n@author: deepmind\n\"\"\"\n\nimport pandas as pd\nfrom d6tstack.helpers import *\nimport ntpath\nimport numpy as np\n\ndf1=pd.DataFrame({'b':range(10),'a':range(10)})\ndf2=pd.DataFrame({'b':range(10),'a':range(10),'c':range(10),})\n\ndfl_all_col = [df.columns.tolist() for df in [df1,df2]]\ncol_files = dict(zip(['1','2'], dfl_all_col))\ncol_common = list_common(list(col_files.values()))\ncol_all = list_unique(list(col_files.values()))\ncol_unique = list(set(col_all) - set(col_common))\n\nset(df1.columns.tolist())\n\n\ndf1.reindex(columns=df1.columns[df1.columns.isin(col_common)])\ndf2.reindex(columns=df1.columns[df1.columns.isin(col_common)])\ndf1.reindex(columns=df1.columns[df1.columns.isin(col_common)].tolist()+col_unique, copy=False)\ndf2.reindex(columns=df1.columns[df1.columns.isin(col_common)].tolist()+col_unique, copy=False)\ndf2\n\npd.concat([df1,df2],ignore_index=True,sort=False)\n\n\n# find index in column list so can check order is correct\ndf_col_present = {}\nfor iFileName, iFileCol in col_files.items():\n    df_col_present[iFileName] = [ntpath.basename(iFileName), ] + [iCol in iFileCol for iCol in col_all]\n\ndf_col_present = pd.DataFrame(df_col_present, index=['filename'] + col_all).T\ndf_col_present.index.names = ['file_path']\n\n# find index in column list so can check order is correct\ndf_col_order = {}\nfor iFileName, iFileCol in col_files.items():\n    df_col_order[iFileName] = [ntpath.basename(iFileName), ] + [\n        iFileCol.index(iCol) if iCol in iFileCol else np.nan for iCol in col_all]\ndf_col_order = pd.DataFrame(df_col_order, index=['filename'] + col_all).T\n\ndf_col_order['b'].value_counts()\n\nfrom scipy.stats import mode\ndft=df_col_order.set_index('filename')\nm=mode(dft,axis=0)\ndforder = pd.DataFrame({'o':m[0][0],'c':m[1][0]},index=dft.columns)\ndforder = dforder.sort_values(['o','c'])\ndforder['iscommon']=dforder.index.isin(col_common)\ndforder\ndforder.index.values.tolist()\n\ndf_col_order.set_index('filename').T.groupby(level=0).agg(lambda x: mode(x))\n\n\n\n\n\n\n\n"
  },
  {
    "path": "tests/tmp-runtest.py",
    "content": "import pandas as pd\nimport numpy as np\nimport glob\nimport d6tstack.combine_csv\nimport importlib\nimportlib.reload(d6tstack.combine_csv)\nimport sqlalchemy\n\nfname_list=glob.glob('test-data/input/test-data-input-csv-clean-*.csv')\nfname_list=glob.glob('test-data/input/test-data-input-csv-reorder-*.csv')\nfname_list=glob.glob('test-data/input/test-data-input-csv-colmismatch-*.csv')\n# combiner = d6tstack.combine_csv.CombinerCSV(fname_list)\n# combiner.is_column_present().all().values.tolist()\n# combiner.is_column_present_common()\n# combiner.sniff_results['df_columns_order']['profit'].values.tolist()\n\nuri = 'mysql+mysqlconnector://testusr:testpwd@localhost/testdb'\nuri = 'mysql+pymysql://testusr:testpwd@localhost/testdb'\n\ntblname = 'testd6tstack'\n\ndef apply(dfg):\n    dfg['date'] = pd.to_datetime(dfg['date'], format='%Y-%m-%d')\n    return dfg\n\n# importlib.reload(d6tstack.combine_csv)\n# combiner = d6tstack.combine_csv.CombinerCSV(fname_list)\n# fnamesout = d6tstack.combine_csv.CombinerCSV(fname_list=fname_list, apply_after_read=apply).to_mysql_combine(uri,tblname,'replace')\n#\n# sql_engine = sqlalchemy.create_engine(uri)\n# df = pd.read_sql_table(tblname, sql_engine)\n# df['profit2']\n\nimportlib.reload(d6tstack.combine_csv)\nsql_engine = sqlalchemy.create_engine(uri)\nd6tstack.combine_csv.CombinerCSV(fname_list=fname_list).to_sql_combine(uri, tblname, {'if_exists': 'replace'})\ndf = pd.read_sql_table(tblname, sql_engine)\nassert check_df_colmismatch_combine(df)\n\n# todo: mysql import makes NaNs 0s\n\nimportlib.reload(d6tstack.combine_csv)\ncombiner = d6tstack.combine_csv.CombinerCSV(fname_list)\nfnamesout = d6tstack.combine_csv.CombinerCSV(fname_list=fname_list).to_parquet_align(output_dir='test-data/output')\n# fnamesout\n\nimport dask.dataframe as dd\n\ndf = dd.read_parquet('test-data/output/d6tstack-test-data-input-csv-colmismatch-*.pq',index='__index_level_0__')\ndf = df.compute()\n\nfor fname in fnamesout:\n    df = dd.read_parquet(fname)\n    df = df.compute()\n    print(fname, df.columns)\n\ndf = dd.read_parquet(fnamesout[0])\ndf = df.compute()\ndf\n\n\n"
  },
  {
    "path": "tests/tmp.py",
    "content": "import importlib\nimport d6tstack.utils\nimportlib.reload(d6tstack.utils)\n\nimport time\nimport yaml\n\nconfig = yaml.load(open('tests/.test-cred.yaml'))\n\ncfg_uri_psql = config['rds']\ncfg_uri_psql = config['wlo']\n\nimport pandas as pd\ndf = pd.DataFrame({'a':range(10),'b':range(10)})\nd6tstack.utils.pd_to_psql(df,cfg_uri_psql,'quick',sep='\\t',if_exists='replace')\nprint(pd.read_sql_table('quick',sqlengine))\n\n\n\nimport yaml\nconfig = yaml.load(open('.test-cred.yaml'))\ncfg_uri_psql = config['wlo']\n\nimport pandas as pd\ndf = pd.DataFrame({'a':range(10),'b':range(10),'name':['name,first name']*10})\n\nimport d6tstack.utils\nd6tstack.utils.pd_to_psql(df,cfg_uri_psql,'quick',sep='\\t',if_exists='replace')\n\nimport sqlalchemy\nsqlengine = sqlalchemy.create_engine(cfg_uri_psql)\nprint(pd.read_sql_table('quick',sqlengine))\n\n\n\n\n\nconfig = yaml.load(open('tests/.test-cred.yaml'))\ncfg_uri_mysql = config['local-mysql']\nsqlengine = sqlalchemy.create_engine(cfg_uri_mysql)\nimportlib.reload(d6tstack.utils)\nd6tstack.utils.pd_to_mysql(df,cfg_uri_mysql,'quick',if_exists='replace')\nprint(pd.read_sql_table('quick',sqlengine))\n\n\nimport sqlalchemy\nsqlengine = sqlalchemy.create_engine(cfg_uri_psql)\nsqlengine = sqlalchemy.create_engine(cfg_uri_mysql)\n\nsqlengine = sqlalchemy.create_engine(cfg_uri_psql)\nprint(pd.read_sql_table('benchmark',sqlengine).head())\n\ndft = pd.read_sql_table('benchmark',sqlengine)\ndft.shape\n\n# cursor = sqlengine.cursor()\nsql = sqlengine.execute(\"SELECT * FROM benchmark;\")\ndft2 = pd.DataFrame(sql.fetchall())\ndft2.shape\nsql.keys()\n\nimportlib.reload(d6tstack.utils)\n\nstart_time = time.time()\ndft2 = d6tstack.utils.pd_from_sqlengine(cfg_uri_psql, \"SELECT * FROM benchmark;\")\nassert dft2.shape==(100000, 23)\nprint(\"--- %s seconds ---\" % (time.time() - start_time))\n\nstart_time = time.time()\ndft = pd.read_sql_table('benchmark',sqlengine)\nassert dft.shape==(100000, 23)\nprint(\"--- %s seconds ---\" % (time.time() - start_time))\n\nd6tstack.utils.test()\n\n"
  }
]